KR102417670B1

KR102417670B1 - Method and apparatus for automatic note-level singing transcription using artificial neural network

Info

Publication number: KR102417670B1
Application number: KR1020210165277A
Authority: KR
Inventors: 금상은; 이종필
Original assignee: 뉴튠(주)
Priority date: 2021-11-26
Filing date: 2021-11-26
Publication date: 2022-07-07
Also published as: KR102497878B1

Abstract

A device for automatic note-level singing transcription using an artificial neural network according to an embodiment of the present invention comprises: one or more processors; and a memory module for storing commands executable by the at least one processor. The processor may include: a first artificial neural network having audio data in a frequency region as first input information and outputting, as first output information, pitch information including interval information of singing for each frame of the audio data; and a second artificial neural network having, as second input information, intermediate output information outputted from a block constituting a module of the first artificial neural network, and outputting, as second output information, vocal information whether vocal exists for each frame of the audio data. Therefore, provided is a deep-learning-based technology for more accurately estimating a note corresponding to a vocal melody in polyphonic music.

Description

인공신경망을 이용한 노트 레벨의 자동 보컬 채보 방법 및 장치{Method and apparatus for automatic note-level singing transcription using artificial neural network}{Method and apparatus for automatic note-level singing transcription using artificial neural network}

본 발명은 인공신경망을 이용한 노트 레벨의 자동 보컬 채보 방법 및 장치에 관한 발명으로서, 보다 상세하게는 인공신경망을 이용하여 입력된 오디오에 대해 보컬에 대한 피치 정보를 정확히 출력하는 기술 및 이를 이용하여 인공신경망을 효율적으로 학습하는 기술에 관한 발명이다. The present invention relates to a method and apparatus for automatic note-level vocal transcription using an artificial neural network, and more particularly, to a technique for accurately outputting pitch information about a vocal with respect to an audio input using an artificial neural network, and artificially using the same. This invention relates to a technique for efficiently learning a neural network.

채보란 본래 기보되어 있지 않은 음악을 악보에 옮기는 것을 의미하는 것으로서, 채보의 목적은 편곡해서 레퍼터리에 넣어 이용하는 경우 또는 학문적 분석을 위한 경우로 나누어지며, 채보의 방법으로는 본래의 형태를 간략하게 적는 규범적(prescriptive)인 서법(書法)과, 연주된 그대로 면밀히 적는 기술적(descriptive)인 서법이 존재한다. Transcription means transferring music that was not originally notated into sheet music, and the purpose of transcription is divided into the case of arranging and using it in the repertoire or for academic analysis. There is a prescriptive method of calligraphy, and a descriptive method of writing it down exactly as it is played.

한편, 입력된 음원의 음악을 컴퓨터 프로세서 등에 의해 자동으로 악보에 옮기는 것을 자동 채보라고 하는데, 구체적으로 음성 처리에 있어서 곡조 인식의 한 분야인 자동 채보 시스템은 사람의 음성 및 악기로 연주된 노래로부터 음의 높이(음정(音程), interval), 길이(음장(音長), duration), 가사를 인식하여 그 결과를 악보의 형태로 나타내어주는 것을 의미한다. 자동 채보의 경우 기존의 음악에 익숙한 전문가가 직접 노래를 듣고 채보하는 방식에 비하여 시창자(始唱者)의 노래가 가진 음악적 특징을 시스템이 자동으로 인식하고 분석하여 이를 악보화할 수 있으므로 일반인도 쉽게 사용할 수 있는 장점이 존재한다.On the other hand, the automatic transcription of the music of the input sound source into the sheet music by a computer processor or the like is called automatic transcription. It means to recognize the height (pitch, interval), length (sound length, duration), and lyrics of In the case of automatic transcription, the system automatically recognizes and analyzes the musical characteristics of the singer's song and converts it into sheet music, so the general public can easily There are advantages to using it.

지금까지 알려진 자동 채보 방법으로는, 연속적인 음성신호에서 추출된 특징 정보를 각각의 음표로 인식할 수 있게 음소 단위로 분절된 구간을 합쳐 음절의 경계 정보로 사용하는 방법과, 음성신호에서 피치 간격마다 발생하는 음성의 최대값을 연결하여 구한 에너지 정보를 이용하여 음절 구간(segment)을 형성하는 방법 등이 있다.As for the automatic transcription method known so far, a method of combining the segmented sections in phoneme units to recognize feature information extracted from continuous speech signals as individual notes and using them as syllable boundary information, and the pitch interval in the speech signal There is a method of forming a syllable segment using energy information obtained by concatenating the maximum values of the voices that occur each time.

그러나, 상술한 바와 같은 종래의 자동 채보 시스템은, 음성신호의 연속적인 특성 때문에 음절을 분할하는 경우 그 경계가 모호한 부분에서는 효율성이 현저히 떨어지는 문제점이 존재하고, 예측된 음절 경계의 한 구간마다 음정 인식을 위하여 피치 정보의 대표값을 찾아 주어야 하는 과정을 추가하여야 하는 불편함이 존재한다. 또한, 마디 검출이 불가능하여 노래의 인식 결과를 완전한 악보의 형태로 나타내어 줄 수 없으므로 만족할 만한 결과를 제공해 줄 수 없는 문제점을 가지고 있으며, 입력되는 데이터가 반주가 없는 단선율(monophonic) 보컬인 경우에만 채보가 가능하다는 단점이 존재하였다.However, the conventional automatic transcription system as described above has a problem in that, when dividing syllables due to the continuous characteristics of the voice signal, the efficiency is significantly lowered in the part where the boundary is ambiguous, and the pitch is recognized for each section of the predicted syllable boundary. For this purpose, there is an inconvenience of adding a process for finding a representative value of pitch information. In addition, there is a problem in that satisfactory results cannot be provided because it is impossible to detect a measure and thus the recognition result of a song cannot be expressed in the form of a complete sheet music. There was a drawback that it was possible.

또한, 통상적으로 사람은 자신의 스타일에 따라 노래를 부르기 때문에 노래 빠르기는 각각 다른 속도와 시간을 지니고 있으므로 개인차가 매우 크지만, 음장 인식에 있어서 종래의 방법은 일반화된 표준 데이터에 의거하여 표준 음표에 매핑(mapping)하는 방법을 사용하기 때문에 사람마다 다른 노래 입력의 빠르기에 적응하지 못하는 단점이 있다.In addition, since people usually sing according to their own style, the singing speed has a different speed and time, so individual differences are very large. Because the mapping method is used, there is a disadvantage in that each person cannot adapt to the speed of inputting different songs.

한국공개특허 제10-2015-0084133호 (2015.07.22. 공개) - '음의 간섭현상을 이용한 음정인식 및 이를 이용한 음계채보 방법'Korean Patent Laid-Open Patent No. 10-2015-0084133 (published on July 22, 2015) - 'Pitch recognition using sound interference and scale transcription method using the same' 한국등록특허 제 10-1696555호 (2019.06.05.) - '영상 또는 지리 정보에서 음성 인식을 통한 텍스트 위치 탐색 시스템Korean Patent No. 10-1696555 (2019.06.05.) - 'Text location search system through voice recognition in image or geographic information

따라서, 일 실시예에 따른 인공신경망을 이용한 노트 레벨의 자동 보컬 채보 방법 및 장치는 상기 설명한 문제점을 해결하기 위해 고안된 발명으로서, 종래기술보다 효과적으로 피치 정보와 보컬 정보를 포함하고 있는 채보 정보를 출력함으로써, 다성(polyphonic) 음악에서 보컬 멜로디에 해당하는 음표(note)를 보다 정확히 예측하는 딥러닝 기반의 기술을 제공하는데 그 목적이 있다. Accordingly, a note-level automatic vocal transcription method and apparatus using an artificial neural network according to an embodiment is an invention devised to solve the above-described problem, by outputting transcription information including pitch information and vocal information more effectively than the prior art. , aims to provide a deep learning-based technology that more accurately predicts notes corresponding to vocal melodies in polyphonic music.

보다 구체적으로 일 실시예에 따른 인공신경망을 이용한 노트 레벨의 자동 보컬 채보 방법 및 장치는, 프레임(frame) 레벨의 채보 정보를 노트 레벨의 채보 정보로 변환시키는 모델을 이용하여, 보다 적은 프레임 레벨의 데이터만으로도 효과적으로 노트 레벨의 채보 정보를 출력하는 방법 및 장치를 제공하는데 그 목적이 있다.More specifically, a note-level automatic vocal transcription method and apparatus using an artificial neural network according to an embodiment uses a model that converts frame-level transcription information into note-level transcription information, thereby reducing frame-level transcription information. An object of the present invention is to provide a method and apparatus for effectively outputting note-level transcription information only with data.

일 실시예에 따른 인공신경망을 이용한 노트 레벨의 자동 보컬 채보 장치는, 하나 이상의 프로세서 및 상기 하나 이상의 프로세서에서 실행 가능한 명령들을 저장하는 메모리 모듈을 포함하고, 상기 프로세서는, 주파수 영역의 오디오 데이터를 제1입력 정보로 하고, 상기 오디오 데이터에 대해 프레임(frame) 단위로 보컬의 음높이 정보를 포함하는 피치(Pitch) 정보를 제1출력 정보로 출력하는 제1인공신경망 및 상기 제1인공신경망을 구성하는 블록에서 출력되는 중간 출력 정보를 제2입력 정보로 하고, 상기 오디오 데이터에 대해 프레임 단위로, 보컬(Vocal)의 존재 유무에 대한 보컬 정보를 제2출력 정보로 출력하는 제2인공신경망을 포함할 수 있다. A note-level automatic vocal transcription apparatus using an artificial neural network according to an embodiment includes one or more processors and a memory module for storing instructions executable in the one or more processors, wherein the processor generates audio data in a frequency domain. A first artificial neural network that outputs as first output information pitch information including pitch information of a vocal in a frame unit with respect to the audio data as one input information, and the first artificial neural network A second artificial neural network that uses the intermediate output information output from the block as second input information and outputs vocal information on the presence or absence of a vocal as second output information in units of frames for the audio data. can

상기 피치 정보는, 상기 오디오 데이터에 대한 프레임 단위의 보컬의 존재 유무 정보 및 보컬의 음높이 정보를 포함할 수 있다. The pitch information may include information about the presence or absence of a vocal in units of frames for the audio data and information on the pitch of the vocal.

제1인공신경망은, 상기 제1입력 정보를 입력 받는 컨볼루션 블록 및 상기 컨볼루션 블록의 출력 정보를 입력 정보로 입력 받는 적어도 하나의 레즈넷 블록(Residual block) 및 폴링 블록(Pooling block)을 포함할 수 있다.The first artificial neural network includes a convolution block receiving the first input information and at least one residual block and a polling block receiving output information of the convolution block as input information can do.

상기 제2입력 정보는, 상기 레즈넷 블록 및 상기 폴링 블록에서 출력되는 중간 정보들의 합으로 구성될 수 있다. The second input information may be composed of a sum of intermediate information output from the resnet block and the polling block.

상기 제1인공신경망은, 상기 폴링 블록에서 출력되는 출력 정보를 입력 정보로 입력 받는 제1 LSTM 블록을 더 포함할 수 있다.The first artificial neural network may further include a first LSTM block that receives output information output from the polling block as input information.

상기 제2인공신경망은, 상기 제2입력 정보를 입력 받는 제2 LSTM 블록을 포함할 수 있다.The second artificial neural network may include a second LSTM block receiving the second input information.

상기 프로세서는, 상기 제1인공신경망의 소프트맥스(Softmax)에서 출력되는 제1출력 정보 및 상기 제1출력 정보와 상기 제2인공신경망 모듈의 소프트맥스에서 출력되는 제2출력 정보를 합산한 제3출력 정보를 합산한 제4출력 정보와 레퍼런스 데이터와의 차이를 손실함수로 하여, 상기 손실함수의 차이가 최소가 되도록 상기 제1인경신경망 모듈의 파라미터를 조정할 수 있다. The processor is configured to include first output information output from Softmax of the first artificial neural network, and third output information obtained by adding the first output information and second output information output from Softmax of the second artificial neural network module. The parameter of the first neural network module may be adjusted so that the difference between the loss function is minimized by using a difference between the fourth output information and the reference data obtained by summing the output information as a loss function.

상기 프로세서는, 상기 제1출력 정보와 상기 제4출력 정보에 서로 다른 가중치를 적용하여, 상기 보컬 채보 인경신경망 모듈의 파라미터를 조정할 수 있다. The processor may apply different weights to the first output information and the fourth output information to adjust the parameters of the vocal transcription intra-neural neural network module.

일 실시예에 따른 인공신경망을 이용한 노트 레벨의 자동 보컬 채보 장치는, 하나 이상의 프로세서 및 상기 하나 이상의 프로세서에서 실행 가능한 명령들을 저장하는 메모리 모듈을 포함하고, 상기 프로세서는, 주파수 영역의 제1오디오 데이터를 제1입력 정보로 하고, 상기 제1오디오 데이터에 대해 프레임(frame) 단위로 보컬의 음높이 정보를 포함하는 피치(Pitch) 정보를 제1출력 정보로 출력하는 제1인공신경망, 상기 제1출력 정보를 노트(note) 단위의 보컬 정보를 포함하는 제1학습 데이터로 변환하는 전처리 모듈 및 주파수 영역의 제2오디오 데이터를 제3입력 정보로 하고, 상기 제2오디오 데이터에 대해 노트(note) 단위로 보컬의 음높이 정보를 포함하는 피치(Pitch) 정보를 제3출력 정보로 출력하는 제3인공신경망을 포함하고, 상기 제3인공신경망은, 상기 제1학습 데이터를 기초로 상기 제3인공신경망에 대해 학습을 수행할 수 있다. A note-level automatic vocal transcription apparatus using an artificial neural network according to an embodiment includes one or more processors and a memory module for storing instructions executable by the one or more processors, wherein the processor includes: first audio data in a frequency domain a first artificial neural network for outputting pitch information including pitch information of a vocalist in units of frames with respect to the first audio data as first input information as first output information, the first output A pre-processing module for converting information into first learning data including vocal information in a unit of note and second audio data in a frequency domain as third input information, and a note unit for the second audio data and a third artificial neural network for outputting pitch information including pitch information of raw vocals as third output information, wherein the third artificial neural network is configured to connect to the third artificial neural network based on the first learning data. learning can be carried out.

상기 인공신경망을 이용한 노트 레벨의 자동 보컬 채보 장치는 상기 제3인공신경망을 구성하는 블록에서 출력되는 중간 출력 정보를 제4입력 정보로 하고, 상기 제2오디오 데이터에 대해 노트 단위로, 보컬(Vocal)의 존재 유무에 대한 정보를 제4출력 정보로 출력하는 기 학습된 제4인공신경망을 더 포함할 수 있다.In the note-level automatic vocal transcription apparatus using the artificial neural network, intermediate output information output from the blocks constituting the third artificial neural network is used as fourth input information, and the second audio data is converted to a note unit by note. ) may further include a pre-learned fourth artificial neural network that outputs information on the existence or nonexistence as fourth output information.

상기 프로세서는, 상기 제1학습 데이터 및 노트 단위의 보컬의 음높이 정보에 대하 라벨링 되어 있는 제2학습 데이터를 기초로 상기 제3인공신경망에 대해 학습을 수행할 수 있다.The processor may perform learning on the third artificial neural network based on the first learning data and the second learning data labeled for pitch information of the vocal in the unit of notes.

상기 인공신경망을 이용한 노트 레벨의 자동 보컬 채보 장치는, 주파수 영역의 제3오디오 데이터를 제5입력 정보로 하고, 상기 제5오디오 데이터에 대해 노트 단위로 보컬의 음높이 정보를 포함하는 피치(Pitch) 정보를 제5출력 정보로 출력하는 제5인공신경망을 더 포함하고, 상기 프로세서는, 상기 제3인공신경망이 출력하는 제3출력 정보를 제3학습 데이터로 하여 상기 제5인공신경망에 대해 학습을 수행할 수 있다.The note-level automatic vocal transcription apparatus using the artificial neural network uses third audio data in the frequency domain as fifth input information, and includes pitch information of vocal pitch in note units with respect to the fifth audio data. and a fifth artificial neural network outputting information as fifth output information, wherein the processor performs learning of the fifth artificial neural network using third output information output by the third artificial neural network as third learning data can be done

일 실시예에 따른 인공신경망을 이용한 노트 레벨의 자동 보컬 채보 장치는 하나 이상의 프로세서 및 상기 하나 이상의 프로세서에서 실행 가능한 명령들을 저장하는 메모리 모듈을 포함하고, 상기 프로세서는, 주파수 영역의 제1오디오 데이터를 제1입력 정보로 하고, 상기 제1오디오 데이터에 대해 프레임(frame) 단위로 보컬의 음높이 정보를 포함하는 피치(Pitch) 정보를 제1출력 정보로 출력하는 제1인공신경망, 상기 제1출력 정보를 노트(note) 단위의 보컬 정보를 포함하는 제1학습 데이터로 변환하는 전처리 모듈 및 주파수 영역의 제2오디오 데이터를 제3입력 정보로 하고, 상기 제2오디오 데이터에 대해 노트(note) 단위로 보컬의 음높이 정보를 포함하는 피치(Pitch) 정보를 제3출력 정보로 출력하는 제3인공신경망을 포함하고, 상기 제3인공신경망은, 상기 제1학습 데이터를 기초로 상기 제3인공신경망에 대해 학습을 수행하며, 상기 메모리 모듈은 상기 제1학습 데이터를 저장하는 제1메모리 모듈을 포함할 수 있다.A note-level automatic vocal transcription apparatus using an artificial neural network according to an embodiment includes one or more processors and a memory module for storing instructions executable in the one or more processors, wherein the processor receives first audio data in a frequency domain A first artificial neural network for outputting pitch information including pitch information of a vocalist in units of frames with respect to the first audio data as first input information as first output information, the first output information A pre-processing module for converting into first learning data including vocal information in note units, and second audio data in the frequency domain as third input information, with respect to the second audio data in note units and a third artificial neural network for outputting pitch information including pitch information of a vocal as third output information, wherein the third artificial neural network is configured with respect to the third artificial neural network based on the first learning data. Learning is performed, and the memory module may include a first memory module configured to store the first learning data.

일 실시예에 따른 인공신경망을 포함하고 있는 프로세서를 이용한 노트 레벨의 자동 보컬 채보 방법은, 상기 프로세서가 주파수 영역의 제1오디오 데이터를 제1입력 정보로 하고, 상기 제1오디오 데이터에 대해 프레임(frame) 단위로 보컬의 음높이 정보를 포함하는 피치(Pitch) 정보를 제1출력 정보로 출력하는 제1인공신경망을 이용하여 상기 제1출력 정보를 출력하는 제1출력 정보 출력 단계, 상기 프로세서가 상기 제1출력 정보를 노트(note) 단위의 보컬 정보를 포함하는 제1학습 데이터로 변환하는 전처리 단계, 상기 프로세서가 상기 제1학습 데이터를 제1메모리 모듈에 저장하는 제1학습 데이터 저장 단계, 상기 프로세서가 주파수 영역의 제2오디오 데이터를 제3입력 정보로 하고, 상기 제2오디오 데이터에 대해 노트(note) 단위로 보컬의 음높이 정보를 포함하는 피치(Pitch) 정보를 제3출력 정보로 출력하는 제3인공신경망을 이용하여 상기 제3출력 정보를 출력하는 제3출력 정보 출력 단계 및 상기 프로세서는, 상기 제1학습 데이터를 기초로 상기 제3인공신경망에 대해 학습을 수행하는 단계를 포함할 수 있다. In a note-level automatic vocal transcription method using a processor including an artificial neural network according to an embodiment, the processor uses first audio data in a frequency domain as first input information, and provides a frame ( a first output information output step of outputting the first output information using a first artificial neural network that outputs pitch information including pitch information of a vocal in units of frame) as first output information; A pre-processing step of converting first output information into first learning data including vocal information in units of notes, a first learning data storage step in which the processor stores the first learning data in a first memory module, the The processor uses the second audio data in the frequency domain as third input information, and outputs pitch information including pitch information of a vocalist in a note unit with respect to the second audio data as third output information. A third output information output step of outputting the third output information using a third artificial neural network, and the processor may include performing learning on the third artificial neural network based on the first learning data have.

일 실시예에 따른 인공신경망을 이용한 노트 레벨의 자동 보컬 채보 방법 및 장치는 보컬 정보를 포함하는 피치 정보를 추론함에 있어서 추정된 보컬 정보를 이용하고, 보컬 정보를 추론함에 있어서 추정된 피치 정보를 활용하므로, 종래 기술보다 효과적으로 오디오 데이터에 포함되어 있는 피치 정보와 보컬 정보를 추론할 수 있는 장점이 존재한다. 이에 따라 채보를 진행함에 있어서 보컬을 분리하는 전처리 단계가 존재하지 않아 연산량이 많이 줄어들고 또한, 보컬을 분리하지 않아도 정확도 높은 보컬 채보를 수행할 수 있는 장점이 존재한다. A note-level automatic vocal transcription method and apparatus using an artificial neural network according to an embodiment uses the estimated vocal information in inferring pitch information including the vocal information, and uses the estimated pitch information in inferring the vocal information Therefore, there is an advantage of inferring pitch information and vocal information included in audio data more effectively than in the prior art. Accordingly, since there is no pre-processing step for separating the vocals in the transcription, the amount of computation is greatly reduced, and there is an advantage that high-accuracy vocal transcription can be performed without separating the vocals.

또한, 일 실시예에 따른 인공신경망을 이용한 노트 레벨의 자동 보컬 채보 방법 및 장치는 프레임 레벨의 피치 정보를 유사-노트 레벨의 피치 정보를 변환시킨 후, 변환된 데이터를 노트 레벨의 채보를 수행하는 인공신경망의 학습 데이터로 사용하기 때문에, 보다 적은 노트 레벨의 학습 데이터만으로도 효율적으로 노트 레벨의 피치 정보를 출력하는 인공신경망을 학습시킬 수 있는 장점이 존재한다. In addition, a note-level automatic vocal transcription method and apparatus using an artificial neural network according to an embodiment converts frame-level pitch information into pseudo-note-level pitch information, and then performs note-level transcription of the converted data. Since it is used as learning data of the artificial neural network, there is an advantage in that it is possible to train an artificial neural network that efficiently outputs note-level pitch information with less note-level learning data.

또한, 인공신경망을 학습함에 있어서 준비도 학습(semi-supervised learning)에 기초한 학습을 수행하기 때문에 라벨링이 되어 있지 않은 적은 데이터만으로도 노트 레벨의 피치 정보를 출력하는 인공신경망을 효율적으로 학습시킬 수 있다.In addition, since learning based on semi-supervised learning is performed in learning the artificial neural network, it is possible to efficiently train the artificial neural network that outputs note-level pitch information with only a small amount of unlabeled data.

이에 따라, 일 실시예에 따른 인공신경망을 이용한 노트 레벨의 자동 보컬 채보 방법 및 장치는 다양한 장르의 음원에 대해서 유연하면서 동시에 정확한 채보를 진행할 수 있으며, 소스 분리 전처리 과정 없이 바로 다성 음악에서 보컬 멜로디를 채보할 수 있어, 종래 기술에 비해 채보 속도를 매우 향상시킬 수 있는 장점이 존재한다. Accordingly, the note-level automatic vocal transcription method and apparatus using an artificial neural network according to an embodiment can perform flexible and accurate transcription of sound sources of various genres at the same time, and directly generate vocal melody from polyphonic music without source separation pre-processing Since the transcription can be performed, there is an advantage in that the transcription speed can be greatly improved compared to the prior art.

도 1은 본 발명의 일 실시예에 따른 음악 채보 장치가 포함된 음악 채보 시스템의 블럭도이다.
도 2는 일 실시예에 따른 음악 채보 장치의 일부 구성 요소를 도시한 블록도이다.
도 3은 일 실시예에 따른 제1인공신경망의 입력 정보 및 출력 정보를 도시한 도면이다.
도 4는 일 실시예에 따른 제1인공신경망을 구성하는 일부 구성 요소를 도시한 블록도이다.
도 5는 일 실시예에 따른 레즈넷 블록의 일부 구성 요소를 도시한 블록도이다.
도 6은 본 발명의 일 실시예에 따른 제2인공신경망의 입력 정보와 출력 정보를 도시한 도면이다.
도 7은 본 발명의 일 실시예에 따른 제1인공신경망과 제2인공신경망을 포함하고 있는 제1인공신경망 모듈의 구성을 도시한 도면이다.
도 8은 본 발명의 다른 실시예에 따른 제1인공신경망과 제2인공신경망 을 포함하고 있는 제1인공신경망 모듈을 도시한 도면이다
도 9는 본 발명에 따른 음악 채보 인공신경망 모듈의 통합 손실함수를 계산하는 방법을 도시한 도면이다.
도 10은 일 실시예에 따른 인공신경망을 이용한 자동 음악 채보 장치의 구성 요소를 도시한 도면이다.
도 11은 일 실시예에 따른 자동 음악 채보 장치의 전처리 모듈을 설명하기 위한 도면이다.
도 12는 다른 실시예에 따른 인공신경망을 이용한 자동 음악 채보 장치의 구성 요소를 도시한 도면이다.1 is a block diagram of a music transcription system including a music transcription device according to an embodiment of the present invention.
2 is a block diagram illustrating some components of a music transcription apparatus according to an embodiment.
3 is a diagram illustrating input information and output information of a first artificial neural network according to an embodiment.
4 is a block diagram illustrating some components constituting the first artificial neural network according to an embodiment.
5 is a block diagram illustrating some components of a Reznet block according to an embodiment.
6 is a diagram illustrating input information and output information of a second artificial neural network according to an embodiment of the present invention.
7 is a diagram illustrating a configuration of a first artificial neural network module including a first artificial neural network and a second artificial neural network according to an embodiment of the present invention.
8 is a diagram illustrating a first artificial neural network module including a first artificial neural network and a second artificial neural network according to another embodiment of the present invention.
9 is a diagram illustrating a method of calculating an integrated loss function of a music transcription artificial neural network module according to the present invention.
10 is a diagram illustrating components of an automatic music transcription device using an artificial neural network according to an embodiment.
11 is a view for explaining a pre-processing module of the automatic music transcription apparatus according to an embodiment.
12 is a diagram illustrating components of an automatic music transcription device using an artificial neural network according to another embodiment.

이하, 본 발명에 따른 실시 예들은 첨부된 도면들을 참조하여 설명한다. 각 도면의 구성요소들에 참조 부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 또한, 본 발명의 실시 예를 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 실시예에 대한 이해를 방해한다고 판단되는 경우에는 그 상세한 설명은 생략한다. 또한, 이하에서 본 발명의 실시 예들을 설명할 것이나, 본 발명의 기술적 사상은 이에 한정되거나 제한되지 않고 당업자에 의해 변형되어 다양하게 실시될 수 있다.Hereinafter, embodiments according to the present invention will be described with reference to the accompanying drawings. In adding reference numerals to the components of each drawing, it should be noted that the same components are given the same reference numerals as much as possible even though they are indicated on different drawings. In addition, in describing the embodiment of the present invention, if it is determined that a detailed description of a related known configuration or function interferes with the understanding of the embodiment of the present invention, the detailed description thereof will be omitted. In addition, although embodiments of the present invention will be described below, the technical spirit of the present invention is not limited thereto and may be modified by those skilled in the art and variously implemented.

또한, 본 명세서에서 사용한 용어는 실시 예를 설명하기 위해 사용된 것으로, 개시된 발명을 제한 및/또는 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. In addition, the terms used herein are used to describe the embodiments, and are not intended to limit and/or limit the disclosed invention. The singular expression includes the plural expression unless the context clearly dictates otherwise.

본 명세서에서, "포함하다", "구비하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는다.In this specification, terms such as "comprises", "comprises" or "have" are intended to designate that the features, numbers, steps, operations, components, parts, or combinations thereof described in the specification exist, but one It does not preclude in advance the possibility of the presence or addition of other features or numbers, steps, operations, components, parts, or combinations thereof, or other features.

또한, 명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "간접적으로 연결"되어 있는 경우도 포함하며, 본 명세서에서 사용한 "제 1", "제 2" 등과 같이 서수를 포함하는 용어는 다양한 구성 요소들을 설명하는데 사용될 수 있지만, 상기 구성 요소들은 상기 용어들에 의해 한정되지는 않는다. In addition, throughout the specification, when a certain part is "connected" with another part, it is not only "directly connected" but also "indirectly connected" with another element interposed therebetween. Including, terms including an ordinal number, such as "first", "second", etc. used herein may be used to describe various elements, but the elements are not limited by the terms.

아래에서는 첨부한 도면을 참고하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략한다. Hereinafter, with reference to the accompanying drawings, the embodiments of the present invention will be described in detail so that those of ordinary skill in the art can easily implement them. And in order to clearly explain the present invention in the drawings, parts irrelevant to the description will be omitted.

한편 본 발명의 명칭은 '인공신경망을 이용한 노트 레벨의 자동 보컬 채보 방법 및 장치'로 기재하였으나, 이하 설명의 편의를 위해 '인공신경망을 이용한 노트 레벨의 자동 보컬 채보 방법 및 장치'는 '음악 채보 장치'로 축약하여 설명하도록 한다.Meanwhile, the title of the present invention has been described as 'a method and apparatus for automatic note-level vocal transcription using an artificial neural network'. It will be abbreviated as 'device'.

도 1은 본 발명의 일 실시예에 따른 음악 채보 장치가 포함된 음악 채보 시스템의 블럭도이다.1 is a block diagram of a music transcription system including a music transcription device according to an embodiment of the present invention.

도 1을 참조하면, 음악 채보 시스템(1)은 자동 채보 서비스를 제공하는 사용자 단말기(100)와 자동 채보 작업을 수행하는 음악 채보 장치(200)를 포함할 수 있으며, 사용자 단말기(100)는 이동 단말기로 구현되고, 음악 채보 장치(200)는 원격 서버로서 구현될 수 있다. Referring to FIG. 1 , a music transcription system 1 may include a user terminal 100 providing an automatic transcription service and a music transcription device 200 performing an automatic transcription operation, and the user terminal 100 moves It may be implemented as a terminal, and the music transcription apparatus 200 may be implemented as a remote server.

따라서, 사용자 단말기(100)는 PCS(Personal Communication System), GSM(Global System for Mobile communication), PDA(Personal Digital Assistant), IMT(International Mobile Telecommunication)-2000, CDMA(Code Division Multiple Access)-2000, W-CDMA(W-Code Division Multiple Access), Wibro(Wireless Broadband Internet) 단말, 스마트폰(Smartphone), 스마트패드(SmartPad), 태블릿 PC, 스마트와치(smart watch), 스마트 글라스(smart glass), 웨어러블 기기(wearable device) 등과 같은 모든 종류의 핸드헬드(Handheld) 기반의 무선 통신 장치를 포함할 수 있다Accordingly, the user terminal 100 is a Personal Communication System (PCS), Global System for Mobile communication (GSM), Personal Digital Assistant (PDA), International Mobile Telecommunication (IMT)-2000, Code Division Multiple Access (CDMA)-2000, W-CDMA (W-Code Division Multiple Access), Wibro (Wireless Broadband Internet) terminal, smartphone, smartpad, tablet PC, smart watch, smart glass, wearable It may include all types of handheld-based wireless communication devices such as wearable devices.

음악 채보 장치(200)가 구현되는 서버는 통상적인 서버(server)를 의미하는 바, 서버는 프로그램이 실행되고 있는 컴퓨터 하드웨어로서, 프린터 제어나 파일 관리 등 네트워크 전체를 감시하거나, 제어하거나, 메인프레임이나 공중망을 통한 다른 네트워크와의 연결, 데이터, 프로그램, 파일 같은 소프트웨어 자원이나 모뎀, 팩스, 프린터 공유. 기타 장비 등 하드웨어 자원을 공유할 수 있도록 지원할 수 있다.The server in which the music transcription device 200 is implemented means a normal server. The server is computer hardware on which a program is executed, and monitors the entire network, such as printer control or file management, or controls the mainframe. or a connection to another network over a public network, or to share software resources such as data, programs, files, or modems, faxes, or printers. It can support sharing of hardware resources such as other equipment.

사용자 단말기(100)는 녹취부(110), 통신부(120), MIDI 생성부(130) 및 디스플레이부(140)를 포함할 수 있다. The user terminal 100 may include a recording unit 110 , a communication unit 120 , a MIDI generating unit 130 , and a display unit 140 .

녹취부(110)는 음원을 녹취할 수 있다. 구체적으로 녹취부(100)는 사용자 단말(100) 외부에서 들려오는 음원을 녹취하거나, 사용자 단말(100) 자체에서 재생되고 있는 음원을 녹취할 수 있다. 녹취부(100)에 의해 녹취된 음원은 통신부(120)를 통해 음악 채보 장치(200)로 송신될 수 있다. The recording unit 110 may record a sound source. Specifically, the recorder 100 may record a sound source heard from outside the user terminal 100 or record a sound source being reproduced in the user terminal 100 itself. The sound source recorded by the recorder 100 may be transmitted to the music transcription device 200 through the communication unit 120 .

통신부(120)는 음악 채보 장치(200)로 녹음된 음원을 포함한 데이터를 전송하거나 음악 채보 장치(200)로부터 데이터를 수신할 수도 있다. 예를 들어, 통신부(120)는 음악 채보 장치(200)로부터 채보된 악보를 수신할 수도 있고 채보된 악보에 관련된 부가 서비스 정보를 수신할 수도 있다.The communication unit 120 may transmit data including the recorded sound source to the music transcription apparatus 200 or may receive data from the music transcription apparatus 200 . For example, the communication unit 120 may receive the transcribed sheet music from the music transcription apparatus 200 or may receive additional service information related to the transcribed sheet music.

MIDI 생성부(130)는 채보된 악보에 따른 MIDI(Musical Instrument Digital Interface) 파일을 생성하고 재생할 수 있다. The MIDI generator 130 may generate and reproduce a MIDI (Musical Instrument Digital Interface) file according to the transcription.

디스플레이부(140)는 채보된 악보 또는 기타 부가 서비스 정보를 사용자가 인지할 수 있도록 외부로 출력할 수 있다. 따라서, 디스플레이부(140)는 액정 디스플레이(Liquid Crystal Display: LCD) 패널, 발광 다이오드(Light Emitting Diode: LED) 패널 또는 유기 발광 다이오드(Organic Light Emitting Diode: OLED) 패널 등 다양한 디스플레이 패널을 포함할 수 있다. 한편, 디스플레이부가 터치 패드(touch pad) 등과 같은 GUI(Graphical User interface), 즉 소프트웨어인 장치를 포함하는 경우, 사용자의 입력을 수신하는 입력부(미도시)의 역할을 수행할 수도 있다. The display unit 140 may output the transcribed sheet music or other additional service information to the outside so that the user can recognize it. Accordingly, the display unit 140 may include various display panels such as a Liquid Crystal Display (LCD) panel, a Light Emitting Diode (LED) panel, or an Organic Light Emitting Diode (OLED) panel. have. Meanwhile, when the display unit includes a graphical user interface (GUI) such as a touch pad, that is, a software device, it may serve as an input unit (not shown) for receiving a user's input.

음악 채보 장치(200)는 프로세서(300, 메모리 모듈(400), 부가 서비스 모듈(500), 사용자 관리 모듈(600)을 포함할 수 있다.The music transcription apparatus 200 may include a processor 300 , a memory module 400 , an additional service module 500 , and a user management module 600 .

프로세서(300)는 메모리 모듈(400)에 저장되어 있는 음원 또는 사용자 단말기(100)로부터 수신한 음원에 대해 인공신경망 모듈을 이용하여 자동으로 채보를 생성할 수 있다. 이에 대한 자세한 설명은 후술하도록 한다. The processor 300 may automatically generate a transcript for a sound source stored in the memory module 400 or a sound source received from the user terminal 100 using an artificial neural network module. A detailed description thereof will be provided later.

메모리 모듈(400)은 사용자 단말기(100)로부터 수신한 음원을 저장하거나, 프로세서(300)가 인공신경망 모듈을 학습하고 추론함에 있어서 필요한 각종 데이터가 저장될 수 있다. 사용자 단말기(100) 등을 통해 수신된 레퍼런스 데이터 또는 프로세서가 학습 또는 추론을 하면서 생성한 각종 데이터 등이 저장될 수 있다. The memory module 400 may store a sound source received from the user terminal 100 , or various data necessary for the processor 300 to learn and infer the artificial neural network module may be stored. Reference data received through the user terminal 100 or the like or various data generated by the processor while learning or reasoning may be stored.

부가 서비스 모듈(500)은 프로세서(300)가 채보한 악보에 관련된 부가 서비스 정보를 생성할 수 있다. The additional service module 500 may generate additional service information related to the score recorded by the processor 300 .

일 예로 부가 서비스 모듈(500)은 부가 서비스로서 작곡 도우미 서비스를 제공할 수 있다. 본 발명에 따라 제공되는 작곡 도우미 서비스는 사용자 단말기(100)를 이용하는 음악 응용 서비스들 중의 하나로서 사용자 단말기(100)부터 채집된 음원을 원격 서버로서 구현된 음악 채보 장치(200)를 활용하여 자동으로 악보화하며 이를 사용자 단말기(100)에 전달하여 보관할 수 있게 하는 보다 향상된 성능의 작곡 도우미 서비스를 의미할 수 있다.For example, the additional service module 500 may provide a composition assistant service as an additional service. The composition assistant service provided according to the present invention is one of the music application services using the user terminal 100, and automatically utilizes the music transcription device 200 implemented as a remote server for the sound source collected from the user terminal 100. It may mean a composition assistant service with improved performance that converts it into music and delivers it to the user terminal 100 for storage.

여기서 향상된 성능의 작곡 도우미 서비스란 종래기술들보다 정확한 채보를 제공하는 음악 채보 장치(200)를 활용함을 의미한다. Here, the composition assistant service with improved performance means using the music transcription apparatus 200 that provides more accurate transcription than conventional techniques.

통상 음악의 문외한인 사람도 흥얼거림을 통해 떠오르는 멜로디를 가질 수 있으며, 이러한 멜로디를 연결하여 작곡을 완성하고자 하는 욕구를 가질 수 있다. 예를 들어, 엄마가 아기를 위해 자장가를 흥얼거린다고 하자. 이 흥얼거리는 자작 자장가를 녹음한 후에 이를 음원의 형태로 보관하였다가 재생할 수도 있지만, 악보로 만들어 두면 데이터를 저장하는데 소요되는 용량도 작아지며 다음에 다시 이어서 노래를 만들 수도 있고 이렇게 만들어져서 악보로 저장된 노래는 다음에 어느 누구라도 재현이 가능할 것이다.A person who is usually outside of music may have a melody that emerges through humming, and may have a desire to complete a composition by connecting these melodies. For example, let's say a mother hums a lullaby for her baby. After recording this humming lullaby, you can store it in the form of a sound source and then play it back, but if you make it into sheet music, the capacity required to store the data will be smaller and you can make a song again next time. The song will be able to be reproduced by anyone next time.

이러한 작곡 도우미 서비스를 제공받기 위하여, 사용자는 이동 단말기(100)를 통해 음원을 녹취하고 이를 원격 서버로서 구현된 음악 채보 장치(200)로 전송한다. 이동 단말기(100)에서 음원의 녹취는 녹취부(110)를 통해 수행되고 음원의 전송은 통신부(120)를 통해 수행될 수 있다.In order to receive such a composition assistant service, the user records a sound source through the mobile terminal 100 and transmits it to the music transcription device 200 implemented as a remote server. In the mobile terminal 100 , the recording of the sound source may be performed through the recorder 110 , and the transmission of the sound source may be performed through the communication unit 120 .

또한, 부가 서비스 모듈(500)은 프로세서(300)가 생성한 데이터에 기반하여 멜로디 기반의 음악 검색 시스템, 멜로디 기반의 유사 음원 검색 서비스 등도 함께 제공해 줄 수 있다. In addition, the additional service module 500 may also provide a melody-based music search system, a melody-based similar sound source search service, and the like, based on the data generated by the processor 300 .

음악 채보 장치(200)는 등록된 사용자의 경우 음원 파일을 수신하여 음원을 분석, 채보하며 채보된 결과를 사용자 단말기(100)로 전송할 수 있다. 음악 채보 장치(200)에서 음원의 분석, 채보는 프로세서(300)를 통해 수행되고 채보된 결과는 통신부를 통해 사용자 단말기(100)로 송신될 수 있다. In the case of a registered user, the music transcription apparatus 200 may receive a sound source file, analyze and transcribe the sound source, and transmit the transcription result to the user terminal 100 . In the music transcription device 200 , the sound source is analyzed and transcription is performed through the processor 300 , and the transcription result may be transmitted to the user terminal 100 through the communication unit.

사용자 단말기(100)는 MIDI 생성부(130) 및 디스플레이부(140)를 통해 채보된 악보를 디스플레이하고 그 악보에 따른 음악을 재생할 수 있으며, 사용자는 채보된 악보를 보고 그 악보에 따른 음악을 들으면서 추가, 삭제, 수정 등을 통해 작곡을 완성해갈 수 있다.The user terminal 100 may display the transcribed sheet music through the MIDI generating unit 130 and the display unit 140 and play music according to the sheet music, while the user sees the transcribed sheet music and listens to the music according to the sheet music. You can complete the composition by adding, deleting, and editing.

사용자 관리 모듈(600)는 음악 채보 장치(200)를 사용하는 사용자 단말기(100)의 사용자를 관리하기 위한 데이터를 저장할 수 있다. 한편, 도면에는 도시하지 않았지만 음악 채보 장치(200)는 통신부를 포함할 수 있으며, 통신부는 사용자 단말기(100)로 데이터를 전송할 수 있다. 예를 들어, 통신부는 사용자 단말기(100)로 프로세서(300) 채보한 악보 정보를 전송할 수도 있고 부가 서비스 모듈(500)이 생성한 각종 부가 서비스 정보를 전송할 수도 있다.The user management module 600 may store data for managing users of the user terminal 100 using the music transcription apparatus 200 . Meanwhile, although not shown in the drawings, the music transcription apparatus 200 may include a communication unit, and the communication unit may transmit data to the user terminal 100 . For example, the communication unit may transmit sheet music information recorded by the processor 300 to the user terminal 100 or may transmit various additional service information generated by the additional service module 500 .

도 2는 일 실시예에 따른 음악 채보 장치의 일부 구성 요소를 도시한 블록도이며, 도 3은 일 실시예에 따른 제1인공신경망의 입력 정보 및 출력 정보를 도시한 도면이다. 도 4는 일 실시예에 따른 제1인공신경망을 구성하는 일부 구성 요소를 도시한 블록도이며, 도 5는 일 실시예에 따른 레즈넷 블록의 일부 구성 요소를 도시한 블록도이다. 2 is a block diagram illustrating some components of a music transcription apparatus according to an embodiment, and FIG. 3 is a diagram illustrating input information and output information of the first artificial neural network according to an embodiment. 4 is a block diagram illustrating some components constituting the first artificial neural network according to an embodiment, and FIG. 5 is a block diagram illustrating some components of the Reznet block according to an embodiment.

도 2를 참조하면, 일 실시예에 따른 음악 채보 장치(200)는 프로세서(300), 메모리 모듈(400), 부가 서비스 모듈(500) 및 사용자 관리 모듈(600)을 포함할 수 있으며, 프로세서(300)는 제1인공신경망(311)과 제2인공신경망(312)을 포함하고 있는 제1인공신경망 (311)을 포함할 수 있다. 도 1에서 설명한 내용과 중복되는 설명은 생략하고, 도 2 내지 도 7에서 음악 채보 인공신경망 모듈에 대해서 구체적으로 알아본다. Referring to FIG. 2 , the music transcription apparatus 200 according to an embodiment may include a processor 300 , a memory module 400 , an additional service module 500 and a user management module 600 , and the processor ( 300 may include a first artificial neural network 311 including a first artificial neural network 311 and a second artificial neural network 312 . A description that overlaps with that described in FIG. 1 will be omitted, and the music transcription artificial neural network module will be described in detail with reference to FIGS. 2 to 7 .

한편 도 1에서는 프로세서(300)가 제1인공신경망(311)만을 포함하는 것으로 도시하였으나, 프로세서(300)는 제3인공신경망(313)과 제4인공신경망(314)을 포함하고 있는 제2인공신경망(320) 및 제5인공신경망(115)과 제6인공신경망(316)을 포함하고 있는 제3인공신경망 모듈(330)을 포함할 수 있으며, 각각의 인공신경망 모듈은 그 성격에 비추어 음악 채보 인공신경망 모듈이라 명칭될 수 도 있다.Meanwhile, in FIG. 1 , the processor 300 is illustrated as including only the first artificial neural network 311 , but the processor 300 is a second artificial neural network including the third artificial neural network 313 and the fourth artificial neural network 314 . A neural network 320 and a third artificial neural network module 330 including a fifth artificial neural network 115 and a sixth artificial neural network 316 may be included, and each artificial neural network module performs music transcription in light of its characteristics. It may also be called an artificial neural network module.

제1인공신경망(311)은 오디오 데이터를 포함하고 있는 제1입력 정보(10)를 입력 정보로 하여, 입력된 오디오 데이터에 포함되어 있는 보컬에 대한 프레임 단위의 피치(pitch) 정보를 출력하는 인공신경망 모듈을 의미하고, 제2인공신경망(312)은 제1인공신경망(311)의 중간 정보들을 합산한 제2입력 정보(20)를 입력 정보로 하여, 입력된 오디오 데이터에 포함되어 있는 보컬의 존재 유무에 대한 정보인 보컬 정보를 출력하는 인공신경망 모듈을 의미한다. The first artificial neural network 311 uses the first input information 10 including audio data as input information, and outputs the frame-by-frame pitch information for the vocal included in the input audio data. It means a neural network module, and the second artificial neural network 312 uses the second input information 20 that is the sum of the intermediate information of the first artificial neural network 311 as input information. It refers to an artificial neural network module that outputs vocal information, which is information about existence or not.

제1인공신경망(311)이 출력하는 피치 정보는 입력된 오디오 데이터에 대해 프레임 별로 각각의 프레임에 보컬의 존재 유무에 대한 정보 및 보컬이 존재한다면 그 보컬의 음높이 정보를 포함하고 있다. The pitch information output by the first artificial neural network 311 includes information on whether or not a vocal is present in each frame for each frame of the input audio data, and information on the pitch of the vocal if there is a vocal.

도 3과 도 4를 참조하여 제1인공신경망(311)에 대해 구체적으로 설명하면, 일 실시예에 따른 제1인공신경망(311)은 오디오 데이터를 포함하는 제1입력 정보(10)를 입력 정보로 하고, 상기 오디오 데이터에 대해 포함되어 있는 보컬 정보에 대한 피치 정보를 포함하는 제1출력 정보(20)를 출력 정보로 하는, 기 학습된 인공신경망 모듈을 의미하며, 제1인공신경망(311)은 제1입력 정보(10)를 기초로 제1출력 정보(20)를 추론하는 추론 세션(미도시)과, 제1입력 정보(10) 와 제1출력 정보(20) 및 제1레퍼런스 데이터를 기초로 제1출력 정보(20)의 정확도를 높이는 방법으로 제1인공신경망(311)의 파라미터를 조정하고 업데이트 하는 학습 세션(미도시)을 포함할 수 있다. Referring to FIGS. 3 and 4 , the first artificial neural network 311 will be described in detail. The first artificial neural network 311 according to an embodiment receives the first input information 10 including audio data as input information. and the first output information 20 including pitch information for vocal information included in the audio data as output information, means a pre-learned artificial neural network module, and the first artificial neural network 311 is an inference session (not shown) for inferring the first output information 20 based on the first input information 10, the first input information 10, the first output information 20, and the first reference data. As a method of increasing the accuracy of the first output information 20 based on the basis, a learning session (not shown) for adjusting and updating parameters of the first artificial neural network 311 may be included.

제1입력 정보(10)는 사용자가 채보 정보를 생성하고 싶은 음원 데이터를 의미하며, 제1입력 정보(10)는 프레임(frame)을 기준으로 생성된 데이터를 의미할 수 있다. 도 4에서는 일 예로 제1인공신경망(311)에 입력되는 제1입력 정보(10)는 한번에 31개의 프레임 데이터가 입력되는 데이터로 도시하였으나, 본 발명의 실시예가 이러한 예로 한정되는 것은 아니고, 제1인공신경망(311)에 입력되는 제1입력 정보(10)에 포함되어 있는 오디오 데이터에 대한 프래임의 개수는 다양한 개수로 설정될 수 있다. The first input information 10 may mean sound source data for which a user wants to generate transcription information, and the first input information 10 may mean data generated based on a frame. In FIG. 4 , as an example, the first input information 10 input to the first artificial neural network 311 is illustrated as data in which 31 frames of data are input at a time, but the embodiment of the present invention is not limited to this example. The number of frames for audio data included in the first input information 10 input to the artificial neural network 311 may be set to various numbers.

제 1출력 정보(20)는 제1인공신경망 (311)에 입력된 오디오 데이터에 대해 프레임별 피치 정보를 포함할 수 있다. 따라서, 제1출력 정보(20)는 각각의 프레임별로 보컬의 존재 유무 및 보컬의 크기에 대한 정보를 포함할 수 있으며, 보컬이 존재하지 않는 경우 해당 프레임에 대한 출력 정보는 0으로 출력될 수 있다. The first output information 20 may include frame-by-frame pitch information for audio data input to the first artificial neural network 311 . Accordingly, the first output information 20 may include information on the presence or absence of a vocal and the size of the vocal for each frame, and when there is no vocal, the output information for the frame may be output as 0. .

따라서, 제1인공신경망(311)이 출력하는 프레임별 정보는 도 3의 오른쪽 아래 표시된 바와 같이 보컬이 아예 존재하지 않는 프레임에서는 NV(Non Vocal) 정보로 정보가 출력되고, 보컬이 존재하는 프레임에서는 각각의 프레임에서의 음높이 정보(예를 들어, D2, B5 등) 정보로 정보가 출력된다.Accordingly, the frame-by-frame information output by the first artificial neural network 311 is output as NV (Non Vocal) information in a frame in which a vocal does not exist at all, as shown in the lower right of FIG. 3, and is output in a frame in which a vocal is present. Information is output as pitch information (eg, D2, B5, etc.) information in each frame.

한편, 지칭 명칭에 따라 제1인공신경망(311)은 피치 정보 출력 인공신경망 모듈로 지칭될 수 있으며, 도 4에서는 일 예로 제1인공신경망(311)에 입력되는 제1입력 정보(10)는 31개의 프레임 데이터가 입력되는 것으로 도시하였는바, 제1인공신경망(311에서 출력되는 정보는 상기 31개의 프레임에 대한 각각의 피치 정보를 포함할 수 있다. 따라서, 도 4에 도시한 바와 같이 출력 정보는 각각의 프레임(제1프레임, 제2프레임, 제3프레임 ~ 제31프레임)에 대해 음의 크기가 존재하는지 여부 및 음의 크기가 존재한다면 어느 정도의 크기를 가지고 있는지에 대한 정보를 포함할 수 있다. Meanwhile, according to the designation, the first artificial neural network 311 may be referred to as a pitch information output artificial neural network module, and in FIG. 4 , as an example, the first input information 10 input to the first artificial neural network 311 is 31 As shown in Fig. 4, the information output from the first artificial neural network 311 may include pitch information for each of the 31 frames. Accordingly, as shown in Fig. 4, the output information is For each frame (the first frame, the second frame, the third frame ~ the 31st frame), it can include information on whether or not the sound level exists and, if there is a sound level, information on how loud it is. have.

제1인공신경망(311)은 기 공지되어 있는 인공신경망을 구성하는 여러 컨볼루션 블록(convolution block) 및 레즈넷 블록(resnet block)들을 조합하여 구현될 수 있다. 일 예로 도 4에 도시된 바와 같이 제1인공신경망(311)은 컨볼루션 블록(121), 제1레즈넷 블록(122), 제2레즈넷 블록(123), 제3레즈넷 블록(124), 폴링 블록(125) 및 제1LSTM 블록(126) 등을 포함할 수 있다. The first artificial neural network 311 may be implemented by combining several convolution blocks and resnet blocks constituting a known artificial neural network. For example, as shown in FIG. 4 , the first artificial neural network 311 includes a convolution block 121 , a first resnet block 122 , a second resnet block 123 , and a third resnet block 124 . , a polling block 125 and a first LSTM block 126 , and the like.

도 4에 도시된 제1인공신경망(311)의 각각의 구성요소는 일 예에 불과할 뿐, 본 발명의 실시예가 도 4에 도시된 구성 요소로 제한되는 것은 아니다. 따라서, 예를 들어, 컨볼루션 블록은 한 개가 아닌 복수 개 구비될 수 있으며, 레즈넷 블록은 3개가 아닌 1개, 2개 혹은 4개 이상의 블록으로 구비될 수 있다. Each of the components of the first artificial neural network 311 shown in FIG. 4 is only an example, and the embodiment of the present invention is not limited to the components shown in FIG. 4 . Accordingly, for example, a plurality of convolution blocks may be provided, and a Reznet block may be provided as one, two, or four or more blocks instead of three.

본 발명에 따른 제1인공신겸망(311)을 구성하는 블록들에 대해 구체적으로 알아보면, 제1인공신경망(311)에 입력되는 컨볼루션 블록(121)은 입력되는 제1입력 정보(10)에 대해 컨볼루션 연산을 수행하는 블록을 의미하며, 대표적으로 CNN 네트워크에서 사용되는 컨볼루션 연산이 수행될 수 있다.When looking at the blocks constituting the first artificial neural network 311 according to the present invention in detail, the convolution block 121 input to the first artificial neural network 311 is the first input information 10 It means a block that performs a convolution operation on , and a convolution operation typically used in a CNN network may be performed.

레즈넷 블록은 Residual Network(ResNet)를 수행하는 블록을 의미한다. 도 4에 도시되어 있는 레즈넷 블록은 공지되어 있는ResNet을 본 발명의 목적에 맞춰 변형되어 구현된 레즈넷을 의미하는데, 일 실시예로 복수 개의 레즈넷 블록이 직렬적으로 연결되어 있을 수 있다. 따라서, 이전 레즈넷 블록의 출력 정보는 직렬적으로 연결되어 있는 레즈넷 블록의 입력 정보로 입력될 수 있다. Resnet block means a block that performs Residual Network (ResNet). The Resnet block shown in FIG. 4 means a Resnet implemented by modifying known ResNet for the purpose of the present invention. In an embodiment, a plurality of Resnet blocks may be serially connected. Accordingly, output information of the previous Reznet block may be input as input information of the serially connected Reznet block.

본 발명에 다른 제1인공신경망(311)은 도4에 도시된 바와 같이 제1레즈넷 블록(111), 제2레즈넷 블록(112) 및 제3레즈넷 블록(113) 즉, 총 3개의 레즈넷 블록을 포함하는 것이 출력 정보의 정확성 및 프로세서의 효율성을 고려하였을 때, 가장 좋은 출력 정보를 출력하는 것으로 실험 결과가 나와, 제1인공신경망(311)이 3개의 레즈넷 블록을 포함하는 것으로 도시하였으나, 본 발명의 실시예가 이로 한정되는 것은 아니고 레즈넷 블록의 개수와 배치 형태는 발명의 목적에 맞춰 다양하게 변형될 수 있다. As shown in FIG. 4, the first artificial neural network 311 according to the present invention includes a first resnet block 111, a second resnet block 112, and a third resnet block 113, that is, a total of three Experimental results show that including the REsnet block outputs the best output information when considering the accuracy of the output information and the efficiency of the processor. Although illustrated, the embodiment of the present invention is not limited thereto, and the number and arrangement of the Reznet blocks may be variously modified according to the purpose of the present invention.

제1인공신경망(311)의 각각의 레즈넷 블록은 도 5에 도시된 바와 같이 수정된 Residual Network로 구현될 수 있다. 따라서, 제1레즈넷 블록(112)은 컨볼루션 블록(121)의 출력 정보를 입력 정보를 입력 받고, 여러 네트워크 연산을 통해 출력 정보로 제1-1중간 정보(11)를 출력 할 수 있다.Each Resnet block of the first artificial neural network 311 may be implemented as a modified Residual Network as shown in FIG. 5 . Accordingly, the first Reznet block 112 may receive the output information of the convolution block 121 as input information, and may output the 1-1 intermediate information 11 as output information through various network operations.

구체적으로, 제2레즈넷 블록(113)은 재1레즈넷 블록(112)의 출력 정보인 제1-1중간 정보(11)를 입력 정보를 입력 받고, 제1-1중간 정보(11)에 대해 여러 네트워크 연산을 수행한 후 출력 정보로 제2-1중간 정보(12)를 출력할 수 있다. 제3레즈넷 블록(114) 또한 같은 프로세서에 의해 제2-1중간 정보(12)를 입력 정보로 입력 받아 제3-1중간 정보(13)를 출력 정보로 출력 할 수 있다.Specifically, the second resnet block 113 receives the input information of the 1-1 intermediate information 11, which is the output information of the first resnet block 112, and receives the input information in the 1-1 intermediate information 11. After performing a number of network operations on the , the 2-1 intermediate information 12 may be output as output information. The third resnet block 114 may also receive the second-first intermediate information 12 as input information by the same processor and output the third-first intermediate information 13 as output information.

도 5를 참조하여, 제1레즈넷 블록(112)에 대해 설명하면(제2레즈넷 블록 및 제3레즈넷 블록 또한 제1레즈넷 블록과 동일한 구조를 가진다) 제1레즈넷 블록(112)은 도 5에 도시된 바와 같이 순차적으로 BN/LReLU 블록(121), MaxPool(1X4) 블록(122), Conv 2D 블록(123), BN/LReLU 블록(124) 및 Conv 2D 블록(125)을 포함할 수 있으며, 각각의 블록은 블록의 명칭에 대응되는 연산을 수행할 수 있다. BN/LReLU, MaxPool, Conv 2D 등의 연산은 이미 공지되어 있는 연산 기술에 해당하는바, 이에 대한 구체적인 설명은 생략하도록 한다. Referring to FIG. 5 , the first resnet block 112 will be described (the second resnet block and the third resnet block also have the same structure as the first resnet block), the first resnet block 112 As shown in FIG. 5, sequentially includes a BN/LReLU block 121, a MaxPool (1X4) block 122, a Conv 2D block 123, a BN/LReLU block 124, and a Conv 2D block 125. and each block may perform an operation corresponding to the name of the block. Calculations such as BN/LReLU, MaxPool, Conv 2D, and the like correspond to known arithmetic techniques, and a detailed description thereof will be omitted.

본 발명에 따른 제1레즈넷 블록(112)이 공지되어 있는 레즈넷 네트워크와의 차이점에 대해 설명하면, 제1레즈넷 블록(112)에서 출력되는 정보는 Conv 2D 블록(125)에서 출력되는 정보와 MaxPool(1X4) 블록(122)에서 출력되는 정보가 합산되어 제1-1중간 정보(11)로 출력될 수 있다. 즉, MaxPool(1X4) 블록에서 출력되는 정보가 X정보라 한다면, X정보는 Conv 2D 블록에서 콘불루션 2D 연산을 거쳐 Y정보가 되므로, 최종적으로 제1중간 정보(11)는 Y정보와 Conv 2D 블록(325)에서 출력되는 Z 정보의 합산 정보로 구현될 수 있다. When explaining the difference between the first Resnet block 112 and the known Reznet network according to the present invention, the information output from the first Resnet block 112 is information output from the Conv 2D block 125 . and information output from the MaxPool (1X4) block 122 may be summed and output as the 1-1 intermediate information 11 . That is, if the information output from the MaxPool (1X4) block is X information, the X information becomes Y information through convolution 2D operation in the Conv 2D block. It may be implemented as summation information of Z information output in block 325 .

제3레즈넷 블록(114)에서 출력된 제3중간 정보(13)는 풀링 블록(115)을 거쳐 제1LSTM 블록(316)으로 입력될 수 있다. 제1LSTM 블록은 RNN 네트워크의 일종인 LSTM(Long short-term memory) 네트워크로 구현된 블록을 의미한다. 오디오 데이터의 경우 입력되는 오디오가 연속적인 특징을 가지고 있다는 점에서, 이전 데이터의 결과를 활용하는LSTM 네트워크를 이용하는 것이 입력된 오디오 데이터에 대한 피치 정보나 음정 정보 효과적으로 출력할 수 있는 장점이 존재한다. 도 4에 도시된 제1인공신경망(310)의 경우 제1입력 정보(10)로 31개의 프레임에 해당하는 오디오 데이터를 활용하였으므로, 제1LSTM 블록(316) 또한 31개의 레이어로 구성된 네트워크를 활용하여 구현될 수 있다. The third intermediate information 13 output from the third resnet block 114 may be input to the first LSTM block 316 through the pooling block 115 . The first LSTM block refers to a block implemented as a long short-term memory (LSTM) network, which is a type of RNN network. In the case of audio data, since input audio has a continuous characteristic, using an LSTM network using the result of previous data has an advantage in that pitch information or pitch information about the input audio data can be effectively output. In the case of the first artificial neural network 310 shown in Fig. 4, since audio data corresponding to 31 frames is used as the first input information 10, the first LSTM block 316 also utilizes a network composed of 31 layers. can be implemented.

제1LSTM 블록(316)에 따라 출력되는 정보인 제1출력 정보(20)는 입력된 오디오 데이터에 대한 피치 정보를 포함할 수 있다.The first output information 20 , which is information output according to the first LSTM block 316 , may include pitch information about the input audio data.

구체적으로, 제1출력 정보(20)는 입력된 각각의 프레임에 대응되는 보의스의 피치 정보를 출력할 수 있으며, 제1출력 정보(20)에는 오디오 데이터에 사람의 보컬 자체가 존재하는지, 존재하지 않는지에 대한 데이터 정보인 제1-1출력 정보(Omv)와, 보컬이 존재한다면, 그 음의 높낮이 정보인 제1-2출력 정보(Om)를 포함할 수 있다.Specifically, the first output information 20 may output pitch information of the voice corresponding to each input frame, and the first output information 20 includes whether a human vocal itself exists in the audio data; The first-first output information Omv, which is data information on whether or not there is a vocal, and 1-2-th output information Om, which is information on the pitch of the vocal, if present, may be included.

예를 들어, 도 4에 도시된 바와 같이 제1프레임과 제2프레임에서는 보컬이 존재하고 제3프레임에서는 보컬이 존재하지는 경우, 제1-1출력 정보(Omv)는 제1프레임과 제2프레임에서는 ON 정보를 가지며, 제3프레임에서는 OFF 정보를 가지게 된다. 따라서, 이 경우 제3프레임에서는 NV(Non Vocal) 정보로, 제1프레임과 제2프레임에서는 V(Vocal) 정보로 1-1출력 정보가 출력된다.For example, as shown in FIG. 4 , when a vocal is present in the first frame and the second frame and a vocal is not present in the third frame, the 1-1 output information Omv is the first frame and the second frame. has ON information, and has OFF information in the third frame. Accordingly, in this case, 1-1 output information is output as NV (Non Vocal) information in the third frame and V (Vocal) information in the first and second frames.

한편, 음의 높낮이에 대한 정보인 제1-2출력 정보(Om)는 ON/OFF 정보를 가지지는 않고 각각의 프레임에서의 음의 높이에 대한 정보를 수치적으로 계산한 정보를 표현될 수 있다. 따라서, 이 경우 제1프레임과 제2프레임에서는 보컬의 음의 높이에 대응되는 정보가, 제3프레임에서는 보컬이 존재하지 않으므로 0으로 정보가 출력될 수 있다. Meanwhile, the first-second output information Om, which is information on the pitch of a sound, does not have ON/OFF information, and may represent information obtained by numerically calculating information on the height of a sound in each frame. Accordingly, in this case, information corresponding to the pitch of the vocal in the first frame and the second frame may be output as 0 because the vocal is not present in the third frame.

지금까지 제1인공신경망 모듈(310)이 포함하고 있는 제1인공신경망(311)에 대해 자세히 알아보았다. 이하 제1인공신경망 모듈(310)이 포함하고 있으면서, 제1인공신경망(311)과 병렬적으로 연결되어 있는 제2인공신경망(312)에 대해 구체적으로 알아본다. So far, the first artificial neural network 311 included in the first artificial neural network module 310 has been studied in detail. Hereinafter, the second artificial neural network 312 included in the first artificial neural network module 310 and connected in parallel to the first artificial neural network 311 will be described in detail.

도 6은 본 발명의 일 실시예에 따른 제2인공신경망의 입력 정보와 출력 정보를 도시한 도면이고, 도 7은 본 발명의 일 실시예에 따른 제1인공신경망과 제2인공신경망을 포함하고 있는 제1인공신경망 모듈의 구성을 도시한 도면이다. 6 is a diagram illustrating input information and output information of a second artificial neural network according to an embodiment of the present invention, and FIG. 7 includes a first artificial neural network and a second artificial neural network according to an embodiment of the present invention. It is a diagram showing the configuration of the first artificial neural network module.

도 6과 도 7를 참조하여 제2인공신경망(312)에 대해 구체적으로 설명하면, 일 실시예에 따른 제2인공신경망(312)은 제1인공신경망 (311)에서 출력된 여러 중간 정보를 합산한 제2입력 정보(30)를 입력 정보로 하고, 제1입력 정보(10)에 포함되어 있는 오디오 데이터에 프레임별로 보컬이 존재하는지 존재하지 않는지에 대한 보컬 정보를 포함하고 있는 제2출력 정보(40)를 출력 정보로 하는, 기 학습된 인공신경망 모듈을 의미한다. The second artificial neural network 312 will be described in detail with reference to FIGS. 6 and 7 . The second artificial neural network 312 according to an embodiment sums up several intermediate information output from the first artificial neural network 311 . One second input information 30 is input information, and second output information ( 40) as output information, means a pre-learned artificial neural network module.

따라서, 도면에는 도시하지 않았지만 제2인공신경망(312)은 제2입력 정보(30)를 기초로 제2출력 정보(40)를 추론하는 추론 세션(미도시)과, 제2입력 정보(30) 와 제2출력 정보(40) 및 제2레퍼런스 데이트를 기초로 제2출력 정보(40)의 정확도를 높이는 방법으로 제2인공신경망(312)의 파라미터를 조정하고 업데이트 하는 학습 세션(미도시)을 포함할 수 있다. Accordingly, although not shown in the drawing, the second artificial neural network 312 includes an inference session (not shown) for inferring the second output information 40 based on the second input information 30 and the second input information 30 . and a learning session (not shown) for adjusting and updating parameters of the second artificial neural network 312 in a way to increase the accuracy of the second output information 40 based on the second output information 40 and the second reference data may include

제2인공신경망(312)의 구조에 대해 구체적으로 알아보면, 도 7에 도시된 바와 같이 제2인공신경망(312)은 제2입력 정보(30)를 입력 받는 컨볼루션 블록(121)과 제2출력 정보(40)를 출력하는 제2LSTM 블록(122)을 포함할 수 있다.Looking into the structure of the second artificial neural network 312 in detail, as shown in FIG. 7 , the second artificial neural network 312 includes a convolution block 121 that receives the second input information 30 and the second The second LSTM block 122 for outputting the output information 40 may be included.

구체적으로, 제2입력 정보(30)는 제1인공신경망 (311)을 구성하는 여러 블록들에서 출력되는 중간 정보들을 합한 정보를 입력 정보로 할 수 있는데, 구체적으로 제1레즈넷 블록(112)에서 출력된 제1중간 정보(11)와 제2레즈넷 블록(113)에서 출력된 제2중간 정보(12)와 제3레즈넷 블록(114)에서 출력된 제3중간 정보(13)와 폴링 블록(115)에서 출력되는 제4중간 정보(14)들의 합산 정보로 구현될 수 있다.Specifically, the second input information 30 may be input information obtained by adding intermediate information output from various blocks constituting the first artificial neural network 311 as input information. Specifically, the first Reznet block 112 . The first intermediate information 11 output from , the second intermediate information 12 output from the second resnet block 113, and the third intermediate information 13 output from the third resnet block 114 and polling It may be implemented as summation information of the fourth intermediate information 14 output in block 115 .

다른 실시예로, 제2입력 정보(30)는 복수개의 맥스폴링 블록을 거친 정보로 구현될 수 있는데, 구체적으로 제2입력 정보(30)는 제1-1중간 정보(11)가 제1맥스폴링 블록인 제1MP 블록(131)을 거쳐서 생성된 제1-2정보(21)와, 제2-1중간 정보(12)가 제2맥스폴링 블록인 제2MP 블록(132)을 거처 생성된 제2-2정보(22)와, 제3-1중간 정보(13)가 제3맥스폴링 블록인 제3MP 블록(133)을 거친 제3-2중간 정보(23) 및 폴링 블록(115)을 거친 제4중간 정보(14)의 합으로 구성될 수 있다. 이렇게 구성된 제2입력 정보(30)는 컨볼루션 블록(121)을 거쳐 제2LSTM 블록(122)으로 입력되고, 최종적으로 제2출력 정보(40)로 출력될 수 있다. In another embodiment, the second input information 30 may be implemented as information that has passed through a plurality of maxpoling blocks. Specifically, the second input information 30 includes the 1-1 intermediate information 11 and the first max. The 1st-2nd information 21 and the 2-1th intermediate information 12 generated through the 1st MP block 131 which is a polling block are generated through the 2nd MP block 132 which is the 2nd max polling block. The 2-2 information 22 and the 3-1 intermediate information 13 pass through the 3-2 intermediate information 23 and the polling block 115 through the 3MP block 133 which is the third max polling block. It may be composed of the sum of the fourth intermediate information 14 . The second input information 30 configured in this way may be input to the second LSTM block 122 through the convolution block 121 , and finally output as the second output information 40 .

도 6에서는 설명의 편의를 위해 제2인공신경망(312)는 컨볼루션 블록(121)과 제2LSTM 블록(122)만을 포함하는 것으로 도시하였으나, 발명의 실시예에 따라 제2인공신경망(312)은 도 6에 도시되어 있는 복수 개의 MP 블록들을 포함할 수 있으며, MP 블록의 수는 제1인공신경망(311)에 포함되어 있는 레즈넷 블록의 개수에 대응한 개수로 구현될 수 있다. In FIG. 6, for convenience of explanation, the second artificial neural network 312 is illustrated as including only the convolution block 121 and the second LSTM block 122. However, according to an embodiment of the present invention, the second artificial neural network 312 is It may include a plurality of MP blocks shown in FIG. 6 , and the number of MP blocks may be implemented as a number corresponding to the number of Reznet blocks included in the first artificial neural network 311 .

제2LSTM 블록(122)에 따라 출력되는 정보인 제2출력 정보(40)는 입력된 오디오 데이터에 대한 보컬(Vocal) 정보를 포함할 수 있다.The second output information 40 , which is information output according to the second LSTM block 122 , may include vocal information about the input audio data.

구체적으로, 제2출력 정보(40)는 제1입력 정보(10)에 포함되어 있는 오디오 데이터를 기준으로, 각각의 프레임에 보컬 정보가 존재하는지에 대한 정보(V) 또는 존재하지 않는지에 대한 논보컬 정보(NV)를 포함할 수 있다. Specifically, the second output information 40 is based on the audio data included in the first input information 10, the information (V) on whether or not vocal information is present in each frame. It may include vocal information (NV).

제2인공신경망(312)이 출력하는 제2출력 정보(40)는 제1인공신경망(311)이 출력하는 제1출력 정보에 사실상 포함되는 정보에 해당한다. 다만, 제1인공신경망(311)의 경우 오디오 데이터 안에 보컬의 높이 정보인 피치 정보를 출력하는 것에 초점에 맞추어져 있다면, 제2인공신경망(312)의 경우 오디오 데이터 안에 보컬이 존재하는 구간 및 존재하지 않는 구간에 대한 정보의 유무를 출력하는데 초점이 맞춰져 있는 인공신경망으로 이해할 수 있다. 따라서, 제2인공신경망(312)은 보컬 정보를 출력하는 정보의 특성에 따라 보컬 정보 출력 인공신경망으로도 지칭될 수 있다. The second output information 40 output by the second artificial neural network 312 corresponds to information actually included in the first output information output by the first artificial neural network 311 . However, in the case of the first artificial neural network 311, if the focus is on outputting pitch information, which is the height information of the vocal in the audio data, in the case of the second artificial neural network 312, the section and existence of the vocal in the audio data It can be understood as an artificial neural network that is focused on outputting the presence or absence of information on the sections that are not. Accordingly, the second artificial neural network 312 may also be referred to as an artificial neural network outputting vocal information according to characteristics of information outputting vocal information.

도 8은 본 발명의 다른 실시예에 따른 제1인공신경망과 제2인공신경망 을 포함하고 있는 제1인공신경망 모듈을 도시한 도면이다. 8 is a diagram illustrating a first artificial neural network module including a first artificial neural network and a second artificial neural network according to another embodiment of the present invention.

도 8에 따른 제1인공신경망 모듈의 경우 기본적인 구성은 도 7에서 설명하였던 구성과 동일하나, 제1인공신경망(311)의 구성에 있어서 차이점이 존재한다. In the case of the first artificial neural network module according to FIG. 8 , the basic configuration is the same as that described in FIG. 7 , but there is a difference in the configuration of the first artificial neural network 311 .

구체적으로, 도 8에 따른 제1인공신경망 (311)은 도면에 도시된 바와 같이 도 7에 따른 제1인공신경망 (311)의 출력단에 제1LSTM 블록(116)에서 출력된 정보와 제2LSTM 블록(122)에서 출력되는 정보를 합산한 정보를 입력 정보로 하여, 제1인공신경망 (311)의 최종 정보를 출력하는 합산 블록(317)을 포함할 수 있다. 합산 블록(317)은 밀집층(Dense layer) 및 FC Layer로 구성되어 있어, 제1LSTM 블록(316)에서 출력하는 정보와 제2LSTM 블록(122)에서 출력하는 정보를 각각 합산한 후, 합산된 정보를 기초로 네트워크 연산을 수행하여, 제1출력 정보(20)를 출력 정보로 출력할 수 있다. Specifically, the first artificial neural network 311 according to FIG. 8 includes information output from the first LSTM block 116 and the second LSTM block ( It may include a summing block 317 for outputting final information of the first artificial neural network 311 by using information obtained by summing the information output from 122 ) as input information. The summing block 317 is composed of a dense layer and an FC layer, and after adding the information output from the first LSTM block 316 and the information output from the second LSTM block 122 , respectively, the summed information By performing a network operation based on , the first output information 20 may be output as output information.

또한, 도면에는 도시하지 않았지만 합산 블록(317)에서 출력되는 정보는 다시 제2인공신경망의 정보로 입력되어 제2인공신경망(312)이 보컬 정보를 출력하는데 활용될 수 있다. 구체적으로, 제2인공신경망(312)은 제2LSTM 블록(122)과 직렬 연결되어 있는 합산 블록을 더 포함하고 있어, 제2LSTM 블록(122)에서 출력된 정보와, 합산 블록(317)에서 출력된 정보를 합산한 후, 합산된 정보를 기초로 네트워크 연산을 수행하여, 제2출력 정보(40)를 최종 출력 정보로 출력할 수 있다. Also, although not shown in the drawing, the information output from the summing block 317 may be input again as information of the second artificial neural network, and the second artificial neural network 312 may be utilized to output vocal information. Specifically, the second artificial neural network 312 further includes a summation block connected in series with the second LSTM block 122 , so that the information output from the second LSTM block 122 and the information output from the summation block 317 are output. After summing the information, a network operation may be performed based on the summed information to output the second output information 40 as final output information.

도 8에 도시된 바와 같은 구성을 가지는 경우 제2인공신경망(312)에서 출력된 보컬 정보를, 피치 정보를 출력하는 제1인공신경망 (311)에서 다시 한번 중간 입력 정보로 활용할 수 있기 때문에, 입력되는 오디오 데이터에서 보컬 정보를 명확히 구분할 수 있다. 따라서, 이러한 정보를 활용하면 보컬에 대한 피치 정보를 더 명확하게 출력할 수 있는 장점이 존재한다. In the case of having the configuration as shown in FIG. 8 , since the vocal information output from the second artificial neural network 312 can be used as intermediate input information once again in the first artificial neural network 311 that outputs pitch information, the input Vocal information can be clearly distinguished from the audio data. Accordingly, there is an advantage in that pitch information about a vocal can be more clearly output by using such information.

또한, 제1인공신경망(311)에서 출력한 피치 정보를, 보컬 정보를 출력하는 제2인공신경망(312)에서 다시 한번 중간 입력 정보로 활용할 수 있기 때문에, 입력되는 오디오 데이터에서 더 정확히 보컬 정보를 명확히 구분할 수 있는 장점이 존재한다. Also, since the pitch information output from the first artificial neural network 311 can be used as intermediate input information once again in the second artificial neural network 312 that outputs vocal information, vocal information can be more accurately obtained from the input audio data. There are advantages that can be clearly distinguished.

지금까지 본 발명에 따른 음악 채보 인공신경망 모듈에 해당하는 제1인공신경망 모듈의 구체적인 구성 및 프로세스에 대해 알아보았다. 이하 제1인공신경망 모듈의 학습 방법에 대해 알아본다.So far, the detailed configuration and process of the first artificial neural network module corresponding to the music transcription artificial neural network module according to the present invention have been studied. Hereinafter, a learning method of the first artificial neural network module will be described.

본 발명에 따른 제1인공신경망(311)과 제2인공신경망(312)은 각각 피치 정보와 보컬 정보를 추론함에 있어서, 각각의 인공신경망 모듈을 기초로 학습을 수행할 수 있다. 즉, 제1인공신경망(311)은 피치 정보만을 기초로 손실함수를 계산한 후, 제1레퍼런스 데이터를 이용하여 피치 정보의 정확성을 높이는 방향으로 학습을 수행할 수 있으며, 제2인공신경망(312)은 보컬 정보만을 기초로 손실함수를 계산한 후, 제2레퍼런스 데이터를 이용하여 피치 정보의 정확성을 높이는 방향으로 학습을 수행할 수 있다In inferring pitch information and vocal information, respectively, the first artificial neural network 311 and the second artificial neural network 312 according to the present invention may perform learning based on each artificial neural network module. That is, after calculating the loss function based on only the pitch information, the first artificial neural network 311 may perform learning in a direction to increase the accuracy of the pitch information using the first reference data, and the second artificial neural network 312 ), after calculating the loss function based on only the vocal information, learning can be performed in the direction of increasing the accuracy of the pitch information using the second reference data.

또한, 본 발명에 따른 제1인공신경망 모듈(310)은 제1인공신경망 (311)과 제2인공신경망(312)을 독립적으로 학습을 하는 것이 아니라, 제1인공신경망 (311)과 제2인공신경망(312)의 출력 정보들을 합산 한 후, 손실함수 또한 이를 기초로 계산하여 학습을 수행할 수 있다. 이를 도 9를 통하여 구체적으로 알아본다. In addition, the first artificial neural network module 310 according to the present invention does not independently learn the first artificial neural network 311 and the second artificial neural network 312, but the first artificial neural network 311 and the second artificial neural network 311. After summing the output information of the neural network 312 , a loss function may also be calculated based on this to perform learning. This will be described in detail with reference to FIG. 9 .

도 9는 본 발명에 따른 음악 채보 인공신경망 모듈의 통합 손실함수를 계산하는 방법을 도시한 도면이다.9 is a diagram illustrating a method of calculating an integrated loss function of a music transcription artificial neural network module according to the present invention.

도 9를 참조하면, 본 발명에 따른 제1인공신경망 (311)의 제1손실함수(Lpitch)는 제1인공신경망 (311)에서 출력된 정보만을 기초로 구성될 수 있다. 즉, 도 9의 왼쪽에 도시된 바와 각각의 프레임에 대해 피치 정보가 있는지에 대한 정보(a1) 또는 피치 정보가 없는지에 대한 정보(a2)를 생성한 후, 생성된 정보와 제1레퍼런스 정보의 차이를 기초로 제1손실함수(Lpitch)를 생성한다.Referring to FIG. 9 , the first loss function Lpitch of the first artificial neural network 311 according to the present invention may be configured based on only information output from the first artificial neural network 311 . That is, as shown on the left side of FIG. 9 , after generating information (a1) on whether or not pitch information exists or information (a2) on whether or not pitch information is present for each frame, the generated information and the first reference information A first loss function (Lpitch) is generated based on the difference.

제2인공신경망(312)의 제2손실함수(LVocal)는 보컬 정보와 멜로디 정보를 기초로 생성될 수 있다. 즉, 도 9의 오른쪽에 도시된 바와 같이 제1손실함수를 구성하는 성분 중 피치 정보가 있는 구간(a1)에 대한 정보인 제1보컬 정보(V1)와 피치 정보가 없는 구간(사실상 보컬이 없는 구간,a2)에 대한 정보인 제1넌-보컬 정보(NV1)를 합친 후, 제2인공신경망 (311)에서 출력되는 정보 중 보컬 정보가 존재하는 구간에 대한 정보인 제2보컬 정보(V2)와 보컬 정보가 존재하지 않는 구간에 대한 정보인 제2넌-보컬 정보(NV2)를 모두 합산하여 제2손실함수(LVocal)를 생성할 수 있다.A second loss function (LVocal) of the second artificial neural network 312 may be generated based on vocal information and melody information. That is, as shown on the right side of FIG. 9 , among the components constituting the first loss function, the first vocal information V1, which is information about the section a1 with pitch information, and the section without pitch information (in fact, there is no vocal After merging the first non-vocal information NV1, which is information on the section, a2), the second vocal information V2, which is information about a section in which vocal information exists among the information output from the second artificial neural network 311 and the second non-vocal information NV2, which is information about a section in which no vocal information exists, may be summed to generate a second loss function LVocal.

그 후, 전체 인공신경망 모듈의 손실함수는 토탈 손실함수인 Ltotal은 제1손실함수(Lpitch)와 제2손실함수(LVocal)의 합으로 표현될 수 있는데 구체적으로 아래 식(1)과 같이 표현될 수 있다.After that, the loss function of the entire artificial neural network module is the total loss function, Ltotal, which can be expressed as the sum of the first loss function (Lpitch) and the second loss function (LVocal). Specifically, it can be expressed as Equation (1) below. can

식 (1) - Ltotal = Lpitch * (a*LVocal)Equation (1) - Ltotal = Lpitch * (a*LVocal)

식 (1)에서 a는 계수를 의미하며, 본 발명의 경우 0.1, 0.5 또는 1이 적용될 수 있다. In Equation (1), a means a coefficient, and in the present invention, 0.1, 0.5 or 1 may be applied.

이렇게 생성된 손실함수는 보컬 정보와 피치 정보를 모두 고려하여 전체 인공신경망의 파라미터를 조정하므로, 보다 정확하게 보컬 정보와 피치 정보를 출력할 수 있는 장점이 존재한다. Since the loss function generated in this way adjusts the parameters of the entire artificial neural network in consideration of both vocal information and pitch information, there is an advantage in that vocal information and pitch information can be output more accurately.

지금까지 보컬 정보와 피치 정보를 출력하는 제1인공신경망 모듈에 대해 자세히 알아보았다. 이하, 본 발명의 다른 실시예로서, 제1인공신경망 모듈을 출력 정보를 활용하여 노트 레벨의 채보 정보를 출력하는 제2인공신경망 모듈에 대해 자세히 알아본다.So far, we have studied in detail the first artificial neural network module that outputs vocal information and pitch information. Hereinafter, as another embodiment of the present invention, a second artificial neural network module for outputting note-level transcription information by using the output information of the first artificial neural network module will be described in detail.

도 10은 일 실시예에 따른 인공신경망을 이용한 자동 음악 채보 장치의 구성 요소를 도시한 도면이고, 도 11은 일 실시예에 따른 자동 음악 채보 장치의 전처리 모듈을 설명하기 위한 도면이다.10 is a diagram illustrating components of an automatic music transcription apparatus using an artificial neural network according to an embodiment, and FIG. 11 is a diagram for explaining a pre-processing module of the automatic music transcription apparatus according to an embodiment.

도 10을 참조하면, 일 실시예에 따른 자동 음악 채보 장치는 프레임 레벨 단위의 제1오디오 데이터에 대한 피치 정보를 출력하는 제1인공신경망 모듈(310), 제1인공신경망 모듈(310)이 출력한 피치 정보를 기초로 상기 제1오디오 데이터에 대한 노트 레벨의 피치 정보로 데이터를 변환하여 제1학습 데이터를 생성하는 전처리 모듈(340), 상기 제1학습 데이터를 저장하는 제1메모리 모듈(210), 노트 레벨 단위의 피치 정보가 라벨링 되어 있는 제2학습 데이터를 저장하는 제2메모리 모듈(220), 제1학습 데이터 및 제2학습 데이터 중 적어도 하나의 데이터를 기초로 학습을 수행하며, 입력되는 프레임 레벨 단위의 제2오디오 데이터에 대한 노트 레벨 단위의 피치 정보를 출력하는 제2인공신경망 모듈(320) 및 제2인공신경망 모듈(320)에서 출력된 프레임 레벨 단위의 정보를 노트 레벨 단위의 정보를 변형하는 후처리 모듈(350)을 포함할 수 있다. Referring to FIG. 10 , in the automatic music transcription apparatus according to an embodiment, a first artificial neural network module 310 and a first artificial neural network module 310 outputting pitch information for first audio data in units of frame level are output A pre-processing module 340 for generating first learning data by converting data into note-level pitch information for the first audio data based on one pitch information, a first memory module 210 for storing the first learning data ), a second memory module 220 for storing second learning data labeled with pitch information in units of note level, learning based on at least one of the first learning data and the second learning data, and input The second artificial neural network module 320 and the second artificial neural network module 320 for outputting pitch information in note-level units for the second audio data in frame-level units of It may include a post-processing module 350 that transforms the information.

도 10에 따른 자동 음악 채보 장치에서 제1인공신경망 모듈(310)과 제2인공신경망 모듈(320) 앞선 도면을 통해 설명하였던 인공신경망 모듈과 그 구성이 대부분 동일하나, 하나의 실시예로서 제1인공신경망 모듈(310)은 도 7에 따른 인공신경망 모듈이 차용되고, 제2인공신경망 모듈(320)은 도 8에 따른 인공신경망 모듈이 차용될 수 있으며, 이러한 경우 제1인공신경망 모듈(310)은 제2인공신경망 모듈(320)이 학습을 하는데 필요한 학습 데이터를 생성하는 인공신경망의 역할을 하며, 제2인공신경망 모듈(320)이 실질적으로 자동 보컬 채보를 하는 역할을 수행할 수 있다. In the automatic music transcription device according to FIG. 10 , the first artificial neural network module 310 and the second artificial neural network module 320 have the same configuration as the artificial neural network module described with reference to the previous drawings, but as an embodiment, the first The artificial neural network module 310 may borrow the artificial neural network module according to FIG. 7, and the second artificial neural network module 320 may borrow the artificial neural network module according to FIG. 8. In this case, the first artificial neural network module 310 The second artificial neural network module 320 serves as an artificial neural network that generates learning data necessary for learning, and the second artificial neural network module 320 may play a role of substantially automatic vocal transcription.

제1인공신경망 모듈(310)과 제2인공신경망 모듈(320)에 대한 중복되는 설명은 생략하고, 다른 구성요소 및 전반적인 프로세스에 대해 설명하면, 전처리 모듈(340)은 제1인공신경망 모듈(310)이 출력하는 프레임 단위의 피치 정보를 가공하여 유사-노트(pseudo-note) 레벨의 피치 정보를 변환하는 역할을 수행할 수 있다.If the overlapping description of the first artificial neural network module 310 and the second artificial neural network module 320 is omitted, and other components and overall processes are described, the preprocessing module 340 is the first artificial neural network module 310 ) may perform a role of processing the pitch information of the frame unit output by the processing to convert the pitch information of the pseudo-note level.

구체적으로 전처리 모듈(340)은 도 10에 도시된 바와 같이 크게 2가지 피치 양자화(pitch quantization) 프로세스와 리듬 양자화(rhythm quantization)을 프로세스를 거쳐 프레임 단위의 피치 정보를 유사-노트 레벨의 피치 정보로 변환한다. 노트 레벨이 아닌 유사-노트 레벨로 호칭하는 이유는 데이터를 처음부터 정확하게 노트 레벨로 취합한 것이 아니고 프레임 단위의 피치 정보를 가공하여 노트 레벨로 변형한 것이기 때문에 정확한 노트 레벨의 데이터와는 차이가 존재하여 유사-노트 레벨이라고 호칭한다. Specifically, as shown in FIG. 10, the pre-processing module 340 converts frame-unit pitch information into pseudo-note level pitch information through two major processes: a pitch quantization process and a rhythm quantization process. convert The reason why it is called pseudo-note level rather than note level is that the data is not accurately collected at the note level from the beginning, but is processed and transformed into the note level by processing the pitch information in units of frames, so there is a difference from the exact note level data. Therefore, it is called a pseudo-note level.

전처리 모듈(340)이 수행하는 피치 양자화 프로세서는 연속 피치를 반음 단계로 반올림하는 피치 양자화 프로세스를 의미한다. 구체적으로, 인공신경망을 기반으로 하는 많은 음정 추정 모델에서 출력은 양자화된 음높이 값에 대한 소프트맥스 함수로 표현되며, 여기서 인접한 음높이는 반음보다 훨씬 작은 경향을 가지고 있다. 따라서, 가장 신뢰도 높은 피치를 가져와 가장 가까운 MIDI 음표 번호로 퀀타이즈 하는 것이 피치 양자화 과정이다.The pitch quantization process performed by the preprocessing module 340 refers to a pitch quantization process of rounding continuous pitches to semitone steps. Specifically, in many pitch estimation models based on artificial neural networks, the output is expressed as a softmax function for quantized pitch values, where adjacent pitches tend to be much smaller than semitones. Therefore, taking the most reliable pitch and quantizing it to the nearest MIDI note number is the pitch quantization process.

피치 양자화 과정이 완료되면 전처리 모듈(340)은 그 다음 단계로 리듬 양자화 과정을 수행한다. 리듬 양자화 과정은 양자화된 피치 라인의 조각을 비트 기반 단위로 스냅(snap) 하는 과정을 의미한다. When the pitch quantization process is completed, the preprocessing module 340 performs the rhythm quantization process as the next step. The rhythm quantization process refers to a process of snapping a piece of a quantized pitch line in beat-based units.

일 예로 3개의 중간 필터로 양자화된 피치를 평탄화하는 과정을 수행 하며, 필터링된 출력을 비트 기반 단위로 만들기 위해 주어진 템포에서 중앙값 필터의 크기를 각각 1/32, 1/16 및 1/12 비트로 설정할 수 있다. 필터의 개수는 사용 환경에 따라 다르게 사용될 수 있으나, 3개의 계단식 필터가 라벨의 품질을 가장 향상시킨 것을 실험적으로 알 수 있었다. As an example, the process of flattening the quantized pitch with three intermediate filters is performed, and the size of the median filter is set to 1/32, 1/16, and 1/12 bits at a given tempo to make the filtered output in beat-based units, respectively. can The number of filters can be used differently depending on the usage environment, but it was experimentally found that the three cascading filters improved the quality of the label the most.

리듬 양자화 과정이 완료되면, 전처리 모듈(340)은 가창으로 보기에는 짧은 작은 조각을 제거하였으며, 마지막으로 옥타브 오류를 최소화하는 간단한 규칙을 설정함으로써, 전처리 과정을 마무리하며, 이렇게 생성된 데이터는 제1메모리 모듈(210)에 저장되어 제2인공신경망 모듈이 학습할 때 학습 데이터로 사용될 수 있다. When the rhythm quantization process is completed, the pre-processing module 340 removes small pieces that are too short to be seen as singing, and finally sets a simple rule to minimize the octave error, thereby ending the pre-processing process, and the generated data is It is stored in the memory module 210 and may be used as learning data when the second artificial neural network module learns.

제2메모리 모듈(220)에 저장되는 데이터는 노트-레벨로 라벨링이 되어 있는 피치 정보를 포함하고 있는 데이터를 의미한다. Data stored in the second memory module 220 means data including pitch information labeled with note-level.

제 2인공신경망 모듈(320)은 주파수 영역의 제2오디오 데이터를 제3입력 정보로 하고, 상기 제2오디오 데이터에 대해 노트(note) 단위로 보컬의 음높이 정보를 포함하는 피치(Pitch) 정보를 제3출력 정보(60)로 출력하는 제3인공신경망과, 제3인공신경망을 구성하는 블록에서 출력되는 중간 출력 정보를 제4입력 정보로 하고, 상기 제2오디오 데이터에 대해 프레임 단위로, 보컬(Vocal)의 존재 유무에 대한 보컬 정보를 제4출력 정보로 출력하는 제4인공신경망을 포함할 수 있다.The second artificial neural network module 320 uses the second audio data in the frequency domain as third input information, and receives pitch information including pitch information of a vocal in a note unit with respect to the second audio data. The third artificial neural network outputting the third output information 60 and the intermediate output information output from the blocks constituting the third artificial neural network are used as fourth input information, and the second audio data is frame-by-frame, vocal It may include a fourth artificial neural network that outputs vocal information on the presence or absence of (Vocal) as fourth output information.

제2인공신경망 모듈(320)의 제3인공신경망과 제4인공신경망은 제1인공신경망 모듈(310)의 제1인공신경망(311)과 제2인공신경망(312)에 각각 대응되는 인공신경망 모듈에 해당하여 대부분의 구성요소는 동일하다. 그러나, 앞서 언급한 바와 같이 하나의 실시예로서 제1인공신경망 모듈(310)은 도 7에 따른 인공신경망 모듈이 차용되고, 제2인공신경망 모듈(320)은 도 8에 따른 인공신경망 모듈이 차용될 수 있다.The third artificial neural network and the fourth artificial neural network of the second artificial neural network module 320 are artificial neural network modules corresponding to the first artificial neural network 311 and the second artificial neural network 312 of the first artificial neural network module 310, respectively. Therefore, most of the components are the same. However, as mentioned above, as an embodiment, the first artificial neural network module 310 according to FIG. 7 is borrowed, and the second artificial neural network module 320 is borrowed from the artificial neural network module according to FIG. 8 . can be

제2인공신경망 모듈(320)의 학습 방법에 대해 설명하면, 제2인공신경망 모듈(320)은 학습 데이터로서 제1메모리 모듈(210)에 저장되어 있는 제1학습 데이터 및 제2메모리 모듈(320)에 저장되어 있는 제2학습 데이터 중 적어도 하나를 기초를 학습을 수행할 수 있다.When the learning method of the second artificial neural network module 320 is described, the second artificial neural network module 320 includes the first learning data and the second memory module 320 stored in the first memory module 210 as learning data. ), it is possible to perform learning based on at least one of the second learning data stored in the.

일반적으로, 보컬 채보를 진행함에 있어서 주파수 영역의 프레임 레벨의 채보보다는 온셋(Onset), 오프셋(Offset), 노트(note) 정보로 채보 정보를 출력하는 노트(note,음표) 레벨의 채보가 보다 정확한 채보 정보를 전달해 줄 수 있다. In general, when performing vocal transcription, note-level transcription, in which transcription information is output as onset, offset, and note information, is more accurate than frame-level transcription in the frequency domain. You can pass the billing information.

그러나 노트 레벨의 채보는 노래 특성상 사람마다 같은 음을 다르게 부르는 경우도 존재하고, 다양한 변수가 존재하기 때문에 휴리스티학 방법으로는 노트를 정확하게 예측하는 것은 매우 어렵다. 그리고 결정적으로 인공신경망을 이용하여 학습을 진행하기 위해서는 노트 레벨로 라벨링 되어 있는 데이터가 존재해야 하는데, 오디오와 정확하게 매칭되어 있는 악보 수준으로 라벨링 되어 있는 데이터가 매우 적어 인공신경망을 이용하여 학습을 수행하기는 매우 어려운 문제점이 존재한다. However, note-level transcription is very difficult to accurately predict the note using the heuristic method because there are cases where each person sings the same note differently due to the characteristics of the song, and there are various variables. And, crucially, in order to proceed with learning using an artificial neural network, data labeled at the note level must exist. However, there is very little data labeled at the score level that matches the audio accurately, so it is difficult to perform learning using an artificial neural network. has a very difficult problem.

그러나, 본 발명에 따른 자동 보컬 채보 장치는 제1인공신경망 모듈(310) 및 전처리 모듈(340)을 활용하면 기존의 프레임 단위의 피치 정보를 실제 노트 레벨 단위와 유사한 성격을 가지고 있는 유사-노트 레벨의 피치 정보로 쉽게 변환할 수 있고, 제2인공신경망 모듈(320)은 이렇게 생성된 데이터를 기초로 학습을 수행할 수 있기 때문에, 출력되는 데이터는 노트 레벨의 데이터와 많이 유사한 성격을 가지고 있는 데이터를 생성할 수 있어, 노트 레벨의 자동 채보 장치의 효율성을 높일 수 있는 장점이 존재한다. However, if the automatic vocal transcription apparatus according to the present invention utilizes the first artificial neural network module 310 and the pre-processing module 340, the existing frame unit pitch information is similar to the actual note level unit, and has a similar nature to the note level. Since it can be easily converted into pitch information of , and the second artificial neural network module 320 can perform learning based on the data generated in this way, the output data is data having a characteristic much similar to note-level data. can be created, there is an advantage of increasing the efficiency of the note-level automatic transcription device.

즉, 제2인공신경망 모듈(320)의 모델 자체는 구성 요소가 제1인공신경망 모듈(320)의 구성과 많이 유사하여 프레임 레벨로 결과를 예측하지만, 학습을 수행함에 있어서 유사-노트 레벨의 제1학습 데이터 및 노트 레벨의 제2학습 데이터로 학습을 수행하였고, 출력 정보를 출력하는 LSTM 단에서 시계열 학습이 함께 이루어지기 때문에 최종 출력 정보는 노트 레벨의 정보와 매우 유사한 정보의 성격을 가지고 있어, 사실상 노트 레벨의 채보 정보를 출력하는 효과를 얻을 수 있다. That is, the model itself of the second artificial neural network module 320 has a lot of components similar to that of the first artificial neural network module 320, so it predicts the result at the frame level, but in performing learning, it is similar to the note level first. Learning was performed with the first learning data and the second learning data of the note level, and since time series learning is performed together in the LSTM stage that outputs the output information, the final output information has the characteristics of information very similar to the note level information, In fact, it is possible to obtain the effect of outputting note-level transcription information.

후처리 모듈(350)은 제2인공신경망 모듈(320)에서 출력되는 프레임 레벨의 출력 정보에 대해 노트 레벨의 채보 정보로 변환하는 역할을 할 수 있다. 사실상 후처리 모듈(350)은 앞선 설명한 전처리 모듈(340)이 수행한 피치 양자화(pitch quantization) 프로세스와 리듬 양자화(rhythm quantization)을 프로세스를 거쳐 프레임 단위의 피치 정보를 노트 레벨의 피치 정보로 변환하여 출력할 수 있다. 이에 대한 프로세서에 대해서는 앞서 자세히 전술하였는바 이하 생략하도록 한다. The post-processing module 350 may serve to convert frame-level output information output from the second artificial neural network module 320 into note-level transcription information. In fact, the post-processing module 350 converts frame-based pitch information into note-level pitch information through the pitch quantization process and rhythm quantization performed by the pre-processing module 340 as described above. can be printed out. Since the processor for this has been described in detail above, it will be omitted below.

도 12는 다른 실시예에 따른 인공신경망을 이용한 자동 음악 채보 장치의 구성 요소를 도시한 도면이다.12 is a diagram illustrating components of an automatic music transcription device using an artificial neural network according to another embodiment.

도 12를 참조하면, 일 실시예에 따른 자동 음악 채보 장치는 프레임 레벨 단위의 제1오디오 데이터에 대한 피치 정보를 출력하는 제1인공신경망 모듈(310), 제1인공신경망 모듈(310)이 출력한 피치 정보를 기초로 상기 제1오디오 데이터에 대한 노트 레벨의 피치 정보로 데이터를 변환하여 제1학습 데이터를 생성하는 전처리 모듈(340), 상기 제1학습 데이터를 저장하는 제1메모리 모듈(210), 노트 레벨 단위의 피치 정보가 라벨링 되어 있는 제2학습 데이터를 저장하는 제2메모리 모듈(220), 제1학습 데이터 및 제2학습 데이터 중 적어도 하나의 데이터를 기초로 학습을 수행하며, 입력되는 프레임 레벨 단위의 제2오디오 데이터에 대한 노트 레벨 단위의 피치 정보를 출력하는 제2인공신경망 모듈(320) 및 제2인공신경망 모듈(320)에서 출력된 프레임 레벨 단위의 정보를 노트 레벨 단위의 정보를 변형하는 제1후처리 모듈(351), 랜덤 노이즈 학습 데이터인 제3학습 데이터가 저장되어 있는 제3메모리 모듈(230), 제1후처리 모듈(351)에서 출력한 정보 및 제3학습 데이터 중 적어도 하나의 데이터를 기초로 학습을 수행하며, 입력되는 프레임 레벨 단위의 제3오디오 데이터에 대한 노트 레벨 단위의 피치 정보를 출력하는 제3인공신경망 모듈(330) 및 제3인공신경망 모듈(330)에서 출력된 프레임 레벨 단위의 정보를 노트 레벨 단위의 정보를 변형하는 제2후처리 모듈(352)을 포함할 수 있다.12 , in the automatic music transcription apparatus according to an embodiment, the first artificial neural network module 310 and the first artificial neural network module 310 outputting pitch information for the first audio data in frame level units are output A pre-processing module 340 for generating first learning data by converting data into note-level pitch information for the first audio data based on one pitch information, a first memory module 210 for storing the first learning data ), a second memory module 220 for storing second learning data labeled with pitch information in units of note level, learning based on at least one of the first learning data and the second learning data, and input The second artificial neural network module 320 and the second artificial neural network module 320 for outputting pitch information in note-level units for the second audio data in frame-level units of The first post-processing module 351 for transforming information, the third memory module 230 in which the third learning data that is random noise learning data is stored, the information output from the first post-processing module 351 and the third learning A third artificial neural network module 330 and a third artificial neural network module ( The second post-processing module 352 may include a second post-processing module 352 that transforms the frame-level information output in 330 to the note-level information.

도 12에 따른 자동 음악 채보 장치에서 제1인공신경망 모듈(310), 제2인공신경망 모듈(320) 및 제3인공신경망 모듈(330)은 앞선 도면을 통해 설명하였던 인공신경망 모듈과 그 구성이 대부분 동일하나, 하나의 실시예로서 제1인공신경망 모듈(310)은 도 7에 따른 인공신경망 모듈이 차용되고, 제2인공신경망 모듈(320)과 제3인공신경망 모듈(330)은 도 8에 따른 인공신경망 모듈이 차용될 수 있다. 이러한 경우 제1인공신경망 모듈(310)은 제2인공신경망 모듈(320)이 학습을 하는데 필요한 학습 데이터를 생성하는 인공신경망의 역할을 하며, 제2인공신경망 모듈(320)이 실질적으로 자동 보컬 채보를 하는 역할을 수행함과 동시에 제3인공신경망 모듈(330)이 학습하는 데이터를 출력하는 역할을 수행할 수 있다. 즉, 이러한 경우 제3인공신경망 모듈(330)이 학습을 수행함에 있어서, 제2인공신경망 모듈(320)은 보다 정확한 정보를 출력하는 선생님 역할을 하는 모델이 되며, 제3인공신경망 모듈(330)은 제2인공신경망 모듈(320)이 출력하는 정보에 기초하여 학습을 수행한다는 점에서 학생 역할을 모델을 할 수 있다. 즉, 도 12에 따른 자동 음악 채보 장치의 경우 준지도학습(Semi-supervised) 방법에 기초하여 제3인공신경망 모듈(330)이 학습을 수행하는 것을 특징으로 한다. The first artificial neural network module 310, the second artificial neural network module 320, and the third artificial neural network module 330 in the automatic music transcription device according to FIG. Although the same, as an embodiment, the first artificial neural network module 310 is the artificial neural network module according to FIG. 7 , and the second artificial neural network module 320 and the third artificial neural network module 330 are shown in FIG. An artificial neural network module may be borrowed. In this case, the first artificial neural network module 310 serves as an artificial neural network generating learning data necessary for the second artificial neural network module 320 to learn, and the second artificial neural network module 320 is substantially automatic vocal transcription. While performing a role of , the third artificial neural network module 330 may play a role of outputting the learned data. That is, in this case, when the third artificial neural network module 330 performs learning, the second artificial neural network module 320 becomes a model serving as a teacher outputting more accurate information, and the third artificial neural network module 330 can model the student role in that learning is performed based on the information output by the second artificial neural network module 320 . That is, in the case of the automatic music transcription apparatus according to FIG. 12 , it is characterized in that the third artificial neural network module 330 performs learning based on a semi-supervised method.

제1인공신경망 모듈(310), 제2인공신경망 모듈(320) 및 전처리 모듈(340)에 대한 중복되는 설명은 생략하며, 제1후처리 모듈(351)과 제2후처리 모듈(352) 또한 앞서 설명한 후처리 모듈(350)과 역할이 동일한바 이에 대한 자세한 설명은 생략하도록 한다. The overlapping description of the first artificial neural network module 310, the second artificial neural network module 320, and the pre-processing module 340 is omitted, and the first post-processing module 351 and the second post-processing module 352 are also Since it has the same role as the post-processing module 350 described above, a detailed description thereof will be omitted.

제 3인공신경망 모듈(330)은 주파수 영역의 제3오디오 데이터를 제5입력 정보로 하고, 상기 제3오디오 데이터에 대해 노트(note) 단위로 보컬의 음높이 정보를 포함하는 피치(Pitch) 정보를 제5출력 정보로 출력하는 제5인공신경망과, 제5인공신경망을 구성하는 블록에서 출력되는 중간 출력 정보를 제5입력 정보로 하고, 상기 제3오디오 데이터에 대해 프레임 단위로, 보컬(Vocal)의 존재 유무에 대한 보컬 정보를 제6출력 정보로 출력하는 제6인공신경망을 포함할 수 있다.The third artificial neural network module 330 uses the third audio data in the frequency domain as fifth input information, and receives pitch information including pitch information of a vocal in a note unit with respect to the third audio data. The fifth artificial neural network outputting the fifth output information and the intermediate output information output from the blocks constituting the fifth artificial neural network are used as the fifth input information, and the third audio data is frame-by-frame, vocal It may include a sixth artificial neural network for outputting vocal information on the presence or absence of , as sixth output information.

제3인공신경망 모듈(330)의 제5인공신경망과 제6인공신경망은 제2인공신경망 모듈(320)의 제3인공신경망과 제4인공신경망에 각각 대응되는 인공신경망에 해당하여 대부분의 구성요소는 동일하다. 그러나, 앞서 언급한 바와 같이 하나의 실시예로서 제1인공신경망 모듈(310)은 도 7에 따른 인공신경망 모듈이 차용되고, 제2인공신경망 모듈(320) 및 제3인공신경망 모듈(330)은 도 8에 따른 인공신경망 모듈이 차용될 수 있다.The fifth artificial neural network and the sixth artificial neural network of the third artificial neural network module 330 correspond to artificial neural networks corresponding to the third and fourth artificial neural networks of the second artificial neural network module 320, respectively, and thus most of the components is the same However, as mentioned above, as an embodiment, the first artificial neural network module 310 is the artificial neural network module according to FIG. 7 , and the second artificial neural network module 320 and the third artificial neural network module 330 are The artificial neural network module according to FIG. 8 may be borrowed.

제3인공신경망 모듈(330)의 학습 방법에 대해 설명하면, 제2인공신경망 모듈(320)은 학습 데이터로서 제3메모리 모듈(210)에 저장되어 있는 제3학습 데이터 및 제2인공신경망 모듈(320)이 출력하는 데이터를 기초로, 제2인공신경망 모듈(320)을 선생님으로 하는 준지도학습을 수행하는 것에 특징이 존재한다.When explaining the learning method of the third artificial neural network module 330, the second artificial neural network module 320 includes the third learning data stored in the third memory module 210 as learning data and the second artificial neural network module ( Based on the data output by the 320), there is a feature in performing semi-supervised learning using the second artificial neural network module 320 as a teacher.

본 발명에 따른 준지도학습은 여러 방법에 의해 진행될 수 있는데, 첫 번째로는 라벨링이 되어 있지 않은 데이터를 학습이 되어있는 제2인공신경망 모듈(320)과 제3인공신경망 모듈(330)에 각각 입력하여 출력되는 결과를 서로 비교하여 그 차이를 줄이는 방법으로 학습을 수행하는 방법이 있다. The semi-supervised learning according to the present invention can be performed by several methods. First, unlabeled data is transferred to the trained second artificial neural network module 320 and the third artificial neural network module 330, respectively. There is a method of performing learning by comparing the input and output results with each other and reducing the difference.

두 번째 방법으로는, 라벨링이 되어 있지 않은 데이터에, 랜덤 노이지 데이터를 추가하여 혼합한 데이터를 학습이 되어있는 제2인공신경망 모듈(320)과 제3인공신경망 모듈(330)에 각각 입력하여 출력되는 결과를 서로 비교하여 그 차이를 줄이는 방법으로 학습을 수행하는 방법이 있다. In the second method, data mixed by adding random noise data to unlabeled data is inputted to the trained second artificial neural network module 320 and third artificial neural network module 330, respectively, and output There is a method of learning by comparing the results obtained and reducing the difference.

세 번째 방법으로는 제2인공신경망 모듈(320)에는 라벨링이 되어 있지 않은 데이터만 입력하고, 제3인공신경망 모듈(330)에는 라벨링이 되어 있지 않은 데이터에 랜덤 노이지 데이터를 추가하여 혼합한 데이터를 각각 입력하여 출력되는 결과를 서로 비교하여 그 차이를 줄이는 방법으로 학습을 수행하는 방법이 있다.As a third method, only unlabeled data is input to the second artificial neural network module 320, and random noise data is added to the unlabeled data in the third artificial neural network module 330 to mix data. There is a method of performing learning by comparing each input and output result with each other and reducing the difference.

상기 설명한 3가지 방법 모두 종래 기술에 비해 상대적으로 정확도가 높은 출력 결과를 얻었으며, 3가지 방법 중 학생 역할에 해당하는 제3인공신경망 모듈(330)에만 랜덤 노이즈 데이터를 추가한 3번째 방법이 가장 정확도가 높은 결과를 얻을 수 있었다. All of the three methods described above obtained output results with relatively high accuracy compared to the prior art, and among the three methods, the third method in which random noise data is added only to the third artificial neural network module 330 corresponding to the student role is the most High-accuracy results were obtained.

지금까지 도면을 통해 일 실시예에 따른 인공신경망을 이용한 노트 레벨의 자동 보컬 채보 방법 및 장치에 대해 자세히 알아보았다.So far, the method and apparatus for automatic note-level vocal transcription using an artificial neural network according to an embodiment have been described in detail through the drawings.

이에 따라, 일 실시예에 따른 인공신경망을 이용한 노트 레벨의 자동 보컬 채보 방법 및 장치는 다양한 장르의 음원에 대해서 유연하면서 동시에 정확한 채보를 진행할 수 있으며, 소스 분리 전처리 과정 없이 바로 다성 음악에서 보컬 멜로디를 채보할 수 있어, 종래 기술에 비해 채보 속도를 매우 향상시킬 수 있는 장점이 존재한다. Accordingly, the note-level automatic vocal transcription method and apparatus using an artificial neural network according to an embodiment can perform flexible and accurate transcription of sound sources of various genres at the same time, and directly generate vocal melody from polyphonic music without source separation preprocessing Since the transcription can be performed, there is an advantage in that the transcription speed can be greatly improved compared to the prior art.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 컨트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. The device described above may be implemented as a hardware component, a software component, and/or a combination of the hardware component and the software component. For example, devices and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable array (FPA), It may be implemented using one or more general purpose or special purpose computers, such as a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the operating system.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다. 그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다. As described above, although the embodiments have been described with reference to the limited embodiments and drawings, various modifications and variations are possible from the above description by those skilled in the art. For example, the described techniques are performed in an order different from the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result. Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

100: 사용자 단말 110: 녹취부
120: 통신부 130: MIDI 생성부
140: 디스플레이부 200: 음악 채보 장치
300: 프로세서 310: 제1인공신경망 모듈
320: 제2인공신경망 모듈 400: 메모리 모듈
500: 부가 서비스 모듈 600: 사용자 관리 모듈100: user terminal 110: recording unit
120: communication unit 130: MIDI generation unit
140: display unit 200: music transcription device
300: processor 310: first artificial neural network module
320: second artificial neural network module 400: memory module
500: additional service module 600: user management module

Claims

하나 이상의 프로세서; 및
상기 하나 이상의 프로세서에서 실행 가능한 명령들을 저장하는 메모리 모듈;을 포함하고,
상기 프로세서는,
주파수 영역의 오디오 데이터를 제1입력 정보로 하고, 상기 오디오 데이터에 대해 프레임(frame) 단위로 보컬의 음높이 정보를 포함하는 피치(Pitch) 정보를 제1출력 정보로 출력하는 제1인공신경망; 및
상기 제1인공신경망을 구성하는 블록에서 출력되는 중간 출력 정보를 제2입력 정보로 하고, 상기 오디오 데이터에 대해 프레임 단위로, 보컬(Vocal)의 존재 유무에 대한 보컬 정보를 제2출력 정보로 출력하는 제2인공신경망;을 포함하는 것을 특징으로 하는,
인공신경망을 이용한 자동 보컬 채보 장치.one or more processors; and
Including; a memory module for storing instructions executable by the one or more processors;
The processor is
a first artificial neural network that uses audio data in a frequency domain as first input information and outputs pitch information including pitch information of a vocalist in units of frames with respect to the audio data as first output information; and
Intermediate output information output from the blocks constituting the first artificial neural network is used as second input information, and vocal information on the presence or absence of a vocal is output as second output information in units of frames for the audio data. A second artificial neural network that comprises;
Automatic vocal transcription device using artificial neural network.

제1항에 있어서,
상기 피치 정보는,
상기 오디오 데이터에 대한 프레임 단위의 보컬의 존재 유무 정보 및 보컬의 음높이 정보를 포함하는 것을 특징으로 하는,
인공신경망을 이용한 자동 보컬 채보 장치.According to claim 1,
The pitch information is
characterized in that it includes information about the presence or absence of a vocal in units of frames for the audio data and information on the pitch of the vocal;
Automatic vocal transcription device using artificial neural network.

제1항에 있어서,
제1인공신경망은,
상기 제1입력 정보를 입력 받는 컨볼루션 블록 및 상기 컨볼루션 블록의 출력 정보를 입력 정보로 입력 받는 적어도 하나의 레즈넷 블록(Residual block) 및 폴링 블록(Pooling block)을 포함하고,
상기 제2입력 정보는, 상기 레즈넷 블록 및 상기 폴링 블록에서 출력되는 중간 정보들의 합으로 구성되는,
인공신경망을 이용한 자동 보컬 채보 장치.According to claim 1,
The first artificial neural network is
a convolution block receiving the first input information and at least one residual block and a polling block receiving output information of the convolution block as input information,
The second input information is composed of a sum of intermediate information output from the Reznet block and the polling block,
Automatic vocal transcription device using artificial neural network.

제3항에 있어서,
제1인공신경망은,
상기 폴링 블록에서 출력되는 출력 정보를 입력 정보로 입력 받는 제1 LSTM 블록;을 더 포함하고,
제2인공신경망은,
상기 제2입력 정보를 입력 받는 제2 LSTM 블록;을 더 포함하는,
인공신경망을 이용한 자동 보컬 채보 장치.4. The method of claim 3,
The first artificial neural network is
It further includes; a first LSTM block receiving the output information output from the polling block as input information;
The second artificial neural network is
A second LSTM block receiving the second input information; further comprising
Automatic vocal transcription device using artificial neural network.

하나 이상의 프로세서; 및
상기 하나 이상의 프로세서에서 실행 가능한 명령들을 저장하는 메모리 모듈;을 포함하고,
상기 프로세서는,
주파수 영역의 제1오디오 데이터를 제1입력 정보로 하고, 상기 제1오디오 데이터에 대해 프레임(frame) 단위로 보컬의 음높이 정보를 포함하는 피치(Pitch) 정보를 제1출력 정보로 출력하는 제1인공신경망;
상기 제1출력 정보를 노트(note) 단위의 보컬 정보를 포함하는 제1학습 데이터로 변환하는 전처리 모듈;, 및
주파수 영역의 제2오디오 데이터를 제3입력 정보로 하고, 상기 제2오디오 데이터에 대해 노트(note) 단위로 보컬의 음높이 정보를 포함하는 피치(Pitch) 정보를 제3출력 정보로 출력하는 제3인공신경망; 을 포함하고,
상기 제3인공신경망은,
상기 제1학습 데이터를 기초로 상기 제3인공신경망에 대해 학습을 수행하는 것을 특징으로 하는,
인공신경망을 이용한 노트 레벨의 자동 보컬 채보 장치.one or more processors; and
Including; a memory module for storing instructions executable by the one or more processors;
The processor is
A first method of outputting, as first output information, first audio data in the frequency domain as first input information, and pitch information including pitch information of a vocalist in units of frames with respect to the first audio data, as first output information artificial neural network;
A pre-processing module for converting the first output information into first learning data including vocal information in note units; and
a third output information that uses second audio data in the frequency domain as third input information and outputs pitch information including pitch information of a vocalist in a note unit with respect to the second audio data as third output information artificial neural network; including,
The third artificial neural network,
Characterized in that learning is performed on the third artificial neural network based on the first learning data,
A note-level automatic vocal transcription device using an artificial neural network.

제5항에 있어서,
상기 제3인공신경망을 구성하는 블록에서 출력되는 중간 출력 정보를 제4입력 정보로 하고, 상기 제2오디오 데이터에 대해 노트 단위로, 보컬(Vocal)의 존재 유무에 대한 정보를 제4출력 정보로 출력하는 기 학습된 제4인공신경망;을 더 포함하는 것을 특징으로 하는,
인공신경망을 이용한 노트 레벨의 자동 보컬 채보 장치.6. The method of claim 5,
Intermediate output information output from the blocks constituting the third artificial neural network is used as fourth input information, and information on the presence or absence of a vocal is used as fourth output information for the second audio data in note units. Pre-trained fourth artificial neural network to output; characterized in that it further comprises,
A note-level automatic vocal transcription device using an artificial neural network.

제5항에 있어서,
상기 프로세서는,
상기 제1학습 데이터 및 노트 단위의 보컬의 음높이 정보에 대하 라벨링 되어 있는 제2학습 데이터를 기초로 상기 제3인공신경망에 대해 학습을 수행하는 것을 특징으로 하는,
인공신경망을 이용한 노트 레벨의 자동 보컬 채보 장치.6. The method of claim 5,
The processor is
Characterized in that learning is performed on the third artificial neural network based on the first learning data and the second learning data labeled for pitch information of the vocal in the unit of notes,
A note-level automatic vocal transcription device using an artificial neural network.

제5항에 있어서,
주파수 영역의 제3오디오 데이터를 제5입력 정보로 하고, 상기 제3오디오 데이터에 대해 노트 단위로 보컬의 음높이 정보를 포함하는 피치(Pitch) 정보를 제5출력 정보로 출력하는 제5인공신경망;을 더 포함하고,
상기 프로세서는,
상기 제3인공신경망이 출력하는 제3출력 정보를 제3학습 데이터로 하여 상기 제5인공신경망에 대해 학습을 수행하는 것을 특징으로 하는,
인공신경망을 이용한 노트 레벨의 자동 보컬 채보 장치.6. The method of claim 5,
a fifth artificial neural network that uses third audio data in the frequency domain as fifth input information, and outputs pitch information including pitch information of a vocalist in units of notes with respect to the third audio data as fifth output information; further comprising,
The processor is
Learning is performed on the fifth artificial neural network by using third output information output from the third artificial neural network as third learning data,
A note-level automatic vocal transcription device using an artificial neural network.

하나 이상의 프로세서; 및
상기 하나 이상의 프로세서에서 실행 가능한 명령들을 저장하는 메모리 모듈;을 포함하고,
상기 프로세서는,
주파수 영역의 제1오디오 데이터를 제1입력 정보로 하고, 상기 제1오디오 데이터에 대해 프레임(frame) 단위로 보컬의 음높이 정보를 포함하는 피치(Pitch) 정보를 제1출력 정보로 출력하는 제1인공신경망;
상기 제1출력 정보를 노트(note) 단위의 보컬 정보를 포함하는 제1학습 데이터로 변환하는 전처리 모듈;, 및
주파수 영역의 제2오디오 데이터를 제3입력 정보로 하고, 상기 제2오디오 데이터에 대해 노트(note) 단위로 보컬의 음높이 정보를 포함하는 피치(Pitch) 정보를 제3출력 정보로 출력하는 제3인공신경망; 을 포함하고,
상기 제3인공신경망은, 상기 제1학습 데이터를 기초로 상기 제3인공신경망에 대해 학습을 수행하며,
상기 메모리 모듈은 상기 제1학습 데이터를 저장하는 제1메모리 모듈;을 포함하는 것을 특징으로 하는,
인공신경망을 이용한 노트 레벨의 자동 보컬 채보 장치.one or more processors; and
Including; a memory module for storing instructions executable by the one or more processors;
The processor is
A first method of outputting, as first output information, first audio data in the frequency domain as first input information, and pitch information including pitch information of a vocalist in units of frames with respect to the first audio data, as first output information artificial neural network;
A pre-processing module for converting the first output information into first learning data including vocal information in note units; and
a third output information that uses second audio data in the frequency domain as third input information and outputs pitch information including pitch information of a vocalist in a note unit with respect to the second audio data as third output information artificial neural network; including,
The third artificial neural network performs learning on the third artificial neural network based on the first learning data,
The memory module comprises a first memory module for storing the first learning data;
A note-level automatic vocal transcription device using an artificial neural network.

인공신경망을 포함하고 있는 프로세서를 이용한 노트 레벨의 자동 보컬 채보 방법에 있어서,
상기 프로세서가 주파수 영역의 제1오디오 데이터를 제1입력 정보로 하고, 상기 제1오디오 데이터에 대해 프레임(frame) 단위로 보컬의 음높이 정보를 포함하는 피치(Pitch) 정보를 제1출력 정보로 출력하는 제1인공신경망을 이용하여 상기 제1출력 정보를 출력하는 제1출력 정보 출력 단계:
상기 프로세서가 상기 제1출력 정보를 노트(note) 단위의 보컬 정보를 포함하는 제1학습 데이터로 변환하는 전처리 단계;
상기 프로세서가 상기 제1학습 데이터를 제1메모리 모듈에 저장하는 제1학습 데이터 저장 단계;
상기 프로세서가 주파수 영역의 제2오디오 데이터를 제3입력 정보로 하고, 상기 제2오디오 데이터에 대해 노트(note) 단위로 보컬의 음높이 정보를 포함하는 피치(Pitch) 정보를 제3출력 정보로 출력하는 제3인공신경망을 이용하여 상기 제3출력 정보를 출력하는 제3출력 정보 출력 단계; 및
상기 프로세서가 상기 제1학습 데이터를 기초로 상기 제3인공신경망에 대해 학습을 수행하는 단계;를 포함하는 것을 특징으로 하는,
인공신경망을 이용한 노트 레벨의 자동 보컬 채보 방법.A note-level automatic vocal transcription method using a processor including an artificial neural network, the method comprising:
The processor uses first audio data in the frequency domain as first input information, and outputs pitch information including pitch information of a vocalist in units of frames with respect to the first audio data as first output information A first output information output step of outputting the first output information using a first artificial neural network:
a pre-processing step of converting, by the processor, the first output information into first learning data including vocal information in note units;
a first learning data storage step of the processor storing the first learning data in a first memory module;
The processor uses the frequency domain second audio data as third input information, and outputs pitch information including pitch information of a vocalist in a note unit with respect to the second audio data as third output information. a third output information output step of outputting the third output information using a third artificial neural network; and
The step of the processor performing learning on the third artificial neural network based on the first learning data; characterized in that it comprises,
A note-level automatic vocal transcription method using an artificial neural network.