KR101079697B1

KR101079697B1 - High speed processing method for image data using parallel processors on graphics processing unit

Info

Publication number: KR101079697B1
Application number: KR1020090094302A
Authority: KR
Inventors: 김선태; 이선엽; 홍창희
Original assignee: 주식회사 글로벌미디어테크
Priority date: 2009-10-05
Filing date: 2009-10-05
Publication date: 2011-11-03
Also published as: KR20110037051A

Abstract

본 발명은 범용 그래픽 처리장치의 병렬 프로세서를 이용한 고속 영상 처리 방법에 관한 것이다. 본 발명은, 정지 영상 또는 동영상 입력 데이터를 이산 웨이블릿 변환(Discrete Wavelet Transform; DWT) 및 그 역변환을 수행하여 영상 처리를 하는 방법에 있어서, 그래픽 처리 장치(Graphics Processing Unit; GPU) 측의 디바이스 영역에 다중 메모리 버퍼 구조를 구축하여, 참조할 데이터 공간과 갱신할 데이터 공간을 분리함으로써 연산 속도를 향상시키는데 그 특징이 있다. 본 발명은, 다중 메모리 버퍼 구조를 이용하며, CPU측의 입력 데이터를 CPU 측의 호스트 영역에서 상기 디바이스 영역으로 복사하는 단계; 및 상기 GPU에서 커널들을 병렬 실행하여 이산 웨이블릿 변환 또는 그 역변환의 연산들을 수행하는 단계;를 포함한다.The present invention relates to a high speed image processing method using a parallel processor of a general-purpose graphics processing apparatus. The present invention relates to a method for performing image processing by performing discrete wavelet transform (DWT) and inverse transform on still image or video input data, wherein the image processing is performed in a device region on a graphics processing unit (GPU) side. Its characteristics are to improve the operation speed by constructing a multiple memory buffer structure to separate the data space to be referenced from the data space to be updated. The present invention utilizes a multiple memory buffer structure, and comprises: copying input data on a CPU side from a host region on a CPU side to the device region; And executing kernels in parallel on the GPU to perform discrete wavelet transforms or inverse transforms.

이산 웨이블릿 변환, JPEG 2000, 쓰레드, 블록, 컨볼루션, CPU, GPU Discrete Wavelet Transform, JPEG 2000, Threads, Blocks, Convolution, CPU, GPU

Description

범용 그래픽 처리장치의 병렬 프로세서를 이용한 고속 영상 처리 방법{HIGH SPEED PROCESSING METHOD FOR IMAGE DATA USING PARALLEL PROCESSORS ON GRAPHICS PROCESSING UNIT}HIGH SPEED PROCESSING METHOD FOR IMAGE DATA USING PARALLEL PROCESSORS ON GRAPHICS PROCESSING UNIT}

본 발명은 고속으로 정지 영상과 동영상을 인코딩/디코딩 처리하는 방법에 관한 것이며, 특히 인코딩/디코딩 처리 과정에서 이루어지는 이산 웨이블릿 변환(Discrete Wavelet Transform; DWT)과 그 역변환 처리를, 그래픽 처리 장치(Graphics Processing Unit; GPU) 기반으로 실행하는 방법에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method of encoding / decoding a still image and a video at high speed, and in particular, to perform a discrete wavelet transform (DWT) and its inverse transformation process performed in the encoding / decoding process, a graphics processing apparatus (Graphics Processing) Unit (GPU) based execution method.

정보화 사회에 있어서 정지 영상, 동영상 및 멀티미디어 정보를 효율적으로 압축하여 얼마나 빠르게 에러 없이 방대한 데이터를 처리할 수 있는지는 매우 중요한 문제로 대두되고 있다. 1992년에 JPEG(Joint Photographic Expert Group)이 국제 표준으로 채택된 이후 이 표준은 다양한 멀티미디어 응용분야에서 사용되고 있다. 그러나 JPEG은 구현이 쉽다는 장점에도 불구하고, 손실 압축에 기반하고 있으며, 잡음이 많은 저비트율 환경에서 뚜렷한 성능 열화를 보이고 있다. 이러한 문제 점을 해결하기 위하여 새로운 표준 개발에 착수하였으며 2000년에 국제표준 ISO/IEC 15444-1로 JPEG 2000이라는 표준 Part 1을 발표하였다. In the information society, how to process a large amount of data without error by efficiently compressing still images, moving images, and multimedia information has emerged as an important issue. Since the Joint Photographic Expert Group (JPEG) was adopted in 1992 as an international standard, it has been used in a variety of multimedia applications. However, despite the advantages of ease of implementation, JPEG is based on lossy compression and exhibits significant performance degradation in noisy low bit rate environments. In order to solve this problem, we began to develop a new standard. In 2000, we published part 1 of the standard, JPEG 2000 as international standard ISO / IEC 15444-1.

JPEG 2000은 13개의 Part로 구성되어 있으며, Part 1(Core coding system)은 코드 스트림 문법, 인코딩/디코딩의 필수 단계 등 핵심 포맷을 정의하였으며, 발표가 완료되었다. Part 2(Extensions)는 DWT, 계수 양자화, ROI 등 유연성 확보를 위한 확장 포맷을 정의하고, Part 3(Motion JPEG 2000)은 Motion Sequence 파일 포맷을 정의하고, Part 4(Conformance)는 인코딩/디코딩 프로세스의 테스트 프로시저 정의, Part 5(Reference Software)는 Part 1을 만족하는 코덱 소프트웨어 구현, Part 6(Compound Image File Format)은 문서 이미지, 팩스 등에서 페이지당 멀티-오브젝트 구성을 정의하고 있다. 특히 Motion JPEG 2000은 각각의 프레임이 독립적으로 인코딩/디코딩 되며, 프레임 단위의 편집이 용이한 장점이 있어 디지털 시네마 분야의 영상 압축 표준으로 채택되었다.JPEG 2000 consists of 13 parts, and Part 1 (Core coding system) has defined the core formats, including the code stream grammar and the essential steps of encoding and decoding. Part 2 (Extensions) defines extended formats for flexibility such as DWT, coefficient quantization, and ROI, Part 3 (Motion JPEG 2000) defines the Motion Sequence file format, and Part 4 (Conformance) defines the encoding / decoding process. The test procedure definition, Part 5 (Reference Software) defines a codec software implementation that satisfies Part 1, and Part 6 (Compound Image File Format) defines a multi-object configuration per page for document images and faxes. In particular, Motion JPEG 2000 has been adopted as the standard for image compression in the field of digital cinema because each frame is encoded / decoded independently and easy to edit frame by frame.

JPEG 2000에서 소스 이미지 데이터는 타일링(Tiling) 단계에 의해 분할되며, 각각 분할된 타일 성분은 DWT(이산 웨이블릿 변환)에 의해 변환 연산되고, 양자화(Quantization)를 거쳐 엔트로피 인코딩(Entropy Encoding) 과정을 거쳐 압축된다. 압축된 이미지 데이터는 저장되거나 전송되며, 이것을 다시 디코딩하는 절차는 인코딩 절차의 거의 역순으로 이루어진다. In JPEG 2000, source image data is divided by a tiling step, and each divided tile component is transformed by DWT (Discrete Wavelet Transform), and is subjected to entropy encoding through quantization. Is compressed. Compressed image data is stored or transmitted, and the procedure of decoding it again is almost the reverse of the encoding procedure.

JPEG 2000의 장점은 특허권으로부터 자유롭고(License Fee), 다른 영상 포맷이나 코딩 어플리케이션에 대하여 최적화가 가능하도록 오픈 아키텍처(Open Architecture)로 구현되어 있으며, 높은 비율의 압축에 의해서도 시각적 무손실 압 축이 가능하여 실질적인 활용도가 높고, DWT 변환에 의해 반복적인 인코딩-디코딩에도 화질의 저하가 없다는 장점이 있다. 이러한 JPEG 2000은 하이엔드(High-end) 어플리케이션과 새롭게 등장하는 어플리케이션 분야에서 요구되는 필수적인 기술적 요구사항을 제공하는 이점이 있다. 예컨대 디지털 방송, 디지털 시네마, 디지털 영상 편집, 원격 센싱, 의료 영상, 위성 영상, 디지털 아카이브, 칼라 팩시밀리, 인쇄, 스캐닝, 모바일, 인터넷 등 다양한 분야의 기술적 요구사항을 충족하는 것이다.The advantages of JPEG 2000 are patent-free (License Fee), open architectures that can be optimized for different image formats and coding applications, and visual lossless compression is achieved by high compression ratios. It is highly usable and there is no deterioration in image quality even with repetitive encoding-decoding by DWT conversion. JPEG 2000 has the advantage of providing essential technical requirements for high-end and emerging applications. For example, it meets the technical requirements of various fields such as digital broadcasting, digital cinema, digital image editing, remote sensing, medical imaging, satellite imagery, digital archival, color facsimile, printing, scanning, mobile and the Internet.

그런데, 위와 같은 장점에도 불구하고, JPEG 2000 압축 알고리즘은 DWT 연산과 EBCOT(Embedded Block Coding with Optimized Truncation) 인코딩 엔진으로 인해 연산량이 JPEG에 비하여 약 30배가 증가되는 문제점이 있다. 이러한 연산량 증가의 문제점으로 말미암아, 대용량 고해상도의 이미지를 연산 처리함에 있어서 처리 속도 저하라는 현실적인 어려움에 부딪치게 되었다. However, despite the above advantages, the JPEG 2000 compression algorithm has a problem that the amount of computation is increased by about 30 times compared to JPEG due to DWT operation and embedded block coding with optimized truncation (EBCOT) encoding engine. Due to the problem of the increase in the amount of computation, there is a practical difficulty of slowing down the processing speed in processing a large amount of high resolution images.

이러한 문제점을 해결하기 위하여, 고화질 영상 실시간 처리는 임베디드 DSP, FPGA 칩 등을 이용한 전용 하드웨어 방식으로 연산 속도를 개선하고자 하는 제품들이 판매되기도 하였다. 그러나 그와 같은 방식으로 접근하게 되면, 비용 증가라는 또 다른 문제점이 수반된다. 본 발명의 발명가들은 기본적으로 코스트 효율이 있는 방식으로 문제를 근본적으로 해결하고자 하기 때문에, 그와 같은 제품들은 시장에서 장점보다 단점이 더 부각될 것으로 인식하고 있다.In order to solve this problem, high-quality image real-time processing has been sold to products that want to improve the operation speed by using a dedicated hardware method using an embedded DSP, FPGA chip, and the like. However, the approach in that way involves another problem: increased costs. Since the inventors of the present invention seek to fundamentally solve the problem in a cost-effective manner, they recognize that such products will have more disadvantages than advantages in the market.

한편, 대한민국 등록특허 제771153호 "이산 웨이블릿 변환 장치 및 그 방 법"(이하, "특허문헌 1")은 DWT의 연산량을 줄이고 연산 속도를 빠르게 할 수 있는 DWT 방법으로서, 리프팅 기반의 DWT를 제안하고 있다. 또한, 특허문헌 1은 이미지 화소의 입력 순서를 달리함으로써 DWT 연산에 필요한 내부 메모리의 사이즈를 줄이는 방법도 개시하고 있다. 그러나 리프팅 기반의 DWT는 표준에 의해 이미 알려진 방법이고, 그와 같이 하더라도, 컨볼루션 기반의 DWT보다 연산 속도가 상대적으로 빨라지는 것에 불과할 뿐이어서, 기술적으로 큰 개선을 주지 못했다.On the other hand, Korean Patent No. 771153 "Discrete Wavelet Transformation Apparatus and Its Method" (hereinafter, "Patent Document 1") is a DWT method that can reduce the amount of calculation of DWT and speed up the calculation, and proposes a lifting-based DWT. Doing. Patent Document 1 also discloses a method of reducing the size of an internal memory required for DWT operation by changing the input order of image pixels. However, lifting-based DWT is a method already known by the standard, and even so, it is only a relatively faster operation speed than a convolution-based DWT, and has not technically improved significantly.

대한민국 공개특허 제10-2008-0073511호 "그래픽 가속기(Graphics Processing Unit) 기반의 고속 영상 처리 방법 및 그 장치"(이하, "특허문헌 2")는 GPU를 활용하여 JPEG 2000 정지 영상의 고속 압축 알고리즘을 구현하는 방법에 대해 개시하고 있다. 최근의 GPU의 성능이 급격히 발전하였고, GPU 자체의 프로그래밍 가능한 특성이 점차 확대됨에 따라, 그래픽 처리 이외의 범용 목적으로 GPU를 활용할 수 있는 기반이 만들어졌다. 특허문헌 2는 최신 GPU의 연산처리 속도가 최신 CPU의 성능을 수배 능가한다는 점에 착안하여, 고급 shading 언어를 이용한 GPU의 범용 활용을 통해 JPEG 2000 정지 영상의 고속 압축 알고리즘을 구현함에 그 특징이 있다. 특히, JPEG 2000 알고리즘의 핵심 모듈이 되는 DWT 알고리즘을 GPU에서의 프래그먼트 쉐이더(fragment shader)에서 수행하기 위하여, 프레임버퍼 객체를 이용한 Render-to-Texture를 활용한 구조를 제안함으로써 GPU상에서의 고속 수행을 가능하게 한다는 것이다.Republic of Korea Patent Application Publication No. 10-2008-0073511 "Graphics Processing Unit based high speed image processing method and apparatus thereof" (hereinafter referred to as "Patent Document 2") is a fast compression algorithm of JPEG 2000 still image using a GPU It discloses how to implement. With the recent rapid advances in GPU performance and the ever-increasing programmable nature of the GPU itself, it has created a foundation for leveraging the GPU for general-purpose purposes other than graphics processing. Patent Document 2 is focused on the fact that the processing speed of the latest GPU is several times the performance of the latest CPU, and has a feature of implementing a high-speed compression algorithm of JPEG 2000 still image through general use of the GPU using an advanced shading language. . In particular, in order to execute DWT algorithm, which is a core module of JPEG 2000 algorithm, in fragment shader on GPU, we propose a structure using Render-to-Texture using frame buffer object to perform high speed performance on GPU. It is possible.

GPU는 그래픽 카드의 핵심 모듈이다. 3D 게임에 대한 그래픽 연산 기술과 물리 엔진의 도입으로 실수 연산기능이 대폭 향상되고, GPU상의 프로그래밍이 가능해 지자 이를 일반 연산에 응용하려는 연구가 활발히 진행되고 있다. 이런 경향의 기술적인 흐름을 통칭하여 GPGPU(General-Purpose computation on Graphics Processing Units)라고 하는데 이는 GPU 컴퓨팅이라고도 불리며 HPC(High Performance Computing)에 있어 가장 적합한 솔루션이라는 평가를 받고 있다(GPGPU.org, http://gpgpu.org/). The GPU is the core module of the graphics card. With the introduction of graphic computing technology and physics engine for 3D games, the real math function has been greatly improved, and programming on the GPU has become possible. The technical flow of this trend is collectively called General-Purpose computation on Graphics Processing Units (GPGPU), also known as GPU computing and is considered the best solution for High Performance Computing (HPC) (GPGPU.org, http: //gpgpu.org/).

종래의 컴퓨터시스템은 CPU(Central Processing Unit)의 연산능력에 의존해 왔다. 하지만 CPU를 제조함에 있어 집적도를 높이고 클럭 속도를 올리는 공정방식이 에너지 효율과 발열 문제를 초래함에 따라 성능개선에 어려움을 겪게 되었다. 이에 따라 기존의 개발방식 대신 코어를 여러 개 결합하여 CPU의 성능향상을 도모하고 있다. 이에 비하여 단순히 화면을 그려주는 연산을 담당하던 GPU의 경우, 화면 해상도가 높아지고 비디오나 게임처럼 높은 그래픽 데이터 연산속도를 요구하는 추세에 발맞추어, 많은 데이터를 처리함에 있어서 CPU보다 더 우수한 연산 기능을 발휘하고 있다. 일례로, 최신의 GPU는 부동소수점 실수연산기와 산술연산기가 내장된 240개의 코어를 갖고 처리속도가 1TFLOP에 가까운 성능을 보이고 있어, 같은 가격의 CPU에 비하여 약 30배 이상의 연산 처리성능을 보여준다.Conventional computer systems have relied on the computing power of central processing units (CPUs). However, in the manufacture of CPUs, the process of increasing the density and increasing the clock speed has caused energy efficiency and heat generation problems, which makes it difficult to improve performance. As a result, CPU performance is improved by combining multiple cores instead of the existing development method. On the other hand, GPUs that were in charge of drawing screens simply displayed higher resolution than the CPU in order to keep up with the trend of higher screen resolution and higher graphics data calculation speeds such as video and games. Doing. For example, the latest GPU has 240 cores with floating-point floating point and arithmetic operators, and the processing speed is close to 1TFLOP, which is about 30 times more than the CPU of the same price.

한편, GPGPU 기술은 GPU의 프로그램이 가능한 그래픽스 파이프라인을 이용하여 일반연산을 처리하는 쉐이더(shader) 기반 GPGPU 기술로 발전하였다. 그러나 버텍스 쉐이더(vertex shader), 프래그먼트 쉐이더(fragment shader), 지오메트리 쉐이더(geometry shader)와 같이 쉐이더 기반 그래픽스 파이프라인 기법에 사용되는 OpenGL이나 DiretX 등의 그래픽 API는 상당한 수준의 그래픽 전문지식을 요구한다. 따라서 고급 쉐이딩 언어에 익숙하지 않은 다른 분야의 개발자와 연구자들이 쉽게 접근하기 어려웠다. 따라서 특허문헌 2는 GPU를 이용하여 DWT 연산을 수행함으로써 영상의 고속 처리라는 소정의 효과를 거두었으나, 쉐이더 기반의 GPGPU의 한계를 벗어나지 못했다. 즉, 특허문헌 2는, 입력 데이터를 텍스처(Texture) 데이터로 맵핑하여 GPU의 그래픽스 파이프라인을 이용하는 방법으로서, 그래픽스 파이프라인에 대한 심도 깊은 이해가 필요하며, 결과 데이터를 텍스처 데이터에서 다시 원하는 데이터 포맷으로 변환해야 하는 과정이 필요하다는 문제점을 초래하였다.On the other hand, GPGPU technology has evolved into a shader-based GPGPU technology that processes general operations using a programmable graphics pipeline of GPUs. However, graphics APIs such as OpenGL and DiretX, which are used in shader-based graphics pipeline techniques such as vertex shaders, fragment shaders, and geometry shaders, require significant graphics expertise. This made it difficult for developers and researchers from other fields not familiar with the advanced shading language. Therefore, Patent Document 2 achieves a predetermined effect of high-speed processing of an image by performing a DWT operation using a GPU, but does not escape the limitation of shader-based GPGPU. That is, Patent Literature 2 is a method of using the GPU's graphics pipeline by mapping input data to texture data, and requires an in-depth understanding of the graphics pipeline. The problem is that the process needs to be converted to.

본 발명의 목적은, 저비용으로, 소프트웨어적인 방법으로, 대용량의 고화질 연산 처리 작업을 보다 용이하고 빠르게 처리하는 방법을 제공함에 있다. 이 방법은, 인코딩/디코딩 과정에서 필요한 이산 웨이블릿 변환 및 그 역변환 연산 작업을, GPU가 탑재된 범용 비디오카드를 이용하여 병렬 처리함으로써 대용량의 데이터 변환에 소요되는 연산속도를 획기적으로 향상시키는 것을 의미한다.SUMMARY OF THE INVENTION An object of the present invention is to provide a method for easily and quickly processing a large amount of high quality arithmetic operations in a low cost, software manner. This method means that the discrete wavelet transform and its inverse transform operation required in the encoding / decoding process are processed in parallel using a general-purpose video card equipped with a GPU, thereby dramatically increasing the computation speed required for a large amount of data conversion. .

또한, 본 발명의 다른 목적은, 인코딩/디코딩 과정에서 필요한 이산 웨이블릿 변환 및 그 역변환 연산 작업을 GPU상에서 보다 효율적으로 병렬 처리하고, 특허문헌 2와 같은, shader 기반의 GPGPU의 단점을 극복하고자 한다. 즉, 본 발명은, GPU의 병렬프로세서를 직접 제어할 수 있고, 그래픽카드의 메모리 계층구조를 활용한 메모리 모델을 구성하여 이산 웨이블릿 변환 및 그 역변환을 효과적으로 수행할 수 있는 방법을 제시한다. 또한, 본 발명은 일반적으로 개발자들이 많이 사용하고 있는 C 언어나 포트란과 같은 프로그래밍 언어로, 그와 같은 방법이 용이하게 구현되도록 하는 목적을 갖는다.In addition, another object of the present invention is to parallelize the discrete wavelet transform and its inverse transform operation required in the encoding / decoding process on the GPU more efficiently, and to overcome the disadvantages of the shader-based GPGPU, such as Patent Document 2. In other words, the present invention provides a method that can directly control the parallel processor of the GPU, and can effectively perform the discrete wavelet transform and its inverse transform by constructing a memory model utilizing the memory hierarchy of the graphics card. In addition, the present invention is a programming language such as C language or Fortran that are commonly used by developers, the object of the present invention is to make it easy to implement such a method.

본 발명의 목적으로서 그밖에 여기에 명시되지 않은 사항이 있다고 하더라도, 그것이 본 명세서에 기재된 사항에 의하여 당업자에게 용이하게 추론될 수 있는 것이라면, 본 발명의 목적에 여전히 포섭될 수 있음을 첨언한다.Even if there are other matters not specified herein for the purpose of the present invention, it is added that it can still be subsumed for the purpose of the present invention as long as it can be easily inferred by those skilled in the art by the matter described herein.

위와 같은 과제를 해결하기 위하여, 본 발명은, 정지 영상 또는 동영상 입력 데이터를 이산 웨이블릿 변환(Discrete Wavelet Transform; DWT) 및 그 역변환을 수행하여 영상 처리를 하는 방법에 있어서, In order to solve the above problems, the present invention, in the method for performing image processing by performing a discrete wavelet transform (DWT) and the inverse transform of the still image or video input data,

그래픽 처리 장치(Graphics Processing Unit; GPU) 측의 디바이스 영역에 다중 메모리 버퍼 구조를 구축하여, 참조할 데이터 공간과 갱신할 데이터 공간을 분리하며, A multi-memory buffer structure is built in the device area on the Graphics Processing Unit (GPU) side to separate the data space to be referenced from the data space to be updated,

입력 데이터를 CPU 측의 호스트 영역에서 상기 디바이스 영역으로 복사하는 단계; 및Copying input data from the host area on the CPU side to the device area; And

상기 GPU에서 커널들을 병렬 실행하여 이산 웨이블릿 변환 또는 그 역변환의 연산들을 수행하는 단계;를 포함하는 것을 특징으로 한다.And executing the discrete wavelet transform or the inverse transform operations by executing kernels in parallel on the GPU.

또한, 본 발명의 범용 그래픽 처리장치의 병렬 프로세서를 이용한 고속 영상 처리 방법의 바람직한 실시예에서는, 상기 연산에 있어서 접근빈도가 높은 변수와 필터계수들을 접근속도가 높은 디바이스 영역의 컨스턴트 메모리 영역에 저장하여 참조하기 위하여 컨스턴트 변수로 지정하는 것이 좋다.In addition, in a preferred embodiment of the high-speed image processing method using a parallel processor of the general-purpose graphics processing apparatus of the present invention, the variable parameters and filter coefficients with high access in the operation is stored in the constant memory area of the device area with high access speed It is better to specify it as a constant variable for reference.

또한, 본 발명의 범용 그래픽 처리장치의 병렬 프로세서를 이용한 고속 영상 처리 방법에 있어서, 바람직하게는, 식(1)에 의해 계산되는 그리드당 블록수와 블록당 쓰레드 수에 의해, 상기 연산을 수행하기 위한 최적의 쓰레드 수를 산출하여 멀티쓰레드의 수를 생성하는 단계를 더 포함하는 것이 좋다.Further, in the high speed image processing method using the parallel processor of the general-purpose graphics processing apparatus of the present invention, preferably, the calculation is performed by the number of blocks per grid and the number of threads per block calculated by Equation (1). The method may further include generating an optimal number of threads to generate the number of multithreads.

식 (1)

Formula (1)

(Nbg는 그리드당 블록수, Ntb는 블록당 쓰레드 수, N은 처리해야 할 데이터의 양)( Nbg is the number of blocks per grid, Ntb is the number of threads per block, N is the amount of data to be processed)

또한, 본 발명의 범용 그래픽 처리장치의 병렬 프로세서를 이용한 고속 영상 처리 방법에 있어서, DWT에서의 상기 디바이스 영역에 구축된 다중 메모리 버퍼 구조는,In addition, in the high speed image processing method using the parallel processor of the general-purpose graphics processing apparatus of the present invention, the multiple memory buffer structure constructed in the device region in the DWT,

호스트 영역에서 데이터를 복사 받는 제 1 메모리 버퍼;A first memory buffer copying data from the host area;

상기 제 1 메모리 버퍼에서 데이터가 적재되며, 세로 방향의 데이터에 대한 연산에 참조가 되는 제 2 메모리 버퍼; 및A second memory buffer loaded with data from the first memory buffer and referred to for operation on data in a vertical direction; And

세로 방향의 데이터에 대한 연산 결과값으로 갱신되며, 가로 방향의 데이터에 대한 연산에 참조가 되는 제 3 메모리 버퍼로 이루어지는 것이 바람직하다.It is preferable that the third memory buffer is updated with a calculation result value for the data in the vertical direction and referred to for the calculation for the data in the horizontal direction.

또한, 본 발명의 범용 그래픽 처리장치의 병렬 프로세서를 이용한 고속 영상 처리 방법에 있어서, DWT의 역변환 과정에서의 상기 디바이스 영역에 구축된 다중 메모리 버퍼 구조는,In addition, in the high-speed image processing method using a parallel processor of the general-purpose graphics processing apparatus of the present invention, the multi-memory buffer structure constructed in the device region during the inverse conversion process of DWT,

상기 제 1 메모리 버퍼에서 데이터가 적재되며, 가로 방향의 데이터에 대한 연산에 참조가 되는 제 2 메모리 버퍼; 및A second memory buffer into which data is loaded in the first memory buffer and which is referred to for operation on data in a horizontal direction; And

가로 방향의 데이터에 대한 연산 결과값으로 갱신되며, 세로 방향의 데이터 에 대한 연산에 참조가 되는 제 3 메모리 버퍼로 이루어지는 것이 바람직하다.It is preferable that the third memory buffer is updated with a result of the calculation on the data in the horizontal direction and referred to for the calculation on the data in the vertical direction.

또한, 바람직한 실시예에서, 상기 각 메모리 버퍼의 좌표는 (x, y) 좌표계로 구성되는 것이 바람직하다.Further, in a preferred embodiment, the coordinates of each of the memory buffers preferably consist of a (x, y) coordinate system.

또한, 본 발명의 범용 그래픽 처리장치의 병렬 프로세서를 이용한 고속 영상 처리 방법의 바람직한 실시예에서, 상기 연산에 있어서 익스텐션 처리 단계를 더 포함하며, 익스텐션 처리 단계를 GPU상의 멀티쓰레드로 구현하고 병렬로 처리할 수 있다. In addition, in a preferred embodiment of the high-speed image processing method using a parallel processor of the general-purpose graphics processing apparatus of the present invention, the operation further comprises an extension processing step, the extension processing step is implemented in multi-threaded on the GPU and processed in parallel can do.

또한, 데이터의 폭과 높이의 값에 의해 데이터의 반복성 단위를 산출하는 단계;를 더 포함하며, The method may further include calculating repeatability units of the data based on values of the width and the height of the data.

상기 반복성 단위와 필터 연산 시 필요한 패드(PAD)의 크기를 이용하여 데이터의 좌표를 2차원으로 구성하는 것이 바람직하다.It is preferable to configure the coordinates of the data in two dimensions by using the repeatability unit and the size of the pad PAD required for the filter operation.

또한, 본 발명의 범용 그래픽 처리장치의 병렬 프로세서를 이용한 고속 영상 처리 방법에 있어서, 바람직하게는, 상기 이산 웨이블릿 변환(Discrete Wavelet Transform; DWT) 및 그 역변환은 컨볼루션(Convolution) 기반이다.Also, in the high speed image processing method using the parallel processor of the general-purpose graphic processing apparatus of the present invention, preferably, the discrete wavelet transform (DWT) and its inverse transform are convolution based.

상술한 바와 같이, 본 발명에 따르면, 범용 그래픽처리장치(GPU)의 병렬프로세서를 활용하여 이산 웨이블릿 변환과 그 역변환을 병렬처리함으로써 CPU의 연산에 기반한 기존 방식보다 대폭적인 연산처리 속도의 향상을 가져올 수 있는 장점이 있다.As described above, according to the present invention, parallel processing of discrete wavelet transforms and their inverse transforms by utilizing a parallel processor of a general-purpose graphics processing unit (GPU) brings a drastic improvement in the processing speed of the conventional method based on the CPU operation. There are advantages to it.

이는 값비싼 전용 하드웨어를 이용하여 연산 속도의 향상을 도모하는 것이 아니라, 범용 그래픽처리 장치의 병렬 프로세서를 소프트웨어적인 방법으로 이용하여 연산 속도를 향상시키는 것이므로, 매우 경제적인 가치를 갖는다.This does not aim to improve the computational speed by using expensive dedicated hardware, but it is very economical because it improves the computational speed by using a parallel processor of a general-purpose graphic processing apparatus in a software method.

또한, 본 발명은 상기의 실시예와 같이 JPEG 2000 압축포맷의 핵심연산인 이산 웨이블릿 변환 및 역변환 알고리즘을 GPU상에서 수행함에 있어서, 호스트 영역과 디바이스 영역에 특별히 고안된 메모리 모델을 구성하여 생성하고, 최적의 멀티쓰레드가 가능하도록 쓰레드 집합을 구성하고, 커널프로그램에서 멀티쓰레드 병렬연산에 적합한 변환 및 역변환 알고리즘이 수행되도록 함으로써 병렬 고속처리를 가능하게 함을 알 수 있다.In addition, according to the present invention, when performing discrete wavelet transform and inverse transform algorithms, which are the core operations of the JPEG 2000 compression format, on the GPU, a memory model specially designed in the host area and the device area is generated and optimized. It can be seen that high-speed parallel processing is possible by constructing a thread set to enable multithreading, and by performing a transform and inverse transform algorithm suitable for multithreaded parallel operation in a kernel program.

본 발명의 명세서에서 구체적으로 언급되지 않은 효과라 하더라도, 본 발명의 기술적 특징에 의해 기대되는 잠정적인 효과는 본 발명의 명세서에 기재된 것과 같이 취급됨을 첨언한다.Although effects not specifically mentioned in the specification of the present invention, it is added that the potential effects expected by the technical features of the present invention are treated as described in the specification of the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 실시를 위한 구체적인 내용을 설명한다. 그리고 본 발명을 설명함에 있어서, 관련된 공지기능 등 이 분야의 기술자에게 자명한 사항으로서 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략한다.Hereinafter, with reference to the accompanying drawings will be described specific details for the practice of the invention. In the following description of the present invention, when it is determined that the subject matter of the present invention may be unnecessarily obscured as matters obvious to those skilled in the art, such as related well-known functions, the detailed description thereof will be omitted.

본 발명의 일 실시예로서 JPEG 2000 압축포맷에 사용되는 이산 웨이블릿 변 환과 그 역변환을 GPU의 병렬 프로세서를 이용하여 고속으로 처리하는 과정을 설명하자면 아래와 같다. 하기의 이산 웨이블릿 변환과 그 역변환 알고리즘은 컨볼루션 방식의 알고리즘을 구현하여 적용한 것이다.As an embodiment of the present invention, a discrete wavelet transform used in the JPEG 2000 compression format and its inverse transform are described in a high speed process using a parallel processor of a GPU. The following discrete wavelet transform and its inverse transform algorithm are implemented by applying a convolutional algorithm.

<< GPUGPU 를 이용한 병렬연산처리의 흐름>Flow of Parallel Computation Using

도 1은 GPU를 이용한 병렬 연산 처리의 개략적인 흐름을 나타내고 있다. 연산 처리 명령이 수행되는 영역을 구분해보면 CPU의 영역과 GPU의 영역으로 나뉘어진다. GPGPU 병렬 연산 기술에서는, CPU(110)가 위치하는 영역을 호스트 영역(100)이라 하고, GPU(210, 정확하게는 다수의 GPU Processor Core)가 위치하는 영역, 예컨대 GeForce 9600GT와 같이 GPU가 있는 비디오 카드를 디바이스 영역(200)이라 한다. GPU 방식으로 멀티프로세싱을 하기 위해서는 호스트 영역(100)과 디바이스 영역(200)의 분명한 구분이 매우 중요하다. 각 영역은 상대 영역에서 생성된 변수를 직접 참조할 방법이 없기 때문이다. 1 shows a schematic flow of parallel arithmetic processing using a GPU. The area in which arithmetic processing instructions are executed is divided into the CPU area and the GPU area. In GPGPU parallel computing technology, the area where the CPU 110 is located is called the host area 100, and the area where the GPU 210 (exactly, a plurality of GPU Processor Cores) is located, for example, a video card with a GPU such as GeForce 9600GT. Is referred to as the device area 200. In order to multiprocess by the GPU method, it is very important to clearly distinguish between the host area 100 and the device area 200. This is because each region has no way of directly referencing a variable created in the other region.

상기 호스트 영역(100)은 CPU(110)와 호스트의 메인 메모리(120)를 포함하여 구성될 수 있다. 디바이스 영역(200)은 GPU(210)와 GPU(210)가 연산수행 시 참조할 수 있는 그래픽카드의 메모리(220)를 포함하여 구성될 수 있다.The host area 100 may include a CPU 110 and a main memory 120 of the host. The device area 200 may include a GPU 210 and a memory 220 of a graphics card which the GPU 210 may refer to when performing an operation.

호스트 영역(100)과 디바이스 영역(200)은 서로 분리되어 있기 때문에, 상호 데이터 교환을 하기 위해서는 다음과 같은 프로세스를 거치게 된다. Since the host area 100 and the device area 200 are separated from each other, in order to exchange data with each other, the following process is performed.

먼저, 수행하고자 하는 연산이 존재하고 그 연산이 처리할 데이터가 존재한다고 가정하자. 호스트 영역(100)의 데이터는 디바이스 영역(200)에서 직접 참조하 거나 갱신할 수 없다. 따라서 디바이스 영역(200)에서 호스트 영역(100)의 데이터를 연산처리하기 위해서 CPU(110)에서는 GPU(210)측에 수행해야 할 연산을 구성하여 넘겨준 후(P1, Copy Processing Data) 그 연산에 필요한 데이터를 GPU(210)의 디바이스 메모리(220)로 복사해 주어야 한다(P2, Instruct the Processing). 이때 호스트 영역(100)에서 넘겨주어 디바이스 영역(200)에 적재된 프로그램을 특히 커널(kernel) 프로그램이라 부른다. 호스트 영역(100)에서 입력을 받아 디바이스 영역(200)의 커널에서 그 값을 사용하려면 반드시 사용전에 호스트에서 디바이스로 메모리를 복사해야 한다. 호스트 영역(100)의 변수와 일정하게 상응하는 변수(__device__)도 디바이스 영역(200)으로 복사해 준다. 디바이스 영역(200)에서 빈번히 참조되지만, 한번 설정된 후 거의 바뀌지 않는 변수는 컨스턴트 변수(__constant__)로 지정함으로써 더욱 빠른 접근을 보장할 수 있다. First, suppose there is an operation to be performed and there is data to be processed by the operation. Data in the host area 100 may not be directly referenced or updated in the device area 200. Therefore, in order to process data of the host area 100 in the device area 200, the CPU 110 configures and passes an operation to be performed on the GPU 210 (P1, Copy Processing Data). The necessary data should be copied to the device memory 220 of the GPU 210 (P2, Instruct the Processing). In this case, a program transferred from the host area 100 to the device area 200 is called a kernel program. To receive input from the host area 100 and use the value in the kernel of the device area 200, the memory must be copied from the host to the device before use. The variable __device__ corresponding to the variable in the host area 100 is also copied to the device area 200. Variables that are frequently referenced in the device region 200 but hardly changed after being set once may be designated as constant variables (__constant__) to ensure faster access.

컨스턴스 변수로 지정하면 디바이스의 컨스턴트 메모리에 적재되므로 디바이스 커널 내부에서는 컨스턴트 변수에 접근하기 위해 특별한 방법이 필요하지 않고 일반 적역 변수처럼 사용할 수 있다. If you specify a constant variable, it is loaded into the device's constant memory, so inside the device kernel, you don't need any special method to access the constant variable.

커널 프로그램은 미리 계산된 연산량에 따라 지정된 양만큼 GPU 내부에 생성된다. 연산에 따라 이 커널의 양은 달라지는데, 본 발명의 일 실시예에서는, 대략 최소 20만 개에서 최대 8백8십만 개에 이른다. 상기 커널 프로그램은 내장된 명령에 따라 일정한 연산을 수행하게 되는데, 이때 참조하는 데이터는 디바이스 영역(200) 내부에 존재해야 한다. 상기 커널 프로그램은 디바이스 영역 내부에서 데이터를 참조하고 갱신하며 복사하거나 이동하는 작업을 수행한다(P3, Execute parallel in each core). 연산 수행은 GPU(210) 내의 다수의 코어들(201)에 의해 병렬적으로 수행될 수 있다. 또한, 커널 프로그램의 연산결과는 디바이스 영역(200)의 메모리(220)에 저장된다. 커널 프로그램의 결과 데이터가 디바이스 메모리로부터 호스트 영역(100)의 메모리(120)로 복사됨으로써 비로소 CPU(110)가 참조할 수 있는 결과가 된다(P4, Copy the result).The kernel program is generated inside the GPU by the amount specified according to the precomputed amount of computation. The amount of this kernel varies depending on the operation. In one embodiment of the present invention, the number ranges from approximately 200,000 up to 8.8 million. The kernel program performs a predetermined operation according to a built-in command. In this case, the referenced data should exist inside the device area 200. The kernel program performs a task of referring to, updating, copying, or moving data in the device region (P3, Execute parallel in each core). Computation may be performed in parallel by multiple cores 201 within GPU 210. In addition, the operation result of the kernel program is stored in the memory 220 of the device region 200. The result data of the kernel program is copied from the device memory to the memory 120 of the host area 100, so that the CPU 110 can refer to the result (P4, Copy the result).

호스트 영역(100)에는 상기한 바와 같은 과정이 진행되도록 명령하는 프로그램이 존재한다. 이 프로그램은 일정한 병렬연산 과정을 구성하여 GPU(210)에서 수행되도록 위탁하고 그 결과를 기다리게 된다. 멀티프로세서의 규모의 차이로 인하여 GPU(210)의 병렬연산은 CPU(110)의 연산에 비해 처리 용량나 처리속도에 있어서 매우 우수한 대용량 병렬처리가 가능하므로, 상기 위탁연산을 통해 처리 속도를 가속하는 것이 가능하다.In the host area 100, a program for instructing the above process to proceed is present. This program configures a parallel processing process and commits to be executed in the GPU 210 and waits for the result. Due to the difference in the size of the multiprocessor, the parallel operation of the GPU 210 is possible to perform a very large capacity parallel processing in terms of processing capacity or processing speed compared to the operation of the CPU 110, and thus accelerates the processing speed through the consigned operation. It is possible.

<< GPUGPU 의 멀티 Multi 쓰레드의Thread 구조> Structure

도 2는 연산을 병렬적으로 처리할 수 있는 GPU(210) 내부의 구조를 논리적인 개념으로 나타내고 있다. 물리적으로 GPU(210) 내부에는 N개의 부동소수점 연산이 가능한 멀티 프로세서(MP)가 존재한다. 상기 멀티프로세서의 각각에는 M개의 삼각함수연산 등을 수행하는 스칼라 프로세서(SP)가 존재한다. 도 2는 N×M개 프로세서의 물리적 구조를 쓰레드(thread)의 입장에서 개념적으로 모델링한 것이다. 일반적인 GPU(210)의 멀티 쓰레드 구조를 개념적으로 표현하면, 크게 그리드(2000)가 존재하며 그 내부에 다수의 쓰레드 블록(2100)이 존재하고, 각 블록(2100)의 내부에 다수의 쓰레드(2110)가 존재하는 구조로 이루어진다.2 illustrates a logical concept of a structure inside the GPU 210 capable of processing operations in parallel. Physically, there are multiple processors (MPs) capable of N floating-point operations within the GPU 210. Each of the multiprocessors has a scalar processor (SP) that performs M trigonometric functions and the like. 2 conceptually models the physical structure of N × M processors in terms of threads. Conceptually expressing the multi-threaded structure of the general GPU 210, the grid 2000 is largely present, there are a plurality of thread blocks 2100 therein, a plurality of threads 2110 inside each block 2100. ) Is present.

물리적 모델과 비슷하게 GPU(210)의 개념적 모델에는 N×M개의 쓰레드(2110)가 존재한다. GPU(210)의 개념적 모델에 따른 연산수행방식을 일반적으로 SIMT(Single Instruction Multiple Threads)라고 한다. 이는 하나의 명령 코드를 다수의 쓰레드(2110)로 동시에 수행하는 GPU(200)의 병렬처리방식을 표현한 용어이다. GPU상의 멀티 쓰레드 연산 시에 총 쓰레드(2110)의 크기는 제한이 없으며, 각 GPU(210)의 실제 물리적 구성에 따라 자동적으로 쓰레드 그룹이 정의되는데, 그룹의 생성은 GPU(210)에 내장된 쓰레드 관리자가 자동적으로 수행할 수 있다. Similar to the physical model, there are N × M threads 2110 in the conceptual model of the GPU 210. The operation execution method according to the conceptual model of the GPU 210 is generally referred to as Single Instruction Multiple Threads (SIMT). This is a term representing a parallel processing method of the GPU 200 that simultaneously executes one instruction code to a plurality of threads 2110. There is no limit on the size of the total threads 2110 during multi-thread operation on the GPU, and the thread group is automatically defined according to the actual physical configuration of each GPU 210. The creation of the group is a thread embedded in the GPU 210. The administrator can do it automatically.

<입력 데이터의 크기를 기준으로 최적의 연산량 산출><Calculation of optimal calculation amount based on input data size>

쓰레드(2110)의 총수는 그리드(2000)와 블록(2100)의 곱의 형식으로 정의될 수 있다. 즉 블록당 쓰레드×그리드당 블록이 되는 것이다. 그런데 GPU(200)에서 생성되는 쓰레드의 수와 연산에 소요되는 쓰레드의 수가 정확히 일치하지 않고, 소요되는 쓰레드 이상의 쓰레드가 생성될 수 있다. 경우에 따라서는 상당량의 쓰레드가 과잉 생성될 수 있게 된다. 일반적으로 커널이 쓰레드당 하나씩 실행되므로 자칫 인덱싱 문제나 동일한 계산을 중복적으로 하는 문제가 발생할 수 있다. 따라서 디바이스 영역(200)에서 수행되는 프로그램을 작성할 때, 특정 병렬연산이 수행되어야 하는 최적의 쓰레드 규모를 계산하는 연산량 산출과정이 전체프로그램의 수행효율을 높이데 중요하다.The total number of threads 2110 may be defined in the form of the product of the grid 2000 and the block 2100. That is, threads per block x blocks per grid. However, the number of threads generated in the GPU 200 and the number of threads required for the calculation do not exactly match, and more threads than the required threads may be generated. In some cases, a significant amount of threads can be over-created. Typically, the kernel runs one per thread, which can lead to indexing problems or duplicated computations. Therefore, when writing a program to be executed in the device region 200, an operation amount calculation process for calculating an optimal thread size in which a specific parallel operation should be performed is important to increase the execution efficiency of the entire program.

본 발명은 입력 데이터의 크기를 기준으로 최적의 연산량을 산출하는 방식을 고안하여 구현하고 있다. 도 3은 최적의 쓰레드의 수를 산출하기 위한 과정을 예시한다. 이 과정은 CPU에서 연산한다. 그리드당 블록수(blocks_per_grid)와 블록당 쓰레드 수(threads_per_block)가 변수로 지정되고, 데이터의 폭(w), 데이터의 높이(h)가 정의된다(S10). N은 처리해야 할 데이터 양으로서 데이터의 폭과 데이터의 높이의 곱으로 구할 수 있다(S11). The present invention devises and implements a scheme for calculating an optimal amount of calculation based on the size of input data. 3 illustrates a process for calculating the optimal number of threads. This process is computed on the CPU. The number of blocks per grid (blocks_per_grid) and the number of threads per block (threads_per_block) are designated as variables, and the width (w) of data and the height (h) of data are defined (S10). N is the amount of data to be processed and can be found as the product of the width of the data and the height of the data (S11).

N이 블록당 최대 쓰레드 수(max_threads_per_block)보다 작은지를 판단한다(S12). 블록당 최대 쓰레드 수는 사용할 GPU의 하드웨어 사양에 따라 달라질 수 있다. 만일, N이 블록당 최대 쓰레드 보다 작은 경우에는 그 N의 값을 블록당 쓰레드의 수로 연산한다(S13). 만일 N이 블록당 최대 쓰레드 수보다 큰 경우에는 블록당 쓰레드 수는 블록당 최대 쓰레드 수로 연산한다(S14). 일반적으로 처리해야 할 데이터의 양(N)은 블록당 최대 쓰레드 수보다 크기 때문에, 통상은 S14로 이행하게 된다. 최종적으로 그리드당 쓰레드의 수를 아래와 같은 수식으로 연산한다(S15). 위와 같은 방법을 통해, 본 발명은 어떠한 프로그램이든 연산량을 산출해 적정한 규모의 멀티쓰레드를 생성할 수 있다.It is determined whether N is smaller than the maximum number of threads per block (max_threads_per_block) (S12). The maximum number of threads per block can vary depending on the hardware specifications of the GPU to be used. If N is smaller than the maximum number of threads per block, the value of N is calculated as the number of threads per block (S13). If N is larger than the maximum number of threads per block, the number of threads per block is calculated as the maximum number of threads per block (S14). In general, since the amount N of data to be processed is larger than the maximum number of threads per block, the process usually proceeds to S14. Finally, the number of threads per grid is calculated using the following formula (S15). Through the above method, the present invention can generate a multi-thread of the appropriate size by calculating the calculation amount of any program.

식 (1)

Formula (1)

위 식(1)에 의해, 처리 대상 이미지 데이터의 픽셀 수와 동일한 수의 연산 쓰레드들이 생성될 수 있다. 예컨대, 데이터의 폭이 10이고, 데이터의 높이가 50이며, 블록당 최대 쓰레드 수가 256인 경우의 최적의 쓰레드 수를 계산해 보자. N은 500이 되며, Ntb는 256이 된다. 이것을 상기 식에 대입하여 Nbg를 계산하면, 약 2.53이며, 식(1)은 연산으로 계산된 실수값보다 크지 않은 최소의 정수를 취하기 때문에, 총 2개의 그리드 당 블록수가 나온다. 결국 그리드 당 쓰레드 수는 512(2*256)이며, 500개의 쓰레드가 연산에 소요되고, 12개가 잉여 쓰레드가 된다. By the above equation (1), the same number of computing threads as the number of pixels of the processing target image data can be generated. For example, calculate the optimal number of threads when the data width is 10, the data height is 50, and the maximum number of threads per block is 256. N becomes 500 and Ntb becomes 256. Substituting this into the above formula calculates Nbg, it is about 2.53, and since formula (1) takes a minimum integer not larger than the real value calculated by the calculation, a total of 2 blocks per grid are obtained. After all, the number of threads per grid is 512 (2 * 256), 500 threads are used for computation, and 12 are redundant.

이 잉여 쓰레드는 중복 연산이나 데이터의 인덱싱 오류의 원인이 될 수 있다. 쓰레드에 고유 인덱스값을 할당한 후, 그 인덱스로부터 x, y 좌표를 구해, 유효 범위 밖의 쓰레드는 폐기하는 절차를 연산 과정에 삽입하여 잉여 쓰레드의 개입을 차단할 수 있다.This excess thread can cause duplicate operations or indexing errors in the data. After assigning a unique index value to a thread, you can get x, y coordinates from that index, and insert a procedure to discard threads outside the valid range into the computational process to block extra thread intervention.

<< JPEGJPEG 2000 영상의 이산 Discrete of 2000 Images 웨이블릿Wavelet 변환과 그 역변환> Transform and Inverse Transform>

JPEG 2000 영상은 DWT 알고리즘을 통해 인코딩되고 역변환 알고리즘을 통해 디코딩되는 국제표준의 이미지 포맷이다. DWT와 그 역변환에 많은 연산량이 필요하므로 이를 처리하기 인해 전문적인 하드웨어 장치가 존재하기도 하지만, 본 발명에서는 GPU를 이용하여 이를 소프트웨어적으로 처리하고자 한다. JPEG 2000 video is an international standard image format that is encoded by the DWT algorithm and decoded by the inverse transform algorithm. Since a large amount of computation is required for DWT and its inverse transform, there are specialized hardware devices for processing this, but in the present invention, the GPU is to be processed in software.

기본적으로 DWT는 영상의 영역을 세로, 가로 각 방향으로 2분하고, 주파수 별로 저주파 영역과 고주파 영역으로 분할하는 방식으로서, 한 단계 레벨의 변환으로 4등분의 영역(LL, LH, HL, HH)을 생성되는 방식이며, 지정된 레벨의 수만큼 LL 영역을 재귀적으로 변환하는 구조의 알고리즘이다. DWT의 역변환은 4등분의 영역을 가로, 세로 각 방향으로 역변환하여 최종적인 영상을 만드는 방식이므로 변환의 역과정이라고 할 수 있다. 알고리즘의 특성상 각 데이터 픽셀의 수와 지정된 레벨 수 만큼 변환 필터로 연산해야 하므로 연산의 양이 매우 많다. Basically, DWT divides the image area into two directions in the vertical and horizontal directions, and divides it into low and high frequency areas for each frequency. It is divided into four quarters (LL, LH, HL, HH) by one level of conversion. Is an algorithm of constructing the LL region recursively by the number of specified levels. Inverse transformation of DWT is the reverse process of transformation because it makes the final image by inverting the quadrant into horizontal and vertical directions. Due to the nature of the algorithm, the amount of computation is very large because the number of data pixels and the number of specified levels must be calculated by the transform filter.

변환과 역변환에 사용되는 이산 웨이블릿 필터는 현재까지 많은 형태가 존재하지만 JPEG 2000 표준에서는 CDF 9/7 필터(Cohen-Daubechies-Feauveau 9/7 filter)와 CDF 5/3 필터(Cohen-Daubechies-Feauveau 5/3 filter)가 대표적으로 사용된다. 본 발명에서는 DWT와 그 역변환 시 필요한 필터 계수나 반복적으로 참조해야 하는 변수를 디바이스 영역(200)의 컨스턴트 메모리에 적재하여 참조하게 함으로써, 호스트 영역(100)에 저장할 때보다 훨씬 빠른 연산을 가능하게 한다.There are many forms of discrete wavelet filters used for transform and inverse transform, but the CDF 9/7 filter (Cohen-Daubechies-Feauveau 9/7 filter) and the CDF 5/3 filter (Cohen-Daubechies-Feauveau 5) exist in the JPEG 2000 standard. / 3 filter) is typically used. In the present invention, the DWT and the filter coefficients necessary for the inverse conversion or the variable to be repeatedly referred to are loaded into the constant memory of the device area 200 to be referred to, thereby enabling much faster operation than the storage in the host area 100. do.

(1) DWT 연산(1) DWT operation

본 발명의 DWT 연산은 호스트 영역(100)에서 하는 연산과 디바이스 영역(200)에서 수행되는 연산에 의해 복합적으로 행해지는데, 특히 디바이스 영역(200)에서의 연산은 GPU(210)에 의해 병렬 처리된다. 도 4는 그 전체 절차에 대한 개략적인 순서도를 나타낸다. 도 4에서 진하게 테두리된 블록은 GPU(210) 내부에서 실행되어 병렬 처리됨을 의미한다.The DWT operation of the present invention is performed by a combination of operations performed in the host region 100 and operations performed in the device region 200. In particular, operations in the device region 200 are processed in parallel by the GPU 210. . 4 shows a schematic flowchart of the whole procedure. In FIG. 4, the thickly bounded blocks are executed in the GPU 210 to be processed in parallel.

먼저, 변환의 대상이 되는 입력 영상의 데이터(Image Source Component)는 일차적으로 호스트 영역의 메모리에 탑재된다(S100). 상기 입력 데이터를 GPU 영역, 즉 디바이스 영역으로 복사한다(S101). 호스트 영역의 메모리의 소스 버 퍼(121)에 저장되어 있는 입력 데이터는 디바이스 영역의 제 1 메모리 버퍼(221)로 복사된다. 호스트 영역의 프로그램은 지정된 레벨이 될 때까지 반복적으로 명령을 수행한다(S102).First, the data (Image Source Component) of the input image to be converted is first mounted in the memory of the host area (S100). The input data is copied to the GPU region, that is, the device region (S101). The input data stored in the source buffer 121 of the memory of the host area is copied to the first memory buffer 221 of the device area. The program in the host area repeatedly executes the command until the designated level is reached (S102).

또한, 도 3에 도시한 바와 같은 방법으로, 연산에 필요한 쓰레드 규모를 산출한다(S103). 그리고 GPU 상에서 실행될 커널들을 적재하고 실행한다. 이제 GPU 영역에서의 연산이 본격적으로 실행될 것이다. GPU 영역에서 커널들을 실행하여 DWT 연산을 할 때, GPU 영역에서의 메모리 구조는 매우 중요하다. 왜냐하면 데이터의 각 픽셀 값은 동시 다발적으로 병렬 연산되고, 비동기적으로 실행되므로 참조의 시점이나 갱신의 시점을 정확히 정할 수 없기 때문이다. 이와 같은 점에서, 원본 데이터의 구조를 담는 메모리가 하나인 CPU에서의 연산과 큰 차이를 갖는다. In addition, by the method as shown in Fig. 3, the thread size required for the calculation is calculated (S103). It loads and executes kernels to run on the GPU. The operation in the GPU area will now be in full swing. When performing DWT operations by running kernels in the GPU region, the memory structure in the GPU region is very important. This is because each pixel value of the data is simultaneously computed in parallel and executed asynchronously so that the point of reference or the point of update cannot be accurately determined. In this respect, there is a big difference from the operation in the CPU having one memory containing the structure of the original data.

CPU에서의 연산은 임시 버퍼를 이용하여 순차적으로 연산하며, 현재 연산이 참조하거나 갱신하고 있는 데이터가 다른 작업에 의해 침해당할 가능성이 전혀 없지만, GPU에서의 연산은 그러하지 않다. 따라서 본 발명은, 다수의 변환 연산이 병렬 실행됨에 있어서, 데이터가 일관되게 유지되게 하기 위해서는, 참조할 데이터 공간과 갱신할 데이터 공간을 분리하여, 디바이스 영역(200)의 메모리를 다중 메모리 버퍼 구조로 만들었다. 도 5는 그 구조와 DWT의 데이터 버퍼의 흐름의 일 실시예를 도시하고 있다. 이에 대해서는 하기에서 다시 설명한다.Operations on the CPU are performed sequentially using a temporary buffer, and the data currently referenced or updated by the operation is unlikely to be violated by other operations, but not on the GPU. Therefore, in the present invention, in order to keep data consistent when multiple conversion operations are executed in parallel, the memory of the device area 200 is divided into a multiple memory buffer structure by separating the data space to be referred to and the data space to be updated. made. Figure 5 illustrates one embodiment of the structure and flow of the data buffer of the DWT. This will be described again later.

다시 도 4로 돌아와서, 제 1 메모리 버퍼(221)에 있는 데이터를 제 2 메모리 버퍼(223)로 복사한다(S104). 이때 복사과정을 처리하는 커널을 별도로 구성하여 병렬 처리하도록 할 수 있다. 다음으로 세로방향의 데이터에 경계 영역처리를 위하 여 익스텐션(Extension)을 첨부하는 커널을 실행한다(S105). 4, the data in the first memory buffer 221 is copied to the second memory buffer 223 (S104). In this case, you can configure the kernel to handle the copy process separately so that it can be processed in parallel. Next, a kernel that attaches an extension to the data in the vertical direction for boundary area processing is executed (S105).

익스텐션은 데이터의 가장자리 경계에 위치한 일정부분을 대칭적 구조로 확장하여 복제하는 것을 말한다. 익스텐션 처리는 데이터의 2차원 구조에서 한 행 혹은 한 열 단위로 수행되어야 한다. 따라서 익스텐션을 병렬처리하기 위한 디바이스 영역의 쓰레드 수는 데이터의 행수 혹은 열수와 동일하다. 호스트 영역에서 CPU로 익스텐션을 처리하고 이를 디바이스 영역으로 복사하는 방법도 가능하다. 하지만 CPU상에서 익스텐션 처리를 순차적으로 하는 것보다 디바이스 영역에서 병렬적으로 수행하는 방법이 훨씬 빠른 처리가 가능하므로 익스텐션 처리를 하는 커널을 작성하고 실행하였다. Extension refers to the replication of a portion of the data along the symmetrical structure at the edge of the data. Extension processing must be performed on a row or column basis in a two-dimensional structure of data. Therefore, the number of threads in the device area for parallelizing extensions is equal to the number of rows or columns of data. It is also possible to process extensions from the host area to the CPU and copy them to the device area. However, since the parallel processing in the device area is much faster than the extension processing on the CPU sequentially, the kernel for extension processing is written and executed.

다음으로, 세로방향의 데이터를 1차원 이산웨이블릿 변환하는 커널을 실행한다(S106). 변환된 데이터의 결과는 변환과 동시에 가로방향의 데이터를 처리하기 위한 제 3 메모리(225)로 복사된다. Next, a kernel for one-dimensional discrete wavelet transform of vertical data is executed (S106). The result of the converted data is copied to the third memory 225 for processing the data in the horizontal direction at the same time as the conversion.

다음으로, 가로방향의 데이터에 익스텐션을 첨부하는 커널을 실행하고(S107), 가로방향의 데이터를 1차원 이산 웨이블릿 변환하는 커널을 실행한다(S108). 변환된 데이터의 결과는 변환과 동시에 인터리빙 처리를 한다. 원본 데이터 메모리 버퍼인 제 1 메모리 버퍼(221)로 복사된다(S109). 만약 지정된 레벨에 도달하면 상기 과정에서 빠져 나온다. 지금까지 변환의 결과는 모두 원본데이터 버퍼(221)에 저장되어 있다. 최종 결과를 보기 위해 디바이스 영역의 메모리 버퍼에서 호스트 영역의 메모리로 자료를 복사한다(S110). 이와 같은 과정을 통해 DWT 데이터를 얻을 수 있다. Next, a kernel attaching an extension to the horizontal data is executed (S107), and a kernel for one-dimensional discrete wavelet transform of the horizontal data is executed (S108). The result of the converted data is interleaved with the conversion. It is copied to the first memory buffer 221 which is the original data memory buffer (S109). If the specified level is reached, the process exits. The results of the conversion so far are all stored in the original data buffer 221. In order to see the final result, the data is copied from the memory buffer of the device area to the memory of the host area (S110). Through this process, DWT data can be obtained.

도 5에 도시된 바와 같이, 본 발명의 바람직한 실시예에서의 디바이스 영역(200)에서의 다중 메모리 버퍼 구조는, 3개의 메모리 버퍼를 갖는다. 디바이스 영역(200)의 제 1 메모리 버퍼(221)에 복사된 데이터를 세로방향 연산을 위하여 고안된 제 2 메모리 버퍼(223)로 복사한다. 제 2 메모리 버퍼(223)는 세로 방향의 데이터를 1차원 이산 웨이블릿 변환하는 커널을 실행할 때 참조된다. 세로 방향 데이터에 대한 1차원 이산 웨이블릿 변환이 처리되면, 변환된 데이터는 제 3 메모리 버퍼(225)로 복사된다. 제 3 메모리 버퍼(225)는 가로 방향의 데이터를 1차원 이산 웨이블릿 변환하는 데 참조된다. 이와 같은 과정으로 변환된 데이터는 다시 제 1 메모리 버퍼(221)로 복사된다. As shown in Fig. 5, the multiple memory buffer structure in the device region 200 in the preferred embodiment of the present invention has three memory buffers. The data copied to the first memory buffer 221 of the device area 200 is copied to the second memory buffer 223 designed for vertical operation. The second memory buffer 223 is referred to when executing a kernel for one-dimensional discrete wavelet transform of vertical data. When the one-dimensional discrete wavelet transform on the longitudinal data is processed, the converted data is copied to the third memory buffer 225. The third memory buffer 225 is referred to for one-dimensional discrete wavelet transform of the horizontal data. The data converted by this process is copied back to the first memory buffer 221.

이러한 본 발명의 다중 메모리 버퍼 구조는, 참조할 데이터 공간과 갱신할 데이터 공간을 분리함에 그 특징이 있다. 이와 같은 메모리 버퍼 구조를 통해서, 데이터가 일관되게 유지되면서, 다수의 변환 연산이 병렬 실행될 수 있게 된다.The multi-memory buffer structure of the present invention is characterized by separating the data space to be referenced from the data space to be updated. This memory buffer structure allows multiple conversion operations to be executed in parallel while keeping data consistent.

(2) DWT 역변환(2) inverse DWT transformation

도 6은 DWT의 역변환 과정을 개략적으로 도시하고 있다. 진하게 테두리된 블록은 GPU(210) 내부에서 실행되어 병렬 처리됨을 의미한다. 역변환의 대상이 되는 데이터가 호스트 영역(100)에 존재하고 있다. 이 데이터는 상기 DWT 과정을 통해 변환된 것이다(S200). 상기 데이터를 GPU 영역, 즉 디바이스 영역으로 복사한다(S201). 6 schematically illustrates the inverse transformation process of DWT. The thick bordered block means that the block is executed inside the GPU 210 and processed in parallel. The data to be inversely transformed exists in the host area 100. This data is converted through the DWT process (S200). The data is copied to the GPU region, that is, the device region (S201).

DWT와는 달리, 역변환 과정에서는 각 레벨에서 저주파 영역의 2배 크기의 영 역을 결정하여 각 레벨별로 저주파영역과 고주파영역을 합성한 결과 데이터를 저장해야 한다. 이때 합성한 결과 데이터를 저장할 영역은 입력 데이터의 길이가 짝수인 경우와 홀수인 경우에 따라 서로 다른 크기가 되므로, 일관성 있는 산출이 쉽지 않다. 이를 위해 데이터의 길이에 따라 적응적으로 영역의 크기를 산출하는 프로그램을 수행하고, 그 결과를 일정한 구조체의 배열형식으로 담는다(S202). 호스트 영역의 프로그램은 수행한 레벨이 지정된 레벨의 값과 같아질 때까지 반복적으로 명령을 수행한다. 만약 지정된 레벨에 도달하지 않았다면 역변환 연산을 수행한다(S203). 앞서 언급한 쓰레드 규모를 산출한 후(S204), GPU 상에서 실행될 커널들을 적재하고 실행한다.Unlike DWT, inverse transformation process needs to determine the area of twice the size of low frequency region at each level, and store the data obtained by combining low frequency region and high frequency region for each level. At this time, the area where the synthesized result data is to be stored has a different size depending on the case where the length of the input data is even and odd, so that it is not easy to calculate the consistent data. To this end, a program for adaptively calculating an area size according to the length of data is performed, and the result is stored in an array form of a predetermined structure (S202). The program in the host area executes the command repeatedly until the level performed is equal to the value of the specified level. If the specified level has not been reached, an inverse transform operation is performed (S203). After calculating the aforementioned thread size (S204), the kernels to be loaded on the GPU are loaded and executed.

디바이스 영역에 복사된 데이터는 가로방향 연산을 위해 가로연산을 위해 고안된 디바이스 메모리 버퍼인 제 2 메모리 버퍼(223′)로 복사된다(S205). 이때 복사를 처리하는 커널이 따로 있어 병렬로 처리되도록 하고 동시에 인터리빙도 처리한다.The data copied to the device area is copied to the second memory buffer 223 'which is a device memory buffer designed for horizontal operation for horizontal operation (S205). At this point, there is a separate kernel that handles the copy so that it can be processed in parallel and at the same time interleaving.

가로방향의 데이터에 익스텐션을 첨부하는 커널을 실행한다(S206). 가로방향의 데이터를 1차원 이산웨이블릿 역변환하는 커널을 실행한다(S207). 변환된 데이터의 결과는 역변환과 동시에 세로방향의 데이터를 처리하기 위한 디바이스 메모리 버퍼인 제 3 메모리 버퍼(225′)로 복사된다. 다음으로, 세로방향의 데이터에 익스텐션을 첨부하는 커널을 실행한다(S208). 세로방향의 데이터를 1차원 이산웨이블릿 역변환하는 커널을 실행한다(S209). 원본 데이터 메모리 버퍼인 제 1 메모리 버퍼(221′)로 복사된다(S210). 만약 지정된 레벨에 도달하면 상기 과정에서 빠져나 온다.The kernel attaching the extension to the horizontal data is executed (S206). A kernel for inversely transforming the horizontal data in one-dimensional discrete wavelets is executed (S207). The result of the converted data is copied to the third memory buffer 225 ', which is a device memory buffer for processing longitudinal data at the same time as the inverse transformation. Next, the kernel attaching the extension to the vertical data is executed (S208). A kernel for inversely transforming vertical data in one-dimensional discrete wavelets is executed (S209). It is copied to the first memory buffer 221 'which is the original data memory buffer (S210). If the specified level is reached, the process exits.

지금까지 역변환의 결과는 모두 원본데이터 버퍼(221′)에 저장되어 있다. 최종 결과를 보기 위해 디바이스 영역의 메모리 버퍼에서 호스트 영역의 메모리로 자료를 복사한다(S211). 이와 같은 DWT 역변환을 통해서 역변환된 이미지 혹은 이미 컴포넌트를 얻을 수 있다(S212). Until now, all the results of the inverse transformation are stored in the original data buffer 221 '. In order to see the final result, the data is copied from the memory buffer of the device area to the memory of the host area (S211). Through the inverse DWT transformation, an inverse transformed image or an already obtained component may be obtained (S212).

<메모리 버퍼의 개념적 구조>Conceptual Structure of the Memory Buffer

이제 DWT 연산에 있어서, 호스트 영역(100)에서의 원본 데이터와 디바이스 영역(200)에서 연산되는 데이터의 개념적 구조에 대해서 살펴본다. 도 8은 원본데이터의 물리적 구조를 도시한다. 이와 같은 1차원 데이터를 일정한 간격(pitch 또는 width)를 기준으로 분할하여 개념적으로 2차원적으로 취급할 수 있다. 도 9(a)는 원본 데이터의 개념적 구조이며, 도 9(b)는 원본 데이터의 개념적 좌표를 나타낸다. 도 9(b)에 표시된 (x, y) 좌표는 데이터의 2차원 구조와 정확히 대응할 수 있다. Now, in the DWT operation, the conceptual structure of the original data in the host area 100 and the data calculated in the device area 200 will be described. 8 shows the physical structure of the original data. Such one-dimensional data can be divided on a constant interval (pitch or width) as a basis and can be conceptually treated in two dimensions. 9 (a) shows a conceptual structure of original data, and FIG. 9 (b) shows conceptual coordinates of original data. The (x, y) coordinates shown in FIG. 9 (b) may correspond exactly to the two-dimensional structure of the data.

DWT 연산시 데이터는 세로 방향으로 연산된 후에 가로방향으로 연산되는 순서를 갖는다. 호스트 영역(100)의 데이터는 디바이스 영역(200)에서 직접 참조하거나 갱신할 수 없기 때문에, 디바이스 영역(200)에서 연산에 필요한 데이터를 참조하기 위해서는, 호스트 영역의 원본 데이터와 동일한 메모리 모델이 디바이스 영역에 존재해야 한다. 따라서 호스트 영역이 원본 데이터를 복사해 준 디바이스 영역의 제 1 메모리 버퍼에 존재하는 데이터의 개념적 구조는, 호스트 영역의 원본 데 이터의 개념적 구조와 동일하다. 도 10은 제 1 메모리 버퍼에 저장된 데이터의 개념적 구조를 예시하고 있다. 실질적으로 도 9(a)와 같다. In DWT operation, data is calculated in the vertical direction and then in the horizontal direction. Since the data of the host area 100 cannot be directly referred to or updated in the device area 200, in order to refer to data necessary for calculation in the device area 200, the same memory model as the original data of the host area is used in the device area. Must exist in Therefore, the conceptual structure of the data present in the first memory buffer of the device region, to which the host region has copied the original data, is the same as the conceptual structure of the original data of the host region. 10 illustrates a conceptual structure of data stored in a first memory buffer. It is substantially the same as FIG. 9 (a).

제 1 메모리 버퍼에 저장된 데이터는 제 2 메모리 버퍼로 적재되며, 제 2 메모리 버퍼에 적재된 데이터에 익스텐션을 적용한다. 도 11은 제 2 메모리 버퍼의 개념적 구조의 예를 나타내고 있다. 도 10의 제 1 메모리 버퍼의 구조와 달리, 세로 방향 분해를 위해 데이터의 폭과 너비가 전치되었다. 익스텐션 처리 후에는 피치(pitch)가 제 1 메모리 버퍼에 적재되었던 데이터의 세로폭(height)보다 2개의 패드의 길이(PADLEN)만큼 더 넓은 구조로 이루어진다. Data stored in the first memory buffer is loaded into the second memory buffer, and an extension is applied to the data loaded in the second memory buffer. 11 shows an example of a conceptual structure of a second memory buffer. Unlike the structure of the first memory buffer of FIG. 10, the width and width of the data are transposed for vertical decomposition. After the extension process, the pitch is made wider by the length of the two pads PADLEN than the height of the data that has been loaded in the first memory buffer.

세로 방향 분해 연산의 결과값으로 제 3 메모리 버퍼가 갱신된다. 메모리 버퍼의 좌표가 다시 전치되었다. 이 제 3 메모리 버퍼에 익스텐션을 적용한 후, 가로 방향 분해 연산의 결과 값을 제 1 메모리 버퍼에 갱신하면서 동시에 인터리빙을 수행할 수 있다. 도 12는 제 3 메모리 버퍼의 개념적 구조의 예를 나타낸다. 도 12에 도시된 바와 같이, pitch가 데이터의 가로폭(width)보다 2개의 PADLEN만큼 더 넓은 구조로 이루어진다. 도 10 내지 도 12에서, 사각형 안에 표시된 숫자는 각 연산이 실행되면서 발생하는 좌표 값의 변동을 나타낸다.The third memory buffer is updated with the result of the vertical decomposition operation. The coordinates of the memory buffer have been transposed again. After the extension is applied to the third memory buffer, interleaving may be performed while updating the result value of the horizontal decomposition operation to the first memory buffer. 12 shows an example of a conceptual structure of a third memory buffer. As shown in FIG. 12, the pitch has a structure wider by two PADLENs than the width of the data. 10 to 12, the numerals shown in the rectangles indicate variations in coordinate values that occur as each operation is executed.

제 2 메모리 버퍼와 제 3 메모리 버퍼가 PADLEN만큼 좌우로 더 큰 이유는 익스텐션 부분을 처리하기 위해서이다. 본 발명의 다른 실시예에서는, 이들 메모리 버퍼들을 동일하게 생성하여 전치 과정 없이 가로방향 연산 커널을 작성하여 가로 방향 연산을 수행하고, 세로방향 연산 커널을 작성하여 세로방향 연산을 수행할 수도 있다. 그러나 본 실시예에서는 컨볼루션 연산 등 관련 함수의 구조를 최대한 단 순화하고, 루프 실행 시 참조하는 데이터의 인접성을 보장하기 위해, 위와 같은 구조를 적용하였다. The reason why the second memory buffer and the third memory buffer are larger left and right by PADLEN is to process the extension part. In another embodiment of the present invention, these memory buffers may be created in the same manner to create a horizontal operation kernel without transposition, to perform a horizontal operation, and to create a vertical operation kernel to perform a vertical operation. However, in the present embodiment, the above structure is applied in order to simplify the structure of related functions such as convolution operations and to ensure the adjacency of the data referenced when executing the loop.

DWT의 역변환 과정을 디바이스 영역(200)에서 연산함에 있어서, 데이터는 가로방향으로 연산된 후, 세로방향으로 연산될 것이다. 가로방향의 연산은 원본 데이터의 개념적 좌표 개념과 일치하기 때문에 곤란함이 없으나, 세로방향의 연산의 경우, 연산을 구현하기 위해 데이터의 폭을 재배열할 필요가 있다. In calculating the inverse transformation process of the DWT in the device region 200, the data may be calculated in the horizontal direction and then in the vertical direction. The horizontal operation is not difficult because it coincides with the conceptual coordinate concept of the original data. However, in the case of the vertical operation, the width of the data needs to be rearranged to implement the operation.

메모리 버퍼 간 데이터의 이동 순서는, 먼저 데이터를 제 1 메모리 버퍼(221′)에서 제 2 메모리 버퍼(223′)로 적재하면서, 가로 세로 인터리빙을 동시에 처리한다. 제 2 메모리 버퍼(223′)에 익스텐션을 적용한다. 다음으로, 가로방향 합성 연산의 결과 값을 제 3 메모리 버퍼(225′)에 갱신한다. 이때 전치가 이루어진다. 제 3 메모리 버퍼(225′)에 익스텐션을 적용 한 후에, 세로방향 합성 연산의 결과 값을 제 2 메모리 버퍼(223′)에 갱신할 수 있다. 그리고 최종 결과 중 데이터 영역만을 제 1 메모리 버퍼(221′)에 복사하게 된다. The movement order of data between memory buffers simultaneously processes horizontal and vertical interleaving while loading data from the first memory buffer 221 'to the second memory buffer 223'. An extension is applied to the second memory buffer 223 '. Next, the result value of the horizontal synthesis operation is updated in the third memory buffer 225 '. At this point, translocation is made. After the extension is applied to the third memory buffer 225 ', the result value of the vertical synthesis operation may be updated in the second memory buffer 223'. Only the data area of the final result is copied to the first memory buffer 221 '.

GPU를 이용한 이산 웨이블릿 변환과 그 역변환의 병렬처리 연산에서 처리하기 어려운 문제들 중에는, 디바이스 영역의 메모리 상에서 데이터를 저장할 때 충돌이 발생할 수 있는 문제와, 쓰레드가 데이터에 접근할 때 대기시간이 존재할 수 있다는 문제이다. 이 과제를 해소하기 위해 본 발명에서는 각 픽셀 데이터의 상호 독립성을 보장하고 연산과정의 데이터 흐름에서 충돌이 발생하는 것을 방지하기 위 해, 위에서 설명한 바와 같이, 디바이스 영역에서 다중 메모리 버퍼 구조를 사용하여 이산 웨이블릿 변환과 그 역변환을 수행하였다.Among the problems that are difficult to handle in parallel processing of discrete wavelet transforms and their inverse transforms using GPUs, there may be a problem that may occur when storing data in the memory of the device area, and there may be a latency when a thread accesses data. It is a problem. In order to solve this problem, in the present invention, in order to ensure mutual independence of each pixel data and to prevent collisions in the data flow of an operation process, as described above, the present invention uses discrete memory buffer structures in a device area. Wavelet transform and inverse transform were performed.

특히, 본 발명의 JPEG 2000의 이산 웨이블릿 변환 및 그 역변환의 바람직한 실시예에서는, 3개의 버퍼를 이용하여 데이터를 처리함을 기술하였다. 특히 디바이스 영역에서 가로 연산을 위한 버퍼와 세로연산을 위한 버퍼, 그리고 원본 데이터를 위한 버퍼는 구조적으로 상이하다. 상이한 데이터 구조들 사이에서 원활한 데이터의 참조나 복사가 이루어지기 위해서는 일관된 좌표계를 구성하고 그 좌표계에 따라 커널이 데이터를 연산해야 한다. 본 발명에서는 일관된 좌표계를 유지하고 필요한 경우에 필요한 좌표를 산출하기 위한 함수를 고안하여 적용하였다.In particular, in the preferred embodiment of the discrete wavelet transform and inverse transform of JPEG 2000 of the present invention, it described that the data is processed using three buffers. Especially in the device area, the buffer for horizontal operation, the buffer for vertical operation, and the buffer for original data are structurally different. In order to seamlessly reference or copy data between different data structures, a consistent coordinate system must be constructed and the kernel must operate on the data according to the coordinate system. In the present invention, a function for maintaining a consistent coordinate system and calculating necessary coordinates when necessary is devised and applied.

데이터에 대한 참조 좌표계를 구성하는 본 발명의 함수는, 상기한 바와 같이, 데이터 공간의 크기를 폭과 높이의 값으로 입력받아 데이터의 반복성 단위를 산출하며, 상기 반복성 단위와 필터연산 시 필요한 패드의 크기를 고려하여 좌표를 2차원으로 구성하는 것이다. The function of the present invention constituting the reference coordinate system for the data, as described above, calculates the repeatability unit of the data by receiving the size of the data space as the width and height values, and calculates the repeatability unit and Considering the size, coordinates are constructed in two dimensions.

GPU상의 병렬연산 효율을 극대화하는 가장 좋은 방법은 호스트와 디바이스 사이의 관계를 최소화하는 것이다. 본 발명에서는 호스트 영역과 디바이스 영역에 각기 독립적으로 구성된 데이터 영역이 존재함으로써 현재의 커널프로그램이 다른 커널프로그램의 결과를 기다림으로써 대기시간이 발생하는 문제점을 방지되도록 설계되어 있다. 그와 같은 대기시간은 GPU상의 멀티쓰레드 병렬연산의 효율을 떨어뜨리는 주된 요인 중 하나이다.The best way to maximize parallel computing efficiency on the GPU is to minimize the relationship between the host and the device. In the present invention, the data area configured independently in the host area and the device area exists so that the current kernel program waits for the results of other kernel programs, thereby preventing the problem of waiting time. Such latency is one of the major factors that reduce the efficiency of multithreaded parallel computing on the GPU.

GPU상의 멀티쓰레드 병렬연산의 효율을 떨어뜨리는 다른 요인으로는 호스트 영역에서 실행되는 명령과 디바이스 영역에서 실행되는 명령이 교차하여 수행되는 횟수이다. 상기한 횟수가 많아질수록 병렬연산의 효율이 저하되므로, 본 발명에서는 GPU 상에서 수행되는 병렬연산을 묶어서 일괄적으로 수행하도록 알고리즘을 구성함으로써 많은 연산량이 필요한 JPEG 2000 압축 포맷의 이산 웨이블릿 변환과 그 역변환이 실시간으로 처리될 수 있도록 하였다. Another factor that reduces the efficiency of multithreaded parallel operations on the GPU is the number of times that instructions executed in the host region and instructions executed in the device region intersect. Since the efficiency of parallel operation decreases as the number of times increases, the present invention configures an algorithm to collectively perform parallel operations performed on a GPU, thereby performing discrete wavelet transform and vice versa in the JPEG 2000 compression format, which require a large amount of computation. This can be processed in real time.

아래의 표는 이산웨이블릿변환(FDWT)와 그 역변환(IDWT)를 CPU 기반으로 수행할 때와 본 발명에서 설명한 GPU 기반으로 수행할 때의 연산시간을 비교한 것이다. 대상 이미지의 해상도는 4096×2160(4K)이며, CPU의 사양은 Intel Q8200 2. 33Ghz, FSB1333MHz, L2 Cache 4MB이며, RAM은 2GB DDR3, GPU는 Nvidia 250 GTS 512MB로 측정하여 비교하였다.The table below compares the computation time when performing discrete wavelet transform (FDWT) and its inverse transform (IDWT) based on CPU and GPU based described in the present invention. The resolution of the target image is 4096 × 2160 (4K), and the CPU specifications are Intel Q8200 2.33Ghz, FSB1333MHz, L2 Cache 4MB, RAM 2GB DDR3 and GPU Nvidia 250 GTS 512MB.

CPU 방식 연산시간CPU type calculation time GPU 방식 연산시간GPU method computation time CPU:GPU 연산효율CPU: GPU computational efficiency FDWTFDWT 2,389.4ms2,389.4 ms 105.4ms105.4 ms 22.7배22.7 times IDWTIDWT 2,007.5ms2,007.5 ms 58.1ms58.1 ms 34.6배34.6 times

그 결과, 이산 웨이블릿 변환을 CPU 기반으로 수행할 때 연산시간과 비교하여 GPU 기반의 연산시간은 22.7배의 효율을 보이고 있으며, 역변환을 수행할 때에는 GPU 기반의 연산시간이 CPU 기반의 연산시간에 비하여 34.6배의 효율을 보이고 있다.As a result, when performing discrete wavelet transform on the CPU basis, the GPU based computation time is 22.7 times more efficient than the computation time, and when performing inverse transformation, the GPU based computation time is higher than the CPU based computation time. It is 34.6 times more efficient.

위 실험으로 증명되는 바와 같이 JPEG 2000 압축 포맷의 경우처럼 이산웨이블릿변환 연산과정을 본 발명의 GPU 방식으로 수행하면 기존의 CPU 방식에 비하여 상당한 효율을 보여주고 있으며, 이는 JPEG 2000뿐만 아니라 이산웨이블릿변환 연산과정이 필요한 다양한 연산처리에 적용될 수 있음도 알 수 있다.As demonstrated by the above experiments, the discrete wavelet transform operation is performed in the GPU method of the present invention as in the case of the JPEG 2000 compressed format, which shows considerable efficiency compared to the conventional CPU method. It can also be seen that the process can be applied to a variety of operations required.

한편, 본 발명의 보호범위가 이상에서 명시적으로 설명한 실시예에 의해 제한되는 것은 아니다. 또한, 본 발명이 속하는 기술분야에서 자명한 변경이나 치환으로 말미암아 본 발명의 보호범위가 제한될 수도 없음을 첨언한다.On the other hand, the scope of protection of the present invention is not limited by the embodiments explicitly described above. Further, it should be noted that the protection scope of the present invention may not be limited due to obvious changes or substitutions in the technical field to which the present invention belongs.

도 1은 본 발명의 병렬 연산 처리에 있어서 GPU를 이용한 병렬 연산 처리의 개략적인 흐름을 나타내는 도면이다.BRIEF DESCRIPTION OF THE DRAWINGS It is a figure which shows the schematic flow of parallel arithmetic processing using GPU in the parallel arithmetic processing of this invention.

도 2는 연산을 병렬적으로 처리할 수 있는 GPU(210) 내부의 구조를 논리적인 개념으로 나타내는 도면이다.2 is a diagram illustrating a logical concept of a structure inside the GPU 210 capable of processing operations in parallel.

도 3은 본 발명의 일 실시예에서 최적의 쓰레드의 수를 산출하기 위한 과정을 나타내는 도면이다.3 is a diagram illustrating a process for calculating an optimal number of threads in an embodiment of the present invention.

도 4는 본 발명의 일 실시예에 의해 DWT 연산을 수행하는 전체 과정을 개략적으로 설명하는 순서도다.4 is a flowchart schematically illustrating an entire process of performing a DWT operation according to an embodiment of the present invention.

도 5는 도 4의 DWT 연산을 수행함에 있어서, 디바이스 영역(200)의 메모리에 구축된 다중 메모리 버퍼 구조를 설명하는 도면이다.FIG. 5 is a diagram illustrating a multiple memory buffer structure built in a memory of the device region 200 in performing the DWT operation of FIG. 4.

도 6은 본 발명의 일 실시예에 의해 IDWT(DWT 역변환) 연산을 수행하는 전체 과정을 개략적으로 설명하는 순서도다.6 is a flowchart schematically illustrating an entire process of performing an IDWT (DWT inverse transform) operation according to an embodiment of the present invention.

도 7은 도 6의 IDWT 연산을 수행함에 있어서, 디바이스 영역(200)의 메모리에 구축된 다중 메모리 버퍼 구조를 설명하는 도면이다.FIG. 7 is a diagram illustrating a multiple memory buffer structure built in a memory of the device area 200 when performing the IDWT operation of FIG. 6.

도 8은 본 발명에 의해 DWT 연산을 수행하기 전의 CPU에서의 원본 데이터의 물리적 구조를 예시하는 도면이다.8 is a diagram illustrating a physical structure of original data in a CPU before performing a DWT operation according to the present invention.

도 9(a)는 도 8의 원본 데이터의 개념적 구조를 예시하며, 도 9(b)는 원본 데이터의 개념적 좌표를 나타내는 도면이다.FIG. 9A illustrates a conceptual structure of the original data of FIG. 8, and FIG. 9B is a diagram illustrating conceptual coordinates of the original data.

도 10은 본 발명의 DWT 연산에 있어서 제 1 메모리 버퍼(221)에 저장된 데이 터의 개념적 구조를 예시하는 도면이다.10 is a diagram illustrating a conceptual structure of data stored in the first memory buffer 221 in the DWT operation of the present invention.

도 11은 본 발명의 DWT 연산에 있어서 제 2 메모리 버퍼(223)에 저장된 데이터의 개념적 구조를 예시하는 도면이다.11 is a diagram illustrating a conceptual structure of data stored in the second memory buffer 223 in the DWT operation of the present invention.

도 12는 본 발명의 DWT 연산에 있어서 제 3 메모리 버퍼(225)에 저장된 데이터의 개념적 구조를 예시하는 도면이다.12 is a diagram illustrating a conceptual structure of data stored in the third memory buffer 225 in the DWT operation of the present invention.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for the main parts of the drawings>

100 : 호스트 영역 110 : CPU100: host area 110: CPU

120 : 메모리(RAM) 121 : 소스 버퍼120: memory (RAM) 121: source buffer

200 : 디바이스 영역 210 : GPU200: device area 210: GPU

220 : GPU의 메모리220: GPU memory

221 : DWT의 제 1 메모리 버퍼221: DWT first memory buffer

221′: IDWT의 제 1 메모리 버퍼221 ': first memory buffer of IDWT

223 : DWT의 제 2 메모리 버퍼223: Second memory buffer of DWT

223′: IDWT의 제 2 메모리 버퍼223 ': second memory buffer of IDWT

225 : DWT의 제 3 메모리 버퍼225: DWT third memory buffer

225′: IDWT의 제 3 메모리 버퍼225 ′: third memory buffer of IDWT

※ 첨부된 도면은 본 발명의 기술사상에 대한 이해를 위하여 참조로서 예시된 것임을 밝히며, 그것에 의해 본 발명의 권리범위가 제한되지는 아니한다.The accompanying drawings show that they are illustrated as a reference for understanding the technical idea of the present invention, by which the scope of the present invention is not limited.

Claims

정지 영상 또는 동영상 입력 데이터를 이산 웨이블릿 변환(Discrete Wavelet Transform; DWT) 및 그 역변환을 수행하여 영상 처리를 하는 방법에 있어서, In the method of performing image processing by performing a discrete wavelet transform (DWT) and inverse transform of the still image or video input data,

상기 GPU에서 커널들을 병렬 실행하여 이산 웨이블릿 변환 또는 그 역변환의 연산들을 수행하는 단계;를 포함하는 것을 특징으로 하는, 범용 그래픽 처리장치의 병렬 프로세서를 이용한 고속 영상 처리 방법.And executing discrete wavelet transforms or inverse transform operations by executing kernels in parallel on the GPU. 2.

삭제delete

제 1 항에 있어서,The method of claim 1,

식(1)에 의해 계산되는 그리드당 블록수에 의해, 상기 연산을 수행하기 위한 최적의 쓰레드 수를 산출하여 멀티쓰레드의 수를 생성하는 단계를 더 포함하는 것을 특징으로 하는, 범용 그래픽 처리장치의 병렬 프로세서를 이용한 고속 영상 처리 방법.Calculating the optimal number of threads for performing the operation, based on the number of blocks per grid calculated by Equation (1), to generate the number of multi-threads. High speed image processing method using parallel processor.

식 (1)

Formula (1)

제 1 항에 있어서,The method of claim 1,

DWT에서의 상기 디바이스 영역에 구축된 다중 메모리 버퍼 구조는,The multiple memory buffer structure built in the device area in DWT is

세로 방향의 데이터에 대한 연산 결과값으로 갱신되며, 가로 방향의 데이터에 대한 연산에 참조가 되는 제 3 메모리 버퍼로 이루어지는 것을 특징으로 하는, 범용 그래픽 처리장치의 병렬 프로세서를 이용한 고속 영상 처리 방법.And a third memory buffer which is updated with a result of calculation on data in the vertical direction and which is referred to for calculation on data in the horizontal direction.

제 1 항에 있어서,The method of claim 1,

DWT의 역변환 과정에서의 상기 디바이스 영역에 구축된 다중 메모리 버퍼 구조는,The multi-memory buffer structure constructed in the device region during the inverse transformation of DWT,

가로 방향의 데이터에 대한 연산 결과값으로 갱신되며, 세로 방향의 데이터에 대한 연산에 참조가 되는 제 3 메모리 버퍼로 이루어지는 것을 특징으로 하는, 범용 그래픽 처리장치의 병렬 프로세서를 이용한 고속 영상 처리 방법.And a third memory buffer which is updated with a result of calculation on the data in the horizontal direction and which is referred to for calculation on the data in the vertical direction.

제 3 항 또는 제 4 항에 있어서,The method according to claim 3 or 4,

상기 각 메모리 버퍼의 좌표는 (x, y) 좌표계로 구성되는 것을 특징으로 하는, 범용 그래픽 처리장치의 병렬 프로세서를 이용한 고속 영상 처리 방법.The coordinates of each of the memory buffer is a (x, y) coordinate system, characterized in that the high-speed image processing method using a parallel processor of a general-purpose graphics processing device.

제 1 항에 있어서,The method of claim 1,

상기 연산에 있어서 익스텐션 처리 단계를 더 포함하며, 익스텐션 처리 단계를 GPU상의 멀티쓰레드로 구현하고 병렬로 처리하는 것을 특징으로 하는, 범용 그래픽 처리장치의 병렬 프로세서를 이용한 고속 영상 처리 방법.And an extension processing step in the operation, wherein the extension processing step is implemented as a multi-thread on a GPU and processed in parallel.

삭제delete

제 1 항에 있어서,The method of claim 1,

상기 이산 웨이블릿 변환(Discrete Wavelet Transform; DWT) 및 그 역변환은 컨볼루션(Convolution) 기반인, 범용 그래픽 처리장치의 병렬 프로세서를 이용한 고속 영상 처리 방법.The discrete wavelet transform (DWT) and its inverse transform are convolution-based, and a high speed image processing method using a parallel processor of a general-purpose graphics processing apparatus.