KR20190103916A

KR20190103916A - Apparatus and method for processing image

Info

Publication number: KR20190103916A
Application number: KR1020180081904A
Authority: KR
Inventors: 조남익; 추성권; 안석현; 이상훈; 권기범
Original assignee: 서울대학교산학협력단
Priority date: 2018-02-28
Filing date: 2018-07-13
Publication date: 2019-09-05
Also published as: KR102086042B1

Abstract

According to an embodiment of the present invention, an image processing device comprises: a characteristic extraction unit extracting visual information by applying a convolution filter to an input frame; a time information processing unit extracting time information by using a characteristic map including the time information, and outputting a probability result by updating an inner state by performing feedback of the extracted time information; and a background separation unit separating a whole view and a background of the input frame based on the probability result. The time information processing unit is realized by using a convolution circulation nerve network, and has a structure of connecting a convolution circulation encoder and a convolution circulation decoder.

Description

영상 처리 장치 및 방법{APPARATUS AND METHOD FOR PROCESSING IMAGE}Image processing apparatus and method {APPARATUS AND METHOD FOR PROCESSING IMAGE}

본 명세서에서 개시되는 실시예들은 영상 처리 장치 및 방법에 관한 것이다. 보다 상세하게는, 인공 신경망을 이용하여 동영상의 전경과 배경을 분리하는 영상 처리 장치 및 방법에 관한 것이다.Embodiments disclosed herein relate to an image processing apparatus and a method. More specifically, the present invention relates to an image processing apparatus and method for separating a foreground and a background of a video using an artificial neural network.

기존의 영상 처리 방법은 전경과 배경을 분리하기 위해 입력된 영상에 대해 이전 영상과 현재 영상을 단순히 비교한 비교값을 사용한다. 관련하여 선행기술문헌인 한국특허 제10-2013-0063963호에서는 현재의 영상과 이전의 영상을 비교하여 영상 간 변화값을 누적하여 누적된 영상 변화값을 생성하고, 누적된 영상 변화값을 이용하여 객체 이동에 의한 영상의 변화값을 추정하는 영상 처리 방법을 기재하고 있다.The existing image processing method uses a comparison value of the input image to compare the previous image with the current image to separate the foreground and the background. In the related art, Korean Patent No. 10-2013-0063963 discloses a cumulative image change value by accumulating change values between images by comparing a current image with a previous image, and using the accumulated image change value. An image processing method for estimating a change value of an image due to object movement is described.

이를 포함한 기존의 영상 처리 방법들은 동영상의 앞부분을 배경모델의 학습구간으로 설정하여 전경분리 없이 모든 영역을 배경모델로 학습하고, 학습 구간 이후에서 입력 영상에 대해 전경과 배경을 분리, 및 배경 모델 갱신 등을 수행한다. 이때, 배경에 대한 학습, 분리, 및 갱신에 있어서 노이즈를 감소하고 성능을 높이기 위해서 다양한 기법을 활용하고 있으며, 노이즈 감소를 위해 유저가 직접 설계하며, 영상 특성마다 다른 파라미터를 사용한다. 이로 인해, 영상 처리를 위한 모든 부분을 유저가 직접 설계해야 하며, 복잡한 메모리 갱신 동작을 수행하고 있으며, CPU 연산 위주이기 때문에 속도가 제한될 수 있다.Conventional image processing methods including this set the front part of the video as the learning section of the background model to learn all areas as the background model without separating the foreground, and separate the foreground and background for the input image after the learning section, and update the background model. And so on. In this case, various techniques are used to reduce noise and improve performance in learning, separating, and updating the background, and the user directly designs for noise reduction and uses different parameters for each image characteristic. As a result, all parts for image processing must be designed by the user, complicated memory update operations are performed, and the speed can be limited because the CPU operation is focused on.

이와 같이, 기존의 영상 처리 방법들은 비학습 기반의 영상 처리 방법을 이용함에 따라 유저가 의해 직접 설계해야 하므로 구현이 용이하지 않고, 하드웨어 자원 사용의 비효율로 인해 성능의 한계가 존재하는 문제점이 있었다.As described above, the conventional image processing methods have a problem in that performance is limited due to inefficiency of hardware resource usage because the user needs to design the image by the user according to the non-learning based image processing method.

따라서, 근래에는 이러한 문제점을 해결하기 위한 장치 및 방법이 요구되고 있는 실정이다.Therefore, in recent years, there is a need for an apparatus and method for solving such a problem.

한편, 전술한 배경기술은 발명자가 본 발명의 도출을 위해 보유하고 있었거나, 본 발명의 도출 과정에서 습득한 기술 정보로서, 반드시 본 발명의 출원 전에 일반 공중에게 공개된 공지기술이라 할 수는 없다.On the other hand, the background art described above is technical information that the inventors possess for the derivation of the present invention or acquired in the derivation process of the present invention, and is not necessarily a publicly known technique disclosed to the general public before the application of the present invention. .

본 명세서에서 개시되는 실시예들은, 인공 신경망을 이용하여 동영상의 전경과 배경을 분리하는 영상 처리 장치 및 방법을 제시하는 데에 목적이 있다.Embodiments disclosed herein are provided to provide an image processing apparatus and method for separating foreground and background of a video using an artificial neural network.

본 명세서에서 개시되는 실시예들은, 인공 신경망을 이용하여 동영상의 시간적 정보를 추출하고, 추출된 시간적 정보를 활용하여 전경과 배경을 분리하는 영상 처리 장치 및 방법을 제시하는 데에 목적이 있다.Embodiments disclosed herein are aimed at presenting an image processing apparatus and method for extracting temporal information of a video using an artificial neural network and separating a foreground and a background using the extracted temporal information.

본 명세서에서 개시되는 실시예들은, 인공 신경망을 이용하여 배경에 대한 학습, 분리, 또는 갱신을 유저의 개입없이 자동으로 학습할 수 있는 영상 처리 장치 및 방법을 제시하는 데에 목적이 있다.Embodiments disclosed herein are provided to provide an image processing apparatus and method that can automatically learn, separate, or update a background using an artificial neural network without user intervention.

본 명세서에서 개시되는 실시예들은, 시각 정보뿐만 아니라 공간적 정보를 유지한 상태에서 시간적 정보를 활용할 수 있도록 인공 신경망을 학습하여 움직임 정보를 검출할 수 있는 영상 처리 장치 및 방법을 제시하는데 목적이 있다.Embodiments disclosed herein are intended to provide an image processing apparatus and method capable of detecting motion information by learning an artificial neural network so as to utilize temporal information while maintaining spatial information as well as visual information.

상술한 기술적 과제를 달성하기 위한 기술적 수단으로서, 일 실시예에 따르면, 영상 제공 장치는, 입력 프레임에 콘볼루션 필터를 적용하여 시각 정보를 추출하는 특징 추출부, 상기 시각 정보를 포함한 특징맵을 사용하여 시간적 정보를 추출하고, 추출된 시간적 정보를 피드백시켜 내부 상태를 갱신하여 확률 결과를 출력하는 시간 정보 처리부; 및 상기 확률 결과에 기초하여 상기 입력 프레임의 전경과 배경을 분리하는 배경 분리부를 포함하고, 상기 시간 정보 처리부는, 콘볼루션 순환 신경망을 이용하여 구현되고, 콘볼루션 순환 인코더와 콘볼루션 순환 디코더가 연결된 구조를 갖는다.As a technical means for achieving the above-described technical problem, according to an embodiment, the image providing apparatus uses a feature extractor for extracting visual information by applying a convolution filter to the input frame, the feature map including the visual information A time information processor for extracting temporal information, feeding back the extracted temporal information, updating an internal state, and outputting a probability result; And a background separator configured to separate the foreground and the background of the input frame based on the probability result, wherein the temporal information processor is implemented using a convolutional cyclic neural network and is connected to a convolutional cyclic encoder and a convolutional cyclic decoder. Has a structure.

다른 실시예에 따르면, 영상 제공 장치의 영상 제공 방법은, 입력 프레임에 콘볼루션 필터를 적용하여 시각 정보를 추출하는 단계, 상기 시각 정보를 포함한 특징맵을 사용하여 시간적 정보를 추출하고, 추출된 시간적 정보를 피드백시켜 내부 상태를 갱신하여 확률 결과를 출력하는 단계, 상기 확률 결과에 기초하여 상기 입력 프레임의 전경과 배경을 분리하는 단계를 포함하고, 상기 확률 결과를 출력하는 단계는, 콘볼루션 순환 인코더와 콘볼루션 순환 디코더가 연결된 구조를 갖는 콘볼루션 순환 신경망을 이용하여 상기 확률 결과를 출력한다.According to another embodiment, an image providing method of an image providing apparatus may include extracting visual information by applying a convolution filter to an input frame, extracting temporal information using the feature map including the visual information, and extracting the temporal information. Feedbacking information to update an internal state to output a probability result, and separating a foreground and a background of the input frame based on the probability result, and outputting the probability result comprises: a convolutional cyclic encoder The probability result is output by using a convolutional cyclic neural network having a structure in which a and convolutional cyclic decoders are connected.

또 다른 실시예에 따르면, 영상 처리 방법을 수행하는 프로그램이 기록된 컴퓨터 판독 가능한 기록 매체로서, 입력 프레임에 콘볼루션 필터를 적용하여 시각 정보를 추출하는 단계, 상기 시각 정보를 포함한 특징맵을 사용하여 시간적 정보를 추출하고, 추출된 시간적 정보를 피드백시켜 내부 상태를 갱신하여 확률 결과를 출력하는 단계, 상기 확률 결과에 기초하여 상기 입력 프레임의 전경과 배경을 분리하는 단계를 포함하고, 상기 확률 결과를 출력하는 단계는, 콘볼루션 순환 인코더와 콘볼루션 순환 디코더가 연결된 구조를 갖는 콘볼루션 순환 신경망을 이용하여 상기 확률 결과를 출력한다.According to another embodiment, a computer-readable recording medium having recorded thereon a program for performing an image processing method, the method comprising: extracting visual information by applying a convolution filter to an input frame, using a feature map including the visual information Extracting temporal information, updating the internal state by feeding back the extracted temporal information, and outputting a probability result; separating a foreground and a background of the input frame based on the probability result; In the outputting step, the probability result is output using a convolutional cyclic neural network having a structure in which a convolutional cyclic encoder and a convolutional cyclic decoder are connected.

또 다른 실시예에 따르면, 영상 처리 장치에 의해 수행되며, 영상 처리 방법을 수행하기 위해 매체에 저장된 컴퓨터 프로그램으로서, 입력 프레임에 콘볼루션 필터를 적용하여 시각 정보를 추출하는 단계, 상기 시각 정보를 포함한 특징맵을 사용하여 시간적 정보를 추출하고, 추출된 시간적 정보를 피드백시켜 내부 상태를 갱신하여 확률 결과를 출력하는 단계, 상기 확률 결과에 기초하여 상기 입력 프레임의 전경과 배경을 분리하는 단계를 포함하고, 상기 확률 결과를 출력하는 단계는, 콘볼루션 순환 인코더와 콘볼루션 순환 디코더가 연결된 구조를 갖는 콘볼루션 순환 신경망을 이용하여 상기 확률 결과를 출력한다.According to yet another embodiment, a computer program executed by an image processing apparatus and stored in a medium for performing an image processing method, the method comprising extracting visual information by applying a convolution filter to an input frame, including the visual information Extracting temporal information using the feature map, updating the internal state by feeding back the extracted temporal information, and outputting a probability result; separating the foreground and background of the input frame based on the probability result; The outputting of the probability result may include outputting the probability result using a convolutional cyclic neural network having a structure in which a convolutional cyclic encoder and a convolutional cyclic decoder are connected.

전술한 본 발명의 과제 해결 수단 중 어느 하나에 의하면, 인공 신경망을 이용하여 동영상의 전경과 배경을 분리하는 영상 처리 장치 및 방법을 제시할 수 있다.According to any one of the above-described means for solving the problems of the present invention, it is possible to provide an image processing apparatus and method for separating the foreground and background of a video using an artificial neural network.

또한, 본 발명의 과제 해결 수단 중 어느 하나에 의하면, 인공 신경망을 이용하여 동영상의 시간적 정보를 추출하고, 추출된 시간적 정보를 활용하여 전경과 배경을 분리하는 영상 처리 장치 및 방법을 제시할 수 있다.In addition, according to any one of the problem solving means of the present invention, it is possible to present an image processing apparatus and method for extracting the temporal information of the video using an artificial neural network, and separating the foreground and the background by using the extracted temporal information. .

또한, 본 발명의 과제 해결 수단 중 어느 하나에 의하면, 인공 신경망을 이용하여 배경에 대한 학습, 분리, 또는 갱신을 유저의 개입없이 자동으로 학습할 수 있는 영상 처리 장치 및 방법을 제시할 수 있다.In addition, according to any one of the problem solving means of the present invention, it is possible to provide an image processing apparatus and method that can automatically learn the learning, separation, or update of the background using the artificial neural network without user intervention.

또한, 본 발명의 과제 해결 수단 중 어느 하나에 의하면, 시각 정보뿐만 아니라 공간적 정보를 유지한 상태에서 시간적 정보를 활용할 수 있도록 인공 신경망을 학습하여 움직임 정보를 검출할 수 있는 영상 처리 장치 및 방법을 제시할 수 있다.In addition, according to any one of the problem solving means of the present invention, it proposes an image processing apparatus and method that can detect the motion information by learning the artificial neural network to utilize the temporal information in the state of maintaining the spatial information as well as the visual information can do.

본 발명에서 얻을 수 있는 효과는 이상에서 언급한 효과들로 제한되지 않으며, 언급하지 않은 또 다른 효과들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The effects obtainable in the present invention are not limited to the above-mentioned effects, and other effects not mentioned above may be clearly understood by those skilled in the art from the following description. will be.

도 1은 일 실시예에 따른 영상 처리 장치를 도시한 블록도이다.
도 2는 일 실시예에 따른 입력 프레임으로부터 시각 정보를 추출하는 과정을 설명하기 위한 도면이다.
도 3은 일 실시예에 따른 시간적 정보를 처리하기 위한 과정을 설명하기 위한 도면이다.
도 4는 일 실시예에 따른 시각 정보를 사용하여 확률 결과를 추출하는 과정을 설명하기 위한 도면이다.
도 5는 일 실시예에 따른 콘볼루션 LSTM의 구조를 설명하기 위한 도면이다.
도 6은 일 실시예에 따른 콘볼루션 LSTM 의 구조에 대해 설명한 도면이다.
도 7은 일 실시예에 따른 콘볼루션 순환 인코더의 맵과 콘볼루션 LSTM의 스케일에 따른 크기 변화를 설명하기 위한 도면이다.
도 8은 일 실시예에 따른 콘볼루션 순환 디코더의 맵과 콘볼루션 LSTM의 스케일에 따른 크기 변화를 설명하기 위한 도면이다.
도 9는 일 실시예에 따른 영상 처리 방법을 도시한 순서도이다.1 is a block diagram illustrating an image processing apparatus according to an exemplary embodiment.
2 is a diagram for describing a process of extracting visual information from an input frame, according to an exemplary embodiment.
3 is a diagram for describing a process of processing temporal information, according to an exemplary embodiment.
4 is a diagram for describing a process of extracting a probability result using visual information, according to an exemplary embodiment.
5 illustrates a structure of a convolution LSTM according to an embodiment.
6 is a diagram illustrating a structure of a convolution LSTM according to an embodiment.
7 is a diagram illustrating a change in size according to a map of a convolution cyclic encoder and a scale of a convolution LSTM according to an embodiment.
8 is a diagram illustrating a change in size according to a map of a convolution cyclic decoder and a scale of a convolution LSTM according to an embodiment.
9 is a flowchart illustrating an image processing method, according to an exemplary embodiment.

아래에서는 첨부한 도면을 참조하여 다양한 실시예들을 상세히 설명한다. 아래에서 설명되는 실시예들은 여러 가지 상이한 형태로 변형되어 실시될 수도 있다. 실시예들의 특징을 보다 명확히 설명하기 위하여, 이하의 실시예들이 속하는 기술분야에서 통상의 지식을 가진 자에게 널리 알려져 있는 사항들에 관해서 자세한 설명은 생략하였다. 그리고, 도면에서 실시예들의 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, various embodiments will be described in detail with reference to the accompanying drawings. The embodiments described below may be embodied in various different forms. In order to more clearly describe the features of the embodiments, detailed descriptions of the matters well known to those skilled in the art to which the following embodiments belong are omitted. In the drawings, parts irrelevant to the description of the embodiments are omitted, and like reference numerals designate like parts throughout the specification.

명세서 전체에서, 어떤 구성이 다른 구성과 "연결"되어 있다고 할 때, 이는 '직접적으로 연결'되어 있는 경우뿐 아니라, '그 중간에 다른 구성을 사이에 두고 연결'되어 있는 경우도 포함한다. 또한, 어떤 구성이 어떤 구성을 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한, 그 외 다른 구성을 제외하는 것이 아니라 다른 구성들을 더 포함할 수도 있음을 의미한다.Throughout the specification, when a configuration is "connected" to another configuration, this includes not only 'directly connected', but also 'connected with another configuration in the middle'. In addition, when a configuration "includes" a certain configuration, this means that, unless specifically stated otherwise, it may further include other configurations other than the other configuration.

이하 첨부된 도면을 참고하여 실시예들을 상세히 설명하기로 한다.Hereinafter, exemplary embodiments will be described in detail with reference to the accompanying drawings.

도 1은 일실시예에 따른 스케쥴링 장치를 도시한 블록도이다.1 is a block diagram illustrating a scheduling apparatus according to an embodiment.

도 1에 도시된 바와 같이, 영상 처리 장치(100)는 특징 추출부(110), 시간 정보 처리부(120), 및 배경 분리부(130)를 포함한다.As illustrated in FIG. 1, the image processing apparatus 100 includes a feature extractor 110, a time information processor 120, and a background separator 130.

특징 추출부(110)는 입력 프레임으로부터 특징맵(feature map)으로 구성될 수 있는 시각 정보를 추출할 수 있다. 여기서, 입력 프레임은 동영상을 구성하는 복수의 프레임 중 하나일 수 있다.The feature extractor 110 may extract visual information that may be configured as a feature map from an input frame. Here, the input frame may be one of a plurality of frames constituting the video.

특징 추출부(110)에서 추출된 특징맵은 입력 프레임 내 적어도 일부의 영역에 대한 시각 정보를 포함한다. 특징 추출부(110)는 특징맵의 획득을 위해 콘볼루션 필터를 이용할 수 있다. 여기서, 콘볼루션 필터는 콘볼루션신경망(Convolution Neural Network, 이하 'CNN'이라 칭하기로 함)을 이용하여 구현될 수 있다. 따라서, 특징 추출부(110)는 입력 프레임, 즉 알지비(RGB) 영상으로부터 CNN을 이용하여 배경분리에 필요한 콘볼루션 필터를 학습하고, 학습된 콘볼루션 필터를 사용하여 특징맵을 추출할 수 있다.The feature map extracted by the feature extractor 110 includes visual information about at least a portion of an area in the input frame. The feature extractor 110 may use a convolution filter to obtain a feature map. Here, the convolution filter may be implemented using a convolutional neural network (hereinafter, referred to as 'CNN'). Accordingly, the feature extractor 110 may learn a convolution filter for background separation using a CNN from an input frame, that is, an RGB image, and extract a feature map using the learned convolution filter. .

시간 정보 처리부(120)는 특징맵을 사용하여 시간적 정보를 추출할 수 있다. 여기서, 시간 정보 처리부(120)는 시간적 정보의 추출을 위해 콘볼루션 순환신경망(Convolution Recurrent Neural Network, 이하 'CRNN'이라 칭하기로 함)을 이용할 수 있다.The time information processor 120 may extract temporal information using the feature map. Here, the time information processor 120 may use a convolutional recurrent neural network (hereinafter referred to as a CRNN) for extracting temporal information.

시간 정보 처리부(120)는 시각 정보를 포함한 특징맵을 사용하여 시간적 정보를 추출하고, 추출된 시간적 정보를 피드백시켜 내부 상태를 갱신하여 확률 결과를 출력할 수 있다.The time information processor 120 may extract temporal information using a feature map including time information, update the internal state by feeding back the extracted temporal information, and output a probability result.

이때, 시간 정보 처리부(120)는 콘볼루션 순환 인코더와 콘볼루션 순환 디코더가 연결된 구조의 CRNN을 이용할 수 있다. 콘볼루션 순환 인코더와 콘볼루션 순환 디코더는 학습을 위한 연산을 감소시키기 위해 맵의 크기를 소정 비율로 다운샘플링 및 업샘플링시킬 수 있다. 이로 인해, 콘볼루션 순환 인코더와 콘볼루션 순환 디코더는 복수의 스케일 각각에서 시간적 정보를 추출할 수 있다. 여기서, 스케일은 콘볼루션 순환 인코더와 콘볼루션 순환 디코더를 구성하는 각 단계를 의미할 수 있으며, 스케일 각각은 특징맵과 유사한 형태의 맵을 포함할 수 있으며, 특징맵과 같이 복수의 레이어로 구성될 수 있다. 또한, 맵은 신경망을 구성하는 셀(Cell)을 학습시키기 위한 정보인 텐서(tensor)와 스테이트(state)(또는 히든 스테이트(hidden state))에 대한 정보를 포함할 수 있다.In this case, the time information processor 120 may use a CRNN having a structure in which a convolution cyclic encoder and a convolution cyclic decoder are connected. The convolutional cyclic encoder and the convolutional cyclic decoder can downsample and upsample the size of the map at a predetermined rate to reduce the computation for learning. As a result, the convolutional cyclic encoder and the convolutional cyclic decoder can extract temporal information at each of the plurality of scales. Here, the scale may mean each step constituting the convolutional cyclic encoder and the convolutional cyclic decoder, and each of the scales may include a map having a form similar to that of the feature map, and may be composed of a plurality of layers like the feature map. Can be. In addition, the map may include information about a tensor and a state (or a hidden state), which are information for learning a cell constituting the neural network.

시간 정보 처리부(120)는 복수의 스케일로 구성된 CRNN을 이용하여 시간적 정보를 추출할 수 있으며, 복수의 스케일 각각에서 내부 상태, 즉 셀 상태(cell state)를 갱신할 수 있다. 이를 통해, 시간 정보 처리부(120)는 시간적 정보를 이용하여 획득한 확률 결과를 출력할 수 있다.The time information processing unit 120 may extract temporal information using a CRNN composed of a plurality of scales, and may update an internal state, that is, a cell state, in each of the plurality of scales. In this way, the time information processor 120 may output a probability result obtained using the temporal information.

배경 분리부(130)는 확률 결과에 기초하여 입력 프레임 내에서 특정 객체에 대한 움직임을 검출할 수 있다. 이를 통해, 배경 분리부(130)는 입력 프레임 내에서 전경과 배경을 분리할 수 있다. 예를 들어, 배경 분리부(130)는 미리 설정된 기준값을 이용하여 전경과 배경을 분리할 수 있다. 예를 들어, 배경 분리부(130)는 약 0.5의 값을 기준값으로 설정한 경우, 시간 정보 처리부(120)에서 출력한 확률 결과에 따른 확률값(예를 들어, 약 0.9)이 기준값(약0.5)을 초과하면 전경으로 분류하고, 확률 결과에 따른 확률값(예를 들어, 약0.2)이 기준값(약 0.5) 이하이면 배경으로 분류할 수 있다.The background separator 130 may detect a movement of a specific object in the input frame based on the probability result. In this way, the background separator 130 may separate the foreground and the background in the input frame. For example, the background separator 130 may separate the foreground and the background by using a preset reference value. For example, when the background separator 130 sets a value of about 0.5 as a reference value, a probability value (for example, about 0.9) according to a probability result output from the time information processor 120 is a reference value (about 0.5). If the value exceeds, it is classified as the foreground, and if the probability value (for example, about 0.2) is less than the reference value (about 0.5) according to the probability result, it may be classified as the background.

이와 같이, 제안된 영상 처리 장치(100)는 인공신경망을 활용하여 영상에서 전경과 배경을 분리할 수 있다. 상술한 바와 같이, 영상 처리 장치(100)는 동영상 내 특정 객체의 움직임을 검출하기 위해서 CNN, CRNN의 인공 신경망들이 차례로 결합된 구조를 이용할 수 있다. 특히, 영상 처리 장치(100)는 CRNN 구조를 사용하여 동영상의 시각 정보에만 의존하지 않고, 공간적 정보의 감소없이 공간적 정보를 유지한 상태에서 시간적 정보를 활용하는 인공신경망의 학습을 가능하게 한다.As such, the proposed image processing apparatus 100 may separate the foreground and the background from the image by using an artificial neural network. As described above, the image processing apparatus 100 may use a structure in which artificial neural networks of CNN and CRNN are sequentially combined to detect movement of a specific object in a video. In particular, the image processing apparatus 100 enables the learning of an artificial neural network utilizing temporal information while maintaining spatial information without reducing spatial information without relying only on visual information of a video using a CRNN structure.

영상 처리 장치(100)는 CRNN(콘볼루션 순환 인코더와 콘볼루션 순환 디코더를 포함) 구조에 의해 구분된 복수의 스케일 각각에서 순환신경망이 적층된 구조를 갖는다. 이로 인해, 영상 처리 장치(100)는 시간적 정보에 대해서도 단일 순환신경망(Recurrent Neural Network, 이하 'RNN'이라 칭하기로 함)에 비해 상대적으로 긴 시간에 대응되는 시간적 정보를 학습할 수 있어 배경과 전경의 분리 성능이 향상될 수 있다.The image processing apparatus 100 has a structure in which a cyclic neural network is stacked on each of a plurality of scales divided by a CRNN (including a convolution cyclic encoder and a convolution cyclic decoder) structure. As a result, the image processing apparatus 100 may learn temporal information corresponding to a relatively long time even with respect to temporal information as compared to a single recurrent neural network (hereinafter referred to as 'RNN'). Separation performance can be improved.

또한, 영상 처리 장치(100)는 인공신경망을 사용하여 배경에 대한 학습, 분리, 또는 갱신을 유저의 직접적인 설계 없이도 자동으로 학습이 가능하며, 파라미터 선택 등에 의한 성능 변화가 적다.In addition, the image processing apparatus 100 may automatically learn, separate, or update a background using an artificial neural network without a user's direct design, and there is little performance change due to parameter selection.

도 2는 일 실시예에 따른 입력 프레임으로부터 시각 정보를 추출하는 과정을 설명하기 위한 도면이다.2 is a diagram for describing a process of extracting visual information from an input frame, according to an exemplary embodiment.

도 2에 도시된 바와 같이, 특징 추출부(110)는 콘볼루션 필터를 사용하여 입력 프레임으로부터 시각 정보를 추출할 수 있으며, 이를 위해 CNN을 사용할 수 있다.As shown in FIG. 2, the feature extractor 110 may extract visual information from an input frame using a convolution filter, and may use a CNN for this purpose.

특징 추출부(110)는 입력 프레임(210)의 일부 영역(211)에 대해 콘볼루션 필터(220)를 사용할 수 있으며, 특징맵(230)으로 구성될 수 있는 시각 정보(또는, CNN 특징)을 추출할 수 있다. 도 2에서는, 특징맵(230)은 네 개의 레이어(231, 232, 233, 234)를 포함하는 것으로 도시하였지만 이에 한정되지 않고, 특징맵(230)은 콘볼루션 필터를 사용하여 하나 이상의 레이어를 갖도록 다양한 개수로 설정될 수 있다.The feature extractor 110 may use the convolution filter 220 for the partial region 211 of the input frame 210, and may generate visual information (or CNN feature) that may be configured as the feature map 230. Can be extracted. In FIG. 2, the feature map 230 is illustrated as including four layers 231, 232, 233, and 234, but is not limited thereto. The feature map 230 may have one or more layers using a convolution filter. It can be set to various numbers.

특징 추출부(110)는 입력 영상(210)으로부터 배경분리에 필요한 콘볼루션 필터를 학습할 수 있다. 여기서, 특징 추출부(110)는 콘볼루션 필터로 필터 뱅크의 역할을 수행하여 복수의 특징맵(230)을 학습 및 생성할 수 있다. 이를 통해, 특징 추출부(110)는 배경 분리에 필요한 정보를 자동적으로 학습하여 추출할 수 있다.The feature extractor 110 may learn a convolution filter required for background separation from the input image 210. Here, the feature extractor 110 may learn and generate a plurality of feature maps 230 by serving as a filter bank as a convolution filter. Through this, the feature extractor 110 may automatically learn and extract information necessary for background separation.

한편, 상술한 CNN은 인공신경망의 하나로 영상들에 대한 학습을 통해 영상 처리를 위한 영상의 검색, 분류, 및 이해를 위한 다양한 기능을 지원할 수 있다. CNN은 입력의 모든 영역을 연결하여 학습하는 다른 인공 신경망들과 다른 구조를 갖기 때문에, 하나의 파라미터를 입력 프레임의 여러 영역에서 각각 사용할 수 있다. 이로 인해, CNN은 적은 파라미터로도 영상을 처리할 수 있으며, 영상이 공간적으로 다르더라도 특성은 변하지 않기 때문에 영상 분석 장치(100)에서 시각 정보의 추출에 사용할 수 있다.Meanwhile, the above-described CNN may support various functions for searching, classifying, and understanding images for image processing through learning about images as one of artificial neural networks. Since the CNN has a different structure from other artificial neural networks that connect and learn all areas of the input, one parameter can be used in various areas of the input frame. As a result, the CNN can process the image even with a small number of parameters, and since the characteristic does not change even if the image is spatially different, the CNN can be used to extract visual information from the image analyzing apparatus 100.

도 3은 일 실시예에 따른 시간적 정보를 처리하기 위한 과정을 설명하기 위한 도면이다.3 is a diagram for describing a process of processing temporal information, according to an exemplary embodiment.

도 3에 도시된 바와 같이, 시간 정보 처리부(120)는 특징맵(310)으로부터 확률 결과(350)를 추출할 수 있다. 여기서, 시간 정보 처리부(120)는 시간의 경과에 따른 시각 정보의 흐름에 대한 정보인 시간적 정보로부터 전경과 배경을 분리하기 위한 예측값인 확률 결과(350)를 추출할 수 있다. 여기서, 시간 정보 처리부(120)는 CRNN(300)을 이용하여 구현될 수 있다.As illustrated in FIG. 3, the time information processor 120 may extract the probability result 350 from the feature map 310. Here, the time information processor 120 may extract a probability result 350, which is a prediction value for separating the foreground and the background, from the temporal information that is information on the flow of time information over time. Here, the time information processor 120 may be implemented using the CRNN 300.

또한, 시간 정보 처리부(130)는 이전 입력에 의한 셀 상태(cell state)를 다음 입력으로 수신할 수 있다. 이를 통해, 시간 정보 처리부(130)는 특징맵(310)으로부터 전경과 배경의 분리에 대한 예측값인 확률 결과(350)를 추출할 수 있다.In addition, the time information processor 130 may receive a cell state of a previous input as a next input. Through this, the time information processor 130 may extract the probability result 350, which is a prediction value for the separation of the foreground and the background, from the feature map 310.

도 4는 일 실시예에 따른 시각 정보를 사용하여 확률 결과를 추출하는 과정을 설명하기 위한 도면이다.4 is a diagram for describing a process of extracting a probability result using visual information, according to an exemplary embodiment.

도 4에 도시된 바와 같이, 시간 정보 처리부(120)는 입력된 특징맵(321)으로부터 CRNN(300)을 사용하여 확률 결과를 추출할 수 있다. 여기서, 특징맵(321)은 입력 프레임(310)으로부터 특징 추출부(110)에 의해 생성된다.As illustrated in FIG. 4, the time information processor 120 may extract a probability result from the input feature map 321 using the CRNN 300. Here, the feature map 321 is generated by the feature extractor 110 from the input frame 310.

CRNN(300)은 콘볼루션 순환 인코더(301)와 콘볼루션 순환 디코더(302)가 연결된 구조를 가질 수 있다. 시간 정보 처리부(230)는 콘볼루션 순환 인코더(301)와 콘볼루션 순환 디코더(302) 각각을 다운샘플링 및 업샘플링 동작에 따라 복수의 스케일로 구분할 수 있다. 시간 정보 처리부(120)는 콘볼루션 순환 인코더(301)의 구조에서 특징맵을 기준으로 소정 비율로 다운샘플링한 스케일들을 구성하고, 콘볼루션 순환 디코더(302)의 구조에서 콘볼루션 순환 인코더(301)의 출력을 다시 소정 비율로 업샘플링한 스케일들을 구성할 수 있다. 이를 통해, 콘볼루션 순환 디코더(302)의 출력이 콘볼루션 순환 인코더(301)에 입력된 특징맵의 크기와 상응한 크기를 갖도록 복원할 수 있다. 이를 통해, 시간 정보 처리부(120)는 입력 프레임과 동일한 해상도를 갖는 출력, 즉, 확률 결과를 제공할 수 있다.The CRNN 300 may have a structure in which a convolution cyclic encoder 301 and a convolution cyclic decoder 302 are connected. The time information processor 230 may divide the convolution cyclic encoder 301 and the convolution cyclic decoder 302 into a plurality of scales according to downsampling and upsampling operations. The time information processor 120 configures scales downsampled at a predetermined rate based on the feature map in the structure of the convolutional cyclic encoder 301, and the convolutional cyclic encoder 301 in the structure of the convolutional cyclic decoder 302. The scales upsampled again at a predetermined ratio may be configured. Through this, the output of the convolutional cyclic decoder 302 may be restored to have a size corresponding to the size of the feature map input to the convolutional cyclic encoder 301. In this way, the time information processor 120 may provide an output having the same resolution as that of the input frame, that is, a probability result.

시간 정보 처리부(120)는 콘볼루션 순환 인코더(301)와 콘볼루션 순환 디코더(302) 각각을 소정 비율로 다운샘플링되거나 업샘플링된 복수의 스케일을 포함하도록 구성할 있다. 시간 정보 처리부(120)는 콘볼루션 순환 인코더(301)의 구조에서 특징맵을 기준으로 소정 비율로 수용 영역의 크기를 감소시키고, 콘볼루션 순환 디코더(302)의 구조에서 다시 소정 비율로 수용 영역의 크기를 증가시킬 수 있다. 이를 통해, 시간 정보 처리부(120)는 콘볼루션 순환 디코더(302)의 구조를 이용하여 특징맵의 크기에 대응되는 크기로 텐서의 수용 영역을 확대할 수 있어 입력 프레임(310)과 동일한 해상도를 갖는 확률 정보를 제공할 수 있다.The time information processor 120 may configure each of the convolutional cyclic encoder 301 and the convolutional cyclic decoder 302 to include a plurality of scales that are downsampled or upsampled at a predetermined ratio. The time information processor 120 decreases the size of the accommodation area at a predetermined rate based on the feature map in the structure of the convolutional cyclic encoder 301, and again at a predetermined rate in the structure of the convolutional cyclic decoder 302. You can increase the size. Through this, the time information processing unit 120 may enlarge the receiving area of the tensor to a size corresponding to the size of the feature map using the structure of the convolutional cyclic decoder 302 and have the same resolution as that of the input frame 310. Probability information can be provided.

시간 정보 처리부(120)는 특징맵(321), 제 1 맵(322), 제 2 맵(323) 각각에 대해 콘볼루션 장단기 메모리(Long Short Term Memory, 이하 'LSTM'라 칭하기로 함)를 사용하여 2차원의 콘볼루션 연산을 수행할 수 있다. 여기서는, 설명의 편의를 위해 제 1 맵(322), 제 2 맵(323)를 기준으로 설명하지만, 제 1 맵(322)과 제 2 맵(323) 각각은 이전 단계의 콘볼루션 LSTM에 의해 획득된 텐서(x)와 스테이트(h)에 대한 정보를 각각 포함할 수 있다.The time information processor 120 uses a convolution long-term memory (hereinafter referred to as 'LSTM') for each of the feature map 321, the first map 322, and the second map 323. 2D convolution operation can be performed. Here, for the convenience of explanation, the first map 322 and the second map 323 will be described based on the above, but each of the first map 322 and the second map 323 is obtained by the convolutional LSTM of the previous step. It may include information about the tensor (x) and the state (h), respectively.

특징맵(321)에는 제 1 콘볼루션 LSTM(331)을 적용하여 콘볼루션 순환 연산을 하고, 제 1 맵(322)에는 제 2 콘볼루션 LSTM(332)을 적용하여 콘볼루션 순환 연산을 한다. 이때, 제 1 맵(322)과 제 2 콘볼루션 LSTM(332)은 소정 비율의 스케일로 감소된 크기를 가질 수 있다.A convolution cyclic operation is applied to the feature map 321 by applying a first convolution LSTM 331, and a convolution cyclic operation is applied to the first map 322 by applying a second convolution LSTM 332. In this case, the first map 322 and the second convolutional LSTM 332 may have a reduced size at a predetermined ratio scale.

이후, 제 2 맵(323)에는 제 3 콘볼루션 LSTM(333)을 적용하여 콘볼루션 순환 연산을 한다. 이때, 제 2 맵(323)과 제 3 콘볼루션 LSTM(333)은 소정 비율의 스케일로 증가된 크기를 가질 수 있으며, 특징맵(321)과 제 1 콘볼루션 LSTM(331)과 동일한 크기를 가질 수 있다. 이때, 제 3 콘볼루션 LSTM(333)은 화살표(340)로 도시된 바와 같이 원본 영상 프레임(310)의 해상도 복원을 위해 제 1 콘볼루션 LSTM(331)으로 대체하여 사용할 수도 있다.Subsequently, a third convolution LSTM 333 is applied to the second map 323 to perform a convolution cyclic operation. In this case, the second map 323 and the third convolutional LSTM 333 may have an increased size at a predetermined ratio, and have the same size as the feature map 321 and the first convolutional LSTM 331. Can be. In this case, the third convolutional LSTM 333 may be replaced with the first convolutional LSTM 331 to restore the resolution of the original image frame 310 as shown by the arrow 340.

이와 같이, 시간 정보 처리부(120)는 CRNN(300)을 이용하여 확률 결과(350)를 출력할 수 있다. 확률 결과(350)에는 전경과 배경을 분리하기 위한 정보가 포함된다.As such, the time information processor 120 may output the probability result 350 using the CRNN 300. The probability result 350 includes information for separating the foreground and the background.

시간 정보 처리부(120)는 콘볼루션 순환 인코더(301)와 콘볼루션 순환 디코더(302)의 구조에서 다운샘플링 동작과 업샘플링 동작을 이용하여 공간적인 수용영역과 시간적인 수용영역을 소정 기준치 이상을 확보할 수 있다. 따라서, 시간 정보 처리부(120)는 입력 프레임과 동일한 크기를 유지한 상태에서 신경망을 학습하는 구조에 비해 연산량을 감소시키면서도 원본과 동일한 해상도의 결과를 출력할 수 있게 한다.The temporal information processor 120 secures the spatial accommodating area and the temporal accommodating area by more than a predetermined reference value by using the downsampling operation and the upsampling operation in the structures of the convolutional cyclic encoder 301 and the convolutional cyclic decoder 302. can do. Accordingly, the time information processor 120 may output a result having the same resolution as that of the original while reducing the amount of computation as compared to the structure for learning the neural network while maintaining the same size as the input frame.

도 5는 일 실시예에 따른 콘볼루션 LSTM의 구조를 설명하기 위한 도면이다.5 illustrates a structure of a convolution LSTM according to an embodiment.

도 5에 도시된 바와 같이, 콘볼루션 LSTM(331)은 텐서(x)와 스테이트(h)에 대해서 c, i, f, o 각각으로 구분되는 콘볼루션 연산으로 입력과 동일한 크기의 게이트(331-1, 331-2, 331-3, 331-4)를 생성할 수 있다.As shown in FIG. 5, the convolution LSTM 331 is a convolution operation divided into c, i, f, and o for the tensor x and the state h, respectively, and has a gate size 331-of the same size as the input. 1, 331-2, 331-3, and 331-4).

콘볼루션 LSTM(331)은 콘볼루션 연산 c, i, f, o 각각에 대해 4 개의 게이트를 생성할 수 있다. 4개의 게이트로 구성된 콘볼루션 LSTM(331)을 사용하여 시간적 정보를 추출할 수 있다.The convolution LSTM 331 may generate four gates for each of the convolution operations c, i, f, and o. Convolutional LSTM 331 consisting of four gates may be used to extract temporal information.

도 6은 일 실시예에 따른 콘볼루션 LSTM 의 구조에 대해 설명한 도면이다.6 is a diagram illustrating a structure of a convolution LSTM according to an embodiment.

도 6에 도시된 바와 같이, 콘볼루션 LSTM(400)은 이전의 내부 상태, 즉 셀 상태(C)를 전달받아 현재 셀 상태(C')을 출력한다. 이때, 콘볼루션 LSTM(400)은 이전과 이후에 유사한 구조의 콘볼루션 LSTM이 연결(미도시)될 수 있으며, 셀 상태를 업데이트하기 위해 콘볼루션 LSTM(400)에 텐서(x)(43)와 스테이트(h)(43)를 입력받을 수 있다.As shown in FIG. 6, the convolution LSTM 400 receives a previous internal state, that is, a cell state C, and outputs a current cell state C ′. In this case, the convolutional LSTM 400 may be connected (not shown) with a similar structure before and after the tensor (x) 43 to the convolutional LSTM 400 to update the cell state. State (h) 43 may be input.

콘볼루션 LSTM(400)은 셀 상태 연산 라인(410), 포겟 게이트 레이어(Forget Gate Layer)(420), 입력 게이트 레이어(430), 및 출력 게이트 레이어(440)를 포함할 수 있다.The convolutional LSTM 400 may include a cell state operation line 410, a forget gate layer 420, an input gate layer 430, and an output gate layer 440.

셀 상태 연산 라인(410)은 이전 셀 상태(C)가 입력되고, 포겟 게이트 레이어(Forget Gate Layer)(420)의 출력에 입력 게이트 레이어(430)를 덧셈 연산하여 현재 셀 상태(C')를 출력할 수 있다.The cell state operation line 410 receives the previous cell state C, and adds the input gate layer 430 to the output of the forget gate layer 420 to calculate the current cell state C '. You can print

포겟 게이트 레이어(Forget Gate Layer)(420)는 텐서(x)(41)의 슬라이딩 윈도우(42)와 스테이트(h)(43)의 슬라이딩 윈도우(44)를 각각 콘볼루션 연산하고, 콘볼루션 연산 결과를 덧셈 연산 후 시그모이드 연산을 하여 포겟 게이트로 출력한다.The gate gate layer 420 convolves the sliding window 42 of the tensor (x) 41 and the sliding window 44 of the state (h) 43, respectively, and the result of the convolution operation. After the addition operation, perform the sigmoid operation and output to the forge gate.

입력 게이트 레이어(430)는 텐서(x)(41)의 슬라이딩 윈도우(42)와 스테이트(h)(43)의 슬라이딩 윈도우(44)를 각각 콘볼루션 연산하고, 콘볼루션 연산 결과를 덧셈 연산 후 쌍곡 탄젠트(tanh) 연산하여 입력 게이트로 출력하고, 콘볼루션 연산 결과를 덧셈 연산 후 시그모이드 연산하여 입력 게이트로 출력한다. 이때, 입력 게이트의 출력은 포겟 게이트의 출력과 덧셈 연산된다.The input gate layer 430 performs a convolution operation on the sliding window 42 of the tensor (x) 41 and the sliding window 44 of the state (h) 43, respectively, and hyperbolic the result of the convolution operation after the addition operation. A tangent operation is performed to output the input gate, and a convolution operation result is added to the input gate after performing a sigmoid operation. At this time, the output of the input gate is added to the output of the forge gate.

출력 게이트 레이어(440)는 텐서(x)(41)의 슬라이딩 윈도우(42)와 스테이트(h)(43)의 슬라이딩 윈도우(44)를 각각 콘볼루션 연산하고, 콘볼루션 연산 결과를 덧셈 연산 후 시그모이드 연산을 하여 출력 게이트로 출력한다. 이때, 출력 게이트 레이어(440)는 현재 셀 상태(C')를 쌍곡 탄젠트(tanh) 연산하여 현재 스테이트(또는, 히든 스테이트)(h')를 출력할 수 있다.The output gate layer 440 performs a convolution operation on the sliding window 42 of the tensor (x) 41 and the sliding window 44 of the state (h) 43, respectively, and adds the result of the convolution operation to the signal. Perform a mode operation and output to the output gate. In this case, the output gate layer 440 may output a current state (or a hidden state) h 'by performing a hyperbolic tangent on the current cell state C'.

텐서(x)와 스테이트(h) 각각에 대해 소정 크기를 갖는 슬라이딩 윈도우, 여기서는 3x3의 크기를 갖는 슬라이딩 윈도우를 텐서(x)와 스테이트(h) 내에서 점진적으로 이동시켜 콘볼루션 LSTM 연산을 수행하여 학습을 하거나 전경과 배경의 분리를 위한 정보를 추출할 수 있다.A sliding window having a predetermined size for each of the tensor (x) and the state (h), in this case, a sliding window having a size of 3x3 is gradually moved in the tensor (x) and the state (h) to perform a convolutional LSTM operation. You can learn or extract information to separate the foreground from the background.

도 7은 일 실시예에 따른 콘볼루션 순환 인코더의 맵과 콘볼루션 LSTM의 스케일에 따른 크기 변화를 설명하기 위한 도면이다.7 is a diagram illustrating a change in size according to a map of a convolution cyclic encoder and a scale of a convolution LSTM according to an embodiment.

도 7에 도시된 바와 같이, 표(500)에는 인코더와 디코더 구조에 대응되는 복수의 스케일에 관련된 정보가 도시된다.As shown in FIG. 7, the table 500 shows information related to a plurality of scales corresponding to the encoder and decoder structure.

표(500)에서 레이어는 소정 비율에 대응되는 스케일을 나타낼 수 있으며, 비율에 대응되는 인코더와 디코더 인덱스가 도시된다. 동작은 다운샘플링 동작과 업샘플링 동작을 나타내며, 인코더 부분에서는 다운샘플링 동작을 수행하며, 디코더 부분에서는 업샘플링 동작을 수행한다. 스트라이드(Stride)는 원본 크기 대비 비율을 나타낸다. 예를 들어, 8인 경우, 원본 크기 대비 1/8비율의 스케일로 축소된 맵을 의미한다.In the table 500, a layer may represent a scale corresponding to a predetermined ratio, and the encoder and decoder indexes corresponding to the ratio are shown. The operation represents a downsampling operation and an upsampling operation, the downsampling operation is performed in the encoder part, and the upsampling operation is performed in the decoder part. Stride is the ratio of the original size. For example, 8 means a map scaled down to a scale of 1/8 ratio of the original size.

여기서는, 5개의 스케일 즉, 5개의 콘볼루션 LSTM이 쌓여있는 구조이며, 각각 다른 크기에서 형태를 유지하면서 이전 입력에 의한 스테이트에 따라 다른 반응을 하는 네트워크이다.Here, it is a structure in which five scales, that is, five convolutional LSTMs, are stacked, and each reacts differently depending on the state of the previous input while maintaining the shape at different sizes.

영상 처리 장치(100)는 확률 결과를 출력하기 위해 콘볼루션 순환 인코더(510)의 구조를 이용할 수 있다. 이때, 콘볼루션 순환 인코더(510)는 제 1 특징맵(511a)과 제 1 콘볼루션 LSTM(511b)을 포함할 수 있다. 콘볼루션 순환 인코더(510)는 1/2의 스케일로 다운샘플링한 제 1 맵(512a)과 제 2 콘볼루션 LSTM(512b)을 포함할 수 있으며, 다시 1/2의 스케일로 다운샘플링한 제 2 맵(513b)와 제 3 콘볼루션 LSTM(513b)을 포함할 수 있다.The image processing apparatus 100 may use the structure of the convolution cyclic encoder 510 to output the probability result. In this case, the convolution cyclic encoder 510 may include a first feature map 511a and a first convolution LSTM 511b. The convolutional cyclic encoder 510 may include a first map 512a downsampled on a half scale and a second convolutional LSTM 512b, and then a second downsampled on half scale. Map 513b and a third convolutional LSTM 513b.

도 8은 일 실시예에 따른 콘볼루션 순환 디코더의 맵과 콘볼루션 LSTM의 스케일에 따른 크기 변화를 설명하기 위한 도면이다.8 is a diagram illustrating a change in size according to a map of a convolution cyclic decoder and a scale of a convolution LSTM according to an embodiment.

도 8에 도시된 바와 같이, 표(500)에 대한 상세 설명은 도 7을 참조한다.As shown in FIG. 8, the detailed description of the table 500 refers to FIG. 7.

영상 처리 장치(100)는 확률 결과를 출력하기 위해 콘볼루션 순환 디코더(520)의 구조를 이용할 수 있다.The image processing apparatus 100 may use the structure of the convolutional cyclic decoder 520 to output the probability result.

콘볼루션 순환 디코더(520)는 제 2 맵(513b)와 제 3 콘볼루션 LSTM(513b)을 2배만큼 업샘플링한 제 3 맵(521a)과 제 4 콘볼루션 LSTM(521b)을 포함할 수 있으며, 다시 2배의 스케일로 업샘플링한 제 4맵(522a)과 제 5 콘볼루션 LSTM(522b)를 포함할 수 있다.The convolutional cyclic decoder 520 may include a third map 521a and a fourth convolutional LSTM 521b that upsampled the second map 513b and the third convolutional LSTM 513b by twice. The second map may further include a fourth map 522a and a fifth convolutional LSTM 522b that are upsampled at a doubled scale.

한편, 표(500)에서 콘볼루션 순환 인코더(510)와 콘볼루션 순환 디코더(520)의 콘볼루션 LSTM의 대응관계는 화살표(530, 540)로 도시한다. Meanwhile, in Table 500, the correspondence between the convolution cyclic encoder 510 and the convolution LSTM of the convolution cyclic decoder 520 is illustrated by arrows 530 and 540.

도 9는 일 실시예에 따른 영상 처리 방법을 도시한 순서도이다.9 is a flowchart illustrating an image processing method, according to an exemplary embodiment.

도 9에 도시된 바와 같이, 영상 처리 장치(100)는 입력 프레임으로부터 특징맵으로 구성될 수 있는 시각 정보를 추출할 수 있다(S510). 영상 처리 장치(100)는 시각 정보를 특징맵의 형태로 추출할 수 있으며, 입력 프레임에 콘볼루션 필터를 이용하여 특징맵을 추출할 수 있다. 여기서, 특징맵은 CNN을 이용하여 구현된 콘볼루션 필터를 이용함에 따라 CNN 특징으로 정의될 수도 있다.As illustrated in FIG. 9, the image processing apparatus 100 may extract visual information that may be configured as a feature map from an input frame (S510). The image processing apparatus 100 may extract visual information in the form of a feature map, and may extract a feature map by using a convolution filter on an input frame. Here, the feature map may be defined as a CNN feature by using a convolution filter implemented using the CNN.

영상 처리 장치(100)는 특징맵으로부터 CRNN을 이용하여 시간적 정보를 추출할 수 있다(S620). 영상 처리 장치(100)는 콘볼루션 순환 인코더와 콘볼루션 순환 디코더 각각에 대해 복수의 스케일에서 시간적 정보를 추출할 수 있다. 영상 처리 장치는 복수의 스케일 각각에서 콘볼루션 LSTM을 이용하여 시간적 정보를 추출할 수 있다. 여기서, 콘볼루션 순환 인코더는 입력을 기준으로 소정 비율로 다운샘플링하고, 콘볼루션 순환 인코더는 입력을 기준으로 소정 비율로 업샘플링한다. 이때, 콘볼루션 순환 인코더와 콘볼루션 순환 인코더의 스케일 각각은 서로 대응될 수 있으며, 스케일이 홀수개일 경우, 중심에 위치한 스케일을 기준으로 다운샘플링 동작이 업샘플링 동작으로 전환된다.The image processing apparatus 100 may extract temporal information from the feature map by using the CRNN (S620). The image processing apparatus 100 may extract temporal information on a plurality of scales for each of the convolutional cyclic encoder and the convolutional cyclic decoder. The image processing apparatus may extract temporal information using a convolution LSTM at each of the plurality of scales. Here, the convolutional cyclic encoder downsamples at a predetermined rate based on the input, and the convolutional cyclic encoder upsamples at a predetermined ratio based on the input. In this case, each of the scales of the convolutional cyclic encoder and the convolutional cyclic encoder may correspond to each other. If the scale is an odd number, the downsampling operation is switched to the upsampling operation based on the centrally located scale.

영상 처리 장치(100)는 CRNN 내에서 구분된 모든 스케일에서 적용이 완료되었는지 확인한다(S630).The image processing apparatus 100 checks whether the application is completed at all scales divided in the CRNN (S630).

S630단계의 확인결과, 모든 스케일에서 시간적 정보의 추출이 완료되면, 영상 처리 장치(100)는 S640단계로 진행한다.As a result of checking in step S630, when the extraction of temporal information at all scales is completed, the image processing apparatus 100 proceeds to step S640.

S630단계의 확인결과, 모든 스케일에서 시간적 정보의 추출이 완료되지 않으면, 영상 처리 장치(100)는 시간적 정보의 추출이 완료되지 않은 스케일에서 시간적 정보를 추출하기 위해 S630단계로 진행한다.As a result of checking in step S630, if the temporal information extraction is not completed in all scales, the image processing apparatus 100 proceeds to step S630 to extract temporal information from the scale on which the temporal information is not extracted.

영상 처리 장치(100)는 시간적 정보를 이용하여 확률 결과를 출력하고, 확률 결과를 이용하여 입력 프레임의 전경과 배경을 분리할 수 있다(S640). 이때, 영상 처리 장치(100)는 확률 결과에 대해 미리 설정된 기준값과의 비교를 통해 배경과 전경을 분리할 수 있다.The image processing apparatus 100 may output the probability result using the temporal information, and may separate the foreground and the background of the input frame using the probability result (S640). In this case, the image processing apparatus 100 may separate the background and the foreground by comparing the probability result with a preset reference value.

본 실시예에 따른 영상 처리 장치(100)는 동영상에 대한 자동 분석 기법 중의 하나인 배경차분(예를 들어, 움직임 검출)을 할 수 있다. 이를 통해, 영상 처리 장치(100)는 동영상 내에서 움직이는 객체를 검출하여, 분석을 위한 영역을 움직이는 객체가 존재하는 영역으로 축소시킬 수 있어, 영상 분석 속도를 향상시킬 수 있다. 또한, 영상 처리 장치(100)는 동영상 내에서 움직임 유무, 방향, 종류에 대한 정보를 획득할 수 있다. 그러므로, 영상 처리 장치(100)는 보안, 경비와 같은 영상을 이용한 감시 분야에 활용될 수 있으며, 동영상 합성, 물체 인식, 영상 검색 등과 같은 기능이 활용될 수 있는 다른 다양한 분야에도 적용될 수 있다. The image processing apparatus 100 according to the present exemplary embodiment may perform background difference (for example, motion detection), which is one of automatic analysis techniques for a video. In this way, the image processing apparatus 100 may detect the moving object in the video and reduce the area for analysis to the area in which the moving object exists, thereby improving the image analysis speed. In addition, the image processing apparatus 100 may obtain information on the presence, direction, and type of motion in the video. Therefore, the image processing apparatus 100 may be used in a surveillance field using an image such as security and security, and may be applied to various other fields in which functions such as video synthesis, object recognition, image search, etc. may be utilized.

본 실시예에서 사용되는 '~부'라는 용어는 소프트웨어 또는 FPGA(field programmable gate array) 또는 ASIC 와 같은 하드웨어 구성요소를 의미하며, '~부'는 어떤 역할들을 수행한다. 그렇지만 '~부'는 소프트웨어 또는 하드웨어에 한정되는 의미는 아니다. '~부'는 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 따라서, 일 예로서 '~부'는 소프트웨어 구성요소들, 객체지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램특허 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들, 및 변수들을 포함한다.The term '~ part' used in the present embodiment refers to software or a hardware component such as a field programmable gate array (FPGA) or an ASIC, and the '~ part' performs certain roles. However, '~' is not meant to be limited to software or hardware. '~ Portion' may be configured to be in an addressable storage medium or may be configured to play one or more processors. Thus, as an example, '~' means components such as software components, object-oriented software components, class components, and task components, and processes, functions, properties, procedures, and the like. Subroutines, segments of program patent code, drivers, firmware, microcode, circuits, data, databases, data structures, tables, arrays, and variables.

구성요소들과 '~부'들 안에서 제공되는 기능은 더 작은 수의 구성요소들 및 '~부'들로 결합되거나 추가적인 구성요소들과 '~부'들로부터 분리될 수 있다.The functionality provided within the components and 'parts' may be combined into a smaller number of components and 'parts' or separated from additional components and 'parts'.

뿐만 아니라, 구성요소들 및 '~부'들은 디바이스 또는 보안 멀티미디어카드 내의 하나 또는 그 이상의 CPU 들을 재생시키도록 구현될 수도 있다In addition, the components and '~' may be implemented to play one or more CPUs in the device or secure multimedia card.

또한 본 발명의 일 실시예에 따르는 영상 처리 방법은 컴퓨터에 의해 실행 가능한 명령어를 포함하는 컴퓨터 프로그램(또는 컴퓨터 프로그램 제품)으로 구현될 수도 있다. 컴퓨터 프로그램은 프로세서에 의해 처리되는 프로그래밍 가능한 기계 명령어를 포함하고, 고레벨 프로그래밍 언어(High-level Programming Language), 객체 지향 프로그래밍 언어(Object-oriented Programming Language), 어셈블리 언어 또는 기계 언어 등으로 구현될 수 있다. 또한 컴퓨터 프로그램은 유형의 컴퓨터 판독가능 기록매체(예를 들어, 메모리, 하드디스크, 자기/광학 매체 또는 SSD(Solid-State Drive) 등)에 기록될 수 있다. In addition, the image processing method according to an embodiment of the present invention may be implemented as a computer program (or computer program product) including instructions executable by a computer. The computer program includes programmable machine instructions processed by the processor and may be implemented in a high-level programming language, an object-oriented programming language, an assembly language, or a machine language. . The computer program may also be recorded on tangible computer readable media (eg, memory, hard disks, magnetic / optical media or solid-state drives, etc.).

따라서 본 발명의 일실시예에 따르는 영상 처리 방법은 상술한 바와 같은 컴퓨터 프로그램이 컴퓨팅 장치에 의해 실행됨으로써 구현될 수 있다. 컴퓨팅 장치는 프로세서와, 메모리와, 저장 장치와, 메모리 및 고속 확장포트에 접속하고 있는 고속 인터페이스와, 저속 버스와 저장 장치에 접속하고 있는 저속 인터페이스 중 적어도 일부를 포함할 수 있다. 이러한 성분들 각각은 다양한 버스를 이용하여 서로 접속되어 있으며, 공통 머더보드에 탑재되거나 다른 적절한 방식으로 장착될 수 있다. Accordingly, the image processing method according to an embodiment of the present invention may be implemented by executing the computer program as described above by the computing device. The computing device may include at least a portion of a processor, a memory, a storage device, a high speed interface connected to the memory and a high speed expansion port, and a low speed interface connected to the low speed bus and the storage device. Each of these components are connected to each other using a variety of buses and may be mounted on a common motherboard or otherwise mounted in a suitable manner.

여기서 프로세서는 컴퓨팅 장치 내에서 명령어를 처리할 수 있는데, 이런 명령어로는, 예컨대 고속 인터페이스에 접속된 디스플레이처럼 외부 입력, 출력 장치상에 GUI(Graphic User Interface)를 제공하기 위한 그래픽 정보를 표시하기 위해 메모리나 저장 장치에 저장된 명령어를 들 수 있다. 다른 실시예로서, 다수의 프로세서 및(또는) 다수의 버스가 적절히 다수의 메모리 및 메모리 형태와 함께 이용될 수 있다. 또한 프로세서는 독립적인 다수의 아날로그 및(또는) 디지털 프로세서를 포함하는 칩들이 이루는 칩셋으로 구현될 수 있다. Here, the processor may process instructions within the computing device, such as to display graphical information for providing a graphical user interface (GUI) on an external input, output device, such as a display connected to a high speed interface. Instructions stored in memory or storage. In other embodiments, multiple processors and / or multiple buses may be used with appropriately multiple memories and memory types. The processor may also be implemented as a chipset consisting of chips comprising a plurality of independent analog and / or digital processors.

또한 메모리는 컴퓨팅 장치 내에서 정보를 저장한다. 일례로, 메모리는 휘발성 메모리 유닛 또는 그들의 집합으로 구성될 수 있다. 다른 예로, 메모리는 비휘발성 메모리 유닛 또는 그들의 집합으로 구성될 수 있다. 또한 메모리는 예컨대, 자기 혹은 광 디스크와 같이 다른 형태의 컴퓨터 판독 가능한 매체일 수도 있다. The memory also stores information within the computing device. In one example, the memory may consist of a volatile memory unit or a collection thereof. As another example, the memory may consist of a nonvolatile memory unit or a collection thereof. The memory may also be other forms of computer readable media, such as, for example, magnetic or optical disks.

그리고 저장장치는 컴퓨팅 장치에게 대용량의 저장공간을 제공할 수 있다. 저장 장치는 컴퓨터 판독 가능한 매체이거나 이런 매체를 포함하는 구성일 수 있으며, 예를 들어 SAN(Storage Area Network) 내의 장치들이나 다른 구성도 포함할 수 있고, 플로피 디스크 장치, 하드 디스크 장치, 광 디스크 장치, 혹은 테이프 장치, 플래시 메모리, 그와 유사한 다른 반도체 메모리 장치 혹은 장치 어레이일 수 있다. In addition, the storage device can provide a large amount of storage space to the computing device. The storage device may be a computer readable medium or a configuration including such a medium, and may include, for example, devices or other configurations within a storage area network (SAN), and may include a floppy disk device, a hard disk device, an optical disk device, Or a tape device, flash memory, or similar other semiconductor memory device or device array.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The foregoing description of the present invention is intended for illustration, and it will be understood by those skilled in the art that the present invention may be easily modified in other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, it should be understood that the embodiments described above are exemplary in all respects and not restrictive. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is shown by the following claims rather than the above description, and all changes or modifications derived from the meaning and scope of the claims and their equivalents should be construed as being included in the scope of the present invention. do.

100: 영상 처리 장치 110: 특징 추출부
120: 시간 정보 처리부 130: 배경 분리부100: image processing apparatus 110: feature extraction unit
120: time information processor 130: background separator

Claims

입력 프레임에 콘볼루션 필터를 적용하여 시각 정보를 추출하는 특징 추출부;
상기 시각 정보를 포함한 특징맵을 사용하여 시간적 정보를 추출하고, 추출된 시간적 정보를 피드백시켜 내부 상태를 갱신하여 확률 결과를 출력하는 시간 정보 처리부; 및
상기 확률 결과에 기초하여 상기 입력 프레임의 전경과 배경을 분리하는 배경 분리부를 포함하고,
상기 시간 정보 처리부는, 콘볼루션 순환 신경망을 이용하여 구현되고, 콘볼루션 순환 인코더와 콘볼루션 순환 디코더가 연결된 구조를 갖는 영상 처리 장치.A feature extractor configured to extract visual information by applying a convolution filter to an input frame;
A time information processor which extracts temporal information using the feature map including the time information, updates the internal state by feeding back the extracted temporal information, and outputs a probability result; And
A background separator that separates a foreground and a background of the input frame based on the probability result;
The time information processor is implemented using a convolutional cyclic neural network, and has a structure in which a convolutional cyclic encoder and a convolutional cyclic decoder are connected.

제 1 항에 있어서,
상기 콘볼루션 필터는,
콘볼루션신경망(CNN)을 이용하여 구현된 필터인 영상 처리 장치.The method of claim 1,
The convolution filter,
An image processing apparatus, which is a filter implemented using a convolutional neural network (CNN).

제 1 항에 있어서,
상기 콘볼루션 순환 인코더는,
상기 특징맵을 소정 비율로 다운샘플링되는 순서로 스케일들이 배치되고,
상기 콘볼루션 순환 디코더는,
상기 콘볼루션 순환 인코더의 출력을 소정 비율로 업생플링되는 순서로 스케일들이 배치되는 영상 처리 장치.The method of claim 1,
The convolution cyclic encoder,
Scales are arranged in an order of downsampling the feature map at a predetermined ratio;
The convolution cyclic decoder,
And scales arranged in an order of upsampling the output of the convolutional cyclic encoder at a predetermined ratio.

제 3 항에 있어서,
상기 스케일들 각각은,
상기 내부 상태 갱신을 위한 정보인 텐서와 스테이트를 포함한 맵의 형태를 갖는 영상 처리 장치.The method of claim 3, wherein
Each of the scales,
And a map including a tensor and a state as information for updating the internal state.

제 3 항에 있어서,
상기 시간 정보 처리부는,
상기 콘볼루션 순환 신경망은 스케일 각각을 기준으로 콘볼루션 장단기 메모리(LSTM: Long Short Term Memory)를 사용하여 시간적 정보를 추출하는 영상 처리 장치.The method of claim 3, wherein
The time information processing unit,
The convolutional cyclic neural network extracts temporal information using a convolutional long short term memory (LSTM) based on each scale.

제 4 항에 있어서,
상기 콘볼루션 LSTM은,
상기 콘볼루션 순환 인코더와 상기 콘볼루션 순환 디코더 각각에 사용되며, 상기 콘볼루션 순환 인코더의 복수의 스케일 각각에서 사용된 콘볼루션 LSTM은 상기 콘볼루션 순환 디코더에서 대응되는 복수의 스케일 각각에서 사용되는 영상 처리 장치.The method of claim 4, wherein
The convolution LSTM,
Convolution LSTM, which is used in each of the convolution cyclic encoder and the convolution cyclic decoder, and used in each of a plurality of scales of the convolution cyclic encoder, is image processing used in each of a plurality of scales corresponding in the convolution cyclic decoder. Device.

영상 처리 장치의 영상 처리 방법에 있어서,
입력 프레임에 콘볼루션 필터를 적용하여 시각 정보를 추출하는 단계;
상기 시각 정보를 포함한 특징맵을 사용하여 시간적 정보를 추출하고, 추출된 시간적 정보를 피드백시켜 내부 상태를 갱신하여 확률 결과를 출력하는 단계; 및
상기 확률 결과에 기초하여 상기 입력 프레임의 전경과 배경을 분리하는 단계를 포함하고,
상기 확률 결과를 출력하는 단계는,
콘볼루션 순환 인코더와 콘볼루션 순환 디코더가 연결된 구조를 갖는 콘볼루션 순환 신경망을 이용하여 상기 확률 결과를 출력하는 영상 처리 방법.In the image processing method of the image processing apparatus,
Extracting visual information by applying a convolution filter to the input frame;
Extracting temporal information by using the feature map including the visual information, updating the internal state by feeding back the extracted temporal information, and outputting a probability result; And
Separating the foreground and the background of the input frame based on the probability result;
The step of outputting the probability result,
And outputting the probability result using a convolutional cyclic neural network having a structure in which a convolutional cyclic encoder and a convolutional cyclic decoder are connected.

제 7 항에 있어서,
상기 확률결과를 출력하는 단계는,
상기 특징맵을 소정 비율로 다운샘플링되는 순서로 스케일들을 배치하여 상기 콘볼루션 순환 인코더를 구성하는 단계; 및
상기 다운샘플링이 종료된 후 업샘플링되는 순서로 스케일들을 배치하여 상기 콘볼루션 순환 디코더를 구성하는 단계를 포함하는 영상 처리 방법.The method of claim 7, wherein
The step of outputting the probability result,
Configuring the convolutional cyclic encoder by placing scales in an order of downsampling the feature map at a predetermined rate; And
And arranging the scales in the order of upsampling after the downsampling is completed to configure the convolutional cyclic decoder.

제 8 항에 있어서,
상기 스케일들 각각은,
상기 내부 상태 갱신을 위한 정보인 텐서와 스테이트로 구분되는 맵의 형태를 갖는 영상 처리 방법.The method of claim 8,
Each of the scales,
And a map divided into tensors and states which are information for updating the internal state.

제 8 항에 있어서,
상기 확률 결과를 출력하는 단계는,
상기 스케일 각각을 기준으로 콘볼루션 장단기 메모리(LSTM)을 사용하여 시간적 정보를 추출하는 영상 처리 방법.The method of claim 8,
The step of outputting the probability result,
An image processing method of extracting temporal information using a convolutional short and long term memory (LSTM) based on each of the scales.

제 7 항에 기재된 방법을 수행하는 프로그램이 기록된 컴퓨터 판독 가능한 기록 매체.A computer-readable recording medium having recorded thereon a program for performing the method of claim 7.

영상 처리 장치에 의해 수행되며, 제 7 항에 기재된 방법을 수행하기 위해 매체에 저장된 컴퓨터 프로그램.A computer program executed by an image processing apparatus and stored in a medium for performing the method of claim 7.