KR101930940B1

KR101930940B1 - Apparatus and method for analyzing image

Info

Publication number: KR101930940B1
Application number: KR1020170092221A
Authority: KR
Inventors: 임재균; 김봉모; 조영관
Original assignee: 에스케이텔레콤 주식회사
Priority date: 2017-07-20
Filing date: 2017-07-20
Publication date: 2018-12-20

Abstract

According to an embodiment of the present invention, an image analyzing apparatus comprises: a feature map generating unit for generating a feature map of an image; an area extracting unit for extracting an area where an object is estimated to be present in the image based on the feature map; a feature extracting unit for extracting a feature on the area; a context extracting unit for extracting a context on the image; and a caption generating unit for generating a feature based on the context by reflecting the context to a feature on the area, and generating and outputting a caption on the area based on the feature based on the context.

Description

영상 분석 장치 및 방법 {APPARATUS AND METHOD FOR ANALYZING IMAGE}[0001] APPARATUS AND METHOD FOR ANALYZING IMAGE [0002]

본 발명은 영상 분석 장치 및 방법에 관한 것이다. 보다 자세하게는 신경망(neural network)을 이용하여 영상을 분석하는 과정에서, 이러한 영상이 나타내는 장소나 시간과 같은 컨텍스트(context)를 참조하여서 해당 영상을 분석하는 장치 및 방법에 관한 것이다.The present invention relates to an image analysis apparatus and method. More particularly, the present invention relates to an apparatus and method for analyzing a corresponding image by referring to a context such as a place and a time represented by the image in a process of analyzing the image using a neural network.

머신러닝(machine learning)의 한 종류인 딥러닝(deep learning)의 발달은 영상 인식 분야를 비약적으로 발전시키고 있다. 이러한 영상 인식의 한 분야인 영상 분류(classification)는 영상을 단어나 문장을 통해 하나의 카테고리로 분류하는 기술이고, 또 다른 분야인 객체 검출(object detection)은 영상에 존재하는 객체가 무엇인지 그리고 해당 객체가 영상에서 어느 위치에 있는지를 파악하여서 표현하는 기술이다. The development of deep learning, a type of machine learning, has dramatically improved the field of image recognition. The classification of image, which is an area of image recognition, is a technique of classifying an image into a category through words or sentences. Another field, object detection, is an object in an image, It is a technique to identify and express the position of an object in an image.

도 1은 제1 영상에 대하여 전술한 영상 분류를 수행한 결과를 도시하고 있다. 영상 분류가 수행된 결과, 제1 영상은 '고양이'로 분류된다. FIG. 1 shows a result of performing the image classification on the first image. As a result of image classification, the first image is classified as 'cat'.

도 2는 제2 영상에 대하여 전술한 객체 검출을 수행한 결과를 도시하고 있다. 객체 검출이 수행된 결과, 제2 영상에는 '고양이, 강아지, 오리'가 객체로서 존재하는데, 이 때 각각의 객체가 존재하는 위치는 바운딩 박스(bounding box)로서 표시된다.FIG. 2 shows a result of performing the above-described object detection on the second image. As a result of object detection, 'cat, dog, duck' exists as an object in the second image, and the position where each object exists is displayed as a bounding box.

이 중 객체 검출의 경우, 영상에 존재하는 객체를 문장으로 표현하는 수준까지 그 기술이 발전하였다. 객체 검출에 따른 결과를 문장으로 표현하는 기술은 'dense captioning'이라고 지칭되기도 한다. 이러한 dense captioning에 대한 결과를 도 3에서 예를 들어 도시하고 있다. 도 3을 참조하면, 제3 영상은 '오렌지 점박이 고양이가 빨간 바퀴를 갖는 스케이트 보드를 타고 있다....'와 같은 문장으로 표현된다.In the case of object detection, the technique has developed to the level of expressing objects existing in the image in sentences. The technique of expressing the result of object detection by sentence is called 'dense captioning'. The results of such dense captioning are shown in FIG. 3 by way of example. Referring to FIG. 3, the third image is expressed by a sentence such as 'An orange spotted cat is riding a skateboard having a red wheel...'.

한국등록특허공보, 10-1719278호 (2016.10.24. 공개)Korean Registered Patent Publication No. 10-1719278 (published on October 24, 2016)

전술한 바와 같이 객체 검출의 경우, 영상에 존재하는 객체가 무엇인지, 그리고 해당 객체가 어떠한 위치에 있는지 등이 단어나 문장으로 표현된다. As described above, in the case of object detection, what an object exists in an image and in which position the object is located are expressed by words or sentences.

여기서, 본 발명의 해결하고자 하는 과제는, 이러한 객체 검출에 있어서 그 결과의 정확성을 향상시키고 아울러 효율성 또한 증진시키는 방안을 제안하는 것이다.The object of the present invention is to propose a method of improving the accuracy of the result of detection of the object and improving the efficiency.

다만, 본 발명의 해결하고자 하는 과제는 이상에서 언급한 것으로 제한되지 않으며, 언급되지 않은 또 다른 해결하고자 하는 과제는 아래의 기재로부터 본 발명이 속하는 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.It is to be understood that both the foregoing general description and the following detailed description of the present invention are exemplary and explanatory and are intended to provide further explanation of the invention as claimed. will be.

일 실시예에 따른 영상 분석 장치는 신경망에 의하여 학습된 것이며, 영상의 특징맵(feature map)을 생성하는 특징맵 생성부와, 상기 특징맵을 기초로, 상기 영상에서 객체의 존재가 추정되는 영역을 추출하는 영역 추출부와, 상기 영역에 대한 특징을 추출하는 특징 추출부와, 상기 영상에 대한 컨텍스트(context)를 추출하는 컨텍스트 추출부와, 상기 영역에 대한 특징에 상기 컨텍스트를 반영하여서 컨텍스트 기반 특징을 생성하고, 상기 컨텍스트 기반 특징을 기초로 상기 영역에 대한 설명(caption)을 생성하여서 출력되도록 하는 설명 생성부를 포함한다.An image analyzing apparatus according to an exemplary embodiment of the present invention includes a feature map generating unit that is learned by a neural network and generates a feature map of an image, A context extracting unit for extracting a feature of the region, a context extracting unit for extracting a context for the image, and a context extracting unit for extracting a context from the feature of the region, And generating a caption for the region based on the context-based feature and outputting the caption.

일 실시예에 따르면 영상에 존재하는 객체에 대한 설명을 생성하는 과정에서, 해당 영상이 나타내는 장소나 해당 영상이 나타내는 시간에 대한 컨텍스트가 고려될 수 있다. 여기서, '컨텍스트를 고려한다'는 것은 이러한 컨텍스트가 객체를 검출하는 과정 내지는 객체에 대한 설명을 생성하는 과정에서 한정사항으로 작용할 수 있다는 의미이다. 예컨대 검출된 객체로부터 복수 개의 설명이 생성되었을 때, 컨텍스트는 이들 복수 개의 설명 중 어느 하나를 선택하는 기준으로 작용할 수 있다. 또는 컨텍스트와 부합되는 설명만이 도출될 수 있도록 할 수도 있다. 또한 검출된 객체로부터 획득 가능한 정보의 양이 적을 경우, 컨텍스트는 해당 객체에 대한 정보 그 자체가 될 수도 있다. 즉, 컨텍스트가 고려될 경우, 객체 검출에 있어서 정확성, 효율성 또는 속도가 향상될 수 있다. According to an exemplary embodiment of the present invention, in the process of generating a description of an object existing in an image, a context for a location indicated by the image or a time indicated by the corresponding image may be considered. Here, 'considering the context' means that this context can act as a limitation in the process of detecting the object or in the process of generating the description of the object. For example, when a plurality of explanations have been generated from a detected object, the context may act as a criterion for selecting any of these explanations. Or only a description that is consistent with the context may be derived. Also, when the amount of information obtainable from the detected object is small, the context may be the information about the object itself. That is, when the context is considered, accuracy, efficiency, or speed in object detection can be improved.

도 1은 영상 분류(classification)의 결과를 예시적으로 도시한 도면이다.
도 2는 영상에 대한 객체 검출(object detection)의 결과를 예시적으로 도시한 도면이다.
도 3은 영상에 대한 객체 검출의 또 다른 결과를 예시적으로 도시한 도면이다.
도 4는 일 실시예에 따른 영상 분석 장치의 구성을 예시적으로 도시한 도면이다.
도 5는 일 실시예에 따른 영상 분석 장치의 구성을 예시적으로 도시한 도면이다.
도 6은 컨볼루션 신경망으로부터 특징맵이 생성되는 과정과, 이러한 특징맵을 기초로 영상 분류(classification)가 수행되는 과정을 개념적으로 도시한 도면이다.
도 7은 영상에 대한 특징맵을 개념적으로 도시한 도면이다.
도 8은 도 7에 도시된 특징맵에서 객체의 존재가 추정되는 영역이 구분되어 있는 것을 도시하고 있다.
도 9는 도 8에 도시된 각각의 영역에 대응되는 특징 벡터를 개념적으로 도시하고 있다.
도 10은 일 실시예에 따른 결합 벡터를 개념적으로 도시하고 있다.
도 11은 일 실시예에 따른 결합 벡터를 개념적으로 도시하고 있다.
도 12는 일 실시예에 따른 영상 분석 방법의 절차를 예시적으로 도시하고 있다.FIG. 1 is a view showing an exemplary result of image classification.
Fig. 2 is an exemplary diagram illustrating the result of object detection for an image.
Figure 3 is an exemplary illustration of another result of object detection for an image.
4 is a diagram illustrating an exemplary configuration of an image analysis apparatus according to an exemplary embodiment of the present invention.
5 is a diagram illustrating an exemplary configuration of an image analysis apparatus according to an embodiment.
6 is a diagram conceptually illustrating a process of generating a feature map from the convolutional neural network and a process of performing image classification based on the feature map.
7 is a diagram conceptually showing a feature map for an image.
FIG. 8 shows regions where the existence of an object is estimated in the feature map shown in FIG. 7. FIG.
FIG. 9 conceptually shows feature vectors corresponding to the respective regions shown in FIG.
10 conceptually illustrates a joint vector according to one embodiment.
Figure 11 conceptually illustrates a joint vector according to one embodiment.
FIG. 12 exemplarily shows a procedure of an image analysis method according to an embodiment.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다.BRIEF DESCRIPTION OF THE DRAWINGS The advantages and features of the present invention, and the manner of achieving them, will be apparent from and elucidated with reference to the embodiments described hereinafter in conjunction with the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. To fully disclose the scope of the invention to those skilled in the art, and the invention is only defined by the scope of the claims.

본 발명의 실시예들을 설명함에 있어서 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다. 그리고 후술되는 용어들은 본 발명의 실시예에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다. In the following description of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear. The following terms are defined in consideration of the functions in the embodiments of the present invention, which may vary depending on the intention of the user, the intention or the custom of the operator. Therefore, the definition should be based on the contents throughout this specification.

도 4는 일 실시예에 따른 영상 분석 장치의 구성을 예시적으로 도시한 도면이다. 도 4에 대한 설명에 앞서, 영상 분석 장치(1000)는 이하에서 설명할 기능을 수행하도록 프로그램된 명령어를 저장하는 메모리, 그리고 이러한 명령어를 실행하는 마이크로프로세서를 포함하는 컴퓨터에서 구현 가능하다.4 is a diagram illustrating an exemplary configuration of an image analysis apparatus according to an exemplary embodiment of the present invention. 4, the image analysis apparatus 1000 can be implemented in a computer including a memory that stores instructions programmed to perform the functions described below, and a microprocessor that executes such instructions.

도 4를 참조하면, 영상 분석 장치(1000)는 입력부(100), 분석부(200), 저장부(300), 학습부(400) 및 출력부(500)를 포함할 수 있되, 실시예에 따라서 이 중 적어도 하나의 구성을 포함하지 않거나 또는 언급되지 않은 추가적인 구성을 더 포함할 수 있다. 4, the image analysis apparatus 1000 may include an input unit 100, an analysis unit 200, a storage unit 300, a learning unit 400, and an output unit 500, And thus may further include additional configurations that do not include or are not mentioned in at least one of these configurations.

입력부(100)는 외부로부터 영상을 입력받는다. 여기서의 '외부'는 영상을 촬영하는 촬영장치 또는 이미 촬영된 영상을 저장하고 있는 메모리일 수 있다. 입력부(100)는 이러한 '외부'와 연결되기 위한 포트(port)를 포함할 수 있다.The input unit 100 receives an image from the outside. Here, 'outside' may be a photographing device for photographing an image or a memory for storing an already photographed image. The input unit 100 may include a port for connection with the 'outside'.

분석부(200)는 영상을 분류(classification)한다. 분류의 기준에는 예컨대 해당 영상이 나타내는 장소가 어디인지 또는 해당 영상이 촬영된 시간이 언제인지 등이 포함될 수 있으나 이에 한정되는 것은 아니다. The analysis unit 200 classifies images. The criteria of the classification may include, for example, the location indicated by the image, the time when the image was photographed, and the like, but the present invention is not limited thereto.

분석부(200)는 분류에 따른 결과를, 해당 영상에 대한 컨텍스트(context)로서 설정한다. 예컨대, 제1 영상이 나타내는 장소가 부엌으로 분류될 경우 분석부(200)는 부엌을 제1 영상에 대한 컨텍스트로 설정할 수 있고, 제2 영상이 나타내는 장소가 야구장으로 분류될 경우, 분석부(200)는 야구장을 제2 영상에 대한 컨텍스트로 설정할 수 있다. 또한, 제3 영상이 나타내는 시간이 새벽인 경우, 분석부(200)는 새벽을 제3 영상에 대한 컨텍스트로 설정할 수 있다. 즉, 컨텍스트란 영상에 대한 정황을 의미한다.The analysis unit 200 sets the result of the classification as a context for the corresponding image. For example, when the place represented by the first image is classified as a kitchen, the analyzer 200 may set the kitchen as a context for the first image, and when the place indicated by the second image is classified as a baseball field, ) May set the baseball field to the context for the second image. If the time indicated by the third image is dawn, the analyzer 200 may set the dawn to the context for the third image. That is, the context means a context for a video.

분석부(200)는 전술한 컨텍스트를 고려하여서 해당 영상에 대한 객체 검출(object detection)을 수행한다. 이러한 분석부(200)에 대해서는 도 5에서 보다 자세하게 살펴보기로 한다.The analysis unit 200 performs object detection on the image in consideration of the above-described context. The analysis unit 200 will be described in more detail with reference to FIG.

저장부(300)는 영상을 저장한다. 저장부(300)에 저장된 영상은 분석부(200)의 학습(딥러닝)에 사용되거나 또는 분석부(200)에 의해서 '객체 검출'의 대상이 되는 영상일 수 있다. 이러한 저장부(300)는 메모리를 통해 구현 가능하다.The storage unit 300 stores an image. The image stored in the storage unit 300 may be used for learning (deep running) of the analysis unit 200 or may be an image subjected to 'object detection' by the analysis unit 200. The storage unit 300 may be implemented through a memory.

학습부(400)는 저장부(300)에 저장된 영상을 이용하여서 분석부(200)를 학습시킨다. 학습부(400)가 학습시키는 대상에는 예컨대 영상에 대한 특징맵을 생성할 때 이용되는 파라미터, 영상으로부터 컨텍스트를 추출할 때 이용되는 파라미터, 영상에서 객체의 존재가 추정되는 영역을 추출할 때 이용되는 파라미터, 객체의 존재가 추정되는 영역으로부터 특징을 추출할 때 이용되는 파라미터, 전술한 특징을 기초로 설명(caption)을 생성할 때 이용되는 파라미터 등이 있을 수 있으나 이에 한정되는 것은 아니다. 학습부(400)가 학습에 사용하는 알고리즘 내지는 학습의 대상에 대해서는 뒤에 보다 자세하게 설명하기로 한다.The learning unit 400 uses the image stored in the storage unit 300 to allow the analysis unit 200 to learn. An object to be learned by the learning unit 400 is, for example, a parameter used when generating a feature map for an image, a parameter used when extracting a context from an image, and an area in which an existence of an object is estimated in an image But is not limited to, a parameter, a parameter used when extracting a feature from an area in which the existence of an object is estimated, and a parameter used in generating a caption based on the above-described characteristic. The algorithm or learning object used by the learning unit 400 will be described later in more detail.

여기서, 전술한 분석부(200)는 학습부(400)에 의하여 이미 학습된 것일 수 있다. 즉, 일 실시예에 따른 영상 분석 장치(1000)는, 학습부(400)에 의하여 이미 학습이 완료된 분석부(200)를 이용하여서 영상에 대한 분석을 수행할 수 있다.Here, the analysis unit 200 may be one that has been already learned by the learning unit 400. [ That is, the image analysis apparatus 1000 according to one embodiment can perform analysis on images using the analysis unit 200, which has been already learned by the learning unit 400. [

출력부(500)는 영상 자체에 대한 설명 내지는 해당 영상의 객체에 대한 설명(caption)을 출력한다. 설명이란 단어 또는 단어를 포함하는 문장으로 구성된다. 이러한 출력부(500)는 영상과 텍스트를 표시할 수 있는 모니터 등으로 구현 가능하다.The output unit 500 outputs a description of the image itself or a caption of an object of the image. The explanation consists of a sentence containing a word or a word. The output unit 500 may be implemented as a monitor capable of displaying images and texts.

이상에서 살펴본 바와 같이, 영상으로부터 객체를 검출하고 해당 객체에 대한 설명(caption)을 생성함에 있어서, 일 실시예에서는 해당 영상을 분류(classification)하여서 컨텍스트를 추출하고, 이러한 컨텍스트를 고려하여서 객체를 검출하고 설명을 생성한다. 여기서, '컨텍스트를 고려한다'는 것은 이러한 컨텍스트가 객체를 검출하는 과정 내지는 객체에 대한 설명을 생성하는 과정에서 한정사항으로 작용할 수 있다는 의미이다. 예컨대 검출된 객체로부터 복수 개의 설명이 생성되었을 때, 컨텍스트는 이들 복수 개의 설명 중 어느 하나를 선택하는 기준으로 작용할 수 있다. 또는 컨텍스트와 부합되는 설명만이 도출될 수 있도록 할 수도 있다. 또한 검출된 객체로부터 획득 가능한 정보의 양이 적을 경우, 컨텍스트는 해당 객체에 대한 정보 그 자체가 될 수도 있다. 즉, 컨텍스트가 고려될 경우, 객체 검출에 있어서 정확성, 효율성 또는 속도가 향상될 수 있다. As described above, in detecting the object from the image and generating a caption for the object, in one embodiment, the corresponding image is classified to extract the context, and the object is detected in consideration of the context And generates a description. Here, 'considering the context' means that this context can act as a limitation in the process of detecting the object or in the process of generating the description of the object. For example, when a plurality of explanations have been generated from a detected object, the context may act as a criterion for selecting any of these explanations. Or only a description that is consistent with the context may be derived. Also, when the amount of information obtainable from the detected object is small, the context may be the information about the object itself. That is, when the context is considered, accuracy, efficiency, or speed in object detection can be improved.

도 5는 일 실시예에 따라서, 분석부(200)의 세부 구성이 도시된 영상 분석 장치(1000)의 구성을 예시적으로 도시하고 있다. 다만, 도 5는 영상 분석 장치(1000) 및 이에 포함된 구성을 예시적으로 도시하고 있는 것에 불과하므로, 본 발명의 사상이 도 5에 도시된 것으로 한정해석되지 않는다.FIG. 5 illustrates an exemplary configuration of an image analysis apparatus 1000 illustrating a detailed configuration of an analysis unit 200 according to an exemplary embodiment of the present invention. 5 is merely illustrative of the image analysis apparatus 1000 and the configuration included therein, and thus the concept of the present invention is not limited to that shown in FIG.

도 5를 참조하면, 분석부(200)는 특징맵 생성부(210), 추출부(220), 컨텍스트 추출부(250), 설명 생성부(260)를 포함할 수 있으며, 이 때 추출부(220)는 영역 추출부(230)와 특징 추출부(240)를 포함할 수 있다.5, the analysis unit 200 may include a feature map generation unit 210, an extraction unit 220, a context extraction unit 250, and a description generation unit 260. At this time, 220 may include a region extraction unit 230 and a feature extraction unit 240.

특징맵 생성부(210)는 영상으로부터 특징맵(feature map)을 생성한다. 특징맵은 영상에서 의미있는 정보, 예컨대 영상의 윤곽, 광도, 색채, 영상 내에 존재하는 객체의 형태 등을 정보로서 포함한다.The feature map generating unit 210 generates a feature map from the image. The feature map includes information meaningful in the image, such as the outline, brightness, color of the image, shape of the object existing in the image, and the like.

특징맵 생성부(210)는, 딥러닝(Deep learning)을 기반으로 학습부(400)에 의하여 이미 학습이 완료된 모델을 이용하여서 특징맵을 생성할 수 있다. 딥러닝은 여러 비선형 변환기법의 조합을 통해 높은 수준의 추상화(abstractions, 다량의 데이터나 복잡한 자료들 속에서 핵심적인 내용 또는 기능을 요약하는 작업)를 시도하는 기계학습(machine learning) 알고리즘의 집합으로 정의된다. The feature map generating unit 210 can generate the feature map using the model that has already been learned by the learning unit 400 based on the Deep Learning. Deep learning is a set of machine learning algorithms that try to achieve a high level of abstraction (a task that summarizes key content or functions in large amounts of data or complex data) through a combination of several nonlinear transformation techniques. Is defined.

특징맵 생성부(210)가 이용하는 모델은 이러한 딥러닝의 모델 중 예컨대 심층 신경망(Deep Neural Networks, DNN), 컨볼루션 신경망 (Convolutional deep Neural Networks, CNN), 순환 신경망(Reccurent Neural Network, RNN) 및 심층 신뢰 신경 망(Deep Belief Networks, DBN) 중 어느 하나를 이용한 것일 수 있으며, 다만 이하에서는 컨볼루션 신경망을 이용하는 것으로 전제하고 설명하기로 한다.Deep Neural Networks (DNN), Convolutional Deep Neural Networks (CNN), Reccurent Neural Networks (RNN), and the like are examples of models used by the feature map generating unit 210. [ And Deep Belief Networks (DBN). In the following description, it is assumed that a convolutional neural network is used.

컨볼루션 신경망에 대하여 간략하게 살펴보면, 컨볼루션 신경망은 최소한의 전처리(preprocess)를 사용하도록 설계된 다계층 퍼셉트론(multilayer perceptrons)의 한 종류이다. 컨볼루션 신경망은 입력 영상에 대하여 컨볼루션을 수행하는 컨볼루션 계층, 그리고 영상에 대해 서브샘플링(subsampling)을 수행하는 서브샘플링 계층을 포함하며, 해당 영상으로부터 특징맵을 추출한다. 여기서, 서브샘플링 계층이란 입력 영상에 대해 지역적으로 최대값을 추출하여서 2차원 영상으로 맵핑하는 계층을 의미한다.Concerning a convolutional neural network, a convolutional neural network is a kind of multilayer perceptrons designed to use minimal preprocessing. The convolutional neural network includes a convolutional layer that performs convolution with respect to an input image, and a subsampling layer that performs subsampling with respect to the image, and extracts a feature map from the image. Here, the subsampling layer refers to a layer that extracts a maximum value locally for an input image and maps the extracted maximum value to a two-dimensional image.

도 6은 전술한 컨볼루션 신경망의 구조, 이로부터 특징맵이 추출되는 과정 그리고 영상을 분류(classification)하는 과정을 예시적으로 도시하고 있다. 도 6에 도시된 (a)부터 (h)까지의 과정 중에서 특징맵 생성부(210)는 (a) 부터 (f)까지의 과정을 특징맵 생성에 사용한다. 참고로, (g)는 특징 추출부(240) 및 컨텍스트 추출부(250)에서 사용하고, (h)는 컨텍스트 추출부(250)에서 사용하는바, 이들에 대해서는 특징 추출부(240) 및 컨텍스트 추출부(250)에 대한 설명에서 보다 자세하게 살펴보기로 한다.FIG. 6 exemplarily shows a structure of the convolution neural network, a process of extracting a feature map from the convolution neural network, and a process of classifying an image. Among the processes from (a) to (h) shown in FIG. 6, the feature map generating unit 210 uses the processes from (a) to (f) to generate a feature map. (H) is used in the context extracting unit 250, and the feature extracting unit 240 and the context extracting unit 240 are used in the feature extracting unit 240 and the context extracting unit 250, The extraction unit 250 will be described in more detail.

도 6에 도시된 컨볼루션 신경망의 구조에 대해 먼저 살펴보면, 컨볼루션 신경망은 복수 개의 컨볼루션 계층(convolution layer)(도 6에서 (b), (d), (f)), 복수 개의 서브샘플링 계층(max-pooling layer)(도 6에서 (c),(e)), 완전 연결 계층(fully-connected layer)(도 6에서 (g))을 포함한다. 여기서 컨볼루션 신경망을 이루는 각각의 구조에 대한 것은 이미 공지되어 있으므로, 이에 대한 설명은 생략하기로 한다.6, the convolutional neural network includes a plurality of convolution layers ((b), (d), and (f) in FIG. 6), a plurality of convolutional neural networks (c), (e) in FIG. 6) and a fully-connected layer (FIG. 6 (g)). Here, the structures of the convolutional neural networks are already known, and a description thereof will be omitted.

다음으로 특징맵의 추출 과정에 대해 예를 들어 살펴보자. 32 x 32 해상도를 갖는 영상(101)이 입력부(100)를 통해 입력된 경우를 가정한다(a). 이러한 영상(101)에 컨볼루션 커널(convolution kernel)(102)를 이용하는 컨볼루션 계층이 적용되면, 영상보다 작은 크기의 특징맵이 복수 개 생성된다(b). (b)의 경우, 컨볼루션 커널은 크기가 5 x 5인 것 20개가 적용되었으며, 그 결과 28 x 28 크기의 특징맵 20개가 생성되었다. 이후 도시된 바와 같이 (c)부터 (f)까지 거치는 과정에서 서브샘플링 계층과 컨볼루션 계층이 번걸아가면서 적용되고, 그 결과 (f)에서는 3 x 3 크기의 특징맵(211) 20개가 생성된다. Next, let's take an example of the feature map extraction process. Assume that an image 101 having a resolution of 32 x 32 is input through the input unit 100 (a). When a convolution layer using a convolution kernel 102 is applied to the image 101, a plurality of feature maps having a size smaller than the image are generated. In the case of (b), the convolution kernel was applied with 20 of the size 5 x 5, resulting in 20 feature maps of 28 x 28 size. Subsequently, as shown in (c) through (f), the subsampling layer and the convolution layer are applied to each other, and in the result (f), 20 feature maps 211 of size 3 x 3 are generated .

도 7은 도 6에와 같은 컨볼루션 신경망에서, (a)부터 (f)까지를 거치면서 특징맵이 생성되었을 때, 이러한 특징맵을 3차원 텐서(tensor)의 형태로서 도시하고 있다. 입력된 영상(101)이 W(width) x H(height)의 해상도를 가질 때, 이러한 영상(101)이 특징맵 생성부(210)의 컨볼루션 신경망에서 (a) 부터 (f)까지를 거치면, 입력된 영상(101)보다 크기가 작은 W' x H' 크기의 특징맵 C개가 생성된다. 이러한 C개의 특징맵을 적층시키면 도 7에 도시된 것과 같은 특징맵(211)이 3차원 텐서의 형태로 생성된다.FIG. 7 shows such a feature map in the form of a three-dimensional tensor when a feature map is generated through (a) to (f) in the convolutional neural network as shown in FIG. When the input image 101 has a resolution of W (width) x H (height), the image 101 passes through the convolutional neural network of the feature map generating unit 210 from (a) to (f) , A feature map C having a size W 'x H' smaller than the input image 101 is generated. When these C feature maps are stacked, a feature map 211 as shown in Fig. 7 is generated in the form of a three-dimensional tensor.

다시 도 5로 돌아오면, 추출부(220)는 영상에서 객체가 존재할 것으로 추정되는 영역을 추출하고, 이와 같이 추출된 영역으로부터 특징을 추출한다. 이러한 추출부(220)는 영역 추출부(230)와 특징 추출부(240)를 포함한다. 5, the extracting unit 220 extracts an area estimated to have an object in the image, and extracts features from the extracted area. The extracting unit 220 includes a region extracting unit 230 and a feature extracting unit 240.

영역 추출부(230)에 대하여 먼저 살펴보면, 영역 추출부(230)는 특징맵 추출부(210)가 추출한 특징맵을 기초로, 영상에서 객체의 존재가 추정되는 적어도 하나의 영역을 추출한다. 영역 추출하는 방법에는 예컨대 faster R-CNN, SSD(Single Shot MultiBox Detector), Yolo(You Only Look Once) 등이 있을 수 있는데, 이하에서는 faster R-CNN을 사용하는 것을 전제로 설명하기로 한다. First, the region extracting unit 230 extracts at least one region in which the existence of an object is estimated based on the feature map extracted by the feature map extracting unit 210. For example, a faster R-CNN, a SSD (Single Shot MultiBox Detector), and a Yolo (You Only Look Once) may be used as a method of extracting a region. Hereinafter, a faster R-CNN will be used.

faster R-CNN의 경우, 영역 추출부(230)는 특징맵 중에서 영상의 영역별 클래스의 좌표를 포함하는 특징맵을 선정하고, 이와 같이 선정된 특징맵으로부터 영역을 구별하는 좌표를 식별한 뒤, 이와 같이 식별된 좌표를 객체의 존재가 추정되는 영역으로 추출할 수 있다.In the case of the faster R-CNN, the region extracting unit 230 selects a feature map including the coordinates of the class of each region of the image from the feature maps, identifies the coordinates for distinguishing the regions from the selected feature map, The coordinate thus identified can be extracted as an area in which the existence of the object is estimated.

또한, 영역 추출부(230)는 이와 같이 추출된 적어도 하나의 영역 각각에 대해서, 해당 객체의 최외곽을 둘러싸는 바운딩 박스(bounding box)로서 표시할 수 있다. 도 8은 영역 추출부(230)에 의해서 바운딩 박스(232)가 표시된 특징맵(231)을 도시하고 있다. 도 8을 참조하면, 특징맵(231)에는 4개의 바운딩 박스(232)가 표시되어 있다. 여기서, 바운딩 박스(232)의 개수 및 크기는 예시적인 것에 불과하며, 이하에서는 바운딩 박스(232)의 개수를 B개로 나타내기로 한다. 각각의 바운딩 박스(232)는 영상에서 해당 바운딩 박스(232)의 위치에 객체의 존재 가능성이 있음을 나타낸다. In addition, the region extracting unit 230 may display each of the extracted at least one region as a bounding box surrounding the outermost region of the object. FIG. 8 shows a feature map 231 in which the bounding box 232 is displayed by the region extracting unit 230. FIG. Referring to FIG. 8, the feature map 231 displays four bounding boxes 232. Here, the number and size of the bounding boxes 232 are merely exemplary, and the number of the bounding boxes 232 is denoted by B. Each of the bounding boxes 232 indicates that there is a possibility that an object exists in the position of the corresponding bounding box 232 in the image.

다시 도 5를 참조하면, 특징 추출부(240)는 영역 추출부(230)가 추출한 적어도 하나의 영역 각각에 대한 특징을 추출한다. 이를 위해, 특징 추출부(240)는 도 8에 도시된 특징맵(231)으로부터, 바운딩 박스(232)로 표시된 부분을 윗면으로 하고 높이가 C인 벡터를 각각의 바운딩 박스(232)에 대해 추출할 수 있다. 이 후, 특징 추출부(240)는 이와 같이 추출된 벡터를, 도 6에 도시된 컨볼루션 신경망의 구조에서 (g)로 표시된 부분인 완전 결합 계층(fully connected layer)에 통과시킴으로써 1 x D 형태의 벡터(241)로 변환시킨다. 도 9는 이와 같이 생성된 B개의 1 x D 형태의 벡터(241)를 예시적으로 도시하고 있다. 1 x D 형태의 벡터(241)는 특징 벡터 내지는 영역 코드(region code)라고 지칭되기도 하는데, 이러한 특징 벡터(241)에서 개개의 엘리먼트는 해당 벡터가 나타내는 영역의 특징을 나타낸다. Referring again to FIG. 5, the feature extraction unit 240 extracts features of each of the at least one region extracted by the region extraction unit 230. To this end, the feature extracting unit 240 extracts, from the feature map 231 shown in FIG. 8, a vector having a height of C as a top face indicated by the bounding box 232, for each of the bounding boxes 232 can do. After that, the feature extracting unit 240 passes the thus extracted vector to the fully connected layer indicated by (g) in the structure of the convolutional neural network shown in FIG. 6, Into a vector (241). FIG. 9 exemplarily shows B 1xD-shaped vectors 241 thus generated. The vector 241 in the form of 1 x D is also referred to as a feature vector or an area code. In this feature vector 241, each element represents a feature of an area represented by the corresponding vector.

다시 도 5를 참조하면, 컨텍스트 추출부(250)는 영상에 대한 컨텍스트를 추출한다. 이를 위해, 컨텍스트 추출부(250)는 입력된 영상을 분류(classification)한다. 분류의 기준에는 예컨대 해당 영상이 나타내는 장소가 어디인지 또는 해당 영상이 촬영된 시간이 언제인지 등이 포함될 수 있으나 이에 한정되는 것은 아니다. Referring again to FIG. 5, the context extraction unit 250 extracts a context for an image. To this end, the context extraction unit 250 classifies the input image. The criteria of the classification may include, for example, the location indicated by the image, the time when the image was photographed, and the like, but the present invention is not limited thereto.

컨텍스트 추출부(250)는 분류에 따른 결과를, 해당 영상에 대한 컨텍스트(context)로서 설정한다. 예컨대, 제1 영상이 나타내는 장소가 부엌으로 분류될 경우 컨텍스트 추출부(250)는 부엌을 제1 영상에 대한 컨텍스트로 설정할 수 있다. 또한, 제2 영상이 나타내는 장소가 야구장으로 분류될 경우, 컨텍스트 추출부(250)는 야구장을 제2 영상에 대한 컨텍스트로 설정할 수 있다. 이와 달리, 제3 영상이 나타내는 시간이 새벽인 경우, 컨텍스트 추출부(250)는 새벽을 제3 영상에 대한 컨텍스트로 설정할 수 있다. 즉, 컨텍스트란 영상에 대한 정황을 의미한다.The context extraction unit 250 sets the result of the classification as a context for the corresponding image. For example, when the place represented by the first image is classified as a kitchen, the context extraction unit 250 may set the kitchen as a context for the first image. Also, when the place represented by the second image is classified as a baseball field, the context extraction unit 250 may set the baseball field as the context for the second image. Alternatively, if the time indicated by the third image is dawn, the context extraction unit 250 may set the dawn to the context for the third image. That is, the context means a context for a video.

여기서, 컨텍스트 추출부(250)가 영상을 분류하는 방법을 예를 들어 살펴보도록 한다. 제1 방법으로서, 컨텍스트 추출부(250)는 특징맵 생성부(210)가 생성한 특징맵을 도 6에서 (g)로 표시된 부분인 완전 결합 계층(fully connected layer)에 통과시킨다. 그 결과 특징맵(211)은 1 x D' 형태(D'은 자연수)의 벡터(241)로 변환된다. 1 x D' 형태의 벡터(241)는 영상의 분류 결과를 나타내는 컨텍스트 벡터이다. 예컨대, 1 x D' 형태의 벡터(241)는 영상의 장소가 부엌인 경우, 부엌이라는 것을 벡터의 형태로 표현한다. Here, a method of classifying images by the context extracting unit 250 will be described as an example. As a first method, the context extraction unit 250 passes the feature map generated by the feature map generation unit 210 to a fully connected layer, which is a portion indicated by (g) in FIG. As a result, the feature map 211 is transformed into a vector 241 of the 1 x D 'shape (D' is a natural number). The vector 241 in the form of 1 x D 'is a context vector indicating the classification result of the image. For example, the 1 x D 'shaped vector 241 represents, in the form of a vector, the kitchen if the place of the image is a kitchen.

이와 달리, 제2 방법으로서, 컨텍스트 추출부(250)는 특징맵 생성부(210)가 생성한 특징맵을 도 6에서 (g)로 표시된 부분인 완전 결합 계층(fully connected layer)에 통과시킨다. 그 결과 특징맵(211)은 1 x D' 형태의 벡터(241)로 변환된다. 이어서 컨텍스트 추출부(250)는 1 x D' 형태의 벡터(241)를 도 6에서 (h)로 표시된 부분인 완전 결합된 출력 계층(output layer full-connected)에 통과시킨다. 그 결과 1 x D' 형태의 벡터(241)는 1개의 엘리먼트(251)로 변환된다. 이러한 1개의 엘리먼트(251)는 영상의 분류 결과를 나타내는 컨텍스트 벡터이다. 예컨대, 1개의 엘리먼트(251)는 영상의 장소가 화장실인 경우, 화장실이라는 것을 엘리먼트(251)가 나타낸다.Alternatively, as a second method, the context extraction unit 250 passes the feature map generated by the feature map generation unit 210 to a fully connected layer, which is a portion indicated by (g) in FIG. As a result, the feature map 211 is transformed into a vector 241 of the 1 x D 'shape. The context extractor 250 then passes the vector 241 in the form of 1 x D 'to the output layer full-connected, which is indicated by h in FIG. As a result, the 1 x D 'shaped vector 241 is transformed into one element 251. This one element 251 is a context vector indicating the classification result of the image. For example, one element 251 indicates that the element 251 is a toilet when the place of the image is a toilet.

또 다른 방법인 제3 방법으로서, 입력부(100)는 영상에 대한 장소나 시간에 대한 정보를 외부로부터 입력받고, 컨텍스트 추출부(250)는 이와 같이 입력된 정보를 기초로 해당 영상을 분류할 수도 있다. 이 때, 입력부(100)에 의해 입력된 정보는 제2 방법에 따른 결과와 그 형태가 동일할 수 있다.As a third method, which is another method, the input unit 100 receives information on a place and a time of an image from the outside, and the context extracting unit 250 may classify the corresponding image based on the input information have. At this time, the information input by the input unit 100 may have the same form as the result of the second method.

설명 생성부(260)는 객체에 대한 설명(caption)을 생성한다. 생성되는 설명은 단어 또는 문장일 수 있다. The description generating unit 260 generates a caption for the object. The generated description may be a word or a sentence.

설명 생성부(260)는 객체에 대한 설명을 생성할 때 딥러닝의 모델을 이용할 수 있다. 이 때의 모델은, 학습부(400)에 의해서 장단기 메모리(Long Short Term Memory networks, LSTM)에 의하여 이미 학습된 모델일 수 있으나 이에 한정되지 않으며, 다만 이하에서는 장단기 메모리에 의하여 이미 학습된 모델을 사용하는 것을 전제로 설명하기로 한다.The description generating unit 260 can use a model of deep learning when generating a description of an object. In this case, the model may be a model that has been already learned by the short term memory networks (LSTM) by the learning unit 400, but the present invention is not limited thereto. Hereinafter, a model already learned by the short- Will be described below.

장단기 메모리에 대한 입력은 컨텍스트 기반 특징이다. 컨텍스트 기반 특징이란 특징 추출부(240)가 추출한 영역에 대한 특징마다 각각, 컨텍스트 추출부(250)가 추출한 해당 영상에 대한 컨텍스트를 반영한 것을 의미한다. 예컨대, 특징 추출부(240)가 추출한 제1 영역이 객체로서 '남자'를 포함하고 있고, 컨텍스트 추출부(250)는 해당 영상에 대한 컨텍스트로서 '부엌'이라는 정보를 가지고 있다면, 이 때의 컨텍스트 기반 특징은 '부엌, 남자'와 같은 형태의 정보를 의미할 수 있다.The inputs to short and long memory are context-based features. The context-based feature means that the context extracting unit 240 reflects the context of the extracted image extracted by the context extracting unit 250 for each feature of the extracted region. For example, if the first region extracted by the feature extraction unit 240 includes 'man' as an object and the context extraction unit 250 has information 'kitchen' as a context for the corresponding image, The base feature can mean information such as 'kitchen, man'.

이러한 컨텍스트 기반 특징의 형태에 대해 보다 자세하게 살펴보기로 한다. 우선, 특징 추출부(240)가 추출한 특징 벡터는 도 9에서 설명한 바와 같이 영상에서 B개의 영역 각각에 대해 1 x D 형태를 갖는다. This type of context-based feature will be discussed in more detail. First, the feature vector extracted by the feature extraction unit 240 has a 1 x D shape for each of the B regions in the image, as described with reference to Fig.

컨텍스트 추출부(250)가 전술한 제1 방법에 따라 컨텍스트를 추출한 경우, 컨텍스트 벡터는 1 x D' 형태를 갖는다. 이 경우, 설명 생성부(260)는 D개의 엘리먼트를 갖는 특징 벡터와 D'개의 엘리먼트를 갖는 컨텍스트 벡터를 직렬적으로 연결시켜서 1 x (D+D') 형태의 결합 벡터를 생성한다. 만약 D'과 D가 동일하다면, 결합 벡터는 1 x 2D 형태를 갖게 된다. 도 10은 D'과 D가 동일할 때의 결합 벡터(261)에 대하여 예시적으로 도시하고 있다.When the context extracting unit 250 extracts the context according to the first method described above, the context vector has a form of 1 x D '. In this case, the description generator 260 serially concatenates the feature vectors having D elements and the context vectors having D 'elements to generate a 1x (D + D') -type join vector. If D 'and D are the same, then the joint vector will have 1 x 2D shape. FIG. 10 is illustratively shown for the joint vector 261 when D 'and D are the same.

이와 달리, 컨텍스트 추출부(250)가 전술한 제2 방법 또는 제3 방법에 따라 컨텍스트를 추출한 경우, 컨텍스트 벡터는 1개의 엘리먼트를 통해서 표현된다. 이 경우, 설명 생성부(260)는 D개의 엘리먼트를 갖는 특징 벡터에 1개의 엘리먼트로 표현되는 컨텍스트를 직렬적으로 연결시켜서 1 x (D+1) 형태의 결합 벡터를 생성한다. 도 11은 제2 방법에 따라 컨텍스트가 추출되었을 때의 결합 벡터(262)에 대하여 예시적으로 도시하고 있다.Alternatively, when the context extracting unit 250 extracts the context according to the second method or the third method described above, the context vector is expressed through one element. In this case, the description generation unit 260 connects the context represented by one element to the feature vector having the D elements serially, thereby generating a 1x (D + 1) -type combination vector. FIG. 11 is illustratively illustrative of the joint vector 262 when the context is extracted according to the second method.

설명 생성부(260)는 전술한 다양한 방법에 따라 생성된 결합 벡터를 장단기 메모리에 입력으로서 받아들이고, 그 결과로서 각각의 객체에 대한 설명을 문장 등으로 생성할 수 있다.The description generating unit 260 may receive the combined vectors generated according to the above-described various methods as inputs to the short-term memory and as a result, generate a description of each object in a sentence or the like.

따라서, 일 실시예에 따르면 영상에 존재하는 객체에 대한 설명을 생성하는 과정에서, 해당 영상이 나타내는 장소나 해당 영상이 나타내는 시간에 대한 컨텍스트가 고려될 수 있다. 여기서, '컨텍스트를 고려한다'는 것은 이러한 컨텍스트가 객체를 검출하는 과정 내지는 객체에 대한 설명을 생성하는 과정에서 한정사항으로 작용할 수 있다는 의미이다. 예컨대 검출된 객체로부터 복수 개의 설명이 생성되었을 때, 컨텍스트는 이들 복수 개의 설명 중 어느 하나를 선택하는 기준으로 작용할 수 있다. 또는 컨텍스트와 부합되는 설명만이 도출될 수 있도록 할 수도 있다. 또한 검출된 객체로부터 획득 가능한 정보의 양이 적을 경우, 컨텍스트는 해당 객체에 대한 정보 그 자체가 될 수도 있다. 즉, 컨텍스트가 고려될 경우, 객체 검출에 있어서 정확성, 효율성 또는 속도가 향상될 수 있다. Accordingly, in the process of generating the description of the object existing in the image, the context represented by the location of the corresponding image or the time indicated by the corresponding image may be considered. Here, 'considering the context' means that this context can act as a limitation in the process of detecting the object or in the process of generating the description of the object. For example, when a plurality of explanations have been generated from a detected object, the context may act as a criterion for selecting any of these explanations. Or only a description that is consistent with the context may be derived. Also, when the amount of information obtainable from the detected object is small, the context may be the information about the object itself. That is, when the context is considered, accuracy, efficiency, or speed in object detection can be improved.

한편, 설명 생성부(260)는 결합 벡터를 생성할 때, 특징 벡터와 컨텍스트 벡터 각각에 서로 상이한 가중치(weight)를 곱해서 생성할 수 있다. 예컨대, 설명 생성부(260)는 영역 추출부(230)에 의하여 추출된 영역의 크기와 기 정의된 기준을 비교하여서, 추출된 영역의 크기가 기준보다 크면 컨텍스트 벡터와 특징 벡터에 각각 가중치로서 1을 곱할 수 있지만, 추출된 영역의 크기가 기준보다 작다면 컨텍스트 벡터에 가중치로서 1이 초과되는 갑을 가중치로서 곱할 수 있다. 추출된 영역의 크기가 작으면 해당 영역에서 획득 가능한 정보의 종류나 개수가 적을 확률이 높으며, 따라서 해당 영역에 대한 설명은 정확도가 낮을 수 있다. 이를 개선하기 위해, 일 실시예에서는 영역의 크기가 기준보다 작으면, 컨텍스트 벡터에 1이 초과되는 값을 가중치로서 곱한 뒤에 결합 벡터를 생성할 수 있다. 이 경우, 해당 영역에 대한 부족한 정보량은 컨텍스트에 의하여 보완될 수 있으므로, 설명의 정확도가 낮아지는 것이 개선될 수 있다.Meanwhile, when generating the joint vector, the description generator 260 may generate the feature vector and the context vector by multiplying the feature vector and the context vector by different weights. For example, the description generating unit 260 compares the size of the region extracted by the region extracting unit 230 with a predefined reference. If the size of the extracted region is larger than the reference, the context vector and the feature vector are weighted by 1 However, if the size of the extracted region is smaller than the criterion, the context vector may be multiplied as a weight, which is greater than 1 as a weight. If the size of the extracted region is small, there is a high probability that the type and number of information obtainable in the region are small, and therefore, the description of the region may be low in accuracy. In order to improve this, in one embodiment, if the size of the region is smaller than the reference, a value exceeding 1 can be multiplied by a context vector to generate a joint vector. In this case, the amount of insufficient information for the area can be supplemented by the context, so that the accuracy of description can be reduced.

또한, 설명 생성부(260)는 영역 추출부(230)에 의하여 추출된 영역의 개수와 기 정의된 기준을 비교하여서, 추출된 영역의 개수가 기준보다 적으면 컨텍스트 벡터와 특징 벡터에 각각 가중치로서 1을 곱할 수 있지만, 추출된 영역의 개수가 기준보다 많다면 컨텍스트 벡터에 가중치로서 1이 초과되는 갑을 가중치로서 곱할 수 있다.The description generating unit 260 compares the number of regions extracted by the region extracting unit 230 with a predefined reference. When the number of extracted regions is less than the reference, the description generating unit 260 assigns weight values to the context vector and the feature vector, respectively 1 can be multiplied, but if the number of extracted regions is larger than the reference, the context vector can be multiplied by a weight, which is greater than 1 as a weight.

한편, 지금까지 설명한 특징맵 생성부(210), 영역 추출부(230), 특징 추출부(240), 컨텍스트 추출부(250) 및 설명 생성부(260) 중 적어도 하나는 학습부(400)에 의하여 딥러닝 방식에 따라 이미 학습된 것일 수 있다. 예컨대, 영역 추출부(230)에서 영역을 추출할 때 사용되는 파라미터(예컨대 faster R-CNN에 사용되는 파라미터)는 추출된 결과와 실측 자료(ground truth) 간의 오차(objectness loss)를 최소화하는 방안으로 학습될 수 있다. 즉, 영역 추출부(230)에 의하여 추출된 바운딩 박스의 위치와 실측 자료를 통해 파악된 바운딩 박스의 위치를 비교함으로써 그 오차가 최소화되도록 파라미터가 학습될 수 있다. At least one of the feature map generating unit 210, the region extracting unit 230, the feature extracting unit 240, the context extracting unit 250, and the description generating unit 260 described above is connected to the learning unit 400 It may be already learned according to the deep learning method. For example, a parameter used for extracting an area in the region extracting unit 230 (for example, a parameter used for faster R-CNN) is a method for minimizing an objectness loss between an extracted result and a ground truth Can be learned. That is, by comparing the position of the bounding box extracted by the region extracting unit 230 with the position of the bounding box captured through the actual data, the parameter can be learned so that the error is minimized.

이는 컨텍스트 추출부(250)에 대해서도 마찬가지이다. 컨텍스트 추출부(250)에서 컨텍스트를 추출할 때 사용되는 파라미터(예컨대 완전 결합 계층)는 추출된 결과와 실측 자료(ground truth) 간의 오차(classification loss)를 최소화하는 방안으로 학습될 수 있다. 즉, 컨텍스트 추출부(250)에 의하여 추출된 영상의 장소와 실측 자료를 통해 파악된 영상의 장소를 비교함으로써 그 오차가 최소화되도록 파라미터가 학습될 수 있다. This also applies to the context extraction unit 250. [ The context extracting unit 250 may use a parameter (for example, a complete combining layer) used to extract a context as a method of minimizing a classification loss between an extracted result and a ground truth. That is, by comparing the location of the image extracted by the context extraction unit 250 with the location of the captured image through the actual data, the parameter can be learned so that the error is minimized.

아울러, 학습부(400)에 학습이 진행될 때, 이러한 학습은 end-to-end 방식으로 진행된 것일 수 있다. 즉, 영상이 입력되고나서 최종적으로 영상에 대한 객체 검출이 완료되어서 1개의 epoch가 종료되고나면, 그 시점에서 학습부(400)에 의하여 학습이 진행될 수 있으며, 매 epoch가 종료될 때마다 학습이 진행될 수 있다. 이 경우, 영상으로부터 특징맵을 생성하는 과정, 특징맵으로부터 영역을 추출하고 특징을 추출하는 과정, 특징으로부터 설명을 생성하는 과정 중 적어도 하나에게, 컨텍스트에 따른 영향이 반영될 수 있다.In addition, when learning proceeds to the learning unit 400, such learning may be conducted in an end-to-end manner. That is, after the input of the image and finally the object detection for the image is completed and one epoch is terminated, the learning can proceed by the learning unit 400 at that point. Can proceed. In this case, the influence of the context may be reflected in at least one of a process of generating a feature map from the image, extracting the region from the feature map, extracting the feature, and generating the description from the feature.

도 12는 일 실시예에 따른 영상 학습 방법의 절차를 도시한 도면이다. 도 12에 도시된 영상 학습 방법은 지금까지 설명한 영상 학습 장치(1000)에 의하여 구현 가능하다. 아울러, 도 12에 도시된 영상 학습 방법은 예시적인 것에 불과하므로, 영상 학습 방법이 도 12에 도시된 것으로 한정해석되는 것은 아니다.12 is a diagram illustrating a procedure of an image learning method according to an embodiment. The image learning method shown in FIG. 12 can be implemented by the image learning apparatus 1000 described above. In addition, since the image learning method shown in Fig. 12 is merely an example, the image learning method is not limited to that shown in Fig.

도 12을 참조하면, 입력부(100)는 영상을 입력받는다(S100). 분석부(200)의 특징맵 추출부(210)는 영상으로부터 특징맵을 추출한다(S110). 분석부(200)의 영역 추출부(230)는 특징맵으로부터 객체의 존재가 추정되는 영역을 추출한다(S120). 분석부(200)의 특징 추출부(240)는 단계 S120에서 추출된 영역 각각에 대해 특징을 추출한다(S130). 분석부(200)의 컨텍스트 추출부(250)는 특징맵으로부터 영상에 대한 컨텍스트를 추출한다(S140). 설명 생성부(260)는 단계 S130)에서 추출된 특징과 단계 S140에서 추출된 컨텍스트를 결합한 뒤에, 영상의 객체에 대한 설명을 생성한다(S150). 출력부(500)는 영상의 객체에 대한 설명을 출력한다. 한편, 이러한 영상 분석 방법은 전술한 영상 분석 장치(1000)에 의하여 수행되며, 그 과정은 영상 분석 장치(1000)에 대한 설명에서 이미 논의하였는바, 이에 대한 자세한 설명은 생략하기로 한다.Referring to FIG. 12, the input unit 100 receives an image (S100). The feature map extraction unit 210 of the analysis unit 200 extracts the feature map from the image (S110). The region extracting unit 230 of the analyzing unit 200 extracts an area in which the existence of the object is estimated from the feature map (S120). The feature extraction unit 240 of the analysis unit 200 extracts features for each of the regions extracted in step S120 (S130). The context extraction unit 250 of the analysis unit 200 extracts a context for an image from the feature map (S140). The description generating unit 260 combines the extracted feature in step S130 and the context extracted in step S140, and generates a description of the image object in step S150. The output unit 500 outputs a description of the object of the image. Meanwhile, the image analysis method is performed by the image analysis apparatus 1000 described above, and the process has already been discussed in the description of the image analysis apparatus 1000, and a detailed description thereof will be omitted.

한편, 일 실시예에 따른 영상 분석 방법은 이러한 방법의 각 단계를 수행하도록 프로그램된 컴퓨터 판독가능한 기록매체에 저장된 컴퓨터 프로그램의 형태로 구현 가능하다.Meanwhile, the image analysis method according to an embodiment can be implemented in the form of a computer program stored in a computer-readable recording medium programmed to perform each step of the method.

이상에서 살펴본 바와 같이, 일 실시예에 따르면 영상에 존재하는 객체에 대한 설명을 생성하는 과정에서, 해당 영상이 나타내는 장소나 해당 영상이 나타내는 시간에 대한 컨텍스트가 고려될 수 있다. 여기서, '컨텍스트를 고려한다'는 것은 이러한 컨텍스트가 객체를 검출하는 과정 내지는 객체에 대한 설명을 생성하는 과정에서 한정사항으로 작용할 수 있다는 의미이다. 예컨대 검출된 객체로부터 복수 개의 설명이 생성되었을 때, 컨텍스트는 이들 복수 개의 설명 중 어느 하나를 선택하는 기준으로 작용할 수 있다. 또는 컨텍스트와 부합되는 설명만이 도출될 수 있도록 할 수도 있다. 또한 검출된 객체로부터 획득 가능한 정보의 양이 적을 경우, 컨텍스트는 해당 객체에 대한 정보 그 자체가 될 수도 있다. 즉, 컨텍스트가 고려될 경우, 객체 검출에 있어서 정확성, 효율성 또는 속도가 향상될 수 있다As described above, according to an exemplary embodiment, in the process of generating a description of an object existing in an image, a context for a location indicated by the corresponding image or a time indicated by the corresponding image can be considered. Here, 'considering the context' means that this context can act as a limitation in the process of detecting the object or in the process of generating the description of the object. For example, when a plurality of explanations have been generated from a detected object, the context may act as a criterion for selecting any of these explanations. Or only a description that is consistent with the context may be derived. Also, when the amount of information obtainable from the detected object is small, the context may be the information about the object itself. That is, when the context is considered, accuracy, efficiency, or speed in object detection can be improved

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 품질에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 개시된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 균등한 범위 내에 있는 모든 기술사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely illustrative of the technical idea of the present invention, and various modifications and changes may be made by those skilled in the art without departing from the essential characteristics of the present invention. Therefore, the embodiments disclosed in the present invention are intended to illustrate rather than limit the scope of the present invention, and the scope of the technical idea of the present invention is not limited by these embodiments. The scope of protection of the present invention should be construed according to the following claims, and all technical ideas within the scope of equivalents thereof should be construed as falling within the scope of the present invention.

일 실시예에 따르면 객체 검출에 있어서 정확성, 효율성 또는 속도가 향상될 수 있다.According to one embodiment, accuracy, efficiency, or speed in object detection can be improved.

1000: 영상 분석 장치
200: 분석부
210: 특징맵 생성부 230: 영역 추출부
240: 특징 추출부 250: 컨텍스트 추출부
260: 설명 생성부1000: Image analysis device
200: Analysis Department
210: Feature Map Generation Unit 230: Region Extraction Unit
240: Feature extraction unit 250: Context extraction unit
260:

Claims

영상의 특징맵(feature map)을 신경망을 이용하여 생성하는 특징맵 생성부와,
상기 특징맵을 기초로, 상기 영상에서 객체의 존재가 추정되는 영역을 추출하는 영역 추출부와,
상기 영역에 대한 특징을 추출하는 특징 추출부와,
상기 영상에서 상기 객체가 존재하는 장소에 대한 정보를 포함하는 컨텍스트(context)를 상기 특징맵으로부터 추출하는 컨텍스트 추출부와,
상기 영역에 대한 특징에 상기 컨텍스트를 반영하여서 컨텍스트 기반 특징을 생성하고, 상기 컨텍스트 기반 특징을 기초로 상기 영역에 대한 설명(caption)을 생성하여서 출력되도록 하는 설명 생성부를 포함하며,
상기 컨텍스트 기반 특징은,
상기 영역을 나타내는 특징에 대한 특징 벡터가 1 by n 의 벡터이고 상기 컨텍스트에 대한 컨텍스트 벡터가 1 by m 의 벡터이면서 상기 n과 m이 자연수일 때, 상기 특징 벡터와 상기 컨텍스트 벡터가 직렬적으로 결합된 1 by (n+m) 의 결합 벡터를 상기 컨텍스트 기반 특징으로서 생성하는
영상 분석 장치.A feature map generator for generating a feature map of an image using a neural network;
An area extracting unit for extracting an area in which the existence of an object is estimated based on the feature map;
A feature extracting unit for extracting a feature of the region,
A context extractor for extracting, from the feature map, a context including information on a place where the object exists in the image;
And a description generating unit for generating a context based feature by reflecting the context on the feature of the region and generating and outputting a caption for the region on the basis of the context based feature,
The context-
When the feature vector for the feature representing the region is 1 by n, the context vector for the context is 1 by m, and the n and m are natural numbers, the feature vector and the context vector are combined serially Gt; by (n + m) < / RTI > as the context-based feature
Image analysis device.

삭제delete

제 1 항에 있어서,
상기 설명 생성부는,
상기 영역의 크기를 기 정의된 기준과 비교하며, 상기 비교의 결과에 따라 가중치를 설정하고, 상기 컨텍스트 벡터에 상기 가중치를 곱한 결과와 상기 특징 벡터를 직렬적으로 결합하여서 상기 결합 벡터를 생성하는
영상 분석 장치.The method according to claim 1,
The above-
The size of the region is compared with a predefined criterion, a weight is set according to a result of the comparison, and a result obtained by multiplying the context vector by the weight is serially combined with the feature vector to generate the joint vector
Image analysis device.

제 1 항에 있어서,
상기 설명 생성부는,
상기 영역의 개수를 기 정의된 기준과 비교하며, 상기 비교의 결과에 따라 가중치를 설정하고, 상기 컨텍스트 벡터에 상기 가중치를 곱한 결과와 상기 특징 벡터를 직렬적으로 결합하여서 상기 결합 벡터를 생성하는
영상 분석 장치.The method according to claim 1,
The above-
A weight is set according to a result of the comparison, a result obtained by multiplying the context vector by the weight is serially combined with the feature vector to generate the combined vector
Image analysis device.

제 1 항에 있어서,
상기 컨텍스트 추출부는,
상기 컨텍스트 추출부에 의해 추출된 컨텍스트와 상기 영상의 컨텍스트에 대한 실측 자료(ground truth) 간의 차이를 최소화하는 과정을 통해 학습된 것인
영상 분석 장치.The method according to claim 1,
Wherein the context extractor comprises:
The context extracting unit extracts the context extracted by the context extracting unit and the ground truth of the context extracted by the context extracting unit,
Image analysis device.

영상의 특징맵(feature map)을 신경망을 이용하여 생성하는 단계와,
상기 특징맵을 기초로, 상기 영상에서 객체의 존재가 추정되는 영역을 추출하는 단계와,
상기 영역에 대한 특징을 추출하는 단계와,
상기 영상에서 상기 객체가 존재하는 장소에 대한 정보를 포함하는 컨텍스트(context)를 상기 특징맵으로부터 추출하는 단계와,
상기 영역에 대한 특징에 상기 컨텍스트를 반영하여서 컨텍스트 기반 특징을 생성하는 단계와,
상기 컨텍스트 기반 특징을 기초로 상기 영역에 대한 설명(caption)을 생성하는 단계와,
상기 생성된 설명을 출력시키는 단계를 포함하여 수행하도록 프로그램되되,
상기 컨텍스트 기반 특징은,
상기 영역을 나타내는 특징에 대한 특징 벡터가 1 by n 의 벡터이고 상기 컨텍스트에 대한 컨텍스트 벡터가 1 by m 의 벡터이면서 상기 n과 m이 자연수일 때, 상기 특징 벡터와 상기 컨텍스트 벡터가 직렬적으로 결합된 1 by (n+m) 의 결합 벡터를 상기 컨텍스트 기반 특징으로서 생성하는 것을 특징으로 하는
컴퓨터 판독가능한 기록매채에 저장된 컴퓨터 프로그램.Generating a feature map of an image using a neural network;
Extracting an area in which the existence of the object is estimated based on the feature map;
Extracting features for the region;
Extracting, from the feature map, a context including information on a location of the object in the image;
Generating a context-based feature by reflecting the context to a feature of the region;
Generating a caption for the region based on the context-based feature;
And outputting the generated description,
The context-
When the feature vector for the feature representing the region is 1 by n, the context vector for the context is 1 by m, and the n and m are natural numbers, the feature vector and the context vector are combined serially (N + m) < / RTI > as the context-based feature
A computer program stored in a computer readable recording medium.