KR20220029335A

KR20220029335A - Method and apparatus to complement the depth image

Info

Publication number: KR20220029335A
Application number: KR1020210066115A
Authority: KR
Inventors: 엘브이 자오후이; 장 웨이지아; 팬 밍밍
Original assignee: 삼성전자주식회사
Priority date: 2020-08-31
Filing date: 2021-05-24
Publication date: 2022-03-08
Also published as: CN112001914B; CN112001914A

Abstract

A method and apparatus for complementing a depth image are provided. The method includes: obtaining an original color image and a corresponding original depth image; based on the original color image, using a first deep neural network to obtain a first depth image; based on the original depth image and an intermediate feature image generated by the intermediate layer of the first deep neural network, obtaining a second depth image obtained by using a second deep neural network; and obtaining a final depth image by merging the first depth image and the second depth image.

Description

깊이 이미지를 보완하는 방법 및 장치{METHOD AND APPARATUS TO COMPLEMENT THE DEPTH IMAGE}METHOD AND APPARATUS TO COMPLEMENT THE DEPTH IMAGE

이하의 일 실시 예들은 이미지 처리 분야에 관한 것으로, 보다 구체적으로, 깊이 이미지를 보완하는 방법 및 장치에 관한 것이다. The following embodiments relate to the field of image processing, and more particularly, to a method and apparatus for supplementing a depth image.

고품질의 완전한 깊이 이미지 정보는 3D 재구성, 자율 주행, 증강 현실, 로봇 공학 등과 같은 깊이 정보를 기반으로 하는 많은 애플리케이션에서 중요한 역할을 한다. 그러나 현재 소비자급의 깊이 카메라는 이미지 품질이 좋지 않거나, 깊이 이미지가 희소하거나, 공동이 존재하는 등의 깊이 값이 누락되는 문제가 존재한다. 이러한 현존하는 문제에 대해, 종래의 깊이 맵 보완 알고리즘은 크게 두 가지 범주로 나뉘는데, 하나는 필터링을 기반으로 하는 전통적인 방법이고, 다른 하나는 회귀 모델 구축을 통해 깊이 값을 채우는 딥 러닝 방법이다.High-quality, complete depth image information plays an important role in many applications based on depth information, such as 3D reconstruction, autonomous driving, augmented reality, robotics, etc. However, the current consumer-grade depth camera has problems in that the image quality is poor, the depth image is sparse, or the depth value is missing, such as a cavity. For these existing problems, the conventional depth map complementation algorithm is largely divided into two categories, one is a traditional method based on filtering, and the other is a deep learning method that fills in depth values by building a regression model.

전통적인 방법은 주로 필터링 및 Markov 랜덤 필드 모델 등을 기반으로 깊이 이미지를 팽창 및 채우고, 에지 등 텍스처 정보를 사용하여 완전한 깊이 이미지를 얻기 위해 제한한다. 이러한 유형의 방법은 많은 기능을 수동으로 설계해야 하므로, 기존 방법의 발전을 제한하게 된다.Traditional methods are mainly based on filtering and Markov random field model, etc. to inflate and fill the depth image, and limit it to obtain a complete depth image using texture information such as edges. This type of method requires the manual design of many functions, which limits the evolution of existing methods.

딥 러닝 방법은 주로 회귀 모델을 구축하고 모델을 통해 원본 깊이 이미지에서 완전한 깊이 이미지로 매핑을 구축한다. 이러한 유형의 방법은 흐릿한 출력 이미지, 불분명한 에지, 에지 부분 및 대규모 깊이 누락 부분의 효과에 대한 불만족 등 단점이 존재한다.The deep learning method mainly builds a regression model and a mapping from the original depth image to the full depth image through the model. This type of method has disadvantages such as blurry output image, fuzzy edges, dissatisfaction with the effect of edge portions and large depth missing portions.

현재 딥 러닝에 기반한 깊이 이미지를 보완하는 방법은 어느 정도의 성과를 거두었다. RGB 이미지 정보를 사용할지 여부에 따라 이러한 방법은, RGB 이미지를 사용하여 안내하는 것과 RGB 이미지를 사용하지 않고 RGB를 안내하는 두 가지 범주로 나눌 수 있다. RGB 이미지를 사용하지 않고 안내하는 방법은 일반적으로 인코더-디코더, 생성적 대응 네트워크 등 방법을 사용하여 회귀 모델을 구축하고, 단일 회귀 모델을 구축하는 이러한 방법은 컬러 이미지 복원 분야에서 가시적인 성과를 거뒀다. 그러나 깊이 복구에는 정확한 깊이 값을 필요로 하기 때문에, 이러한 방법은 간단한 보간 또는 인접 픽셀 복사 등 문제가 종종 발생하여, 출력 이미지가 흐려지고 에지가 선명하지 않게 된다. RGB 이미지를 사용하여 안내하는 방법은 특징 코딩 및 특징 융합을 통해 RGB 이미지 정보 채굴을 시도하고, 깊이를 보완하는 과정을 안내하는 데 사용되며 일정한 정확도 향상을 달성하였지만, 에지 부분과 넓은 범위의 깊이 누락 부분의 효과는 여전히 *?*만족스럽지 않다.Currently, methods to supplement depth images based on deep learning have achieved some degree of success. According to whether or not to use RGB image information, these methods can be divided into two categories: guiding using RGB images and guiding RGB without using RGB images. Guided methods without using RGB images generally use methods such as encoder-decoders, generative correspondence networks, etc. to build a regression model, and these methods to build a single regression model have achieved tangible results in the field of color image restoration. . However, since depth recovery requires accurate depth values, this method often suffers from simple interpolation or copying adjacent pixels, resulting in blurred output images and sharp edges. The guidance method using RGB images is used to guide the process of attempting to mine RGB image information through feature coding and feature fusion, and to compensate for depth, and has achieved a certain accuracy improvement, but is missing the edge part and wide range of depth. The effect of the part is still *?* unsatisfactory.

종래의 딥 러닝을 기반으로 한 방법의 문제점은 주로 다음과 같은 측면에서 나타난다.The problems of the conventional deep learning-based method mainly appear in the following aspects.

1. 중간 표현을 채택하지 않는 방식에서, 기존의 특징 융합 방식은 너무 단순하여, 컬러 이미지와 깊이 이미지를 효과적으로 융합할 수 없어, 생성된 깊이 이미지의 효과가 좋지 않다. 예를 들어, 입력 단계 또는 특징 단계에서 단순한 이미지 스플라이싱(splicing) 또는 픽셀 별 추가로 인해, 정보 융합이 불충분하다. 또 다른 예로, 피라미드 네트워크를 사용하여 깊이 맵과 RGB 맵의 다중 레벨 특징을 각각 추출한 다음, 디코더의 입력 부분에 융합해도 결과는 여전히 이상적이지 않고, 에지의 품질이 높지 않고, 텍스처 복원 정도가 높지 않고 구조가 완전하지 않다.1. In the method that does not adopt the intermediate expression, the existing feature fusion method is too simple to effectively fuse the color image and the depth image, so the effect of the generated depth image is not good. For example, due to simple image splicing or pixel-by-pixel addition in the input stage or feature stage, information fusion is insufficient. As another example, using a pyramid network to extract the multi-level features of the depth map and RGB map respectively, and then fuse them to the input part of the decoder, the result is still not ideal, the quality of the edges is not high, the degree of texture restoration is not high, and the The structure is not complete.

2. 중간 표현을 채택한 방식에서, 기존 방법에서는, 깊이 네트워크를 통해 컬러 이미지를 중간 표현으로 변환한 후, 중간 표현을 통해 깊이 이미지를 생성하여 네트워크 예측의 난이도를 단순화하고, 이를 통해 깊이 이미지의 품질을 향상시켰다. 예를 들어, 각 단계의 네트워크 예측의 어려움을 단순화하기 위해, 표면 법선(Surface Normal) 및 폐색 경계(Occlusion Boundary) 등을 중간 표현으로 제안하고, 마지막으로 글로벌 최적화을 통해 원본 깊이 맵에 대해 보완한다. 또 다른 예로, 중간 특징의 표현을 학습하기 위해 분기 네트워크를 제안한 다음, 중간 특징을 깊이 예측을 위해 RGB 이미지 및 깊이 이미지와 스플라이싱한다. 이러한 방식으로 깊이 보완의 효과는 수동으로 추출된 특징 또는 중간 표현으로 생성된 품질 및 이러한 특징의 융합 전략에 따라 달라진다.2. In the method adopting the intermediate representation, in the existing method, the color image is converted to the intermediate representation through the depth network, and then the depth image is generated through the intermediate representation, thereby simplifying the difficulty of network prediction, and through this, the quality of the depth image improved For example, in order to simplify the difficulty of network prediction at each stage, we propose a surface normal and an occlusion boundary as intermediate representations, and finally complement the original depth map through global optimization. As another example, we propose a branching network to learn the representation of intermediate features, and then splicing the intermediate features with RGB images and depth images for depth prediction. In this way, the effectiveness of depth complementation depends on the quality generated with manually extracted features or intermediate representations and the fusion strategy of these features.

3. 기존의 깊이 이미지 보완 방법의 대부분은 특정 단일의 깊이 보완 목표에 대상으로 하며, 예를 들어, DeepLiDAR 방법, 다중 스케일 캐스케이드 모래 시계 네트워크 등은 희소에서 밀도까지의 깊이 보완 방법만 포함한다. 또 다른 예로, 공동 이미지를 기반으로 한 깊이 보완 방법이다. 종래의 방법은 보편적이지 않고 강력하지 않다.3. Most of the existing depth image enhancement methods target a specific single depth enhancement goal, for example, DeepLiDAR method, multi-scale cascade hourglass network, etc. include only sparse to dense depth enhancement methods. Another example is the depth complementation method based on joint images. Conventional methods are not universal and not robust.

종래 기술에 존재하는 문제점을 해결하고 깊이 이미지의 보완 효과를 향상시키기 위한 방법이 요구된다.There is a need for a method for solving the problems existing in the prior art and for improving the complementary effect of a depth image.

본 발명의 일 실시 예에 따른 깊이 이미지를 보완하는 방법은, 원본 컬러 이미지 및 대응하는 원본 깊이 이미지를 획득하는 단계; 상기 원본 컬러 이미지에 기초하여, 제1 심층 신경망(Deep Neural Networks)을 사용하여 제1 깊이 이미지를 획득하는 단계; 상기 원본 깊이 이미지 및 제1 심층 신경망의 각 중간 레이어에 의해 생성된 중간 특징 이미지(feature image)에 기초하여, 제2 심층 신경망을 사용하여 제2 깊이 이미지를 획득하는 단계; 및 상기 제1 깊이 이미지와 상기 제2 깊이 이미지를 합병하여 최종 깊이 이미지를 획득하는 단계를 포함한다.A method for supplementing a depth image according to an embodiment of the present invention includes: acquiring an original color image and a corresponding original depth image; acquiring a first depth image using a first deep neural network based on the original color image; obtaining a second depth image using a second deep neural network based on the original depth image and an intermediate feature image generated by each intermediate layer of the first deep neural network; and merging the first depth image and the second depth image to obtain a final depth image.

이때, 상기 제1 심층 신경망은, N개 레이어의 잔차(residual) 구조를 갖는 제1 인코더 네트워크와 제1 디코더 네트워크를 포함하고, 상기 제2 심층 신경망은, N개 레이어의 잔차 구조를 갖는 제2 인코더 네트워크와 제2 디코더 네트워크를 포함하고, 상기 N은 1 보다 큰 정수이고, 상기 제2 깊이 이미지를 획득하는 단계는, 상기 제1 인코더 네트워크 및 상기 제2 인코더 네트워크의 출력, 상기 제1 디코더 네트워크의 중간 특징 이미지 및 상기 제2 인코더 네트워크의 중간 특징 이미지에 기초하여, 상기 제2 디코더 네트워크를 사용하여 특징 디코딩을 진행하는 단계를 포함할 수 있다.In this case, the first deep neural network includes a first encoder network and a first decoder network having a residual structure of N layers, and the second deep neural network is a second neural network having a residual structure of N layers. an encoder network and a second decoder network, wherein N is an integer greater than 1, and obtaining the second depth image comprises: outputs of the first encoder network and the second encoder network, the first decoder network and performing feature decoding using the second decoder network based on the intermediate feature image of and the intermediate feature image of the second encoder network.

이때, 상기 제2 깊이 이미지를 획득하는 단계는, 상기 원본 컬러 이미지 및 상기 제1 인코더 네트워크의 중간 특징 이미지에 기초하여, 상기 제2 심층 신경망의 상기 제2 인코더 네트워크를 사용하여 특징 인코딩을 진행하는 단계를 포함할 수 있다.In this case, the acquiring of the second depth image includes performing feature encoding using the second encoder network of the second deep neural network based on the original color image and the intermediate feature image of the first encoder network. may include steps.

이때, 상기 제1 심층 신경망은, 상기 제1 인코더 네트워크 이전의 제1 전처리(pretreatment) 네트워크 및 상기 제1 디코더 이후의 제1 깊이 예측 네트워크를 더 포함하고, 상기 제1 깊이 이미지를 획득하는 단계는, 상기 제1 전처리 네트워크를 사용하여 상기 원본 컬러 이미지를 심층 신경망 처리에 적합한 제1 특징 이미지로 변환하고, 상기 제1 특징 이미지를 상기 제1 인코더 네트워크에 입력하는 단계; 및 상기 제1 깊이 예측 네트워크를 사용하여 상기 제1 디코더 네트워크에 의해 출력된 특징 이미지를 상기 제1 깊이 이미지로 합성하는 단계를 포함하고, 상기 제2 심층 신경망은, 상기 제2 인코더 네트워크 이전의 제2 전처리 네트워크 및 상기 제2 디코더 네트워크 이후의 제2 깊이 예측 네트워크를 더 포함하고, 상기 제2 깊이 이미지를 획득하는 단계는, 상기 제2 전처리 네트워크를 사용하여 상기 원본 깊이 이미지를 심층 신경망 처리에 적합한 제2 특징 이미지로 변환하고, 상기 제2 특징 이미지를 제2 인코더 네트워크에 입력하는 단계; 및 상기 제2 깊이 예측 네트워크를 사용하여 상기 제1 디코더 네트워크 및 상기 제2 디코더 네트워크에 의해 출력된 특징 이미지와 상기 제2 특징 이미지를 융합하여 상기 제2 깊이 이미지를 획득하는 단계를 포함할 수 있다.In this case, the first deep neural network further includes a first pretreatment network before the first encoder network and a first depth prediction network after the first decoder, and obtaining the first depth image comprises: , converting the original color image into a first feature image suitable for deep neural network processing using the first preprocessing network, and inputting the first feature image to the first encoder network; and synthesizing the feature image output by the first decoder network into the first depth image by using the first depth prediction network, wherein the second deep neural network includes: Further comprising two pre-processing networks and a second depth prediction network after the second decoder network, wherein the obtaining of the second depth image includes converting the original depth image to a deep neural network processing using the second pre-processing network. converting to a second feature image, and inputting the second feature image to a second encoder network; and acquiring the second depth image by fusing the second feature image with the feature image output by the first decoder network and the second decoder network using the second depth prediction network. .

이때, 상기 제2 디코더 네트워크 중 제1 레이어 디코딩 단위의 입력은, 상기 제2 인코더 네트워크에 의해 출력된 특징 이미지와 상기 제1 인코더 네트워크에 의해 출력된 특징 이미지의 합이고, 상기 제2 디코더 네트워크 중 제2 레이어에서 제N 레이어까지의 디코딩 단위로서의 각 레이어 디코딩 단위의 입력은, SE 블록을 이용하는 방식으로 이전 레이어 디코딩 단위에서 출력된 특징 이미지, 상기 제1 디코더 네트워크의 대응하는 레이어 디코딩 단위에서 출력된 특징 이미지 및 상기 제2 인코더 네트워크의 대응하는 레이어 인코딩 단위에서 출력된 특징 이미지를 융합하여 획득한 특징 이미지이고, 상기 제2 깊이 예측 네트워크의 입력은, 상기 SE 블록을 이용하는 방식으로 상기 제2 디코더 네트워크에서 출력된 특징 이미지, 상기 제1 디코더 네트워크에서 출력된 특징 이미지 및 상기 제2 특징 이미지를 융합하여 획득한 특징 이미지일 수 있다.In this case, the input of the first layer decoding unit in the second decoder network is the sum of the feature image output by the second encoder network and the feature image output by the first encoder network, and in the second decoder network The input of each layer decoding unit as a decoding unit from the second layer to the Nth layer is a feature image output from a previous layer decoding unit in a manner using an SE block, output from a corresponding layer decoding unit of the first decoder network It is a feature image obtained by fusing a feature image and a feature image output from a corresponding layer encoding unit of the second encoder network, and the input of the second depth prediction network is the second decoder network using the SE block. It may be a feature image obtained by fusing the feature image output from , the feature image output from the first decoder network, and the second feature image.

이때, 상기 제2 인코더 네트워크 중 제1 레이어 인코딩 단위의 입력은, 상기 제1 특징 이미지와 상기 제2 특징 이미지의 합이고, 상기 제2 인코더 네트워크 중 제2 레이어에서 제N 레이어까지의 인코딩 단위로서의 각 레이어 인코딩 단위의 입력은, 이전 레이어 인코딩 단위에서 출력된 특징 이미지와 상기 제1 인코더 네트워크의 대응하는 레이어 인코딩 단위에서 출력된 특징 이미지의 합일 수 있다.In this case, the input of the first layer encoding unit of the second encoder network is the sum of the first feature image and the second feature image, and is an encoding unit from the second layer to the Nth layer of the second encoder network. An input of each layer encoding unit may be a sum of a feature image output from a previous layer encoding unit and a feature image output from a corresponding layer encoding unit of the first encoder network.

이때, 상기 제2 인코더 네트워크 및 상기 제2 디코더 네트워크의 각각의 잔차 블록은 컨볼루션 프로세스 실행 후에 한 번의 게이트 프로세스를 실행할 수 있다.In this case, each residual block of the second encoder network and the second decoder network may execute a gate process once after executing the convolution process.

이때, 상기 제1 깊이 이미지와 상기 제2 깊이 이미지를 합병하여 상기 최종 깊이 이미지를 획득하는 단계는, 어텐션 네트워크를 사용하여 상기 제1 깊이 이미지의 제1 픽셀 가중치 맵 및 상기 제2 깊이 이미지의 제2 픽셀 가중치 맵을 획득하는 단계; 및 상기 제1 픽셀 가중치 맵 및 상기 제2 픽셀 가중치 맵에 기초하여, 상기 제1 깊이 이미지 및 상기 제2 깊이 이미지 각각에 가중치를 부여하고 합하여 상기 최종 깊이 이미지를 획득하는 단계를 포함할 수 있다.In this case, the step of acquiring the final depth image by merging the first depth image and the second depth image includes: a first pixel weight map of the first depth image and a second depth image of the second depth image using an attention network. obtaining a 2 pixel weight map; and weighting and summing each of the first depth image and the second depth image based on the first pixel weight map and the second pixel weight map to obtain the final depth image.

이때, 깊이 이미지를 보완하는 방법은, 상기 제1 심층 신경망 및 상기 제2 심층 신경망 및/또는 상기 어텐션 네트워크를 사용하기 전, 손실 함수를 사용하여 상기 제1 심층 신경망 및 상기 제2 심층 신경망 및/또는 상기 어텐션 네트워크에 대해 훈련하는 단계를 더 포함하고, 상기 제1 심층 신경망 및 상기 제2 심층 신경망 및/또는 상기 어텐션 네트워크에 대해 훈련하는 단계는, 상기 제1 깊이 이미지와 실제 깊이 이미지의 제1 평균 제곱 오차 손실, 상기 제2 깊이 이미지와 상기 실제 깊이 이미지의 제2 평균 제곱 오차 손실, 상기 최종 깊이 이미지와 상기 실제 깊이 이미지의 제3 평균 제곱 오차 손실 및 상기 최종 깊이 이미지와 상기 실제 깊이 이미지의 구조적 손실을 고려하여 상기 손실 함수를 생성하고, 상기 구조적 손실은, 1 - 구조적 유사성 지수일 수 있다.At this time, the method of supplementing the depth image is, before using the first deep neural network and the second deep neural network and/or the attention network, the first deep neural network and the second deep neural network and/or using a loss function or training on the attention network, wherein the training on the first deep neural network and the second deep neural network and/or the attention network comprises: a first of the first depth image and the actual depth image mean squared error loss, second mean squared error loss of the second depth image and the actual depth image, third mean squared error loss of the final depth image and the actual depth image, and the final depth image and the actual depth image The loss function is generated in consideration of the structural loss, and the structural loss may be 1 - a structural similarity index.

이때, 상기 제1 심층 신경망 및 상기 제2 심층 신경망 및/또는 상기 어텐션 네트워크에 대해 훈련하는 단계는, 상기 제1 평균 제곱 오차 손실, 상기 제2 평균 제곱 오차 손실, 상기 제3 평균 제곱 오차 손실 및 상기 구조적 손실의 가중 합산을 통해 상기 손실 함수를 얻을 수 있다.In this case, the training for the first deep neural network and the second deep neural network and/or the attention network includes the first mean square error loss, the second mean square error loss, the third mean square error loss and The loss function may be obtained through weighted summing of the structural losses.

이때, 상기 원본 컬러 이미지 및 대응하는 상기 원본 깊이 이미지는 획득하는 단계는, 상기 원본 깊이 이미지가 존재하지 않는 경우, 픽셀 값이 0 인 깊이 이미지를 대응하는 원본 깊이 이미지로 획득하는 단계를 포함할 수 있다.In this case, the acquiring of the original color image and the corresponding original depth image may include acquiring a depth image having a pixel value of 0 as a corresponding original depth image when the original depth image does not exist. there is.

본 발명의 일 실시 예에 따른 깊이 이미지 보완 장치는, 원본 컬러 이미지 및 대응하는 원본 깊이 이미지를 획득하도록 구성된 이미지 획득 모듈; 상기 원본 컬러 이미지에 기초하여, 제1 심층 신경망을 사용하여 제1 깊이 이미지를 획득하도록 구성된 컬러 분기 모듈; 상기 원본 깊이 이미지 및 상기 제1 심층 신경망의 각 중간 레이어에 의해 생성된 중간 특징 이미지에 기초하여, 제2 심층 신경망을 사용하여 제2 깊이 이미지를 획득하도록 구성된 깊이 분기 모듈; 및 상기 제1 깊이 이미지와 상기 제2 깊이 이미지를 합병하여 최종 깊이 이미지를 획득하도록 구성된 이미지 합병 모듈을 포함한다.An apparatus for supplementing a depth image according to an embodiment of the present invention includes: an image acquisition module configured to acquire an original color image and a corresponding original depth image; a color branching module, configured to obtain a first depth image by using a first deep neural network, based on the original color image; a depth branching module, configured to obtain a second depth image using a second deep neural network, based on the original depth image and an intermediate feature image generated by each intermediate layer of the first deep neural network; and an image merging module, configured to merge the first depth image and the second depth image to obtain a final depth image.

이때, 상기 제1 심층 신경망은, N개 레이어의 잔차(residual) 구조를 갖는 제1 인코더 네트워크와 제1 디코더 네트워크를 포함하고, 상기 제2 심층 신경망은, N개 레이어의 잔차 구조를 갖는 제2 인코더 네트워크와 제2 디코더 네트워크를 포함하고, 상기 N은 1 보다 큰 정수이고, 상기 깊이 분기 모듈은, 상기 제1 인코더 네트워크 및 상기 제2 인코더 네트워크의 출력, 상기 제1 디코더 네트워크의 중간 특징 이미지 및 상기 제2 인코더 네트워크의 중간 특징 이미지에 기초하여, 상기 제2 디코더 네트워크를 사용하여 특징 디코딩을 진행할 수 있다.In this case, the first deep neural network includes a first encoder network and a first decoder network having a residual structure of N layers, and the second deep neural network is a second neural network having a residual structure of N layers. an encoder network and a second decoder network, wherein N is an integer greater than 1, and the depth branching module is configured to: output an output of the first encoder network and the second encoder network, an intermediate feature image of the first decoder network, and Based on the intermediate feature image of the second encoder network, feature decoding may be performed using the second decoder network.

이때, 상기 깊이 분기 모듈은, 상기 원본 깊이 이미지 및 상기 제1 인코더 네트워크의 중간 특징 이미지에 기초하여, 상기 제2 심층 신경망의 상기 제2 인코더 네트워크를 사용하여 특징을 인코딩할 수 있다.In this case, the depth branching module may encode a feature using the second encoder network of the second deep neural network based on the original depth image and the intermediate feature image of the first encoder network.

이때, 상기 제1 심층 신경망은, 상기 제1 인코더 네트워크 이전의 제1 전처리 네트워크 및 상기 제1 디코더 이후의 제1 깊이 예측 네트워크를 더 포함하고, 상기 컬러 분기 모듈은, 상기 제1 전처리 네트워크를 사용하여 상기 원본 컬러 이미지를 심층 신경망 처리에 적합한 제1 특징 이미지로 변환하고, 상기 제1 특징 이미지를 상기 제1 인코더 네트워크에 입력하고, 상기 제1 깊이 예측 네트워크를 사용하여 제1 디코더 네트워크에 의해 출력된 특징 이미지를 상기 제1 깊이 이미지로 합성하고, 상기 제2 심층 신경망은, 상기 제2 인코더 네트워크 이전의 제2 전처리 네트워크 및 상기 제2 디코더 네트워크 이후의 제2 깊이 예측 네트워크를 더 포함하고, 상기 깊이 분기 모듈은, 상기 제2 전처리 네트워크를 사용하여 상기 원본 깊이 이미지를 심층 신경망 처리에 적합한 제2 특징 이미지로 변환하고, 상기 제2 특징 이미지를 상기 제2 인코더 네트워크에 입력하고, 상기 제2 깊이 예측 네트워크를 사용하여 상기 제1 디코더 네트워크 및 상기 제2 디코더 네트워크에 의해 출력된 특징 이미지와 상기 제2 특징 이미지를 융합하여 상기 제2 깊이 이미지를 획득할 수 있다.In this case, the first deep neural network further includes a first preprocessing network before the first encoder network and a first depth prediction network after the first decoder, and the color branching module uses the first preprocessing network to convert the original color image into a first feature image suitable for deep neural network processing, input the first feature image to the first encoder network, and output by a first decoder network using the first depth prediction network synthesized feature images into the first depth image, and the second deep neural network further comprises a second preprocessing network before the second encoder network and a second depth prediction network after the second decoder network, The depth branching module uses the second pre-processing network to convert the original depth image into a second feature image suitable for deep neural network processing, and input the second feature image to the second encoder network, and the second depth The second depth image may be obtained by fusing the second feature image with the feature image output by the first decoder network and the second decoder network using a prediction network.

이때, 상기 이미지 합병 모듈은, 어텐션 네트워크를 사용하여 상기 제1 깊이 이미지의 제1 픽셀 가중치 맵 및 상기 제2 깊이 이미지의 제2 픽셀 가중치 맵을 획득하고, 상기 제1 픽셀 가중치 맵 및 상기 제2 픽셀 가중치 맵에 기초하여, 상기 제1 깊이 이미지 및 상기 제2 깊이 이미지에 가중치를 부여하고 합하여 상기 최종 깊이 이미지를 획득할 수 있다.In this case, the image merging module is configured to obtain a first pixel weight map of the first depth image and a second pixel weight map of the second depth image by using an attention network, the first pixel weight map and the second The final depth image may be obtained by weighting and summing the first depth image and the second depth image based on the pixel weight map.

이때, 깊이 이미지 보완 장치는, 상기 제1 심층 신경망 및 상기 제2 심층 신경망 및/또는 상기 어텐션 네트워크를 사용하기 전, 손실 함수를 사용하여 상기 제1 심층 신경망 및 상기 제2 심층 신경망 및/또는 상기 어텐션 네트워크에 대해 훈련하는 훈련 모듈을 더 포함하고, 상기 훈련 모듈은, 상기 제1 깊이 이미지와 실제 깊이 이미지의 제1 평균 제곱 오차 손실, 상기 제2 깊이 이미지와 상기 실제 깊이 이미지의 제2 평균 제곱 오차 손실, 상기 최종 깊이 이미지와 상기 실제 깊이 이미지의 제3 평균 제곱 오차 손실 및 상기 최종 깊이 이미지와 상기 실제 깊이 이미지의 구조적 손실을 고려하여 상기 손실 함수를 생성하고, 상기 구조적 손실은, 1 - 구조적 유사성 지수일 수 있다.At this time, the depth image complementation apparatus, before using the first deep neural network and the second deep neural network and/or the attention network, using a loss function, the first deep neural network and the second deep neural network and/or the A training module for training on an attention network, further comprising: a first mean squared error loss between the first depth image and the real depth image, and a second mean squared error loss between the second depth image and the real depth image generate the loss function taking into account error loss, a third mean squared error loss of the final depth image and the real depth image, and a structural loss of the final depth image and the real depth image, wherein the structural loss is 1 - structural loss It may be a similarity index.

이때, 상기 훈련 모듈은, 상기 제1 평균 제곱 오차 손실, 상기 제2 평균 제곱 오차 손실, 상기 제3 평균 제곱 오차 손실 및 상기 구조적 손실의 가중 합산을 통해 상기 손실 함수를 얻을 수 있다.In this case, the training module may obtain the loss function through weighted summation of the first mean squared error loss, the second mean squared error loss, the third mean squared error loss, and the structural loss.

도 1a는 일 실시 예에 따른 깊이 이미지를 보완하는 모델을 도시한 도면이다.
도 1b는 다른 실시 예에 따른 깊이 이미지를 보완하는 모델을 도시한 도면이다.
도 2는 일 실시 예에 따른 SE 블록 융합 방법을 도시한 도면이다.
도 3은 일 실시 예에 따른 어텐션 메커니즘 기반의 융합 방법을 도시한 도면이다.
도 4는 두 가지 모드의 깊이 이미지를 도시한 도면이다.
도 5는 일 실시 예에 따른 손실 함수를 도시한 도면이다.
도 6은 일 실시 예에 따른 깊이 이미지를 보완하는 방법을 도시한 흐름도이다.
도 7은 일 실시 예에 따른 깊이 이미지를 보완하는 장치를 도시한 도면이다.1A is a diagram illustrating a model supplementing a depth image according to an exemplary embodiment.
1B is a diagram illustrating a model supplementing a depth image according to another exemplary embodiment.
2 is a diagram illustrating an SE block fusion method according to an embodiment.
3 is a diagram illustrating a fusion method based on an attention mechanism according to an embodiment.
4 is a diagram illustrating depth images of two modes.
5 is a diagram illustrating a loss function according to an embodiment.
6 is a flowchart illustrating a method of supplementing a depth image according to an exemplary embodiment.
7 is a diagram illustrating an apparatus for supplementing a depth image according to an exemplary embodiment.

이하에서, 첨부된 도면을 참조하여 실시예들을 상세하게 설명한다. 그러나, 실시예들에는 다양한 변경이 가해질 수 있어서 특허출원의 권리 범위가 이러한 실시예들에 의해 제한되거나 한정되는 것은 아니다. 실시예들에 대한 모든 변경, 균등물 내지 대체물이 권리 범위에 포함되는 것으로 이해되어야 한다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. However, since various changes may be made to the embodiments, the scope of the patent application is not limited or limited by these embodiments. It should be understood that all modifications, equivalents and substitutes for the embodiments are included in the scope of the rights.

실시예에서 사용한 용어는 단지 설명을 목적으로 사용된 것으로, 한정하려는 의도로 해석되어서는 안된다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the examples are used for the purpose of description only, and should not be construed as limiting. The singular expression includes the plural expression unless the context clearly dictates otherwise. In the present specification, terms such as "comprise" or "have" are intended to designate that a feature, number, step, operation, component, part, or combination thereof described in the specification exists, but one or more other features It is to be understood that this does not preclude the possibility of the presence or addition of numbers, steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which the embodiment belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and should not be interpreted in an ideal or excessively formal meaning unless explicitly defined in the present application. does not

또한, 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 실시예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 실시예의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.In addition, in the description with reference to the accompanying drawings, the same components are given the same reference numerals regardless of the reference numerals, and the overlapping description thereof will be omitted. In describing the embodiment, if it is determined that a detailed description of a related known technology may unnecessarily obscure the gist of the embodiment, the detailed description thereof will be omitted.

또한, 실시 예의 구성 요소를 설명하는 데 있어서, 제1, 제2, A, B, (a), (b) 등의 용어를 사용할 수 있다. 이러한 용어는 그 구성 요소를 다른 구성 요소와 구별하기 위한 것일 뿐, 그 용어에 의해 해당 구성 요소의 본질이나 차례 또는 순서 등이 한정되지 않는다. 어떤 구성 요소가 다른 구성요소에 "연결", "결합" 또는 "접속"된다고 기재된 경우, 그 구성 요소는 그 다른 구성요소에 직접적으로 연결되거나 접속될 수 있지만, 각 구성 요소 사이에 또 다른 구성 요소가 "연결", "결합" 또는 "접속"될 수도 있다고 이해되어야 할 것이다. In addition, in describing the components of the embodiment, terms such as first, second, A, B, (a), (b), etc. may be used. These terms are only for distinguishing the elements from other elements, and the essence, order, or order of the elements are not limited by the terms. When it is described that a component is “connected”, “coupled” or “connected” to another component, the component may be directly connected or connected to the other component, but another component is between each component. It will be understood that may also be "connected", "coupled" or "connected".

어느 하나의 실시 예에 포함된 구성요소와, 공통적인 기능을 포함하는 구성요소는, 다른 실시 예에서 동일한 명칭을 사용하여 설명하기로 한다. 반대되는 기재가 없는 이상, 어느 하나의 실시 예에 기재한 설명은 다른 실시 예에도 적용될 수 있으며, 중복되는 범위에서 구체적인 설명은 생략하기로 한다.Components included in one embodiment and components having a common function will be described using the same names in other embodiments. Unless otherwise stated, descriptions described in one embodiment may be applied to other embodiments as well, and detailed descriptions within the overlapping range will be omitted.

본 개시는 딥 러닝을 기반으로 컬러 이미지를 중심으로 한 깊이 이미지 보완 방법을 제공한다. 구체적으로, 해당 방법에서 사용되는 깊이 이미지 보완 모델은 두 개의 분기 네트워크, 즉 컬러 분기 네트워크와 깊이 분기 네트워크를 포함할 수 있다. 그 중, 컬러 분기 네트워크는 깊이 추정을 위해 원본 컬러 이미지를 사용하여 완전한 깊이 이미지를 획득하고, 깊이 분기 네트워크는 원본 깊이 이미지 및 컬러 분기 네트워크의 일부 중간 레이어 특징 이미지를 사용하여 추론하고 또 다른 완전한 깊이 이미지를 획득하고, 이 둘의 완전한 깊이 이미지를 융합하여 최종적으로 보완된 깊이 이미지를 생성한다. 본 개시의 방법은 네트워크를 통해 컬러 이미지에서 완전한 깊이 이미지로의 매핑을 학습하고, 컬러 이미지 정보를 최대한 활용하여 깊이 이미지의 보완을 돕고, 이를 통해 원본 깊이 이미지가 매운 희소한 경우에도(심지어 원본 깊이 이미지가 존재하지 않는 경우에도) 모델이 고품질의 완전한 깊이 이미지를 안정적으로 생성하도록 하여, 깊이 공동 채우기 및 희소 깊이 밀도화 두 가지 작업에서 모두 좋은 효과를 얻을 수 있다. 또한, 네트워크가 이미지 중 유효 픽셀과 무효 픽셀을 효과적으로 구분하게 하기 위해, 생성된 깊이 이미지가 원래의 깊이 정보를 잘 유지하도록 하고, 깊이 분기 네트워크에서 게이트 컨볼루션(Gated-Convolution)을 이용하여 마스크 정보를 전송한다. 게이트 컨볼루션의 게이트 동작은 유효 픽셀과 무효 픽셀의 위치를 효과적으로 식별할 수 있고, 유효 픽셀의 가중치는 무효 픽셀의 가중치보다 높다. 또한, 최종 생성된 깊이 이미지의 디테일 정보가 풍부하고 에지 품질이 높도록 하기 위해, 본 개시의 딥 러닝 네트워크 모델 훈련은 구조적 유사성 SSIM(Structural Similarity Index Measure) 관련 구조적 손실 모니터링으로 보완될 수 있다. 또한, 본 개시의 깊이 이미지 보완 모델은 종단 간 훈련을 수행할 수 있어, 중간 특징의 사용을 피하고, 중간 특징의 품질이 낮은 리스크를 효과적으로 피할 수 있다.The present disclosure provides a depth image supplementation method centered on a color image based on deep learning. Specifically, the depth image complementation model used in the method may include two branch networks, that is, a color branch network and a depth branch network. Among them, the color branching network uses the original color image for depth estimation to obtain a full depth image, and the depth branching network uses the original depth image and some intermediate layer feature images of the color branching network to infer and another full depth image. An image is acquired, and the two full depth images are fused to create a finally supplemented depth image. The method of the present disclosure learns the mapping from a color image to a full depth image through a network, makes full use of the color image information to help complement the depth image, and through this, even when the original depth image is very sparse (even the original depth By allowing the model to reliably generate high-quality, full-depth images (even when images do not exist), good results can be obtained for both depth cavity filling and sparse depth densification. In addition, in order for the network to effectively distinguish between valid and invalid pixels in the image, the generated depth image maintains original depth information well, and mask information using gated-convolution in the depth branching network to send The gate operation of the gate convolution can effectively identify the positions of the valid pixel and the invalid pixel, and the weight of the effective pixel is higher than the weight of the invalid pixel. In addition, in order to ensure that the detailed information of the finally generated depth image is rich and the edge quality is high, the deep learning network model training of the present disclosure can be supplemented with structural similarity index measure (SSIM) related structural loss monitoring. In addition, the depth image complementation model of the present disclosure can perform end-to-end training, avoiding the use of intermediate features and effectively avoiding the risk of low quality of intermediate features.

이하에서는, 본 개시의 일 실시 예에 따른 깊이 이미지를 보완하는 방법 및 장치를 첨부된 도 1 내지 도 7을 참조하여 상세히 설명한다.Hereinafter, a method and an apparatus for supplementing a depth image according to an embodiment of the present disclosure will be described in detail with reference to FIGS. 1 to 7 .

이하, 상기와 같이 구성된 본 발명에 따른 방법을 아래에서 도면을 참조하여 설명한다.Hereinafter, a method according to the present invention configured as described above will be described with reference to the drawings below.

도 1a는 일 실시 예에 따른 깊이 이미지를 보완하는 모델을 도시한 도면이다.1A is a diagram illustrating a model supplementing a depth image according to an exemplary embodiment.

도 1b는 다른 실시 예에 따른 깊이 이미지를 보완하는 모델을 도시한 도면이다.1B is a diagram illustrating a model supplementing a depth image according to another exemplary embodiment.

도 1a 및 1b를 참조하면, 본 개시의 깊이 이미지 보완 모델(100)은 제1 심층 신경망(즉, 컬러 분기 네트워크)(110), 제2 심층 신경망(즉, 깊이 분기 네트워크)(120) 및 융합 모듈(130)을 포함할 수 있다.1A and 1B , the depth image complementation model 100 of the present disclosure includes a first deep neural network (ie, color branching network) 110 , a second deep neural network (ie, depth branching network) 120 and fusion It may include a module 130 .

구체적으로, 제1 심층 신경망(110)은 깊이 이미지를 획득하기 위해 원본 컬러 이미지(예, RGB 이미지 등)를 기반으로 깊이 추정을 수행하는 데 사용된다. 따라서, 제1 심층 신경망(110)의 입력은 원본 컬러 이미지일 수 있고, 출력은 깊이 이미지일 수 있다. 제2 심층 신경망(120)은 깊이 이미지를 획득하기 위해 원본 깊이 이미지 및 제1 심층 신경망(110)의 일부 중간 레이어 특징 이미지를 기반으로 추론하는 데 사용된다. 따라서, 제2 심층 신경망(120)의 입력은 원본 깊이 이미지일 수 있고, 중간 레이어의 입력은 제1 심층 신경망(110)의 중간 레이어가 출력한 특징 이미지일 수 있으며, 제2 심층 신경망(120)의 출력은 깊이 이미지일 수 있다. Specifically, the first deep neural network 110 is used to perform depth estimation based on an original color image (eg, RGB image, etc.) to obtain a depth image. Accordingly, the input of the first deep neural network 110 may be an original color image, and the output may be a depth image. The second deep neural network 120 is used to infer based on the original depth image and some intermediate layer feature images of the first deep neural network 110 to obtain a depth image. Accordingly, the input of the second deep neural network 120 may be an original depth image, and the input of the intermediate layer may be a feature image output by the intermediate layer of the first deep neural network 110 , and the second deep neural network 120 . The output of may be a depth image.

융합 모듈(130)은 제1 심층 신경망(110)이 출력한 깊이 이미지와 제2 심층 신경망(120)이 출력한 깊이 이미지를 융합하여 최종 보완된 깊이 이미지를 생성하는 데 사용될 수 있다. The fusion module 130 may be used to generate a final supplemented depth image by fusing the depth image output by the first deep neural network 110 and the depth image output by the second deep neural network 120 .

여기서, 원본 컬러 이미지와 원본 깊이 이미지는 일치 및 보정된 컬러 카메라와 깊이 카메라로 동일한 위치에서 동일한 장면을 동시에 촬영한 다음, 두 이미지를 등록하여 얻을 수 있고, 또는 필요에 따라 로컬 메모리 또는 로컬 데이터베이스에서 획득할 수 있고, 또는 입력 장치 또는 전송 매개를 통해 외부 데이터 소스(예, 인터넷, 서버, 데이터베이스 등)에서 수신할 수 있다. 원본 컬러 이미지와 원본 깊이 이미지는 서로 대응하는 이미지로, 예를 들어, 이미지 등록을 통해 센서가 수집한 원본 컬러 이미지와 원본 깊이 이미지를 동일한 좌표계에 투사하여, 두 이미지 픽셀이 하나씩 대응하도록 할 수 있다.Here, the original color image and the original depth image can be obtained by simultaneously shooting the same scene at the same location with a matched and calibrated color camera and depth camera, and then registering the two images, or from local memory or local database as needed may be obtained, or may be received from an external data source (eg, Internet, server, database, etc.) via an input device or transmission medium. The original color image and the original depth image correspond to each other. For example, by projecting the original color image and the original depth image collected by the sensor through image registration on the same coordinate system, the two image pixels can correspond one by one. .

본 개시의 실시 예에 따르면, 제1 심층 신경망(110)의 주요 구조는 잔차 블록을 적층하여 형성된 인코더-디코더 네트워크(Encoder-Decoder Network)로 구성될 수 있다. 잔차 구조는 네트워크의 하위 레이어 특징이 상위 레이어로 전달되는 효과를 효과적으로 보장할 수 있고, 이를 통해 네트워크가 하위 레이어 특징의 텍스처 정보와 구조 정보를 유지하도록 할 수 있다. 예를 들어, 제1 심층 신경망(110)은 N개 레이어의 잔차 구조를 갖는 제1 인코더 네트워크(112)와 제1 디코더 네트워크(113)를 포함할 수 있다. 이때, N은 1보다 큰 정수이다.According to an embodiment of the present disclosure, the main structure of the first deep neural network 110 may include an encoder-decoder network formed by stacking residual blocks. The residual structure can effectively guarantee the effect that the lower layer features of the network are transferred to the upper layer, thereby allowing the network to maintain the texture information and structure information of the lower layer features. For example, the first deep neural network 110 may include a first encoder network 112 and a first decoder network 113 having a residual structure of N layers. In this case, N is an integer greater than 1.

또한, 제1 심층 신경망은(110) 인코더 네트워크(예, 제1 인코더 네트워크(112)) 이전의 제1 전처리 네트워크(111) 및 디코더 네트워크(예, 제1 디코더 네트워크(113)) 이후의 제1 깊이 예측 네트워크(114)를 더 포함할 수 있다. 또한, 제1 심층 신경망(110)의 제1 전처리 네트워크(111), 제1 인코더 네트워크(112) 및 제1 디코더 네트워크(113) 중 적어도 하나에 의해 출력된 특징 이미지는 보존되고 제2 심층 신경망(120)의 대응되는 레이어에 병렬로 입력되어 특징 융합을 진행할 수 있으며, 이는 뒤에서 자세히 설명한다.In addition, the first deep neural network 110 includes a first preprocessing network 111 before the encoder network (eg, the first encoder network 112 ) and the first after the decoder network (eg, the first decoder network 113 ). It may further include a depth prediction network 114 . In addition, the feature image output by at least one of the first preprocessing network 111, the first encoder network 112, and the first decoder network 113 of the first deep neural network 110 is preserved and the second deep neural network ( 120) can be input in parallel to the corresponding layer to perform feature fusion, which will be described in detail later.

구체적으로, 제1 전처리 네트워크(111)는 입력된 원본 컬러 이미지를 심층 신경망 처리에 적합한 제1 특징 이미지로 변환하고, 제1 특징 이미지를 제1 인코더 네트워크(112)에 입력하는데 사용될 수 있다. 예를 들어, 제1 전처리 네트워크(111)는 적어도 하나의 컨볼루션 레이어로 구성될 수 있다. 제1 전처리 네트워크(111)는 사이즈 변경 없이 원본 컬러 이미지만을 컨볼루트 할 수 있다.Specifically, the first pre-processing network 111 may be used to convert an input original color image into a first feature image suitable for deep neural network processing, and input the first feature image to the first encoder network 112 . For example, the first preprocessing network 111 may include at least one convolutional layer. The first preprocessing network 111 may convolute only the original color image without changing the size.

제1 인코더 네트워크(112)는 N개 레이어의 잔차 구조를 갖는 캐스케이드된 코딩 단위를 통해 제1 특징 이미지에 대해 특징 인코딩을 수행할 수 있다. 이때, N은 1보다 큰 정수이다.The first encoder network 112 may perform feature encoding on the first feature image through a cascaded coding unit having a residual structure of N layers. In this case, N is an integer greater than 1.

제1 인코더 네트워크(112)의 각 레이어의 코딩 단위는 캐스케이드된(cascade) 복수의 잔차 블록(Residual Block)을 포함할 수 있고, 각 잔차 블록은 입력된 특징 이미지에 대해 적어도 한 번의 컨볼루션 처리를 수행하고, 마지막 잔차 블록은 입력된 특징 이미지에 대해 적어도 한 번의 컨볼루션 처리 및 한 번의 다운 샘플링 처리를 수행한다. 여기서, 본 개시는 N의 값, 잔차 블록의 수 및 잔차 블록이 수행하는 컨볼루션의 횟수를 제한하지 않는다. 예를 들어, 제1 인코더 네트워크(112)는 4개의 코딩 단위를 포함할 수 있고, 각 코딩 단위는 두 개의 잔차 블록을 포함할 수 있고, 각 잔차 블록은 두 개의 컨볼루션 레이어를 포함할 수 있으며, 마지막 잔차 블록은 두 개의 컨볼루션 레이어 및 하나의 다운 샘플링 레이어(예, 다운 샘플링 계수는 1/2)포함할 수 있으므로, 제1 인코더 네트워크(112)의 출력된 특징 이미지의 해상도는 입력된 특징 이미지의 해상도의 1/16일 수 있다. 따라서, 입력된 원본 컬러 이미지의 해상도는 16의 정배수가 될 수 있으며, 예를 들어, 304 Х 224이다.The coding unit of each layer of the first encoder network 112 may include a plurality of cascaded residual blocks, and each residual block performs at least one convolution process on the input feature image. In the last residual block, at least one convolution process and one downsampling process are performed on the input feature image. Here, the present disclosure does not limit the value of N, the number of residual blocks, and the number of convolutions performed by the residual blocks. For example, the first encoder network 112 may include four coding units, each coding unit may include two residual blocks, and each residual block may include two convolutional layers. , since the last residual block may include two convolutional layers and one downsampling layer (eg, the downsampling coefficient is 1/2), the resolution of the output feature image of the first encoder network 112 is the input feature It can be 1/16th of the resolution of the image. Accordingly, the resolution of the input original color image may be an integer multiple of 16, for example, 304 Х 224.

또한, 각 잔차 블록은 하나의 정규화 레이어(normalization layer)(예, 배치 정규화 레이어) 및 활성화 레이어(예, PReLU 레이어)를 더 포함할 수 있고, 정규화 레이어는 입력된 특징 이미지를 정규화하여 출력된 특징이 동일한 스케일을 갖도록 할 수 있으며, 활성화 레이어는 정규화된 특징 이미지를 비선형화할 수 있다.In addition, each residual block may further include one normalization layer (eg, a batch normalization layer) and an activation layer (eg, a PReLU layer), and the normalization layer normalizes an input feature image and outputs a feature This can be made to have the same scale, and the activation layer can non-linearize the normalized feature image.

제1 디코더 네트워크(113)는 제1 인코더 네트워크(112)에서 출력된 특징 이미지에 대해 캐스케이드된 N개 레이어의 잔차 구조를 갖는 디코딩 단위들에 의해 특징 디코딩을 수행할 수 있다. 다시 말해, 제1 디코더 네트워크(113)는 동일한 잔차 구조를 샘플링하고, 대응하는 횟수의 디컨볼루션 연산(업 샘플링 및 컨볼루션을 통해 구현)을 통해 이미지의 해상도를 원래 해상도로 복원할 수 있다.The first decoder network 113 may perform feature decoding by decoding units having a residual structure of N layers cascaded with respect to the feature image output from the first encoder network 112 . In other words, the first decoder network 113 may sample the same residual structure and restore the resolution of the image to the original resolution through a corresponding number of deconvolution operations (implemented through upsampling and convolution).

구체적으로, 제1 디코더 네트워크(113)의 각 레이어의 디코딩 단위는 캐스케이드된 복수의 잔차 블록을 포함할 수 있고, 각 잔차 블록은 입력된 특징 이미지에 대해 적어도 한 번의 컨볼루션 처리를 수행하고, 첫 번째 잔차 블록은 입력된 특징 이미지에 대해 한 번의 업 샘플링 처리와 적어도 한 번의 컨볼루션 처리를 수행한다. 여기서, 본 개시는 N의 값, 잔차 블록의 수 및 잔차 블록이 수행하는 컨볼루션의 횟수를 제한하지 않는다. 예를 들어, 제1 디코더 네트워크(113)는 대응되는 4개의 디코딩 단위를 포함할 수 있고, 각 디코딩 단위는 두 개의 잔차 블록을 포함할 수 있고, 각 잔차 블록은 두 개의 컨볼루션 레이어를 포함할 수 있으며, 첫 번째 잔차 블록은 하나의 업 샘플링 레이어(예, 업 샘플링 계수는 2) 및 두 개의 컨볼루션 레이어를 포함할 수 있으므로, 제1 디코더 네트워크(113)의 출력된 특징 이미지의 해상도는 원래 해상도로 복원될 수 있다. Specifically, the decoding unit of each layer of the first decoder network 113 may include a plurality of cascaded residual blocks, and each residual block performs at least one convolution process on the input feature image, and the first The th residual block performs one upsampling process and at least one convolution process on the input feature image. Here, the present disclosure does not limit the value of N, the number of residual blocks, and the number of convolutions performed by the residual blocks. For example, the first decoder network 113 may include four corresponding decoding units, each decoding unit may include two residual blocks, and each residual block may include two convolutional layers. Since the first residual block may include one upsampling layer (eg, the upsampling coefficient is 2) and two convolutional layers, the resolution of the output feature image of the first decoder network 113 is resolution can be restored.

또한, 각 잔차 블록은 하나의 정규화 레이어(예, 배치 정규화 레이어) 및 활성화 레이어(예, PReLU 레이어)를 더 포함할 수 있고, 정규화 레이어는 입력된 특징 이미지를 정규화하여 출력된 특징이 동일한 스케일을 갖도록 할 수 있으며, 활성화 레이어는 정규화된 특징 이미지를 비선형화할 수 있다.In addition, each residual block may further include one normalization layer (eg, a batch normalization layer) and an activation layer (eg, a PReLU layer), and the normalization layer normalizes the input feature image so that the output feature has the same scale and the activation layer may non-linearize the normalized feature image.

제1 깊이 예측 네트워크(114)는 제1 디코더 네트워크(113)에서 출력된 특징 이미지를 단일의 깊이 이미지(예를 들어, 제1 깊이 이미지)로 합성할 수 있다. 원본 컬러 이미지는 제1 전처리 네트워크(111), 제1 인코더 네트워크(112) 및 제1 디코더 네트워크(113)의 컨볼루션 처리 후, C개의 채널의 특징 이미지로 변환될 수 있다. 예를 들어, C는 32, 64, 128 등일 수 있다. 따라서, 제1 깊이 예측 네트워크(114)는 이 C개 채널의 특징 이미지를 단일 채널의 깊이 이미지로 합성할 필요가 있다. 예를 들어, 제1 깊이 예측 네트워크(114)는 C개 채널의 특징 이미지를 단일 채널의 깊이 이미지로 합성하기 위한 두 개의 컨볼루션 레이어를 포함할 수 있으며, 제1 컨볼루션 레이어는 특징 채널을 원래의 절반, 즉 C/2로 줄일 수 있고, 제2 컨볼루션 레이어는 C/2개 채널의 특징 이미지를 단일 채널의 깊이 이미지로 압축할 수 있다. 또한, 제1 컨볼루션 레이어와 제2 컨볼루션 레이어 사이에는 하나의 정규화 레이어(예, 배치 정규화 레이어) 및 활성화 레이어(예, PReLU 레이어)를 더 포함할 수 있고, 정규화 레이어는 제1 컨볼루션 레이어에서 출력한 특징 이미지를 정규화하여 출력된 특징이 동일한 스케일을 갖도록 할 수 있으며, 활성화 레이어는 정규화된 특징 이미지를 비선형화하여, 제2 컨볼루션 레이어로 출력할 수 있다.The first depth prediction network 114 may synthesize the feature image output from the first decoder network 113 into a single depth image (eg, a first depth image). The original color image may be converted into feature images of C channels after convolution processing of the first pre-processing network 111 , the first encoder network 112 , and the first decoder network 113 . For example, C may be 32, 64, 128, etc. Therefore, the first depth prediction network 114 needs to synthesize the feature images of the C channels into a depth image of a single channel. For example, the first depth prediction network 114 may include two convolutional layers for synthesizing the feature images of C channels into the depth images of a single channel, and the first convolutional layer can be reduced to half, that is, C/2, and the second convolutional layer can compress the feature image of C/2 channels into a depth image of a single channel. In addition, one normalization layer (eg, a batch normalization layer) and an activation layer (eg, a PReLU layer) may be further included between the first convolutional layer and the second convolutional layer, and the normalization layer is the first convolutional layer The output feature image may be normalized so that the output feature has the same scale, and the activation layer may non-linearize the normalized feature image and output it as a second convolutional layer.

본 개시의 예시적 실시예에 따르면, 제2 심층 신경망(120)의 구조는 기본적으로 제1 심층 신경망(110)의 구조와 일치하며, 그 주요 구조 또한 잔차 블록을 적층하여 형성된 인코더-디코더 네트워크로 구성될 수 있다. 예를 들어, 제2 심층 신경망은 N개의 레이어의 잔차 구조를 갖는 제2 인코더 네트워크(122) 및 제2 디코더 네트워크(123)를 포함할 수 있다. 또한, 제2 심층 신경망(120)은 인코더-디코더 네트워크(예, 제2 인코더 네트워크(122) 및 제2 디코더 네트워크(123)) 이전의 제2 전처리 네트워크(121) 및 인코더-디코더 네트워크(예, 제2 인코더 네트워크(122) 및 제2 디코더 네트워크(123)) 이후의 제2 깊이 예측 네트워크(124)를 더 포함할 수 있다. 제2 전처리 네트워크(121), 제2 인코더 네트워크(122), 제2 디코더 네트워크(123) 및 제2 깊이 예측 네트워크(124) 각각은 제1 심층 신경망(110)에서 대응하는 네트워크와 동일한 기능을 수행한다. According to an exemplary embodiment of the present disclosure, the structure of the second deep neural network 120 is basically the same as that of the first deep neural network 110, and its main structure is also an encoder-decoder network formed by stacking residual blocks. can be configured. For example, the second deep neural network may include a second encoder network 122 and a second decoder network 123 having a residual structure of N layers. In addition, the second deep neural network 120 includes a second preprocessing network 121 and an encoder-decoder network (eg, a second encoder network 122 and a second decoder network 123 ) before the encoder-decoder network (eg, the second encoder network 122 and the second decoder network 123 ). It may further include a second depth prediction network 124 after the second encoder network 122 and the second decoder network 123 . Each of the second preprocessing network 121 , the second encoder network 122 , the second decoder network 123 , and the second depth prediction network 124 performs the same function as a corresponding network in the first deep neural network 110 . do.

제1 심층 신경망(110)과 제2 심층 신경망(120) 간의 차이는 다음과 같을 수 있다. 제2 디코더 네트워크(123)의 디코딩 단계에서, 각 디코딩 단위의 입력은 SE 블록(Squeeze-and-Excitation Block)의 방법을 채택하여 상위 레이어에서 출력한 특징 이미지, 제1 심층 신경망(110) 중 대응되는 레이어에서 출력된 특징 이미지 및 제2 인코더 네트워크(122)의 코딩 단계에서 대응되는 레이어에서 출력된 특징 이미지를 융합할 수 있다. The difference between the first deep neural network 110 and the second deep neural network 120 may be as follows. In the decoding step of the second decoder network 123 , the input of each decoding unit corresponds to the feature image output from the upper layer by adopting the method of the SE block (Squeeze-and-Excitation Block), the first deep neural network 110 . It is possible to fuse the feature image output from the corresponding layer and the feature image output from the corresponding layer in the coding step of the second encoder network 122 .

또한, 제1 심층 신경망(110)과 제2 심층 신경망(120) 간의 차이는 또한 다음과 같을 수 있다. 제2 인코더 네트워크(122)의 코딩 단계에서, 각 코딩 단위의 입력은 직접 추가하는 방법을 채택하여 상위 레이어에서 출력한 특징 이미지와 제1 심층 신경망(110) 중 대응되는 레이어에서 출력된 특징 이미지를 융합할 수 있다. In addition, the difference between the first deep neural network 110 and the second deep neural network 120 may also be as follows. In the coding step of the second encoder network 122, the input of each coding unit adopts a method of directly adding the feature image output from the upper layer and the feature image output from the corresponding layer among the first deep neural network 110 can be fused.

또한, 제1 심층 신경망(110)과 제2 심층 신경망(120) 간의 차이는 또한 다음과 같을 수 있다. 제2 인코더 네트워크(122) 및 제2 디코더 네트워크(123)의 각 잔차 블록은 게이트 컨볼루션을 채택한다. 즉, 각 컨볼루션 레이어 후에 게이트(Gate) 연산이 추가된다.In addition, the difference between the first deep neural network 110 and the second deep neural network 120 may also be as follows. Each residual block of the second encoder network 122 and the second decoder network 123 adopts gated convolution. That is, a gate operation is added after each convolution layer.

구체적으로, 제2 전처리 네트워크(121)는 입력된 원본 깊이 이미지를 심층 신경망 처리에 적합한 제2 특징 이미지로 변환하는 데 사용될 수 있다. 예를 들어, 제2 전처리 네트워크(121)는 적어도 하나의 컨볼루션 레이어로 구성될 수 있다. 제2 전처리 네트워크(121)는 사이즈 변경 없이 원본 깊이 이미지만을 컨볼루트 할 수 있다.Specifically, the second pre-processing network 121 may be used to convert an input original depth image into a second feature image suitable for deep neural network processing. For example, the second preprocessing network 121 may include at least one convolutional layer. The second preprocessing network 121 may convolute only the original depth image without changing the size.

본 개시의 실시 예에 따르면, 제2 인코더 네트워크(122)는 캐스케이드된 N개 레이어의 잔차 구조를 갖는 코딩 단위만을 통해, 원본 깊이 이미지에 기초하여 특징 코딩을 수행할 수 있다. 예를 들어, 도 1a에 도시된 바와 같이, 제2 인코더 네트워크(122)는 제2 전처리 네트워크(121)에 의해 출력된 제2 특징 이미지에 대해, 캐스케이드된 N개 레이어의 잔차 구조를 갖는 코딩 단위를 통해 특징 코딩을 수행할 수 있다.According to an embodiment of the present disclosure, the second encoder network 122 may perform feature coding based on the original depth image through only coding units having a cascaded N-layer residual structure. For example, as shown in FIG. 1A , the second encoder network 122 is a coding unit having a cascaded N-layer residual structure for the second feature image output by the second pre-processing network 121 . Feature coding can be performed through

본 개시의 실시 예에 따르면, 제2 인코더 네트워크(122)는 캐스케이드된 N개 레이어의 잔차 구조를 갖는 코딩 단위를 통해, 원본 깊이 이미지 및 제1 인코더 네트워크의 각 중간 레이어에서 출력된 중간 특징 이미지를 기반으로 특징 코딩을 수행할 수 있다. 예를 들어, 도 1b에 도시된 바와 같이, 제2 인코더 네트워크(122)는 제2 특징 이미지, 제1 특징 이미지 및 제1 인코더 네트워크(112)의 각 중간 레이에서 출력된 특징 이미지를 기반으로 특징 코딩을 수행할 수 있다. 상술한 바와 같이, 제1 심층 신경망(110)의 제1 전처리 네트워크(111), 제1 인코더 네트워크(112) 및 제1 디코더 네트워크(113) 중 적어도 하나에 의해 출력된 특징 이미지는 보존되어 제2 심층 신경망(120)의 대응되는 레이어에 병렬로 입력되어 특징 융합을 진행할 수 있다. 예를 들어, 도 1b에 도시된 바와 같이, 부호

는 직접 덧셈을 나타낸다. 이와 같이, 제2 디코더 네트워크(123)에서 제1 레이어 디코딩 단위의 입력은 제1 전처리 네트워크(111)에서 출력된 제1 특징 이미지와 제2 전처리 네트워크(121)에서 출력된 제2 특징 이미지를 직접 더하여 얻은 특징 이미지임을 알 수 있다. 제2 인코더 네트워크 중 제2 레이어에서 제N 번째 레이어까지 코딩 단위로의 각 레이어 코딩 단위의 입력은 상위 레이어 코딩 단위에서 출력된 특징 이미지와 제1 인코더 네트워크(112) 중 대응되는 레이어 코딩 단위에서 출력된 특징 이미지를 직접 더하여 획득한 특징 이미지다. 예를 들어, 제2 인코더 네트워크(122) 중 각 2 레이어 코딩 단위의 입력은 제2 인코더 네트워크(122) 중 제1 레이어 코딩 단위가 출력한 특징 이미지와 제1 인코더 네트워크(112) 중 제1 레이어 코딩 단위가 출력한 특징 이미지를 직접 더하여 얻은 특징 이미지이며, 그 다음도 마찬가지이다. According to an embodiment of the present disclosure, the second encoder network 122 performs an original depth image and an intermediate feature image output from each intermediate layer of the first encoder network through a coding unit having a residual structure of N cascaded layers. Based on this, feature coding may be performed. For example, as shown in FIG. 1B , the second encoder network 122 is characterized based on the second feature image, the first feature image, and the feature image output from each intermediate ray of the first encoder network 112 . coding can be performed. As described above, the feature image output by at least one of the first preprocessing network 111 , the first encoder network 112 , and the first decoder network 113 of the first deep neural network 110 is preserved and the second Feature fusion may be performed by being input in parallel to a corresponding layer of the deep neural network 120 . For example, as shown in Fig. 1b, the

represents direct addition. As such, in the second decoder network 123 , the input of the first layer decoding unit directly receives the first feature image output from the first pre-processing network 111 and the second feature image output from the second pre-processing network 121 . It can be seen that this is a feature image obtained by addition. The input of each layer coding unit to the coding unit from the second layer to the N-th layer of the second encoder network is output from the feature image output from the higher layer coding unit and the corresponding layer coding unit of the first encoder network 112 . It is a feature image obtained by directly adding the acquired feature images. For example, the input of each two-layer coding unit of the second encoder network 122 includes a feature image output by the first layer coding unit of the second encoder network 122 and the first layer of the first encoder network 112 . It is a feature image obtained by directly adding the feature image output by the coding unit, and the following is also the same.

제2 인코더 네트워크(122)의 각 레이어의 코딩 단위는 캐스케이드된 복수의 잔차 블록을 포함할 수 있고, 각 잔차 블록은 입력된 특징 이미지에 대해 적어도 한 번의 컨볼루션 처리를 수행하고, 마지막 잔차 블록은 입력된 특징 이미지에 대해 적어도 한 번의 컨볼루션 처리 및 한 번의 다운 샘플링 처리를 수행한다. 여기서, 본 개시는 N의 값, 잔차 블록의 수 및 잔차 블록이 수행하는 컨볼루션의 횟수를 제한하지 않는다. 예를 들어, 제2 인코더 네트워크(122)는 4개의 코딩 단위를 포함할 수 있고, 각 코딩 단위는 두 개의 잔차 블록을 포함할 수 있고, 각 잔차 블록은 두 개의 컨볼루션 레이어를 포함할 수 있으며, 마지막 잔차 블록은 두 개의 컨볼루션 레이어 및 하나의 다운 샘플링 레이어(예, 다운 샘플링 계수는 1/2)포함할 수 있으므로, 제2 인코더 네트워크(122)의 출력된 특징 이미지의 해상도는 입력된 특징 이미지의 해상도의 1/16일 수 있다. 따라서, 입력된 원본 컬러 이미지의 해상도는 16의 정배수가 될 수 있으며, 예를 들어, 304 Х 224이다. A coding unit of each layer of the second encoder network 122 may include a plurality of cascaded residual blocks, each residual block performing at least one convolutional process on the input feature image, and the last residual block is At least one convolution process and one downsampling process are performed on the input feature image. Here, the present disclosure does not limit the value of N, the number of residual blocks, and the number of convolutions performed by the residual blocks. For example, the second encoder network 122 may include four coding units, each coding unit may include two residual blocks, and each residual block may include two convolutional layers. , since the last residual block may include two convolutional layers and one downsampling layer (eg, the downsampling coefficient is 1/2), the resolution of the output feature image of the second encoder network 122 is the input feature It can be 1/16th of the resolution of the image. Accordingly, the resolution of the input original color image may be an integer multiple of 16, for example, 304 Х 224.

또한, 제2 인코더 네트워크(122)의 각 잔차 블록은 하나의 정규화 레이어(예, 배치 정규화 레이어) 및 활성화 레이어(예, PReLU 레이어)를 더 포함할 수 있고, 정규화 레이어는 입력된 특징 이미지를 정규화하여 출력된 특징이 동일한 스케일을 갖도록 할 수 있으며, 활성화 레이어는 정규화된 특징 이미지를 비선형화할 수 있다.In addition, each residual block of the second encoder network 122 may further include one normalization layer (eg, a batch normalization layer) and an activation layer (eg, a PReLU layer), and the normalization layer normalizes the input feature image. Thus, the output features may have the same scale, and the activation layer may non-linearize the normalized feature image.

본 개시의 실시예에 따르면, 제2 디코더 네트워크(123)는 캐스케이드된 N개 레이어의 잔차 구조를 갖는 코딩 단위를 통해, 제2 인코더 네트워크에서 출력된 특징 이미지, 제1 인코더 네트워크에서 출력된 특징 이미지, 제2 인코더 네트워크 각 중간 레이어에서 출력된 특징 이미지 및 제1 디코더 네트워크 각 중간 레이어에서 출력된 특징 이미지를 기반으로 특징 디코딩을 수행할 수 있다. 상술한 바와 같이, 제1 심층 신경망(110)의 제1 전처리 네트워크(111), 제1 인코더 네트워크(112) 및 제1 디코더 네트워크(113)에 의해 출력된 특징 이미지는 보존되어 제2 심층 신경망(120)의 대응되는 레이어에 병렬로 입력되어 특징 융합을 진행할 수 있다. 예를 들어, 도 1a 및 도 1b에 도시된 바와 같이, 부호

는 직접 덧셈을 나타내고, 부호

는 SE 블록의 방식으로 융합하는 것을 나타내며, 이에 대해서는 아래에서 자세히 설명한다. According to an embodiment of the present disclosure, the second decoder network 123 uses a coding unit having a cascaded N-layer residual structure, a feature image output from the second encoder network, and a feature image output from the first encoder network. , feature decoding may be performed based on a feature image output from each intermediate layer of the second encoder network and a feature image output from each intermediate layer of the first decoder network. As described above, the feature images output by the first preprocessing network 111, the first encoder network 112, and the first decoder network 113 of the first deep neural network 110 are preserved and the second deep neural network ( 120), it can be input in parallel to the corresponding layer to perform feature fusion. For example, as shown in Figs. 1A and 1B, the

represents direct addition, and the sign

indicates the fusion in the manner of the SE block, which will be described in detail below.

이와 같이, 제2 디코더 네트워크(123)에서 제1 레이어 디코딩 단위의 입력은 제2 인코더 네트워크(122)에서 출력된 특징 이미지와 제1 인코더 네트워크(112)에서 출력된 특징 이미지를 직접 더하여 얻은 특징 이미지임을 알 수 있다. 제2 디코더 네트워크(123)에서 제2 레이어에서 제N 번째 레이어까지 디코딩 단위로의 각 레이어 디코딩 단위의 입력은 SE 블록을 사용하여 상위 레이어 디코딩 단위에서 출력된 특징 이미지, 제1 디코더 네트워크(113) 중 대응되는 레이어 디코딩 단위에서 출력된 특징 이미지, 제2 인코더 네트워크(122) 중 대응되는 레이어 코딩 단위에서 출력된 특징 이미지를 융합한 특징 이미지다. 예를 들어, 제2 디코더 네트워크(123) 중 제2 레이어 디코딩 단위의 입력은 SE 블록을 사용하여 제2 디코더 네트워크(123) 중 제1 레이어 디코딩 단위가 출력한 특징 이미지, 제1 디코더 네트워크(113) 중 제1 레이어 디코딩 단위가 출력한 특징 이미지, 제2 인코더 네트워크(122) 중 제N-1 레이어 인코딩 단위가 출력한 특징 이미지를 융합하여 얻은 특징 이미지이며, 그 다음도 마찬가지이다. As such, the input of the first layer decoding unit in the second decoder network 123 is a feature image obtained by directly adding the feature image output from the second encoder network 122 and the feature image output from the first encoder network 112 . it can be seen that The input of each layer decoding unit to the decoding unit from the second layer to the Nth layer in the second decoder network 123 is a feature image output from the higher layer decoding unit using the SE block, the first decoder network 113 It is a feature image obtained by fusion of a feature image output from a corresponding layer decoding unit of the second encoder network 122 and a feature image output from a corresponding layer coding unit of the second encoder network 122 . For example, the input of the second layer decoding unit of the second decoder network 123 is a feature image output by the first layer decoding unit of the second decoder network 123 using the SE block, and the first decoder network 113 . ) of the feature image output by the first layer decoding unit and the feature image output by the N-1th layer encoding unit of the second encoder network 122 are fused to a feature image, and so on.

제2 디코더 네트워크(123)의 각 레이어의 디코딩 단위는 캐스케이드된 복수의 잔차 블록을 포함할 수 있고, 각 잔차 블록은 입력된 특징 이미지에 대해 적어도 한 번의 컨볼루션 처리를 수행하고, 첫 번째 잔차 블록은 입력된 특징 이미지에 대해 한 번의 업 샘플링 처리와 적어도 한 번의 컨볼루션 처리를 수행한다. 여기서, 본 개시는 N의 값, 잔차 블록의 수 및 잔차 블록이 수행하는 컨볼루션의 횟수를 제한하지 않는다. 또한, 각 잔차 블록은 각 컨볼루션 프로세스 후, 한 번의 게이트 프로세스를 수행하며, 이에 대해서는 아래에서 자세히 설명한다. 예를 들어, 제1 디코더 네트워크(113)는 대응되는 4개의 디코딩 단위를 포함할 수 있고, 각 디코딩 단위는 두 개의 잔차 블록을 포함할 수 있고, 각 잔차 블록은 두 개의 컨볼루션 레이어를 포함할 수 있으며, 첫 번째 잔차 블록은 하나의 업 샘플링 레이어(예, 업 샘플링 계수는 2) 및 두 개의 컨볼루션 레이어를 포함할 수 있으므로, 제1 디코더 네트워크(113)의 출력된 특징 이미지의 해상도는 원래 해상도로 복원될 수 있다. A decoding unit of each layer of the second decoder network 123 may include a plurality of cascaded residual blocks, and each residual block performs at least one convolution process on the input feature image, and the first residual block performs one upsampling process and at least one convolution process on the input feature image. Here, the present disclosure does not limit the value of N, the number of residual blocks, and the number of convolutions performed by the residual blocks. In addition, each residual block performs one gate process after each convolution process, which will be described in detail below. For example, the first decoder network 113 may include four corresponding decoding units, each decoding unit may include two residual blocks, and each residual block may include two convolutional layers. Since the first residual block may include one upsampling layer (eg, the upsampling coefficient is 2) and two convolutional layers, the resolution of the output feature image of the first decoder network 113 is resolution can be restored.

제2 깊이 예측 네트워크(124)는 제2 디코더 네트워크(123)에서 출력된 특징 이미지, 제1 디코더 네트워크(113)에서 출력된 특징 이미지 및 제2 전처리 네트워크(121)에서 출력된 제2 특징 이미지를 융합하여 형성한 특징 이미지를 단일의 깊이 이미지(예를 들어, 제2 깊이 이미지)로 합성할 수 있다. 예를 들어, 도 1a 및 도 1b에 도시된 바와 같이, 부호

는 SE 블록의 방식으로 융합하는 것을 나타내며, 이에 대해서는 아래에서 자세히 설명한다. 이와 같이, 제2 깊이 예측 네트워크(124)의 입력은 SE 블록을 사용하는 방식으로 제2 디코더 네트워크(123)에서 출력된 특징 이미지, 제1 디코더 네트워크(113)에서 출력된 특징 이미지 및 제2 전처리 네트워크(121)에서 출력된 제2 특징 이미지를 융합하여 획득한 특징 이미지임을 알 수 있다. 원본 깊이 이미지는 제2 전처리 네트워크(121), 제2 인코더 네트워크(122) 및 제2 디코더 네트워크(123)를 통과한 후, C개의 채널의 특징 이미지로 변환될 수 있다. 예를 들어, C는 32, 64, 128 등 일 수 있다. 따라서, 제2 깊이 예측 네트워크(124)는 이 C개 채널의 특징 이미지를 단일 채널의 깊이 이미지로 합성할 필요가 있다. 예를 들어, 제1 깊이 예측 네트워크(114)는 C개 채널의 특징 이미지를 단일 채널의 깊이 이미지로 합성하기 위한 두 개의 컨볼루션 레이어를 포함할 수 있으며, 제1 컨볼루션 레이어는 특징 채널을 원래의 절반, 즉 C/2로 줄일 수 있고, 제2 컨볼루션 레이어는 C/2개 채널의 특징 이미지를 단일 채널의 깊이 이미지로 압축할 수 있다. 또한, 제1 컨볼루션 레이어와 제2 컨볼루션 레이어 사이에는 하나의 정규화 레이어(예, 배치 정규화 레이어) 및 활성화 레이어(예, PReLU 레이어)를 더 포함할 수 있고, 정규화 레이어는 제1 컨볼루션 레이어에서 출력한 특징 이미지를 정규화하여 출력된 특징이 동일한 스케일을 갖도록 할 수 있으며, 활성화 레이어는 정규화된 특징 이미지를 비선형화하여, 제2 컨볼루션 레이어로 출력할 수 있다.The second depth prediction network 124 calculates the feature image output from the second decoder network 123 , the feature image output from the first decoder network 113 , and the second feature image output from the second preprocessing network 121 . A feature image formed by fusion may be synthesized into a single depth image (eg, a second depth image). For example, as shown in Figs. 1A and 1B, the

indicates the fusion in the manner of the SE block, which will be described in detail below. As such, the input of the second depth prediction network 124 is a feature image output from the second decoder network 123, a feature image output from the first decoder network 113, and a second preprocessing method using the SE block. It can be seen that the feature image is obtained by fusing the second feature image output from the network 121 . The original depth image may be converted into a feature image of C channels after passing through the second pre-processing network 121 , the second encoder network 122 , and the second decoder network 123 . For example, C may be 32, 64, 128, etc. Accordingly, the second depth prediction network 124 needs to synthesize the feature images of the C channels into a depth image of a single channel. For example, the first depth prediction network 114 may include two convolutional layers for synthesizing the feature images of C channels into the depth images of a single channel, and the first convolutional layer can be reduced to half, that is, C/2, and the second convolutional layer can compress the feature image of C/2 channels into a depth image of a single channel. In addition, one normalization layer (eg, a batch normalization layer) and an activation layer (eg, a PReLU layer) may be further included between the first convolutional layer and the second convolutional layer, and the normalization layer is the first convolutional layer The output feature image may be normalized so that the output feature has the same scale, and the activation layer may non-linearize the normalized feature image and output it as a second convolutional layer.

이하, 제2 심층 신경망(120)에서 사용되는 SE 블록 및 게이트 컨볼루션에 대해 자세히 설명한다.Hereinafter, the SE block and gate convolution used in the second deep neural network 120 will be described in detail.

SE 블록(Squeeze-and-Excitation Block)SE Block (Squeeze-and-Excitation Block)

SE 블록의 핵심 아이디어는 예를 들어 C2 개의 채널 특징을 C1개의 채널 특징(C2는 C1의 정수 배수일 수 있음)으로 압축해야 할 때, 네트워크를 통해 각 채널의 특징 가중치를 자동으로 학습하여, 유효 특징의 가중치를 확대하고, 유효하지 않거나 비효율적인 특징의 가중치를 줄여, 네트워크가 다른 특징을 선택적으로 사용할 수 있도록 한다. 제2 심층 신경망(120)에서, SE 블록은 서로 다른 특징의 가중치를 학습하고 학습된 가중치로 특징을 융합하는 데 사용된다.The core idea of the SE block is that, for example, when C2 channel features need to be compressed into C1 channel features (C2 can be an integer multiple of C1), the network automatically learns the feature weights of each channel, which is effective By extending the weights of features and reducing the weights of invalid or inefficient features, the network can selectively use other features. In the second deep neural network 120, the SE block is used to learn the weights of different features and fuse the features with the learned weights.

도 2는 일 실시 예에 따른 SE 블록 융합 방법을 도시한 도면이다.2 is a diagram illustrating an SE block fusion method according to an embodiment.

도 2를 참조하면, 제2 디코더 네트워크(123)의 제2 레이어 디코딩 단위에 대해, 제2 디코더 네트워크(123)의 제1 레이어 디코딩 단위에 의해 출력된 C개의 채널의 특징 이미지(깊이 특징(211)), 제1 디코더 네트워크(113)의 제1 레이어 디코딩 단위에 의해 출력된 C개의 채널의 특징 이미지(컬러 특징(212)) 및 제2 인코더 네트워크(122)의 제N-1 레이어 인코딩 단위에 의해 출력된 C개의 채널 이미지(인코더 특징(213))를 스플라이싱하여 하나의 3C 채널의 특징 벡터(스플라이싱 특징(220))를 획득한다. 그리고, SE 블록(230)을 통해 3C 채널의 특징 벡터를 수량이 3C인 가중치 그래프(240)를 생성한다. Referring to FIG. 2 , with respect to the second layer decoding unit of the second decoder network 123 , the feature images (depth features 211 ) of C channels output by the first layer decoding unit of the second decoder network 123 . )), feature images (color features 212) of C channels output by the first layer decoding unit of the first decoder network 113 and the N-1th layer encoding unit of the second encoder network 122 By splicing the C channel images (encoder feature 213) output by , the feature vector (splicing feature 220) of one 3C channel is obtained. Then, through the SE block 230, a weight graph 240 having a quantity of 3C feature vectors of a 3C channel is generated.

그리고, 획득한 가중치 그래프(240)를 원래 순서대로 3개의 채널이 C인 가중치 벡터(깊이 특징 가중치(251), 컬러 특징 가중치(252), 인코더 특징 가중치(253))로 나눈다. 이때, 각 가중치의 범위는 0-1이다.Then, the obtained weight graph 240 is divided into a weight vector (depth feature weight 251, color feature weight 252, encoder feature weight 253) having three channels of C in the original order. In this case, the range of each weight is 0-1.

그리고, 채널 곱셈 방식(260)에 따라 원래의 3개의 C 채널의 특징 이미지에 대해 가중치를 부여하고, 가중치가 부여된 3개의 C 채널의 특징 이미지(가중된 깊이 특징(271), 가중된 컬러 특징(272), 가중된 인코더 특징(273))를 생성한다.Then, according to the channel multiplication method 260, weights are given to the feature images of the original three C channels, and the weighted feature images of the three C channels (weighted depth feature 271, weighted color feature) 272, a weighted encoder feature 273).

마지막으로, 채널 추가 방식(280)에 따라 최종 단일 C 채널의 특징 이미지(융합 특징 이미지)(290)를 생성하고, 이를 제2 디코더 네트워크(123) 중 제2 레이어 디코딩 단위의 입력으로 제공한다.Finally, a feature image (fusion feature image) 290 of the final single C channel is generated according to the channel addition method 280 , and it is provided as an input of a second layer decoding unit of the second decoder network 123 .

제2 심층 신경망(120) 중 SE 블록 융합 방식이 사용되는 다른 모듈에 있어서(예를 들어, 제2 디코더 네트워크(123) 중 기타 레이어 디코딩 단위, 제2 깊이 예측 네트워크(124)), 상기와 동일한 방식으로 입력을 생성한다. In other modules in which the SE block fusion method is used among the second deep neural networks 120 (eg, other layer decoding units in the second decoder network 123, the second depth prediction network 124), the same method to generate input.

게이트 컨볼루션(gate convolution)gate convolution

게이트 컨볼루션은 일반 컨볼루션(convolution)에서 모든 입력을 유효 픽셀로 처리하는 문제를 해결하는 데 사용된다. 즉, 일반 컨볼루션은 이미지에서 유효하지 않은 픽셀을 구별할 수 없는 반면, 게이트 컨볼루션은 일반 컨볼루션을 기반으로 게이트 작업을 하나 더 추가하고, 학습 가능한 매개 변수를 사용하여 모듈을 통해 상응하는 가중치를 생성하여 최종적으로 가중치를 통해 원래의 출력을 억제할 수 있다. Gate convolution is used to solve the problem of treating all inputs as valid pixels in normal convolution. In other words, normal convolution cannot distinguish invalid pixels in an image, whereas gated convolution adds one more gate operation based on normal convolution and uses the learnable parameters to modulate the corresponding weights. Finally, the original output can be suppressed through weighting.

원본 이미지 복원 작업은 0/1 마스크를 사용하여 유효하지 않은 픽셀과 유효한 픽셀을 표시하지만, 컨볼루션 프로세스가 로컬 필터링과 유사하기 때문에 주변 픽셀의 정보가 사용된다. 단순히 0/1 마스크 표기만 사용하는 경우, 픽셀의 신뢰도를 반영할 수 없다. The original image restoration operation uses a 0/1 mask to mark invalid and valid pixels, but information from surrounding pixels is used because the convolution process is similar to local filtering. If only 0/1 mask notation is used, the reliability of pixels cannot be reflected.

예를 들어, 원본 이미지는 다음과 같다.For example, the original image is:

그 대응하는 마스크는 다음과 같다.The corresponding mask is as follows.

가중치가 1인 3Х3의 컨볼루션 커널 후, 다음과 같은 픽셀이 된다.After a convolution kernel of 3Х3 with a weight of 1, the following pixels are obtained.

그 대응하는 마스크는 다음과 같이 변한다.The corresponding mask changes as follows.

즉, 네트워크는 출력된 값이 모두 유효한 값이라고 간주하고 원본 이미지에 포함된 0을 무시하고, 가중치를 적용한 후에도 출력은 여전히 10이다. 그러나 게이트 컨볼루션을 추가한 후, 게이트 연산을 통해 대응하는 가중치 0.6을 생성할 수 있으며, 가중치 값 6을 출력할 수 있다. 따라서 네트워크는 원본 입력 이미지의 모든 정보가 다 유효한 값은 아니라고 간주하고 가중치가 적용된 출력이 6이 되므로, 해당 위치의 출력을 억제하게 된다. 원본 이미지에 0이 많을수록 이 값은 작아지고, 원래 입력이 모두 0이면 마스크 또한 0이 되고, 출력의 신뢰도 또한 0이 된다. 이러한 메커니즘을 통해 네트워크의 출력에 대해 가중치를 부여한다.That is, the network considers all output values to be valid values, ignores zeros included in the original image, and after weighting, the output is still 10. However, after adding the gate convolution, the corresponding weight 0.6 can be generated through gate operation, and the weight value 6 can be output. Therefore, the network considers that all information in the original input image is not a valid value, and the weighted output becomes 6, thus suppressing the output of the corresponding position. As there are more 0's in the original image, this value becomes smaller. If the original input is all 0, the mask also becomes 0, and the reliability of the output also becomes 0. Through this mechanism, the output of the network is weighted.

원본 깊이 이미지의 부족으로 인해, 제2 심층 신경망(120)에서 컨볼루션 연산 후 게이트 연산을 추가하면, 유효 픽셀과 무효 픽셀의 위치를 효과적으로 식별하여 유효 픽셀의 가중치를 무효 픽셀보다 높게 설정할 수 있고, 출력 특징 이미지에 대해 공간적으로 모니터링하여, 누락된 이미지에 대한 네트워크의 처리 능력을 향상시킨다.Due to the lack of the original depth image, if the gate operation is added after the convolution operation in the second deep neural network 120, the positions of the valid and invalid pixels can be effectively identified and the weight of the effective pixel can be set higher than that of the invalid pixel, By spatially monitoring the output feature image, the processing power of the network for missing images is improved.

다시 도 1을 참조하면, 융합 모듈(130)은 제1 심층 신경망(110)이 출력한 제1 깊이 이미지와 제2 심층 신경망(120)이 출력한 제2 깊이 이미지를 합병하여, 최종적으로 보완된 깊이 이미지(즉, 최종 깊이 이미지)를 획득할 수 있다.Referring back to FIG. 1 , the fusion module 130 merges the first depth image output by the first deep neural network 110 and the second depth image output by the second deep neural network 120 , and finally supplemented. A depth image (ie, a final depth image) may be acquired.

본 개시의 실시 예에 따르면, 융합 모듈(130)은 어텐션(attention) 모듈로 구현될 수 있다. 물론, 융합 모듈(130)은 또한 임의의 가능한 방식으로 구현될 수 있으며, 본 개시는 융합 모듈(130)의 구현 방식에 제한을 두지 않는다. 이하, 어텐션 모듈을 통해 융합 모듈(130)을 구현하는 방법에 대해 상세히 설명한다. According to an embodiment of the present disclosure, the fusion module 130 may be implemented as an attention module. Of course, the fusion module 130 may also be implemented in any possible manner, and the present disclosure does not limit the implementation manner of the fusion module 130 . Hereinafter, a method of implementing the fusion module 130 through the attention module will be described in detail.

어텐션 모듈은 학습 가능한 네트워크 모듈을 통해 입력된 두 개의 깊이 이미지에 대해 두 개의 가중치 맵을 생성하고, 가중치 맵을 원래 깊이 이미지에 다시 곱하고, 가중치가 적용된 깊이 이미지를 추가하여 최종 깊이 이미지를 얻을 수 있다. 어텐션 모듈은 공간 위치 상에서 모니터링을 진행하고, 깊이 이미지의 각 픽셀에 해당하는 가중치를 모두 출력한다. 즉, 출력된 가중치 맵은 깊이 이미지와 동일한 해상도를 갖는다. 예를 들어, 깊이 이미지의 사이즈가 HХW이며, 가중치 맵의 사이즈 또한 HХW이다. The attention module generates two weight maps for the two depth images input through the learnable network module, multiplies the weight map back to the original depth image, and adds the weighted depth image to obtain the final depth image. . The attention module monitors the spatial location and outputs all weights corresponding to each pixel of the depth image. That is, the output weight map has the same resolution as the depth image. For example, the size of the depth image is HХW, and the size of the weight map is also HХW.

도 3은 일 실시 예에 따른 어텐션 메커니즘 기반의 융합 방법을 도시한 도면이다.3 is a diagram illustrating a fusion method based on an attention mechanism according to an embodiment.

도 3를 참조하면, 제1 깊이 이미지(311)와 제2 깊이 이미지(312)(예를 들어, D1와 D2)를 입력한 다음, 제1 깊이 이미지(311)와 제2 깊이 이미지(312)를 스플라이싱하고, 스플라이싱된 깊이 이미지(320)를 어텐션 모듈(330)에 입력하고, 스플라이싱된 깊이 이미지에서 각 픽셀의 가중치를 생성하여 대응하는 가중치 맵(340)을 생성하고, 획득한 가중치 맵을 원래 순서대로 각각 제1 깊이 이미지와 제2 깊이 이미지에 대응하는 두 개의 가중치 맵(351, 352)(예를 들어, W1과 W2)으로 분할한다. 이 두 가중치 맵(351, 352)을 픽셀 단위로 각각 제1 깊이 이미지(311)와 제2 깊이 이미지(312)에 곱(360)하여, 가중치가 적용된 제1 깊이 이미지(371)와 가중치가 적용된 제2 깊이 이미지(372)를 획득하고, 다시 픽셀 단위로 가중치가 적용된 제1 깊이 이미지와 가중치가 적용된 제2 깊이 이미지를 추가(380)하여, 최종 깊이 이미지(390) (예를 들어, D)를 획득한다. 이와 같은 과정은 다음의 <수학식 1>과 같이 표현할 수 있다.Referring to FIG. 3 , a first depth image 311 and a second depth image 312 (eg, D1 and D2) are input, and then a first depth image 311 and a second depth image 312 are input. splicing , input the spliced depth image 320 to the attention module 330, and generating a weight of each pixel in the spliced depth image to generate a corresponding weight map 340, The obtained weight map is divided into two weight maps 351 and 352 (eg, W1 and W2) corresponding to the first depth image and the second depth image, respectively, in the original order. The first depth image 311 and the second depth image 312 are multiplied 360 by the two weight maps 351 and 352 in pixel units, respectively, and the weighted first depth image 371 and the weighted first depth image 371 are applied. A second depth image 372 is acquired, and a weighted first depth image and a weighted second depth image are added (380) again to a final depth image 390 (eg, D) to acquire Such a process can be expressed as in the following <Equation 1>.

[수학식 1][Equation 1]

이때, D1은 제1 심층 신경망(110)이 출력한 제1 깊이 이미지를 나타내고, D2는 제2 심층 신경망(120)이 출력한 제2 깊이 이미지를 나타내고, W1과 W2는 각각 깊이 이미지에 대응하는 가중치를 나타내고,

는 대응하는 픽셀에 의한 곱셈을 나타낸다.In this case, D1 represents the first depth image output by the first deep neural network 110, D2 represents the second depth image output by the second deep neural network 120, and W1 and W2 each correspond to the depth image. represents the weight,

denotes multiplication by the corresponding pixel.

다시 도 1을 참조하면, 깊이 이미지 보완 모델(100)은 다음과 같은 장점이 있다. (1) 깊이 보완 작업 중 컬러 이미지를 이용한 깊이 추정 방법이 도입된다. 지도 학습(Supervised Training)을 통해, 컬러 분기는 컬러 이미지에서 깊이 이미지로의 매핑을 학습할 수 있고, 컬러 이미지의 고해상도와 풍부한 텍스처 정보로 인해, 깊이 추정으로 얻은 깊이 이미지는 디테일 정보가 풍부하고, 에지가 날카로우며, 우수한 시각적 효과를 지닌다. (2) 깊이 이미지 보완 모델(100)은 중간 표현 및 수동 설계 특징에 의존하지 않고 종단 간 훈련을 진행할 수 있으므로, 네트워크가 훈련 과정에서 다른 특징의 품질이 낮아지는 리스크를 피할 수 있으며, 훈련 속도가 향상될 수 있다. (3) 깊이 추정 네트워크(즉, 제1 심층 신경망(110))는 깊이 예측 네트워크(제2 심층 신경망 (120))와 독립적이므로, 네트워크는 원본 깊이 이미지가 매운 드물고 심지어 누락된 경우에도 대응되는 깊이 이미지를 안정적으로 출력할 수 있다. 이러한 설계는 깊이 이미지 보완 모델(100)이 깊이 공동 채우기 및 희소 깊이 밀도화(도 4에 도시된 바와 같음) 두 가지 작업에서 모두 좋은 효과를 보여주게 한다. Referring back to FIG. 1 , the depth image supplementation model 100 has the following advantages. (1) A depth estimation method using a color image is introduced during the depth compensation operation. Through supervised training, color branching can learn the mapping from color image to depth image, and due to the high resolution and rich texture information of color image, the depth image obtained by depth estimation is rich in detail information, The edges are sharp and have a good visual effect. (2) Since the depth image supplementation model 100 can perform end-to-end training without relying on intermediate representations and manual design features, the network can avoid the risk of lowering the quality of other features in the training process, and the training speed increases can be improved (3) the depth estimation network (i.e., the first deep neural network 110) is independent of the depth prediction network (the second deep neural network 120), so that the network has a corresponding depth even when the original depth image is very sparse and even missing Images can be output stably. This design allows the depth image complementation model 100 to show good effects in both tasks of depth cavity filling and sparse depth densification (as shown in Fig. 4).

도 4는 두 가지 모드의 깊이 이미지를 도시한 도면이다.4 is a diagram illustrating depth images of two modes.

도 4에 도시된 바와 같이, 도 4의 410은 연속적인 결측 값이 있는 깊이 이미지를 도시한 것으로, 공동 외의 영역에서, 깊이 값은 연속적이고 조밀하다. 도 4의 420은 희소한 심도 이미지로, 흰색 점은 심도 값이 있는 곳을 나타내고, 검은색 영역은 심도 값이 없는 곳을 나타내며, 흰색 밝기는 거리의 멀고 가까움을 나타내며, 높을수록 거리가 멀다는 것을 나타내고, 어두울수록 거리가 가깝다는 것을 나타낸다. 동공 보완 작업의 경우, 도 4의 410과 같이 공간이 누락된 깊이 이미지를 훈련 샘플로 사용하여, 깊이 이미지 보완 모델(100)을 훈련할 수 있다. 희소 깊이 밀도화 작업의 경우, 도 4의 420과 같이 희소한 깊이 이미지를 훈련 샘플로 사용하여, 깊이 이미지 보완 모델(100)을 훈련할 수 있다.As shown in Fig. 4, 410 of Fig. 4 shows a depth image with consecutive missing values, and in the region other than the cavity, the depth values are continuous and dense. 420 in FIG. 4 is a sparse depth image, where white dots indicate places with depth values, black areas indicate places without depth values, and white brightness indicates far and near distances, and the higher the distance, the greater the distance. It indicates that the darker the distance, the closer the distance. In the case of the pupil complementation task, as shown in 410 of FIG. 4 , the depth image supplementation model 100 may be trained by using a depth image with missing space as a training sample. In the case of the sparse depth densification task, the depth image complementation model 100 may be trained using a sparse depth image as a training sample as shown in 420 of FIG. 4 .

이하, 본 개시의 실시 예에 따른 깊이 이미지 보완 모델(100)의 훈련 방법에 대해 상세히 설명한다.Hereinafter, a training method of the depth image supplementation model 100 according to an embodiment of the present disclosure will be described in detail.

먼저, 훈련 샘플을 준비해야 한다. 훈련 샘플에는 원본 컬러 이미지와 원본 깊이 이미지가 포함된다. 여기서, 원본 컬러 이미지와 원본 깊이 이미지는 서로 대응하는 이미지로, 이미지 등록을 통해 센서가 수집한 원본 컬러 이미지와 원본 깊이 이미지를 동일한 좌표계에 투사하여, 두 이미지 픽셀이 하나씩 대응하도록 할 수 있다.First, we need to prepare a training sample. The training sample contains the original color image and the original depth image. Here, the original color image and the original depth image correspond to each other, and the original color image and the original depth image collected by the sensor through image registration may be projected on the same coordinate system so that the two image pixels correspond one by one.

본 개시의 실시 예에 따라, 훈련 샘플이 부족한 경우, 임의의 수평 플립(Random Horizontal Flip), 임의의 수직 플립(Random Vertical Flip), 컬러 지터(Color Jitter) 등과 같은 일부 데이터 증가 동작에 의해 데이터가 확장될 수 있다. 이러한 동작은 네트워크가 더 많은 시나리오 및 각기 다른 환경 하에서의 대응 관계를 학습할 수 있게 하며, 모델의 견고성은 향상된다. According to an embodiment of the present disclosure, when the training sample is insufficient, the data is increased by some data increasing operation such as a random horizontal flip, a random vertical flip, and color jitter. can be expanded. This behavior allows the network to learn more scenarios and correspondences under different circumstances, and the robustness of the model is improved.

본 개시의 실시 예에 따르면, 공동 보완 작업의 경우, 훈련 샘플은 복수의 원본 컬러 이미지 및 원본 깊이 이미지 쌍을 포함할 수 있다. 여기서, 원본 깊이 이미지는 공동이 없는 깊이 이미지일 수 있다. 희소 심도 조밀화 작업의 경우, 훈련 샘플은 복수의 원본 컬러 이미지 및 희소 심도 이미지 쌍을 포함할 수 있다. 여기서, 희소 심도 이미지는 희소 심도 이미지를 포함하는 데이터베이스를 통해 획득하거나, 심도 이미지 지면의 실제 값 또는 조밀 심도 이미지에 대해 희소 샘플링을 수행하여 얻을 수 있다. 예를 들어, 원본 깊이 이미지가 공동이 없는 깊이 이미지인 경우, 원본 깊이 이미지를 채워 깊이 이미지 지면의 실제 값을 얻은 다음, 깊이 이미지 지면의 실제 값에 대해 희소 샘플링을 수행하여 희소 깊이 이미지를 얻는다. 또한, 복수의 원본 컬러 이미지 및 원본 깊이 이미지 쌍 뿐만 아니라 복수의 원본 컬러 이미지 및 희소 깊이 이미지 쌍을 포함하는 샘플을 훈련시켜 깊이 이미지 보완 모델(100)을 훈련하고, 이를 통해 공동 보완 작업 및 희소 깊이 조밀화 작업의 목적을 동시에 충족시킬 수도 있다.According to an embodiment of the present disclosure, in the case of a joint complementary task, the training sample may include a plurality of original color images and original depth image pairs. Here, the original depth image may be a depth image without a cavity. For the sparse depth densification task, the training sample may include a plurality of original color images and sparse depth image pairs. Here, the sparse depth image may be obtained through a database including sparse depth images, or may be obtained by performing sparse sampling on an actual value of the depth image field or a dense depth image. For example, if the original depth image is a depth image without a cavity, fill the original depth image to obtain the true value of the depth image field, and then perform sparse sampling on the actual value of the depth image field to obtain a sparse depth image. In addition, the depth image complementation model 100 is trained by training a sample comprising a plurality of original color image and original depth image pairs as well as a plurality of original color image and sparse depth image pairs, through which a joint complementation task and sparse depth image pair are trained. The purpose of the densification operation may be met at the same time.

그리고, 훈련 샘플은 손실 함수를 구성해야 한다. 모델 훈련은 손실 함수의 모니터링 하에, Adam 최적화 프로그램을 사용하여 역 전파를 통해 네트워크의 매개 변수를 지속적으로 업데이트하는 것으로, 네트워크가 입력 데이터에 더 잘 맞을 수 있도록 하며, 이를 통해 예측된 깊이 이미지와 실제 깊이 이미지 간의 차이를 줄인다.And, the training sample should construct a loss function. Model training is to continuously update the parameters of the network through backpropagation using the Adam optimizer, under the monitoring of the loss function, so that the network can better fit the input data, and thereby Reduce the difference between depth images.

본 개시의 실시예에 따르면, 예측 깊이 픽셀과 실제 깊이 이미지의 픽셀 값의 평균 제곱 오차(MSE, Mean Square Error)를 손실 함수로 사용하는 것 외에도, 예측 깊이 픽셀과 실제 깊이 이미지의 구조적 유사성 지수(SSIM, Structural Similarity Index Measure)를 기반으로 획득한 구조적 손실을 손실 함수로 도입하여, 획득한 최종 깊이 이미지의 품질을 향상시키고 네트워크의 노이즈 및 체스판 효과를 제거하여, 획득한 최종 깊이 이미지의 디테일 정보를 풍부하게 하고, 에지 품질을 높인다. According to the embodiment of the present disclosure, in addition to using the mean square error (MSE) of the pixel values of the predicted depth pixel and the actual depth image as a loss function, the structural similarity index ( Structural Similarity Index Measure (SSIM)-based structural loss is introduced as a loss function to improve the quality of the final depth image obtained, and the noise and chessboard effects of the network are removed to obtain detailed information on the final depth image. , and increase the edge quality.

도 5는 일 실시 예에 따른 손실 함수를 도시한 도면이다.5 is a diagram illustrating a loss function according to an embodiment.

도 5를 참조하면, 컬러 분기가 컬러 이미지와 깊이 이미지 간의 매핑 관계를 학습하도록 하기 위해, MSE 손실(MSE1)을 사용하여 컬러 분기의 깊이 예측 부분을 모니터링한다. 마찬가지로, 원본 깊이 이미지와 보완된 깊이 이미지 간의 관계를 학습하기 위해, 또한 MSE 손실 함수(MSE2)를 깊이 분기의 깊이 예측 부분에 사용한다. 최종 깊이 융합 부분의 경우, MSE와 SSIM을 손실 함수(MSE3 및 SSIM)로 사용하여 최종 깊이 이미지를 모니터링한다.Referring to FIG. 5 , in order to allow the color branch to learn the mapping relationship between the color image and the depth image, the MSE loss (MSE1) is used to monitor the depth prediction part of the color branch. Similarly, to learn the relationship between the original depth image and the supplemented depth image, we also use the MSE loss function (MSE2) for the depth prediction part of the depth branch. For the final depth fusion part, the final depth image is monitored using MSE and SSIM as loss functions (MSE3 and SSIM).

MSE 손실 함수는 다음 <수학식 2>와 같이 표현될 수 있다.The MSE loss function may be expressed as in Equation 2 below.

[수학식 2][Equation 2]

이때, N은 이미지의 유효 픽셀 수이고, D는 예측된 깊이 값이며, D*는 실제 깊이 값이다.Here, N is the number of effective pixels in the image, D is the predicted depth value, and D* is the actual depth value.

SSIM 손실 함수는 다음 <수학식 3>과 같이 표현될 수 있다.The SSIM loss function may be expressed as in Equation 3 below.

[수학식 3][Equation 3]

이때, SSIM은 구조적 유사성 지수이고, x와 y는 각각 예측된 깊이 이미지와 직접 실제 깊이 이미지를 나타낸다. SSIM은 다음 <수학식 4>와 같이 표현될 수 있다.In this case, SSIM is a structural similarity index, and x and y represent a predicted depth image and a direct actual depth image, respectively. SSIM may be expressed as in Equation 4 below.

[수학식 4][Equation 4]

이때,

는 x 이미지의 픽셀 값의 평균 값이고,

는 y 이미지의 픽셀 값의 평균 값이고,

는 x 이미지의 픽셀 값의 분산(variance)이고,

는 y 이미지의 픽셀 값의 분산이고,

는 x 이미지와 y 이미지의 픽셀 값의 공분산(covariance)이고,

및

는 상수이다. 구조적 유사성의 범위는 0~1이다.At this time,

is the average value of the pixel values of the x image,

is the average value of the pixel values of the y image,

is the variance of the pixel values of the x image,

is the variance of the pixel values of the y image,

is the covariance of the pixel values of the x image and the y image,

and

is a constant. Structural similarity ranges from 0 to 1.

본 개시의 실시 예의 손실 함수는 다음 <수학식 5>과 같이 표현될 수 있다.The loss function of the embodiment of the present disclosure may be expressed as in Equation 5 below.

[수학식 5][Equation 5]

이때,

는 손실 가중치 계수의 벡터로, 실제로 다른 손실 함수의 패널티 효과를 나타내며, 일 예로, 그러나 제한되지 않게,

,

는 네 가지 종류의 손실(예, MSE3, SSIM, MSE2, MSE1)로 구성된 손실 벡터이다.

은 최종 깊이 이미지의 평균 제곱 오차 손실이고,

은 최종 깊이 맵의 구조적 손실이고,

및

는 각각 깊이 예측 분기 및 깊이 추정 분기의 평균 제곱 오차 손실이다.At this time,

is a vector of loss weighting coefficients, which actually represents the penalty effect of another loss function, as an example, but not limited to,

,

is a loss vector composed of four types of loss (eg, MSE3, SSIM, MSE2, MSE1).

is the mean squared error loss of the final depth image,

is the structural loss of the final depth map,

and

is the mean square error loss of the depth prediction branch and the depth estimation branch, respectively.

도 6은 일 실시 예에 따른 깊이 이미지를 보완하는 방법을 도시한 흐름도이다.6 is a flowchart illustrating a method of supplementing a depth image according to an exemplary embodiment.

도 6을 참조하면, 단계 601에서, 깊이 이미지 보완 장치는 원본 컬러 이미지 및 대응하는 원본 깊이 이미지를 획득한다. 여기서, 원본 컬러 이미지와 원본 깊이 이미지는 일치 및 보정된 컬러 카메라와 깊이 카메라로 동일한 위치에서 동일한 장면을 동시에 촬영한 다음, 두 이미지를 등록하여 얻을 수 있고, 또는 필요에 따라 로컬 메모리 또는 로컬 데이터베이스에서 획득할 수 있고, 또는 입력 장치 또는 전송 매개를 통해 외부 데이터 소스(예, 인터넷, 서버, 데이터베이스 등)에서 수신할 수 있다. 원본 컬러 이미지와 원본 깊이 이미지는 서로 대응하는 이미지로, 예를 들어, 이미지 등록을 통해 센서가 수집한 원본 컬러 이미지와 원본 깊이 이미지를 동일한 좌표계에 투사하여, 두 이미지 픽셀이 하나씩 대응하도록 할 수 있다.Referring to FIG. 6 , in step 601 , the depth image complementing apparatus acquires an original color image and a corresponding original depth image. Here, the original color image and the original depth image can be obtained by simultaneously shooting the same scene at the same location with a matched and calibrated color camera and depth camera, and then registering the two images, or from local memory or local database as needed may be obtained, or may be received from an external data source (eg, Internet, server, database, etc.) via an input device or transmission medium. The original color image and the original depth image correspond to each other. For example, by projecting the original color image and the original depth image collected by the sensor through image registration on the same coordinate system, the two image pixels can correspond one by one. .

한편, 깊이 이미지 보완 장치는 대응하는 원본 깊이 이미지가 존재하지 않는 경우, 픽셀 값이 0 인 깊이 이미지를 대응하는 원본 깊이 이미지로 획득할 수 있다.Meanwhile, when a corresponding original depth image does not exist, the depth image complementing apparatus may acquire a depth image having a pixel value of 0 as a corresponding original depth image.

단계 602에서, 깊이 이미지 보완 장치는 원본 컬러 이미지에 기초하여, 제1 심층 신경망을 사용하여 제1 깊이 이미지를 획득한다. 여기서, 깊이 이미지 보완 장치는 깊이 이미지 보완 모델(100)의 제1 심층 신경망(110)을 통해 제1 심층 신경망을 생성할 수 있다.In step 602, the depth image complementing apparatus acquires a first depth image using a first deep neural network, based on the original color image. Here, the depth image supplementation apparatus may generate a first deep neural network through the first deep neural network 110 of the depth image supplementation model 100 .

단계 603에서, 깊이 이미지 보완 장치는 원본 깊이 이미지 및 제1 심층 신경망의 각 중간 레이어에 의해 생성된 중간 특징 이미지에 기초하여, 제2 심층 신경망을 이용하여 제2 깊이 이미지를 획득한다. 여기서, 깊이 이미지 보완 장치는 깊이 이미지 보완 모델(100)의 제2 심층 신경망(120)을 통해 제2 심층 신경망을 생성할 수 있다.In step 603 , the depth image complementation apparatus acquires a second depth image by using the second deep neural network, based on the original depth image and the intermediate feature image generated by each intermediate layer of the first deep neural network. Here, the depth image supplementation apparatus may generate a second deep neural network through the second deep neural network 120 of the depth image supplementation model 100 .

예를 들어, 제2 깊이 이미지를 획득하는 방법은, 제1 인코더 네트워크(112) 및 제2 인코더 네트워크(122)의 출력, 제1 디코더 네트워크(113)의 중간 특징 이미지 및 제2 인코더 네트워크(122)의 중간 특징 이미지에 기초하여, 제2 디코더 네트워크(123)를 사용하여 특징 디코딩을 수행하는 단계를 포함할 수 있다. 여기서, 제2 디코더 네트워크(123)에서 제1 레이어 디코딩 단위의 입력은 제2 인코더 네트워크(122)가 출력한 특징 이미지와 제1 인코더 네트워크(112)가 출력한 특징 이미지의 합일 수 있고; 제2 디코더 네트워크(123)의 제2 레이어에서 제N 번째 레이어까지 디코딩 단위로의 각 레이어 디코딩 단위의 입력은 SE 블록을 사용하는 방식으로 상위 레이어 디코딩 단위가 출력한 특징 이미지, 제1 디코더 네트워크(113) 중 대응하는 레이어 디코딩 단위가 출력한 특징 이미지, 제2 인코더 네트워크(122) 중 대응하는 레이어 인코딩 단위가 출력한 특징 이미지를 융합하여 얻은 특징 이미지이다. For example, the method of obtaining the second depth image includes the output of the first encoder network 112 and the second encoder network 122 , the intermediate feature image of the first decoder network 113 and the second encoder network 122 . ) based on the intermediate feature image, performing feature decoding using the second decoder network 123 . Here, the input of the first layer decoding unit in the second decoder network 123 may be the sum of the feature image output by the second encoder network 122 and the feature image output by the first encoder network 112 ; The input of each layer decoding unit to the decoding unit from the second layer to the Nth layer of the second decoder network 123 is a feature image output by the higher layer decoding unit in a manner using an SE block, the first decoder network ( It is a feature image obtained by fusing the feature image output by the corresponding layer decoding unit in 113) and the feature image output by the corresponding layer encoding unit in the second encoder network 122 .

다른 예를 들어, 제2 깊이 이미지를 획득하는 방법은, 원본 깊이 이미지 및 제1 인코더 네트워크(112)의 중간 특징 이미지에 기초하여, 제2 심층 신경망(120)의 제2 인코더 네트워크(122)를 사용하여 특징 코딩을 수행하는 단계를 포함할 수 있다. 여기서, 제2 인코더 네트워크(122)에서 제1 레이어 코딩 단위의 입력은 제1 전처리 네트워크(111)에 의해 출력된 제1 특징 이미지와 제2 전처리 네트워크(121)에 의해 출력된 제2 특징 이미지의 합이고; 제2 인코더 네트워크(122)에서 제2 레이어에서 제N 번째 레이어까지 코딩 단위로의 각 레이어 코딩 단위의 입력은 상위 레이어 코딩 단위가 출력한 특징 이미지와 제1 인코더 네트워크(112)에서 대응하는 레이어 코딩 단위가 출력한 특징 이미지의 합이다.As another example, the method of acquiring the second depth image may include, based on the original depth image and the intermediate feature image of the first encoder network 112 , the second encoder network 122 of the second deep neural network 120 . It may include the step of performing feature coding using Here, the input of the first layer coding unit in the second encoder network 122 is the first feature image output by the first pre-processing network 111 and the second feature image output by the second pre-processing network 121 . sum; In the second encoder network 122 , the input of each layer coding unit to the coding unit from the second layer to the N-th layer is a layer coding corresponding to the feature image output by the higher layer coding unit and the first encoder network 112 . It is the sum of the feature images output by the unit.

한편, 상기 제2 깊이 이미지를 획득하는 방법은, 제2 전처리 네트워크(121)를 이용하여 상기 원본 깊이 이미지를 심층 신경망 처리에 적합한 제2 특징 이미지로 변환하고, 상기 제2 특징 이미지를 제2 인코더 네트워크(122)에 입력하는 단계; 제2 깊이 예측 네트워크(124)를 이용하여, 제1 디코더 네트워크(113) 및 제2 디코더 네트워크(123)에 의해 출력된 특징 이미지 및 상기 제2 특징 이미지의 융합된 특징 이미지를 제2 깊이 이미지로 합성하는 단계를 포함할 수 있다. 여기서, 제2 깊이 예측 네트워크(124)의 입력은 SE 블록을 사용하는 방식으로 제2 디코더 네트워크(123)에서 출력된 특징 이미지, 제1 디코더 네트워크(113)에서 출력된 특징 이미지 및 제2 전처리 네트워크(121)에서 출력된 제2 특징 이미지를 융합하여 얻은 특징 이미지이다. Meanwhile, the method for obtaining the second depth image includes converting the original depth image into a second feature image suitable for deep neural network processing using a second pre-processing network 121 , and converting the second feature image to a second encoder input to the network (122); Using the second depth prediction network 124 , the feature image output by the first decoder network 113 and the second decoder network 123 and the fused feature image of the second feature image are used as a second depth image It may include a step of synthesizing. Here, the input of the second depth prediction network 124 is a feature image output from the second decoder network 123, a feature image output from the first decoder network 113, and a second preprocessing network in a manner using an SE block. It is a feature image obtained by fusing the second feature image output in (121).

단계 604에서, 깊이 이미지 보완 장치는 제1 깊이 이미지와 제2 깊이 이미지를 합병하여 최종 깊이 이미지를 획득한다. 여기서, 상기 깊이 이미지 보완 모델(100)의 융합 모듈(130)을 통해, 제1 깊이 이미지와 제2 깊이 이미지를 합병하여 최종 깊이 이미지를 획득하는 단계를 수행할 수 있다. In operation 604, the depth image complementing apparatus acquires a final depth image by merging the first depth image and the second depth image. Here, through the fusion module 130 of the depth image supplement model 100 , the step of merging the first depth image and the second depth image to obtain a final depth image may be performed.

융합 모듈(130)은 어텐션 네트워크(attention network)를 통해 구현될 수 있다. 이러한 경우, 어텐션 네트워크를 사용하여 제1 깊이 이미지의 제1 픽셀 가중치 맵 및 제2 깊이 이미지의 제2 픽셀 가중치 맵을 획득할 수 있고; 제1 픽셀 가중치 맵 및 제2 픽셀 가중치 맵에 기초하여, 제1 깊이 이미지 및 제2 깊이 이미지에 가중치를 부여하여 합을 구하고, 상기 최종 깊이 이미지를 획득할 수 있다.The fusion module 130 may be implemented through an attention network. In this case, the attention network may be used to obtain a first pixel weight map of the first depth image and a second pixel weight map of the second depth image; Based on the first pixel weight map and the second pixel weight map, weights are given to the first depth image and the second depth image to obtain a sum, and the final depth image may be obtained.

깊이 이미지 보완 방법은, 제1 심층 신경망 및 제2 심층 신경망 및/또는 어텐션 네트워크를 사용하기 전, 손실 함수를 사용하여 제1 심층 신경망 및 제2 심층 신경망 및/또는 어텐션 네트워크에 대해 훈련하는 단계를 더 포함할 수 있다. 앞서 소개한 깊이 이미지 보완 모델(100)을 훈련하는 방법을 통해 훈련을 수행할 수 있다.The depth image complementation method includes, before using the first deep neural network and the second deep neural network and/or the attention network, training on the first deep neural network and the second deep neural network and/or the attention network using a loss function. may include more. Training may be performed through the method of training the depth image supplementation model 100 introduced above.

깊이 이미지 보완 모델(100)을 훈련하는 방법은 제1 깊이 이미지와 실제 깊이 이미지의 제1 평균 제곱 오차 손실(MSE1), 제2 깊이 이미지와 실제 깊이 이미지의 제2 평균 제곱 오차 손실(MSE2), 최종 깊이 이미지와 실제 깊이 이미지의 제3 평균 제곱 오차 손실(MSE3) 및 최종 깊이 이미지와 실제 깊이 이미지의 구조적 손실(SSIM)을 고려하여 손실 함수를 생성할 수 있다. 이때, 구조적 손실은 1 - 구조적 유사성 지수다. 예를 들어, 제1 평균 제곱 오차 손실, 제2 평균 제곱 오차 손실, 제3 평균 제곱 오차 손실 및 구조적 손실의 가중 합산을 통해 상기 손실 함수를 얻을 수 있다.The method of training the depth image complementation model 100 includes a first mean squared error loss (MSE1) of a first depth image and an actual depth image, a second mean squared error loss (MSE2) of a second depth image and an actual depth image, The loss function may be generated by considering the third mean square error loss (MSE3) of the final depth image and the actual depth image and the structural loss (SSIM) of the final depth image and the actual depth image. In this case, the structural loss is 1 - the structural similarity index. For example, the loss function may be obtained through weighted summing of the first mean squared error loss, the second mean squared error loss, the third mean squared error loss, and the structural loss.

도 7은 일 실시 예에 따른 깊이 이미지를 보완하는 장치를 도시한 도면이다.7 is a diagram illustrating an apparatus for supplementing a depth image according to an exemplary embodiment.

도 7을 참조하면, 깊이 이미지 보완 장치(700)는 이미지 획득 모듈(701), 컬러 분기 모듈(702), 깊이 분기 모듈(703) 및 이미지 합병 모듈(704)을 포함할 수 있다.Referring to FIG. 7 , the depth image complementing apparatus 700 may include an image acquisition module 701 , a color branching module 702 , a depth branching module 703 , and an image merging module 704 .

이미지 획득 모듈(701)은 원본 컬러 이미지 및 대응하는 원본 깊이 이미지를 획득할 수 있다. 여기서, 원본 컬러 이미지와 원본 깊이 이미지는 일치 및 보정된 컬러 카메라와 깊이 카메라로 동일한 위치에서 동일한 장면을 동시에 촬영한 다음, 두 이미지를 등록하여 얻을 수 있고, 또는 필요에 따라 로컬 메모리 또는 로컬 데이터베이스에서 획득할 수 있고, 또는 입력 장치 또는 전송 매개를 통해 외부 데이터 소스(예, 인터넷, 서버, 데이터베이스 등)에서 수신할 수 있다. 원본 컬러 이미지와 원본 깊이 이미지는 서로 대응하는 이미지로, 예를 들어, 이미지 등록을 통해 센서가 수집한 원본 컬러 이미지와 원본 깊이 이미지를 동일한 좌표계에 투사하여, 두 이미지 픽셀이 하나씩 대응하도록 할 수 있다.The image acquisition module 701 may acquire an original color image and a corresponding original depth image. Here, the original color image and the original depth image can be obtained by simultaneously shooting the same scene at the same location with a matched and calibrated color camera and depth camera, and then registering the two images, or from local memory or local database as needed may be obtained, or may be received from an external data source (eg, the Internet, a server, a database, etc.) via an input device or transmission medium. The original color image and the original depth image correspond to each other. For example, by projecting the original color image and the original depth image collected by the sensor through image registration on the same coordinate system, the two image pixels can correspond one by one. .

한편, 대응하는 원본 깊이 이미지가 존재하지 않는 경우, 이미지 획득 모듈(701)은 픽셀 값이 0 인 깊이 이미지를 대응하는 원본 깊이 이미지로 획득할 수 있다.Meanwhile, when a corresponding original depth image does not exist, the image acquisition module 701 may acquire a depth image having a pixel value of 0 as a corresponding original depth image.

컬러 분기 모듈(702)은 상기 원본 컬러 이미지에 기초하여, 제1 심층 신경망을 사용하여 제1 깊이 이미지를 획득할 수 있다. 여기서, 깊이 이미지 보완 모델(100)의 제1 심층 신경망(110)을 통해 제1 심층 신경망을 생성할 수 있다.The color branching module 702 may obtain a first depth image using a first deep neural network based on the original color image. Here, the first deep neural network may be generated through the first deep neural network 110 of the depth image supplementation model 100 .

깊이 분기 모듈(703)은 원본 깊이 이미지 및 제1 심층 신경망의 각 중간 레이어에 의해 생성된 중간 특징 이미지에 기초하여, 제2 심층 신경망을 이용하여 제2 깊이 이미지를 획득할 수 있다. 여기서, 상기 깊이 이미지 보완 모델(100)의 제2 심층 신경망(120)을 통해 제2 심층 신경망을 생성할 수 있다.The depth branching module 703 may obtain a second depth image by using the second deep neural network, based on the original depth image and the intermediate feature image generated by each intermediate layer of the first deep neural network. Here, a second deep neural network may be generated through the second deep neural network 120 of the depth image supplementation model 100 .

깊이 분기 모듈(703)은, 제1 인코더 네트워크(112) 및 제2 인코더 네트워크(122)의 출력, 제1 디코더 네트워크(113)의 중간 특징 이미지 및 제2 인코더 네트워크(122)의 중간 특징 이미지에 기초하여, 제2 디코더 네트워크(123)를 사용하여 특징 디코딩을 수행하도록 구성될 수 있다. 여기서, 제2 디코더 네트워크(123)에서 제1 레이어 디코딩 단위의 입력은 제2 인코더 네트워크(122)가 출력한 특징 이미지와 제1 인코더 네트워크(112)가 출력한 특징 이미지의 합일 수 있다. 그리고, 제2 디코더 네트워크(123)의 제2 레이어에서 제N 번째 레이어까지 디코딩 단위로의 각 레이어 디코딩 단위의 입력은 SE 블록을 사용하는 방식으로 상위 레이어 디코딩 단위가 출력한 특징 이미지, 제1 디코더 네트워크(113) 중 대응하는 레이어 디코딩 단위가 출력한 특징 이미지, 제2 인코더 네트워크(122) 중 대응하는 레이어 인코딩 단위가 출력한 특징 이미지를 융합하여 얻은 특징 이미지이다. The depth branching module 703 is configured to apply to the output of the first encoder network 112 and the second encoder network 122 , the intermediate feature image of the first decoder network 113 and the intermediate feature image of the second encoder network 122 . Based on it, it may be configured to perform feature decoding using the second decoder network 123 . Here, the input of the first layer decoding unit in the second decoder network 123 may be the sum of the feature image output by the second encoder network 122 and the feature image output by the first encoder network 112 . And, the input of each layer decoding unit to the decoding unit from the second layer to the Nth layer of the second decoder network 123 is a feature image output by the higher layer decoding unit in a manner using an SE block, the first decoder It is a feature image obtained by fusing a feature image output by a corresponding layer decoding unit in the network 113 and a feature image output by a corresponding layer encoding unit in the second encoder network 122 .

깊이 분기 모듈(703)은, 원본 깊이 이미지 및 제1 인코더 네트워크(112)의 중간 특징 이미지에 기초하여, 제2 심층 신경망(120)의 제2 인코더 네트워크(122)를 사용하여 특징 코딩을 수행하도록 구성될 수 있다. 여기서, 제2 인코더 네트워크(122)에서 제1 레이어 코딩 단위의 입력은 제1 전처리 네트워크(111)에 의해 출력된 제1 특징 이미지와 제2 전처리 네트워크(121)에 의해 출력된 제2 특징 이미지의 합이고; 제2 인코더 네트워크(122)에서 제2 레이어에서 제N 번째 레이어까지 코딩 단위로의 각 레이어 코딩 단위의 입력은 상위 레이어 코딩 단위가 출력한 특징 이미지와 제1 인코더 네트워크(112)에서 대응하는 레이어 코딩 단위가 출력한 특징 이미지의 합이다.The depth branching module 703 is configured to perform feature coding using the second encoder network 122 of the second deep neural network 120 based on the original depth image and the intermediate feature image of the first encoder network 112 . can be configured. Here, the input of the first layer coding unit in the second encoder network 122 is the first feature image output by the first pre-processing network 111 and the second feature image output by the second pre-processing network 121 . sum; In the second encoder network 122 , the input of each layer coding unit to the coding unit from the second layer to the N-th layer is a layer coding corresponding to the feature image output by the higher layer coding unit and the first encoder network 112 . It is the sum of the feature images output by the unit.

깊이 분기 모듈(703)은, 제2 전처리 네트워크(121)를 이용하여 원본 깊이 이미지를 심층 신경망 처리에 적합한 제2 특징 이미지로 변환하고, 제2 특징 이미지를 제2 인코더 네트워크(122)에 입력하고, 제2 깊이 예측 네트워크(124)를 이용하여, 제1 디코더 네트워크(113) 및 제2 디코더 네트워크(123)에 의해 출력된 특징 이미지 및 제2 특징 이미지의 융합된 특징 이미지를 제2 깊이 이미지로 합성하도록 구성될 수 있다. 여기서, 제2 깊이 예측 네트워크(124)의 입력은 SE 블록을 사용하는 방식으로 제2 디코더 네트워크(123)에서 출력된 특징 이미지, 제1 디코더 네트워크(113)에서 출력된 특징 이미지 및 제2 전처리 네트워크(121)에서 출력된 제2 특징 이미지를 융합하여 얻은 특징 이미지이다. The depth branching module 703 converts the original depth image into a second feature image suitable for deep neural network processing using the second preprocessing network 121 , and inputs the second feature image to the second encoder network 122 , , using the second depth prediction network 124 , the fused feature image of the feature image output by the first decoder network 113 and the second decoder network 123 and the second feature image as a second depth image can be configured to synthesize. Here, the input of the second depth prediction network 124 is a feature image output from the second decoder network 123, a feature image output from the first decoder network 113, and a second preprocessing network in a manner using an SE block. It is a feature image obtained by fusing the second feature image output in (121).

이미지 합병 모듈(704)은 제1 깊이 이미지와 제2 깊이 이미지를 합병하여 최종 깊이 이미지를 획득할 수 있다. 여기서, 깊이 이미지 보완 모델(100)의 융합 모듈(130)을 통해, 제1 깊이 이미지와 제2 깊이 이미지를 합병하여 최종 깊이 이미지를 획득할 수 있다. The image merging module 704 may obtain a final depth image by merging the first depth image and the second depth image. Here, the final depth image may be obtained by merging the first depth image and the second depth image through the fusion module 130 of the depth image supplementation model 100 .

융합 모듈(130)은 어텐션 네트워크를 통해 구현될 수 있다. 이러한 경우, 이미지 합병 모듈(704)은 어텐션 네트워크를 사용하여 제1 깊이 이미지의 제1 픽셀 가중치 맵 및 제2 깊이 이미지의 제2 픽셀 가중치 맵을 획득할 수 있고, 제1 픽셀 가중치 맵 및 제2 픽셀 가중치 맵에 기초하여, 제1 깊이 이미지 및 제2 깊이 이미지에 가중치를 부여하여 합을 구하고, 최종 깊이 이미지를 획득할 수 있다.The fusion module 130 may be implemented through an attention network. In this case, the image merging module 704 may use the attention network to obtain a first pixel weight map of the first depth image and a second pixel weight map of the second depth image, the first pixel weight map and the second Based on the pixel weight map, weights are given to the first depth image and the second depth image to obtain a sum, and a final depth image may be obtained.

깊이 이미지 보완 장치(700)는 훈련 모듈(미도시)을 더 포하거나 훈련 모듈과 통신할 수 있는 통신 장치 또는 분산 네트워크에 포함될 수도 있다. 훈련 모듈은, 제1 심층 신경망 및 제2 심층 신경망 및/또는 어텐션 네트워크를 사용하기 전, 손실 함수를 사용하여 제1 심층 신경망 및 제2 심층 신경망 및/또는 어텐션 네트워크에 대해 훈련할 수 있다. 앞서 소개한 깊이 이미지 보완 모델(100)을 훈련하는 방법을 통해 훈련을 수행할 수 있다.The depth image enhancement device 700 may further include a training module (not shown) or be included in a communication device capable of communicating with the training module or a distributed network. The training module may train on the first deep neural network and the second deep neural network and/or the attention network by using the loss function before using the first deep neural network and the second deep neural network and/or the attention network. Training may be performed through the method of training the depth image supplementation model 100 introduced above.

훈련 모듈은 제1 깊이 이미지와 실제 깊이 이미지의 제1 평균 제곱 오차 손실(MSE1), 제2 깊이 이미지와 실제 깊이 이미지의 제2 평균 제곱 오차 손실(MSE2), 최종 깊이 이미지와 실제 깊이 이미지의 제3 평균 제곱 오차 손실(MSE3) 및 최종 깊이 이미지와 실제 깊이 이미지의 구조적 손실(SSIM)을 고려하여 손실 함수를 생성할 수 있다. 이때, 구조적 손실은 1 - 구조적 유사성 지수다. 훈련 모듈은 제1 평균 제곱 오차 손실, 제2 평균 제곱 오차 손실, 제3 평균 제곱 오차 손실 및 구조적 손실의 가중 합산을 통해 상기 손실 함수를 얻을 수 있다.The training module includes the first mean squared error loss (MSE1) of the first depth image and the actual depth image, the second mean squared error loss (MSE2) of the second depth image and the actual depth image, and the product of the final depth image and the actual depth image. The loss function can be generated by considering the three mean square error loss (MSE3) and the structural loss (SSIM) of the final and true depth images. In this case, the structural loss is 1 - the structural similarity index. The training module may obtain the loss function through weighted summation of the first mean squared error loss, the second mean squared error loss, the third mean squared error loss, and the structural loss.

이하, 본 개시의 실시 예의 깊이 이미지 보완 방법에 따라 NYU-Depth-V2(이하 NYU라 함) 데이터베이스를 기반으로 희소 깊이 밀도화 작업을 구현하는 실시예에 대해 상세히 설명한다.Hereinafter, an embodiment of implementing the sparse depth densification operation based on the NYU-Depth-V2 (hereinafter referred to as NYU) database according to the depth image supplementation method according to the embodiment of the present disclosure will be described in detail.

첫 번째 단계로, 데이터를 사전 처리하여 훈련 샘플을 준비한다. NYU 데이터베이스는 Kinect에서 수집한 465개의 실내 장면, 컬러 카메라로 수집한 RGB 이미지를 제공하였고, 그중 249개 장면이 훈련 장면으로 사용되고, 216개 장면이 검증 장면으로 사용된다. 동시에 654개의 주석이 달린 이미지를 테스트 세트로 제공하였다. 동시에, 공식은 카메라 매개 변수와 데이터 전처리 도구를 제공하였다. 데이터 전처리 프로세스는 다음과 같습니다.As a first step, a training sample is prepared by preprocessing the data. The NYU database provided 465 indoor scenes collected by Kinect, RGB images collected by color cameras, of which 249 scenes were used as training scenes and 216 scenes were used as validation scenes. At the same time, 654 annotated images were provided as a test set. At the same time, the formula provided camera parameters and data preprocessing tools. The data preprocessing process is as follows:

(1) 공식에서 제공하는 도구를 사용하여 원본 데이터에 대해 일치, 투영 및 클리핑(clipping)을 수행하여 총 약 500K의 동일한 해상도의 쌍을 이루는 원본 이미지 데이터를 얻는다. 그중, 훈련 장면은 약 220K이고 테스트 장면은 약 280k이다.(1) Perform matching, projection, and clipping on the original data using the tool provided by the formula to obtain paired original image data with the same resolution of about 500K in total. Among them, the training scene is about 220K and the test scene is about 280K.

*(2) 원본 이미지는 깊이 이미지의 지면 실제 값(Ground Truth )의 일부만 제공하기 때문에, 깊이 이미지의 지면 실제 값(Ground Truth)을 제공하지 않는 다른 깊이 이미지의 경우, 공식 소개의 Colorization 방법을 사용하여 모든 깊이 이미지를 채우고, 모든 깊이 이미지의 지면 실제 값을 얻는다.*(2) Since the original image only provides a part of the ground truth of the depth image, for other depth images that do not provide the ground truth of the depth image, use the Colorization method of the official introduction. to fill all the depth images, and get the ground real value of all the depth images.

(3) 기존 방식과 비교하기 위해, 훈련 장면에서 50K 쌍의 이미지를 무작위로 선택하여 깊이 이미지 보완 모델(100) 훈련에 사용한다.(3) In order to compare with the existing method, 50K pairs of images are randomly selected from the training scene and used for training the depth image supplementation model 100 .

(4) 모든 훈련 이미지를 예를 들어 304Х224의 사이즈로 조정한다. 물론 훈련 이미지의 사이즈는 이것으로 제한하지 않는다.(4) Adjust all training images to a size of 304Х224, for example. Of course, the size of the training image is not limited to this.

(5) 단계(2)에서 얻은 모든 깊이 이미지의 지면 실제 값에 대해 희소 샘플링을 수행하고, 예를 들어, 깊이 이미지의 지면 실제 값에서 500개의 유효 픽셀을 무작위로 선택하여 희소 깊이 이미지를 생성한다.(5) sparse sampling is performed on the ground truth values of all depth images obtained in step (2), for example, randomly selecting 500 effective pixels from the ground truth values of the depth image to generate a sparse depth image .

(6) 깊이 이미지의 수평 뒤집기, 수직 뒤집기 및 색상 지터를 무작위로 수행하여 데이터의 다양성을 높인다.(6) Randomly perform horizontal flip, vertical flip, and color jitter of depth image to increase data diversity.

(7) 깊이 이미지를 텐서(tensor)로 변환하고 깊이 이미지 보완 모델(100)을 입력하여 처리한다. (7) The depth image is converted into a tensor, and the depth image supplement model 100 is input and processed.

두 번째 단계로, 정확한 훈련 샘플을 사용하고, 도 5에 설명된 손실 함수를 참조하여 깊이 이미지 보완 모델(100)을 훈련한다. 훈련 과정에서 배치 크기(Batch Size)는 4이고, 초기 학습률은 0.001이고, 5개 에포크(epochs)마다 1/2로 감소하며, 총 50개의 에포크를 훈련한다.As a second step, the depth image complementation model 100 is trained using the correct training sample and referring to the loss function described in FIG. 5 . In the training process, the batch size is 4, the initial learning rate is 0.001, and is reduced by 1/2 every 5 epochs, and a total of 50 epochs are trained.

세 번째 단계로, 훈련이 완료된 후, 깊이 영상 보완 모델(100)의 파라미터를 고정하고, 이때 깊이 이미지 보완 모델(100)은 훈련 데이터 중 원본 깊이 이미지부터 원본 컬러 이미지까지 완전한 깊이 이미지의 매핑 관계를 학습하였다. 한 쌍의 새로운 훈련 데이터를 깊이 이미지 보완 모델(100)로 전송하고, 깊이 이미지 보완 모델(100)은 보완된 깊이 이미지를 추론할 수 있다.In the third step, after training is completed, the parameters of the depth image supplementation model 100 are fixed, and at this time, the depth image supplementation model 100 determines the mapping relationship of the complete depth image from the original depth image to the original color image among the training data. learned A pair of new training data may be transmitted to the depth image supplementation model 100 , and the depth image supplementation model 100 may infer the supplemented depth image.

유사하게 본 개시의 실시 예의 깊이 이미지 보완 방법에 따라, NYU 데이터베이스를 기반으로, 공동 깊이 보완 작업을 구현할 수 있으며, 여기서 더는 자세히 설명하지 않는다.Similarly, according to the depth image supplementation method according to an embodiment of the present disclosure, a joint depth enhancement operation may be implemented based on the NYU database, which will not be described in detail herein.

종래 방법과 비교하여, 본 개시의 깊이 이미지 보완 방법은 공동 깊이 보완 작업 및 희소 깊이 조밀화 작업에서 더 나은 효과를 얻었음을 보여준다.Compared with the conventional method, the depth image enhancement method of the present disclosure shows that a better effect is obtained in the joint depth enhancement task and the sparse depth densification task.

1) 공동 깊이 보완 작업1) Joint depth supplementation work

NYU 데이터베이스의 경우, 지면의 실제 값을 제공하지 않는 깊이 이미지 쌍의 모든 데이터 세트(약 500K)가 훈련 세트로 사용되고, 테스트 프로세스에서 완전한 깊이 맵이 있는 1449개의 공식 이미지 쌍을 테스트 세트로 사용하여 최종 정확도를 확인한다. For the NYU database, all datasets (approximately 500K) of depth image pairs that do not give true values of the ground were used as training sets, and in the testing process, 1449 official image pairs with complete depth maps were used as test sets to finalize the final result. Check the accuracy.

DeepLiDAR의 오픈 소스 코드 재현을 통해 훈련 및 테스트하고, 공동 깊이 보완 데이터를 얻는다. 표 1에 도시된 바와 같이, 본 개시의 예시적 실시예에 따른 깊이 이미지 보완 방법은 다양한 지표(예를 들어, 평균 제곱근 오차(RMSE), 평균 절대 오차(MAE), 반전 평균 제곱근 오차(iRMSE), 반전 평균 절대 오차(iMAE))에서 DeepLiDAR보다 현저히 우수하다. DeepLiDAR's open-source code reproduction is trained and tested, and joint depth complementary data is obtained. As shown in Table 1, the depth image supplementation method according to an exemplary embodiment of the present disclosure includes various indicators (eg, root mean square error (RMSE), mean absolute error (MAE), inverted root mean square error (iRMSE)). , significantly better than DeepLiDAR in inversion mean absolute error (iMAE)).

알고리즘algorithm RMSERMSE MAEMAE iRMSEiRMSE iMAEiMAE DeepLiDARDeepLiDAR 82.03300182.033001 49.31448049.314480 16.45975216.459752 9.2986969.298696 DepthNet(모델100)DepthNet (Model 100) 36.78337136.783371 12.82753412.827534 5.6604275.660427 1.9955471.995547

NYU 데이터 세트 상의 공동 깊이 보완 성능 비교 (단위: mm)2) 희소 깊이 보완 작업Comparison of joint depth complementation performance on NYU data set (unit: mm)2) sparse depth complementation task

상기 데이터 구성에 따르면, 훈련 세트는 공식 트레이닝 세트(약 220K)에서 무작위로 선택된 50K 쌍의 깊이 이미지이며, 데이터 증강 방식으로 확장되고, 테스트 과정에서 654개의 공식 이미지 쌍을 테스트 세트로 사용하여 최종 정확도를 확인한다. According to the above data configuration, the training set is 50K pairs of depth images randomly selected from the official training set (about 220K), extended by data augmentation method, and the final accuracy using 654 official image pairs as the test set in the testing process. check

테스트 결과는 마찬가지로 NYU-Depth-V2 데이터 세트의 테스트 세트를 기반으로 하며, 모든 입력 이미지는 해당 지면 실제 값 깊이 이미지에서 무작위로 샘플링되어 500개의 유효 포인트가 있는 희소 샘플링 이미지를 얻은 다음, 희소에서 밀도까지의 깊이 보완 테스트를 수행한다. 표 2에 도시된 바와 같이, 본 개시의 예시적 실시예에 따른 깊이 이미지 보완 방법은 다양한 지표(예, 평균 제곱근 오차(RMSE), 상대 오차(REL))에서 기존 네트워크보다 우수하다.The test results are likewise based on the test set of the NYU-Depth-V2 data set, where all input images are randomly sampled from the corresponding ground true-value depth images to obtain sparsely sampled images with 500 valid points, then sparse to density up to the depth complementary test. As shown in Table 2, the depth image supplementation method according to the exemplary embodiment of the present disclosure is superior to the existing network in various indicators (eg, root mean square error (RMSE), relative error (REL)).

알고리즘algorithm RMSERMSE RELREL DfusenetDfusenet 219.5219.5 0.04410.0441 Sparse-to-denseSparse-to-dense 200200 0.0380.038 CSPN++CSPN++ 115.0115.0 DeepLiDARDeepLiDAR 115.0115.0 0.0220.022 DepthNet(모델100)DepthNet (Model 100) 105.65105.65 0.0150.015

NYU 데이터 세트 상의 희소 깊이 보완 성능 비교 (단위: mm)두 작업의 결과를 비교해 보면, 본 개시의 내용은 두 작업, 특히 희소 깊이 보완 작업에서 우수한 성능을 보였음을 알 수 있으며, 본 개시의 깊이 이미지 보완 방법의 성능은 현재 업계의 가장 최첨단 방법보다 훨씬 우수하다는 것을 알 수 있습니다. 실험 결과는 본 개시에 따른 모델이 우수한 견고성을 가지고 있음을 보여준다. 서로 다른 누락 패턴에 대해, 본 개시에 따른 모델은 컬러 이미지를 기반으로 깊이 추정 네트워크를 통해 완전한 깊이 맵을 생성할 수 있고, 깊이 이미지 영역의 깊이 예측 분기에 의해 생성된 깊이 이미지를 융합하여, 깊이 이미지가 누락된 경우에도 본 개시에 따른 모델이 상대적으로 합리적인 깊이 이미지를 출력할 수 있도록 한다.Comparison of sparse depth complementation performance on NYU data set (unit: mm) Comparing the results of the two tasks, it can be seen that the contents of the present disclosure showed excellent performance in the two tasks, especially the sparse depth complementation task, and the depth image of the present disclosure It can be seen that the performance of the complementary method far outperforms the most state-of-the-art methods currently in the industry. Experimental results show that the model according to the present disclosure has excellent robustness. For different omission patterns, the model according to the present disclosure can generate a complete depth map through a depth estimation network based on the color image, fuse the depth image generated by the depth prediction branch of the depth image region, Even when an image is omitted, the model according to the present disclosure can output a relatively reasonable depth image.

본 개시의 실시 예에 따라, 컴퓨팅 장치를 제공할 수 있다. 컴퓨팅 장치는, 프로세서 및 메모리를 포함하고, 이때, 메모리는 컴퓨터 프로그램을 저장하고, 프로세서에 의해 상기 컴퓨터 프로그램이 실행될 때, 본 개시의 깊이 이미지를 보완하는 방법이 구현된다.According to an embodiment of the present disclosure, a computing device may be provided. The computing device includes a processor and a memory, wherein the memory stores a computer program, and when the computer program is executed by the processor, the method for supplementing the depth image of the present disclosure is implemented.

본 개시의 깊이 이미지를 보완하는 방법 및 장치에 따르면, 컬러 분기 네트워크를 통해 원본 컬러 이미지를 이용하여 깊이 추정을 진행하고, 이를 통해 컬러 이미지에서 완전한 깊이 이미지로의 매핑을 학습하고, 컬러 이미지 정보를 최대한 활용하여 깊이 이미지의 보완을 돕기 위해, 깊이 분기 네트워크를 통해 원본 깊이 이미지 및 컬러 분기 네트워크의 일부 중간 레이어 특징 이미지를 사용하여 깊이 추론(예측)을 진행하여, 원본 깊이 이미지가 매운 희소한 경우에도(심지어 원본 깊이 이미지가 존재하지 않는 경우에도) 모델이 고품질의 완전한 깊이 이미지를 안정적으로 생성할 수 있고, 깊이 공동 채우기 및 희소 깊이 밀도화 두 가지 작업에서 모두 좋은 효과를 얻을 수 있다.According to the method and apparatus for supplementing the depth image of the present disclosure, depth estimation is performed using an original color image through a color branching network, and mapping from a color image to a full depth image is learned through this, and color image information is obtained. In order to make full use of it to help complement the depth image, depth inference (prediction) is performed using the original depth image through the depth branching network and some intermediate layer feature images in the color branching network, even when the original depth image is extremely sparse. The model can reliably generate high-quality, full-depth images (even when the original depth image does not exist), and can achieve good results in both depth cavity filling and sparse depth densification.

또한, 본 개시의 깊이 이미지를 보완하는 방법 및 장치에 따르면, 깊이 분기 네트워크에서 게이트 컨볼루션을 이용하여 마스크 정보를 전송하여 이미지 중 유효 픽셀과 무효 픽셀을 효과적으로 구분하고, 생성된 깊이 이미지가 원래의 깊이 정보를 잘 유지하도록 한다.In addition, according to the method and apparatus for supplementing the depth image of the present disclosure, mask information is transmitted using gate convolution in the depth branching network to effectively distinguish valid pixels from invalid pixels in an image, and the generated depth image is Maintain good depth information.

또한, 본 개시의 깊이 이미지를 보완하는 방법 및 장치에 따르면, 모델 학습 시 구조적 유사성 SSIM 관련 구조적 손실 모니터링으로 보완하여, 최종 생성된 깊이 이미지가 풍부한 디테일 정보와 높은 에지 품질을 갖도록 한다.In addition, according to the method and apparatus for supplementing the depth image of the present disclosure, the final generated depth image has rich detail information and high edge quality by supplementing the structural similarity SSIM-related structural loss monitoring during model training.

또한, 본 개시의 깊이 이미지를 보완하는 방법 및 장치에 따르면, 종단 간 모델 훈련을 수행할 수 있어, 중간 특징의 사용을 피하고, 중간 특징의 품질이 낮은 리스크를 효과적으로 피할 수 있다.In addition, according to the method and apparatus for supplementing the depth image of the present disclosure, it is possible to perform end-to-end model training, avoiding the use of intermediate features and effectively avoiding the risk of low quality of intermediate features.

이상, 도 1 내지 도 7를 참조하여, 본 개시의 실시 예에 따른 깊이 이미지를 보완하는 방법 및 장치에 대해 설명하였다.A method and an apparatus for supplementing a depth image according to an embodiment of the present disclosure have been described above with reference to FIGS. 1 to 7 .

도 7에 도시된 모듈은 하드웨어, 소프트웨어, 펌웨어, 미들웨어, 마이크로 코드 또는 임의의 조합으로 구현될 수도 있다. 소프트웨어, 펌웨어, 미들웨어 또는 마이크로 코드로 구현되는 경우, 해당 작업을 수행하는 데 사용되는 프로그램 코드 또는 코드 세그먼트는 저장 매체와 같은 컴퓨터 판독 가능 매체에 저장될 수 있고, 이를 통해 프로세서가 해당 프로그램 코드 또는 코드 세그먼트를 읽고 동작하여 해당 작업을 실행할 수 있다. The module shown in FIG. 7 may be implemented in hardware, software, firmware, middleware, microcode, or any combination. If implemented in software, firmware, middleware, or microcode, the program code or code segment used to perform the corresponding task may be stored in a computer readable medium such as a storage medium, which causes the processor to perform the corresponding program code or code You can read the segment and operate on it to execute the corresponding operation.

예를 들어, 본 개시의 실시예는 또한 컴퓨팅 장치로 구현될 수 있고, 해당 컴퓨팅 장치는 저장 구성 요소와 프로세서를 포함하고, 저장 구성 요소는 컴퓨터 실행 가능 명령 세트가 저장하고, 컴퓨터 실행 가능 명령 세트는 프로세서에 의해 실행될 때, 본 개시의 실시 예에 따른 깊이 이미지 보완 방법을 실행한다.For example, embodiments of the present disclosure may also be implemented in a computing device, wherein the computing device includes a storage component and a processor, the storage component storing the computer-executable instruction set, and the computer-executable instruction set. when executed by the processor, executes the depth image supplementation method according to an embodiment of the present disclosure.

구체적으로, 컴퓨팅 장치는 서버나 클라이언트 또는 분산 네트워크 환경의 노드 장치에 배치될 수 있다. 또한, 컴퓨팅 장치는 PC 컴퓨터, 태블릿 장치, 개인용 정보 단말기, 스마트폰, 웹 애플리케이션 또는 상기 명령 세트를 실행할 수 있는 다른 장치일 수 있다.Specifically, the computing device may be deployed on a server or client, or a node device in a distributed network environment. Further, the computing device may be a PC computer, a tablet device, a personal digital assistant, a smart phone, a web application, or other device capable of executing the above instruction set.

여기서, 컴퓨팅 장치는 단일 컴퓨팅 장치일 필요는 없지만, 명령(또는 명령 세트)을 개별적으로 또는 공동으로 실행할 수 있는 장치 또는 회로의 모음일 수도 있다. 컴퓨팅 장치는 또한 통합 제어 시스템 또는 시스템 관리자의 일부일 수도 있고, 로컬 또는 원격(예, 무선 전송을 통해)과 인터페이스하는 휴대용 전자 장치로 구성될 수도 있다.Here, a computing device need not be a single computing device, but may be a collection of devices or circuits capable of individually or jointly executing instructions (or sets of instructions). The computing device may also be part of an integrated control system or system administrator, and may consist of a portable electronic device that interfaces locally or remotely (eg, via wireless transmission).

컴퓨팅 장치에서 프로세서는 중앙 처리 장치(CPU), 그래픽 처리 장치(GPU), 프로그램 가능 논리 장치, 전용 프로세서 시스템, 마이크로 컨트롤러 또는 마이크로 프로세서를 포함할 수 있다. 제한이 아닌 예로서, 프로세서는 또한 아날로그 프로세서, 디지털 프로세서, 마이크로 프로세서, 멀티 코어 프로세서, 프로세서 어레이, 네트워크 프로세서 등을 포함할 수 있다.In a computing device, a processor may include a central processing unit (CPU), a graphics processing unit (GPU), a programmable logic unit, a dedicated processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

본 개시의 실시 예에 따른 깊이 이미지 보완 방법에서 설명된 일부 동작은 소프트웨어로 구현할 수 있고, 일부 동작은 하드웨어로 구현할 수 있으며, 또한 이러한 동작은 소프트웨어와 하드웨어의 조합으로 구현할 수 있다.Some operations described in the depth image supplementation method according to an embodiment of the present disclosure may be implemented in software, some operations may be implemented in hardware, and these operations may be implemented by a combination of software and hardware.

프로세서는 저장 구성 요소 중 하나에 저장된 명령 또는 코드를 실행할 수 있으며, 저장 구성 요소는 또한 데이터를 저장할 수 있다. 명령 및 데이터는 또한 임의의 알려진 전송 프로토콜을 사용할 수 있는 네트워크 인터페이스 장치를 통해 네트워크를 통해 전송 및 수신될 수 있다.The processor may execute instructions or code stored in one of the storage components, and the storage component may also store data. Commands and data may also be sent and received over the network via a network interface device capable of using any known transport protocol.

따라서, 도 6을 참조하여 설명된 깊이 이미지 보완 방법은 적어도 하나의 컴퓨팅 장치 및 명령을 저장하는 적어도 하나의 저장 장치를 포함하는 시스템에 의해 구현될 수 있다.Accordingly, the depth image supplementation method described with reference to FIG. 6 may be implemented by a system including at least one computing device and at least one storage device for storing instructions.

본 개시의 실시 예에 따르면, 적어도 하나의 컴퓨팅 장치는 깊이 이미지 보완 방법을 실행하기 위한 컴퓨팅 장치이고, 컴퓨터 실행 가능 명령 세트가 저장 장치에 저장되고, 컴퓨터 실행 가능 명령 세트가 적어도 하나의 컴퓨팅 장치에 의해 실행될 때, 도 6을 참조하여 설명된 깊이 이미지를 보완하는 방법이 실행된다.According to an embodiment of the present disclosure, at least one computing device is a computing device for executing a depth image supplementation method, a computer-executable instruction set is stored in a storage device, and the computer-executable instruction set is stored in the at least one computing device. When executed by , the method of supplementing the depth image described with reference to FIG. 6 is executed.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 저장할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may store program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may comprise a computer program, code, instructions, or a combination of one or more thereof, which configures a processing device to operate as desired or is independently or collectively processed You can command the device. The software and/or data may be any kind of machine, component, physical device, virtual equipment, computer storage medium or apparatus, to be interpreted by or to provide instructions or data to the processing device. , or may be permanently or temporarily embody in a transmitted signal wave. The software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with reference to the limited drawings, those skilled in the art may apply various technical modifications and variations based on the above. For example, the described techniques are performed in an order different from the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

원본 컬러 이미지 및 대응하는 원본 깊이 이미지를 획득하는 단계;
상기 원본 컬러 이미지에 기초하여, 제1 심층 신경망을 사용하여 제1 깊이 이미지를 획득하는 단계;
상기 원본 깊이 이미지 및 제1 심층 신경망의 각 중간 레이어에 의해 생성된 중간 특징 이미지(feature image)에 기초하여, 제2 심층 신경망을 사용하여 제2 깊이 이미지를 획득하는 단계; 및
상기 제1 깊이 이미지와 상기 제2 깊이 이미지를 합병하여 최종 깊이 이미지를 획득하는 단계
를 포함하는 깊이 이미지를 보완하는 방법.
acquiring an original color image and a corresponding original depth image;
obtaining a first depth image using a first deep neural network based on the original color image;
obtaining a second depth image using a second deep neural network based on the original depth image and an intermediate feature image generated by each intermediate layer of the first deep neural network; and
acquiring a final depth image by merging the first depth image and the second depth image
How to supplement depth images containing

제1항에 있어서,
상기 제1 심층 신경망은,
N개 레이어의 잔차 구조를 갖는 제1 인코더 네트워크와 제1 디코더 네트워크
를 포함하고,
상기 제2 심층 신경망은,
N개 레이어의 잔차 구조를 갖는 제2 인코더 네트워크와 제2 디코더 네트워크
를 포함하고,
상기 N은 1 보다 큰 정수이고,
상기 제2 깊이 이미지를 획득하는 단계는,
상기 제1 인코더 네트워크 및 상기 제2 인코더 네트워크의 출력, 상기 제1 디코더 네트워크의 중간 특징 이미지 및 상기 제2 인코더 네트워크의 중간 특징 이미지에 기초하여, 상기 제2 디코더 네트워크를 사용하여 특징 디코딩을 진행하는 단계
를 포함하는 깊이 이미지를 보완하는 방법.
According to claim 1,
The first deep neural network,
A first encoder network and a first decoder network having a residual structure of N layers
including,
The second deep neural network,
A second encoder network and a second decoder network having a residual structure of N layers
including,
Wherein N is an integer greater than 1,
Acquiring the second depth image comprises:
Based on the output of the first encoder network and the second encoder network, the intermediate feature image of the first decoder network, and the intermediate feature image of the second encoder network, feature decoding is performed using the second decoder network. step
How to supplement depth images containing

제2항에 있어서,
상기 제2 깊이 이미지를 획득하는 단계는,
상기 원본 컬러 이미지 및 상기 제1 인코더 네트워크의 중간 특징 이미지에 기초하여, 상기 제2 심층 신경망의 상기 제2 인코더 네트워크를 사용하여 특징 인코딩을 진행하는 단계
를 포함하는 깊이 이미지를 보완하는 방법.
3. The method of claim 2,
Acquiring the second depth image comprises:
Based on the original color image and the intermediate feature image of the first encoder network, performing feature encoding using the second encoder network of the second deep neural network
How to supplement depth images containing

제2항에 있어서,
상기 제1 심층 신경망은,
상기 제1 인코더 네트워크 이전의 제1 전처리 네트워크 및 상기 제1 디코더 이후의 제1 깊이 예측 네트워크를 더 포함하고,
상기 제1 깊이 이미지를 획득하는 단계는,
상기 제1 전처리 네트워크를 사용하여 상기 원본 컬러 이미지를 심층 신경망 처리에 적합한 제1 특징 이미지로 변환하고, 상기 제1 특징 이미지를 상기 제1 인코더 네트워크에 입력하는 단계; 및
상기 제1 깊이 예측 네트워크를 사용하여 상기 제1 디코더 네트워크에 의해 출력된 특징 이미지를 상기 제1 깊이 이미지로 합성하는 단계
를 포함하고,
상기 제2 심층 신경망은,
상기 제2 인코더 네트워크 이전의 제2 전처리 네트워크 및 상기 제2 디코더 네트워크 이후의 제2 깊이 예측 네트워크를 더 포함하고,
상기 제2 깊이 이미지를 획득하는 단계는,
상기 제2 전처리 네트워크를 사용하여 상기 원본 깊이 이미지를 심층 신경망 처리에 적합한 제2 특징 이미지로 변환하고, 상기 제2 특징 이미지를 제2 인코더 네트워크에 입력하는 단계; 및
상기 제2 깊이 예측 네트워크를 사용하여 상기 제1 디코더 네트워크 및 상기 제2 디코더 네트워크에 의해 출력된 특징 이미지와 상기 제2 특징 이미지를 융합하여 상기 제2 깊이 이미지를 획득하는 단계
를 포함하는
깊이 이미지를 보완하는 방법.
3. The method of claim 2,
The first deep neural network,
A first preprocessing network before the first encoder network and a first depth prediction network after the first decoder,
Acquiring the first depth image comprises:
converting the original color image into a first feature image suitable for deep neural network processing using the first pre-processing network, and inputting the first feature image to the first encoder network; and
synthesizing the feature image output by the first decoder network into the first depth image using the first depth prediction network;
including,
The second deep neural network,
a second preprocessing network before the second encoder network and a second depth prediction network after the second decoder network;
Acquiring the second depth image comprises:
converting the original depth image into a second feature image suitable for deep neural network processing using the second preprocessing network, and inputting the second feature image to a second encoder network; and
obtaining the second depth image by fusing the second feature image with the feature image output by the first decoder network and the second decoder network using the second depth prediction network;
containing
How to complement the depth image.

제4항에 있어서,
상기 제2 디코더 네트워크 중 제1 레이어 디코딩 단위의 입력은,
상기 제2 인코더 네트워크에 의해 출력된 특징 이미지와 상기 제1 인코더 네트워크에 의해 출력된 특징 이미지의 합이고,
상기 제2 디코더 네트워크 중 제2 레이어에서 제N 레이어까지의 디코딩 단위로서의 각 레이어 디코딩 단위의 입력은,
SE 블록을 이용하는 방식으로 이전 레이어 디코딩 단위에서 출력된 특징 이미지, 상기 제1 디코더 네트워크의 대응하는 레이어 디코딩 단위에서 출력된 특징 이미지 및 상기 제2 인코더 네트워크의 대응하는 레이어 인코딩 단위에서 출력된 특징 이미지를 융합하여 획득한 특징 이미지이고,
상기 제2 깊이 예측 네트워크의 입력은,
상기 SE 블록을 이용하는 방식으로 상기 제2 디코더 네트워크에서 출력된 특징 이미지, 상기 제1 디코더 네트워크에서 출력된 특징 이미지 및 상기 제2 특징 이미지를 융합하여 획득한 특징 이미지인
깊이 이미지를 보완하는 방법.
5. The method of claim 4,
The input of the first layer decoding unit of the second decoder network,
is the sum of the feature image output by the second encoder network and the feature image output by the first encoder network,
The input of each layer decoding unit as a decoding unit from the second layer to the Nth layer of the second decoder network is,
The feature image output from the previous layer decoding unit, the feature image output from the corresponding layer decoding unit of the first decoder network, and the feature image output from the corresponding layer encoding unit of the second encoder network in a manner using the SE block It is a feature image obtained by fusion,
The input of the second depth prediction network is,
A feature image obtained by fusing the feature image output from the second decoder network, the feature image output from the first decoder network, and the second feature image in a manner using the SE block
How to complement the depth image.

제5항에 있어서,
상기 제2 인코더 네트워크 중 제1 레이어 인코딩 단위의 입력은,
상기 제1 특징 이미지와 상기 제2 특징 이미지의 합이고,
상기 제2 인코더 네트워크 중 제2 레이어에서 제N 레이어까지의 인코딩 단위로서의 각 레이어 인코딩 단위의 입력은,
이전 레이어 인코딩 단위에서 출력된 특징 이미지와 상기 제1 인코더 네트워크의 대응하는 레이어 인코딩 단위에서 출력된 특징 이미지의 합인
깊이 이미지를 보완하는 방법.
6. The method of claim 5,
The input of the first layer encoding unit among the second encoder networks is,
is the sum of the first feature image and the second feature image,
The input of each layer encoding unit as an encoding unit from the second layer to the Nth layer among the second encoder networks is,
The sum of the feature image output from the previous layer encoding unit and the feature image output from the corresponding layer encoding unit of the first encoder network
How to complement the depth image.

제2항에 있어서,
상기 제2 인코더 네트워크 및 상기 제2 디코더 네트워크의 각각의 잔차 블록은 컨볼루션 프로세스 실행 후에 한 번의 게이트 프로세스를 실행하는
깊이 이미지를 보완하는 방법.
3. The method of claim 2,
Each residual block of the second encoder network and the second decoder network executes one gate process after executing the convolution process.
How to complement the depth image.

제1항에 있어서,
상기 제1 깊이 이미지와 상기 제2 깊이 이미지를 합병하여 상기 최종 깊이 이미지를 획득하는 단계는,
어텐션 네트워크를 사용하여 상기 제1 깊이 이미지의 제1 픽셀 가중치 맵 및 상기 제2 깊이 이미지의 제2 픽셀 가중치 맵을 획득하는 단계; 및
상기 제1 픽셀 가중치 맵 및 상기 제2 픽셀 가중치 맵에 기초하여, 상기 제1 깊이 이미지 및 상기 제2 깊이 이미지 각각에 가중치를 부여하고 합하여 상기 최종 깊이 이미지를 획득하는 단계
를 포함하는 깊이 이미지를 보완하는 방법.
The method of claim 1,
acquiring the final depth image by merging the first depth image and the second depth image,
obtaining a first pixel weight map of the first depth image and a second pixel weight map of the second depth image using an attention network; and
weighting and summing each of the first depth image and the second depth image based on the first pixel weight map and the second pixel weight map to obtain the final depth image;
How to supplement depth images containing

제8항에 있어서상기 제1 심층 신경망 및 상기 제2 심층 신경망 및/또는 상기 어텐션 네트워크를 사용하기 전, 손실 함수를 사용하여 상기 제1 심층 신경망 및 상기 제2 심층 신경망 및/또는 상기 어텐션 네트워크에 대해 훈련하는 단계
를 더 포함하고,
상기 제1 심층 신경망 및 상기 제2 심층 신경망 및/또는 상기 어텐션 네트워크에 대해 훈련하는 단계는,
상기 제1 깊이 이미지와 실제 깊이 이미지의 제1 평균 제곱 오차 손실, 상기 제2 깊이 이미지와 상기 실제 깊이 이미지의 제2 평균 제곱 오차 손실, 상기 최종 깊이 이미지와 상기 실제 깊이 이미지의 제3 평균 제곱 오차 손실 및 상기 최종 깊이 이미지와 상기 실제 깊이 이미지의 구조적 손실을 고려하여 상기 손실 함수를 생성하고,
상기 구조적 손실은, 1 - 구조적 유사성 지수인
깊이 이미지를 보완하는 방법.
The method according to claim 8, wherein before using the first deep neural network and the second deep neural network and/or the attention network, a loss function is used to the first deep neural network and the second deep neural network and/or the attention network. steps to train
further comprising,
Training for the first deep neural network and the second deep neural network and / or the attention network,
a first mean squared error loss between the first depth image and the actual depth image, a second mean squared error loss between the second depth image and the real depth image, and a third mean squared error between the final depth image and the actual depth image generating the loss function taking into account loss and structural loss of the final depth image and the actual depth image;
The structural loss is 1 - the structural similarity index,
How to complement the depth image.

제9항에 있어서,
상기 제1 심층 신경망 및 상기 제2 심층 신경망 및/또는 상기 어텐션 네트워크에 대해 훈련하는 단계는,
상기 제1 평균 제곱 오차 손실, 상기 제2 평균 제곱 오차 손실, 상기 제3 평균 제곱 오차 손실 및 상기 구조적 손실의 가중 합산을 통해 상기 손실 함수를 얻는
깊이 이미지를 보완하는 방법.
10. The method of claim 9,
Training for the first deep neural network and the second deep neural network and / or the attention network,
obtaining the loss function through the weighted summation of the first mean squared error loss, the second mean squared error loss, the third mean squared error loss and the structural loss
How to complement the depth image.

제1항에 있어서,
상기 원본 컬러 이미지 및 대응하는 상기 원본 깊이 이미지는 획득하는 단계는,
상기 원본 깊이 이미지가 존재하지 않는 경우, 픽셀 값이 0 인 깊이 이미지를 대응하는 원본 깊이 이미지로 획득하는 단계
를 포함하는
깊이 이미지를 보완하는 방법.
According to claim 1,
Acquiring the original color image and the corresponding original depth image comprises:
If the original depth image does not exist, acquiring a depth image having a pixel value of 0 as a corresponding original depth image
containing
How to complement the depth image.

원본 컬러 이미지 및 대응하는 원본 깊이 이미지를 획득하도록 구성된 이미지 획득 모듈;
상기 원본 컬러 이미지에 기초하여, 제1 심층 신경망을 사용하여 제1 깊이 이미지를 획득하도록 구성된 컬러 분기 모듈;
상기 원본 깊이 이미지 및 상기 제1 심층 신경망의 각 중간 레이어에 의해 생성된 중간 특징 이미지에 기초하여, 제2 심층 신경망을 사용하여 제2 깊이 이미지를 획득하도록 구성된 깊이 분기 모듈; 및
상기 제1 깊이 이미지와 상기 제2 깊이 이미지를 합병하여 최종 깊이 이미지를 획득하도록 구성된 이미지 합병 모듈
을 포함하는 깊이 이미지 보완 장치.
an image acquisition module, configured to acquire an original color image and a corresponding original depth image;
a color branching module, configured to obtain a first depth image by using a first deep neural network, based on the original color image;
a depth branching module, configured to obtain a second depth image by using a second deep neural network, based on the original depth image and an intermediate feature image generated by each intermediate layer of the first deep neural network; and
an image merging module, configured to merge the first depth image and the second depth image to obtain a final depth image
A depth image enhancement device comprising a.

제12항에 있어서,
상기 제1 심층 신경망은,
N개 레이어의 잔차 구조를 갖는 제1 인코더 네트워크와 제1 디코더 네트워크
를 포함하고,
상기 제2 심층 신경망은,
N개 레이어의 잔차 구조를 갖는 제2 인코더 네트워크와 제2 디코더 네트워크
를 포함하고,
상기 N은 1 보다 큰 정수이고,
상기 깊이 분기 모듈은,
상기 제1 인코더 네트워크 및 상기 제2 인코더 네트워크의 출력, 상기 제1 디코더 네트워크의 중간 특징 이미지 및 상기 제2 인코더 네트워크의 중간 특징 이미지에 기초하여, 상기 제2 디코더 네트워크를 사용하여 특징 디코딩을 진행하는
깊이 이미지 보완 장치.
13. The method of claim 12,
The first deep neural network,
A first encoder network and a first decoder network having a residual structure of N layers
including,
The second deep neural network,
A second encoder network and a second decoder network having a residual structure of N layers
including,
Wherein N is an integer greater than 1,
The depth branch module,
Based on the output of the first encoder network and the second encoder network, the intermediate feature image of the first decoder network, and the intermediate feature image of the second encoder network, feature decoding is performed using the second decoder network.
Depth image augmentation device.

제13항에 있어서,
상기 깊이 분기 모듈은,
상기 원본 깊이 이미지 및 상기 제1 인코더 네트워크의 중간 특징 이미지에 기초하여, 상기 제2 심층 신경망의 상기 제2 인코더 네트워크를 사용하여 특징을 인코딩하는
깊이 이미지 보완 장치.
14. The method of claim 13,
The depth branch module,
encoding a feature using the second encoder network of the second deep neural network based on the original depth image and the intermediate feature image of the first encoder network
Depth image augmentation device.

제13항에 있어서, 상기 제1 심층 신경망은,
상기 제1 인코더 네트워크 이전의 제1 전처리 네트워크 및 상기 제1 디코더 이후의 제1 깊이 예측 네트워크를 더 포함하고,
상기 컬러 분기 모듈은,
상기 제1 전처리 네트워크를 사용하여 상기 원본 컬러 이미지를 심층 신경망 처리에 적합한 제1 특징 이미지로 변환하고, 상기 제1 특징 이미지를 상기 제1 인코더 네트워크에 입력하고, 상기 제1 깊이 예측 네트워크를 사용하여 제1 디코더 네트워크에 의해 출력된 특징 이미지를 상기 제1 깊이 이미지로 합성하고,
상기 제2 심층 신경망은,
상기 제2 인코더 네트워크 이전의 제2 전처리 네트워크 및 상기 제2 디코더 네트워크 이후의 제2 깊이 예측 네트워크를 더 포함하고,
상기 깊이 분기 모듈은,
상기 제2 전처리 네트워크를 사용하여 상기 원본 깊이 이미지를 심층 신경망 처리에 적합한 제2 특징 이미지로 변환하고, 상기 제2 특징 이미지를 상기 제2 인코더 네트워크에 입력하고, 상기 제2 깊이 예측 네트워크를 사용하여 상기 제1 디코더 네트워크 및 상기 제2 디코더 네트워크에 의해 출력된 특징 이미지와 상기 제2 특징 이미지를 융합하여 상기 제2 깊이 이미지를 획득하는
깊이 이미지 보완 장치.
The method of claim 13, wherein the first deep neural network,
A first preprocessing network before the first encoder network and a first depth prediction network after the first decoder,
The color branching module,
Transform the original color image into a first feature image suitable for deep neural network processing using the first pre-processing network, input the first feature image to the first encoder network, and use the first depth prediction network to synthesizing the feature image output by the first decoder network into the first depth image;
The second deep neural network,
a second preprocessing network before the second encoder network and a second depth prediction network after the second decoder network;
The depth branch module,
Transform the original depth image into a second feature image suitable for deep neural network processing using the second preprocessing network, input the second feature image to the second encoder network, and use the second depth prediction network to acquiring the second depth image by fusing the second feature image with the feature image output by the first decoder network and the second decoder network
Depth image augmentation device.

제12항에 있어서,
상기 이미지 합병 모듈은,
어텐션 네트워크를 사용하여 상기 제1 깊이 이미지의 제1 픽셀 가중치 맵 및 상기 제2 깊이 이미지의 제2 픽셀 가중치 맵을 획득하고,
상기 제1 픽셀 가중치 맵 및 상기 제2 픽셀 가중치 맵에 기초하여, 상기 제1 깊이 이미지 및 상기 제2 깊이 이미지에 가중치를 부여하고 합하여 상기 최종 깊이 이미지를 획득하는
깊이 이미지 보완 장치.
13. The method of claim 12,
The image merging module,
obtain a first pixel weight map of the first depth image and a second pixel weight map of the second depth image by using an attention network;
weighting and summing the first depth image and the second depth image based on the first pixel weight map and the second pixel weight map to obtain the final depth image
Depth image augmentation device.

제16항에 있어서,
상기 제1 심층 신경망 및 상기 제2 심층 신경망 및/또는 상기 어텐션 네트워크를 사용하기 전, 손실 함수를 사용하여 상기 제1 심층 신경망 및 상기 제2 심층 신경망 및/또는 상기 어텐션 네트워크에 대해 훈련하는 훈련 모듈을 더 포함하고,
상기 훈련 모듈은,
상기 제1 깊이 이미지와 실제 깊이 이미지의 제1 평균 제곱 오차 손실, 상기 제2 깊이 이미지와 상기 실제 깊이 이미지의 제2 평균 제곱 오차 손실, 상기 최종 깊이 이미지와 상기 실제 깊이 이미지의 제3 평균 제곱 오차 손실 및 상기 최종 깊이 이미지와 상기 실제 깊이 이미지의 구조적 손실을 고려하여 상기 손실 함수를 생성하고,
상기 구조적 손실은, 1 - 구조적 유사성 지수인
깊이 이미지 보완 장치.
17. The method of claim 16,
Before using the first deep neural network and the second deep neural network and/or the attention network, a training module for training the first deep neural network and the second deep neural network and/or the attention network using a loss function further comprising,
The training module is
a first mean squared error loss between the first depth image and the actual depth image, a second mean squared error loss between the second depth image and the real depth image, and a third mean squared error between the final depth image and the actual depth image generating the loss function taking into account loss and structural loss of the final depth image and the actual depth image;
The structural loss is 1 - the structural similarity index,
Depth image augmentation device.

제17항에 있어서,
상기 훈련 모듈은,
상기 제1 평균 제곱 오차 손실, 상기 제2 평균 제곱 오차 손실, 상기 제3 평균 제곱 오차 손실 및 상기 구조적 손실의 가중 합산을 통해 상기 손실 함수를 얻는
깊이 이미지 보완 장치.
18. The method of claim 17,
The training module is
obtaining the loss function through the weighted summation of the first mean squared error loss, the second mean squared error loss, the third mean squared error loss and the structural loss
Depth image augmentation device.

프로세서; 및
컴퓨터 프로그램을 저장하는 메모리
를 포함하고,
상기 프로세서에 의해 상기 컴퓨터 프로그램이 실행될 때,
제1항 내지 제11항 중 어느 한 항의 상기 깊이 이미지를 보완하는 방법이 구현되는
컴퓨팅 장치.
processor; and
memory that stores computer programs
including,
When the computer program is executed by the processor,
The method of any one of claims 1 to 11 for supplementing the depth image is implemented
computing device.

제1항 내지 제11항 중 어느 한 항의 방법을 실행하기 위한 프로그램이 기록되어 있는 것을 특징으로 하는 컴퓨터에서 판독 가능한 기록 매체.A computer-readable recording medium in which a program for executing the method of any one of claims 1 to 11 is recorded.