KR102225579B1

KR102225579B1 - Method for semantic segmentation based on knowledge distillation with improved learning performance

Info

Publication number: KR102225579B1
Application number: KR1020200057993A
Authority: KR
Inventors: 허용석; 박상용
Original assignee: 아주대학교산학협력단
Priority date: 2020-05-14
Filing date: 2020-05-14
Publication date: 2021-03-10

Abstract

The present invention provides a semantic image segmentation method which comprises the steps of: inputting an input image to a teacher network and a student network; generating an adaptive probability map using a probability map of the teacher network and a GT for the input image; defining an adaptive cross entropy loss function using the adaptive probability map and a probability map of the student network; and determining each pixel of the input image as one of a plurality of labels using the adaptive cross entropy loss function. Therefore, learning performance of the student network can be improved.

Description

학습성능이 향상된 지식 증류법 기반 의미론적 영상 분할 방법{METHOD FOR SEMANTIC SEGMENTATION BASED ON KNOWLEDGE DISTILLATION WITH IMPROVED LEARNING PERFORMANCE}Semantic image segmentation method based on knowledge distillation with improved learning performance{METHOD FOR SEMANTIC SEGMENTATION BASED ON KNOWLEDGE DISTILLATION WITH IMPROVED LEARNING PERFORMANCE}

본 발명은 지식 증류법 기반의 의미론적 영상 분할 방법에 관한 것이다.The present invention relates to a semantic image segmentation method based on a knowledge distillation method.

의미론적 영상 분할(semantic segmentation)은 이미지 내 객체를 픽셀 단위로 분류하는 알고리즘으로, 영상 분할로 객체의 검출과 분류를 동시에 수행할 수 있는 장점이 있다.Semantic segmentation is an algorithm that classifies objects in an image in pixel units, and has the advantage of simultaneously detecting and classifying objects through image segmentation.

한편, 지식 증류법(knowledge distillation)은 학습된 인공 신경망을 실제 서비스에 적용하기 위해 효과적으로 모델을 압축하여 속도를 향상시키는 네트워크 압축 방법 중 하나이다. 지식 증류법은 교사 네트워크라고 불리는 비교적 성능이 뛰어난 네트워크를 이용하여 학생 네트워크라고 불리는 연산양 및 메모리 사용량이 적은 네트워크를 학습시킬 수 있다. 이때 교사 네트워크의 특징 맵과 학생 네트워크에서 특징 맵을 비교하는 손실 함수를 정의하고, 이를 이용하여 학생 네트워크를 학습시키게 된다.On the other hand, knowledge distillation is one of the network compression methods that improves speed by effectively compressing a model in order to apply a learned artificial neural network to an actual service. The knowledge distillation method can train a network called a student network, which has a low computational amount and memory usage, using a relatively high-performance network called a teacher network. At this time, a loss function for comparing the feature map of the teacher network and the feature map in the student network is defined, and the student network is trained using this.

지식 증류법을 통하여 교사 네트워크는 학생 네트워크에 지식을 전달하기 때문에, 학생 네트워크는 일반적인 역전파(backpropagation) 알고리즘 만을 사용하여 처음부터 학습한 경우보다 높은 인식률을 얻을 수 있게 된다. 그러나 교사 네트워크의 부정확한 예측 정보가 학생 네트워크에 전달됨에 따른 성능 저하가 발생할 수 있다. 이에 따라, 학생 네트워크의 성능을 개선할 수 있는 지식 증류법 기반 의미론적 영상 분할에 관한 연구가 진행 중에 있다.Since the teacher network transfers knowledge to the student network through the knowledge distillation method, the student network can obtain a higher recognition rate than the case of learning from the beginning using only a general backpropagation algorithm. However, performance degradation may occur as inaccurate prediction information from the teacher network is transmitted to the student network. Accordingly, research on semantic image segmentation based on a knowledge distillation method that can improve the performance of student networks is in progress.

본 발명에서는 의미론적 영상 분할 기법의 지식 증류 방법에서 교사 네트워크에 의해 틀리게 예측된 픽셀에 관한 지식이 학생 네트워크에게 전달되지 않도록 구성된 새로운 적응형 확률 맵에 기초하여 손실 함수를 정의함으로써, 학생 네트워크의 성능을 개선할 수 있는 방법을 제안한다.In the present invention, in the knowledge distillation method of the semantic image segmentation technique, by defining a loss function based on a new adaptive probability map configured so that knowledge about a pixel incorrectly predicted by the teacher network is not transmitted to the student network, the performance of the student network Suggest a way to improve it.

일 실시예에 따른 의미론적 영상 분할 방법은, 입력 이미지를 교사 네트워크 및 학생 네트워크에 입력하는 단계, 교사 네트워크의 확률 맵 및 입력 이미지에 대한 GT를 이용하여 적응형 확률 맵을 생성하는 단계, 적응형 확률 맵 및 학생 네트워크의 확률 맵을 이용하여 적응형 크로스 엔트로피 손실 함수를 정의하는 단계 및 적응형 크로스 엔트로피 손실 함수를 이용하여 입력 이미지의 각 픽셀을 복수개의 라벨 중 하나의 라벨로 결정하는 단계를 포함할 수 있다.The semantic image segmentation method according to an embodiment includes inputting an input image to a teacher network and a student network, generating an adaptive probability map using a probability map of the teacher network and a GT for the input image, and adaptive Defining an adaptive cross entropy loss function using a probability map and a probability map of a student network, and determining each pixel of the input image as one of a plurality of labels using the adaptive cross entropy loss function. can do.

여기서, 적응형 확률 맵은 입력 이미지에 포함된 가로 방향의 픽셀의 수 및 세로 방향의 픽셀의 수와 동일한 크기를 갖는 매트릭스를 라벨의 수 만큼 포함할 수 있다.Here, the adaptive probability map may include as many as the number of labels a matrix having the same size as the number of pixels in the horizontal direction and the number of pixels in the vertical direction included in the input image.

또한, 입력 이미지에 포함된 제1 픽셀에 대한 적응형 확률 맵은 제1 픽셀에 대하여 교사 네트워크가 예측한 라벨과 GT의 라벨이 일치하는 경우, 교사 네트워크의 확률 맵의 값과 GT의 원-핫 인코딩 벡터 값의 합일 수 있다.In addition, the adaptive probability map for the first pixel included in the input image is the one-hot value of the teacher network probability map and the GT if the label predicted by the teacher network for the first pixel matches the label of the GT. It may be the sum of encoding vector values.

또한, 입력 이미지에 포함된 제1 픽셀에 대한 적응형 확률 맵은 제1 픽셀에 대하여 교사 네트워크가 예측한 라벨과 GT의 라벨이 일치하는 경우, 교사 네트워크의 확률 맵의 값일 수 있다.In addition, the adaptive probability map for the first pixel included in the input image may be a value of the probability map of the teacher network when the label predicted by the teacher network and the label of the GT match for the first pixel.

또한, 교사 네트워크의 확률 맵은 교사 네트워크의 마지막 레이어의 특징 맵의 소프트맥스 값일 수 있다.Also, the probability map of the teacher network may be a softmax value of the feature map of the last layer of the teacher network.

또한, 입력 이미지에 포함된 제1 픽셀에 대한 적응형 확률 맵은 제1 픽셀에 대하여 교사 네트워크가 예측한 라벨과 GT의 라벨이 일치하지 않는 경우, GT의 원-핫 인코딩 벡터 값을 가중치와 곱한 값인 것일 수 있다.In addition, the adaptive probability map for the first pixel included in the input image is obtained by multiplying the one-hot encoding vector value of the GT by the weight when the label predicted by the teacher network for the first pixel and the label of the GT do not match. It can be a value.

또한, 입력 이미지에 포함된 픽셀에 대한 적응형 크로스 엔트로피 손실 함수는 복수의 라벨마다 적응형 확률 맵의 값과 학생 네트워크의 확률 맵의 로그 값을 곱한 후, 곱한 결과를 모두 더한 값일 수 있다.In addition, the adaptive cross entropy loss function for a pixel included in the input image may be a value obtained by multiplying a value of the adaptive probability map and a log value of the probability map of the student network for each of a plurality of labels, and then adding all the multiplication results.

또한, 적응형 크로스 엔트로피 손실 함수는 입력 이미지에 포함된 모든 픽셀에 대한 적응형 크로스 엔트로피 손실 함수를 더한 후, 적응형 확률 맵은 입력 이미지에 포함된 가로 방향의 픽셀의 수 및 세로 방향의 픽셀의 수로 나눈 값인 것일 수 있다.In addition, the adaptive cross entropy loss function adds the adaptive cross entropy loss function for all the pixels included in the input image, and the adaptive probability map is the number of pixels in the horizontal direction and the number of pixels in the vertical direction in the input image. It may be a value divided by a number.

본 발명의 일 실시예에 따른 의미론적 영상 분할 방법은, 적응형 크로스 엔트로피 손실 함수를 이용하여 학생 네트워크를 학습시키는 단계를 더 포함할 수 있다.The semantic image segmentation method according to an embodiment of the present invention may further include learning a student network using an adaptive cross entropy loss function.

본 발명은 컴퓨터 프로그램이 프로세서에 의해 실행될 때, 상술된 방법이 수행되는 컴퓨터 프로그램을 저장한 컴퓨터-판독가능 저장 매체를 제공한다.The present invention provides a computer-readable storage medium storing a computer program in which the above-described method is performed when the computer program is executed by a processor.

본 발명에서 개시하고 있는 일 실시예에 따르면, 교사 네트워크의 확률 맵에서 옳게 예측된 픽셀에 대한 정보만 학생 네트워크에 전달되기 때문에, 학생 네트워크의 학습 성능을 향상시킬 수 있는 효과가 있다.According to an embodiment disclosed in the present invention, since only information about a pixel correctly predicted in the probability map of the teacher network is transmitted to the student network, there is an effect of improving the learning performance of the student network.

도 1은 본 발명의 일 실시예에 따른 지식 증류기법 기반 의미론적 영상 분할 방법을 나타낸 도면이다.
도 2는 본 발명의 일 실시예에 따라 생성된 적응형 확률 맵을 설명하기 위한 도면이다.
도 3은 본 발명의 일 실시예에 따른 의미론적 영상 분할 방법을 나타낸 흐름도이다.
도 4 내지 도 7은 본 발명의 일 실시예에 따른 의미론적 영상 분할 방법과 종래 기술의 영상 분할 정확도를 설명하기 위한 도면이다.1 is a diagram showing a semantic image segmentation method based on a knowledge distillation technique according to an embodiment of the present invention.
2 is a diagram illustrating an adaptive probability map generated according to an embodiment of the present invention.
3 is a flowchart illustrating a semantic image segmentation method according to an embodiment of the present invention.
4 to 7 are diagrams for explaining a semantic image segmentation method according to an embodiment of the present invention and image segmentation accuracy in the prior art.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다.In the present invention, various modifications may be made and various embodiments may be provided, and specific embodiments will be illustrated in the drawings and described in detail in the detailed description. However, this is not intended to limit the present invention to a specific embodiment, it should be understood to include all changes, equivalents, and substitutes included in the spirit and scope of the present invention. In describing each drawing, similar reference numerals have been used for similar elements.

제1, 제2, A, B 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.Terms such as first, second, A, and B may be used to describe various elements, but the elements should not be limited by the terms. The above terms are used only for the purpose of distinguishing one component from another component. For example, without departing from the scope of the present invention, a first element may be referred to as a second element, and similarly, a second element may be referred to as a first element. The term and/or includes a combination of a plurality of related listed items or any of a plurality of related listed items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다.When a component is referred to as being "connected" or "connected" to another component, it is understood that it may be directly connected or connected to the other component, but other components may exist in the middle. It should be. On the other hand, when a component is referred to as being "directly connected" or "directly connected" to another component, it should be understood that there is no other component in the middle.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the present application are only used to describe specific embodiments, and are not intended to limit the present invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In the present application, terms such as "comprise" or "have" are intended to designate the presence of features, numbers, steps, actions, components, parts, or combinations thereof described in the specification, but one or more other features. It is to be understood that the presence or addition of elements or numbers, steps, actions, components, parts, or combinations thereof does not preclude in advance.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein including technical or scientific terms have the same meaning as commonly understood by one of ordinary skill in the art to which the present invention belongs. Terms as defined in a commonly used dictionary should be interpreted as having a meaning consistent with the meaning in the context of the related technology, and should not be interpreted as an ideal or excessive formal meaning unless explicitly defined in this application. Does not.

도 1은 본 발명의 일 실시예에 따른 지식 증류기법 기반 의미론적 영상 분할 방법을 나타낸 도면이다.1 is a diagram showing a semantic image segmentation method based on a knowledge distillation technique according to an embodiment of the present invention.

입력 이미지가 교사 네트워크 및 학생 네트워크에 입력되면, 교사 네트워크 및 학생 네트워크는 각각 인코더와 디코더를 거쳐 특징 맵을 출력할 수 있다. 이후 교사 네트워크의 특징 맵 및 학생 네트워크의 특징 맵으로부터 각각의 확률 맵을 생성하고, 생성된 확률 맵에 기초하여 손실 함수를 정의할 수 있다. 이후, 정의된 손실 함수로 학생 네트워크를 학습시키게 된다.When the input image is input to the teacher network and the student network, the teacher network and the student network may output feature maps through encoders and decoders, respectively. Thereafter, each probability map may be generated from the feature map of the teacher network and the feature map of the student network, and a loss function may be defined based on the generated probability map. After that, the student network is trained with the defined loss function.

한편, 본 발명의 일 실시예에 따른 의미론적 영상 분할 방법은 손실 함수를 정의할 때, 교사 네트워크에 관한 확률 맵 대신 적응형 확률 맵을 이용한다. 적응형 확률 맵은 교사 네트워크의 예측이 부정확한 경우, 교사 네트워크의 확률 맵을 배제하는 방식으로 생성된다. 따라서, 본 발명의 일 실시예에 따른 의미론적 영상 분할 방법은 예측 정확도가 개선된 확률 맵에 기초하여 손실 함수를 정의함으로써, 학생 네트워크의 성능을 향상시키는 효과가 있다.Meanwhile, in the semantic image segmentation method according to an embodiment of the present invention, when defining a loss function, an adaptive probability map is used instead of a probability map related to a teacher network. The adaptive probability map is generated by excluding the probability map of the teacher network when the prediction of the teacher network is incorrect. Accordingly, the semantic image segmentation method according to an embodiment of the present invention has an effect of improving the performance of a student network by defining a loss function based on a probability map with improved prediction accuracy.

도 2는 본 발명의 일 실시예에 따라 생성된 적응형 확률 맵을 설명하기 위한 도면이다.2 is a diagram illustrating an adaptive probability map generated according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 의미론적 영상 분할 방법은 교사 네트워크에서 출력되는 맵의 크기와 학생 네트워크에서 출력되는 맵의 크기가 동일하다고 가정한다. The semantic image segmentation method according to an embodiment of the present invention assumes that the size of the map output from the teacher network and the size of the map output from the student network are the same.

한편 본 발명의 일 실시예에 따른 의미론적 영상 분할 방법은 GT(ground truth)를 이용하여 원-핫 인코딩한 벡터 및 교사 네트워크의 확률 맵의 벡터를 모두 활용하는 새로운 적응형 확률 맵

을 개시한다. Meanwhile, the semantic image segmentation method according to an embodiment of the present invention is a new adaptive probability map that utilizes both a one-hot-encoded vector using a ground truth (GT) and a vector of a probability map of a teacher network.

Start.

여기서 W는 적응형 확률 맵의 너비, H는 적응형 확률 맵의 높이를 나타낸다. 그리고 C는 적응형 확률 맵의 채널 수를 의미하며, 이는 입력 이미지의 각 픽셀과 대응할 수 있는 라벨의 수와 동일하다. 적응형 확률 벡터

는 적응형 확률 맵 P에서 (i, j)번째 픽셀의 값을 의미하며, 이는 수학식 1과 같이 정의한다.Here, W is the width of the adaptive probability map, and H is the height of the adaptive probability map. In addition, C denotes the number of channels of the adaptive probability map, which is the same as the number of labels that can correspond to each pixel of the input image. Adaptive probability vector

Denotes the value of the (i, j)-th pixel in the adaptive probability map P, which is defined as in Equation 1.

[수학식 1][Equation 1]

수학식 1에서,

는 교사 네트워크의 마지막 레이어의 특징 맵에서 (i, j)번째 픽셀의 소프트맥스 값의 벡터이며,

는 GT의 (i, j)번째 픽셀을 원-핫 인코딩한 벡터이다. 그리고

는 원-핫 인코딩한 벡터의 가중치를 나타낸다.

는 (i, j)번째 픽셀에서 GT의 라벨을 뜻하며,

는 교사 네트워크가 예측한 라벨을 뜻하는데,

는 다음의 수학식 2와 같이 정의한다.In Equation 1,

Is a vector of softmax values of the (i, j)th pixel in the feature map of the last layer of the teacher network,

Is a one-hot-encoded vector of the (i, j)-th pixel of the GT. And

Represents the weight of the one-hot encoded vector.

Means the label of the GT at the (i, j)th pixel,

Means the label predicted by the teacher network,

Is defined as in Equation 2 below.

[수학식 2][Equation 2]

여기서

는 교사 네트워크의 확률 맵 중 (i, j)번째 벡터에서 c번째 채널의 값을 의미한다. here

Denotes the value of the c-th channel in the (i, j)-th vector of the probability map of the teacher network.

도 2를 참고하면, 교사 네트워크가 예측한 라벨(210)에는 예측이 맞은 픽셀(O으로 도시)과 예측이 틀린 픽셀(X로 도시)이 있을 수 있다. 따라서, 본 발명의 일 실시예에 따른 적응형 확률 맵 P(230)는 예측이 맞은 픽셀에 대해서는 확률 벡터 및 GT의 원-핫 인코딩한 벡터(220)를 더한 값이 적용되고, 예측이 틀린 픽셀에 대해서는 GT의 원-핫 인코딩한 벡터(220)와 가중치에 의해 결정된 값이 적용된다. Referring to FIG. 2, in the label 210 predicted by the teacher network, there may be pixels with correct predictions (shown as O) and pixels with wrong predictions (shown as X). Accordingly, in the adaptive probability map P 230 according to an embodiment of the present invention, a value obtained by adding the probability vector and the one-hot-encoded vector 220 of GT is applied to the pixel with the correct prediction, and the pixel with the wrong prediction. For, the value determined by the one-hot encoded vector 220 of GT and the weight is applied.

한편, 다른 일 실시예에 따른 적응형 확률 맵 P(230)는 예측이 맞은 픽셀에 대해서는 확률 벡터 값이 적용될 수 있다.Meanwhile, in the adaptive probability map P 230 according to another embodiment, a probability vector value may be applied to a pixel in which prediction is correct.

본 발명의 일 실시예에 따른 의미론적 영상 분할 방법은 상술한 적응형 확률 맵 P를 이용하여, 픽셀 (i, j)번째에 대한 적응형 크로스 엔트로피 손실 함수

를 다음의 수학식 3과 같이 정의할 수 있다The semantic image segmentation method according to an embodiment of the present invention uses the adaptive probability map P described above, and uses the adaptive cross entropy loss function for the (i, j)-th pixel.

Can be defined as in Equation 3 below

[수학식 3][Equation 3]

여기서

는 적응형 확률 맵 중 (i, j)번째 벡터에서 c번째 채널 값을 의미하고,

는 학생 네트워크의 확률 맵 중 (i, j)번째 벡터

에서 c번째 채널 값을 의미한다. 이에 따라, 적응형 크로스 엔트로피 손실 함수는 다음의 수학식 4와 같이 정의할 수 있다.here

Denotes the c-th channel value in the (i, j)-th vector of the adaptive probability map,

Is the (i, j)-th vector of the student network's probability map

Means the c-th channel value in. Accordingly, the adaptive cross entropy loss function can be defined as in Equation 4 below.

[수학식 4][Equation 4]

본 발명의 일 실시예에 따른 의미론적 영상 분할 방법은 수학식 4의 적응형 크로스 엔트로피 손실 함수에 기초하여 학생 네트워크를 학습시킬 수 있다.The semantic image segmentation method according to an embodiment of the present invention may train a student network based on the adaptive cross entropy loss function of Equation 4.

도 3은 본 발명의 일 실시예에 따른 의미론적 영상 분할 방법을 나타낸 흐름도이다.3 is a flowchart illustrating a semantic image segmentation method according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 지식 증류법 기반 의미론적 영상 분할 방법은 310 단계에서, 입력 이미지를 교사 네트워크 및 학생 네트워크에 입력할 수 있다.In the semantic image segmentation method based on the knowledge distillation method according to an embodiment of the present invention, in step 310, an input image may be input to a teacher network and a student network.

320 단계에서, 교사 네트워크의 확률 맵 및 입력 이미지에 대한 GT를 이용하여 적응형 확률 맵을 생성할 수 있다. 여기서 적응형 확률 맵은 입력 이미지에 포함된 가로 방향의 픽셀의 수 및 세로 방향의 픽셀의 수와 동일한 크기를 갖는 매트릭스를 라벨의 수 만큼 포함할 수 있다.In step 320, an adaptive probability map may be generated by using the probability map of the teacher network and the GT for the input image. Here, the adaptive probability map may include as many as the number of labels a matrix having the same size as the number of pixels in the horizontal direction and the number of pixels in the vertical direction included in the input image.

한편, 입력 이미지에 포함된 제1 픽셀에 대한 적응형 확률 맵은 제1 픽셀에 대하여 교사 네트워크가 예측한 라벨과 GT의 라벨이 일치하는 경우, 교사 네트워크의 확률 맵의 값인 것일 수 있다.Meanwhile, the adaptive probability map for the first pixel included in the input image may be a value of the probability map of the teacher network when the label predicted by the teacher network and the label of the GT match for the first pixel.

또한, 입력 이미지에 포함된 제1 픽셀에 대한 적응형 확률 맵은 제1 픽셀에 대하여 교사 네트워크가 예측한 라벨과 GT의 라벨이 일치하지 않는 경우, GT의 원-핫 인코딩 벡터 값을 가중치와 곱한 값일 수 있다.In addition, the adaptive probability map for the first pixel included in the input image is obtained by multiplying the one-hot encoding vector value of the GT by the weight when the label predicted by the teacher network for the first pixel and the label of the GT do not match. Can be a value.

330 단계에서, 적응형 확률 맵 및 학생 네트워크의 확률 맵을 이용하여 적응형 크로스 엔트로피 손실 함수를 정의할 수 있다.In step 330, an adaptive cross entropy loss function may be defined using the adaptive probability map and the probability map of the student network.

여기서 입력 이미지에 포함된 픽셀에 대한 적응형 크로스 엔트로피 손실 함수는 복수의 라벨마다 적응형 확률 맵의 값과 학생 네트워크의 확률 맵의 로그 값을 곱한 후, 곱한 결과를 모두 더한 값일 수 있다.Here, the adaptive cross entropy loss function for a pixel included in the input image may be a value obtained by multiplying a value of the adaptive probability map and a log value of the probability map of the student network for each of a plurality of labels, and then adding all the multiplication results.

그리고 적응형 크로스 엔트로피 손실 함수는 입력 이미지에 포함된 모든 픽셀에 대한 적응형 크로스 엔트로피 손실 함수를 더한 후, 적응형 확률 맵은 입력 이미지에 포함된 가로 방향의 픽셀의 수 및 세로 방향의 픽셀의 수로 나눈 값인 것일 수 있다.In addition, the adaptive cross entropy loss function adds the adaptive cross entropy loss function for all pixels included in the input image, and the adaptive probability map is the number of pixels in the horizontal direction and the number of pixels in the vertical direction in the input image. It may be a divided value.

340 단계에서, 적응형 크로스 엔트로피 손실 함수를 이용하여 입력 이미지의 각 픽셀을 복수개의 라벨 중 하나의 라벨로 결정할 수 있다.In step 340, each pixel of the input image may be determined as one of a plurality of labels using an adaptive cross entropy loss function.

또한 본 발명의 일 실시예에 따른 지식 증류법 기반 의미론적 영상 분할 방법은 적응형 크로스 엔트로피 손실 함수를 이용하여 학생 네트워크를 학습시키는 단계를 더 포함할 수 있다.In addition, the knowledge distillation-based semantic image segmentation method according to an embodiment of the present invention may further include the step of learning a student network using an adaptive cross entropy loss function.

도 4 내지 도 7은 본 발명의 일 실시예에 따른 의미론적 영상 분할 방법과 종래 기술의 영상 분할 정확도를 설명하기 위한 도면이다.4 to 7 are diagrams for explaining a semantic image segmentation method according to an embodiment of the present invention and image segmentation accuracy in the prior art.

본 발명의 일 실시예에 따른 지식 증류법 기반 의미론적 영상 분할 방법의 성능을 검증하기 위한 실험에서, 교사 네트워크는 Deeplab-V3+ 구조를 활용하였고, 교사 네트워크의 인코더는 Xception65를 사용하였다. 또한, 학생 네트워크의 인코더로서 Resnet34를 사용하였다.In an experiment to verify the performance of the semantic image segmentation method based on the knowledge distillation method according to an embodiment of the present invention, the teacher network used the Deeplab-V3+ structure, and the teacher network encoder used the Xception65. Also, Resnet34 was used as the encoder of the student network.

실험 데이터 셋으로는 Cityscapes 및 Camvid 이미지가 활용되었다. Cityscapes의 경우, 라벨의 수는 총 19개이고 훈련 데이터는 2,975개, 검증 데이터는 500개, 테스트 데이터는 1,525개였다. Camvid의 경우, 라벨의 수는 총 12개이고 훈련 데이터는 367개, 검증 데이터는 101개, 테스트 데이터는 233개였다.Cityscapes and Camvid images were used as experimental data sets. In the case of Cityscapes, the total number of labels was 19, training data was 2,975, validation data was 500, and test data was 1,525. In the case of Camvid, the total number of labels was 12, training data was 367, validation data was 101, and test data was 233.

도 4 및 도 5의 교사 네트워크는 교사 네트워크의 예측 정확도를 나타내고, CE는 학생 네트워크가 지식 증류법을 기반으로 하지 않고 학생 네트워크의 확률 맵에만 기초한 컨볼루션 크로스 엔트로피(CE) 손실 함수로 학습한 경우의 예측 정확도를 나타낸다. The teacher networks of FIGS. 4 and 5 represent the prediction accuracy of the teacher network, and CE is a case where the student network is not based on the knowledge distillation method, but is learned with a convolutional cross entropy (CE) loss function based only on the probability map of the student network. Indicates the prediction accuracy.

MIMIC+CE는 학생 네트워크가 'A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “Fitnets: Hints for thin deep nets," arXiv, vol. arXiv:1412.6550, 2015.[Online]. Available: http://arxiv.org/abs/1412.6550'에 개시된 MIMIC 특징 증류법 기반의 CE 손실 함수로 학습한 경우의 예측 정확도를 나타낸다. Pair-wise+CE는 학생 네트워크가 'Y. Liu, K. Chen, C. Liu, Z. Qin, Z. Luo, and J. Wang, "Structured knowledge distillation for semantic segmentation," in Proc. IEEE Conf.Comput. Vis. Pattern Recognit. (CVPR), June. 2019, pp. 2604-2613.'에 개시된 전역 pair-wise 특징 증류법 기반의 CE 손실 함수로 학습한 경우의 예측 정확도를 나타낸다. MIMIC+CE is a student network of'A. Romero, N. Ballas, SE Kahou, A. Chassang, C. Gatta, and Y. Bengio, “Fitnets: Hints for thin deep nets,” arXiv, vol. arXiv:1412.6550, 2015.[Online]. Available: http: //arxiv.org/abs/1412.6550' shows the prediction accuracy when learning with the CE loss function based on the MIMIC feature distillation method disclosed in'Pair-wise+CE' is the student network's'Y. Liu, K. Chen, C. Liu, Z. Qin, Z. Luo, and J. Wang, "Structured knowledge distillation for semantic segmentation," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), June. 2019, pp. 2604-2613 It shows the prediction accuracy when learning with the CE loss function based on the global pair-wise feature distillation method disclosed in .'

MIMIC+ACE는 학생 네트워크가 MIMIC 특징 증류법 기반의 본 발명의 일 실시예에 따른 적응형 크로스 엔트로피 손실 함수로 학습한 경우의 예측 정확도를 나타내고, Pair-wise+CE는 학생 네트워크가 전역 pair-wise 특징 증류법 기반의 본 발명의 일 실시예에 따른 적응형 크로스 엔트로피 손실 함수로 학습한 경우의 예측 정확도를 나타낸다.MIMIC+ACE represents the prediction accuracy when the student network learns with the adaptive cross entropy loss function according to an embodiment of the present invention based on the MIMIC feature distillation method, and Pair-wise+CE represents the global pair-wise feature of the student network. It shows the prediction accuracy when learning with the adaptive cross entropy loss function according to an embodiment of the present invention based on the distillation method.

도 4는 Cityscapes 이미지가 입력된 경우, Resnet34를 학생 네트워크의 인코더로 활용한 본 발명의 일 실시예에 따른 의미론적 영상 분할 방법의 예측 정확도를 설명하기 위한 표이다.4 is a table for explaining the prediction accuracy of a semantic image segmentation method according to an embodiment of the present invention using Resnet34 as an encoder for a student network when a Cityscapes image is input.

도 4를 참고하면, 검증 데이터, 훈련 데이터 및 시험 데이터에 대하여 메모리 사용량 및 연산량이 큰 교사 네트워크의 예측 정확도가 가장 높은 것을 확인할 수 있다. 또한, 기존의 CE, MIMIC+CE, Pair-wise+CE 기법에 비하여, 본 발명의 일 실시예에 따른 의미론적 영상 분할 방법의 예측 정확도(MIMIC+ACE, Pair-wise+ACE)가 더 높은 것을 확인할 수 있다.Referring to FIG. 4, it can be seen that the prediction accuracy of the teacher network having a large memory usage and a large computational amount for verification data, training data, and test data is the highest. In addition, the prediction accuracy (MIMIC+ACE, Pair-wise+ACE) of the semantic image segmentation method according to an embodiment of the present invention is higher than that of the existing CE, MIMIC+CE, and Pair-wise+CE techniques. I can confirm.

도 5는 Cityscapes 이미지가 입력된 경우, Resnet34를 학생 네트워크의 인코더로 활용한 본 발명의 일 실시예에 따른 의미론적 영상 분할 방법의 결과를 설명하기 위한 도면이다.5 is a diagram for explaining a result of a semantic image segmentation method according to an embodiment of the present invention using Resnet34 as an encoder for a student network when a Cityscapes image is input.

도 5의 (a)는 라벨을 설명하기 위한 것으로 총 19개의 라벨이 다른 색상으로 표시된 것을 확인할 수 있다. 도 5의 (b)는 각각 입력 이미지, 교사 네트워크의 라벨링 결과, GT, CE 기법의 라벨링 결과를 나타낸다. 도 5의 (c)는 MIMIC+CE 기법의 라벨링 결과와 본 발명의 일 실시예에 따른 지식 증류법 기반 의미론적 영상 분할 방법(MIMIC+ACE)을 적용한 경우, 학생 네트워크의 라벨링 결과를 나타낸다. 그리고, 도 5의 (d)는 Pair-wise+CE 기법의 라벨링 결과와 본 발명의 일 실시예에 따른 지식 증류법 기반 의미론적 영상 분할 방법(Pair-wise+ACE)을 적용한 경우, 학생 네트워크의 라벨링 결과를 나타낸다. 5A is for explaining the label, and it can be seen that a total of 19 labels are displayed in different colors. 5B shows the input image, the labeling result of the teacher network, and the labeling result of the GT and CE techniques, respectively. 5C shows the labeling result of the MIMIC+CE technique and the labeling result of the student network when the knowledge distillation-based semantic image segmentation method (MIMIC+ACE) according to an embodiment of the present invention is applied. In addition, FIG. 5(d) shows the labeling result of the Pair-wise+CE technique and the labeling of the student network when the knowledge distillation-based semantic image segmentation method (Pair-wise+ACE) according to an embodiment of the present invention is applied. Show the results.

도 5의 (b) 내지 (d)를 참고하면, 본 발명의 일 실시예에 따른 지식 증류법 기반 의미론적 영상 분할 방법을 적용한 경우의 라벨링 결과는, 교사 네트워크의 라벨링 결과보다는 정확도가 낮지만, 종래 기법(CE, MIMIC+CE, Pair-wise+CE)보다는 정확도가 향상된 것을 확인할 수 있다.5B to 5D, the labeling result when the semantic image segmentation method based on the knowledge distillation method according to an embodiment of the present invention is applied is less accurate than the labeling result of the teacher network. It can be seen that the accuracy is improved over the techniques (CE, MIMIC+CE, Pair-wise+CE).

도 6은 Camvid 이미지가 입력된 경우, Resnet34를 학생 네트워크의 인코더로 활용한 본 발명의 일 실시예에 따른 의미론적 영상 분할 방법의 예측 정확도를 설명하기 위한 표이다.6 is a table for explaining the prediction accuracy of a semantic image segmentation method according to an embodiment of the present invention using Resnet34 as an encoder of a student network when a Camvid image is input.

도 4와 마찬가지로, 검증 데이터, 훈련 데이터 및 시험 데이터에 대하여 메모리 사용량 및 연산량이 큰 교사 네트워크의 예측 정확도가 가장 높은 것을 확인할 수 있다. 또한, 종래 기법(CE, MIMIC+CE, Pair-wise+CE)에 비하여, 본 발명의 일 실시예에 따른 의미론적 영상 분할 방법의 예측 정확도(MIMIC+ACE, Pair-wise+ACE)가 더 높은 것을 확인할 수 있다.Like FIG. 4, it can be seen that the prediction accuracy of the teacher network having a large memory usage and a large computational amount for verification data, training data, and test data is the highest. In addition, the prediction accuracy (MIMIC+ACE, Pair-wise+ACE) of the semantic image segmentation method according to an embodiment of the present invention is higher than that of the conventional techniques (CE, MIMIC+CE, Pair-wise+CE). Can be confirmed.

도 7은 Camvid 이미지가 입력된 경우, Resnet34를 학생 네트워크의 인코더로 활용한 본 발명의 일 실시예에 따른 의미론적 영상 분할 방법의 결과를 설명하기 위한 도면이다.FIG. 7 is a diagram illustrating a result of a semantic image segmentation method according to an embodiment of the present invention using Resnet34 as an encoder of a student network when a Camvid image is input.

도 7의 (a)는 라벨을 설명하기 위한 것으로 총 12개의 라벨이 다른 색상으로 표시된 것을 확인할 수 있다. 도 7의 (b)는 각각 입력 이미지, 교사 네트워크의 라벨링 결과, GT, 종래 CE 기법의 라벨링 결과를 나타낸다. 도 7의 (c)는 MIMIC+CE 기법의 라벨링 결과와 본 발명의 일 실시예에 따른 지식 증류법 기반 의미론적 영상 분할 방법(MIMIC+ACE)을 적용한 경우, 학생 네트워크의 라벨링 결과를 나타낸다. 그리고, 도 7의 (d)는 Pair-wise+CE 기법의 라벨링 결과와 본 발명의 일 실시예에 따른 지식 증류법 기반 의미론적 영상 분할 방법(Pair-wise+ACE)을 적용한 경우, 학생 네트워크의 라벨링 결과를 나타낸다. 7A is for explaining the label, and it can be seen that a total of 12 labels are displayed in different colors. 7B shows the input image, the labeling result of the teacher network, the GT, and the labeling result of the conventional CE technique, respectively. 7C shows the labeling result of the MIMIC+CE technique and the labeling result of the student network when the knowledge distillation-based semantic image segmentation method (MIMIC+ACE) according to an embodiment of the present invention is applied. 7(d) shows the labeling result of the Pair-wise+CE technique and the case of applying the knowledge distillation-based semantic image segmentation method (Pair-wise+ACE) according to an embodiment of the present invention, the labeling of the student network. Show the results.

도 7 역시 도 5와 마찬가지로 본 발명의 일 실시예에 따른 지식 증류법 기반 의미론적 영상 분할 방법을 적용한 경우의 라벨링 결과는, 교사 네트워크의 라벨링 결과보다는 정확도가 낮지만, 종래의 학생 네트워크의 라벨링 결과(CE, MIMIC+CE, Pair-wise+CE)보다는 정확도가 향상된 것을 확인할 수 있다.Like FIG. 5, the labeling result when the semantic image segmentation method based on the knowledge distillation method according to an embodiment of the present invention is applied is less accurate than the labeling result of the teacher network, but the labeling result of the conventional student network ( It can be seen that the accuracy is improved than CE, MIMIC+CE, Pair-wise+CE).

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 사람이라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 실행된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely illustrative of the technical idea of the present invention, and a person of ordinary skill in the art to which the present invention pertains will be able to make various modifications and variations without departing from the essential characteristics of the present invention. Accordingly, the embodiments implemented in the present invention are not intended to limit the technical idea of the present invention, but to explain it, and the scope of the technical idea of the present invention is not limited by these embodiments. The scope of protection of the present invention should be construed by the following claims, and all technical ideas within the scope equivalent thereto should be construed as being included in the scope of the present invention.

Claims

입력 이미지를 교사 네트워크 및 학생 네트워크에 입력하는 단계;
상기 교사 네트워크의 확률 맵 및 상기 입력 이미지에 대한 GT를 이용하여 적응형 확률 맵을 생성하는 단계;
상기 적응형 확률 맵 및 상기 학생 네트워크의 확률 맵을 이용하여 적응형 크로스 엔트로피 손실 함수를 정의하는 단계; 및
상기 적응형 크로스 엔트로피 손실 함수를 이용하여 상기 입력 이미지의 각 픽셀을 복수개의 라벨 중 하나의 라벨로 결정하는 단계를 포함하고,
상기 입력 이미지에 포함된 제1 픽셀에 대한 상기 적응형 확률 맵은
상기 입력 이미지에 포함된 상기 제1 픽셀에 대하여 상기 교사 네트워크가 예측한 라벨과 상기 GT의 라벨의 일치여부에 따라, 상기 교사 네트워크의 확률 맵의 값과 상기 GT의 원-핫 인코딩 벡터 값의 합, 상기 교사 네트워크의 확률 맵의 값, 상기 GT의 원-핫 인코딩 벡터 값을 가중치와 곱한 값 중 어느 하나인 것을 특징으로 하는 의미론적 영상 분할 방법. Inputting the input image to the teacher network and the student network;
Generating an adaptive probability map using the probability map of the teacher network and the GT for the input image;
Defining an adaptive cross entropy loss function using the adaptive probability map and the probability map of the student network; And
Determining each pixel of the input image as one of a plurality of labels using the adaptive cross entropy loss function,
The adaptive probability map for the first pixel included in the input image is
According to whether the label predicted by the teacher network matches the label of the GT with respect to the first pixel included in the input image, the sum of the probability map value of the teacher network and the one-hot encoding vector value of the GT , A value of a probability map of the teacher network, and a value obtained by multiplying a value of a one-hot encoding vector of the GT by a weight.

제1항에 있어서,
상기 적응형 확률 맵은 상기 입력 이미지에 포함된 가로 방향의 픽셀의 수 및 세로 방향의 픽셀의 수와 동일한 크기를 갖는 매트릭스를 상기 라벨의 수 만큼 포함하는 것을 특징으로 하는, 의미론적 영상 분할 방법.The method of claim 1,
The adaptive probability map, characterized in that it includes a matrix having the same size as the number of pixels in the horizontal direction and the number of pixels in the vertical direction included in the input image as many as the number of labels.

제1항에 있어서,
상기 입력 이미지에 포함된 제1 픽셀에 대한 상기 적응형 확률 맵은
상기 제1 픽셀에 대하여 상기 교사 네트워크가 예측한 라벨과 상기 GT의 라벨이 일치하는 경우, 상기 교사 네트워크의 확률 맵의 값과 상기 GT의 원-핫 인코딩 벡터 값의 합인 것을 특징으로 하는 것인, 의미론적 영상 분할 방법.The method of claim 1,
The adaptive probability map for the first pixel included in the input image is
When the label predicted by the teacher network and the label of the GT match with respect to the first pixel, it is characterized in that it is a sum of a probability map value of the teacher network and a one-hot encoding vector value of the GT. Semantic image segmentation method.

제1항에 있어서,
상기 입력 이미지에 포함된 제1 픽셀에 대한 상기 적응형 확률 맵은
상기 제1 픽셀에 대하여 상기 교사 네트워크가 예측한 라벨과 상기 GT의 라벨이 일치하는 경우, 상기 교사 네트워크의 확률 맵의 값인 것을 특징으로 하는 것인, 의미론적 영상 분할 방법.The method of claim 1,
The adaptive probability map for the first pixel included in the input image is
When the label predicted by the teacher network and the label of the GT match with respect to the first pixel, it is a value of a probability map of the teacher network.

제1항에 있어서,
상기 교사 네트워크의 확률 맵은 상기 교사 네트워크의 마지막 레이어의 특징 맵의 소프트맥스 값인 것을 특징으로 하는 것인, 의미론적 영상 분할 방법.The method of claim 1,
The method of semantic image segmentation, characterized in that the probability map of the teacher network is a softmax value of the feature map of the last layer of the teacher network.

제1항에 있어서,
상기 입력 이미지에 포함된 제1 픽셀에 대한 상기 적응형 확률 맵은
상기 제1 픽셀에 대하여 상기 교사 네트워크가 예측한 라벨과 상기 GT의 라벨이 일치하지 않는 경우, 상기 GT의 원-핫 인코딩 벡터 값을 가중치와 곱한 값인 것을 특징으로 하는 것인, 의미론적 영상 분할 방법.The method of claim 1,
The adaptive probability map for the first pixel included in the input image is
When the label predicted by the teacher network for the first pixel and the label of the GT do not match, the semantic image segmentation method is characterized in that it is a value obtained by multiplying the one-hot encoding vector value of the GT by a weight. .

제1항에 있어서,
상기 입력 이미지에 포함된 픽셀에 대한 상기 적응형 크로스 엔트로피 손실 함수는
상기 복수의 라벨마다 상기 적응형 확률 맵의 값과 상기 학생 네트워크의 확률 맵의 로그 값을 곱한 후, 상기 곱한 결과를 모두 더한 값인 것을 특징으로 하는, 의미론적 영상 분할 방법.The method of claim 1,
The adaptive cross entropy loss function for a pixel included in the input image is
The semantic image segmentation method, characterized in that a value obtained by multiplying a value of the adaptive probability map and a log value of the probability map of the student network for each of the plurality of labels, and then adding all the multiplication results.

제6항에 있어서,
상기 적응형 크로스 엔트로피 손실 함수는
상기 입력 이미지에 포함된 모든 픽셀에 대한 상기 적응형 크로스 엔트로피 손실 함수를 더한 후, 상기 입력 이미지에 포함된 가로 방향의 픽셀의 수 및 세로 방향의 픽셀의 수로 나눈 값인 것을 특징으로 하는, 의미론적 영상 분할 방법.The method of claim 6,
The adaptive cross entropy loss function is
Semantic image, characterized in that a value obtained by adding the adaptive cross entropy loss function for all pixels included in the input image and dividing the number of pixels in the horizontal direction and the number of pixels in the vertical direction included in the input image Split method.

제1항에 있어서,
상기 적응형 크로스 엔트로피 손실 함수를 이용하여 상기 학생 네트워크를 학습시키는 단계를 더 포함하는 것을 특징으로 하는, 의미론적 영상 분할 방법.The method of claim 1,
Further comprising the step of learning the student network using the adaptive cross entropy loss function.

컴퓨터 프로그램이 프로세서에 의해 실행될 때, 제1항 내지 제9항 중 어느 한 항에 따른 방법이 수행되는 컴퓨터 프로그램을 저장한 컴퓨터-판독가능 저장 매체.A computer-readable storage medium storing a computer program in which the method according to any one of claims 1 to 9 is performed when the computer program is executed by a processor.