KR102634166B1

KR102634166B1 - Face recognition apparatus using multi-scale convolution block layer

Info

Publication number: KR102634166B1
Application number: KR1020160127570A
Authority: KR
Inventors: 박준영; 노승인; 강봉남; 김대진
Original assignee: 한화비전 주식회사; 포항공과대학교 산학협력단
Priority date: 2016-10-04
Filing date: 2016-10-04
Publication date: 2024-02-08
Also published as: KR20180037436A

Abstract

본 발명은 얼굴 인식 장치에 관한 것으로, 이미지를 입력받는 입력부; 복수의 다중 크기 컨볼루션 블록으로 구성된 컨볼루션 신경망을 트레이닝하는 트레이닝부; 상기 트레이닝된 컨볼루션 신경망을 이용하여 상기 입력된 이미지로부터 피쳐를 추출하는 피쳐 추출부; 상기 추출된 피쳐를 레퍼런스 이미지와 비교하여 얼굴의 동일성을 인식하는 동일성 판단부; 및 상기 인식된 결과를 디스플레이하는 출력부;를 포함하여 얼굴 인식 성능을 향상시킬 수 있다. The present invention relates to a face recognition device, comprising: an input unit that receives an image; A training unit that trains a convolutional neural network composed of a plurality of multi-size convolutional blocks; a feature extraction unit that extracts features from the input image using the trained convolutional neural network; an identity determination unit that recognizes the identity of a face by comparing the extracted features with a reference image; and an output unit that displays the recognized results, thereby improving face recognition performance.

Description

다중 크기 컨볼루션 블록 층을 이용한 얼굴 인식 장치 {Face recognition apparatus using multi-scale convolution block layer}Face recognition apparatus using multi-scale convolution block layer}

본 발명은 다중 크기 컨볼루션 블록 층을 이용한 얼굴 인식 장치에 관한 것으로, 보다 상세하게는 얼굴의 피쳐를 다중 크기 컨볼루션 블록 층으로 구성된 인공 신경망을 통해 학습하여 영상에서 얼굴을 인식하는 얼굴 인식 장치에 관한 것이다.The present invention relates to a face recognition device using multi-size convolutional block layers. More specifically, it relates to a face recognition device that recognizes faces in images by learning facial features through an artificial neural network composed of multi-size convolutional block layers. It's about.

얼굴 인식 기술은 영상에서 얼굴을 검출한 뒤, 해당 얼굴의 특징으로부터 누구의 얼굴인지를 파악하는 것으로, 영상을 통한 신원 확인, 표정 인식, 졸음 검출 및 사진 자동분류 등의 다양한 응용 분야의 필수적인 기술이다. Face recognition technology detects a face in an image and then determines whose face it is based on the facial features. It is an essential technology in various application fields such as identity verification through video, facial expression recognition, drowsiness detection, and automatic photo classification. .

패턴 인식 기술, 특히 얼굴 인식 기술에서 그 성능은 얼굴의 눈, 코, 입 등의 특징점을 표현하는 피쳐(Feature)를 얼마나 잘 고안하는지에 따라 결정되므로, 아주 밀접한 관계가 있다. 따라서 기존에는 얼굴 인식 기술에서 인식 성능 향상을 위해서 설계 단계에서 직접 피쳐를 고안하는 피쳐 엔지니어링(feature engineering) 방법을 사용하였다. In pattern recognition technology, especially face recognition technology, the performance is determined by how well the features that represent the facial features such as the eyes, nose, and mouth are designed, so there is a very close relationship. Therefore, in the past, in order to improve recognition performance in face recognition technology, the feature engineering method of directly devising features at the design stage was used.

피쳐 엔지니어링을 위해서는 직접 피쳐를 고안해야 할 필요가 있는데, 이렇게 직접 고안된 피쳐(handcrafted feature)는 데이터의 특성이 어떠한지 보다는 고안자의 특정 분야 지식에 대해 크게 의존하여 만들어진다는 문제가 있었다. 또한 인식을 위한 분류기의 작동 방식과 분리되어 고안을 해 서로간에 괴리가 생긴다는 문제가 있었다.For feature engineering, it is necessary to design features yourself, but there is a problem that such handcrafted features are created based largely on the designer's knowledge in a specific field rather than on the characteristics of the data. In addition, there was a problem that there was a gap between them because they were designed separately from the operating method of the classifier for recognition.

또한 이렇게 직접 고안한 피쳐는 조정해야 할 파라미터를 다수 갖는데, 이를 원하는 방향으로 직접 조정하는 것이 쉽지 않다는 문제가 있다. 예를 들어, 얼굴 인식을 위해서는 동일 인물간에 피쳐가 유사하게 나타나도록 하고, 다른 인물간에는 피쳐가 비유사하게 나타나도록 피쳐를 고안해야 하므로, 이러한 피쳐를 고안자의 판단으로 고안하는 것이 매우 어렵다.Additionally, these self-designed features have many parameters that need to be adjusted, but there is a problem that it is not easy to directly adjust them in the desired direction. For example, for face recognition, features must be designed so that they appear similarly between the same person and dissimilarly between different people, so it is very difficult to devise such features based on the designer's judgment.

한편, 최근 학습 모델은 인간의 행동 인식을 포함한 많은 비전 애플리케이션에 적용되기 시작했다.Meanwhile, recently learning models have begun to be applied to many vision applications, including human action recognition.

한국 등록특허 제1563569호Korean Patent No. 1563569

본 발명이 해결하고자 하는 과제는, 피쳐와 이를 이용한 인식 분류기를 동시에 트레이닝하는 얼굴 인식 장치를 제공하는 것이다.The problem that the present invention aims to solve is to provide a face recognition device that simultaneously trains features and a recognition classifier using them.

본 발명의 과제들은 이상에서 언급한 과제로 제한되지 않으며, 언급되지 않은 또 다른 과제들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The problems of the present invention are not limited to the problems mentioned above, and other problems not mentioned will be clearly understood by those skilled in the art from the description below.

상기 과제를 해결하기 위한 얼굴 인식 장치는, 이미지를 입력받는 입력부; 복수의 다중 크기 컨볼루션 블록으로 구성된 컨볼루션 신경망을 트레이닝하는 트레이닝부; 상기 트레이닝된 컨볼루션 신경망을 이용하여 상기 입력된 이미지로부터 피쳐를 추출하는 피쳐 추출부; 상기 추출된 피쳐를 레퍼런스 이미지와 비교하여 얼굴의 동일성을 인식하는 동일성 판단부; 및 상기 인식된 결과를 디스플레이하는 출력부;를 포함할 수 있다.A face recognition device for solving the above problem includes an input unit that receives an image; A training unit that trains a convolutional neural network composed of a plurality of multi-size convolutional blocks; a feature extraction unit that extracts features from the input image using the trained convolutional neural network; an identity determination unit that recognizes the identity of a face by comparing the extracted features with a reference image; and an output unit that displays the recognized results.

상기 트레이닝부는 상기 추출된 피쳐를 기초로 상기 컨볼루션 신경망을 트레이닝할 수 있다.The training unit may train the convolutional neural network based on the extracted features.

상기 입력부는 학습용 삼중 쌍(triplet) 이미지를 입력받고, 상기 피쳐 추출부는 상기 학습용 삼중 쌍 이미지로부터 피쳐를 추출하고, 상기 트레이닝부는 상기 학습용 삼중 쌍 이미지로부터 추출된 피쳐를 기초로 상기 컨볼루션 신경망을 트레이닝시킬 수 있다. The input unit receives a triplet image for learning, the feature extraction unit extracts features from the triplet image for learning, and the training unit trains the convolutional neural network based on the features extracted from the triplet image for learning. You can do it.

상기 트레이닝부는 상기 학습용 삼중 쌍 이미지로부터 추출된 피쳐를 손실함수에 의해 역전파(backpropagation) 함으로써 상기 컨볼루션 신경망을 트레이닝하되, 손실함수는 하기 수학식 1과 같이 정의할 수 있다.The training unit trains the convolutional neural network by backpropagating the features extracted from the training triple pair image using a loss function, and the loss function can be defined as Equation 1 below.

상기 트레이닝부는 보조 손실함수를 더 포함하고, 상기 학습용 삼중 쌍 이미지로부터 추출된 피쳐를 상기 보조 손실함수에 의해 역전파 함으로써 상기 컨볼루션 신경망을 더 트레이닝할 수 있다.The training unit further includes an auxiliary loss function, and can further train the convolutional neural network by backpropagating features extracted from the training triple pair image using the auxiliary loss function.

상기 컨볼루션 신경망은 복수의 컨볼루션 신경망을 포함하고, 상기 트레이닝부는 상기 복수의 컨볼루션 신경망을 이용해 추출된 복수의 피쳐로 앙상블을 구성하고, 상기 구성된 앙상블의 차원을 축소하며, 상기 차원이 축소된 앙상블을 기초로 상기 트레이닝부가 더 포함하는 분류기를 트레이닝할 수 있다.The convolutional neural network includes a plurality of convolutional neural networks, and the training unit configures an ensemble with a plurality of features extracted using the plurality of convolutional neural networks, reduces the dimension of the constructed ensemble, and performs a function with the reduced dimension. Based on the ensemble, a classifier further included in the training unit can be trained.

본 발명의 기타 구체적인 사항들은 상세한 설명 및 도면들에 포함되어 있다.Other specific details of the invention are included in the detailed description and drawings.

본 발명의 실시예들에 의하면 적어도 다음과 같은 효과가 있다.According to embodiments of the present invention, there are at least the following effects.

학습을 통해 인식 분류기에 적합하도록 피쳐를 개선해 얼굴 인식 성능이 향상된다.Through learning, features are improved to suit the recognition classifier, improving face recognition performance.

본 발명에 따른 효과는 이상에서 예시된 내용에 의해 제한되지 않으며, 더욱 다양한 효과들이 본 명세서 내에 포함되어 있다. 언급되지 않은 또 다른 효과들은 청구범위의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The effects according to the present invention are not limited to the contents exemplified above, and further various effects are included in the present specification. Other effects not mentioned will be clearly understood by those skilled in the art from the description of the claims.

도 1은 본 발명의 일 실시예에 따른 얼굴 인식 장치의 전체 구성을 표현한 블록도이다.
도 2는 본 발명의 일 실시예에 따른 얼굴 인식 장치의 개략적인 작용을 나타낸 도면이다.
도 3은 본 발명의 일 실시예에 따른 얼굴 인식 장치의 컨볼루션 신경망을 표현한 도면이다.
도 4a 및 도 4b는 본 발명의 일 실시예 및 다른 실시예에 따른 얼굴 인식 장치의 다중 크기 컨볼루션 블록의 구성 예를 나타낸 도면이다.
도 5는 본 발명의 일 실시예에 따른 얼굴 인식 장치가 삼중 쌍 이미지를 이용해 트레이닝 하는 과정을 나타낸 도면이다.
도 6은 본 발명의 일 실시예에 따른 얼굴 인식 장치의 삼중 쌍 이미지와 삼중 쌍 이미지를 손실함수로 연산해 트레이닝한 결과를 나타낸 도면이다.
도 7은 본 발명의 일 실시예에 따른 얼굴 인식 장치가 보조 손실함수를 사용하여 피쳐를 추출하고 트레이닝하는 과정을 나타낸 도면이다.
도 8은 본 발명의 또 다른 실시예에 따른 얼굴 인식 장치가 앙상블을 구성해서 트레이닝 하는 과정을 나타낸 도면이다.Figure 1 is a block diagram expressing the overall configuration of a face recognition device according to an embodiment of the present invention.
Figure 2 is a diagram schematically showing the operation of a face recognition device according to an embodiment of the present invention.
Figure 3 is a diagram expressing a convolutional neural network of a face recognition device according to an embodiment of the present invention.
FIGS. 4A and 4B are diagrams illustrating examples of the configuration of multi-size convolution blocks of a face recognition device according to one embodiment and another embodiment of the present invention.
Figure 5 is a diagram showing a process of training a face recognition device using triple pair images according to an embodiment of the present invention.
Figure 6 is a diagram showing the training results of a face recognition device according to an embodiment of the present invention by calculating a triple pair image and a triple pair image with a loss function.
Figure 7 is a diagram showing a process in which a face recognition device extracts and trains features using an auxiliary loss function according to an embodiment of the present invention.
Figure 8 is a diagram illustrating a process in which a face recognition device constructs and trains an ensemble according to another embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.The advantages and features of the present invention and methods for achieving them will become clear by referring to the embodiments described in detail below along with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below and may be implemented in various different forms. The present embodiments are merely provided to ensure that the disclosure of the present invention is complete and to be understood by those skilled in the art. It is provided to fully inform those who have the scope of the invention, and the present invention is only defined by the scope of the claims. Like reference numerals refer to like elements throughout the specification.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms (including technical and scientific terms) used in this specification may be used with meanings that can be commonly understood by those skilled in the art to which the present invention pertains. Additionally, terms defined in commonly used dictionaries are not interpreted ideally or excessively unless clearly specifically defined.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소 외에 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다.The terminology used herein is for describing embodiments and is not intended to limit the invention. As used herein, singular forms also include plural forms, unless specifically stated otherwise in the context. As used in the specification, “comprises” and/or “comprising” does not exclude the presence or addition of one or more other elements in addition to the mentioned elements.

또한, 본 명세서에서 기술하는 실시예들은 본 발명의 이상적인 예시도인 단면도 및/또는 개략도들을 참고하여 설명될 것이다. 따라서, 제조 기술 및/또는 허용 오차 등에 의해 예시도의 형태가 변형될 수 있다. 또한 본 발명에 도시된 각 도면에 있어서 각 구성 요소들은 설명의 편의를 고려하여 다소 확대 또는 축소되어 도시된 것일 수 있다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭하며, "및/또는"은 언급된 아이템들의 각각 및 하나 이상의 모든 조합을 포함한다.Additionally, embodiments described in this specification will be described with reference to cross-sectional views and/or schematic diagrams that are ideal illustrations of the present invention. Accordingly, the form of the illustration may be modified depending on manufacturing technology and/or tolerance. Additionally, in each drawing shown in the present invention, each component may be shown somewhat enlarged or reduced in consideration of convenience of explanation. Like reference numerals refer to like elements throughout the specification, and “and/or” includes each and all combinations of one or more of the referenced items.

공간적으로 상대적인 용어는 도면에 도시되어 있는 방향에 더하여 사용시 또는 동작시 구성요소들의 서로 다른 방향을 포함하는 용어로 이해되어야 한다. 구성요소는 다른 방향으로도 배향될 수 있고, 이에 따라 공간적으로 상대적인 용어들은 배향에 따라 해석될 수 있다Spatially relative terms should be understood as terms that include different directions of components during use or operation in addition to the directions shown in the drawings. Components can also be oriented in different directions, so spatially relative terms can be interpreted according to orientation.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예의 구성을 상세히 설명하기로 한다.Hereinafter, the configuration of a preferred embodiment of the present invention will be described in detail with reference to the attached drawings.

도 1은 본 발명의 일 실시예에 따른 얼굴 인식 장치(1)의 전체 구성을 표현한 블록도이다. Figure 1 is a block diagram expressing the overall configuration of a face recognition device 1 according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 얼굴 인식 장치(1)는 입력부(11), 제어부(12), 저장부(13) 및 출력부(14)를 포함함을 알 수 있다.Referring to FIG. 1, it can be seen that the face recognition device 1 of the present invention includes an input unit 11, a control unit 12, a storage unit 13, and an output unit 14.

입력부(11)는 외부로부터 이미지를 입력받는 기능을 한다. 예를 들어, 카메라, 캠코더 등의 영상 획득 장치를 통해서, 검출하고자 하는 사람의 얼굴을 촬영하여 이미지를 획득할 수 있다. 입력부(11)가 영상 획득 장치를 통해서 이미지를 캡쳐하는 경우, 빛이 집광되는 집광부, 집광된 빛을 감지하고 감지된 빛의 신호를 전기 신호로 변환하는 촬상부, 변환된 전기 신호를 디지철 신호로 변환하는 A/D 컨버터를 포함할 수 있다.The input unit 11 functions to receive an image from the outside. For example, an image can be obtained by photographing the face of a person to be detected through an image acquisition device such as a camera or camcorder. When the input unit 11 captures an image through an image acquisition device, there is a light concentrator that focuses light, an imaging unit that detects the condensed light and converts the detected light signal into an electrical signal, and a digital signal that converts the converted electrical signal into an electrical signal. It may include an A/D converter that converts it into a signal.

여기서 촬상부는 노출 및 감마 조정, 이득조정, 화이트 밸런스, 컬러 매트릭스 등의 기능을 수행하며, 일반적으로 CCD(Charge Coupled Device, 전하결합소자)나 CMOS 이미지 센서 등의 촬상 소자가 포함된다. CCD는 복수의 포토다이오드(Photodiode)에 광이 조사되면 광전효과에 의해 발생한 전자들을 축적시킨 후 전송하는 방식이다. 이 때, 광자의 양에 따라 발생하는 전자량의 변화를 분석하고 정보를 재구성함으로써, 화면을 이루는 이미지 정보가 생성된다. CCD는 화질이 선명하고 노이즈가 적다는 장점이 있으나, 전력소비가 높고 처리속도가 느리다는 단점이 있다.Here, the imaging unit performs functions such as exposure and gamma adjustment, gain adjustment, white balance, and color matrix, and generally includes an imaging device such as a CCD (Charge Coupled Device) or CMOS image sensor. CCD is a method that accumulates electrons generated by the photoelectric effect when light is irradiated to a plurality of photodiodes and then transmits them. At this time, the image information that forms the screen is created by analyzing the change in the amount of electrons that occurs depending on the amount of photons and reconstructing the information. CCD has the advantage of clear image quality and low noise, but has the disadvantage of high power consumption and slow processing speed.

CMOS 이미지 센서는 CMOS(Complementary Metal Oxide Semiconductor)를 이용하는 이미지 센서로서, 각 셀마다 증폭기가 있어 광에 의해 발생한 전자를 곧바로 전기신호로 증폭시켜 전송하는 방식이다. CMOS 이미지 센서는 가격이 저렴하고 전력소비가 낮으며 처리속도가 빠르나, 노이즈가 많다는 단점이 있다.A CMOS image sensor is an image sensor that uses CMOS (Complementary Metal Oxide Semiconductor), and each cell has an amplifier to directly amplify and transmit electrons generated by light into an electrical signal. CMOS image sensors are inexpensive, consume low power, and have fast processing speeds, but have the disadvantage of producing a lot of noise.

또한 입력부(11)는 이미 파일화된 이미지를 입력받을 수 있다. 이 경우 입력부(11)는 키보드, 마우스 등의 입력 장치를 포함할 수 있고, 터치가 가능한 터치 스크린을 포함해 저장장치로부터 저장된 파일화된 이미지를 입력받을 수 있다.Additionally, the input unit 11 can receive images that have already been filed. In this case, the input unit 11 may include an input device such as a keyboard or mouse, and may include a touch screen capable of touch to receive a file image stored from a storage device.

입력부(11)는 학습용 삼중 쌍 이미지(23)와 동일 인물 쌍 이미지를 입력받을 수 있다. 입력받은 학습용 삼중 쌍 이미지(23)와 동일 인물 쌍 이미지에 대한 처리는 도 5 및 도 6에서 후술한다. The input unit 11 can receive the learning triple pair image 23 and the same person pair image. Processing of the input training triple pair image 23 and the same person pair image will be described later with reference to FIGS. 5 and 6.

제어부(12)는 얼굴 인식 장치(1)의 전반적인 동작을 제어한다. 예를 들어, 제어부(12)는 입력부(11)로부터 입력받은 입력 이미지(21)로부터 피쳐를 추출하고 얼굴을 인식하며 피쳐를 추출하는 과정을 트레이닝한다. 또한 디스플레이되는 이미지 상에서 피쳐가 위치하는 영역을 나타내기 위하여, 추출된 피쳐 영역을 포위하는 프레임을 설정한다. 도 2에서는 프레임을 사각형으로 표시하였으나, 이에 제한되지 않고 피쳐 영역을 용이하게 나타낼 수 있다면, 원형, 타원형, 다각형 등 다양한 모양을 가질 수 있다. 그리고 제어부(12)는 출력부(14)를 통해 상기 출력 이미지(22)를 디스플레이한다. 또한, 입력부(11)로부터 받은 이미지 또는, 레퍼런스 이미지 및 그에 대응하는 정보를 저장부(13)에 저장시키고, 저장부(13)로부터 상기의 이미지들을 로딩하도록 제어한다. 본 발명의 일 실시예에 따른 제어부(12)로는 CPU(Central Processing Unit), MCU(Micro Controller Unit) 또는 DSP(Digital Signal Processor) 등을 사용하는 것이 바람직하나, 이에 제한되지 않고 다양한 논리 연산 프로세서가 사용될 수 있다.The control unit 12 controls the overall operation of the face recognition device 1. For example, the control unit 12 extracts features from the input image 21 received from the input unit 11, recognizes faces, and trains the process of extracting features. Additionally, to indicate the area where the feature is located on the displayed image, a frame surrounding the extracted feature area is set. In Figure 2, the frame is displayed as a square, but it is not limited to this and can have various shapes, such as a circle, oval, or polygon, as long as the feature area can be easily represented. And the control unit 12 displays the output image 22 through the output unit 14. In addition, the image or reference image received from the input unit 11 and the corresponding information are stored in the storage unit 13, and the images are controlled to be loaded from the storage unit 13. It is preferable to use a CPU (Central Processing Unit), MCU (Micro Controller Unit), or DSP (Digital Signal Processor) as the control unit 12 according to an embodiment of the present invention, but is not limited thereto and various logical operation processors are used. can be used

제어부(12)는 피쳐 추출부(121), 동일성 판단부(123) 및 트레이닝부(122)를 포함한다.The control unit 12 includes a feature extraction unit 121, an identity determination unit 123, and a training unit 122.

피쳐 추출부(121)는 입력된 이미지로부터 피쳐를 추출하는 역할을 하는 구성요소이다. 입력된 이미지를 복수의 다중 크기 컨볼루션 블록(33)으로 구성된 심층 컨볼루션 신경망(34)(Deep convolutional neural network)를 통해 복수회에 걸쳐 컨볼루션 함으로써 피쳐를 추출한다. 피쳐 추출부(121)가 컨볼루션 신경망(34)을 이용하여 피쳐를 추출하는 방법에 대해서는 도 3 내지 도 4b에서 보다 자세히 후술한다. The feature extraction unit 121 is a component that extracts features from the input image. Features are extracted by convolving the input image multiple times through a deep convolutional neural network (34) composed of a plurality of multi-size convolution blocks (33). The method by which the feature extraction unit 121 extracts features using the convolutional neural network 34 will be described in more detail later with reference to FIGS. 3 to 4B.

동일성 판단부(123)는 얼굴을 인식하는 기능을 수행하는 구성요소이다. 피쳐 추출부(121)가 추출한 피쳐와 저장부(13)에 사전에 저장되어있는 레퍼런스 이미지의 피쳐를 비교해 그 동일성을 판단하여 해당하는 인물이 누구인지를 인식한다. 피쳐의 동일성을 판단하는 기준에는 피쳐의 위치, 종류 및 크기 등이 있으나 이에 제한되지 않는다. The identity determination unit 123 is a component that performs a face recognition function. The features extracted by the feature extraction unit 121 are compared with the features of the reference image previously stored in the storage unit 13 to determine their identity to recognize the corresponding person. The criteria for determining the identity of features include, but are not limited to, the location, type, and size of the feature.

동일성 판단부(123)는 인식한 결과를 출력부(14)로 전달한다. 인식한 결과에는 입력된 이미지가 어떠한 레퍼런스 이미지에 대응되는지를 나타내는 성명, ID, 사번 등의 정보가 포함될 수 있으나 이에 제한되지 않는다. 동일성 판단부(123)는 이와 같은 정보를 저장부(13)로부터 전달받아 출력부(14)로 전달하는 것이다.The identity determination unit 123 transmits the recognition result to the output unit 14. The recognition result may include, but is not limited to, information such as name, ID, and employee number indicating which reference image the input image corresponds to. The identity determination unit 123 receives such information from the storage unit 13 and transmits it to the output unit 14.

트레이닝부(122)는 피쳐 추출부(121)가 추출한 피쳐를 기초로 하여 피쳐 추출부(121)가 포함하는 컨볼루션 신경망(34)을 트레이닝한다. 트레이닝부(122)는 피쳐 추출부(121)의 연산 결과물을 기초로 본래 원했던 결과값과 비교해 그 차이(error)를 가지고 피쳐 추출부(121)를 트레이닝하는 것이므로, 피드포워드 네트워크(feedforward network)의 결과물을 통해 역으로 트레이닝하는 역전파(backpropagation)를 사용한다. 트레이닝부(122)가 컨볼루션 신경망(34)을 트레이닝하기 위해 사용하는 추출된 피쳐는 후술할 학습용 삼중 쌍 이미지(23)로부터 추출한 피쳐일 수 있고, 상기 피쳐 추출부(121)가 얼굴 인식을 위해 입력부(11)로 입력된 이미지에 대해 추출한 피쳐일 수도 있다. 트레이닝부(122)가 컨볼루션 신경망(34)을 트레이닝하는 구체적인 방식에 대해서는 도 5 및 도 6에 대한 설명에서 자세히 후술한다.The training unit 122 trains the convolutional neural network 34 included in the feature extraction unit 121 based on the features extracted by the feature extraction unit 121. The training unit 122 trains the feature extraction unit 121 with the difference (error) by comparing the result value originally desired based on the calculation result of the feature extraction unit 121, so that the feedforward network Backpropagation is used to train backwards through the results. The extracted features used by the training unit 122 to train the convolutional neural network 34 may be features extracted from the triple pair image 23 for learning, which will be described later, and the feature extractor 121 is used for face recognition. It may be a feature extracted from an image input to the input unit 11. The specific method by which the training unit 122 trains the convolutional neural network 34 will be described in detail later in the description of FIGS. 5 and 6 .

출력부(14)는 입력받은 이미지에서 얼굴의 피쳐가 검출된 이미지와 누구의 얼굴인지 인식하고 동일성을 판단한 결과를 동일성 판단부(123)로부터 전달받아 디스플레이한다. 그리고 도 2에 도시된 바와 같이, 얼굴이 인식되어 디스플레이되는 영상에는, 얼굴을 나타내는 프레임과 피쳐의 위치가 표시된다. 출력부(14)는 액정, 모니터, 프로젝터, TV 또는 프린터 등 얼굴이 인식되어 디스플레이될 수 있다면, 제한되지 않고 다양한 종류의 출력부(14)를 포함할 수 있다.The output unit 14 recognizes whose face is the image in which facial features are detected in the input image and receives the result of determining the identity from the identity determination unit 123 and displays it. And as shown in FIG. 2, in the image where the face is recognized and displayed, the positions of the frame and features representing the face are displayed. The output unit 14 is not limited to various types of output units 14, such as liquid crystal display, monitor, projector, TV, or printer, as long as the face can be recognized and displayed.

저장부(13)는 입력부(11)에서 입력받은 이미지 또는, 제어부(12)에서 인식된 얼굴 및 피쳐가 저장될 수 있다. 또한 얼굴을 인식하고 대응되는 인물이 누구인지를 판단하기 위해 제공되는 레퍼런스 이미지, 레퍼런스 이미지의 피쳐 및 레퍼런스 이미지에 대응되는 정보가 저장될 수 있다.The storage unit 13 may store images input from the input unit 11 or faces and features recognized by the control unit 12. Additionally, a reference image provided to recognize a face and determine who the corresponding person is, features of the reference image, and information corresponding to the reference image may be stored.

저장부(13)는 부피가 작고 외부의 충격에 강한 플래시 메모리(Flash Memory)인 것이 바람직하나, 이에 제한되지 않고 HDD(Hard Disk Drive), SSD(Solid State Drive), SD(Secure Digital) 카드 등 다양한 저장 장치를 포함할 수 있다.The storage unit 13 is preferably a flash memory that is small in volume and resistant to external shocks, but is not limited to this, such as a hard disk drive (HDD), solid state drive (SSD), or secure digital (SD) card. May include various storage devices.

도 2는 본 발명의 일 실시예에 따른 얼굴 인식 장치(1)의 개략적인 작용을 나타낸 도면이다.Figure 2 is a diagram schematically showing the operation of the face recognition device 1 according to an embodiment of the present invention.

도 2를 참조하면, 사람의 얼굴 이미지인 입력 이미지(21)가 입력부(11)를 통해서 촬상 혹은 이미 만들어진 파일의 형태로 입력되게 된다. 입력부(11)는 입력 이미지(21)를 제어부(12)로 전달한다. 제어부(12)는 해당 이미지로부터 컨볼루션 신경망(34)을 통해 피쳐를 추출하고, 추출된 피쳐를 저장부(13)에 저장된 레퍼런스 이미지 및 피쳐와 비교해 인물의 동일성을 판단한다. 제어부(12)는 인식된 결과를 출력부(14)로 전달하고, 출력부(14)는 박스로 처리된 피쳐 및 입력 이미지(21)에 대응하는 정보를 입력 이미지(21)와 함께 디스플레이해서 사용자가 어떠한 피쳐가 추출되었으며, 어떤 인물에 해당하는지를 파악할 수 있도록 제공한다. 한편, 제어부(12)는 추출된 복수의 피쳐로부터 피쳐 추출 과정인 컨볼루션 신경망(34)을 트레이닝한다. 따라서 컨볼루션 신경망(34)은 다음 입력 이미지(21)가 입력될 때 보다 나은 성능으로 피쳐를 추출할 수 있도록 트레이닝을 통해 개선된다. Referring to FIG. 2, an input image 21, which is an image of a person's face, is input through the input unit 11 in the form of an image captured or a previously created file. The input unit 11 transmits the input image 21 to the control unit 12. The control unit 12 extracts features from the image through the convolutional neural network 34 and compares the extracted features with the reference image and features stored in the storage unit 13 to determine the identity of the person. The control unit 12 transmits the recognized result to the output unit 14, and the output unit 14 displays the boxed features and information corresponding to the input image 21 together with the input image 21 to enable the user to use the recognized result. Provides information so that you can understand which features have been extracted and which person they correspond to. Meanwhile, the control unit 12 trains the convolutional neural network 34, which is a feature extraction process, from the plurality of extracted features. Therefore, the convolutional neural network 34 is improved through training to extract features with better performance when the next input image 21 is input.

이하, 도 3을 참조하여 피쳐 추출부(121)가 피쳐를 추출하는 과정에 대하여 자세히 설명한다.Hereinafter, the process of extracting features by the feature extraction unit 121 will be described in detail with reference to FIG. 3.

도 3은 본 발명의 일 실시예에 따른 얼굴 인식 장치(1)의 컨볼루션 신경망(34)을 표현한 도면이다.FIG. 3 is a diagram illustrating the convolutional neural network 34 of the face recognition device 1 according to an embodiment of the present invention.

도 3을 참조하면, 본 발명의 일 실시예에 따른 얼굴 인식 장치(1)의 피쳐 추출부(121)의 피쳐 추출 과정인 컨볼루션 신경망(deep convolutional neural network)(34)은 복수의 다중 크기 컨볼루션 블록(multi-scale convolution block)(33)을 적층하여 구성된다. 피쳐 추출 과정은 상기 컨볼루션 신경망(34)을 입력층(input layer)(31) 및 출력층(output layer)(32)과 연결하여 구성된다.Referring to FIG. 3, a deep convolutional neural network 34, which is a feature extraction process of the feature extraction unit 121 of the face recognition device 1 according to an embodiment of the present invention, uses a plurality of multi-size convolutions. It is constructed by stacking solution blocks (multi-scale convolution blocks) 33. The feature extraction process is constructed by connecting the convolutional neural network 34 with an input layer 31 and an output layer 32.

입력층(31)은 입력부(11)를 통해 전달된 이미지를 데이터의 형태로 받아들인다. 이미지 데이터는 이미지의 너비와 높이 및 색상 정보를 담고 있으므로 3차원 행렬로 나타낼 수 있고, 복수의 이미지를 사용하게 된다면 전체 이미지 데이터는 4차원 텐서(tensor) 데이터의 형태로 입력될 수 있다. 색상 정보는 RGB(Red, Green, Blue), CMYK(Cyan, Magenta, Yellow, blacK) 등으로 표현될 수 있으나 이에 제한되지 않으며, 이미지 데이터를 구성하는 방식도 상술한 내용에 한정되지 않는다. The input layer 31 accepts the image transmitted through the input unit 11 in the form of data. Image data contains the width, height, and color information of the image, so it can be expressed as a three-dimensional matrix. If multiple images are used, the entire image data can be input in the form of four-dimensional tensor data. Color information may be expressed in RGB (Red, Green, Blue), CMYK (Cyan, Magenta, Yellow, blacK), etc., but is not limited thereto, and the method of configuring image data is not limited to the above.

컨볼루션 신경망(34)은 도 3에서는 간략하게 다수의 다중 크기 컨볼루션 블록(33)을 적층하여 구성한 것으로 도시하였으나, 구성은 이에 한정되는 것은 아니며 일반적인 컨볼루션 블록을 더 포함하여 구성할 수 있다. 컨볼루션 블록의 자세한 구성에 대해서는 도 4a 및 도 4b에 대한 설명에서 후술한다.The convolutional neural network 34 is shown in FIG. 3 as being constructed by simply stacking a plurality of multi-sized convolutional blocks 33, but the configuration is not limited thereto and may be configured to further include general convolutional blocks. The detailed configuration of the convolution block will be described later in the description of FIGS. 4A and 4B.

피쳐 추출부(121)는 정규화를 위해 미니배치(mini-batch) 데이터 단위로 정규화를 하고 그 결과에 스케일 팩터와 시프트 팩터를 곱하여 더해주는 배치 정규화(batch normalization) 연산을 각 다중 크기 컨볼루션 블록(33)의 말단에서 더 수행할 수 있다. 배치 정규화는 네트워크의 각 층마다 입력의 분포(distribution)가 일관성 없이 바뀌는 내부 공분산 이동(internal covariance shift) 현상에 의해 학습의 복잡성이 증가하고 그라디언트 소멸 또는 폭발(gradient vanishing or exploding)이 일어나는 것을 막기 위해 사용되는 방법이다. 또한 정규화를 통해 활성함수(activation function)의 비선형성(nonlinearity)이 사라지는 현상을 막기 위해서도 사용된다.The feature extraction unit 121 performs a batch normalization operation in which normalization is performed in mini-batch data units and the result is multiplied by a scale factor and a shift factor and added to each multi-size convolution block (33). ) can be further performed at the end of. Batch normalization is used to prevent the complexity of learning from increasing and gradient vanishing or exploding to occur due to the internal covariance shift phenomenon, where the distribution of inputs at each layer of the network changes inconsistently. This is the method used. It is also used to prevent the nonlinearity of the activation function from disappearing through normalization.

피쳐 추출부(121)는 활성함수를 이용한 연산을 각 다중 크기 컨볼루션 블록(33)의 말단에서 더 수행할 수 있다. 활성함수란 복수의 입력 정보에 가중치를 부여하여 결합해 완성된 결과값을 출력하는 함수를 의미하는 것이다. 따라서 각 컨볼루션 과정을 거친 데이터들이 바로 활성함수의 입력으로 사용될 수도 있고, 상기 배치 정규화 과정을 거친 데이터가 활성함수의 입력으로 사용될 수도 있다. 활성함수로는 ReLU(Rectified Linear Unit) 함수를 사용할 수 있으나 이에 제한되지 않는다.The feature extraction unit 121 may further perform an operation using an activation function at the end of each multi-size convolution block 33. An activation function refers to a function that weights and combines multiple input information and outputs a complete result. Therefore, data that has gone through each convolution process can be directly used as input to the activation function, or data that has gone through the batch normalization process can be used as input to the activation function. The ReLU (Rectified Linear Unit) function can be used as an activation function, but is not limited to this.

도면에서 확인할 수 있듯이 컨볼루션 신경망(34)은 복수의 다중 크기 컨볼루션 블록(33)을 적층함으로써 구성할 수 있다. 따라서 이미지 데이터가 다중 크기 컨볼루션 블록(33)을 거쳐갈수록 상위 레벨의 피쳐를 표현하게 된다. 여기서 상위 레벨 피쳐란 미세한 영역의 점과 같은 미세한 특징이 아닌 눈, 코, 입과 같은 보다 거시적인 특징을 의미한다.As can be seen in the drawing, the convolutional neural network 34 can be constructed by stacking a plurality of multi-sized convolutional blocks 33. Therefore, as image data passes through the multi-size convolution block 33, higher-level features are expressed. Here, high-level features refer to more macroscopic features such as eyes, nose, and mouth, rather than microscopic features such as dots in a microscopic area.

출력층(32)에는 컨볼루션 신경망(34)을 통해 피쳐 추출이 완료된 이미지 데이터가 전달된다. 얼굴의 검출이 완료된 것으로, 이와 같이 처리된 이미지 데이터를 출력층(32)에서 동일성 판단부(123)로 전달함으로써 얼굴 인식 단계로 넘어간다.Image data from which feature extraction has been completed is delivered to the output layer 32 through the convolutional neural network 34. Now that the face detection has been completed, the processed image data is transferred from the output layer 32 to the identity determination unit 123 to proceed to the face recognition step.

이하, 도 4a 및 도 4b를 참조하여 다중 크기 컨볼루션 블록(33)의 구성 및 작용에 대해서 자세히 설명한다.Hereinafter, the configuration and operation of the multi-size convolution block 33 will be described in detail with reference to FIGS. 4A and 4B.

도 4a 및 도 4b는 본 발명의 일 실시예 및 다른 실시예에 따른 얼굴 인식 장치(1)의 다중 크기 컨볼루션 블록(33)의 구성 예를 나타낸 도면이다.FIGS. 4A and 4B are diagrams showing a configuration example of the multi-size convolution block 33 of the face recognition device 1 according to one embodiment and another embodiment of the present invention.

도 4a 및 도 4b를 참조하면, 본 발명의 다중 크기 컨볼루션 블록(33)은 복수의 연산층(331, 332, 333)으로 구성되어 입력층(311, 312)으로부터 입력받은 이미지를 출력층(321, 322)으로 연산하여 내보낸다. 연산층(331, 332, 333)은 1x1 컨볼루션층(331), 3x3 컨볼루션층(332), 3x3 맥스풀링(max pooling)층(333)으로 구성됨을 확인할 수 있으나, 연산층(331, 332, 333)의 종류는 이에 제한되지 않으며 에버리지 풀링(average pooling)층, 5x5 컨볼루션층 등을 포함할 수 있다.Referring to FIGS. 4A and 4B, the multi-size convolution block 33 of the present invention is composed of a plurality of operation layers 331, 332, and 333 to convert images input from the input layers 311 and 312 to the output layer 321. , 322) and exported. It can be seen that the operation layers (331, 332, 333) are composed of a 1x1 convolution layer (331), a 3x3 convolution layer (332), and a 3x3 max pooling layer (333), but the operation layers (331, 332) , 333) is not limited to this and may include an average pooling layer, a 5x5 convolution layer, etc.

컨볼루션층(331, 332)은 주어진 크기의 커널(kernel)과 이미지 데이터의 행렬곱을 구해서 결과값을 연산한다. 1x1 컨볼루션층(331)의 경우 커널의 크기가 1x1 행렬로 나타난다. 커널을 옮기는 칸 수를 스트라이드(stride)라고 하며, 스트라이드와 커널의 크기에 따라 결과값 행렬의 차원이 결정된다. 본 발명의 1x1 컨볼루션층(331)은 입력된 데이터들에 대해 차원을 축소하는 역할을 할 수도 있고, 국부적인 지역에 대한 컨볼루션 수행 후 이보다 더 넓은 영역에 대한 컨볼루션을 수행하여 다중 크기 특징을 표현할 수 있으나 그 작용은 이에 제한되지 않는다.The convolution layers 331 and 332 calculate the result by calculating the matrix product of a kernel of a given size and the image data. In the case of the 1x1 convolution layer 331, the size of the kernel is displayed as a 1x1 matrix. The number of columns to move the kernel is called stride, and the dimension of the result matrix is determined by the size of the stride and kernel. The 1x1 convolution layer 331 of the present invention may serve to reduce the dimension of the input data, and perform convolution on a local area and then perform convolution on a wider area to create multi-size features. can be expressed, but its action is not limited to this.

맥스풀링층(333)은 주어진 커널의 영역 안에서 가장 큰 값만을 뽑아내서 출력하는 연산을 수행한다. 가장 큰 값만을 취하고 나머지 값을 버리고 차원이 축소되므로, 다운 샘플링(down sampling)이라고도 불리운다. The max pooling layer 333 performs an operation to extract and output only the largest value within the region of the given kernel. Since only the largest value is taken and the remaining values are discarded and the dimensionality is reduced, it is also called down sampling.

따라서 이와 같은 연산층(331, 332, 333)들의 단일 혹은 중첩된 연산이 별도의 복수개 경로로 이루어지므로 다중 크기(multi scale)의 컨볼루션 블록이 되는 것이다. 이와 같은 복수개의 경로를 통해 연산된 데이터들을 정규화 하거나 활성함수를 통하여 출력층(321, 322)의 출력 이미지(22) 데이터로 내보내는 것이 다중 크기 컨볼루션 블록(33)의 작동 방식이다. Therefore, single or overlapping operations of the operation layers 331, 332, and 333 are performed through a plurality of separate paths, resulting in a multi-scale convolution block. The operating method of the multi-size convolution block 33 is to normalize the data calculated through such multiple paths or export it as output image 22 data of the output layers 321 and 322 through an activation function.

다중 크기 컨볼루션 블록(33)을 구성하는 연산층(331, 332, 333)의 종류 및 배치 방식은 도 4a 및 도 4b의 예시에 제한되지 않으며, 입력 데이터 크기 및 목적에 따라 다양한 배치가 가능하다. The type and arrangement method of the operation layers 331, 332, and 333 constituting the multi-size convolution block 33 are not limited to the examples of FIGS. 4A and 4B, and various arrangements are possible depending on the size and purpose of input data. .

이하, 도 5 및 도 6을 참조하여 학습용 삼중 쌍 이미지(23)와 손실함수(4)를 이용해 본 발명의 컨볼루션 신경망(34)을 트레이닝 하는 방법을 설명한다.Hereinafter, with reference to FIGS. 5 and 6, a method of training the convolutional neural network 34 of the present invention using the training triple pair image 23 and the loss function 4 will be described.

도 5는 본 발명의 일 실시예에 따른 얼굴 인식 장치(1)가 학습용 삼중 쌍 이미지(23)를 이용해 트레이닝 하는 과정을 나타낸 도면이다.Figure 5 is a diagram showing a process in which the face recognition device 1 according to an embodiment of the present invention is trained using the triple pair image 23 for learning.

도 5를 참조하면, 본 발명의 컨볼루션 신경망(34)을 트레이닝 하기 위해 제공되는 학습용 샘플 이미지는 삼중 쌍(triplet)으로 구성됨을 확인할 수 있다. 학습용 삼중 쌍 이미지(23)는 기준 얼굴 이미지(231)(reference face image), 동일 얼굴 이미지(232)(positive face image) 및 다른 얼굴 이미지(233)(negative face image)로 구성된다. 따라서 학습용 삼중 쌍 이미지(23)는 미리 특정한 기준에 의해 정리되어 세트를 구성해 제공되는 것이 바람직하다.Referring to FIG. 5, it can be seen that the learning sample image provided for training the convolutional neural network 34 of the present invention is composed of triplet pairs. The triple pair image 23 for learning is composed of a reference face image 231, a positive face image 232, and a negative face image 233. Therefore, it is desirable that the triple pair images 23 for learning are organized in advance according to a specific standard and provided in a set.

기준 얼굴 이미지(231)와 동일 얼굴 이미지(232)는 동일한 인물의 얼굴 이미지로 구성된다. 따라서 피쳐 추출 후 서로간의 피쳐가 유사하다고 판단하도록 컨볼루션 신경망(34)을 트레이닝할 필요가 있다. 반면 다른 얼굴 이미지(233)는 기준 얼굴 이미지(231) 및 동일 얼굴 이미지(232)와는 다른 인물의 얼굴 이미지로 구성된다. 따라서 피쳐 추출 후 기준 얼굴 이미지(231) 및 동일 얼굴 이미지(232)에서 추출한 피쳐와는 상이하다고 판단하도록 컨볼루션 신경망(34)을 트레이닝 할 필요가 있다. The reference face image 231 and the same face image 232 are composed of face images of the same person. Therefore, after feature extraction, it is necessary to train the convolutional neural network 34 to determine that the features are similar to each other. On the other hand, the other face image 233 is composed of a face image of a person different from the reference face image 231 and the same face image 232. Therefore, after feature extraction, it is necessary to train the convolutional neural network 34 to determine that it is different from the features extracted from the reference face image 231 and the same face image 232.

학습용 삼중 쌍 이미지(23)는 각각 도 3 내지 도 4b에서 설명한 다중 크기 컨볼루션 블록(33)으로 구성된 컨볼루션 신경망(34)을 거쳐 출력층으로 출력 결과를 내놓게 된다. 이 결과값들은 손실함수(loss function)로 입력되어, 원했던 결과와 출력 결과가 얼마나 차이가 나는지 확인하는 과정을 거치게 된다. The training triple pair image 23 passes through the convolutional neural network 34 composed of the multi-size convolutional blocks 33 described in FIGS. 3 to 4b, respectively, and outputs the output result to the output layer. These result values are input into a loss function and go through a process to check how much the desired result differs from the output result.

손실함수는 얻고자 하는 결과와 연산 결과의 차이를 손실 값(loss)으로 정의하여 상기 손실 값을 계산하는 함수이다. 본 발명에서 사용되는 손실함수(4)는 다음 수학식 1과 같이 정의한다.The loss function is a function that calculates the loss value by defining the difference between the desired result and the operation result as a loss value. The loss function (4) used in the present invention is defined as follows:

여기서 L은 손실값이고, 은 학습용 삼중 쌍 이미지(23) 중 기준 얼굴 이미지(231)이며, 은 학습용 삼중 쌍 이미지(23) 중 다른 얼굴 이미지(233)이며, 은 학습용 삼중 쌍 이미지(23) 중 동일 얼굴 이미지(232)이며, F는 컨볼루션 신경망(34)의 출력 결과이며, m은 기준 얼굴 이미지(231)와 동일 얼굴 이미지(232)의 출력 결과의 차와, 기준 얼굴 이미지(231)와 다른 얼굴 이미지(233)의 출력 결과의 차의 최소 비율을 뜻한다. m은 또한 트레이닝 과정에서의 여백(margin)을 의미한다.Here L is the loss value, is the reference face image (231) among the triple pair images (23) for learning, is another face image (233) among the triple pair images (23) for learning, is the same face image (232) among the triple pair images (23) for learning, F is the output result of the convolutional neural network (34), and m is the difference between the output results of the reference face image (231) and the same face image (232). Wow, it means the minimum ratio of the difference between the output result of the reference face image 231 and the other face image 233. m also refers to the margin in the training process.

수학식 1의 우변 첫 번째 항은 학습용 삼중 쌍 이미지(23)에서 기준 얼굴 이미지(231)와 동일 얼굴 이미지(232) 출력 결과의 차와, 기준 얼굴 이미지(231)와 다른 얼굴 이미지(233) 출력 결과의 차의 비율이 최대가 되도록 하기 위해 정의되었다. 따라서 분모에 기준 얼굴 이미지(231)와 동일 얼굴 이미지(232) 출력 결과의 차 및 m의 합을 배치하였고, 분자에는 상기 분모값에서 기준 얼굴 이미지(231)와 다른 얼굴 이미지(233) 출력 결과의 차를 감산하도록 하였다. 수학식 1의 우변 두 번째 항은 기준 얼굴 이미지(231)와 동일 얼굴 이미지(232)의 출력 결과 차가 최소가 되도록 정의하였다. 따라서 상기 2개의 항으로 구성된 수학식 1을 통해 손실값을 최소화 하는 방향으로 트레이닝부(122)가 컨볼루션 신경망(34)에 대한 트레이닝을 수행하면, 도 6과 같은 결과를 얻을 수 있다.The first term on the right side of Equation 1 is the difference between the output result of the same face image (232) as the reference face image (231) in the triple pair image (23) for learning, and the output of a face image (233) different from the reference face image (231). It was defined to ensure that the ratio of differences in results was maximized. Therefore, the difference between the output results of the reference face image 231 and the same face image 232 and the sum of m are placed in the denominator, and the output results of the reference face image 231 and the different face image 233 from the denominator value are placed in the numerator. The production of cars was reduced. The second term on the right side of Equation 1 was defined to minimize the difference in output results between the reference face image 231 and the same face image 232. Therefore, if the training unit 122 performs training on the convolutional neural network 34 in the direction of minimizing the loss value through Equation 1 consisting of the two terms, the result as shown in FIG. 6 can be obtained.

이하, 도 6을 참조하여 본 발명의 손실함수(4)를 이용해 컨볼루션 신경망(34)을 트레이닝 하는 방법을 설명한다. Hereinafter, a method of training the convolutional neural network 34 using the loss function 4 of the present invention will be described with reference to FIG. 6.

도 6은 본 발명의 일 실시예에 따른 얼굴 인식 장치(1)의 학습용 삼중 쌍 이미지(23)와 학습용 삼중 쌍 이미지(23)를 손실함수(4)로 연산해 트레이닝한 결과를 나타낸 도면이다.Figure 6 is a diagram showing the training results of the training triple pair image 23 and the learning triple pair image 23 of the face recognition device 1 according to an embodiment of the present invention by calculating the loss function 4.

도 6은 학습용 삼중 쌍 이미지(23)의 컨볼루션 신경망(34)을 통한 출력 결과의 차이를 이미지간의 거리로 표현한 도면이다. 도 6을 참조하면, 본 발명의 일 실시예의 학습용 삼중 쌍 이미지(23)는 최초 기준 얼굴 이미지(231)와 동일 인물 이미지 간의 거리가 기준 얼굴 이미지(231)와 동일 얼굴 이미지(232) 간의 거리보다 큰 상태이나, 트레이닝을 거친 후 결과에선 관계가 역전되어 기준 얼굴 이미지(231)와 동일 얼굴 이미지(232)가 가깝고, 기준 얼굴 이미지(231)와 다른 얼굴 이미지(233)가 멀리 위치함을 알 수 있다.Figure 6 is a diagram showing the difference in the output results of the training triple pair image 23 through the convolutional neural network 34 expressed in terms of the distance between images. Referring to FIG. 6, in the triple pair image 23 for learning in an embodiment of the present invention, the distance between the first reference face image 231 and the same person image is greater than the distance between the reference face image 231 and the same face image 232. However, after training, the relationship is reversed and the reference face image 231 and the same face image 232 are close, and the reference face image 231 and the different face image 233 are located far away. there is.

이와 같은 트레이닝을 위해서, 본 발명의 트레이닝부(122)는 역전파를 사용한다. 역전파는 함수의 지역 최소값(local minima)을 찾는 함수 최적화 방법인 경사하강법(gradient descent)을 이용하기 위해 손실함수(4)의 미분값을 구하여 이전의 네트워크 층으로 전달, 하위 네트워크 층으로 진행하면서 피드포워드(feedforward) 네트워크의 파라미터를 수정하도록 하는 방법이다. 본 발명의 손실함수(4)를 기준 얼굴 이미지(231) 데이터에 대해 미분한 결과는 하기 수학식 2와 같다.For such training, the training unit 122 of the present invention uses backpropagation. In order to use gradient descent, a function optimization method that finds the local minima of a function, backpropagation obtains the differential value of the loss function (4), passes it to the previous network layer, and proceeds to the lower network layer. This is a method of modifying the parameters of a feedforward network. The result of differentiating the loss function (4) of the present invention with respect to the reference face image (231) data is given in Equation 2 below.

또한 본 발명의 손실함수(4)를 동일 얼굴 이미지(232) 데이터에 대해 미분한 결과는 하기 수학식 3과 같다.In addition, the result of differentiating the loss function (4) of the present invention with respect to the same face image (232) data is shown in Equation 3 below.

본 발명의 손실함수(4)를 다른 얼굴 이미지(233) 데이터에 대해 미분한 결과는 하기 수학식 4와 같다.The result of differentiating the loss function (4) of the present invention with respect to other face image (233) data is as shown in Equation 4 below.

상기 식과 같이 역전파를 수행한 결과를 트레이닝부(122)가 하위 컨볼루션 신경망(34)으로 전달해 학습용 삼중 쌍 이미지(23)에 대해 연산을 수행한 컨볼루션 신경망(34)의 파라미터를 수정하도록 하는 방식으로 트레이닝이 이루어진다. 경사하강법을 이용하므로, 각 파라미터가 최적값인 지역 최소값을 가지는 방향으로 트레이닝을 거칠수록 점진적으로 변화하게 된다. The training unit 122 transmits the result of backpropagation as in the above equation to the lower convolutional neural network 34 to modify the parameters of the convolutional neural network 34 that performed the operation on the triple pair image 23 for learning. Training is done in a way. Since gradient descent is used, each parameter gradually changes as training progresses in the direction of the local minimum value, which is the optimal value.

메모리 문제로 모든 데이터를 일시에 이용하여 계산할 수 없는 문제가 있으므로, 트레이닝부(122)가 미니배치 단위마다 역전파 연산을 수행하도록 해서 계산횟수를 줄일 수도 있다. Since there is a problem that calculations cannot be made using all data at once due to memory problems, the number of calculations can be reduced by having the training unit 122 perform a backpropagation operation for each mini-batch unit.

본 발명은 삼중 쌍이 아닌 동일 인물 이미지 쌍(intra pair)을 구성하여 트레이닝을 수행할 수도 있다. 만약 동일 인물 이미지 쌍을 구성하는 경우, 상기 학습용 삼중 쌍 이미지(23)에서 다른 얼굴 이미지(233)를 제외하고 동일 얼굴 이미지(232)와 기준 얼굴 이미지(231)로만 데이터 셋을 구성하고 상술한 손실함수(4)에서 연산을 수행하면 될 것이다.The present invention can also perform training by configuring image pairs of the same person (intra pairs) rather than triple pairs. If a pair of images of the same person is constructed, a data set is constructed with only the same face image 232 and the reference face image 231, excluding the other face image 233 from the training triple pair image 23, and the loss described above is used. All you have to do is perform the operation in function (4).

본 명세서에서 작성된 수학식 상의 연산 순서는 설명한 바에 제한되지 않고, 일정 인자를 일정 값에 곱하는 연산을 해당 인자의 곱셈의 역원으로 일정 값을 나누는 연산으로 변형하는 등의 실질적으로 동일하다고 볼 수 있는 연산을 포함한다.The order of operations in the mathematical equations written in this specification is not limited to what has been described, and operations that can be considered substantially the same, such as transforming the operation of multiplying a certain factor by a certain value into an operation of dividing a certain value by the inverse of the multiplication of the factor. Includes.

다만 경사하강법을 이용해서 역전파 연산을 수행할 경우, 하위 네트워크로 갈수록 손실값이 미비하여 경사소멸(gradient vanishing)이 일어날 수 있다. 그리하여 최종적으로 지역 최소값에 도달하지 못하고 갈수록 트레이닝 효율이 떨어지는 문제가 발생할 수 있다. 이러한 문제를 해결하기 위한 방법을 도 7을 참조하여 설명한다.However, when backpropagation is performed using gradient descent, gradient vanishing may occur as the loss value becomes insignificant as the lower networks go. As a result, a problem may occur where the local minimum is not reached and training efficiency decreases. A method for solving this problem will be described with reference to FIG. 7.

도 7은 본 발명의 일 실시예에 따른 얼굴 인식 장치(1)가 보조 손실함수(42)를 사용하여 피쳐를 추출하고 트레이닝하는 과정을 나타낸 도면이다. Figure 7 is a diagram showing a process in which the face recognition device 1 extracts and trains features using the auxiliary loss function 42 according to an embodiment of the present invention.

도 7을 참조하면, 본 발명의 얼굴 인식 장치(1)의 트레이닝부(122)는 컨볼루션 신경망(34)을 통해 얻은 결과값에 대해 상기 설명한 손실함수(4)에 대입, 역전파를 수행하여 학습하는 과정 이외에 보조 손실함수(42)를 더 이용할 수 있다. 구분을 위해 상기 설명한 손실함수(4)를 삼중 쌍 손실함수(41)라고 한다. 다중 크기 컨볼루션 블록(33)을 적층하는 과정에서, 중간 지점까지 연산된 데이터를 별도의 컨볼루션 블록으로 연산해 본 발명의 삼중 쌍 손실함수(41)와 동일하지 않은 보조 손실함수(42)에 입력해 그 값을 연산하고 역전파를 수행하는 것이다. 보조 손실함수(42)는 1개 또는 2개 이상이 사용될 수 있으며, 트레이닝부(122)는 이를 통해 얻은 결과로 본 발명의 삼중 쌍 손실함수(41)의 지역 최소값에 다다르는 한계를 보충할 수 있다. 본 발명의 일 실시예에서는 보조 손실함수(42)로 소프트맥스 회귀(softmax regression) 함수를 선택하였으나 이에 제한되지 않고 응용 단계 및 학습 목적에 따라 다양한 손실함수를 사용할 수 있다. Referring to FIG. 7, the training unit 122 of the face recognition device 1 of the present invention substitutes the result obtained through the convolutional neural network 34 into the loss function 4 described above and performs backpropagation. In addition to the learning process, the auxiliary loss function (42) can be further used. For distinction, the loss function (4) described above is called the triple pair loss function (41). In the process of stacking the multi-size convolution block 33, the data calculated up to the midpoint is calculated as a separate convolution block and applied to the auxiliary loss function 42, which is not the same as the triple pair loss function 41 of the present invention. Enter the value, calculate the value, and perform backpropagation. One or two or more auxiliary loss functions 42 may be used, and the training unit 122 can supplement the limit of reaching the local minimum value of the triple pair loss function 41 of the present invention with the results obtained through this. . In one embodiment of the present invention, the softmax regression function was selected as the auxiliary loss function 42, but the loss function is not limited to this and various loss functions can be used depending on the application stage and learning purpose.

또한 컨볼루션 신경망(34)을 다중 크기 컨볼루션 블록(33)을 적층하여 구성하는 방법은 도 7의 예시에 제한되지 않고 목적에 따라 다양한 변형이 가능하다. Additionally, the method of constructing the convolutional neural network 34 by stacking multi-sized convolutional blocks 33 is not limited to the example of FIG. 7 and various modifications are possible depending on the purpose.

이하, 도 8을 참조하여 본 발명의 또 다른 실시예에 따른 얼굴 인식 장치(1)가 앙상블을 구성하고 구성된 앙상블을 이용해 트레이닝 하는 과정을 설명한다.Hereinafter, with reference to FIG. 8, a process of forming an ensemble and training by the face recognition device 1 according to another embodiment of the present invention will be described.

도 8은 본 발명의 또 다른 실시예에 따른 얼굴 인식 장치(1)가 앙상블을 구성해서 트레이닝 하는 과정을 나타낸 도면이다.Figure 8 is a diagram showing a process in which the face recognition device 1 forms an ensemble and trains it according to another embodiment of the present invention.

도 8을 참조하면, 트레이닝부(122)가 별도의 복수의 컨볼루션 신경망(34)을 거쳐 얻은 피쳐들을 모아 M차원(M dim) 피쳐(510) 데이터 셋을 구성할 수 있음을 확인할 수 있다. 이와 같이 조합된 피쳐 데이터 셋을 앙상블(ensemble)이라고 한다. 트레이닝부(122)는 M차원 앙상블(51)에 대해서 차원을 N차원으로 축소하여 분류기(53)에 대한 트레이닝을 실시할 수 있다. 본 발명의 M차원 앙상블(51)에 대한 차원 축소 방법은 주성분분석법(PCA, Principal Component Analysis), LDA(Linear Discriminant Analysis), 성긴 회귀분석(sparse regression), 부스팅(boosting) 을 포함하나 이에 제한되지 않는다.Referring to FIG. 8, it can be seen that the training unit 122 can configure an M-dimension (M dim) feature 510 data set by collecting features obtained through a plurality of separate convolutional neural networks 34. The feature data set combined in this way is called an ensemble. The training unit 122 may train the classifier 53 by reducing the dimension of the M-dimensional ensemble 51 to N-dimensional. Dimension reduction methods for the M-dimensional ensemble 51 of the present invention include, but are not limited to, Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), sparse regression, and boosting. No.

여기서 분류기(53)란 추출된 피쳐를 각 피쳐가 속하는 그룹을 찾아 분류하는 구성요소이다. 여기서 분류기(53) 알고리즘은 JB/TL(Jointly Bayesian Transfer Learning), SVM(Support Vector Machine), 아다부스트(Adaboost), 랜덤 포레스트(Random Forest) 및 랜덤 펀(Random Fern) 등을 포함하나 이에 제한되지 않는다.Here, the classifier 53 is a component that classifies the extracted features by finding the group to which each feature belongs. Here, the classifier 53 algorithms include, but are not limited to, Jointly Bayesian Transfer Learning (JB/TL), Support Vector Machine (SVM), Adaboost, Random Forest, and Random Fern. No.

N차원으로 축소된 앙상블(52)은 N차원으로 축소된 피쳐(520)를 포함하며, 학습용 삼중 쌍 이미지(23) 데이터와 함께 분류기(53)를 트레이닝하는데 사용된다. 이와 같이 복수의 컨볼루션 신경망(34)으로부터 추출한 피쳐로 앙상블을 구성하여 차원을 축소하고 트레이닝해서 보다 나은 분류기(53)의 성능을 도모할 수 있다. The N-dimensionally reduced ensemble 52 includes N-dimensionally reduced features 520 and is used to train the classifier 53 together with the training triple pair image 23 data. In this way, better performance of the classifier 53 can be achieved by forming an ensemble with features extracted from multiple convolutional neural networks 34, reducing the dimension, and training.

본 발명의 명세서에서는 인물의 얼굴의 피쳐를 추출하고 인식하며 트레이닝하는 장치 및 방법에 대해서 대표적으로 서술하였으나, 본 발명의 장치 및 방법은 얼굴뿐이 아닌 광범위한 이미지 패턴으로부터 피쳐를 추출하고 인식하며 트레이닝하도록 사용될 수 있다. 따라서 그 범위가 얼굴 인식에 제한되는 것은 아니며, 다른 이미지 패턴 인식 분야에 이용이 가능하다.In the specification of the present invention, an apparatus and method for extracting, recognizing, and training features of a person's face are representatively described, but the apparatus and method of the present invention are designed to extract, recognize, and train features from a wide range of image patterns, not just faces. can be used Therefore, its scope is not limited to face recognition and can be used in other image pattern recognition fields.

본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.Those skilled in the art to which the present invention pertains will understand that the present invention can be implemented in other specific forms without changing its technical idea or essential features. Therefore, the embodiments described above should be understood in all respects as illustrative and not restrictive. The scope of the present invention is indicated by the claims described below rather than the detailed description above, and all changes or modified forms derived from the meaning and scope of the claims and their equivalent concepts should be construed as being included in the scope of the present invention. do.

비록 본 발명이 상기 언급된 바람직한 실시예와 관련하여 설명되었지만, 발명의 요지와 범위로부터 벗어남이 없이 다양한 수정이나 변형을 하는 것이 가능하다. 따라서 첨부된 특허청구의 범위에는 본 발명의 요지에 속하는 한 이러한 수정이나 변형을 포함할 것이다.Although the present invention has been described in connection with the above-mentioned preferred embodiments, various modifications and variations can be made without departing from the spirit and scope of the invention. Accordingly, the scope of the appended patent claims will include such modifications or variations as long as they fall within the gist of the present invention.

1 : 얼굴 인식 장치 4 : 손실함수
11 : 입력부 12 : 제어부
13 : 저장부 14 : 출력부
21 : 입력 이미지 22 : 출력 이미지
23 : 학습용 삼중 쌍 이미지 31, 311, 312 : 입력층
32, 321, 322 : 출력층 33 : 다중 크기 컨볼루션 블록
34 : 컨볼루션 신경망 41 : 삼중 쌍 손실함수
42 : 보조 손실함수 51 : M차원 앙상블
52 : N차원으로 축소된 앙상블 53 : 분류기
121 : 피쳐 추출부 122 : 트레이닝부
123 : 동일성 판단부 231 : 기준 얼굴 이미지
232 : 동일 얼굴 이미지 233 : 다른 얼굴 이미지
331 : 1x1 컨볼루션층 332 : 3x3 컨볼루션층
333 : 맥스풀링층 510 : M차원 피쳐
520 : N차원으로 축소된 피쳐1: Face recognition device 4: Loss function
11: input unit 12: control unit
13: storage unit 14: output unit
21: input image 22: output image
23: Triple pair images for learning 31, 311, 312: Input layer
32, 321, 322: Output layer 33: Multi-size convolution block
34: Convolutional neural network 41: Triple pair loss function
42: Auxiliary loss function 51: M-dimensional ensemble
52: Ensemble reduced to N dimension 53: Classifier
121: feature extraction unit 122: training unit
123: Identity judgment unit 231: Reference face image
232: Same face image 233: Different face image
331: 1x1 convolution layer 332: 3x3 convolution layer
333: Max pooling layer 510: M-dimensional feature
520: Feature reduced to N dimension

Claims

이미지를 입력받는 입력부;
복수의 다중 크기 컨볼루션 블록으로 구성된 컨볼루션 신경망을 트레이닝하는 트레이닝부;
상기 트레이닝된 컨볼루션 신경망을 이용하여 상기 입력된 이미지로부터 피쳐를 추출하는 피쳐 추출부;
상기 추출된 피쳐를 레퍼런스 이미지와 비교하여 얼굴의 동일성을 인식하는 동일성 판단부; 및
상기 인식된 결과를 디스플레이하는 출력부;를 포함하되,
상기 복수의 다중 크기 컨볼루션 블록 각각은,
이미지를 입력 받는 입력층;
연산이 완료된 이미지를 출력하는 출력층;
상기 입력층과 상기 출력층을 연결하는 복수의 경로; 및
상기 복수의 경로에 구비되어 상기 입력층으로 입력된 이미지에 대한 연산을 수행하는 복수의 연산층을 포함하고,
상기 복수의 연산층 중 상기 복수의 경로별로 선택된 적어도 하나의 연산층이 대응되는 경로에 적층되어 구비되는 얼굴 인식 장치.An input unit that receives an image;
A training unit that trains a convolutional neural network composed of a plurality of multi-size convolutional blocks;
a feature extraction unit that extracts features from the input image using the trained convolutional neural network;
an identity determination unit that recognizes the identity of a face by comparing the extracted features with a reference image; and
Including an output unit that displays the recognized result,
Each of the plurality of multi-size convolution blocks,
Input layer that receives images;
An output layer that outputs the image for which the calculation has been completed;
a plurality of paths connecting the input layer and the output layer; and
A plurality of operation layers are provided in the plurality of paths and perform operations on images input to the input layer,
A face recognition device wherein at least one calculation layer selected for each of the plurality of paths among the plurality of calculation layers is stacked on a corresponding path.

제1 항에 있어서,
상기 트레이닝부는 상기 추출된 피쳐를 기초로 상기 컨볼루션 신경망을 트레이닝하는 얼굴 인식 장치.According to claim 1,
The training unit trains the convolutional neural network based on the extracted features.

제1 항에 있어서,
상기 입력부는 학습용 삼중 쌍(triplet) 이미지를 입력받고,
상기 피쳐 추출부는 상기 학습용 삼중 쌍 이미지로부터 피쳐를 추출하고,
상기 트레이닝부는 상기 학습용 삼중 쌍 이미지로부터 추출된 피쳐를 기초로 상기 컨볼루션 신경망을 트레이닝시키는 얼굴 인식 장치. According to claim 1,
The input unit receives a triplet image for learning,
The feature extractor extracts features from the training triple pair image,
The training unit is a face recognition device that trains the convolutional neural network based on features extracted from the learning triple pair image.

◈청구항 4은(는) 설정등록료 납부시 포기되었습니다.◈◈Claim 4 was abandoned upon payment of the setup registration fee.◈

제3 항에 있어서,
상기 트레이닝부는 상기 학습용 삼중 쌍 이미지로부터 추출된 피쳐를 손실함수에 의해 역전파(backpropagation) 함으로써 상기 컨볼루션 신경망을 트레이닝하되,
상기 손실함수는 이고,
상기 은 상기 학습용 삼중 쌍 이미지 중 기준 얼굴 이미지이며,
상기 은 상기 학습용 삼중 쌍 이미지 중 다른 얼굴 이미지이며,
상기 은 상기 학습용 삼중 쌍 이미지 중 동일 얼굴 이미지이며,
상기 F 는 상기 컨볼루션 신경망의 출력 결과이며,
상기 m 은 상기 기준 얼굴 이미지와 상기 동일 얼굴 이미지의 출력 결과의 차와, 상기 기준 얼굴 이미지와 상기 다른 얼굴 이미지의 출력 결과의 차의 최소 비율인 얼굴 인식 장치.According to clause 3,
The training unit trains the convolutional neural network by backpropagating features extracted from the learning triple pair image using a loss function,
The loss function is ego,
remind is the reference face image among the triple pair images for learning,
remind is another face image among the training triple pair images,
remind is the same face image among the triple pair images for learning,
The F is the output result of the convolutional neural network,
The m is a minimum ratio of the difference between the output result of the reference face image and the same face image and the difference between the output result of the reference face image and the different face image.

◈청구항 5은(는) 설정등록료 납부시 포기되었습니다.◈◈Claim 5 was abandoned upon payment of the setup registration fee.◈

제4 항에 있어서,
상기 트레이닝부는 보조 손실함수를 더 포함하고, 상기 학습용 삼중 쌍 이미지로부터 추출된 피쳐를 상기 보조 손실함수에 의해 역전파 함으로써 상기 컨볼루션 신경망을 더 트레이닝하는 얼굴 인식 장치.According to clause 4,
The training unit further includes an auxiliary loss function, and backpropagates features extracted from the training triple pair image using the auxiliary loss function to further train the convolutional neural network.

◈청구항 6은(는) 설정등록료 납부시 포기되었습니다.◈◈Claim 6 was abandoned upon payment of the setup registration fee.◈

제1 항에 있어서,
상기 컨볼루션 신경망은 복수의 컨볼루션 신경망을 포함하고,
상기 트레이닝부는 상기 복수의 컨볼루션 신경망을 이용해 추출된 복수의 피쳐로 앙상블을 구성하고, 상기 구성된 앙상블의 차원을 축소하며, 상기 차원이 축소된 앙상블을 기초로 상기 트레이닝부가 더 포함하는 분류기를 트레이닝하는 얼굴 인식 장치.
According to claim 1,
The convolutional neural network includes a plurality of convolutional neural networks,
The training unit constructs an ensemble with a plurality of features extracted using the plurality of convolutional neural networks, reduces the dimension of the constructed ensemble, and trains a classifier further included in the training unit based on the ensemble with the reduced dimension. Facial recognition device.