KR20200029659A

KR20200029659A - Method and apparatus for face recognition

Info

Publication number: KR20200029659A
Application number: KR1020180106659A
Authority: KR
Inventors: 김대진; 강봉남
Original assignee: 포항공과대학교 산학협력단
Priority date: 2018-09-06
Filing date: 2018-09-06
Publication date: 2020-03-19
Also published as: KR102244013B9; KR102244013B1

Abstract

Disclosed are a method for recognizing a face and an apparatus thereof. According to the present invention, the apparatus comprises: a memory including a command for receiving an image from an external server, a command for extracting a valid image, a command for aligning the valid image, a command for extracting a global appearance feature by learning a convolutional neural network, a command for extracting a relational local feature by learning a paired network and a command for embedding an identity identification feature; and a processor for executing at least one command stored in the memory. The relational local feature is extracted by combining unique features represented in local parts in a facial area of a subject in the image. Also, the extracted relational local feature and the global appearance feature representing a feature of the entire facial area are combined, thereby improving the identification of an identity between a pre-stored user and the subject.

Description

얼굴 인식 방법 및 장치{METHOD AND APPARATUS FOR FACE RECOGNITION}Face recognition method and apparatus {METHOD AND APPARATUS FOR FACE RECOGNITION}

본 발명은 얼굴 인식 방법 및 장치에 관한 것으로, 더욱 상세하게는 인공신경망(Neural Network)을 이용한 얼굴 인식 방법 및 장치에 관한 것이다.The present invention relates to a face recognition method and apparatus, and more particularly, to a face recognition method and apparatus using an artificial neural network.

생체 인식 기술은 지문, 얼굴, 홍채 및 정맥 등의 고유한 신체 특징을 이용하여 특정인을 인식하는 기술이다. Biometric technology is a technology that recognizes a specific person using unique body features such as fingerprints, faces, irises, and veins.

이러한 생체 인식 기술은 열쇠 또는 비밀번호처럼 타인에 의해 도용되거나 복제되기 어렵고, 변경되거나 또는 분실될 위험이 없으므로 오늘날 보안 분야에 주로 활용되고 있다. Such biometric technology is mainly used in the security field today because it is difficult to be stolen or copied by others, such as a key or password, and there is no risk of being altered or lost.

다양한 신체 특징 중에서도 얼굴 인식 기술은, 홍채 인식 또는 정맥 등의 기타 생체 인식 기술들에 비해서, 사용자로 하여금 인식 절차가 간편하고 자연스러운 장점이 있어, 주요 연구 대상으로 각광받고 있다.Among various body features, the face recognition technology has been spotlighted as a main research subject because the user has a simple and natural recognition procedure compared to other biometric technologies such as iris recognition or vein.

얼굴 인식 기술은 촬영 이미지 또는 영상 이미지로부터 얼굴 영역을 검출하여, 검출된 얼굴의 대상자를 식별하는 기술이다. 그러나, 대상자의 얼굴은 조명, 포즈 및 표정의 변화 또는 가려짐에 의해 쉽게 변형 가능함으로, 촬영 이미지 또는 영상 이미지로부터 추출된 얼굴 영역을 바탕으로 사전 등록된 사용자와 대상자가 동일인임을 판별하기는 어렵다.The face recognition technique is a technique of detecting a face region from a captured image or a video image, and identifying a subject of the detected face. However, since the subject's face can be easily deformed by changing or obscuring lighting, poses, and facial expressions, it is difficult to determine whether the pre-registered user and the subject are the same person based on the face area extracted from the captured image or the video image.

종래의 얼굴 인식 기술로는 학습된 컨볼루셔널 신경망(Convolutional Neural Network, 이하 CNN) 모델에 의해 얼굴을 식별하는 방법이 이용되고 있다.As a conventional face recognition technology, a method of identifying a face by using a learned convolutional neural network (CNN) model is used.

그러나, 종래의 학습된 컨볼루셔널 신경망(CNN)을 이용한 얼굴 인식 기술은 촬영 이미지 또는 영상 이미지를 카테고리(category) 별로 분류하는 데에 그 목적을 둠으로써, 촬영 이미지 또는 영상 이미지 내 얼굴 인식을 위해 어떤 특징이 사용되는지 또는 식별성이 높은 특징이 어떤 특징인지를 식별하지 못해, 대상 식별의 정확도가 떨어지는 단점이 있다.However, the conventional face recognition technology using a learned convolutional neural network (CNN) aims to classify a captured image or a video image into categories, for recognition of a face in the captured image or video image. There is a disadvantage in that the accuracy of object identification is deteriorated because it is not possible to identify which features are used or which features are highly discernable.

상기와 같은 문제점을 해결하기 위한 본 발명의 목적은 고속, 고정밀 및 고신뢰성의 얼굴 인식 방법을 제공하는 데 있다.An object of the present invention for solving the above problems is to provide a high-speed, high-precision and highly reliable face recognition method.

상기와 같은 문제점을 해결하기 위한 본 발명의 목적은 고속, 고정밀 및 고신뢰성의 얼굴 인식 장치를 제공하는 데 있다.An object of the present invention for solving the above problems is to provide a high-speed, high-precision and highly reliable face recognition device.

상기와 같은 문제점을 해결하기 위한 본 발명의 목적은 고속, 고정밀 및 고신뢰성의 쌍 관계 네트워크 모델링 방법을 제공하는 데 있다.An object of the present invention for solving the above problems is to provide a high-speed, high-precision and high-reliability pair-relational network modeling method.

상기와 같은 문제점을 해결하기 위한 본 발명의 목적은 고속, 고정밀 및 고신뢰성의 쌍 관계 네트워크 모델링 장치를 제공하는 데 있다.An object of the present invention for solving the above problems is to provide a high-speed, high-precision and highly reliable pair-relational network modeling device.

상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른 얼굴 인식 방법은, 식별하고자 하는 대상자의 얼굴이 촬영된 영상 이미지를 수신하는 단계, 상기 영상 이미지를 정규화하는 단계, 복수의 얼굴 특징점들을 추출하도록 학습된 컨볼루셔널 신경망(CNN, Convolutional Neural Network)에 상기 영상 이미지를 입력하여, 상기 영상 이미지 내 얼굴 특징점들을 포함하는 특징맵(Feature map)을 도출하는 단계, 상기 특징맵에 글로벌 평균 풀링(GAP, Global Average Pooling)을 적용하여, 상기 영상 이미지 내 대상자의 얼굴 전역에 대한 외형 특징을 표현하는 글로벌 외형 특징을 출력하는 단계, 쌍 관계 네트워크(PRN, Pairwise Related Network)에 상기 특징맵을 입력하여 복수의 로컬 외형 특징들을 추출하여 관계쌍을 형성하는 단계; 및 상기 관계쌍에 신원 식별 특징을 임베딩(Embeding)하여 관계형 로컬 특징을 추출하는 단계를 포함한다.A face recognition method according to an embodiment of the present invention for achieving the above object includes receiving a video image of a face of a subject to be identified, normalizing the video image, and extracting a plurality of facial feature points Inputting the image image into a learned convolutional neural network (CNN) to derive a feature map including facial feature points in the image image, global average pooling (GAP) on the feature map , Applying Global Average Pooling, and outputting a global appearance feature expressing a feature of the entire face of the subject in the video image, inputting the feature map to a pairwise related network (PRN) Extracting the local appearance features of to form a relationship pair; And extracting a relational local feature by embedding an identification feature in the relation pair.

여기서, 상기 학습된 쌍 관계 네트워크는 상기 학습 이미지의 글로벌 외형 특징 및 관계형 로컬 특징으로부터 추출된 손실 함수의 가중치가 학습된 모델일 수 있다.Here, the trained pair-relation network may be a model in which the weight of the loss function extracted from the global external feature and the relational local feature of the training image is learned.

상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른 얼굴 인식 방법은 상기 영상 이미지를 정규화하는 단계 이전에 상기 영상 이미지를 정렬하는 단계를 더 포함할 수 있다.The face recognition method according to an embodiment of the present invention for achieving the above object may further include aligning the image image before normalizing the image image.

상기 영상 이미지를 정렬하는 단계는 상기 영상 이미지 내 대상자의 두 눈의 위치 정보를 이용하여 평면 내 각도(RIP, Rotation in Plane)가 0이 되도록 회전 정렬하는 단계, 상기 영상 이미지 내 얼굴 특징점들을 이용하여, 상기 영상 이미지의 X축 위치를 정렬하는 단계 및 상기 영상 이미지 내 얼굴 특징점들을 이용하여, 상기 영상 이미지의 Y축 위치 및 크기를 정렬하는 단계를 포함할 수 있다.The step of aligning the video image is rotationally aligned so that an in-plane angle (RIP, Rotation in Plane) is 0 using position information of two eyes of a subject in the video image, using facial feature points in the video image And aligning the X-axis position of the video image and aligning the Y-axis position and size of the video image using facial feature points in the video image.

이때, 상기 영상 이미지의 X축 위치를 정렬하는 단계는 상기 얼굴 특징점들 중 제1 방향을 기준으로 최외각에 위치하는 제1 특징점을 추출하는 단계, 상기 제1 방향과 반대인 제2 방향을 기준으로 최외각에 위치하는 제2 특징점을 추출하는 단계 및 상기 영상 이미지의 중심으로부터 상기 제1 특징점 및 상기 제2 특징점의 X축 거리가 동일하게 제공되도록, 상기 영상 이미지의 X축 위치를 조정하는 단계를 포함할 수 있다.At this time, the step of aligning the X-axis position of the video image includes extracting a first feature point located at the outermost point based on a first direction among the facial feature points, and based on a second direction opposite to the first direction. Extracting a second feature point located at the outermost point and adjusting an X-axis position of the image so that the X-axis distances of the first feature point and the second feature point are equally provided from the center of the image. It may include.

또한, 상기 영상 이미지의 Y축 위치 및 크기를 정렬하는 단계는 상기 영상 이미지 내 대상자의 두 눈 사이의 중점인 제3 특징점을 추출하는 단계, 상기 영상 이미지 내 대상자의 입술 중점인 제4 특징점을 추출하는 단계 및 상기 제3 특징점 및 상기 제4 특징점을 이용하여, 상기 영상 이미지의 크기 및 Y축 위치를 조정하는 단계를 포함할 수 있다.In addition, the step of aligning the position and size of the Y-axis of the video image may include extracting a third feature point that is a center point between two eyes of a subject in the video image, and extracting a fourth feature point that is a center point of a lip of the subject in the video image. And adjusting the size and Y-axis position of the video image by using the third feature point and the fourth feature point.

상기 영상 이미지는 Y축을 기준으로, 상기 제3 특징점이 상면으로부터 30% 간격만큼 하향 이격되어 위치되고, 상기 제4 특징점이 하면으로부터 35% 간격만큼 상향 이격되어 위치될 수 있다.The video image may be positioned with respect to the Y axis, the third feature point is spaced downwardly 30% apart from the top surface, and the fourth feature point can be positioned upwardly spaced 35% apart from the bottom surface.

상기 특징맵을 도출하는 단계는, 복수의 컨볼루션 계층(Convolution layer)들에 의해 상기 정규화된 영상 이미지의 채널별 합성곱을 산출하는 단계 및 상기 채널별 합성곱에 최대 풀링(Max Pooling)을 적용하는 단계를 포함할 수 있다.The step of deriving the feature map includes calculating a convergence for each channel of the normalized image image by a plurality of convolution layers and applying Max Pooling to the convergence for each channel. It may include steps.

이때, 적어도 하나의 상기 컨볼루션 계층은 레지듀얼 함수(Residual Function)를 포함하는 병목(Bottleneck) 구조로 제공될 수 있다.At this time, the at least one convolutional layer may be provided in a bottleneck structure including a residual function.

상기 글로벌 외형 특징을 출력하는 단계에서는 특정 크기의 필터(filter)를 이용하여, 상기 특징맵에 평균 풀링(Average Pooling)을 적용할 수 있다.In the step of outputting the global appearance feature, average pooling may be applied to the feature map using a filter of a specific size.

또한, 상기 관계쌍을 형성하는 단계는 상기 컨볼루셔널 신경망으로부터 출력된 상기 특징맵을 입력 받는 단계, 상기 특징맵 내 복수의 얼굴 특징점들을 중심으로 로컬 외형 특징 그룹을 추출하는 단계 및 상기 로컬 외형 특징 그룹으로부터 복수의 상기 로컬 외형 특징들을 추출하여 상기 관계쌍을 형성하는 단계를 포함할 수 있다.In addition, the step of forming the relationship pair may include receiving the feature map output from the convolutional neural network, extracting a local feature group based on a plurality of facial feature points in the feature map, and the local feature. And extracting a plurality of the local appearance features from a group to form the relationship pair.

이때, 상기 로컬 외형 특징 그룹을 추출하는 단계는 상기 특징맵 내 얼굴 영역 중 적어도 일부 영역을 관심 영역(ROI, Region Of Interest)으로 설정하여 투영하는 단계 및 상기 관심 영역 내 위치한 적어도 하나의 상기 얼굴 특징점으로부터, 상기 로컬 외형 특징들을 포함하는 상기 로컬 외형 특징 그룹을 추출하는 단계를 포함할 수 있다.At this time, the step of extracting the local appearance feature group may include setting and projecting at least some of the face regions in the feature map as a region of interest (ROI) and at least one face feature point located in the region of interest. And extracting the local appearance feature group including the local appearance features.

상기 관계형 로컬 특징을 추출하는 단계는 LSTM(Long Short-term Memory uint) 기반의 순환 네트워크에 의해, 상기 신원 식별 특징을 상기 관계쌍에 임베딩(Embeding)하는 단계, 제1 멀티 레이어 퍼셉트론(MLP, Multi Layer Perceptron)에 의해 제1 가중치를 산출하여, 적어도 하나의 상기 관계형 로컬 특징에 개별 적용하는 단계, 적어도 하나의 상기 관계형 로컬 특징을 집계 함수에 의해 합산하여 예측 관계형 특징을 추출하는 단계 및 제2 멀티 레이어 퍼셉트론에 의해 제2 가중치를 산출하여, 상기 예측 관계형 특징에 적용하여 상기 쌍 관계 네트워크를 생성하는 단계를 더 포함할 수 있다.The step of extracting the relational local feature includes embedding the identification feature into the relation pair by a long-term memory uint (LSTM) -based cyclic network, a first multi-layer perceptron (MLP, Multi) Layer Perceptron) to calculate a first weight, individually applying at least one of the relational local features, summing at least one of the relational local features by an aggregate function, and extracting a predictive relational feature and a second multi The method may further include generating a pair relation network by calculating a second weight by layer perceptron and applying it to the predicted relational feature.

이때, 상기 관계형 로컬 특징은 단일 벡터 형태로 제공될 수 있다.In this case, the relational local feature may be provided in the form of a single vector.

또한, 상기 LSTM 기반의 순환 네트워크는 복수의 완전 연결된 계층(FC Layer, Fully Connected Layer)들을 포함하고, 손실 함수(Loss Function)를 이용하여 학습될 수 있다.In addition, the LSTM-based cyclic network includes a plurality of fully connected layers (FC Layer, Fully Connected Layer), and can be learned using a loss function.

상기 목적을 달성하기 위한 본 발명의 다른 실시예에 따른 얼굴 인식 장치는 상기 프로세서(processor)를 통해 실행되는 적어도 하나의 명령이 저장된 메모리(memory)를 포함하고, 상기 적어도 하나의 명령은, 식별하고자 하는 대상자의 얼굴이 촬영된 영상 이미지를 수신하도록 하는 명령, 상기 영상 이미지를 정규화하도록 하는 명령, 복수의 얼굴 특징점들을 추출하도록 학습된 컨볼루셔널 신경망에 상기 영상 이미지를 입력하여 상기 영상 이미지 내 얼굴 특징점들을 포함하는 특징맵을 도출하도록 하는 명령, 상기 특징맵에 글로벌 평균 풀링을 적용하여 상기 영상 이미지 내 대상자의 얼굴 전역에 대한 외형 특징을 표현하는 글로벌 외형 특징을 출력하도록 하는 명령, 쌍 관계 네트워크에 상기 특징맵을 입력하여 복수의 로컬 외형 특징들을 추출하여 관계쌍을 형성하도록 하는 명령 및 상기 관계쌍에 신원 식별 특징을 임베딩하여 관계형 로컬 특징을 추출하도록 하는 명령을 포함한다.A face recognition apparatus according to another embodiment of the present invention for achieving the above object includes a memory in which at least one instruction executed through the processor is stored, and the at least one instruction is to be identified. A command to receive a video image of a subject's face, a command to normalize the video image, and input the video image to a convolutional neural network trained to extract a plurality of facial feature points to obtain facial feature points in the video image A command to derive a feature map including the commands, a command to apply a global average pooling to the feature map, and output a global appearance feature expressing the external features of the entire subject's face in the video image; Enter a feature map to extract and manage multiple local features Embedding the identification characteristic identity to the command and the related pairs to form a pair and comprises the instruction to cause extracting a local relational characteristics.

이때, 상기 프로세서는 상기 영상 이미지를 정규화하기 전에 상기 영상 이미지를 정렬할 수 있다.At this time, the processor may align the video image before normalizing the video image.

또한, 상기 프로세서는 상기 특징맵을 도출하도록 하는 명령 수행 시, 복수의 컨볼루션 계층들에 의해 상기 정규화된 영상 이미지의 채널별 합성곱을 산출하고, 상기 채널별 합성곱에 최대 풀링을 적용하여 상기 특징맵을 출력할 수 있다.In addition, when the processor performs an instruction to derive the feature map, the convolutional product of the normalized image is calculated by a plurality of convolutional layers, and the maximum pooling is applied to the convolutional product of each channel to apply the feature. You can print the map.

여기서, 적어도 하나의 상기 컨볼루션 계층은 레지듀얼 함수를 포함하는 병목 구조로 제공될 수 있다.Here, at least one convolutional layer may be provided as a bottleneck structure including a residual function.

또한, 상기 프로세서는 상기 글로벌 외형 특징을 출력하도록 하는 명령 수행 시, 특정 크기의 필터를 이용하여 상기 특징맵에 평균 풀링을 적용할 수 있다.In addition, the processor may apply average pooling to the feature map by using a filter of a specific size when performing an instruction to output the global appearance feature.

상기 프로세서는 상기 관계쌍을 형성하도록 하는 명령 수행 시, 상기 컨볼루셔널 신경망으로부터 출력된 상기 특징맵을 입력 받고, 상기 특징맵 내 복수의 얼굴 특징점들을 중심으로 로컬 외형 특징 그룹을 추출하며, 상기 로컬 외형 특징 그룹으로부터 복수의 상기 로컬 외형 특징들을 추출하여 상기 관계쌍을 형성할 수 있다.The processor receives the feature map output from the convolutional neural network when performing an instruction to form the relationship pair, extracts a local appearance feature group around a plurality of facial feature points in the feature map, and the local A plurality of the local appearance features may be extracted from the appearance feature group to form the relationship pair.

이때, 상기 프로세서는 상기 로컬 외형 특징 그룹의 추출 시, 상기 영상 이미지의 얼굴 영역 내 국부 영역을 관심 영역으로 추출하고, 상기 추출된 관심 영역을 기준으로 적어도 하나의 상기 로컬 외형 특징들을 포함하는 상기 로컬 외형 특징 그룹을 추출할 수 있다.At this time, when the processor extracts the local appearance feature group, the local area in the face region of the video image is extracted as a region of interest, and the local including at least one of the local appearance features based on the extracted region of interest. The appearance feature group can be extracted.

또한, 상기 프로세서는 상기 관계형 로컬 특징의 생성 시, LSTM 기반의 순환 네트워크에 의해, 상기 신원 식별 특징을 상기 관계쌍에 임베딩하고, 제1 멀티 레이어 퍼셉트론에 의해 제1 가중치를 산출하여 적어도 하나의 상기 관계형 로컬 특징에 개별 적용하며, 적어도 하나의 상기 관계형 로컬 특징을 집계 함수에 의해 합산하여 예측 관계형 특징을 추출하고, 제2 멀티 레이어 퍼셉트론에 의해 제2 가중치를 산출하고, 상기 예측 관계형 특징에 적용할 수 있다.In addition, when the relational local feature is generated, the processor embeds the identification feature into the relation pair by an LSTM-based cyclic network and calculates a first weight by the first multi-layer perceptron to calculate at least one of the Individually applied to relational local features, summed at least one of the relational local features by an aggregate function to extract predictive relational features, calculate a second weight by a second multi-layer perceptron, and apply to the predicted relational features You can.

상기 목적을 달성하기 위한 본 발명의 또다른 실시예에 따른 쌍 관계 네트워크 모델링 방법은 학습된 컨볼루셔널 신경망으로부터 복수의 얼굴 특징점들을 포함하는 특징맵을 입력 받는 단계, 상기 특징맵 내 복수의 얼굴 특징점들을 중심으로 로컬 외형 특징 그룹을 추출하는 단계, 상기 로컬 외형 특징 그룹으로부터 복수의 로컬 외형 특징들을 추출하여 관계쌍을 형성하는 단계, LSTM 기반의 순환 네트워크에 의해, 신원 식별 특징을 상기 관계쌍에 임베딩하여 관계형 로컬 특징을 추출하는 단계 및 상기 관계형 로컬 특징 및 상기 학습된 컨볼루셔널 신경망으로부터 수신된 글로벌 외형 특징을 결합한 특징을 복수의 완전 연결된 계층들에 통과시켜, 손실 함수가 최소화되도록 학습하는 단계를 포함한다.A paired relationship network modeling method according to another embodiment of the present invention for achieving the above object comprises receiving a feature map including a plurality of facial feature points from a learned convolutional neural network, and a plurality of facial feature points in the feature map Extracting a local appearance feature group around them, extracting a plurality of local appearance features from the local appearance feature group to form a relationship pair, and embedding an identity identification feature into the relationship pair by an LSTM-based circular network Extracting a relational local feature and passing a feature combining the relational local feature and the global appearance feature received from the learned convolutional neural network through a plurality of fully connected layers to learn to minimize a loss function. Includes.

상기 목적을 달성하기 위한 본 발명의 또다른 실시예에 따른 쌍 관계 네트워크 모델링 장치는 프로세서 및 상기 프로세서를 통해 실행되는 적어도 하나의 명령이 저장된 메모리를 포함하고, 상기 적어도 하나의 명령은 학습된 컨볼루셔널 신경망으로부터 복수의 얼굴 특징점들을 포함하는 특징맵을 입력 받도록 하는 명령, 상기 특징맵 내 복수의 얼굴 특징점들을 중심으로 로컬 외형 특징 그룹을 추출하도록 하는 명령, 상기 로컬 외형 특징 그룹으로부터 복수의 로컬 외형 특징들을 추출하여 관계쌍을 형성하도록 하는 명령, LSTM 기반의 순환 네트워크에 의해, 신원 식별 특징을 상기 관계쌍에 임베딩하여 관계형 로컬 특징을 추출하도록 하는 명령 및 상기 관계형 로컬 특징 및 상기 학습된 컨볼루셔널 신경망으로부터 수신된 글로벌 외형 특징을 결합한 특징을 복수의 완전 연결된 계층들에 통과시켜, 손실 함수가 최소화되도록 학습하도록 하는 명령을 포함한다.A paired relationship network modeling apparatus according to another embodiment of the present invention for achieving the above object includes a processor and a memory in which at least one instruction executed through the processor is stored, and the at least one instruction is a learned convolution A command to receive a feature map including a plurality of facial feature points from the National Neural Network, a command to extract a local feature group based on a plurality of facial feature points in the feature map, and a plurality of local feature features from the local feature group Instructions for extracting them to form a relational pair, instructions for embedding an identity identification feature into the relational pair to extract relational local features by the LSTM-based cyclic network, and the relational local features and the learned convolutional neural network Combines global appearance features received from Passing the features in a plurality of fully connected layers, and includes a command to the loss function is minimized to learn.

본 발명의 실시예에 따른 얼굴 인식 방법 및 장치는 컨볼루셔널 신경망(Convolutional Neural Network, CNN)에 의해 출력된 글로벌 외형 특징 및 쌍 관계 네트워크(Pairwise Related Network, PRN)를 통해 출력된 관계형 로컬 특징을 결합하여 영상 이미지 내 대상자의 신원을 식별함으로써 고정밀 및 고정확한 얼굴 인식 방법 및 장치를 제공할 수 있다.The face recognition method and apparatus according to an embodiment of the present invention include global appearance features output by a convolutional neural network (CNN) and relational local features output through a pairwise related network (PRN). In combination, it is possible to provide a high-precision and high-accuracy face recognition method and apparatus by identifying the identity of a subject in a video image.

또한, 상기 얼굴 인식 방법 및 장치는 정렬된 영상 이미지를 정규화함으로써, 피사체인 대상자의 얼굴 표정의 변화 또는 포즈 변화와 같은 학습 이미지 변형에도 신원 식별이 가능한 고정밀 및 고신뢰성의 얼굴 인식 방법이 제공될 수 있다.In addition, the face recognition method and apparatus can provide a high-precision and high-reliability face recognition method capable of identifying an identity even in a deformation of a learning image such as a change in a facial expression or a pose of a subject as a subject by normalizing the aligned image image. have.

또한, 상기 얼굴 인식 방법 및 장치는 적어도 하나의 컨볼루션 계층(Convolution layer)이 레지듀얼 함수(residual function)를 포함하는 병목(Bottleneck) 구조로 제공되어, 컨볼루셔널 신경망(CNN) 모델의 학습 시간이 단축된 고속의 얼굴 인식 방법이 제공될 수 있다.In addition, the face recognition method and apparatus are provided in a bottleneck structure in which at least one convolution layer includes a residual function, so that learning time of a convolutional neural network (CNN) model is provided. This shortened high-speed face recognition method can be provided.

또한, 상기 얼굴 인식 방법 및 장치는 쌍 관계 네트워크(PRN) 모델로부터 관계형 로컬 특징을 생성하여 영상 이미지 내 대상자의 얼굴 부위별 특징을 나타내는 로컬 외형 특징들의 관계 구조를 파악함으로써, 고정확한 얼굴 인식 방법이 제공될 수 있다.In addition, the face recognition method and apparatus generate a relational local feature from a pair-relation network (PRN) model to grasp the relational structure of local appearance features representing features of a person's face in a video image, thereby enabling a high-precision face recognition method. Can be provided.

도 1은 본 발명의 실시예에 따른 얼굴 인식 장치의 블록 구성도이다.
도 2는 본 발명의 실시예에 따른 얼굴 인식 장치 내 프로세서에 의해 실행되는 얼굴 인식 방법의 순서도이다.
도 3은 본 발명의 실시예에 따른 학습 이미지를 추출하는 단계를 설명하기 위한 순서도이다.
도 4는 본 발명의 실시예에 따른 유효 영상 이미지를 정렬하는 단계를 설명하기 위한 이미지이다.
도 5는 본 발명의 실시예에 따른 얼굴 인식 방법을 설명하기 위한 블록 구성도이다.
도 6은 본 발명의 실시예에 따른 컨볼루셔널 신경망 모델을 설명하기 위한 이미지이다.
도 7은 본 발명의 실시예에 따른 쌍 관계 네트워크 모델의 블록 구성도이다.
도 8은 본 발명의 실시예에 따른 쌍 관계 네트워크 모델이 관계쌍을 형성하는 단계를 설명하기 위한 이미지이다.
도 9는 본 발명의 실시예에 따른 신원 식별 특징을 추출하는 단계를 설명하기 위한 이미지이다.1 is a block diagram of a face recognition apparatus according to an embodiment of the present invention.
2 is a flowchart of a face recognition method executed by a processor in a face recognition apparatus according to an embodiment of the present invention.
3 is a flowchart illustrating a step of extracting a learning image according to an embodiment of the present invention.
4 is an image for explaining a step of arranging an effective video image according to an embodiment of the present invention.
5 is a block diagram illustrating a face recognition method according to an embodiment of the present invention.
6 is an image for explaining a convolutional neural network model according to an embodiment of the present invention.
7 is a block diagram of a pair relationship network model according to an embodiment of the present invention.
8 is an image for explaining a step of forming a pair in the pair relationship network model according to an embodiment of the present invention.
9 is an image for explaining the step of extracting the identification feature according to an embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.The present invention can be applied to various changes and can have various embodiments, and specific embodiments will be illustrated in the drawings and described in detail. However, this is not intended to limit the present invention to specific embodiments, and should be understood to include all modifications, equivalents, and substitutes included in the spirit and scope of the present invention.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.Terms such as first and second may be used to describe various components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from other components. For example, the first component may be referred to as a second component without departing from the scope of the present invention, and similarly, the second component may be referred to as a first component. The term and / or includes a combination of a plurality of related described items or any one of a plurality of related described items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다.When an element is said to be "connected" or "connected" to another component, it is understood that other components may be directly connected to or connected to the other component, but other components may exist in the middle. It should be. On the other hand, when a component is said to be "directly connected" or "directly connected" to another component, it should be understood that no other component exists in the middle.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in this application are only used to describe specific embodiments, and are not intended to limit the present invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In this application, the terms "include" or "have" are intended to indicate the presence of features, numbers, steps, actions, components, parts or combinations thereof described herein, one or more other features. It should be understood that the existence or addition possibilities of fields or numbers, steps, operations, components, parts or combinations thereof are not excluded in advance.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가진 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by a person skilled in the art to which the present invention pertains. Terms such as those defined in a commonly used dictionary should be interpreted as having meanings consistent with meanings in the context of related technologies, and should not be interpreted as ideal or excessively formal meanings unless explicitly defined in the present application. Does not.

이하, 첨부한 도면들을 참조하여, 본 발명의 바람직한 실시예를 보다 상세하게 설명하고자 한다. 본 발명을 설명함에 있어 전체적인 이해를 용이하게 하기 위하여 도면상의 동일한 구성요소에 대해서는 동일한 참조부호를 사용하고 동일한 구성요소에 대해서 중복된 설명은 생략한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. In order to facilitate the overall understanding in describing the present invention, the same reference numerals are used for the same components in the drawings, and duplicate descriptions for the same components are omitted.

도 1은 본 발명의 실시예에 따른 얼굴 인식 장치의 블록 구성도이다.1 is a block diagram of a face recognition apparatus according to an embodiment of the present invention.

도 1을 참조하면, 얼굴 인식 장치는 프로세서(1000) 및 메모리(5000)를 포함할 수 있다.Referring to FIG. 1, the face recognition apparatus may include a processor 1000 and a memory 5000.

프로세서(1000)는 중앙 처리 장치(Central Processing Unit, CPU), 그래픽 처리 장치(Graphics Processing Unit; GPU) 또는 본 발명에 실시예에 따른 방법들이 수행되는 전용 프로세서를 의미할 수 있다.The processor 1000 may refer to a central processing unit (CPU), a graphics processing unit (GPU), or a dedicated processor in which methods according to an embodiment of the present invention are performed.

프로세서(1000)는 후술될 메모리(5000)에 저장된 프로그램 명령(program command)을 실행할 수 있다. The processor 1000 may execute program commands stored in the memory 5000 to be described later.

또한, 프로세서(1000)는 후술될 메모리(5000)에 저장된 명령을 변경할 수 있다. 실시예에 따르면, 프로세서(1000)는 기계학습에 의해 메모리(5000)의 정보를 갱신할 수 있다. 다시 말하면, 프로세서(5000)는 기계학습에 의해 메모리(5500)에 저장된 명령을 변경할 수 있다. Also, the processor 1000 may change instructions stored in the memory 5000 to be described later. According to an embodiment, the processor 1000 may update information of the memory 5000 by machine learning. In other words, the processor 5000 may change instructions stored in the memory 5500 by machine learning.

메모리(5000)는 휘발성 저장 매체 및/또는 비휘발성 저장 매체로 구성될 수 있다. 예를 들어, 메모리(5000)는 읽기 전용 메모리(read only memory; ROM) 및/또는 랜덤 액세스 메모리(random access memory; RAM)로 구성될 수 있다.The memory 5000 may be composed of volatile storage media and / or non-volatile storage media. For example, the memory 5000 may be composed of read only memory (ROM) and / or random access memory (RAM).

메모리(5000)는 적어도 하나의 명령을 저장할 수 있다. 보다 구체적으로 설명하면, 메모리(5000)는 프로세서(1000)에 의해 실행되는 적어도 하나의 명령을 저장할 수 있다. The memory 5000 may store at least one command. More specifically, the memory 5000 may store at least one instruction executed by the processor 1000.

메모리(5000)는 앞서 설명한 바와 같이, 적어도 하나의 명령을 포함할 수 있다. 실시예에 따르면, 메모리(5000)는 외부서버(S)로부터 영상 이미지를 수신하도록 하는 명령, 유효 영상 이미지를 추출하도록 하는 명령, 유효 영상 이미지를 정렬하도록 하는 명령, 컨볼루셔널 신경망(CNN) 모델을 학습하여 글로벌 외형 특징을 추출하도록 하는 명령, 쌍 관계 네트워크(PRN) 모델을 학습하여 관계형 로컬 특징을 추출하도록 하는 명령 및 신원 식별 특징을 임베딩(Embeding)하도록 하는 명령을 포함할 수 있다.As described above, the memory 5000 may include at least one command. According to an embodiment, the memory 5000 includes a command for receiving a video image from an external server S, a command for extracting a valid video image, a command for aligning a valid video image, a convolutional neural network (CNN) model It may include an instruction to extract global appearance features by learning, an instruction to extract a relational local feature by learning a pair relational network (PRN) model, and an instruction to embed an identity identification feature.

메모리(5000)는 프로세서(1000)의 실행에 의해 산출된 적어도 하나의 데이터를 저장할 수 있다. The memory 5000 may store at least one data calculated by execution of the processor 1000.

이상 본 발명의 실시예에 따른 얼굴 인식 장치를 살펴보았다. 이하에서는 본 발명의 실시예에 따른 얼굴 인식 방법을 설명하겠다. The face recognition device according to the embodiment of the present invention has been described above. Hereinafter, a face recognition method according to an embodiment of the present invention will be described.

도 2는 본 발명의 실시예에 따른 얼굴 인식 장치 내 프로세서에 의해 실행되는 얼굴 인식 방법의 순서도이다.2 is a flowchart of a face recognition method executed by a processor in a face recognition apparatus according to an embodiment of the present invention.

도 2를 참조하면, 프로세서(1000)는 외부 서버(S)로부터 복수의 영상 이미지들을 수신할 수 있다(S1000). 실시예에 따르면, 영상 이미지는 식별하고자 하는 대상자의 얼굴이 촬영된 컬러 이미지일 수 있으며, 외부 서버(S)는 VGGFace2 DB, Labeled Faces in the Wild(LFW) DB 및 YouTube Face(YTF) DB 중 적어도 하나일 수 있다. Referring to FIG. 2, the processor 1000 may receive a plurality of image images from the external server S (S1000). According to an embodiment, the video image may be a color image of a face of a subject to be identified, and the external server S may include at least one of VGGFace2 DB, Labeled Faces in the Wild (LFW) DB, and YouTube Face (YTF) DB. It can be one.

이후, 프로세서(1000)는 컨볼루셔널 신경망(CNN) 모델을 이용하여, 수신된 영상 이미지로부터 글로벌 외형 특징을 추출할 수 있다(S3000). Thereafter, the processor 1000 may extract a global appearance feature from the received image image using a convolutional neural network (CNN) model (S3000).

보다 구체적으로 설명하면, 프로세서(1000)는 컨볼루셔널 신경망(CNN)을 학습하기 위해 복수의 학습 이미지를 추출할 수 있다(S3100). 학습 이미지는 컨볼불셔널 신경망(CNN) 모델을 학습하기 위한 이미지 데이터로, 훈련 이미지들 및 검증 이미지들을 포함할 수 있다. 학습 이미지를 추출하는 단계는 하기 도 3에서 보다 구체적으로 설명하겠다.In more detail, the processor 1000 may extract a plurality of training images to train a convolutional neural network (CNN) (S3100). The training image is image data for training a convolutional neural network (CNN) model, and may include training images and verification images. The step of extracting the learning image will be described in more detail in FIG. 3 below.

도 3은 본 발명의 실시예에 따른 학습 이미지를 추출하는 단계를 설명하기 위한 이미지이다.3 is an image for explaining the step of extracting the learning image according to an embodiment of the present invention.

도 3을 참조하면, 프로세서(1000)는 수신된 영상 이미지 내 복수의 얼굴 특징점들을 추출할 수 있다(S3110). 여기서, 얼굴 특징점들은 후술될 컨볼루셔널 신경망(CNN) 모델 및 쌍 관계 네트워크(PRN) 모델에 의해, 대상자의 신원을 식별하기 위한 데이터로 사용될 수 있다. 예를 들면, 프로세서(1000)는 얼굴 검출기 또는 특징점 검출기를 이용하여, 복수의 얼굴 특징점들을 추출할 수 있다.Referring to FIG. 3, the processor 1000 may extract a plurality of facial feature points in the received video image (S3110). Here, the facial feature points may be used as data for identifying a subject's identity by a convolutional neural network (CNN) model and a pair relational network (PRN) model, which will be described later. For example, the processor 1000 may extract a plurality of facial feature points using a face detector or a feature point detector.

여기서, 프로세서(1000)는 얼굴 특징점들이 검출되지 않은 영상 이미지들을 제외시킬 수 있다. 이에 따라, 프로세서(1000)는 복수의 얼굴 특징점들을 포함하는 유효 영상 이미지를 추출할 수 있다(S3130). Here, the processor 1000 may exclude image images in which facial feature points are not detected. Accordingly, the processor 1000 may extract an effective image image including a plurality of facial feature points (S3130).

프로세서(1000)는 추출된 유효 영상 이미지를 정렬할 수 있다(S3150). 실시예에 따르면, 프로세서(1000)는 추출된 유효 영상 이미지 내 복수의 얼굴 특징점들을 이용하여, 해당 유효 영상 이미지를 정렬할 수 있다. The processor 1000 may align the extracted effective video image (S3150). According to an embodiment, the processor 1000 may align a corresponding valid video image by using a plurality of facial feature points in the extracted valid video image.

도 4는 본 발명의 실시예에 따른 유효 영상 이미지를 정렬하는 단계를 설명하기 위한 이미지이다.4 is an image for explaining a step of arranging an effective video image according to an embodiment of the present invention.

도 4를 참조하면, 프로세서(1000)는 유효 영상 이미지를 회전 정렬할 수 있다(S3151). Referring to FIG. 4, the processor 1000 may rotate and align an effective video image (S3151).

보다 구체적으로 설명하면, 프로세서(1000)는 영상 이미지 내 대상자의 두 눈의 위치 정보를 추출할 수 있다. 프로세서(1000)는 추출된 두 눈의 위치 정보를 이용하여, 두 눈이 수평선 상에 위치되도록 유효 영상 이미지를 회전할 수 있다. 다시 말하면, 프로세서(1000)는 두 눈의 평면 내 각도(RIP, Rotation in Plane)가 0이 되도록 유효 영상 이미지를 회전시켜, 수평으로 정렬시킬 수 있다. More specifically, the processor 1000 may extract location information of two eyes of a subject in a video image. The processor 1000 may rotate the effective image image so that the two eyes are positioned on the horizontal line using the extracted location information of the two eyes. In other words, the processor 1000 may rotate the effective image image so that the rotation angle in plane (RIP) of both eyes is 0, and horizontally align the image.

프로세서(1000)는 수평 정렬된 유효 영상 이미지에 대해 X축 정렬을 진행할 수 있다(S3151). The processor 1000 may perform X-axis alignment with respect to the horizontally aligned effective video image (S3151).

보다 구체적으로 설명하면, 프로세서(1000)는 추출된 얼굴 특징점들 중에서 제1 특징점(P_L) 및 제2 특징점(P_R)을 추출할 수 있다. 여기서, 제1 특징점(P_L)은 추출된 얼굴 특징점들 중에서 제1 방향(-x축 방향) 끝에 위치하는 특징점일 수 있다. 실시예에 따르면, 제1 특징점(P_L)은 유효 영상 이미지 내에서 최좌측에 위치하는 특징점일 수 있다. 또한, 제2 특징점(P_R)은 추출된 얼굴 특징점들 중에서 제2 방향(+x축 방향) 끝에 위치하는 특징점일 수 있다. 예를 들어, 제2 특징점(P_R)은 유효 영상 이미지 내에서 최우측에 위치하는 특징점일 수 있다.More specifically, the processor 1000 may extract the first feature point P _L and the second feature point P _R from among the extracted facial feature points. Here, the first feature point P _L may be a feature point located at the end of the first direction (-x-axis direction) among the extracted facial feature points. According to an embodiment, the first feature point P _L may be a feature point located at the leftmost side in the effective video image. Also, the second feature point P _R may be a feature point located at the end of the second direction (+ x-axis direction) among the extracted facial feature points. For example, the second feature point P _R may be a feature point located at the far right side in the effective video image.

프로세서(1000)는 제1 특징점(P_L) 및 제2 특징점(P_R)을 이용하여 해당 유효 영상 이미지 내 얼굴 영역의 수평 중심(P_W)을 추출할 수 있다(S2333). 여기서, 수평 중점(P_W)은 수평 중심(Pw)은 제1 특징점(P_L)까지의 거리 및 제2 특징점(P_R)까지의 거리가 동일한 지점일 수 있다. The processor 1000 may extract the horizontal center P _W of the face region in the corresponding valid image image by using the first feature point P _L and the second feature point P _R (S2333). Here, the horizontal center point P _W may be a point where the horizontal center Pw has a distance to the first feature point P _L and a distance to the second feature point P _R.

프로세서(1000)는, 추출된 수평 중심(P_W)이 X축을 기준으로 중심에 위치하도록, 유효 영상 이미지를 이동시킬 수 있다. 이에 따라, 유효 영상 이미지가 X축 정렬될 수 있다.The processor 1000 may move the effective image image so that the extracted horizontal center P _W is centered with respect to the X axis. Accordingly, the effective video image may be aligned in the X axis.

회전 정렬 및 X축 정렬이 완료된 유효 영상 이미지는 프로세서(1000)에 의해 Y축 정렬될 수 있다(S3155).The valid image image in which rotation alignment and X-axis alignment are completed may be Y-axis aligned by the processor 1000 (S3155).

보다 구체적으로 설명하면, 프로세서(1000)는 유효 영상 이미지 내 얼굴 특징점들로부터 제3 특징점(E_C) 및 제4 특징점(L_C)을 추출할 수 있다. 이때, 제3 특징점(E_C)은 영상 이미지 내 두 눈 간의 중점일 수 있으며, 제4 특징점(L_C)은 유효 영상 이미지 내 입의 중점(L_C)일 수 있다. More specifically, the processor 1000 may extract the third feature point E _C and the fourth feature point L _C from facial feature points in the effective image image. At this time, the third feature point E _C may be the midpoint between the two eyes in the image image, and the fourth feature point L _C may be the midpoint L _{C of the} mouth in the effective image image.

이후, 프로세서(1000)는 추출된 제3 특징점(E_C) 및 제4 특징점(L_C)의 거리비에 따라, 유효 영상 이미지의 Y축 및 크기를 정렬할 수 있다. 실시예에 따르면, 프로세서(1000)는 Y축을 기준으로, 제3 특징점(E_C)이 해당 유효 영상 이미지 내 상면으로부터 30% 간격만큼 하향 위치되고, 제4 특징점(L_C)이 해당 유효 영상 이미지 내 하면으로부터 35% 간격만큼 상향 위치되도록 크기를 정렬할 수 있다.Thereafter, the processor 1000 may arrange the Y axis and the size of the effective image image according to the distance ratio between the extracted third feature point E _C and the fourth feature point L _C. According to an embodiment, the processor 1000, the third feature point (E _C ) with respect to the Y-axis is located downward by 30% intervals from the upper surface in the effective image image, the fourth feature point (L _C ) is the effective image image The size can be aligned so that it is positioned upward by 35% from the inner bottom.

본 발명의 실시예에 따른 얼굴 인식 방법은 컨볼루셔널 신경망(CNN) 및 쌍 관계 네트워크(PRN) 모델을 학습하기 위한 학습 이미지를 사전 정렬함으로써, 피사체인 대상자의 얼굴 표정의 변화 또는 포즈 변화와 같은 학습 이미지 변형에도 신원 식별이 가능한 고정밀 및 고신뢰성의 얼굴 인식 방법이 제공될 수 있다.The face recognition method according to an embodiment of the present invention pre-aligns a learning image for training a convolutional neural network (CNN) and a pair relational network (PRN) model, such as a change in a facial expression of a subject as a subject or a change in pose A high-precision and highly reliable face recognition method capable of identifying an image may also be provided in a modified learning image.

다시 도 3을 참조하면, 프로세서(1000)는 정렬된 유효 영상 이미지의 크기를 재조정할 수 있다(S3170). 예를 들어, 프로세서(1000)는 유효 영상 이미지의 해상도의 크기를 140 X 140으로 조정할 수 있다.Referring back to FIG. 3, the processor 1000 may resize the aligned effective video image (S3170). For example, the processor 1000 may adjust the size of the resolution of the effective video image to 140 X 140.

이후, 프로세서(1000)는 정규화 이미지를 추출할 수 있다(S3190). 다시 말하면, 프로세서(1000)는 유효 영상 이미지 내 화소(RGB) 값을 정규화 할 수 있다. Thereafter, the processor 1000 may extract the normalized image (S3190). In other words, the processor 1000 may normalize the pixel (RGB) value in the effective image image.

실시예에 따라 보다 구체적으로 설명하면, 프로세서(1000)는 유효 영상 이미지 내 개별 화소(RGB) 값을 255로 나누어, 개별 화소(RGB) 값이 각각 0과 1 의 값을 갖도록 정규화 시킬 수 있다. 이에 따라, 프로세서(1000)는 복수의 유효 영상 이미지들을 정규화하여, 복수의 정규화 이미지들을 생성할 수 있다.More specifically, according to an embodiment, the processor 1000 may divide the individual pixel (RGB) values in the effective image image by 255, and normalize the individual pixel (RGB) values to have values of 0 and 1, respectively. Accordingly, the processor 1000 may normalize a plurality of valid video images to generate a plurality of normalized images.

다시 도 2를 참조하면, 프로세서(1000)는 생성된 복수의 정규화 이미지들을 이용하여, 컨볼루셔널 신경망(CNN) 모델을 생성할 수 있다(S3500). 이에 따라, 프로세서(1000)는 생성된 컨볼루셔널 신경망(CNN) 모델을 이용하여, 영상 이미지로부터 글로벌 외형 특징을 추출할 수 있다(S3000). 하기에서는 컨볼루셔널 신경망(CNN) 모델로부터 글로벌 외형 특징을 추출하는 단계를 보다 구체적으로 설명하겠다.Referring back to FIG. 2, the processor 1000 may generate a convolutional neural network (CNN) model using the generated plurality of normalized images (S3500). Accordingly, the processor 1000 may extract a global appearance feature from the image image using the generated convolutional neural network (CNN) model (S3000). In the following, the steps of extracting the global appearance features from the convolutional neural network (CNN) model will be described in more detail.

도 5는 본 발명의 실시예에 따른 얼굴 인식 방법을 설명하기 위한 블록 구성도이다.5 is a block diagram illustrating a face recognition method according to an embodiment of the present invention.

도 5를 참조하면, 프로세서(1000)는 유효 영상 이미지들 중 정규화된 훈련 이미지들 및 검증 이미지들을 이용하여, 딥러닝(Deep learning) 학습에 의해 가중치가 반영된 컨볼루셔널 신경망(CNN) 모델을 생성할 수 있다. 따라서, 컨볼루셔널 신경망(CNN) 모델은 입력되는 적어도 하나의 영상 이미지 내 대상자의 신원을 구분하기 위한 글로벌 외형 특징(f^g)을 출력할 수 있다. Referring to FIG. 5, the processor 1000 generates a convolutional neural network (CNN) model in which weights are reflected by deep learning using normalized training images and verification images among valid image images. can do. Accordingly, the convolutional neural network (CNN) model may output a global appearance feature f ^g for distinguishing the identity of a subject in at least one input video image.

도 6은 본 발명의 실시예에 따른 컨볼루셔널 신경망(CNN) 모델을 설명하기 위한 이미지이다.6 is an image for explaining a convolutional neural network (CNN) model according to an embodiment of the present invention.

도 6을 참조하면, 컨볼루셔널 신경망(CNN)은 컨볼루션 계층(Convolution layer), 풀링 계층(Pooling layer), 완전 연결 계층(Fully-connected layer, fc) 및 출력단(output layer)을 포함할 수 있다. Referring to FIG. 6, the convolutional neural network (CNN) may include a convolutional layer, a pooling layer, a fully-connected layer (fc), and an output layer. have.

실시예에 따르면, 컨볼루셔널 신경망(CNN)은 제1 내지 제5 컨볼루션 계층(Convolution layer)들로 구성될 수 있다. 제1 내지 제5 컨볼루션 계층(Convolution layer)들은 입력되는 영상 이미지에 복수의 필터(Filter)를 적용하여 복수의 합성곱을 산출할 수 있다.According to an embodiment, the convolutional neural network (CNN) may include first to fifth convolutional layers. The first to fifth convolution layers may calculate a plurality of composite products by applying a plurality of filters to the input image image.

예를 들어, 제1 컨볼루션 계층(Convolution layer)은 영상 이미지의 RGB 개별 채널에 스트라이드(Stride)가 1인 64개의 5 X 5 크기의 컨볼루션 필터(Convolution Filter)를 적용할 수 있다.For example, as the first convolution layer, 64 5 × 5 size convolution filters having a stride of 1 may be applied to RGB individual channels of a video image.

이후, 제2 컨볼루션 계층(Convolution layer)에서는 제1 컨볼루션 계층(Convolution layer)의 출력에 스트라이드(Stride)가 2이고, 3 X 3 크기인 최대 풀링(Max Pooling)을 적용할 수 있다. 이에 따라, 제2 컨볼루션 계층(Convolution layer)에서는 제1 컨볼루션 계층(Convolution layer)의 출력을 기준으로 특정 영역 내 최대값을 추출함으로써, 로컬 외형 특징이 강조된 특징맵(feature map)을 생성할 수 있다. Thereafter, in the second convolution layer, the maximum pooling having a stride of 2 and a size of 3 X 3 may be applied to the output of the first convolution layer. Accordingly, in the second convolution layer, the maximum value in a specific region is extracted based on the output of the first convolution layer, thereby creating a feature map in which local appearance characteristics are emphasized. You can.

실시예에 따르면, 특징맵(feature map)의 크기는 9 X 9 X 2048 일 수 있으며, 로컬 외형 특징은 상기 특징맵(Feature)을 구성하는 국소 영역에 대한 얼굴 특징일 수 있다. 로컬 외형 특징이 강조된 특징맵(Feature)은 후술될 쌍 관계 네트워크(PRN) 모델의 입력으로 사용될 수 있다.According to an embodiment, the size of the feature map may be 9 X 9 X 2048, and the local appearance feature may be a facial feature for a local area constituting the feature map. The feature map in which the local appearance feature is emphasized may be used as an input of a pair relational network (PRN) model to be described later.

또한, 제2 내지 제5 컨볼루션 계층(Convolution layer)들에서는 레지듀얼 함수(residual function)를 포함하는 병목(Bottleneck) 구조를 제공할 수 있다. 이에 따라, 제2 내지 제5 컨볼루션 계층(Convolution layer)들에서는 차원(dimension)이 줄어들어 합성곱의 연산량이 감소할 수 있다. 따라서, 본 발명의 실시예에 따른 얼굴 인식 방법은 컨볼루셔널 신경망(CNN) 모델의 학습 시간이 줄어들어, 신속한 얼굴 식별이 가능할 수 있다.Also, in the second to fifth convolution layers, a bottleneck structure including a residual function may be provided. Accordingly, in the second to fifth convolution layers, the dimension is reduced, so that the computational power of the composite product can be reduced. Therefore, in the face recognition method according to an embodiment of the present invention, the learning time of the convolutional neural network (CNN) model is reduced, and thus it is possible to quickly identify the face.

컨볼루셔널 신경망(CNN) 모델의 출력단(output layer)에서는 제5 컨볼루션 계층(Convolution layer)에서 출력된 특징맵(Feature)을 입력으로 하여, 각 채널별(RGB) 9 x 9 필터를 적용한 글로벌 평균 풀링 계층(Grobal Average Pooling layer)에 의해 글로벌 외형 특징(f^g)을 추출할 수 있다. In the output layer of the convolutional neural network (CNN) model, the feature map output from the fifth convolution layer is used as input, and each channel (RGB) 9 x 9 filter is applied globally. The global appearance feature f ^g may be extracted by the global average pooling layer.

추출된 글로벌 외형 특징(f^g)은 후술될 쌍 관계 네트워크(PRN) 모델로부터 생성된 관계형 로컬 특징과 결합하여 후술될 손실 함수(loss function)의 입력으로 사용될 수 있다.The extracted global appearance feature f ^g may be used as an input of a loss function to be described below in combination with a relational local feature generated from a paired relationship network (PRN) model to be described later.

다시 도 2를 참조하면, 프로세서(1000)는 관계형 로컬 특징을 추출할 수 있다(S5000). 보다 구체적으로 설명하면, 프로세서(1000)는 앞서 설명한 바와 같이, 컨볼루셔널 신경망(CNN) 모델로부터 출력된 로컬 외형 특징을 이용하여 관계형 로컬 특징(F)을 추출하는 쌍 관계 네트워크(PRN) 모델을 생성할 수 있다. 쌍 관계 네트워크(PRN) 모델에 대해서는 하기 도 7을 참조하여 보다 구체적으로 설명하겠다.Referring back to FIG. 2, the processor 1000 may extract relational local features (S5000). More specifically, as described above, the processor 1000 uses a pair relational network (PRN) model that extracts a relational local feature (F) by using a local appearance feature output from a convolutional neural network (CNN) model. Can be created. The pair relationship network (PRN) model will be described in more detail with reference to FIG. 7 below.

도 7은 본 발명의 실시예에 따른 쌍 관계 네트워크 모델의 블록 구성도이다.7 is a block diagram of a pair relationship network model according to an embodiment of the present invention.

도 7을 설명하면, 쌍 관계 네트워크(PRN) 모델은 앞서 설명한 바와 같이, 컨볼루셔널 신경망(CNN) 모델로부터 출력된 특징맵(feature map)에서 로컬 외형 특징들을 추출하여 관계쌍으로 구성된 관계형 로컬 특징(r)을 생성할 수 있다.Referring to FIG. 7, the pair relation network (PRN) model is a relational local feature composed of relation pairs by extracting local appearance features from a feature map output from a convolutional neural network (CNN) model, as described above. (r).

도 8은 본 발명의 실시예에 따른 쌍 관계 네트워크 모델이 관계쌍을 형성하는 단계를 설명하기 위한 이미지이다.8 is an image for explaining a step of forming a pair in the pair relationship network model according to an embodiment of the present invention.

도 8을 참조하면, 쌍 관계 네트워크(PRN) 모델은 컨볼루셔널 신경망(CNN) 모델로부터 추출된 특징맵(feature map)을 입력 받을 수 있다. 앞서 설명한 바와 같이, 컨볼루셔널 신경망(CNN) 모델로부터 추출된 특징맵(feature map)은 복수의 얼굴 특징점들을 포함할 수 있다.Referring to FIG. 8, a pair relationship network (PRN) model may receive a feature map extracted from a convolutional neural network (CNN) model. As described above, the feature map extracted from the convolutional neural network (CNN) model may include a plurality of facial feature points.

이후, 쌍 관계 네트워크(PRN) 모델은, 입력된 특징맵(feature map) 내 복수의 특징점들을 중심으로, 로컬 외형 특징 그룹(F)을 추출할 수 있다. Thereafter, the pair relation network (PRN) model may extract a local appearance feature group (F) based on a plurality of feature points in the input feature map.

실시예에 따르면, 쌍 관계 네트워크(PRN) 모델은 적어도 하나의 특징점이 포함된 1 X 1 크기의 관심 영역(region of interest, ROI)을 추출할 수 있다. 이때, 관심 영역(ROI)은 영상 이미지 내 특정 얼굴 부위를 나타내는 영역일 수 있다.According to an embodiment, the pair relation network (PRN) model may extract a region of interest (ROI) having a size of 1 X 1 including at least one feature point. At this time, the region of interest (ROI) may be a region representing a specific face region in the image.

이후, 쌍 관계 네트워크(PRN) 모델은 추출된 관심 영역(ROI)을 9 X 9 X 2048 형태로 프로젝션(Projection)하여, 복수의 로컬 외형 특징(f^l)들을 포함하는 로컬 외형 특징 그룹(F)을 추출할 수 있다.Subsequently, the paired relationship network (PRN) model projects the extracted region of interest (ROI) in the form of 9 X 9 X 2048 to project the local appearance feature group (F) including a plurality of local appearance features (f ^l ). Can be extracted.

다시 도 7을 참조하면, 쌍 관계 네트워크(PRN) 모델은 추출된 복수의 로컬 외형 특징(f^l)들을 대상으로 관계형 로컬 특징(r_i,j)을 생성할 수 있다. 여기서, 관계형 로컬 특징(r_i,j)은 앞서 설명한 바와 같이, 복수의 로컬 외형 특징(f^l)들이 관계쌍을 이뤄 형성된 특징일 수 있다. Referring to FIG. 7 again, the pair relation network (PRN) model may generate relational local features r _{i, j} based on the extracted plurality of local external features f ^l . Here, the relational local feature r _{i, j} may be a feature formed by forming a pair of relations between a plurality of local external features f ^l as described above.

이에 따라, 본 발명의 실시예에 따른 얼굴 인식 방법은 쌍 관계 네트워크(PRN) 모델로부터 로컬 외형 특징 그룹(F)을 생성하여 영상 이미지 내 대상자의 얼굴 부위별 특징을 나타내는 로컬 외형 특징(f^l)들의 관계 구조를 파악함으로써, 대상자의 고정확한 얼굴 인식이 가능할 수 있다.Accordingly, the face recognition method according to an embodiment of the present invention generates a local appearance feature group F from a pair-relational network (PRN) model, and shows a local appearance feature (f ^l ) representing features of a person's face in a video image. By grasping the relational structure of the subjects, it may be possible to accurately recognize the face of the subject.

하기에서는 [수학식 1] 내지 [수학식 4]를 참조하여, 쌍 관계 네트워크(PRN) 모델에 대해 보다 구체적으로 설명하겠다.Below, with reference to [Equation 1] to [Equation 4], the pair relationship network (PRN) model will be described in more detail.

먼저, 쌍 관계 네트워크(PRN) 모델이 생성하는 관계형 로컬 특징(r_i,j)은 하기 [수학식 1]과 같이, 두 개의 로컬 외형 특징 간의 관계로 표현될 수 있다.First, the relational local feature r _{i, j} generated by the pair relation network (PRN) model may be expressed as a relationship between two local external features, as shown in Equation 1 below.

G_θ: 가중치 θ를 갖는 멀티 레이어 퍼셉트론(Multi-layer perceptron, MLP)G _θ : Multi-layer perceptron (MLP) with weight θ

P_i,j : 로컬 외형 특징 그룹(F) 내 i번째 특징(f^l _i) 및 j번째 특징(f^l _j)을 포함하는 관계쌍P _{i, j} : Relationship pair including i-th feature (f ^l _i ) and j-th feature (f ^l _j ) in local appearance feature group (F)

이후, 쌍 관계 네트워크(PRN) 모델은 조합 순서와 관계 없이 적어도 하나의 관계쌍에 대해 학습함으로써, 조합 순서를 결정하기 위해 하기 [수학식 2]와 같이, 집계 함수를 사용하여 관계쌍을 학습할 수 있다. Subsequently, the pair relationship network (PRN) model learns at least one relationship pair regardless of the combination order, thereby learning the relationship pair using an aggregate function as shown in Equation 2 below to determine the combination order. You can.

r_i,j: 관계쌍r _{i, j} : relationship pair

A(r_i,j) : 관계쌍의 집계 함수A (r _{i, j} ): Aggregate function of relationship pairs

[수학식 2]과 같이, 쌍 관계 네트워크(PRN) 모델은 집계 함수를 이용하여, 집계된 적어도 하나의 관계쌍들을 합산할 수 있다. As shown in [Equation 2], the pair relation network (PRN) model may aggregate the aggregated at least one relation pair using an aggregation function.

이때, 쌍 관계 네트워크(PRN) 모델은 순서에 상관없이 조합 가능한 특징들의 관계쌍을 형성할 수 있다. 이에 따라, 쌍 관계 네트워크(PRN) 모델은 집계 함수를 사용하여 조합 순서 정보와 상관 없이 관계쌍들의 합을 산출할 수 있다. In this case, the pair relationship network (PRN) model may form a relationship pair of features that can be combined in any order. Accordingly, the pair relationship network (PRN) model may use an aggregate function to calculate the sum of relationship pairs regardless of combination order information.

이후, 쌍 관계 네트워크(PRN) 모델은 집계된 관계형 로컬 특징(r_i,j)에 가중치 F_Φ를 부여하여, 하기 [수학식 3]과 같이, 예측 관계형 특징 모델(M)을 형성할 수 있다. 이때, 쌍 관계 네트워크(PRN) 모델의 가중치 G_θ 및 F_Φ들은 계층당 복수의뉴런들로 구성된 다계층 멀티 레이어 퍼셉트론(Multi-layer perceptron, MPL)에 반영될 수 있다. 예를 들어, 가중치 G_θ 및 F_Φ들은 각 계층당 1000개의 뉴런으로 구성된 3계층 멀티 레이어 퍼셉트론(MPL)에 반영될 수 있다.Subsequently, the pair relation network (PRN) model may form a predicted relational feature model (M) as shown in [Equation 3] by assigning a weight F _Φ to the aggregated relational local feature (r _{i, j} ). . At this time, the weights G _θ and F _Φ of the pair relationship network (PRN) model may be reflected in a multi-layer perceptron (MPL) composed of a plurality of neurons per layer. For example, the weights G _θ and F _Φ may be reflected in a three-layer multi-layer perceptron (MPL) composed of 1000 neurons per layer.

M: 예측 관계형 특징 모델M: predictive relational feature model

f_agg: 집계된 관계형 특징f _agg : Aggregated relational features

F_Φ: 가중치 Φ를 갖는 멀티 레이어 퍼셉트론(MLP)F _Φ : Multi-layer perceptron (MLP) with weight Φ

따라서, [수학식 1] 내지 [수학식 3]을 참조하면, 쌍 관계 네트워크(PRN) 모델은 하기 [수학식 4]와 같이 표현할 수 있다.Accordingly, referring to [Equation 1] to [Equation 3], the pair relation network (PRN) model can be expressed as [Equation 4] below.

다시 도 2를 참조하면, 프로세서(1000)는 쌍 관계 네트워크(PRN) 모델에 신원 식별 특징(S_id)를 임베딩할 수 있다(S7000). 쌍 관계 네트워크(PRN) 모델에 신원 식별 특징(S_id)를 반영하는 단계는 하기 도 9에서 보다 구체적으로 설명하겠다.Referring back to FIG. 2, the processor 1000 may embed an identity identification feature S _id in a pair relational network (PRN) model (S7000). The step of reflecting the identity identification feature (S _id ) in the pair relationship network (PRN) model will be described in more detail in FIG. 9 below.

도 9는 본 발명의 실시예에 따른 신원 식별 특징을 임베딩하는 단계를 설명하기 위한 이미지이다.9 is an image for explaining the step of embedding the identification feature according to an embodiment of the present invention.

도 9를 참조하면, 쌍 관계 네트워크(PRN) 모델로부터 추출된 관계형 로컬 특징(p_i,j)은 서로 다른 신원을 가진 대상자에 대해서 식별 가능한 고유 특징을 가질 수 있다. 이에 따라, 관계형 로컬 특징(p_i,j)은 상기 대상자에 대해 종속적인 특징을 가질 수 있다. 따라서, 프로세서(1000)는 하기 [수학식 5]와 같이, 대상자의 특징 정보를 나타내는 신원 식별 특징(S_id)을 추출하여, 관계쌍(r_i,j)에 임베딩(Embeding)함으로써 쌍 관계 네트워크(PRN) 모델(PRN)에 반영할 수 있다. Referring to FIG. 9, a relational local feature (p _{i, j} ) extracted from a pair-relational network (PRN) model may have a unique characteristic that can be identified with respect to subjects having different identities. Accordingly, the relational local feature p _{i, j} may have a feature dependent on the subject. Accordingly, the processor 1000 extracts the identification identification feature S _id indicating the feature information of the subject, as shown in Equation 5 below, and embeds it in the relation pair r _{i, j} to embed the pair relation network. (PRN) can be reflected in the model (PRN).

프로세서(1000)는 하기 [수학식 5]과 같이, 대상자의 신원 식별 특징(S_id)을 추출할 수 있다.The processor 1000 may extract a subject's identification feature S _id as shown in Equation 5 below.

S_id: 신원 식별 특징S _id : Identity identification feature

여기서, 신원 식별 특징(S_id)은 로컬 외형 특징 그룹(F)을 이용하여 LSTM(Long Short-term Memory units) 계층 기반 순환 네트워크(E_Ψ)를 사용하여 하기 [수학식 6]과 같이 모델링 될 수 있다.Here, the identity identification feature (S _id ) can be modeled as shown in [Equation 6] using a long short-term memory units (LSTM) layer-based cyclic network (E _Ψ ) using a local appearance feature group (F). You can.

E_Ψ: 순환 네트워크E _Ψ : circular network

F : 로컬 외형 특징 그룹F: Local appearance feature group

이때, 순환 네트워크(E_Ψ)는 LSTM 계층 및 완전 연결된 계층(Fully Connected Layer, FC layer)들로 구성될 수 있다. 실시예에 따르면, LSTM 계층은 2048개의 메모리 셀을 가질 수 있으며, LSTM 계층의 출력은 256 및 9630개의 뉴런으로 각각 구성된 2계층의 멀티 레이터 퍼셉트론(MPL)의 입력이 될 수 있다. At this time, the circular network (E _Ψ ) may be composed of an LSTM layer and a Fully Connected Layer (FC layer). According to an embodiment, the LSTM layer may have 2048 memory cells, and the output of the LSTM layer may be an input of a multi-layer multiplier perceptron (MPL) composed of 256 and 9630 neurons, respectively.

또한, 순환 네트워크(E_Ψ)는 크로스 엔트로피(Cross-entropy) 손실 함수를 사용하여 신원 식별 특징(S_id)을 학습할 수 있다. 이때, 손실 함수의 입력으로는 컨볼루셔널 신경망(CNN) 모델로부터 추출된 글로벌 외형 특징(f^g) 및 쌍 관계 네트워크(PRN) 모델로부터 추출된 관계형 로컬 특징(p_i,j)이 이용될 수 있다.In addition, the cyclic network E _Ψ can learn the identity identification feature S _id using a cross-entropy loss function. In this case, as inputs of the loss function, global appearance features (f ^g ) extracted from the convolutional neural network (CNN) model and relational local features (p _{i, j} ) extracted from the pair relationship network (PRN) model may be used. have.

실시예에 따르면, 손실 함수는 글로벌 외형 특징(f^g) 및 관계형 로컬 특징(p_i,j)이 결합된 특징을 입력으로 하여, 2개의 완전 연결된 계층(FC layer)에 의해 손실이 최소화 되도록 학습될 수 있다. According to an embodiment, the loss function is trained such that loss is minimized by two fully connected layers (FC layers) by inputting a feature in which global appearance features (f ^g ) and relational local features (p _{i, j} ) are combined. Can be.

따라서, 본 발명의 실시에에 따른 얼굴 인식 방법은 컨볼루셔널 신경망(CNN) 모델 및 쌍 관계 네트워크(PRN) 모델로부터 각각 추출된 글로벌 외형 특징(f^g) 및 관계형 로컬 특징(p_i,j)을 결합하여, 영상 이미지 내 대상자의 얼굴 영상에 대한 국소 영역 및 전체 영역의 특징을 모두 고려함으로써, 대상자의 신원 식별성이 강화된 얼굴 인식 방법을 제공할 수 있다.Accordingly, the face recognition method according to an embodiment of the present invention includes global appearance features (f ^g ) and relational local features (p _{i, j} ) extracted from a convolutional neural network (CNN) model and a pair relational network (PRN) model, respectively. By combining, it is possible to provide a face recognition method in which the identity identification of the subject is enhanced by taking into account characteristics of both the local area and the entire area of the subject's face image in the image.

이상, 본 발명의 실시예에 따른 얼굴 인식 방법 및 장치를 살펴보았다.In the above, the face recognition method and apparatus according to the embodiment of the present invention have been described.

본 발명의 실시예에 따른 얼굴 인식 방법 및 장치는 외부서버로부터 영상 이미지를 수신하도록 하는 명령, 유효 영상 이미지를 추출하도록 하는 명령, 유효 영상 이미지를 정렬하도록 하는 명령, 컨볼루셔널 신경망을 학습하여 글로벌 외형 특징을 추출하도록 하는 명령, 쌍 관계 네트워크를 학습하여 관계형 로컬 특징을 추출하도록 하는 명령 및 신원 식별 특징을 임베딩하도록 하는 명령을 포함하는 메모리, 상기 메모리에 저장된 적어도 하나의 명령을 실행하는 프로세서를 포함하여, 영상 이미지 내 대상자의 얼굴 영역 내 국소 부위들에 나타나는 고유 특징들을 조합하여 관계형 로컬 특징을 추출하고, 추출된 관계형 로컬 특징 및 전체적인 얼굴 영역의 특징을 나타내는 글로벌 외형 특징을 결합함으로써, 기저장된 사용자 및 대상자 간의 신원 식별성이 향상된 얼굴 인식 방법 및 장치를 제공할 수 있다.The face recognition method and apparatus according to an embodiment of the present invention learn global by learning a command to receive a video image from an external server, a command to extract a valid video image, a command to align a valid video image, and a convolutional neural network A memory including an instruction to extract external features, an instruction to learn a paired relational network to extract relational local features, and an instruction to embed an identity identification feature, and a processor to execute at least one instruction stored in the memory By extracting the relational local features by combining the unique features appearing in the local parts of the subject's face region in the video image, and combining the extracted relational local features and global facial features representing the features of the entire face region, the pre-stored user Identification between people and subjects It is possible to provide a face recognition method and apparatus with improved sex.

본 발명의 실시예에 따른 방법의 동작은 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 프로그램 또는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의해 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어 분산 방식으로 컴퓨터로 읽을 수 있는 프로그램 또는 코드가 저장되고 실행될 수 있다.The operation of the method according to an embodiment of the present invention can be implemented as a computer-readable program or code on a computer-readable recording medium. The computer-readable recording medium includes all kinds of recording devices in which data readable by a computer system is stored. In addition, the computer readable recording medium may be distributed over network coupled computer systems so that the computer readable program or code is stored and executed in a distributed manner.

또한, 컴퓨터가 읽을 수 있는 기록매체는 롬(rom), 램(ram), 플래시 메모리(flash memory) 등과 같이 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치를 포함할 수 있다. 프로그램 명령은 컴파일러(compiler)에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터(interpreter) 등을 사용해서 컴퓨터에 의해 실행될 수 있는 고급 언어 코드를 포함할 수 있다.In addition, the computer-readable recording medium may include a hardware device specially configured to store and execute program instructions, such as a ROM, a RAM, and a flash memory. Program instructions may include high-level language code that can be executed by a computer using an interpreter, etc., as well as machine code such as that produced by a compiler.

본 발명의 일부 측면들은 장치의 문맥에서 설명되었으나, 그것은 상응하는 방법에 따른 설명 또한 나타낼 수 있고, 여기서 블록 또는 장치는 방법 단계 또는 방법 단계의 특징에 상응한다. 유사하게, 방법의 문맥에서 설명된 측면들은 또한 상응하는 블록 또는 아이템 또는 상응하는 장치의 특징으로 나타낼 수 있다. 방법 단계들의 몇몇 또는 전부는 예를 들어, 마이크로프로세서, 프로그램 가능한 컴퓨터 또는 전자 회로와 같은 하드웨어 장치에 의해(또는 이용하여) 수행될 수 있다. 몇몇의 실시예에서, 가장 중요한 방법 단계들의 하나 이상은 이와 같은 장치에 의해 수행될 수 있다. While some aspects of the invention have been described in the context of an apparatus, it can also represent a description according to a corresponding method, where a block or apparatus corresponds to a method step or a feature of a method step. Similarly, aspects described in the context of a method may also be represented by features of corresponding blocks or items or corresponding devices. Some or all of the method steps may be performed by (or using) a hardware device, such as, for example, a microprocessor, programmable computer, or electronic circuit. In some embodiments, one or more of the most important method steps may be performed by such an apparatus.

이상 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.Although described above with reference to the preferred embodiments of the present invention, those skilled in the art can variously modify and change the present invention without departing from the spirit and scope of the present invention as set forth in the claims below. You will understand that you can.

1000: 프로세서 5000: 메모리
S: 외부서버1000: Processor 5000: Memory
S: External server

Claims

식별하고자 하는 대상자의 얼굴이 촬영된 영상 이미지를 수신하는 단계;
상기 영상 이미지를 정규화하는 단계;
복수의 얼굴 특징점들을 추출하도록 학습된 컨볼루셔널 신경망(CNN, Convolutional Neural Network)에 상기 영상 이미지를 입력하여, 상기 영상 이미지 내 얼굴 특징점들을 포함하는 특징맵(Feature map)을 도출하는 단계;
상기 특징맵에 글로벌 평균 풀링(GAP, Global Average Pooling)을 적용하여, 상기 영상 이미지 내 대상자의 얼굴 전역에 대한 외형 특징을 표현하는 글로벌 외형 특징을 출력하는 단계;
쌍 관계 네트워크(PRN, Pairwise Related Network)에 상기 특징맵을 입력하고, 복수의 로컬 외형 특징들을 추출하여 관계쌍을 형성하는 단계; 및
상기 관계쌍에 신원 식별 특징을 임베딩(Embeding)하여 관계형 로컬 특징을 추출하는 단계를 포함하는, 얼굴 인식 방법.Receiving a video image of a face of a subject to be identified;
Normalizing the video image;
Inputting the video image into a convolutional neural network (CNN) trained to extract a plurality of facial feature points, and deriving a feature map including facial feature points in the video image;
Applying a global average pooling (GAP) to the feature map, and outputting a global appearance feature expressing the appearance characteristics of the entire subject's face in the video image;
Inputting the feature map into a pairwise related network (PRN), and extracting a plurality of local external features to form a relationship pair; And
And embedding an identification feature in the relation pair to extract a relational local feature.

청구항 1항에 있어서,
상기 쌍 관계 네트워크는 상기 학습 이미지의 글로벌 외형 특징 및 관계형 로컬 특징으로부터 추출된 손실 함수의 가중치가 학습된 모델인, 얼굴 인식 방법.The method according to claim 1,
The pair-relation network is a face recognition method in which a weight of a loss function extracted from a global appearance feature and a relational local feature of the learning image is a learned model.

청구항 2항에 있어서,
상기 영상 이미지를 정규화하는 단계 이전에 상기 영상 이미지를 정렬하는 단계를 더 포함하는, 얼굴 인식 방법.The method according to claim 2,
And aligning the video image prior to normalizing the video image.

청구항 3항에 있어서,
상기 영상 이미지를 정렬하는 단계는
상기 영상 이미지 내 대상자의 두 눈의 위치 정보를 이용하여 평면 내 각도(RIP, Rotation in Plane)가 0이 되도록 회전 정렬하는 단계;
상기 영상 이미지 내 얼굴 특징점들을 이용하여, 상기 영상 이미지의 X축 위치를 정렬하는 단계; 및
상기 영상 이미지 내 얼굴 특징점들을 이용하여, 상기 영상 이미지의 Y축 위치 및 크기를 정렬하는 단계를 포함하는, 얼굴 인식 방법.The method according to claim 3,
The step of aligning the video image
Rotating and aligning so that the in-plane angle (RIP) is 0 using the positional information of the eyes of the subject in the video image;
Aligning X-axis positions of the video image using facial feature points in the video image; And
And aligning the Y-axis position and size of the video image using facial feature points in the video image.

청구항 4항에 있어서,
상기 영상 이미지의 X축 위치를 정렬하는 단계는
상기 얼굴 특징점들 중 제1 방향을 기준으로 최외각에 위치하는 제1 특징점을 추출하는 단계;
상기 제1 방향과 반대인 제2 방향을 기준으로 최외각에 위치하는 제2 특징점을 추출하는 단계; 및
상기 영상 이미지의 중심으로부터 상기 제1 특징점 및 상기 제2 특징점의 X축 거리가 동일하게 제공되도록, 상기 영상 이미지의 X축 위치를 조정하는 단계를 포함하는, 얼굴 인식 방법.The method according to claim 4,
Aligning the X-axis position of the video image is
Extracting a first feature point located at an outermost point based on a first direction among the face feature points;
Extracting a second feature point located on the outermost side based on a second direction opposite to the first direction; And
And adjusting an X-axis position of the video image so that an X-axis distance of the first feature point and the second feature point is provided from the center of the image image.

청구항 4항에 있어서,
상기 영상 이미지의 Y축 위치 및 크기를 정렬하는 단계는
상기 영상 이미지 내 대상자의 두 눈 사이의 중점인 제3 특징점을 추출하는 단계;
상기 영상 이미지 내 대상자의 입술 중점인 제4 특징점을 추출하는 단계; 및
상기 제3 특징점 및 상기 제4 특징점을 이용하여, 상기 영상 이미지의 크기 및 Y축 위치를 조정하는 단계를 포함하는, 얼굴 인식 방법.The method according to claim 4,
Aligning the Y-axis position and size of the video image is
Extracting a third feature point that is a center point between two eyes of a subject in the video image;
Extracting a fourth feature point that is a lip center point of the subject in the video image; And
And adjusting a size and a Y-axis position of the video image using the third feature point and the fourth feature point.

청구항 6항에 있어서,
상기 영상 이미지는
Y축을 기준으로, 상기 제3 특징점이 상면으로부터 30% 간격만큼 하향 이격되어 위치되고, 상기 제4 특징점이 하면으로부터 35% 간격만큼 상향 이격되어 위치되는, 얼굴 인식 방법.The method according to claim 6,
The video image
On the basis of the Y-axis, the third feature point is located spaced downwardly 30% apart from the top surface, and the fourth feature point is spaced upwardly spaced 35% apart from the bottom, face recognition method.

청구항 1항에 있어서,
상기 특징맵을 도출하는 단계는,
복수의 컨볼루션 계층(Convolution layer)들에 의해 상기 정규화된 영상 이미지의 채널별 합성곱을 산출하는 단계; 및
상기 채널별 합성곱에 최대 풀링(Max Pooling)을 적용하는 단계를 포함하는, 얼굴 인식 방법.The method according to claim 1,
The step of deriving the feature map,
Calculating a convergence product for each channel of the normalized video image by a plurality of convolution layers; And
And applying a maximum pooling to the channel-specific composite product.

청구항 8항에 있어서,
적어도 하나의 상기 컨볼루션 계층은 레지듀얼 함수(Residual Function)를 포함하는 병목(Bottleneck) 구조로 제공되는, 얼굴 인식 방법.The method according to claim 8,
The at least one convolutional layer is provided in a bottleneck structure including a residual function.

청구항 1항에 있어서,
상기 글로벌 외형 특징을 출력하는 단계에서는
특정 크기의 필터(filter)를 이용하여, 상기 특징맵에 평균 풀링(Average Pooling)을 적용하는, 얼굴 인식 방법.The method according to claim 1,
In the step of outputting the global appearance features
A method of recognizing a face, by applying average pooling to the feature map using a filter of a specific size.

청구항 1항에 있어서,
상기 관계쌍을 형성하는 단계는
상기 컨볼루셔널 신경망으로부터 출력된 상기 특징맵을 입력 받는 단계;
상기 특징맵 내 복수의 얼굴 특징점들을 중심으로 로컬 외형 특징 그룹을 추출하는 단계; 및
상기 로컬 외형 특징 그룹으로부터 복수의 상기 로컬 외형 특징들을 추출하여 상기 관계쌍을 형성하는 단계를 포함하는, 얼굴 인식 방법.The method according to claim 1,
The step of forming the relationship pair
Receiving the feature map output from the convolutional neural network;
Extracting a local appearance feature group based on a plurality of facial feature points in the feature map; And
And extracting a plurality of the local appearance features from the local appearance feature group to form the relationship pair.

청구항 11항에 있어서,
상기 로컬 외형 특징 그룹을 추출하는 단계는
상기 특징맵 내 얼굴 영역 중 적어도 일부 영역을 관심 영역(ROI, Region Of Interest)으로 설정하여 투영하는 단계; 및
상기 관심 영역 내 위치한 적어도 하나의 상기 얼굴 특징점으로부터, 상기 로컬 외형 특징들을 포함하는 상기 로컬 외형 특징 그룹을 추출하는 단계를 포함하는, 얼굴 인식 방법.The method according to claim 11,
Extracting the local appearance feature group is
Setting and projecting at least some of the face regions in the feature map as a region of interest (ROI); And
And extracting the local appearance feature group including the local appearance features from at least one face feature point located in the region of interest.

청구항 1항에 있어서,
상기 관계형 로컬 특징을 추출하는 단계는
LSTM(Long Short-term Memory uint) 기반의 순환 네트워크에 의해, 상기 신원 식별 특징을 상기 관계쌍에 임베딩(Embeding)하는 단계;
제1 멀티 레이어 퍼셉트론(MLP, Multi Layer Perceptron)에 의해 제1 가중치를 산출하여, 적어도 하나의 상기 관계형 로컬 특징에 개별 적용하는 단계;
적어도 하나의 상기 관계형 로컬 특징을 집계 함수에 의해 합산하여 예측 관계형 특징을 추출하는 단계; 및
제2 멀티 레이어 퍼셉트론에 의해 제2 가중치를 산출하여, 상기 예측 관계형 특징에 적용하여 상기 쌍 관계 네트워크를 생성하는 단계를 더 포함하는, 얼굴 인식 방법.The method according to claim 1,
The step of extracting the relational local feature is
Embedding the identification feature into the relationship pair by a long-term memory uint (LSTM) -based cyclic network;
Calculating a first weight by a first multi-layer perceptron (MLP) and individually applying the at least one relational local feature;
Extracting predictive relational features by summing at least one of the relational local features by an aggregate function; And
And generating a pair relation network by calculating a second weight by a second multi-layer perceptron and applying it to the predictive relational feature.

청구항 제1항에 있어서,
상기 관계형 로컬 특징은 단일 벡터 형태로 제공되는, 얼굴 인식 방법.The method according to claim 1,
The relational local feature is provided in a single vector format.

청구항 제13항에 있어서,
상기 LSTM 기반의 순환 네트워크는 복수의 완전 연결된 계층(FC Layer, Fully Connected Layer)들을 포함하고, 손실 함수(Loss Function)를 이용하여 학습되는, 얼굴 인식 방법.The method according to claim 13,
The LSTM-based cyclic network includes a plurality of Fully Connected Layers (FC Layers), and is learned using a loss function.

프로세서(processor); 및
상기 프로세서(processor)를 통해 실행되는 적어도 하나의 명령이 저장된 메모리(memory)를 포함하고,
상기 적어도 하나의 명령은,
식별하고자 하는 대상자의 얼굴이 촬영된 영상 이미지를 수신하도록 하는 명령,
상기 영상 이미지를 정규화하도록 하는 명령,
복수의 얼굴 특징점들을 추출하도록 학습된 컨볼루셔널 신경망에 상기 영상 이미지를 입력하여, 상기 영상 이미지 내 얼굴 특징점들을 포함하는 특징맵을 도출하도록 하는 명령,
상기 특징맵에 글로벌 평균 풀링을 적용하여, 상기 영상 이미지 내 대상자의 얼굴 전역에 대한 외형 특징을 표현하는 글로벌 외형 특징을 출력하도록 하는 명령,
쌍 관계 네트워크에 상기 특징맵을 입력하여 복수의 로컬 외형 특징들을 추출하여 관계쌍을 형성하도록 하는 명령 및
상기 관계쌍에 신원 식별 특징을 임베딩하여 관계형 로컬 특징을 추출하도록 하는 명령을 포함하는, 얼굴 인식 장치.A processor; And
And a memory in which at least one instruction executed through the processor is stored,
The at least one command,
Command to receive the video image of the face of the person to be identified,
Command to normalize the video image,
A command for inputting the video image into a convolutional neural network trained to extract a plurality of facial feature points to derive a feature map including facial feature points in the video image,
A command to apply a global average pooling to the feature map to output a global feature that expresses the feature of the entire subject's face in the video image,
A command for inputting the feature map into a pair relation network and extracting a plurality of local appearance features to form a relation pair and
And an instruction to embed an identity identification feature in the relation pair to extract a relational local feature.

청구항 16항에 있어서,
상기 프로세서는 상기 영상 이미지를 정규화하기 전에 상기 영상 이미지를 정렬하는 얼굴 인식 장치.The method according to claim 16,
The processor is a face recognition apparatus that aligns the video image before normalizing the video image.

청구항 16항에 있어서,
상기 프로세서는,
상기 특징맵을 도출하도록 하는 명령 수행 시, 복수의 컨볼루션 계층들에 의해 상기 정규화된 영상 이미지의 채널별 합성곱을 산출하고, 상기 채널별 합성곱에 최대 풀링을 적용하여 상기 특징맵을 출력하는, 얼굴 인식 장치.The method according to claim 16,
The processor,
When performing a command to derive the feature map, calculate a convergence product for each channel of the normalized image image by a plurality of convolutional layers, and output the feature map by applying maximum pooling to the convolution product per channel, Face recognition device.

청구항 18항에 있어서,
적어도 하나의 상기 컨볼루션 계층은 레지듀얼 함수를 포함하는 병목 구조로 제공되는, 얼굴 인식 장치.The method according to claim 18,
The at least one convolutional layer is provided as a bottleneck structure including a residual function.

청구항 16항에 있어서,
상기 프로세서는,
상기 글로벌 외형 특징을 출력하도록 하는 명령 수행 시, 특정 크기의 필터를 이용하여, 상기 특징맵에 평균 풀링을 적용하는, 얼굴 인식 장치.The method according to claim 16,
The processor,
When performing a command to output the global appearance feature, a facial recognition device applying an average pooling to the feature map using a filter of a specific size.

청구항 16항에 있어서,
상기 관계형 로컬 특징은 단일 벡터 형태로 제공되는, 얼굴 인식 장치.The method according to claim 16,
The relational local feature is provided in a single vector format.

청구항 16항에 있어서,
상기 프로세서는
상기 관계쌍을 형성하도록 하는 명령 수행 시, 상기 컨볼루셔널 신경망으로부터 출력된 상기 특징맵을 입력 받고, 상기 특징맵 내 복수의 얼굴 특징점들을 중심으로 로컬 외형 특징 그룹을 추출하며, 상기 로컬 외형 특징 그룹으로부터 복수의 상기 로컬 외형 특징들을 추출하여 상기 관계쌍을 형성하는, 얼굴 인식 장치.The method according to claim 16,
The processor
When executing the command to form the relationship pair, the feature map output from the convolutional neural network is input, a local feature group is extracted based on a plurality of facial feature points in the feature map, and the local feature group And extracting a plurality of the local appearance features from the to form the relationship pair.

청구항 22항에 있어서,
상기 프로세서는
상기 로컬 외형 특징 그룹의 추출 시, 상기 영상 이미지의 얼굴 영역 내 국부 영역을 관심 영역으로 추출하고, 상기 추출된 관심 영역을 기준으로 적어도 하나의 상기 로컬 외형 특징들을 포함하는 상기 로컬 외형 특징 그룹을 추출하는 얼굴 인식 장치.The method according to claim 22,
The processor
When extracting the local contour feature group, a local area in the face region of the video image is extracted as a region of interest, and the local contour feature group including at least one of the local contour features is extracted based on the extracted region of interest. Face recognition device.

청구항 16항에 있어서,
상기 프로세서는,
상기 관계형 로컬 특징을 추출하도록 하는 명령 수행 시, LSTM 기반의 순환 네트워크에 의해, 상기 신원 식별 특징을 상기 관계쌍에 임베딩하고, 제1 멀티 레이어 퍼셉트론에 의해 제1 가중치를 산출하여 적어도 하나의 상기 관계형 로컬 특징에 개별 적용하며, 적어도 하나의 상기 관계형 로컬 특징을 집계 함수에 의해 합산하여 예측 관계형 특징을 추출하고, 제2 멀티 레이어 퍼셉트론에 의해 제2 가중치를 산출하여 상기 예측 관계형 특징에 적용하는, 얼굴 인식 장치.The method according to claim 16,
The processor,
When performing an instruction to extract the relational local feature, the LSTM-based cyclic network embeds the identification feature into the relation pair, calculates a first weight by a first multi-layer perceptron, and calculates at least one of the relational relations A face applied individually to a local feature, summed by at least one of the relational local features by an aggregate function, extracts a predictive relational feature, calculates a second weight by a second multi-layer perceptron, and applies the predicted relational feature to the face Recognition device.

학습된 컨볼루셔널 신경망으로부터 복수의 얼굴 특징점들을 포함하는 특징맵을 입력 받는 단계;
상기 특징맵 내 복수의 얼굴 특징점들을 중심으로 로컬 외형 특징 그룹을 추출하는 단계;
상기 로컬 외형 특징 그룹으로부터 복수의 로컬 외형 특징들을 추출하여 관계쌍을 형성하는 단계;
LSTM 기반의 순환 네트워크에 의해, 신원 식별 특징을 상기 관계쌍에 임베딩하여 관계형 로컬 특징을 추출하는 단계; 및
상기 관계형 로컬 특징 및 상기 학습된 컨볼루셔널 신경망으로부터 수신된 글로벌 외형 특징을 결합한 특징을 복수의 완전 연결된 계층들에 통과시켜, 손실 함수가 최소화되도록 기계학습하는 단계를 포함하는, 쌍 관계 네트워크 모델링 방법.Receiving a feature map including a plurality of facial feature points from the learned convolutional neural network;
Extracting a local appearance feature group based on a plurality of facial feature points in the feature map;
Extracting a plurality of local appearance features from the local appearance feature group to form a relationship pair;
Extracting a relational local feature by embedding an identity identification feature into the relation pair by an LSTM-based cyclic network; And
A method of modeling a pair-relational network, comprising passing the feature combining the relational local feature and the global appearance feature received from the learned convolutional neural network through a plurality of fully connected layers to learn the machine so that a loss function is minimized. .

프로세서; 및
상기 프로세서를 통해 실행되는 적어도 하나의 명령이 저장된 메모리를 포함하고,
상기 적어도 하나의 명령은
학습된 컨볼루셔널 신경망으로부터 복수의 얼굴 특징점들을 포함하는 특징맵을 입력 받도록 하는 명령,
상기 특징맵 내 복수의 얼굴 특징점들을 중심으로 로컬 외형 특징 그룹을 추출하도록 하는 명령,
상기 로컬 외형 특징 그룹으로부터 복수의 로컬 외형 특징들을 추출하여 관계쌍을 형성하도록 하는 명령,
LSTM 기반의 순환 네트워크에 의해, 신원 식별 특징을 상기 관계쌍에 임베딩하여 관계형 로컬 특징을 추출하도록 하는 명령 및
상기 관계형 로컬 특징 및 상기 학습된 컨볼루셔널 신경망으로부터 수신된 글로벌 외형 특징을 결합한 특징을 복수의 완전 연결된 계층들에 통과시켜, 손실 함수가 최소화되도록 기계학습하도록 하는 명령을 포함하는, 쌍 관계 네트워크 모델링 장치.Processor; And
And a memory in which at least one instruction executed through the processor is stored,
The at least one command
A command to receive a feature map including a plurality of facial feature points from the learned convolutional neural network,
A command to extract a group of local appearance features based on a plurality of facial feature points in the feature map,
An instruction to extract a plurality of local appearance features from the local appearance feature group to form a relationship pair,
An instruction to embed an identity identification feature into the relation pair to extract a relational local feature by an LSTM-based cyclic network, and
Paired relationship network modeling comprising instructions for passing a feature combining the relational local feature and the global appearance feature received from the learned convolutional neural network through a plurality of fully connected layers to machine learning such that the loss function is minimized Device.