KR20220106426A

KR20220106426A - Method for Human Activity Recognition Using Semi-supervised Multi-modal Deep Embedded Clustering for Social Media Data

Info

Publication number: KR20220106426A
Application number: KR1020210009319A
Authority: KR
Inventors: 이동만; 김동민; 한수민; 손희석
Original assignee: 한국과학기술원
Priority date: 2021-01-22
Filing date: 2021-01-22
Publication date: 2022-07-29
Also published as: KR102514920B1

Abstract

A method for recognizing human activity using semi-supervised multi-modal deep embedded clustering for social media data according to one embodiment of the present invention may comprise: a multi-modal data pre-processing step of extracting image features and text features by pre-processing multi-modal social media data including images and text provided as training data; an embedding network initialization step of initializing an image embedding network for the image features and a text embedding network for the text features; a supervised learning step of learning the image embedding network and the text embedding network using labeled data among the training data; and an unsupervised learning step of learning the image embedding network and the text embedding network using both unlabeled data and labeled data included in the training data. According to one embodiment of the present invention, both the labeled data and unlabeled data are utilized so that a generalized feature expression can be learned, and overfitting to labeled features can be prevented.

Description

소셜 미디어 데이터에 대한 준 지도 멀티 모달 딥 임베디드 클러스터링을 이용한 인간 활동 인식 방법 {Method for Human Activity Recognition Using Semi-supervised Multi-modal Deep Embedded Clustering for Social Media Data}{Method for Human Activity Recognition Using Semi-supervised Multi-modal Deep Embedded Clustering for Social Media Data}

본 출원은 소셜 미디어 데이터에 대한 준 지도 멀티 모달 딥 임베디드 클러스터링을 이용한 인간 활동 인식 방법에 관한 것이다.This application relates to a human activity recognition method using semi-supervised multi-modal deep embedded clustering for social media data.

소셜 미디어는 사람들이 일상 생활의 활동과 기억에 남는 경험을 공유하는 주변 데이터 플랫폼이다. 인간 활동 인식(Human Activity Recognition; HAR)은 공유 데이터에 포함된 행동 패턴을 사용하여 상황 인식 추천 시스템 및 의료 서비스와 같은 다양한 애플리케이션을 위한 견고한 기반을 제공할 수 있다. Social media is a peripheral data platform where people share their daily activities and memorable experiences. Human Activity Recognition (HAR) can use behavior patterns contained in shared data to provide a solid foundation for a variety of applications such as context-aware recommendation systems and healthcare services.

기존의 HAR 접근 방식은 인간 활동과 관련된 중요한 패턴을 학습하기 위해 소셜 미디어 게시물의 텍스트와 메타 데이터(예를 들어 타임 스탬프, 장소 및 키워드)를 가져오는 지도 기계 학습 모델(예를 들어 SVM, LSTM 등)을 활용한다. 그러나, 사용자는 대부분 텍스트와 이미지를 모두 사용하여 일상적인 활동과 생각을 기술하기 때문에 유니 모달(uni-modal) 텍스트 특징의 사용은 소셜 미디어에서 공유되는 인간 활동의 충분한 패턴을 캡처할 수 없다는 한계가 있다(비특허문헌 1-3 참조). Existing HAR approaches use supervised machine learning models (e.g. SVM, LSTM, etc. ) is used. However, since users mostly use both text and images to describe their daily activities and thoughts, the use of uni-modal text features is limited in that they cannot capture sufficient patterns of human activity shared on social media. There is (refer to Non-Patent Documents 1-3).

이러한 종래 기술의 인식 성능을 향상시키기 위해서, 소셜 미디어 게시물의 멀티 모달(multi-modal) 특징(예를 들어, 이미지 및 캡션)을 활용할 수 있으며, 이와 관련하여 몇 가지 접근 방식이 제안된 바 있다. 예를 들어, 비특허문헌 4는 CNN 모델의 이미지 특징과 Doc2Vec 모델의 텍스트 특징을 연결하고 이를 함께 사용하여 완전 연결 신경망을 훈련시켜 불법 약물과 관련된 소셜 미디어 게시물을 식별하는 기술을 제안한다. 또한, 비특허문헌 5는 CNN 모델의 시각적 시맨틱과 NLP 네트워크 모델의 텍스트 특징을 함께 사용하여 각각 기존 모델인 SVM 및 DNN을 훈련하고, 훈련된 모델을 활용하여 비꼬는 소셜 미디어 게시물을 감지하는 기술을 제안한다.In order to improve the recognition performance of the prior art, multi-modal features (eg, images and captions) of social media posts may be utilized, and several approaches have been proposed in this regard. For example, Non-Patent Document 4 proposes a technique for identifying social media posts related to illegal drugs by connecting image features of a CNN model and text features of a Doc2Vec model and using them together to train a fully connected neural network. In addition, Non-Patent Document 5 uses the visual semantics of the CNN model and the text features of the NLP network model together to train the existing models, SVM and DNN, respectively, and proposes a technique for detecting sarcastic social media posts using the trained model. do.

그러나, 멀티 모달 지도 방법을 훈련할 만큼 충분히 큰 레이블이 있는 데이터 세트를 구성하는 것은 쉬운 작업이 아니며, 소셜 미디어를 사용하는 멀티 모달 HAR에 대해 데이터 세트가 제시된 바 없다.However, constructing labeled datasets large enough to train multimodal supervising methods is not an easy task, and no datasets have been presented for multimodal HAR using social media.

한편, 딥 임베디드 클러스터링(Deep Embedded Clustering; DEC)은 클러스터링 목표를 반복적으로 최적화하여 최적의 특징 표현과 레이블이 없는 데이터의 클러스터 할당을 동시에 학습하는 비지도 학습 방법이다(비특허문헌 6 참조). DEC는 훈련 과정에서 클러스터 소프트 할당 확률과 제안된 대상 분포 사이의 KL 발산(Kullback-Leibler divergence)을 최소화하여 임베딩 네트워크가 특징 표현을 학습할 수 있도록 한다. MultiDEC은 이미지-캡션 쌍과 같은 멀티 모달 데이터를 처리하기 위한 DEC의 확장 버전으로, DEC에는 단일 임베딩 네트워크가 있지만 MultiDEC는 유사한 분포를 가진 이미지 및 텍스트 표현을 동시에 학습하도록 공동 훈련된 두 개의 임베딩 네트워크로 구성된다(비특허문헌 7 참조).On the other hand, Deep Embedded Clustering (DEC) is an unsupervised learning method that simultaneously learns an optimal feature expression and cluster assignment of unlabeled data by iteratively optimizing a clustering target (see Non-Patent Document 6). DEC minimizes the Kullback-Leibler divergence between the cluster soft assignment probability and the proposed target distribution during the training process, so that the embedding network can learn the feature representation. MultiDEC is an extended version of DEC for processing multi-modal data such as image-caption pairs, where DEC has a single embedding network, whereas MultiDEC consists of two embedding networks that are jointly trained to simultaneously learn image and text representations with similar distributions. It is comprised (refer nonpatent literature 7).

상술한 DEC 및 MultiDEC는 강력한 방법이지만 클러스터링 알고리즘이므로 HAR에 직접 사용하기는 어렵다는 한계가 있다. 분류 작업을 처리하는데 입증된 효과를 활용하기 위해 최근 준 지도 DEC인 SSLDEC가 발표되었다(비특허문헌 8 참조). SSLDEC는 레이블이 있는 데이터의 대상 분포를 학습하고, 레이블이 있는 데이터의 클러스터에서 특징 거리를 측정하여 레이블이 없는 데이터의 클래스 확률 분포를 반복적으로 추정하고, 훈련 중에 최적화한다. 따라서 SSLDEC는 비교적 적은 양의 레이블이 있는 데이터 세트로도 인식 성능을 유지하는 전환 학습 방법이다. 그러나 SSLDEC의 지도 방법은 레이블이 있는 데이터가 대상 분포를 정확하게 표현할 수 없거나 그 분포가 레이블이 없는 데이터의 분포를 수용할 수 없는 경우 HAR 성능을 최대화하지 못할 수 있다. 특히 이러한 특성이 뚜렷한 소셜 미디어 데이터에 적용하면 SSLDEC의 적용 가능성이 급격히 떨어질 수 있다.Although DEC and MultiDEC described above are powerful methods, they have limitations in that they are difficult to use directly for HAR because they are clustering algorithms. SSLDEC, a semi-supervised DEC, was recently published to utilize the proven effectiveness in processing classification tasks (see Non-Patent Document 8). SSLDEC learns the target distribution of labeled data, measures feature distances from clusters of labeled data, iteratively estimates the class probability distribution of unlabeled data, and optimizes during training. Therefore, SSLDEC is a conversion learning method that maintains recognition performance even with a relatively small amount of labeled data sets. However, SSLDEC's guidance method may not maximize HAR performance if the labeled data cannot accurately represent the target distribution or the distribution cannot accommodate the distribution of unlabeled data. In particular, if applied to social media data with distinct characteristics, the applicability of SSLDEC may drop sharply.

Zhu, Z., Blanke, U., Calatroni, A., Troster, G.: Human activity recognition using social media data. In: Proceedings of the 12th International Conference on Mobile and Ubiquitous Multimedia, p. 21. ACM (2013) Zhu, Z., Blanke, U., Calatroni, A., Troster, G.: Human activity recognition using social media data. In: Proceedings of the 12th International Conference on Mobile and Ubiquitous Multimedia, p. 21. ACM (2013) Zhu, Z., Blanke, U., Troster, G.: Recognizing composite daily activities from crowdlabelled social media data. Pervasive Mob. Comput. 26, 103-120 (2016) Zhu, Z., Blanke, U., Troster, G.: Recognizing composite daily activities from crowdlabelled social media data. Pervasive Mob. Compute. 26, 103-120 (2016) Gong, J., Li, R., Yao, H., Kang, X., Li, S.: Recognizing human daily activity using social media sensors and deep learning. Int. J. Environ. Res. Public Health 16(20), 3955 (2019) Gong, J., Li, R., Yao, H., Kang, X., Li, S.: Recognizing human daily activity using social media sensors and deep learning. Int. J. Environ. Res. Public Health 16 (20), 3955 (2019) Roy, A., Paul, A., Pirsiavash, H., Pan, S.: Automated detection of substance userelated social media posts based on image and text analysis. In: 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 772-779. IEEE (2017) Roy, A., Paul, A., Pirsiavash, H., Pan, S.: Automated detection of substance userelated social media posts based on image and text analysis. In: 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 772-779. IEEE (2017) Schifanella, R., de Juan, P., Tetreault, J., Cao, L.: Detecting sarcasm in multimodal social platforms. In: Proceedings of the 24th ACM International Conference on Multimedia, pp. 1136-1145. ACM (2016) Schifanella, R., de Juan, P., Tetreault, J., Cao, L.: Detecting sarcasm in multimodal social platforms. In: Proceedings of the 24th ACM International Conference on Multimedia, pp. 1136-1145. ACM (2016) Xie, J., Girshick, R., Farhadi, A.: Unsupervised deep embedding for clustering analysis. In: International Conference on Machine Learning, pp. 478-487 (2016) Xie, J., Girshick, R., Farhadi, A.: Unsupervised deep embedding for clustering analysis. In: International Conference on Machine Learning, pp. 478-487 (2016) Yang, S., Huang, K.H., Howe, B.: MultiDEC: multi-modal clustering of imagecaption pairs. arXiv preprint arXiv:1901.01860 (2019) Yang, S., Huang, K.H., Howe, B.: MultiDEC: multi-modal clustering of imagecaption pairs. arXiv preprint arXiv:1901.01860 (2019) Enguehard, J., O'Halloran, P., Gholipour, A.: Semi-supervised learning with deep embedded clustering for image classification and segmentation. IEEE Access 7, 11093-11104 (2019) Enguehard, J., O'Halloran, P., Gholipour, A.: Semi-supervised learning with deep embedded clustering for image classification and segmentation. IEEE Access 7, 11093-11104 (2019)

따라서, 당해 기술분야에서는 소셜 미디어 데이터에 적용하여 HAR 성능을 최대화하기 위한 인간 활동 인식 기술이 요구되고 있다.Therefore, there is a need in the art for a human activity recognition technology for maximizing HAR performance by applying to social media data.

상기 과제를 해결하기 위해서, 본 발명의 일 실시예는 소셜 미디어 데이터에 대한 준 지도 멀티 모달 딥 임베디드 클러스터링을 이용한 인간 활동 인식 방법을 제공한다.In order to solve the above problems, an embodiment of the present invention provides a human activity recognition method using quasi-supervised multi-modal deep embedded clustering for social media data.

상기 소셜 미디어 데이터에 대한 준 지도 멀티 모달 딥 임베디드 클러스터링을 이용한 인간 활동 인식 방법은, 훈련 데이터로 제공된 이미지 및 텍스트를 포함하는 멀티 모달 소셜 미디어 데이터를 전처리하여 이미지 특징 및 텍스트 특징을 추출하는 멀티 모달 데이터 전처리 단계; 상기 이미지 특징에 대한 이미지 임베딩 네트워크 및 상기 텍스트 특징에 대한 텍스트 임베딩 네트워크의 초기화를 수행하는 임베딩 네트워크 초기화 단계; 상기 훈련 데이터 중 레이블이 있는 데이터를 사용하여 상기 이미지 임베딩 네트워크 및 텍스트 임베딩 네트워크를 학습하는 지도 학습 단계; 및 상기 훈련 데이터에 포함된 레이블이 없는 데이터 및 레이블이 있는 데이터를 모두 사용하여 상기 이미지 임베딩 네트워크 및 텍스트 임베딩 네트워크를 학습하는 비지도 학습 단계를 포함할 수 있다.The human activity recognition method using semi-supervised multi-modal deep embedded clustering for the social media data is multi-modal data for extracting image features and text features by preprocessing multi-modal social media data including images and texts provided as training data. pretreatment step; an embedding network initialization step of performing initialization of an image embedding network for the image feature and a text embedding network for the text feature; a supervised learning step of learning the image embedding network and the text embedding network using labeled data among the training data; and an unsupervised learning step of learning the image embedding network and the text embedding network using both unlabeled data and labeled data included in the training data.

덧붙여 상기한 과제의 해결수단은, 본 발명의 특징을 모두 열거한 것이 아니다. 본 발명의 다양한 특징과 그에 따른 장점과 효과는 아래의 구체적인 실시형태를 참조하여 보다 상세하게 이해될 수 있을 것이다.Incidentally, the means for solving the above problems do not enumerate all the features of the present invention. Various features of the present invention and its advantages and effects may be understood in more detail with reference to the following specific embodiments.

본 발명의 일 실시예에 따르면, 레이블이 있는 데이터와 레이블이 없는 데이터를 모두 활용함으로써 일반화된 특징 표현을 학습하고 레이블이 있는 특징에 과적합되는 것을 방지할 수 있다. According to an embodiment of the present invention, by utilizing both labeled data and unlabeled data, it is possible to learn a generalized feature expression and prevent overfitting to labeled features.

또한, 본 발명의 일 실시예에 따르면, 기존의 유니 모달 접근 방식에 비해 향상된 HAR 정확도를 달성할 수 있다. In addition, according to an embodiment of the present invention, it is possible to achieve improved HAR accuracy compared to the existing uni-modal approach.

도 1은 본 발명의 일 실시예에 따른 소셜 미디어 데이터에 대한 준 지도 멀티 모달 딥 임베디드 클러스터링을 이용한 인간 활동 인식 방법의 흐름도이다.
도 2는 본 발명의 일 실시예에 따른 소셜 미디어 데이터에 대한 준 지도 멀티 모달 딥 임베디드 클러스터링을 이용한 인간 활동 인식 방법을 설명하기 위한 개략도이다.
도 3은 본 발명의 일 실시예에 따른 임베딩 네트워크 초기화 및 준 지도 학습 과정을 설명하기 위한 도면이다.
도 4는 레이블이 없는 데이터의 수와 학습률에 따른 본 발명의 실시예의 정확도를 도시하는 도면이다.1 is a flowchart of a human activity recognition method using quasi-supervised multi-modal deep embedded clustering for social media data according to an embodiment of the present invention.
2 is a schematic diagram illustrating a human activity recognition method using quasi-supervised multi-modal deep embedded clustering for social media data according to an embodiment of the present invention.
3 is a diagram for explaining an embedding network initialization and semi-supervised learning process according to an embodiment of the present invention.
4 is a diagram illustrating the accuracy of an embodiment of the present invention as a function of the number of unlabeled data and the learning rate.

이하, 첨부된 도면을 참조하여 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있도록 바람직한 실시예를 상세히 설명한다. 다만, 본 발명의 바람직한 실시예를 상세하게 설명함에 있어, 관련된 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략한다. 또한, 유사한 기능 및 작용을 하는 부분에 대해서는 도면 전체에 걸쳐 동일한 부호를 사용한다.Hereinafter, preferred embodiments will be described in detail so that those of ordinary skill in the art can easily practice the present invention with reference to the accompanying drawings. However, in describing a preferred embodiment of the present invention in detail, if it is determined that a detailed description of a related known function or configuration may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted. In addition, the same reference numerals are used throughout the drawings for parts having similar functions and functions.

덧붙여, 명세서 전체에서, 어떤 부분이 다른 부분과 '연결'되어 있다고 할 때, 이는 '직접적으로 연결'되어 있는 경우뿐만 아니라, 그 중간에 다른 소자를 사이에 두고 '간접적으로 연결'되어 있는 경우도 포함한다. 또한, 어떤 구성요소를 '포함'한다는 것은, 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있다는 것을 의미한다.In addition, throughout the specification, when a part is 'connected' with another part, it is not only 'directly connected' but also 'indirectly connected' with another element interposed therebetween. include In addition, 'including' a certain component means that other components may be further included, rather than excluding other components, unless otherwise stated.

본 발명에서는 제한된 양의 레이블이 있는 데이터(labeled data)만을 사용하면서 높은 인식 정확도를 달성할 수 있는 멀티 모달 소셜 미디어 데이터, 즉 이미지와 텍스트를 모두 사용하는 HAR을 위한 준 지도 학습 방법을 제안한다. In the present invention, we propose a semi-supervised learning method for multi-modal social media data, that is, HAR using both images and text, that can achieve high recognition accuracy while using only a limited amount of labeled data.

소셜 미디어에서 레이블이 없는 데이터(unlabeled data)의 양은 매일 기하급수적으로 증가하는 반면 특정 작업에 대한 레이블이 있는 데이터는 소량만 존재한다. 이러한 영역에서 소량의 레이블이 있는 데이터와 훨씬 더 많은 양의 레이블이 없는 데이터 세트를 함께 활용하는 준 지도 학습 방법을 통해 학습 성능을 크게 향상시킬 수 있다.While the amount of unlabeled data on social media grows exponentially every day, there is only a small amount of labeled data for a specific task. In these areas, quasi-supervised learning methods that combine small amounts of labeled data with much larger unlabeled data sets can significantly improve learning performance.

본 발명에서는 HAR을 위한 준 지도 학습 방법을 제시하기 위해, 소셜 미디어 이미지와 텍스트의 깊이 내재된 특징 표현을 각각 학습할 수 있는 클러스터링 방법인 MultiDEC을 채택하고, 이를 소량의 레이블이 있는 데이터를 훈련 절차에 통합하는 준 지도 모델로 확장한다. In the present invention, in order to present a semi-supervised learning method for HAR, MultiDEC, a clustering method that can learn deeply embedded feature expressions of social media images and texts, is adopted, and a small amount of labeled data is used as a training procedure. Extends to a quasi-map model that integrates into

본 발명의 실시예에 따르면, 교차 엔트로피 손실과 KL 발산 손실을 모두 최소화할 수 있으며, 이를 통해 레이블이 없는 대규모 데이터 세트의 특징 분포를 활용하는 동시에 레이블이 있는 데이터의 특징에 대해 학습 결과를 최적화함으로써 보다 일반화된 특징 분포를 학습할 수 있다. According to an embodiment of the present invention, both cross-entropy loss and KL divergence loss can be minimized, and through this, it is possible to utilize the feature distribution of a large unlabeled data set while optimizing the learning results for the features of labeled data. A more generalized feature distribution can be learned.

이하, 도 1 내지 도 4를 참조하여 본 발명의 실시예에 따른 소셜 미디어 데이터에 대한 준 지도 멀티 모달 딥 임베디드 클러스터링을 이용한 인간 활동 인식 방법에 대해 보다 구체적으로 설명한다.Hereinafter, a human activity recognition method using quasi-supervised multi-modal deep embedded clustering for social media data according to an embodiment of the present invention will be described in more detail with reference to FIGS. 1 to 4 .

도 1은 본 발명의 일 실시예에 따른 소셜 미디어 데이터에 대한 준 지도 멀티 모달 딥 임베디드 클러스터링을 이용한 인간 활동 인식 방법의 흐름도이고, 도 2는 본 발명의 일 실시예에 따른 소셜 미디어 데이터에 대한 준 지도 멀티 모달 딥 임베디드 클러스터링을 이용한 인간 활동 인식 방법을 설명하기 위한 개략도이다.1 is a flowchart of a human activity recognition method using quasi-supervised multi-modal deep embedded clustering for social media data according to an embodiment of the present invention, and FIG. It is a schematic diagram to explain the human activity recognition method using supervised multi-modal deep embedded clustering.

도 1을 참조하면, 본 발명의 일 실시예에 따른 소셜 미디어 데이터에 대한 준 지도 멀티 모달 딥 임베디드 클러스터링을 이용한 인간 활동 인식 방법은, 멀티 모달 데이터 전처리 단계(S110), 임베딩 네트워크 초기화 단계(S120), 지도 학습 단계(S130) 및 비지도 학습 단계(S140)를 포함하여 구성될 수 있으며, 여기서 지도 학습 단계(S130) 및 비지도 학습 단계(S140)는 수렴시까지 반복적으로 수행될 수 있으며, 멀티 모달 데이터의 처리 및 임베딩 네트워크의 프로세싱이 가능한 프로세싱 장치에 의해 수행될 수 있다.Referring to FIG. 1 , the method for recognizing human activity using quasi-supervised multi-modal deep embedded clustering for social media data according to an embodiment of the present invention includes a multi-modal data pre-processing step (S110) and an embedding network initialization step (S120). , a supervised learning step (S130) and an unsupervised learning step (S140), where the supervised learning step (S130) and the unsupervised learning step (S140) may be repeatedly performed until convergence, and The processing of the modal data and the processing of the embedding network may be performed by a capable processing device.

도 2를 참조하면, 예를 들어 인스타그램(Instagram) 게시물과 같은 멀티 모달 소셜 미디어 데이터가 훈련 데이터 세트로 제공되면 이미지 및 텍스트 특징을 추출하기 위해 멀티 모달 데이터 전처리 단계(Preprocessing)(S110)를 수행할 수 있다.Referring to FIG. 2 , for example, when multi-modal social media data such as an Instagram post is provided as a training data set, a multi-modal data preprocessing step (S110) is performed to extract image and text features. can be done

이후, 이미지 특징 및 텍스트 특징 각각에 대한 임베딩 네트워크, 즉 이미지 임베딩 네트워크 및 텍스트 임베딩 네트워크에 대한 초기화 단계(Parameter Initialization)(S120)를 수행할 수 있다. 여기서, 이미지 임베딩 네트워크 및 텍스트 임베딩 네트워크는 각각 매개 변수 θ 및 θ'를 가지며, 이미지 특징 및 텍스트 특징의 심층 표현을 학습하기 위한 것이다. 즉, 이미지 임베딩 네트워크 및 텍스트 임베딩 네트워크는 각각 이미지 특징(X) 및 텍스트 특징(X')을 해당 잠재 공간(latent space) Z 및 Z'에 임베드할 수 있다.Thereafter, an initialization step (Parameter Initialization) ( S120 ) for the embedding network for each of the image feature and the text feature, that is, the image embedding network and the text embedding network may be performed. Here, the image embedding network and the text embedding network have parameters θ and θ', respectively, and are for learning deep representations of image features and text features. That is, the image embedding network and the text embedding network may embed an image feature (X) and a text feature (X') into the corresponding latent spaces Z and Z', respectively.

이후, 임베딩 네트워크의 훈련을 위해 지도 학습 단계(Supervised Learning)(S130) 및 비지도 학습 단계(Unsupervised Learning) (S140)를 번갈아 가며 수행할 수 있다.Thereafter, for training of the embedding network, a supervised learning step (S130) and an unsupervised learning step (S140) may be alternately performed.

지도 학습 단계(S130)에서는 클래스 할당을 학습하기 위해 레이블이 있는 멀티 모달 데이터가 사용된다.In the supervised learning step ( S130 ), labeled multi-modal data is used to learn class assignment.

반면, 비지도 학습 단계(S140)에서는 이미지 특징 및 텍스트 특징에 대한 클러스터 할당 확률을 계산하기 위해 레이블이 없는 멀티 모달 데이터가 사용된다.On the other hand, in the unsupervised learning step ( S140 ), unlabeled multi-modal data is used to calculate cluster allocation probabilities for image features and text features.

이후, 각 클러스터 중심 μ 및 μ'를 조정하고 클러스터 순도를 개선하기 위해 대상 연결 확률 분포와 비교할 수 있다. Then, each cluster centroid μ and μ' can be adjusted and compared to the target linkage probability distribution to improve cluster purity.

이러한 준 지도 학습 방법은 이미지 특징 및 텍스트 특징의 최적 표현을 학습하고 멀티 모달 딥 임베디드 클러스터링을 HAR에 적용하는 데 도움이 된다.These semi-supervised learning methods help to learn optimal representations of image features and text features and apply multi-modal deep embedded clustering to HAR.

이하, 각 단계에 대해 보다 구체적으로 설명한다.Hereinafter, each step will be described in more detail.

멀티 모달 데이터 전처리 단계(S110)에서 제공되는 인스타그램 게시물은 이미지-텍스트 쌍으로 구성되며, 여기서 텍스트는 캡션과 복수의 해시 태그의 혼합으로 구성될 수 있다.The Instagram post provided in the multi-modal data preprocessing step S110 is composed of an image-text pair, where the text may be composed of a mixture of a caption and a plurality of hashtags.

인스타그램 게시물에서 이미지의 전처리를 위해 예를 들어 ImageNet 데이터 세트에서 사전 훈련된 ResNet-50의 2048 차원 특징 표현을 추출할 수 있으며, 추출된 2048 차원 특징을 PCA(Principal Component Analysis)를 사용하여 다시 300 차원으로 압축할 수 있다.For pre-processing of images from Instagram posts, for example, we can extract 2048-dimensional feature representations of pre-trained ResNet-50 from the ImageNet dataset, and then reconstruct the extracted 2048-dimensional features using Principal Component Analysis (PCA). It can be compressed to 300 dimensions.

또한, 인스타그램 게시물에서 텍스트의 전처리를 위해 예를 들어 한국어 토크나이저를 사용하여 텍스트를 별도의 단어로 분할하고 300 차원의 Doc2Vec Skip-gram 특징 공간에 삽입할 수 있다. Also, for pre-processing of text in Instagram posts, we can use, for example, a Korean tokenizer to split the text into separate words and insert them into a 300-dimensional Doc2Vec Skip-gram feature space.

그러나, 데이터 전처리 방법이 반드시 이로 제한되는 것은 아니며, 통상의 기술자에게 알려진 다양한 기술 중에서 선택하여 적용 가능하다.However, the data pre-processing method is not necessarily limited thereto, and may be applied by selecting from various techniques known to those skilled in the art.

임베딩 네트워크 초기화 단계(S120)에서는, 전처리된 이미지와 텍스트 데이터 포인트 X와 X'가 주어지면, 예를 들어 MultiDEC에서와 같이 초기 매개 변수 θ 및 θ'를 갖는 두 개의 임베딩 네트워크(즉, 이미지 임베딩 네트워크 및 텍스트 임베딩 네트워크)가 생성될 수 있다.In the embedding network initialization step S120, given the preprocessed image and text data points X and X', for example, two embedding networks with initial parameters θ and θ' as in MultiDEC (i.e., the image embedding network). and text embedding networks) may be created.

여기서, 두 개의 임베딩 네트워크는 대칭으로 각 임베딩 네트워크는 동일한 과정을 거치므로, 이미지 임베딩 네트워크에 대해서만 설명하기로 한다.Here, since the two embedding networks are symmetric and each embedding network goes through the same process, only the image embedding network will be described.

이미지 임베딩 네트워크는 인코딩 레이어(Encoder) 및 디코딩 레이어(Decoder)를 포함하는 스택 오토 인코더를 훈련할 수 있다.The image embedding network may train a stack auto-encoder including an encoding layer (Encoder) and a decoding layer (Decoder).

스택 오토 인코더는 매개 변수가 θ이며, 입력 데이터 X를 적층된 DNN 레이어의 인코더에서 잠재 공간 Z로 압축하고, 디코더에서 X와 X* 사이의 평균 제곱 오차 손실을 최소화하여 Z로부터 X*를 재생성할 수 있다. j개의 인간 활동 클래스가 있는 HAR을 위해 내장된 특징을 활용하기 위해 K-평균 알고리즘을 Z에 적용할 수 있다. 여기서, k는 j와 같고 중심이 μ인 j 개의 초기 클러스터를 생성한다. 이후, 생성된 j개의 클러스터를 가장 관련성이 높은 인간 활동 클래스와 연관시키기 위해 j Х j 혼동 행렬을 생성한다. 이 행렬의 (m, n) 요소는 n 번째 클러스터에 포함된 m 번째 클래스 레이블을 가진 데이터의 수를 나타낸다. 각 셀 값에서 혼동 행렬의 최대 값을 빼서 비용 행렬을 생성하고 헝가리 알고리즘(Hungarian algorithm)을 적용하여 비용을 최소화하는 클래스 할당을 찾을 수 있다. 마지막으로, 찾은 할당을 따르기 위해 중심 μ를 재배열할 수 있다. 도 3의 (a)는 임베딩 네트워크 초기화 단계를 도시한다.The stack autoencoder, with parameters θ, compresses the input data X into the latent space Z at the encoder of the stacked DNN layer, and regenerates X* from Z by minimizing the mean square error loss between X and X* at the decoder. can A K-means algorithm can be applied to Z to utilize the built-in features for HAR with j human activity classes. Here, k equals j and creates j initial clusters with center μ. Then, j Х j confusion matrix is generated to associate the generated j clusters with the most relevant human activity class. The (m, n) element of this matrix represents the number of data with the m-th class label included in the n-th cluster. The cost matrix can be generated by subtracting the maximum value of the confusion matrix from each cell value, and the class assignment that minimizes the cost can be found by applying the Hungarian algorithm. Finally, we can rearrange the center μ to follow the found assignment. 3A illustrates an embedding network initialization step.

지도 학습 단계(S130)에서는, 레이블이 있는 학습 세트 L과 주어진 클래스 레이블 표시자 y_ij로부터 이미지 및 텍스트 샘플 x_i 및 x'_i의 소프트 할당 확률 q_ij 및 r_ij 간의 교차 엔트로피를 최소화하여 모델을 최적화할 수 있다. 여기서, y_ij는 이진 표시자이며, 데이터 포인트 x_i가 올바른 클래스 레이블 j의 클러스터에 할당되고 중심에 가깝게 위치하면 1이 할당되고 그렇지 않으면 0이 할당된다.In the supervised learning step S130, the model is built by minimizing the cross entropy between the soft assignment probabilities q _ij and r _ij of the image and text samples x _i and x' _i from the labeled training set L and the given class label indicator y _ij . can be optimized. where y _ij is a binary indicator, where data point x _i is assigned to a cluster of the correct class label j and is assigned a 1 if it is located close to the center and a 0 otherwise.

우선, 1 자유도에 대한 Student t-분포를 사용하여 하기의 수학식 1에 따라 이미지 임베딩 z_i와 이미지 클러스터 중심 μ_j 사이의 유사성인 이미지 소프트 할당 확률 q_ij를 계산할 수 있다. 마찬가지로, 텍스트 임베딩 z_'i와 텍스트 클러스터 중심 μ'_j를 사용하여 하기의 수학식 2에 따라 텍스트 소프트 할당 확률 r_ij를 계산할 수 있다.First, using the Student t-distribution for 1 degree of freedom, it is possible to calculate the image soft assignment probability q _ij , which is the similarity between the image embedding z _i and the image cluster center μ _j , according to Equation 1 below. Similarly, using the text embedding z _'i and the text cluster center μ' _j , the text soft allocation probability r _ij may be calculated according to Equation 2 below.

[수학식 1][Equation 1]

[수학식 2][Equation 2]

이후, 이미지 및 텍스트 모델에 대한 지도 손실 함수 SL_img 및 SL_txt는 하기의 수학식 3 및 4와 같이 계산된 소프트 할당 확률 q_ij 및 r_ij과 각 샘플 x_{i ∈}L의 주어진 클래스 레이블 표시자 y_ij간의 교차 엔트로피 값의 합으로 정의될 수 있다.Then, the supervised loss functions SL _img and SL _txt for the image and text models are calculated as Equations 3 and 4 below with the soft assignment probabilities q _ij and r _ij and the given class label indicator y of each sample x _{i ∈} L It can be defined as the sum of cross entropy values between _ij .

[수학식 3][Equation 3]

[수학식 4][Equation 4]

상술한 지도 학습 단계(S130)에서는 레이블이 있는 데이터 포인트 x_{i ∈}L의 내재 특징 z_i를 레이블이 있는 각 클래스 j의 중심 μ_j에 가능한 가깝게 위치시키는 방법을 학습할 수 있다.In the supervised learning step S130 described above, it is possible to learn how to locate the intrinsic feature z _i of the labeled data point x _{i ∈} L as close as possible to the center μ _j of each labeled class j.

비지도 학습 단계(S140)에서는, 레이블이 없는 데이터 세트 U와 레이블이 있는 데이터 세트 L 모두(U ∪ L)에 대해 딥 임베디드 클러스터링을 사용하여 학습할 수 있다. 이 학습은 생성된 대상 확률 분포 p에 대한 q 및 r의 KL-발산을 최소화하여 진행될 수 있다.In the unsupervised learning step S140, both the unlabeled data set U and the labeled data set L (U ∪ L) can be trained using deep embedded clustering. This learning can proceed by minimizing the KL-divergence of q and r for the generated target probability distribution p.

우선, 연결 대상 분포 p_ij는 하기의 수학식 5에 따라 q_ij와 r_ij를 함께 사용하여 계산될 수 있다. 본 발명의 실시예에서는 클러스터 순도를 개선하고 DEC에서 제안된 바와 같이 높은 신뢰도로 할당된 데이터 포인트를 더 강조하기 위해 각각 q_ij 및 r_ij에 두 번째 파워 분포를 적용할 수 있다. 또한, MultiDEC에 따라 q_ij와 r_ij를 균등하게 계산하기 위해 평균 분포를 취할 수 있다.First, the connection target distribution p _ij may be calculated by using both q _ij and r _ij according to Equation 5 below. In an embodiment of the present invention, a second power distribution may be applied to q _ij and r _ij , respectively, to improve cluster purity and to further emphasize data points allocated with high reliability as proposed in DEC. Also, according to MultiDEC, the average distribution can be taken to equally calculate q _ij and r _ij .

[수학식 5][Equation 5]

이후, 연결 대상 분포가 계산되면, 하기의 수학식 6 및 7에 정의된 비지도 손실 함수로 KL 발산을 최소화하여 모델을 훈련시킬 수 있다. p, q 및 r 사이의 KL 발산 최소화 외에도, DEC 모델은 대상 확률 분포 p의 평균 h와 클래스 분포의 사전 지식 w 사이에 추가 손실을 도입하여 훈련될 수 있다. 여기서 사전 지식은 L의 클래스 분포로부터 획득될 수 있다.Thereafter, when the distribution to be connected is calculated, the model can be trained by minimizing the KL divergence using the unsupervised loss function defined in Equations 6 and 7 below. In addition to minimizing the KL divergence between p, q and r, the DEC model can be trained by introducing an additional loss between the mean h of the target probability distribution p and the prior knowledge w of the class distribution. Here, the prior knowledge may be obtained from the class distribution of L.

[수학식 6][Equation 6]

[수학식 7][Equation 7]

여기서,

이다.here,

to be.

또한, 비지도 학습 단계의 학습률 η_u를 지도 학습 단계의 학습률 η_s보다 작게 설정하여 Z_L이 U의 영향을 과도하게 받지 않도록 할 수 있다. 본 발명의 실시예에서는 η_u = ｋ Х η_s를 사용하며, 여기서 ｋ는 입력 매개 변수이다. In addition, by setting the learning rate η _u in the unsupervised learning stage to be smaller than the learning rate η _s in the supervised learning stage, Z _L can be prevented from being excessively affected by U . In the embodiment of the present invention, η _u = Х η _s is used, where k is the input parameter.

이처럼, 본 발명의 실시예에 따르면, U와 L의 특징 분포를 함께 활용하여 모델이 보다 일반화된 매개 변수인 θ 및 μ를 학습하도록 하고 훈련된 모델이 L에 과적합되는 것을 방지할 수 있다. As such, according to an embodiment of the present invention, by utilizing the feature distributions of U and L together, the model can learn more generalized parameters θ and μ, and it is possible to prevent the trained model from overfitting L.

하기의 알고리즘 1은 상술한 본 발명의 실시예의 전체 절차를 표현한 것이다.Algorithm 1 below represents the overall procedure of the above-described embodiment of the present invention.

상술한 본 발명의 실시예의 성능을 평가하기 위해, 2015 년 1 월부터 2018 년 12 월까지 서울 지하철 2 호선 25 개역 인근 도심지에서 다양한 인간 활동이 담긴 지오 태깅(geo-tagged)된 인스타그램 게시물을 수집하여 데이터 세트를 구축하였다. 데이터 수집은 캡션이 비어 있지 않은 한국 인스타그램 게시물로 제한되었으며, URL, 숫자, 이메일 주소 또는 이모티콘을 제거하여 캡션을 수정하였다. 또한, 동일한 작성자, 동일한 캡션 또는 동일한 위치에서 수차례 작성한 스팸 게시물을 필터링하였다. 단일 게시물에 두 개 이상의 이미지가 있는 경우 첫 번째 이미지만 가져왔다. 이에 따라, 967,598 개의 이미지-텍스트 쌍의 데이터 세트를 구성하였다.In order to evaluate the performance of the above-described embodiment of the present invention, geo-tagged Instagram posts containing various human activities in downtown areas near 25 stations of Seoul Subway Line 2 from January 2015 to December 2018 were published. data sets were collected. Data collection was limited to Korean Instagram posts with non-empty captions, and captions were corrected by removing URLs, numbers, email addresses, or emoticons. In addition, we filtered out spam posts made multiple times by the same author, the same caption, or the same location. If there is more than one image in a single post, only the first image is fetched. Accordingly, a data set of 967,598 image-text pairs was constructed.

하기의 표 1은 각 인간 활동 클래스에 대해 레이블이 있는 인스타그램 게시물 수를 정리한 것이다.Table 1 below summarizes the number of labeled Instagram posts for each human activity class.

[표 1][Table 1]

여기서, 클래스 레이블로는, HAR에 널리 사용되는 ATUS (American Time Use Survey) 분류의 주요 인간 활동 클래스를 사용하였다. 합의 메커니즘으로 검증된 데이터 세트를 설정하기 위해 먼저 27 명의 참가자를 모아 세 그룹으로 나누었다. 이후, 각 그룹에 동일한 인스타그램 게시물 세트가 제공되고 각 게시물이 나타내는 가장 가능성이 높은 인간 활동에 주석을 달도록 요청되었다. 이에 따라, 두 명 이상의 참가자가 동일한 인간 활동 클래스 레이블에 동의한 게시물만 사용하였다.Here, as the class label, a major human activity class of the American Time Use Survey (ATUS) classification widely used in HAR was used. To establish a data set validated by a consensus mechanism, we first gathered 27 participants and divided them into three groups. Then, each group was given the same set of Instagram posts and asked to annotate the most likely human activity each post represented. Accordingly, only posts where two or more participants agreed to the same human activity class label were used.

표 1은 23 개의 인간 활동 클래스 레이블과 해당 개수를 보여주는 것으로, 17 개의 ATUS 클래스 중 Socializing, Relaxing 및 Leisure 클래스는 소셜 미디어에 매우 빈번하게 등장한다. 인스타그램 게시물은 ATUS 분류에 정의된 하위 클래스, 사회화 및 의사 소통, 사회 이벤트 참석 또는 주최, 휴식 및 여가, 예술 및 엔터테인먼트(스포츠 제외)로 나뉜다. 또한 스팸 및 모호한 게시물을 필터링하기 위해 Advertisement 및 Unknown 클래스를 추가하였다. 최종적으로, 16,894 개의 레이블이 지정된 게시물의 HAR 데이터 세트를 구축하였다. 데이터 세트에서 본 발명의 평가를 위해 가장 자주 나타나는 12 개의 클래스의 16,248 개의 게시물을 사용하며, 표 1에서 별표로 표시되었다.Table 1 shows the 23 human activity class labels and their numbers. Among the 17 ATUS classes, the Socializing, Relaxing, and Leisure classes appear very frequently in social media. Instagram posts fall into the subclasses defined in the ATUS classification: socialization and communication, attending or hosting social events, relaxation and leisure, and arts and entertainment (excluding sports). We also added Advertisement and Unknown classes to filter out spam and obscure posts. Finally, a HAR data set of 16,894 labeled posts was constructed. We use 16,248 posts from the 12 classes that appear most frequently for the assessment of the present invention in the data set, and are marked with asterisks in Table 1.

한편, 평가 지표의 경우, 정확도 점수, 매크로 f1 점수 및 NMI(Normalized Mutual Information)를 채택한다. 정확도 점수는 간단한 인식 성능을 나타내고, 매크로 f1 점수는 정규화된 HAR 성능 지표이며, NMI는 테스트 세트의 실제 클래스의 확률 분포와 예측된 클래스의 확률 분포 간의 유사성을 나타낸다.On the other hand, for the evaluation index, an accuracy score, a macro f1 score, and NMI (Normalized Mutual Information) are adopted. The accuracy score represents simple recognition performance, the macro f1 score is a normalized HAR performance indicator, and the NMI represents the similarity between the probability distribution of the real class in the test set and the probability distribution of the predicted class.

이러한 세 가지 표준 지표를 사용하여 5 배 교차 테스트를 수행하고 결과의 평균을 측정하였다. 본 발명의 실시예에서 모델 훈련을 위해 레이블이 있는 데이터의 훈련 세트와 함께 레이블이 없는 데이터를 사용하였다. 또한, 배치 크기가 256이고 학습률이 0.01 (η_s) 인 확률적 경사 하강법(SGD) 최적화 프로그램을 사용하였다. 모델에서 스택 오토 인코더를 훈련하기 위해 기존의 DEC 모델과 동일한 구성(즉, 계층 구조 및 하이퍼 매개 변수)을 사용하였다.A 5-fold crossover test was performed using these three standard indicators and the results were averaged. In an embodiment of the present invention, unlabeled data were used together with a training set of labeled data for model training. In addition, a stochastic gradient descent (SGD) optimizer with a batch size of 256 and a learning rate of 0.01 (η _s ) was used. To train the stack autoencoder on the model, we used the same configuration (i.e. hierarchical structure and hyperparameters) as the conventional DEC model.

또한, 비교 평가를 위해 데이터 양식이 상이한 기준 모델을 구현하였다. 텍스트 전용 기준 모델의 경우, TF-IDF 텍스트 입력 벡터를 갖는 선형 SVM(TF-IDF + SVM), textWord2Vec 입력을 갖는 LSTM(Word2Vec + LSTM) 및 사전 임베딩 및 Word2Vec 입력을 갖는 LSTM(Word2Vec + LSTM + DE)을 구현하였다. 이미지 전용 기준 모델의 경우, ImageNet에서 사전 학습된 ResNet50 모델을 구현하였다. 본 발명의 실시예가 이미지 공유 소셜 미디어 데이터의 멀티 모달을 효과적으로 활용하고 HAR 성능을 향상시키는지 평가하기 위해 유니 모달(즉, 텍스트 또는 이미지만)을 사용하여 준 지도 딥 임베디드 클러스터링 모델을 구현하고 이를 본 발명에 따른 멀티 모달의 성능과 비교하였다. In addition, a reference model with different data formats was implemented for comparative evaluation. For text-only reference models, linear SVM with TF-IDF text input vector (TF-IDF + SVM), LSTM with textWord2Vec input (Word2Vec + LSTM), and LSTM with pre-embedding and Word2Vec input (Word2Vec + LSTM + DE) ) was implemented. For the image-only reference model, we implemented a pre-trained ResNet50 model on ImageNet. To evaluate whether an embodiment of the present invention effectively utilizes multi-modal of image sharing social media data and improves HAR performance, we implemented a quasi-supervised deep embedded clustering model using uni-modal (i.e., text or image only), and The performance of the multi-modal according to the invention was compared.

성능 평가 결과는 하기의 표 2에 요약되어 있으며, 본 발명의 실시예에 따른 모델이 모든 평가 지표에서 HAR에 대해 최상의 성능을 발휘함을 확인하였다. The performance evaluation results are summarized in Table 2 below, and it was confirmed that the model according to the embodiment of the present invention exhibits the best performance for HAR in all evaluation indices.

[표 2][Table 2]

구체적으로, 정확도 점수 측면에서 성능은 텍스트 전용 및 이미지 전용 모델의 최고 성능보다 각각 7.43 % 및 11.89 % 높다. 정규화된 지표인 매크로 f1 점수 측면에서 텍스트 전용 및 이미지 전용 모델보다 5.69 % 및 21.41 % 더 우수하다. 또한, NMI가 높을수록 올바른 분포에 대한 최적의 클러스터링을 달성하는데 더 효과적임을 나타낸다. 멀티 모달 및 유니 모달의 준 지도 모델을 비교함으로써 멀티 모달 데이터를 통합하면 HAR 성능을 향상시키는 데 도움이 된다는 것을 확인할 수 있다.Specifically, in terms of accuracy scores, the performance is 7.43% and 11.89% higher than the best performance of the text-only and image-only models, respectively. 5.69% and 21.41% better than the text-only and image-only models in terms of the macro f1 score, a normalized indicator. In addition, a higher NMI indicates that it is more effective in achieving optimal clustering for the correct distribution. By comparing multi-modal and uni-modal quasi-supervised models, it can be seen that integrating multi-modal data helps to improve HAR performance.

도 4는 레이블이 없는 데이터의 수와 학습률에 따른 본 발명의 실시예의 정확도를 도시하는 도면이다.4 is a diagram illustrating the accuracy of an embodiment of the present invention as a function of the number of unlabeled data and the learning rate.

레이블이 없는 샘플을 사용한 준 지도 학습의 효과를 검증하기 위해 레이블이 있는 데이터의 수를 고정하고 레이블이 없는 데이터의 수를 늘림으로써 정확도 향상을 측정하였다. 구체적으로, 레이블이 있는 데이터의 수를 12,998(즉, 5 배 훈련 데이터)로 고정하고 레이블이 없는 데이터(U)의 수 n을 0에서 951,350으로 늘려 주었다.To verify the effectiveness of semi-supervised learning using unlabeled samples, we measured the accuracy improvement by fixing the number of labeled data and increasing the number of unlabeled data. Specifically, the number of labeled data was fixed at 12,998 (ie, 5-fold training data) and the number n of unlabeled data (U) was increased from 0 to 951,350.

도 3의 (a)의 결과에서 레이블이 없는 데이터의 수가 증가함에 따라 정확도가 대수적으로 증가함을 확인할 수 있다. 레이블이 없는 데이터를 사용하지 않는 경우에 비해 레이블이 없는 데이터를 함께 사용할 때 약 8.9 %의 정확도 성능 향상을 달성하였다. 정확도는 레이블이 없는 데이터가 100,000까지는 매우 빠르게 증가하지만, 레이블이 없는 데이터가 100,000에서 200,000로 증가하면 개선 속도가 느려짐을 알 수 있다. 즉, 데이터 세트를 기반으로 12,000의 레이블이 있는 데이터의 특성 분포를 일반화하기 위해서는 레이블이 없는 데이터가 최소 100,000~200,000 필요함을 알 수 있다. 이 결과로부터, 레이블이 없는 데이터를 모델 학습에 통합하는 것이 HAR 성능을 개선하는데 도움이 되며 본 발명의 모델이 효과적으로 이점을 얻을 수 있음을 알 수 있다. It can be seen from the result of (a) of FIG. 3 that the accuracy increases logarithmically as the number of unlabeled data increases. An accuracy performance improvement of about 8.9% was achieved when unlabeled data were used together compared to the case where unlabeled data were not used. It can be seen that the accuracy increases very quickly for unlabeled data up to 100,000, but the improvement slows down as the unlabeled data increases from 100,000 to 200,000. That is, it can be seen that at least 100,000 to 200,000 of unlabeled data are required to generalize the feature distribution of 12,000 labeled data based on the data set. From this result, it can be seen that integrating unlabeled data into model training helps to improve HAR performance and that the model of the present invention can effectively benefit.

또한, 비지도 학습률을 조정하는 입력 매개 변수인 ｋ를 입력하여 지도 학습이 비지도 학습에 과도하게 영향을 받지 않도록 추가 실험을 수행하였다.In addition, an additional experiment was performed so that supervised learning is not excessively affected by unsupervised learning by inputting k, an input parameter that adjusts the unsupervised learning rate.

도 3의 (b)결과에서 지도 학습률과 비지도 학습률을 동일한 값 (ｋ = 1)으로 설정하면 성능이 66.56 %로 기준 모델보다 2.41 % 높고, 레이블이 없는 데이터의 수가 100 만 개인 경우 ｋ가 0.2 일 때 71.58 %의 최고 성능을 보여준다. 이를 통해 ｋ를 사용하여 비지도 학습의 일반화 효과를 제어할 수 있음을 알 수 있다. In the result of Fig. 3(b), when the supervised learning rate and the unsupervised learning rate are set to the same value (t = 1), the performance is 66.56%, which is 2.41% higher than the reference model, and when the number of unlabeled data is 1 million, d is 0.2 It shows the best performance of 71.58% when This shows that the generalization effect of unsupervised learning can be controlled by using k.

본 발명은 전술한 실시예 및 첨부된 도면에 의해 한정되는 것이 아니다. 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 있어, 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 본 발명에 따른 구성요소를 치환, 변형 및 변경할 수 있다는 것이 명백할 것이다.The present invention is not limited by the above embodiments and the accompanying drawings. For those of ordinary skill in the art to which the present invention pertains, it will be apparent that components according to the present invention can be substituted, modified and changed without departing from the technical spirit of the present invention.

Claims

훈련 데이터로 제공된 이미지 및 텍스트를 포함하는 멀티 모달 소셜 미디어 데이터를 전처리하여 이미지 특징 및 텍스트 특징을 추출하는 멀티 모달 데이터 전처리 단계;
상기 이미지 특징에 대한 이미지 임베딩 네트워크 및 상기 텍스트 특징에 대한 텍스트 임베딩 네트워크의 초기화를 수행하는 임베딩 네트워크 초기화 단계;
상기 훈련 데이터 중 레이블이 있는 데이터를 사용하여 상기 이미지 임베딩 네트워크 및 텍스트 임베딩 네트워크를 학습하는 지도 학습 단계; 및
상기 훈련 데이터에 포함된 레이블이 없는 데이터 및 레이블이 있는 데이터를 모두 사용하여 상기 이미지 임베딩 네트워크 및 텍스트 임베딩 네트워크를 학습하는 비지도 학습 단계를 포함하는 소셜 미디어 데이터에 대한 준 지도 멀티 모달 딥 임베디드 클러스터링을 이용한 인간 활동 인식 방법.
A multi-modal data pre-processing step of pre-processing multi-modal social media data including images and text provided as training data to extract image features and text features;
an embedding network initialization step of performing initialization of an image embedding network for the image feature and a text embedding network for the text feature;
a supervised learning step of learning the image embedding network and the text embedding network using labeled data among the training data; and
Semi-supervised multi-modal deep embedded clustering for social media data comprising an unsupervised learning step of learning the image embedding network and the text embedding network using both unlabeled and labeled data included in the training data. A method for recognizing human activity using

제 1 항에 있어서, 상기 지도 학습 단계는,
상기 레이블이 있는 데이터의 이미지 및 텍스트 특징과 상기 레이블에 해당하는 이미지 및 텍스트 클러스터 중심 사이의 유사성인 이미지 및 텍스트 소프트 할당 확률을 계산하고, 상기 이미지 및 텍스트 소프트 할당 확률을 기초로 지도 손실 함수를 계산하며, 상기 레이블이 있는 데이터의 이미지 및 텍스트 특징을 상기 레이블에 해당하는 이미지 및 텍스트 클러스터 중심에 가능한 가깝게 위치시키는 방법을 학습하는 것을 특징으로 하는 소셜 미디어 데이터에 대한 준 지도 멀티 모달 딥 임베디드 클러스터링을 이용한 인간 활동 인식 방법.
According to claim 1, wherein the supervised learning step,
Calculate the image and text soft assignment probability, which is the similarity between the image and text features of the labeled data and the center of the image and text cluster corresponding to the label, and calculate a map loss function based on the image and text soft assignment probability and learning how to locate the image and text features of the labeled data as close as possible to the center of the image and text cluster corresponding to the label. Using semi-supervised multi-modal deep embedded clustering for social media data How to recognize human activity.

제 2 항에 있어서, 상기 비지도 학습 단계는,
상기 이미지 및 텍스트 소프트 할당 확률을 기초로 대상 확률 분포를 계산하고, KL-발산을 최소화하여 상기 이미지 임베딩 네트워크 및 텍스트 임베딩 네트워크를 훈련하는 것을 특징으로 하는 소셜 미디어 데이터에 대한 준 지도 멀티 모달 딥 임베디드 클러스터링을 이용한 인간 활동 인식 방법.
The method of claim 2, wherein the unsupervised learning step comprises:
Semi-supervised multi-modal deep embedded clustering for social media data, characterized in that the image embedding network and the text embedding network are trained by calculating a target probability distribution based on the image and text soft allocation probabilities, and minimizing KL-divergence A method for recognizing human activity using

제 1 항에 있어서,
상기 비지도 학습 단계의 학습률을 상기 지도 학습 단계의 학습률에 비해 작게 설정하여 비지도 학습의 일반화 효과를 제어하는 것을 특징으로 하는 소셜 미디어 데이터에 대한 준 지도 멀티 모달 딥 임베디드 클러스터링을 이용한 인간 활동 인식 방법.
The method of claim 1,
Human activity recognition method using semi-supervised multi-modal deep embedded clustering for social media data, characterized in that the generalization effect of unsupervised learning is controlled by setting the learning rate of the unsupervised learning step to be smaller than the learning rate of the supervised learning step .

제 1 항에 있어서,
상기 지도 학습 단계 및 비지도 학습 단계는 수렴시까지 반복적으로 번갈아 수행되는 것을 특징으로 하는 소셜 미디어 데이터에 대한 준 지도 멀티 모달 딥 임베디드 클러스터링을 이용한 인간 활동 인식 방법.

The method of claim 1,
The method for recognizing human activity using quasi-supervised multi-modal deep embedded clustering for social media data, characterized in that the supervised learning step and the unsupervised learning step are repeatedly performed alternately until convergence.