KR102475108B1

KR102475108B1 - System for modeling automatically of machine learning with hyper-parameter optimization and method thereof

Info

Publication number: KR102475108B1
Application number: KR1020200144337A
Authority: KR
Inventors: 정태윤; 박판종; 박용순; 김동길
Original assignee: 강릉원주대학교산학협력단
Priority date: 2020-11-02
Filing date: 2020-11-02
Publication date: 2022-12-07
Also published as: KR20220059120A

Abstract

본 발명은 최적화된 하이퍼 파라미터를 갖는 기계 학습 모델 구축 시스템에 관한 것이다. 상기 시스템은, 전처리되고 정규화된 훈련 데이터들을 이용하여 사전 설정된 복수 개의 학습 모델들을 순차적으로 학습하고, 학습 결과를 기반으로 하여 상기 복수 개의 학습 모델들 중 가장 높은 정확도(Accuracy)를 갖는 학습 모델을 최적 학습 모델로 검출하는 학습 알고리즘 모델링 모듈; 상기 학습 알고리즘 모델링 모듈에 의해 검출된 최적 학습 모델에 대하여 하이퍼 파라미터 최적값을 검출하는 하이퍼 파라미터 최적화 모듈; 상기 검출된 하이퍼 파라미터 최적값을 상기 최적 학습 모델에 적용하여 최적 학습 모델을 재학습시키는 재학습 모듈; 및 상기 정규화된 테스트 데이터를 이용하여 상기 재학습된 최적 학습 모델에 따른 예측을 수행하는 학습 모델 예측 모듈;을 구비하여, 정확도가 가장 우수한 하이퍼 파라미터 조합을 갖는 최적 학습 모델을 자동으로 구축할 수 있도록 구성된다. The present invention relates to a machine learning model building system with optimized hyperparameters. The system sequentially learns a plurality of preset learning models using preprocessed and normalized training data, and optimizes a learning model having the highest accuracy among the plurality of learning models based on the learning result A learning algorithm modeling module for detecting as a learning model; a hyper-parameter optimization module for detecting optimal hyper-parameter values for the optimal learning model detected by the learning algorithm modeling module; a re-learning module for re-learning an optimal learning model by applying the detected optimal hyper-parameter value to the optimal learning model; and a learning model prediction module that performs prediction according to the retrained optimal learning model using the normalized test data, so as to automatically build an optimal learning model having a hyperparameter combination with the highest accuracy. It consists of

Description

최적화된 하이퍼파라미터를 갖는 기계 학습 모델링 자동화 방법 및 이를 이용한 기계 학습 모델링 자동화 시스템{System for modeling automatically of machine learning with hyper-parameter optimization and method thereof} System for modeling automatically of machine learning with hyper-parameter optimization and method thereof}

본 발명은 기계 학습 모델링 자동화 방법 및 시스템에 관한 것으로서, 더욱 구체적으로는, 훈련 데이터와 테스트 데이터를 자동으로 전처리하고, 전처리된 데이터들을 이용하여 사전 설정된 복수 개의 기계 학습 모델들을 학습하고, 학습된 결과들로부터 가장 우수한 정확도를 갖는 기계 학습 모델을 자동으로 선택하고, 상기 선택된 기계 학습 모델에 대하여 최적화된 하이퍼파라미터 조합을 구하고, 이를 이용하여 최적화된 하이퍼파라미터를 갖는 기계 학습 모델을 제공하도록 구성된 기계 학습 모델링 자동화 방법 및 이를 이용한 시스템에 관한 것이다. The present invention relates to a method and system for automating machine learning modeling, and more specifically, to automatically pre-process training data and test data, learn a plurality of pre-set machine learning models using the pre-processed data, and learn the result. Machine learning modeling configured to automatically select a machine learning model with the best accuracy from the selected machine learning model, obtain a hyperparameter combination optimized for the selected machine learning model, and provide a machine learning model having optimized hyperparameters using the same. It relates to an automation method and a system using the same.

기계 학습은 1950년대 이후부터 시작되어 오래되었지만, 80~90년대까지 발전후 답보 상태를 이루다가 2000년대 중반에 들어와서 현저한 발전을 이루게 되었다. 최근, 사물 인터넷이 활성화됨에 따라 엄청난 양의 데이터가 발생하게 되고, 이러한 빅데이터를 이용하여 학습할 데이터들을 사전 처리하여 최적화시킴으로써 학습 효과를 극대화함에 따라 실용화가 가능한 기계 학습 결과가 나오고 있다. Machine learning has been around since the 1950s, but after development until the 1980s and 1990s, it achieved a stagnant state, and then entered the mid-2000s and achieved remarkable development. Recently, with the activation of the Internet of Things, a huge amount of data is generated, and machine learning results that can be put to practical use are emerging as the learning effect is maximized by pre-processing and optimizing the data to be learned using such big data.

지도 학습을 위한 대표적인 기계 학습 모델로는 선형 회귀(Linear Regression), 로지스틱 회귀(Logistic Regression), 결정 트리(Decision Tree), 서포트 벡터 머신(Support Vector Machine), 인공 신경망 등이 있다. 지도 학습에서 입력을 예측 변수(predictor variable) 또는 특징(Feature)라고 하며, 출력을 반응 변수(response variable) 또는 목표 변수(Target variable)라고도 한다. 지도 학습 중 목표 변수가 수치형인 경우에는 '회귀'라고 하며, 범주형인 경우는 '분류'라고 한다. Representative machine learning models for supervised learning include linear regression, logistic regression, decision trees, support vector machines, and artificial neural networks. In supervised learning, the input is called a predictor variable or feature, and the output is also called a response variable or target variable. In supervised learning, if the target variable is numeric, it is called 'regression', and if it is categorical, it is called 'classification'.

이러한 기계 학습에 있어서, 가장 중요한 영역 중 하나는 정확한 예측(Prediction)을 위하여 훈련 데이터를 이용하여 정확도가 우수한 모델을 생성하는 과정이다. 기계 학습 모델을 생성하기 위하여는 선형 대수학, 수열 등의 고도의 수학적인 지식이 요구된다. 따라서, 개인이 이러한 이론들을 배우는 데는 많은 시간과 비용이 요구되므로, 많은 한계가 따르게 된다. 또한, 기계 학습 모델에 사용되는 알고리즘을 선택하고 이를 구현하기 위하여는 프로그래밍 언어에 대한 연구 및 공부가 필요하기 때문에 이들을 습득하는데도 더욱 많은 시간이 소요된다. 따라서, 수학 및 통계 분석 등에 대한 전문 지식이 없는 일반인은 기계 학습 모델을 학습시켜 모델링시키는 것이 쉽지 않은 분야이다. In such machine learning, one of the most important areas is a process of generating a model with excellent accuracy using training data for accurate prediction. In order to create a machine learning model, advanced mathematical knowledge such as linear algebra and sequence is required. Therefore, it takes a lot of time and money for individuals to learn these theories, and thus comes with many limitations. In addition, since research and study on programming languages are required to select and implement algorithms used in machine learning models, it takes more time to learn them. Therefore, it is not easy for ordinary people who do not have expertise in mathematics and statistical analysis to learn and model a machine learning model.

특히, 기계 학습 중 하나인 딥러닝(Deep Learning)은 컴퓨터 비전이나 자연어 처리 등과 같은 다양한 분야에서 많이 사용되고 있으며, 딥러닝은 주어진 입력 데이터에 대한 예측값을 얻는 것이다. 따라서, 사용자는 예측값을 계산하기 위하여 딥러닝 모델을 구축하게 되며, 이러한 딥러닝에는 2가지 특징이 있다. 첫째는 딥러닝 모델을 통해 주어진 입력값에 대한 높은 예측값을 얻기 위하여 모델을 학습시키는 것이며, 학습 과정에서 많은 양의 연산이 필요하므로, 다수 개의 GPU로 이루어진 공용 GPU 클러스터 환경에서 학습을 진행하게 된다. 두번째는, 딥러닝 모델의 예측 정확도는 하이퍼 파라미터(Hyper Parameter)라고 불리는 변수들의 초기 설정값에 크게 영향을 받기 때문에, 다양한 하이퍼 파라미터 조합을 적용해서 모델의 정답 예측도를 최대화 시켜주는 최적의 파라미터 조합을 찾는 절차를 수행하는 것이다. 이러한 과정을 하이퍼 파라미터 최적화(Hyper parameter optimization)라고 부른다. In particular, deep learning, one of machine learning, is widely used in various fields such as computer vision and natural language processing, and deep learning is to obtain predicted values for given input data. Therefore, a user builds a deep learning model to calculate a predicted value, and there are two characteristics of such deep learning. The first is to train the model to obtain a high predictive value for a given input value through a deep learning model. Since a large amount of computation is required in the learning process, learning is conducted in a shared GPU cluster environment composed of multiple GPUs. Second, since the prediction accuracy of a deep learning model is greatly affected by the initial settings of variables called hyper parameters, an optimal parameter combination that maximizes the model's predictability of the answer by applying various hyper parameter combinations It is to carry out the process of finding . This process is called hyper parameter optimization.

하이퍼 파라미터 최적화에는 크게 3가지 특징이 있다. 첫째, 하이퍼 파라미터의 탐색 범위가 증가할수록 사용자가 딥러닝 모델의 최적의 하이퍼 파라미터 조합을 찾을 가능성이 증가한다. 따라서, 딥러닝 모델 연구자들은 최대한 많은 하이퍼 파라미터 조합들을 모델에 적용해서, 모델의 예측 정확도를 최적화하는 조합을 찾고자 노력한다. 둘째, 초반에 정답에 빠르게 수렴하는 하이퍼 파라미터 조합이 최적의 하이퍼 파라미터 조합이 될 가능성이 높다. 셋째, 학습 횟수에 따른 각 하이퍼 파라미터 조합들의 정확도를 예측하기 어렵다. 다시 말해, 학습을 실제로 진행해야 학습 횟수에 따른 모델의 예측 정확도를 확인해 볼 수 있게 된다. 따라서, 다양한 조합들의 조기 정답 수렴 정도를 비교하면, 최적의 하이퍼 파라미터 조합이 될 가능성이 높은 조합들을 빠르게 찾을 수 있지만, 초기 정답 수렴 정도 비교를 위한 학습 횟수를 결정하기 어렵다. 이러한 특징들로 인하여, 사용자들은 다양한 하이퍼 파라미터 조합들을 충분히 오랫동안 학습해서 최적의 하이퍼 파라미터 조합을 찾는다. There are three main characteristics of hyperparameter optimization. First, as the search range of hyperparameters increases, the user's probability of finding the optimal hyperparameter combination of the deep learning model increases. Therefore, deep learning model researchers try to find a combination that optimizes the prediction accuracy of the model by applying as many hyperparameter combinations as possible to the model. Second, the hyperparameter combination that quickly converges to the correct answer at the beginning is highly likely to be the optimal hyperparameter combination. Third, it is difficult to predict the accuracy of each hyperparameter combination according to the number of times of learning. In other words, only when learning is actually performed can we check the prediction accuracy of the model according to the number of times of learning. Accordingly, by comparing the degree of convergence of the early answers of various combinations, it is possible to quickly find combinations that are likely to be optimal hyperparameter combinations. Due to these features, users learn various hyperparameter combinations long enough to find the optimal hyperparameter combination.

한편, 기계 학습의 성능을 향상시키기 위한 하이퍼 파라미터 조합을 찾는 방법으로서, 수동으로 하이퍼 파라미터를 변경하면서 결과를 확인하는 과정은 경우의 수가 매우 많고, 많은 시간이 소요된다. On the other hand, as a method for finding hyperparameter combinations to improve machine learning performance, the process of checking results while manually changing hyperparameters has a very large number of cases and takes a lot of time.

한국공개특허공보 제 10-2019-0134983호Korean Patent Publication No. 10-2019-0134983 한국등록특허공보 제 10-2096301호Korean Registered Patent Publication No. 10-2096301 한국등록특허공보 제 10-2037279호Korea Patent Registration No. 10-2037279

전술한 문제를 해결하기 위한 본 발명은 하이퍼 파라미터의 최적화를 위한 튜닝 과정을 최소화시킴으로써, 가장 우수한 성능을 제공하는 하이퍼 파라미터 최적값을 갖는 기계 학습 모델을 자동으로 선택하여 제공하도록 구성된 최적화된 하이퍼 파라미터값을 갖는 기계 학습 모델링 자동화 방법 및 자동화 시스템을 제공하는 것을 목적으로 한다. The present invention for solving the above-described problem minimizes the tuning process for optimizing hyperparameters, thereby automatically selecting and providing optimized hyperparameter values for a machine learning model having optimal hyperparameter values that provide the best performance. It is an object of the present invention to provide a machine learning modeling automation method and automation system having.

전술한 기술적 과제를 달성하기 위한 본 발명의 제1 특징에 따른 기계 학습 모델 자동화 구축 시스템은, 사전 준비된 훈련 데이터 및 테스트 데이터를 저장한 데이터 저장부; 상기 훈련 데이터 및 테스트 데이터에 대하여 전처리하는 데이터 전처리부; 데이터 전처리된 훈련 데이터들 및 테스트 데이터들의 왜도 및 첨도를 조정하여 정규화시키는 정규화 모듈; 상기 정규화된 훈련 데이터들을 이용하여 사전 설정된 복수 개의 학습 모델들을 순차적으로 학습하고, 학습 결과를 기반으로 하여 상기 복수 개의 학습 모델들 중 가장 높은 정확도(Accuracy)를 갖는 학습 모델을 최적 학습 모델로 검출하는 학습 알고리즘 모델링 모듈; 상기 학습 알고리즘 모델링 모듈에 의해 검출된 최적 학습 모델에 대하여 하이퍼 파라미터 최적값을 검출하는 하이퍼 파라미터 최적화 모듈; 상기 검출된 하이퍼 파라미터 최적값을 상기 최적 학습 모델에 적용하여 최적 학습 모델을 재학습시키는 재학습 모듈; 및 상기 정규화된 테스트 데이터를 이용하여 상기 재학습된 최적 학습 모델에 따른 예측을 수행하는 학습 모델 예측 모듈; 을 구비하여, 정확도가 가장 우수한 하이퍼 파라미터 조합을 갖는 최적 학습 모델을 자동으로 구축할 수 있도록 구성된다. A machine learning model automation construction system according to a first aspect of the present invention for achieving the above technical problem includes a data storage unit for storing pre-prepared training data and test data; a data pre-processing unit that pre-processes the training data and the test data; a normalization module for adjusting and normalizing skewness and kurtosis of data preprocessed training data and test data; Sequentially learning a plurality of preset learning models using the normalized training data, and detecting a learning model having the highest accuracy among the plurality of learning models as an optimal learning model based on a learning result learning algorithm modeling module; a hyper-parameter optimization module for detecting optimal hyper-parameter values for the optimal learning model detected by the learning algorithm modeling module; a re-learning module for re-learning an optimal learning model by applying the detected optimal hyper-parameter value to the optimal learning model; and a learning model prediction module that performs prediction according to the retrained optimal learning model using the normalized test data. It is configured to automatically build an optimal learning model having a hyperparameter combination with the highest accuracy.

전술한 제1 특징에 따른 기계 학습 모델 자동화 구축 시스템에 있어서, 상기 하이퍼 파라미터 최적화 모듈은, 상기 사전 설정된 복수 개의 학습 모델들의 하이퍼 파라미터들에 대한 데이터 프레임(Data Frame) 및 각 데이터에 대한 변수들을 초기 설정하여 저장하는 데이터프레임 설정 모듈; 상기 학습 알고리즘 모델링 모듈에 의해 검출된 최적 학습 모델에 대하여 초기 설정된 하이퍼 파라미터의 데이터 프레임 및 각 데이터에 대한 변수들을 이용하여 복수 개의 하이퍼 파라미터 조합들을 생성하는 하이퍼 파라미터 조합 생성 모듈; 상기 복수 개의 하이퍼 파라미터 조합들을 순차적으로 상기 최적 학습 모델에 적용하여 학습시키고, 각 하이퍼 파라미터 조합에 대한 학습에 따른 최적 학습 모델에 대한 정확도 점수들을 측정하는 정확도 측정 모듈; 상기 하이퍼 파라미터 조합들 중 가장 우수한 정확도 점수를 갖는 하이퍼 파라미터 조합을 검출하고, 상기 검출된 하이퍼 파라미터 조합을 하이퍼 파라미터 최적값으로 설정하고 저장하는 최적값 검출 모듈; 을 구비하는 것이 바람직하다. In the machine learning model automated construction system according to the first feature described above, the hyperparameter optimization module initializes data frames for hyperparameters of the plurality of preset learning models and variables for each data. Data frame setting module to set and save; a hyper-parameter combination generation module generating a plurality of hyper-parameter combinations by using a hyper-parameter data frame initially set for the optimal learning model detected by the learning algorithm modeling module and variables for each data; an accuracy measurement module for sequentially applying and learning the plurality of hyper parameter combinations to the optimal learning model and measuring accuracy scores for the optimal learning model according to learning of each hyper parameter combination; an optimal value detection module configured to detect a hyper parameter combination having the highest accuracy score among the hyper parameter combinations, set the detected hyper parameter combination as an optimal hyper parameter value, and store the result; It is desirable to have.

전술한 제1 특징에 따른 기계 학습 모델 자동화 구축 시스템에 있어서, 상기 데이터 전처리부는, 훈련 데이터 및 테스트 데이터에 대하여, 데이터의 유형에 따라 수치형 데이터 및 범주형 데이터로 분리하는 데이터 분리 모듈; 훈련 데이터 및 테스트 데이터들 중 범주형 데이터를 검출하고, 상기 범주형 데이터를 수치형 데이터로 변환시키는 데이터 변환 모듈; 훈련 데이터 및 테스트 데이터들 중 결측값이 있는 데이터를 검출하고, 결측값을 대체하는 결측값 대체 모듈;을 구비하는 것이 바람직하다. In the machine learning model automated construction system according to the first aspect described above, the data pre-processing unit includes: a data separation module for separating training data and test data into numerical data and categorical data according to data types; a data conversion module for detecting categorical data among training data and test data and converting the categorical data into numerical data; It is preferable to include a missing value replacement module that detects data with missing values among the training data and test data and replaces the missing values.

전술한 제1 특징에 따른 기계 학습 모델 자동화 구축 시스템에 있어서, 상기 학습 알고리즘 모델링 모듈은, 복수 개의 학습 모델들을 사전 설정하고, 데이터 전처리 및 정규화된 훈련 데이터를 이용하여 상기 복수 개의 학습 모델들을 순차적으로 학습하고, 각 학습 모델에 의한 학습 결과를 나타내고, 각 학습 모델에 대한 학습 결과를 수치화하여 정확도 점수로서 저장하고, 상기 복수 개의 학습 모델들 중 가장 높은 정확도 점수를 갖는 학습 모델을 자동으로 선택하여 최적 학습 모델로 설정하는 것이 바람직하다. In the machine learning model automated construction system according to the first feature described above, the learning algorithm modeling module presets a plurality of learning models, and sequentially builds the plurality of learning models using data preprocessing and normalized training data. learning, displaying the learning result by each learning model, digitizing the learning result for each learning model and storing it as an accuracy score, and automatically selecting the learning model with the highest accuracy score among the plurality of learning models to optimize the It is desirable to set it as a learning model.

본 발명의 제2 특징에 따른 기계 학습 모델 자동화 구축 방법은, (a) 사전 준비된 훈련 데이터 및 테스트 데이터를 전처리하는 단계; (b) 상기 전처리된 훈련 데이터 및 테스트 데이터들의 왜도 및 첨도를 조정하고, 이상치를 제거하여 정규화시키는 단계; (c) 상기 정규화된 훈련 데이터를 이용하여 사전 설정된 복수 개의 학습 모델들을 순차적으로 학습하고, 학습 결과를 기반으로 하여 복수 개의 학습 모델들 중 가장 높은 정확도를 갖는 학습 모델을 최적 학습 모델로 검출하는 단계; (d) 상기 검출된 최적 학습 모델에 대하여 하이퍼 파라미터 최적값을 검출하는 단계; (e) 상기 검출된 하이퍼 파라미터 최적값을 상기 최적 학습 모델에 적용하여 최적 학습 모델을 재학습시키는 단계; 및 (f) 상기 정규화된 테스트 데이터를 이용하여 상기 재학습된 최적 학습 모델에 따른 예측을 수행하는 단계;를 구비하여, 정확도가 가장 우수한 하이퍼 파라미터 최적값을 갖는 최적 학습 모델을 자동으로 구축할 수 있도록 구성된다. A method for automatically building a machine learning model according to a second aspect of the present invention includes the steps of (a) pre-processing pre-prepared training data and test data; (b) normalizing by adjusting skewness and kurtosis of the preprocessed training data and test data and removing outliers; (c) sequentially learning a plurality of preset learning models using the normalized training data, and detecting a learning model having the highest accuracy among the plurality of learning models as an optimal learning model based on a learning result ; (d) detecting an optimal hyper-parameter value for the detected optimal learning model; (e) retraining an optimal learning model by applying the detected optimal hyperparameter value to the optimal learning model; and (f) performing prediction according to the retrained optimal learning model using the normalized test data, thereby automatically constructing an optimal learning model having optimal hyperparameter values with the highest accuracy. is configured so that

전술한 제2 특징에 따른 기계 학습 모델 자동화 구축 방법에 있어서, 상기 (d) 단계는, (d1) 상기 사전 설정된 복수 개의 학습 모델들의 하이퍼 파라미터들에 대한 데이터 프레임(Data Frame) 및 각 데이터에 대한 변수들을 초기 설정하여 저장하는 단계; (d2) 상기 최적 학습 모델에 대하여 초기 설정된 하이퍼 파라미터의 데이터 프레임 및 각 데이터에 대한 변수들을 이용하여 복수 개의 하이퍼 파라미터 조합들을 생성하는 단계; (d3) 상기 복수 개의 하이퍼 파라미터 조합들을 순차적으로 상기 최적 학습 모델에 적용하여 학습시키고, 각 하이퍼 파라미터 조합에 대한 학습에 따른 최적 학습 모델에 대한 정확도 점수들을 측정하는 단계; (d4) 상기 하이퍼 파라미터 조합들 중 가장 우수한 정확도 점수를 갖는 하이퍼 파라미터 조합을 검출하고, 상기 검출된 하이퍼 파라미터 조합을 하이퍼 파라미터 최적값으로 설정하고 저장하는 단계;을 구비하는 것이 바람직하다. In the method for automatically building a machine learning model according to the second feature described above, the step (d) comprises: (d1) a data frame for hyper parameters of the plurality of preset learning models and a data frame for each data initial setting and storing of variables; (d2) generating a plurality of hyper parameter combinations using a data frame of hyper parameters initially set for the optimal learning model and variables for each data; (d3) sequentially applying and learning the plurality of hyper parameter combinations to the optimal learning model, and measuring accuracy scores for the optimal learning model according to learning of each hyper parameter combination; (d4) detecting a hyper parameter combination having the highest accuracy score among the hyper parameter combinations, setting the detected hyper parameter combination as an optimal hyper parameter value, and storing the hyper parameter combination.

전술한 제2 특징에 따른 기계 학습 모델 자동화 구축 방법에 있어서, 상기 (b) 단계는, (b1) 상기 훈련 데이터 및 테스트 데이터들에 있어서 수치형 데이터로 변환이 불가능한 데이터들을 삭제하는 단계; (b2) 훈련 데이터 및 테스트 데이터에 대하여, 데이터의 유형에 따라 수치형 데이터 및 범주형 데이터로 분리하는 단계; (b3) 훈련 데이터 및 테스트 데이터들 중 범주형 데이터를 검출하고, 상기 범주형 데이터를 수치형 데이터로 변환시키는 단계; (b4) 데이터 변환후, 훈련 데이터 및 테스트 데이터들 중 결측값이 있는 데이터를 검출하고, 결측값을 대체하는 단계;을 구비하는 것이 바람직하다. In the method for automatically building a machine learning model according to the second feature, the step (b) includes: (b1) deleting data that cannot be converted into numerical data in the training data and the test data; (b2) separating training data and test data into numerical data and categorical data according to data types; (b3) detecting categorical data among training data and test data, and converting the categorical data into numeric data; (b4) after data conversion, detecting data with missing values among the training data and test data, and replacing the missing values;

전술한 제2 특징에 따른 기계 학습 모델 자동화 구축 방법에 있어서, 상기 (c) 단계는, 복수 개의 학습 모델들을 사전 설정하고, 데이터 전처리 및 정규화된 훈련 데이터를 이용하여 상기 복수 개의 학습 모델들을 순차적으로 학습하고, 각 학습 모델에 의한 학습 결과를 나타내고, 각 학습 모델에 대한 학습 결과를 수치화하여 정확도 점수로서 저장하고, 상기 복수 개의 학습 모델들 중 가장 높은 정확도 점수를 갖는 학습 모델을 최적 학습 모델로 선택하여 출력하는 것이 바람직하다. In the method for automatically building a machine learning model according to the second feature described above, in the step (c), a plurality of learning models are preset, and the plurality of learning models are sequentially developed using data preprocessing and normalized training data. learning, displaying the learning result by each learning model, digitizing the learning result for each learning model and storing it as an accuracy score, and selecting the learning model with the highest accuracy score among the plurality of learning models as the optimal learning model It is preferable to print it out.

본 발명에 따른 기계 학습 모델링 자동화 시스템 및 방법은, 훈련 데이터 및 테스트 데이터에 대한 전처리를 프로그램적으로 수행하고 이를 이용하여 성능이 가장 우수한 최적 학습 모델을 자동으로 선택하도록 구성할 뿐만 아니라, 최적 학습 모델에 대하여 가장 정확도 점수가 높은 하이퍼 파라미터 조합을 선택하여 적용함으로써, 가장 정확도가 우수한 최적 학습 모델을 제공할 수 있게 된다. The machine learning modeling automation system and method according to the present invention programmatically perform preprocessing on training data and test data, and use this to automatically select an optimal learning model with the best performance, as well as an optimal learning model. By selecting and applying the hyperparameter combination with the highest accuracy score for , it is possible to provide an optimal learning model with the highest accuracy.

특히, 본 발명에 따른 시스템은, 최적 학습 모델에 대하여 하이퍼 파라미터 조합들을 적용하여 순차적으로 학습시키고, 학습 결과에 따른 정확도 점수를 이용하여 가장 우수한 정확도를 갖는 하이퍼 파라미터 조합을 적용하여 재학습시킴으로써, 최소한의 반복 횟수를 통해 최적 학습 모델에 대한 하이퍼 파라미터 최적값을 구할 수 있게 된다. In particular, the system according to the present invention sequentially learns by applying hyperparameter combinations to the optimal learning model, and relearns by applying the hyperparameter combination with the best accuracy using the accuracy score according to the learning result, at least Through the number of iterations of , it is possible to obtain the optimal hyperparameter value for the optimal learning model.

따라서, 본 발명에 따른 시스템을 통해, 기계 학습에 대한 전문적인 지식이 부족하더라도 우수한 성능을 갖는 기계 학습 모델링을 할 수 있게 된다. Therefore, through the system according to the present invention, it is possible to perform machine learning modeling with excellent performance even if there is a lack of professional knowledge on machine learning.

도 1은 본 발명의 바람직한 실시예에 따른 기계 학습 모델 자동화 구축 시스템을 도시한 블록도이다.
도 2는 본 발명의 바람직한 실시예에 따른 기계 학습 모델 자동화 구축 시스템에 있어서, 가장 성능이 우수한 기계 학습 모델을 선택하고 검증 및 예측하는 과정을 순차적으로 설명하는 흐름도이다.
도 3은 기계 학습 알고리즘의 종류를 예시적으로 도시한 도표이다.
도 4는 각 기계 학습 알고리즘에 사용되는 하이퍼 파라미터의 종류를 도시한 도표이다. 1 is a block diagram illustrating a machine learning model automated building system according to a preferred embodiment of the present invention.
2 is a flowchart sequentially illustrating a process of selecting, verifying, and predicting a machine learning model with the best performance in an automated machine learning model construction system according to a preferred embodiment of the present invention.
3 is a diagram showing types of machine learning algorithms by way of example.
4 is a diagram showing the types of hyperparameters used in each machine learning algorithm.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예에 따른 최적화된 하이퍼 파라미터를 갖는 기계 학습 모델 자동화 구축 시스템 및 방법에 대하여 구체적으로 설명한다. Hereinafter, a system and method for automatically building a machine learning model with optimized hyperparameters according to a preferred embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 바람직한 실시예에 따른 최적화된 하이퍼 파라미터를 갖는 기계 학습 모델 자동화 구축 시스템을 도시한 블록도이며, 도 2는 본 발명의 바람직한 실시예에 따른 기계 학습 모델 자동화 구축 시스템에 있어서, 가장 성능이 우수한 기계 학습 모델을 선택하고 하이퍼 파라미터를 최적화시킨 후 검증 및 예측하는 과정을 순차적으로 설명하는 흐름도이다. 1 is a block diagram showing a system for automatically building a machine learning model with optimized hyperparameters according to a preferred embodiment of the present invention, and FIG. 2 is a system for automatically building a machine learning model according to a preferred embodiment of the present invention, This is a flowchart that sequentially describes the process of selecting the best-performing machine learning model, optimizing hyperparameters, and then verifying and predicting.

도 1 및 도 2를 참조하면, 본 발명에 따른 기계 학습 모델 자동화 구축 시스템(1)는, 데이터 저장부(10), 데이터 전처리부(20), 정규화 모듈(30), 학습 알고리즘 모델링 모듈(40), 하이퍼 파라미터 최적화 모듈(50), 재학습 모듈(60), 학습 모델 성능 평가 모듈(70) 및 학습 모델 예측 모듈(80)을 구비하여, 학습 과정을 통해 정확도가 가장 우수한 학습 모델을 자동으로 구축할 수 있도록 구성된 것을 특징으로 한다. 상기 데이터 저장부(10)는 사전 준비된 훈련 데이터 및 테스트 데이터를 저장 및 관리한다. 상기 데이터 전처리부(20)는 학습을 위하여 상기 훈련 데이터 및 테스트 데이터를 전처리한다. 상기 정규화 모듈(30)은 데이터 전처리된 훈련 데이터들 및 테스트 데이터들의 왜도 및 첨도를 조정하고 이상치를 제거하여 정규화시킨다. 상기 학습 알고리즘 모델링 모듈(40)은 상기 정규화된 훈련 데이터들을 이용하여 사전 설정된 복수 개의 학습 모델들을 순차적으로 학습하고, 학습 결과를 기반으로 하여 상기 복수 개의 학습 모델들 중 가장 높은 정확도(Accuracy)를 갖는 학습 모델을 최적 학습 모델로 검출한다. 1 and 2, the machine learning model automated construction system 1 according to the present invention includes a data storage unit 10, a data pre-processing unit 20, a normalization module 30, and a learning algorithm modeling module 40. ), a hyperparameter optimization module 50, a relearning module 60, a learning model performance evaluation module 70, and a learning model prediction module 80 are provided to automatically select a learning model with the best accuracy through a learning process. It is characterized by being configured so that it can be built. The data storage unit 10 stores and manages pre-prepared training data and test data. The data pre-processing unit 20 pre-processes the training data and test data for learning. The normalization module 30 normalizes the preprocessed training data and test data by adjusting skewness and kurtosis and removing outliers. The learning algorithm modeling module 40 sequentially learns a plurality of preset learning models using the normalized training data, and based on the learning result, a model having the highest accuracy among the plurality of learning models The learning model is detected as an optimal learning model.

상기 하이퍼 파라미터 최적화 모듈(50)은 상기 학습 알고리즘 모델링 모듈에 의해 검출된 최적 학습 모델에 대하여 하이퍼 파라미터 최적값을 검출한다. 상기 재학습 모듈(60)은 상기 검출된 하이퍼 파라미터 최적값을 상기 최적 학습 모델에 적용하여 최적 학습 모델을 재학습시킨다. The hyperparameter optimization module 50 detects an optimal hyperparameter value for the optimal learning model detected by the learning algorithm modeling module. The re-learning module 60 re-learns the optimal learning model by applying the detected optimal hyper-parameter value to the optimal learning model.

상기 성능 평가 모듈(70)은 상기 재학습된 최적 학습 모델의 성능을 평가한다. 상기 학습 모델 예측 모듈(80)은 상기 정규화된 테스트 데이터를 이용하여 상기 최적 학습 모델에 따른 예측을 수행한다. The performance evaluation module 70 evaluates the performance of the retrained optimal learning model. The learning model prediction module 80 performs prediction according to the optimal learning model using the normalized test data.

이하, 전술한 각 구성요소들에 대하여 보다 구체적으로 설명한다. 본 발명에 따른 시스템은 파이썬(Python) 등과 같은 프로그래밍 언어를 사용하여 구현될 수 있다. Hereinafter, each of the aforementioned components will be described in more detail. A system according to the present invention may be implemented using a programming language such as Python.

상기 데이터 저장부(10)는 학습에 사용되는 훈련 데이터와 예측에 사용되는 테스트 데이터를 저장 및 관리하는 데이터베이스로서, 훈련 데이터 및 테스트 데이터는 기계 학습 모델의 학습을 위하여 온라인 등을 통해 공개적으로 제공되는 데이터를 읽어와서 사용할 수 있다. 상기 데이터 저장부는, 기계 학습 모델에 사용되는 데이터를 입력하기 위하여, 학습 모델에 사용될 훈련 데이터와 예측에 사용될 테스트 데이터를 준비하고, 상기 준비된 데이터들의 종속 변수(Label)와 독립 변수(Feature)를 확인한다. The data storage unit 10 is a database that stores and manages training data used for learning and test data used for prediction, and the training data and test data are publicly provided online or the like for machine learning model learning. Data can be read and used. The data storage unit prepares training data to be used in the learning model and test data to be used in prediction in order to input data used in the machine learning model, and checks the dependent variable (Label) and independent variable (Feature) of the prepared data. do.

데이터 저장부에 의해 준비된 훈련 데이터 및 테스트 데이터는 정확한 학습을 위하여 전처리되어야 하며, 상기 데이터 전처리부는 훈련 데이터 및 테스트 데이터들을 전처리하게 된다. 상기 데이터 전처리부(20)는 데이터 삭제 모듈(22), 데이터 분리 모듈(24), 데이터 변환 모듈(26), 결측값 대체 모듈(28) 및 가변수 생성 모듈(29)을 구비한다.Training data and test data prepared by the data storage unit must be preprocessed for accurate learning, and the data preprocessor preprocesses the training data and test data. The data pre-processing unit 20 includes a data deletion module 22, a data separation module 24, a data conversion module 26, a missing value substitution module 28, and a variable number generation module 29.

일반적으로 훈련 데이터 및 테스트 데이터의 유형은 수치형 데이터(Numerical data)와 범주형 데이터(Categorical data)로 나뉠 수 있다. 상기 수치형 데이터는 숫자로 표현되는 데이터로서, 숫자만으로 표현될 수 있는 변수인 age(예: '28', '35', '41'), score(예: '95.2', '88.7', '93.6') 등이 포함된다. 한편, 범주형 데이터는 문자로 표현되거나 문자와 숫자의 병합으로 표현될 수 있는 변수들로서, 문자만으로 이루어진 예로는 sex(예; 'male', 'female'), season(예: 'spring', 'summer', 'fall', 'winter')등이 포함되며, 문자+숫자로 이루어진 예로는 date(예; '2020-04-22'), time('pm 13:15:21') 등이 포함될 수 있다. 훈련 데이터 중 범주형 데이터는 학습이 불가능하므로, 범주형 데이터는 학습이 가능한 수치형 데이터로 변환시키고, 변환이 어려운 데이터는 보다 정확한 학습을 위하여 삭제시키는 것이 바람직하다. In general, the types of training data and test data can be divided into numerical data and categorical data. The numerical data is data expressed in numbers, and variables that can be expressed only with numbers include age (eg '28', '35', '41'), score (eg '95.2', '88.7', ' 93.6'). On the other hand, categorical data are variables that can be expressed in letters or as a combination of letters and numbers. Examples consisting only of letters include sex (eg 'male', 'female'), season (eg 'spring', ' summer', 'fall', 'winter'), etc. Examples of letters + numbers include date (eg '2020-04-22') and time ('pm 13:15:21'). can Since categorical data among training data cannot be learned, it is desirable to convert categorical data into numerical data capable of learning, and to delete data that is difficult to convert for more accurate learning.

상기 데이터 삭제 모듈(22)은, 상기 훈련 데이터 및 테스트 데이터를 구성하는 범주형 데이터들 중 수치형 데이터로 변환이 불가능한 데이터들을 삭제한다. 상기 데이터 분리 모듈(24)은, 훈련 데이터 및 테스트 데이터에 대하여, 데이터의 유형에 따라 수치형 데이터 및 범주형 데이터로 분리한다. 상기 데이터 변환 모듈(26)은 상기 데이터 분리 모듈에 의해 분리된 데이터들 중 범주형 데이터를 검출하고, 상기 범주형 데이터를 수치형 데이터로 변환시킨다. The data deletion module 22 deletes data that cannot be converted into numerical data among categorical data constituting the training data and the test data. The data separation module 24 separates training data and test data into numerical data and categorical data according to data types. The data conversion module 26 detects categorical data among data separated by the data separation module, and converts the categorical data into numeric data.

상기 결측값 대체 모듈(28)은 범주형 데이터들에 대한 수치형 데이터로의 변환이 완료된 상기 훈련 데이터 및 테스트 데이터들로부터 결측값이 있는 데이터를 검출하고, 결측값을 대체한다. The missing value replacement module 28 detects data with missing values from the training data and test data, which have been converted from categorical data into numeric data, and replaces the missing values.

본 발명에 따른 결측값 대체 모듈은, 수치 데이터로 변환된 데이터 셋에서 결측값을 검출되면, 결측값을 갖는 데이터 변수에 대한 상관 관계 영향도가 가장 높은 변수를 선택하고, 상기 상관 관계 영향도가 가장 높은 변수에 대한 복수 개의 통계값들을 각각 이용하여 결측값을 대체하는 복수 개의 결측값 대체 모델을 구하고, 상기 결측값 대체 모델들을 이용하여 얻은 결과들을 사전 설정된 학습 알고리즘을 통해 성능을 평가하고, 상기 성능 평가에서 가장 우수한 성능을 갖는 결측값 대체 모델을 이용하여 결측값에 대한 대체값을 구하는 것이 바람직하다. 여기서, 상기 통계값은, 해당 데이터 변수에 대한 평균값, 표준 편차값, 분산값, 중앙값, 사분위수 중 적어도 둘 이상을 포함하는 것이 바람직하다.When a missing value is detected in a data set converted to numerical data, the missing value replacement module according to the present invention selects a variable having the highest degree of correlation influence for a data variable having a missing value, and A plurality of missing value replacement models are obtained by using a plurality of statistical values for the highest variable, respectively, and performance of the results obtained by using the missing value replacement models is evaluated through a preset learning algorithm, and the In performance evaluation, it is desirable to obtain a replacement value for a missing value using a missing value replacement model with the best performance. Here, the statistical value preferably includes at least two or more of a mean value, a standard deviation value, a variance value, a median value, and a quartile for a corresponding data variable.

상기 가변수 생성 모듈(29)은 훈련 데이터 및 테스트 데이터들 중 특정 데이터들에 대하여 가변수(Dummy variable)를 생성한다. Sex는 male과 female과 같이 사실 여부에 대하여 '예' 또는 '아니오'로 확인이 가능한 경우, 가변수(One-Hot Encoding)를 활용하여 male을 '0'으로 변환하고 female을 '1'의 숫자 형태로 변환할 수 있다. The variable variable generation module 29 generates dummy variables for specific data among training data and test data. If sex can be confirmed as 'yes' or 'no' for facts, such as male and female, male is converted to '0' and female is a number of '1' by using a variable number (One-Hot Encoding). can be converted into a form.

상기 정규화 모듈(30)은 상기 데이터 전처리부에 의해 전처리된 훈련 데이터들 및 테스트 데이터들의 왜도(Skewness) 및 첨도(Kurtosis)를 조정하여 1차 정규화시킴으로써, 데이터 쏠림을 방지한다. 상기 정규화 모듈은 중앙값(Median)과 IQR(Interquartile Range)를 이용하여 이상치를 제거하여 2차 정규화시킴으로써, 아웃라이어를 최소화시키는 것이 바람직하다. 여기서, 아웃라이어는 데이터 상의 다른 값들의 분포와 비교하였을 때 비정상적으로 떨어져 있는 관측치를 의미한다. The normalization module 30 performs primary normalization by adjusting skewness and kurtosis of the training data and test data preprocessed by the data preprocessor, thereby preventing data bias. The normalization module preferably minimizes outliers by performing secondary normalization by removing outliers using a median and an interquartile range (IQR). Here, outliers refer to observations that are abnormally apart when compared to the distribution of other values in the data.

훈련 데이터는, 전술한 데이터 전처리부에 의해 전처리되고, 전처리된 데이터들이 정규화 모듈에 의해 정규화되어 머신러닝 모델의 학습에 사용될 최종 데이터이다. 훈련 데이터는 정규화 모듈에 의한 왜도 및 첨도 값 조정으로 각 변수의 데이터가 정규 분포에 가깝게 변경되고 이상치가 제거된 데이터이다. The training data is final data that is preprocessed by the above-described data preprocessor and normalized by the normalization module to be used for learning of the machine learning model. The training data is data in which the data of each variable is changed to be close to normal distribution by adjusting the skewness and kurtosis values by the normalization module and outliers are removed.

상기 학습 알고리즘 모델링 모듈(40)은 여러 개의 학습 모델들 중 정확도가 가장 우수한 알고리즘을 최적 학습 모델로서 자동 선택하도록 설계된 것이다.The learning algorithm modeling module 40 is designed to automatically select an algorithm with the highest accuracy among several learning models as an optimal learning model.

도 3은 기계 학습 알고리즘의 종류를 예시적으로 도시한 도표이다. 도 3을 참조하면, 기계 학습 모델로는 KNN, SVM, Decision Tree, GBM, XGBoost, LightGBM 등이 있으며, 사안에 따라 이들 중 가장 우수한 성능을 제공하는 학습 모델을 선택하여 사용하게 된다. 3 is a diagram showing types of machine learning algorithms by way of example. Referring to FIG. 3, machine learning models include KNN, SVM, Decision Tree, GBM, XGBoost, LightGBM, and the like, and depending on the case, a learning model providing the best performance is selected and used.

따라서, 상기 학습 알고리즘 모델링 모듈(40)은, 모델링할 복수 개의 학습 모델들을 사전 설정하고, for 구문을 이용하여 상기 복수 개의 학습 모델들을 순차적으로 학습하고, 각 학습 모델에 대한 학습 결과를 나타내고, 각 학습 모델에 대한 학습 결과를 수치화하여 정확도(Accuracy)에 저장하고, 상기 복수 개의 학습 모델들 중 가장 높은 정확도 점수를 갖는 학습 모델을 자동으로 선택하여 최적 학습 모델로 출력하도록 구성된 것을 특징으로 한다. 여기서, 정확도(Accuracy)는 가장 직관적인 학습 모델의 성능을 나타낼 수 있는 평가 지표이다. Therefore, the learning algorithm modeling module 40 presets a plurality of learning models to be modeled, sequentially learns the plurality of learning models using a for statement, displays a learning result for each learning model, and It is characterized in that the learning result for the learning model is digitized and stored in Accuracy, and the learning model with the highest accuracy score among the plurality of learning models is automatically selected and output as an optimal learning model. Here, accuracy is an evaluation index that can represent the performance of the most intuitive learning model.

도 4는 각 기계 학습 알고리즘에 사용되는 하이퍼 파라미터의 종류를 도시한 도표이다. 도 4에 도시된 바와, 각 학습 모델들은 서로 다른 하이퍼 파라미터들이 설정되어야 한다. 따라서, 최적 학습 모델이 결정되면, 이에 대한 하이퍼 파라미터 최적값을 구하여 적용시킴으로써, 성능이 우수한 최적 학습 모델을 얻을 수 있게 된다. 4 is a diagram showing the types of hyperparameters used in each machine learning algorithm. As shown in FIG. 4, different hyperparameters must be set for each learning model. Therefore, when an optimal learning model is determined, an optimal learning model with excellent performance can be obtained by obtaining and applying an optimal hyperparameter value.

상기 하이퍼 파라미터 최적화 모듈(50)은, 데이터 프레임 설정 모듈(52), 하이퍼 파라미터 조합 생성 모듈(54), 정확도 측정 모듈(56), 최적값 검출 모듈(58)을 구비하여, 상기 최적 학습 모델에 대한 하이퍼 파라미터 최적값을 검출한다. 상기 데이터 프레임 설정 모듈(52)은 상기 사전 설정된 복수 개의 학습 모델들의 하이퍼 파라미터들에 대한 데이터 프레임(Data Frame) 및 각 데이터에 대한 변수들을 초기 설정하여 저장한다. 상기 하이퍼 파라미터 조합 생성 모듈(54)은 상기 학습 알고리즘 모델링 모듈에 의해 검출된 최적 학습 모델에 대하여 초기 설정된 하이퍼 파라미터의 데이터 프레임 및 각 데이터에 대한 변수들을 이용하여 복수 개의 하이퍼 파라미터 조합들을 생성한다. 상기 정확도 측정 모듈(56)은 상기 복수 개의 하이퍼 파라미터 조합들을 순차적으로 상기 최적 학습 모델에 적용하여 학습시키고, 각 하이퍼 파라미터 조합에 대한 학습에 따른 최적 학습 모델에 대한 정확도 점수들을 측정한다. 상기 최적값 검출 모듈(58)은 상기 하이퍼 파라미터 조합들 중 가장 우수한 정확도 점수를 갖는 하이퍼 파라미터 조합을 검출하고, 상기 검출된 하이퍼 파라미터 조합을 하이퍼 파라미터 최적값으로 설정하고 저장한다. The hyper parameter optimization module 50 includes a data frame setting module 52, a hyper parameter combination generation module 54, an accuracy measurement module 56, and an optimal value detection module 58, Detect the optimal hyperparameter value for The data frame setting module 52 initially sets and stores data frames for hyper parameters of the plurality of preset learning models and variables for each data. The hyper-parameter combination generation module 54 generates a plurality of hyper-parameter combinations using a data frame of hyper-parameters initially set for the optimal learning model detected by the learning algorithm modeling module and variables for each data. The accuracy measurement module 56 sequentially applies and trains the plurality of hyperparameter combinations to the optimal learning model, and measures accuracy scores for the optimal learning model according to learning of each hyperparameter combination. The optimal value detection module 58 detects a hyper parameter combination having the highest accuracy score among the hyper parameter combinations, sets the detected hyper parameter combination as an optimal hyper parameter value, and stores the hyper parameter combination.

상기 재학습 모듈(60)은 상기 하이퍼 파라미터 최적화 모듈에 의해 설정된 하이퍼 파라미터 최적값은 상기 최적 학습 모듈에 적용시킨 후 재학습한다. The re-learning module 60 re-learns after applying the optimal hyper-parameter values set by the hyper-parameter optimization module to the optimal learning module.

상기 학습 모델 성능 평가 모듈(70)은 상기 재학습 모듈(60)에 의해 재학습된 최적 학습 모델의 성능을 평가한다. The learning model performance evaluation module 70 evaluates the performance of the optimal learning model re-learned by the re-learning module 60 .

상기 학습 모델 예측 모듈(80)은 상기 학습 알고리즘 모델링 모듈(40)에 의해 선택되고 하이퍼 파라미터 최적값이 적용된 후 재학습된 가장 우수한 최적 학습 모델을 이용하여, 테스트 데이터의 Label을 예측한다. 상기 학습 모델 예측 모듈(80)은 최적화된 하이퍼 파라미터 조합이 적용되어 재학습된 최적 학습 모델을 바탕으로 테스트 데이터 셋을 예측하기 위하여, 테스트 데이터에 대하여 데이터 전처리 과정 및 정규화 과정을 동일하게 적용하고 학습하여 예측 결과의 정답을 확인하게 된다. 테스트 데이터의 경우, Label이 존재하지 않으며, 변수(Feature)는 훈련 데이터와 동일하게 사용된다. 상기 학습 모델 예측 모듈(80)에 의하여 테스트 데이터를 예측한 결과를 확인함으로써, 최종적으로 학습 모델의 정확도를 확인할 수 있게 된다. The learning model prediction module 80 predicts the label of the test data by using the best optimal learning model selected by the learning algorithm modeling module 40 and re-learned after applying the optimal hyperparameter values. The learning model prediction module 80 equally applies the data preprocessing process and the normalization process to the test data in order to predict the test data set based on the retrained optimal learning model by applying the optimized hyperparameter combination and learning Thus, the correct answer of the prediction result is confirmed. In the case of test data, there is no label, and the variable (Feature) is used the same as that of training data. By checking the result of predicting the test data by the learning model prediction module 80, it is possible to finally check the accuracy of the learning model.

이상에서 본 발명에 대하여 그 바람직한 실시예를 중심으로 설명하였으나, 이는 단지 예시일 뿐 본 발명을 한정하는 것이 아니며, 본 발명이 속하는 분야의 통상의 지식을 가진 자라면 본 발명의 본질적인 특성을 벗어나지 않는 범위에서 이상에 예시되지 않은 여러 가지의 변형과 응용이 가능함을 알 수 있을 것이다. 그리고, 이러한 변형과 응용에 관계된 차이점들은 첨부된 청구 범위에서 규정하는 본 발명의 범위에 포함되는 것으로 해석되어야 할 것이다. Although the present invention has been described above with reference to preferred embodiments, this is only an example and does not limit the present invention, and those skilled in the art to which the present invention belongs will not deviate from the essential characteristics of the present invention. It will be appreciated that various modifications and applications not exemplified above are possible within the range. And, differences related to these variations and applications should be construed as being included in the scope of the present invention defined in the appended claims.

1 : 학습 모델 자동화 구축 시스템
10 : 데이터 저장부
20 : 데이터 전처리부
22 : 데이터 삭제 모듈
24 : 데이터 분리 모듈
26 : 데이터 변환 모듈
28 : 결측값 대체 모듈
29 : 가변수 생성 모듈
30 : 정규화 모듈
40 : 학습 알고리즘 모델링 모듈
50 : 하이퍼 파라미터 최적화 모듈
52 : 데이터 프레임 설정 모듈
54 : 하이퍼 파라미터 조합 생성 모듈
56 : 정확도 측정 모듈
58 : 최적값 검출 모듈
60 : 재학습 모듈
70 : 학습 모델 성능 평가 모듈
80 : 학습 모델 예측 모듈1: Learning model automation construction system
10: data storage unit
20: data pre-processing unit
22: data deletion module
24: data separation module
26: data conversion module
28: missing value replacement module
29: variable number generation module
30: normalization module
40: learning algorithm modeling module
50: hyperparameter optimization module
52: data frame setting module
54: hyperparameter combination generation module
56: accuracy measurement module
58: optimal value detection module
60: relearning module
70: learning model performance evaluation module
80: learning model prediction module

Claims

사전 준비된 훈련 데이터 및 테스트 데이터를 저장한 데이터 저장부;
상기 훈련 데이터 및 테스트 데이터에 대하여 전처리하는 데이터 전처리부;
데이터 전처리된 훈련 데이터들 및 테스트 데이터들의 왜도 및 첨도를 조정하여 정규화시키는 정규화 모듈;
상기 정규화된 훈련 데이터들을 이용하여 사전 설정된 복수 개의 학습 모델들을 순차적으로 학습하고, 학습 결과를 기반으로 하여 상기 복수 개의 학습 모델들 중 가장 높은 정확도(Accuracy)를 갖는 학습 모델을 최적 학습 모델로 검출하는 학습 알고리즘 모델링 모듈;
상기 학습 알고리즘 모델링 모듈에 의해 검출된 최적 학습 모델에 대하여 하이퍼 파라미터 최적값을 검출하는 하이퍼 파라미터 최적화 모듈;
상기 검출된 하이퍼 파라미터 최적값을 상기 최적 학습 모델에 적용하여 최적 학습 모델을 재학습시키는 재학습 모듈; 및
상기 정규화된 테스트 데이터를 이용하여 상기 재학습된 최적 학습 모델에 따른 예측을 수행하는 학습 모델 예측 모듈;을 구비하고,
상기 하이퍼 파라미터 최적화 모듈은,
상기 사전 설정된 복수 개의 학습 모델들의 하이퍼 파라미터들에 대한 데이터 프레임(Data Frame) 및 각 데이터에 대한 변수들을 초기 설정하여 저장하는 데이터프레임 설정 모듈;
상기 학습 알고리즘 모델링 모듈에 의해 검출된 최적 학습 모델에 대하여 초기 설정된 하이퍼 파라미터의 데이터 프레임 및 각 데이터에 대한 변수들을 이용하여 복수 개의 하이퍼 파라미터 조합들을 생성하는 하이퍼 파라미터 조합 생성 모듈;
상기 복수 개의 하이퍼 파라미터 조합들을 순차적으로 상기 최적 학습 모델에 적용하여 학습시키고, 각 하이퍼 파라미터 조합에 대한 학습에 따른 최적 학습 모델에 대한 정확도 점수들을 측정하는 정확도 측정 모듈; 및
상기 하이퍼 파라미터 조합들 중 가장 우수한 정확도 점수를 갖는 하이퍼 파라미터 조합을 검출하고, 상기 검출된 하이퍼 파라미터 조합을 하이퍼 파라미터 최적값으로 설정하는 최적값 검출 모듈;
을 구비하여, 정확도가 가장 우수한 하이퍼 파라미터 조합을 갖는 최적 학습 모델을 자동으로 구축할 수 있도록 구성된 것을 특징으로 하는 기계 학습 모델 자동화 구축 시스템. a data storage unit storing pre-prepared training data and test data;
a data pre-processing unit that pre-processes the training data and the test data;
a normalization module for adjusting and normalizing skewness and kurtosis of data preprocessed training data and test data;
Sequentially learning a plurality of preset learning models using the normalized training data, and detecting a learning model having the highest accuracy among the plurality of learning models as an optimal learning model based on a learning result learning algorithm modeling module;
a hyper-parameter optimization module for detecting optimal hyper-parameter values for the optimal learning model detected by the learning algorithm modeling module;
a re-learning module for re-learning an optimal learning model by applying the detected optimal hyper-parameter value to the optimal learning model; and
A learning model prediction module that performs prediction according to the retrained optimal learning model using the normalized test data;
The hyperparameter optimization module,
a data frame setting module that initially sets and stores data frames for hyper parameters of the plurality of preset learning models and variables for each data;
a hyper-parameter combination generation module generating a plurality of hyper-parameter combinations by using a hyper-parameter data frame initially set for the optimal learning model detected by the learning algorithm modeling module and variables for each data;
an accuracy measurement module for sequentially applying and learning the plurality of hyper parameter combinations to the optimal learning model and measuring accuracy scores for the optimal learning model according to learning of each hyper parameter combination; and
an optimal value detection module configured to detect a hyper parameter combination having the highest accuracy score among the hyper parameter combinations and set the detected hyper parameter combination as an optimal hyper parameter value;
A machine learning model automated construction system, characterized in that configured to automatically build an optimal learning model having a hyperparameter combination with the highest accuracy.

삭제delete

제1항에 있어서, 상기 데이터 전처리부는,
훈련 데이터 및 테스트 데이터에 대하여, 데이터의 유형에 따라 수치형 데이터 및 범주형 데이터로 분리하는 데이터 분리 모듈;
훈련 데이터 및 테스트 데이터들 중 범주형 데이터를 검출하고, 상기 범주형 데이터를 수치형 데이터로 변환시키는 데이터 변환 모듈;
훈련 데이터 및 테스트 데이터들 중 결측값이 있는 데이터를 검출하고, 결측값을 대체하는 결측값 대체 모듈;
을 구비하는 것을 특징으로 하는 기계 학습 모델 자동화 구축 시스템.The method of claim 1, wherein the data pre-processing unit,
a data separation module that separates training data and test data into numerical data and categorical data according to data types;
a data conversion module for detecting categorical data among training data and test data and converting the categorical data into numerical data;
a missing value replacement module that detects data with missing values among the training data and test data and replaces the missing values;
Machine learning model automation building system, characterized in that comprising a.

제1항에 있어서, 상기 학습 알고리즘 모델링 모듈은,
복수 개의 학습 모델들을 사전 설정하고,
데이터 전처리 및 정규화된 훈련 데이터를 이용하여 상기 복수 개의 학습 모델들을 순차적으로 학습하고, 각 학습 모델에 의한 학습 결과를 나타내고,
각 학습 모델에 대한 학습 결과를 수치화하여 정확도 점수로서 저장하고,
상기 복수 개의 학습 모델들 중 가장 높은 정확도 점수를 갖는 학습 모델을 자동으로 선택하여 최적 학습 모델로 설정하는 것을 특징으로 하는 기계 학습 모델 자동화 구축 시스템.The method of claim 1, wherein the learning algorithm modeling module,
presetting a plurality of learning models;
Sequentially learning the plurality of learning models using data preprocessing and normalized training data, and displaying a learning result by each learning model,
The learning result for each learning model is quantified and stored as an accuracy score,
Machine learning model automation construction system, characterized in that for automatically selecting a learning model having the highest accuracy score among the plurality of learning models and setting it as an optimal learning model.

컴퓨팅 시스템에 의해 각 단계가 수행되는 기계 학습 모델 자동화 구축 방법에 있어서,
(a) 사전 준비된 훈련 데이터 및 테스트 데이터를 전처리하는 단계;
(b) 상기 전처리된 훈련 데이터 및 테스트 데이터들의 왜도 및 첨도를 조정하고, 이상치를 제거하여 정규화시키는 단계;
(c) 상기 정규화된 훈련 데이터를 이용하여 사전 설정된 복수 개의 학습 모델들을 순차적으로 학습하고, 학습 결과를 기반으로 하여 복수 개의 학습 모델들 중 가장 높은 정확도를 갖는 학습 모델을 최적 학습 모델로 검출하는 단계;
(d) 상기 검출된 최적 학습 모델에 대하여 하이퍼 파라미터 최적값을 검출하는 단계;
(e) 상기 검출된 하이퍼 파라미터 최적값을 상기 최적 학습 모델에 적용하여 최적 학습 모델을 재학습시키는 단계; 및
(f) 상기 정규화된 테스트 데이터를 이용하여 상기 재학습된 최적 학습 모델에 따른 예측을 수행하는 단계;를 구비하고,
상기 (d) 단계는,
(d1) 상기 사전 설정된 복수 개의 학습 모델들의 하이퍼 파라미터들에 대한 데이터 프레임(Data Frame) 및 각 데이터에 대한 변수들을 초기 설정하여 저장하는 단계;
(d2) 상기 최적 학습 모델에 대하여 초기 설정된 하이퍼 파라미터의 데이터 프레임 및 각 데이터에 대한 변수들을 이용하여 복수 개의 하이퍼 파라미터 조합들을 생성하는 단계;
(d3) 상기 복수 개의 하이퍼 파라미터 조합들을 순차적으로 상기 최적 학습 모델에 적용하여 학습시키고, 각 하이퍼 파라미터 조합에 대한 학습에 따른 최적 학습 모델에 대한 정확도 점수들을 측정하는 단계; 및
(d4) 상기 하이퍼 파라미터 조합들 중 가장 우수한 정확도 점수를 갖는 하이퍼 파라미터 조합을 검출하고, 상기 검출된 하이퍼 파라미터 조합을 하이퍼 파라미터 최적값으로 설정하고 저장하는 단계;
를 구비하여, 정확도가 가장 우수한 하이퍼 파라미터 최적값을 갖는 최적 학습 모델을 자동으로 구축할 수 있도록 구성된 것을 특징으로 하는 기계 학습 모델 자동화 구축 방법. A method for automatically building a machine learning model in which each step is performed by a computing system,
(a) preprocessing pre-prepared training data and test data;
(b) normalizing by adjusting skewness and kurtosis of the preprocessed training data and test data and removing outliers;
(c) sequentially learning a plurality of preset learning models using the normalized training data, and detecting a learning model having the highest accuracy among the plurality of learning models as an optimal learning model based on a learning result ;
(d) detecting an optimal hyper-parameter value for the detected optimal learning model;
(e) retraining an optimal learning model by applying the detected optimal hyperparameter value to the optimal learning model; and
(f) performing prediction according to the retrained optimal learning model using the normalized test data;
In step (d),
(d1) initially setting and storing data frames for hyper parameters of the plurality of preset learning models and variables for each data;
(d2) generating a plurality of hyper parameter combinations using a data frame of hyper parameters initially set for the optimal learning model and variables for each data;
(d3) sequentially applying and learning the plurality of hyper parameter combinations to the optimal learning model, and measuring accuracy scores for the optimal learning model according to learning of each hyper parameter combination; and
(d4) detecting a hyper parameter combination having the highest accuracy score among the hyper parameter combinations, setting the detected hyper parameter combination as an optimal hyper parameter value, and storing the result;
A method for automatically building a machine learning model, characterized in that it is configured to automatically build an optimal learning model having an optimum hyperparameter with the highest accuracy.

삭제delete

제5항에 있어서, 상기 (b) 단계는,
(b1) 상기 훈련 데이터 및 테스트 데이터들에 있어서 수치형 데이터로 변환이 불가능한 데이터들을 삭제하는 단계;
(b2) 훈련 데이터 및 테스트 데이터에 대하여, 데이터의 유형에 따라 수치형 데이터 및 범주형 데이터로 분리하는 단계;
(b3) 훈련 데이터 및 테스트 데이터들 중 범주형 데이터를 검출하고, 상기 범주형 데이터를 수치형 데이터로 변환시키는 단계;
(b4) 데이터 변환후, 훈련 데이터 및 테스트 데이터들 중 결측값이 있는 데이터를 검출하고, 결측값을 대체하는 단계;
을 구비하는 것을 특징으로 하는 기계 학습 모델 자동화 구축 방법.The method of claim 5, wherein step (b),
(b1) deleting data that cannot be converted into numerical data among the training data and test data;
(b2) separating training data and test data into numerical data and categorical data according to data types;
(b3) detecting categorical data among training data and test data, and converting the categorical data into numeric data;
(b4) detecting data with missing values among training data and test data after data conversion, and replacing the missing values;
Machine learning model automation construction method characterized by comprising a.

제5항에 있어서, 상기 (c) 단계는,
복수 개의 학습 모델들을 사전 설정하고,
데이터 전처리 및 정규화된 훈련 데이터를 이용하여 상기 복수 개의 학습 모델들을 순차적으로 학습하고, 각 학습 모델에 의한 학습 결과를 나타내고,
각 학습 모델에 대한 학습 결과를 수치화하여 정확도 점수로서 저장하고,
상기 복수 개의 학습 모델들 중 가장 높은 정확도 점수를 갖는 학습 모델을 최적 학습 모델로 선택하여 출력하는 것을 특징으로 하는 기계 학습 모델 자동화 구축 방법.
The method of claim 5, wherein step (c),
presetting a plurality of learning models;
Sequentially learning the plurality of learning models using data preprocessing and normalized training data, and displaying a learning result by each learning model,
The learning result for each learning model is quantified and stored as an accuracy score,
A machine learning model automation construction method, characterized in that for selecting and outputting a learning model having the highest accuracy score among the plurality of learning models as an optimal learning model.