KR101588431B1

KR101588431B1 - Method for data classification based on manifold learning

Info

Publication number: KR101588431B1
Application number: KR1020140029855A
Authority: KR
Inventors: 이재욱; 지현웅; 한상우; 김경옥
Original assignee: 서울대학교산학협력단
Priority date: 2014-03-13
Filing date: 2014-03-13
Publication date: 2016-01-25
Also published as: KR20150107252A

Abstract

본 발명에 따른 다양체 학습에 기반한 데이터 분류 방법은 수집된 전체 데이터로부터 이상치를 제거하는 단계; 이상치가 제거된 데이터에 기초하여 분산함수를 구축하는 단계; 상기 전체 데이터를 상기 분산함수에 기초한 동적 시스템을 통해 다양체에 투영하는 단계 및 상기 투영 단계에 따라 획득된 데이터를 기초로 다양체 학습 알고리즘을 적용하는 단계를 포함한다.According to the present invention, there is provided a data classification method based on manifold learning, comprising: removing an ideal value from all collected data; Constructing a dispersion function based on the data from which the ideal value has been removed; Projecting the entire data to a manifold via a dynamic system based on the distribution function and applying a manifold learning algorithm based on the data obtained in accordance with the projecting step.

Description

다양체 학습에 기반한 데이터 분류 방법{METHOD FOR DATA CLASSIFICATION BASED ON MANIFOLD LEARNING}[0001] METHOD FOR DATA CLASSIFICATION BASED ON MANIFOLD LEARNING [0002]

본 발명은 다양체 학습에 기반한 데이터 분류 방법에 관한 것으로, 특히 다양체 학습 과정에서 차원을 감소하는 과정 중에 노이즈를 최소화하기 위한 방법에 관한 것이다.The present invention relates to a data classification method based on manifold learning, and more particularly, to a method for minimizing noise during a process of reducing a dimension in a manifold learning process.

이미지 데이터와 같은 고차원의 현실 데이터는 보통 고차원 공간 안에서 더 낮은 차원의 비선형 다양체를 이루는 것으로 알려져 있다. 이를 이용하여, 고차원의 데이터에서 저차원 구조를 찾아내어 데이터의 특성을 보존하면서 낮은 차원으로 표현하는 것을 다양체 학습(Manifold Learning)이라 한다. 데이터가 이루는 저차원의 구조를 찾는 기존의 방법들 중 비선형적 방법들은 선형적 방법들에 비하여 효과적으로 저차원의 구조를 찾는다는 장점이 있으나, 노이즈 데이터가 많이 포함된 데이터에 대해서는 저차원 구조를 파악하는데 실패할 가능성이 높아진다는 단점이 있다.High-dimensional real-world data, such as image data, is usually known to form a lower-dimensional nonlinear manifold in a higher dimensional space. Using this, it is called "Manifold Learning" to find low-dimensional structure from high-dimensional data and to express low dimension while preserving data characteristics. Among the existing methods of finding the low-dimensional structure of data, nonlinear methods have an advantage of finding a low-dimensional structure effectively as compared with linear methods. However, for data including a lot of noise data, There is a disadvantage that the possibility of failure is increased.

이와 관련하여, 대한민국 등록특허 제10-1172579 호(발명의 명칭: 이상 속성을 포함하는 데이터의 검출 방법 및 장치)는 하나 이상의 속성을 포함하는 데이터 집합에서 이상 속성을 포함하는 데이터를 검출하기 위한 방법 및 장치를 개시하고 있으며, 보다 구체적으로, 데이터에 포함되는 속성들의 수치 범위를 각각 적어도 하나의 구간(interval)으로 분할하고, 속성들의 값을 해당 값을 포함하는 구간으로 각각 대체함으로써 데이터를 트랜잭션으로 변환하는 데이터 전처리 단계, 트랜잭션들의 집합으로부터 속성들 중 관련성 있는 속성들 간의 정상적 관계를 나타내는 데이터 연관 패턴(data association pattern, DAP)을 결정하는 데이터 연관 패턴 결정 단계, 결정된 데이터 연관 패턴들 각각의 중요도(significance)를 결정하는 중요도 결정 단계, 및 데이터 연관 패턴을 부분 집합으로서 포함하는 데이터를 결정하고, 결정된 데이터의 속성들에 대해 데이터 연관 패턴의 중요도를 이용하여 데이터가 이상 속성을 포함하는지 여부를 결정하는 이상치 판정 단계를 포함한다. In this regard, Korean Patent No. 10-1172579 (entitled " method and apparatus for detecting data including abnormal attributes ") discloses a method for detecting data including abnormal attributes in a data set including one or more attributes And more specifically, by dividing the numerical range of the attributes included in the data into at least one interval and replacing the values of the attributes with the section including the corresponding value, respectively, A data preprocessing step of converting a data association pattern (DAP), a data association pattern determination step of determining a data association pattern (DAP) indicating a normal relationship between relevant attributes among attributes from a set of transactions, a significance determination step of determining a data association pattern, Determining the data to be included as a subset, and determining whether the data includes an anomaly attribute using the importance of the data association pattern for the attributes of the determined data.

본 발명은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 다양체 학습 과정에서 노이즈 데이터들을 다양체에 투영하는 과정을 통해 노이즈를 최소화하여 데이터 분류 성능을 향상시키는 것을 목적으로 한다.It is an object of the present invention to improve data classification performance by minimizing noise by projecting noise data to a manifold in a manifold learning process.

상술한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본 발명의 일측면에 따른 다양체 학습에 기반한 데이터 분류 방법은 수집된 전체 데이터로부터 이상치를 제거하는 단계; 이상치가 제거된 데이터에 기초하여 분산함수를 구축하는 단계; 상기 전체 데이터를 상기 분산함수에 기초한 동적 시스템을 통해 다양체에 투영하는 단계 및 상기 투영 단계에 따라 획득된 데이터를 기초로 다양체 학습 알고리즘을 적용하는 단계를 포함한다. According to an aspect of the present invention, there is provided a method for classifying data based on manifold learning, comprising: removing an ideal value from collected total data; Constructing a dispersion function based on the data from which the ideal value has been removed; Projecting the entire data to a manifold via a dynamic system based on the distribution function and applying a manifold learning algorithm based on the data obtained in accordance with the projecting step.

전술한 본 발명의 일 실시예에 따라, 노이즈 데이터가 많이 포함된 데이터에 대해서는 기존의 다양체 학습 방법이 가진 문제점을 해소할 수 있다. 즉, 종래의 다양체 학습법에 따르면 노이즈가 포함된 구조가 변형된 구조이고, 이 변형된 구조를 학습하기 때문에, 본래 학습하고자 하는 데이터 구조를 학습하는데 실패하게 된다. 본 발명에서는 이상치 데이터를 제외한 정상 데이터를 중심으로 분산 함수를 학습하고, 이상치 데이터를 정상 구조의 데이터에 투영시키기 때문에, 이상치 데이터가 가진 정상 정보를 훼손하지 않으면서도, 본래 학습하고자 하는 정상 데이터 구조를 학습할 수 있게 된다. According to the embodiment of the present invention described above, the problem of existing manifold learning methods can be solved for data including a large amount of noise data. That is, according to the conventional manifold learning method, the structure including the noise is modified, and since it learns the modified structure, it fails to learn the data structure to be originally learned. According to the present invention, since the variance function is learned centering on the normal data excluding the outlier data and the outlier data is projected onto the data of the normal structure, the normal data structure to be originally learned can be obtained Learning.

도 1은 본 발명의 일 실시예에 따른 다양체 학습에 기반한 데이터 분류 장치를 도시한 도면이다.
도 2는 본원 발명의 일 실시예에 따른 다양체 학습 과정을 도시한 순서도이다.
도 3은 본 발명의 일실시예에 따른 분산 함수 구축을 위한 알고리즘을 도시한 도면이다.
도 4는 본 발명의 일실시예에 따른 데이터 투영을 위한 알고리즘을 도시한 도면이다.
도 5는 본 발명의 일실시예에 따른 다양체 학습을 위한 노이즈 데이터 투영의 개념을 설명하기 위한 도면이다.1 is a diagram illustrating a data classification apparatus based on manifold learning according to an embodiment of the present invention.
2 is a flowchart illustrating a manifold learning process according to an embodiment of the present invention.
3 is a diagram illustrating an algorithm for constructing a distributed function according to an embodiment of the present invention.
4 is a diagram illustrating an algorithm for data projection according to an embodiment of the present invention.
5 is a diagram for explaining the concept of noise data projection for manifold learning according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings, which will be readily apparent to those skilled in the art. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. In order to clearly illustrate the present invention, parts not related to the description are omitted, and similar parts are denoted by like reference characters throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다.Throughout the specification, when a part is referred to as being "connected" to another part, it includes not only "directly connected" but also "electrically connected" with another part in between . Also, when an element is referred to as "comprising ", it means that it can include other elements as well, without departing from the other elements unless specifically stated otherwise.

도 1은 본 발명의 일 실시예에 따른 다양체 학습에 기반한 데이터 분류 장치를 도시한 도면이다.1 is a diagram illustrating a data classification apparatus based on manifold learning according to an embodiment of the present invention.

데이터 분류 장치(10)는 데이터 수집부(100), 다양체 학습부(110), 데이터 분류부(120)를 포함한다.The data classification apparatus 10 includes a data collection unit 100, a manifold learning unit 110, and a data classification unit 120.

데이터 분류 장치(10)는 네트워크를 통해 외부 단말 장치들과 접속될 수 있고, 다양한 단말로부터 여러 종류의 데이터를 수신할 수 있고, 이를 위한 유선 또는 무선 통신 모듈을 포함한다. 이때, 네트워크는 근거리 통신망(Local Area Network; LAN), 광역 통신망(Wide Area Network; WAN) 또는 부가가치 통신망(Value Added Network; VAN) 등과 같은 유선 네트워크나 이동 통신망(mobile radio communication network) 또는 위성 통신망 등과 같은 모든 종류의 무선 네트워크로 구현될 수 있다. 또한, 외부 단말 장치는 네트워크를 통해 데이터 분류 장치(10) 에 접속할 수 있는 컴퓨터나 휴대용 단말기로 구현될 수 있다. 여기서, 컴퓨터는 예를 들어, 웹 브라우저(WEB Browser)가 탑재된 노트북, 데스크톱(desktop), 랩톱(laptop), 태블릿 PC 등을 포함하고, 휴대용 단말기는 예를 들어, 휴대성과 이동성이 보장되는 무선 통신 장치로서, 셀룰러 단말, 스마트 폰 등과 같은 모든 종류의 핸드헬드(Handheld) 기반의 무선 통신 장치를 포함할 수 있다.The data classification device 10 can be connected to external terminal devices via a network, and can receive various types of data from various terminals, and includes a wired or wireless communication module. The network may be a wired network such as a local area network (LAN), a wide area network (WAN) or a value added network (VAN), a mobile radio communication network, And can be implemented in all kinds of wireless networks. Also, the external terminal device may be implemented as a computer or a portable terminal capable of connecting to the data classification device 10 via a network. Here, the computer includes, for example, a notebook computer, a desktop computer, a laptop computer, a tablet PC, and the like, each of which is equipped with a web browser (WEB Browser) As a communication device, it may include all kinds of handheld based wireless communication devices such as a cellular terminal, a smart phone, and the like.

참고로, 본 발명의 실시예에 따른 도 1에 도시된 구성 요소들은 소프트웨어 또는 FPGA(Field Programmable Gate Array) 또는 ASIC(Application Specific Integrated Circuit)와 같은 하드웨어 구성 요소를 의미하며, 소정의 역할들을 수행한다.1 refers to a hardware component such as software or an FPGA (Field Programmable Gate Array) or an ASIC (Application Specific Integrated Circuit), and performs predetermined roles .

그렇지만 '구성 요소들'은 소프트웨어 또는 하드웨어에 한정되는 의미는 아니며, 각 구성 요소는 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다.However, 'components' are not meant to be limited to software or hardware, and each component may be configured to reside on an addressable storage medium and configured to play one or more processors.

따라서, 일 예로서 구성 요소는 소프트웨어 구성 요소들, 객체지향 소프트웨어 구성 요소들, 클래스 구성 요소들 및 태스크 구성 요소들과 같은 구성 요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로 코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들 및 변수들을 포함한다.Thus, by way of example, an element may comprise components such as software components, object-oriented software components, class components and task components, processes, functions, attributes, procedures, Routines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables.

구성 요소들과 해당 구성 요소들 안에서 제공되는 기능은 더 작은 수의 구성 요소들로 결합되거나 추가적인 구성 요소들로 더 분리될 수 있다.The components and functions provided within those components may be combined into a smaller number of components or further separated into additional components.

데이터 수집부(100)는 앞서 언급한 외부 단말 장치 등으로부터 학습 대상 또는 분류 대상이 되는 다양한 데이터를 수집한다. 예를 들면, 지문 인식, 얼굴 인식 등에 사용되는 사용자의 이미지 데이터 또는 MRI 촬영 데이터 등 다양한 형태의 이미지 데이터가 수집될 수 있고, 빅데이터와 같은 대량의 데이터들이 수집될 수 있다.The data collecting unit 100 collects various data to be a learning object or a classification object from the above-mentioned external terminal device or the like. For example, various types of image data such as user's image data or MRI imaging data used for fingerprint recognition, face recognition and the like can be collected, and a large amount of data such as big data can be collected.

이와 같은 데이터들은 데이터 분류 장치(10)에 포함된 메모리 또는 기타 저장장치에 저장될 수 있고, 데이터베이스를 이용하여 구조화되어 저장될 수 있다.Such data may be stored in a memory or other storage device included in the data classification device 10, and may be structured and stored using a database.

다양체 학습부(110)는 수집된 데이터에 대하여 다양체 학습 알고리즘을 수행하며, 이후 설명할 이상치 제거, 분산함수 구축, 다양체에 대한 데이터 투영 등의 단계를 각각 수행한다. The manifold learning unit 110 performs a manifold learning algorithm on the collected data, and performs steps such as an outlier removal, a distributed function construction, and a data projection for a manifold, which will be described later.

다양체 학습에 대하여 간단히 살펴보면, n차원 실수 공간에 포함된 특정 m(<n)차원의 다양체는 충분히 작은 국소 공간 상에서 m차원의 유클리드 공간과 닮은 위상 공간이다. 일반적으로 m차원의 매끄러운 다양체

은 국소적으로(각 점 주위에서), (n-m) 개의 독립적인 매끄러운 함수들을 0으로 하는 점들의 집합으로 표현 가능한 것이 알려져 있다(M. W. Hirsch, Differential Topology. Springer, Jul. 1976. 참조). 일반적인 다양체의 대역적 구조는 매우 복잡하지만, 몇 가지 가벼운 조건 하에서 다양체

은 (필요에 따라 국소적으로 정의된 부분들이 부드럽게 접착 되도록 다시 정의하여) 다음과 같이 전역에 근사할 수 있다:Briefly examining manifold learning, a particular m (<n) dimensional manifold contained in an n-dimensional real space is a phase space resembling the m-dimensional Euclidean space in a sufficiently small local space. Generally m-dimensional smooth manifolds

(See MW Hirsch, Differential Topology, Springer, Jul. 1976) can be expressed locally (around each point), with (nm) independent smooth functions as zero. The general bandit structure of manifolds is very complex, but under some light conditions the manifold

Can be approximated globally by redefining (where necessary, locally defined parts to be glued together) as follows:

[수학식 1][Equation 1]

여기서,

는 매끄러운 함수이다. (이러한 H는 고유하지 않다는 점에 유의한다.) 이 때, 다양체 학습의 문제는 이러한 함수 H를 구축하는 문제로 공식화 할 수 있다.here,

Is a smooth function. (Note that this H is not unique.) At this point, the problem of manifold learning can be formulated as a problem of constructing such a function H.

다양체 학습부(110)는 데이터 분류 장치(10)의 메모리에 프로그램 모듈의 형태로 저장될 수 있고, 데이터 분류 장치(10)에 포함된 프로세서가 해당 프로그램 모듈을 실행함에 따라 다양체 학습부(110)의 기능을 수행하게 된다. 다양체 학습부(100)가 수행하는 구체적인 알고리즘에 대해서는 추후 상세히 설명하기로 한다.The manifold learning unit 110 may be stored in the form of a program module in the memory of the data classification apparatus 10 and may be stored in the manifold learning unit 110 as the processor included in the data classification apparatus 10 executes the corresponding program module. As shown in FIG. The concrete algorithm performed by the manifold learning unit 100 will be described later in detail.

데이터 분류부(120)는 다양체 학습부(110)를 통해 차원이 감소된 데이터를 이용하여 데이터 분류를 수행할 수 있다. 데이터 분류부(120)는 데이터 분류 장치(10)의 메모리에 프로그램 모듈의 형태로 저장될 수 있고, 데이터 분류 장치(10)에 포함된 프로세서가 해당 프로그램 모듈을 실행함에 따라 데이터 분류부(110)의 기능을 수행하게 된다.The data classifier 120 may classify data using the reduced-dimension data through the manifold learning unit 110. [ The data classification unit 120 may be stored in the form of a program module in the memory of the data classification apparatus 10 and may be stored in the data classification unit 110 as the processor included in the data classification apparatus 10 executes the corresponding program module. As shown in FIG.

도 2는 본원 발명의 일 실시예에 따른 다양체 학습 과정을 도시한 순서도이다.2 is a flowchart illustrating a manifold learning process according to an embodiment of the present invention.

먼저, 데이터 수집부(100)를 통해 학습 대상 또는 분류 대상이 되는 데이터를 수집한다(S210). First, data to be learned or classified is collected through the data collecting unit 100 (S210).

다음으로, 수집된 데이터로부터 이상치를 제거한다(S220).Next, the abnormal value is removed from the collected data (S220).

일반적으로, 노이즈가 없는 이상적인 경우라면 앞서 소개한 수학식 1과 같이 표현될 수 있으나, 현실 데이터는 노이즈로 인해 손상되어 있다. Generally, if it is an ideal case without noise, it can be expressed as Equation 1, but the real data is damaged due to noise.

이러한 노이즈를 고려하여, 자연수 N개의 데이터 집합

에 대하여, 각 데이터를

라고 가정한다. 이 때

이고

는 노이즈로 충분히 작은 값을 가진다. 또한

의 데이터의 대부분은 다양체

의 일부 근방

에 한정되어 있다고 가정한다. H가 매끄러운 함수이기 때문에 각각의

에 대하여

을 만족하는

이 존재한다. 여기서

을

의

크기의 분산 다양체(dispersed manifold) 라고 부르기로 한다.In consideration of such noise,

For each data

. At this time

ego

Has a sufficiently small value as noise. Also

Most of the data is manifold

Some neighborhood

. &Lt; / RTI > Since H is a smooth function,

about

Satisfy

Lt; / RTI > here

of

Size dispersed manifold. &Lt; RTI ID = 0.0 >

통상의 다양체 학습 방법은 충분히 노이즈가 적은 데이터에 적합하기 때문에 높은 분산을 갖는 큰 노이즈에 의해 손상된 데이터에 대한 직접적인 적용은 매우 어렵다. 함수 H를 안정적으로 구축하기 위해서는 먼저

에 포함된 노이즈 데이터들을 충분히 작은

의

에 투영한 다음 투영된 데이터 포인트에 대하여 다양체 학습 알고리즘을 적용할 필요가 있다. 이 작업을 수행하는 한 가지 방법은 노이즈가 많은 데이터 점들에 함수 H와 관련된 수학식 2의 동적 시스템을 적용하는 것이다.Since the normal manifold learning method is suited for data with sufficiently low noise, direct application to data damaged by large noise with high dispersion is very difficult. To build the function H stable,

Lt; RTI ID = 0.0 >

of

And then apply the manifold learning algorithm to the projected data points. One way to do this is to apply the dynamic system of equation (2), which is related to the function H , to the noisy data points.

[수학식 2]&Quot; (2) "

여기서,

는

에서 시작되는 수학식 2의 궤적이고,

는 X에 대한 H의 야코비안 행렬이다. 분산 다양체에 속한 점 x에 수학식 2의 동적 시스템를 적용할 경우,

을 만족하는 점 z로 그것의 다양체 상에서의 접면과 수직인 궤적을 그리며 수렴하게 되는 것으로 알려져 있다. (J. Lee and H.-D. Chiang, “Theory of stability regions for a class of nonhyperbolic dynamical systems and its application to constraint satisfaction problems", IEEE Trans. Circuits and Systems I: Fundamental Theory and Applications, vol. 49, no. 2, pp. 196 -209, Feb. 2002. 참조)here,

The

2, < / RTI >

Is the Jacobian matrix of H for X. When the dynamic system of Equation (2) is applied to a point x belonging to a distributed manifold,

And it is known that the point converges with the trajectory perpendicular to the contact surface on its manifold. (J. Lee and H.-D. Chiang, " Theory of stability regions for nonhyperbolic dynamical systems and its application to constraint satisfaction problems ", IEEE Trans. Circuits and Systems I: Fundamental Theory and Applications, No. 2, pp. 196-209, Feb. 2002.)

다만, 함수 H는 선험적으로 알려져 있지 않기 때문에, 앞서 설명한 통상의 방법은 명시적으로 구현할 수 없다. 이를 극복하기 위해, H의 정보를 필요로 하지 않으면서 노이즈 점 x를 밀접하게 대응점 z에 투영할 수 있는 동적 시스템을 구축할 필요가 있다. However, since the function H is not known a priori, the above-described ordinary method can not be explicitly implemented. To overcome this problem, it is necessary to build a dynamic system that can project the noise point x to the corresponding point z without the need of H information.

이를 위하여, 본 발명에서는 먼저 이상치를 탐색하고 이를 제거한 데이터를 중심으로, 분산함수를 구축한다.To this end, in the present invention, a dispersion function is constructed based on data obtained by searching for an ideal value and removing it.

이상치를 탐색하기 위하여, 주어진 데이터 집합

에 이상치 탐지 알고리즘을 적용한다. 예를 들면, SVDD(support vector domain description) 알고리즘을 통해 이상치를 제거할 수 있다. To search for anomalies, a given set of data

The algorithm of the outlier detection is applied. For example, an ideal value can be removed through a support vector domain description (SVDD) algorithm.

SVDD 알고리즘에서는 수학식 3과 같은 최적화 문제를 통해 고차원 상에서 주어진 데이터를 포함하는 가장 작은 구를 탐색하게 된다.In the SVDD algorithm, an optimization problem such as Equation (3) is searched for the smallest sphere containing the given data on a higher dimension.

[수학식 3]&Quot; (3) "

각각의 데이터 점

의 고차원 공간에서의 상인

을 포함하는 공간을 특징 공간 (feature space)라고 하고 각각의 상

를 특징 벡터(feature vector) 라고 한다. SVDD 알고리즘은 이 특징공간 상에서 모든 특징 벡터를 포함하는 가장 작은 구를 탐색하는데, 수식의 는 구의 중심이고 R은 구의 반지름이 된다. 따라서 제약식은 모든 특징 벡터가 구의 중심에서 R의 반경 안에 들어오도록 하는 것을 의미하는데, 여기서 제약을 느슨하게 하는 변수인

를 도입하여 R의 반경에서 조금 벗어나는 것을 허용하도록 하나 많은 데이터에 대해서 큰 값의

을 주게되면 목적식이 증가하므로 목적식을 최소화 하기 위해서는 구에서 벗어나는 데이터를 가능한 한 줄여야 한다. 여기서

>0 인 데이터는 특징 공간에서 구의 밖에 있어 이상치로 판단할 수 있다. 여기서 C를 통해 제약식의 느슨함의 허용 정도를 조절할 수 있다. 높은 C값은

>0 에 대한 대가를 높이므로 대부분의 데이터를 구 안에 포함하도록 요구하게 한다.Each data point

Merchant in the higher dimension of

Is called a feature space, and each phase

Is called a feature vector. The SVDD algorithm searches for the smallest sphere containing all feature vectors in this feature space, where the center of the sphere and R is the radius of the sphere. Thus, the constraint implies that all feature vectors are within the radius of R in the center of the sphere,

To allow a little deviation from the radius of R, but a large value of < RTI ID = 0.0 >

, The objective expression is increased. Therefore, in order to minimize the objective expression, data that deviates from the phrase must be reduced as much as possible. here

> 0 data can be judged to be outliers because it is outside the sphere in the feature space. Here, the tolerance of the loosening of the constraint can be adjusted through C. High C values

> Increases the cost of 0, so it requires you to include most of the data in the phrase.

라그랑지안 승수법을 통해 위 최적화 문제를 수학식 4와 같은 최적화 문제로 변형시킬 수 있다.The Lagrangian multiplier method can be used to transform the above optimization problem into an optimization problem such as Equation (4).

[수학식 4]&Quot; (4) "

또한, 커널 방법론을 적용하기 위해 위 최적화 문제를 수학식 5와 같은 쌍대 문제로 바꿀 수 있다.Also, to apply the kernel methodology, we can replace the above optimization problem with a pair of problems such as Equation (5).

[수학식 5]&Quot; (5) "

이때, 고차원 상에서의 내적은

커널함수

로 나타낼 수 있으며, 이를 쌍대문제와 연관시켜 수학식 6과 같은 결정 함수를 학습할 수 있다.In this case,

Kernel function

, Which can be related to the pair problem, and the decision function as shown in Equation (6) can be learned.

[수학식 6]&Quot; (6) "

위 결정 함수를 통해 f(x) 값이 음수일 경우 이상치로 판별하게 되며 이상치의 빈도수는 C값을 통해 조정할 수 있다. 이와 같이, SVDD를 통해서 최종적으로 이상치가 제거된 데이터를 획득할 수 있다.If the value of f (x) is negative through the above decision function, it is determined as an outliers. The frequency of outliers can be adjusted through C value. In this manner, the SVDD can be used to acquire data whose anomalies are finally removed.

다음으로, 이상치가 제거된 데이터를 통해 분산함수를 학습하는 형태로 분산함수를 구축한다(S230).Next, a distributed function is constructed in such a manner that the distributed function is learned through the data from which the ideal value is removed (S230).

앞서 언급한 바와 같이, 함수 H는 선험적으로 알려져 있지 않기 때문에, 동적 시스템을 구축하여 이를 해결하고자 한다. As mentioned above, since the function H is not known a priori, we try to solve it by building a dynamic system.

이를 위해, 먼저 스칼라 함수

를 수학식 7과 같이 정의하도록 한다.To do this, first use the scalar function

Is defined as Equation (7).

[수학식 7]&Quot; (7) "

이러한 η를 분산 함수(dispersion function)라고 하며, 수학식 7로부터 다음과 같은 분산 함수의 특징을 확인할 수 있다. 첫째로,

은 스칼라 비 음수 함수이고, 둘째로, 분산 함수의 값은 각 점에서 다양체

으로부터의 분산된 정도를 추정한 값이며, 셋째로, 다양체

의 점에서 최소값인 0을 취하며, 다양체

에서 이격된 정도에 따라 단조 증가한다. 분산 함수의 이러한 특성은 데이터의 분포를 기술하는 밀도 함수의 특성과는 대조적이다. 직접 데이터에서 η를 구축하기 위해 수학식 8과 같이 주어진 매개 변수화된 분산도 추정을 하도록 한다.This? Is called a dispersion function, and the following characteristic of the dispersion function can be confirmed from the equation (7). First,

Is a scalar non-negative function, and secondly, the value of the variance function is the manifold

, And third, a value obtained by estimating the degree of dispersion

The minimum value of 0,

And increases monotonously according to the degree of separation from the center. This characteristic of the dispersion function is in contrast to the characteristic of the density function which describes the distribution of the data. In order to construct η in the direct data, a given parameterized variance estimate is given as in equation (8).

[수학식 8]&Quot; (8) "

여기서,

는 매개화된 의사 밀도 함수 추정치로서 수학식 9와 같이 주어질 수 있다.here,

Can be given as the median pseudo-density function estimate, as shown in equation (9).

[수학식 9]&Quot; (9) "

이때,

는 가중치로서, 전체 가중치들의 합은 1이 된다. 또한, 수학식 9에서 자연수 N은 전체 데이터의 갯수이며, 자연수 M은 전체 데이터에 포함된 이상치의 갯수로 N보다 작다. 즉, N-M은 전체 데이터 중 이상치가 제거된 후의 데이터 갯수가 될 수 있다. 수학식 9에서

는 커널함수이다. 예를 들면, 이러한 커널함수로서 아래 수학식 10과 같은 레이디얼 가우스 함수를 사용할 수 있다.At this time,

Is the weight, and the sum of all the weights is 1. In Equation (9), the natural number N is the total number of data, and the natural number M is the number of outliers included in the total data, which is smaller than N. [ Namely, NM can be the number of data after the abnormal value is removed from all the data. In Equation (9)

Is a kernel function. For example, a radial Gaussian function as shown in Equation (10) below can be used as the kernel function.

[수학식 10]&Quot; (10) "

여기서,

이 만족되면, h는 대역폭 매개변수이다. 차후 설명할 수학식 13의 동적시스템은 η의 일차 도함수 정보만을 요구하므로,

의 값은 η의 비음성을 보장하기 위해 함수

의 최대값보다 크기만 하다면 임의의 값을 취해도 같은 결과를 가져올 수 있다.here,

Is satisfied, h is a bandwidth parameter. Since the dynamic system of Equation (13) to be described later requires only the first order derivative information of?

The value of η is the function

If the value is larger than the maximum value of the value, taking an arbitrary value may give the same result.

그러나,

의 점들은 다양체

을 중심으로 분포하므로,

에서의 분산함수 값이 최대한 작아지는 것이 바람직하다. 따라서,

의 값을

의 최대값 보다 약간만 크도록 값을 설정한다.But,

The points of

As a result,

It is desirable that the dispersion function value at the maximum value is minimized. therefore,

The value of

Is set to be slightly larger than the maximum value of < RTI ID = 0.0 >

한편, 분산함수를 나타내는 η가 학습 데이터에 대하여 최소값을 갖도록 하는 매개 변수를 찾기 위해 다음 수학식 11과 같은 최적화 문제를 설정한다.On the other hand, an optimization problem such as the following Equation (11) is set to find a parameter such that? Representing the dispersion function has a minimum value with respect to learning data.

[수학식 11]&Quot; (11) "

그리고, 수학식 11에 대한 최적의 가중치들을 찾기 위해 수학식 12와 같은 선형 계획법 문제를 풀이하도록 한다.Then, a linear programming problem such as Equation (12) is solved to find optimal weights for Equation (11).

[수학식 12]&Quot; (12) "

최적의 매개변수 셋은 이용 가능한 선형 계획법 솔루션을 통해 획득할 수 있다. η 의 변동성을 제한하기 위해서, 그것의 K의 커널 힐버트 공간(reproducing kernel Hilbert space)에서의 노름(norm)인

를 목적 함수에서 제약이 되도록 도입할 수 있다 (T. Hastie, R. Tibshirani, and J. Friedman "The Elements of Statistical Learning", Springer, 2009. pp.168-169 참조). 따라서 다음과 같은 이차 계획법 문제가 도출될 수 있다.The optimal set of parameters can be obtained through an available linear programming solution. In order to limit the variability of η, the norm of the K in its K kernel Hilbert space

(T. Hastie, R. Tibshirani, and J. Friedman, "The Elements of Statistical Learning ", Springer, 2009. pp.168-169). Therefore, the following problem of secondary programming can be derived.

[수학식 13]&Quot; (13) "

수학식 13의 문제 또한 이용 가능한 이차 계획법 솔루션을 이용하여 해결할 수 있다. 반복적인 시뮬레이션을 통해

의정확한 최적치는 크게 중요하지 않다는 것을 관찰하였다. 가중치 값이 정확하지 않더라도, 수학식 18 의 동적 시스템을 통해 분산점들을

으로 투영 할 수 있는 η 만 도출되면 되기 때문이다. 따라서, 본 발명에서는 최적의 해에서 일부 범위 내에서의

의 근사값들을 선택함으로써 최적화 시간을 단축할 수 있다. 최적해를

와

로 표시하면, 분산 다양체는 수학식 14와 같이 획득할 수 있다.The problem of Equation (13) can also be solved by using available quadratic programming solutions. Through repetitive simulation

And that the exact optimal values of the parameters are not important. Even if the weight value is not correct,

Only η can be projected. Therefore, in the present invention,

The optimization time can be shortened. The optimal solution

Wow

, The distributed manifold can be obtained as shown in Equation (14).

[수학식 14]&Quot; (14) "

이때,

이고,

는

와 연결 요소의 개수가 같도록 설정할 수 있다.At this time,

ego,

The

And the number of connection elements can be set to be the same.

한편, 분산 함수의 구축과 관련하여 더욱 상세한 실시예를 설명하면 다음과 같다.Hereinafter, a more detailed embodiment related to the construction of the dispersion function will be described.

예를 들어, 수학식 13에 있어서, 커널함수가

로 주어지는 (N-M)*(N-M) 행렬이라면, 이차 계획 문제는 다음 수학식 15와 같이 설정될 수 있다.For example, in equation (13), the kernel function

(NM) * (NM) matrix given by the following equation (15), the secondary planning problem can be set as shown in the following equation (15).

[수학식 15]&Quot; (15) "

한편, 수학식 15는 벡터-행렬 표기법으로 수학식 16과 같이 좀 더 평이하게 표현될 수 있다.Equation (15) can be expressed more simply as Equation (16) by vector-matrix notation.

[수학식 16]&Quot; (16) "

따라서, 주어진 문제는 (N-M)개의 부등호 제약식과 하나의 등호 제약식을 갖는 이차 계획 문제이며, 이 문제는 다음과 같은 액티브 셋(active set) 방법을 이용하여 풀이할 수 있다.Therefore, the given problem is a quadratic programming problem with (N-M) inequality constraints and one equality constraint, and this problem can be solved using the following active set method.

액티브 셋 방법을 적용하기 위해 수학식 16의 이차 계획 문제를 수학식 17과 같이 일반적인 이차형태로 재정의한다.In order to apply the active set method, the secondary planning problem of equation (16) is redefined as a general quadratic form as shown in equation (17).

[수학식 17]&Quot; (17) "

여기서,

이고,

로 주어진다. 이와 같이, 재정의된 이차 형태의 최적화 문제는 도 3에 도시된 알고리즘을 통해 최적해를 도출할 수 있다.here,

ego,

. Thus, the redefined secondary type optimization problem can derive the optimal solution through the algorithm shown in FIG.

도 3은 본 발명의 일실시예에 따른 분산 함수 구축을 위한 알고리즘을 도시한 도면이다.3 is a diagram illustrating an algorithm for constructing a distributed function according to an embodiment of the present invention.

도시된 바와 같이, 수학식 17의 풀이를 위한 액티브 셋 알고리즘을 통해 최종 분산 함수를 산출할 수 있다.As shown, the final dispersion function can be calculated through the active set algorithm for solving Equation (17).

다시 도 2를 참조하면, 앞선 단계(S230)에서 학습을 통해 구축한 분산함수를 이용하여 전체 데이터를 다양체에 투영한다. 이를 위해, 분산함수를 이용하여 수학식 18과 같은 동적 시스템을 구축한다. Referring again to FIG. 2, in the preceding step S230, the entire data is projected onto the manifold using the dispersion function constructed through learning. To this end, a dynamic system such as equation (18) is constructed using a dispersion function.

[수학식 18]&Quot; (18) "

η는 두번 미분 가능하므로, 동적 시스템은 어느 초기 점에서도 유일한 궤적 해의 존재성이 보장된다. F(x)=0을 만족하는 상태 벡터 x는 수학식 18의 평형점이라고 한다. Since η can be differentiated twice, the dynamic system guarantees the existence of a unique locus solution at any initial point. The state vector x satisfying F (x) = 0 is called the equilibrium point of the equation (18).

그리고, 모든 데이터

에 대하여 도 4의 알고리즘을 통해 투영을 실시한다. Then, all the data

The projection is performed through the algorithm of Fig.

도 4는 본 발명의 일실시예에 따른 데이터 투영을 위한 알고리즘을 도시한 도면이다.4 is a diagram illustrating an algorithm for data projection according to an embodiment of the present invention.

도시된 바와 같이, 투영하고자 할 점을

, 동적 시스템을 정의할 때 필요한 시간 변수 t를 0, 앞서 구한 최적해

보다 작은 임의의 양수 r을 설정한 뒤 도 4의 while 문을 수행한다.As shown, the point to be projected

, The time variable t necessary for defining the dynamic system is 0,

After setting a smaller positive integer r, the while statement of FIG. 4 is executed.

이와 같은 알고리즘을 통해 새로운 데이터 집합

을 구성한다.Through such an algorithm, a new data set

.

도 5는 본 발명의 일실시예에 따른 다양체 학습을 위한 노이즈 데이터 투영의 개념을 설명하기 위한 도면이다.5 is a diagram for explaining the concept of noise data projection for manifold learning according to an embodiment of the present invention.

먼저, (a)는 종래에 알려진 스위스 롤(swiss-roll) 데이터에 노이즈 데이터를 포함시킨 것을 도시하고 있다. (b)는 이상치 데이터(점으로 표시됨)와 이상치 데이터를 제외한 데이터를 통해 획득한 분산함수(음영으로 표시됨)를 도시하고 있다. (c)는 (b)에서 획득한 분산함수를 통해 이상치를 투영한 궤도(빨간색으로 표시됨)를 도시하고 있다. (d)는 이와 같은 과정을 통해 노이즈 데이터가 투영된 최종 데이터를 도시한 것으로, (a)에 비하여 데이터가 보다 밀집되어 있는 것을 확인할 수 있다. First, (a) shows incorporation of noise data in swiss-roll data known in the art. (b) shows the dispersion function (indicated by shading) obtained through the data excluding the outlier data (indicated by dots) and the outlier data. (c) shows an orbit (indicated by red) projecting an ideal value through the dispersion function obtained in (b). (d) shows the final data in which the noise data is projected through the above process, and it can be confirmed that the data is denser than (a).

본 발명의 일 실시예는 컴퓨터에 의해 실행되는 프로그램 모듈과 같은 컴퓨터에 의해 실행가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체 및 통신 매체를 모두 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다. 통신 매체는 전형적으로 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈, 또는 반송파와 같은 변조된 데이터 신호의 기타 데이터, 또는 기타 전송 메커니즘을 포함하며, 임의의 정보 전달 매체를 포함한다. One embodiment of the present invention may also be embodied in the form of a recording medium including instructions executable by a computer, such as program modules, being executed by a computer. Computer readable media can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media. In addition, the computer-readable medium may include both computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Communication media typically includes any information delivery media, including computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave, or other transport mechanism.

본 발명의 방법 및 시스템은 특정 실시예와 관련하여 설명되었지만, 그것들의 구성 요소 또는 동작의 일부 또는 전부는 범용 하드웨어 아키텍쳐를 갖는 컴퓨터 시스템을 사용하여 구현될 수 있다.While the methods and systems of the present invention have been described in connection with specific embodiments, some or all of those elements or operations may be implemented using a computer system having a general purpose hardware architecture.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.It will be understood by those skilled in the art that the foregoing description of the present invention is for illustrative purposes only and that those of ordinary skill in the art can readily understand that various changes and modifications may be made without departing from the spirit or essential characteristics of the present invention. will be. It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive. For example, each component described as a single entity may be distributed and implemented, and components described as being distributed may also be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is defined by the appended claims rather than the detailed description and all changes or modifications derived from the meaning and scope of the claims and their equivalents are to be construed as being included within the scope of the present invention do.

100: 데이터 수집부 110: 다양체 학습부
120: 데이터 분류부100: data collecting unit 110: manifold learning unit
120:

Claims

다양체 학습에 기반한 데이터 분류 방법에 있어서,
수집된 전체 데이터로부터 이상치를 제거하는 단계;
이상치가 제거된 데이터에 기초하여 분산함수를 구축하는 단계;
상기 전체 데이터를 상기 분산함수에 기초한 동적 시스템을 통해 다양체에 투영하는 단계 및
상기 투영 단계에 따라 획득된 데이터를 기초로 다양체 학습 알고리즘을 적용하는 단계를 포함하는 데이터 분류 방법.In a data classification method based on manifold learning,
Removing an ideal value from the collected total data;
Constructing a dispersion function based on the data from which the ideal value has been removed;
Projecting the entire data onto a manifold via a dynamic system based on the variance function; and
Applying a manifold learning algorithm based on data obtained in accordance with the projection step.

제 1 항에 있어서,
상기 다양체 학습 알고리즘이 적용된 데이터를 기초로 데이터 분류를 수행하는 단계를 더 포함하는 데이터 분류 방법.The method according to claim 1,
And performing data classification based on the data to which the manifold learning algorithm is applied.

제 1 항에 있어서,
상기 이상치를 제거하는 단계는,
SVDD(support vector domain description) 알고리즘에 기초하여 이상치를 제거하는 데이터 분류 방법.The method according to claim 1,
The step of removing the abnormal value includes:
A data classification method for removing an ideal value based on a support vector domain description (SVDD) algorithm.

제 1 항에 있어서,
상기 분산함수를 구축하는 단계는,
매개 변수화된 분산도 추정자를 기초로 분산함수를 설정하는 단계;
상기 이상치가 제거된 데이터에 대하여 상기 분산함수가 최소값이 되도록 하는 매개 변수를 찾기 위한 최적화 문제를 풀이하는 단계 및
상기 최적화 문제의 풀이에 따라 산출된 가중치를 상기 매개 변수화된 분산도 추정자에 적용하여 분산함수를 획득하는 단계를 포함하는 데이터 분류 방법.The method according to claim 1,
The step of constructing the dispersion function comprises:
Setting a variance function based on a parameterized variance estimator;
Solving the optimization problem to find a parameter for which the dispersion function has a minimum value with respect to the data from which the ideal value has been removed, and
And applying the calculated weight to the parameterized variance estimator to obtain a variance function.

제 4 항에 있어서,
상기 최적화 문제를 풀이하는 단계는,
상기 이상치가 제거된 데이터를 기초로 하기 수학식 1의 이차 계획법 문제의 해를 풀이하되,
하기 수학식 1에서 자연수 N은 상기 전체 데이터의 갯수이고,
자연수 M(단, M < N)은 상기 이상치의 갯수인 데이터 분류 방법.
[수학식 1]

ω: 가중치
γ: 임계값
η: 분산도 함수
x: 데이터
P: 분산도 추정자
k:커널함수5. The method of claim 4,
The step of solving the optimization problem comprises:
Solving the problem of the quadratic programming method of Equation (1) based on the data from which the ideal value is removed,
In the following equation (1), the natural number N is the total number of the data,
Wherein the natural number M (where M < N) is the number of the outliers.
[Equation 1]

ω: Weight
γ: Threshold
η: dispersion function
x: Data
P: Dispersion estimator
k: kernel function

제 1 항에 있어서,
상기 동적 시스템을 통해 다양체에 투영하는 단계는
하기 수학식 2의 동적 시스템을 통해 상기 전체 데이터를 투영하되,
하기 수학식 2의

는 상기 분산함수를 구축하는 단계를 통하여 산출된 가중치이며,

(단,

)은 미리 설정된 노이즈인, 데이터 분류 방법.
[수학식 2]

η: 분산도 함수
x: 데이터The method according to claim 1,
The step of projecting onto the manifold via the dynamic system
The entire data is projected through the dynamic system of Equation (2)
In Equation (2)

Is a weight value calculated through the step of constructing the dispersion function,

(only,

) Is a preset noise.
&Quot; (2) "

η: dispersion function
x: Data