KR101869026B1

KR101869026B1 - Method and apparatus for clustering software

Info

Publication number: KR101869026B1
Application number: KR1020160103775A
Authority: KR
Inventors: 박종화; 조성제
Original assignee: 단국대학교 산학협력단
Priority date: 2016-08-16
Filing date: 2016-08-16
Publication date: 2018-06-20
Also published as: KR20180019429A

Abstract

본 발명은 소프트웨어 클러스터링 방법 및 장치에 관한 것이다. 개시된 소프트웨어 클러스터링 방법은 복수의 소프트웨어로부터 추출한 복수의 버스마크가 포함된 데이터를 생성하는 단계와, 데이터에 대한 텍스트 마이닝을 통해 복수의 버스마크에게 가중치를 부여하는 단계와, 복수의 버스마크를 대상으로 하여, 가중치가 적용된 출현 빈도에 따라 숫자 벡터로 변환하는 벡터화를 수행하여 행렬로 표현하는 단계와, 행렬로 표현된 데이터를 기초로 한 특정 개수의 범주를 사용하여 복수의 소프트웨어를 클러스터링하는 단계를 포함한다. 이처럼 본 발명은 복수의 소프트웨어로부터 버스마크를 추출한 후 전처리 과정을 통해 데이터를 축소한 후에 클러스터링에 사용함으로써, 범주화의 정확성이 향상되도록 한다. 그러면, 프로그램 간의 유사도 분석을 위한 비교 횟수와 시간 등의 오버헤드를 효율적으로 줄일 수 있다.The present invention relates to a software clustering method and apparatus. The disclosed software clustering method includes the steps of generating data including a plurality of bus marks extracted from a plurality of software programs, assigning a weight to a plurality of bus marks through text mining on the data, And performing vectorization to convert the weighted values into numerical vectors according to the appearance frequency to which the weights are applied, thereby expressing them as a matrix, and clustering a plurality of software using a specific number of categories based on the data represented by the matrix do. As described above, according to the present invention, bus marks are extracted from a plurality of software programs, and data is reduced through a preprocessing process and then used for clustering, thereby improving the accuracy of categorization. Then, overhead such as the number of comparisons and time for analyzing the similarity between programs can be effectively reduced.

Description

소프트웨어 클러스터링 방법 및 장치{METHOD AND APPARATUS FOR CLUSTERING SOFTWARE}[0001] METHOD AND APPARATUS FOR CLUSTERING SOFTWARE [0002]

본 발명은 소프트웨어 클러스터링(clustering) 방법 및 장치에 관한 것이다. 더욱 상세하게는 복수의 소프트웨어로부터 추출한 버스마크(birthmark)를 클러스터링에 사용하는 소프트웨어 클러스터링 방법 및 장치에 관한 것이다.The present invention relates to a software clustering method and apparatus. And more particularly, to a software clustering method and apparatus using a bus mark extracted from a plurality of software programs for clustering.

소프트웨어 범주화란 다른 그룹에 있는 프로그램들보다 서로가 유사한 프로그램들을 하나의 프로그램 집합으로 그룹화하는 작업을 말한다. 소프트웨어 범주화는 인터넷을 통해 배포된 불법 프로그램을 식별하고 차단하는 소프트웨어 필터링 시스템의 검색 범위를 줄일 수 있다.Software categorization refers to the grouping of similar programs into a set of programs rather than programs in other groups. Software categorization can reduce the search scope of software filtering systems that identify and block illegal programs distributed over the Internet.

이러한 소프트웨어 필터링 시스템은 불법 소프트웨어가 불법 복제로 인해 변형되어 배포되는 경우가 빈번하게 존재하므로 소프트웨어의 유사도를 측정할 필요가 있으며, 비교해야 할 프로그램의 수가 많아지는 경우에는 식별하는데 걸리는 시간이 증가한다.This software filtering system needs to measure the degree of similarity of software because illegal software is frequently distributed due to illegal copying. If the number of programs to be compared increases, the time required for identification increases.

이를 위해 소프트웨어 클러스터링 또는 분류가 필요하다. 소프트웨어 클러스터링은 범주화되어 있지 않은 소프트웨어의 특징을 분석하여 유사한 프로그램끼리 클러스터로 묶어주는 것을 말하며, 소프트웨어 분류는 소프트웨어의 기능에 따른 범주를 정하여 소프트웨어의 특징을 분석하여 유사한 프로그램이 같은 범주에 포함되도록 하는 것을 말한다.This requires software clustering or classification. Software clustering is a method of analyzing the characteristics of unclassified software and grouping similar programs into clusters. The software classification is based on analyzing the characteristics of the software by defining the categories according to the functions of the software so that similar programs are included in the same category It says.

오픈소스 사이트에 업로드되어있는 소프트웨어는 웹사이트 별로 주요 기능에 따라 범주화되어 있으므로 웹사이트마다 주요 기능에 따른 범주화한 카테고리가 같지 않은 경우가 존재한다. 따라서 주요 기능에 따른 범주화의 정확성을 높이기 위해 클러스터링 기법을 적용함으로써 다수의 프로그램을 대상으로 유사도 분석을 위해 걸리는 시간과 비교 횟수 및 범위를 축소할 필요가 있다.The software that is uploaded to the open source site is categorized according to the main functions of each website, so there are cases where the categories categorized according to the main functions are not the same for each website. Therefore, in order to improve the accuracy of categorization according to the main functions, it is necessary to reduce the time and frequency of comparison and the range for the similarity analysis for a large number of programs by applying the clustering technique.

종래에는 소프트웨어의 분류를 위한 연구로서, 소프트웨어 대상의 기계학습 기반의 연구가 진행되었다. 소프트웨어 분류는 API(Application Programming Interface)나 문자열 등 프로그램의 기능을 식별하기 위한 버스마크를 이용하여 수행되었다.Conventionally, as a research for classification of software, a study on a machine learning based on a software object has been carried out. Software classification was performed using bus marks to identify program functions such as application programming interfaces (APIs) and strings.

그러나 프로그램 간의 유사도 분석 시에 많은 양의 프로그램과의 유사도 분석을 위해 간접적 혹은 추가적으로 요구되는 시간에 따라 오버헤드가 발생하는 문제점이 있었다.However, when analyzing the similarity between programs, there is a problem that overhead occurs due to indirect or additional time required for analyzing similarity with a large amount of programs.

한국등록특허 제10-1579347호, 등록일 2015년 12월 15일.Korean Patent No. 10-1579347, registered on December 15, 2015.

실시예에 따르면, 복수의 소프트웨어로부터 버스마크를 추출한 후 전처리 과정을 통해 데이터를 축소한 후에 클러스터링에 사용하는 소프트웨어 클러스터링 방법 및 장치를 제공한다.According to an embodiment, there is provided a software clustering method and apparatus for extracting a bus mark from a plurality of software programs and then reducing the data through a preprocessing process and then using the software mark for clustering.

해결하려는 과제는 이상에서 언급한 것으로 제한되지 않으며, 언급되지 않은 또 다른 해결하고자 하는 과제는 아래의 기재로부터 본 발명이 속하는 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The problems to be solved are not limited to those mentioned above, and another problem to be solved not mentioned can be clearly understood by those skilled in the art from the following description.

제 1 관점에 따른 소프트웨어 클러스터링 방법은, 복수의 소프트웨어로부터 추출한 복수의 버스마크가 포함된 데이터를 생성하는 단계와, 상기 데이터에 대한 텍스트 마이닝(text mining)을 통해 상기 복수의 버스마크에게 가중치를 부여하는 단계와, 상기 복수의 버스마크를 대상으로 하여, 상기 가중치가 적용된 출현 빈도에 따라 숫자 벡터로 변환하는 벡터화를 수행하여 행렬로 표현하는 단계와, 상기 행렬로 표현된 데이터를 기초로 한 특정 개수의 범주를 사용하여 상기 복수의 소프트웨어를 클러스터링하는 단계를 포함할 수 있다.According to a first aspect of the present invention, there is provided a software clustering method comprising: generating data including a plurality of bus marks extracted from a plurality of software; and assigning weights to the plurality of bus marks through text mining on the data A step of vectorizing the plurality of bus marks into a number vector according to an appearance frequency to which the weight value is applied and expressing the result as a matrix; And clustering the plurality of software using a category of the plurality of software.

제 2 관점에 따른 소프트웨어 클러스터링 장치는, 복수의 소프트웨어로부터 추출한 복수의 버스마크가 포함된 데이터를 생성하는 데이터 생성부와, 상기 데이터에 대한 텍스트 마이닝을 통해 상기 복수의 버스마크에게 가중치를 부여하고, 상기 복수의 버스마크를 대상으로 하여, 상기 가중치가 적용된 출현 빈도에 따라 숫자 벡터로 변환하는 벡터화를 수행하여 행렬로 표현하는 분석부와, 상기 행렬로 표현된 데이터를 기초로 한 특정 개수의 범주를 사용하여 상기 복수의 소프트웨어를 클러스터링하는 처리부를 포함할 수 있다.A software clustering apparatus according to a second aspect of the present invention includes: a data generation unit that generates data including a plurality of bus marks extracted from a plurality of software; a data mining unit that assigns weights to the plurality of bus marks through text mining on the data, An analysis unit for performing vectorization for converting the plurality of bus marks into numerical vectors according to appearance frequencies to which the weights are applied and expressing them as a matrix; and a classifying unit for classifying a specific number of categories based on the data represented by the matrix And a processor for clustering the plurality of software using the software.

실시예에 의하면, 복수의 소프트웨어로부터 버스마크를 추출한 후 전처리 과정을 통해 데이터를 축소한 후에 클러스터링에 사용함으로써, 범주화의 정확성이 향상되도록 한다. 그러면, 프로그램 간의 유사도 분석을 위한 비교 횟수와 시간 등의 오버헤드를 효율적으로 줄일 수 있는 효과가 있다.According to the embodiment, bus marks are extracted from a plurality of software programs, and the data is reduced through a preprocessing process and then used for clustering, thereby improving the accuracy of categorization. Thus, there is an effect that the overhead such as the number of times of comparison and time for analyzing the similarity between programs can be effectively reduced.

도 1은 일 실시예에 따른 소프트웨어 클러스터링 장치의 블록 구성도이다.
도 2는 일 실시예에 따른 소프트웨어 클러스터링 방법을 설명하기 위한 흐름도이다.1 is a block diagram of a software clustering apparatus according to an embodiment.
2 is a flowchart illustrating a software clustering method according to an embodiment.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다.BRIEF DESCRIPTION OF THE DRAWINGS The advantages and features of the present invention, and the manner of achieving them, will be apparent from and elucidated with reference to the embodiments described hereinafter in conjunction with the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. To fully disclose the scope of the invention to those skilled in the art, and the invention is only defined by the scope of the claims.

본 발명의 실시예들을 설명함에 있어서 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다. 그리고 후술되는 용어들은 본 발명의 실시예에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다. In the following description of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear. The following terms are defined in consideration of the functions in the embodiments of the present invention, which may vary depending on the intention of the user, the intention or the custom of the operator. Therefore, the definition should be based on the contents throughout this specification.

이하, 첨부된 도면들을 참조하여 본 발명의 실시예들에 대해 살펴보기로 한다.Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings.

도 1은 일 실시예에 따른 소프트웨어 클러스터링 장치의 블록 구성도이다.1 is a block diagram of a software clustering apparatus according to an embodiment.

이에 나타낸 바와 같이 소프트웨어 클러스터링 장치(100)는 데이터 생성부(110), 분석부(120), 처리부(130)를 포함한다.As shown, the software clustering apparatus 100 includes a data generation unit 110, an analysis unit 120, and a processing unit 130.

데이터 생성부(110)는 복수의 소프트웨어로부터 추출한 복수의 버스마크가 포함된 데이터를 생성한다.The data generation unit 110 generates data including a plurality of bus marks extracted from a plurality of software programs.

이러한 데이터 생성부(110)는 소프트웨어로부터 API(Application Programming Interface) 정보를 추출하여 버스마크로 정의할 수 있다.The data generation unit 110 may extract API (Application Programming Interface) information from software and define it as a bus mark.

분석부(120)는 데이터 생성부(110)에서 생성된 데이터에 대한 텍스트 마이닝(text mining)을 통해 복수의 버스마크에게 가중치를 부여하고, 복수의 버스마크를 대상으로 가중치가 적용된 출현 빈도에 따라 숫자 벡터로 변환하는 벡터화를 수행하여 행렬로 표현한다.The analysis unit 120 assigns weights to the plurality of bus marks through text mining on the data generated by the data generation unit 110 and calculates the weights of the plurality of bus marks based on the appearance frequency Vectorization to convert to a number vector is performed and expressed as a matrix.

이러한 분석부(120)는 복수의 버스마크 중 다수의 소프트웨어에서 공통적으로 이용되는 버스마크에게 다른 버스마크보다 상대적인 영향력을 낮추기 위해 가중치를 상대적으로 더 낮게 부여한다.The analysis unit 120 assigns a relatively low weight to the bus marks commonly used in a plurality of software among the plurality of bus marks in order to lower the relative influence thereof relative to the other bus marks.

아울러, 분석부(120)는 벡터화를 수행하여 표현된 행렬이 클러스터링에 사용되기 전에 제로 요소를 제거하여 희소행렬에서 밀집행렬로 변환한다.In addition, the analysis unit 120 performs vectorization to remove zero elements before the matrix is used for clustering, and converts the matrix from a sparse matrix to a dense matrix.

또, 분석부(120)는 벡터화를 수행하여 표현된 행렬이 클러스터링에 사용되기 전에 특이값 분해(Singular Value Decomposition, SVD)를 이용하여 축소한다.In addition, the analysis unit 120 performs vectorization to reduce the expressed matrix using Singular Value Decomposition (SVD) before it is used for clustering.

처리부(130)는 분석부(120)에 의해 행렬로 표현된 데이터를 기초로 한 특정 개수의 범주를 사용하여 복수의 소프트웨어를 클러스터링한다.The processing unit 130 clusters a plurality of software programs using a specific number of categories based on data represented by a matrix by the analysis unit 120. [

실시예에 따르면, 데이터 생성부(110), 분석부(120) 및 처리부(130)는 CPU(Central Processing Unit) 등과 같은 프로세서로 구현할 수 있다.The data generation unit 110, the analysis unit 120, and the processing unit 130 may be implemented by a processor such as a CPU (Central Processing Unit).

이와 같은 소프트웨어 클러스터링 장치(100)는 복수의 버스마크에게 가중치를 부여한 후에 벡터화 및 클러스터링을 위해 scikit-learn 라이브러리를 활용한 기계학습을 이용할 수 있다. 예컨대, TF-IDF(Term Frequency-Inverse Document Frequency)를 적용하여 가중치를 부여할 수 있다. TF-IDF의 적용 및 벡터화를 위해 scikit-learn 라이브러리의 TfidfVectorizer 모듈을 사용할 수 있다. 특이값 분해를 위해 scikit-learn 라이브러리의 TruncatedSVD 모듈을 사용할 수 있다. 클러스터링을 위해 scikit-learn 라이브러리의 Kmeans 모듈을 사용할 수 있다. 또, 클러스터링을 위해 가우시안 혼합모델(Gaussian Mixture Model(GMM), Mixture of Gaussian) 알고리즘을 사용할 수도 있다.The software clustering apparatus 100 may use machine learning using a scikit-learn library for vectorization and clustering after weighting a plurality of bus marks. For example, a weight can be given by applying a TF-IDF (Term Frequency-Inverse Document Frequency). You can use the TfidfVectorizer module of the scikit-learn library to apply and vectorize TF-IDF. The TruncatedSVD module of the scikit-learn library can be used for singular value decomposition. You can use the Kmeans module of the scikit-learn library for clustering. In addition, a Gaussian Mixture Model (GMM) or Mixture of Gaussian algorithm may be used for clustering.

도 2는 일 실시예에 따른 소프트웨어 클러스터링 방법을 설명하기 위한 흐름도이다.2 is a flowchart illustrating a software clustering method according to an embodiment.

이에 나타낸 바와 같이 실시예에 따른 소프트웨어 클러스터링 방법은, 복수의 소프트웨어로부터 추출한 복수의 버스마크가 포함된 데이터를 생성하는 단계(S210)를 포함한다.As shown in the figure, the software clustering method according to the embodiment includes a step S210 of generating data including a plurality of bus marks extracted from a plurality of software.

여기서, 소프트웨어로부터 API 정보를 추출하여 버스마크로 정의할 수 있다.Here, API information can be extracted from the software and defined as a bus mark.

아울러, 복수의 버스마크가 포함된 데이터에 대한 텍스트 마이닝을 통해 복수의 버스마크에게 가중치를 부여하는 단계(S220)를 더 포함한다.The method further includes a step S220 of weighting a plurality of bus marks by text mining on data including a plurality of bus marks.

이어서, 복수의 버스마크를 대상으로 가중치가 적용된 출현 빈도에 따라 숫자 벡터로 변환하는 벡터화를 수행하여 행렬로 표현하는 단계(S230)를 더 포함한다.The method further includes a step S230 of performing vectorization for converting a plurality of bus marks into numerical vectors according to appearance frequencies to which weights are applied, and expressing them as a matrix.

여기서, 복수의 버스마크 중 다수의 소프트웨어에서 공통적으로 이용되는 버스마크에게 다른 버스마크보다 상대적인 영향력을 낮추기 위해 가중치를 상대적으로 더 낮게 부여한다.Here, the bus marks commonly used in a plurality of software among the plurality of bus marks are given relatively lower weight so as to lower the relative influence thereof relative to the other bus marks.

그리고, 벡터화를 수행하여 표현된 행렬이 클러스터링에 사용되기 전에 제로 요소를 제거하여 희소행렬에서 밀집행렬로 변환하는 단계(S240)를 더 포함한다.The method further includes a step (S240) of removing the zero element from the sparse matrix to the dense matrix before the matrix expressed by vectorization is used for clustering.

아울러, 벡터화를 수행하여 표현된 행렬이 클러스터링에 사용되기 전에 특이값 분해를 이용하여 축소하는 단계(S250)를 더 포함한다.In addition, the method further includes a step S250 of performing vectorization to reduce the matrix represented by using the singular value decomposition before being used for clustering.

다음으로, 행렬로 표현된 후에 희소행렬의 변환 및 축소된 데이터를 기초로 한 특정 개수의 범주를 사용하여 복수의 소프트웨어를 클러스터링하는 단계(S260)를 더 포함한다.Next, the method further comprises clustering a plurality of software (S260) using a certain number of categories based on the transformation of the sparse matrix and the reduced data after being expressed as a matrix.

이하, 도 1 및 도 2를 실시예에 따른 소프트웨어 클러스터링 장치(100)에 의해 수행되는 소프트웨어 클러스터링 과정에 대해 더 자세히 살펴보기로 한다.Hereinafter, the software clustering process performed by the software clustering apparatus 100 according to the embodiment will be described in more detail with reference to FIGS. 1 and 2. FIG.

먼저, 데이터 생성부(110)는 복수의 소프트웨어로부터 추출한 복수의 버스마크가 포함된 데이터를 생성한다(S210).First, the data generating unit 110 generates data including a plurality of bus marks extracted from a plurality of software (S210).

예컨대, 데이터 생성부(110)는 윈도우 실행 파일인 PE(Portable Executable) 포맷의 IAT(Import Address Table)에 포함된 API 정보를 추출하여 버스마크로 정의할 수 있다. 소프트웨어의 버스마크는 프로그램 실행 파일로부터 프로그램을 식별하는 데 사용될 수 있는 고유한 정보이다. 이러한 정보는 프로그램의 도용을 판별하기 위해 소프트웨어 간의 유사도를 측정하거나 소프트웨어 식별, 악성 코드 탐지 등에 사용되고 있다. 버스마크는 크게 정적 버스마크와 동적 버스마크로 분류할 수 있다. 정적 버스마크는 프로그램을 실행시키지 않고 실행파일로부터 추출 가능한 특징 정보이며, 동적 버스마크는 프로그램을 실행시켜 시스템 환경, 외부 입력 등에 따라 프로그램이 실행되는 과정으로부터 추출 가능한 특징 정보이다. 정적 버스마크는 프로그램을 실행시키지 않고 코드를 분석하기 때문에 분석을 자동화할 수 있으며, 코드 전체에 대한 분석을 수행하므로 동적 버스마크보다 코드 커버리지가 높고 추출 오버헤드가 낮은 특징이 있다.For example, the data generation unit 110 may extract API information included in an IAT (Import Address Table) format of a PE (Portable Executable) format, which is a window execution file, and define it as a bus mark. The bus mark of the software is unique information that can be used to identify the program from the program executable file. Such information is used to measure the similarity between software, to identify software, to detect malicious code, etc., in order to determine program theft. Bus marks can be classified into static bus marks and dynamic bus marks. The static bus mark is feature information that can be extracted from the executable file without executing the program, and the dynamic bus mark is feature information that can be extracted from the process of executing the program according to the system environment, external input, etc. by executing the program. Static bus marking can analyze the code without executing the program, so it can automate the analysis and analyze the whole code. Therefore, it has higher code coverage and lower extraction overhead than dynamic bus mark.

그리고, 분석부(120)는 데이터 생성부(110)에서 생성된 데이터에 대한 텍스트 마이닝(text mining)을 통해 복수의 버스마크에게 가중치를 부여하고(S220), 복수의 버스마크를 대상으로 가중치가 적용된 출현 빈도에 따라 숫자 벡터로 변환하는 벡터화를 수행하여 행렬로 표현한다(S230).The analysis unit 120 assigns weights to the plurality of bus marks through text mining on the data generated by the data generation unit 110 at step S220, And vectorized into a number vector in accordance with the applied appearance frequency to be expressed as a matrix (S230).

실시예에 따르면, 소프트웨어 클러스터링 장치(100)의 분석부(120)는 복수의 버스마크에게 가중치를 부여한 후에 벡터화 및 클러스터링을 위해 scikit-learn 라이브러리를 활용한 기계학습을 이용할 수 있다.According to the embodiment, the analysis unit 120 of the software clustering apparatus 100 may use machine learning using a scikit-learn library for vectorization and clustering after weighting a plurality of bus marks.

이러한 분석부(120)는 복수의 버스마크 중 다수의 소프트웨어에서 공통적으로 이용되는 버스마크에게 다른 버스마크보다 상대적인 영향력을 낮추기 위해 가중치를 상대적으로 더 낮게 부여한다. 소프트웨어로부터 버스마크로서 추출된 API는 프로그램의 파일 제어, 네트워크, 메모리 할당 등과 같이 프로그램에 필수적인 기능을 제어하는 API가 존재한다. 이러한 API는 대부분의 프로그램에 존재하기 때문에 프로그램의 고유한 정보로 정의하기 어려워 클러스터링에 대한 신뢰성을 낮추게 되므로, 가중치를 다른 API보다 상대적으로 더 낮게 부여하는 것이다. 이를 위해 다수의 프로그램에서 빈번하게 호출되는 API의 영향력을 낮추기 위해 텍스트 마이닝 기법에 사용되는 TF-IDF를 적용하여 API에 대한 가중치를 적용할 수 있다. 또, TF-IDF의 적용 및 벡터화를 위해 scikit-learn 라이브러리의 TfidfVectorizer 모듈을 사용할 수 있다.The analysis unit 120 assigns a relatively low weight to the bus marks commonly used in a plurality of software among the plurality of bus marks in order to lower the relative influence thereof relative to the other bus marks. The API extracted from the software as a bus mark has an API that controls functions essential to the program such as file control of the program, network, and memory allocation. Since these APIs exist in most programs, it is difficult to define them as program-specific information, which lowers the reliability of clustering. Therefore, weights are given relatively lower than other APIs. For this purpose, we can apply the weight of API by applying TF-IDF, which is used in text mining technique, to lower the influence of frequently called API in many programs. You can also use the TfidfVectorizer module of the scikit-learn library to apply and vectorize TF-IDF.

아울러, 분석부(120)는 벡터화를 수행하여 표현된 행렬이 클러스터링에 사용되기 전에 제로 요소를 제거하여 희소행렬에서 밀집행렬로 변환한다. 분석부(120)가 데이터 생성부(110)에 의해 생성된 데이터를 기반으로 벡터화한 공간에는 제로 요소, 즉 '0'이 다수 포함된 희소행렬이기 때문에 불필요한 공간이 많이 존재한다. 예컨대, 분석부(120)는 TfidfVectorizer 모듈의 fit_trnasform() 함수를 적용함으로써 공간을 축소할 수 있다. fit_transform() 함수는 벡터화된 데이터 내에서 '0'을 제거하여 희소행렬에서 밀집행렬로 변환함으로써 공간을 축소하는 기능을 제공한다(S240).In addition, the analysis unit 120 performs vectorization to remove zero elements before the matrix is used for clustering, and converts the matrix from a sparse matrix to a dense matrix. There is a lot of unnecessary space because the space that the analysis unit 120 has vectorized based on the data generated by the data generation unit 110 is a sparse matrix including a plurality of zero elements, that is, '0'. For example, the analyzer 120 may reduce the space by applying the fit_trnasform () function of the TfidfVectorizer module. The fit_transform () function provides a function of reducing the space by removing '0' from vectorized data and converting it from a sparse matrix to a dense matrix (S240).

그리고, 분석부(120)는 벡터화를 수행하여 표현된 행렬이 클러스터링에 사용되기 전에 특이값 분해를 이용함으로써 공간을 더욱 축소한다. 단계 S240에서 희소행렬을 변환함으로써 공간을 축소했지만 추출된 API 수는 여전히 많을 수 있기 때문에 기계학습에서 사용하기에는 복잡도가 여전히 클 수 있다. 예컨대, 분석부(120)는 데이터 압축, 노이즈 제거 등에 활용되는 truncated SVD를 이용하여 공간을 축소할 수 있다. 이를 위해, 분석부(120)는 scikit-learn 라이브러리의 TruncatedSVD 모듈을 사용할 수 있다. truncated SVD는 적용할 행렬이 어떠한 행렬이든지 관계없이 모든 행렬에 적용할 수 있기 때문에, TF-IDF가 적용되어 희소행렬 상태에서 변환된 행렬에 적합하다(S250).Then, the analysis unit 120 performs vectorization to further reduce the space by using singular value decomposition before the expressed matrix is used for clustering. The complexity may still be large for use in machine learning because the space is reduced by converting the sparse matrix in step S240, but the number of extracted APIs may still be large. For example, the analysis unit 120 may reduce the space using truncated SVD used for data compression, noise reduction, and the like. For this, the analysis unit 120 may use the TruncatedSVD module of the scikit-learn library. Since the truncated SVD can be applied to all matrices regardless of the matrix to be applied, TF-IDF is applied to the matrices converted in the sparse matrix state (S250).

이로써, 분석부(120)에 의해 가중치 적용, 벡터화, 희소행렬의 변환, 특이값 분해 등의 전처리 과정을 거친 데이터가 처리부(130)에게 제공된다.Thus, the analysis unit 120 provides the processing unit 130 with data that has undergone a preprocessing process such as weighting, vectorization, transformation of a sparse matrix, and singular value decomposition.

다음으로, 처리부(130)는 전처리 과정을 거친 데이터를 기초로 한 특정 개수의 범주를 사용하여 복수의 소프트웨어를 클러스터링한다. 예컨대, 처리부(130)는 클러스터링을 위해 scikit-learn 라이브러리의 Kmeans 모듈을 사용함으로써, 클러스터링을 위해 K-평균 알고리즘을 사용할 수 있다. 아울러, 처리부(130)는 클러스터링 결과에 대한 척도로 실루엣 계수를 사용한다. 실루엣 계수는 데이터들이 얼마나 잘 클러스터링 되었는지 나타내는 척도를 의미한다. 예컨대, 처리부(130)는 임의로 선정된 데이터 i와 그 외 데이터 간의 거리를 계산하기 위해 유클리디안 거리(Euclidean distance) 수식, 코사인 거리(Cosine distance) 수식, 맨해턴 거리(Manhattan distance) 수식을 각각 적용하여 실루엣 계수를 측정한다. 각 거리 공식으로 계산된 실루엣 계수를 기반으로 클러스터링의 척도를 판단한다. 이를 위해 거리를 측정하기 위한 기능을 제공하는 scikit-learn 라이브러리의 pairwise_distances 모듈을 사용할 수 있다(S260).Next, the processing unit 130 clusters a plurality of software programs using a specific number of categories based on the data subjected to the preprocessing process. For example, the processing unit 130 may use the K-means algorithm for clustering by using the Kmeans module of the scikit-learn library for clustering. In addition, the processing unit 130 uses the silhouette coefficient as a measure for the clustering result. The silhouette factor is a measure of how well the data is clustered. For example, the processing unit 130 may apply Euclidean distance equations, cosine distance equations, and Manhattan distance equations to calculate the distance between arbitrarily selected data i and other data, respectively And the silhouette coefficient is measured. The scale of the clustering is determined based on the silhouette coefficients calculated by each distance formula. To do this, we can use the pairwise_distances module of the scikit-learn library (S260), which provides a function for measuring distances.

지금까지 설명한 바와 같이 실시예에 의하면, 복수의 소프트웨어로부터 버스마크를 추출한 후 전처리 과정을 통해 데이터를 축소한 후에 클러스터링에 사용함으로써, 범주화의 정확성이 향상되도록 한다. 그러면, 프로그램 간의 유사도 분석을 위한 비교 횟수와 시간 등의 오버헤드를 효율적으로 줄일 수 있다. 이러한 소프트웨어 클러스터링 방법은 윈도우 실행 파일뿐만 아니라, 안드로이드 앱 또는 iOS 앱 등의 실행파일을 분류하는데 사용될 수 있다.As described above, according to the embodiment, bus marks are extracted from a plurality of software programs, and the data is reduced through a preprocessing process and then used for clustering, thereby improving the accuracy of categorization. Then, overhead such as the number of comparisons and time for analyzing the similarity between programs can be effectively reduced. Such software clustering methods can be used to classify executables, such as Android apps or iOS apps, as well as Windows executables.

아울러, 본 발명의 실시예에 따른 클러스트링 방법 및 장치를 활용할 수 있는 소프트웨어 필터링 시스템은 각종의 앱 마켓(예컨대, Naver 앱 스토어, *** play, 애플 앱 스토어 등)에서 불법 앱(예컨대, 도용 앱, 표절 앱, 해킹된 앱, 음란 앱 등)을 효율적으로 식별하기 위해 사용될 수 있다. 예컨대, 누군가가 새로운 앱을 개발하여 업로드 할 경우에, 새로운 앱이 불법 앱인지 검사할 때도 사용할 수 있다.In addition, the software filtering system that can utilize the clustering method and apparatus according to the embodiment of the present invention can be applied to various applications (for example, Naver app store, *** play, Apple app store, etc.) Apps, hacked apps, obscene apps, etc.). For example, when someone develops and uploads a new app, it can be used to check whether a new app is illegal.

본 명세서에 첨부된 블록도의 각 블록과 흐름도의 각 단계의 조합들은 컴퓨터 프로그램 인스트럭션들에 의해 수행될 수도 있다. 이들 컴퓨터 프로그램 인스트럭션들은 범용 컴퓨터, 특수용 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 프로세서에 탑재될 수 있으므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 프로세서를 통해 수행되는 그 인스트럭션들이 블록도의 각 블록 또는 흐름도의 각 단계에서 설명된 기능들을 수행하는 수단을 생성하게 된다. 이들 컴퓨터 프로그램 인스트럭션들은 특정 방식으로 기능을 구현하기 위해 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 지향할 수 있는 컴퓨터 이용 가능 또는 컴퓨터 판독 가능 기록매체에 저장되는 것도 가능하므로, 그 컴퓨터 이용가능 또는 컴퓨터 판독 가능 기록매체에 저장된 인스트럭션들은 블록도의 각 블록 또는 흐름도 각 단계에서 설명된 기능을 수행하는 인스트럭션 수단을 내포하는 제조 품목을 생산하는 것도 가능하다. 컴퓨터 프로그램 인스트럭션들은 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에 탑재되는 것도 가능하므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에서 일련의 동작 단계들이 수행되어 컴퓨터로 실행되는 프로세스를 생성해서 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 수행하는 인스트럭션들은 블록도의 각 블록 및 흐름도의 각 단계에서 설명된 기능들을 실행하기 위한 단계들을 제공하는 것도 가능하다. Each block of the block diagrams attached hereto and combinations of steps of the flowchart diagrams may be performed by computer program instructions. These computer program instructions may be loaded into a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus so that the instructions, which may be executed by a processor of a computer or other programmable data processing apparatus, And means for performing the functions described in each step are created. These computer program instructions may be stored on a computer usable or computer readable recording medium capable of directing a computer or other programmable data processing apparatus to implement a function in a particular manner so that the computer usable or computer readable It is also possible for the instructions stored in the recording medium to produce a manufacturing item containing instruction means for performing the functions described in each block or flowchart of each block diagram. Computer program instructions may also be stored on a computer or other programmable data processing equipment so that a series of operating steps may be performed on a computer or other programmable data processing equipment to create a computer- It is also possible that the instructions that perform the processing equipment provide the steps for executing the functions described in each block of the block diagram and at each step of the flowchart.

또한, 각 블록 또는 각 단계는 특정된 논리적 기능(들)을 실행하기 위한 하나 이상의 실행 가능한 인스트럭션들을 포함하는 모듈, 세그먼트 또는 코드의 일부를 나타낼 수 있다. 또, 몇 가지 대체 실시예들에서는 블록들 또는 단계들에서 언급된 기능들이 순서를 벗어나서 발생하는 것도 가능함을 주목해야 한다. 예컨대, 잇달아 도시되어 있는 두 개의 블록들 또는 단계들은 사실 실질적으로 동시에 수행되는 것도 가능하고 또는 그 블록들 또는 단계들이 때때로 해당하는 기능에 따라 역순으로 수행되는 것도 가능하다.Also, each block or each step may represent a module, segment, or portion of code that includes one or more executable instructions for executing the specified logical function (s). It should also be noted that in some alternative embodiments, the functions mentioned in the blocks or steps may occur out of order. For example, two blocks or steps shown in succession may in fact be performed substantially concurrently, or the blocks or steps may sometimes be performed in reverse order according to the corresponding function.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 개시된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The foregoing description is merely illustrative of the technical idea of the present invention, and various changes and modifications may be made by those skilled in the art without departing from the essential characteristics of the present invention. Therefore, the embodiments disclosed in the present invention are intended to illustrate rather than limit the scope of the present invention, and the scope of the technical idea of the present invention is not limited by these embodiments. The scope of protection of the present invention should be construed according to the following claims, and all technical ideas within the scope of equivalents should be construed as falling within the scope of the present invention.

100 : 소프트웨어 클러스터링 장치
110 : 데이터 생성부
120 : 분석부
130 : 처리부100: Software clustering device
110:
120:
130:

Claims

소프트웨어 클러스터링 장치에서 수행하는 소프트웨어 클러스터링 방법으로서,
복수의 소프트웨어로부터 추출한 복수의 버스마크(birthmark)가 포함된 데이터를 생성하는 단계와,
상기 데이터에 대한 텍스트 마이닝(text mining)을 통해 상기 복수의 버스마크에게 가중치를 부여하는 단계와,
상기 복수의 버스마크를 대상으로 하여, 상기 가중치가 적용된 출현 빈도에 따라 숫자 벡터로 변환하는 벡터화를 수행하여 행렬로 표현하는 단계와,
상기 행렬로 표현된 데이터를 기초로 한 특정 개수의 범주를 사용하여 상기 복수의 소프트웨어를 클러스터링하는 단계를 포함하는
소프트웨어 클러스터링 방법.A software clustering method performed by a software clustering device,
Generating data including a plurality of busmarks extracted from a plurality of software;
Assigning weights to the plurality of bus marks through text mining on the data;
Performing vectorization to convert the plurality of bus marks into numerical vectors according to appearance frequencies to which the weights are applied, and expressing them as a matrix;
And clustering the plurality of software using a specific number of categories based on the data represented by the matrix
Software clustering method.

제 1 항에 있어서,
상기 가중치를 부여하는 단계는,
상기 복수의 버스마크 중 다수의 상기 소프트웨어에서 공통적으로 이용되는 버스마크에게 다른 버스마크보다 상기 가중치를 상대적으로 더 낮게 부여하는
소프트웨어 클러스터링 방법.The method according to claim 1,
Wherein the weighting step comprises:
And the bus mark commonly used in a plurality of the software among the plurality of bus marks is given a lower weight than the other bus marks
Software clustering method.

제 1 항에 있어서,
상기 소프트웨어로부터 API(Application Programming Interface) 정보를 추출하여 상기 버스마크로 정의하는
소프트웨어 클러스터링 방법.The method according to claim 1,
Extracting API (Application Programming Interface) information from the software and defining it as the bus mark
Software clustering method.

제 1 항에 있어서,
상기 가중치의 부여와 상기 벡터화 및 상기 클러스터링을 위해 scikit-learn 라이브러리를 활용한 기계학습을 하는
소프트웨어 클러스터링 방법.The method according to claim 1,
And the machine learning using the scikit-learn library for the weighting and the vectorization and the clustering
Software clustering method.

제 1 항에 있어서,
TF-IDF(Term Frequency-Inverse Document Frequency)를 적용하여 상기 가중치를 부여하는
소프트웨어 클러스터링 방법.The method according to claim 1,
The TF-IDF (Term Frequency-Inverse Document Frequency)
Software clustering method.

제 5 항에 있어서,
상기 TF-IDF의 적용 및 상기 벡터화를 위해 scikit-learn 라이브러리의 TfidfVectorizer 모듈을 사용하는
소프트웨어 클러스터링 방법.6. The method of claim 5,
Using the TfidfVectorizer module of the scikit-learn library for application and vectorization of the TF-IDF
Software clustering method.

제 1 항에 있어서,
상기 벡터화를 수행하여 표현된 행렬이 상기 클러스터링에 사용되기 전에 제로 요소를 제거하여 희소행렬에서 밀집행렬로 변환하는 단계를 더 포함하는
소프트웨어 클러스터링 방법.The method according to claim 1,
Further comprising the step of removing the zero elements and transforming from the sparse matrix to the dense matrix before the matrix represented by performing the vectorization is used in the clustering
Software clustering method.

제 1 항에 있어서,
상기 벡터화를 수행하여 표현된 행렬이 상기 클러스터링에 사용되기 전에 특이값 분해(Singular Value Decomposition, SVD)를 이용하여 축소하는 단계를 더 포함하는
소프트웨어 클러스터링 방법.The method according to claim 1,
Further comprising the step of reducing the matrix expressed by performing the vectorization using Singular Value Decomposition (SVD) before being used in the clustering
Software clustering method.

제 8 항에 있어서,
상기 특이값 분해를 위해 scikit-learn 라이브러리의 TruncatedSVD 모듈을 사용하는
소프트웨어 클러스터링 방법.9. The method of claim 8,
Using the TruncatedSVD module of the scikit-learn library for the above singular value decomposition
Software clustering method.

제 1 항에 있어서,
상기 클러스터링을 위해 scikit-learn 라이브러리의 Kmeans 모듈이나, 가우시안 혼합모델(Gaussian Mixture Model(GMM), Mixture of Gaussian)을 사용하는
소프트웨어 클러스터링 방법.The method according to claim 1,
Using the Kmeans module of the scikit-learn library or the Gaussian Mixture Model (GMM), Mixture of Gaussian for the clustering
Software clustering method.

제 1 항 내지 제 10 항 중 어느 한 항의 소프트웨어 클러스터링 방법을 프로세서가 수행하도록 하기 위하여
컴퓨터 판독 가능한 기록매체에 저장된 컴퓨터 프로그램.11. A method for clustering a software according to any one of claims 1 to 10,
A computer program stored on a computer readable recording medium.

제 1 항 내지 제 10 항 중 어느 한 항의 소프트웨어 클러스터링 방법을 프로세서가 수행하도록 하는
컴퓨터 프로그램이 저장된 컴퓨터 판독 가능한 기록매체.11. A method for clustering a software according to any one of claims 1 to 10
A computer-readable recording medium storing a computer program.

복수의 소프트웨어로부터 추출한 복수의 버스마크(birthmark)가 포함된 데이터를 생성하는 데이터 생성부와,
상기 데이터에 대한 텍스트 마이닝(text mining)을 통해 상기 복수의 버스마크에게 가중치를 부여하고, 상기 복수의 버스마크를 대상으로 하여, 상기 가중치가 적용된 출현 빈도에 따라 숫자 벡터로 변환하는 벡터화를 수행하여 행렬로 표현하는 분석부와,
상기 행렬로 표현된 데이터를 기초로 한 특정 개수의 범주를 사용하여 상기 복수의 소프트웨어를 클러스터링하는 처리부를 포함하는
소프트웨어 클러스터링 장치.A data generating unit for generating data including a plurality of bus marks extracted from a plurality of software;
Performing vectorization for assigning weights to the plurality of bus marks through text mining on the data and transforming the plurality of bus marks into numerical vectors according to appearance frequencies to which the weights are applied An analysis unit for expressing the matrix by a matrix,
And a processing unit for clustering the plurality of software programs using a specific number of categories based on the data represented by the matrix
Software clustering device.

제 13 항에 있어서,
상기 분석부는, 상기 복수의 버스마크 중 다수의 상기 소프트웨어에서 공통적으로 이용되는 버스마크에게 다른 버스마크보다 상기 가중치를 상대적으로 더 낮게 부여하는
소프트웨어 클러스터링 장치.14. The method of claim 13,
Wherein the analyzing unit assigns the bus marks commonly used in a plurality of the software among the plurality of bus marks to a lower value than another bus mark,
Software clustering device.

제 13 항에 있어서,
상기 소프트웨어로부터 API(Application Programming Interface) 정보를 추출하여 상기 버스마크로 정의하는
소프트웨어 클러스터링 장치.14. The method of claim 13,
Extracting API (Application Programming Interface) information from the software and defining it as the bus mark
Software clustering device.

제 13 항에 있어서,
상기 가중치의 부여와 상기 벡터화 및 상기 클러스터링을 위해 scikit-learn 라이브러리를 활용한 기계학습을 하는
소프트웨어 클러스터링 장치.14. The method of claim 13,
And the machine learning using the scikit-learn library for the weighting and the vectorization and the clustering
Software clustering device.

제 13 항에 있어서,
TF-IDF(Term Frequency-Inverse Document Frequency)를 적용하여 상기 가중치를 부여하는
소프트웨어 클러스터링 장치.14. The method of claim 13,
The TF-IDF (Term Frequency-Inverse Document Frequency)
Software clustering device.

제 17 항에 있어서,
상기 TF-IDF의 적용 및 상기 벡터화를 위해 scikit-learn 라이브러리의 TfidfVectorizer 모듈을 사용하는
소프트웨어 클러스터링 장치.18. The method of claim 17,
Using the TfidfVectorizer module of the scikit-learn library for application and vectorization of the TF-IDF
Software clustering device.

제 13 항에 있어서,
상기 분석부는, 상기 벡터화를 수행하여 표현된 행렬이 상기 클러스터링에 사용되기 전에 제로 요소를 제거하여 희소행렬에서 밀집행렬로 변환하는
소프트웨어 클러스터링 장치.14. The method of claim 13,
The analysis unit removes the zero element from the sparse matrix to convert it to a dense matrix before the matrix expressed by performing the vectorization is used for the clustering
Software clustering device.

제 13 항에 있어서,
상기 분석부는, 상기 벡터화를 수행하여 표현된 행렬이 상기 클러스터링에 사용되기 전에 특이값 분해(Singular Value Decomposition, SVD)를 이용하여 축소하는
소프트웨어 클러스터링 장치.14. The method of claim 13,
The analysis unit may reduce the matrix expressed by performing the vectorization using Singular Value Decomposition (SVD) before being used for the clustering
Software clustering device.

제 20 항에 있어서,
상기 특이값 분해를 위해 scikit-learn 라이브러리의 TruncatedSVD 모듈을 사용하는
소프트웨어 클러스터링 장치.21. The method of claim 20,
Using the TruncatedSVD module of the scikit-learn library for the above singular value decomposition
Software clustering device.

제 13 항에 있어서,
상기 클러스터링을 위해 scikit-learn 라이브러리의 Kmeans 모듈이나, 가우시안 혼합모델(Gaussian Mixture Model(GMM), Mixture of Gaussian)을 사용하는
소프트웨어 클러스터링 장치.14. The method of claim 13,
Using the Kmeans module of the scikit-learn library or the Gaussian Mixture Model (GMM), Mixture of Gaussian for the clustering
Software clustering device.