KR101752257B1

KR101752257B1 - A system of linked open data cloud information service and a providing method thereof, and a recoding medium storing program for executing the same

Info

Publication number: KR101752257B1
Application number: KR1020160097034A
Authority: KR
Inventors: 최미숙
Original assignee: 최미숙
Priority date: 2016-07-29
Filing date: 2016-07-29
Publication date: 2017-07-11

Abstract

본 발명은 링크드 오픈 데이터 클라우드 정보 서비스 시스템 및 그 제공 방법에 관한 것으로서, 더욱 상세하게는, 정형화된 정형 데이터에 연관되는 비정형 데이터를 수집하고, 수집한 데이터를 자원서술체계 형식의 트리플로 변환하여 지속적으로 저장 및 갱신하고, 사용자 질의에 응답하여 링크드오픈데이터(LOD)를 발행함으로써, 공공의 정형 데이터 뿐만 아니라 연관되는 비정형 데이터까지도 가공하여 제공할 수 있는, 링크드 오픈 데이터 클라우드 정보 서비스 시스템 및 그 제공 방법과 이를 구현하기 위한 프로그램이 저장된 기록매체를 제공한다.The present invention relates to a linked open data cloud information service system and a method for providing the same. More particularly, the present invention relates to a method and system for collecting unstructured data related to formalized form data, converting the collected data into a triple form of a resource description system format, And a method of providing the linked open data cloud information service system capable of processing and providing not only publicly-formatted data but also associated unstructured data by issuing linked open data (LOD) in response to a user query And a recording medium on which a program for implementing the program is stored.

Description

링크드 오픈 데이터 클라우드 정보 서비스 시스템 및 그 제공 방법과 이를 구현하기 위한 프로그램이 저장된 기록매체{A system of linked open data cloud information service and a providing method thereof, and a recoding medium storing program for executing the same}[0001] The present invention relates to a linked open data cloud information service system, a method for providing the same, and a storage medium storing a program for implementing the same,

본 발명은 링크드 오픈 데이터 클라우드 정보 서비스 시스템 및 그 제공 방법에 관한 것으로서, 더욱 상세하게는, 정형화된 정형 데이터에 연관되는 비정형 데이터를 수집하고, 수집한 데이터를 자원서술체계 형식의 트리플로 변환하여 지속적으로 저장 및 갱신하고, 사용자 질의에 응답하여 링크드오픈데이터(LOD)를 발행함으로써, 공공의 정형 데이터 뿐만 아니라 연관되는 비정형 데이터까지도 가공하여 제공할 수 있는, 링크드 오픈 데이터 클라우드 정보 서비스 시스템 및 그 제공 방법과 이를 구현하기 위한 프로그램이 저장된 기록매체에 관한 것이다.
The present invention relates to a linked open data cloud information service system and a method for providing the same. More particularly, the present invention relates to a method and system for collecting unstructured data related to formalized form data, converting the collected data into a triple form of a resource description system format, And a method of providing the linked open data cloud information service system capable of processing and providing not only publicly-formatted data but also associated unstructured data by issuing linked open data (LOD) in response to a user query And a recording medium storing a program for implementing the same.

새로운 유형의 멀티미디어 콘텐츠, SNS(social network service)의 광범위한 확장, 그리고 스마트 기기들의 보급과 이용으로 인해 웹상에서 발생 및 유통되는 데이터의 규모가 기하 급수적으로 늘어나고 있다. 웹상에서 존재하고 지금도 늘어나고 있는 엄청난 양의 데이터는 세상을 해석하기 위해 사용될 수 있다. 이것이 바로 '빅 데이터'이다. 빅 데이터란 쉽게 말해 디지털화된 방대한 양의 정보를 뜻한다. 빅 데이터에서 불필요한 데이터들을 걸러내고 유용한 정보만을 추출 및 분석하여 사람들의 생각과 의견, 트랜드를 읽어내고 더 나아가 그들의 행동을 미리 예측할 수 있다. 빅 데이터는 이러한 유용성으로 인해 현재 우리나라에서뿐만 아니라 전세계적으로 각광받고 있는 차세대 IT(information technology) 기술 중 하나이다.With the expansion of new types of multimedia content, social network services (SNS), and the dissemination and use of smart devices, the size of data generated and distributed on the web is increasing exponentially. A tremendous amount of data that exists and is still growing on the Web can be used to interpret the world. This is the Big Data. Big data means a vast amount of information digitized. It can filter out unnecessary data from big data and extract and analyze only useful information to read people's thoughts, opinions and trends and to predict their behavior in advance. Big Data is one of the next generation information technology (IT) technology that is not only in Korea but also in the world because of its usefulness.

국내 빅 데이터 시장은 2015년 3,000억 원대를 형성하며, 2020년 1조원 규모로 성장할 것으로 예상된다. 빅 데이터와 관련된 국내 시장 규모도 매년 28% 이상 성장하고 있다. 빅 데이터의 활용이, 리서치, 컨설팅 영역에서 집중되고 있지만, 아직 상업 목적, 광고 목적의 시장에서도 활용가능성이 높다.The domestic big data market is expected to reach 300 billion won in 2015 and grow to 1 trillion won in 2020. The domestic market related to Big Data is growing by more than 28% every year. Although the use of Big Data is concentrated in research and consulting, it is likely to be used in commercial and advertising markets.

'공공데이터'는 공공기관이 다루는 데이터 중 누구나 자유롭게 활용하고 재설계 혹은 재생산할 수 있도록 개방한 데이터를 일컫는다. 미국, 영국, 한국 정부 등은 오랫동안 공공데이터를 확산하기 위한 다양한 노력을 기울여 왔다. 공공데이터 품질을 높이기 위한 노력 가운데 하나가 '링크드 오픈데이터'(Linked Open Data, LOD)다.'Public data' refers to data that is open to any public sector data that anyone can freely use and redesign or reproduce. The United States, the United Kingdom, and the Korean government have been making various efforts to spread public data for a long time. Linked Open Data (LOD) is one of the efforts to improve public data quality.

LOD는 '링크드 데이터(Linked Data)'와 '공개 데이터(Open Data)' 성격을 모두 갖는 데이터다. 즉, 링크드 데이터 구축 원칙에 맞게 만든 개방형 데이터다. 링크드 데이터 구축 원칙은 글로벌 웹표준 제정 기구인 월드와이드웹컨소시엄(W3C) 중심으로 발전되고 있다. 4대 원칙은 아래와 같다.LOD is data that has both the characteristics of 'Linked Data' and 'Open Data'. In other words, it is open data that is made in accordance with the principle of building linked data. Linked data building principles are being developed centering around the World Wide Web Consortium (W3C), a global Web standards-setting body. The four principles are as follows.

링크드 데이터의 4대 원칙The four principles of Linked Data

1. 통합 자원 식별자(URI)를 사용한다.1. Use a unified resource identifier (URI).

2. URI는 HTTP 프로토콜을 통해 접근할 수 있어야 한다. 2. The URI must be accessible via the HTTP protocol.

3. RDF나 스파클 같은 표준을 사용한다. 3. Use standards such as RDF or Sparkle.

4. 풍부한 링크 정보가 있어야 한다.4. There should be abundant link information.

링크드 데이터의 가장 큰 특징은 '통합 자원 식별자'(Uniform Resource Identifier, URI)'를 사용한다는 점이다. 흔히 알고 있는 'URL(Uniform Resource Locator)'과 비슷한 개념이다. URL이 특정 정보 자원의 종류와 위치를 가리킨다면, URI는 웹에 저장된 객체(식별자)를 가리킨다는 점에서 다르다.The biggest feature of linked data is that it uses 'Uniform Resource Identifier (URI)'. It is a concept similar to the commonly known Uniform Resource Locator (URL). When a URL points to a particular type and location of an information resource, the URI is different in that it points to an object (identifier) stored on the Web.

지금의 웹은 문서 기반으로 연결돼 있다. URI를 활용하면 데이터를 중심으로 웹을 연결할 수 있다. 예컨대 소나무에 대한 데이터를 수목원에서 만들었다 치자. 수목원은 소나무의 명칭과 뜻, 분포지역, 사진 등의 자료를 정리해 올리게 된다. 그런데 국립중앙도서관도 소나무와 관련된 책을 소장하고, 이를 데이터로 입력해 올려놓았다. 웹에 중복 자료가 생긴 것이다.The web is now document-based. URI allows you to connect the web around data. For example, data on pine trees were made at the Arboretum. The arboretum collects the names and meanings of pine trees, distribution area, and photographs. By the way, the National Central Library also has a book on pine trees and puts it in data. There is duplicate data on the web.

예전에는 수목원 홈페이지 밑에 '참고자료 : (국립중앙도서관 홈페이지 주소)' 식으로 관련 정보를 수동으로 표시하는 식으로 두 웹사이트 정보를 연결하곤 했다. URI를 활용하면 각 기관에 있는 데이터를 곧바로 홈페이지에 연동해 보여줄 수 있다. 수목원에서는 기존 소나무 소개 페이지에 '소나무와 관련된 책 정보'를 바로 연결할 수 있다. 서로 다른 데이터베이스에 저장된 정보이지만, 웹을 매개체로 필요한 데이터를 서로 가져와 쓸 수 있는 것이다.In the past, we used to link the information of two websites by displaying the related information manually by referring to 'Reference: (Address of the National Central Library website)' at the bottom of the website. Using URI, data from each institution can be linked directly to the homepage. At the Arboretum, you can directly link the 'information about pine trees' to the existing pine tree introduction page. It is information stored in different databases, but it is possible to bring necessary data to each other through the medium of the web.

LOD는 거저 구축되는 게 아니다. 비용과 시간을 투자해야 하고, 전문 개발자도 투입돼야 한다. 그럼에도 많은 공공기기관이 LOD에 관심을 가져야 하는 이유는 명확하다. LOD가 공공데이터 품질을 높여주기 때문이다.The LOD is not built at all. You have to invest money and time, and you have to be a professional developer. Nevertheless, the reason why many public agencies need to be interested in LOD is clear. This is because LOD enhances public data quality.

LOD의 중요성을 앞장서 주창하는 곳은 W3C다. 팀 버너스 리는 개방형 데이터의 수준을 5단계로 나누어 품질을 정의하고 있다. 각 단계별 데이터는 다음과 같다.It is W3C that advocates the importance of LOD. Tim Berners-Lee defines quality by dividing the level of open data into five levels. The data for each step are as follows.

1단계는 포맷을 고려하지 않은 데이터다. 표를 스캔해서 PDF나 HWP 파일로 올린 경우가 1단계 수준의 공공데이터다. 2단계는 표를 이미지가 아닌 구조화된 데이터로 제공하는 형태다. 엑셀 파일로 표를 제공하는 경우가 이에 해당한다. 3단계는 비독점 포맷을 사용한 데이터다. 엑셀 파일 대신 CSV 파일로 데이터를 공개한 경우 3단계 데이터라고 말할 수 있다. 4단계는 개체를 가리킬 수 있도록 URI를 제공하는 경우다. 마지막 5단계는 다른 데이터끼리 연결할 수 있는 데이터를 말한다. 팀 버너스 리는 이처럼 가장 높은 수준의 데이터로 LOD를 꼽았다. 많은 공공기관들이 데이터 품질을 이야기할 때 이 5단계 기준을 참고한다.The first step is data that does not consider the format. Scanning a table and uploading it as a PDF or HWP file is the first level of public data. The second stage is to provide tables as structured data rather than as images. This is the case when a table is provided as an Excel file. Step 3 is data using a non-proprietary format. If you open your data as a CSV file instead of an Excel file, you can say it is a three-step data. Step 4 is to provide a URI to point to an object. The last five steps refer to data that can be linked to other data. Tim Berners-Lee named the LOD with this highest level of data. When many public bodies are talking about data quality, take a look at these five-step criteria.

한국정보화진흥원이 발간한 '2014 링크드 오픈 데이터 국내 구축 사례집'에 따르면 국내 정보관리기관이나 일반적인 웹사이트에서 공개한 데이터는 위 기준에서 별 1개 ~ 3개 수준의 데이터라고 한다. According to the '2014 Linked Open Data Domestic Casebook' published by the Korea Information Society Agency, the data disclosed by domestic information management organizations or general websites is said to be 1 ~ 3 levels of data in the above criteria.

한국 공공기관이 LOD를 구축한 사례는 2014년 기준으로 10여건에 불과하다. LOD는 이제 막 국내에서 움직임이 싹트는 단계라고 볼 수 있다. LOD가 제대로 구축되려면 무엇보다 데이터 양 자체가 많아야 하며, 이를 연결할 수 있는 요소도 많아야 한다. 국내에선 현재 공공데이터 양 자체가 많지 않아 LOD를 활용하는 사례가 적은 것으로 보인다. 해외에선 '디비피디아'가 LOD의 좋은 활용 사례로 꼽힌다. 디비피디아는 위키피디아에 있는 정보를 RDF로 변환해 모아둔 저장소다. 디비피디아에 있는 정보를 활용하면 개인 웹사이트에 위키피디아 정보를 바로 보여줄 수 있다.The number of cases in which Korean public organizations have built LOD is only 10 cases by 2014. LOD is just the beginning of a move in Korea. In order for LOD to be built properly, the amount of data must be large, and there are many factors that can be connected to it. In Korea, the amount of public data is not so large, so there are few cases where LOD is used. Overseas, 'Divipedia' is a good use case of LOD. Divipedia is a repository of information stored in Wikipedia converted into RDF. With the information in Divipedia, you can instantly display Wikipedia information on your personal website.

따라서, 현재 우리 나라에는 공공데이터 양 자체가 많지 않으므로, LOD가 제대로 구축되려면, 공공데이터 이외의 데이터도 활용 가능한 방법이 제안되어야 한다.
Therefore, in Korea, there are not many public data volumes. Therefore, in order for LOD to be properly constructed, a method of utilizing data other than public data should be proposed.

한국공개특허 [10-2015-0136701](공개일자: 2015. 12. 08)Korean Patent Publication [10-2015-0136701] (Published date: December 08, 2015)

따라서, 본 발명은 상기한 바와 같은 문제점을 해결하기 위하여 고안된 것으로, 본 발명의 목적은 정형화된 정형 데이터에 연관되는 비정형 데이터를 수집하고, 수집한 데이터를 자원서술체계 형식의 트리플로 변환하여 지속적으로 저장 및 갱신하고, 사용자 질의에 응답하여 링크드오픈데이터(LOD)를 발행함으로써, 공공의 정형 데이터 뿐만 아니라 연관되는 비정형 데이터까지도 가공하여 제공할 수 있는, 링크드 오픈 데이터 클라우드 정보 서비스 시스템 및 그 제공 방법과 이를 구현하기 위한 프로그램이 저장된 기록매체를 제공하는 것이다.SUMMARY OF THE INVENTION Accordingly, the present invention has been made to solve the above-mentioned problems occurring in the prior art, and it is an object of the present invention to provide an apparatus and method for collecting irregular data associated with formalized form data, converting the collected data into a triple form of a resource description format, A linked open data cloud information service system capable of processing and providing not only public fixed data but also associated irregular data by issuing linked open data (LOD) in response to a user query, and And a recording medium on which a program for implementing the method is stored.

본 발명의 실시예들의 목적은 이상에서 언급한 목적으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.
The objects of the embodiments of the present invention are not limited to the above-mentioned objects, and other objects not mentioned can be clearly understood by those skilled in the art from the following description .

상기한 바와 같은 목적을 달성하기 위한 본 발명의 일 실시예에 따른 링크드 오픈 데이터 클라우드 정보 서비스 시스템에 있어서, 정형 데이터 및 상기 정형 데이터에 연관되는 비정형 데이터를 수집하기 위한 데이터 수집부(101); 수집되는 데이터를 기설정된 스키마 또는 온톨로지 스키마에 근거하여 자원서술체계(RDF) 형식에 맞도록 매핑하기 위한 매핑룰을 저장하고 있는 매핑룰 저장부(102); 상기 데이터 수집부가 수집한 데이터를 상기 매핑룰에 기반하여 자원서술체계(RDF)형식의 트리플로 생성 및 변환하기 위한 트리플 변환부(103); 상기 트리플 변환부에서 생성 및 변환한 트리플의 적합성 오류를 검증하기 위한 오류 검증부(104); 상기 생성 및 변환한 트리플을 저장하고 관리하는 트리플 저장소(105); 외부 질의를 수신하는 질의 수신부(106); 및 상기 수신한 질의에 대한 링크드 오픈 데이터(LOD)를 발행하는 LOD 발행부(107)를 포함하고, 상기 데이터 수집부는, 상기 정형 데이터의 제1 키워드 집합 및 상기 수집한 비정형 데이터 집합의 각 데이터에 대한 제2 키워드 집합을 추출하는 키워드 추출부(201); 상기 키워드 추출부에서 추출한 상기 제1 키워드 집합과 연관되는 비정형 데이터 집합을 수집하기 위한 수집부(202); 상기 수집한 비정형 데이터 집합과 상기 제1 키워드 집합 간의 유사도를 계산하여, 상기 계산된 유사도가 기설정 값 이상인 경우 해당 비정형 데이터를 선택하는 선택부(203); 및 상기 선택한 비정형 데이터와 상기 정형 데이터의 의미를 연결시키는 연결부(204)를 포함하는 것을 특징으로 한다.According to an aspect of the present invention, there is provided a linked open data cloud information service system, comprising: a data collection unit (101) for collecting fixed data and irregular data associated with the fixed data; A mapping rule storage unit 102 for storing a mapping rule for mapping collected data to a resource description system (RDF) format based on a predetermined schema or an ontology schema; A triple conversion unit 103 for generating and converting the data collected by the data collecting unit into triples of a resource description system (RDF) format based on the mapping rule; An error verification unit 104 for verifying a conformance error of the triple generated and transformed by the triple conversion unit; A triple storage 105 for storing and managing the generated and converted triples; A query receiving unit (106) for receiving an external query; And an LOD issuing unit (107) for issuing linked open data (LOD) for the received query, wherein the data collecting unit acquires the first keyword set of the fixed data and each data of the collected irregular data set A keyword extracting unit 201 for extracting a second keyword set for the second keyword; A collection unit (202) for collecting an irregular data set associated with the first keyword set extracted by the keyword extraction unit; A selecting unit (203) for calculating the similarity between the collected irregular data set and the first keyword set and selecting the corresponding irregular data when the calculated similarity is equal to or greater than a predetermined value; And a connection unit 204 for connecting the meaning of the selected data with the selected irregular data.

한편, 상기한 바와 같은 목적을 달성하기 위한 본 발명의 일 실시예에 따른링크드 오픈 데이터 클라우드 정보 서비스 제공 방법에 있어서, 정형 데이터 및 상기 정형 데이터에 연관되는 비정형 데이터를 수집하는 데이터수집단계(S410); 수집한 비정형 데이터를 매핑룰에 기반하여 자원서술체계(RDF) 형식의 트리플로 생성 및 변환하는 변환단계(S420); 상기 생성 및 변환한 트리플의 적합성 오류를 검증하는 검증단계(S430); 검증된 트리플을 저장하는 저장단계(S440); 외부 질의를 수신하는 질의수신단계(S450); 및 상기 수신한 질의에 대한 링크드 오픈 데이터(LOD)를 발행하는 LOD 발행단계(S460)를 포함하고, 상기 데이터수집단계는, 상기 정형 데이터의 제1 키워드 집합을 추출하는 제1 키워드 추출단계(S510); 상기 추출한 제1 키워드 집합과 연관되는 비정형 데이터 집합을 수집하는 수집단계(S520); 상기 수집한 비정형 데이터 집합의 각 데이터에 대한 제2 키워드 집합을 추출하는 제2 키워드 추출단계(S530); 상기 수집한 비정형 데이터 집합과 상기 제1 키워드 집합 간의 유사도를 계산하는 유사도 계산단계(S540); 상기 계산된 유사도가 기설정 값 이상인 경우 해당 비정형 데이터를 선택하는 선택단계(S550); 및 상기 선택한 비정형 데이터와 상기 정형 데이터의 의미를 연결시키는 연결단계(S560)를 포함하는 것을 특징으로 한다.According to another aspect of the present invention, there is provided a method of providing a linked open data cloud information service, the method comprising: collecting atypical data related to the fixed data and the fixed data at step S410; ; A transformation step S420 of generating and transforming the collected unstructured data into a triple of a resource description system (RDF) format based on a mapping rule; A verification step (S430) of verifying a conformance error of the generated and transformed triple; A storing step (S440) of storing the verified triple; Receiving an external query (S450); And an LOD issue step (S460) of issuing linked open data (LOD) for the received query, wherein the data collection step includes a first keyword extraction step (S510) of extracting a first keyword set of the formatted data, ); A collection step (S520) of collecting an unstructured data set associated with the extracted first keyword set; A second keyword extracting step (S530) of extracting a second keyword set for each piece of data of the collected irregular data set; Calculating a degree of similarity between the collected irregular data set and the first keyword set (S540); A selection step (S550) of selecting the corresponding irregular data when the calculated similarity is equal to or greater than a preset value; And connecting a meaning of the selected atypical data to the meaning of the fixed data (S560).

또한, 본 발명의 일 실시예에 따르면, 링크드 오픈 데이터 클라우드 정보 서비스 제공 방법을 구현하기 위한 프로그램이 저장된 컴퓨터 판독 가능한 기록매체가 제공되는 것을 특징으로 한다.
According to an embodiment of the present invention, there is provided a computer-readable recording medium storing a program for implementing a linked open data cloud information service providing method.

본 발명의 일 실시예에 따른 링크드 오픈 데이터 클라우드 정보 서비스 시스템 및 그 제공 방법에 의하면, 공공 링크드 데이터와 같은 한정된 정형 데이터의 이용자가 웹, SNS, 뉴스, 논문, 이메일, 휴대전화 등의 비정형 데이터와 연관성 데이터를 제공받음으로써, 공공 링크드 데이터의 활용도를 증가시킬 수 있다. 또한, 웹, SNS, 뉴스 등의 최신의 데이터와 연결함으로써, 실시간으로 갱신할 수 없는 정형 데이터의 한계를 극복할 수 있다.According to the linked open data cloud information service system and method of providing the linked open data cloud information service system according to an embodiment of the present invention, a user of limited form data such as public linked data can access unstructured data such as web, SNS, news, By receiving association data, the utilization of publicly linked data can be increased. In addition, by linking with the latest data such as the Web, SNS, and news, it is possible to overcome the limitation of the fixed data that can not be updated in real time.

또한, LOD 기반의 통합 데이터베이스 구축으로 콘텐츠 개방과 공유의 정신을 실천하고 나아가 전문가들이 한 분야의 방대한 데이터를 기반으로 새로운 콘텐츠를 생산할 수 있게 한다.In addition, by building an integrated database based on LOD, we can practice the spirit of opening and sharing contents, and enable experts to produce new contents based on vast amount of data in one field.

또한, 양질의 정보를 손쉽게 구할 수 있는 기반을 제공하며, 링크드 오픈 데이터를 이용하여 지능화된 정보 서비스를 구축하고 지식정보의 선순환을 이룩할 수 있다.In addition, it provides a base on which high-quality information can be easily obtained, an intelligent information service can be constructed using linked open data, and a virtuous cycle of knowledge information can be achieved.

또한, 본 발명은 일반적으로 공개되어있는 정형 데이터만 있는 경우 보다, 원하는 응용 서비스에 관련된 방대한 비정형 데이터를 기계 가독한 정보 형태로 손쉽게 생산하고 획득함으로써, 데이터 서비스 활성화를 통한 수익성 고취 등을 기대할 수 있다.
In addition, the present invention can easily produce and acquire a large amount of unstructured data related to a desired application service in the form of machine-readable information, as compared with a case where only regularly disclosed data is available, thereby enhancing profitability through activation of data service .

도 1은 본 발명에 따른 링크드 오픈 데이터 클라우드 정보 서비스 시스템의 일 실시예 구성도.
도 2는 도 1의 데이터 수집부의 일실시예 구성도.
도 3은 도 1의 LOD 발행부의 일실시예 구성도.
도 4는 본 발명에 따른 링크드 오픈 데이터 클라우드 정보 서비스 제공 방법에 대한 일 실시예 흐름도.
도 5는 본 발명에 따른 도 4의 데이터 수집단계에 대한 일 실시예 흐름도.
도 6은 본 발명에 따른 도 5의 LOD 발행단계에 대한 일 실시예 흐름도.
도 7은 본 발명에 따른 시각화되어 발행된 LOD 에 대한 일 실시예 설명도.1 is a block diagram of an embodiment of a linked open data cloud information service system according to the present invention.
Fig. 2 is a configuration diagram of an embodiment of the data collecting unit of Fig. 1; Fig.
3 is a block diagram of an embodiment of the LOD issuing unit of FIG.
4 is a flowchart of an embodiment of a method for providing a linked open data cloud information service according to the present invention.
Figure 5 is a flow diagram of one embodiment of the data collection step of Figure 4 according to the present invention.
FIG. 6 is a flowchart of an embodiment of the LOD issue step of FIG. 5 according to the present invention. FIG.
Figure 7 is an explanatory diagram of an embodiment of a visualized published LOD according to the present invention;

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는바, 특정 실시예들을 도면에 예시하고 상세하게 설명하고자 한다. 그러나 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It is to be understood, however, that the invention is not to be limited to the specific embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다.It is to be understood that when an element is referred to as being "connected" or "connected" to another element, it may be directly connected or connected to the other element, .

반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다.On the other hand, when an element is referred to as being "directly connected" or "directly connected" to another element, it should be understood that there are no other elements in between.

본 명세서에서 사용되는 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 공정, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 공정, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The singular expressions include plural expressions unless the context clearly dictates otherwise. In the present application, the term "comprises" or "having ", etc. is intended to specify the presence of stated features, integers, steps, operations, elements, parts, or combinations thereof, And does not preclude the presence or addition of one or more other features, integers, integers, steps, operations, elements, components, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미가 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미가 있는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries are to be interpreted as having a meaning consistent with the meaning in the context of the relevant art and are to be construed as ideal or overly formal in meaning unless explicitly defined in the present application Do not.

이하, 첨부된 도면을 참조하여 본 발명을 더욱 상세하게 설명한다. 이에 앞서, 본 명세서 및 청구범위에 사용된 용어나 단어는 통상적이거나 사전적인 의미로 한정하여 해석되어서는 아니 되며, 발명자는 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념을 적절하게 정의할 수 있다는 원칙에 입각하여, 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야만 한다. 또한, 사용되는 기술 용어 및 과학 용어에 있어서 다른 정의가 없다면, 이 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 통상적으로 이해하고 있는 의미를 가지며, 하기의 설명 및 첨부 도면에서 본 발명의 요지를 불필요하게 흐릴 수 있는 공지 기능 및 구성에 대한 설명은 생략한다. 다음에 소개되는 도면들은 당업자에게 본 발명의 사상이 충분히 전달될 수 있도록 하기 위해 예로서 제공되는 것이다. 따라서, 본 발명은 이하 제시되는 도면들에 한정되지 않고 다른 형태로 구체화될 수도 있다. 또한, 명세서 전반에 걸쳐서 동일한 참조번호들은 동일한 구성요소들을 나타낸다. 도면들 중 동일한 구성요소들은 가능한 한 어느 곳에서든지 동일한 부호들로 나타내고 있음에 유의해야 한다.Hereinafter, the present invention will be described in more detail with reference to the accompanying drawings. Prior to this, terms and words used in the present specification and claims should not be construed as limited to ordinary or dictionary terms, and the inventor should appropriately interpret the concept of the term appropriately in order to describe its own invention in the best way. The present invention should be construed in accordance with the meaning and concept consistent with the technical idea of the present invention. Further, it is to be understood that, unless otherwise defined, technical terms and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Descriptions of known functions and configurations that may be unnecessarily blurred are omitted. The following drawings are provided by way of example so that those skilled in the art can fully understand the spirit of the present invention. Therefore, the present invention is not limited to the following drawings, but may be embodied in other forms. In addition, like reference numerals designate like elements throughout the specification. It is to be noted that the same elements among the drawings are denoted by the same reference numerals whenever possible.

도 1은 본 발명에 따른 링크드 오픈 데이터 클라우드 정보 서비스 시스템의 일 실시예 구성도이다.1 is a block diagram of a linked open data cloud information service system according to an embodiment of the present invention.

본 발명이 적용되는 전체 시스템은, 통신 네트워크를 통하여 연결된 링크드 오픈 데이터 클라우드 정보 서비스 시스템(100) 및 사용자 단말(300), 및 자원서술체계(RDF)를 얻기 위한 다수의 웹서버(미도시)를 포함한다.The entire system to which the present invention is applied includes a linked open data cloud information service system 100 and a user terminal 300 connected through a communication network and a plurality of web servers (not shown) for obtaining a resource description system (RDF) .

여기서, 통신 네트워크는 유선 및 무선 등과 같은 그 통신 양태를 가리지 않고 구성될 수 있으며, 근거리 통신망(LAN; Local Area Network), 도시권 통신망(MAN; Metropolitan Area Network), 광역 통신망(WAN; Wide Area Network) 등 다양한 통신망으로 구성될 수 있다. 바람직하게는, 본 발명에서 말하는 통신 네트워크는 공지의 월드와이드웹(WWW; World Wide Web)일 수 있다.Here, the communication network may be configured without regard to its communication mode such as wired and wireless, and may be a local area network (LAN), a metropolitan area network (MAN), a wide area network (WAN) And the like. Preferably, the communication network referred to in the present invention may be the well-known World Wide Web (WWW).

여기서, 자원서술체계(Resource Description Framework)란 웹상의 정보 또는 데이터(리소스)를 하나의 표준 언어로 기술하기 위한 통일된 데이터 모델로써, 정보 또는 데이터(리소스)의 메타데이터 설명을 주어, 술어, 목적어 등의 트리플(Triple)로 표현한다. 주어 및 술어는 URI로 목적어는 문자열 또는 URI로 표시한다.Here, the Resource Description Framework is a unified data model for describing information or data (resources) on the Web in one standard language. The resource description framework gives a metadata description of information or data (resource) And so on. Subjects and predicates are URIs, and objects are represented by strings or URIs.

상기 사용자 단말(300)은 일반적으로 통신 네트워크를 통하여 본인이 원하는 LOD 정보를 얻고자 검색을 하거나 여러 다양한 활동을 할 수 있도록 하는 기능을 포함하는 디지털 기기로서, 개인용 컴퓨터(예를 들어, 데스크탑 컴퓨터, 노트북 컴퓨터 등), 워크스테이션, PDA, 웹 패드, 이동 전화기 등과 같이 메모리 수단을 구비하고 마이크로 프로세서를 탑재하여 연산 능력을 갖춘 디지털 기기라면 얼마든지 본 발명에 따른 사용자 단말(300)로서 채택될 수 있다. 또한, 상기 사용자 단말(300)은 상기 링크드 오픈 데이터 클라우드 정보 서비스 시스템(100)의 사용자 인터페이스(미도시)에 접속하여 LOD관련 질의가 가능하다.The user terminal 300 is generally a digital device including a function of searching for or obtaining a variety of activities in order to obtain desired LOD information through a communication network. The digital device includes a personal computer (for example, a desktop computer, A notebook computer, etc.), a workstation, a PDA, a web pad, a mobile phone, etc., and can be employed as a user terminal 300 according to the present invention as long as it is a digital device equipped with a microprocessor and computing ability . Also, the user terminal 300 can access the user interface (not shown) of the linked open data cloud information service system 100 to inquire about LOD.

도 1을 참조하면, 본 발명에 따른(100)은, 데이터 수집부(101), 매핑룰 저장부(102), 트리플 변환부(103), 오류 검증부(104), 트리플 저장소(105), 질의 수신부(106), 및 LOD 발행부(107)를 포함한다.Referring to FIG. 1, a (100) according to the present invention includes a data collecting unit 101, a mapping rule storing unit 102, a triple converting unit 103, an error verifying unit 104, a triple storage 105, A query receiving unit 106, and an LOD issuing unit 107. [

상기 데이터 수집부(101)는 정형 데이터 및 상기 정형 데이터에 연관되는 비정형 데이터를 수집한다.The data collection unit 101 collects the fixed data and the unstructured data associated with the fixed data.

상기 매핑룰 저장부(102)는 수집되는 데이터를 기설정된 스키마 또는 온톨로지 스키마에 근거하여 자원서술체계(RDF) 형식에 맞도록 매핑하기 위한 매핑룰을 저장하고 있다. 상세하게는, RDF, Excel, CSV, 사용자 정의 구분자 포맷 등을 RDF로 형식으로 매핑하는 것을 지원한다.The mapping rule storage unit 102 stores a mapping rule for mapping the collected data to a resource description system (RDF) format based on a predetermined schema or an ontology schema. Specifically, it supports RDF, Excel, CSV, and user-defined delimiter format mapping to RDF format.

상기 트리플 변환부(103)는 상기 데이터 수집부(101)가 수집한 데이터를 상기 매핑룰에 기반하여 자원서술체계(RDF)형식의 트리플로 생성 및 변환한다.The triple conversion unit 103 generates and converts data collected by the data collection unit 101 into triples of a resource description system (RDF) format based on the mapping rule.

상기 트리플 변환부(103)는 RDBMS 데이터 및 정영화된 데이터 및 비정형화된 데이터를 수집하여 트리플을 생성하는 온톨로지(틀리플) 변환기로써, 시맨틱웹 또는 Linked Open Data 구현을 위해 기존의 자체 보유 정보를 RDF 트리플로 변환하기 위한 방법을 제공하며, 상기 트리플 저장소(105)와 밀접하게 연동되어 시맨틱웹과 Linked Data를 구현한다.The triple conversion unit 103 is an ontology (template ripple) converter for collecting RDBMS data, extended data, and non-standardized data to generate triples. The triple conversion unit 103 converts existing self-held information into RDF And provides a method for converting into a triple, and works closely with the triple storage 105 to implement a semantic web and linked data.

상기 오류 검증부(104)는 상기 트리플 변환부(103)에서 생성 및 변환한 트리플의 적합성 오류를 검증한다.The error verifying unit 104 verifies the conformance error of the triple generated and transformed by the triple converting unit 103. [

상기 트리플 저장소(105)는 상기 생성 및 변환한 트리플을 저장하고 관리한다.The triple storage 105 stores and manages the generated and converted triples.

상세하게는, 트리플 저장소(Triple Store)는 시맨틱 웹에서 온톨로지를 트리플 형태의 데이터로 저장하고 관리하는 시스템을 말한다. Oracle과 같은 RDBMS를 트리플 저장소로 활용한 경우 대용량 데이터를 저장하기 위해 고성능의 장비를 갖추어야 하는 수직 확장성(Scale up)으로 인해 확장에 따른 비용이 급격히 증가한다. 그래서 데이터가 증가함에 따라 여러 대의 서버에 데이터를 분산 관리하는 수평 확장성(Scale out)을 갖춘 빅 데이터 프레임워크가 트리플 저장소로 활용되고 있다.Specifically, a triple store refers to a system that stores and manages ontologies as triple-form data in the Semantic Web. When an RDBMS such as Oracle is used as a triple storage, the cost of expansion increases sharply due to the scale-up that requires high-performance equipment to store large amounts of data. As the data grows, a big data framework with scale-out that distributes data to multiple servers is used as a triple storage.

상기 질의 수신부(106)는 사용자로부터의 질의를 수신한다. 상기 질의 수신부(106)는 사용자 단말(300)로 시각화 기반 대화형 검색 서비스를 제공할 수 있다. 한편, 상기 질의는 SPARQL을 사용한다.The query receiving unit 106 receives a query from the user. The query receiving unit 106 may provide a visualization-based interactive search service to the user terminal 300. On the other hand, the above query uses SPARQL.

상기 LOD 발행부(107)는 상기 수신한 질의에 대한 LOD를 발행한다.The LOD issuing unit 107 issues an LOD for the received query.

한편, 상기 링크드 오픈 데이터 클라우드 정보 서비스 시스템(100)은, 상기 매핑룰 저장부(102)에 저장된 매핑룰을 수정 및 관리하기 위한 매핑룰 관리부(미도시)를 더 포함할 수 있다.The linked open data cloud information service system 100 may further include a mapping rule management unit (not shown) for modifying and managing the mapping rules stored in the mapping rule storage unit 102.

상기 데이터 수집부(101), 매핑룰 저장부(102), 트리플 변환부(103), 오류 검증부(104), 트리플 저장소(105), 질의 수신부(106), 및 LOD 발행부(107)는 그 중 적어도 일부가 상기 링크드 오픈 데이터 클라우드 정보 서비스 시스템(100)과 통신하는 프로그램 모듈들일 수 있다. 이러한 프로그램 모듈들은 운영 시스템, 응용 프로그램 모듈 및 기타 프로그램 모듈의 형태로 상기 링크드 오픈 데이터 클라우드 정보 서비스 시스템(100)에 포함될 수 있으며, 물리적으로는 여러 가지 공지의 기억 장치 상에 저장될 수 있다. 또한, 이러한 프로그램 모듈들은 상기 링크드 오픈 데이터 클라우드 정보 서비스 시스템(100)과 통신 가능한 원격 기억 장치에 저장될 수도 있다. 한편, 이러한 프로그램 모듈들은 본 발명에 따라 후술할 특정 업무를 수행하거나 특정 추상 데이터 유형을 실행하는 루틴, 서브루틴, 프로그램, 오브젝트, 컴포넌트, 데이터 구조 등을 포괄하지만, 이에 제한되지는 않는다.
The triple storage unit 105, the mapping rule storage unit 102, the triple conversion unit 103, the error verification unit 104, the triple storage 105, the query receiving unit 106, and the LOD issuing unit 107 At least some of which may be program modules in communication with the linked open data cloud information service system 100. Such program modules may be included in the linked open data cloud information service system 100 in the form of an operating system, application program modules, and other program modules, and may be physically stored on various known storage devices. These program modules may also be stored in a remote storage device capable of communicating with the linked open data cloud information service system 100. These program modules include, but are not limited to, routines, subroutines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types as described below in accordance with the present invention.

빅 데이터란 기존 데이터의 수집, 저장, 관리, 분석 역량을 넘어서는 데이터 세트를 의미한다. 빅 데이터는 정형화 정도에 따라 정형 데이터, 반정형 데이터, 및 비정형 데이터로 분류될 수 있다.Big data refers to data sets beyond the ability to collect, store, manage and analyze existing data. Big data can be classified into stereotyped data, semi-stereotyped data, and non-stereotyped data according to the degree of stereotyping.

정형 데이터(structured data)는 고정된 필드에 저장되는 데이터를 말한다. 즉, 일정한 형식을 갖추고 저장되는 데이터를 말한다. 반정형 데이터(semi-structured data)는 고정된 필드에 저장되어 있지는 않지만, 메타데이터나 스키마를 포함하는 데이터를 말한다. 반정형 데이터로는 XML(Extensible Mark-up Language) 및 HTML(Hypertext Mark-up Language)을 예로 들 수 있다. 비정형 데이터(unstructured data)는 고정된 필드에 저장되어 있지 않은 데이터를 말한다. 비정형 데이터로는 텍스트 문서, 이미지 데이터, 동영상 데이터, 및 음성 데이터를 예로 들 수 있다.Structured data refers to data stored in fixed fields. In other words, data is stored in a certain format. Semi-structured data refers to data that is not stored in a fixed field but contains metadata or schema. Examples of semi-structured data include Extensible Mark-up Language (XML) and Hypertext Mark-up Language (HTML). Unstructured data refers to data that is not stored in a fixed field. Examples of the unstructured data include text documents, image data, moving image data, and audio data.

상기 데이터 수집부(101)는 상술한 바와 같은 빅 데이터를 수집하여, 분석할 수 있다. 빅 데이터 분석 기술로는 텍스트 마이닝(text mining), 평판 분석(opinion mining), 소셜 네트워크 분석, 군집 분석(cluster analysis), 신경망 분석(neural network analysis), 및 마코브 모델(markov model)을 예로 들 수 있으나, 예시된 분석 기술들로 한정되는 것은 아니다.The data collection unit 101 can collect and analyze the big data as described above. Big data analysis techniques include text mining, opinion mining, social network analysis, cluster analysis, neural network analysis, and the markov model. But are not limited to, the illustrated analytical techniques.

텍스트 마이닝은 반정형 텍스트 데이터 또는 비정형 텍스트 데이터에서 자연 언어 처리 기술에 기반하여 유용한 정보를 추출, 가공하는 기술이다. 자연 언어 처리(Natural Language Processing; NLP) 기술은 자연 언어 이해와 자연 언어 생성이 가능하도록 하는 기술이다. 자연 언어란 사람이 의사소통을 하기 위해 사용하는 용어로, 인공 언어(컴퓨터 언어)와 반대되는 개념이다. 자연 언어 이해란 자연 언어를 기계적으로 분석해서 컴퓨터가 이해할 수 있는 형태로 만드는 것을 말한다. 자연 언어 생성은 컴퓨터가 자연 언어를 출력할 수 있도록 하는 것을 말한다. 자연 언어 처리 기술은 형태소 분석, 구문 분석, 의미 분석, 및 화용 분석으로 이루어진다.Text mining is a technique for extracting and processing useful information based on natural language processing techniques in semi-structured text data or unstructured text data. Natural Language Processing (NLP) technology is a technology that enables natural language understanding and natural language generation. Natural language is a term used by people to communicate and is the opposite of artificial language (computer language). Natural language understanding is the natural interpretation of the natural language to make it understandable to the computer. Natural Language Generation refers to the ability of a computer to output natural language. Natural language processing techniques consist of morpheme analysis, parsing, semantic analysis, and phonetic analysis.

평판 분석은 블로그(blog), 소셜 네트워크 서비스(Social Network Service; SNS), 위키(Wiki), 손수제작물(UCC), 마이크로 블로그(Micro-Blog) 등과 같은 소셜 미디어(social media)에서 정형 텍스트 및 비정형 텍스트를 수집 및 분석하여, 제품이나 서비스에 대한 평판(예를 들어, 긍정, 부정, 중립)을 판별하는 기술이다.Reputation analysis can be applied to social media such as blogs, social network services (SNS), wikis, UCCs, micro-blogs, It is a technology that collects and analyzes texts and determines the reputation (for example, affirmative, negative, neutral) of a product or service.

소셜 네트워크 분석은 소셜 네트워크 연결구조 및 연결강도 등에 기초하여, 사용자의 영향력, 관심사, 및 성향을 분석하고, 추출하는 기술이다.Social network analysis is a technique for analyzing and extracting user influence, interest, and propensity based on social network connection structure and connection strength.

군집 분석은 비슷한 특성을 가진 개체를 합쳐가면서 최종적으로 유사한 특성을 가진 군을 발굴하는 기술이다.
Cluster analysis is a technique for collecting groups with similar characteristics, finally combining groups with similar characteristics.

도 2는 도 1의 데이터 수집부의 일실시예 구성도이다.2 is a configuration diagram of an embodiment of the data collection unit of FIG.

도 2에 도시된 바와 같이, 데이터 수집부(101)는 정형 데이터의 제1 키워드 집합 및 수집된 비정형 데이터 집합의 각 데이터에 대한 제2 키워드 집합을 추출하는 키워드 추출부(201), 상기 키워드 추출부(201)에서 추출한 상기 제1키워드 집합과 연관되는 비정형 데이터 집합을 수집하기 위한 수집부(202), 상기 수집한 비정형 데이터와 상기 제1 키워드 집합 간의 유사도를 계산하여, 상기 계산된 유사도가 기설정 값 이상인 경우 해당 비정형 데이터를 선택하는 선택부(203), 및 상기 선택한 비정형 데이터와 상기 정형 데이터의 의미를 연결시키는 연결부(204)를 포함한다.2, the data collecting unit 101 includes a keyword extracting unit 201 for extracting a first keyword set of the fixed data and a second keyword set for each data of the collected irregular data set, A collection unit (202) for collecting an irregular data set associated with the first keyword set extracted by the input unit (201), a degree of similarity between the collected irregular data and the first set of keywords, A selection unit 203 for selecting the corresponding irregular data if the value is equal to or greater than a predetermined value, and a connection unit 204 for connecting the selected irregular data with the meaning of the fixed data.

상기 키워드 추출부(201)는 정형 데이터를 자원 서술 체계(Resource Description Framework : RDF)형 정형 데이터로 변환한 후, 변환된 정형 데이터를 분석하여 제1 키워드 집합을 추출한다. 예컨대, 대규모의 박물관의 경우, 정형화 데이터로서 박물관에 있는 물품, 지도, 행사 정보 등이 제공되는데, 이러한 박물관 정보를 자원 서술 체계(RDF)형 정형 데이터로 변환한 후 분석하여 박물관 위치, 전시물 등을 주요 키워드로 추출한다. 상기 키워드 추출부(201)는 문서의 의미를 나타낸다고 판단되는 키워드를 추출하는데, 형태소 분석된 데이터 문서 안에서 두 개 이상의 형태소를 붙여 하나의 키워드로 추출할 수도 있다.The keyword extracting unit 201 extracts a first set of keywords by converting the formatted data into a Resource Description Framework (RDF) formatted data, and analyzing the converted formatted data. For example, in the case of a large-scale museum, the formatting data provides articles, maps, and event information in the museum. The museum information is converted into a resource description system (RDF) Extract as key keywords. The keyword extracting unit 201 extracts a keyword determined to indicate the meaning of the document, and may extract two or more morphemes in the morpheme-analyzed data document and extract it as one keyword.

상기 수집부(202)는 키워드 추출부(101)에 의해 추출된 제1 키워드 집합을 기반으로 정형 데이터와 연관성이 있어 의미상 연결할 수 있는 비정형 데이터 집합을 수집한다. 예컨대, 비정형 데이터로 박물관과 관련된 전시 물품에 대한 백과사전 정보, 최근 뉴스, 탐방자의 방문 후기 등이 있을 수 있는데, 상기 수집부(201)는 키워드 추출부(201)에 의해 추출된 제1 키워드 집합을 기반으로 웹, 블로그, 소셜 네트워크 서비스(SNS), 뉴스 정보 등의 비정형 텍스트 데이터 또는 문서 데이터를 수집한다.The collection unit 202 collects unstructured data sets that are related to the template data and can be connected semantically based on the first keyword set extracted by the keyword extracting unit 101. For example, the atypical data may include encyclopedia information on exhibited articles related to the museum, recent news, visitor's visit, and the like. The collecting unit 201 collects the first keyword set extracted by the keyword extracting unit 201, And acquires unstructured text data or document data such as a web, a blog, a social network service (SNS), news information, and the like.

한편, 비정형 텍스트 데이터 및 문서 데이터 수집시, 연관성 분석, 군집분석, 예측모형, 의사결정트리, 인공신경망, 유전자 알고리즘, 사례기반추론, 퍼지이론 등을 사용할 수 있다.On the other hand, associativity analysis, cluster analysis, prediction model, decision tree, artificial neural network, genetic algorithm, case based reasoning, fuzzy theory, and the like can be used when collecting irregular text data and document data.

한편, 상기 키워드 추출부(201)는 상기 수집한 비정형 데이터 또는 문서 데이터에 대해서도 상기 제2 키워드 집합을 추출한다.Meanwhile, the keyword extracting unit 201 also extracts the second keyword set for the collected irregular data or document data.

예를 들어, n-gram 알고리즘을 사용하여 비정형 데이터의 키워드를 추출할 수 있다.For example, keywords of irregular data can be extracted using an n-gram algorithm.

n-gram 알고리즘은 통계와 확률을 바탕으로 한 색인 분석 등에 널리 쓰이는 방식으로 초기 검색사이트가 취한 색인 검색 알고리즘의 하나다. 개념이 무척 직관적이고 빠르기 때문에 여러 학문에 적용하기 쉽다. n-gram 알고리즘은 검색시스템에서 키워드 추출 및 악성코드의 API 추출시 사용된다.The n-gram algorithm is one of the index search algorithms used by the initial search site, which is widely used for statistical and probability-based index analysis. The concept is so intuitive and fast that it is easy to apply to many disciplines. The n-gram algorithm is used for keyword extraction and malicious code API extraction in the search system.

n-gram 알고리즘은 n개의 문자열 크기만큼의 창(window)을 만들어 문자열을 왼쪽에서 오른쪽으로 한 단위씩 움직이며 추출되는 시퀀스의 집합의 출현 빈도수를 기록한다. 이때 n은 얼마만큼의 단위로 잘라낼지를 나타내는 지표인데, 이 값이 1이면 unigram, 2이면 bigram, 3이면 trigram이라 부르며, 그 값은 더 커질 수 있다. The n-gram algorithm creates a window of n character sizes, moves the string from left to right by one unit, and records the frequency of occurrences of the set of extracted sequences. In this case, n is an index indicating how much to cut in. If this value is 1, it is called unigram, 2 is bigram, 3 is called trigram, and the value can be larger.

예를 들어, "I am a boy" 라는 문장을 문자 단위의 3-gram으로 만든다면, "I m", " am", "am ", "m a", " a ", "a b", " bo", "boy" 로 만들 수 있다. 같은 작업을 단어 단위로도 할 수 있는데, 만약 2-gram으로 만든다면, "I am", "am a", "a boy" 가 된다. 여기서 띄어쓰기를 포함할지 말지, 단어 단위로 할지 문자 단위로 할지는 설계자가 원하는대로 설정할 수도 있다.For example, if you make a sentence of "I am a boy" in 3-gram of character units, you can use "I m", "am", "am", "ma" "and" boy ". You can do the same thing on a word-by-word basis, but if you make it in 2-gram, it will be "I am", "am a", "a boy". Here, it is possible to designate whether or not to include spacing, word-by-word or character-by-character as desired by the designer.

하지만, 문자열의 is와 have와 같이 모든 문서에 공통적으로 자주 나오는 단어라면 중요한 단어로 판단하기가 힘들다. 그래서 해당 문서군을 통째로 검사하여, 해당 단어의 중요도를 낮추고, 문서 내에서 유일하게 많이 나오는 키워드의 중요성을 나타내 주는 기법이 있는데, 그것이 바로 TF-IDF 방법이다.However, it is difficult to judge a word as an important word if it is frequently found in all documents, such as is and have in a string. Therefore, there is a technique that examines a whole group of documents, lowers the importance of the word, and shows the importance of the keywords that are the only ones in the document. This is the TF-IDF method.

TF-IDF(Term Frequency - Inverse Document Frequency)는 정보 검색과 텍스트 마이닝에서 이용하는 가중치로, 여러 문서로 이루어진 문서군이 있을 때 어떤 단어가 특정 문서 내에서 얼마나 중요한 것인지를 나타내는 통계적 수치이다. 문서의 핵심어를 추출하거나, 검색 엔진에서 검색 결과의 순위를 결정하거나, 문서들 사이의 비슷한 정도를 구하는 등의 용도로 사용할 수 있다.TF-IDF (Term Frequency - Inverse Document Frequency) is a weighting value used in information retrieval and text mining. It is a statistical value that indicates how important a word is in a particular document when there is a document group consisting of several documents. It can be used to extract key words of a document, to rank search results in a search engine, to obtain similarity between documents, and so on.

TF(단어 빈도, term frequency)는 특정한 단어가 문서 내에 얼마나 자주 등장하는지를 나타내는 값으로, 이 값이 높을수록 문서에서 중요하다고 생각할 수 있다. 하지만 단어 자체가 문서군 내에서 자주 사용되는 경우, 이것은 그 단어가 흔하게 등장한다는 것을 의미한다. 이것을 DF(문서 빈도, document frequency)라고 하며, 이 값의 역수를 IDF(역문서 빈도, inverse document frequency)라고 한다. TF-IDF는 TF와 IDF를 곱한 값이다.TF (word frequency, term frequency) is a value indicating how often a particular word appears in a document. The higher the value, the more important it is in the document. However, if the word itself is frequently used within a set of documents, this means that the word appears frequently. This is called DF (document frequency), and the reciprocal of this value is called IDF (inverse document frequency). TF-IDF is the product of TF and IDF.

상세하게는, TF는 단어빈도(term frequency)로, 특정 단어가 문서 내에 얼만큼의 빈도로 등장하는지를 나타낸다. 당연히 문서에 자주 등장하는 단어라면 그만큼 중요도가 높다고 예상할 수 있다. DF는 "문서군 내에서 해당 단어가 나타난 문서의 수" 이다. 따라서 각 단어가 그 문서에서 얼마나 나타났는지는 중요하지 않고, 몇 개의 문서에서 나타났는지가 중요하다. 해당 문서군의 기준은 보통 폴더를 기준으로 한다. 본 발명에서, 해당 문서군은 수집된 비정형 데이터 집합이 될 수 있다.Specifically, the TF is a term frequency, which indicates how often a particular word appears in a document. Of course, if a word appears frequently in a document, it can be expected to be as important as it is. DF is the "number of documents in which the word appears in the document set". Therefore, it does not matter how much each word appears in the document, and how it appears in several documents is important. The criteria of the document group are based on the normal folder. In the present invention, the document group may be a collection of unstructured data collected.

DF 값이 높다는 것은 해당 문서군 내에서 많은 문서에서 등장하는 것이므로, 별로 중요한 단어가 아니라는 것을 나타낸다. 즉 is, have 등과 같이 어느 곳에서나 쓰이는 단어일 수록 DF값이 높다.A high DF value indicates that it is not a very important word since it appears in many documents in the document group. That is, the DF value is higher for words used anywhere, such as is, have, and so on.

IDF는 Inverse Document Frequency로 앞서 설명한 DF에 역수를 취한 것이다. DF는 값이 클수록 중요하지 않은 단어를 나타내는 것이다. 이 값을 역수로 취하면(1/DF) DF 값이 클수록 값이 작아지게 되는 것이다. 즉 IDF 값이 클수록 다른 문서에서 잘 등장하는 않는 단어라는 것을 뜻한다. 참고로 DF 값은 굉장히 넓은 범위로 구성되어 있어 보통 log를 취한다. 하지만 이 또한 적용하려는 환경에 맞게 설정할 수 있다.The IDF is the Inverse Document Frequency, which is the reciprocal of DF described above. The larger the value of DF, the less significant the word. Taking this value as a reciprocal (1 / DF), the larger the DF value, the smaller the value. That is, the larger the IDF value, the better the word does not appear in other documents. For reference, the DF value is composed of a very wide range and usually takes a log. However, you can also configure it for the environment you want to apply.

TF, IDF 및 TF-IDF를 각각 하기 수학식과 같이 나타낼 수 있다.TF, IDF, and TF-IDF can be expressed by the following equations, respectively.

<수학식 1>&Quot; (1) "

<수학식 2>&Quot; (2) "

<수학식 3>&Quot; (3) "

TF-IDF = TF × IDFTF-IDF = TF x IDF

즉, 비정형 데이터에 대해서는, n-gram 알고리즘 및 TF-IDF를 사용하여 데이터(문서) 내의 키워드를 추출할 수 있다. 한편, 불용어 수준의 의미 없는 단어는 제거함으로써, 추출한 키워드가 실제 문서(데이터) 내에서 신뢰성 있는 키워드가 되도록 한다.
That is, for the atypical data, keywords in the data (document) can be extracted using the n-gram algorithm and TF-IDF. On the other hand, by removing meaningless words at the insoluble word level, the extracted keywords are relied on in actual documents (data).

상기 선택부(203)는 여러 가지 방법을 사용하여 수집한 비정형 데이터 집합과 상기 제1 키워드 집합과의 유사도를 계산하고, 상기 계산된 유사도가 기설정 값 이상인 경우 해당 비정형 데이터를 선택한다.The selector 203 calculates the similarity between the atypical data set and the first keyword set collected using various methods, and selects the corresponding atypical data when the calculated similarity is equal to or greater than the preset value.

예를 들어, 정형 데이터에서 추출된 제1 키워드 집합에 대한 TF-IDF 값을 벡터 A, 각 비정형 데이터에서 추출된 제2 키워드 집합에 대한 TF-IDF 값을 벡터 B라고 하고, 백터간 유사도를 측정하는 코사인 유사도(Cosine Similarity)를 적용하여, 하기 <수학식 4>와 같이 정형 데이터와 비정형 데이터 간의 유사도를 계산할 수 있다.For example, assume that the TF-IDF value for the first set of keywords extracted from the fixed data is vector A, the TF-IDF value for the second set of keywords extracted from each irregular data is vector B, The degree of similarity between the fixed data and the unstructured data can be calculated by applying the cosine similarity to Equation (4).

<수학식 4>&Quot; (4) "

즉, A, B의 벡터값이 각각 주어졌을 때, 코사인 유사도 cos(θ)는 벡터의 스칼라 곱을 그 크기로 나눈 값이다.That is, given the vector values of A and B, the cosine similarity cos (θ) is a value obtained by dividing the scalar product of the vector by its size.

상기 연결부(204)는 선택된 비정형 데이터에서 주제어 용어집을 기반으로 비정형 데이터의 문맥을 추출하고, 이 추출된 결과를 기반으로 정형 데이터와 비정형 데이터를 간의 링크를 어노테이션한다.
The connection unit 204 extracts the context of the unstructured data based on the glossary of terms in the selected unstructured data, and annotates the link between the fixed data and the unstructured data based on the extracted result.

도 3은 도 1의 LOD 발행부의 일실시예 구성도이다.3 is a block diagram of an embodiment of the LOD issuing unit of FIG.

도 3에 도시된 바와 같이, LOD 발행부(107)는, 질의 수신부(106)로부터 전달받은 사용자의 질의에 대하여 어휘를 선정하는 어휘 선정부(301), 상기 선정된 어휘를 이용하여 트리플 저장소(105)를 검색하기 위한 검색부(302), 검색된 트리플을 시각화처리하는 처리부(303), 및 시각화 처리된 링크드 오픈 데이터(LOD)를 사용자 인터페이스로 출력하기 위한 출력부(304)를 포함한다.3, the LOD issuing unit 107 includes a vocabulary selecting unit 301 for selecting a vocabulary for a query of a user received from the query receiving unit 106, a triple storage unit A processing unit 303 for visualizing the searched triples, and an output unit 304 for outputting visualized linked open data (LOD) to the user interface.

상기 처리부(30)에서 수행하는 시각화 (data visualization)는 데이터 분석 결과를 쉽게 이해할 수 있도록 시각적으로 표현하고 전달되는 과정을 말한다. 시각화의 목적은 챠트와 그래프를 통해 정보를 명확하고 효과적으로 전달하는 것이다. 공공기관의 빅데이터를 클라우드 컴퓨팅 자원과 오픈 소스인 하둡, R, 기계학습, 데이터 마이닝 등을 이용하여 다양한 처리 결과를 이해하기 쉬운 그래픽 또는 챠트로 표현하고 공유할 수 있다.The data visualization performed in the processing unit 30 is a process of visually expressing and transmitting data analysis results so that they can be easily understood. The purpose of visualization is to communicate information clearly and effectively through charts and graphs. Big data from public institutions can be represented and shared with easy-to-understand graphics or charts using cloud computing resources and open source Hadoop, R, machine learning, and data mining.

한편, 출력되는 LOD는, 다양한 데이터 포맷으로 다운로드가 가능하며, LOD 메타데이터 상세 페이지는 해당 개체(URI)에 대한 상세 정보를 제공하는 페이지이며, 해당 페이지의 메타데이터 정보는 기본적으로 RDF 트리플로 구성되어 있으며 이는 Turtle, N-Triple, RDF/XML, JSON 등의 다양한 형식으로 변환되어 제공될 수 있다.
On the other hand, the output LOD can be downloaded in various data formats, and the LOD metadata detail page is a page providing detailed information on the corresponding object (URI). The metadata information of the page is basically composed of an RDF triple structure Which can be converted into various formats such as Turtle, N-Triple, RDF / XML, and JSON.

도 4는 본 발명에 따른 링크드 오픈 데이터 클라우드 정보 서비스 제공 방법에 대한 일 실시예 흐름도이다.4 is a flowchart illustrating a method of providing a linked open data cloud information service according to an embodiment of the present invention.

도 4를 참조하면, 본 발명에 따른 링크드 오픈 데이터 클라우드 정보 서비스 제공 방법은, 먼저, 정형 데이터 및 상기 정형 데이터에 연관되는 비정형 데이터를 수집한다(S410).Referring to FIG. 4, a method of providing a linked open data cloud information service according to the present invention first acquires fixed data and unstructured data related to the fixed data (S410).

수집한 비정형 데이터를 매핑룰에 기반하여 자원서술체계(RDF) 형식의 트리플로 생성 및 변환한다(S420).The collected irregular data is generated and converted into a triple of a resource description system (RDF) format based on the mapping rule (S420).

상기 생성 및 변환한 트리플의 적합성 오류를 검증하고(S430), 상기 검증된 트리플을 저장소(예: 트리플 저장소(105))에 저장한다(S440).The validity of the generated and transformed triple is verified (S430), and the verified triple is stored in the repository (e.g., triple repository 105) (S440).

이후, 사용자로부터 질의를 수신하면(S450), 상기 수신한 질의에 대하여 상기 트리플 저장소(105)를 검색하여 LOD를 발행한다(S460).Thereafter, when receiving a query from the user (S450), the triple storage 105 is searched for the received query to issue an LOD (S460).

한편, 상기 매핑룰을 수정 및 관리하기 위한 매핑룰 관리단계(미도시)를 더 포함할 수 있다.
Meanwhile, a mapping rule management step (not shown) for modifying and managing the mapping rule may be further included.

도 5는 본 발명에 따른 도 4의 데이터 수집단계에 대한 일 실시예 흐름도이다.5 is a flowchart of an embodiment of the data collection step of FIG. 4 according to the present invention.

데이터 수집단계(S410)는, 우선, 정형 데이터의 제1 키워드 집합을 추출한다(S510).The data collection step (S410) first extracts a first keyword set of the form data (S510).

정형 데이터를 자원 서술 체계(Resource Description Framework : RDF)형 정형 데이터로 변환한 후, 변환된 정형 데이터를 분석하여 주요 키워드를 추출한다. 예컨대, 대규모의 박물관의 경우, 정형화 데이터로서 박물관에 있는 물품, 지도, 행사 정보 등이 제공되는데, 이러한 박물관 정보를 자원 서술 체계(RDF)형 정형 데이터로 변환한 후 분석하여 박물관 위치, 전시물 등을 주요 키워드로 추출한다. 한편, 문서의 의미를 나타낸다고 판단되는 키워드를 추출하는데, 형태소 분석된 데이터 문서 안에서 두 개 이상의 형태소를 붙여 하나의 키워드로 추출할 수도 있다.After transforming the formal data into a Resource Description Framework (RDF) type formal data, the transformed form data is analyzed to extract the main keywords. For example, in the case of a large-scale museum, the formatting data provides articles, maps, and event information in the museum. The museum information is converted into a resource description system (RDF) Extract as key keywords. On the other hand, in extracting a keyword determined to represent the meaning of the document, it is also possible to extract two or more morphemes in a morpheme-analyzed data document and extract it as one keyword.

이후, 상기 추출한 제1 키워드 집합과 연관되는 비정형 데이터 집합을 수집한다(S520). 상기 추출된 제1 키워드 집합을 기반으로 정형 데이터와 연관성이 있어 의미상 연결할 수 있는 비정형 데이터 집합을 수집한다. 예컨대, 비정형 데이터로 박물관과 관련된 전시 물품에 대한 백과사전 정보, 최근 뉴스, 탐방자의 방문 후기 등이 있을 수 있는데, 추출된 키워드를 기반으로 웹, 블로그, 뉴스 정보 등의 비정형 텍스트 데이터 또는 문서 데이터를 수집할 수 있다.Thereafter, an irregular data set associated with the extracted first keyword set is collected (S520). Based on the extracted first set of keywords, an irregular data set which is related to the settled data and can be connected semantically. For example, unstructured data may include encyclopedia information about exhibited articles related to a museum, recent news, visitors' visit history, and the like. At this time, unstructured text data or document data such as web, Can be collected.

이후, 상기 수집한 비정형 데이터 집합의 각 데이터에 대한 제2 키워드 집합을 추출한다(S530).Thereafter, a second keyword set for each data of the collected irregular data set is extracted (S530).

비정형 데이터에 대해서는, n-gram 알고리즘 및 TF-IDF를 사용하여 데이터(문서) 내의 제2 키워드 집합을 추출할 수 있다. 한편, 불용어 수준의 의미 없는 단어는 제거함으로써, 추출한 키워드가 실제 문서(데이터) 내에서 신뢰성 있는 키워드가 되도록 한다. 여기서, n-gram 알고리즘 및 TF-IDF는 도 2에서 설명하였으므로 자세한 설명은 생략하기로 한다.For unstructured data, a second set of keywords in the data (document) can be extracted using the n-gram algorithm and the TF-IDF. On the other hand, by removing meaningless words at the insoluble word level, the extracted keywords are relied on in actual documents (data). Here, since the n-gram algorithm and the TF-IDF have been described with reference to FIG. 2, a detailed description will be omitted.

이후, 상기 수집한 비정형 데이터 집합과 상기 제1 키워드 집합 간의 유사도를 계산한다(S540). 여러 가지 방법을 사용하여 수집한 비정형 데이터 집합과 상기 제1 키워드 집합 간의 유사도를 계산할 수 있다.Then, the degree of similarity between the collected irregular data set and the first keyword set is calculated (S540). The degree of similarity between the irregular data set and the first keyword set collected using various methods can be calculated.

예를 들어, 정형 데이터에서 추출된 제1 키워드 집합에 대한 TF-IDF 값을 벡터 A, 각 비정형 데이터에서 추출된 제2 키워드 집합에 대한 TF-IDF 값을 벡터 B라고 하고, 백터간 유사도를 측정하는 코사인 유사도(Cosine Similarity)를 적용하여, 하기 <수학식 5>와 같이 정형 데이터와 비정형 데이터 간의 유사도를 계산할 수 있다.For example, assume that the TF-IDF value for the first set of keywords extracted from the fixed data is vector A, the TF-IDF value for the second set of keywords extracted from each irregular data is vector B, The similarity degree between the fixed data and the unstructured data can be calculated by applying the cosine similarity to Equation (5).

<수학식 5>Equation (5)

즉, A, B의 벡터값이 각각 주어졌을 때, 코사인 유사도 cos(θ)는 벡터의 스칼라 곱을 그 크기로 나눈 값이다.
That is, given the vector values of A and B, the cosine similarity cos (θ) is a value obtained by dividing the scalar product of the vector by its size.

이후, 계산된 유사도가 기설정된 값 이상인 경우, 해당 비정형 데이터를 선택한다(S550).If the calculated similarity is equal to or greater than a preset value, the corresponding irregular data is selected (S550).

선택한 비정형 데이터와 정형 데이터의 의미를 연결시킨다(S560). 선택한 비정형 데이터에서 주제어 용어집을 기반으로 비정형 데이터의 문맥을 추출하고, 이 추출된 결과를 기반으로 정형 데이터와 비정형 데이터를 간의 링크를 어노테이션할 수 있다.
The selected unstructured data and the meaning of the formatted data are concatenated (S560). The context of unstructured data is extracted from the selected unstructured data based on the glossary of subject terms, and the link between the fixed data and the unstructured data can be annotated based on the extracted result.

도 6은 본 발명에 따른 도 5의 LOD 발행단계에 대한 일 실시예 흐름도이다.6 is a flowchart of an embodiment of the LOD issue step of FIG. 5 according to the present invention.

LOD 발행단계(S460)는, 사용자로부터 질의를 수신함(S450)에 따라 수행된다.The LOD issue step S460 is performed according to a query from the user (S450).

우선, 사용자로부터 수신한 질의에 대한 어휘를 선정한다(S610).First, a vocabulary for a query received from the user is selected (S610).

상기 선정된 어휘를 이용하여 트리플 저장소를 검색한다(S620).The triple storage is searched using the selected vocabulary (S620).

검색된 트리플을 시각화처리하고(S630), 시각화 처리된 링크드 오픈 데이터(LOD)를 사용자 인터페이스로 출력한다(S640).The searched triple is visualized (S630), and the visualized linked open data (LOD) is output to the user interface (S640).

도 7은 본 발명에 따른 시각화되어 발행된 LOD에 대한 일 실시예 설명도이다.Figure 7 is an illustration of an embodiment of a visualized published LOD according to the present invention.

도 7에 도시된 바와 같이, BBC 도메인에서 검색된 RDF 데이터(701), NYTimes 도메인에서 검색되는 RDF 데이터(702), 및 DBpedia 도메인에서 검색되는 RDF 데이터(703)가 서로 링크되어 보여질 수 있다.As shown in FIG. 7, RDF data 701 retrieved from the BBC domain, RDF data 702 retrieved from the NYTimes domain, and RDF data 703 retrieved from the DBpedia domain can be linked to each other.

이상에서 본 발명의 일 실시예에 따른 링크드 오픈 데이터 클라우드 정보 서비스 제공 방법에 대하여 설명하였지만, 링크드 오픈 데이터 클라우드 정보 서비스 제공 방법을 구현하기 위한 프로그램이 저장된 컴퓨터 판독 가능한 기록매체 및 링크드 오픈 데이터 클라우드 정보 서비스 제공 방법을 구현하기 위한 컴퓨터 판독 가능한 기록매체에 저장된 프로그램 역시 구현 가능함은 물론이다.Although the method for providing a linked open data cloud information service according to an embodiment of the present invention has been described above, the method for providing the linked open data cloud information service may be a computer readable recording medium storing a program for implementing the linked open data cloud information service, Of course, a program stored in a computer-readable recording medium for implementing the providing method may also be implemented.

즉, 상술한 링크드 오픈 데이터 클라우드 정보 서비스 제공 방법은 이를 구현하기 위한 명령어들의 프로그램이 유형적으로 구현됨으로써, 컴퓨터를 통해 판독될 수 있는 기록매체에 포함되어 제공될 수도 있음을 당업자들이 쉽게 이해할 수 있을 것이다. 다시 말해, 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어, 컴퓨터 판독 가능한 기록매체에 기록될 수 있다. 상기 컴퓨터 판독 가능한 기록매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 컴퓨터 판독 가능한 기록매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 상기 컴퓨터 판독 가능한 기록매체의 예에는 하드 디스크, 플로피 디스크 및 자기테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리, USB 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 상기 컴퓨터 판독 가능한 기록매체는 프로그램 명령, 데이터 구조 등을 지정하는 신호를 전송하는 반송파를 포함하는 광 또는 금속선, 도파관 등의 전송 매체일 수도 있다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.That is, those skilled in the art will readily understand that the method of providing the linked open data cloud information service described above may be provided in a recording medium that can be read through a computer by tangibly embodying a program of instructions for implementing it . In other words, it can be implemented in the form of a program command that can be executed through various computer means, and can be recorded on a computer-readable recording medium. The computer-readable recording medium may include program commands, data files, data structures, and the like, alone or in combination. The program instructions recorded on the computer-readable recording medium may be those specially designed and configured for the present invention or may be those known and available to those skilled in the computer software. Examples of the computer-readable medium include magnetic media such as hard disks, floppy disks and magnetic tape, optical media such as CD-ROMs and DVDs, and optical disks such as floppy disks. Magneto-optical media and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, flash memory, USB memory, and the like. The computer-readable recording medium may be a transmission medium such as a light or metal line, a wave guide, or the like, including a carrier wave for transmitting a signal designating a program command, a data structure, and the like. Examples of program instructions include machine language code such as those produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware device may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

본 발명은 상기한 실시예에 한정되지 아니하며, 적용범위가 다양함은 물론이고, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 다양한 변형 실시가 가능한 것은 물론이다.
It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

100: 링크드 오픈 데이터 클라우드 정보 서비스 시스템
101: 데이터 수집부 102: 매핑룰 저장부
103: 트리플 변환부 104: 오류 검증부
105: 트리플 저장소 106: 질의 수신부
107: LOD 발행부 201: 키워드 추출부
202: 수집부 203: 선택부
204: 연결부 301: 어휘 선정부
302: 검색부 303: 처리부
304: 출력부100: Linked Open Data Cloud Information Service System
101: Data collection unit 102: Mapping rule storage unit
103: triple conversion unit 104: error verification unit
105: triple storage 106: query reception unit
107: LOD issuing unit 201: Keyword extracting unit
202: collecting unit 203:
204: connection 301: vocabulary selection
302: search unit 303:
304:

Claims

링크드 오픈 데이터 클라우드 정보 서비스 시스템에 있어서,
정형 데이터 및 상기 정형 데이터에 연관되는 비정형 데이터를 수집하기 위한 데이터 수집부(101);
상기 데이터 수집부(101)에서 수집되는 데이터를 기설정된 스키마 또는 온톨로지 스키마에 근거하여 자원서술체계(RDF) 형식에 맞도록 매핑하기 위한 매핑룰을 저장하고 있는 매핑룰 저장부(102);
상기 데이터 수집부가 수집한 데이터를 상기 매핑룰에 기반하여 자원서술체계(RDF)형식의 트리플로 생성 및 변환하기 위한 트리플 변환부(103);
상기 트리플 변환부에서 생성 및 변환한 트리플의 적합성 오류를 검증하기 위한 오류 검증부(104);
상기 생성 및 변환한 트리플을 저장하고 관리하는 트리플 저장소(105);
외부 질의를 수신하는 질의 수신부(106);
상기 수신한 질의에 대한 링크드 오픈 데이터(LOD)를 발행하는 LOD 발행부(107); 및
상기 매핑룰 저장부에 저장된 매핑룰을 수정 및 관리하기 위한 매핑룰 관리부
를 포함하고,
상기 데이터 수집부는,
상기 정형 데이터의 제1 키워드 집합 및 상기 수집한 비정형 데이터 집합의 각 데이터에 대한 제2 키워드 집합을 추출하는 키워드 추출부(201);
상기 키워드 추출부에서 추출한 상기 제1 키워드 집합과 연관되는 비정형 데이터 집합을 수집하기 위한 수집부(202);
상기 수집한 비정형 데이터 집합과 상기 제1 키워드 집합 간의 유사도를 계산하여, 상기 계산된 유사도가 기설정 값 이상인 경우 해당 비정형 데이터를 선택하는 선택부(203); 및
상기 선택한 비정형 데이터와 상기 정형 데이터의 의미를 연결시키는 연결부(204)
를 포함하고,
상기 키워드 추출부는,
상기 제2 키워드 집합을 추출하기 위해, n-gram 알고리즘 및 TF(Term Frequency)와 IDF(Inverse Document Frequency)를 곱한 TF-IDF (Term Frequency-Inverse Document Frequency)를 사용하고,
상기 선택부는,
하기 수학식과 같이 백터간 유사도를 측정하는 코사인 유사도(Cosine Similarity)를 적용하여, 유사도를 계산하고,
<수학식>

(여기서, 벡터 A는 상기 정형 데이터에서 추출된 상기 제1 키워드 집합에 대한 TF-IDF, 벡터 B는 상기 각 비정형 데이터에서 추출된 상기 제2 키워드 집합에 대한 TF-IDF),
상기 연결부는,
상기 정형 데이터와 상기 연관된 비정형 데이터 간의 링크를 어노테이션하는 것을 특징으로 하고,
상기 트리플 저장소는,
데이터가 증가함에 따라 여러 대의 서버에 데이터를 분산 관리하는 수평 확장성(Scale out)을 갖춘 빅 데이터 프레임워크를 활용하는 것을 특징으로 하고,
상기 질의 수신부는,
SPARQL을 사용하고, 사용자 인터페이스로 시각화 기반 대화형 검색 서비스를 제공하는 것을 특징으로 하고,
상기 LOD 발행부는,
사용자의 질의에 대하여 어휘를 선정하는 어휘 선정부(301);
상기 선정된 어휘를 이용하여 상기 트리플 저장소를 검색하기 위한 검색부(302);
검색된 트리플을 시각화처리하는 처리부(303); 및
시각화 처리된 링크드 오픈 데이터(LOD)를 상기 사용자 인터페이스로 출력하기 위한 출력부(304)를 포함하는 것을 특징으로 하는 링크드 오픈 데이터 클라우드 정보 서비스 시스템.
In a linked open data cloud information service system,
A data collection unit (101) for collecting fixed data and unstructured data associated with the fixed data;
A mapping rule storage unit 102 for storing a mapping rule for mapping data collected by the data collection unit 101 to a resource description system (RDF) format based on a predetermined schema or an ontology schema;
A triple conversion unit 103 for generating and converting the data collected by the data collecting unit into triples of a resource description system (RDF) format based on the mapping rule;
An error verification unit 104 for verifying a conformance error of the triple generated and transformed by the triple conversion unit;
A triple storage 105 for storing and managing the generated and converted triples;
A query receiving unit (106) for receiving an external query;
An LOD issue unit 107 for issuing linked open data (LOD) for the received query; And
A mapping rule manager for modifying and managing a mapping rule stored in the mapping rule storage unit,
Lt; / RTI >
Wherein the data collecting unit comprises:
A keyword extraction unit (201) for extracting a first keyword set of the formatted data and a second keyword set for each data of the collected irregular data set;
A collection unit (202) for collecting an irregular data set associated with the first keyword set extracted by the keyword extraction unit;
A selecting unit (203) for calculating the similarity between the collected irregular data set and the first keyword set and selecting the corresponding irregular data when the calculated similarity is equal to or greater than a predetermined value; And
A connection unit 204 for linking the selected irregular data with the meaning of the fixed data,
Lt; / RTI >
The keyword extracting unit extracts,
In order to extract the second set of keywords, an n-gram algorithm and a TF-IDF (Term Frequency-Inverse Document Frequency) obtained by multiplying a TF (Term Frequency) and an IDF (Inverse Document Frequency)
Wherein the selection unit comprises:
The degree of similarity is calculated by applying cosine similarity to measure the similarity between vectors as shown in the following equation,
&Lt; Equation &

(Where TF-IDF for the first set of keywords extracted from the template data and TF-IDF for the second set of keywords extracted from each irregular data)
The connecting portion
And annotates the link between the fixed data and the associated unstructured data,
The triple-
A large data framework having scale-out which distributes and manages data to a plurality of servers as data increases,
The query receiving unit,
SPARQL is used, and a visualization-based interactive search service is provided as a user interface.
The LOD issuing unit,
A vocabulary selection unit 301 for selecting a vocabulary for a user query;
A searching unit 302 for searching the triple storage using the selected vocabulary;
A processing unit 303 for visualizing the searched triple; And
And an output unit (304) for outputting the visualized linked open data (LOD) to the user interface.

삭제delete

링크드 오픈 데이터 클라우드 정보 서비스 제공 방법에 있어서,
정형 데이터 및 상기 정형 데이터에 연관되는 비정형 데이터를 수집하는 데이터수집단계(S410);
수집한 비정형 데이터를 매핑룰에 기반하여 자원서술체계(RDF) 형식의 트리플로 생성 및 변환하는 변환단계(S420);
상기 생성 및 변환한 트리플의 적합성 오류를 검증하는 검증단계(S430);
검증된 트리플을 저장하는 저장단계(S440);
외부 질의를 수신하는 질의수신단계(S450);
상기 수신한 질의에 대한 링크드 오픈 데이터(LOD)를 발행하는 LOD 발행단계(S460); 및
상기 매핑룰을 수정 및 관리하는 매핑룰 관리단계
를 포함하고,
상기 데이터수집단계는,
상기 정형 데이터의 제1 키워드 집합을 추출하는 제1 키워드 추출단계(S510);
상기 추출한 제1 키워드 집합과 연관되는 비정형 데이터 집합을 수집하는 수집단계(S520);
상기 수집한 비정형 데이터 집합의 각 데이터에 대한 제2 키워드 집합을 추출하는 제2 키워드 추출단계(S530);
상기 수집한 비정형 데이터 집합과 상기 제1 키워드 집합 간의 유사도를 계산하는 유사도 계산단계(S540);
상기 계산된 유사도가 기설정 값 이상인 경우 해당 비정형 데이터를 선택하는 선택단계(S550); 및
상기 선택한 비정형 데이터와 상기 정형 데이터의 의미를 연결시키는 연결단계(S560)
를 포함하고,
상기 제2 키워드 추출단계는,
상기 제2 키워드 집합을 추출하기 위해, n-gram 알고리즘 및 TF(Term Frequency)와 IDF(Inverse Document Frequency)를 곱한 TF-IDF (Term Frequency-Inverse Document Frequency)를 사용하고,
상기 유사도 계산단계는,
하기 수학식과 같이 백터간 유사도를 측정하는 코사인 유사도(Cosine Similarity)를 적용하여, 유사도를 계산하고,
<수학식>

(여기서, 벡터 A는 상기 정형 데이터에서 추출된 상기 제1 키워드 집합에 대한 TF-IDF, 벡터 B는 상기 각 비정형 데이터에서 추출된 상기 제2 키워드 집합에 대한 TF-IDF),
상기 연결단계는,
상기 정형 데이터와 상기 연관된 비정형 데이터 간의 링크를 어노테이션하는 것을 특징으로 하고,
상기 저장단계는,
데이터가 증가함에 따라 여러 대의 서버에 데이터를 분산 관리하는 수평 확장성(Scale out)을 갖춘 빅 데이터 프레임워크를 활용하는 것을 특징으로 하고,
상기 질의수신단계는,
SPARQL을 사용하고, 사용자 인터페이스로 시각화 기반 대화형 검색 서비스를 제공하는 것을 특징으로 하고,
상기 LOD 발행단계는,
사용자의 질의에 대하여 어휘를 선정하는 어휘 선정단계(S610);
상기 선정된 어휘를 이용하여 트리플 저장소를 검색하는 검색단계(S620);
검색된 트리플을 시각화처리하는 처리단계(S630); 및
시각화 처리된 링크드 오픈 데이터(LOD)를 상기 사용자 인터페이스로 출력하기 위한 출력단계(S640)
를 포함하는 것을 특징으로 하는 링크드 오픈 데이터 클라우드 정보 서비스 제공 방법.In a linked open data cloud information service providing method,
A data collection step (S410) of collecting the fixed data and the unstructured data associated with the fixed data;
A transformation step S420 of generating and transforming the collected unstructured data into a triple of a resource description system (RDF) format based on a mapping rule;
A verification step (S430) of verifying a conformance error of the generated and transformed triple;
A storing step (S440) of storing the verified triple;
Receiving an external query (S450);
An LOD issue step (S460) of issuing linked open data (LOD) for the received query; And
A mapping rule management step of modifying and managing the mapping rule
Lt; / RTI >
Wherein the data collection step comprises:
A first keyword extracting step (S510) of extracting a first set of keywords of the formatted data;
A collection step (S520) of collecting an unstructured data set associated with the extracted first keyword set;
A second keyword extracting step (S530) of extracting a second keyword set for each piece of data of the collected irregular data set;
Calculating a degree of similarity between the collected irregular data set and the first keyword set (S540);
A selection step (S550) of selecting the corresponding irregular data when the calculated similarity is equal to or greater than a preset value; And
A connection step (S560) for connecting the meaning of the selected irregular data to the selected form data,
Lt; / RTI >
The second keyword extracting step may include:
In order to extract the second set of keywords, an n-gram algorithm and a TF-IDF (Term Frequency-Inverse Document Frequency) obtained by multiplying a TF (Term Frequency) and an IDF (Inverse Document Frequency)
The similarity calculation step may include:
The degree of similarity is calculated by applying cosine similarity to measure the similarity between vectors as shown in the following equation,
&Lt; Equation &

(Where TF-IDF for the first set of keywords extracted from the template data and TF-IDF for the second set of keywords extracted from each irregular data)
Wherein the connecting step comprises:
And annotates the link between the fixed data and the associated unstructured data,
Wherein,
A large data framework having scale-out which distributes and manages data to a plurality of servers as data increases,
Wherein the query receiving step comprises:
SPARQL is used, and a visualization-based interactive search service is provided as a user interface.
The LOD issue step includes:
A vocabulary selection step (S610) for selecting a vocabulary for a user query;
A searching step (S620) of searching for a triple storage using the selected vocabulary;
A processing step (S630) of visualizing the searched triple; And
An output step (S640) for outputting visualized linked open data (LOD) to the user interface,
The method of claim 1, further comprising the steps of:

삭제delete