KR20220160756A

KR20220160756A - Method and apparatus for generating document embedding

Info

Publication number: KR20220160756A
Application number: KR1020210068791A
Authority: KR
Inventors: 정은수; 서지현; 유지아; 남승우; 손형관; 오규삼; 윤용근
Original assignee: 삼성에스디에스 주식회사
Priority date: 2021-05-28
Filing date: 2021-05-28
Publication date: 2022-12-06

Abstract

Disclosed are a method and an apparatus for generating document embedding. The method for generating document embedding according to one embodiment of the present invention comprises the steps of: extracting attributes of each of a plurality of entities from a plurality of documents, based on the plurality of the entities, relations among the plurality of the entities, and a hierarchical structure of each of the plurality of the documents; generating a knowledge graph corresponding to the plurality of the documents, based on the plurality of entities and relations among the plurality of the entities; extracting a plurality of sub-graphs which correspond to each of the sectional units in the lowest level of the hierarchy in the knowledge graph, based on the attributes of each of the plurality of the entities; generating a first embedding vector for each of the plurality of the sub-graphs; and generating an embedding vector for each of the sectional unit in a higher level of the hierarchy among the sectional units in the lowest level of the hierarchy, based on the first embedding vector for each of the plurality of the sub-graphs.

Description

문서 임베딩 생성 방법 및 장치{METHOD AND APPARATUS FOR GENERATING DOCUMENT EMBEDDING}Method and apparatus for generating document embedding {METHOD AND APPARATUS FOR GENERATING DOCUMENT EMBEDDING}

개시되는 실시예들은 문서 임베딩 생성 기술과 관련된다.The disclosed embodiments relate to document embedding creation techniques.

종래의 문서 임베딩 기술은 다양한 접근을 통해 발전되어왔다. 그 중에서도 word frequency를 활용한 Bag-of-Words 임베딩 방법은 단어의 순서를 고려하지 않아 문맥 정보를 표현하는 데 한계가 있었다. Conventional document embedding technology has been developed through various approaches. Among them, the Bag-of-Words embedding method using word frequency has limitations in expressing contextual information because it does not consider the order of words.

또한, 최근 BERT/GPT와 같이 언어 모델을 활용한 문서 임베딩 방법은 문장의 특징 추출 없이 그대로 언어 모델에 입력하고 학습하여 하나의 문서 임베딩 벡터를 생성함으로써, 입력 토큰의 개수가 한정되는 문제 및 토큰의 수에 반비례하여 이전 토큰에 대한 정보가 감소하는 문제가 있었으며, 입력 대상이 문장으로 한정되어 이미지 내지 표에 포함된 정보를 반영할 수 없는 문제가 있었다. In addition, recent document embedding methods using language models, such as BERT/GPT, generate one document embedding vector by inputting and learning the language model as it is without extracting sentence features, thereby solving the problem of the limited number of input tokens and the number of tokens. There was a problem that the information on the previous tokens decreased in inverse proportion to the number, and there was a problem that information included in images or tables could not be reflected because the input target was limited to sentences.

또한, 기존의 기술들은 하나의 문서에 대해 단어 계층 또는 문장 계층과 같이 하나의 문장 계층만 고려하여 문서를 구성하는 계층적 구조를 반영하지 못하는 문제가 있었다.In addition, existing technologies have a problem in that they do not reflect the hierarchical structure constituting a document by considering only one sentence hierarchy, such as a word hierarchy or a sentence hierarchy, for one document.

대한민국 공개특허공보 제10-2020-0064880호 (2020.06.08일. 공개)Republic of Korea Patent Publication No. 10-2020-0064880 (2020.06.08. Publication)

개시되는 실시예들은 문서 임베딩 생성 방법 및 장치를 제공하기 위한 것이다.Disclosed embodiments are to provide a document embedding generating method and apparatus.

개시되는 일 실시예에 따른 문서 임베딩 생성 방법은, 복수의 문서로부터 복수의 엔티티(entity), 상기 복수의 엔티티 사이의 관계(relation) 및 상기 복수의 문서 각각의 계층적 구조에 기초한 상기 복수의 엔티티 각각에 대한 속성을 추출하는 단계, 상기 복수의 엔티티(entity) 및 상기 복수의 엔티티 사이의 관계(relation)에 기초하여 상기 복수의 문서에 대응하는 지식 그래프(Knowledge Graph)를 생성하는 단계, 상기 복수의 엔티티 각각에 대한 속성에 기초하여 상기 지식 그래프에서 복수의 최하위 계층 구분 단위 각각에 대응하는 복수의 서브 그래프를 추출하는 단계, 상기 복수의 서브 그래프 각각에 대한 제1 임베딩 벡터를 생성하는 단계 및 상기 복수의 서브 그래프 각각에 대한 제1 임베딩 벡터에 기초하여, 상기 복수의 최하위 계층의 구분 단위의 상위 계층 구분 단위 각각에 대한 임베딩 벡터를 생성하는 단계를 포함한다.According to an embodiment disclosed A document embedding generation method includes extracting a plurality of entities from a plurality of documents, a relationship between the plurality of entities, and an attribute for each of the plurality of entities based on a hierarchical structure of each of the plurality of documents. generating a knowledge graph corresponding to the plurality of documents based on the plurality of entities and a relationship between the plurality of entities; extracting a plurality of subgraphs corresponding to each of a plurality of lowest hierarchical division units from the knowledge graph based on the above, generating a first embedding vector for each of the plurality of subgraphs, and generating a first embedding vector for each of the plurality of subgraphs and generating an embedding vector for each of the upper hierarchical division units of the plurality of lower hierarchical division units based on the first embedding vector.

상기 추출하는 단계는, 상기 복수의 문서 중 상기 복수의 엔티티 각각이 추출된 문서의 계층적 구조에 기초하여 상기 복수의 엔티티 각각이 추출된 영역을 식별하기 위한 계층 별 식별 정보를 상기 복수의 엔티티 각각에 대한 속성으로 추출할 수 있다.The extracting may include providing identification information for each layer for identifying a region from which each of the plurality of entities is extracted based on a hierarchical structure of a document from which each of the plurality of entities is extracted from among the plurality of documents. It can be extracted as an attribute for .

상기 계층적 구조는, 문단, 페이지, 섹션, 챕터 및 문서를 포함하는 계층 구분 단위 종류 중 적어도 둘 이상의 계층 구분 단위의 상하 관계 조합으로 구성될 수 있다.The hierarchical structure may be composed of a combination of hierarchical relationships of at least two or more hierarchical division units among hierarchical division unit types including paragraphs, pages, sections, chapters, and documents.

상기 계층 별 식별 정보는, 상기 계층적 구조를 구성하는 최하위 계층 구분 단위의 식별 정보 및 최하위 계층 구분 단위의 상위 계층 구분 단위 각각의 식별 정보를 포함할 수 있다.The identification information for each layer may include identification information of the lowest layer division unit constituting the hierarchical structure and identification information of each of the upper layer division units of the lowest hierarchy division unit.

상기 지식 그래프는, 상기 복수의 엔티티 각각에 대응하는 엔티티 노드 및 상기 복수의 엔티티 사이의 관계에 대응하는 간선을 포함할 수 있다.The knowledge graph may include an entity node corresponding to each of the plurality of entities and a trunk line corresponding to a relationship between the plurality of entities.

상기 복수의 서브 그래프를 추출하는 단계는, 상기 지식 그래프에서 상기 계층 별 식별 정보가 동일한 복수의 엔티티 각각에 대응하는 복수의 엔티티 노드를 식별하는 단계 및 식별된 상기 복수의 엔티티 노드 및 식별된 상기 복수의 엔티티 노드에 대한 간선을 포함하는 서브 그래프를 추출하는 단계를 포함할 수 있다.The extracting of the plurality of subgraphs may include identifying a plurality of entity nodes corresponding to each of a plurality of entities having the same identification information for each layer in the knowledge graph, and identifying the plurality of entity nodes and the identified plurality of entity nodes. It may include extracting a subgraph including an edge for an entity node of .

상기 제1 임베딩 벡터는, 상기 복수의 최하위 계층 구분 단위 각각에 대응하는 서브 그래프 사이의 유사도에 기초하여 생성될 수 있다.The first embedding vector may be generated based on a similarity between subgraphs corresponding to each of the plurality of lowest hierarchical division units.

상기 상위 계층 구분 단위 각각에 대한 임베딩 벡터를 생성하는 단계는, 상기 계층 별 식별 정보에 기초하여 최하위 계층 구분 단위의 상위 계층 구분 단위 각각의 식별 정보를 확인하는 단계 및 상기 최하위 계층 구분 단위의 상위 계층 구분 단위 각각의 식별 정보가 동일한 복수의 제1 임베딩 벡터를 이용하여 복수의 차상위 계층 구분 단위 각각에 대한 제2 임베딩 벡터를 생성하는 단계를 포함할 수 있다.The generating of the embedding vector for each of the upper hierarchical divisions may include the step of verifying identification information of each of the upper hierarchical division units of the lowest hierarchical division based on the identification information for each layer, and the upper layer of the lowest hierarchical division unit. The method may include generating a second embedding vector for each of a plurality of next-higher hierarchical division units by using a plurality of first embedding vectors having the same identification information of each division unit.

상기 상위 계층 구분 단위 각각에 대한 임베딩 벡터를 생성하는 단계는, 상기 제2 임베딩 벡터에 기초하여 차상위 계층 구분 단위의 최상위 계층 구분 단위의 식별 정보를 확인하는 단계 및 상기 차상위 계층 구분 단위의 최상위 계층 구분 단위의 식별 정보가 동일한 복수의 제2 임베딩 벡터를 이용하여 복수의 최상위 계층 구분 단위 각각에 대한 제3 임베딩 벡터를 생성하는 단계를 포함할 수 있다.The generating of the embedding vector for each of the higher hierarchical division units includes the steps of checking identification information of the highest hierarchical division unit of the next higher hierarchical division based on the second embedding vector, and the highest hierarchy division of the next higher hierarchy division unit. It may include generating a third embedding vector for each of a plurality of upper-level hierarchical division units by using a plurality of second embedding vectors having the same unit identification information.

일 실시예에 따른 문서 임베딩 생성 장치는 복수의 문서로부터 복수의 엔티티(entity), 상기 복수의 엔티티 사이의 관계(relation) 및 상기 복수의 문서 각각의 계층적 구조에 기초한 상기 복수의 엔티티 각각에 대한 속성을 추출하는 엔티티 추출부, 상기 복수의 엔티티(entity) 및 상기 복수의 엔티티 사이의 관계(relation)에 기초하여 상기 복수의 문서에 대응하는 지식 그래프(Knowledge Graph)를 생성하는 그래프 생성부, 상기 복수의 엔티티 각각에 대한 속성에 기초하여 상기 지식 그래프에서 복수의 최하위 계층 구분 단위 각각에 대응하는 복수의 서브 그래프를 추출하는 그래프 추출부 및 상기 복수의 서브 그래프 각각에 대한 제1 임베딩 벡터를 생성하고, 상기 복수의 서브 그래프 각각에 대한 제1 임베딩 벡터에 기초하여, 상기 복수의 최하위 계층의 구분 단위의 상위 계층 구분 단위 각각에 대한 임베딩 벡터를 생성하는 임베딩 벡터 생성부를 포함한다. An apparatus for generating document embeddings according to an embodiment provides information for each of a plurality of entities based on a plurality of entities from a plurality of documents, a relationship between the plurality of entities, and a hierarchical structure of each of the plurality of documents. An entity extractor for extracting properties; a graph generator for generating a knowledge graph corresponding to the plurality of documents based on the plurality of entities and a relationship between the plurality of entities; A graph extractor extracting a plurality of subgraphs corresponding to each of a plurality of lowest hierarchical division units in the knowledge graph based on attributes of each of a plurality of entities, and generating a first embedding vector for each of the plurality of subgraphs; , An embedding vector generator for generating an embedding vector for each of the upper hierarchical division units of the plurality of lower hierarchical division units based on the first embedding vector for each of the plurality of subgraphs.

상기 엔티티 추출부는, 상기 복수의 문서 중 상기 복수의 엔티티 각각이 추출된 문서의 계층적 구조에 기초하여 상기 복수의 엔티티 각각이 추출된 영역을 식별하기 위한 계층 별 식별 정보를 상기 복수의 엔티티 각각에 대한 속성으로 추출할 수 있다. The entity extraction unit transmits, to each of the plurality of entities, identification information for each layer for identifying a region from which each of the plurality of entities is extracted based on a hierarchical structure of a document from which each of the plurality of entities is extracted from among the plurality of documents. properties can be extracted.

상기 그래프 추출부는, 상기 지식 그래프에서 상기 계층 별 식별 정보가 동일한 복수의 엔티티 각각에 대응하는 복수의 엔티티 노드를 식별하고, 식별된 상기 복수의 엔티티 노드 및 식별된 상기 복수의 엔티티 노드에 대한 간선을 포함하는 서브 그래프를 추출할 수 있다.The graph extractor identifies a plurality of entity nodes corresponding to each of a plurality of entities having the same identification information for each layer in the knowledge graph, and extracts the identified plurality of entity nodes and trunk lines for the identified plurality of entity nodes. It is possible to extract subgraphs that contain

상기 임베딩 벡터 생성부는, 상기 계층 별 식별 정보에 기초하여 최하위 계층 구분 단위의 상위 계층 구분 단위 각각의 식별 정보를 확인하고, 상기 최하위 계층 구분 단위의 상위 계층 구분 단위 각각의 식별 정보가 동일한 복수의 제1 임베딩 벡터를 이용하여 복수의 차상위 계층 구분 단위 각각에 대한 제2 임베딩 벡터를 생성할 수 있다.The embedding vector generator checks identification information of each of the upper hierarchical division units of the lowest hierarchical division unit based on the identification information for each layer, and identifies a plurality of first hierarchical division units in which the identification information of each of the upper hierarchical division units of the lowest hierarchical division unit is the same. A second embedding vector for each of a plurality of second-higher hierarchical division units may be generated using 1 embedding vector.

상기 임베딩 벡터 생성부는, 상기 제2 임베딩 벡터에 기초하여 차상위 계층 구분 단위의 최상위 계층 구분 단위의 식별 정보를 확인하고, 상기 차상위 계층 구분 단위의 최상위 계층 구분 단위의 식별 정보가 동일한 복수의 제2 임베딩 벡터를 이용하여 복수의 최상위 계층 구분 단위 각각에 대한 제3 임베딩 벡터를 생성할 수 있다.The embedding vector generator checks identification information of the highest hierarchical division unit of the next higher hierarchical division based on the second embedding vector, and identifies a plurality of second embedding units having the same identification information of the highest hierarchical division unit of the next higher hierarchy division unit. A third embedding vector for each of a plurality of top hierarchical division units may be generated using the vector.

개시되는 실시예들에 따르면, 문서를 구성하는 계층적 구조에 기초하여 의미 또는 주장을 포함하는 가장 작은 계층에 포함된 상호 관련이 있는 복수의 엔티티를 포함하는 서브 그래프를 하나의 임베딩 벡터로 생성함으로써, 복수의 문서들에 대한 특성 및 내용을 반영한 문서 임베딩 벡터를 생성할 수 있다.According to the disclosed embodiments, based on the hierarchical structure constituting the document, a subgraph including a plurality of interrelated entities included in the smallest layer including meaning or assertion is generated as one embedding vector. , it is possible to generate a document embedding vector reflecting the characteristics and contents of a plurality of documents.

또한 개시되는 실시예들에 따르면, 사용자는 가장 작은 계층 및 가장 큰 계층 중 적어도 하나를 선택하여 문서 간, 각 계층 간 비교 또는 검색함으로써, 사용자가 원하는 수준의 문서 비교 결과를 얻을 수 있고, 문서 간, 각 계층 간 비교의 기준이 되는 유사도를 제공할 수 있다. In addition, according to the disclosed embodiments, a user may select at least one of the smallest layer and the largest layer and compare or search between documents and between each layer to obtain a document comparison result of a level desired by the user, and between documents. , it is possible to provide a similarity that is a standard for comparison between each layer.

또한 개시되는 실시예들에 따르면, 의미 또는 주장을 포함하는 가장 작은 계층은 텍스트 뿐만 아니라 이미지 및 표에 담긴 정보도 포함함으로서, 다양한 설명이 포함되도록 문서를 통합할 수 있다.In addition, according to the disclosed embodiments, the smallest layer including meaning or assertion includes not only text but also information contained in images and tables, so that documents can be integrated to include various descriptions.

도 1은 일 실시예에 따른 문서 임베딩 생성 장치의 구성도
도 2는 일 실시예에 따른 문서의 계층적 구조 및 계층 구분 단위를 설명하기 위한 도면
도 3은 일 실시예에 따른 최하위 계층 구분 단위인 문단에 대한 지식 그래프를 생성하는 과정을 예시하여 나타낸 도면
도 4는 일 실시예에 따른 복수의 최하위 계층 구분 단위 각각에 대한 복수의 서브 그래프를 추출하는 과정을 예시하여 나타낸 도면
도 5는 일 실시예에 따른 최하위 계층 구분 단위에 대한 서브 그래프에 포함된 복수의 엔티티 및 복수의 엔티티 사이의 관계를 벡터로 변환하는 과정을 예시하여 설명하기 위한 도면
도 6은 일 실시예에 따른 복수의 최하위 계층 구분 단위 각각에 대응하는 임베딩 벡터의 생성에 이용되는 유사도 평가 모델을 예시하여 설명하기 위한 도면
도 7은 일 실시예에 따른 최하위 계층 구분 단위의 상위 계층 구분 단위 각각에 대한 임베딩 벡터를 생성하는 과정을 예시하여 설명하기 위한 도면
도 8은 일 실시예에 따른 문서 임베딩 벡터 생성 방법의 순서도
도 9는 일 실시예에 따른 컴퓨팅 장치를 포함하는 컴퓨팅 환경을 예시하여 설명하기 위한 블록도1 is a block diagram of an apparatus for generating document embeddings according to an exemplary embodiment;
2 is a diagram for explaining a hierarchical structure of a document and a hierarchical classification unit according to an exemplary embodiment;
3 is a diagram illustrating a process of generating a knowledge graph for a paragraph, which is a lowest hierarchical division unit, according to an embodiment.
4 is a diagram illustrating a process of extracting a plurality of subgraphs for each of a plurality of lowest hierarchical division units according to an embodiment.
5 is a diagram for explaining a process of converting a plurality of entities included in a subgraph for a lowest hierarchical division unit and a relationship between the plurality of entities into a vector according to an exemplary embodiment;
6 is a diagram for illustrating and explaining a similarity evaluation model used to generate an embedding vector corresponding to each of a plurality of lowest hierarchical division units according to an embodiment;
7 is a diagram for illustrating and explaining a process of generating an embedding vector for each of the upper hierarchical division units of the lowest hierarchical division unit according to an embodiment;
8 is a flowchart of a method for generating a document embedding vector according to an embodiment.
9 is a block diagram for illustrating and describing a computing environment including a computing device according to an exemplary embodiment;

이하, 도면을 참조하여 구체적인 실시 형태를 설명하기로 한다. 이하의 상세한 설명은 본 명세서에서 기술된 방법, 장치 및/또는 시스템에 대한 포괄적인 이해를 돕기 위해 제공된다. 그러나 이는 예시에 불과하며 개시되는 실시예들은 이에 제한되지 않는다.Hereinafter, specific embodiments will be described with reference to the drawings. The detailed descriptions that follow are provided to provide a comprehensive understanding of the methods, devices and/or systems described herein. However, this is only an example and disclosed embodiments are not limited thereto.

실시예들을 설명함에 있어서, 관련된 공지 기술에 대한 구체적인 설명이 개시되는 실시예들의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략하기로 한다. 그리고, 후술되는 용어들은 개시되는 실시예들에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다. 상세한 설명에서 사용되는 용어는 단지 실시예들을 기술하기 위한 것이며, 결코 제한적이어서는 안 된다. 명확하게 달리 사용되지 않는 한, 단수 형태의 표현은 복수 형태의 의미를 포함한다. 본 설명에서, "포함" 또는 "구비"와 같은 표현은 어떤 특성들, 숫자들, 단계들, 동작들, 요소들, 이들의 일부 또는 조합을 가리키기 위한 것이며, 기술된 것 이외에 하나 또는 그 이상의 다른 특성, 숫자, 단계, 동작, 요소, 이들의 일부 또는 조합의 존재 또는 가능성을 배제하도록 해석되어서는 안 된다.In describing the embodiments, if it is determined that a detailed description of a related known technology may unnecessarily obscure the gist of the disclosed embodiments, the detailed description will be omitted. In addition, terms to be described later are terms defined in consideration of functions in the disclosed embodiments, which may vary according to intentions or customs of users or operators. Therefore, the definition should be made based on the contents throughout this specification. Terminology used in the detailed description is only for describing the embodiments and should in no way be limiting. Unless expressly used otherwise, singular forms of expression include plural forms. In this description, expressions such as "comprising" or "comprising" are intended to indicate any characteristic, number, step, operation, element, portion or combination thereof, one or more other than those described. It should not be construed to exclude the existence or possibility of any other feature, number, step, operation, element, part or combination thereof.

도 1은 일 실시예에 따른 문서 임베딩 생성 장치의 구성도이다. 1 is a configuration diagram of a document embedding generation device according to an embodiment.

도 1을 참조하면, 일 실시예에 따른 문서 임베딩 생성 장치(100)(이하, 생성 장치)는 엔티티 추출부(110), 그래프 생성부(120), 그래프 추출부(130) 및 임베딩 벡터 생성부(140)를 포함한다. Referring to FIG. 1 , a document embedding generating device 100 (hereinafter, a generating device) according to an embodiment includes an entity extracting unit 110, a graph generating unit 120, a graph extracting unit 130, and an embedding vector generating unit. (140).

일 실시예에 따르면, 엔티티 추출부(110), 그래프 생성부(120), 그래프 추출부(130) 및 임베딩 벡터 생성부(140)는 각각 물리적으로 구분된 하나 이상의 장치를 이용하여 구현되거나, 하나 이상의 하드웨어 프로세서 또는 하나 이상의 하드웨어 프로세서 및 소프트웨어의 결합에 의해 구현될 수 있으며, 도시된 예와 달리 구체적 동작에 있어 명확히 구분되지 않을 수 있다.According to an embodiment, the entity extractor 110, the graph generator 120, the graph extractor 130, and the embedding vector generator 140 are each implemented using one or more physically separated devices, or one It may be implemented by one or more hardware processors or a combination of one or more hardware processors and software, and unlike the illustrated example, specific operations may not be clearly distinguished.

엔티티 추출부(110)는 복수의 문서로부터 복수의 엔티티(entity), 복수의 엔티티 사이의 관계(relation) 및 상기 복수의 문서 각각의 계층적 구조에 기초한 상기 복수의 엔티티 각각에 대한 속성을 추출한다.The entity extractor 110 extracts a plurality of entities from a plurality of documents, a relationship between the plurality of entities, and an attribute of each of the plurality of entities based on a hierarchical structure of each of the plurality of documents. .

이때, 복수의 문서는 각각 개인용 PC, 태블릿 PC, 랩답 PC, 스마트 폰 등과 같이 정보처리능력을 가진 장치를 이용하여 작성되거나 읽을 수 있는 문서로서 의미, 주장, 정보 등의 내용이 자연어로 표현된 텍스트, 이미지, 표를 포함하는 문서를 의미할 수 있다. 예를 들어, 복수의 문서는 각각 전자 문서 작성 프로그램을 이용하여 작성된 문서 파일, 웹 페이지, 온라인 백과 사전에 등재된 문서 등을 포함할 수 있으나, 자연어 텍스트, 이미지, 표를 포함하고 있으며 정보처리능력을 가진 장치를 이용하여 작성되거나 읽을 수 있는 문서라면 반드시 특정한 형태 및 종류의 문서로 한정되는 것은 아니다. At this time, the plurality of documents are documents that can be written or read using devices with information processing capabilities, such as personal PCs, tablet PCs, laptop PCs, smart phones, etc. , images, and tables. For example, the plurality of documents may include document files, web pages, and documents listed in online encyclopedias, each created using an electronic document writing program, but include natural language text, images, and tables, and have information processing capabilities. A document that can be created or read using a device having a device is not necessarily limited to a specific type and type of document.

한편, 엔티티(entity)는 실세계에 독립적으로 존재하는 유형, 무형의 정보로서 서로 연관된 하나 이상의 속성으로 구성되며, 엔티티는 다른 엔티티와 하나 이상의 관계(relation)가 있다. Meanwhile, an entity is tangible or intangible information that exists independently in the real world and is composed of one or more attributes associated with each other, and an entity has one or more relationships with other entities.

속성은 엔티티를 구성하는 항목으로 엔티티가 가지고 있는 특성을 기술한다. An attribute is an item that constitutes an entity and describes the characteristics that the entity has.

엔티티 사이의 관계(relation)는 엔티티 간 또는 속성 간의 논리적인 연결을 의미한다. A relation between entities means a logical connection between entities or properties.

구체적으로, 일 실시예에 따르면 엔티티 추출부(110)는 사전 학습된 엔티티 추출 모델을 이용하여 복수의 문서 각각에서 하나 이상의 엔티티를 추출할 수 있다. 이때, 엔티티 추출 모델은 예를 들어, 자연어 형태의 텍스트 데이터가 입력되었을 때 입력된 텍스트 데이터에 포함된 하나 이상의 엔티티를 반환하도록 사전 학습된 인공 신경망(Artificial Neural Network) 기반의 모델일 수 있다.Specifically, according to an embodiment, the entity extractor 110 may extract one or more entities from each of a plurality of documents by using a pretrained entity extraction model. In this case, the entity extraction model may be, for example, a model based on an artificial neural network pre-trained to return one or more entities included in the input text data when text data in the form of natural language is input.

한편, 엔티티 추출부(110)에 의한 엔티티 추출 방식은 반드시 상술한 예에 한정되는 것은 아니며, 엔티티 추출부(110)는 상술한 예 외에도 공지된 다양한 개체명 인식(Named Entity Recognition, NER) 기술을 이용하여 복수의 문서로부터 엔티티를 추출할 수 있다.On the other hand, the entity extraction method by the entity extractor 110 is not necessarily limited to the above-described example, and the entity extractor 110 uses various known named entity recognition (NER) technologies in addition to the above-described examples. Entities can be extracted from multiple documents.

일 실시예에 따르면, 엔티티 추출부(110)는 복수의 문서 중 복수의 엔티티 각각이 추출된 문서의 계층적 구조에 기초하여 복수의 엔티티 각각이 추출된 영역을 식별하기 위한 계층 별 식별 정보를 복수의 엔티티 각각에 대한 속성으로 추출할 수 있다.According to an embodiment, the entity extractor 110 provides identification information for each layer to identify a region from which each of a plurality of entities is extracted based on a hierarchical structure of a document from which each of a plurality of entities is extracted from among a plurality of documents. It can be extracted as an attribute for each entity of .

계층적 구조란 문서 작성자에 의해 구성된 문서를 구성하는 구조 사이의 관계가 상하 관계, 포함 관계 또는 일대 다수의 관계 등을 포함하는 구조를 의미한다. The hierarchical structure means a structure in which a relationship between structures constituting a document constructed by a document writer includes a hierarchical relationship, an inclusive relationship, or a one-to-many relationship.

한편, 일 실시예에 따르면 계층적 구조는 문단, 페이지, 섹션, 챕터 및 문서를 포함하는 계층 구분 단위 종류 중 적어도 둘 이상의 계층 구분 단위의 상하 관계 조합으로 구성될 수 있다. Meanwhile, according to an embodiment, the hierarchical structure may be composed of a combination of upper and lower relationships of at least two or more hierarchical division units among types of hierarchical division units including paragraphs, pages, sections, chapters, and documents.

계층 구분 단위 중 문단은 하나 이상의 문장으로 구성되어 하나의 주장 또는 정보를 나타내는 덩어리를 의미할 수 있다. 이때, 일 실시예에 따르면, 문단은 문단과 관련된 이미지나 표를 포함할 수도 있다. Among the hierarchical division units, a paragraph may refer to a chunk that is composed of one or more sentences and represents one assertion or information. In this case, according to one embodiment, the paragraph may include an image or table related to the paragraph.

계층 구분 단위 중 페이지는 문서에 사전 설정된 규격에 따라 문서의 한 면을 의미할 수 있다. Among hierarchical division units, a page may refer to one side of a document according to standards preset in the document.

계층 구분 단위 중 섹션은 문서에 포함된 각각의 대상, 사건, 취급 또는 동작에 관한 연관된 자료 또는 정보를 가지고 있는 영역을 의미할 수 있다. A section among hierarchical division units may refer to an area having data or information related to each object, event, handling, or operation included in a document.

계층 구분 단위 중 챕터는 복수의 문단, 복수의 페이지 또는 복수의 섹션으로 구성된 하나의 주제를 포함하는 영역을 의미할 수 있다. Among hierarchical division units, a chapter may refer to a region including a single subject composed of a plurality of paragraphs, a plurality of pages, or a plurality of sections.

그러나 계층 구분 단위는 상술한 단위 이외에도 문서에 포함된 의미나 주장을 내포하여 문맥 정보를 포함할 수 있는 단위라면 반드시 특정한 단위로 한정되는 것은 아니다. However, the hierarchical division unit is not necessarily limited to a specific unit as long as it is a unit capable of including contextual information by implying meaning or assertion included in a document other than the above-mentioned units.

일 실시예에 따르면, 계층적 구조를 구성하는 계층 구분 단위의 종류 및 상하 관계의 조합은 사전 설정 또는 사용자에 의해 결정될 수 있다. According to an embodiment, the combination of types and hierarchical relationships of hierarchical division units constituting the hierarchical structure may be set in advance or determined by a user.

한편, 일 실시예에 따르면, 엔티티 추출부(110)는 복수의 문서 중 복수의 엔티티 각각이 추출된 문서의 계층적 구조에 기초하여 복수의 엔티티 각각이 추출된 영역을 식별하기 위한 계층 별 식별 정보를 복수의 엔티티 각각에 대한 속성으로 추출할 수 있다. Meanwhile, according to an embodiment, the entity extractor 110 includes identification information for each layer for identifying a region from which each of a plurality of entities is extracted based on a hierarchical structure of a document from which each of a plurality of entities is extracted from among a plurality of documents. may be extracted as an attribute for each of a plurality of entities.

이때, 일 실시예에 따르면, 계층 별 식별 정보는, 단위는 계층적 구조를 구성하는 최하위 계층 구분 단위의 식별 정보 및 최하위 계층 구분 단위의 상위 계층 구분 단위 각각의 식별 정보를 포함할 수 있다. In this case, according to an embodiment, the identification information for each layer may include identification information of the lowest layer division unit constituting the hierarchical structure and identification information of each of the upper layer division units of the lowest hierarchy division unit.

도 2는 일 실시예에 따른 문서의 계층적 구조 및 계층 구분 단위를 설명하기 위한 도면이다. 2 is a diagram for explaining a hierarchical structure and a hierarchical classification unit of a document according to an exemplary embodiment.

도 2에 도시된 문서(200)는 문서-챕터-문단의 세가지 계층 구분 단위를 포함하고, 문서에서 문단으로 갈수록 하위 계층 구분 단위를 이루는 상하 관계 조합을 가진 계층적 구조를 포함하고 있다. 이때, 문서(200)의 최하위 계층 구분 단위를 문단으로, 차상위 계층 구분 단위를 챕터로, 최상위 계층 구분 단위는 문서일 수 있다. The document 200 shown in FIG. 2 includes three hierarchical division units of document-chapter-paragraph, and has a hierarchical structure having a combination of upper and lower relationships forming lower hierarchical division units in descending order from a document to a paragraph. In this case, the lowest hierarchical division unit of the document 200 may be a paragraph, the next higher hierarchical division unit may be a chapter, and the highest hierarchical division unit may be a document.

일 실시예에 따르면, 최상위 계층 구분 단위에 대응하는 문서(200)는 “I. Introduction” 및 “II. Methods” 로 표현된 차상위 계층 구분 단위인, 챕터(210, 220)를 포함하고, 각각의 챕터는 최하위 계층 구분 단위인, 문단(211, 212, 221)을 포함할 수 있다. According to one embodiment, the document 200 corresponding to the highest hierarchical division unit is “I. Introduction” and “II. Methods” may include chapters 210 and 220, which are the second hierarchical division units, and each chapter may include paragraphs 211, 212 and 221, which are the lowest hierarchical division units.

일 실시예에 따르면, 엔티티 추출부(110)는 문서(200)에 대한 계층적 구조에 기초하여 복수의 엔티티 각각이 포함된 계층 별정보를 복수의 엔티티 각각에 대한 속성으로 추출할 수 있다.According to an embodiment, the entity extractor 110 may extract information for each layer including each of a plurality of entities as an attribute of each of the plurality of entities based on the hierarchical structure of the document 200 .

구체적으로, 엔티티 추출부(110)는 계층 별 식별 정보는 문서의 계층적 구조를 구성하는 계층 구분 단위의 종류 및 각 계층 구분 단위의 식별 정보를 복수의 엔티티 각각에 대한 속성으로 추출할 수 있다. In detail, the entity extractor 110 may extract, as the identification information for each layer, the type of layer classification unit constituting the hierarchical structure of the document and the identification information of each layer classification unit as attributes for each of a plurality of entities.

예를 들어, 엔티티 추출부(110)는 문서(200)의 챕터 “I. Introduction”(210)에 포함된 두 개의 문단(211, 212) 중 첫 번째 문단을 식별하기 위해 첫 번째 문단(211)에 대한 식별 정보인 par1을 추출할 수 있다. 또한, 문단(211)의 차상위 계층 구분 단위인 챕터(210)의 식별 정보인 chap 1 및 문단의 최상위 계층 구분 단위인 문서(200)의 식별 정보인 doc1을 추출할 수 있다. For example, the entity extractor 110 extracts a chapter “I. In order to identify the first paragraph of the two paragraphs 211 and 212 included in “Introduction” 210, par1, which is identification information on the first paragraph 211, can be extracted. In addition, chap 1, which is identification information of the chapter 210, which is the second-highest hierarchical division unit of the paragraph 211, and doc1, which is identification information of the document 200, which is the highest hierarchical division unit of the paragraph, can be extracted.

즉, 엔티티 추출부(110)는 최하위 계층 구분 단위가 문단이며, 계층적 구조가 문서-챕터-문단으로 구성된 문단(211)의 계층 별 식별 정보인 doc1, chap 1, par1을 문단(211)에 포함된 복수의 엔티티(예를 들어, A, B 및 C) 각각에 대한 속성으로 추출할 수 있다. That is, the entity extractor 110 converts doc1, chap 1, and par1, which are identification information for each layer of the paragraph 211 in which the lowest hierarchical unit is a paragraph and the hierarchical structure is composed of document-chapter-paragraph, to the paragraph 211. It may be extracted as an attribute for each of a plurality of included entities (eg, A, B, and C).

또 다른 예를 들어, 엔티티 추출부(110)는 문서(200)의 챕터 “II. Methods”(220)에 포함된 첫 번째 문단(221)을 식별하기 위해 문단(221)에 대한 식별 정보인 par1을 추출할 수 있다. 또한, 문단(221)의 차상위 계층 구분 단위인 챕터(220)의 식별 정보인 chap 2 및 문단의 최상위 계층 구분 단위인 문서(200)의 식별 정보인 doc1을 추출할 수 있다. As another example, the entity extraction unit 110 may extract chapter “II. In order to identify the first paragraph 221 included in “Methods” 220, par1, which is identification information about the paragraph 221, can be extracted. In addition, chap 2, which is identification information of the chapter 220, which is the second highest hierarchical division unit of the paragraph 221, and doc1, which is identification information of the document 200, which is the highest hierarchical division unit of the paragraph, can be extracted.

즉, 엔티티 추출부(110)는 최하위 계층 구분 단위가 문단이며, 계층적 구조가 문서-챕터-문단으로 구성된 문단(221)의 계층 별 식별 정보인 doc1, chap 2, par1을 문단(221)에 포함된 엔티티(예를 들어, A)에 대한 속성으로 추출할 수 있다. That is, the entity extractor 110 converts doc1, chap 2, and par1, which are identification information for each layer of the paragraph 221 in which the lowest hierarchical unit is a paragraph and the hierarchical structure is composed of document-chapter-paragraph, to the paragraph 221. It can be extracted as an attribute for the included entity (eg, A).

이 때, 일 실시예에 따르면, 문서(200)의 그림(222)은 문단(221)과 같은 계층 구분 단위로 취급될 수 있다. 예를 들어, 문서(200)의 챕터 “II. Methods”(220)에 포함된 그림(222)을 식별하기 위해 그림(222)에 대한 식별 정보인 fig 1을 추출할 수 있다. 또한, 그림(222)의 차상위 계층 구분 단위인 챕터(220)의 식별 정보인 chap 2 및 그림(222)의 최상위 계층 구분 단위인 문서(200)의 식별 정보인 doc1을 추출할 수 있다. At this time, according to an embodiment, the picture 222 of the document 200 may be treated as a hierarchical division unit like the paragraph 221 . For example, chapter “II. In order to identify the figure 222 included in “Methods” 220, fig 1, which is identification information about the figure 222, can be extracted. In addition, chap 2, which is identification information of the chapter 220, which is the second hierarchical division unit of the figure 222, and doc1, which is identification information of the document 200, which is the highest hierarchical division unit of the figure 222, can be extracted.

즉, 엔티티 추출부(110)는 최하위 계층 구분 단위가 문단이며, 계층적 구조가 문서-챕터-문단으로 구성된 그림(222)의 계층 별 식별 정보인 doc1, chap 2, fig1을 그림(222)에 포함된 복수의 엔티티(예를 들어, A1, A2, A3 및 A4) 각각에 대한 속성으로 추출할 수 있다. That is, the entity extraction unit 110 converts doc1, chap 2, and fig1, which are identification information for each layer of the figure 222 in which the lowest hierarchical unit is a paragraph and the hierarchical structure is document-chapter-paragraph, to the figure 222. It may be extracted as an attribute for each of a plurality of included entities (eg, A1, A2, A3, and A4).

다른 실시예에 따르면, 문서(200)의 그림(222)은 문단(221)의 일부로 문장과 같이 취급될 수 있다. 예를 들어, 문서(200)의 문단(221)과 관련된 그림(222)을 식별하기 위해 문단(221)에 대한 식별 정보인 par 1을 그림(222)에 대한 식별 정보로 추출할 수 있다. 또한, 그림(222)을 포함하는 문단(221)의 차상위 계층 구분 단위인 챕터(220)의 식별 정보인 chap 2 및 문단(221)의 최상위 계층 구분 단위인 문서(200)의 식별 정보인 doc1을 추출할 수 있다. According to another embodiment, picture 222 of document 200 is part of paragraph 221 and can be treated like text. For example, in order to identify the picture 222 related to the paragraph 221 of the document 200, par 1 that is identification information for the paragraph 221 may be extracted as identification information for the picture 222. In addition, chap 2, which is identification information of the chapter 220, which is the second-highest hierarchical division unit of the paragraph 221 including the picture 222, and doc1, which is identification information of the document 200, which is the highest hierarchical division unit of the paragraph 221, can be extracted.

즉, 엔티티 추출부(110)는 최하위 계층 구분 단위가 문단이며, 계층적 구조가 문서-챕터-문단으로 구성된 그림(222)의 계층 별 식별 정보인 doc1, chap 2, par1을 그림(222)에 포함된 복수의 엔티티(예를 들어, A1, A2, A3 및 A4) 각각에 대한 속성으로 추출할 수 있다. That is, the entity extraction unit 110 converts doc1, chap 2, and par1, which are identification information for each layer of the figure 222 in which the lowest hierarchical unit is a paragraph and the hierarchical structure is document-chapter-paragraph, to the figure 222. It may be extracted as an attribute for each of a plurality of included entities (eg, A1, A2, A3, and A4).

그래프 생성부(120)는 복수의 엔티티 및 복수의 엔티티 사이의 관계에 기초하여 복수의 문서에 대응하는 지식 그래프(Knowledge Graph)를 생성한다. The graph generator 120 generates a knowledge graph corresponding to a plurality of documents based on a plurality of entities and relationships between the plurality of entities.

일 실시예에 따르면, 지식 그래프는 복수의 엔티티 각각에 대응하는 엔티티 노드 및 상기 복수의 엔티티 사이의 관계에 대응하는 간선을 포함할 수 있다.According to an embodiment, the knowledge graph may include an entity node corresponding to each of a plurality of entities and a trunk line corresponding to a relationship between the plurality of entities.

일 실시예에 따르면, 그래프 생성부(120)는 예를 들어, 위키 데이터(Wikidata), YAGO(Yet Another Great Ontology), Freebase, Dbpedia, WordNet, Cyc, BabelNet 등과 같이 사전 구축된 지식 베이스(Knowledge base)를 이용하여, 복수의 문서에서 추출된 하나 이상의 엔티티에 대한 지식을 결정할 수 있다. According to an embodiment, the graph generator 120 may include, for example, a pre-built knowledge base (Knowledge base) such as Wikidata, Yet Another Great Ontology (YAGO), Freebase, Dbpedia, WordNet, Cyc, BabelNet, and the like. ), it is possible to determine knowledge about one or more entities extracted from a plurality of documents.

구체적으로, 지식은 <주어 엔티티, 서술어, 목적어 엔티티>와 같은 트리플(triple)의 형태로 표현되며, 그래프 생성부(120)는 복수의 엔티티 및 복수의 엔티티 사이의 관계를 포함하고 있는 트리플에 기초하여 복수의 문서에 대응하는 지식 그래프를 생성할 수 있다. Specifically, knowledge is expressed in the form of a triple such as <subject entity, predicate entity, object entity>, and the graph generator 120 is based on the triple including a plurality of entities and a relationship between the plurality of entities. Thus, a knowledge graph corresponding to a plurality of documents can be created.

일 실시예에 따르면 지식 그래프는 복수의 엔티티 각각에 대응하는 엔티티 노드 및 복수의 엔티티 사이의 관계에 대응하는 간선을 포함할 수 있다. According to an embodiment, the knowledge graph may include an entity node corresponding to each of a plurality of entities and a trunk line corresponding to a relationship between the plurality of entities.

도 3은 일 실시예에 따른 최하위 계층 구분 단위인 문단에 대한 지식 그래프를 생성하는 과정을 예시하여 나타낸 도면이다. 3 is a diagram illustrating a process of generating a knowledge graph for a paragraph, which is a lowest hierarchical division unit, according to an embodiment.

일 실시예에 따르면, 그래프 생성부(120)는 사전 구축된 지식 베이스를 이용하여 지식 그래프를 생성할 수 있다. According to an embodiment, the graph generator 120 may generate a knowledge graph using a pre-built knowledge base.

도 3에 도시된 지식 그래프는, 'A technique belongs to B field.' 및 'B is known as C.'의 두 개의 문장으로 구성된 문단에 대응될 수 있다. The knowledge graph shown in FIG. 3 is 'A technique belongs to B field.' and 'B is known as C.'.

문단에 포함된 첫 번째 문장에 대한 지식은 <A technique, belongs to, B field>의 트리플 형태로 표현되며, 첫 번째 문장에 대한 지식 그래프는 “A technique”에 대응하는 엔티티 노드 1(311), “B field”에 대응하는 엔티티 노드 2(313) 및 “belongs to”에 대응하는 간선(312)을 포함할 수 있다. 또한, 문단에 포함된 두 번째 문장에 대한 지식은 < B, is known as, C >의 트리플 형태로 표현되며, 상기 문장에 대한 지식 그래프는 “B”에 대응하는 엔티티 노드 2(313), “C”에 대응하는 엔티티 노드 3(315) 및 “is known as”에 대응하는 간선(314)을 포함할 수 있다.The knowledge of the first sentence included in the paragraph is expressed in the triple form of <A technique, belongs to, B field>, and the knowledge graph of the first sentence is entity node 1 (311) corresponding to “A technique”, It may include entity node 2 313 corresponding to “B field” and trunk 312 corresponding to “belongs to”. In addition, the knowledge of the second sentence included in the paragraph is expressed in the triple form of <B, is known as, C>, and the knowledge graph for the sentence is entity node 2 (313) corresponding to “B”, “ It may include an entity node 3 315 corresponding to “C” and an edge 314 corresponding to “is known as”.

그래프 추출부(130)는 복수의 엔티티 각각에 대한 속성에 기초하여 지식 그래프에서 복수의 최하위 계층 구분 단위 각각에 대응하는 복수의 서브 그래프를 추출한다. The graph extractor 130 extracts a plurality of subgraphs corresponding to each of a plurality of lowest hierarchical division units from the knowledge graph based on attributes of each of a plurality of entities.

일 실시예에 따르면, 그래프 추출부(130)는 지식 그래프에서 계층 별 식별 정보가 동일한 복수의 엔티티 각각에 대응하는 복수의 엔티티 노드를 식별할 수 있다. According to an embodiment, the graph extractor 130 may identify a plurality of entity nodes corresponding to each of a plurality of entities having the same identification information for each layer in the knowledge graph.

이후, 일 실시예에 따르면, 그래프 추출부(130)는 식별된 복수의 엔티티 노드 및 식별된 복수의 엔티티 노드의 간선을 포함하는 서브 그래프를 추출할 수 있다. Then, according to an embodiment, the graph extractor 130 may extract a sub-graph including the identified plurality of entity nodes and the trunk lines of the identified plurality of entity nodes.

이때, 일 실시예에 따르면, 복수의 엔티티 노드 사이의 대한 간선은 지식 그래프에서 계층 별 식별 정보가 동일한 복수의 엔티티에 대응하는 복수의 엔티티 노드 사이의 간선을 의미할 수 있다. In this case, according to an embodiment, the trunk line between the plurality of entity nodes may refer to the trunk line between the plurality of entity nodes corresponding to the plurality of entities having the same identification information for each layer in the knowledge graph.

도 4는 일 실시예에 따른 복수의 최하위 계층 구분 단위 각각에 대응하는 복수의 서브 그래프를 추출하는 과정을 예시하여 나타낸 도면이다. 4 is a diagram illustrating a process of extracting a plurality of subgraphs corresponding to each of a plurality of lowest hierarchical division units according to an embodiment.

도 4를 참조하면, 그래프 추출부(130)는 지식 그래프 중 복수의 최하위 계층 구분 단위인 문단 각각에 대한 서브 그래프(410, 430, 450)를 추출할 수 있다.Referring to FIG. 4 , the graph extractor 130 may extract subgraphs 410 , 430 , and 450 for each paragraph, which is a plurality of lowest hierarchical classification units, among knowledge graphs.

일 실시예에 따르면 그래프 추출부(130)는 지식 그래프에서 계층 별 식별 정보가 동일한 복수의 엔티티 각각에 대응하는 복수의 엔티티 노드를 식별할 수 있다.According to an embodiment, the graph extractor 130 may identify a plurality of entity nodes corresponding to each of a plurality of entities having the same identification information for each layer in the knowledge graph.

일 실시예에 따르면, 서브 그래프(410, 430, 450)에 대응하는 문단 각각의 계층 별 식별 정보는 doc1.chap1.par1, doc2.chap2.par1, doc1.chap1.par2 일 수 있다. According to an embodiment, identification information for each layer of paragraphs corresponding to the subgraphs 410, 430, and 450 may be doc1.chap1.par1, doc2.chap2.par1, and doc1.chap1.par2.

예를 들어, 그래프 추출부(130)는 계층 별 식별 정보가 doc1.chap1.par1로 동일한 복수의 엔티티 각각에 대응하는 엔티티 노드 1, 2, 3(411, 413, 415)을 식별할 수 있다.For example, the graph extractor 130 may identify entity nodes 1, 2, and 3 (411, 413, and 415) corresponding to each of a plurality of entities having the same identification information for each layer as doc1.chap1.par1.

또 다른 예를 들어, 그래프 추출부(130)는 계층 별 식별 정보가 doc2.chap2.par1으로 동일한 복수의 엔티티 각각에 대응하는 엔티티 노드 1, 5, 6(411, 431, 433)을 식별할 수 있다.As another example, the graph extractor 130 may identify entity nodes 1, 5, and 6 (411, 431, and 433) corresponding to each of a plurality of entities having the same identification information for each layer as doc2.chap2.par1. have.

또 다른 예를 들어, 그래프 추출부(130)는 식별 정보가 doc1.chap1.par2로 동일한 복수의 엔티티 각각에 대응하는 엔티티 노드 2, 4(413, 451)를 식별할 수 있다.As another example, the graph extractor 130 may identify entity nodes 2 and 4 (413 and 451) corresponding to each of a plurality of entities having the same identification information as doc1.chap1.par2.

구체적으로, 일 실시예에 따르면, 그래프 추출부(130)는 식별된 복수의 엔티티 노드 및 식별된 복수의 엔티티 노드 사이의 간선을 포함하는 서브 그래프를 추출할 수 있다. Specifically, according to an embodiment, the graph extractor 130 may extract a subgraph including the identified plurality of entity nodes and the trunk lines between the identified plurality of entity nodes.

예를 들어, 그래프 추출부(130)는 엔티티 노드 1, 2, 3(411, 413, 415) 및 식별된 엔티티 노드 1, 2, 3(411, 413, 415) 사이의 간선(412, 414)을 포함하는 서브 그래프(410)를 추출할 수 있다. For example, the graph extractor 130 extracts the trunk lines 412 and 414 between entity nodes 1, 2, and 3 (411, 413, and 415) and identified entity nodes 1, 2, and 3 (411, 413, and 415). A subgraph 410 including can be extracted.

또 다른 예를 들어, 그래프 추출부(130)는 엔티티 노드 1, 3, 6(411, 431, 433) 및 엔티티 노드 1, 3, 6(411, 431, 433) 사이의 간선(432, 434)을 포함하는 서브 그래프(430)를 추출할 수 있다. As another example, the graph extractor 130 extracts the trunk lines 432 and 434 between entity nodes 1, 3, and 6 (411, 431, and 433) and entity nodes 1, 3, and 6 (411, 431, and 433). A subgraph 430 including can be extracted.

또 다른 예를 들어, 그래프 추출부(130)는 엔티티 노드 2, 4(413, 451) 및 식별된 엔티티 노드 2, 4(413, 451) 사이의 간선(452)을 포함하는 서브 그래프(450)를 추출할 수 있다.As another example, the graph extractor 130 generates a subgraph 450 including entity nodes 2 and 4 (413 and 451) and an edge 452 between identified entity nodes 2 and 4 (413 and 451). can be extracted.

임베딩 벡터 생성부(140)는 복수의 서브 그래프 각각에 대한 제1 임베딩 벡터를 생성한다. The embedding vector generator 140 generates a first embedding vector for each of a plurality of subgraphs.

구체적으로, 일 실시예에 따르면, 임베딩 벡터 생성부(140)는 제1 임베딩 벡터를 생성하기 위해 복수의 엔티티 및 복수의 엔티티 사이의 관계 각각을 벡터로 변환한 입력 임베딩 벡터(Input Embedding Vector)를 생성할 수 있다. Specifically, according to an embodiment, the embedding vector generator 140 converts an input embedding vector obtained by converting each of a plurality of entities and relationships between the plurality of entities into vectors to generate a first embedding vector. can create

예를 들어, 임베딩 벡터 생성부(140)는 TF-IDF(Term Frequency-Inverse Document Frequency), 원-핫 인코딩(one-hot encoding), 단어 임베딩(word embedding) 등을 이용하여 복수의 엔티티 및 복수의 엔티티 사이의 관계 각각에 대한 입력 임베딩 벡터를 생성할 수 있으나, 반드시 이에 한정되는 것은 아니다. For example, the embedding vector generator 140 uses TF-IDF (Term Frequency-Inverse Document Frequency), one-hot encoding, word embedding, and the like to generate a plurality of entities and a plurality of entities. An input embedding vector for each relationship between entities of may be generated, but is not necessarily limited thereto.

이후, 일 실시예에 따르면, 임베딩 벡터 생성부(140)는 복수의 엔티티 및 복수의 엔티티 사이의 관계에 기초하여 최하위 계층 구분 단위를 표현하기 위한 제1 임베딩 벡터를 생성할 수 있다. Then, according to an embodiment, the embedding vector generator 140 may generate a first embedding vector for expressing the lowest hierarchical division unit based on the plurality of entities and the relationship between the plurality of entities.

이때, 일 실시예에 따르면, 제1 임베딩 벡터는 복수의 최하위 계층 구분 단위 각각에 대응하는 서브 그래프 사이의 유사도에 기초하여 생성될 수 있다. In this case, according to an embodiment, the first embedding vector may be generated based on the degree of similarity between subgraphs corresponding to each of the plurality of lowest hierarchical division units.

한편, 일 실시예에 따르면 복수의 최하위 계층 구분 단위 각각에 대응하는 서브 그래프 사이의 유사도는 사전 학습된 유사도 평가 모델을 이용하여 결정될 수 있다. 구체적으로, 유사도 평가 모델은 지식 그래프의 부분 그래프에 대응하는 복수의 서브 그래프에 포함된 복수의 엔티티 및 복수의 엔티티 사이의 관계에 대한 복수의 서브 그래프 사이의 유사도를 출력하도록 사전 학습된 인공 신경망 기반의 모델일 수 있다.Meanwhile, according to an embodiment, the similarity between subgraphs corresponding to each of a plurality of lowest hierarchical division units may be determined using a pre-learned similarity evaluation model. Specifically, the similarity evaluation model is based on an artificial neural network pretrained to output similarities between a plurality of entities included in a plurality of subgraphs corresponding to subgraphs of a knowledge graph and a plurality of subgraphs for relationships between the plurality of entities. can be a model of

도 5는 일 실시예에 따른 최하위 계층 구분 단위에 대한 서브 그래프에 포함된 복수의 엔티티 및 복수의 엔티티 사이의 관계를 벡터로 변환하는 과정을 예시하여 설명하기 위한 도면이다. 5 is a diagram for illustrating and explaining a process of converting a plurality of entities included in a subgraph for a lowest hierarchical division unit and a relationship between the plurality of entities into a vector according to an embodiment.

일 실시예에 따르면, 임베딩 벡터 생성부(140)는 제1 임베딩 벡터를 생성하기 위해 복수의 엔티티 및 복수의 엔티티 사이의 관계에 기초하여 입력 임베딩 벡터(Input Embedding Vector) (510)를 생성할 수 있다. According to an embodiment, the embedding vector generator 140 may generate an input embedding vector 510 based on a plurality of entities and relationships between the plurality of entities in order to generate a first embedding vector. have.

도 5를 참조하면, 예를 들어, 임베딩 벡터 생성부(140)는 도 3에 도시된 서브 그래프(410)에 포함된 복수의 엔티티(411, 413, 415) 및 복수의 엔티티 사이의 관계(412, 414) 각각에 대한 벡터 (511, 512, 513, 514, 515)에 기초하여 입력 임베딩 벡터(510)를 생성할 수 있다.Referring to FIG. 5 , for example, the embedding vector generator 140 includes a plurality of entities 411, 413, and 415 included in the subgraph 410 shown in FIG. 3 and a relationship 412 between the plurality of entities. , 414), the input embedding vector 510 may be generated based on the vectors 511, 512, 513, 514, and 515, respectively.

도 6은 일 실시예에 따른 복수의 최하위 계층 구분 단위 각각에 대응하는 임베딩 벡터 생성에 이용되는 유사도 평가 모델을 예시하여 설명하기 위한 도면이다.6 is a diagram for illustrating and explaining a similarity evaluation model used to generate an embedding vector corresponding to each of a plurality of lowest hierarchical division units according to an embodiment.

일 실시예에 따르면, 임베딩 벡터 생성부(140)는 사전 학습된 유사도 평가 모델에 복수의 엔티티 및 복수의 엔티티 사이의 관계에 기초하여 변환된 입력 임베딩 벡터를 입력하여 복수의 최하위 계층 구분 단위 각각에 대한 제1 임베딩 벡터를 생성할 수 있다. According to an embodiment, the embedding vector generation unit 140 inputs an input embedding vector transformed based on a plurality of entities and relationships between the plurality of entities to a pre-learned similarity evaluation model to be applied to each of a plurality of lowest hierarchical division units. A first embedding vector for .

이때, 일 실시예에 따르면, 유사도 평가 모델은 복수의 최하위 계층 구분 단위 각각에 대한 서브 그래프 사이의 유사도에 기초하여 갱신될 수 있다. In this case, according to an embodiment, the similarity evaluation model may be updated based on similarities between subgraphs for each of a plurality of lowest hierarchical division units.

일 실시예에 따르면, 임베딩 벡터 생성부(140)는 복수의 서브 그래프 각각에 대한 제1 임베딩 벡터의 유사도가 가까워지도록 유사도 평가 모델을 학습시킬 수 있다. According to an embodiment, the embedding vector generator 140 may train a similarity evaluation model so that the similarity of the first embedding vector for each of the plurality of subgraphs becomes closer.

예를 들어, 도 6을 참조하면, 임베딩 벡터 생성부(140)는 최하위 계층 구분 단위의 계층 별 식별 정보가 Doc1.chap1.par1 와 Doc1.chap1.par2인 복수의 서브 그래프 각각에 대한 입력 임베딩 벡터 1, 2 (611, 612)를 유사도 평가 모델에 각각 입력하여 복수의 서브 그래프 각각에 대응하는 제1 임베딩 벡터(621, 622)를 생성할 수 있다. For example, referring to FIG. 6 , the embedding vector generation unit 140 inputs an embedding vector for each of a plurality of subgraphs of which identification information for each layer of the lowest layer division unit is Doc1.chap1.par1 and Doc1.chap1.par2. First embedding vectors 621 and 622 corresponding to each of a plurality of subgraphs may be generated by inputting 1 and 2 (611 and 612) to the similarity evaluation model, respectively.

일 실시예에 따르면, 임베딩 벡터 생성부(140)는 제1 임베딩 벡터(621, 622)에 기초하여 유사도 평가 모델의 손실 함수(631)를 갱신할 수 있다.According to an embodiment, the embedding vector generator 140 may update the loss function 631 of the similarity evaluation model based on the first embedding vectors 621 and 622 .

구체적으로, 일 실시예에 따르면, 유사도 평가 모델의 손실 함수(631)는 다음의 수학식 1을 통해 구할 수 있다. Specifically, according to an embodiment, the loss function 631 of the similarity evaluation model can be obtained through Equation 1 below.

와

는 최하위 계층 구분 단위인 문단 각각에 대한 제1 임베딩 벡터(621, 622), m은 마진(margin), Y는 각 문단에 대한 서브 그래프들의 유사도를 의미할 수 있다. 그러나, 유사도 평가 모델의 손실 함수는 상술한 수학식 1에 한정되는 것은 아니며 Triplet Loss, Margin Loss, N-pair loss 등 유사도를 학습할 수 있는 Metric loss라면 어떤 것이든 활용될 수 있다.

Wow

M may mean the first embedding

vectors

621 and 622 for each paragraph, which is the lowest hierarchical division unit, m may mean a margin, and Y may mean the degree of similarity of subgraphs for each paragraph. However, the loss function of the similarity evaluation model is not limited to Equation 1 described above, and any metric loss that can learn similarity, such as triplet loss, margin loss, and N-pair loss, can be used.

일 실시예에 따르면, 유사도 평가 모델의 유사도는 그래프 동형 사상(graph isomorphism)이나 그래프의 특징을 반복(iterative) 비교하는 방법론들을 통해 다양하게 정의될 수 있다. According to an embodiment, the degree of similarity of the similarity evaluation model may be variously defined through graph isomorphism or methodologies for iteratively comparing characteristics of graphs.

임베딩 벡터 생성부(140)는 복수의 서브 그래프 각각에 대한 제1 임베딩 벡터에 기초하여, 복수의 최하위 계층 구분 단위의 상위 계층 구분 단위 각각에 대한 임베딩 벡터를 생성한다. The embedding vector generator 140 generates an embedding vector for each of a plurality of upper hierarchical division units of a plurality of lower hierarchical division units based on a first embedding vector for each of a plurality of subgraphs.

일 실시예에 따르면, 임베딩 벡터 생성부(140)는 계층 별 식별 정보에 기초하여 최하위 계층 구분 단위의 상위 계층 구분 단위 각각의 식별 정보를 확인할 수 있다. According to an embodiment, the embedding vector generator 140 may check identification information of each upper layer division unit of the lowest hierarchy division unit based on the identification information for each layer.

이후, 일 실시예에 따르면 임베딩 벡터 생성부(140)는 최하위 계층 구분 단위의 상위 계층 구분 단위 각각의 식별 정보가 동일한 복수의 제1 임베딩 벡터를 이용하여 차상위 계층 구분 단위 각각에 대응하는 제2 임베딩 벡터를 생성할 수 있다. Thereafter, according to an embodiment, the embedding vector generator 140 uses a plurality of first embedding vectors in which the identification information of each of the upper hierarchical division units of the lowest hierarchical division unit is the same, and performs second embedding corresponding to each of the next higher hierarchy division units. Vectors can be created.

일 실시예에 따르면, 임베딩 벡터 생성부(140)는 제2 임베딩 벡터에 기초하여 차상위 계층 구분 단위의 최상위 계층 구분 단위의 식별 정보를 확인할 수 있다. According to an embodiment, the embedding vector generator 140 may check identification information of the highest hierarchical classification unit of the next highest hierarchical classification unit based on the second embedding vector.

일 실시예에 따르면, 임베딩 벡터 생성부(140)는 차상위 계층 구분 단위의 최상위 계층 구분 단위의 식별 정보가 동일한 복수의 제2 임베딩 벡터를 이용하여 복수의 최상위 계층 구분 단위 각각에 대한 제3 임베딩 벡터를 생성할 수 있다. According to an embodiment, the embedding vector generator 140 uses a plurality of second embedding vectors having the same identification information of the highest hierarchical division unit of the next highest hierarchical division unit, and the third embedding vector for each of the plurality of highest hierarchical division units. can create

도 7은 일 실시예에 따른 최하위 계층 구분 단위의 상위 계층 구분 단위 각각에 대한 임베딩 벡터를 생성하는 과정을 예시하여 설명하기 위한 도면이다. 7 is a diagram for explaining a process of generating an embedding vector for each of the upper hierarchical division units of the lowest hierarchical division unit according to an embodiment.

도 7을 참조하면, 최하위 계층 구분 단위의 상위 계층 구분 단위 각각에 대한 임베딩 벡터를 생성하는 과정은 제1 임베딩 벡터 생성 과정(710), 제2 임베딩 벡터 생성 과정 (730) 및 제3 임베딩 벡터 생성 과정(750)을 포함할 수 있다. Referring to FIG. 7 , the process of generating an embedding vector for each of the upper hierarchical division units of the lowest hierarchical division unit includes a first embedding vector generation process 710, a second embedding vector generation process 730, and a third embedding vector generation process. Process 750 may be included.

일 실시예에 따르면, 제1 임베딩 벡터 생성 과정(710)은 사전 학습된 유사도 평가 모델에 복수의 엔티티 및 복수의 엔티티 사이의 관계에 대한 입력 임베딩 벡터를 입력하여 복수의 최하위 계층 구분 단위에 대한 서브 그래프 각각에 대응하는 제1 임베딩 벡터를 생성하는 과정을 포함할 수 있다.According to an embodiment, the first embedding vector generation process 710 inputs an input embedding vector for a plurality of entities and a relationship between the plurality of entities to a pre-learned similarity evaluation model to provide a subclass for a plurality of lowest hierarchical division units. A process of generating a first embedding vector corresponding to each graph may be included.

예를 들어, 임베딩 벡터 생성부(140)는 제1 임베딩 벡터 생성 과정(710)을 통해 최하위 계층 구분 단위의 계층 별 식별 정보가, Doc1.chap1.par1인 문단에 대한 서브 그래프에 대응하는 제1 임베딩 벡터를 생성할 수 있다. For example, the embedding vector generator 140, through the first embedding vector generation process 710, identifies a first layer corresponding to a subgraph for a paragraph in which identification information for each layer of the lowest layer division unit is Doc1.chap1.par1. You can create embedding vectors.

일 실시예에 따르면, 제2 임베딩 벡터 생성 과정(730)은 최하위 계층 구분 단위의 상위 계층 구분 단위 각각의 식별 정보(725)가 동일한 복수의 제1 임베딩 벡터(711, ···, 719)를 이용하여 차상위 계층 구분 단위(chap)에 대한 제2 임베딩 벡터(731)를 생성할 수 있다.According to an embodiment, the process of generating the second embedding vector 730 generates a plurality of first embedding vectors 711, ..., 719 having the same identification information 725 of each of the upper hierarchical division unit of the lowest hierarchical division unit. A second embedding vector 731 for a second hierarchical classification unit (chap) may be generated by using.

일 실시예에 따르면, 제3 임베딩 벡터 생성 과정(750)은 차상위 계층 구분 단위의 최상위 계층 구분 단위의 식별 정보(745)가 동일한 복수의 제2 임베딩 벡터(731, ···, 739)를 이용하여 최상위 계층 구분 단위(Doc)에 대한 제3 임베딩 벡터(751)를 생성할 수 있다.According to an embodiment, the process of generating the third embedding vector 750 uses a plurality of second embedding vectors 731, ..., 739 having the same identification information 745 of the highest hierarchical division unit of the next hierarchical division unit. By doing so, a third embedding vector 751 for the highest hierarchical division unit (Doc) may be generated.

도 8은 일 실시예에 따른 문서 임베딩 벡터 생성 방법의 순서도이다.8 is a flowchart of a method of generating a document embedding vector according to an exemplary embodiment.

도 8에 도시된 방법은 예를 들어, 도 1에 도시된 문서 임베딩 생성 장치(100)에 의해 수행될 수 있다.The method illustrated in FIG. 8 may be performed by, for example, the document embedding generating apparatus 100 illustrated in FIG. 1 .

도 8을 참조하면, 복수의 문서로부터 복수의 엔티티(entity), 상기 복수의 엔티티 사이의 관계(relation) 및 복수의 문서 각각의 계층적 구조에 기초한 복수의 엔티티 각각에 대한 속성을 추출한다(810).Referring to FIG. 8, a plurality of entities are extracted from a plurality of documents, a relationship between the plurality of entities, and an attribute of each of the plurality of entities based on the hierarchical structure of each of the plurality of documents (810). ).

이후, 문서 임베딩 생성 장치(100)는 복수의 엔티티(entity) 및 복수의 엔티티 사이의 관계(relation)에 기초하여 복수의 문서에 대응하는 지식 그래프(Knowledge Graph)를 생성한다(820).Thereafter, the document embedding generating apparatus 100 generates a knowledge graph corresponding to a plurality of documents based on a plurality of entities and a relationship between the plurality of entities (820).

이후, 문서 임베딩 생성 장치(100)는 복수의 엔티티 각각에 대한 속성에 기초하여 지식 그래프에서 복수의 최하위 계층 구분 단위 각각에 대응하는 복수의 서브 그래프를 추출한다(830).Thereafter, the document embedding generation apparatus 100 extracts a plurality of subgraphs corresponding to each of a plurality of lowest hierarchical classification units from the knowledge graph based on the attributes of each of a plurality of entities (830).

이후, 문서 임베딩 생성 장치(100)는 복수의 서브 그래프 각각에 대한 제1 임베딩 벡터를 생성한다(840).Thereafter, the document embedding generating apparatus 100 generates a first embedding vector for each of a plurality of subgraphs (840).

이후, 문서 임베딩 생성 장치(100)는 복수의 서브 그래프 각각에 대한 제1 임베딩 벡터에 기초하여, 복수의 최하위 계층의 구분 단위의 상위 계층 구분 단위 각각에 대한 임베딩 벡터를 생성한다(850).Thereafter, the document embedding generating apparatus 100 generates an embedding vector for each of the upper hierarchical division units of the plurality of lower hierarchical division units based on the first embedding vector for each of the plurality of subgraphs (850).

도 9는 일 실시예에 따른 컴퓨팅 장치를 포함하는 컴퓨팅 환경을 예시하여 설명하기 위한 블록도이다. 9 is a block diagram illustrating and describing a computing environment including a computing device according to an exemplary embodiment.

도 9는 일 실시예에 따른 컴퓨팅 장치를 포함하는 컴퓨팅 환경(10)을 예시하여 설명하기 위한 블록도이다. 도시된 실시예에서, 각 컴포넌트들은 이하에 기술된 것 이외에 상이한 기능 및 능력을 가질 수 있고, 이하에 기술된 것 이외에도 추가적인 컴포넌트를 포함할 수 있다.9 is a block diagram illustrating and describing a computing environment 10 including a computing device according to an exemplary embodiment. In the illustrated embodiment, each component may have different functions and capabilities other than those described below, and may include additional components other than those described below.

도시된 컴퓨팅 환경(10)은 컴퓨팅 장치(12)를 포함한다. 일 실시예에서, 컴퓨팅 장치(12)는 도 1에 문서 임베딩 생성 장치(100)일 수 있다.The illustrated computing environment 10 includes a computing device 12 . In one embodiment, computing device 12 may be document embedding generating device 100 in FIG.

컴퓨팅 장치(12)는 적어도 하나의 프로세서(14), 컴퓨터 판독 가능 저장 매체(16) 및 통신 버스(18)를 포함한다. 프로세서(14)는 컴퓨팅 장치(12)로 하여금 앞서 언급된 예시적인 실시예에 따라 동작하도록 할 수 있다. 예컨대, 프로세서(14)는 컴퓨터 판독 가능 저장 매체(16)에 저장된 하나 이상의 프로그램들을 실행할 수 있다. 상기 하나 이상의 프로그램들은 하나 이상의 컴퓨터 실행 가능 명령어를 포함할 수 있으며, 상기 컴퓨터 실행 가능 명령어는 프로세서(14)에 의해 실행되는 경우 컴퓨팅 장치(12)로 하여금 예시적인 실시예에 따른 동작들을 수행하도록 구성될 수 있다.Computing device 12 includes at least one processor 14 , a computer readable storage medium 16 and a communication bus 18 . Processor 14 may cause computing device 12 to operate according to the above-mentioned example embodiments. For example, processor 14 may execute one or more programs stored on computer readable storage medium 16 . The one or more programs may include one or more computer-executable instructions, which when executed by processor 14 are configured to cause computing device 12 to perform operations in accordance with an illustrative embodiment. It can be.

컴퓨터 판독 가능 저장 매체(16)는 컴퓨터 실행 가능 명령어 내지 프로그램 코드, 프로그램 데이터 및/또는 다른 적합한 형태의 정보를 저장하도록 구성된다. 컴퓨터 판독 가능 저장 매체(16)에 저장된 프로그램(20)은 프로세서(14)에 의해 실행 가능한 명령어의 집합을 포함한다. 일 실시예에서, 컴퓨터 판독 가능 저장 매체(16)는 메모리(랜덤 액세스 메모리와 같은 휘발성 메모리, 비휘발성 메모리, 또는 이들의 적절한 조합), 하나 이상의 자기 디스크 저장 디바이스들, 광학 디스크 저장 디바이스들, 플래시 메모리 디바이스들, 그 밖에 컴퓨팅 장치(12)에 의해 액세스되고 원하는 정보를 저장할 수 있는 다른 형태의 저장 매체, 또는 이들의 적합한 조합일 수 있다.Computer-readable storage medium 16 is configured to store computer-executable instructions or program code, program data, and/or other suitable form of information. Program 20 stored on computer readable storage medium 16 includes a set of instructions executable by processor 14 . In one embodiment, computer readable storage medium 16 includes memory (volatile memory such as random access memory, non-volatile memory, or a suitable combination thereof), one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, other forms of storage media that can be accessed by computing device 12 and store desired information, or any suitable combination thereof.

통신 버스(18)는 프로세서(14), 컴퓨터 판독 가능 저장 매체(16)를 포함하여 컴퓨팅 장치(12)의 다른 다양한 컴포넌트들을 상호 연결한다.Communications bus 18 interconnects various other components of computing device 12, including processor 14 and computer-readable storage medium 16.

컴퓨팅 장치(12)는 또한 하나 이상의 입출력 장치(24)를 위한 인터페이스를 제공하는 하나 이상의 입출력 인터페이스(22) 및 하나 이상의 네트워크 통신 인터페이스(26)를 포함할 수 있다. 입출력 인터페이스(22) 및 네트워크 통신 인터페이스(26)는 통신 버스(18)에 연결된다. 입출력 장치(24)는 입출력 인터페이스(22)를 통해 컴퓨팅 장치(12)의 다른 컴포넌트들에 연결될 수 있다. 예시적인 입출력 장치(24)는 포인팅 장치(마우스 또는 트랙패드 등), 키보드, 터치 입력 장치(터치패드 또는 터치스크린 등), 음성 또는 소리 입력 장치, 다양한 종류의 센서 장치 및/또는 촬영 장치와 같은 입력 장치, 및/또는 디스플레이 장치, 프린터, 스피커 및/또는 네트워크 카드와 같은 출력 장치를 포함할 수 있다. 예시적인 입출력 장치(24)는 컴퓨팅 장치(12)를 구성하는 일 컴포넌트로서 컴퓨팅 장치(12)의 내부에 포함될 수도 있고, 컴퓨팅 장치(12)와는 구별되는 별개의 장치로 컴퓨팅 장치(12)와 연결될 수도 있다.Computing device 12 may also include one or more input/output interfaces 22 and one or more network communication interfaces 26 that provide interfaces for one or more input/output devices 24 . An input/output interface 22 and a network communication interface 26 are connected to the communication bus 18 . Input/output device 24 may be coupled to other components of computing device 12 via input/output interface 22 . Exemplary input/output devices 24 include a pointing device (such as a mouse or trackpad), a keyboard, a touch input device (such as a touchpad or touchscreen), a voice or sound input device, various types of sensor devices, and/or a photographing device. input devices, and/or output devices such as display devices, printers, speakers, and/or network cards. The exemplary input/output device 24 may be included inside the computing device 12 as a component constituting the computing device 12, or may be connected to the computing device 12 as a separate device distinct from the computing device 12. may be

한편, 본 발명의 실시예는 본 명세서에서 기술한 방법들을 컴퓨터상에서 수행하기 위한 프로그램, 및 상기 프로그램을 포함하는 컴퓨터 판독 가능 기록매체를 포함할 수 있다. 상기 컴퓨터 판독 가능 기록매체는 프로그램 명령, 로컬 데이터 파일, 로컬 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체는 본 발명을 위하여 특별히 설계되고 구성된 것들이거나, 또는 컴퓨터 소프트웨어 분야에서 통상적으로 사용 가능한 것일 수 있다. 컴퓨터 판독 가능 기록매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM, DVD와 같은 광 기록 매체, 및 롬, 램, 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 상기 프로그램의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함할 수 있다.Meanwhile, embodiments of the present invention may include a program for performing the methods described in this specification on a computer, and a computer readable recording medium including the program. The computer readable recording medium may include program instructions, local data files, local data structures, etc. alone or in combination. The media may be specially designed and configured for the present invention, or may be commonly available in the field of computer software. Examples of computer-readable media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and specially configured to store and execute program instructions such as ROM, RAM, and flash memory. Hardware devices are included. Examples of the program may include not only machine language codes generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter.

이상에서 본 발명의 대표적인 실시예들을 상세하게 설명하였으나, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 상술한 실시예에 대하여 본 발명의 범주에서 벗어나지 않는 한도 내에서 다양한 변형이 가능함을 이해할 것이다. 그러므로 본 발명의 권리범위는 설명된 실시예에 국한되어 정해져서는 안 되며, 후술하는 청구범위뿐만 아니라 이 청구범위와 균등한 것들에 의해 정해져야 한다.Although representative embodiments of the present invention have been described in detail above, those skilled in the art will understand that various modifications are possible to the above-described embodiments without departing from the scope of the present invention. . Therefore, the scope of the present invention should not be limited to the described embodiments and should not be defined, and should be defined by not only the claims to be described later, but also those equivalent to these claims.

10: 컴퓨팅 환경
12: 컴퓨팅 장치
14: 프로세서
16: 컴퓨터 판독 가능 저장 매체
18: 통신 버스
20: 프로그램
22: 입출력 인터페이스
24: 입출력 장치
26: 네트워크 통신 인터페이스
100: 문서 임베딩 생성 장치
110: 엔티티 추출부
120: 그래프 생성부
130: 그래프 추출부
140: 임베딩 벡터 생성부10: Computing environment
12: computing device
14: Processor
16: computer readable storage medium
18: communication bus
20: program
22: I/O interface
24: I/O device
26: network communication interface
100: document embedding generation device
110: entity extraction unit
120: graph generating unit
130: graph extraction unit
140: embedding vector generation unit

Claims

복수의 문서로부터 복수의 엔티티(entity), 상기 복수의 엔티티 사이의 관계(relation) 및 상기 복수의 문서 각각의 계층적 구조에 기초한 상기 복수의 엔티티 각각에 대한 속성을 추출하는 단계;
상기 복수의 엔티티(entity) 및 상기 복수의 엔티티 사이의 관계(relation)에 기초하여 상기 복수의 문서에 대응하는 지식 그래프(Knowledge Graph)를 생성하는 단계;
상기 복수의 엔티티 각각에 대한 속성에 기초하여 상기 지식 그래프에서 복수의 최하위 계층 구분 단위 각각에 대응하는 복수의 서브 그래프를 추출하는 단계;
상기 복수의 서브 그래프 각각에 대한 제1 임베딩 벡터를 생성하는 단계; 및
상기 복수의 서브 그래프 각각에 대한 제1 임베딩 벡터에 기초하여, 상기 복수의 최하위 계층의 구분 단위의 상위 계층 구분 단위 각각에 대한 임베딩 벡터를 생성하는 단계를 포함하는, 문서 임베딩 생성 방법.
extracting a plurality of entities from a plurality of documents, a relationship between the plurality of entities, and an attribute of each of the plurality of entities based on a hierarchical structure of each of the plurality of documents;
generating a knowledge graph corresponding to the plurality of documents based on the plurality of entities and relationships between the plurality of entities;
extracting a plurality of subgraphs corresponding to each of a plurality of lowest hierarchical division units from the knowledge graph based on attributes of each of the plurality of entities;
generating a first embedding vector for each of the plurality of subgraphs; and
And generating an embedding vector for each of the higher layer division units of the plurality of lower layer division units based on the first embedding vector for each of the plurality of subgraphs.

청구항 1에 있어서,
상기 추출하는 단계는, 상기 복수의 문서 중 상기 복수의 엔티티 각각이 추출된 문서의 계층적 구조에 기초하여 상기 복수의 엔티티 각각이 추출된 영역을 식별하기 위한 계층 별 식별 정보를 상기 복수의 엔티티 각각에 대한 속성으로 추출하는, 문서 임베딩 생성 방법.
The method of claim 1,
The extracting may include providing identification information for each layer for identifying a region from which each of the plurality of entities is extracted based on a hierarchical structure of a document from which each of the plurality of entities is extracted from among the plurality of documents. A method for generating document embeddings, extracting them as attributes for .

청구항 1에 있어서,
상기 계층적 구조는, 문단, 페이지, 섹션, 챕터 및 문서를 포함하는 계층 구분 단위 종류 중 적어도 둘 이상의 계층 구분 단위의 상하 관계 조합으로 구성되는, 문서 임베딩 생성 방법.
The method of claim 1,
The hierarchical structure is composed of a combination of upper and lower relationships of at least two or more hierarchical division units among types of hierarchical division units including paragraphs, pages, sections, chapters, and documents.

청구항 2에 있어서,
상기 계층 별 식별 정보는, 상기 계층적 구조를 구성하는 최하위 계층 구분 단위의 식별 정보 및 최하위 계층 구분 단위의 상위 계층 구분 단위 각각의 식별 정보를 포함하는, 문서 임베딩 생성 방법.
The method of claim 2,
Wherein the identification information for each layer includes identification information of the lowest hierarchical division unit constituting the hierarchical structure and identification information of each of the upper hierarchical division units of the lowest hierarchical division unit.

청구항 2에 있어서,
상기 지식 그래프는, 상기 복수의 엔티티 각각에 대응하는 엔티티 노드 및 상기 복수의 엔티티 사이의 관계에 대응하는 간선을 포함하는, 문서 임베딩 생성 방법.
The method of claim 2,
Wherein the knowledge graph includes an entity node corresponding to each of the plurality of entities and an edge corresponding to a relationship between the plurality of entities.

청구항 5에 있어서,
상기 복수의 서브 그래프를 추출하는 단계는,
상기 지식 그래프에서 상기 계층 별 식별 정보가 동일한 복수의 엔티티 각각에 대응하는 복수의 엔티티 노드를 식별하는 단계; 및
식별된 상기 복수의 엔티티 노드 및 식별된 상기 복수의 엔티티 노드에 대한 간선을 포함하는 서브 그래프를 추출하는 단계를 포함하는, 문서 임베딩 생성 방법.
The method of claim 5,
The step of extracting the plurality of subgraphs,
identifying a plurality of entity nodes corresponding to each of a plurality of entities having the same identification information for each layer in the knowledge graph; and
and extracting a subgraph including the identified plurality of entity nodes and trunk lines for the identified plurality of entity nodes.

청구항 5에 있어서,
상기 제1 임베딩 벡터는, 상기 복수의 최하위 계층 구분 단위 각각에 대응하는 서브 그래프 사이의 유사도에 기초하여 생성되는, 문서 임베딩 생성 방법.
The method of claim 5,
The first embedding vector is generated based on a similarity between subgraphs corresponding to each of the plurality of lowest hierarchical division units.

청구항 2에 있어서,
상기 상위 계층 구분 단위 각각에 대한 임베딩 벡터를 생성하는 단계는,
상기 계층 별 식별 정보에 기초하여 최하위 계층 구분 단위의 상위 계층 구분 단위 각각의 식별 정보를 확인하는 단계; 및
상기 최하위 계층 구분 단위의 상위 계층 구분 단위 각각의 식별 정보가 동일한 복수의 제1 임베딩 벡터를 이용하여 복수의 차상위 계층 구분 단위 각각에 대한 제2 임베딩 벡터를 생성하는 단계를 포함하는, 문서 임베딩 생성 방법.
The method of claim 2,
Generating an embedding vector for each of the upper layer division units,
Checking identification information of each of the upper layer division units of the lowest hierarchy division unit based on the identification information for each layer; and
A document embedding generating method comprising generating a second embedding vector for each of a plurality of next-higher hierarchical division units using a plurality of first embedding vectors having the same identification information of each of the upper hierarchical division units of the lowest hierarchical division unit. .

청구항 8에 있어서,
상기 상위 계층 구분 단위 각각에 대한 임베딩 벡터를 생성하는 단계는,
상기 제2 임베딩 벡터에 기초하여 차상위 계층 구분 단위의 최상위 계층 구분 단위의 식별 정보를 확인하는 단계; 및
상기 차상위 계층 구분 단위의 최상위 계층 구분 단위의 식별 정보가 동일한 복수의 제2 임베딩 벡터를 이용하여 복수의 최상위 계층 구분 단위 각각에 대한 제3 임베딩 벡터를 생성하는 단계를 포함하는, 문서 임베딩 생성 방법.
The method of claim 8,
Generating an embedding vector for each of the upper layer division units,
verifying identification information of an uppermost hierarchical division unit of a next higher hierarchical division unit based on the second embedding vector; and
And generating a third embedding vector for each of a plurality of highest hierarchical division units using a plurality of second embedding vectors having the same identification information of the highest hierarchical division unit of the next highest hierarchical division unit.

복수의 문서로부터 복수의 엔티티(entity), 상기 복수의 엔티티 사이의 관계(relation) 및 상기 복수의 문서 각각의 계층적 구조에 기초한 상기 복수의 엔티티 각각에 대한 속성을 추출하는 엔티티 추출부;
상기 복수의 엔티티(entity) 및 상기 복수의 엔티티 사이의 관계(relation)에 기초하여 상기 복수의 문서에 대응하는 지식 그래프(Knowledge Graph)를 생성하는 그래프 생성부;
상기 복수의 엔티티 각각에 대한 속성에 기초하여 상기 지식 그래프에서 복수의 최하위 계층 구분 단위 각각에 대응하는 복수의 서브 그래프를 추출하는 그래프 추출부; 및
상기 복수의 서브 그래프 각각에 대한 제1 임베딩 벡터를 생성하고, 상기 복수의 서브 그래프 각각에 대한 제1 임베딩 벡터에 기초하여, 상기 복수의 최하위 계층의 구분 단위의 상위 계층 구분 단위 각각에 대한 임베딩 벡터를 생성하는 임베딩 벡터 생성부를 포함하는, 문서 임베딩 생성 장치.
an entity extractor extracting a plurality of entities from a plurality of documents, a relationship between the plurality of entities, and an attribute of each of the plurality of entities based on a hierarchical structure of each of the plurality of documents;
a graph generator configured to generate a knowledge graph corresponding to the plurality of documents based on the plurality of entities and relationships between the plurality of entities;
a graph extractor extracting a plurality of subgraphs corresponding to each of a plurality of lowest hierarchical division units in the knowledge graph based on attributes of each of the plurality of entities; and
A first embedding vector for each of the plurality of subgraphs is generated, and based on the first embedding vector for each of the plurality of subgraphs, an embedding vector for each of the upper hierarchical division units of the plurality of lower hierarchical division units An apparatus for generating document embeddings, including an embedding vector generating unit that generates a.

청구항 10에 있어서,
상기 엔티티 추출부는, 상기 복수의 문서 중 상기 복수의 엔티티 각각이 추출된 문서의 계층적 구조에 기초하여 상기 복수의 엔티티 각각이 추출된 영역을 식별하기 위한 계층 별 식별 정보를 상기 복수의 엔티티 각각에 대한 속성으로 추출하는, 문서 임베딩 생성 장치.
The method of claim 10,
The entity extraction unit transmits, to each of the plurality of entities, identification information for each layer for identifying a region from which each of the plurality of entities is extracted based on a hierarchical structure of a document from which each of the plurality of entities is extracted from among the plurality of documents. A device for generating document embeddings, extracting them as attributes for a document.

청구항 10에 있어서,
상기 계층적 구조는, 문단, 페이지, 섹션, 챕터 및 문서를 포함하는 계층 구분 단위 종류 중 적어도 둘 이상의 계층 구분 단위의 상하 관계 조합으로 구성되는, 문서 임베딩 생성 장치.
The method of claim 10,
The hierarchical structure is composed of a combination of upper and lower relations of at least two or more hierarchical division units among types of hierarchical division units including paragraphs, pages, sections, chapters, and documents.

청구항 11에 있어서,
상기 계층 별 식별 정보는, 상기 계층적 구조를 구성하는 최하위 계층 구분 단위의 식별 정보 및 최하위 계층 구분 단위의 상위 계층 구분 단위 각각의 식별 정보를 포함하는, 문서 임베딩 생성 장치.
The method of claim 11,
Wherein the identification information for each layer includes identification information of the lowest hierarchical division unit constituting the hierarchical structure and identification information of each of the upper hierarchical division units of the lowest hierarchical division unit.

청구항 11에 있어서,
상기 지식 그래프는, 상기 복수의 엔티티 각각에 대응하는 엔티티 노드 및 상기 복수의 엔티티 사이의 관계에 대응하는 간선을 포함하는, 문서 임베딩 생성 장치.
The method of claim 11,
The knowledge graph includes an entity node corresponding to each of the plurality of entities and a trunk line corresponding to a relationship between the plurality of entities.

청구항 14에 있어서,
상기 그래프 추출부는, 상기 지식 그래프에서 상기 계층 별 식별 정보가 동일한 복수의 엔티티 각각에 대응하는 복수의 엔티티 노드를 식별하고,
식별된 상기 복수의 엔티티 노드 및 식별된 상기 복수의 엔티티 노드에 대한 간선을 포함하는 서브 그래프를 추출하는, 문서 임베딩 생성 장치.
The method of claim 14,
The graph extractor identifies a plurality of entity nodes corresponding to each of a plurality of entities having the same identification information for each layer in the knowledge graph;
A document embedding generating apparatus for extracting a subgraph including the identified plurality of entity nodes and trunk lines for the identified plurality of entity nodes.

청구항 14에 있어서,
상기 제1 임베딩 벡터는, 상기 복수의 최하위 계층 구분 단위 각각에 대응하는 서브 그래프 사이의 유사도에 기초하여 생성되는, 문서 임베딩 생성 장치.
The method of claim 14,
The first embedding vector is generated based on a similarity between subgraphs corresponding to each of the plurality of lowest hierarchical division units.

청구항 11에 있어서,
상기 임베딩 벡터 생성부는, 상기 계층 별 식별 정보에 기초하여 최하위 계층 구분 단위의 상위 계층 구분 단위 각각의 식별 정보를 확인하고,
상기 최하위 계층 구분 단위의 상위 계층 구분 단위 각각의 식별 정보가 동일한 복수의 제1 임베딩 벡터를 이용하여 복수의 차상위 계층 구분 단위 각각에 대한 제2 임베딩 벡터를 생성하는, 문서 임베딩 생성 장치.
The method of claim 11,
The embedding vector generation unit checks identification information of each upper layer division unit of the lowest hierarchy division unit based on the identification information for each layer;
A document embedding generating device for generating a second embedding vector for each of a plurality of next-higher hierarchical division units using a plurality of first embedding vectors having the same identification information of each of the upper hierarchical division units of the lowest hierarchical division unit.

청구항 17에 있어서,
상기 임베딩 벡터 생성부는, 상기 제2 임베딩 벡터에 기초하여 차상위 계층 구분 단위의 최상위 계층 구분 단위의 식별 정보를 확인하고,
상기 차상위 계층 구분 단위의 최상위 계층 구분 단위의 식별 정보가 동일한 복수의 제2 임베딩 벡터를 이용하여 복수의 최상위 계층 구분 단위 각각에 대한 제3 임베딩 벡터를 생성하는, 문서 임베딩 생성 장치.
The method of claim 17
The embedding vector generator checks identification information of an uppermost hierarchical division unit of a next higher hierarchical division unit based on the second embedding vector;
Wherein the document embedding generation device generates a third embedding vector for each of a plurality of upper hierarchical division units by using a plurality of second embedding vectors having the same identification information of the highest hierarchical division unit of the next higher hierarchy division unit.