KR102461454B1

KR102461454B1 - Document Summarization System And Summary Method Thereof

Info

Publication number: KR102461454B1
Application number: KR1020200014265A
Authority: KR
Inventors: 서동국; 이동호; 김건우
Original assignee: 한양대학교 에리카산학협력단
Priority date: 2020-02-06
Filing date: 2020-02-06
Publication date: 2022-11-08
Also published as: KR20210100370A

Abstract

문서 요약 시스템 및 그 요약 방법에 대한 것으로 요약 방법은 입력된 텍스트를 형태소 분석하여 각 어절에 형태소 태그 및 개체명 태그를 부착하여 임베딩하는 단계, 임베딩된 적어도 하나의 어절 벡터를 인코더에 입력하는 단계, 어절 벡터는 디코더로 재입력되고, 어절 벡터 및 디코더의 마지막 은닉 상태인 제1 은닉 상태를 포함하는 태그 문맥 벡터를 생성하는 단계, 태그 문맥 벡터 및 인코더에 입력된 어절 벡터의 은닉 상태인 제2 은닉 상태를 포함하는 주목 가중치를 계산하고, 주목 가중치 및 상기 제2 은닉 상태를 포함하는 문맥 벡터를 생성하는 단계 및 문맥 벡터, 제1 은닉 상태 및 어절 벡터를 이용해 생성 확률(Pgen)을 구하고, 미리 설정된 기준 확률 값에 따라 단어를 생성 또는 추출하여 문장을 생성하는 단계를 포함한다.A document summarization system and a summary method therefor, the summary method comprising the steps of morphological analysis of input text and embedding by attaching a morpheme tag and an entity name tag to each word, inputting at least one embedded word vector into an encoder; The word vector is re-input to the decoder, generating a tag context vector including the word vector and a first hidden state that is the last hidden state of the decoder, and the second hidden state of the tag context vector and the word vector input to the encoder calculating the attention weight including the state, generating the context vector including the attention weight and the second hidden state, and obtaining the generation probability Pgen using the context vector, the first hidden state, and the word vector, and generating a sentence by generating or extracting a word according to a reference probability value.

Description

문서 요약 시스템 및 그 요약 방법{Document Summarization System And Summary Method Thereof}Document Summarization System And Summary Method Thereof

본 발명은 문장을 추출하거나 새로운 문장을 생성하는 문서 요약 시스템 및 요약 방법에 관한 것이다.The present invention relates to a document summarization system and a summary method for extracting a sentence or generating a new sentence.

인터넷을 중심으로 정보의 유입, 전달 및 공유가 증가하면서 방대한 텍스트가 생성되고 있다. 이는 인터넷 사용자들이 접근할 수 있는 텍스트가 많다는 것을 의미하는 동시에, 사용자들이 원하는 텍스트를 검색하고 선택하는 데 어려움을 겪을 수 있다는 것을 의미하기도 한다. 이러한 문제점을 해결하고 사용자에게 편의성을 제공하고자 방대한 텍스트를 간단한 요약 형태로 생성해주는 문서 요약 기법이 등장하였다.Massive texts are being created as the inflow, transmission, and sharing of information increases around the Internet. While this means that Internet users have access to a lot of text, it also means that users may have a hard time finding and selecting the text they want. In order to solve these problems and provide convenience to users, a document summary technique that generates vast amounts of text in the form of a simple summary has emerged.

현재 기존 문서 요약 기법에는 크게 두 가지이다. 문헌 내에서 중요한 핵심 문장들을 뽑아서 그 문장들로 요약을 생성하는 추출 요약과, 문헌의 내용을 잘 반영할 수 있는 추상적인 문장을 직접 생성함으로써 요약을 생성하는 생성 요약 기법이 있다. 또한, PG-Net(Pointer Generator Network) 모델이 제안되었고, 이를 보완하기 위한 연구가 최근 활발히 진행되고 있다.Currently, there are mainly two existing document summarization techniques. There is an extract summary that extracts important key sentences from the literature and generates a summary with those sentences, and a generative summary technique that generates a summary by directly generating an abstract sentence that can reflect the contents of the document well. In addition, a PG-Net (Pointer Generator Network) model has been proposed, and research to supplement this has been actively conducted recently.

대한민국공개특허공보 제10-2018-0023351호(뉴스 요약을 생성하는 방법 및 장치, 건국대학교 산학협력단, 2018.03.07)Republic of Korea Patent Publication No. 10-2018-0023351 (Method and apparatus for generating a news summary, Konkuk University Industry-Academic Cooperation Foundation, 2018.03.07) 대한민국등록특허공보 제10-0785927호(텍스트 요약 생성 방법 및 장치, 삼성전자주식회사, 2007.12.07)Republic of Korea Patent Publication No. 10-0785927 (Method and Apparatus for Generating Text Summary, Samsung Electronics Co., Ltd., 2007.12.07) 일본특허공보 2944346(문서 요약 장치 및 컴퓨터 판독 가능한 기록 매체, SHARP CORP, 1999.06.25)Japanese Patent Publication No. 2944346 (Document summary apparatus and computer-readable recording medium, SHARP CORP, June 25, 1999)

제28회 한글 및 한국어 정보처리 학술대회 논문집(질의응답 시스템에서 형태소임베딩 모델과 GRU 인코더를 이용한 문장유사도 측정, 이동건, 오교중, 최호진, 허정, 2016.10.08) Proceedings of the 28th Korean and Korean Information Processing Conference (Sentence similarity measurement using morpheme embedding model and GRU encoder in question-and-answer system, Lee Dong-gun, Oh Gyo-jung, Choi Ho-jin, Heo Jeong, 2016.10.08)

태그 정보 기반의 PG-Net 모델과, 한 어절이 어근 및 접사로 구성되는 한국어의 특징을 반영하기 위해 형태소 및 개체명 단위의 임베딩 학습 방법론을 반영한 문서 요약 시스템 및 요약 방법을 제공한다.To reflect the tag information-based PG-Net model and the characteristics of Korean, in which one word is composed of roots and affixes, a document summary system and summary method are provided that reflect the embedding learning methodology in units of morphemes and entity names.

문서 요약 방법의 일 실시예는 입력된 텍스트를 형태소 분석하여 각 어절에 형태소 태그 및 개체명 태그를 부착하여 임베딩하는 단계, 임베딩된 적어도 하나의 어절 벡터를 인코더에 입력하는 단계, 적어도 하나의 어절 벡터는 디코더로 재입력되고, 어절 벡터 및 디코더의 마지막 은닉 상태인 제1 은닉 상태를 포함하는 태그 문맥 벡터를 생성하는 단계, 태그 문맥 벡터 및 인코더에 입력된 어절 벡터의 은닉 상태인 제2 은닉 상태를 포함하는 주목 가중치를 계산하고, 주목 가중치 및 제2 은닉 상태를 포함하는 문맥 벡터를 생성하는 단계 및 문맥 벡터, 제1 은닉 상태 및 어절 벡터를 이용해 생성 확률(Pgen)을 구하고, 미리 설정된 기준 확률 값에 따라 단어를 생성 또는 추출하여 문장을 생성하는 단계를 포함할 수 있다.An embodiment of the document summarizing method includes the steps of morphological analysis of input text and embedding by attaching a morpheme tag and an entity name tag to each word, inputting at least one embedded word vector into an encoder, at least one word vector is re-input to the decoder, generating a tag context vector including a word vector and a first hidden state that is the last hidden state of the decoder; Calculating the weight of attention including It may include generating or extracting a word according to the method to generate a sentence.

또한, 일 실시예에 따라 임베딩하는 단계는 어절을 어근 및 접사로 구분하고, 어근 및 접사에 대응하는 품사 및 개체명을 결정할 수 있다.Also, in the embedding, according to an embodiment, a word may be divided into a root and an affix, and a part-of-speech and an entity name corresponding to the root and the affix may be determined.

또한, 일 실시예에 따라 어절 벡터는 어근의 형태소 태그 벡터 및 개체명 태그 벡터가 결합된 제1 벡터와, 접사의 형태소 태그 벡터 및 개체명 태그 벡터가 결합된 제2 벡터를 포함하는 벡터 값일 수 있다.Also, according to an embodiment, the word vector may be a vector value including a first vector in which a morpheme tag vector of a root and an entity tag vector are combined, and a second vector in which a morpheme tag vector of an affix and an entity tag vector are combined. have.

또한, 일 실시예에 따라 어절 벡터들을 인코더에 입력하여 학습하는 단계를 더 포함할 수 있다.Also, according to an embodiment, the method may further include learning by inputting word vectors to the encoder.

또한, 일 실시예에 따라 태그 문맥 벡터를 생성하는 단계는 인코더로 입력된 상기 적어도 하나의 어절 벡터 중 첫번째 어절 벡터가 단어일 경우, 디코더에 START 태그를 삽입하고, 단어를 추출 및 제1 은닉 상태를 계산하여 생성하며, 첫번째 어절 벡터가 단어가 아닌 경우에는 계산과정을 거치지 않고 생성할 수 있다.Also, according to an embodiment, the generating of the tag context vector includes inserting a START tag into the decoder, extracting the word, and a first hidden state when the first word vector among the at least one word vector input to the encoder is a word. , and if the first word vector is not a word, it can be created without going through the calculation process.

또한, 일 실시예에 따라 문맥 벡터를 생성하는 단계는 제2 은닉 상태의 주목 가중치 모음 값을 계산하고, 주목 가중치 모음 값을 소프트맥스(softmax) 함수를 이용하여 주목 분포를 계산하며, 각 인코더의 주목 분포 및 은닉 상태를 가중 합하여 계산할 수 있다.In addition, the generating of the context vector according to an embodiment may include calculating the value of the weighted collection of interest in the second hidden state, calculating the distribution of interest using the value of the collection of weights of interest using a softmax function, and It can be calculated by weighted summing the attention distribution and the hidden state.

또한, 일 실시예에 따라 상기 단어를 생성 또는 추출하는 단계에서 생성 확률(Pgen)의 범위는 0.0 내지 1.0이고, 생성 확률(Pgen)이 미리 설정된 기준 확률 값보다 크거나 같은 경우, 구축된 단어사전에서 가중치가 높은 단어를 다음 단어로 생성하고, 생성 확률(Pgen)이 미리 설정된 기준 확률 값보다 작은 경우, 인코더에 삽입된 어절 벡터들 중 가중치가 높은 단어를 다음 단어로 추출할 수 있다.In addition, according to an embodiment, in the step of generating or extracting the word, the generation probability Pgen ranges from 0.0 to 1.0, and when the generation probability Pgen is greater than or equal to a preset reference probability value, the constructed word dictionary When a word having a high weight is generated as the next word and the generation probability Pgen is smaller than a preset reference probability value, a word having a high weight among word vectors inserted into the encoder may be extracted as the next word.

또한, 일 실시예에 따라 문장을 생성하는 단계는 마지막 어절 벡터가 END 태그인 경우, 생성 또는 추출된 단어를 이용하여 문장을 생성하고, 마지막 어절 벡터가 END 태그가 아닌 경우, 디코더로 이동해 문장 생성 과정을 재반복할 수 있다.Also, in the step of generating a sentence according to an embodiment, when the last word vector is an END tag, a sentence is generated using the generated or extracted word. When the last word vector is not an END tag, the sentence is generated by moving to a decoder. The process can be repeated.

문서 요약 시스템의 일 실시예는 입력된 텍스트를 형태소 분석하여 각 어절에 형태소 태그 및 개체명 태그를 부착하여 임베딩하는 임베딩부, 임베딩된 적어도 하나의 어절 벡터는 인코터를 통해 디코더로 재입력되고, 어절 벡터 및 디코더의 마지막 은닉 상태인 제1 은닉 상태를 포함하는 태그 문맥 벡터를 생성하며, 태그 문맥 벡터 및 인코더에 입력된 어절 벡터의 은닉 상태인 제2 은닉 상태를 포함하는 주목 가중치를 계산하고, 주목 가중치 및 제2 은닉 상태를 포함하는 문맥 벡터를 생성하고, 문맥 벡터, 제1 은닉 상태 및 어절 벡터를 이용해 생성 확률(Pgen)을 구하며, 미리 설정된 기준 확률 값에 따라 단어를 생성 또는 추출하는 생성부 및 생성부에서 생성 또는 추출된 단어를 이용하여 문장을 추출하는 출력부를 포함할 수 있다.An embodiment of the document summary system is an embedding unit that morphologically analyzes the input text and attaches a morpheme tag and an entity name tag to each word for embedding, the embedded at least one word vector is re-inputted to the decoder through the encoder, generating a tag context vector including a word vector and a first hidden state that is the last hidden state of the decoder, calculating a weight of interest including the tag context vector and a second hidden state that is a hidden state of the word vector input to the encoder; Generation of generating a context vector including the attention weight and the second hidden state, obtaining the generation probability (Pgen) using the context vector, the first hidden state, and the word vector, and generating or extracting a word according to a preset reference probability value It may include an output unit for extracting a sentence by using the word generated or extracted by the unit and generation unit.

단어들의 문맥적 요소와 한국어만의 특징을 고려하여 단어 간의 관계성을 보다 확실하게 정의함으로써 요약문을 도출하는 문서 요약 시스템 및 방법을 제공할 수 있다.It is possible to provide a document summary system and method for deriving a summary sentence by more clearly defining the relationship between words in consideration of contextual elements of words and characteristics of Korean language.

도 1은 일 실시예에 따른 문서 요약 시스템의 블록도이다.
도 2는 일 실시예에 따른 임베딩부에서 텍스트에 형태소 및 개체명 태그를 부착한 것을 나타낸 도면이다.
도 3은 일 실시예에 따른 개체명 태그 종류의 예시를 나타낸 개념도이다.
도 4는 일 실시예에 따른 생성부를 구체화한 도면이다.
도 5는 일 실시예에 따른 문서 요약 방법의 과정을 나타낸 순서도이다.
도 6은 일 실시예에 따른 태그 문맥 벡터를 생성하는 과정을 나타낸 순서도이다.
도 7은 일 실시예에 따른 문맥 벡터 생성 내지 문장 생성까지의 과정을 나타낸 순서도이다.
도 8은 일 실시예에 따른 생성부에서의 문장 생성 과정을 나타낸 순서도이다.
도 9는 일 실시예에 따른 뉴스기사가 뉴스제목으로 생성된 것을 나타낸 도면이다.1 is a block diagram of a document summarization system according to an embodiment.
2 is a diagram illustrating attachment of morpheme and entity name tags to text in an embedding unit according to an exemplary embodiment.
3 is a conceptual diagram illustrating an example of an entity name tag type according to an embodiment.
4 is a detailed diagram of a generator according to an embodiment.
5 is a flowchart illustrating a process of a document summarizing method according to an embodiment.
6 is a flowchart illustrating a process of generating a tag context vector according to an embodiment.
7 is a flowchart illustrating a process from generating a context vector to generating a sentence according to an embodiment.
8 is a flowchart illustrating a sentence generation process in the generator according to an exemplary embodiment.
9 is a diagram illustrating that a news article is generated as a news title according to an exemplary embodiment.

이하, 첨부된 도면들에 기재된 내용들을 참조하여 본 발명에 따른 예시적 실시예를 상세하게 설명한다. 또한, 첨부된 도면들에 기재된 내용들을 참조하여 본 발명의 실시예에 따른 문서 요약 시스템 및 요약하는 방법을 상세히 설명한다. 각 도면에서 제시된 동일한 참조번호 또는 부호는 실질적으로 동일한 기능을 수행하는 부품 또는 구성요소를 나타낸다.Hereinafter, exemplary embodiments according to the present invention will be described in detail with reference to the contents described in the accompanying drawings. In addition, a document summarizing system and a summary method according to an embodiment of the present invention will be described in detail with reference to the contents described in the accompanying drawings. The same reference numbers or reference numerals in each figure indicate parts or components that perform substantially the same functions.

도 1은 본 발명의 실시예에 따른 문서 요약을 위한 전반적인 시스템을 나타내는 블록도이다.1 is a block diagram illustrating an overall system for document summarization according to an embodiment of the present invention.

도 1을 참조하면, 문서 요약 시스템(1)은 입력부(30), 입력부(30)로부터 입력된 텍스트를 임베딩하는 임베딩부(10), 단어를 생성 또는 추출하는 생성부(20) 및 생성 또는 추출된 단어를 이용하여 문장을 추출하는 출력부(40)를 포함할 수 있다.Referring to FIG. 1 , the document summary system 1 includes an input unit 30 , an embedding unit 10 for embedding text input from the input unit 30 , a generation unit 20 for generating or extracting words, and generating or extracting It may include an output unit 40 for extracting a sentence using the obtained word.

임베딩부(10)는 단어 임베딩 학습 모델 중 Skip-gram 모델을 적용할 수 있다. 또한, Word2Vec 또는 Doc2Vec 방법을 통해 임베딩하는 것을 더 포함할 수 있다.The embedding unit 10 may apply a skip-gram model among the word embedding learning models. It may further include embedding via Word2Vec or Doc2Vec methods.

또한, 생성부(20)는 임베딩부(10)로부터 도출된 적어도 하나의 어절 벡터를 입력받는 인코더(21)와 재입력되는 디코더(22)를 포함할 수 있다.Also, the generator 20 may include an encoder 21 that receives at least one word vector derived from the embedding unit 10 and a decoder 22 that receives it again.

또한, 인코더(21)는 양방향 RNN(bidirectional RNN)일 수 있고, 디코더(22)는 단방향 RNN(directional RNN)일 수 있다.In addition, the encoder 21 may be a bidirectional RNN (bidirectional RNN), and the decoder 22 may be a unidirectional RNN (directional RNN).

출력부(40)는 적어도 하나 이상의 단어가 포함된 문장을 추출할 수 있다.The output unit 40 may extract a sentence including at least one word or more.

또한, 생성부(20) 및 출력부(40)는 커버리지 매커니즘(Coverage Mechanism)을 포함하는 포인터 생성 신경망(Pointer-Generator Network, PG-Net) 모델을 기반으로 한 문서 요약 시스템일 수 있다.Also, the generator 20 and the output unit 40 may be a document summary system based on a Pointer-Generator Network (PG-Net) model including a Coverage Mechanism.

입력부(30)에서 입력된 텍스트가 임베딩부(10) 내에서 임베딩되는 형태는 도 2 및 도 3을 참조하면 알 수 있다.A form in which the text input by the input unit 30 is embedded in the embedding unit 10 can be seen with reference to FIGS. 2 and 3 .

도 2는 임베딩부(10)에서 텍스트에 형태소 및 개체명 태그를 부착한 것을 나타내고 있고, 도 3은 개체명 태그 종류의 예시를 보여주고 있다. 다만, 도 3의 개체명 태그 종류는 본 발명을 설명하기 위한 일련의 예시일 뿐 이에 한정되는 것은 아니다. 또한, 도 2 및 도 3에서 사용되는 형태소 및 개체명 태그는 시스템을 사용하는 사용자에 따라 다르게 정의될 수 있다. 예를 들어, 본 발명에서는 '인명'이라는 의미를 포함하는 Person의 약축어인 PS 태그는, '사람'이라는 의미를 포함하는 People의 약축어인 PP 태그로 대체될 수 있다.2 shows that the embedding unit 10 attaches morpheme and entity name tags to text, and FIG. 3 shows examples of types of entity name tags. However, the type of entity name tag in FIG. 3 is only a series of examples for explaining the present invention, and is not limited thereto. In addition, the morpheme and entity name tags used in FIGS. 2 and 3 may be defined differently according to users who use the system. For example, in the present invention, the PS tag, which is an abbreviation of Person including the meaning of 'person', may be replaced with the PP tag, which is the abbreviation of People including the meaning of 'person'.

도 2를 참조하면, 문서 상의 텍스트들을 형태소 분석하여 형태소 단위로 분리(segment)하여, 각각의 형태소에 대응되는 품사 태그를 부착(S101)할 수 있다. 또한, 도 2 및 3을 참조하면, 개체명 태그 종류를 이용하여 각각의 형태소에 대응되는 개체명 태그를 부착(S102)할 수 있다. 이에 따라, 한 어절은 어근의 형태소 및 개체명 태그와 접사의 형태소 및 개체명 태그가 결합된 형태로 정의될 수 있다.Referring to FIG. 2 , texts on a document are morpheme-analyzed to be segmented into morpheme units, and a part-of-speech tag corresponding to each morpheme may be attached ( S101 ). Also, referring to FIGS. 2 and 3 , an entity name tag corresponding to each morpheme may be attached ( S102 ) by using the entity name tag type. Accordingly, one word may be defined in a form in which a morpheme and entity name tag of a root and a morpheme and entity name tag of an affix are combined.

생성부(20)의 구체적인 구성 및 과정에 대해서는 도 4를 참조하여 아래에서 구체적으로 설명하도록 한다.A detailed configuration and process of the generating unit 20 will be described in detail below with reference to FIG. 4 .

도 4는 생성부(20)의 구성을 구체화하여 도시하고 있는 도면이다.4 is a diagram illustrating the configuration of the generating unit 20 in detail.

생성부(20)는 전처리부(26), 결정부(27) 및 요약부(28)를 포함할 수 있다.The generating unit 20 may include a preprocessing unit 26 , a determining unit 27 , and a summary unit 28 .

전처리부(26)는 단어의 생성 확률(Pgen)을 계산하는 구성으로서, 태그 문맥 벡터 및 문맥 벡터를 생성하고, 문맥 벡터, 제1 은닉 상태 및 어절 벡터를 이용할 수 있다. 본 발명에서 정의한 제1 은닉 상태 및 어절 벡터를 포함하는 용어들은 아래에서 도 5 내지 도 8를 참조하여 구체적으로 설명하도록 한다.The preprocessor 26 may generate a tag context vector and a context vector as a component for calculating the generation probability Pgen of a word, and may use the context vector, the first hidden state, and the word vector. Terms including the first hidden state and the word vector defined in the present invention will be described in detail below with reference to FIGS. 5 to 8 .

결정부(27)는 단어의 생성 또는 추출 여부를 결정하는 구성으로서, 미리 설정된 기준 확률 값과 비교하여 요약부(28)에 단어 생성 또는 단어 추출이라는 명령을 전달할 수 있다.The determination unit 27 is a component that determines whether a word is generated or extracted, and may transmit a command of word generation or word extraction to the summary unit 28 by comparing it with a preset reference probability value.

요약부(28)는 결정부(27)의 여부에 따라 단어를 생성 또는 추출할 수 있다. 이에 따라, 생성 또는 추출된 단어를 이용하여 문장을 생성할 수 있다.The summary unit 28 may generate or extract a word depending on whether the determination unit 27 exists. Accordingly, a sentence may be generated using the generated or extracted word.

이상에서는 문서 요약 시스템에 대해서 설명하였다. 이하 도 5 내지 도 8을 참조하여 문서 요약 방법에 대해서 설명하도록 한다.In the above, the document summary system has been described. Hereinafter, a document summary method will be described with reference to FIGS. 5 to 8 .

먼저, 본 발명에서 정의한 용어에 대해 자세하게 설명한 후, 도 5 내지 도 8에 대해 설명하도록 한다.First, terms defined in the present invention will be described in detail, and then, FIGS. 5 to 8 will be described.

먼저 어절 벡터는, 어근의 형태소 태그 벡터 및 개체명 태그 벡터가 결합된 벡터와, 접사의 형태소 태그 벡터 및 개체명 태그 벡터가 결합된 벡터의 합으로 정의할 수 있다. 아울러 어근의 형태소 태그 벡터 및 개체명 태그 벡터가 결합된 벡터를 제1 벡터, 접사의 형태소 태그 벡터 및 개체명 태그 벡터가 결합된 벡터를 제2 벡터라 정의할 수 있다.First, the word vector may be defined as the sum of a vector in which a morpheme tag vector of a root and a tag vector of an entity are combined, and a vector in which a morpheme tag vector of an affix and a tag vector of an entity are combined. In addition, a vector in which a morpheme tag vector of a root and an entity name tag vector are combined may be defined as a first vector, and a vector in which an affix morpheme tag vector and entity name tag vector are combined may be defined as a second vector.

또한, 제1 은닉 상태는, 디코더(22)의 마지막 은닉 상태로 정의할 수 있다.Also, the first hidden state may be defined as the last hidden state of the decoder 22 .

또한, 태그 문맥 벡터는, 어절 벡터 및 제1 은닉 상태를 포함하는 것으로 정의할 수 있다.In addition, the tag context vector may be defined as including a word vector and a first hidden state.

또한, 제2 은닉 상태는, 인코더(21)에 입력된 어절 벡터의 은닉 상태로 정의할 수 있다.Also, the second hidden state may be defined as a hidden state of the word vector input to the encoder 21 .

이상에서 설명한 용어는 본 발명에서 이해하기 쉽게 설명하기 위해 자체적으로 정의하였을 뿐, 발명자에 따라 이에 대응되는 상이한 용어로 재정의될 수 있다.The terms described above have been defined by themselves for easy understanding in the present invention, and may be redefined as different terms corresponding thereto depending on the inventor.

도 5는 문서 요약 방법의 과정을 나타낸 순서도이다.5 is a flowchart illustrating a process of a document summarization method.

텍스트를 입력(S100)하고, 텍스트를 형태소 및 개체명 태그 임베딩(S110) 과정을 통해 태그 문맥 벡터를 생성(S120)할 수 있다. 이후, 태그 문맥 벡터를 이용하여 문맥 벡터를 생성(S130)하고, 생성 확률 계산 및 생성 또는 추출 여부를 판단(S140)하며, 최종적으로 판단 결과에 따라 생성 또는 추출된 단어들을 이용하여 문장을 생성(S150)할 수 있다.A text may be input (S100), and a tag context vector may be generated (S120) through a process of embedding the text into a morpheme and entity name tag (S110). Thereafter, a context vector is generated using the tag context vector (S130), generation probability calculation and generation or extraction are determined (S140), and finally a sentence is generated using the generated or extracted words according to the determination result (S140). S150) can be done.

형태소 및 개체명 태그 임베딩(S110)하는 단계는, 어절을 어근 및 접사로 구분하고, 어근 및 접사에 대응하는 품사 및 개체명을 결정할 수 있다. 임베딩에 대한 설명은 이상 문서 요약 시스템을 설명하는 과정에서 도 2 및 도 3을 이용하여 설명하였으므로 생략하기로 한다.In the step of embedding the morpheme and entity name tag ( S110 ), a word may be divided into a root and an affix, and a part of speech and an entity name corresponding to the root and the affix may be determined. The description of embedding will be omitted since it has been described with reference to FIGS. 2 and 3 in the process of describing the document summary system.

태그 문맥 벡터를 생성(S120)하는 단계에 대해서는 도 6과 수학식을 참조하여 아래에서 구체적으로 설명하도록 한다.The step of generating the tag context vector (S120) will be described in detail below with reference to FIG. 6 and Equation.

도 6은 태그 문맥 벡터를 생성하는 과정을 나타내는 순서도이다.6 is a flowchart illustrating a process of generating a tag context vector.

임베딩된 적어도 하나의 어절 벡터를 인코더(21)에 입력(S23)한 후, 어절 벡터를 디코더에 재입력(S24)할 수 있다. 다음으로, 디코더(22)에 입력된 적어도 하나의 어절 벡터 중 첫번째 어절 벡터가 단어인지 여부를 판단(S25)할 수 있다. 다음으로, 첫번재 어절 벡터가 단어일 경우, 디코더(22)에 START 태그를 삽입(S26)하고, 단어 추출 및 제1 은닉 상태를 계산(S27)하여 태그 문맥 벡터를 생성(S120)할 수 있다.After inputting at least one embedded word vector to the encoder 21 ( S23 ), the word vector may be re-inputted to the decoder ( S24 ). Next, it may be determined whether a first word vector among at least one word vector input to the decoder 22 is a word ( S25 ). Next, when the first word vector is a word, the START tag is inserted into the decoder 22 ( S26 ), the word is extracted and the first hidden state is calculated ( S27 ) to generate the tag context vector ( S120 ). .

또한, 어절 벡터들을 인코더(21)에 입력하여 학습하는 단계를 더 포함할 수 있다.In addition, the method may further include inputting word vectors into the encoder 21 to learn.

또한, 아래와 같이 수학식 1을 사용하여 태그 문맥 벡터를 도출할 수 있다.In addition, a tag context vector can be derived using Equation 1 as follows.

여기서

는 t시점에서의 디코더(22)에 입력되는 각 어절 벡터들의 모든 태그 정보를 의미하고,

는 t-1시점에서의 디코더(22)의 마지막 은닉 상태(제1 은닉 상태)를 의미하며,

는 전치행렬을 의미한다.here

denotes all tag information of each word vector input to the decoder 22 at time t,

is the last hidden state (first hidden state) of the decoder 22 at time t-1,

is the transpose matrix.

또한, 디코더로 입력된 적어도 하나의 어절 벡터 중 첫번째 어절 벡터가 단어가 아닌 경우, 위 수학식 1을 이용한 계산과정을 거치지 않고 태그 문맥 벡터를 생성할 수 있다.Also, when the first word vector among at least one word vector input to the decoder is not a word, the tag context vector may be generated without the calculation process using Equation 1 above.

문맥 벡터를 생성(S130)하는 단계에 대해서는 도 7과 수학식을 참조하여 아래에서 구체적으로 설명하도록 한다.The step of generating the context vector (S130) will be described in detail below with reference to FIG. 7 and Equation.

도 7은 문맥 벡터 생성 내지 문장 생성까지의 과정을 나타낸 순서도이다.7 is a flowchart illustrating a process from generating a context vector to generating a sentence.

먼저 문맥 벡터를 생성(S130)하는 단계는, 태그 문맥 벡터 및 제2 은닉 상태를 포함하는 주목 가중치를 계산(S121)하고, 주목 가중치와 제2 은닉 상태를 포함하는 문맥 벡터를 생성(S130)할 수 있다.First, generating the context vector (S130) includes calculating a weight of interest including the tag context vector and the second hidden state (S121), and generating a context vector including the weight of interest and the second hidden state (S130). can

또한, 아래와 같이 수학식 2부터 수학식 5까지의 계산 과정을 통해 문맥 벡터를 도출(S130)할 수 있다.In addition, a context vector may be derived through the calculation process from Equations 2 to 5 as follows ( S130 ).

여기서

는 제2 은닉 상태를 의미하고,

는 각 제2 은닉 상태의 주목 가중치 모음 값을 의미하며,

는 주목 분포를 의미하고,

는 문맥 벡터를 의미한다.here

means the second hidden state,

denotes the value of the attention weight collection of each second hidden state,

means the attention distribution,

is a context vector.

생성 확률 계산 및 생성 또는 추출 여부 판단(S140)하는 단계에 대해서는 도 7 및 도 8과 수학식을 참조하여 아래에서 구체적으로 설명하도록 한다.The step of calculating the generation probability and determining whether to generate or extract (S140) will be described in detail below with reference to FIGS. 7 and 8 and the equations.

도 8은 생성부에서의 문장 생성 과정을 나타낸 순서도이다.8 is a flowchart illustrating a sentence generation process in the generator.

먼저 생성 확률(Pgen) 계산(S141)하는 단계는 문맥 벡터, 제1 은닉 상태 및 어절 벡터를 이용하여 도출할 수 있다.First, the step of calculating the generation probability (Pgen) (S141) may be derived using the context vector, the first hidden state, and the word vector.

또한, 생성 확률(Pgen)의 범위는 0.0 내지 1.0일 수 있다.Also, the generation probability Pgen may be in a range of 0.0 to 1.0.

또한, 아래와 같이 수학식 6을 사용하여 생성 확률(Pgen)을 계산(S141)할 수 있다. In addition, the generation probability Pgen may be calculated (S141) using Equation 6 as follows.

여기서

는 문맥 벡터를 의미하고,

는 t시점에서의 디코더(22)의 마지막 은닉 상태(제1 은닉 상태)를 의미하며,

는 t시점에서 디코더(22)에 입력되는 어절 벡터를 의미하고,

는 스칼라(scalar)값을 의미하며,

는 가중치를 의미한다. 또한, 시그모이드 함수(Sigmoid function)을 이용하여 생성 확률(Pgen)을 도출하며, 각각의

들과

는 학습가능한 매개변수일 수 있다.here

is the context vector,

is the last hidden state (first hidden state) of the decoder 22 at time t,

denotes a word vector input to the decoder 22 at time t,

is a scalar value,

means weight. In addition, a generation probability (Pgen) is derived using a sigmoid function, and each

field

may be a learnable parameter.

다음으로, 미리 설정된 기준 확률 값에 따라 단어를 생성 또는 추출(S142&S143)하는 단계는 생성 확률(Pgen)이 미리 설정된 기준 확률 값보다 크거나 같은 경우, 구축된 단어사전에서 가중치가 높은 단어를 다음 단어로 생성할 수 있다. 또한, 생성 확률(Pgen)이 미리 설정된 기준 확률 값보다 작은 경우, 인코더(21)에 삽입된 어절 벡터들 중 가중치가 높은 단어를 다음 단어로 추출할 수 있다.Next, in the step of generating or extracting a word according to a preset reference probability value (S142&S143), when the generation probability (Pgen) is greater than or equal to the preset reference probability value, the word with a high weight in the constructed word dictionary is selected as the next word can be created with Also, when the generation probability Pgen is smaller than a preset reference probability value, a word having a high weight among word vectors inserted into the encoder 21 may be extracted as the next word.

또한, 이상에서 언급한 미리 설정된 기준 확률 값은 0.5 일 수 있다. 다만, 미리 설정된 기준 확률 값은 사용자에 따라 상이하게 설정할 수 있다.Also, the preset reference probability value mentioned above may be 0.5. However, the preset reference probability value may be set differently depending on the user.

또한, 아래와 같이 수학식 7을 사용하여 생성 또는 추출(S142&S143) 여부를 판단하는 계산할 수 있다.In addition, it can be calculated to determine whether to generate or extract (S142 & S143) using Equation 7 as shown below.

여기서

가 어휘 밖의 단어(Out-of-Vocabulary, OOV)일 경우,

는 '0'의 값을 갖고,

가 입력되는 텍스트 상에 존재하지 않다면,

가 '0'의 값을 갖을 수 있다.here

is an out-of-vocabulary (OOV) word,

has a value of '0',

If is not present in the input text,

may have a value of '0'.

문장을 생성(S150)하는 단계에 대해서는 도 7을 참조하여 아래에서 구체적으로 설명하도록 한다.The step of generating the sentence ( S150 ) will be described in detail below with reference to FIG. 7 .

문장을 생성(S150)하는 단계는 마지막 어절 벡터가 END 태그인 경우, 생성 또는 추출된 단어를 이용하여 문장을 생성(S150)할 수 있다. 또한, 마지막 어절 벡터가 END 태그가 아닌 경우, 디코더(22)로 이동(S144)해 문장 생성 과정을 재반복할 수 있다.In the step of generating a sentence (S150), when the last word vector is an END tag, a sentence may be generated using the generated or extracted word (S150). In addition, when the last word vector is not the END tag, it may move to the decoder 22 ( S144 ) and repeat the sentence generation process.

도 9은 일 실시예에 따른 문서 요약 시스템 및 요약 방법을 통해 뉴스기사가 뉴스제목으로 생성된 것을 나타낸 도면이다. 9 is a diagram illustrating that a news article is generated as a news title through a document summary system and a summary method according to an exemplary embodiment.

다만, 도 9는 본 발명을 실시하기 위한 일 실시예에 불과하며, 사용자에 따라 뉴스기사는 이에 대응되는 텍스트로 대체될 수 있고, 도출된 문장(제목)은 상이하게 도출될 수 있다.However, FIG. 9 is merely an embodiment for carrying out the present invention, and depending on the user, a news article may be replaced with a text corresponding thereto, and a derived sentence (title) may be derived differently.

1: 문서 요약 시스템
10: 임베딩부
20: 생성부
21: 인코더
22: 디코더
26: 전처리부
27: 결정부
28: 요약부
30: 입력부
40: 출력부1: Document Summarization System
10: embedding part
20: generator
21: encoder
22: decoder
26: preprocessor
27: decision part
28: Summary
30: input unit
40: output unit

Claims

입력된 텍스트를 분석하여 임베딩하는 임베딩부, 단어를 생성 또는 추출하는 생성부. 문장을 추출하는 출력부를 포함하는 문서 요약 방법에 있어서,
상기 임베딩부에 의해, 입력된 텍스트를 형태소 분석하여 각 어절에 형태소 태그 및 개체명 태그를 부착하여 임베딩하는 단계;
상기 생성부에 의해, 임베딩된 적어도 하나의 어절 벡터를 인코더에 입력하는 단계;
상기 생성부에 의해, 상기 입력된 적어도 하나의 어절 벡터는 디코더로 재입력되고, 상기 어절 벡터 및 상기 디코더의 마지막 은닉 상태인 제1 은닉 상태를 포함하는 태그 문맥 벡터를 생성하는 단계;
상기 생성부에 의해, 상기 태그 문맥 벡터 및 상기 인코더에 입력된 어절 벡터의 은닉 상태인 제2 은닉 상태를 포함하는 주목 가중치를 계산하고, 상기 주목 가중치 및 상기 제2 은닉 상태를 포함하는 문맥 벡터를 생성하는 단계; 및
상기 생성부에 의해, 상기 문맥 벡터, 상기 제1 은닉 상태 및 상기 어절 벡터를 이용해 생성 확률(Pgen)을 구하고, 미리 설정된 기준 확률 값에 따라 단어를 생성 또는 추출하여 문장을 생성하는 단계;
를 포함하고,
상기 태그 문맥 벡터를 생성하는 단계에 있어서, 상기 태그 문맥 벡터는

(

: t시점에서의 디코더에 입력되는 i번째 어절 벡터들의 모든 태그 정보,

: t-1시점에서의 디코더의 마지막 은닉 상태(제1 은닉 상태), T: 전치행렬, t: t 시점)의 수식에 의해 생성되고,
상기 문맥 벡터를 생성하는 단계에 있어서, 상기 주목 가중치는

(

: 각 제2 은닉 상태의 주목 가중치 모음 값,

: t-1시점에서의 디코더의 마지막 은닉 상태(제1 은닉 상태),

: n번째 입력 단어의 제2 은닉상태, T: 전치행렬, t: t 시점)의 수식으로 계산하는 것을 특징으로 하는 문서 요약 방법.An embedding unit that analyzes and embeds input text, and a generation unit that generates or extracts words. In the document summary method comprising an output unit for extracting a sentence,
morphological analysis of the input text by the embedding unit and embedding by attaching a morpheme tag and an entity name tag to each word;
inputting, by the generator, at least one embedded word vector to an encoder;
generating, by the generator, the at least one word vector inputted back to a decoder, and generating a tag context vector including the word vector and a first hidden state that is the last hidden state of the decoder;
Calculate, by the generator, a weight of interest including a second hidden state that is a hidden state of the tag context vector and the word vector input to the encoder, and generate a context vector including the weight of interest and the second hidden state generating; and
generating, by the generator, a generation probability (Pgen) using the context vector, the first hidden state, and the word vector, and generating or extracting a word according to a preset reference probability value to generate a sentence;
including,
In the step of generating the tag context vector, the tag context vector is

(

: All tag information of the i-th word vectors input to the decoder at time t,

: generated by the formula of the last hidden state (first hidden state) of the decoder at time t-1, T: transpose matrix, t: time t),
In the step of generating the context vector, the attention weight is

(

: the value of the attention weight collection of each second hidden state,

: All tag information of the i-th word vectors input to the decoder at time t,

: The last hidden state (first hidden state) of the decoder at time t-1,

: A document summary method, characterized in that the calculation is performed by the formula of the second hidden state of the nth input word, T: transposition matrix, t: t time point).

제1항에 있어서,
상기 임베딩하는 단계는,
상기 어절을 어근 및 접사로 구분하고, 상기 어근 및 접사에 대응하는 품사 및 개체명을 결정하는 문서 요약 방법.According to claim 1,
The embedding step is
A document summarization method for classifying the word word into a root and an affix, and determining a part-of-speech and an entity name corresponding to the root and affix.

제1항에 있어서,
상기 어절 벡터는,
어근의 형태소 태그 벡터 및 개체명 태그 벡터가 결합된 제1 벡터와, 접사의 형태소 태그 벡터 및 개체명 태그 벡터가 결합된 제2 벡터를 포함하는 벡터 값인 문서 요약 방법.According to claim 1,
The word vector is
A document summarization method, which is a vector value including a first vector in which a morpheme tag vector of a root and an entity tag vector are combined, and a second vector in which a morpheme tag vector of an affix and an entity tag vector are combined.

제1항에 있어서,
상기 어절 벡터들을 인코더에 입력하여 학습하는 단계;
를 더 포함하는 문서 요약 방법.According to claim 1,
learning the word vectors by inputting them into an encoder;
A document summary method further comprising:

제1항에 있어서,
상기 태그 문맥 벡터를 생성하는 단계는,
상기 디코더로 입력된 상기 적어도 하나의 어절 벡터 중 첫번째 어절 벡터가 단어일 경우, 상기 디코더에 START 태그를 삽입하고, 상기 단어를 추출 및 제1 은닉 상태를 계산하여 생성하고,
상기 디코더로 입력된 상기 적어도 하나의 어절 벡터 중 첫번째 어절 벡터가 단어가 아닌 경우에는 계산과정을 거치지 않고 생성하는 문서 요약 방법.According to claim 1,
The step of generating the tag context vector comprises:
When the first word vector among the at least one word vector input to the decoder is a word, a START tag is inserted into the decoder, the word is extracted, and a first hidden state is calculated and generated;
When a first word vector among the at least one word vector input to the decoder is not a word, the document summary method is generated without a calculation process.

제1항에 있어서,
상기 문맥 벡터를 생성하는 단계는,
상기 제2 은닉 상태의 주목 가중치 모음 값을 계산하고, 상기 주목 가중치 모음 값을 소프트맥스(softmax) 함수를 이용하여 주목 분포를 계산하며, 각 인코더의 상기 주목 분포 및 은닉 상태를 가중 합하여 계산하는 문서 요약 방법.According to claim 1,
The step of generating the context vector comprises:
A document that calculates the value of the collection of attention weights of the second hidden state, calculates the distribution of interest using the softmax function for the value of the collection of attention weights, and calculates the weighted sum of the distribution of interest and the hidden state of each encoder Summary method.

제1항에 있어서,
상기 단어를 생성 또는 추출하는 단계에서,
상기 생성 확률(Pgen)의 범위는 0.0 내지 1.0이고,
상기 생성 확률(Pgen)이 상기 미리 설정된 기준 확률 값보다 크거나 같은 경우, 구축된 단어사전에서 가중치가 높은 단어를 다음 단어로 생성하고,
상기 생성 확률(Pgen)이 상기 미리 설정된 기준 확률 값보다 작은 경우, 상기 인코더에 삽입된 어절 벡터들 중 가중치가 높은 단어를 다음 단어로 추출하는 문서 요약 방법.According to claim 1,
In the step of generating or extracting the word,
The range of the generation probability (Pgen) is 0.0 to 1.0,
When the generation probability (Pgen) is greater than or equal to the preset reference probability value, a word having a high weight in the constructed word dictionary is generated as the next word,
When the generation probability (Pgen) is smaller than the preset reference probability value, a document summarizing method of extracting a word having a high weight among word vectors inserted into the encoder as a next word.

제1항에 있어서,
상기 문장을 생성하는 단계는,
마지막 어절 벡터가 END 태그인 경우, 생성 또는 추출된 단어를 이용하여 문장을 생성하고,
상기 마지막 어절 벡터가 END 태그가 아닌 경우, 상기 디코더로 이동해 문장 생성 과정을 재반복하는 문서 요약 방법.According to claim 1,
The step of generating the sentence is,
If the last word vector is an END tag, a sentence is created using the generated or extracted word,
If the last word vector is not an END tag, the document summary method moves to the decoder and repeats the sentence generation process.

입력된 텍스트를 형태소 분석하여 각 어절에 형태소 태그 및 개체명 태그를 부착하여 임베딩하는 임베딩부;
임베딩된 적어도 하나의 어절 벡터는 인코더를 통해 디코더로 재입력되고, 상기 어절 벡터 및 상기 디코더의 마지막 은닉 상태인 제1 은닉 상태를 포함하는 태그 문맥 벡터를 생성하며, 상기 태그 문맥 벡터 및 상기 인코더에 입력된 어절 벡터의 은닉 상태인 제2 은닉 상태를 포함하는 주목 가중치를 계산하고, 상기 주목 가중치 및 제2 은닉 상태를 포함하는 문맥 벡터를 생성하고, 상기 문맥 벡터, 상기 제1 은닉 상태 및 상기 어절 벡터를 이용해 생성 확률(Pgen)을 구하며, 미리 설정된 기준 확률 값에 따라 단어를 생성 또는 추출하는 생성부; 및
상기 생성부에서 생성 또는 추출된 단어를 이용하여 문장을 추출하는 출력부;
를 포함하고,
상기 생성부는, 상기 태그 문맥 벡터를

(

: t-1시점에서의 디코더의 마지막 은닉 상태(제1 은닉 상태), T: 전치행렬, t: t 시점)의 수식에 의해 생성하고,
상기 생성부는, 상기 주목 가중치를

(

: 각 제2 은닉 상태의 주목 가중치 모음 값,

: t-1시점에서의 디코더의 마지막 은닉 상태(제1 은닉 상태),

: n번째 입력 단어의 제2 은닉상태, T: 전치행렬, t: t 시점)의 수식으로 계산하는 것을 특징으로 하는 문서 요약 시스템.an embedding unit that morphologically analyzes the input text and attaches a morpheme tag and an entity name tag to each word for embedding;
The embedded at least one word vector is re-input to the decoder through the encoder, and generates a tag context vector including the word vector and a first hidden state that is the last hidden state of the decoder, and the tag context vector and the encoder calculating a weight of interest including a second hidden state that is a hidden state of the input word vector, generating a context vector including the weight of interest and the second hidden state, and generating the context vector, the first hidden state, and the word a generation unit that obtains a generation probability (Pgen) using a vector and generates or extracts a word according to a preset reference probability value; and
an output unit for extracting a sentence using the word generated or extracted by the generating unit;
including,
The generating unit generates the tag context vector

(

: All tag information of the i-th word vectors input to the decoder at time t,

: Generated by the formula of the last hidden state (first hidden state) of the decoder at time t-1, T: transpose matrix, t: time t),
The generating unit, the attention weight

(

: the value of the attention weight collection of each second hidden state,

: All tag information of the i-th word vectors input to the decoder at time t,

: The last hidden state (first hidden state) of the decoder at time t-1,

: Document summarization system, characterized in that it is calculated by the formula of the second hidden state of the nth input word, T: transposition matrix, t: t time point).

제9항에 있어서,
상기 임베딩부는 Word2Vec 또는 Doc2Vec 방법을 통해 임베딩하는 것을 더 포함하는 문서 요약 시스템.10. The method of claim 9,
The document summary system further comprising: embedding the embedding unit through a Word2Vec or Doc2Vec method.

제9항에 있어서,
상기 생성부는 상기 인코더 및 상기 디코더를 포함하고,
상기 태그 문맥 벡터 및 상기 문맥 벡터를 생성하고, 상기 문맥 벡터, 상기 제1 은닉 상태 및 상기 어절 벡터를 이용해 상기 생성 확률(Pgen)을 계산하는 전처리부;
상기 생성 확률을 미리 설정된 기준 확률 값과 비교하여 단어의 생성 또는 추출 여부를 결정하는 결정부; 및
상기 결정부의 여부에 따라 단어를 생성 또는 추출하는 요약부;
를 더 포함하는 문서 요약 시스템.10. The method of claim 9,
The generator includes the encoder and the decoder,
a preprocessor for generating the tag context vector and the context vector, and calculating the generation probability (Pgen) using the context vector, the first hidden state, and the word vector;
a determining unit for determining whether to generate or extract a word by comparing the generation probability with a preset reference probability value; and
a summary unit for generating or extracting words according to whether the determining unit is present;
A document summary system further comprising a.

제9항에 있어서,
상기 인코더는 양방향 RNN(bidirectional RNN)이고,
상기 디코더는 단방향 RNN(directional RNN)인 문서 요약 시스템.
10. The method of claim 9,
The encoder is a bidirectional RNN (bidirectional RNN),
The decoder is a directional RNN (directional RNN) document summary system.

삭제delete