KR20200029930A

KR20200029930A - System and method for translating based on context

Info

Publication number: KR20200029930A
Application number: KR1020180108539A
Authority: KR
Inventors: 최규현
Original assignee: 한국전자통신연구원
Priority date: 2018-09-11
Filing date: 2018-09-11
Publication date: 2020-03-19

Abstract

According to the present invention, provided is a context-based translation system which comprises: an embedding construction unit receiving context corpus data and constructing a context information embedding vector; a context information extraction unit extracting context information based on the constructed context information embedding vector; and a translation sentence generation unit generating a language model based on the extracted context information and applying the language model to a translation model to generate translation sentences.

Description

문맥 기반 번역 시스템 및 방법{SYSTEM AND METHOD FOR TRANSLATING BASED ON CONTEXT}{SYSTEM AND METHOD FOR TRANSLATING BASED ON CONTEXT}

본 발명은 문맥 기반 번역 시스템 및 방법에 관한 것으로서, 보다 구체적으로는 특정 도메인의 문맥정보를 구체화된 데이터로 만드는 워드 임베딩 방법과 이를 사용한 문맥 기반 번역 시스템에 관한 것이다.The present invention relates to a context-based translation system and method, and more particularly, to a word embedding method for making context information of a specific domain into specified data and a context-based translation system using the same.

종래 출시되어 있는 자동 통번역 기술에서의 번역 결과는 학습에 사용한 데이터의 정보에 의존적이다.The translation result in the conventional automatic translation and translation technology depends on the information of the data used for learning.

이러한 종래 기술에 따른 자동 통번역 시스템에서는 간혹 문맥의 흐름에 맞추어 번역되어야 하는 상황에 있어서, 상황에 맞지 않는 단어가 선택되어 번역되는 경우가 많다.In such an automatic translation system according to the prior art, in a situation in which it is sometimes necessary to translate according to the flow of context, words that do not fit the situation are often selected and translated.

이러한 이유로 종래 자동 통번역 기술은 대화의 흐름에 부합하지 않는 부적절한 번역 결과를 만들어내는 문제가 있으며, 그 결과 의사소통에 오류가 발생하게 되는 문제가 있다.For this reason, the conventional automatic interpretation and translation technology has a problem of generating an inappropriate translation result that does not conform to the flow of conversation, and as a result, an error occurs in communication.

최근 들어 상술한 문제점을 해결하기 위한 방안으로 문맥정보가 담긴 학습 데이터를 대용량으로 확보하여 학습에 사용하는 방법이 제안되고 있다. 그러나 이 방법은 자동 통번역 시스템이 문맥정보를 담은 학습 데이터에 얼마나 많은 가중치를 주어야 하는지를 판단하기 어렵다는 문제가 있다. 또한, 이러한 문제점으로 인해 문맥 흐름에 맞는 번역 결과를 예측할 수 없다는 문제가 있으며, 그 결과 정확한 의사전달이 불가능하게 된다는 문제가 있다.Recently, as a method for solving the above-mentioned problems, a method of securing a large amount of learning data containing context information and using it for learning has been proposed. However, this method has a problem in that it is difficult for the automatic interpretation and translation system to determine how much weight should be given to the learning data containing context information. In addition, due to this problem, there is a problem in that it is impossible to predict a translation result suitable for the context flow, and as a result, there is a problem that accurate communication is impossible.

본 발명의 실시예는 특정 도메인 상황에서 생성되는 발화 속에 담겨있는 문맥정보를 번역 결과에 반영하기 위한 도메인 문맥정보 임베딩 구축 방법을 적용하여 번역 결과를 생성하는 문맥 기반 번역 시스템 및 방법을 제공한다.An embodiment of the present invention provides a context-based translation system and method for generating a translation result by applying a domain context information embedding construction method for reflecting context information contained in an utterance generated in a specific domain situation to a translation result.

다만, 본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제로 한정되지 않으며, 또다른 기술적 과제들이 존재할 수 있다.However, the technical problem to be achieved by the present embodiment is not limited to the technical problem as described above, and other technical problems may exist.

상술한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본 발명의 제 1 측면에 따른 문맥 기반 번역 시스템은 문맥 말뭉치 데이터를 입력받아 문맥정보 임베딩 벡터를 구축하는 임베딩 구축부, 상기 구축된 문맥정보 임베딩 벡터에 기초하여 문맥정보를 추출하는 문맥정보 추출부 및 상기 추출된 문맥정보에 기초하여 언어모델을 생성하고, 상기 언어모델을 번역모델에 적용하여 번역문을 생성하는 번역문 생성부를 포함한다.As a technical means for achieving the above-described technical problem, the context-based translation system according to the first aspect of the present invention receives an context corpus data and embeds a construction unit for constructing a context information embedding vector, and the constructed context information embedding vector. And a contextual information extracting unit for extracting contextual information on the basis of the contextual information, and a translational sentence generating unit for generating a language model based on the extracted contextual information and applying the language model to a translational model to generate a translation.

또한, 본 발명의 제 2 측면에 따른 문맥 기반 번역 방법은 문맥 말뭉치 데이터를 입력받는 단계; 상기 문맥 말뭉치 데이터에 기초하여 문맥정보 임베딩 벡터를 구축하는 단계; 상기 구축된 문맥정보 임베딩 벡터와 더불어 이전 발화에 대한 문맥정보에 대하여 벡터 연산 기법을 통해 문맥 흐름을 수치화하여 문맥정보를 추출하는 단계; 상기 추출된 문맥정보에 기초하여 언어모델을 생성하는 단계 및 상기 언어모델을 번역모델에 적용하여 번역결과를 생성하는 단계를 포함한다.In addition, the context-based translation method according to the second aspect of the present invention comprises the steps of receiving context corpus data; Constructing a context information embedding vector based on the context corpus data; Extracting context information by digitizing the context flow through a vector operation technique with respect to context information for a previous utterance together with the constructed context information embedding vector; And generating a language model based on the extracted context information and generating a translation result by applying the language model to the translation model.

전술한 본 발명의 과제 해결 수단 중 어느 하나에 의하면, 문맥정보가 풍부하게 담긴 양국어 말뭉치를 통해 미리 학습된 번역모델과 별개의 임베딩 벡터를 구축함으로써 문맥흐름 정보가 뚜렷하게 반영된 번역 결과를 생성할 수 있다.According to any one of the above-described problem solving means of the present invention, by constructing a pre-trained translation model and an embedding vector separate from a bilingual corpus containing rich context information, it is possible to generate a translation result in which context flow information is clearly reflected. have.

또한, 구축된 임베딩 벡터는 외부 번역모델에 추가적으로 사용될 수 있으며, 외부에서 구축한 문맥정보는 언어를 번역할 때 직접적으로 반영되기 때문에, 학습 데이터로 들어가 한번에 학습되어 다른 문맥정보들에 의해 특정 문맥정보가 현출되지 않는 문제를 방지할 수 있다는 장점이 있다.In addition, the built-in embedding vector can be additionally used in an external translation model, and since context information constructed externally is directly reflected when translating a language, it enters into learning data and is learned at a time to obtain specific context information by other context information. It has the advantage that it can prevent problems that do not appear.

또한, 학습을 위한 추가적인 시간이 단축되고 발화의 주제와 연관있는 문맥정보들만 수집하여 하나의 임베딩 벡터 공간을 생성하면 되기 때문에 종래 방법에 비하여 단순하면서도 효과가 뛰어나다는 장점이 있다.In addition, since the additional time for learning is shortened and only one context vector related to the topic of speech needs to be collected to generate a single embedding vector space, it has the advantage of being simple and excellent in effect compared to the conventional method.

도 1은 본 발명의 일 실시예에 따른 문맥 기반 번역 시스템의 블록도이다.
도 2는 임베딩 구축 모델을 설명하기 위한 도면이다.
도 3은 문맥정보 추출 모델을 설명하기 위한 도면이다.
도 4는 문맥정보를 반영하여 번역문을 생성하는 내용을 설명하기 위한 도면이다.
도 5는 본 발명의 일 실시예에 따른 문맥기반 번역 방법의 순서도이다.1 is a block diagram of a context-based translation system according to an embodiment of the present invention.
2 is a diagram for explaining an embedding construction model.
3 is a diagram for explaining a context information extraction model.
4 is a view for explaining the content of generating a translation by reflecting context information.
5 is a flowchart of a context-based translation method according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art to which the present invention pertains can easily practice. However, the present invention can be implemented in many different forms and is not limited to the embodiments described herein. In addition, in order to clearly describe the present invention, parts not related to the description are omitted.

명세서 전체에서 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다.When a part of the specification "includes" a certain component, it means that other components may be further included instead of excluding other components, unless otherwise specified.

본 발명은 문맥 기반 번역 시스템(1) 및 방법에 관한 것이다.The present invention relates to a context-based translation system (1) and method.

사람들은 의사소통을 할 때 문맥의 흐름과 무관한 대화를 하지 않는다. 즉, 의사소통시 문맥을 바탕으로 대화의 앞 내용과 뒤 내용이 서로 연관될 수 있도록 말을 한다. People do not communicate when communicating, regardless of the flow of context. In other words, when communicating, it is spoken so that the contents of the conversation can be related to each other based on the context.

이와 같이 사람의 말은 대화의 문맥 그리고 환경에 의존적이여서, 내용이 문맥과 환경에 영향을 받기 때문에 문맥정보는 의사소통에 있어서 중요한 정보에 해당한다.As such, since human speech is dependent on the context and environment of the conversation, context information is important information for communication because the content is affected by context and environment.

이러한 점을 고려하여 본 발명에 따른 문맥 기반 번역 시스템(1) 및 방법은 특정 도메인 상황에서 생성되는 발화 속에 담겨있는 문맥정보를 번역 결과에 반영하기 위한 도메인 문맥정보 임베딩 구축 방법을 적용하여 번역 결과를 생성할 수 있다.In consideration of these points, the context-based translation system 1 and the method according to the present invention apply a domain context information embedding construction method to reflect context information contained in an utterance generated in a specific domain situation to a translation result, and the translation result is applied. Can be created.

또한, 본 발명은 문맥 흐름에 포함되어 있는 문맥정보를 구체화하는 임베딩 방법에 기초하여 문맥정보를 요약 및 압축하여 벡터 공간으로 구현할 수 있다. In addition, the present invention can be implemented as a vector space by summarizing and compressing context information based on an embedding method for specifying context information included in the context flow.

이와 같이 생성된 벡터 정보는 신경만 기반의 번역 시스템(1)에서 번역문을 생성할 때 사용되며, 그 결과 번역 시스템(1)은 문맥의 흐름에 맞는 번역문을 생성할 수 있다.The generated vector information is used when generating a translation in the nervous system based translation system 1, and as a result, the translation system 1 can generate a translation suitable for the flow of context.

이하에서는 도 1 내지 도 4를 참조하여 본 발명의 일 실시예에 따른 문맥 기반 번역 시스템(1)에 대해 설명하도록 한다.Hereinafter, a context-based translation system 1 according to an embodiment of the present invention will be described with reference to FIGS. 1 to 4.

도 1은 본 발명의 일 실시예에 따른 문맥 기반 번역 시스템(1)의 블록도이다.1 is a block diagram of a context-based translation system 1 according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 문맥 기반 번역 시스템(1)은 임베딩 구축부(100), 문맥정보 추출부(200) 및 번역문 생성부(300)를 포함한다.The context-based translation system 1 according to an embodiment of the present invention includes an embedding construction unit 100, a context information extraction unit 200, and a translation sentence generation unit 300.

임베딩 구축부(100)는 문맥정보의 임베딩 벡터를 구축하기 위해 문맥 말뭉치 데이터를 입력받아 문맥정보 임베딩 벡터를 구축한다. 이와 같이 구축된 문맥정보 임베딩 벡터는 미리 구현된 신경망 기반의 자동 통역 모델의 입력 정보로 사용된다.The embedding construction unit 100 receives context corpus data to construct an embedding vector of context information and constructs a context information embedding vector. The contextual information embedding vector constructed as described above is used as input information of a pre-implemented neural network-based automatic interpretation model.

번역 단어를 생성하는 신경망 기반의 자동 통역 모델의 디코더에 문맥정보 임베딩 벡터가 입력 정보로 사용되면, 문맥정보 추출부(200)는 문맥정보 임베딩 벡터와 더불어 자동 통역 모델에 이미 입력으로 사용되었던 이전 발화에 대한 문맥정보에 대하여 벡터 연산 기법을 통해 문맥 흐름을 수치화하여 특정 문맥정보를 추출하고 이를 번역 단어 생성에 사용할 수 있다.When the context information embedding vector is used as input information to a decoder of the neural network-based automatic interpretation model generating a translation word, the contextual information extracting unit 200, along with the contextual information embedding vector, has previously used as input to the automatic interpretation model With respect to the context information for, the context flow can be quantified through a vector operation technique to extract specific context information and use it to generate translated words.

번역문 생성부(300)는 추출된 문맥정보에 기초하여 언어 모델을 생성하고, 상기 언어모델을 번역모델에 적용하여 번역문을 생성한다.The translation sentence generation unit 300 generates a language model based on the extracted context information, and applies the language model to the translation model to generate a translation sentence.

임베딩Embedding 구축부Construction (100)(100)

도 2는 임베딩 구축 모델을 설명하기 위한 도면이다.2 is a diagram for explaining an embedding construction model.

임베딩 구축부(100)는 문맥 말뭉치 데이터를 입력받아 문맥정보 임베딩 벡터를 구축한다. 이때, 문맥정보 임베딩 벡터 구축을 위한 문맥 말뭉치 데이터로는 문맥정보가 뚜렷하게 드러나는 강연 자막, 뉴스 기사, 논문 요약 말뭉치가 사용될 수 있다.The embedding construction unit 100 receives context corpus data and constructs a context information embedding vector. In this case, as the context corpus data for constructing the context information embedding vector, lecture captions, news articles, and paper summary corpuses in which context information is clearly displayed may be used.

이러한 문맥 말뭉치 데이터가 입력되면 워드 임베딩 벡터를 구축하는 임베딩 구축부(100)에서는 문맥 말뭉치 데이터를 벡터값으로 수치화하는 과정을 수행한다. When the context corpus data is input, the embedding construction unit 100 that constructs the word embedding vector performs a process of digitizing the context corpus data into vector values.

이때, 임베딩 구축부(100)는 단어들을 벡터값으로 만드는 word2vec 기법을 이용하여 문맥정보 임베딩 벡터의 연관성에 기초한 분류과정을 통해 하나의 공간 내에 전사시킬 수 있다. 즉, 임베딩 구축부(100)는 특정 단어가 다른 단어들과의 연관성이 얼마나 많은 연관성이 있는지를 알 수 있게 표현되어 있는 워드 임베딩을 구축한다.At this time, the embedding construction unit 100 may be transferred in one space through a classification process based on the relevance of the contextual information embedding vector using the word2vec technique of making words into vector values. That is, the embedding construction unit 100 constructs a word embedding expressed so that it can know how many associations a specific word has with other words.

문맥정보 Context information 추출부Extraction (200)(200)

도 3은 문맥정보 추출 모델을 설명하기 위한 도면이다.3 is a diagram for explaining a context information extraction model.

문맥정보 추출부(200)는 임베딩 구축부(100)에 의해 구축된 문맥정보 임베딩 벡터에 기초하여 문맥정보를 추출한다.The contextual information extracting unit 200 extracts contextual information based on the contextual information embedding vector constructed by the embedding constructing unit 100.

본 발명의 일 실시예에서 번역모델은 실시간으로 발화자의 소스언어를 타겟언어로 번역할 때 문맥정보를 반영하는 것을 특징으로 하는바, 이를 위해 문맥정보 추출부(200)는 임베딩 구축부(100)에 의해 구축된 문맥정보 임베딩 벡터를 사용하여 문맥정보를 추출할 수 있다. 즉, 문맥정보 추출부(200)는 문맥정보 임베딩 벡터에 포함되어 있는 문맥정보를 추출하여 추후 번역 과정에서의 타겟 언어 생성시 반영한다.In one embodiment of the present invention, the translation model is characterized by reflecting contextual information when translating the speaker's source language into the target language in real time. To this end, the contextual information extraction unit 200 includes an embedding construction unit 100 Context information can be extracted using the context information embedding vector constructed by. That is, the contextual information extraction unit 200 extracts contextual information included in the contextual information embedding vector and reflects it when a target language is generated in a later translation process.

이때, 문맥정보 추출부(200)는 임베딩 구축부(100)에서 구축한 워드 임베딩 벡터만을 반영하는 것이 아니라, 이미 소스언어의 입력으로 사용되었던 이전 발화에 대한 문맥정보들과 문맥정보 임베딩 벡터를 벡터 연선 기법을 통해 문맥 흐름을 수치화하여, 특정 문맥정보를 구체화하여 추출하고 이를 번역 단어 생성시 사용하게끔 할 수 있다.At this time, the contextual information extraction unit 200 does not reflect only the word embedding vector constructed by the embedding construction unit 100, but vector contextual information and contextual information embedding vectors for previous utterances that have already been used as input of the source language. The context flow can be quantified through a twisted pair technique, and specific context information can be specified and extracted and used to generate translated words.

이와 같이, 본 발명의 일 실시예는 현재 발화자가 의도하는 문맥정보와 외부에서 구축된 문맥정보가 함께 사용되어 현재 문맥 상황에서 발현되어야할 단어가 적절히 선택되게끔 할 수 있다.As described above, an embodiment of the present invention can use contextual information intended by the current speaker and contextually constructed contextual information to appropriately select words to be expressed in the current contextual context.

번역문 Translation 생성부Generation (300)(300)

도 4는 문맥정보를 반영하여 번역문을 생성하는 내용을 설명하기 위한 도면이다.4 is a view for explaining the content of generating a translation by reflecting context information.

번역문 생성부(300)는 문맥정보 추출부(200)에 의해 추출된 문맥정보에 기초하여 언어모델을 생성하고, 언어모델을 번역모델에 적용하여 번역문을 생성한다.The translation sentence generation unit 300 generates a language model based on the context information extracted by the context information extraction unit 200 and generates a translation sentence by applying the language model to the translation model.

번역문 생성부(300)는 번역문장을 생성하기에 앞서 언어모델을 사용하는데, 추출된 문맥정보로 새로운 언어모델을 생성하고, 이에 따라 번역모델은 새로 만든 언어모델을 사용하여 번역문을 생성한다.The translation generating unit 300 uses a language model prior to generating the translation sentence, and generates a new language model with the extracted context information, and accordingly, the translation model generates a translation using the newly created language model.

이때, 언어모델은 추출된 문맥정보에 의해 생성된 것이기 때문에, 번역모델에서의 번역 결과는 문맥 흐름에 어색하지 않는 문맥정보가 잘 반영된 번역문을 생성할 수 있다.At this time, since the language model is generated by the extracted context information, the translation result in the translation model can generate a translation that reflects context information that is not awkward in the context flow.

한편, 본 발명의 일 실시예에 따른 문맥기반 번역 시스템(1)은 통신모듈, 메모리 및 프로세서를 포함하도록 구성될 수 있다.Meanwhile, the context-based translation system 1 according to an embodiment of the present invention may be configured to include a communication module, a memory, and a processor.

통신모듈은 내부 네트워크 연결 및 외부 단말 및 디바이스와 데이터를 송수신한다. 이때, 통신 모듈은 무선 통신 모듈을 포함할 수 있다. 무선 통신 모듈은 WLAN(wireless LAN), Bluetooth, HDR WPAN, UWB, ZigBee, Impulse Radio, 60GHz WPAN, Binary-CDMA, WiFi 무선 USB 기술 및 무선 HDMI 기술 등으로 구현될 수 있다. The communication module transmits and receives data to and from an internal network connection and external terminals and devices. At this time, the communication module may include a wireless communication module. The wireless communication module may be implemented with wireless LAN (WLAN), Bluetooth, HDR WPAN, UWB, ZigBee, Impulse Radio, 60GHz WPAN, Binary-CDMA, WiFi wireless USB technology, and wireless HDMI technology.

뿐만 아니라, 통신 모듈은 유선 통신 모듈을 모두 포함할 수도 있다. 유선 통신모듈은 전력선 통신 장치, 전화선 통신 장치, 케이블 홈(MoCA), 이더넷(Ethernet), IEEE1294, 통합 유선 홈 네트워크 및 RS-485 제어 장치로 구현될 수 있다. In addition, the communication module may include all wired communication modules. The wired communication module may be implemented as a power line communication device, a phone line communication device, a cable home (MoCA), an Ethernet, IEEE1294, an integrated wired home network, and an RS-485 control device.

메모리에는 문맥정보에 기초하여 번역문을 생성하기 위한 프로그램이 저장된다. 이때, 메모리는 전원이 공급되지 않아도 저장된 정보를 계속 유지하는 비휘발성 저장장치 및 휘발성 저장장치를 통칭하는 것이다. A program for generating a translation based on context information is stored in the memory. At this time, the memory refers to a non-volatile storage device and a volatile storage device that keep the stored information even when power is not supplied.

예를 들어, 메모리는 콤팩트 플래시(compact flash; CF) 카드, SD(secure digital) 카드, 메모리 스틱(memory stick), 솔리드 스테이트 드라이브(solid-state drive; SSD) 및 마이크로(micro) SD 카드 등과 같은 낸드 플래시 메모리(NAND flash memory), 하드 디스크 드라이브(hard disk drive; HDD) 등과 같은 마그네틱 컴퓨터 기억 장치 및 CD-ROM, DVD-ROM 등과 같은 광학 디스크 드라이브(optical disc drive) 등을 포함할 수 있다.For example, the memory may be a compact flash (CF) card, secure digital (SD) card, memory stick, solid-state drive (SSD), micro SD card, or the like. And a magnetic computer storage device such as a NAND flash memory, a hard disk drive (HDD), and an optical disc drive such as a CD-ROM, DVD-ROM, or the like.

프로세서는 메모리에 저장된 프로그램을 실행시키며, 이에 따라 임베딩 생성부(100), 문맥정보 추출부(200) 및 번역문 생성부(300)에서의 기능이 수행되도록 할 수 있다.The processor executes the program stored in the memory, and accordingly, functions of the embedding generation unit 100, the context information extraction unit 200, and the translation generation unit 300 may be performed.

참고로, 본 발명의 실시예에 따른 도 1 내지 도 4에 도시된 구성 요소들은 소프트웨어 또는 FPGA(Field Programmable Gate Array) 또는 ASIC(Application Specific Integrated Circuit)와 같은 하드웨어 형태로 구현될 수 있으며, 소정의 역할들을 수행할 수 있다.For reference, the components shown in FIGS. 1 to 4 according to an embodiment of the present invention may be implemented in software or in a hardware form such as a field programmable gate array (FPGA) or an application specific integrated circuit (ASIC). Roles can be played.

그렇지만 '구성 요소들'은 소프트웨어 또는 하드웨어에 한정되는 의미는 아니며, 각 구성 요소는 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다.However, 'components' are not meant to be limited to software or hardware, and each component may be configured to be in an addressable storage medium or may be configured to reproduce one or more processors.

따라서, 일 예로서 구성 요소는 소프트웨어 구성 요소들, 객체지향 소프트웨어 구성 요소들, 클래스 구성 요소들 및 태스크 구성 요소들과 같은 구성 요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로 코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들 및 변수들을 포함한다.Thus, as an example, a component is a component, such as software components, object-oriented software components, class components, and task components, processes, functions, attributes, procedures, subs It includes routines, segments of program code, drivers, firmware, microcode, circuitry, data, database, data structures, tables, arrays and variables.

구성 요소들과 해당 구성 요소들 안에서 제공되는 기능은 더 작은 수의 구성 요소들로 결합되거나 추가적인 구성 요소들로 더 분리될 수 있다.Components and functions provided within those components may be combined into a smaller number of components or further separated into additional components.

이하에서는 도 5를 참조하여, 본 발명의 일 실시예에 따른 문맥기반 번역 시스템(1)에서 수행되는 방법에 대해 설명하도록 한다.Hereinafter, a method performed in the context-based translation system 1 according to an embodiment of the present invention will be described with reference to FIG. 5.

도 5는 본 발명의 일 실시예에 따른 문맥기반 번역 방법의 순서도이다.5 is a flowchart of a context-based translation method according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 문맥 기반 번역 방법은 먼저, 문맥 말뭉치 데이터를 입력받으면(S110), 상기 문맥 말뭉치 데이터에 기초하여 문맥정보 임베딩 벡터를 구축한다(S120).In the context-based translation method according to an embodiment of the present invention, first, when context corpus data is input (S110), a context information embedding vector is constructed based on the context corpus data (S120).

다음으로, 상기 구축된 문맥정보 임베딩 벡터와 더불어 이전 발화에 대한 문맥정보에 대하여 벡터 연산 기법을 통해 문맥 흐름을 수치화하여 문맥정보를 추출한다(S130).Next, context information is extracted by quantifying the context flow through a vector operation technique on the context information for the previous speech together with the constructed context information embedding vector (S130).

다음으로, 상기 추출된 문맥정보에 기초하여 언어모델을 생성한 후(S140), 상기 언어모델을 번역모델에 적용하여 번역결과를 생성한다(S150).Next, after generating a language model based on the extracted context information (S140), a translation result is generated by applying the language model to the translation model (S150).

한편, 상술한 설명에서, 단계 S110 내지 S150은 본 발명의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 변경될 수도 있다. 아울러, 기타 생략된 내용이라 하더라도 도 1 내지 도 4에서 이미 기술된 내용은 도 5의 문맥기반 번역 방법에도 적용될 수 있다.Meanwhile, in the above description, steps S110 to S150 may be further divided into additional steps or combined into fewer steps according to an embodiment of the present invention. In addition, some steps may be omitted if necessary, and the order between the steps may be changed. In addition, even if omitted, the contents already described in FIGS. 1 to 4 may be applied to the context-based translation method of FIG. 5.

종래 출시되어 있는 자동 통번역 기술은 대용량의 학습 데이터를 사용하여 소스언어를 타겟언어로 번역될 수 있도록 학습한다. 이러한 학습 과정에서 소스언어와 타겟언어의 워드 임베딩 정보를 이용해 어떤 소스언어의 단어가 어떤 타겟언어의 단어로 번역되는지, 단어와 단어 사이의 문법과 문맥 구조를 학습하게 된다.The automatic translation and translation technology, which has been released in the past, uses a large amount of learning data to learn the source language to be translated into the target language. In this learning process, the word embedding information of the source language and the target language is used to learn which source language words are translated into words of which target language, and the grammar and context structure between words and words.

하지만, 모델이 언어적 규칙을 잘 학습하는지 여부는 학습 데이터의 양과 질에 의해 결정되기 때문에, 학습 데이터가 적절하지 못할 경우 모델 학습이 잘못되고 좋은 번역결과를 획득하지 못하게 되는 단점이 있다.However, since whether the model learns linguistic rules well is determined by the amount and quality of the training data, if the training data is not appropriate, there is a disadvantage that the model training is wrong and a good translation result cannot be obtained.

또한, 잘 학습된 모델이더라 하더라도 학습 데이터에서 어떤 단어들이 많이 등장했는지에 따른 단어들마다의 가중치를 상이하게 측정하게 되는데, 이 때문에 번역되어야할 단어가 선택되지 못하고 다른 단어가 선택될 수 있는 가능성이 있다.In addition, even if it is a well-trained model, the weight of each word is differently measured according to which words appear in the training data differently. Therefore, there is a possibility that a word to be translated cannot be selected and another word can be selected. have.

예를 들어, 실상황에서 “이 운동을 시작했을 때”라는 발화가 있을 때, 운동을 ‘몸을 단련하기 위해 움직이는 일’로 이해할지, 아니면 ‘목적을 이루기 위해 조직적으로 활동하는 일’로 이해할지는 문맥정보에 따라 결정되게 된다.For example, when there is a utterance of “When I started this exercise” in a real situation, I would understand the exercise as 'moving to train one's body', or 'working systematically to achieve my goal'. The paper will be determined according to the context information.

번역모델이 전자의 운동 단어 위주로 학습되었다면 번역모델은 운동을 ‘exercise’로 번역할 것이고, 후자에 대한 의미에 대해 많이 학습되었다면 ‘movement’로 번역할 것이다.If the translation model was learned mainly from the former exercise word, the translation model would translate the exercise into 'exercise', and if it learned a lot about the meaning of the latter, it would translate into 'movement'.

위 예시에서 만약 문맥 흐름상 시위, 데모와 관련된 내용들이 있었다고 했을 경우 운동은 ‘movement’로 번역되어야 적절하지만, 학습 과정에서 운동을 ‘exercise’라고 번역해야 한다고 학습된 모델은 ‘movement’로 번역하지 못할 것이다.In the above example, if there was content related to demonstrations and demonstrations in the context flow, the exercise should be translated as 'movement', but in the process of learning, the model trained to translate the exercise as 'exercise' is not translated as 'movement'. Will not.

본 발명의 일 실시예는 전술한 문제점을 해소하기 위하여, 문맥정보가 풍부하게 담긴 양국어 말뭉치를 가지고 미리 학습된 번역모델과 별개의 임베딩 벡터를 구축하는 것을 특징으로 한다. 이러한 임베딩 벡터값에는 말뭉치 내에 뚜렷하게 드러나는 문맥흐름 정보가 포함되어 있다.An embodiment of the present invention is characterized by constructing a pre-trained translation model and a separate embedding vector with a corpus of bilingual texts rich in context information in order to solve the above-mentioned problems. The embedding vector value includes context flow information that is clearly revealed in the corpus.

또한, 구축된 임베딩 벡터는 외부 번역모델에 추가적으로 사용될 수 있으며, 이는 기존 모델이 알지 못했던 문맥정보를 외부에서 알려주는 방법이다.In addition, the built-in embedding vector can be additionally used in an external translation model, which is a method for externally informing context information that an existing model did not know.

외부에서 구축한 문맥정보는 언어를 번역할 때 직접적으로 반영되기 때문에, 학습 데이터로 들어가 한번에 학습되어 다른 문맥정보들에 의해 특정 문맥정보가 현출되지 않는 문제를 방지할 수 있다는 장점이 있다.Since context information constructed externally is directly reflected when translating a language, it has an advantage in that it is possible to prevent a problem in which specific context information is not exposed by other context information through learning data at a time.

한편, 본 발명의 일 실시예는 컴퓨터에 의해 실행되는 매체에 저장된 컴퓨터 프로그램 또는 컴퓨터에 의해 실행가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체 및 통신 매체를 모두 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다. 통신 매체는 전형적으로 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈, 또는 반송파와 같은 변조된 데이터 신호의 기타 데이터, 또는 기타 전송 메커니즘을 포함하며, 임의의 정보 전달 매체를 포함한다. Meanwhile, an embodiment of the present invention may also be implemented in the form of a computer program stored in a medium executed by a computer or a recording medium including instructions executable by a computer. Computer readable media can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media. In addition, computer readable media may include both computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Communication media typically include computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave, or other transport mechanism, and includes any information delivery media.

본 발명의 방법 및 시스템은 특정 실시예와 관련하여 설명되었지만, 그것들의 구성 요소 또는 동작의 일부 또는 전부는 범용 하드웨어 아키텍쳐를 갖는 컴퓨터 시스템을 사용하여 구현될 수 있다.Although the methods and systems of the present invention have been described in connection with specific embodiments, some or all of their components or operations may be implemented using a computer system having a general purpose hardware architecture.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The above description of the present invention is for illustration only, and those of ordinary skill in the art to which the present invention pertains can understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is indicated by the following claims rather than the above detailed description, and it should be interpreted that all changes or modified forms derived from the meaning and scope of the claims and equivalent concepts thereof are included in the scope of the present invention. do.

1: 문맥 기반 번역 시스템
100: 임베딩 구축부
200: 문맥 정보 추출부
300: 번역문 생성부1: context-based translation system
100: embedding construction department
200: context information extraction unit
300: translation generating unit

Claims

문맥 기반 번역 시스템에 있어서,
문맥 말뭉치 데이터를 입력받아 문맥정보 임베딩 벡터를 구축하는 임베딩 구축부,
상기 구축된 문맥정보 임베딩 벡터에 기초하여 문맥정보를 추출하는 문맥정보 추출부 및
상기 추출된 문맥정보에 기초하여 언어모델을 생성하고, 상기 언어모델을 번역모델에 적용하여 번역문을 생성하는 번역문 생성부를 포함하는 문맥기반 번역 시스템.In the context-based translation system,
An embedding construction unit that receives context corpus data and constructs a context information embedding vector,
A contextual information extraction unit for extracting contextual information based on the constructed contextual information embedding vector and
A context-based translation system including a translation sentence generation unit that generates a language model based on the extracted context information and applies the language model to a translation model to generate a translation.