KR102480235B1

KR102480235B1 - Device and system for deep learning-based optical character recognition

Info

Publication number: KR102480235B1
Application number: KR1020200152535A
Authority: KR
Inventors: 김연규; 이록규; 이혁재; 강민석
Original assignee: 엔에이치엔클라우드 주식회사
Priority date: 2020-11-16
Filing date: 2020-11-16
Publication date: 2022-12-22
Also published as: KR20220066475A

Abstract

본 발명의 실시예에 따른 딥러닝 기반 광학문자인식 장치는, 적어도 하나 이상의 프로세서; 및 적어도 하나 이상의 메모리;를 포함하고, 상기 메모리에 저장되고 상기 적어도 하나 이상의 프로세서에 의해 실행되어 딥러닝 기반으로 광학문자인식을 수행하는 적어도 하나의 프로그램으로서, 상기 적어도 하나 이상의 프로그램은, 딥러닝 뉴럴 네트워크를 기반으로 입력 이미지 내 텍스트에 대한 특징정보를 추출하는 제 1 모듈과, 상기 추출된 특징정보를 기초로 상기 텍스트에 대한 위치정보를 가지는 초기 인코딩 정보를 생성하는 제 2 모듈과, 상기 생성된 초기 인코딩 정보에 기반한 형태 변환을 수행하여 상기 입력 이미지에 대한 2차원 정보를 유지한 어텐션 맵을 획득하는 제 3 모듈과, 상기 획득된 어텐션 맵에 기초한 딥러닝 모델을 통해 상기 입력 이미지 내 텍스트를 감지하는 제 4 모듈을 포함한다. A deep learning-based optical character recognition device according to an embodiment of the present invention includes at least one processor; and at least one program that is stored in the memory and executed by the at least one or more processors to perform optical character recognition based on deep learning, wherein the at least one or more programs include: A first module for extracting feature information on text in an input image based on a network; a second module for generating initial encoding information having location information on the text based on the extracted feature information; A third module for obtaining an attention map maintaining 2D information of the input image by performing shape transformation based on initial encoding information; detecting text in the input image through a deep learning model based on the acquired attention map; It includes a fourth module that

Description

딥러닝 기반 광학문자인식 장치 및 그 시스템{DEVICE AND SYSTEM FOR DEEP LEARNING-BASED OPTICAL CHARACTER RECOGNITION}Deep learning-based optical character recognition device and its system {DEVICE AND SYSTEM FOR DEEP LEARNING-BASED OPTICAL CHARACTER RECOGNITION}

본 발명은 딥러닝 기반 광학문자인식(OCR) 장치 및 그 시스템에 관한 것이다. The present invention relates to a deep learning-based optical character recognition (OCR) device and system thereof.

보다 상세하게는, 서로 다른 기능을 수행하는 복수의 딥러닝 모듈에 기반하여 이미지 내 문자인식을 수행하는 딥러닝 기반 광학문자인식(OCR) 장치 및 그 시스템에 관한 것이다. More specifically, it relates to a deep learning-based optical character recognition (OCR) device and system for performing character recognition in an image based on a plurality of deep learning modules performing different functions.

최근 들어, 디지털 저장매체의 급속한 보급에 따라 기존에 지면으로 존재하였던 문서들에 대한 디지털화 작업이 활발히 전개되고 있다. In recent years, with the rapid dissemination of digital storage media, digitization of documents that previously existed on paper has been actively developed.

이와 같은 현상은 문서에 포함된 문자를 자동으로 인식하는 기술인 광학문자인식(Optical Character Recognition: OCR) 기술의 발전에 따라 더욱 더 가속화되고 있다. This phenomenon is further accelerated with the development of Optical Character Recognition (OCR) technology, which is a technology for automatically recognizing characters included in documents.

자세히, 광학문자인식(OCR)이란, 일반적으로 종이 등에 인쇄되거나 수서(handwritten)된 문자, 기호 또는 마크 등을 광학적 수단에 의해 인식하여 컴퓨터 텍스트로 변환하는 기술을 말한다. In detail, optical character recognition (OCR) refers to a technology that recognizes characters, symbols, or marks printed on paper or handwritten by optical means and converts them into computer text.

즉, 광학문자인식(OCR)은 스캐너 혹은 카메라 등과 같은 광학 기기에 의해 생성된 이미지에 포함된 문자를 컴퓨터 등의 디지털 기기로 편집할 수 있는 텍스트로 변환하는 일련의 과정을 의미한다. That is, optical character recognition (OCR) refers to a series of processes of converting text included in an image generated by an optical device such as a scanner or camera into text that can be edited by a digital device such as a computer.

이러한 광학문자인식(OCR)을 수행할 시, 문서 이미지 내에서 텍스트 영역과 그 이외의 영역(예컨대, 이미지 영역 및/또는 배경 영역 등)이 병존하는 경우, 상기 두 영역 각각을 정확하게 구별하는 과정이 필요시 되나, 본 기술분야의 종래 기술만으로는 해당 텍스트의 특성이나 위치를 정확히 파악하여 텍스트 영역과 그 이외의 영역을 명확히 분별하는 것이 용이하지 않은 실정이다. When such optical character recognition (OCR) is performed, when a text area and other areas (eg, an image area and/or a background area) coexist within a document image, a process of accurately distinguishing each of the two areas is performed. Although it is necessary, it is not easy to accurately distinguish the text area from the other area by accurately grasping the characteristics or location of the corresponding text only with the prior art in this technical field.

또한, 문서 내 문자는 항상 정형적인 형태(예를 들면, 직선을 따라 작성된 문자 등)를 가지는 것이 아니라, 아치형이나 곡선형 등과 같은 비정형적인 형태를 가지고 있는 경우도 많다. In addition, characters in a document do not always have a standard shape (for example, characters written along a straight line), but often have an irregular shape such as an arch shape or a curved shape.

그러나 종래의 본 기술분야에서는, 정형적인 형태 뿐만 아니라 다양한 비정형적 형태의 문자들을 해당 문자 이외의 영역과 효과적으로 분리하면서 빠르고 정확하게 인식하는 기술이 미흡한 실정이며, 이를 위한 새로운 기술 도입이 필요시되고 있다. However, in the conventional art field, technology for quickly and accurately recognizing characters of various atypical shapes while effectively separating them from regions other than the corresponding characters is insufficient, and the introduction of a new technology is required for this purpose.

KR 10-2010-0044668 B1KR 10-2010-0044668 B1

본 발명은, 상술된 문제점을 해결하기 위하여 안출된 것으로서, 서로 다른 기능을 수행하는 복수의 딥러닝 모듈에 기반하여 이미지 내 문자인식을 수행하는 딥러닝 기반 광학문자인식(OCR) 장치 및 그 시스템을 제공하는데 그 목적이 있다. The present invention has been made to solve the above problems, and provides a deep learning-based optical character recognition (OCR) device and system for performing character recognition in an image based on a plurality of deep learning modules performing different functions. Its purpose is to provide

자세히, 본 발명은, 복수의 딥러닝 모듈을 기초로 이미지 내 텍스트의 특징과 위치 정보에 기반한 인코딩 정보를 획득하고, 획득된 인코딩 정보에 기반하여 상기 이미지 내 문자인식을 수행하는 딥러닝 기반 광학문자인식(OCR) 장치 및 그 시스템을 제공하는데 그 목적이 있다. In detail, the present invention obtains encoding information based on the characteristics and location information of text in an image based on a plurality of deep learning modules, and performs character recognition in the image based on the obtained encoding information, based on deep learning-based optical character. Its purpose is to provide a recognition (OCR) device and its system.

다만, 본 발명 및 본 발명의 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다. However, the technical problems to be achieved by the present invention and the embodiments of the present invention are not limited to the technical problems described above, and other technical problems may exist.

이때, 상기 제 1 모듈은, Resnet45 기반 딥러닝 뉴럴 네트워크를 기초로, 상기 입력 이미지에 기반한 높이(height, H), 너비(width, W) 및 출력 채널 수(Number of output channels, C)에 대한 데이터를 상기 특징정보로 추출한다. In this case, the first module determines the height (H), width (W), and number of output channels (C) based on the input image based on the Resnet45-based deep learning neural network. Data is extracted as the feature information.

또한, 상기 2차원 정보는, 상기 입력 이미지에 기반한 높이(height, H) 및 너비(width, W) 정보를 포함한다. In addition, the 2D information includes height (H) and width (W) information based on the input image.

또한, 상기 제 3 모듈은, 상기 초기 인코딩 정보 내 아티팩트(artifacts)를 제거하는 상기 형태 변환을 수행하여 상기 어텐션 맵을 생성한다. In addition, the third module generates the attention map by performing the shape transformation to remove artifacts in the initial encoding information.

또한, 상기 제 3 모듈은, 상기 제 1 모듈로부터 획득되는 ‘1*출력 채널 수(Number of output channels, C)*높이(Height, H)*너비(Width, W)’에 대한 제 1 차원정보와, 상기 제 2 모듈로부터 획득되는 ‘최대 채널 수(Max number of channels, T)*1*높이(Height, H)*너비(Width, W)’에 대한 제 2 차원정보를 제 2 방식의 차원결합을 수행하여 상기 어텐션 맵을 생성한다. In addition, the third module provides first dimension information about '1*Number of output channels (C)*Height (H)*Width (W)' obtained from the first module. And, the second dimension information about 'Max number of channels (T) * 1 * Height (H) * Width (Width, W)' obtained from the second module is a dimension of a second method Combining is performed to generate the attention map.

또한, 상기 제 3 모듈은, 상기 제 1 차원정보 및 상기 제 2 차원정보를 기초로 최대 채널 수(Max number of channels, T) 및 출력 채널 수(Number of output channels, C) 채널의 차원결합을 수행하여 제 3 차원정보를 획득하고, 상기 획득된 제 3 차원정보를 기초로 높이(Height, H) 및 너비(Width, W) 정보의 차원결합을 수행하여 제 4 차원정보를 획득한다. In addition, the third module performs a dimensional combination of maximum number of channels (T) and number of output channels (C) based on the first dimension information and the second dimension information. 3D information is obtained by performing dimensional combination of height (H) and width (Width, W) information based on the obtained 3D information, and 4th dimensional information is obtained.

또한, 상기 제 3 모듈은, 상기 획득된 제 4 차원정보를 기초로 1x1 컨볼루션 레이어(1x1 conv layer)에 기반한 차원 축소 프로세스를 수행하여 제 5 차원정보를 획득하는 상기 제 2 방식의 차원결합을 수행한다. In addition, the third module performs a dimensional reduction process based on a 1x1 convolution layer based on the obtained fourth dimensional information to perform dimensional combining of the second method of acquiring fifth dimensional information. carry out

또한, 상기 제 3 모듈은, 상기 제 2 방식의 차원결합을 수행하여 상기 높이(Height, H) 및 너비(Width, W) 정보를 포함하는 상기 2차원 정보를 유지한 '출력 채널 수(Number of output channels, C)* 높이(Height, H)*너비(Width, W)' 형태의 상기 어텐션 맵을 획득한다. In addition, the third module performs dimensional combining of the second method and maintains the 2D information including the height (H) and width (Width, W) information 'Number of output channels (Number of The attention map in the form of 'output channels, C) * height (H) * width (W)' is obtained.

또한, 상기 제 4 모듈은, 트랜스포머 모델(Transformer model)의 디코더를 기반으로 구현된다. In addition, the fourth module is implemented based on a decoder of a transformer model.

또한, 상기 제 4 모듈은, 상기 어텐션 맵을 입력 데이터로 하여 상기 디코더에 기초한 딥러닝을 수행하고, 상기 수행된 딥러닝을 기반으로 상기 입력 이미지 내 텍스트를 감지한다. Further, the fourth module performs deep learning based on the decoder using the attention map as input data, and detects text in the input image based on the performed deep learning.

한편, 본 발명의 실시예에 따른 딥러닝 기반 광학문자인식 시스템은, 딥러닝 뉴럴 네트워크를 기반으로 입력 이미지 내 텍스트에 대한 특징정보를 추출하는 제 1 모듈; 및 상기 추출된 특징정보를 기초로 상기 텍스트에 대한 위치정보를 가지는 초기 인코딩 정보를 생성하는 제 2 모듈; 을 포함하는 인코더 딥러닝 서버; 상기 생성된 초기 인코딩 정보에 기반한 형태 변환을 수행하여 상기 입력 이미지에 대한 2차원 정보를 그대로 유지한 형태의 어텐션 맵을 획득하는 제 3 모듈; 을 포함하는 인코딩 스키마 변환서버; 및 상기 획득된 어텐션 맵에 기초한 딥러닝을 수행하여 상기 입력 이미지 내 텍스트를 감지하는 제 4 모듈; 을 포함하는 디코더 딥러닝 서버; 를 포함한다. Meanwhile, a deep learning-based optical character recognition system according to an embodiment of the present invention includes a first module for extracting feature information on text in an input image based on a deep learning neural network; and a second module generating initial encoding information having location information about the text based on the extracted feature information. Encoder deep learning server including; a third module for performing shape conversion based on the generated initial encoding information to obtain an attention map in a form in which 2D information of the input image is maintained as it is; Encoding schema conversion server including; and a fourth module detecting text in the input image by performing deep learning based on the obtained attention map. Decoder deep learning server including; includes

본 발명의 실시예에 따른 딥러닝 기반 광학문자인식(OCR) 장치 및 그 시스템은, 서로 다른 기능을 수행하는 복수의 딥러닝 모듈에 기반하여 이미지 내 문자인식을 수행함으로써, 각각의 딥러닝 모듈을 거치며 이미지 내 텍스트에 대한 명확한 특징정보와 위치정보를 획득할 수 있고, 이와 같이 획득된 정보들을 기초로 문자인식을 수행하여 정형적인 형태의 텍스트 뿐만 아니라 비정형적 형태의 텍스트도 해당 텍스트 이외의 영역과 효과적으로 분리하면서 빠르고 정확하게 인식할 수 있는 효과가 있다. A deep learning-based optical character recognition (OCR) device and its system according to an embodiment of the present invention perform character recognition in an image based on a plurality of deep learning modules that perform different functions, so that each deep learning module Through this process, it is possible to obtain clear feature information and location information about the text in the image, and character recognition is performed based on the information obtained in this way, so that not only the standard text but also the unstructured text can be detected in areas other than the text. It has the effect of being able to quickly and accurately recognize while separating effectively.

또한, 본 발명의 실시예에 따른 딥러닝 기반 광학문자인식(OCR) 장치 및 그 시스템은, 복수의 딥러닝 모듈에 기반하여 이미지 내 텍스트의 특징정보와 위치정보를 추출하고, 추출된 특징정보와 위치정보를 기초로 초기 인코딩 정보(실시예에서, 제 1 어텐션 맵(early attention map))를 획득함으로써, 이미지 내 개별 텍스트의 특성과 위치가 반영된 인코딩 데이터에 기초하여 문자인식을 위한 디코딩이 수행되게 할 수 있고, 이를 통해 광학문자인식(OCR) 결과의 정확도와 그 성능을 향상시킬 수 있다. In addition, the deep learning-based optical character recognition (OCR) device and system thereof according to an embodiment of the present invention extracts feature information and location information of text in an image based on a plurality of deep learning modules, and extracts the extracted feature information and By acquiring initial encoding information (in the embodiment, a first attention map) based on location information, decoding for character recognition is performed based on encoding data in which characteristics and locations of individual texts in an image are reflected. This can improve the accuracy and performance of optical character recognition (OCR) results.

또한, 본 발명의 실시예에 따른 딥러닝 기반 광학문자인식(OCR) 장치 및 그 시스템은, 위와 같이 획득된 초기 인코딩 정보에 별도의 디컨볼루션(deconv) 방식을 적용하여 이미지의 2차원 정보의 손실의 최소화하는 인코딩 스키마를 적용함으로써, 이미지 내 2차원 정보를 최대한 유지하는 인코딩 정보(제 2 어텐션 맵(attention map))을 획득할 수 있고, 이를 통해 인코딩 정보 내 아티팩트(artifacts)를 최소화하여 추후 해당 인코딩 정보에 기반한 문자인식 결과의 품질을 증진시킬 수 있는 효과가 있다. In addition, the deep learning-based optical character recognition (OCR) device and system according to an embodiment of the present invention apply a separate deconvolution method to the initial encoding information obtained as above to obtain two-dimensional information of an image. By applying an encoding scheme that minimizes loss, it is possible to obtain encoding information (a second attention map) that maximally retains 2-dimensional information in an image, and through this, minimizes artifacts in the encoding information to be used later in the future. There is an effect of improving the quality of a character recognition result based on the corresponding encoding information.

또한, 본 발명의 실시예에 따른 딥러닝 기반 광학문자인식(OCR) 장치 및 그 시스템은, 위와 같은 인코딩 정보를 입력 데이터로 하여 디코딩을 수행해 이미지 내 문자인식을 수행하는 딥러닝 모듈을 트랜스포머 모델(transformer model)의 디코더에 기반하여 구현함으로써, 본 발명에서의 인코더에 의한 장점과 상기 트랜스포머 모델의 디코더의 장점을 결합하여 광학문자인식(OCR)을 수행하는 딥러닝 뉴럴 네트워크를 제공할 수 있다. In addition, the deep learning-based optical character recognition (OCR) device and its system according to an embodiment of the present invention, a deep learning module that performs character recognition in an image by performing decoding using the above encoding information as input data is a transformer model ( By implementing it based on the decoder of the transformer model, it is possible to provide a deep learning neural network that performs optical character recognition (OCR) by combining the advantages of the encoder in the present invention and the decoder of the transformer model.

다만, 본 발명에서 얻을 수 있는 효과는 이상에서 언급한 효과들로 제한되지 않으며, 언급하지 않은 또 다른 효과들은 아래의 기재로부터 명확하게 이해될 수 있다. However, the effects obtainable in the present invention are not limited to the effects mentioned above, and other effects not mentioned can be clearly understood from the description below.

도 1은 본 발명의 실시예에 따른 딥러닝 기반 광학문자인식(OCR) 시스템의 개념도이다.
도 2는 본 발명의 실시예에 따른 문자인식 서버의 내부 블록도이다.
도 3은 본 발명의 실시예에 따른 제 1 모듈(CNN 기반 특징 추출부)의 내부 동작방식을 설명하는 도면의 일례이다.
도 4는 본 발명의 실시예에 따른 제 2 모듈(CNN 기반 정렬부)의 내부 동작방식을 설명하는 도면의 일례이다.
도 5는 본 발명의 실시예에 따른 제 1 모듈 및 제 2 모듈에서의 차원 데이터를 결합하는 방법을 설명하기 위한 도면의 일례이다.
도 6은 본 발명의 실시예에 따른 제 1 모듈 및 제 2 모듈에서의 차원 데이터를 기반으로 제 1 방식의 차원결합을 수행하는 방법을 설명하기 위한 도면의 일례이다.
도 7은 본 발명의 실시예에 따른 제 3 모듈(제 1 및 2 모듈 결합부)에 기반하여 인코딩 정보(제 2 어텐션 맵(attention map))의 아티팩트(artifacts)를 최소화한 모습의 일레이다.
도 8은 본 발명의 실시예에 따른 제 1 모듈 및 제 2 모듈에서의 차원 데이터를 기반으로 제 2 방식의 차원결합을 수행하는 방법을 설명하기 위한 도면의 일례이다.
도 9는 본 발명의 실시예에 따른 제 4 모듈(CNN 기반 문자 인식부)의 내부 동작방식을 설명하는 도면의 일례이다. 1 is a conceptual diagram of a deep learning-based optical character recognition (OCR) system according to an embodiment of the present invention.
2 is an internal block diagram of a character recognition server according to an embodiment of the present invention.
3 is an example of a diagram explaining an internal operating method of a first module (CNN-based feature extraction unit) according to an embodiment of the present invention.
4 is an example of a diagram explaining an internal operation method of a second module (CNN-based alignment unit) according to an embodiment of the present invention.
5 is an example of a diagram for explaining a method of combining dimensional data in a first module and a second module according to an embodiment of the present invention.
6 is an example of a diagram for explaining a method of performing dimensional combining of a first method based on dimensional data in a first module and a second module according to an embodiment of the present invention.
7 is an example of minimizing artifacts of encoding information (second attention map) based on a third module (first and second module combiner) according to an embodiment of the present invention.
8 is an example of a diagram for explaining a method of performing dimensional combining of a second method based on dimensional data in a first module and a second module according to an embodiment of the present invention.
9 is an example of a diagram explaining an internal operation method of a fourth module (CNN-based character recognition unit) according to an embodiment of the present invention.

본 발명은 다양한 변환을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 본 발명의 효과 및 특징, 그리고 그것들을 달성하는 방법은 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 다양한 형태로 구현될 수 있다. 이하의 실시예에서, 제1, 제2 등의 용어는 한정적인 의미가 아니라 하나의 구성 요소를 다른 구성 요소와 구별하는 목적으로 사용되었다. 또한, 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 또한, 포함하다 또는 가지다 등의 용어는 명세서상에 기재된 특징, 또는 구성요소가 존재함을 의미하는 것이고, 하나 이상의 다른 특징들 또는 구성요소가 부가될 가능성을 미리 배제하는 것은 아니다. 또한, 도면에서는 설명의 편의를 위하여 구성 요소들이 그 크기가 과장 또는 축소될 수 있다. 예컨대, 도면에서 나타난 각 구성의 크기 및 두께는 설명의 편의를 위해 임의로 나타내었으므로, 본 발명이 반드시 도시된 바에 한정되지 않는다.Since the present invention can apply various transformations and have various embodiments, specific embodiments will be illustrated in the drawings and described in detail in the detailed description. Effects and features of the present invention, and methods for achieving them will become clear with reference to the embodiments described later in detail together with the drawings. However, the present invention is not limited to the embodiments disclosed below and may be implemented in various forms. In the following embodiments, terms such as first and second are used for the purpose of distinguishing one component from another component without limiting meaning. Also, expressions in the singular number include plural expressions unless the context clearly dictates otherwise. In addition, terms such as include or have mean that features or elements described in the specification exist, and do not preclude the possibility that one or more other features or elements may be added. In addition, in the drawings, the size of components may be exaggerated or reduced for convenience of explanation. For example, since the size and thickness of each component shown in the drawings are arbitrarily shown for convenience of description, the present invention is not necessarily limited to the illustrated bar.

이하, 첨부된 도면을 참조하여 본 발명의 실시예들을 상세히 설명하기로 하며, 도면을 참조하여 설명할 때 동일하거나 대응하는 구성 요소는 동일한 도면부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings, and when describing with reference to the drawings, the same or corresponding components are assigned the same reference numerals, and overlapping descriptions thereof will be omitted. .

도 1은 본 발명의 실시예에 따른 딥러닝 기반 광학문자인식(OCR) 시스템의 개념도이다. 1 is a conceptual diagram of a deep learning-based optical character recognition (OCR) system according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 실시예에서 딥러닝 기반 광학문자인식(OCR) 시스템은, 문자인식 서버(100)를 포함하여 구현될 수 있다. Referring to FIG. 1 , in an embodiment of the present invention, a deep learning-based optical character recognition (OCR) system may be implemented by including a character recognition server 100.

실시예에서 문자인식 서버(100)는, 인코더 딥러닝 서버(110), 인코딩 스키마 변환서버(120) 및 디코더 딥러닝 서버(130)를 포함할 수 있으며, 상기 서버들은 네트워크를 통하여 연결될 수 있다. In the embodiment, the character recognition server 100 may include an encoder deep learning server 110, an encoding schema conversion server 120, and a decoder deep learning server 130, and the servers may be connected through a network.

여기서, 네트워크는, 인코더 딥러닝 서버(110), 인코딩 스키마 변환서버(120) 및 디코더 딥러닝 서버(130) 등과 같은 각각의 노드 상호 간에 정보 교환이 가능한 연결 구조를 의미하는 것으로, 이러한 네트워크의 일 예에는 3GPP(3rd Generation Partnership Project) 네트워크, LTE(Long Term Evolution) 네트워크, WIMAX(World Interoperability for Microwave Access) 네트워크, 인터넷(Internet), LAN(Local Area Network), Wireless LAN(Wireless Local Area Network), WAN(Wide Area Network), PAN(Personal Area Network), 블루투스(Bluetooth) 네트워크, 위성 방송 네트워크, 아날로그 방송 네트워크, DMB(Digital Multimedia Broadcasting) 네트워크 등이 포함되나 이에 한정되지는 않는다. Here, the network refers to a connection structure capable of exchanging information between nodes such as the encoder deep learning server 110, the encoding schema conversion server 120, and the decoder deep learning server 130, and is one of such networks. Examples include 3rd Generation Partnership Project (3GPP) networks, Long Term Evolution (LTE) networks, World Interoperability for Microwave Access (WIMAX) networks, Internet, Local Area Networks (LANs), Wireless Local Area Networks (Wireless LANs), A wide area network (WAN), a personal area network (PAN), a Bluetooth network, a satellite broadcasting network, an analog broadcasting network, a digital multimedia broadcasting (DMB) network, and the like are included, but are not limited thereto.

- 문자인식 서버(100: Character Recognition Server) - Character recognition server (100: Character Recognition Server)

도 2는 본 발명의 실시예에 따른 문자인식 서버(100)의 내부 블록도이다. 2 is an internal block diagram of a character recognition server 100 according to an embodiment of the present invention.

도 2를 참조하면, 본 발명의 실시예에서 문자인식 서버(100)는, 서로 다른 기능을 수행하는 복수의 딥러닝 모듈에 기반하여, 입력 이미지 내 텍스트의 특징과 위치 정보에 기반한 인코딩 정보(실시예에서, attention map)를 획득하고, 획득된 인코딩 정보를 기초로 상기 입력 이미지 내 문자인식을 수행하는 딥러닝 기반 광학문자인식(OCR) 서비스(이하, 문자인식 서비스)를 제공할 수 있다. Referring to FIG. 2, in an embodiment of the present invention, the character recognition server 100, based on a plurality of deep learning modules that perform different functions, encoding information based on the characteristics and location information of text in an input image In an example, a deep learning-based optical character recognition (OCR) service (hereinafter, a character recognition service) may be provided that obtains an attention map and performs character recognition in the input image based on the obtained encoding information.

자세히, 실시예에서 문자인식 서버(100)는, 복수의 딥러닝 모듈로서 적어도 하나의 인코더 딥러닝 서버(110), 적어도 하나의 인코딩 스키마 변환서버(120) 및 적어도 하나의 디코더 딥러닝 서버(130)를 포함할 수 있다. In detail, in the embodiment, the character recognition server 100 includes at least one encoder deep learning server 110, at least one encoding schema conversion server 120, and at least one decoder deep learning server 130 as a plurality of deep learning modules. ) may be included.

이때, 본 발명의 실시예에서는 문자인식 서버(100)가 적어도 하나의 인코더 딥러닝 서버(110), 적어도 하나의 인코딩 스키마 변환서버(120) 및 적어도 하나의 디코더 딥러닝 서버(130)를 포함하는 것에 기준하여 설명하나, 실시예에 따라서 인코더 딥러닝 서버(110), 인코딩 스키마 변환서버(120) 및 디코더 딥러닝 서버(130) 중 적어도 하나 이상이 별도의 장치로 구현될 수도 있는 등 다양한 실시예 또한 가능할 수 있다. At this time, in the embodiment of the present invention, the character recognition server 100 includes at least one encoder deep learning server 110, at least one encoding schema conversion server 120, and at least one decoder deep learning server 130. However, according to embodiments, at least one of the encoder deep learning server 110, the encoding schema conversion server 120, and the decoder deep learning server 130 may be implemented as a separate device. It may also be possible.

실시예에서, 위와 같은 딥러닝 모듈은, 딥러닝 뉴럴 네트워크(Deep-learning neural network)에 기반하여 구현될 수 있다. In an embodiment, the above deep learning module may be implemented based on a deep-learning neural network.

여기서, 상기 딥러닝 뉴럴 네트워크는, 컨볼루션 뉴럴 네트워크(CNN, Convolution Neural Network, 예컨대, 유-넷 컨볼루션 뉴럴 네트워크(U-net Convolution Neural Network))와 같은 딥러닝 뉴럴 네트워크를 포함할 수 있으며, 본 발명의 실시예에서는 딥러닝 뉴럴 네트워크 토폴로지 자체를 한정하거나 제한하지는 않는다. Here, the deep learning neural network may include a deep learning neural network such as a Convolution Neural Network (CNN), for example, a U-net Convolution Neural Network, In an embodiment of the present invention, the deep learning neural network topology itself is not limited or limited.

또한, 실시예에서 위와 같은 딥러닝 모듈 각각은, 소정의 트레이닝 데이터 셋(Training Data Set)에 기초하여 학습될 수 있다. Also, in an embodiment, each of the above deep learning modules may be learned based on a predetermined training data set.

예를 들면, 딥러닝 모듈은, 다양한 형태의 문자들을 포함하는 복수의 문서를 스캐닝한 이미지 데이터 집합으로 구현되는 트레이닝 데이터 셋에 기반하여, 소정의 방식에 따라 학습될 수 있다. For example, the deep learning module may be trained according to a predetermined method based on a training data set implemented as an image data set obtained by scanning a plurality of documents including various types of characters.

이때, 상기 소정의 방식은, Hebbian Learning (Hebbian Rule), Perceptron Rule, Gradient Descent(Delta Rule,　Least Mean Square) 및/또는 Back propagation 방식 등을 포함할 수 있으며, 본 발명의 실시예에서는 딥러닝 뉴럴 네트워크의 학습방식 자체를 한정하거나 제한하지는 않는다. In this case, the predetermined method may include Hebbian Learning (Hebbian Rule), Perceptron Rule, Gradient Descent (Delta Rule, 　 Least Mean Square), and/or Back propagation method. In an embodiment of the present invention, deep learning neural It does not limit or restrict the learning method of the network itself.

다시 돌아와서, 실시예에서 위와 같은 딥러닝 모듈을 포함하는 문자인식 서버(100)는, 복수의 딥러닝 모듈 각각의 학습된 딥러닝 뉴럴 네트워크에 기반하여 구현되는 문자인식 딥러닝 뉴럴 네트워크와 연동할 수 있다. Returning again, in an embodiment, the character recognition server 100 including the deep learning module as described above can work with the character recognition deep learning neural network implemented based on the deep learning neural network learned by each of the plurality of deep learning modules. there is.

또한, 실시예에서 문자인식 서버(100)는, 연동되는 딥러닝 뉴럴 네트워크를 기반으로 수행되는 딥러닝에 기초하여, 광학문자인식(OCR)을 수행하고자 하는 입력 이미지 내 텍스트(비정형 텍스트(예컨대, 곡선형 텍스트 등)를 포함할 수 있음)의 특징정보 및 위치 정보를 추출할 수 있다. In addition, in the embodiment, the text recognition server 100 may perform optical character recognition (OCR) on text (eg, unstructured text) in an input image based on deep learning performed based on an interlocked deep learning neural network. (which may include curved text, etc.) feature information and location information may be extracted.

또한, 문자인식 서버(100)는, 딥러닝에 기반하여, 위와 같이 추출된 정보에 기초한 초기 인코딩 정보 즉, 제 1 어텐션 맵(Early attention map)을 획득할 수 있다. In addition, the text recognition server 100 may obtain initial encoding information based on the information extracted as above, that is, a first attention map, based on deep learning.

또한, 실시예에서 문자인식 서버(100)는, 딥러닝에 기반하여, 획득된 초기 인코딩 정보(제 1 어텐션 맵)에 대한 형태(shape) 변환을 수행할 수 있다. Also, in an embodiment, the text recognition server 100 may perform shape conversion on the obtained initial encoding information (first attention map) based on deep learning.

즉, 문자인식 서버(100)는, 초기 인코딩 정보에 기초한 인코딩 스키마 변환을 수행할 수 있다. That is, the character recognition server 100 may perform encoding schema conversion based on initial encoding information.

그리고 문자인식 서버(100)는, 위와 같이 수행된 형태 변환을 통해 디코더에 입력하기 위한 최종적인 인코딩 정보 즉, 제 2 어텐션 맵(attention map)을 획득할 수 있다. Further, the character recognition server 100 may obtain final encoding information to be input to the decoder, that is, the second attention map through the shape conversion performed as above.

또한, 실시예에서 문자인식 서버(100)는, 딥러닝에 기반하여, 획득된 인코딩 정보(제 2 어텐션 맵)를 기초로 입력 이미지 내 문자(비정형 문자(예컨대, 곡선형 문자 등)를 포함할 수 있음)인식을 수행할 수 있다. In addition, in the embodiment, the character recognition server 100 may include characters (irregular characters (eg, curved characters) in the input image based on encoding information (second attention map) obtained based on deep learning). can) perform recognition.

자세히, 문자인식 서버(100)는, 딥러닝을 기초로 인코딩 정보에 기반한 디코딩을 수행하여 입력 이미지 내 문자를 감지하는 광학문자인식(OCR)을 수행할 수 있다. In detail, the character recognition server 100 may perform optical character recognition (OCR) to detect characters in an input image by performing decoding based on encoding information based on deep learning.

이하, 첨부된 도면을 참조하여 문자인식 서버(100)의 딥러닝 모듈 각각에 대해 상세히 설명한다. Hereinafter, each of the deep learning modules of the text recognition server 100 will be described in detail with reference to the accompanying drawings.

<인코더 딥러닝 서버><Encoder Deep Learning Server>

본 발명의 실시예에서 인코더 딥러닝 서버(110)는, 딥러닝 뉴럴 네트워크와 연동하여, 입력 이미지 내 텍스트(비정형 텍스트(예컨대, 곡선형 텍스트 등)를 포함할 수 있음)의 특징정보 및 위치정보를 추출할 수 있다. In an embodiment of the present invention, the encoder deep learning server 110 interworks with the deep learning neural network, and the feature information and location information of the text (which may include unstructured text (eg, curved text, etc.)) in the input image. can be extracted.

또한, 실시예에서 인코더 딥러닝 서버(110)는, 추출된 특징정보 및 위치정보에 기초하여 초기 인코딩 정보 즉, 제 1 어텐션 맵(Early attention map)을 획득할 수 있다. Also, in an embodiment, the encoder deep learning server 110 may obtain initial encoding information, that is, a first attention map, based on the extracted feature information and location information.

자세히, 실시예에서 인코더 딥러닝 서버(110)는, 인코더 딥러닝 뉴럴 네트워크(이하, 인코더)에 기반하여 구현될 수 있다. In detail, in the embodiment, the encoder deep learning server 110 may be implemented based on an encoder deep learning neural network (hereinafter referred to as encoder).

이때, 실시예에 따른 인코더는, 제 1 모듈인 CNN 기반 특징 추출부(111: 이하, 특징 추출부)와, 제 2 모듈인 CNN 기반 정렬부(112: 이하, 정렬부)를 포함할 수 있다. At this time, the encoder according to the embodiment may include a first module, CNN-based feature extraction unit 111 (hereinafter referred to as feature extraction unit), and a second module, CNN-based alignment unit 112 (hereinafter referred to as alignment unit). .

도 3은 본 발명의 실시예에 따른 제 1 모듈(CNN 기반 특징 추출부(111))의 내부 동작방식을 설명하는 도면의 일례이다. 3 is an example of a diagram explaining an internal operation method of the first module (CNN-based feature extraction unit 111) according to an embodiment of the present invention.

도 3을 참조하면, 구체적으로 인코더의 특징 추출부(111: 제 1 모듈)는, 딥러닝 뉴럴 네트워크에 기반하여 입력 이미지 내 텍스트에 대한 특징정보(FPN_Features)를 추출할 수 있다. Referring to FIG. 3 , in detail, the feature extraction unit 111 (first module) of the encoder may extract feature information (FPN_Features) of text in an input image based on a deep learning neural network.

여기서, 실시예에 따른 특징정보는, 입력 이미지를 컨볼루션 레이어 블록에 통과시켰을 때, 높이(height, H), 너비(width, W) 및 출력 채널 수(Number of output channels, C)로 출력되는 데이터를 의미한다. Here, the feature information according to the embodiment is output as height (H), width (W), and number of output channels (C) when the input image passes through the convolution layer block. means data.

예를 들어, 특징 추출부(111: 제 1 모듈)는, 복수의 컨볼루션 레이어로 구성된 resnet45 기반의 딥러닝 뉴럴 네트워크를 이용하여 입력 이미지 내 텍스트에 대한 특징정보를 추출하는 딥러닝을 수행할 수 있다. For example, the feature extraction unit 111 (first module) may perform deep learning to extract feature information on text in an input image using a resnet45-based deep learning neural network composed of a plurality of convolutional layers. there is.

이때, 실시예에서 특징 추출부(111: 제 1 모듈)의 딥러닝 뉴럴 네트워크를 구현하는 내부 파라미터 설계에 대한 설명은 도 3을 참조하여 대체한다. At this time, in the embodiment, the description of the internal parameter design implementing the deep learning neural network of the feature extractor 111 (first module) is replaced with reference to FIG. 3 .

도 4는 본 발명의 실시예에 따른 제 2 모듈(CNN 기반 정렬부(112))의 내부 동작방식을 설명하는 도면의 일례이다. 4 is an example of a diagram explaining an internal operation method of the second module (CNN-based alignment unit 112) according to an embodiment of the present invention.

한편, 도 4를 참조하면, 실시예에서 인코더의 정렬부(112: 제 2 모듈)는, 딥러닝 뉴럴 네트워크를 기반으로 특징 추출부(111: 제 1 모듈)로부터 출력되는 데이터(즉, 특징정보)에 기초하여 해당하는 입력 이미지 내 텍스트에 대한 위치정보를 가지는 초기 인코딩 정보 즉, 제 1 어텐션 맵(early attention map)을 획득할 수 있다. On the other hand, referring to FIG. 4, in the embodiment, the alignment unit 112 (second module) of the encoder, based on the deep learning neural network, outputs data from the feature extraction unit 111 (first module) (ie, feature information). ), it is possible to obtain initial encoding information, that is, a first attention map having location information on text in a corresponding input image.

자세히, 실시예에서 정렬부(112: 제 2 모듈)는, 특징 추출부(111: 제 1 모듈)의 출력 데이터를 기반으로 제 1 내지 4단계의 컨볼루션 레이어(convolution layer)에 대한 제 1 피드 포워드(Feed forward) 프로세스를 수행할 수 있다. In detail, in the embodiment, the sorting unit 112 (second module) feeds the first to fourth convolution layers based on the output data of the feature extraction unit 111 (first module). A feed forward process can be performed.

이때, 실시예에서 제 1 내지 4 단계 컨볼루션 레이어는, 배치 정규화(Batch normalization) 방식으로 동작할 수 있으며, 배치 정규화에서의 하이퍼 파라미터(Hyper-parameter)의 초기 설정값은 'c 64; eps 1e-05; momentum 0.1'일 수 있다. At this time, in the embodiment, the first to fourth stage convolution layers may operate in a batch normalization method, and the initial setting value of the hyper-parameter in batch normalization is 'c 64; eps 1e-05; momentum may be 0.1'.

또한, 제 1 내지 4 단계 컨볼루션 레이어 각각에 적용되는 활성화 함수(Activation Function)는, 정류 선형 유닛(Rectified Linear Unit, ReLU)일 수 있다. Also, an activation function applied to each of the first to fourth stage convolutional layers may be a rectified linear unit (ReLU).

또한, 제 1 내지 4 단계 컨볼루션 레이어는, 'k 3x3; s 2x2; p 1x1; c 64'와 같이 구성될 수 있다. In addition, the first to fourth stage convolution layers, 'k 3x3; s 2x2; p 1x1; c 64'.

위와 같이, 특징 추출부(111: 제 1 모듈)의 출력 데이터(FPN_Features)에 기반하여 제 1 내지 4단계의 컨볼루션 레이어에 대한 제 1 피드 포워드(Feed forward) 프로세스가 수행되면, 실시예에서 정렬부(112: 제 2 모듈)는, 제 1 내지 4 단계 디컨볼루션 레이어에 기반한 제 2 피드 포워드 프로세스를 수행할 수 있다. As described above, if the first feed forward process is performed for the convolution layer in the first to fourth steps based on the output data (FPN_Features) of the feature extractor 111 (first module), alignment in the embodiment is performed. The unit 112 (a second module) may perform a second feed forward process based on the first to fourth deconvolution layers.

자세히, 실시예에서 정렬부(112: 제 2 모듈)는, 제 1 내지 4 단계 컨볼루션 레이어에서의 출력 데이터에 기초하여 제 1 내지 4단계의 디컨볼루션 레이어(deconvolution layer)에 대한 제 2 피드 포워드 프로세스를 수행할 수 있다. In detail, in the embodiment, the alignment unit 112 (the second module) feeds the second feed for the deconvolution layer of the first to fourth stages based on the output data from the first to fourth stage convolution layers. A forward process can be performed.

이때, 실시예에서 제 1 내지 4 단계 디컨볼루션 레이어는, 배치 정규화(Batch normalization) 방식으로 동작할 수 있으며, 배치 정규화에서의 하이퍼 파라미터(Hyper-parameter)의 초기 설정값은 'c 64; eps 1e-05; momentum 0.1'일 수 있다. At this time, in the embodiment, the first to fourth deconvolution layers may operate in a batch normalization method, and the initial setting value of the hyper-parameter in batch normalization is 'c 64; eps 1e-05; momentum may be 0.1'.

또한, 제 1 내지 4 단계 디컨볼루션 레이어 중 제 1 내지 3 단계 디컨볼루션 레이어는, 활성화 함수(Activation Function)로 정류 선형 유닛(ReLU)을 이용할 수 있고, 제 4 단계 디컨볼루션 레이어는, 활성화 함수로 시그모이드(Sigmoid) 함수를 이용할 수 있다. In addition, the 1st to 3rd deconvolution layers among the 1st to 4th deconvolution layers may use a rectified linear unit (ReLU) as an activation function, and the 4th deconvolution layer, A sigmoid function can be used as an activation function.

또한, 제 1 내지 4 단계 디컨볼루션 레이어는, 'k 3x3; s 1x1; c 64'와 같이 구성될 수 있다. In addition, the first to fourth stage deconvolution layers, 'k 3x3; s 1x1; c 64'.

더하여, 실시예에서 제 1 내지 4 단계 디컨볼루션 레이어는, 이중선형보간법(bilinear interpolate)에 기초하여 딥러닝을 수행할 수 있으며, 이때의 크기(scale) 값은 '2'로 설정될 수 있다. In addition, in the embodiment, the first to fourth deconvolution layers may perform deep learning based on bilinear interpolation, and the scale value at this time may be set to '2'. .

또한, 제 1 내지 4 단계 디컨볼루션 레이어는, 패딩 레이어(Padding layers)가 'reflectionPad2d: 1'로 설정될 수 있다. Also, in the first to fourth stage deconvolution layers, the padding layers may be set to 'reflectionPad2d: 1'.

즉, 실시예에서 정렬부(112: 제 2 모듈)는, 특징 추출부(111: 제 1 모듈)에서의 출력 데이터(FPN_Features)를 입력 데이터로 하여, 위와 같이 구현될 수 있는 딥러닝 뉴럴 네트워크를 기초로 제 1 내지 4 단계 컨볼루션 및 디컨볼루션 레이어에 기반한 딥러닝을 수행할 수 있고, 이를 통해 해당하는 입력 이미지 내 텍스트에 대한 초기 인코딩 정보(제 1 어텐션 맵)을 획득할 수 있다. That is, in the embodiment, the alignment unit 112 (second module) uses the output data (FPN_Features) from the feature extraction unit 111 (first module) as input data to generate a deep learning neural network that can be implemented as above. Based on this, deep learning based on the first to fourth stage convolution and deconvolution layers can be performed, and through this, initial encoding information (first attention map) for text in a corresponding input image can be obtained.

도 5는 본 발명의 실시예에 따른 제 1 모듈 및 제 2 모듈에서의 차원 데이터를 결합하는 방법을 설명하기 위한 도면의 일례이고, 도 6은 본 발명의 실시예에 따른 제 1 모듈 및 제 2 모듈에서의 차원 데이터를 기반으로 제 1 방식의 차원결합을 수행하는 방법을 설명하기 위한 도면의 일례이다. 5 is an example of a drawing for explaining a method of combining dimensional data in a first module and a second module according to an embodiment of the present invention, and FIG. 6 is a first module and a second module according to an embodiment of the present invention. It is an example of a drawing for explaining a method of performing dimensional combining of the first method based on dimensional data in a module.

한편, 도 5 및 도 6을 참조하면, 실시예에서 인코더는, 특징 추출부(111: 제 1 모듈)에서의 차원 데이터와 정렬부(112: 제 2 모듈)에서의 차원 데이터에 기반하여, 제 1 방식의 차원결합을 수행할 수 있다. On the other hand, referring to FIGS. 5 and 6, in the embodiment, the encoder, based on the dimensional data in the feature extraction unit 111 (first module) and the dimensional data in the alignment unit 112 (second module), One-way dimensional combining can be performed.

자세히, 실시예에서 특징 추출부(111: 제 1 모듈)는, 딥러닝을 수행하여 입력 이미지의 차원을 ‘1*출력 채널 수(Number of output channels, C)*높이(Height, H)*너비(Width, W)’로 출력할 수 있다. In detail, in the embodiment, the feature extractor (111: first module) performs deep learning to set the dimension of the input image to '1*Number of output channels (C)*Height (H)*Width (Width, W)'.

즉, 특징 추출부(111: 제 1 모듈)는, 입력 데이터(실시예에서, 입력 이미지 등)의 차원을 ‘1*C*H*W’로 변환하여 제 1 차원정보를 획득할 수 있다. That is, the feature extractor 111 (first module) may obtain first dimension information by converting the dimension of input data (in an embodiment, an input image, etc.) into '1*C*H*W'.

한편, 실시예에서 정렬부(112: 제 2 모듈)는, 딥러닝을 수행하여 입력 이미지에 대한 차원을 ‘최대 채널 수(Max number of channels, T)*1*높이(Height, H)*너비(Width, W)’로 출력할 수 있다. On the other hand, in the embodiment, the alignment unit 112 (second module) performs deep learning to set the dimension of the input image to 'Max number of channels (T) * 1 * Height (H) * Width (Width, W)'.

즉, 정렬부(112: 제 2 모듈)는, 입력 데이터(실시예에서, 특징정보 등)의 차원을 ‘T*1*H*W’로 변화시켜 제 2 차원정보를 획득할 수 있다. That is, the sorting unit 112 (second module) may change the dimension of the input data (characteristic information, etc. in the embodiment) to 'T*1*H*W' to obtain second-dimensional information.

또한, 실시예에서 상기 특징 추출부(111: 제 1 모듈) 및 상기 정렬부(112: 제 2 모듈)를 포함하는 인코더는, 위와 같이 획득된 제 1 차원정보 및 제 2 차원정보에 기반하여 인코더에서의 1) 높이(Height, H) 및 너비(Width, W) 정보 차원결합을 수행할 수 있다. In addition, in the embodiment, the encoder including the feature extraction unit 111 (first module) and the sorting unit 112 (second module) generates an encoder based on the first dimensional information and the second dimensional information obtained as described above. 1) Dimensional combination of height (H) and width (Width, W) information can be performed.

자세히, 실시예에서 인코더는, 제 1 차원정보인 ‘1*C*H*W’ 및 제 2 차원정보인‘T*1*H*W’를 기반으로 높이(H) 및 너비(W) 정보 차원결합을 수행하여 ‘T*C*(HxW)’로 구현되는 제 3 차원정보를 획득할 수 있다. In detail, in the embodiment, the encoder provides height (H) and width (W) information based on the first dimension information '1*C*H*W' and the second dimension information 'T*1*H*W'. By performing dimensional combining, third-dimensional information implemented as 'T*C*(HxW)' can be obtained.

이때, 실시예에 따른 인코더는, 2) ‘(HxW)’에 대한 차원결합을 더 수행할 수 있다. In this case, the encoder according to the embodiment may further perform 2) dimension combining for '(HxW)'.

그리하여 실시예에서 인코더는, 특징 추출 후 ‘(C*T)’의 높이(H) 정보가 압축된 형태의 제 4 차원정보를 최종적으로 획득할 수 있다. Thus, in the embodiment, the encoder can finally acquire the 4th-dimensional information in the form of compressed height (H) information of '(C*T)' after feature extraction.

즉, 실시예에서 인코더는, 기존 ‘H*W’에 기반한 2차원(2D) 정보를 1차원(1D)화시키는 인코딩을 수행할 수 있다. That is, in the embodiment, the encoder may perform encoding that converts two-dimensional (2D) information based on the existing 'H*W' into one-dimensional (1D).

<인코딩 스키마 변환서버><Encoding schema conversion server>

본 발명의 실시예에서 인코딩 스키마 변환서버(120)는, 딥러닝 뉴럴 네트워크와 연동하여, 정렬부(112: 제 2 모듈)로부터 획득되는 초기 인코딩 정보(제 1 어텐션 맵)에 대한 형태(shape) 변환을 수행할 수 있다. In an embodiment of the present invention, the encoding schema conversion server 120 interworks with the deep learning neural network to obtain the shape of the initial encoding information (first attention map) obtained from the alignment unit 112 (second module). conversion can be performed.

즉, 실시예에서 인코딩 스키마 변환서버(120)는, 초기 인코딩 정보에 기초한 인코딩 스키마 변환을 수행할 수 있다. That is, in the embodiment, the encoding scheme conversion server 120 may perform encoding scheme conversion based on initial encoding information.

그리고 인코딩 스키마 변환서버(120)는, 위와 같이 수행된 형태 변환에 기초하여 디코더에 입력하기 위한 최종적인 인코딩 정보 즉, 제 2 어텐션 맵(attention map)을 획득할 수 있다. Further, the encoding schema conversion server 120 may obtain final encoding information to be input to the decoder, that is, the second attention map, based on the shape conversion performed as above.

본 발명의 실시예에서, 특징 추출부(111: 제 1 모듈)로부터 출력되는 데이터(실시예에서, 특징정보)를 기반으로 정렬부(112: 제 2 모듈)에서 해당하는 입력 이미지 내 텍스트에 대한 위치정보를 가지는 인코딩 정보를 추출할 시, 일반적인 디컨볼루션(deconv) 방식을 사용하는 경우 소정의 아티팩트(artifacts)가 발생할 수 있다. In an embodiment of the present invention, based on the data (in the embodiment, feature information) output from the feature extraction unit 111 (first module), the alignment unit 112 (second module) determines the text in the corresponding input image. When extracting encoding information having location information, certain artifacts may occur when a general deconvolution method is used.

위와 같이 소정의 아티팩트(예컨대, 제 2 어텐션 맵 내 소정의 왜곡, 잡음 등)를 가지는 인코딩 정보를 입력 데이터로하여 딥러닝 뉴럴 네트워크(즉, 실시예에서 제 4 모듈(131))가 디코딩을 수행하는 경우, 해당 입력 데이터 전반에 걸친 아티팩트(artifacts)로 인하여 텍스트 이외의 오브젝트 등을 포함하는 잡음 및/또는 왜곡 등의 영향을 받을 수 있고, 이는 결과적으로 디코딩의 결과 즉, 광학문자인식(OCR) 결과의 품질 저하를 초래할 수 있다. As described above, the deep learning neural network (i.e., the fourth module 131 in the embodiment) decodes the encoded information having the predetermined artifact (eg, predetermined distortion, noise, etc. in the second attention map) as input data. In this case, artifacts throughout the input data may be affected by noise and/or distortion including objects other than text, which in turn results in decoding, that is, optical character recognition (OCR). This may lead to a decrease in the quality of the results.

도 7은 본 발명의 실시예에 따른 제 3 모듈(제 1 및 2 모듈 결합부(121))에 기반하여 인코딩 정보(제 2 어텐션 맵(attention map))의 아티팩트(artifacts)를 최소화한 모습의 일레이다. 7 is a view of minimizing artifacts of encoding information (second attention map) based on a third module (first and second module combiner 121) according to an embodiment of the present invention. it's ilray

그리하여, 도 7을 참조하면, 본 발명의 실시예에서 인코딩 스키마 변환서버(120)는, 특징 추출부(111: 제 1 모듈) 및 정렬부(112: 제 2 모듈)로부터 획득되는 초기 인코딩 정보에서의 아티팩트(artifacts)를 최소화하기 위하여, 별도의 인코딩 프로세스를 구현하는 디컨볼루션 블록(deconvolution block)을 사용해 인코딩 스키마 변환을 수행할 수 있고, 이를 통해 디코더(실시예에서, 제 4 모듈(131))에 입력될 최종적 형태의 인코딩 정보(즉, 제 2 어텐션 맵)을 생성할 수 있다. Thus, referring to FIG. 7 , in an embodiment of the present invention, the encoding scheme conversion server 120, in the initial encoding information obtained from the feature extraction unit 111 (first module) and alignment unit 112 (second module) In order to minimize artifacts of , encoding scheme conversion may be performed using a deconvolution block that implements a separate encoding process, and through this, the decoder (in the embodiment, the fourth module 131) It is possible to generate the final form of encoding information (ie, the second attention map) to be input to ).

자세히, 실시예에서 인코딩 스키마 변환서버(120)는, 인코딩 스키마 변환 딥러닝 뉴럴 네트워크를 기반으로 구현될 수 있다. In detail, in the embodiment, the encoding scheme conversion server 120 may be implemented based on an encoding scheme conversion deep learning neural network.

이때, 실시예에서 인코딩 스키마 변환 딥러닝 뉴럴 네트워크는, 제 3 모듈인 제 1 및 2 모듈 결합부(121: 이하, 결합부)를 포함할 수 있다. In this case, the encoding schema conversion deep learning neural network in the embodiment may include a first and second module combiner 121 (hereinafter referred to as a combiner) as a third module.

보다 상세히, 실시예에 따른 결합부(121: 제 3 모듈)는, 딥러닝 뉴럴 네트워크에 기반하여 정렬부(112: 제 2 모듈)로부터 출력되는 데이터(즉, 실시예에서 제 1 어텐션 맵)에 기반한 별도의 인코딩 프로세스를 수행할 수 있다. In more detail, the combination unit 121 (third module) according to the embodiment is based on the deep learning neural network to data output from the alignment unit 112 (second module) (ie, the first attention map in the embodiment). Based on this, a separate encoding process may be performed.

일반적으로 입력 이미지 내에서 곡선(curved) 등의 특성을 가지며 구현되는 비정형 텍스트의 경우에는, 높이(Height, H)와 너비(Width, W)에 기반한 2차원 정보를 최대한으로 유지하는 것이 중요하다. In general, in the case of irregular text implemented with characteristics such as curves within an input image, it is important to maintain 2D information based on height (H) and width (Width, W) as much as possible.

그러나 상술된 바와 같이 인코더에서 초기 인코딩 정보(제 1 어텐션 맵)를 구현하는 차원결합에서는, 기존 ‘H*W’에 기반한 2차원(2D) 정보를 1차원(1D)화되게 하는 인코딩이 수행되며, 이때 높이(H) 및 너비(W) 개별 정보 각각에 대한 손실(loss)이 발생할 수 있다. However, as described above, in the dimensional combination that implements the initial encoding information (first attention map) in the encoder, encoding is performed to make the two-dimensional (2D) information based on the existing 'H*W' into one-dimensional (1D), , At this time, a loss may occur for each individual information of height (H) and width (W).

도 8은 본 발명의 실시예에 따른 제 1 모듈 및 제 2 모듈에서의 차원 데이터를 기반으로 제 2 방식의 차원결합을 수행하는 방법을 설명하기 위한 도면의 일례이다. 8 is an example of a diagram for explaining a method of performing dimensional combining of a second method based on dimensional data in a first module and a second module according to an embodiment of the present invention.

그리하여, 도 8을 참조하면, 본 발명의 실시예에서 결합부(121: 제 3 모듈)는, 인코더에서 특징 추출부(111: 제 1 모듈)의 특징 추출 이후 정렬부(112: 제 2 모듈)에서 (C*T)의 H 정보가 압축된 형태로 출력되는 제 1 어텐션 맵을, 높이(H)와 너비(W)를 포함하는 2차원 정보를 최대한 유지하는 형태인 (C*(H*W)) 형태로 구현할 수 있다. Thus, referring to FIG. 8 , in the embodiment of the present invention, the combiner (121: the third module), after feature extraction by the feature extractor (111: the first module) in the encoder, the aligner (112: the second module) In (C*(H*W), the first attention map, in which H information of (C*T) is output in a compressed form, is of the form (C*(H*W )) can be implemented in the form of

본 발명의 실시예에서는, 결합부(121: 제 3 모듈)가 인코더와는 별도의 장치로 구현되는 것에 기준하여 설명하나, 이는 일례일 뿐 실시예에 따라서 인코더가 결합부(121: 제 3 모듈)을 포함하여 정렬부(112: 제 2 모듈)의 동작 시 결합부(121: 제 3 모듈)의 기능 동작을 포함하여 동작할 수 있는 등 다양한 실시예 또한 가능하다. In the embodiment of the present invention, the combination unit 121 (third module) is implemented as a separate device from the encoder, but this is only an example. ), various embodiments are also possible, including operation including the functional operation of the coupling unit 121 (third module) during the operation of the aligning unit 112 (second module).

구체적으로, 본 발명의 실시예에서 결합부(121: 제 3 모듈)는, 상술된 특징 추출부(111: 제 1 모듈)의 제 1 차원정보와 정렬부(112: 제 2 모듈)의 제 2 차원정보에 기초하여, 제 2 방식의 차원결합을 수행할 수 있다. Specifically, in the embodiment of the present invention, the combining unit 121 (third module) combines the first dimension information of the above-described feature extraction unit 111 (first module) and the second aligning unit 112 (second module). Based on the dimensional information, the second type of dimensional combining may be performed.

보다 상세히, 실시예에서 결합부(121: 제 3 모듈)는, 특징 추출부(111: 제 1 모듈)의 제 1 차원정보인 ‘1*C*H*W’ 및 정렬부(112: 제 2 모듈)의 제 2 차원정보인 ‘T*1*H*W’에 기반하여, 1) 최대 채널 수(Max number of channels, T) 및 출력 채널 수(Number of output channels, C) 채널 차원결합을 수행할 수 있다. In more detail, in the embodiment, the combining unit 121 (third module) includes '1*C*H*W', which is the first dimension information of the feature extraction unit 111 (first module), and the aligning unit 112 (second module). module), based on the second dimension information 'T*1*H*W', 1) the maximum number of channels (T) and the number of output channels (Number of output channels, C) channel dimension combination can be done

즉, 실시예에서 결합부(121: 제 3 모듈)는, ‘1*C*H*W’ 및 ‘T*1*H*W’에 기반한 T 및 C 채널 차원결합을 수행할 수 있고, 이를 통해 ‘(TxC)*HxW’로 구현되는 제 3 차원정보를 획득할 수 있다. That is, in the embodiment, the combining unit 121 (third module) may perform dimensional combining of T and C channels based on '1*C*H*W' and 'T*1*H*W', 3D information implemented as '(TxC)*HxW' can be obtained through this.

위와 같이 제 3 차원정보를 획득한 이후, 실시예에서 결합부(121: 제 3 모듈)는, 2) 높이(Height, H) 및 너비(Width, W) 정보 차원결합을 수행할 수 있다. After obtaining the third dimension information as above, in the embodiment, the combination unit 121 (third module) may perform 2) dimensional combination of height (H) and width (Width, W) information.

자세히, 실시예에서 결합부(121: 제 3 모듈)는, ‘(TxC)*HxW’로 구현되는 제 3 차원정보에 기초하여 높이(H) 및 너비(W) 정보 차원결합을 수행할 수 있고, 이를 통해‘(TxC)*(HxW)’로 구현되는 제 4 차원정보를 획득할 수 있다. In detail, in the embodiment, the combining unit 121 (third module) may perform dimensional combining of height (H) and width (W) information based on the third dimensional information implemented as '(TxC) * HxW', , Through this, it is possible to obtain the fourth dimension information implemented as '(TxC)*(HxW)'.

계속해서, 제 4 차원정보를 획득한 결합부(121: 제 3 모듈)는, 획득된 제 4 차원정보를 기초로 3) 1x1 컨볼루션 레이어(1x1 conv layer)를 이용한 차원 축소 프로세스를 수행할 수 있다. Subsequently, the combiner 121 (third module) that has obtained the fourth dimensional information may perform a dimensionality reduction process using 3) 1x1 convolution layer (1x1 conv layer) based on the acquired fourth dimensional information. there is.

구체적으로, 실시예에서 결합부(121: 제 3 모듈)는, ‘(TxC)*(HxW)’로 구현되는 제 4 차원정보에 대한 차원 축소 프로세스를 1x1 컨볼루션 레이어(1x1 conv layer)에 기반하여 수행할 수 있다. Specifically, in the embodiment, the combination unit 121 (third module) performs a dimensionality reduction process for the fourth dimensional information implemented as '(TxC)*(HxW)' based on a 1x1 convolution layer (1x1 convolution layer). can be done by

그리고 결합부(121: 제 3 모듈)는, 상기 차원 축소 프로세스를 수행을 통하여 ‘C*(HxW)’로 구현되는 제 5 차원정보를 획득할 수 있다. Also, the combiner 121 (third module) may acquire fifth dimension information implemented as 'C*(HxW)' through the dimension reduction process.

즉, 실시예에서 결합부(121: 제 3 모듈)는, 특징 추출부(111: 제 1 모듈)로부터 획득되는 제 1 차원정보(1*C*H*W)와 정렬부(112: 제 2 모듈)로부터 획득되는 제 2 차원정보(T*1*H*W)를 기초로 제 2 방식의 차원결합을 수행함으로써, 기존의 ‘H*W’에 기반한 2차원(2D) 정보를 그대로 유지하는 차원결합 수행할 수 있다. That is, in the embodiment, the combination unit 121 (third module) combines the first dimension information (1*C*H*W) obtained from the feature extraction unit 111 (first module) and the alignment unit 112 (second module). module) by performing the second method of dimensional combining based on the second dimensional information (T * 1 * H * W) obtained from, thereby maintaining the existing two-dimensional (2D) information based on 'H * W' Dimensional coupling can be performed.

정리하면, 본 발명의 실시예에서 결합부(121: 제 3 모듈)는, 제 1 어텐션 맵의 형태(shape)와는 달리, 입력 이미지에 대한 2차원 정보의 손실을 최소화한 형태의 제 2 어텐션 맵(인코딩 정보)을 생성할 수 있다. In summary, in an embodiment of the present invention, the combination unit 121 (third module), unlike the shape of the first attention map, obtains a second attention map in a form in which loss of 2-dimensional information about the input image is minimized. (encoding information) can be created.

또한, 실시예에서 결합부(121: 제 3 모듈)는, 입력 이미지 내 2차원 정보(height 및 width 정보)를 최대한 유지함과 동시에 디코더의 입력 데이터 형태에 매칭되는 형태(shape)인 제 2 어텐션 맵(인코딩 정보)을 생성할 수 있다. In addition, in the embodiment, the combination unit 121 (third module) maintains two-dimensional information (height and width information) in the input image as much as possible and at the same time generates a second attention map having a shape that matches the shape of the input data of the decoder. (encoding information) can be created.

그리하여 결합부(121: 제 3 모듈)는, 해당하는 제 2 어텐션 맵(인코딩 정보)이 디코더(즉, 제 4 모듈(131))의 입력 데이터로서 동작하게 할 수 있고, 이를 통해 입력 이미지 내 2차원 정보를 최대한으로 유지한 형태의 입력 데이터에 기초하여 디코딩이 수행되게 할 수 있다. Thus, the combining unit 121 (third module) can cause the corresponding second attention map (encoding information) to operate as input data of the decoder (ie, the fourth module 131), through which 2 in the input image Decoding may be performed based on input data in a form in which dimension information is maximally maintained.

이와 같이, 실시예에서 결합부(121: 제 3 모듈)는, 입력 이미지 내 2차원 정보의 손실을 최소화하는 인코딩 정보(제 2 어텐션 맵)을 제공함으로써, 인코딩 정보 내 아티팩트(예컨대, 제 2 어텐션 맵 내 소정의 왜곡 및/또는 잡음 등)를 최소화할 수 있고, 이를 통해 비정형 텍스트(예컨대, 곡선형 텍스트 등)를 포함하는 텍스트에 대한 딥러닝 기반의 광학문자인식(OCR) 성능을 향상시킬 수 있다. As such, in the embodiment, the combiner 121 (the third module) provides encoding information (second attention map) that minimizes the loss of 2-dimensional information in the input image, thereby reducing artifacts (eg, second attention map) in the encoding information. certain distortion and/or noise in the map) can be minimized, and through this, deep learning-based optical character recognition (OCR) performance for text including unstructured text (eg, curved text, etc.) can be improved. there is.

<디코더 딥러닝 서버><Decoder Deep Learning Server>

본 발명의 실시예에서 디코더 딥러닝 서버(130)는, 딥러닝 뉴럴 네트워크와 연동하여, 결합부(121: 제 3 모듈)로부터 획득되는 인코딩 정보(제 2 어텐션 맵(attention map))를 기초로 입력 이미지 내 문자(비정형 문자(예컨대, 곡선형 문자 등)를 포함할 수 있음)인식을 수행할 수 있다. In an embodiment of the present invention, the decoder deep learning server 130 interworks with the deep learning neural network, based on the encoding information (second attention map) obtained from the combination unit 121 (third module) Recognition of characters in the input image (which may include irregular characters (eg, curved characters, etc.)) may be performed.

즉, 실시예에서 디코더 딥러닝 서버(130)는, 딥러닝을 기초로 인코딩 정보에 기반한 디코딩을 수행하여 입력 이미지 내 문자를 감지하는 광학문자인식(OCR)을 수행할 수 있다. That is, in the embodiment, the decoder deep learning server 130 may perform optical character recognition (OCR) to detect characters in an input image by performing decoding based on encoding information based on deep learning.

자세히, 실시예에서 디코더 딥러닝 서버(130)는, 디코더 딥러닝 뉴럴 네트워크(이하, 디코더)를 기반으로 구현될 수 있다. In detail, in the embodiment, the decoder deep learning server 130 may be implemented based on a decoder deep learning neural network (hereinafter referred to as decoder).

이때, 실시예에에 따른 디코더는, 제 4 모듈인 CCN 기반 문자 인식부(131: 이하, 문자 인식부)를 포함할 수 있다. In this case, the decoder according to the embodiment may include a CCN-based character recognition unit 131 (hereinafter referred to as a character recognition unit) as a fourth module.

도 9는 본 발명의 실시예에 따른 제 4 모듈(CNN 기반 문자 인식부(131))의 내부 동작방식을 설명하는 도면의 일례이다. 9 is an example of a diagram explaining an internal operation method of a fourth module (CNN-based text recognition unit 131) according to an embodiment of the present invention.

보다 상세히, 도 9를 참조하면, 실시예에서 문자 인식부(131: 제 4 모듈)는, 딥러닝 뉴럴 네트워크에 기반하여 결합부(121: 제 3 모듈)로부터 출력되는 인코딩 정보(제 2 어텐션 맵)에 기초한 디코딩을 수행해 입력 이미지 내 문자를 감지할 수 있다. In more detail, referring to FIG. 9 , in the embodiment, the character recognition unit 131 (the fourth module) outputs encoding information (the second attention map) from the combiner 121 (the third module) based on the deep learning neural network. ) to detect characters in the input image.

이때, 실시예에 따른 문자 인식부(131: 제 4 모듈)는, 트랜스포머 모델(Transformer model)의 디코더를 기반으로 구현될 수 있다. In this case, the character recognition unit 131 (fourth module) according to the embodiment may be implemented based on a transformer model decoder.

자세히, 디코딩 과정에서 일반적인 어텐션(attention) 방식을 이용하는 경우, 디코더에서 출력 문자를 예측하는 매 시점(time step)마다 인코더(및/또는 결합부(121: 제 3 모듈))로부터의 전체 입력 데이터를 다시 한 번씩 참고해야 한다. In detail, when a general attention method is used in the decoding process, the entire input data from the encoder (and/or combiner (121: 3rd module)) at every time step in which the output character is predicted by the decoder should be noted once again.

이러한 방식에서 디코더는, 전체 입력 데이터를 모두 동일한 비율로 참고하는 것이 아니라 해당 시점에서 예측해야 할 문자와 연관이 있는 입력 데이터 부분에 좀 더 집중(attention)하게 되는데, 이러한 특성으로 인하여 이전 상태의 영향을 받는 문제가 발생할 수 있다. In this way, the decoder pays more attention to the part of the input data that is related to the character to be predicted at that time, rather than referring to the entire input data at the same rate. Due to this characteristic, the effect of the previous state There may be problems with receiving .

그리하여 최근에는, 이전 상태의 영향을 최소화하는 트랜스포머 모델 방식이 주로 사용되고 있다. Therefore, recently, a transformer model method that minimizes the influence of the previous state is mainly used.

참고적으로, 트랜스포머 모델(트랜스포머 인식기) 방식에서는, 어텐션을 자기자신에게 수행하는 셀프 어텐션(self-attention)을 기반으로 디코딩이 진행됨으로써, 이전 상태의 영향을 최소화할 수 있다는 장점이 있다. For reference, in the transformer model (transformer recognizer) method, since decoding proceeds based on self-attention, which performs attention on itself, there is an advantage in minimizing the influence of the previous state.

이하, 트랜스포머 모델에 대한 자세한 설명은, ‘Vaswani,A shish, et al. "Attentioni s all you need." Advances in neural information processing systems. 2017.’ 논문으로 대체한다. Hereinafter, a detailed description of the transformer model is given in ‘Vaswani, A shish, et al. "Attentioni's all you need." Advances in neural information processing systems. 2017.’ to be replaced with the thesis.

이때, 상기 논문에서의 트랜스포머 모델은, 해당 트랜스포머 모델 내에 인코더와 디코더 구조를 포함하며, 상기 인코더 및 디코더는 모두 유사한 구조로 구현되어 있다. At this time, the transformer model in the paper includes an encoder and a decoder structure in the transformer model, and both the encoder and the decoder are implemented in a similar structure.

그러나 위와 같은 구조에 기반한 문자인식은, 입력 이미지 내 텍스트의 위치정보를 추출하는데 비효율적이며 그 정확성 또한 저하되는 문제가 발생할 수 있다. However, character recognition based on the above structure is inefficient in extracting location information of text in an input image, and the accuracy may also be reduced.

그리하여 본 발명의 실시예에서는, 상기 트랜스포머 모델의 디코더만을 이용하여 문자 인식부(131: 제 4 모듈)를 구현할 수 있고, 입력 이미지 내 텍스트의 특징과 위치 정보를 효과적으로 추출하여 인코딩을 수행하는 인코더(및/또는 결합부(121: 제 3 모듈))로부터 획득되는 제 2 어텐션 맵(인코딩 정보)을 상기 문자 인식부(131: 제 4 모듈)에 입력하여 딥러닝에 기반한 광학문자인식(OCR)을 수행할 수 있다. Therefore, in an embodiment of the present invention, the character recognition unit 131 (the fourth module) can be implemented using only the decoder of the transformer model, and the encoder that effectively extracts the characteristics and location information of the text in the input image to perform encoding ( and/or the second attention map (encoding information) obtained from the combiner (121: third module) is input to the character recognition unit (131: fourth module) to perform optical character recognition (OCR) based on deep learning. can be done

여기서, 실시예에 따른 문자 인식부(131: 제 4 모듈)의 딥러닝 뉴럴 네트워크를 구현하는 내부 설계에 대한 설명은 도 9를 참조하여 대체한다. Here, a description of an internal design implementing a deep learning neural network of the character recognition unit 131 (the fourth module) according to the embodiment is replaced with reference to FIG. 9 .

즉, 실시예에서 문자 인식부(131: 제 4 모듈)는, 트랜스포머 모델의 디코더에 기반하여 도 9와 같이 구현될 수 있고, 위와 같이 구현된 문자 인식부(131: 제 4 모듈)는 인코더 및/또는 결합부(121: 제 3 모듈)로부터 출력되는 인코딩 정보(제 2 어텐션 맵)를 입력 데이터로 하여 딥러닝을 수행할 수 있다. That is, in the embodiment, the character recognition unit 131 (the fourth module) may be implemented as shown in FIG. 9 based on the decoder of the transformer model, and the character recognition unit 131 (the fourth module) implemented as above may include the encoder and / Alternatively, deep learning may be performed using the encoding information (second attention map) output from the combiner 121 (third module) as input data.

그리고 문자 인식부(131: 제 4 모듈)는, 상기 딥러닝을 수행하여 해당하는 입력 이미지 내 문자를 감지하는 광학문자인식(OCR)을 수행할 수 있다. Further, the character recognition unit 131 (the fourth module) may perform optical character recognition (OCR) to detect characters in a corresponding input image by performing the deep learning.

또한, 문자 인식부(131: 제 4 모듈)는, 위와 같이 감지된 문자 데이터를 소정의 방식(예컨대, 디스플레이 출력 등)으로 출력하여 제공할 수 있다. In addition, the text recognition unit 131 (the fourth module) may output and provide the detected text data in a predetermined manner (eg, display output, etc.).

이와 같이, 본 발명의 실시예에서는 셀프 어텐션에 기반하여 이전 상태의 영향을 최소화하는 트랜스포머 모델의 디코더를 기반으로 문자 인식부(131: 제 4 모듈)를 구현함으로써, 본 발명의 실시예에 따른 인코더 및/또는 결합부(121: 제 3 모듈)의 장점과 공지된 트랜스포머 모델의 디코더의 장점을 취합한 딥러닝 기반의 광학문자인식(OCR) 모델을 구현할 수 있다. As such, in the embodiment of the present invention, the encoder according to the embodiment of the present invention is implemented by implementing the character recognition unit 131 (the fourth module) based on the decoder of the transformer model that minimizes the influence of the previous state based on self-attention. And/or a deep learning-based optical character recognition (OCR) model that combines the advantages of the combiner 121 (third module) and the decoder of the known transformer model can be implemented.

또한, 이상과 같이 본 발명의 실시예에서는, 입력 이미지 내 텍스트의 특징과 위치 정보를 빠르고 정확하게 추출하여 초기 인코딩 정보(제 1 어텐션 맵)를 출력하는 인코더와, 상기 인코더에서 출력되는 초기 인코딩 정보를 변환하여 상기 초기 인코딩 정보에서의 아티팩트(artifacts)를 최소화한 형태의 인코딩 정보(제 2 어텐션 맵)를 생성하는 결합부(121: 제 3 모듈)와, 상기 결합부(121: 제 3 모듈)로부터 출력되는 인코딩 정보를 입력 데이터로 하여 상기 입력 이미지 내 실제 텍스트(문자)를 높은 정확도와 속도로 감지하는 트랜스포머 디코더 기반의 문자 인식부(131: 제 4 모듈)를 포함하는 문자인식 딥러닝 뉴럴 네트워크를 제공함으로써, 입력 이미지 내 텍스트(비정형 텍스트(예컨대, 곡선형 텍스트 등)를 포함할 수 있음)에 대한 광학문자인식(OCR)의 성능을 보다 향상시킬 수 있다. In addition, as described above, in the embodiment of the present invention, an encoder outputting initial encoding information (first attention map) by quickly and accurately extracting characteristics and location information of text in an input image, and initial encoding information output from the encoder From the combination unit 121 (third module) that converts and generates encoding information (second attention map) in a form in which artifacts in the initial encoding information are minimized, and the combination unit 121 (third module) A character recognition deep learning neural network including a transformer decoder-based character recognition unit (131: 4th module) that detects actual text (characters) in the input image with high accuracy and speed using the output encoding information as input data. By providing, it is possible to further improve the performance of optical character recognition (OCR) for text in an input image (which may include irregular text (eg, curved text, etc.)).

- 딥러닝 기반 광학문자인식(OCR) 방법 - Deep learning-based optical character recognition (OCR) method

이하에서는, 본 발명의 실시예에 따른 문자인식 서버(100)가 딥러닝을 기반으로 광학문자인식(OCR)을 수행하는 방법에 대해 상세히 설명하기로 한다. Hereinafter, a method for performing optical character recognition (OCR) based on deep learning by the text recognition server 100 according to an embodiment of the present invention will be described in detail.

여기서, 상술된 내용과 중복되는 설명은 요약되거나 생략될 수 있다. Here, descriptions overlapping with the above descriptions may be summarized or omitted.

본 발명의 실시예에서 문자인식 서버(100)는, 인코더에 기반하여 입력 이미지 내 텍스트(비정형 텍스트(예컨대, 곡선형 텍스트 등)를 포함할 수 있음)의 특징정보 및 위치정보를 추출할 수 있다. In an embodiment of the present invention, the character recognition server 100 may extract feature information and location information of text (which may include irregular text (eg, curved text)) in an input image based on an encoder. .

또한, 실시예에서 문자인식 서버(100)는, 위와 같이 추출된 특징정보 및 위치정보에 기초하여 초기 인코딩 정보 즉, 제 1 어텐션 맵(Early attention map)을 획득할 수 있다. Also, in an embodiment, the character recognition server 100 may obtain initial encoding information, that is, a first attention map, based on the feature information and location information extracted as above.

자세히, 실시예에서 문자인식 서버(100)는, 인코더의 특징 추출부(111: 제 1 모듈)에 기초하여 resnet45 딥러닝 뉴럴 네트워크에 기반한 딥러닝을 수행할 수 있다. In detail, in the embodiment, the text recognition server 100 may perform deep learning based on the resnet45 deep learning neural network based on the feature extraction unit 111 (first module) of the encoder.

그리고 문자인식 서버(100)는, 수행된 딥러닝을 통해 입력 이미지 내 텍스트에 대한 특징정보(FPN_Features)를 추출할 수 있다. Also, the text recognition server 100 may extract feature information (FPN_Features) of text in an input image through deep learning.

또한, 실시예에서 문자인식 서버(100)는, 인코더의 정렬부(112: 제 2 모듈)를 기초로 해당 입력 이미지 내 텍스트에 대한 위치정보를 가지는 초기 인코딩 정보 즉, 제 1 어텐션 맵을 획득할 수 있다. In addition, in the embodiment, the character recognition server 100 obtains initial encoding information having location information on text in the corresponding input image, that is, the first attention map, based on the sorting unit 112 (second module) of the encoder. can

자세히, 문자인식 서버(100)는, 특징 추출부(111: 제 1 모듈)로부터 출력되는 데이터(즉, 특징정보)를 입력 데이터로 하여 정렬부(112: 제 2 모듈)의 제 1 내지 4 단계 컨볼루션 레이어(convolution layer)와, 제 1 내지 4 단계 디컨볼루션 레이어(deconvolution layer)에 기반한 딥러닝을 수행할 수 있다. In detail, the character recognition server 100 takes the data (ie, feature information) output from the feature extraction unit 111 (first module) as input data and performs the first to fourth steps of the sorting unit 112 (second module) Deep learning based on a convolution layer and first to fourth deconvolution layers may be performed.

그리고 문자인식 서버(100)는, 해당하는 입력 이미지 내 텍스트에 대한 위치정보를 가지는 초기 인코딩 정보 즉, 제 1 어텐션 맵을 획득할 수 있다.Further, the character recognition server 100 may obtain initial encoding information, that is, a first attention map having location information of text in a corresponding input image.

또한, 실시예에서 문자인식 서버(100)는, 인코딩 스키마 변환서버(120)에 기반하여 정렬부(112: 제 2 모듈)로부터 획득되는 초기 인코딩 정보(제 1 어텐션 맵)에 대한 형태(shape) 변환을 수행할 수 있다. In addition, in the embodiment, the character recognition server 100, based on the encoding schema conversion server 120, the shape of the initial encoding information (first attention map) obtained from the sorting unit 112 (second module) conversion can be performed.

자세히, 문자인식 서버(100)는, 인코딩 스키마 변환서버(120)의 결합부(121: 제 3 모듈)를 기반으로 상기 초기 인코딩 정보에 기반한 별도의 차원결합을 수행하는 딥러닝을 수행할 수 있다. In detail, the character recognition server 100 may perform deep learning that performs separate dimensional combination based on the initial encoding information based on the combiner 121 (third module) of the encoding schema conversion server 120. .

이를 통해 문자인식 서버(100)는, 입력 이미지 내 2차원 정보(height 및 width 정보)를 최대한 유지함과 동시에 디코더의 입력 데이터 형태에 매칭되는 형태(shape)로 구현되는 인코딩 정보(제 2 어텐션 맵(attention map))을 획득할 수 있다. Through this, the character recognition server 100 maintains two-dimensional information (height and width information) in the input image as much as possible, and at the same time, encodes information implemented in a shape matching the shape of the input data of the decoder (second attention map ( attention map)) can be obtained.

또한, 실시예에서 문자인식 서버(100)는, 디코더에 기반하여 결합부(121: 제 3 모듈)로부터 획득되는 인코딩 정보(제 2 어텐션 맵)에 기반한 광학문자인식(OCR)을 수행할 수 있다. In addition, in the embodiment, the text recognition server 100 may perform optical character recognition (OCR) based on the encoding information (second attention map) obtained from the combiner 121 (third module) based on the decoder. .

자세히, 문자인식 서버(100)는, 디코더의 문자 인식부(131: 제 4 모듈)를 구현하는 딥러닝 뉴럴 네트워크를 기반으로, 결합부(121: 제 3 모듈)로부터 출력되는 인코딩 정보(제 2 어텐션 맵)를 입력 데이터로 하는 딥러닝을 수행할 수 있다. In detail, the text recognition server 100 is based on the deep learning neural network that implements the text recognition section (131: fourth module) of the decoder, and the encoding information (second module) output from the combiner (121: third module). Attention map) can be used as input data for deep learning.

이때, 실시예에 따른 문자 인식부(131: 제 4 모듈)는, 셀프 어텐션(self-attention)을 수행하여 이전 상태의 영향을 최소화하도록 동작하는 트랜스포머 모델(Transformer model)의 디코더를 기반으로 구현될 수 있다. At this time, the character recognition unit 131 (the fourth module) according to the embodiment may be implemented based on a transformer model decoder that operates to minimize the effect of the previous state by performing self-attention. can

그리고 문자인식 서버(100)는, 위와 같이 수행된 문자 인식부(131: 제 4 모듈)의 딥러닝에 기반하여 해당하는 입력 이미지 내 텍스트를 감지하는 광학문자인식(OCR)을 수행할 수 있다. Further, the character recognition server 100 may perform optical character recognition (OCR) for detecting text in a corresponding input image based on the deep learning of the character recognition unit 131 (the fourth module) performed as described above.

또한, 실시예에서 문자인식 서버(100)는, 딥러닝 기반의 광학문자인식(OCR)을 통해 획득되는 결과 데이터를 소정의 방식(예컨대, 디스플레이 출력 등)으로 출력하여 제공할 수 있다. In addition, in an embodiment, the text recognition server 100 may output and provide result data obtained through deep learning-based optical character recognition (OCR) in a predetermined method (eg, display output, etc.).

이상, 본 발명의 실시예에 따른 딥러닝 기반 광학문자인식(OCR) 장치 및 그 시스템은, 서로 다른 기능을 수행하는 복수의 딥러닝 모듈에 기반하여 이미지 내 문자인식을 수행함으로써, 각각의 딥러닝 모듈을 거치며 이미지 내 텍스트에 대한 명확한 특징정보와 위치정보를 획득할 수 있고, 이와 같이 획득된 정보들을 기초로 문자인식을 수행하여 정형적인 형태의 텍스트 뿐만 아니라 비정형적 형태의 텍스트도 해당 텍스트 이외의 영역과 효과적으로 분리하면서 빠르고 정확하게 인식할 수 있는 효과가 있다. As described above, the deep learning-based optical character recognition (OCR) device and system according to an embodiment of the present invention perform character recognition in an image based on a plurality of deep learning modules that perform different functions, so that each deep learning Through the module, it is possible to obtain clear feature information and location information about the text in the image, and character recognition is performed based on the information obtained in this way, so that not only the text in the standard form but also the text in the atypical form is It has the effect of being able to quickly and accurately recognize while effectively separating it from the area.

또한, 이상에서 설명된 본 발명에 따른 실시예는 다양한 컴퓨터 구성요소를 통하여 실행될 수 있는 프로그램 명령어의 형태로 구현되어 컴퓨터 판독 가능한 기록 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능한 기록 매체는 프로그램 명령어, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 컴퓨터 판독 가능한 기록 매체에 기록되는 프로그램 명령어는 본 발명을 위하여 특별히 설계되고 구성된 것이거나 컴퓨터 소프트웨어 분야의 당업자에게 공지되어 사용 가능한 것일 수 있다. 컴퓨터 판독 가능한 기록 매체의 예에는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM 및 DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical medium), 및 ROM, RAM, 플래시 메모리 등과 같은, 프로그램 명령어를 저장하고 실행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령어의 예에는, 컴파일러에 의하여 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용하여 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드도 포함된다. 하드웨어 장치는 본 발명에 따른 처리를 수행하기 위하여 하나 이상의 소프트웨어 모듈로 변경될 수 있으며, 그 역도 마찬가지이다.In addition, the embodiments according to the present invention described above may be implemented in the form of program instructions that can be executed through various computer components and recorded on a computer readable recording medium. The computer readable recording medium may include program instructions, data files, data structures, etc. alone or in combination. Program instructions recorded on the computer-readable recording medium may be specially designed and configured for the present invention, or may be known and usable to those skilled in the art of computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical recording media such as CD-ROMs and DVDs, and magneto-optical media such as floptical disks. medium), and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter or the like as well as machine language codes generated by a compiler. A hardware device may be modified with one or more software modules to perform processing according to the present invention and vice versa.

본 발명에서 설명하는 특정 실행들은 일 실시 예들로서, 어떠한 방법으로도 본 발명의 범위를 한정하는 것은 아니다. 명세서의 간결함을 위하여, 종래 전자적인 구성들, 제어 시스템들, 소프트웨어, 상기 시스템들의 다른 기능적인 측면들의 기재는 생략될 수 있다. 또한, 도면에 도시된 구성 요소들 간의 선들의 연결 또는 연결 부재들은 기능적인 연결 및/또는 물리적 또는 회로적 연결들을 예시적으로 나타낸 것으로서, 실제 장치에서는 대체 가능하거나 추가의 다양한 기능적인 연결, 물리적인 연결, 또는 회로 연결들로서 나타내어질 수 있다. 또한, “필수적인”, “중요하게” 등과 같이 구체적인 언급이 없다면 본 발명의 적용을 위하여 반드시 필요한 구성 요소가 아닐 수 있다.Specific implementations described in the present invention are examples, and do not limit the scope of the present invention in any way. For brevity of the specification, description of conventional electronic components, control systems, software, and other functional aspects of the systems may be omitted. In addition, the connection of lines or connecting members between the components shown in the drawings represent functional connections and / or physical or circuit connections by way of example, and in actual devices, various functional connections that are replaceable or additional, physical connection, or circuit connections. In addition, if there is no specific reference such as “essential” or “important”, it may not be a component necessarily required for the application of the present invention.

또한 설명한 본 발명의 상세한 설명에서는 본 발명의 바람직한 실시 예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자 또는 해당 기술분야에 통상의 지식을 갖는 자라면 후술할 특허청구범위에 기재된 본 발명의 사상 및 기술 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다. 따라서, 본 발명의 기술적 범위는 명세서의 상세한 설명에 기재된 내용으로 한정되는 것이 아니라 특허청구범위에 의해 정하여져야만 할 것이다.In addition, the detailed description of the present invention described has been described with reference to preferred embodiments of the present invention, but those skilled in the art or those having ordinary knowledge in the art will find the spirit of the present invention described in the claims to be described later. And it will be understood that the present invention can be variously modified and changed without departing from the technical scope. Therefore, the technical scope of the present invention is not limited to the contents described in the detailed description of the specification, but should be defined by the claims.

Claims

적어도 하나 이상의 프로세서; 및
적어도 하나 이상의 메모리;를 포함하고,
상기 메모리에 저장되고 상기 적어도 하나 이상의 프로세서에 의해 실행되어 딥러닝 기반으로 광학문자인식을 수행하는 적어도 하나의 프로그램으로서,
상기 적어도 하나 이상의 프로그램은,
딥러닝 뉴럴 네트워크를 기반으로 입력 이미지 내 텍스트에 대한 특징정보를 추출하는 제 1 모듈과,
상기 추출된 특징정보를 기초로 상기 텍스트에 대한 위치정보를 가지는 초기 인코딩 정보를 생성하는 제 2 모듈과,
상기 생성된 초기 인코딩 정보에 기반한 형태 변환을 수행하여 상기 입력 이미지에 대한 2차원 정보를 유지한 어텐션 맵을 획득하는 제 3 모듈과,
상기 획득된 어텐션 맵에 기초한 딥러닝 모델을 통해 상기 입력 이미지 내 텍스트를 감지하는 제 4 모듈을 포함하고,
상기 제 3 모듈은, 상기 초기 인코딩 정보 내 아티팩트(artifacts)를 제거하는 상기 형태 변환을 수행하여 상기 어텐션 맵을 생성하는
딥러닝 기반 광학문자인식 장치.at least one processor; and
includes at least one memory;
At least one program stored in the memory and executed by the at least one processor to perform optical character recognition based on deep learning,
The at least one or more programs,
A first module for extracting feature information about text in an input image based on a deep learning neural network;
A second module for generating initial encoding information having location information about the text based on the extracted feature information;
A third module for obtaining an attention map maintaining 2D information of the input image by performing shape transformation based on the generated initial encoding information;
A fourth module detecting text in the input image through a deep learning model based on the obtained attention map;
The third module generates the attention map by performing the shape conversion to remove artifacts in the initial encoding information.
Deep learning based optical character recognition device.

제 1 항에 있어서,
상기 제 1 모듈은,
Resnet45 기반 딥러닝 뉴럴 네트워크를 기초로, 상기 입력 이미지에 기반한 높이(height, H), 너비(width, W) 및 출력 채널 수(Number of output channels, C)에 대한 데이터를 상기 특징정보로 추출하는
딥러닝 기반 광학문자인식 장치.According to claim 1,
The first module,
Based on the Resnet45-based deep learning neural network, extracting data about height (H), width (W), and number of output channels (C) based on the input image as the feature information
Deep learning based optical character recognition device.

제 2 항에 있어서,
상기 2차원 정보는,
상기 입력 이미지에 기반한 높이(height, H) 및 너비(width, W) 정보를 포함하는
딥러닝 기반 광학문자인식 장치.According to claim 2,
The two-dimensional information,
Including height (H) and width (W) information based on the input image
Deep learning based optical character recognition device.

삭제delete

제 1 항에 있어서,
상기 제 3 모듈은,
상기 제 1 모듈로부터 획득되는 '1*출력 채널 수(Number of output channels, C)*높이(Height, H)*너비(Width, W)'에 대한 제 1 차원정보와,
상기 제 2 모듈로부터 획득되는 '최대 채널 수(Max number of channels, T)*1*높이(Height, H)*너비(Width, W)'에 대한 제 2 차원정보를 제 2 방식의 차원결합을 수행하여 상기 어텐션 맵을 생성하는
딥러닝 기반 광학문자인식 장치.According to claim 1,
The third module,
First dimension information about '1 * Number of output channels (C) * Height (H) * Width (Width, W)' obtained from the first module;
The second dimension information about 'Max number of channels (T) * 1 * Height (H) * Width (Width, W)' obtained from the second module is dimensionally combined in the second method. Performing to generate the attention map
Deep learning based optical character recognition device.

제 5 항에 있어서,
상기 제 3 모듈은,
상기 제 1 차원정보 및 상기 제 2 차원정보를 기초로 최대 채널 수(Max number of channels, T) 및 출력 채널 수(Number of output channels, C) 채널의 차원결합을 수행하여 제 3 차원정보를 획득하고,
상기 획득된 제 3 차원정보를 기초로 높이(Height, H) 및 너비(Width, W) 정보의 차원결합을 수행하여 제 4 차원정보를 획득하는
딥러닝 기반 광학문자인식 장치.According to claim 5,
The third module,
Third-dimensional information is obtained by performing dimensional combining of channels of the maximum number of channels (T) and the number of output channels (C) based on the first-dimensional information and the second-dimensional information. do,
Acquiring fourth dimensional information by performing dimensional combining of height (H) and width (Width, W) information based on the obtained third dimensional information
Deep learning based optical character recognition device.

제 6 항에 있어서,
상기 제 3 모듈은,
상기 획득된 제 4 차원정보를 기초로 1x1 컨볼루션 레이어(1x1 conv layer)에 기반한 차원 축소 프로세스를 수행하여 제 5 차원정보를 획득하는 상기 제 2 방식의 차원결합을 수행하는
딥러닝 기반 광학문자인식 장치.According to claim 6,
The third module,
Performing dimensional combining of the second method of acquiring fifth dimensional information by performing a dimensional reduction process based on a 1x1 convolution layer based on the obtained fourth dimensional information
Deep learning based optical character recognition device.

제 7 항에 있어서,
상기 제 3 모듈은,
상기 제 2 방식의 차원결합을 수행하여 상기 높이(Height, H) 및 너비(Width, W) 정보를 포함하는 상기 2차원 정보를 유지한 '출력 채널 수(Number of output channels, C)* 높이(Height, H)*너비(Width, W)' 형태의 상기 어텐션 맵을 획득하는
딥러닝 기반 광학문자인식 장치.According to claim 7,
The third module,
'Number of output channels (C)*Height ('number of output channels, C) * height (which is obtained by performing the second method of dimensional combining to maintain the 2D information including the height (H) and width (Width, W) information. Obtaining the attention map in the form of 'Height, H) * Width (Width, W)'
Deep learning based optical character recognition device.

제 1 항에 있어서,
상기 제 4 모듈은, 트랜스포머 모델(Transformer model)의 디코더를 기반으로 구현되는
딥러닝 기반 광학문자인식 장치.According to claim 1,
The fourth module is implemented based on a decoder of a transformer model.
Deep learning based optical character recognition device.

제 9 항에 있어서,
상기 제 4 모듈은,
상기 어텐션 맵을 입력 데이터로 하여 상기 디코더에 기초한 딥러닝을 수행하고, 상기 수행된 딥러닝을 기반으로 상기 입력 이미지 내 텍스트를 감지하는
딥러닝 기반 광학문자인식 장치.According to claim 9,
The fourth module,
Performing deep learning based on the decoder using the attention map as input data, and detecting text in the input image based on the performed deep learning
Deep learning based optical character recognition device.

딥러닝 뉴럴 네트워크를 기반으로 입력 이미지 내 텍스트에 대한 특징정보를 추출하는 제 1 모듈; 및 상기 추출된 특징정보를 기초로 상기 텍스트에 대한 위치정보를 가지는 초기 인코딩 정보를 생성하는 제 2 모듈; 을 포함하는 인코더 딥러닝 서버;
상기 생성된 초기 인코딩 정보에 기반한 형태 변환을 수행하여 상기 입력 이미지에 대한 2차원 정보를 그대로 유지한 형태의 어텐션 맵을 획득하는 제 3 모듈; 을 포함하는 인코딩 스키마 변환서버; 및
상기 획득된 어텐션 맵에 기초한 딥러닝을 수행하여 상기 입력 이미지 내 텍스트를 감지하는 제 4 모듈; 을 포함하는 디코더 딥러닝 서버; 를 포함하고,
상기 제 3 모듈은, 상기 초기 인코딩 정보 내 아티팩트(artifacts)를 제거하는 상기 형태 변환을 수행하여 상기 어텐션 맵을 생성하는
딥러닝 기반 광학문자인식 시스템.A first module for extracting feature information about text in an input image based on a deep learning neural network; and a second module generating initial encoding information having location information about the text based on the extracted feature information. Encoder deep learning server including;
a third module for performing shape conversion based on the generated initial encoding information to obtain an attention map in a form in which 2D information of the input image is maintained as it is; Encoding schema conversion server including; and
a fourth module detecting text in the input image by performing deep learning based on the obtained attention map; Decoder deep learning server including; including,
The third module generates the attention map by performing the shape conversion to remove artifacts in the initial encoding information.
Deep learning based optical character recognition system.

딥러닝 서버에서 수행하는 딥러닝 기반 광학문자인식 방법으로서,
딥러닝 뉴럴 네트워크를 기반으로 입력 이미지 내 텍스트에 대한 특징정보를 추출하는 단계;
상기 추출된 특징정보를 기초로 상기 텍스트에 대한 위치정보를 가지는 초기 인코딩 정보를 생성하는 단계;
상기 생성된 초기 인코딩 정보에 기반한 형태 변환을 수행하여 상기 입력 이미지에 대한 2차원 정보를 유지한 어텐션 맵을 획득하는 단계; 및
상기 획득된 어텐션 맵에 기초한 딥러닝 모델을 통해 상기 입력 이미지 내 텍스트를 감지하는 단계를 포함하고,
상기 초기 인코딩 정보에 기반한 형태 변환을 수행하여 상기 입력 이미지에 대한 2차원 정보를 유지한 어텐션 맵을 획득하는 단계는,
상기 초기 인코딩 정보 내 아티팩트(artifacts)를 제거하고, 상기 입력 이미지에 기반한 높이(height, H) 및 너비(width, W) 정보를 포함하는 상기 2차원 정보를 유지한 상기 어텐션 맵을 생성하는 단계를 포함하는
딥러닝 기반 광학문자인식 방법.As a deep learning-based optical character recognition method performed in a deep learning server,
Extracting feature information about text in an input image based on a deep learning neural network;
generating initial encoding information having location information about the text based on the extracted feature information;
acquiring an attention map maintaining 2D information of the input image by performing shape transformation based on the generated initial encoding information; and
Detecting text in the input image through a deep learning model based on the obtained attention map;
Obtaining an attention map that retains 2D information about the input image by performing shape transformation based on the initial encoding information,
removing artifacts in the initial encoding information and generating the attention map maintaining the 2D information including height (H) and width (W) information based on the input image; including
Deep learning based optical character recognition method.

제 12 항에 있어서,
상기 특징정보를 추출하는 단계는,
Resnet45 기반 딥러닝 뉴럴 네트워크에 기초하여 상기 입력 이미지에 기반한 높이(height, H), 너비(width, W) 및 출력 채널 수(Number of output channels, C)에 대한 데이터를 상기 특징정보로 추출하는 단계를 포함하는
딥러닝 기반 광학문자인식 방법.According to claim 12,
The step of extracting the feature information,
Extracting data about height (H), width (W), and number of output channels (C) based on the input image as the feature information based on a Resnet45-based deep learning neural network containing
Deep learning based optical character recognition method.

삭제delete

제 12 항에 있어서,
상기 어텐션 맵을 획득하는 단계는,
상기 특징정보에 대한 '1*출력 채널 수(Number of output channels, C)*높이(Height, H)*너비(Width, W)'에 대한 제 1 차원정보와, 상기 초기 인코딩 정보에 대한 '최대 채널 수(Max number of channels, T)*1*높이(Height, H)*너비(Width, W)'에 대한 제 2 차원정보를 기반으로 제 2 방식의 차원결합을 수행하여 상기 어텐션 맵을 생성하는 단계를 더 포함하는
딥러닝 기반 광학문자인식 방법.According to claim 12,
The step of acquiring the attention map,
First dimension information about '1 * Number of output channels (C) * Height (H) * Width (Width, W)' of the feature information, and 'maximum of the initial encoding information' The attention map is generated by performing a second method of dimensional combining based on the second dimension information about 'Max number of channels (T)*1*Height (H)*Width (W)'. further comprising the steps of
Deep learning based optical character recognition method.

제 15 항에 있어서,
상기 어텐션 맵을 획득하는 단계는,
상기 제 1 차원정보 및 상기 제 2 차원정보를 기초로 최대 채널 수(Max number of channels, T) 및 출력 채널 수(Number of output channels, C) 채널의 차원결합을 수행하여 제 3 차원정보를 획득하는 단계와,
상기 획득된 제 3 차원정보를 기초로 높이(Height, H) 및 너비(Width, W) 정보의 차원결합을 수행하여 제 4 차원정보를 획득하는 단계를 더 포함하는
딥러닝 기반 광학문자인식 방법.According to claim 15,
The step of acquiring the attention map,
Third-dimensional information is obtained by performing dimensional combining of channels of the maximum number of channels (T) and the number of output channels (C) based on the first-dimensional information and the second-dimensional information. step of doing,
Further comprising obtaining fourth dimensional information by performing dimensional combination of height (H) and width (Width, W) information based on the obtained third dimensional information
Deep learning based optical character recognition method.

제 16 항에 있어서,
상기 어텐션 맵을 획득하는 단계는,
상기 획득된 제 4 차원정보를 기초로 1x1 컨볼루션 레이어(1x1 conv layer)에 기반한 차원 축소 프로세스를 수행하여 제 5 차원정보를 획득하는 상기 제 2 방식의 차원결합을 수행하는 단계를 더 포함하는
딥러닝 기반 광학문자인식 방법.17. The method of claim 16,
The step of acquiring the attention map,
Further comprising performing dimensional combining of the second method of acquiring fifth dimensional information by performing a dimensional reduction process based on a 1x1 convolution layer based on the obtained fourth dimensional information
Deep learning based optical character recognition method.

제 17 항에 있어서,
상기 어텐션 맵을 획득하는 단계는,
상기 제 2 방식의 차원결합을 수행하여 상기 높이(Height, H) 및 너비(Width, W) 정보를 포함하는 상기 2차원 정보를 유지한 '출력 채널 수(Number of output channels, C)* 높이(Height, H)*너비(Width, W)' 형태의 상기 어텐션 맵을 획득하는 단계를 더 포함하는
딥러닝 기반 광학문자인식 방법.18. The method of claim 17,
The step of acquiring the attention map,
'Number of output channels (C)*Height ('number of output channels, C) * height (which is obtained by performing the second method of dimensional combining to maintain the 2D information including the height (H) and width (Width, W) information. Further comprising obtaining the attention map in the form of 'Height, H) * Width (Width, W)'
Deep learning based optical character recognition method.

제 12 항에 있어서,
상기 텍스트를 감지하는 단계는,
트랜스포머 모델(Transformer model)의 디코더에 기반하여 구현되는 딥러닝 뉴럴 네트워크를 기반으로 상기 텍스트를 감지하는 단계를 포함하는
딥러닝 기반 광학문자인식 방법.According to claim 12,
The step of detecting the text is,
Detecting the text based on a deep learning neural network implemented based on a decoder of a Transformer model
Deep learning based optical character recognition method.

제 19 항에 있어서,
상기 텍스트를 감지하는 단계는,
상기 어텐션 맵을 입력 데이터로 하여 상기 디코더에 기초한 딥러닝을 수행하고, 상기 수행된 딥러닝을 기반으로 상기 입력 이미지 내 텍스트를 감지하는 단계를 포함하는
딥러닝 기반 광학문자인식 방법.According to claim 19,
The step of detecting the text is,
Performing deep learning based on the decoder using the attention map as input data, and detecting text in the input image based on the performed deep learning
Deep learning based optical character recognition method.