KR102629150B1

KR102629150B1 - A method for building datasets by recognizing documents with a complex structure including tables using document structure tags when performing ocr

Info

Publication number: KR102629150B1
Application number: KR1020230107354A
Authority: KR
Inventors: 황선희; 조창희; 고형석; 이홍재
Original assignee: (주)유알피
Priority date: 2023-08-17
Filing date: 2023-08-17
Publication date: 2024-01-25

Abstract

본 발명은 OCR 수행 시 문서 구조화 태그를 활용한 표가 포함된 복잡한 구조의 문서 인식으로 데이터셋을 구축하는 방법에 관한 것으로, 복잡한 구조의 문서 인식으로 데이터셋을 구축하는 장치에서, 입력된 대상 문서의 타입에 따라 서식 항목을 식별하고, 상기 서식 항목에 대응하는 텍스트를 추출하는 단계; 상기 대상 문서에서 식별된 적어도 하나 이상의 서식 항목 및 상기 서식 항목의 텍스트에 대한 관계 정보를 포함하여 문서 구조화 데이터를 생성하는 단계; 상기 대상 문서에 포함된 표의 구조를 파악하고, 상기 표의 셀 영역에 포함된 텍스트를 추출하여 표 서식 데이터를 생성하는 단계; 및 상기 문서 구조화 데이터에 문서 구조를 식별하는 문서 구조화 태그를 부착하여 정해진 형식의 텍스트 데이터로 변환하는 단계;를 포함한다.The present invention relates to a method of building a dataset by recognizing documents with a complex structure including tables using document structuring tags when performing OCR. In a device that builds a dataset by recognizing documents with a complex structure, the input target document Identifying format items according to their type and extracting text corresponding to the format items; generating document structured data including at least one format item identified in the target document and relationship information about text of the format item; identifying the structure of a table included in the target document and extracting text included in a cell area of the table to generate table format data; and attaching a document structuring tag identifying the document structure to the document structured data and converting it into text data in a predetermined format.

Description

OCR 수행 시 문서 구조화 태그를 활용한 표가 포함된 복잡한 구조의 문서 인식으로 데이터셋을 구축하는 방법 {A METHOD FOR BUILDING DATASETS BY RECOGNIZING DOCUMENTS WITH A COMPLEX STRUCTURE INCLUDING TABLES USING DOCUMENT STRUCTURE TAGS WHEN PERFORMING OCR}How to build a dataset by recognizing documents with complex structures that include tables using document structuring tags when performing OCR {A METHOD FOR BUILDING DATASETS BY RECOGNIZING DOCUMENTS WITH A COMPLEX STRUCTURE INCLUDING TABLES USING DOCUMENT STRUCTURE TAGS WHEN PERFORMING OCR}

본 발명은 PDF 문서 또는 이미지 문서에서 텍스트를 추출하면 서식이 제외되어 원래 문서의 의도대로 파악이 불가능하므로, 입력된 문서에 대해 텍스트를 추출하고, 입력된 문서에 대해 머리말, 꼬리말, 페이지 번호, 본문과 같은 문서의 구조를 판단하여 추출한 텍스트에 문서 구조 태그를 부착하고, 본문에 표가 포함된 경우 표의 구조 및 각 셀의 내용 간의 연계 정보를 행렬 구조로 수식화하여 표 태그를 부착하여 표의 내용을 원본 문서대로 파악할 수 있도록 데이터화하는 OCR 수행 시 문서 구조화 태그를 활용한 표가 포함된 복잡한 구조의 문서 인식으로 데이터셋을 구축하는 방법에 관한 것이다.In the present invention, when text is extracted from a PDF document or an image document, the format is excluded and it is impossible to understand the original document as intended. Therefore, the text is extracted from the input document, and the header, footer, page number, and body text are extracted from the input document. Determine the structure of the document and attach a document structure tag to the extracted text. If the text contains a table, format the linkage information between the table structure and the contents of each cell into a matrix structure and attach a table tag to copy the table contents to the original. This is about how to build a dataset by recognizing documents with a complex structure that includes tables using document structuring tags when performing OCR, which converts documents into data so that they can be recognized as they are.

문서 전자화 시장이 확대되고 전자문서도 종이 문서처럼 법적인 효력을 가지게 되면서 다양한 산업 분야에서 종이 문서를 스캔하여 전자 문서로 전환하여 사용하고 있다.As the document electronicization market expands and electronic documents become as legal as paper documents, various industries are using scanned paper documents and converting them into electronic documents.

최근에는 인공지능 기반의 광학식문자판독(OCR) 기술이 발전하면서 전자화 한 문서 내용을 검색하거나, 스크랩할 수 있게 되었고, 과거에 종이로 생산된 많은 자료가 전자 문서로 변환되어 가공 또는 분석되고 있다.Recently, with the development of artificial intelligence-based optical character recognition (OCR) technology, it has become possible to search or scrap the contents of electronic documents, and many data that were produced on paper in the past are being converted into electronic documents and processed or analyzed.

일반적으로 OCR은 문서 전체를 읽어서 문자를 인식하는 방식으로 기술이 구현되는데, 문서에서 머리말, 꼬리말, 페이지 번호 등 문서 내에서 반복되는 불필요한 부분까지 인식되는 경우가 있으므로 수작업으로 제거하거나 분류해야 하는 불편함이 여전히 존재한다.In general, OCR technology is implemented by reading the entire document and recognizing characters. However, in some cases, unnecessary parts that are repeated within the document, such as headers, footers, and page numbers, are recognized, which causes the inconvenience of having to manually remove or classify them. This still exists.

또한, 텍스트 추출 시 문서 서식이 제외되므로 서식이나 표가 포함된 문서는 원래 문서의 의도대로 파악되지 못하는 경우가 많다.Additionally, because document formatting is excluded when extracting text, documents containing formatting or tables are often not understood as intended in the original document.

따라서, OCR 수행 시 텍스트 추출과 더불어 문서의 서식이 반영되어 문서의 구조를 파악할 수 있도록 문서를 인식하는 기술이 요구되고 있다.Therefore, when performing OCR, in addition to text extraction, there is a need for technology to recognize documents so that the format of the document can be reflected and the structure of the document can be identified.

본 발명은 상기 문제점을 해결하기 위해 OCR 수행 시 텍스트를 추출 뿐만 아니라 문서의 구조를 파악하여 추출된 텍스트에 문서 구조를 반영한 태그를 부착하여 원래 문서의 의도대로 문서를 데이터화하는 방법을 제공하는데 그 목적이 있다.In order to solve the above problems, the purpose of the present invention is to provide a method of converting documents into data as intended by the original document by not only extracting text when performing OCR, but also identifying the structure of the document and attaching a tag reflecting the document structure to the extracted text. There is.

본 발명의 일 실시예에 따른 OCR 수행 시 문서 구조화 태그를 활용한 표가 포함된 복잡한 구조의 문서 인식으로 데이터셋을 구축하는 방법은, 복잡한 구조의 문서 인식으로 데이터셋을 구축하는 장치에서, 입력된 대상 문서의 타입에 따라 서식 항목을 식별하고, 상기 서식 항목에 대응하는 텍스트를 추출하는 단계; 상기 대상 문서에서 식별된 적어도 하나 이상의 서식 항목 및 상기 서식 항목의 텍스트에 대한 관계 정보를 포함하여 문서 구조화 데이터를 생성하는 단계; 상기 대상 문서에 포함된 표의 구조를 파악하고, 상기 표의 셀 영역에 포함된 텍스트를 추출하여 표 서식 데이터를 생성하는 단계; 및 상기 문서 구조화 데이터에 문서 구조를 식별하는 문서 구조화 태그를 부착하여 정해진 형식의 텍스트 데이터로 변환하는 단계;를 포함할 수 있다.A method of building a dataset by recognizing documents with a complex structure including tables using document structuring tags when performing OCR according to an embodiment of the present invention involves inputting a data set from a device that builds a dataset by recognizing documents with a complex structure. identifying format items according to the type of target document and extracting text corresponding to the format items; generating document structured data including at least one format item identified in the target document and relationship information about text of the format item; identifying the structure of a table included in the target document and extracting text included in a cell area of the table to generate table format data; and attaching a document structuring tag identifying the document structure to the document structured data and converting it into text data in a predetermined format.

또한, 상기 대상 문서는 텍스트가 포함된 이미지 및 pdf파일 중 적어도 하나 이상을 포함하고, 상기 입력된 대상 문서의 타입에 따라 서식 항목을 식별하고, 상기 서식 항목에 대응하는 텍스트를 추출하는 단계는, 광학 문자 인식(OCR, Optical Character Recognition) 기술을 적용하여 텍스트 이미지 형태의 문서에서 텍스트를 인식하여 추출하는 것을 특징으로 한다.In addition, the target document includes at least one of an image and a PDF file containing text, identifying format items according to the type of the input target document, and extracting text corresponding to the format items, It is characterized by recognizing and extracting text from documents in the form of text images by applying optical character recognition (OCR) technology.

또한, 상기 입력된 대상 문서의 타입에 따라 서식 항목을 식별하고, 상기 서식 항목에 대응하는 텍스트를 추출하는 단계는, 상기 대상 문서의 메타 정보를 통해 문서 타입을 파악하는 단계; 및 상기 문서 타입의 템플릿 데이터를 확인하여 상기 템플릿 데이터에 정의된 서식 항목의 위치 영역 및 패턴 규칙에 따라 텍스트를 추출하여 서식 항목-텍스트 내용의 페어(pair) 데이터를 생성하는 단계;를 포함할 수 있다.In addition, identifying format items according to the type of the input target document and extracting text corresponding to the format items includes identifying the document type through meta information of the target document; and checking the template data of the document type and extracting text according to the location area and pattern rules of the format item defined in the template data to generate pair data of format item-text content. there is.

또한, 상기 대상 문서에 포함된 표의 구조를 파악하고, 상기 표의 셀 영역에 포함된 텍스트를 추출하여 표 서식 데이터를 생성하는 단계는, 상기 대상 문서 내에 표가 위치하는 영역의 이미지 데이터에서 표의 각 셀 영역을 식별하여 표 전체의 행렬 구조를 파악하는 단계; 식별된 셀 영역의 표 태그를 정의하고, 상기 셀 영역의 텍스트를 추출하여 표 태그를 부착하는 단계; 및 표 태그가 부착된 표 서식 데이터를 문서 구조화 데이터에 연결하는 단계;를 포함할 수 있다.In addition, the step of identifying the structure of the table included in the target document and extracting text included in the cell area of the table to generate table format data includes selecting each cell of the table from the image data of the area where the table is located in the target document. Identifying the area and determining the matrix structure of the entire table; Defining a table tag for an identified cell area, extracting text from the cell area, and attaching a table tag; and linking table format data with a table tag attached to document structured data.

또한, 상기 표 태그는 표 내에서 셀의 상대적 위치 정보를 행렬 구조로 수식화하여 표시하는 것을 특징으로 한다.In addition, the table tag is characterized in that the relative position information of cells within the table is formatted and displayed in a matrix structure.

또한, 상기 표 태그는 셀 태그를 포함하고, 상기 셀 태그는 특정 셀이 추가되거나 복수개의 셀이 병합된 정보를 포함하여 표시하는 것을 특징으로 한다.In addition, the table tag includes a cell tag, and the cell tag is characterized in that it includes and displays information that a specific cell has been added or a plurality of cells have been merged.

또한, 상기 문서 구조화 태그를 부착하여 정해진 형식의 텍스트 데이터로 변환하는 단계는, 사전에 정의된 문서 구조화 태그를 사용하여 상기 문서 구조화 데이터를 마크업 언어(Markup Language)로 작성하여 저장하는 것을 특징으로 한다.In addition, the step of attaching the document structuring tag and converting it into text data in a predetermined format is characterized by writing and storing the document structured data in markup language using a predefined document structuring tag. do.

PDF문서 및 스캔한 이미지로부터 텍스트 추출 시 문서의 서식을 구조화하여 태그를 부착함으로서 문서 전처리시 필요한 태그 별 데이터만 선별하여 편집 및 추출이 가능하다.When extracting text from PDF documents and scanned images, by structuring the format of the document and attaching tags, it is possible to select, edit and extract only the data by tag required during document preprocessing.

또한, PDF 문서 및 스캔한 문서로부터 추출한 텍스트를 학습 데이터로 활용 시 페이지 번호, 반복적으로 표시되는 머리말, 꼬리말 등의 불필요한 텍스트를 제외하고, 필요한 부분만 추출하여 사용할 수 있다.In addition, when using text extracted from PDF documents and scanned documents as learning data, unnecessary text such as page numbers, repeatedly displayed headers, and footers can be excluded, and only the necessary parts can be extracted and used.

또한, 표 태그 부착을 통해 문서에서 표에 대한 구조 및 각 셀의 내용을 연결하여 파악할 수 있고, 태깅된 문서 데이터를 통해 다시 표를 재현할 수 있다.In addition, by attaching a table tag, the structure of the table and the contents of each cell can be linked and understood in the document, and the table can be reproduced through the tagged document data.

또한, 텍스트 추출 시 논문과 같이 다단으로 형성된 문서를 단락 영역에 맞추어 글자를 읽어 원래 문서의 의도대로 텍스트를 추출할 수 있다.Additionally, when extracting text, the text can be extracted as intended in the original document by reading the letters in a multi-column document such as a thesis according to the paragraph area.

도 1은 본 발명의 일 실시예에 따른 OCR 수행 시 문서 구조화 태그를 활용한 표가 포함된 복잡한 구조의 문서 인식으로 데이터셋을 구축하는 장치의 전체 관계도이다.
도 2는 본 발명의 일 실시예에 따른 OCR 수행 시 문서 구조화 태그를 활용한 표가 포함된 복잡한 구조의 문서 인식으로 데이터셋을 구축하는 장치의 기능에 대한 블록도이다.
도 3은 본 발명의 일 실시예에 따른 OCR 수행 시 문서 구조화 태그를 활용한 표가 포함된 복잡한 구조의 문서 인식으로 데이터셋을 구축하는 장치에서, 문서 내에 포함된 표를 행렬 구조로 표시하고, 표 태그를 부착한 표 서식 데이터에 대한 예시 도면이다.
도 4는 본 발명의 일 실시예에 따른 OCR 수행 시 문서 구조화 태그를 활용한 표가 포함된 복잡한 구조의 문서 인식으로 데이터셋을 구축하는 장치에서, 텍스트가 포함된 이미지 파일에서 문서 구조 및 텍스트를 인식하고, 문서 구조 태그를 부착하여 정해진 형식의 텍스트 데이터로 변환한 데이터를 나타낸 도면이다.
도 5는 본 발명의 일 실시예에 따른 OCR 수행 시 문서 구조화 태그를 활용한 표가 포함된 복잡한 구조의 문서 인식으로 데이터셋을 구축하는 장치의 하드웨어 구조를 나타낸 도면이다.
도 6은 본 발명의 일 실시예에 따른 OCR 수행 시 문서 구조화 태그를 활용한 표가 포함된 복잡한 구조의 문서 인식으로 데이터셋을 구축하는 방법의 순서도이다.
도 7은 본 발명의 일 실시예에 따른 OCR 수행 시 문서 구조화 태그를 활용한 표가 포함된 복잡한 구조의 문서 인식으로 데이터셋을 구축하는 방법에서 표 서식 데이터를 생성하는 단계를 상세히 나타낸 순서도이다.Figure 1 is an overall relationship diagram of a device that builds a dataset by recognizing a document with a complex structure including a table using document structuring tags when performing OCR according to an embodiment of the present invention.
Figure 2 is a block diagram of the function of a device that builds a dataset by recognizing a document with a complex structure including a table using document structuring tags when performing OCR according to an embodiment of the present invention.
Figure 3 shows a device that builds a dataset by recognizing a document with a complex structure including a table using document structuring tags when performing OCR according to an embodiment of the present invention, where the table included in the document is displayed in a matrix structure, This is an example drawing of table format data with table tags attached.
Figure 4 is a device that builds a dataset by recognizing a document with a complex structure including a table using a document structuring tag when performing OCR according to an embodiment of the present invention, and document structure and text are extracted from an image file containing text. This is a diagram showing data that has been recognized and converted to text data in a specified format by attaching a document structure tag.
Figure 5 is a diagram showing the hardware structure of a device that builds a dataset by recognizing a document with a complex structure including a table using document structuring tags when performing OCR according to an embodiment of the present invention.
Figure 6 is a flow chart of a method of building a dataset by recognizing a document with a complex structure including a table using document structuring tags when performing OCR according to an embodiment of the present invention.
Figure 7 is a flowchart showing in detail the steps of generating table format data in a method of building a dataset by recognizing a document with a complex structure including a table using a document structuring tag when performing OCR according to an embodiment of the present invention.

이하에서는 도면을 참조하여 본 발명의 구체적인 실시예를 상세하게 설명한다. 다만, 본 발명의 사상은 제시되는 실시예에 제한되지 아니하고, 본 발명의 사상을 이해하는 당업자는 동일한 사상의 범위 내에서 다른 구성요소를 추가, 변경, 삭제 등을 통하여, 퇴보적인 다른 발명이나 본 발명 사상의 범위 내에 포함되는 다른 실시예를 용이하게 제안할 수 있을 것이나, 이 또한 본원 발명 사상 범위 내에 포함된다고 할 것이다.Hereinafter, specific embodiments of the present invention will be described in detail with reference to the drawings. However, the spirit of the present invention is not limited to the presented embodiments, and those skilled in the art who understand the spirit of the present invention may add, change, or delete other components within the scope of the same spirit, or create other degenerative inventions or this invention. Other embodiments that are included within the scope of the invention can be easily proposed, but this will also be said to be included within the scope of the invention of the present application.

그리고, 후술되는 용어들은 본 발명에서의 기능을 고려하여 설정된 용어들로써 이는 발명자의 의도 또는 관례에 따라 달라질 수 있으므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이고, 본 명세서에서 본 발명에 관련된 공지의 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에 이에 관한 자세한 설명은 생략하기로 한다.In addition, the terms described below are terms set in consideration of the function in the present invention, and may vary depending on the inventor's intention or custom, so the definition should be made based on the content throughout the specification, and in this specification, the terms related to the present invention In cases where it is determined that detailed descriptions of well-known configurations or functions may obscure the gist of the present invention, detailed descriptions thereof will be omitted.

이하, 도면을 참조로 하여 본 발명에 따른 OCR 수행 시 문서 구조화 태그를 활용한 표가 포함된 복잡한 구조의 문서 인식으로 데이터셋을 구축하는 장치를 설명한다.Hereinafter, with reference to the drawings, a device for building a dataset by recognizing a document with a complex structure including a table using document structuring tags when performing OCR according to the present invention will be described.

도 1은 본 발명의 일 실시예에 따른 OCR 수행 시 문서 구조화 태그를 활용한 표가 포함된 복잡한 구조의 문서 인식 인식으로 데이터셋을 구축하는 장치(이하, 복잡한 구조의 문서 인식으로 데이터셋을 구축하는 장치라 함.)의 전체 관계도이다.1 is a device for building a dataset by recognizing a document with a complex structure including a table using a document structuring tag when performing OCR according to an embodiment of the present invention (hereinafter, building a dataset by recognizing a document with a complex structure) This is the overall relationship diagram.

도 1을 참조하면, 복잡한 구조의 문서 인식으로 데이터셋을 구축하는 장치 (100)는 적어도 하나 이상의 사용자 단말기(200) 및 적어도 하나 이상의 어플리케이션 서버(300)와 네트워크로 연결되어 서로 통신할 수 있다.Referring to FIG. 1, a device 100 that builds a dataset through document recognition of a complex structure is connected to at least one user terminal 200 and at least one application server 300 over a network and can communicate with each other.

본 발명에서 언급하는 네트워크라 함은 유선 공중망, 무선 이동 통신망, 또는 휴대 인터넷 등과 통합된 코어 망일 수도 있고, TCP/IP 프로토콜 및 그 상위 계층에 존재하는 여러 서비스, 즉 HTTP(Hyper Text Transfer Protocol), HTTPS(Hyper Text Transfer Protocol Secure), Telnet, FTP(File Transfer Protocol) 등을 제공하는 전 세계적인 개방형 컴퓨터 네트워크 구조를 의미할 수 있으며, 이러한 예에 한정하지 않고 다양한 형태로 데이터를 송수신할 수 있는 데이터 통신망을 포괄적으로 의미하는 것이다.The network referred to in the present invention may be a core network integrated with a wired public network, wireless mobile communication network, or mobile Internet, etc., and may include the TCP/IP protocol and various services existing in its upper layer, such as HTTP (Hyper Text Transfer Protocol), It can refer to a global open computer network structure that provides HTTPS (Hyper Text Transfer Protocol Secure), Telnet, and FTP (File Transfer Protocol), etc., and is not limited to these examples, but is a data communication network that can transmit and receive data in various forms. It means comprehensively.

본 발명의 복잡한 구조의 문서 인식으로 데이터셋을 구축하는 장치(100)는 PDF 문서 또는 이미지 문서에서 텍스트를 추출하면 서식이 제외되어 원래 문서의 의도대로 파악이 불가능하므로, 입력된 문서에 대해 텍스트를 추출하고, 입력된 문서에 대해 머리말, 꼬리말, 페이지 번호, 본문과 같은 문서의 구조를 판단하여 추출한 텍스트에 문서 구조 태그를 부착하고, 본문에 표가 포함된 경우 표의 구조 및 각 셀의 내용 간의 연계 정보를 행렬 구조로 수식화하여 표 태그를 부착하여 표의 내용을 원본 문서대로 파악할 수 있도록 데이터화한다.The device 100, which builds a dataset by recognizing a document with a complex structure of the present invention, extracts the text from the input document because the format is excluded when text is extracted from a PDF document or an image document, making it impossible to understand the original document as intended. Extracts and determines the structure of the document such as header, footer, page number, and body for the input document and attaches a document structure tag to the extracted text. If the body includes a table, linkage between the structure of the table and the contents of each cell The information is formatted into a matrix structure and table tags are attached to data so that the contents of the table can be understood as in the original document.

이를 위해, 사용자 단말기(200) 외부 서버(300) 및 중 적어도 하나 이상으로부터 PDF 문서 또는 스캔된 문서(이미지)를 수신하고, 수신한 대상 문서의 타입에 따라 서식 항목을 식별하고, 상기 서식 항목에 대응하는 텍스트를 추출하고, 상기 대상 문서에서 식별된 적어도 하나 이상의 서식 항목 및 상기 서식 항목의 텍스트에 대한 관계 정보를 포함하여 문서 구조화 데이터를 생성한다.To this end, the user terminal 200 receives a PDF document or a scanned document (image) from at least one of the external server 300, identifies form items according to the type of the received target document, and enters the form items into the form items. The corresponding text is extracted, and document structured data is generated including at least one format item identified in the target document and relationship information about the text of the format item.

또한, 상기 대상 문서에 포함된 표의 구조를 파악하고, 상기 표의 셀 영역에 포함된 텍스트를 추출하여 표 서식 데이터를 생성하여, 문서 구조화 데이터에 연결한다.In addition, the structure of the table included in the target document is identified, the text included in the cell area of the table is extracted, table format data is generated, and linked to document structured data.

문서 구조화가 완료되면 문서 구조화 데이터에 문서 구조를 식별하는 문서 구조화 태그를 부착하여 정해진 형식의 텍스트 데이터로 변환한다.When document structuring is completed, a document structuring tag that identifies the document structure is attached to the document structured data and converted into text data in a defined format.

본 발명에서 사용자 단말기(200) 또는 외부 서버(300)는 복잡한 구조의 문서 인식으로 데이터셋을 구축하는 장치(100)에서 제공하는 사용자 인터페이스 또는 연동 인터페이스를 통해 대상 문서를 등록할 수 있다.In the present invention, the user terminal 200 or the external server 300 can register the target document through the user interface or linkage interface provided by the device 100 that builds a dataset by recognizing documents of a complex structure.

여기서, 대상 문서는 텍스트가 포함된 이미지 및 pdf파일 중 적어도 하나 이상의 문서로 종이 문서를 스캔한 이미지 파일 또는 PDF로 저장한 파일과 워드, PPT, 한글 문서 등의 전자 문서를 PDF로 변환한 파일 등을 포함할 수 있다.Here, the target document is at least one of an image containing text and a PDF file, such as an image file scanned from a paper document, a file saved as a PDF, and a file converted to PDF from an electronic document such as Word, PPT, or Korean document. may include.

외부 서버(300)는 문서를 생산하거나, 수집된 문서를 가공하는 서버일 수 있고, 문서를 보관, 저장하는 시스템일 수 있다.The external server 300 may be a server that produces documents or processes collected documents, or may be a system that stores and stores documents.

도 2는 본 발명의 일 실시예에 따른 OCR 수행 시 문서 구조화 태그를 활용한 표가 포함된 복잡한 구조의 문서 인식으로 데이터셋을 구축하는 장치(100)의 기능에 대한 블록도이다.Figure 2 is a block diagram of the function of the device 100 to build a dataset by recognizing a document with a complex structure including a table using document structuring tags when performing OCR according to an embodiment of the present invention.

도 2를 참조하면, 복잡한 구조의 문서 인식으로 데이터셋을 구축하는 장치(100)는 OCR분석부(110), 문서 구조화부(120), 표인식부(130) 및 문서 태깅부(140)를 구비할 수 있다.Referring to FIG. 2, the device 100 for building a dataset by recognizing documents with a complex structure includes an OCR analysis unit 110, a document structuring unit 120, a mark recognition unit 130, and a document tagging unit 140. It can be provided.

OCR분석부(110)는 입력된 대상 문서의 타입에 따라 서식 항목을 식별하고, 상기 서식 항목에 대응하는 텍스트를 추출한다.The OCR analysis unit 110 identifies format items according to the type of the input target document and extracts text corresponding to the format items.

OCR분석부(110)로 입력되는 대상 문서는 텍스트를 포함하는 이미지, PDF 문서로, 서식 및 표를 포함하는 논문, 보고서, 보도 자료, 행정 문서 등 일 수 있다.Target documents input to the OCR analysis unit 110 may be images containing text, PDF documents, papers, reports, press releases, administrative documents, etc. containing formats and tables.

OCR분석부(110)는 광학 문자 인식(OCR, Optical Character Recognition) 기술을 적용하여 입력된 대상 문서에서 텍스트 이미지 형태의 문서에서 텍스트를 인식하여 추출한다.The OCR analysis unit 110 applies optical character recognition (OCR) technology to recognize and extract text from the input target document in the form of a text image.

일례로, 광학 문자 인식은 아래와 같은 단계로 진행할 수 있다.For example, optical character recognition can proceed in the following steps.

먼저 대상 문서에 대해 노이즈로 손상되거나 이미지가 기울어지거나 회전되어 있는 경우 이미지를 분석에 적합한 형태로 복구하는 전처리 작업을 수행할 수 있다.First, if the target document is damaged by noise or the image is tilted or rotated, preprocessing can be performed to restore the image to a form suitable for analysis.

이후, 전처리 된 이미지에서 텍스트를 검출하는 작업을 진행한다. 문서에는 텍스트 뿐만 아니라 그림, 그래프, 선 등의 다양한 오브젝트가 존재하므로, 텍스트를 인식하고, 검출된 영역의 문자가 무엇인지를 인식하는 텍스트 인식 작업을 수행한다. Afterwards, the task of detecting text from the preprocessed image is carried out. Since documents contain not only text but also various objects such as pictures, graphs, and lines, text recognition is performed to recognize the text and identify the characters in the detected area.

텍스트를 인식하는 과정에는 CNN(Convolutional neural network), RNN(Recurrent neural network), CNN과 RNN을 결합한 CRNN 방식 등의 딥러닝 기반 OCR모델을 적용하여 할 수 있다.The process of recognizing text can be done by applying deep learning-based OCR models such as CNN (Convolutional neural network), RNN (Recurrent neural network), and CRNN method that combines CNN and RNN.

OCR분석부(110)는 서식 식별부(111) 및 텍스트 추출부(112)를 포함한다.The OCR analysis unit 110 includes a format identification unit 111 and a text extraction unit 112.

서식 식별부(111)는 대상 문서의 메타 정보를 통해 문서 타입을 파악하고, 상기 문서 타입의 템플릿 데이터를 확인하여 템플릿 데이터에 정의된 서식 항목의 위치 영역 및 패턴 규칙에 따라 텍스트를 추출한다.The format identification unit 111 identifies the document type through meta information of the target document, checks template data of the document type, and extracts text according to the location area and pattern rules of the format item defined in the template data.

일례로, 일반적인 문서의 경우 제목, 본문, 머리말, 꼬리말, 페이지 등의 서식으로 구성될 수 있을 것이다.For example, a general document may consist of a title, body, header, footer, page, etc.

또한, 일례로, 입력된 대상 문서가 논문인 경우 제목, 초록, 서론, 연구방법, 결과, 고찰(Discussion), 사사(Acknowledgement), 참고문헌 등으로 구성될 수 있을 것이다.Additionally, as an example, if the input target document is a paper, it may consist of title, abstract, introduction, research method, results, discussion, acknowledgment, and references.

서식 식별부(111)는 각 도메인에서 사용되는 문서 서식을 정의한 템플릿을 등록하여 관리하고, OCR 분석 대상 문서를 수신하면, 함께 수신된 해당 문서의 메타 정보를 확인하여 문서의 타입을 파악한다.The format identification unit 111 registers and manages a template defining the document format used in each domain, and upon receiving a document subject to OCR analysis, checks the meta information of the document received together to determine the type of the document.

또한, 문서의 타입에 따라 템플릿 데이터에 정의된 서식 항목을 확인하여 서식 식별을 위한 추출 규칙을 확인한다.In addition, the format items defined in the template data are checked according to the type of document, and the extraction rules for format identification are checked.

템플릿 데이터에 포함되는 서식 식별 규칙은 문서 이미지 내의 위치 영역(좌표), 글자 스타일(크기, 폰트 종류), 특수 문자 또는 기호 포함 등이 포함될 수 있다.Format identification rules included in template data may include location area (coordinates) within the document image, character style (size, font type), inclusion of special characters or symbols, etc.

서식 식별부(111)에서 해당 서식의 서식 영역이 확인되면, 텍스트 추출부(112)를 통해 해당 영역의 텍스트를 인식하고, 추출한다.When the format area of the corresponding form is confirmed in the format identification unit 111, the text in the corresponding area is recognized and extracted through the text extraction unit 112.

텍스트 추출부(112)는 이미지에서 특정 영역에 대한 텍스트를 추출할 수 있다.The text extraction unit 112 may extract text for a specific area from the image.

추출된 텍스트는 서식 항목-텍스트 내용의 페어(pair) 데이터로 매핑 되어 저장되고, 문서 구조화부(120)를 통해 문서 서식 간의 관계성에 따라 구조화되어 저장된다.The extracted text is mapped and stored as format item-text content pair data, and is structured and stored according to the relationship between document formats through the document structuring unit 120.

문서 구조화부(120)는 OCR분석부(110)를 통해 식별된 적어도 하나 이상의 서식 항목 및 추출된 텍스트에 대한 관계 정보를 포함하여 문서 구조화 데이터를 생성한다.The document structuring unit 120 generates document structured data including relationship information about at least one form item identified through the OCR analysis unit 110 and the extracted text.

일례로, 문서 구조화 데이터는 적어도 하나 이상의 속성 이름-속성값으로 구성된 페어(pair) 데이터를 포함하는 형태일 수 있다.For example, document structured data may be in the form of pair data consisting of at least one attribute name and attribute value.

또한, 문서 구조화 데이터는 복수개의 페어 데이터 간의 관계성을 나타내는 정보를 포함할 수 있다.Additionally, document structured data may include information indicating relationships between a plurality of pair data.

한편, 서식 식별부(111)는 대상 문서 내에 포함된 표를 인식하고, 표의 위치 영역을 추출할 수 있다. Meanwhile, the format identification unit 111 can recognize the table included in the target document and extract the location area of the table.

추출된 표의 위치 영역은 표인식부(130)에 의해 분석되어 표의 구조화 및 셀 내의 텍스트 추출이 이루어 진다.The location area of the extracted table is analyzed by the table recognition unit 130 to structure the table and extract text within the cell.

표인식부(130)는 대상 문서에 포함된 표의 구조를 파악하고, 상기 표의 셀 영역에 포함된 텍스트를 추출하여 표 서식 데이터를 생성한다.The table recognition unit 130 determines the structure of the table included in the target document, extracts text included in the cell area of the table, and generates table format data.

표인식부(130)는 표구조 식별부(131) 및 셀텍스트 추출부(132)를 포함한다.The table recognition unit 130 includes a table structure identification unit 131 and a cell text extraction unit 132.

표구조 식별부(131)는 서식 식별부(111)에 의해 추출된 표의 위치 영역을 분석하여 전체 표의 영역을 식별하고, 표가 위치하는 영역의 이미지 데이터에서 표의 각 셀 영역을 식별한다.The table structure identification unit 131 analyzes the location area of the table extracted by the format identification unit 111 to identify the entire table area, and identifies each cell area of the table from the image data of the area where the table is located.

이때, 표 내에서 셀의 상대적 위치 정보를 행렬 구조로 수식화하여 표시한다.At this time, the relative position information of cells within the table is formatted and displayed in a matrix structure.

도 3은 본 발명의 일 실시예에 따른 OCR 수행 시 문서 구조화 태그를 활용한 표가 포함된 복잡한 구조의 문서 인식으로 데이터셋을 구축하는 장치(100)에서, 문서 내에 포함된 표를 행렬 구조로 표시하고, 표 태그를 부착한 표 서식 데이터에 대한 예시 도면이다.Figure 3 shows the apparatus 100 that builds a dataset by recognizing a document with a complex structure including a table using document structuring tags when performing OCR according to an embodiment of the present invention, and the table included in the document is converted into a matrix structure. This is an example drawing of table format data displayed and with table tags attached.

도 3을 참조하여, 문서 내 포함된 표를 인식하고, 행렬 구조로 수식화하는 과정을 설명한다.Referring to FIG. 3, the process of recognizing tables included in a document and formulating them into a matrix structure will be described.

도 3의 (a)는 문서 내에 포함된 표의 예시이다. Figure 3(a) is an example of a table included in a document.

표구조 식별부(131)는 표의 외곽선을 식별하여 문서 내에 포함된 표의 전체 영역을 식별하고, 표 영역 내에서 셀 라인을 인식하여 셀의 영역을 판단한다.The table structure identification unit 131 identifies the entire area of the table included in the document by identifying the outline of the table, and determines the area of the cell by recognizing cell lines within the table area.

이때, 표의 전체 구조는 (b)와 같은 행렬 구조로 수식화될 수 있다.At this time, the entire structure of the table can be formalized into a matrix structure like (b).

또한, 표구조 식별부(131)는 식별된 각 셀에 대해 셀의 위치 및 병합 여부를 확인할 수 있는 표 태그를 정의한다.Additionally, the table structure identification unit 131 defines a table tag for each identified cell that can check the location of the cell and whether it is merged.

특히, 적어도 하나 이상의 셀이 병합된 부분은 병합된 셀의 범위를 포함하여 표 태그를 정의한다.In particular, the part where at least one cell is merged defines the table tag including the range of the merged cells.

일례로, 도 3의 표(a)의 구조는 2개의 행과 4개의 열로 판단되고, '<table2*4>'라는 표 태그로 정의할 수 있다.For example, the structure of table (a) in Figure 3 is determined to have 2 rows and 4 columns, and can be defined with the table tag '<table2*4>'.

도 3의 표(a)에서 첫번째 셀, 즉, '구분' 이라는 텍스트를 포함한 셀의 경우 행렬 구조로 정의할 때 (1,1) 셀과 (1,2)의 셀이 병합된 것으로 인식하고, 해당 표 태그를 '<data:(1,1)(1,2)>'로 정의할 수 있다.In the first cell in table (a) of Figure 3, that is, the cell containing the text 'Separation', when defining the matrix structure, the (1,1) cell and the (1,2) cell are recognized as merged, The corresponding table tag can be defined as '<data:(1,1)(1,2)>'.

도 3의 (c)는 표(a)를 인식하여 표 태그를 부착한 예시이다.Figure 3 (c) is an example of recognizing table (a) and attaching a table tag.

일례로, 표 태그는 도 3의 (c)와 같이 시작 태그, 끝 태그, 엘리먼트(element) 및 속성(attribute)를 포함하는 html, xml과 유사한 마크업 (Markup) 언어의 구조일 수 있다.For example, a table tag may be a markup language structure similar to HTML and XML that includes a start tag, an end tag, an element, and an attribute, as shown in (c) of FIG. 3.

다만, 이에 한정하지 않고 다양한 형태의 태그를 포함하는 텍스트 형식의 문서 형식을 나타내는 태그일 수 있다.However, the tag is not limited to this and may be a tag indicating a text format document format including various types of tags.

이와 같이 표의 구조를 행렬 구조로 수식화하여 정의함으로서, 태깅된 표 서식 데이터를 다시 표로 정확하게 재현할 수 있으며, 재현된 표에서 셀 병합, 셀 추가와 같은 표 편집이 가능해 진다.By formulating and defining the table structure as a matrix structure in this way, tagged table format data can be accurately reproduced as a table, and table editing such as merging cells and adding cells in the reproduced table becomes possible.

셀텍스트 추출부(132)는 인식되어 표 태그로 정의된 셀 영역 내에서 텍스트를 추출한다.The cell text extractor 132 extracts text within a cell area recognized and defined by a table tag.

이때, 셀 영역 내의 텍스트가 복수개의 줄로 이루어진 경우 셀의 영역 내에서 셀의 외각선을 인식하고 줄바꿈하여 텍스트를 읽도록 하여 원문과 동일하게 읽혀지도록 하는 것이 바람직할 것이다.At this time, if the text in the cell area consists of multiple lines, it would be desirable to recognize the cell's outline within the cell area and change the lines to read the text so that it can be read the same as the original text.

한편, 표인식부(130)는 표 태그가 부착된 표 서식 데이터를 문서 구조화 데이터에 연결한다.Meanwhile, the table recognition unit 130 connects table format data with a table tag attached to document structured data.

일례로, 제목, 머리말, 꼬리말, 본문의 구조로 인식되어 문서 구조화 데이터에 저장되고, 본문 내에 표가 위치하는 경우 본문의 하위 구조로 표 태그가 부착된 텍스트가 문서 구조화 데이터에 포함될 수 있다. For example, if the structure of the title, header, footer, and body is recognized and stored in the document structured data, and a table is located within the body, text with a table tag attached as a substructure of the body may be included in the document structured data.

문서 태깅부(140)는 문서 구조화 데이터에 문서 구조를 식별하는 문서 구조화 태그를 부착하여 정해진 형식의 텍스트 데이터로 변환한다.The document tagging unit 140 attaches a document structuring tag that identifies the document structure to the document structured data and converts it into text data in a predetermined format.

즉, 사전에 정의된 문서 구조화 태그를 사용하여 문서 구조화 데이터를 마크업 언어(Markup Language)로 작성하여 저장할 수 있다.In other words, document structured data can be written and stored in markup language using predefined document structuring tags.

여기서, 마크업 언어(Markup Language)는 html, xml 과 같이 시작 태그, 끝 태그, 엘리먼트(element) 및 속성(attribute)를 포함하고, 문서의 구조를 포함할 수 있다.Here, the markup language includes start tags, end tags, elements, and attributes, such as html and xml, and may include the structure of the document.

도 4는 본 발명의 일 실시예에 따른 OCR 수행 시 문서 구조화 태그를 활용한 표가 포함된 복잡한 구조의 문서 인식으로 데이터셋을 구축하는 장치(100)에서, 텍스트가 포함된 이미지 파일에서 문서 구조 및 텍스트를 인식하고, 문서 구조 태그를 부착하여 정해진 형식의 텍스트 데이터로 변환한 데이터를 나타낸 도면으로, 생성된 문서 구조화 데이터에 대해 사전에 정의된 각 서식에 대한 문서 구조 태그를 부착하여 마크업 언어로 표시한 것이다.Figure 4 shows the document structure in an image file containing text in the device 100 that builds a dataset by recognizing a document with a complex structure including a table using a document structuring tag when performing OCR according to an embodiment of the present invention. and a drawing showing data that has been converted into text data in a defined format by recognizing text and attaching document structure tags, and attaching document structure tags for each format predefined for the generated document structured data to create a markup language. It is marked with .

도 4와 같이 태그가 부착되어 마크업 언어 형태로 변환된 문서는 문서의 구조에 영향을 받지 않고 유연하게 텍스트 형태로 저장될 수 있고, 또한 구조화된 데이터로 변환하여 사용할 수 있다.As shown in Figure 4, a document with a tag attached and converted into a markup language form can be flexibly stored in text form without being affected by the structure of the document, and can also be converted to structured data and used.

또한, 문서 타입 별로 문서의 구조를 사전에 규칙(스키마, schema)로 정의하고, 태그가 부착된 문서 텍스트에 대해 상기 규칙(스키마)에 따라 파싱(Parsing)하여 문서 구조화 데이터로 변환할 수 있고, 필요한 서식 텍스트 만을 선별적으로 추출하여 편집/가공할 수 있다.In addition, the structure of the document can be defined in advance as a rule (schema, schema) for each document type, and the tagged document text can be parsed according to the rule (schema) and converted into document structured data. You can selectively extract and edit/process only the required format text.

일례로, 문서에 반복적으로 포함된 머리말, 꼬리말, 페이지를 식별하여 문서 가공 시 제외시킬 수 있다.For example, headers, footers, and pages repeatedly included in a document can be identified and excluded when processing the document.

또한 일례로, 문서에 대한 키워드 추출 시 문서 내용 전체를 분석하지 않고, 제목으로 인식된 텍스트를 사용할 수 있다.Also, as an example, when extracting keywords for a document, text recognized as the title can be used without analyzing the entire document content.

도 5는 본 발명의 일 실시예에 따른 OCR 수행 시 문서 구조화 태그를 활용한 표가 포함된 복잡한 구조의 문서 인식으로 데이터셋을 구축하는 장치(100)의 하드웨어 구조를 나타낸 도면이다.Figure 5 is a diagram showing the hardware structure of the device 100 that builds a dataset by recognizing a document with a complex structure including a table using document structuring tags when performing OCR according to an embodiment of the present invention.

도 5를 참조하면, 복잡한 구조의 문서 인식으로 데이터셋을 구축하는 장치(100)의 하드웨어 구조는, 중앙처리장치(1000), 메모리(2000), 사용자 인터페이스(3000), 데이터베이스 인터페이스(4000), 네트워크 인터페이스(5000), 웹서버(6000) 등을 포함하여 구성된다.Referring to FIG. 5, the hardware structure of the device 100 for building a dataset by recognizing complex documents includes a central processing unit 1000, a memory 2000, a user interface 3000, a database interface 4000, It is composed of a network interface (5000), a web server (6000), etc.

사용자 인터페이스(3000)는 그래픽 사용자 인터페이스(GUI, graphical user interface)를 사용함으로써, 사용자에게 입력과 출력 인터페이스를 제공한다.The user interface 3000 provides an input and output interface to the user by using a graphical user interface (GUI).

데이터베이스 인터페이스(4000)는 데이터베이스와 하드웨어 구조 사이의 인터페이스를 제공한다.The database interface 4000 provides an interface between a database and a hardware structure.

네트워크 인터페이스(5000)는 사용자가 보유한 장치 간의 네트워크 연결을 제공한다.The network interface 5000 provides network connections between devices owned by users.

웹 서버(6000)는 사용자가 네트워크를 통해 하드웨어 구조로 액세스하기 위한 수단을 제공한다. 대부분의 사용자들은 원격에서 웹 서버로 접속하여 복잡한 구조의 문서 인식으로 데이터셋을 구축하는 장치(100)를 사용할 수 있다.The web server 6000 provides a means for users to access the hardware structure through a network. Most users can use the device 100 to connect to a web server remotely and build a dataset by recognizing documents with a complex structure.

상술한 구성 또는 방법의 각 단계는, 컴퓨터 판독 가능한 기록 매체 상의 컴퓨터 판독 가능 코드로 구현되거나 전송 매체를 통해 전송될 수 있다. 컴퓨터 판독 가능한 기록 매체는, 컴퓨터 시스템에 의해 읽혀질 수 있는 데이터를 저장할 수 있는 데이터 저장 디바이스이다.Each step of the above-described configuration or method may be implemented as computer-readable code on a computer-readable recording medium or transmitted through a transmission medium. A computer-readable recording medium is a data storage device capable of storing data that can be read by a computer system.

컴퓨터 판독 가능한 기록 매체의 예로는 데이터베이스, ROM, RAM, CD-ROM, DVD, 자기 테이프, 플로피 디스크 및 광학 데이터 저장 디바이스가 있으나 이에 한정되는 것은 아니다. 전송 매체는 인터넷 또는 다양한 유형의 통신 채널을 통해 전송되는 반송파를 포함할 수 있다. 또한 컴퓨터 판독 가능한 기록 매체는, 컴퓨터 판독 가능 코드가 분산 방식으로 저장되고, 실행되도록 네트워크 결합 컴퓨터 시스템을 통해 분배될 수 있다.Examples of computer-readable recording media include, but are not limited to, databases, ROM, RAM, CD-ROM, DVD, magnetic tape, floppy disk, and optical data storage devices. Transmission media may include carrier waves transmitted over the Internet or various types of communication channels. The computer-readable recording medium may also be distributed through a network-coupled computer system such that the computer-readable code is stored and executed in a distributed manner.

또한 본 발명에 적용된 적어도 하나 이상의 구성요소는, 각각의 기능을 수행하는 중앙처리장치(CPU), 마이크로프로세서 등과 같은 프로세서를 포함하거나 이에 의해 구현될 수 있으며, 상기 구성요소 중 둘 이상은 하나의 단일 구성요소로 결합되어 결합된 둘 이상의 구성요소에 대한 모든 동작 또는 기능을 수행할 수 있다. 또한 본 발명에 적용된 적어도 하나 이상의 구성요소의 일부는, 이들 구성요소 중 다른 구성요소에 의해 수행될 수 있다. 또한 상기 구성요소들 간의 통신은 버스(미도시)를 통해 수행될 수 있다.In addition, at least one or more components applied to the present invention may include or be implemented by a processor such as a central processing unit (CPU) or microprocessor that performs each function, and two or more of the components may be implemented as a single It can be combined into components and perform all operations or functions of two or more components combined. Additionally, part of at least one or more components applied to the present invention may be performed by other components among these components. Additionally, communication between the components may be performed through a bus (not shown).

도 6은 본 발명의 일 실시예에 따른 OCR 수행 시 문서 구조화 태그를 활용한 표가 포함된 복잡한 구조의 문서 인식으로 데이터셋을 구축하는 방법의 순서도이다.Figure 6 is a flow chart of a method of building a dataset by recognizing a document with a complex structure including a table using document structuring tags when performing OCR according to an embodiment of the present invention.

도 6을 참조하여 OCR 수행 시 문서 구조화 태그를 활용한 표가 포함된 복잡한 구조의 문서 인식으로 데이터셋을 구축하는 방법을 설명한다.Referring to Figure 6, we will explain how to build a dataset by recognizing documents with a complex structure including tables using document structuring tags when performing OCR.

복잡한 구조의 문서 인식으로 데이터셋을 구축하는 장치(100)는 사용자 단말기(200) 또는 외부 서버(300)로부터 텍스트를 포함하는 이미지 파일 및/또는 PDF 파일을 수신할 수 있다.The device 100, which builds a dataset by recognizing documents with a complex structure, may receive an image file and/or a PDF file containing text from the user terminal 200 or an external server 300.

이를 수신한 복잡한 구조의 문서 인식으로 데이터셋을 구축하는 장치(100)는 입력된 대상 문서의 타입을 대상 문서의 메타 정보에서 확인하고, 문서 타입에 따라 식별해야 할 서식 항목을 확인하여 상기 서식 항목에 대응하는 텍스트를 추출하는 단계(S610)를 수행한다.Upon receiving this, the device 100, which builds a dataset by recognizing documents with a complex structure, checks the type of the input target document from the meta information of the target document, checks the format items to be identified according to the document type, and identifies the format items. A step (S610) of extracting text corresponding to is performed.

먼저 복잡한 구조의 문서 인식으로 데이터셋을 구축하는 장치(100)는 상기 대상 문서의 메타 정보를 통해 문서 타입을 파악한다. First, the device 100, which builds a dataset by recognizing documents with a complex structure, determines the document type through meta information of the target document.

이후, 상기 문서 타입의 템플릿 데이터를 확인하여 상기 템플릿 데이터에 정의된 서식 항목의 위치 영역 및 패턴 규칙에 따라 텍스트를 추출하여 서식 항목-텍스트 내용의 페어(pair) 데이터를 생성한다. Afterwards, the template data of the document type is checked and text is extracted according to the location area and pattern rules of the format item defined in the template data to generate format item-text content pair data.

S610 단계 이후, 상기 대상 문서에서 식별된 적어도 하나 이상의 서식 항목 및 상기 서식 항목의 텍스트에 대한 관계 정보를 포함하여 문서 구조화 데이터를 생성하는 단계(S620)를 수행한다.After step S610, a step (S620) of generating document structured data including at least one format item identified in the target document and relationship information about the text of the format item is performed.

이때, 문서 구조화 데이터는 적어도 하나 이상의 서식 항목-텍스트의 페어 데이터를 순서 관계 및 포함 관계를 포함하여 저장될 수 있다.At this time, the document structured data may store at least one format item-text pair data including an order relationship and an inclusion relationship.

한편, S620단계에서 제목, 머리말, 본문, 꼬리말 등의 문서의 서식을 인식한 후, 본문과 같은 표가 포함될 수 있는 서식에 대해서 표를 인식하는 단계를 수행할 수 있다.Meanwhile, after recognizing the format of the document such as the title, header, body, and footer in step S620, a table recognition step can be performed for a format that may include a table such as the body.

S620단계에서는 표가 위치하는 영역만을 파악하고, 표에 대한 구조 파악 및 텍스트 추출은 이후 진행한다.In step S620, only the area where the table is located is identified, and the structure of the table and text extraction are performed later.

S620단계 이후, 상기 대상 문서에 포함된 표의 구조를 파악하고, 상기 표의 셀 영역에 포함된 텍스트를 추출하여 표 서식 데이터를 생성하는 단계(S630)를 수행한다.After step S620, a step (S630) is performed to identify the structure of the table included in the target document, extract text included in the cell area of the table, and generate table format data.

S630단계는 도 7을 참조하여 자세히 설명한다.Step S630 is described in detail with reference to FIG. 7.

도 7은 본 발명의 일 실시예에 따른 OCR 수행 시 문서 구조화 태그를 활용한 표가 포함된 복잡한 구조의 문서 인식으로 데이터셋을 구축하는 방법에서 표 서식 데이터를 생성하는 단계를 상세히 나타낸 순서도이다.Figure 7 is a flowchart showing in detail the steps of generating table format data in a method of building a dataset by recognizing a document with a complex structure including a table using a document structuring tag when performing OCR according to an embodiment of the present invention.

도 7을 참조하면, S630단계에서는 대상 문서 내 표가 위치하는 영역의 이미지 데이터에서 표의 각 셀 영역을 식별하여 표 전체의 행렬 구조를 파악한다. (S631)Referring to FIG. 7, in step S630, each cell area of the table is identified from the image data of the area where the table is located in the target document to determine the matrix structure of the entire table. (S631)

먼저 표의 외곽선을 식별하여 문서 내에 포함된 표의 전체 영역을 식별하고, 표 영역 내에서 셀 라인을 인식하여 셀의 영역을 판단한다.First, the outline of the table is identified to identify the entire area of the table included in the document, and the cell area is determined by recognizing cell lines within the table area.

이후, 식별된 셀 영역의 표 태그를 정의하고, 상기 셀 영역의 텍스트를 추출하여 표 태그를 부착한다. (S632)Afterwards, the table tag of the identified cell area is defined, the text of the cell area is extracted, and the table tag is attached. (S632)

이때, 표 태그는 표 내에서의 셀의 위치 및 병합된 구조를 확인할 수 있도록 행렬 구조로 수식화되어 정의된다.At this time, the table tag is defined and formatted into a matrix structure so that the location and merged structure of cells within the table can be confirmed.

다음으로, 표 태그가 부착된 표 서식 데이터를 문서 구조화 데이터에 연결한다. (S633)Next, connect the table format data with the table tag attached to the document structured data. (S633)

문서 구조화 데이터에는 S610단계에서 인식되어 추출한 제목, 본문, 머리말, 꼬리말, 페이지 등의 서식 항목 및 상기 서식 항목의 텍스트가 포함되어 있으므로, 상기 기 추출된 서식에서 표가 인식된 서식, 예를 들어, 본문에서 표가 인식된 경우 본문의 하위 서식으로 표 서식 데이터가 포함될 수 있다.Since the document structured data includes format items such as title, body, header, footer, and page recognized and extracted in step S610 and the text of the format items, the format in which the table is recognized in the previously extracted format, for example, If a table is recognized in the text, table format data can be included as a sub-format of the text.

다시 도 6를 참조하면, S630단계를 수행한 후, 대상 문서의 구조 인식이 완료되어 문서의 구조 및 텍스트가 저장된 구조화 데이터에 대해서, 문서 구조를 식별하는 문서 구조화 태그를 부착하여 정해진 형식의 텍스트 데이터로 변환하는 단계(S640)를 수행한다.Referring again to FIG. 6, after performing step S630, the structure recognition of the target document is completed, and a document structuring tag identifying the document structure is attached to the structured data in which the structure and text of the document are stored, thereby generating text data in a predetermined format. A conversion step (S640) is performed.

여기서, 정해진 형식의 텍스트 데이터는 html, xml과 유사한 마크업 언어 일 수 있다.Here, text data in a given format may be a markup language similar to HTML or XML.

일례로, 상기 마크업 언어는 사전에 정해진 태그, 엘리먼트(element) 및 속성(attribute)를 포함할 수 있고, 문서 타입에 따라 태깅 규칙(스키마)이 사전에 정의되어 설정될 수 있다.For example, the markup language may include predetermined tags, elements, and attributes, and tagging rules (schemas) may be defined and set in advance depending on the document type.

이후, 변환된 정해진 형식의 텍스트 데이터는 스토리지(storage) 또는 데이터베이스(DB)에 저장될 수 있다.Afterwards, the converted text data in a predetermined format can be stored in storage or a database.

또한, 사용자 단말기(200) 및/또는 외부 서버(300)의 요청에 따라 문서 타입 별로 정해진 규칙(스키마)를 적용하여 파싱(Parsing)하고, 필요한 서식 항목의 텍스트를 추출하여 전달하거나 텍스트 데이터를 가공하여 제공할 수 있다. In addition, at the request of the user terminal 200 and/or the external server 300, the rules (schema) determined for each document type are applied for parsing, and the text of necessary format items is extracted and delivered or text data is processed. It can be provided.

또한, 표 서식 데이터는 다시 표로 재현하거나, 편집하여 사용할 수 있다.Additionally, table format data can be reproduced as a table or edited and used.

상기와 같은 문서 인식 방법을 통해, 표를 포함한 복잡한 구조의 문서를 구조화하여 인식할 수 있으며, 구조화되어 태그가 부착된 데이터는 재현이 가능하여 다양한 형태로 가공되어 사용될 수 있다.Through the document recognition method described above, documents with complex structures, including tables, can be structured and recognized, and the structured and tagged data can be reproduced and processed and used in various forms.

100: 복잡한 구조의 문서 인식으로 데이터셋을 구축하는 장치
110: OCR분석부
111: 서식 식별부 112: 텍스트 추출부
120: 문서 구조화부
130: 표인식부
131: 표구조 식별부 132: 셀텍스트 추출부
140: 문서 태깅부
200: 사용자 단말기
300: 외부 서버100: A device that builds a dataset by recognizing documents with complex structures
110: OCR analysis department
111: Format identification unit 112: Text extraction unit
120: Document structuring unit
130: Mark recognition unit
131: Table structure identification unit 132: Cell text extraction unit
140: Document tagging unit
200: user terminal
300: external server

Claims

복잡한 구조의 문서 인식으로 데이터셋을 구축하는 장치에서, 입력된 대상 문서의 타입에 따라 서식 항목을 식별하고, 상기 서식 항목에 대응하는 텍스트를 추출하는 단계;
상기 대상 문서에서 식별된 적어도 하나 이상의 서식 항목 및 상기 서식 항목의 텍스트에 대한 관계 정보를 포함하여 문서 구조화 데이터를 생성하는 단계;
상기 대상 문서에 포함된 표의 구조를 파악하고, 상기 표의 셀 영역에 포함된 텍스트를 추출하여 표 서식 데이터를 생성하는 단계; 및
상기 문서 구조화 데이터에 문서 구조를 식별하는 문서 구조화 태그를 부착하여 정해진 형식의 텍스트 데이터로 변환하는 단계;를 포함하고,
상기 대상 문서는 텍스트가 포함된 이미지 및 pdf파일 중 적어도 하나 이상을 포함하고,
상기 입력된 대상 문서의 타입에 따라 서식 항목을 식별하고, 상기 서식 항목에 대응하는 텍스트를 추출하는 단계는,
상기 대상 문서의 메타 정보를 통해 문서 타입을 파악하는 단계; 및
광학 문자 인식(OCR, Optical Character Recognition) 기술을 적용하여 대상 문서 타입의 템플릿 데이터를 확인하여 상기 템플릿 데이터에 정의된 서식 항목의 식별 규칙에 따라 서식 항목의 영역에 위치한 텍스트를 추출하여 서식 항목-텍스트 내용의 페어(pair) 데이터를 생성하는 단계;를 포함하고,
상기 템플릿 데이터에 정의된 서식 항목의 식별 규칙은, 문서 이미지 상의 위치 좌표 및 글자 스타일을 포함하며,
상기 문서 구조화 태그를 부착하여 정해진 형식의 텍스트 데이터로 변환하는 단계는,
대상 문서 타입 별로 각 문서의 구조를 정의한 스키마(schema)에 기반하여, 문서 구조화 데이터에 대해 문서 구조화 태그를 부착하여 XML 기반 문서 텍스트를 생성하고,
상기 XML 기반 문서 텍스트는 상기 스키마(schema)에 따라 파싱(Parsing)하여 문서 구조화 데이터로 변환이 가능하고, 특정 서식 항목의 텍스트 추출이 가능한 것을 특징으로 하는,
OCR 수행 시 문서 구조화 태그를 활용한 표가 포함된 복잡한 구조의 문서 인식으로 데이터셋을 구축하는 방법.
In an apparatus for constructing a dataset through document recognition of a complex structure, identifying format items according to the type of an input target document and extracting text corresponding to the format items;
generating document structured data including at least one format item identified in the target document and relationship information about text of the format item;
identifying the structure of a table included in the target document and extracting text included in a cell area of the table to generate table format data; and
Attaching a document structuring tag identifying the document structure to the document structured data and converting it into text data in a predetermined format,
The target document includes at least one of an image and a PDF file containing text,
The step of identifying format items according to the type of the input target document and extracting text corresponding to the format items,
Identifying the document type through meta information of the target document; and
Optical Character Recognition (OCR) technology is applied to check the template data of the target document type, extract the text located in the area of the format item according to the identification rules of the format item defined in the template data, and extract the format item-text. A step of generating pair data of contents,
The identification rules of format items defined in the template data include location coordinates and character styles on the document image,
The step of attaching the document structuring tag and converting it into text data in a defined format is,
Based on a schema that defines the structure of each document for each target document type, an XML-based document text is created by attaching a document structuring tag to the document structured data.
The XML-based document text can be parsed according to the schema and converted into document structured data, and text of specific format items can be extracted.
A method of building a dataset by recognizing documents with complex structures that include tables using document structuring tags when performing OCR.

삭제delete

제1항에 있어서,
상기 대상 문서에 포함된 표의 구조를 파악하고, 상기 표의 셀 영역에 포함된 텍스트를 추출하여 표 서식 데이터를 생성하는 단계는,
상기 대상 문서 내에 표가 위치하는 영역의 이미지 데이터에서 표의 각 셀 영역을 식별하여 표 전체의 행렬 구조를 파악하는 단계;
식별된 셀 영역의 표 태그를 정의하고, 상기 셀 영역의 텍스트를 추출하여 표 태그를 부착하는 단계; 및
표 태그가 부착된 표 서식 데이터를 문서 구조화 데이터에 연결하는 단계;를 포함하는,
OCR 수행 시 문서 구조화 태그를 활용한 표가 포함된 복잡한 구조의 문서 인식으로 데이터셋을 구축하는 방법.
According to paragraph 1,
The step of identifying the structure of the table included in the target document and extracting text included in the cell area of the table to generate table format data,
Identifying each cell area of the table from image data of the area where the table is located in the target document to determine the matrix structure of the entire table;
Defining a table tag for an identified cell area, extracting text from the cell area, and attaching a table tag; and
Comprising: linking table format data with a table tag attached to document structured data;
A method of building a dataset by recognizing documents with a complex structure that includes tables using document structuring tags when performing OCR.

제4항에 있어서,
상기 표 태그는,
표 내에서 셀의 상대적 위치 정보를 행렬 구조로 수식화하여 표시하는 것을 특징으로 하는,
OCR 수행 시 문서 구조화 태그를 활용한 표가 포함된 복잡한 구조의 문서 인식으로 데이터셋을 구축하는 방법.
According to clause 4,
The table tag above is,
Characterized in that the relative position information of cells within the table is formatted and displayed in a matrix structure.
A method of building a dataset by recognizing documents with a complex structure that includes tables using document structuring tags when performing OCR.

제5항에 있어서,
상기 표 태그는,
셀 태그를 포함하고, 상기 셀 태그는 특정 셀이 추가되거나 복수개의 셀이 병합된 정보를 포함하여 표시하는 것을 특징으로 하는,
OCR 수행 시 문서 구조화 태그를 활용한 표가 포함된 복잡한 구조의 문서 인식으로 데이터셋을 구축하는 방법.
According to clause 5,
The table tag above is,
Includes a cell tag, wherein the cell tag displays information including information on a specific cell being added or a plurality of cells being merged.
A method of building a dataset by recognizing documents with a complex structure that includes tables using document structuring tags when performing OCR.

삭제delete