KR102394480B1

KR102394480B1 - Methods and systems for syntactic and semantic information extraction from plant procedures

Info

Publication number: KR102394480B1
Application number: KR1020200127076A
Authority: KR
Inventors: 최용선; 민 덕 응웬
Original assignee: 인제대학교 산학협력단
Priority date: 2020-09-29
Filing date: 2020-09-29
Publication date: 2022-05-04
Also published as: KR20220043546A

Abstract

본 발명은 절차서내의 각 문구(문장)에 포함된 구문 및 의미 정보를 정확하게 추출함으로써, 원자력 발전소와 같은 대형 플랜트를 안전하고 효율적으로 운영/유지할 수 있도록 한 구문 및 의미정보 추출방법 및 그의 시스템을 제안한다. 본 발명에 따르면, 입력된 절차서에 포함된 이미지 및 표 개체를 제거하는 넌-텍스트 처리부와, 이미지 및 표 개체가 제거된 절차서의 텍스트에 대한 구조적 특성을 추출하는 텍스트 처리부를 포함하는 전처리 유닛; 상기 전처리 유닛으로부터 텍스트 정보를 전달받고 자연어 처리(NLP)기술을 이용하여 텍스트를 분석하고 보정하는 확장된 자연어 처리 유닛; 상기 자연어 처리된 절차서의 각 단락을 대상으로 단락에 포함된 모은 유의한 의미 개체 식별, 단락의 유형 분류, 조치문 단락의 세부 구성 요소 식별 등을 수행하는 정보 추출 유닛; 및 추출된 모든 정보들을 출력물 형태로 생성 출력하는 출력 유닛을 포함한다. The present invention proposes a syntax and semantic information extraction method and a system for safely and efficiently operating/maintaining a large plant such as a nuclear power plant by accurately extracting syntax and semantic information included in each phrase (sentence) in a procedure do. According to the present invention, there is provided a pre-processing unit comprising: a non-text processing unit that removes image and table objects included in an input procedure; an extended natural language processing unit that receives text information from the preprocessing unit and analyzes and corrects the text using a natural language processing (NLP) technology; an information extraction unit that performs identification of collected significant semantic entities included in paragraphs, classification of types of paragraphs, identification of detailed components of paragraphs of action sentences, etc. for each paragraph of the natural language-processed procedure; and an output unit for generating and outputting all the extracted information in the form of a printout.

Description

플랜트 절차서에 포함된 구문 및 의미정보 추출방법 및 그의 시스템{Methods and systems for syntactic and semantic information extraction from plant procedures}Methods and systems for syntactic and semantic information extraction from plant procedures

본 발명은 절차서의 운영관리에 관한 것으로, 특히 인적오류 요인을 배제하도록 마련된 절차서의 작성 지침의 모든 요구사항들을 만족하는지 점검할 수 있도록 절차서내의 각 문구(문장)에 포함된 구문 및 의미 정보를 정확하게 추출함으로써, 예를 들어 원자력 발전소와 같은 대형 플랜트를 안전하고 효율적으로 운영/유지할 수 있도록 한 구문 및 의미 정보 추출방법 및 그의 시스템에 관한 것이다. The present invention relates to the operation and management of procedures. In particular, the syntax and semantic information contained in each sentence (sentence) in the procedure is accurately analyzed so that it can be checked whether all requirements of the guidelines for preparing procedures prepared to exclude human error factors are satisfied. By extracting, for example, it is possible to safely and efficiently operate/maintain a large plant such as a nuclear power plant. It relates to a method for extracting syntax and semantic information and a system thereof.

본 발명의 명세서에서 설명하는 플랜트는, 원자력 발전소, 정유공장, 담수화설비, 건설 등과 같은 대형 설비를 말하나, 아래에서 설명하게 되는 본 실시 예는 원자력 발전소를 예를 들어 설명한다. 따라서 본 발명의 플랜트 절차서는 실시 예에서 예를 들어 설명한 원자력 발전소의 절차서만을 의미하지 않고 다른 유형의 플랜트 절차서에도 적용될 수 있음은 당연할 것이다.The plant described in the specification of the present invention refers to a large-scale facility such as a nuclear power plant, an oil refinery, a desalination facility, and construction, but the present embodiment to be described below will describe a nuclear power plant as an example. Therefore, it will be natural that the plant procedure of the present invention can be applied to other types of plant procedures as well as not only the procedures of a nuclear power plant described by way of example in the embodiments.

절차서(procedures)는 원자력 발전소와 같은 많은 설비를 구비한 플랜트에서, 장비를 운전하고 유지 보수하는 모든 작업자의 활동을 광범위하게 지원하여, 안전하고 신중하게 통제된 운영을 보장하는데 핵심적인 역할을 한다. 또 절차서는 각종 플랜트가 충족해야 할 안전기준 등 여러 가지 요구조건 및 이의 실제 구현결과를 비롯하여, 플랜트를 설계하고 건설하는 과정에서의 전문 인력들의 지식 및 활동을 시스템 운영자에게 전달하고, 더 나아가 플랜트의 모든 작업자에게 관련 내용을 교육하는 데에도 중요한 역할을 수행한다. 그리고 절차서는 플랜트 관리자가 플랜트 운영 및 유지 보수에 대한 표준과 기대치를 정확하게 충족시키는 방법을 이해할 수 있도록 돕는 역할도 한다.Procedures play a key role in ensuring safe and carefully controlled operation in a large-capacity plant, such as a nuclear power plant, by broadly supporting the activities of all operators operating and maintaining equipment. In addition, the procedure transfers the knowledge and activities of professional personnel in the process of designing and constructing the plant, including various requirements such as safety standards to be met by various plants and the actual implementation results, to the system operator, and furthermore, It also plays an important role in educating all workers. Procedures also serve to help plant managers understand how to accurately meet standards and expectations for plant operation and maintenance.

그래서 절차서는 플랜트를 안전하게 운영하는 데 필요한 요구 사항, 정책, 물리적 시설, 프로세스 및 인력을 포함하여 모든 관련 영역에서 사용 가능한 최신 지식을 통합하여 기술적, 운영적 측면에서 정확해야 한다. 또한 통제된 기술문서로서의 절차서는 플랜트에서의 모든 작업 활동, 프로그램 또는 프로세스에 대한 목적, 구체적 작업의 의도 및 작업내용과 순서 등을 명확하고 이해하기 쉽게 제공하여, 인적오류나 실수 등을 유발하지 않도록 함으로써, 모든 작업자의 수행 품질 및 안전성을 보장할 수 있어야 한다. Therefore, procedures must be technically and operationally accurate, incorporating up-to-date knowledge available in all relevant areas, including requirements, policies, physical facilities, processes and personnel necessary to safely operate a plant. In addition, the procedure as a controlled technical document clearly and easily provides the purpose of all work activities, programs or processes in the plant, the intention of specific work, and the contents and sequence of work, so as not to cause human error or mistakes. , it should be possible to ensure the quality and safety of the performance of all workers.

플랜트의 종류에 따라 다양한 종류의 많은 수량의 절차서가 필요하다. 원자력 발전소와 같은 대형 플랜트의 경우 규모와 공정의 복잡성에 따라 다양한 상호 참조관계를 갖는 많은 절차서를 필요로 하며, 또한 절차서는 지속적으로 개정작업을 필요로 한다. 특히 원자력 발전소는 복잡한 설비에 대하여 세분화된 기술적 절차서를 필요로 하며, 이들의 개발과 개선작업 또한 많은 전문가들을 필요로 한다. 절차서의 개발과 개선작업에 참여하는 많은 전문가들의 특정 설비에 대한 기술적 배경과 작업 경험은 매우 다양할 수 있고, 또 절차서에 대한 복잡한 요구 사항에 대한 이해수준도 동일하지 않다. Depending on the type of plant, a large number of procedures of various types are required. In the case of a large plant such as a nuclear power plant, many procedures with various cross-referencing relationships are required depending on the size and complexity of the process, and the procedures need to be continuously revised. In particular, nuclear power plants require detailed technical procedures for complex facilities, and their development and improvement work also requires many experts. Many experts involved in the development and improvement of procedures may have very different technical backgrounds and work experiences for specific facilities, and the level of understanding of complex requirements for procedures may not be the same.

따라서 여러 다양한 전문가들이 절차서의 기술적 내용에 대하여 적용 가능한 운영 경험, 중요한 기술적 문제, 작업자의 인적 오류 요인 및 생산성 향상 방안 등을 모두 고려하여 일관성 있게 검토하는 것은 매우 어려운 과제이다. 또 많은 양을 가지는 절차서들을 대상으로 플랜트가 요구하는 모든 기준을 충족시키면서 관리하는 것은 대단히 어려운 작업이라 할 수 있다. Therefore, it is a very difficult task for various experts to consistently review the technical contents of the procedure in consideration of applicable operating experience, important technical problems, human error factors of workers, and ways to improve productivity. In addition, it can be said that it is a very difficult task to manage a large number of procedures while meeting all the standards required by the plant.

그렇기 때문에, 보다 효율적인 방법으로 모든 기준을 충족하는 충실한 절차서의 개발과 이들의 지속적인 개선을 실현하기 위한 솔루션이 필요하고, 이에 절차서의 효율적이고 효과적인 관리를 위한 기반 기술로서, 절차서에 포함된 모든 유의한 구문 및 의미 정보를 추출하기 위한 방법론이 요구된다. Therefore, there is a need for a solution to realize the development of faithful procedures that meet all criteria in a more efficient way and a solution to realize their continual improvement. A methodology for extracting syntactic and semantic information is required.

한국등록특허 10-1917038호(2018. 11. 02. 원자력 발전소 시운전 시험 절차서 작성에 필요한 참조문서 통합관리 및 절차서 표현장치)Korea Patent Registration No. 10-1917038 (2018. 11. 02. Integrated management of reference documents required for preparation of test operation test procedures for nuclear power plants and system for expressing procedures) 한국등록특허 10-1186965호(2012. 09. 24. 원자력발전소의 절차서 전산처리 시스템 및 그 방법)Korean Patent Registration No. 10-1186965 (2012. 09. 24. System and method for computerized processing of procedures for nuclear power plants)

본 발명의 목적은 상기한 문제점을 해결하기 위한 것으로, 절차서에 기재되는 용어의 일관성을 확보하면서 작성된 절차서에 대한 검토작업을 효율적이고 효과적으로 수행할 수 있게 하는 것이다.SUMMARY OF THE INVENTION An object of the present invention is to solve the above problems, and it is possible to efficiently and effectively perform a review operation on a written procedure while ensuring the consistency of terms described in the procedure.

본 발명의 다른 목적은 자연어 처리과정에서 발생할 수 있는 POS 태깅 및 구문분석 오류를 개선하여 추출된 정보의 정확성을 향상시키기 위한 것이다.Another object of the present invention is to improve the accuracy of extracted information by improving POS tagging and syntax analysis errors that may occur during natural language processing.

이와 같은 목적을 달성하기 위한 본 발명은, 입력된 절차서에 포함된 이미지 및 표 개체를 제거하는 넌-텍스트 처리부와, 이미지 및 표 개체가 제거된 절차서의 텍스트에 대한 구조적 특성을 추출하는 텍스트 처리부를 포함하는 전처리 유닛; 상기 전처리 유닛으로부터 텍스트 정보를 전달받고 자연어 처리(NLP)기술을 이용하여 텍스트를 분석하고 그 결과를 보정하는 확장된 자연어 처리 유닛; 상기 자연어 처리된 절차서의 각 단락을 대상으로 단락에 포함된 모은 유의한 의미 개체 식별, 단락의 유형 분류, 조치문 단락의 세부 구성 요소 식별을 수행하는 정보 추출 유닛; 및 추출된 모든 정보들을 출력물 형태로 생성 출력하는 출력 유닛을 포함하는 절차서의 구문 및 의미정보 추출시스템을 제공한다. The present invention for achieving the above object, a non-text processing unit for removing images and table objects included in the input procedure, and a text processing unit for extracting structural characteristics of the text of the procedure from which the image and table objects are removed A pre-treatment unit comprising; an extended natural language processing unit that receives text information from the preprocessing unit, analyzes the text using a natural language processing (NLP) technology, and corrects the result; an information extraction unit that performs identification of collected significant semantic entities included in paragraphs, classification of types of paragraphs, and identification of detailed components of paragraphs of action sentences for each paragraph of the natural language-processed procedure; And it provides a syntax and semantic information extraction system of the procedure including an output unit for generating and outputting all the extracted information in the form of a printout.

상기 확장된 자연어 처리 유닛은, 토큰화, 문장 분할, 표제어 추출을 위한 제1 자연어 처리부; 품사 태깅 및 구문 분석 과정을 위한 제2 자연어 처리부; 및 어휘 데이터베이스와 통합한 내부 규칙을 활용하여 자연어 처리과정에서 발생하는 잘못된 POS 태그 및 구문분석 결과를 감지 및 수정하기 위한 제3 자연어 처리부를 포함한다.The extended natural language processing unit may include: a first natural language processing unit for tokenization, sentence segmentation, and lemma extraction; a second natural language processing unit for part-of-speech tagging and syntax analysis; and a third natural language processing unit for detecting and correcting erroneous POS tags and syntax analysis results generated in the natural language processing process by utilizing the internal rules integrated with the vocabulary database.

상기 구문 분석은, 구성요소 기반과, 종속성 기반의 2가지로 분석한다.The syntax analysis is analyzed in two ways: component-based and dependency-based.

상기 정보 추출 유닛은, 상기 의미 개체 식별을 위한 의미 개체 식별부; 상기 절차서 각 단락의 유형을 분류하는 단락 유형 분류부; 및 상기 조치문 단락의 세부 구성 요소를 식별하는 구성 요소 식별부를 포함하고, 상기 의미 개체 식별부 동작 이후에, 상기 단락 유형 분류부와 상기 조치문의 구성 요소 식별부는 순서에 상관없이 어느 하나가 먼저 수행될 수 있다. The information extraction unit may include: a semantic entity identification unit for identifying the semantic entity; a paragraph type classification unit for classifying the type of each paragraph in the procedure; and a component identification unit for identifying detailed components of the action statement paragraph, wherein after the semantic entity identification unit operation, the paragraph type classification unit and the component identification unit of the action statement are performed first regardless of order can be

상기 의미 개체 식별부는, 온톨로지 조회 방식과 내부 규칙 방식을 적용하여 절차서 각 단락에 포함된 유의한 의미 개체를 식별한다. The semantic entity identification unit identifies significant semantic entities included in each paragraph of the procedure by applying the ontology inquiry method and the internal rule method.

상기 온톨로지 조회 방식에 따른 태깅은, 온톨로지에 포함된 의미 개체와 일치하는 토큰에만 적용된다.The tagging according to the ontology inquiry method is applied only to tokens that match the semantic entities included in the ontology.

상기 내부 규칙 방식에 따른 태깅은, POS 태깅, 구문적 태깅, 및 의미적 태깅을 포함하는 조건식을 만족하는 토큰에 대해 미리 지정된 개념으로 태깅한다. In the tagging according to the internal rule method, a token that satisfies a conditional expression including POS tagging, syntactic tagging, and semantic tagging is tagged with a pre-designated concept.

상기 단락 유형 분류부는, 조치 동사 및 그 대상 객체를 포함하는 제1 그룹; 제3 그룹과 비교하여 특정 조치문과 관련성이 상대적으로 높은 제2 그룹; 및 상기 제2 그룹과 비교하여 특정 조치문과 관련성이 상대적으로 적은 제3 그룹 각각에 포함된 여러 가지 단락 유형으로 분류한다. 실시 예는 총 17가지의 단락 유형으로 구분된다. The paragraph type classification unit may include: a first group including an action verb and a target object thereof; a second group having a relatively high relevance to a specific action statement compared to the third group; and various paragraph types included in each of the third group having relatively little relevance to the specific action sentence as compared to the second group. The embodiment is divided into a total of 17 paragraph types.

상기 구성 요소 식별부는, POS 태그, 의미개체 태그, 구문분석 태그에 따라 조치 동사와 대상 객체를 비롯하여 나머지 복수 개의 선택적 구성요소를 각각 식별한다. The component identification unit identifies the remaining plurality of optional components including the action verb and the target object according to the POS tag, the semantic object tag, and the parsing tag, respectively.

본 발명의 다른 특징에 따르면, 절차서에 포함된 구문 및 의미정보를 추출하는 방법에 있어서, 상기 절차서에 대한 전처리 과정을 수행하는 제1 단계; 상기 전처리 과정에 따른 텍스트 정보를 분석하기 위해 자연어 처리를 수행하고, 상기 자연어 처리 중 발생한 POS 태그 및 구문분석 오류를 감지하고 수정하는 제2 단계; 상기 제1 단계 및 제2 단계의 결과물을 이용하여 상기 절차서의 각 단락에 대하여 의미 개체 식별, 단락의 유형 분류, 조치문 단락의 세부 구성 요소 식별을 수행하는 제3 단계; 및 상기 추출된 정보를 둘 이상의 형태의 출력물로 생성 출력하는 제4 단계를 포함하는 절차서의 구문 및 의미정보 추출방법을 제공한다. According to another feature of the present invention, there is provided a method for extracting syntax and semantic information included in a procedure, the method comprising: a first step of performing a pre-processing process for the procedure; a second step of performing natural language processing to analyze text information according to the preprocessing process, and detecting and correcting POS tags and syntax analysis errors occurring during the natural language processing; a third step of performing semantic entity identification, paragraph type classification, and detailed component identification of action paragraphs for each paragraph of the procedure using the results of the first and second steps; and a fourth step of generating and outputting the extracted information as two or more types of output.

상기 제2 단계는 소정 플랜트를 대상으로 작성된 어휘 데이터 베이스와 내부 규칙을 활용하여 POS 태그 및 구문분석 오류를 감지하고 보정한다.In the second step, POS tag and syntax errors are detected and corrected by using the vocabulary database and internal rules prepared for a given plant.

상기 제3 단계의 상기 의미 개체 식별은, 각 토큰에 포함된 단어가 온톨로지에 포함되어 있는 경우 적용되는 온톨로지 조회방식과, POS 태깅, 구문적 태깅, 및 의미적 태깅을 포함하는 조건식을 만족하는 토큰에 대해 미리 지정된 개념으로 태깅하는 패턴기반의 규칙방식을 조합하여 수행한다. The semantic entity identification in the third step is a token that satisfies a conditional expression including an ontology inquiry method applied when a word included in each token is included in an ontology, POS tagging, syntactic tagging, and semantic tagging It is performed by combining the pattern-based rule method tagging with a pre-specified concept.

상기 제3 단계의 단락 유형 분류는, 조치 동사 및 그 대상 객체를 포함하는 제1 그룹과, 제3 그룹과 비교하여 특정 조치문과 관련성이 상대적으로 높은 제2 그룹, 및 상기 제2 그룹과 비교하여 특정 조치문과 관련성이 상대적으로 적은 제3 그룹 각각에 포함되는 여러 가지 단락 유형으로 분류한다. The paragraph type classification of the third step includes a first group including an action verb and its target object, a second group having a relatively high relevance to a specific action sentence compared to the third group, and a second group compared to the second group It is classified into several paragraph types, each of which is included in the third group, which has relatively little relevance to specific action statements.

상기 제3 단계의 구성 요소 식별은, POS 태그, 의미개체 태그, 구문분석 태그에 따라 조치 동사와 대상 객체를 비롯하여, 나머지 복수 개의 선택적 구성요소를 각각 식별한다.The component identification of the third step identifies each of the remaining plurality of optional components, including the action verb and the target object, according to the POS tag, the semantic object tag, and the parsing tag.

이상과 같은 본 발명의 플랜트 절차서에 포함된 구문 및 의미정보 추출방법 및 그의 시스템에 따르면, 절차서에 기재되는 용어를 일관성 있게 작성할 수 있고, 또 자연어 처리과정에서 발생할 수 있는 POS 태깅 및 구문분석 오류를 개선하여 추출된 정보의 정확성을 향상시킬 수 있다. According to the method for extracting syntax and semantic information included in the plant procedure of the present invention and its system as described above, the terms described in the procedure can be written consistently, and POS tagging and syntax analysis errors that may occur in the natural language processing process are eliminated. It can be improved to improve the accuracy of the extracted information.

또 본 발명에 따르면, 모든 정보의 추출 결과는 다양한 형태의 출력물로 생성 제공되기 때문에, 도메인 전문가로부터 다양한 피드백 정보를 수집하여 정보 추출 결과를 검증하는데 활용할 수 있다.In addition, according to the present invention, since all information extraction results are generated and provided in various types of output, various feedback information from domain experts can be collected and utilized to verify information extraction results.

그리고 본 발명에 따르면, 온톨로지 및 규칙을 확장할 수 있어 절차서 작성 시스템의 성능 향상도 기대할 수 있다.And, according to the present invention, the ontology and rules can be extended, so that the performance improvement of the procedure writing system can be expected.

도 1은 본 발명의 실시 예에 따라 원자력 발전소 절차서에 포함된 구문 및 의미정보 추출을 위해 제시된 시스템의 전체 구성도
도 2는 본 발명에 따른 구문 및 의미정보의 추출 과정을 보인 흐름도
도 3은 본 발명의 확장된 자연어 처리 유닛과 정보 추출 유닛의 수행 과정에 따른 결과를 나타낸 도면
도 4는 본 발명의 자연어 처리과정 중 두 가지 유형의 구문 분석의 예를 설명하는 트리 구조도
도 5는 본 발명의 자연어 처리과정 중 구문 분석의 오류 및 개선된 결과를 보인 구문분석트리 구조도
도 6은 본 발명의 정보 추출 과정 중 조치문에 포함될 수 있는 구성 요소들의 식별을 설명하기 위해 조치문 내에서의 구성 요소 배치 개념도
도 7은 본 발명의 정보 추출 과정에 따라 출력되는 출력물의 예시도1 is an overall configuration diagram of a system presented for extracting syntax and semantic information included in a nuclear power plant procedure according to an embodiment of the present invention;
2 is a flowchart illustrating a process of extracting syntax and semantic information according to the present invention;
3 is a view showing results according to the execution process of the extended natural language processing unit and information extraction unit of the present invention;
4 is a tree structure diagram illustrating an example of two types of syntax analysis in the natural language processing process of the present invention;
5 is a structure diagram of a syntax analysis tree showing errors and improved results of syntax analysis during the natural language processing process of the present invention;
6 is a conceptual diagram of component arrangement within an action statement to explain the identification of components that may be included in an action statement during the information extraction process of the present invention;
7 is an exemplary view of an output output according to the information extraction process of the present invention;

본 발명의 목적 및 효과, 그리고 그것들을 달성하기 위한 기술적 구성들은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시 예들을 참조하면 명확해질 것이다. 본 발명을 설명함에 있어서 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다.Objects and effects of the present invention, and technical configurations for achieving them will become clear with reference to the embodiments described below in detail in conjunction with the accompanying drawings. In the description of the present invention, if it is determined that a detailed description of a well-known function or configuration may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted.

그리고 후술되는 용어들은 본 발명에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다.And, the terms to be described later are terms defined in consideration of functions in the present invention, which may vary according to intentions or customs of users and operators.

그러나 본 발명은 이하에서 개시되는 실시 예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있다. 단지 본 실시 예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.However, the present invention is not limited to the embodiments disclosed below and may be implemented in various different forms. Only the present embodiments are provided so that the disclosure of the present invention is complete, and to completely inform those of ordinary skill in the art to which the present invention belongs, the scope of the invention, the present invention is defined by the scope of the claims will only be Therefore, the definition should be made based on the content throughout this specification.

본 발명을 설명하기에 앞서, 정보 추출 및 온톨로지 개념에 대해 살펴본다. 정보 추출은 컴퓨터가 인식할 수 있는 문서에 포함되어 있는 수 많은 데이터를 선택적으로 구성하고 결합하는 프로세스이고, 통상 정보를 추출하기 전에 입력된 문서에 포함된 텍스트를 분석하기 위해서는 자연어 처리(Natural Language Processing, NLP) 기술이 사용된다. 그리고 특정 도메인에서의 이와 같은 정보 추출은 보통의 비기술적 텍스트에 적용 할 때 보다 훨씬 더 적합한 결과를 기대할 수 있는데, 이는 동음이의어 및 동일지시어 문제의 저감 및 특정 도메인에서의 전문 용어에 대한 향상된 해석 등에 기반하기 때문이다. 특정 도메인에서의 정보 추출은 종종 도메인 온톨로지와 결합하여 그 성능을 더욱 향상시킨다.Before describing the present invention, information extraction and ontology concepts will be described. Information extraction is a process of selectively composing and combining a large number of data included in a document that can be recognized by a computer. , NLP) techniques are used. In addition, such information extraction in a specific domain can expect much more suitable results than when applied to normal non-descriptive texts, which include reduction of homonyms and identical referents and improved interpretation of technical terms in specific domains. because it is based Information extraction from a specific domain is often combined with a domain ontology to further improve its performance.

온톨로지는 특정 도메인의 지식을 컴퓨터가 이해할 수 있도록 나타낸 것으로서, 도메인의 주요 개념, 개념과 개념 간의 관계, 그리고 각 개념에 포함되는 고유한 개체들로 구성된다. 온톨로지 학습으로도 알려진 온톨로지 구축은 사용된 기술에 따라 통계, 언어학, 기계 학습, 논리적 추론 및 하이브리드 방법 등 여러 가지 방식으로 접근할 수 있다. 이러한 온톨로지를 원자력 산업분야에도 적용할 수 있으며, 예를 들면 고장 분석에 관한 정보 공유나 원자로 포탈에서의 시맨틱 웹 기술 사용, 제어실 현대화 프로젝트 등에서 살펴볼 수 있다. 온톨로지를 적용한 예를 다른 산업분야에서도 살펴볼 수 있다. An ontology is a computer-readable representation of knowledge of a specific domain, and consists of the main concepts of the domain, the relationships between concepts, and the unique entities included in each concept. Ontology building, also known as ontology learning, can be approached in several ways, depending on the technology used, including statistics, linguistics, machine learning, logical reasoning, and hybrid methods. Such ontology can be applied to the nuclear industry, for example, information sharing on failure analysis, use of semantic web technology in reactor portal, control room modernization project, etc. Examples of the application of ontology can be found in other industries as well.

본 발명은 위에서 언급하고 있는 절차서의 작성 지침 및 도메인 온톨로지에 기반하여 절차서에 포함된 구문 및 의미 정보를 추출하기 위한 방법론을 제안한 것이라 할 수 있고, 이하에서 도면에 도시한 실시 예에 기초하면서 본 발명에 대하여 더욱 상세하게 설명하기로 한다. The present invention can be said to have proposed a methodology for extracting syntax and semantic information included in a procedure based on the above-mentioned procedure writing guidelines and domain ontology, and based on the embodiments shown in the drawings below, the present invention will be described in more detail.

도 1은 본 발명의 실시 예에 따라 원자력 발전소 절차서에 포함된 구문 및 의미정보 추출을 위해 제시된 시스템의 전체 구성도이다. 도 1에 도시한 시스템(10)은, 입력되는 절차서(procedures)에 포함된 유의미한 모든 구문 및 의미 정보를 자동으로 추출하기 위한 시스템이라 할 수 있다. 1 is an overall configuration diagram of a system presented for extracting syntax and semantic information included in a nuclear power plant procedure according to an embodiment of the present invention. The system 10 shown in FIG. 1 may be said to be a system for automatically extracting all meaningful syntax and semantic information included in input procedures.

그리고 도 1에서 보듯이 본 발명의 시스템(10)은 워드프로세서 소프트웨어에서 제공하는 API를 사용하여 절차서 문서파일을 전처리하는 전처리 유닛(preprocessing unit)(100)과, 공개형 자연어 처리 도구를 컴포넌트 형태로 활용하여 기본적인 자연어 처리와 이를 보정하기 위한 확장된 자연어 처리 유닛(NLP and make-up unit)(200), 그리고 절차서에 포함된 모든 유의한 의미 개체, 단락의 유형 분류, 조치문 단락의 세부 구성요소 식별 등을 수행하는 정보 추출 유닛(300)들을 포함하여 구성된다. 실시 예에서 상기 유닛들(100, 200, 300) 전체를 Microsoft C#을 이용하여 구현하였다. 그리고 상기 공개형 자연어 처리 도구는 절차서를 기술하는데 사용된 언어에 따라 구분할 수 있는데, 'Stanford Core NLP Toolkit'이나 'Open Korean Text'를 예를 들 수 있지만, 반드시 이러한 처리 도구에 한정되지는 않는다. 그리고 도면에는 도시하고 있지 않지만 추출된 정보들을 출력물 형태로 출력하는 출력 유닛이 포함된다. And, as shown in FIG. 1, the system 10 of the present invention includes a preprocessing unit 100 for preprocessing a procedure document file using an API provided by word processor software, and an open natural language processing tool in the form of a component. Basic natural language processing and an extended natural language processing unit (NLP and make-up unit) 200 for correcting it using and information extraction units 300 for performing identification and the like. In the embodiment, all of the units 100, 200, and 300 were implemented using Microsoft C#. In addition, the open natural language processing tool can be classified according to the language used to describe the procedure. Examples of the 'Stanford Core NLP Toolkit' or 'Open Korean Text' are, but are not necessarily limited to these processing tools. In addition, although not shown in the drawings, an output unit for outputting the extracted information in the form of an output is included.

상기 전처리 유닛(100)은, 넌-텍스트 처리부(Non-Text object Handling)(110)와 텍스트 처리부(Create and fill-in text feature for 'Paragraph' instances)(120)를 포함한다. 상기 넌-텍스트 처리부(110)는 절차서에 포함된 이미지 개체, 즉 그림이나 도면, 차트 등을 처리하는 역할을 한다. 즉 자연어 처리를 위하여 절차서에 포함된 넌-텍스트를 제거하는 구성이라고 보면 된다. 넌-텍스트 처리부(110)는, 절차서에 포함된 상기 이미지 개체를 위치정보와 함께 데이터베이스(400, DB)에 저장한다.The pre-processing unit 100 includes a non-text object handling unit 110 and a text processing unit (Create and fill-in text feature for 'Paragraph' instances) 120 . The non-text processing unit 110 serves to process image objects included in the procedure, that is, pictures, drawings, charts, and the like. That is, it can be seen as a configuration that removes non-text included in the procedure for natural language processing. The non-text processing unit 110 stores the image object included in the procedure together with the location information in the database 400 (DB).

절차서에서 넌-텍스트는 표 정보를 포함할 수도 있다. 따라서 표 개체에 대한 정보도 제거하는데, 이러한 표 개체에 대해서는 중첩 관계를 포함한 표의 구조 정보를 기록하고, 표에 포함된 이미지 개체는 위치정보와 함께 데이터베이스(400)에 저장된다.Non-text in a procedure may contain table information. Accordingly, information on the table object is also removed. For this table object, the structure information of the table including the overlapping relationship is recorded, and the image object included in the table is stored in the database 400 together with the location information.

상기 텍스트 처리부(120)는 표 개체에 포함된 텍스트 단락을 포함한 모든 텍스트 단락을 대상으로 하며, 텍스트의 구조적 특성(텍스트 색인, 글머리 기호 등), 텍스트 속성(서체, 크기, 색상, 밑줄 등)을 추출하는 역할을 한다. 또 추가 구조 속성으로서 계층 구조상의 상위 텍스트 단락, 색인 수준 등도 추출한다.The text processing unit 120 targets all text paragraphs including text paragraphs included in the table object, and includes text structural characteristics (text index, bullets, etc.), text properties (typeface, size, color, underline, etc.) serves to extract It also extracts text paragraphs above the hierarchy, index level, etc. as additional structural attributes.

본 발명에 따른 확장된 자연어 처리 유닛(NLP unit)(200)은 상기 전처리 유닛(100)에서 텍스트 정보만 입력으로 받아 기본적 자연어 처리 및 보정 과정을 수행하는 자연어 처리 툴(tool)을 말하며, 제1 자연어 처리부 내지 제 3 자연어 처리부(210, 220, 230)를 포함하여 구성된다. The extended natural language processing unit (NLP unit) 200 according to the present invention refers to a natural language processing tool that receives only text information as input from the preprocessing unit 100 and performs a basic natural language processing and correction process, and the first It is configured to include a natural language processing unit to a third natural language processing unit (210, 220, 230).

제1 자연어 처리부(210)는 토큰화(Tokenization), 문장 분할(Sentence splitting) 및 표제어 추출(Lemmatization) 과정을 수행하고, 제2 자연어 처리부(220)는 품사(POS) 태깅 및 구문 분석 과정을 수행하며, 제3 자연어 처리부(230)는 정해진 룰(rules)을 적용하여 형태소 분석 및 품사 태깅의 오류를 수정하는 과정을 수행한다. 그리고 확장된 자연어 처리 유닛(200)은 제1 자연어 처리부(210), 제2 자연어 처리부(220), 제3 자연어 처리부(230)를 순서대로 실시한다.The first natural language processing unit 210 performs tokenization, sentence splitting, and lemmatization processes, and the second natural language processing unit 220 performs POS tagging and syntax analysis processes. In addition, the third natural language processing unit 230 performs a process of correcting errors in morpheme analysis and part-of-speech tagging by applying predetermined rules. The extended natural language processing unit 200 sequentially executes the first natural language processing unit 210 , the second natural language processing unit 220 , and the third natural language processing unit 230 .

본 발명에 따른 정보 추출 유닛(300)은, 상기 전처리 유닛(100)과 자연어 처리 유닛(200) 결과를 활용하여, 절차서의 각 단락에 포함된 유의한 의미개체, 단락의 유형 분류, 조치문 단락의 세부 구성 요소 식별과 같은 3가지 유형의 정보 추출작업을 수행하게 된다. 이를 위해 정보 추출 유닛(300)은, 의미 개체 식별부(semantic element extraction, SE)(310), 단락 유형 분류부(paragraph type classification, PC)(320), 조치문의 구성 요소 식별부(step components identification, CI)(330)를 포함하고 있다.The information extraction unit 300 according to the present invention utilizes the results of the pre-processing unit 100 and the natural language processing unit 200 to provide meaningful semantic objects included in each paragraph of the procedure, classification of the type of paragraph, and paragraph of action sentences. Three types of information extraction work, such as identification of detailed components of To this end, the information extraction unit 300 includes a semantic element extraction (SE) 310 , a paragraph type classification (PC) 320 , and a step components identification unit for action sentences. , CI) 330 .

상기 의미 개체 식별부(310)는 절차서에서 의미를 가지는 모든 단어를 추출하는 것이고, 단락 유형 분류부(320)는 단락별 유형을 분류한다. 또 상기 조치문의 구성 요소 식별부(330)는 작업을 지시하는 조치동사 및 대상 개체를 비롯하여 지시(instruction)와 관련된 부수적 의미를 포함하는 절 및 구까지 구분하게 된다. 의미 개체 식별부(310) 동작 이후에, 단락 유형 분류부(320)와 조치문의 구성 요소 식별부(330)는 서로 독립적으로 동작할 수 있다.The semantic entity identification unit 310 extracts all words having a meaning in the procedure, and the paragraph type classification unit 320 classifies types for each paragraph. In addition, the component identification unit 330 of the action sentence classifies the action verb and the target object for instructing the task, as well as clauses and phrases including incidental meanings related to the instruction (instruction). After the semantic entity identification unit 310 operates, the paragraph type classification unit 320 and the action sentence component identification unit 330 may operate independently of each other.

다음에는 이와 같이 구성된 시스템을 이용하여 원자력 발전소 절차서에 포함된 구문 및 의미정보를 추출하는 방법에 대하여 도 1 내지 도 7를 참조하여 살펴보기로 한다. 도 2는 본 발명에 따른 구문 및 의미정보의 추출 과정을 보인 흐름도이다. Next, a method of extracting syntax and semantic information included in a nuclear power plant procedure using the system configured as described above will be described with reference to FIGS. 1 to 7 . 2 is a flowchart illustrating a process of extracting syntax and semantic information according to the present invention.

전처리 유닛(100)은 원자력 발전소 절차서의 문서 파일을 입력받아(100), 넌-텍스트와 텍스트를 구분하여 전처리과정을 수행한다(s110). 전처리 과정은 상기 원자력 발전소 절차서의 자연어를 처리할 수 있도록 넌-텍스트를 제거하고 텍스트에 대해서는 여러 가지 특성정보를 추출하는 과정이라고 할 수 있다. 이를 위해 넌-텍스트 처리부(110)는 절차서에 포함된 그림이나 도면, 차트, 표 등의 개체에 대해 위치 정보 및 표의 구조 정보와 함께 데이터베이스(400)에 저장한다. 이미지 및 표 개체를 제거하는 과정일 수 있다(s112). 그리고 텍스트 처리부(120)는 텍스트의 색인, 글머리 기호 등과 같은 구조적인 특성과 서체, 크기, 색상, 밑줄 등의 속성 등을 추출한다. 이외에도 텍스트 처리부(120)는 텍스트가 가지는 다양한 특성 정보들을 추출할 수 있을 것이다(s114). 이렇게 추출된 텍스트 정보들만 자연어 처리 유닛(200)으로 입력된다.The pre-processing unit 100 receives the document file of the nuclear power plant procedure ( 100 ), and performs the pre-processing by separating the non-text from the text ( s110 ). The preprocessing process can be said to be a process of removing non-text and extracting various characteristic information from the text so that the natural language of the nuclear power plant procedure can be processed. To this end, the non-text processing unit 110 stores in the database 400 together with location information and table structure information for objects such as pictures, drawings, charts, and tables included in the procedure. It may be a process of removing images and table objects (s112). In addition, the text processing unit 120 extracts structural characteristics such as index and bullet points of the text, and attributes such as font, size, color, and underline. In addition, the text processing unit 120 may extract various characteristic information of the text (s114). Only the text information extracted in this way is input to the natural language processing unit 200 .

자연어 처리 유닛(200)은 텍스트 정보를 대상으로 제1 자연어 처리부 내지 제3 자연어 처리부(210, 220, 230)의 순서대로 자연어 처리를 수행하게 된다(s120). 구체적으로 살펴보면, 먼저 제1 자연어 처리부(10)는, 텍스트 단락을 단어, 숫자, 문장 부호, 기호 등과 같은 모든 유형의 텍스트를 포함하는 별도의 토큰으로 분할한다(s122). 영어 텍스트인 경우 텍스트 사이의 빈칸을 기준으로 토큰으로 분할할 수도 있을 것이다. 분할된 각 토큰에는 해당 토큰에 포함된 텍스트의 길이, 시작 및 종료 위치에 대한 색인정보와 같은 추가 속성 정보가 포함될 수 있다. The natural language processing unit 200 performs natural language processing on the text information in the order of the first natural language processing unit to the third natural language processing unit 210 , 220 , 230 ( S120 ). Specifically, first, the first natural language processing unit 10 divides a text paragraph into separate tokens including all types of text such as words, numbers, punctuation marks, and symbols (s122). In the case of English text, it may be possible to split into tokens based on the space between texts. Each divided token may include additional attribute information such as the length of the text included in the token, and index information on start and end positions.

토큰화 과정에 따른 결과 예는 도 3을 참조하기로 하며, 도 3은 자연어 처리 유닛(200)과 정보 추출 유닛(300)의 수행 과정에 따른 결과를 나타낸 도면이다. 도 3에서 (1)의 statement를 기준으로 토큰화 과정이 완료되면, (2)의 tokens 화면과 같이 단어 및 문장 부호 등을 기초로 하여 토큰 분할이 이루어진다. An example of the result according to the tokenization process will be referred to FIG. 3 , which is a diagram showing the results according to the execution process of the natural language processing unit 200 and the information extraction unit 300 . When the tokenization process is completed based on the statement of (1) in FIG. 3, token division is made based on words and punctuation marks as shown in the tokens screen of (2).

이러한 토큰 분할이 완료되면, 문장 분할 과정이 수행된다(s124). 문장에 대한 분할 과정은, 텍스트 단락을 마침표, 물음표, 느낌표와 같은 문장 경계 표시자를 사용하여 분할하는 것이다. 이후 문장 분할된 결과에서 토큰에 포함된 단어가 활용형 또는 파생형인 경우 언어 형태학적 분석을 이용하여 그 단어의 기본형도 토큰에 포함시키는 표제어 추출(Lemmatization) 과정이 수행된다(s126). When this token division is completed, a sentence division process is performed (s124). The division process for a sentence is to divide a paragraph of text using sentence boundary markers such as a period, a question mark, and an exclamation point. Afterwards, if the word included in the token is a conjugation or a derivative form from the result of the sentence division, a lemmatization process is performed in which the basic form of the word is also included in the token using linguistic morphological analysis (s126).

제1 자연어 처리부(210)에 의한 과정이 완료되면, 제2 자연어 처리부(220)는 품사 태깅 및 구문 분석을 실시한다(s130). 품사 태깅은 각 토큰에 포함된 단어의 품사를 지정하는 작업으로(s132), 도 3의 (3) POS에 도시한 것과 같다. 즉, TURN -VB, FCEDS13-NNP, Flow-NNP, Station-NNP, Pump-NNP, Switch-NNP, to-TO, the-DT, ON-JJ, position-NN처럼 지정된다. 그리고 구문 분석은 각 문장에 포함된 토큰들의 계층구조 트리를 나타내는 작업이다(s134). 이러한 구문분석은 구성요소 기반 및 종속성 기반의 두 가지 형태가 있다. 아래 [표 1]은 구문 분석시 도 4, 도 5a, 도 5b를 포함하여 본 명세서에서 사용된 태그 및 관계 레이블에 대한 설명을 알파벳 순서로 나타낸 것이다.When the process by the first natural language processing unit 210 is completed, the second natural language processing unit 220 performs part-of-speech tagging and syntax analysis (s130). Part-of-speech tagging is an operation of designating a part-of-speech of a word included in each token (s132), as shown in (3) POS of FIG. 3 . That is, it is designated as TURN -VB, FCEDS13-NNP, Flow-NNP, Station-NNP, Pump-NNP, Switch-NNP, to-TO, the-DT, ON-JJ, position-NN. And the syntax analysis is a task of representing a hierarchical tree of tokens included in each sentence (s134). There are two forms of this parsing: component-based and dependency-based. [Table 1] below shows, in alphabetical order, descriptions of tags and relation labels used in the present specification, including FIGS. 4, 5A, and 5B when parsing.

TagsTags Relation labelsRelation labels Constituent tagsConstituent tags ●ADVP: adverb phrase
●NP: noun phrase
●PP: prepositional phrase
●QP: quantifier phrase
●S: simple declarative clause
●SBAR: clause
●VP: verb phraseADVP: adverb phrase
●NP: noun phrase
PP: prepositional phrase
QP: quantifier phrase
S: simple declarative clause
SBAR: clause
VP: verb phrase ●amond: adjectival modifier
●appos: appositional modifier
●case: case-marking element
●compound: pair of nouns (verb) in compound relation
●det: determiner
●dobj:direct object
●nmod: nominal modifier
●root: points to the root of the sentenceamond: adjectival modifier
●appos: appositional modifier
Case: case-marking element
compound: pair of nouns (verb) in compound relation
Det: determiner
dobj:direct object
nmod: nominal modifier
root: points to the root of the sentence POS tagsPOS tags ●.: sentence-final punctuation
●,: comma
●CD: cardinal number
●DT: determiner
●IN: preposition and subordinationg conjunction
●JJ: adjective
●JJR: comparative adjective
●NN: noun, singular or mass
●NNP: proper noun, singular
●NNS: noun, plural
●RB: adverb
●TO: to
●VB: verb, base form
●VBP: verb, non 3rd persons singular present
●VBZ: verb, 3rd person singular present●.: sentence-final punctuation
●,: comma
CD: cardinal number
DT: determiner
IN: preposition and subordination conjunction
JJ: adjective
JJR: comparative adjective
NN: noun, singular or mass
NNP: proper noun, singular
NNS: noun, plural
●RB: adverb
●TO: to
●VB: verb, base form
VBP: verb, non 3rd persons singular present
VBZ: verb, 3rd person singular present

상기 구성요소 기반의 구문분석은 각 문장 구성 요소의 계층 구조를 각각 토큰들의 그룹으로 표시하는 것이고, 반면 종속성 기반의 구문분석은 각각 한 쌍의 어휘 토큰 노드들을 화살표로 연결한 지향성 트리 구조를 말한다. 이러한 구문 분석의 예는 도 4에 도시하였다.The component-based syntax analysis displays the hierarchical structure of each sentence component as a group of tokens, whereas the dependency-based syntax analysis refers to a directional tree structure in which a pair of lexical token nodes are connected by arrows. An example of such a syntax analysis is shown in FIG. 4 .

도 4의 (a)는 구성요소 기반의 구문분석 트리구조를 보인 도면으로, 도 3의 (1)에 표시된 조치문을 나타내고 있다. 이를 보면 각 말단 노드는 표시된 텍스트를 포함하는 토큰의 POS 태그를 나타내고, 이외의 모든 비 말단 노드는 구문구성요소 태그를 나타낸다. 그리고 도 4의 (b)는 종속성 기반의 구문분석 트리구조를 보인 도면으로, 도 3의 (1)에 표시된 조치문을 나타낸 것이다. 여기서 각 화살표의 레이블은 상위 토큰에 대한 하위 토큰의 문법적 역할을 나타낸다.Fig. 4 (a) is a diagram showing a component-based syntax analysis tree structure, and shows the action statement shown in Fig. 3 (1). In this way, each end node represents the POS tag of the token containing the displayed text, and all other non-end nodes represent the syntax element tag. And Fig. 4 (b) is a diagram showing a dependency-based syntax analysis tree structure, and shows the action statement shown in Fig. 3 (1). Here, the label of each arrow indicates the grammatical role of the lower token relative to the upper token.

제3 자연어 처리부(230)는 자연어 처리 과정에서 발생할 수 있는 품사 태깅 및 구문분석의 오류를 개선하여 정보 추출의 성능 향상을 제고하도록 한 것이다. 즉 각 단어의 POS 태그 및 각 문장의 구문분석 트리는 고유하지 않은데, 자연어처리 컴포넌트에 의해 확률적으로 가장 가능성이 높다고 추천된 결과들 중에서 오류를 감지하여 개선하는 것이다. The third natural language processing unit 230 improves the performance of information extraction by improving errors in part-of-speech tagging and syntax analysis that may occur in the natural language processing process. That is, the POS tag of each word and the parsing tree of each sentence are not unique, but errors are detected and improved among the results that are probabilistically most recommended by the natural language processing component.

종래에 자연어를 처리하는 방법은 각각의 단어나 문장에 대해 여러 가지 가능한 선택 중에서 특정 계량 평가치를 바탕으로 가장 높은 값을 갖는 경우를 추천하였기 때문에, 주석이 달린 참조 말뭉치를 적용하기 어려운 일반적인 구조화되지 않은 문장의 경우에는 POS 태그와 구문 분석 트리에 대해 시스템화한 평가를 적용하는 것은 의미가 없으며, 모든 태그에 대한 평가 및 수정은 전문가가 일일이 수작업으로 해야 했다. 그러나 원자력 발전소 절차서와 같은 전문적인 기술문서에는 많은 도메인 전문 용어가 사용되고, 각 조치문은 명령형의 단정적 표현을 기반으로 반구조화된 형태로 표현되는 특징을 가지고 있다. 또한 조건부 또는 논리 용어, 조치 동사, 약호 또는 기호 코드와 같이 구문 또는 의미적으로 중요한 일부 단어는 강조 목적으로 대문자로 표시되기도 한다. 이러한 절차서의 서술 방법상의 특성으로 인하여 상기 일반적인 자연어 처리 방법으로는 상당수의 POS 태깅 및 구문 분석 오류를 초래하게 되어, 최종적으로 정보 추출 결과에 많은 오류가 포함될 수 있다. 이러한 정보 추출 결과의 오류는 원자력 발전소 등의 중요 시설을 운영 유지하는데 있어 잘못된 결과를 초래할 수 있다.Conventionally, the method of processing natural language recommends the case with the highest value based on a specific quantitative evaluation value among various possible choices for each word or sentence, so it is difficult to apply the annotated reference corpus. In the case of sentences, it is meaningless to apply systemized evaluation of POS tags and parsing trees, and experts had to manually evaluate and modify all tags. However, many domain terminology is used in professional technical documents such as nuclear power plant procedures, and each action statement has a characteristic of being expressed in a semi-structured form based on an imperative assertive expression. Also, some words that are syntactically or semantically important, such as conditional or logical terms, action verbs, abbreviations or symbolic codes, are capitalized for emphasis purposes. Due to the characteristics of the description method of these procedures, the general natural language processing method causes a significant number of POS tagging and syntax analysis errors, and finally, many errors may be included in the information extraction result. Errors in these information extraction results may lead to erroneous results in operating and maintaining important facilities such as nuclear power plants.

따라서 본 발명은 정해진 룰(rules)을 적용하여 구문분석 및 품사 태깅의 오류를 수정하는 과정을 수행하는 제3 자연어 처리부(230)를 제공한다. 즉, 원자력 발전소를 대상으로 작성된 어휘 데이터베이스와 절차서의 서술 방법상의 특수성을 바탕으로 하여, 어휘 데이터베이스와 통합된 간단한 내부 규칙을 이용하여 잘못된 POS 태그 및 구문분석 결과를 감지하고 수정하는 것이다. Accordingly, the present invention provides a third natural language processing unit 230 that performs a process of correcting errors in syntax analysis and part-of-speech tagging by applying predetermined rules. In other words, based on the vocabulary database written for nuclear power plants and the specificity of the description method of the procedure, it detects and corrects incorrect POS tags and syntax analysis results using simple internal rules integrated with the vocabulary database.

제3 자연어 처리부(230)의 POS 태깅 오류를 감지하고 이를 수정하는 과정을 예를 들어 살펴본다(s140).A process of detecting and correcting a POS tagging error of the third natural language processing unit 230 will be described as an example (s140).

다음과 같은 조치문이 있다고 가정하고, 이러한 조치문에 대한 초기 구문분석트리는 도 5a에 도시하였다. It is assumed that the following action statements exist, and the initial syntax analysis tree for these action statements is shown in FIG. 5A.

' IF as-found thickness is less than 0.180 inches, ' IF as-found thickness is less than 0.180 inches,

THEN REPLACE X Relay Lever per 0_MNT_005, Relay Replacement.' THEN REPLACE X Relay Lever per 0_MNT_005, Relay Replacement.'

상기 조치문에 대해 도 5a을 보면, 'THEN' 및 'REPLACE' 토큰에 대해 각각 'NNP'(고유명사, 단수) 및 'VBP'(동사, 비3인칭 단수 현재)처럼 POS 태그가 잘못 지정되어 있다. 또 'THEN'의 구문구성 요소 태그도 'NP'(명사구)로 잘못 평가되어 있다. 즉 동사 'REPLACE'의 주격으로 잘못 해석된 상태이다.Referring to FIG. 5a for the above action statement, POS tags such as 'NNP' (proper noun, singular) and 'VBP' (verb, non-third person singular present) are incorrectly specified for 'THEN' and 'REPLACE' tokens, respectively. there is. Also, the syntax component tag of 'THEN' is incorrectly evaluated as 'NP' (noun phrase). In other words, it is misinterpreted as the subject of the verb 'REPLACE'.

따라서 제3 자연어 처리부(230)는 이러한 POS 태깅 오류를 감지한다(s142). 그리고 도 5b와 같이 규칙 기반의 POS 태깅 수정을 적용하여 구문 분석 트리를 개선하게 된다(s144). 도 5b를 살펴보면, 'THEN'에 대한 POS 태깅이 'RB'(부사)로 변경되었고, 이를 이용하여 조치문에 대한 구문 분석을 다시 수행하여 'REPLACE' 토큰의 POS 태그는 'VB'(동사)로 수정하고, 'THEN'의 구문구성 태그는 'ADVP'(부사구)로 수정하였다. Therefore, the third natural language processing unit 230 detects such a POS tagging error (s142). And as shown in Fig. 5b, the parsing tree is improved by applying the rule-based POS tagging modification (s144). Referring to FIG. 5B , the POS tagging for 'THEN' was changed to 'RB' (adverb), and by using this, parsing of the action statement was performed again, and the POS tag of the 'REPLACE' token was 'VB' (verb) , and the syntax configuration tag of 'THEN' was changed to 'ADVP' (adverb phrase).

본 발명은 이처럼 제3 자연어 처리부(230)의 과정을 통해 일반적인 자연어 처리방법에서 발생했던 POS 태깅 및 구문분석 오류를 최대한 방지할 수 있음을 알 수 있다. It can be seen that the present invention can prevent POS tagging and syntax analysis errors occurring in a general natural language processing method as much as possible through the process of the third natural language processing unit 230 as described above.

다음은 정보 추출 유닛(300)에 의한 정보 추출 과정이다(s150). 정보 추출 과정은, 정보 추출 유닛(300)이 상기 자연어 처리 유닛(200)의 수행 결과를 기초로 하여 의미 개체 식별부(310), 단락 유형 분류부(320), 조치문의 구성 요소 식별부(330)가 각각 서로 다른 유형의 정보추출 작업을 수행하는 과정이라 할 수 있다. 여기서 의미 개체 식별부(310) 동작 이후에, 단락 유형 분류부(320)와 조치문의 구성 요소 식별부(330)는 순서에 상관없이 어느 하나가 먼저 수행될 수 있다.The following is an information extraction process by the information extraction unit 300 (s150). In the information extraction process, the information extraction unit 300 performs the semantic entity identification unit 310, the paragraph type classification unit 320, and the action sentence component identification unit 330 based on the execution result of the natural language processing unit 200. ) can be said to be the process of performing different types of information extraction tasks. Here, after the operation of the semantic entity identification unit 310 , either the paragraph type classification unit 320 and the action sentence component identification unit 330 may be performed first regardless of the order.

첫 번째로 의미 개체 식별부(310)가 텍스트 문장의 각 단락에 포함된 모든 유의한 의미 개체를 식별하는 과정을 살펴본다(s152). 의미 개체 식별은 ⅰ. 온톨로지 조회 방식 및 ⅱ. 내부 규칙 방식을 적용하는 2가지 방법을 통해 식별한다.First, a process in which the semantic entity identification unit 310 identifies all significant semantic entities included in each paragraph of a text sentence is described (s152). Semantic entity identification is i. Ontology inquiry method and ii. It is identified through two methods of applying the internal rule method.

온톨로지 조회 방식은 각 토큰에 포함된 단어(또는 그 표제어)가 온톨로지에 포함되어 있을 경우, 해당 개념으로 토큰에 대한 태그를 지정하는 방식이다. 온톨로지 조회 태깅 결과는 다음 도 3의 (4) 항목과 같다.In the ontology inquiry method, when a word (or its headword) included in each token is included in the ontology, it is a method of designating a tag for the token with the corresponding concept. The ontology inquiry tagging result is the same as item (4) of FIG. 3 below.

그리고 온톨로지에 포함된 개념들은 각 조치문에 포함되어야 할 다음 표 2의 3가지 질문에 대한 가능한 답변요소들로 구성한다.And the concepts included in the ontology consist of possible answer elements to the three questions in Table 2 below that should be included in each action statement.

1One 누가 해당 작업을 수행하는가? - 조직, 부서, 직무 등Who is doing the work? - Organization, department, job, etc. 22 어떤 작업을 수행하는가? 조치동사, 구조물/ 계통/ 설비 및 부품 등What do you do? Jochi verb, structure/system/equipment and parts, etc. 33 안전하고 효율적이며 올바른 작업 수행방법은 무엇인가? 도구, 재료, 측정, 측정 단위, 기준, 상태 등How to do the job safely, efficiently and properly? tool, material, measure, unit of measure, reference, condition, etc.

아울러 여러 가지 약호 또는 기호 코드와 같이 실제 절차서에 자주 등장하는 항목들도 개념 목록에 추가하였다. 참고로 영어 및 한국어로 된 각 개념별 개체 집합은 전문자료 및 실제 절차서와 같은 여러 자료를 사용하여 개발되었고, 개념 목록과 각 개념에 포함된 개체 집합은 도메인 전문가와의 협력을 바탕으로 여러 번의 개선 과정을 수행하였다.In addition, items frequently appearing in actual procedures, such as various abbreviations or symbol codes, were added to the list of concepts. For reference, the set of objects for each concept in English and Korean was developed using various materials such as professional materials and actual procedures, and the list of concepts and the set of objects included in each concept were improved several times based on collaboration with domain experts. The process was carried out.

한편 조치동사는 절차서 작업자가 수행할 모든 조치문의 핵심 요소이다. 따라서 단순 온톨로지 조회를 통한 방법이 아니라 다음 표 3의 조건들을 모두 만족하는 동사로 식별한다.On the other hand, the action verb is a key element of all action statements to be performed by the procedure operator. Therefore, it is not a method through a simple ontology inquiry, but a verb that satisfies all of the conditions in Table 3 below.

1One 'to'와 연결되지 않은 기본 형태의 동사Basic form of verbs not connected with 'to' 22 주격형 명사를 포함하지 않고, 주격형 명사를 포함하는 동사와 관련성이 없는
동사Does not contain a nominative noun and is not related to a verb containing a nominative noun
verb 33 동사 'have'의 경우에는 시제표현과 관련이 없어야 함In the case of the verb 'have', it should not be related to the tense expression.

상기 온톨로지 조회를 기반으로 하는 방식은 온톨로지에 포함된 의미 개체와 일치하는 토큰에만 적용되나, 온톨로지에는 각 개념별로 대표적인 의미개체들만 포함되어 제한적이다. 그래서 의미개체들을 더 식별할 수 있어야 한다. The method based on the ontology inquiry is applied only to tokens that match the semantic objects included in the ontology, but the ontology is limited as only representative semantic objects for each concept are included. Therefore, it is necessary to be able to identify more semantic objects.

본 발명에 따르면 온톨로지에 미 포함된 의미개체를 식별할 수 있도록 만들어진 내부 규칙을 활용한다. 이러한 내부 규칙은 패턴 기반 형태로 표현되며, POS 태깅, 구문적 태깅, 의미적 태깅 등을 포함하는 조건식을 만족하는 토큰에 대해 해당 규칙에 의해 지정된 개념으로 태깅된다. 내부 규칙에 따라 의미개체를 식별하는 예를 살펴본다. 예를 들어 다음 규칙은 온톨로지 조회를 통해 찾은 SSCP 의미 개체 토큰 바로 앞에 명사형의 토큰이 있을 경우, 이들 전체 토큰에 포함된 단어들로 구성된 복합 명사를 새로운 SSCP 의미개체로 식별하는 것이다. According to the present invention, an internal rule made to identify semantic objects not included in the ontology is utilized. These internal rules are expressed in a pattern-based form, and tokens that satisfy conditional expressions including POS tagging, syntactic tagging, semantic tagging, etc. are tagged with a concept designated by the corresponding rule. Let's look at an example of identifying semantic objects according to internal rules. For example, the following rule is to identify as a new SSCP semantic object a compound noun composed of words included in all tokens when there is a noun token immediately before the SSCP semantic entity token found through ontology inquiry.

((({POS = NN} | {POS = NNP})*((({POS = NN} | {POS = NNP})*

{OntologyLookup = SSCP}) : SSCPinstance {OntologyLookup = SSCP}) : SSCPinstance

{POS ! =~ "NN.*"}) {POS ! =~ "NN.*"})

-> SSCP = : SSCPinstance -> SSCP = : SSCPinstance

이와 같은 내부 규칙의 적용을 통한 의매개체 태깅 결과는 도 3의 (5) 항목과 같다. 이를 보면 온톨로지 조회에 의해 SSCP 의미개체로 태그된 'Pump Switch'의 토큰과 바로 앞에 'NNP'로 POS 태그된 두 개의 토큰들(Flow, Station)에 대해 상기 내부 규칙을 적용한 결과, 'Flow Station Pump Switch'와 같이 4개의 토큰에 포함된 단어로 구성된 새로운 SSCP 의미 개체를 식별할 수 있는 것이다. The result of tagging a pseudo-media through the application of such internal rules is the same as item (5) of FIG. 3 . Looking at this, the result of applying the above internal rules to the token of 'Pump Switch' tagged as SSCP semantic object by ontology inquiry and the two tokens (Flow, Station) tagged with 'NNP' immediately before the internal rule, 'Flow Station Pump' It is possible to identify a new SSCP semantic entity composed of words included in 4 tokens, such as 'Switch'.

이처럼 정보 추출 유닛(300)의 의미 개체 식별부(310)는 온톨로지 조회 방식과 함께 내부 규칙 방식을 적용하여 단락에 포함된 모든 유의한 의미 개체를 식별하고 있다.As such, the semantic entity identification unit 310 of the information extraction unit 300 identifies all significant semantic entities included in a paragraph by applying the internal rule method together with the ontology inquiry method.

본 발명은 이러한 의미 개체 식별이 완료되면, 단락유형 분류와 구성요소 식별 과정이 수행되고, 이들 과정은 독립적으로 수행될 수 있다. In the present invention, when semantic entity identification is completed, paragraph type classification and component identification processes are performed, and these processes may be independently performed.

먼저 단락 유형의 분류작업이다(s154). 본 발명에 따르면 단락 유형 분류부(320)는 텍스트 문장의 각 단락에 대하여 세 그룹으로 구분되는 여러 가지 유형(예컨대 17가지 유형) 중의 하나로 분류한다. First, it is a classification task of a paragraph type (s154). According to the present invention, the paragraph type classification unit 320 classifies each paragraph of a text sentence into one of several types (eg, 17 types) divided into three groups.

첫 번째 단락 그룹은, 모두 조치 동사 및 그 대상 객체를 포함하며, 특수한 조치동사 또는 도 6의 조치문에서 조건부 구문의 포함여부에 따라 다음 표 4의 5가지 단락 유형을 포함한다.The first group of paragraphs, all contain action verbs and their target objects, and special action verbs or According to whether the conditional syntax is included in the action sentence of FIG. 6 , the five paragraph types in Table 4 are included.

●(비조건)조치문● (non-conditional) action statement 조치동사로 시작하며, 중요 정보 또는 부사 구성요소가 선행될 수 있음Begins with action verbs, may be preceded by important information or adverbial components ●분기문(GO TO ~ 또는 PROCEED TO~)
●참조문(REFER TO ~, SEE ~, USE~, REPEAT ~ 또는 PER)Branch statement (GO TO ~ or PROCEED TO~)
References (REFER TO ~, SEE ~, USE ~, REPEAT ~ or PER) 지정된 조치동사로 시작하는 특수한 형태의 조치문A special form of action sentence that begins with a specified action verb ●조건부 조치문(IF/WHEN<조건>, THEN<조치>)
●연속 조치문(WHILE/IF AT ANY TIME <조건>, <조치>)Conditional action statements (IF/WHEN<condition>, THEN<action>)
●Continuous action statements (WHILE/IF AT ANY TIME <condition>, <action>) 지정된 키워드로 시작하는 특수한 형태의 조치문임.
여기서 <조건>은 하나의 조건 절로 구성되거나, 논리 연산자(예: AND)에 의해 서로 연결된 여러 조건 절의 복합체이거나, 또는 '다음 중 어느 것'과 같은 표현 이후에 제시되는 여러 조건 절의 목록일 수 있고, <조치> 절은 일반적으로 조치문의 형태를 가짐This is a special type of action statement that starts with a specified keyword.
where <condition> may consist of a single conditional clause, a complex of several conditional clauses interconnected by logical operators (such as AND), or a list of multiple conditional clauses presented after an expression such as 'any of the following', , the <action> clause usually has the form of an action statement

두 번째 단락 그룹은, 특정 조치문들과 밀접한 관련이 있는 다음 표 5의 8가지 단락 유형을 포함한다. The second paragraph group includes eight paragraph types in Table 5 that are closely related to specific action statements.

●NCW 헤딩
●NCW 문NCW heading
NCW door 참고(NOTE), 주의(CAUTION) 또는 경고(WARNING) 표현 및 그 다음에 나오는 각 단락들. 이들 각각은 관련 조치문 이전에 배치되며, 어떠한 조치행위도 포함하지 않아야 함A NOTE, CAUTION, or WARNING expression, followed by each paragraph. Each of these should be placed before the relevant action statement and should not contain any action action. ●Hold point문Hold point statement HOLD POINT(또는 QA HOLD POINT와 같이 주제 키워드가 선행되어 있는 경우)HOLD POINT (or preceded by a subject keyword, such as QA HOLD POINT) ●기록행
●계산행
●확인서명행●Record line
Calculation line
●Confirmation signature line 새로운 관측치를 기록하거나, 관측치를 이용한 간단한 계산결과를 기록해야 하는 경우, 관련 조치문 다음에 배치되어 기록을 위한 공간을 포함하는 단락이고, 확인서명을 요구하는 독립단락은 확인서명 행으로 분류함When it is necessary to record new observations or to record the results of simple calculations using observations, the paragraph containing space for recording is placed after the relevant action statement, and the independent paragraph requiring a confirmation signature is classified as a confirmation signature line. ●논리 연산자 단락
●목록요소Logical operator short circuit
●List element 논리 연산자 자체가 하나의 단락을 형성하는 경우 해당. 목록 요소는 문구 '다음'이 포함된 조치문 다음에 나열된, 글머리 기호를 포함하는 개조식 형태의 각 단락을 말함Corresponding if logical operators themselves form a single paragraph. A list element is each paragraph in its modified form, including bullets, listed after an action statement containing the phrase 'next'.

세 번째 단락 그룹은, 특정 조치문과 관련성이 상대적으로 적은 다음 표 6의 4가지 단락 유형을 포함한다.The third paragraph group includes the following four paragraph types in Table 6, which have relatively little relevance to specific action statements.

●(하위) 절 제목
●그림 또는 표의 캡션● (sub) section title
Captions of figures or tables (하위) 절 제목
그림 또는 표의 캡션(sub) section title
Captions of figures or tables ●연속 표현 단락 ●Continuous Expression Paragraph 조치문이 다음 페이지에 계속되는 경우에 삽입되는 표현 단락An expression paragraph inserted when an action statement continues on the next page ●정보 단락●Information paragraph 앞서 언급한 어느 유형에도 속하지 않은 단락Paragraphs that do not belong to any of the preceding types

다음에는 조치문의 구성요소 식별과정이다(s156). Next is the process of identifying the components of the action statement (s156).

조치문은 도 6에 보인바와 같이 2개의 필수 핵심 구성요소로서 조치동사(Action Verb)와 대상 객체(Target Object)를 포함하며, 추가로 4개의 선택적 구성요소(condition, critical information/critical location, Adverb, supporting information/Non-critical location)를 포함할 수 있다. As shown in FIG. 6, the action statement includes an action verb and a target object as two essential core components, and additionally four optional components (condition, critical information/critical location, Adverb , supporting information/Non-critical location).

본 발명에서는 이러한 조치문에 속하는 모든 단락에 대해서 6개의 구성요소 중 포함되어 있는 구성요소를 각각 식별하는 것이다. 이때 조치문의 구성요소를 식별할 때 패턴 기반 규칙들이 활용된다. 상기 패턴 기반 규칙은 조건식에 POS 태그, 의미개체 태그, 구문분석 태그 등이 포함될 수 있고, 각 단락에 대해 현재 단락의 태그 유형과 일치하는 조건을 갖는 규칙을 찾고, 해당 규칙에 의해 지정된 구성 요소를 식별하는 것이다. In the present invention, for all paragraphs belonging to these measures, each of the components included among the six components is identified. In this case, pattern-based rules are used to identify the components of an action statement. The pattern-based rule may include POS tags, semantic object tags, and parsing tags in the conditional expression. For each paragraph, a rule having a condition matching the tag type of the current paragraph is found, and the components specified by the rule are selected. is to identify

구성요소의 식별 예는 다음과 같이 나타낼 수 있다. 아래 예는 세 가지 조건과 부합하는 조치문 단락으로부터 조치동사, 대상 객체 및 지원 정보의 3가지 구성요소를 식별하는 것을 나타내고 있다.An example of identification of a component may be represented as follows. The example below shows the identification of three components: an action verb, a target object, and supporting information from an action paragraph that meets three conditions.

(({Dependency = dobj} {SemanticType = ActionVerb}) : av(({Dependency = dobj} {SemanticType = ActionVerb}): av

{Node = NP | POS = NN | pos = cd} : to1) {Node = NP | POS = NN | pos = cd}: to1)

({Denpendency = appos} {Node = NP | POS = NN | POS = CD} : to1({Denpendency = appos} {Node = NP | POS = NN | POS = CD}: to1

{Node = NP | POS = NN | POS = CD} : to2) {Node = NP | POS = NN | POS = CD}: to2)

({Denpendency = nmod} {Node = NP | POS = NN | POS = CD} : to1({Denpendency = nmod} {Node = NP | POS = NN | POS = CD}: to1

{Node = PP | NODE = VP | NODE = SBAR} : si)) {Node = PP | NODE = VP | NODE = SBAR}: si))

-> Action verb = av, Target object = to1 + to2, Supplemental information = si. -> Action verb = av, Target object = to1 + to2, Supplemental information = si.

앞에서 설명한 도 4a 및 4b로부터 상기 4개의 인자 av는 'TURN', to1은 'FCEDS13', to2는 'Flow Station Pump Switch', si는 'to the ON position'를 대상으로 상기한 패턴 기반 규칙의 3가지 조건이 성립함을 확인할 수 있다. 4a and 4b described above, the four factors av are 'TURN', to1 is 'FCEDS13', to2 is 'Flow Station Pump Switch', and si is 'to the ON position'. It can be confirmed that several conditions are met.

그리고 도 3의 (6) 항목을 살펴보면 이러한 구성요소의 식별과정에 의해 식별된 세 가지 조치문의 구성 요소를 확인할 수 있다. 즉 AV에 해당하는 'TURN'을 조치동사로, to1 + to2에 해당하는 'FCEDS13, Flow Station Pump Switch'을 대상 객체(Target Object)로, 그리고 si에 해당하는 'to the ON positon'을 지원정보(Supporting information) 구성요소로 식별할 수 있다. And, looking at the item (6) of FIG. 3, the components of the three action statements identified by the identification process of these components can be confirmed. That is, 'TURN' corresponding to AV as action verb, 'FCEDS13, Flow Station Pump Switch' corresponding to to1 + to2 as Target Object, and 'to the ON position' corresponding to si as support information (Supporting information) It can be identified as a component.

이러한 정보 추출 유닛(300)에 의해 추출된 결과는 다양한 형태의 출력물로 제공된다(s160). 출력물의 출력 예는 도 7에 도시하고 있다. 도 7의 (a)는 각 의미개체를 해당 개념 유형에 따라 다른 색상으로 강조 표시한 절차서이고, (b)는 각 단락을 해당 분류 유형에 따라 다른 색상으로 강조 표시한 절차서이다. 그리고 이러한 절차서의 출력물은 도메인 전문가로부터 정보 추출 결과에 대한 피드백을 수집하여 정보 추출 결과를 검증하는데 사용되며, 아울러 온톨로지 및 규칙을 확장하여 성능을 향상시키는데 사용된다. The results extracted by the information extraction unit 300 are provided in various types of output (s160). An example of the output of the output is shown in FIG. 7 . 7 (a) is a procedure sheet in which each semantic object is highlighted in a different color according to the corresponding concept type, and (b) is a procedure in which each paragraph is highlighted in a different color according to the corresponding classification type. And the output of this procedure is used to verify the information extraction result by collecting feedback on the information extraction result from domain experts, and is also used to improve the performance by extending the ontology and rules.

한편, 본 발명의 일 실시 예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Meanwhile, the method according to an embodiment of the present invention may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the present invention, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

이상으로, 본 발명의 실시 예에 따른 시스템, 장치 및 방법을 상세히 설명하였다. 전술한 본 발명의 실시 예에서는 어떠한 전자 장치에도 동일하게 적용할 수 있다. 이하, 본 발명의 실시 예에 따른 변형 가능한 다양한 전자 장치의 구현 예를 설명한다.In the above, the system, apparatus and method according to the embodiment of the present invention have been described in detail. The above-described embodiment of the present invention can be equally applied to any electronic device. Hereinafter, implementation examples of various deformable electronic devices according to embodiments of the present invention will be described.

본 발명의 다양한 실시 예들에 따른 전자 장치는, 예를 들면 스마트폰(smartphone), 태블릿 PC(tablet personal computer), 이동 전화기(mobile phone), 화상 전화기, 전자북 리더기(e-book reader), 데스크탑 PC(desktop personal computer), 랩탑 PC(laptop personal computer), 넷북 컴퓨터(netbook computer), 워크스테이션(workstation), 서버, PDA(personal digital assistant), PMP(portable multimedia player), MP3 플레이어, 모바일 의료기기, 카메라(camera), 또는 웨어러블 장치(wearable device)(예: 스마트 안경, 머리 착용형 장치(head-mounted-device(HMD)), 전자 의복, 전자 팔찌, 전자 목걸이, 전자 앱세서리(appcessory), 전자 문신, 스마트 미러, 또는 스마트 와치(smart watch))중 적어도 하나를 포함할 수 있다.Electronic devices according to various embodiments of the present disclosure include, for example, a smartphone, a tablet personal computer, a mobile phone, a video phone, an e-book reader, and a desktop computer. PC (desktop personal computer), laptop PC (laptop personal computer), netbook computer, workstation, server, PDA (personal digital assistant), PMP (portable multimedia player), MP3 player, mobile medical device , cameras, or wearable devices (such as smart glasses, head-mounted-device (HMD)), electronic garments, electronic bracelets, electronic necklaces, electronic accessories; an electronic tattoo, a smart mirror, or a smart watch).

이상과 같이 본 발명의 도시된 실시 예를 참고하여 설명하고 있으나, 이는 예시적인 것들에 불과하며, 본 발명이 속하는 기술 분야의 통상의 지식을 가진 자라면 본 발명의 요지 및 범위에 벗어나지 않으면서도 다양한 변형, 변경 및 균등한 타 실시 예들이 가능하다는 것을 명백하게 알 수 있을 것이다. 따라서 본 발명의 진정한 기술적 보호 범위는 첨부된 청구범위의 기술적인 사상에 의해 정해져야 할 것이다.Although described with reference to the illustrated embodiments of the present invention as described above, these are merely exemplary, and those of ordinary skill in the art to which the present invention pertains can use various functions without departing from the spirit and scope of the present invention. It will be apparent that modifications, variations and equivalent other embodiments are possible. Therefore, the true technical protection scope of the present invention should be determined by the technical spirit of the appended claims.

100: 전처리 유닛
110: 넌-텍스트 처리부 120: 텍스트 처리부
200: 확장된 자연어 처리 유닛
210 ~ 230: 제1 내지 제3 자연어 처리부
300: 정보 추출 유닛 310: 의미 개체 식별부
320: 단락 유형 분류부 330: 조치문의 구성요소 식별부
400: 데이터 베이스100: pre-processing unit
110: non-text processing unit 120: text processing unit
200: extended natural language processing unit
210 to 230: first to third natural language processing units
300: information extraction unit 310: semantic entity identification unit
320: paragraph type classification unit 330: action statement component identification unit
400: database

Claims

입력된 절차서에 포함된 이미지 및 표 개체를 제거하는 넌-텍스트 처리부와, 이미지 및 표 개체가 제거된 절차서의 텍스트에 대한 구조적 특성을 추출하는 텍스트 처리부를 포함하는 전처리 유닛;
상기 전처리 유닛으로부터 텍스트 정보를 전달받고 자연어 처리(NLP)기술을 이용하여 텍스트를 분석하고 보정하는 확장된 자연어 처리 유닛;
상기 자연어 처리된 절차서의 각 단락을 대상으로 단락에 포함된 모든 유의한 의미 개체 식별, 단락의 유형 분류, 조치문 단락의 세부 구성 요소 식별을 수행하는 정보 추출 유닛; 및
추출된 모든 정보들을 출력물 형태로 생성 출력하는 출력 유닛을 포함하는 절차서의 구문 및 의미정보 추출시스템.a pre-processing unit comprising: a non-text processing unit that removes images and table objects included in the input procedure;
an extended natural language processing unit that receives text information from the preprocessing unit and analyzes and corrects the text using a natural language processing (NLP) technology;
an information extraction unit configured to identify all significant semantic entities included in the paragraph, classify the type of the paragraph, and identify detailed components of the action sentence for each paragraph of the natural language-processed procedure; and
A system for extracting syntax and semantic information of a procedure including an output unit for generating and outputting all extracted information in the form of a printout.

제 1 항에 있어서,
상기 확장된 자연어 처리 유닛은,
토큰화, 문장 분할, 표제어 추출을 위한 제1 자연어 처리부;
품사 태깅 및 구문 분석 과정을 위한 제2 자연어 처리부; 및
어휘 데이터베이스와 통합한 내부 규칙을 활용하여 자연어 처리과정에서 발생하는 잘못된 POS 태그 및 구문분석 결과를 감지 및 수정하기 위한 제3 자연어 처리부를 포함하는 절차서의 구문 및 의미정보 추출시스템.The method of claim 1,
The extended natural language processing unit,
a first natural language processing unit for tokenization, sentence segmentation, and headword extraction;
a second natural language processing unit for part-of-speech tagging and syntax analysis; and
A system for extracting syntax and semantic information of procedures including a third natural language processing unit for detecting and correcting incorrect POS tags and syntax analysis results that occur in the natural language processing process by utilizing the internal rules integrated with the vocabulary database.

제 2 항에 있어서,
상기 구문 분석은,
구성요소 기반과, 종속성 기반의 2가지로 분석하는 절차서의 구문 및 의미정보 추출시스템.3. The method of claim 2,
The parsing is
A system for extracting syntactic and semantic information of a procedure that is analyzed in two ways: component-based and dependency-based.

제 1 항에 있어서,
상기 정보 추출 유닛은,
상기 의미 개체 식별을 위한 의미 개체 식별부;
상기 절차서 각 단락의 유형을 분류하는 단락 유형 분류부; 및
상기 조치문 단락의 세부 구성 요소를 식별하는 구성 요소 식별부를 포함하고,
상기 의미 개체 식별부 동작 이후에, 상기 단락 유형 분류부와 상기 조치문의 구성 요소 식별부는 순서에 상관없이 어느 하나가 먼저 수행될 수 있는 절차서의 구문 및 의미정보 추출시스템.The method of claim 1,
The information extraction unit,
a semantic entity identification unit for identifying the semantic entity;
a paragraph type classification unit for classifying the type of each paragraph in the procedure; and
It includes a component identification unit for identifying detailed components of the action paragraph,
After the semantic entity identification unit operation, the paragraph type classification unit and the component identification unit of the action statement may be performed first regardless of the order of the syntax and semantic information extraction system of the procedure.

제 4 항에 있어서,
상기 의미 개체 식별부는,
온톨로지 조회 방식과 내부 규칙 방식을 적용하여 절차서의 각 단락에 포함된 유의한 의미 개체를 식별하는 절차서의 구문 및 의미정보 추출시스템.5. The method of claim 4,
The semantic entity identification unit,
A system for extracting syntax and semantic information of procedures that identifies significant semantic entities included in each paragraph of procedures by applying ontology inquiry method and internal rule method.

제 5 항에 있어서,
상기 온톨로지 조회 방식에 따른 태깅은,
온톨로지에 포함된 의미 개체와 일치하는 토큰에만 적용되는 절차서의 구문 및 의미정보 추출시스템.6. The method of claim 5,
Tagging according to the ontology inquiry method is,
A system for extracting syntax and semantic information of procedures that is applied only to tokens that match the semantic entities included in the ontology.

제 5 항에 있어서,
상기 내부 규칙 방식에 따른 태깅은,
POS 태깅, 구문적 태깅 및, 의미적 태깅을 포함하는 조건식을 만족하는 토큰에 대해 미리 지정된 개념으로 태깅하는 절차서의 구문 및 의미정보 추출시스템.6. The method of claim 5,
Tagging according to the internal rule method,
POS tagging, syntactic tagging, and a system for extracting syntax and semantic information of a procedure for tagging a token that satisfies a conditional expression including semantic tagging with a predefined concept.

제 4 항에 있어서,
상기 단락 유형 분류부는,
조치 동사 및 그 대상 객체를 포함하는 제1 그룹;
제3 그룹보다 특정 조치문과 관련성이 높은 제2 그룹; 및
상기 제2 그룹과 비교하여 특정 조치문과 관련성이 상대적으로 적은 제3 그룹 각각에 포함된 여러 가지 단락 유형으로 분류하는 절차서의 구문 및 의미정보 추출시스템.5. The method of claim 4,
The paragraph type classification unit,
a first group comprising an action verb and its target object;
a second group having a higher relevance to a specific action statement than the third group; and
A system for extracting syntax and semantic information of a procedure for classifying various paragraph types included in each of the third group, which has relatively little relevance to a specific action sentence compared to the second group.

제 4 항에 있어서,
상기 구성 요소 식별부는,
POS 태그, 의미개체 태그, 구문분석 태그에 따라 조치 동사 및 대상 객체를 비롯하여 나머지 복수 개의 선택적 구성요소를 각각 식별하는 절차서의 구문 및 의미정보 추출시스템.5. The method of claim 4,
The component identification unit,
A system for extracting syntax and semantic information of a procedure that identifies each of the remaining plurality of optional components, including action verbs and target objects, according to POS tags, semantic object tags, and parsing tags.

절차서에 포함된 구문 및 의미정보를 추출하는 방법에 있어서,
전처리 유닛이 상기 절차서에 대한 전처리 과정을 수행하는 제1 단계;
확장된 자연어 처리 유닛이 상기 전처리 과정에 따른 텍스트 정보를 분석하기 위해 자연어 처리를 수행하고, 상기 자연어 처리 중 발생한 POS 태그 오류를 감지하고 수정하는 제2 단계;
정보 추출 유닛이 상기 제1 단계 및 제2 단계의 결과물을 이용하여 상기 절차서의 각 단락에 대하여 의미 개체 식별, 단락의 유형 분류, 조치문 단락의 세부 구성 요소 식별을 수행하는 제3 단계; 및
출력 유닛이 상기 추출된 정보를 둘 이상의 형태의 출력물로 생성 출력하는 제4 단계를 포함하는 절차서의 구문 및 의미정보 추출방법.In the method of extracting syntax and semantic information included in the procedure,
a first step in which the pre-processing unit performs a pre-processing process for the procedure;
a second step of performing, by an extended natural language processing unit, natural language processing to analyze text information according to the preprocessing process, and detecting and correcting a POS tag error occurring during the natural language processing;
a third step in which the information extraction unit performs semantic entity identification, paragraph type classification, and action sentence paragraph detailed component identification for each paragraph of the procedure using the results of the first and second steps; and
A method for extracting syntax and semantic information of a procedure including a fourth step of generating and outputting, by an output unit, the extracted information as two or more types of output.

제 10 항에 있어서,
상기 제2 단계는 상기 확장된 자연어 처리 유닛이 소정 플랜트를 대상으로 작성된 어휘 데이터 베이스와 내부 규칙을 활용하여 POS 태그 및 구문분석 오류를 감지하고 보정하는 과정을 포함하는 절차서의 구문 및 의미정보 추출방법.11. The method of claim 10,
The second step is a method for extracting syntax and semantic information of a procedure including a step in which the expanded natural language processing unit detects and corrects POS tags and syntax analysis errors using a vocabulary database and internal rules written for a predetermined plant .

제 10 항에 있어서,
상기 제3 단계의 상기 의미 개체 식별은,
상기 정보 추출 유닛이 각 토큰에 포함된 단어가 온톨로지에 포함되어 있는 경우 적용되는 온톨로지 조회방식과, POS 태깅, 구문적 태깅 및, 의미적 태깅을 포함하는 조건식을 만족하는 토큰에 대해 미리 지정된 개념으로 태깅하는 패턴기반의 규칙방식을 조합하여 수행하는 절차서의 구문 및 의미정보 추출방법.11. The method of claim 10,
The semantic entity identification of the third step is,
The information extraction unit is a pre-designated concept for a token that satisfies the conditional expression including the ontology inquiry method applied when the word included in each token is included in the ontology, POS tagging, syntactic tagging, and semantic tagging. A method of extracting syntax and semantic information of procedures performed by combining the tagging pattern-based rule method.

제 10 항에 있어서,
상기 제3 단계의 단락 유형 분류는,
상기 정보 추출 유닛이 조치 동사 및 그 대상 객체를 포함하는 제1 그룹과, 제3 그룹과 비교하여 특정 조치문과 관련성이 상대적으로 높은 제2 그룹, 및 상기 제2 그룹과 비교하여 특정 조치문과 관련이 상대적으로 적은 제3 그룹 각각에 포함되는 여러가지 단락유형으로 분류하는 절차서의 구문 및 의미정보 추출방법.11. The method of claim 10,
The paragraph type classification of the third step is,
The information extraction unit includes a first group including an action verb and its target object, a second group having a relatively high relevance to the specific action sentence compared to the third group, and related to the specific action sentence compared to the second group A method of extracting syntax and semantic information of procedures for classifying into various paragraph types included in each of the relatively few third groups.

제 10 항에 있어서,
상기 제3 단계의 구성 요소 식별은, 상기 정보 추출 유닛이 POS 태그, 의미개체 태그, 구문분석 태그에 따라 조치 동사, 및 대상 객체를 비롯하여 나머지 복수 개의 선택적 구성요소를 각각 식별하는 절차서의 구문 및 의미정보 추출방법.
11. The method of claim 10,
The component identification of the third step is the syntax and meaning of the procedure in which the information extraction unit identifies the remaining plurality of optional components, including the action verb, and the target object, respectively, according to the POS tag, the semantic object tag, and the parse tag. Information extraction method.