KR102051825B1

KR102051825B1 - Semantic-based similar patent search apparatus and method, storage media storing the same

Info

Publication number: KR102051825B1
Application number: KR1020180017567A
Authority: KR
Inventors: 김남규; 백민지; 이동훈; 변성호
Original assignee: 국민대학교산학협력단
Priority date: 2018-02-13
Filing date: 2018-02-13
Publication date: 2020-01-08
Also published as: KR20190097750A

Abstract

본 발명은 의미 기반 유사특허 검색 장치 및 방법에 관한 것으로, 제1국 특허 문헌이 속한 기술 분야와 동일한 기술 분야에 속한 제2국 특허 문헌을 결정하고 상기 제1국 특허 문헌 및 상기 제2국 특허 문헌 각각에 대해 적어도 하나의 핵심 기술용어를 추출하는 핵심 기술용어 추출부, 추출된 상기 적어도 하나의 핵심 기술용어를 기초로 상기 제1국 특허 문헌 및 상기 제2국 특허 문헌을 정제하는 특허 문헌 정제부 및 정제된 상기 특허 문헌에 포함된 핵심 기술용어에 대한 벡터화를 통해 상기 핵심 기술용어 간의 유사도를 산출하는 유사도 산출부를 포함한다. 따라서, 본 발명은 특허 문헌에 사용된 용어의 용례 분석을 통해 기술 용어의 유사성을 산출할 수 있다.The present invention relates to a semantic-based pseudo patent retrieval apparatus and method, wherein a second country patent document belonging to the same technical field as the first country patent document belongs is determined, and the first country patent document and the second country patent are determined. Core technical term extraction unit for extracting at least one core technical term for each document, Purifying patent documents for purifying the first and second country patent documents based on the extracted at least one key technology term And a similarity calculator for calculating similarity between the core technical terms through vectorization of the core technical terms included in the patent document. Therefore, the present invention can calculate the similarity of technical terms through the analysis of the usage of terms used in the patent literature.

Description

의미 기반 유사특허 검색 장치 및 방법, 이를 기록한 기록매체{SEMANTIC-BASED SIMILAR PATENT SEARCH APPARATUS AND METHOD, STORAGE MEDIA STORING THE SAME}Mean-based similar patent retrieval apparatus and method, recording medium recording the same {SEMANTIC-BASED SIMILAR PATENT SEARCH APPARATUS AND METHOD, STORAGE MEDIA STORING THE SAME}

본 발명은 의미 기반 유사특허 검색 기술에 관한 것으로, 보다 상세하게는 특허 문헌에 사용된 용어의 용례 분석을 통해 기술 용어의 유사성을 산출할 수 있는 의미 기반 유사특허 검색 장치 및 방법에 관한 것이다.The present invention relates to a semantic based similar patent search technology, and more particularly, to a semantic based similar patent search device and method that can calculate the similarity of technical terms by analyzing the usage of terms used in the patent literature.

선행기술조사는 국내에서 출원된 특허 뿐만 아니라 해외에서 출원된 특허를 조사대상으로 하고 있다. 코드 기반 검색은 국제특허분류(IPC)를 통해 국내 특허의 검색과 유사한 방식으로 수행될 수 있고, 다국어 키워드 검색은 일반적으로 동의어 사전을 통해 수행될 수 있다. 예를 들어, 한국어 입력을 통해 일본어로 출원된 발명을 검색하고자 하는 경우 “밀리미터파”를 입력하면 이 값이 동의어 사전을 통해 “ミリメ*jタ*j波”로 번역되어 일본 특허 검색 시스템으로 전달되는 방식이다. 하지만 사전 기반 다국어 키워드 검색은 동일한 개념이 국가 간 상이한 언어로 표현되는 경우 검색이 제한적이라는 문제점을 가지고 있다.Prior art research is targeting the patents applied abroad as well as patents applied in Korea. Code-based retrieval can be performed in a manner similar to the retrieval of domestic patents through International Patent Classification (IPC), and multilingual keyword retrieval can generally be performed through a synonym dictionary. For example, if you want to search for an invention filed in Japanese through Korean input, if you enter “millimeter wave”, this value is translated into “ミリメ * j タ * j 波” through a thesaurus and transferred to the Japanese patent search system. That's how it works. However, the dictionary-based multilingual keyword search has a problem that the search is limited when the same concept is expressed in different languages between countries.

한국공개 특허 제10-2010-0113423(2010.10.21)호는 역 벡터 공간 모델을 이용한 키워드 추천방법 및 그 장치에 관한 것으로, 많은 문서들 중에서 입력된 키워드와 가장 근접한 문서를 찾는 벡터 공간 모델을 역으로 적용하여 여러 키워드 중에서 입력된 글과 가장 근접한 키워드를 찾아 추천함으로써, 사용자는 추천받은 키워드를 이용하여 자신이 직접 작성한 글의 키워드를 손쉽게 선택할 수 있다.Korean Patent Laid-Open No. 10-2010-0113423 (2010.10.21) relates to a keyword recommendation method and apparatus using an inverse vector space model, and inverts a vector space model that finds a document closest to an input keyword among many documents. By applying it to find and recommend the keyword that is closest to the input text among the various keywords, the user can easily select the keyword of the article written by himself using the recommended keywords.

한국공개 특허 제10-2010-0113423(2010.10.21)호Korea Patent Publication No. 10-2010-0113423 (2010.10.21)

본 발명의 일 실시예는 특허 문헌에 사용된 용어의 용례 분석을 통해 기술 용어의 유사성을 산출할 수 있는 의미 기반 유사특허 검색 장치 및 방법을 제공하고자 한다.An embodiment of the present invention is to provide a semantic-based similar patent search apparatus and method that can calculate the similarity of technical terms by analyzing the usage of terms used in the patent literature.

본 발명의 일 실시예는 형태소 분석을 통해 추출한 용어를 사전으로 활용하여 상이한 언어체계의 의미적 유사어를 자동으로 확장할 수 있는 의미 기반 유사특허 검색 장치 및 방법을 제공하고자 한다.An embodiment of the present invention is to provide a semantic-based similar patent search apparatus and method that can automatically expand the semantic similar words of different language systems by using the terms extracted through morphological analysis as a dictionary.

본 발명의 일 실시예는 검색시에 사용되는 키워드를 의미 기반으로 하여 용어를 자동 확장함으로써 선행기술조사 검색 시의 업무 효율성을 증대시킬 수 있는 의미 기반 유사특허 검색 장치 및 방법을 제공하고자 한다.An embodiment of the present invention is to provide a semantic-based similar patent search apparatus and method that can increase the work efficiency in the prior art search search by automatically expanding the term based on the keyword used in the search.

실시예들 중에서, 의미 기반 유사특허 검색 장치는 제1국 특허 문헌이 속한 기술 분야와 동일한 기술 분야에 속한 제2국 특허 문헌을 결정하고 상기 제1국 특허 문헌 및 상기 제2국 특허 문헌 각각에 대해 적어도 하나의 핵심 기술용어를 추출하는 핵심 기술용어 추출부, 추출된 상기 적어도 하나의 핵심 기술용어를 기초로 상기 제1국 특허 문헌 및 상기 제2국 특허 문헌을 정제하는 특허 문헌 정제부 및 정제된 상기 특허 문헌에 포함된 핵심 기술용어에 대한 벡터화를 통해 상기 핵심 기술용어 간의 유사도를 산출하는 유사도 산출부를 포함한다.Among the embodiments, the semantic-based pseudo patent retrieval apparatus determines a second country patent document belonging to the same technical field as the first country patent document belongs to, and is applied to each of the first country patent document and the second country patent document. A core technical term extracting unit for extracting at least one core technical term, and a patent document purifying unit and purifying the first and second patent documents based on the extracted at least one key technical term It includes a similarity calculation unit for calculating the similarity between the core technical terms through the vectorization of the core technical terms contained in the patent document.

상기 핵심 기술용어 추출부는 우선권 정보를 기초로 연결된 특허 문헌 쌍에서 선출원에 해당하는 특허 문헌을 상기 제1국 특허 문헌으로 결정하고 후출원에 해당하는 특허 문헌을 상기 제2국 특허 문헌으로 결정할 수 있다.The core technical term extracting unit may determine a patent document corresponding to a prior application as the first country patent document and a patent document corresponding to a subsequent application as the second country patent document based on the pair of patent documents connected based on the priority information. .

상기 핵심 기술용어 추출부는 형태소 분석을 통해 상기 제1국 및 제2국 특허 문헌에 포함된 기술용어를 추출하고 빈도 수 및 기술 관련도 중 적어도 하나를 기초로 상기 적어도 하나의 핵심 기술용어를 결정할 수 있다.The core technical term extracting unit may extract technical terms included in the first and second country patent documents through morphological analysis and determine the at least one core technical term based on at least one of frequency and technical relevance. have.

상기 특허 문헌 정제부는 정제된 상기 제2국 특허 문헌에 포함된 지표 기술용어를 상기 지표 기술용어에 매칭되는 제1국 핵심 기술용어로 대체하여 지표화된 제2국 특허 문헌을 생성할 수 있다.The patent document refining unit may generate the indexed second country patent document by replacing the index technology term included in the purified second country patent document with the core technology term of the first country matching the index technology term.

상기 유사도 산출부는 정제된 상기 제1국 특허 문헌 및 상기 지표화된 제2국 특허 문헌을 기초로 Word2Vec 학습을 통해 상기 적어도 하나의 핵심 기술용어를 벡터화할 수 있다.The similarity calculator may vectorize the at least one key technical term through Word2Vec learning based on the purified first country patent document and the indexed second country patent document.

상기 유사도 산출부는 상기 지표 기술용어에 해당하지 않는 상기 제1국 및 제2국 핵심 기술용어 벡터 간의 코사인 유사도를 산출하여 유사도 행렬을 생성할 수 있다.The similarity calculator may generate a similarity matrix by calculating a cosine similarity between the first and second country core technical term vectors that do not correspond to the index description term.

실시예들 중에서, 의미 기반 유사특허 검색 방법은 (a) 제1국 특허 문헌이 속한 기술 분야와 동일한 기술 분야에 속한 제2국 특허 문헌을 결정하고 상기 제1국 특허 문헌 및 상기 제2국 특허 문헌 각각에 대해 적어도 하나의 핵심 기술용어를 추출하는 단계, (b) 추출된 상기 적어도 하나의 핵심 기술용어를 기초로 상기 제1국 특허 문헌 및 상기 제2국 특허 문헌을 정제하는 단계 및 (c) 정제된 상기 특허 문헌에 포함된 핵심 기술용어에 대한 벡터화를 통해 상기 핵심 기술용어 간의 유사도를 산출하는 단계를 포함한다.Among the embodiments, the semantic-based pseudopattern search method may comprise: (a) determining a second country patent document belonging to the same technical field as that of the first country patent document and the first country patent document and the second country patent; Extracting at least one key technical term for each document, (b) purifying the first and second country patent documents based on the extracted at least one key technical term and (c) ) Calculating the similarity between the core technical terms through vectorization of the core technical terms contained in the purified patent document.

상기 (a) 단계는 우선권 정보를 기초로 연결된 특허 문헌 쌍에서 선출원에 해당하는 특허 문헌을 상기 제1국 특허 문헌으로 결정하고 후출원에 해당하는 특허 문헌을 상기 제2국 특허 문헌으로 결정하는 단계일 수 있다.The step (a) is a step of determining a patent document corresponding to the first application from the pair of patent documents linked based on the priority information as the first country patent document and determining the patent document corresponding to the subsequent application as the second country patent document. Can be.

상기 (a) 단계는 형태소 분석을 통해 상기 제1국 및 제2국 특허 문헌에 포함된 기술용어를 추출하고 빈도 수 및 기술 관련도 중 적어도 하나를 기초로 상기 적어도 하나의 핵심 기술용어를 결정하는 단계일 수 있다.Step (a) is to extract the technical terms contained in the patent documents of the first and second countries through morphological analysis and determine the at least one key technical term based on at least one of frequency and technical relevance. It may be a step.

상기 (b) 단계는 정제된 상기 제2국 특허 문헌에 포함된 지표 기술용어를 상기 지표 기술용어에 매칭되는 제1국 핵심 기술용어로 대체하여 지표화된 제2국 특허 문헌을 생성하는 단계일 수 있다.The step (b) may be a step of generating an indexed second country patent document by replacing the indicator technology term included in the purified second country patent document with the first country core technology term matching the index technology term. have.

상기 (c) 단계는 정제된 상기 제1국 특허 문헌 및 상기 지표화된 제2국 특허 문헌을 기초로 Word2Vec 학습을 통해 상기 적어도 하나의 핵심 기술용어를 벡터화하는 단계일 수 있다.Step (c) may be a step of vectorizing the at least one key technical term through Word2Vec learning based on the purified first country patent document and the indexed second country patent document.

상기 (c) 단계는 상기 지표 기술용어에 해당하지 않는 상기 제1국 및 제2국 핵심 기술용어 벡터 간의 코사인 유사도를 산출하여 유사도 행렬을 생성하는 단계일 수 있다.Step (c) may be a step of generating a similarity matrix by calculating the cosine similarity between the first and second station core technical term vectors that do not correspond to the index description term.

실시예들 중에서, 컴퓨터 수행 가능한 기록매체는 제1국 특허 문헌이 속한 기술 분야와 동일한 기술 분야에 속한 제2국 특허 문헌을 결정하고 상기 제1국 특허 문헌 및 상기 제2국 특허 문헌 각각에 대해 적어도 하나의 핵심 기술용어를 추출하는 과정, 추출된 상기 적어도 하나의 핵심 기술용어를 기초로 상기 제1국 특허 문헌 및 상기 제2국 특허 문헌을 정제하는 과정 및 정제된 상기 특허 문헌에 포함된 핵심 기술용어에 대한 벡터화를 통해 상기 핵심 기술용어 간의 유사도를 산출하는 과정을 포함한다.Among the embodiments, the computer-executable recording medium determines a second country patent document belonging to the same technical field as that of the first country patent document, and for each of the first country patent document and the second country patent document, respectively. Extracting at least one key technical term, purifying the first and second country patent documents based on the extracted at least one key technical term, and the cores included in the purified patent document Calculating the similarity between the core technical terms through vectorization of the technical terms.

개시된 기술은 다음의 효과를 가질 수 있다. 다만, 특정 실시예가 다음의 효과를 전부 포함하여야 한다거나 다음의 효과만을 포함하여야 한다는 의미는 아니므로, 개시된 기술의 권리범위는 이에 의하여 제한되는 것으로 이해되어서는 아니 될 것이다.The disclosed technique can have the following effects. However, since a specific embodiment does not mean to include all of the following effects or only the following effects, it should not be understood that the scope of the disclosed technology is limited by this.

본 발명의 일 실시예에 따른 의미 기반 유사특허 검색 장치 및 방법은 형태소 분석을 통해 추출한 용어를 사전으로 활용하여 상이한 언어체계의 의미적 유사어를 자동으로 확장할 수 있다.The semantic-based pseudopattern search apparatus and method according to an embodiment of the present invention may automatically expand semantic analogous words of different language systems by using terms extracted through morphological analysis as dictionaries.

본 발명의 일 실시예에 따른 의미 기반 유사특허 검색 장치 및 방법은 검색시에 사용되는 키워드를 의미 기반으로 하여 용어를 자동 확장함으로써 선행기술조사 검색 시의 업무 효율성을 증대시킬 수 있다.The semantic-based pseudopattern search apparatus and method according to an embodiment of the present invention can increase the work efficiency in the prior art search search by automatically expanding the term based on the keyword used in the search.

도 1은 본 발명의 일 실시예에 따른 의미 기반 유사특허 검색 시스템을 설명하는 도면이다.
도 2는 도 1에 있는 유사특허 검색 장치를 설명하는 블록도이다.
도 3은 도 1에 있는 유사특허 검색 장치에서 수행되는 유사특허 검색 과정을 설명하는 순서도이다.
도 4는 본 발명의 일 실시예에 따른 의미 기반 유사특허 검색 시스템의 전체적인 개요를 나타내는 도면이다.
도 5는 도 2에 있는 핵심 기술용어 추출부에서 우선권 정보를 기초로 특허 문헌을 매핑하는 과정을 설명하는 예시도이다.
도 6은 도 2에 있는 특허 문헌 정제부에서 지표화된 특허 문헌을 생성하는 과정을 설명하는 예시도이다.
도 7은 도 2에 있는 유사도 산출부에서 Word2Vec 학습을 통한 용어 추출 및 코사인 유사도를 산출하는 과정을 설명하는 예시도이다.1 is a diagram illustrating a semantic based similar patent search system according to an exemplary embodiment of the present invention.
FIG. 2 is a block diagram illustrating a similar patent search apparatus of FIG. 1.
FIG. 3 is a flowchart illustrating a similar patent search process performed by the similar patent search apparatus of FIG. 1.
4 is a diagram illustrating an overall overview of a semantic based similar patent search system according to an embodiment of the present invention.
FIG. 5 is an exemplary diagram illustrating a process of mapping a patent document based on priority information in the core terminology extracting unit of FIG. 2.
FIG. 6 is an exemplary view illustrating a process of generating an indexed patent document in the patent document purifying unit of FIG. 2.
FIG. 7 is an exemplary diagram illustrating a process of calculating a term extraction and cosine similarity through Word2Vec learning in the similarity calculator of FIG. 2.

본 발명에 관한 설명은 구조적 내지 기능적 설명을 위한 실시예에 불과하므로, 본 발명의 권리범위는 본문에 설명된 실시예에 의하여 제한되는 것으로 해석되어서는 아니 된다. 즉, 실시예는 다양한 변경이 가능하고 여러 가지 형태를 가질 수 있으므로 본 발명의 권리범위는 기술적 사상을 실현할 수 있는 균등물들을 포함하는 것으로 이해되어야 한다. 또한, 본 발명에서 제시된 목적 또는 효과는 특정 실시예가 이를 전부 포함하여야 한다거나 그러한 효과만을 포함하여야 한다는 의미는 아니므로, 본 발명의 권리범위는 이에 의하여 제한되는 것으로 이해되어서는 아니 될 것이다.Description of the present invention is only an embodiment for structural or functional description, the scope of the present invention should not be construed as limited by the embodiments described in the text. That is, since the embodiments may be variously modified and may have various forms, the scope of the present invention should be understood to include equivalents capable of realizing the technical idea. In addition, the objects or effects presented in the present invention does not mean that a specific embodiment should include all or only such effects, the scope of the present invention should not be understood as being limited thereby.

한편, 본 출원에서 서술되는 용어의 의미는 다음과 같이 이해되어야 할 것이다.On the other hand, the meaning of the terms described in the present application should be understood as follows.

"제1", "제2" 등의 용어는 하나의 구성요소를 다른 구성요소로부터 구별하기 위한 것으로, 이들 용어들에 의해 권리범위가 한정되어서는 아니 된다. 예를 들어, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다.Terms such as "first" and "second" are intended to distinguish one component from another component, and the scope of rights should not be limited by these terms. For example, the first component may be named a second component, and similarly, the second component may also be named a first component.

어떤 구성요소가 다른 구성요소에 "연결되어"있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결될 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어"있다고 언급된 때에는 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. 한편, 구성요소들 간의 관계를 설명하는 다른 표현들, 즉 "~사이에"와 "바로 ~사이에" 또는 "~에 이웃하는"과 "~에 직접 이웃하는" 등도 마찬가지로 해석되어야 한다.When a component is referred to as being "connected" to another component, it should be understood that there may be other components in between, although it may be directly connected to the other component. On the other hand, when a component is referred to as being "directly connected" to another component, it should be understood that there is no other component in between. On the other hand, other expressions describing the relationship between the components, such as "between" and "immediately between" or "neighboring to" and "directly neighboring to", should be interpreted as well.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한 복수의 표현을 포함하는 것으로 이해되어야 하고, "포함하다"또는 "가지다" 등의 용어는 실시된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이며, 하나 또는 그 이상의 다른 특징이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Singular expressions should be understood to include plural expressions unless the context clearly indicates otherwise, and terms such as "comprise" or "have" refer to a feature, number, step, operation, component, part, or feature thereof. It is to be understood that the combination is intended to be present and does not exclude in advance the possibility of the presence or addition of one or more other features or numbers, steps, operations, components, parts or combinations thereof.

각 단계들에 있어 식별부호(예를 들어, a, b, c 등)는 설명의 편의를 위하여 사용되는 것으로 식별부호는 각 단계들의 순서를 설명하는 것이 아니며, 각 단계들은 문맥상 명백하게 특정 순서를 기재하지 않는 이상 명기된 순서와 다르게 일어날 수 있다. 즉, 각 단계들은 명기된 순서와 동일하게 일어날 수도 있고 실질적으로 동시에 수행될 수도 있으며 반대의 순서대로 수행될 수도 있다.In each step, an identification code (e.g., a, b, c, etc.) is used for convenience of description, and the identification code does not describe the order of the steps, and each step clearly indicates a specific order in context. Unless stated otherwise, they may occur out of the order noted. That is, each step may occur in the same order as specified, may be performed substantially simultaneously, or may be performed in the reverse order.

본 발명은 컴퓨터가 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현될 수 있고, 컴퓨터가 읽을 수 있는 기록 매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록 장치를 포함한다. 컴퓨터가 읽을 수 있는 기록 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광 데이터 저장 장치 등이 있다. 또한, 컴퓨터가 읽을 수 있는 기록 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산 방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.The present invention can be embodied as computer readable code on a computer readable recording medium, and the computer readable recording medium includes all kinds of recording devices in which data can be read by a computer system. . Examples of computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like. The computer readable recording medium can also be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.

여기서 사용되는 모든 용어들은 다르게 정의되지 않는 한, 본 발명이 속하는 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한 이상적이거나 과도하게 형식적인 의미를 지니는 것으로 해석될 수 없다.All terms used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. Generally, the terms defined in the dictionary used are to be interpreted to coincide with the meanings in the context of the related art, and should not be interpreted as having ideal or excessively formal meanings unless clearly defined in the present application.

도 1은 본 발명의 일 실시예에 따른 의미 기반 유사특허 검색 시스템을 설명하는 도면이다.1 is a diagram illustrating a semantic based similar patent search system according to an embodiment of the present invention.

도 1을 참조하면, 의미 기반 유사특허 검색 시스템(100)은 사용자 단말(110), 유사특허 검색 장치(130) 및 데이터베이스를 포함할 수 있다.Referring to FIG. 1, the semantic based similar patent search system 100 may include a user terminal 110, a similar patent search apparatus 130, and a database.

사용자 단말(110)은 유사특허 검색을 요청하여 검색 결과를 확인할 수 있는 컴퓨팅 장치에 해당할 수 있고, 스마트폰, 노트북 또는 컴퓨터로 구현될 수 있으며, 반드시 이에 한정되지 않고, 태블릿 PC 등 다양한 디바이스로도 구현될 수 있다. 사용자 단말(110)은 유사특허 검색 장치(130)와 네트워크를 통해 연결될 수 있고, 복수의 사용자 단말(110)은 유사특허 검색 장치(130)와 동시에 연결될 수 있다.The user terminal 110 may correspond to a computing device capable of requesting a similar patent search to check a search result, and may be implemented as a smartphone, a notebook, or a computer, and is not limited thereto. May also be implemented. The user terminal 110 may be connected to the similar patent search apparatus 130 through a network, and the plurality of user terminals 110 may be simultaneously connected to the similar patent search apparatus 130.

유사특허 검색 장치(130)는 사용자 단말(110)로부터 수신한 키워드를 기초로 유사특허를 검색한 결과를 사용자 단말(110)에 제공할 수 있는 컴퓨터 또는 프로그램에 해당하는 서버로 구현될 수 있다. 유사특허 검색 장치(130)는 사용자 단말(110)과 블루투스, WiFi 등을 통해 무선으로 연결될 수 있고, 네트워크를 통해 사용자 단말(110)과 데이터를 주고 받을 수 있다.The similar patent search apparatus 130 may be implemented as a server corresponding to a computer or a program capable of providing the user terminal 110 with a result of searching for a similar patent based on a keyword received from the user terminal 110. The similar patent search apparatus 130 may be wirelessly connected to the user terminal 110 through Bluetooth, WiFi, or the like, and may exchange data with the user terminal 110 through a network.

일 실시예에서, 유사특허 검색 장치(130)는 우선권으로 연결된 두 국가의 특허 문헌을 분석하여 기술용어를 추출할 수 있고, 추출된 기술용어를 기초로 특허 문헌들을 정제하여 Word2Vec 학습을 통해 기술용어 간의 유사도를 산출할 수 있다. 유사특허 검색 장치(130)는 산출된 유사도를 기초로 두 국가의 특허 문헌에서 사용되는 핵심 기술용어 간의 동의어 사전을 구축할 수 있고, 유사특허 검색시에 이를 활용하여 유사특허를 검색함으로써 보다 효과적인 검색 결과를 제공할 수 있다.In one embodiment, the similar patent search apparatus 130 may extract technical terms by analyzing patent documents of two countries connected as priorities, and refine the patent documents based on the extracted technical terms to obtain technical terms through Word2Vec learning. The similarity between the two can be calculated. The similar patent search apparatus 130 may construct a synonym dictionary between key technical terms used in patent documents of two countries based on the calculated similarity, and use the same when searching similar patents to search for similar patents. Can provide results.

유사특허 검색 장치(130)는 데이터베이스(150)를 포함하여 구현될 수 있고, 데이터베이스(150)와 독립적으로 구현될 수 있다. 데이터베이스(150)와 독립적으로 구현된 경우 유사특허 검색 장치(130)는 데이터베이스(150)와 유선 또는 무선으로 연결되어 데이터를 주고 받을 수 있다.The similar patent search apparatus 130 may be implemented including a database 150 and may be implemented independently of the database 150. When implemented independently of the database 150, the similar patent search apparatus 130 may be connected to the database 150 by wire or wirelessly to exchange data.

데이터베이스(150)는 유사특허 검색을 위해 필요한 다양한 정보들을 저장할 수 있는 저장장치이다. 데이터베이스(150)는 사용자 단말(110)로부터 수신한 기술용어들을 저장할 수 있고, 유사특허 검색 장치(130)가 특허 문헌으로부터 추출한 기술용어 및 Word2Vec 학습 관련 정보들을 저장할 수 있다. 데이터베이스(150)는, 반드시 이에 한정되지 않고, 특허 문헌에 사용된 용어의 용례를 분석하고 기술 용어의 유사성을 산출하는 과정에서 다양한 형태로 수집 또는 가공된 정보들을 저장할 수 있다.Database 150 is a storage device that can store a variety of information required for similar patent search. The database 150 may store the technical terms received from the user terminal 110, and may store the technical terms and Word2Vec learning related information extracted from the patent document by the similar patent search apparatus 130. The database 150 is not limited thereto, and may store information collected or processed in various forms in analyzing a usage of terms used in a patent document and calculating similarity of technical terms.

데이터베이스(150)는 특정 범위에 속하는 정보들을 저장하는 적어도 하나의 독립된 서브-데이터베이스들로 구성될 수 있고, 적어도 하나의 독립된 서브-데이터베이스들이 하나로 통합된 통합 데이터베이스로 구성될 수 있다. 적어도 하나의 독립된 서브-데이터베이스들로 구성되는 경우에는 각각의 서브-데이터베이스들은 블루투스, WiFi 등을 통해 무선으로 연결될 수 있고, 네트워크를 통해 상호 간의 데이터를 주고 받을 수 있다. 데이터베이스(150)는 통합 데이터베이스로 구성되는 경우 각각의 서브-데이터베이스들을 하나로 통합하고 상호 간의 데이터 교환 및 제어 흐름을 관리하는 제어부를 포함할 수 있다.The database 150 may be composed of at least one independent sub-databases that store information belonging to a specific range, and may be configured as an integrated database in which at least one independent sub-databases are integrated into one. When composed of at least one independent sub-database, each sub-database may be wirelessly connected through Bluetooth, WiFi, and the like, and may exchange data with each other through a network. When the database 150 is configured as an integrated database, the database 150 may include a control unit for integrating respective sub-databases into one and managing data exchange and control flow between them.

도 2는 도 1에 있는 유사특허 검색 장치를 설명하는 블록도이다.FIG. 2 is a block diagram illustrating a similar patent search apparatus of FIG. 1.

도 2를 참조하면, 유사특허 검색 장치(130)는 핵심 기술용어 추출부(210), 특허 문헌 정제부(230), 유사도 산출부(250) 및 제어부(270)를 포함할 수 있다.Referring to FIG. 2, the similar patent search apparatus 130 may include a core technology term extractor 210, a patent document refiner 230, a similarity calculator 250, and a controller 270.

핵심 기술용어 추출부(210)는 제1국 특허 문헌이 속한 기술 분야와 동일한 기술 분야에 속한 제2국 특허 문헌을 결정하고 제1국 특허 문헌 및 제2국 특허 문헌 각각에 대해 적어도 하나의 핵심 기술용어를 추출할 수 있다. 핵심 기술용어 추출부(210)는 제1국 특허 문헌을 기초로 내용 상의 유사성이 높아 동일한 의미를 가지는 기술용어를 많이 포함하고 있는 제2국 특허 문헌을 선택할 수 있다. 예를 들어, 핵심 기술용어 추출부(210)는 특허 문헌이 속하는 기술 분야의 동일성을 기초로 제2국 특허 문헌을 결정할 수 있다.The core technical term extracting unit 210 determines a second country patent document belonging to the same technical field as that of the first country patent document, and at least one key for each of the first country patent document and the second country patent document. Descriptive terms can be extracted. The core technical term extracting unit 210 may select a second country patent document including a lot of technical terms having the same meaning due to high similarity in content based on the first country patent document. For example, the core technology term extractor 210 may determine the second country patent document based on the identity of the technical field to which the patent document belongs.

일 실시예에서, 핵심 기술용어 추출부(210)는 우선권 정보를 기초로 연결된 특허 문헌 쌍에서 선출원에 해당하는 특허 문헌을 제1국 특허 문헌으로 결정하고 후출원에 해당하는 특허 문헌을 제2국 특허 문헌으로 결정할 수 있다. 핵심 기술용어 추출부(210)는 특허법 중의 우선권 주장 제도를 활용하여 내용의 유사성이 높은 양국의 특허 문헌 쌍을 식별할 수 있다. 예를 들어, 서로 다른 두 국가에 출원된 특허 중 우선권으로 연결된 양국의 특허는 거의 동일한 내용이 각국의 언어로 기술되어 있다는 특징을 갖는다.In one embodiment, the core technical term extraction unit 210 determines the patent document corresponding to the prior application as the first country patent document from the pair of patent documents linked based on the priority information, and the patent document corresponding to the subsequent application as the second country. It can be determined by the patent literature. The core technical term extracting unit 210 may identify pairs of patent documents having high similarity of contents by utilizing a priority claim system in the patent law. For example, patents of two countries, which are linked by priorities among patents filed in two different countries, are characterized by almost identical contents written in their respective languages.

도 5는 도 2에 있는 핵심 기술용어 추출부에서 우선권 정보를 기초로 특허 문헌을 매핑하는 과정을 설명하는 예시도이다. 도 5를 참조하면, 한국과 일본의 특허 쌍을 매핑하기 위한 방법이 표시되어 있다. 일본에서 선출원된 특허를 한국에서 다시 출원하는 경우 특허의 서지사항에 우선권 정보(510)로서 일본 특허 출원번호(530)가 기재될 수 있다. 핵심 기술용어 추출부(210)는 특정 기간 제1국에서 출원된 특허 정보를 수집할 수 있고, 우선권 정보에 제2국 특허의 출원번호를 포함하고 있는 특허를 추출한 후 해당 특허 출원번호를 기초로 제2국 특허 문헌을 수집할 수 있다.FIG. 5 is an exemplary view illustrating a process of mapping a patent document based on priority information in the core terminology extracting unit of FIG. 2. Referring to FIG. 5, a method for mapping a patent pair between Korea and Japan is shown. When re-applying a patent filed in Japan in Korea, Japanese patent application number 530 may be described as priority information 510 in the bibliography of the patent. The core technical term extraction unit 210 may collect patent information filed in a first country for a specific period, and extract a patent including an application number of a second country patent in priority information based on the patent application number. A second country patent document can be collected.

일 실시예에서, 핵심 기술용어 추출부(210)는 형태소 분석을 통해 제1국 및 제2국 특허 문헌에 포함된 기술용어를 추출하고 빈도 수 및 기술 관련도 중 적어도 하나를 기초로 적어도 하나의 핵심 기술용어를 결정할 수 있다. 핵심 기술용어 추출부(210)는 제1국 및 제2국 특허 문헌에 대한 형태소 분석을 각 국가별로 수행할 수 있다. 핵심 기술용어 추출부(210)는 대부분의 기술용어의 품사가 명사인 점을 고려하여 형태소 분석을 통해 명사 이외의 품사를 갖는 용어들을 모두 제거할 수 있다.In one embodiment, the core technical term extracting unit 210 extracts the technical terms included in the first and second country patent documents through morphological analysis and based on at least one of the frequency and the technical relevance. Determine key technical terms The core technical term extraction unit 210 may perform morphological analysis of patent documents of the first and second countries for each country. The core technical term extraction unit 210 may remove all terms having parts other than nouns through morphological analysis in consideration of the fact that most technical terms are parts of nouns.

핵심 기술용어 추출부(210)는 추출된 명사 중 빈도 수가 높고 기술 관련도가 높은 기술용어 만을 핵심 기술용어로서 결정할 수 있다. 여기에서, 빈도 수는 해당 기술용어가 특허 문헌 내에서 출현하는 빈도에 해당할 수 있고, 기술 관련도는 해당 기술용어가 특허 문헌의 대상 발명이 속한 기술 분야와 관련 있는 정도를 수치화한 것에 해당할 수 있다. 핵심 기술용어 추출부(210)는 형태소 분석을 통해 추출된 기술용어 중에서 특정 수 이상의 빈도수를 가지거나 또는 특정 임계 값 이상의 기술 관련도를 가지는 기술용어를 핵심 기술용어로서 결정할 수 있다.The core technical term extraction unit 210 may determine only the technical term having a high frequency and high technical relevance among the extracted nouns as the core technical term. Here, the frequency number may correspond to the frequency at which the technical term appears in the patent document, and the technical relevance may correspond to the quantification of the degree to which the technical term is related to the technical field to which the target invention of the patent document belongs. Can be. The core technical term extracting unit 210 may determine, as a core technical term, a technical term having a frequency of a specific number or more or a technical relevance of a specific threshold value or more among technical terms extracted through morphological analysis.

특허 문헌 정제부(230)는 추출된 적어도 하나의 핵심 기술용어를 기초로 제1국 특허 문헌 및 제2국 특허 문헌을 정제할 수 있다. 보다 구체적으로, 특허 문헌 정제부(230)는 각 국가별 특허 문헌을 형태소 분석하여 도출된 적어도 하나의 핵심 기술용어를 기초로 각 국가의 특허 문헌에 대해 해당 핵심 기술용어 만을 남기고 나머지는 모두 제거함으로써 특허 문헌에 대한 정제를 수행할 수 있다. 즉, 정제는 특허 문헌에 대하여 특정 용어만을 남기고 나머지 용어는 모두 제거하는 것을 의미할 수 있다.The patent document purifying unit 230 may purify the first and second country patent documents based on the extracted at least one key technical term. More specifically, the patent document refiner 230 leaves only the core technical terms for the patent documents of each country based on at least one key technical term derived by morphological analysis of the patent documents of each country, and removes all others. Purification can be performed on the patent literature. That is, the purification may mean to remove only the specific term and all other terms with respect to the patent document.

일 실시예에서, 특허 문헌 정제부(230)는 정제된 제2국 특허 문헌에 포함된 지표 기술용어를 해당 지표 기술용어에 매칭되는 제1국 핵심 기술용어로 대체하여 지표화된 제2국 특허 문헌을 생성할 수 있다. 여기에서, 지표 기술용어는 동일한 기술개념을 나타내는 양국의 기술용어 중 이미 널리 알려진 것으로, 서로 일대일로 직역 변환하여도 의미의 손실이 없는 용어에 해당할 수 있다. 지표 기술용어들로 구성된 지표 기술용어 집합은 양국의 기술용어 간의 동의어 집합의 부분 집합에 해당할 수 있다. 지표화된 특허 문헌에 대해서는 도 6에서 보다 자세히 설명한다.In one embodiment, the patent document refiner 230 replaces the indicator technology term included in the purified second country patent document with the core technology term of the first country corresponding to the index technology term, and indexed the second country patent document. Can be generated. Here, the indicator technical term is widely known among technical terms of both countries that represent the same technical concept, and may correspond to a term having no loss of meaning even when one-to-one translation is performed. An indicator descriptive set of indicator descriptive terms may correspond to a subset of the set of synonyms between descriptive terms in both countries. The indexed patent document will be described in more detail in FIG. 6.

유사도 산출부(250)는 정제된 특허 문헌에 포함된 핵심 기술용어에 대한 벡터화를 통해 핵심 기술용어 간의 유사도를 산출할 수 있다. 보다 구체적으로, 유사도 산출부(250)는 핵심 기술용어 각각에 대한 벡터를 산출할 수 있고, 핵심 기술용어에 대한 벡터를 기초로 핵심 기술용어 간의 유사도 행렬을 생성할 수 있다. 여기에서, 벡터화는 특정 기술용어에 대한 벡터를 산출하는 것을 의미할 수 있다.The similarity calculator 250 may calculate the similarity between the core technical terms through vectorization of the core technical terms included in the purified patent document. More specifically, the similarity calculator 250 may calculate a vector for each of the core technical terms, and generate a similarity matrix between the core technical terms based on the vector for the core technical terms. Here, vectorization may mean calculating a vector for a specific technical term.

일 실시예에서, 유사도 산출부(250)는 정제된 제1국 특허 문헌 및 지표화된 제2국 특허 문헌을 기초로 Word2Vec 학습을 통해 적어도 하나의 핵심 기술용어를 벡터화할 수 있다. 유사도 산출부(250)는 정제된 제1국 특허 문헌 및 지표화된 제2국 특허 문헌을 하나로 통합한 후 한꺼번에 Word2Vec 학습을 수행함으로써 양국 특허 문서에 사용된 핵심 기술용어를 동일한 관점으로 벡터화할 수 있다.In one embodiment, the similarity calculator 250 may vectorize at least one key technical term through Word2Vec learning based on the purified first country patent document and the indexed second country patent document. The similarity calculator 250 may integrate the refined first patent documents and the indexed second patent documents into one, and then perform word2Vec learning at a time to vectorize key technical terms used in both patent documents from the same viewpoint. .

일 실시예에서, 유사도 산출부(250)는 지표 기술용어에 해당하지 않는 제1국 및 제2국 핵심 기술용어 벡터 간의 코사인 유사도를 산출하여 유사도 행렬을 생성할 수 있다. 유사도 산출부(250)는 Word2Vec 학습을 통해 산출된 핵심 기술용어에 대한 벡터를 이용하여 핵심 기술용어 중 지표 기술용어에 해당하지 않으면서 서로 다른 국가의 특허 문헌에 포함된 핵심 기술용어 간에만 코사인 유사도를 산출할 수 있다.In one embodiment, the similarity calculator 250 may generate a similarity matrix by calculating a cosine similarity between the core technology term vectors of the first and second countries that do not correspond to the index description term. The similarity calculating unit 250 uses the vector of the core technical term calculated through Word2Vec learning to find the cosine similarity only between the core technical terms included in patent documents of different countries without corresponding to the index technical term among the core technical terms. Can be calculated.

제어부(270)는 유사특허 검색 장치(130)의 전체적인 동작을 제어하고, 핵심 기술용어 추출부(210), 특허 문헌 정제부(230) 및 유사도 산출부(250) 간의 제어 흐름 또는 데이터 흐름을 관리할 수 있다.The controller 270 controls the overall operation of the similar patent search apparatus 130 and manages a control flow or data flow between the core technology term extracting unit 210, the patent document refiner 230, and the similarity calculating unit 250. can do.

도 3은 도 1에 있는 유사특허 검색 장치에서 수행되는 유사특허 검색 과정을 설명하는 순서도이다.FIG. 3 is a flowchart illustrating a similar patent search process performed by the similar patent search apparatus of FIG. 1.

도 3을 참조하면, 유사특허 검색 장치(130)는 핵심 기술용어 추출부(210)를 통해 제1국 특허 문헌이 속한 기술 분야와 동일한 기술 분야에 속한 제2국 특허 문헌을 결정하고 제1국 특허 문헌 및 제2국 특허 문헌 각각에 대해 적어도 하나의 핵심 기술용어를 추출할 수 있다(단계 S310).Referring to FIG. 3, the similar patent search apparatus 130 may determine a second country patent document belonging to the same technology field as that of the first country patent document through the core technology term extraction unit 210, and then determine the first country. At least one key technical term may be extracted for each of the patent document and the second country patent document (step S310).

유사특허 검색 장치(130)는 특허 문헌 정제부(230)를 통해 추출된 적어도 하나의 핵심 기술용어를 기초로 제1국 특허 문헌 및 제2국 특허 문헌을 정제할 수 있다(단계 S330). 유사특허 검색 장치(130)는 유사도 산출부(250)를 통해 정제된 특허 문헌에 포함된 핵심 기술용어에 대한 벡터화를 통해 핵심 기술용어 간의 유사도를 산출할 수 있다(단계 S350).The similar patent search apparatus 130 may purify the first and second country patent documents based on at least one key technical term extracted through the patent document refiner 230 (step S330). The similar patent search apparatus 130 may calculate the similarity between the key technical terms by vectorizing the key technical terms included in the patent document purified through the similarity calculator 250 (step S350).

일 실시예에서, 유사특허 검색 장치(130)는 유사도 산출부(250)를 통해 산출된 핵심 기술용어 간의 유사도 정보를 기초로 제1국 및 제2국 간의 핵심 기술용어 사전을 생성할 수 있다. 핵심 기술용어 사전은 핵심 기술용어 간의 유사도를 기초로 두 국가의 핵심 기술용어 중에서 의미적 유사도가 높은 기술용어 쌍에 대한 정보를 포함할 수 있다. 유사특허 검색 장치(130)는 사용자 단말(110)로부터 유사특허 검색 요청을 수신한 경우 함께 수신한 검색 키워드를 기초로 핵심 기술용어 사전에서 검색된 타 국가의 기술용어를 활용하여 유사특허 검색을 수행할 수 있고, 검색 결과를 사용자 단말(110)에 제공할 수 있다.In one embodiment, the similar patent search apparatus 130 may generate a core technical term dictionary between the first and second stations based on the similarity information between the core technical terms calculated by the similarity calculator 250. The core technical terminology dictionary may include information on pairs of technical terms with high semantic similarity among the core technical terms of two countries based on the similarity between the core technical terms. When the similar patent search apparatus 130 receives the similar patent search request from the user terminal 110, the similar patent search apparatus 130 may perform the similar patent search by using the technical terms of other countries searched in the core technical term dictionary based on the received search keyword. The search result may be provided to the user terminal 110.

도 4는 본 발명의 일 실시예에 따른 의미 기반 유사특허 검색 시스템의 전체적인 개요를 나타내는 도면이다.4 is a diagram illustrating an overall overview of a semantic based similar patent search system according to an embodiment of the present invention.

도 4를 참조하면, 유사특허 검색 장치(130)는 내용의 유사성이 매우 높은 한국과 일본의 특허 문서 쌍을 확보하기 위해, 분석 대상 한국 특허에 대해 우선권으로 연결된 일본 특허 문서를 수집할 수 있다(410). 다음으로, 유사특허 검색 장치(130)는 양국의 특허 문서에 대해 한국어 형태소 분석기와 일본어 형태소 분석기를 각각 적용하여 형태소 분석을 수행할 수 있고, 사전에 정의된 기술용어 사전을 적용하여 각 특허 문서를 정제할 수 있다(420). 일 실시예에서, 유사특허 검색 장치(130)는 양국의 특허 문서를 분석하여 핵심 기술용어를 결정할 수 있고, 이를 기초로 각 특허 문서를 정제할 수 있다.Referring to FIG. 4, the similar patent retrieval apparatus 130 may collect Japanese patent documents linked with priority to the target Korean patent to be analyzed in order to secure a pair of patent documents of Korea and Japan having very high similarity of contents ( 410). Next, the similar patent search apparatus 130 may perform morphological analysis by applying a Korean morpheme analyzer and a Japanese morpheme analyzer to the patent documents of both countries, and apply each patent document by applying a predefined technical term dictionary. It may be purified (420). In one embodiment, the similar patent search apparatus 130 may analyze patent documents of both countries to determine key technical terms, and may purify each patent document based thereon.

이렇게 정제된 문서는 추후 용어의 의미적 유사도 도출을 위해 통합되는데, 양국 특허 문서의 언어가 상이하므로 상호 연결을 위한 전처리를 수행해야 한다. 유사특허 검색 장치(130)는 이를 위해 이미 양국에서 동일한 개념을 가리키는 것으로 검증된 기술용어를 지표 기술용어로 활용할 수 있다.This refined document is integrated to derive semantic similarity of terms later. Since the languages of both patent documents are different, preprocessing for interconnection should be performed. For this purpose, the similar patent search apparatus 130 may utilize a technical term that has already been verified to indicate the same concept in both countries as an index technical term.

유사특허 검색 장치(130)는 정제된 일본 특허 문서에서 검증된 지표 기술용어에 한해 일본어 기술용어를 한국어 기술용어로 대체하여 지표화된 일본 특허 문서를 생성할 수 있다(430). 다음으로, 유사특허 검색 장치(130)는 한국 특허 문서와 지표화된 일본 특허 문서를 통합한 후, Word2Vec 분석을 통해 각 기술용어를 벡터로 표현할 수 있다(440). 마지막으로, 유사특허 검색 장치(130)는 각 기술용어의 벡터간 거리를 산출하여 한국과 일본의 핵심 기술용어 간 유사도 행렬을 도출할 수 있다(450).The similar patent search apparatus 130 may generate the indexed Japanese patent document by replacing the Japanese technical term with the Korean technical term only for the index technical term verified in the purified Japanese patent document (430). Next, the similar patent search apparatus 130 may integrate the Korean patent document and the indexed Japanese patent document, and then express each technical term as a vector through Word2Vec analysis (440). Finally, the similar patent search apparatus 130 may derive the similarity matrix between the core technical terms of Korea and Japan by calculating the distance between the vectors of each technical term (450).

도 6은 도 2에 있는 특허 문헌 정제부에서 지표화된 특허 문헌을 생성하는 과정을 설명하는 예시도이다.FIG. 6 is an exemplary diagram illustrating a process of generating an indexed patent document in the patent document purifying unit of FIG. 2.

도 6을 참조하면, 유사특허 검색 장치(130)는 특허 문헌 정제부(230)를 통해 한국 및 일본의 특허 원문(610, 650)에 대하여 핵심 기술용어 추출부(210)에 의해 추출된 핵심 기술용어 또는 미리 정의된 기술용어 사전을 적용하여 정제된 특허 문헌을 생성(630, 670)할 수 있다.Referring to FIG. 6, the similar patent retrieval apparatus 130 is a core technology extracted by the core technology term extraction unit 210 for patent texts 610 and 650 of Korea and Japan through the patent document refining unit 230. Terms or predefined technical terminology dictionaries may be applied to generate purified patent literature (630, 670).

특허 문헌 정제부(230)는 정제된 일본 특허 문서(670) 중 지표 기술용어 집합에 속한 일본 기술용어들을 이에 대응하는 한글 기술용어들로 대체하여 지표 기술용어들을 삽입할 수 있고, 삽입된 결과로서 지표화된 일본 특허 문서(690)를 생성할 수 있다. 지표화된 일본 특허 문서(690)는 지표 기술용어에 해당하는 기술용어는 한국어로, 지표 기술용어에 해당하지 않는 기술용어는 일본어로 포함할 수 있다. 즉, 지표화된 일본 특허 문서(690)는 한국어와 일본어가 혼재된 형태로 표현될 수 있다.The patent document refining unit 230 may insert index description terms by replacing Japanese technical terms belonging to the index description term set in the refined Japanese patent document 670 with corresponding Hangul description terms. The indexed Japanese patent document 690 can be generated. The indexed Japanese patent document 690 may include a technical term corresponding to the index technical term in Korean, and a technical term not corresponding to the index technical term in Japanese. That is, the indexed Japanese patent document 690 may be expressed in a mixed form of Korean and Japanese.

이후 과정에서, 유사특허 검색 장치(130)는 정제된 한국 특허 문서(630)와 지표화된 일본 특허 문서(690)에 포함된 용어의 위치에 기반을 두어 한국과 일본의 용어 간 유사도를 분석할 수 있다. 예를 들어, 정제된 한국 특허(630)에서 “발명”의 우측에 “밀리미터파”가 위치하고 있고, 지표화된 일본 특허 문서(690)에서 “發明”의 우측에 “ミリ波”가 위치하고 있다. 이러한 패턴이 빈번하게 발생된다면 유사특허 검색 장치(130)는 “밀리미터파”와 “ミリ波”의 의미적 유사성이 높은 것으로 분석할 수 있다.In the subsequent process, the similar patent search apparatus 130 may analyze the similarity between terms in Korea and Japan based on the locations of the terms included in the purified Korean patent document 630 and the indexed Japanese patent document 690. have. For example, in the refined Korean patent 630, "millimeter wave" is located on the right side of "invention", and in the indexed Japanese patent document 690, "ミリ波" is located on the right side of "發明". If such a pattern occurs frequently, the similar patent search apparatus 130 may analyze that the semantic similarity between "millimeter wave" and "ミリ波" is high.

도 7은 도 2에 있는 유사도 산출부에서 Word2Vec 학습을 통한 용어 추출 및 코사인 유사도를 산출하는 과정을 설명하는 예시도이다.FIG. 7 is an exemplary view illustrating a process of calculating a term extraction and cosine similarity through Word2Vec learning in the similarity calculator of FIG. 2.

도 7을 참조하면, 유사특허 검색 장치(130)는 유사도 산출부(250)를 통해 “정제된 한국 특허” 문서 집합과 “지표화된 일본 특허” 문서 집합을 통합하고 이에 대해 한꺼번에 Word2Vec 학습을 수행함으로써, 양국 특허 문서에 사용된 기술용어를 동일한 관점으로 벡터화할 수 있다.Referring to FIG. 7, the similar patent search apparatus 130 integrates a "purified Korean patent" document set and an "indexed Japanese patent" document set through the similarity calculating unit 250 and performs Word2Vec learning at once. Technical terms used in bilateral patent documents can be vectorized from the same point of view.

유사도 산출부(250)는 각 Word2Vec 알고리즘을 통해 산출된 각 기술용어의 벡터(710)를 산출할 수 있다. 기술용어의 벡터(710)는 양국의 기술용어가 혼재되어 있다. 유사도 산출부(250)는 기술용어의 벡터(710)를 기초로 각각의 기술용어 상호 간의 코사인 유사도(730)를 산출할 수 있다. 일 실시예에서, 유사도 산출부(250)는 지표 기술용어에 포함되지 않은 기술용어에 대해서 양국 기술용어 간의 코사인 유사도만을 산출할 수 있다. 유사도 값은 0에서 1사이의 값을 가질 수 있고, 유사도가 1에 가까울수록 해당 기술용어 간의 유사도가 높은 것으로 해석될 수 있다.The similarity calculator 250 may calculate a vector 710 of each technical term calculated through each Word2Vec algorithm. In the technical term vector 710, technical terms of both countries are mixed. The similarity calculator 250 may calculate a cosine similarity 730 between each technical term based on the vector 710 of the technical term. In one embodiment, the similarity calculator 250 may calculate only the cosine similarity between the two technical terms for the technical terms not included in the indicator technical term. The similarity value may have a value between 0 and 1, and the closer the similarity is to 1, the higher the similarity between technical terms.

상기에서는 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.Although described above with reference to a preferred embodiment of the present invention, those skilled in the art will be variously modified and changed within the scope of the invention without departing from the spirit and scope of the invention described in the claims below I can understand that you can.

100: 의미 기반 유사특허 검색 시스템
110: 사용자 단말 130: 유사특허 검색 장치
150: 데이터베이스
210: 핵심 기술용어 추출부 230: 특허 문헌 정제부
250: 유사도 산출부 270: 제어부
510: 우선권 정보 530: 출원번호
610: 한국 특허 원문 630: 정제된 한국 특허
650: 일본 특허 원문 670: 정제된 일본 특허
690: 지표화된 일본 특허 문서
710: 기술용어의 벡터 730: 코사인 유사도100: semantic based similar patent search system
110: user terminal 130: similar patent search device
150: database
210: core technical term extraction unit 230: patent document purification unit
250: similarity calculation unit 270: control unit
510: priority information 530: application number
610: Original Korean patent 630: Refined Korean patent
650: Original Japanese patent 670: Refined Japanese patent
690: Indexed Japanese Patent Document
710: vector of technical terms 730: cosine similarity

Claims

제1국 특허 문헌이 속한 기술 분야와 동일한 기술 분야에 속한 제2국 특허 문헌을 결정하고 상기 제1국 특허 문헌 및 상기 제2국 특허 문헌 각각에 대해 적어도 하나의 핵심 기술용어를 추출하는 핵심 기술용어 추출부;
추출된 상기 적어도 하나의 핵심 기술용어를 기초로 상기 제1국 특허 문헌 및 상기 제2국 특허 문헌을 각각 정제하고, 정제된 상기 제2국 특허 문헌에 포함된 지표 기술용어를 상기 지표 기술용어에 매칭되는 제1국 핵심 기술용어로 대체하여 제1국 용어와 제2국 용어가 혼재된 문서로서 지표화된 제2국 특허 문헌을 생성하는 특허 문헌 정제부; 및
정제된 상기 제1국 특허 문헌 및 상기 지표화된 제2국 특허 문헌을 기초로 Word2Vec 학습을 통해 각 문헌에 포함된 핵심 기술용어를 벡터화하고 상기 핵심 기술용어 간의 유사도를 산출하는 유사도 산출부를 포함하되,
상기 핵심 기술용어 추출부는 우선권 정보를 기초로 연결된 특허 문헌 쌍에서 선출원에 해당하는 특허 문헌을 상기 제1국 특허 문헌으로 결정하고 후출원에 해당하는 특허 문헌을 상기 제2국 특허 문헌으로 결정하는 것을 특징으로 하는 의미 기반 유사특허 검색 장치.
Core technology for determining a second country patent document belonging to the same technical field as that of the first country patent document, and extracting at least one key technical term for each of the first country patent document and the second country patent document Term extracting unit;
Purifying the first country patent document and the second country patent document, respectively, based on the extracted at least one key technical term, and adding the indicator technology term contained in the purified second country patent document to the index description term. A patent document refining unit configured to generate a second country patent document indexed as a document in which the first country term and the second country term are mixed in place of a matching first country core technical term; And
Based on the purified first country patent documents and the indexed second country patent documents through the Word2Vec learning, including a similarity calculation unit for vectorizing the core technical terms contained in each document and calculating the similarity between the core technical terms,
The core technical term extracting unit may determine that the patent document corresponding to the first application is the first country patent document and the patent document corresponding to the subsequent application as the second country patent document based on the pair of patent documents linked based on the priority information. A semantic based similar patent search apparatus characterized by the above.

삭제delete

제1항에 있어서, 상기 핵심 기술용어 추출부는
형태소 분석을 통해 상기 제1국 및 제2국 특허 문헌에 포함된 기술용어를 추출하고 빈도 수 및 기술 관련도 중 적어도 하나를 기초로 상기 적어도 하나의 핵심 기술용어를 결정하는 것을 특징으로 하는 의미 기반 유사특허 검색 장치.
The method of claim 1, wherein the core technical term extraction unit
Semantic basis, characterized in that to extract the technical terms contained in the patent documents of the first and second country through morphological analysis and to determine the at least one key technical term based on at least one of the frequency and technical relevance Similar patent search device.

삭제delete

제1항에 있어서, 상기 유사도 산출부는
상기 지표 기술용어에 해당하지 않는 상기 제1국 및 제2국 핵심 기술용어 벡터 간의 코사인 유사도를 산출하여 유사도 행렬을 생성하는 것을 특징으로 하는 의미 기반 유사특허 검색 장치.
The method of claim 1, wherein the similarity calculating unit
And a similarity matrix is generated by calculating a cosine similarity between the core technology term vectors of the first and second stations that do not correspond to the index description term.

의미 기반 유사특허 검색 장치에서 수행되는 유사특허 검색 방법에 있어서,
(a) 제1국 특허 문헌이 속한 기술 분야와 동일한 기술 분야에 속한 제2국 특허 문헌을 결정하고 상기 제1국 특허 문헌 및 상기 제2국 특허 문헌 각각에 대해 적어도 하나의 핵심 기술용어를 추출하는 단계;
(b) 추출된 상기 적어도 하나의 핵심 기술용어를 기초로 상기 제1국 특허 문헌 및 상기 제2국 특허 문헌을 각각 정제하고, 정제된 상기 제2국 특허 문헌에 포함된 지표 기술용어를 상기 지표 기술용어에 매칭되는 제1국 핵심 기술용어로 대체하여 제1국 용어와 제2국 용어가 혼재된 문서로서 지표화된 제2국 특허 문헌을 생성하는 단계; 및
(c) 정제된 상기 제1국 특허 문헌 및 상기 지표화된 제2국 특허 문헌을 기초로 Word2Vec 학습을 통해 각 문헌에 포함된 핵심 기술용어를 벡터화하고 상기 핵심 기술용어 간의 유사도를 산출하는 단계를 포함하되,
상기 (a) 단계는 우선권 정보를 기초로 연결된 특허 문헌 쌍에서 선출원에 해당하는 특허 문헌을 상기 제1국 특허 문헌으로 결정하고 후출원에 해당하는 특허 문헌을 상기 제2국 특허 문헌으로 결정하는 단계인 것을 특징으로 하는 의미 기반 유사특허 검색 방법.
In a similar patent search method performed in a semantic based similar patent search device,
(a) determining a second country patent document belonging to the same technical field as that of the first country patent document and extracting at least one key technical term for each of the first country patent document and the second country patent document; Doing;
(b) purifying the first country patent document and the second country patent document, respectively, based on the extracted at least one key technical term and indexing the index technology term included in the purified second country patent document. Generating a second country patent document indexed as a document in which the first country term and the second country term are mixed by substituting the first country core technical term matching the technical term; And
(c) vectorizing key technical terms contained in each document through Word2Vec learning based on the purified first country patent document and the indexed second country patent document and calculating the similarity between the core technology terms. But
The step (a) is a step of determining a patent document corresponding to the first application from the pair of patent documents linked based on the priority information as the first country patent document and determining the patent document corresponding to the subsequent application as the second country patent document. Semantic-based similar patent search method characterized in that the.

삭제delete

제7항에 있어서, 상기 (a) 단계는
형태소 분석을 통해 상기 제1국 및 제2국 특허 문헌에 포함된 기술용어를 추출하고 빈도 수 및 기술 관련도 중 적어도 하나를 기초로 상기 적어도 하나의 핵심 기술용어를 결정하는 단계인 것을 특징으로 하는 의미 기반 유사특허 검색 방법.
The method of claim 7, wherein step (a)
Extracting the technical terms included in the first and second patent documents through morphological analysis and determining the at least one key technical term based on at least one of frequency and technical relevance. Semantic-based similar patent search method.

삭제delete

제7항에 있어서, 상기 (c) 단계는
상기 지표 기술용어에 해당하지 않는 상기 제1국 및 제2국 핵심 기술용어 벡터 간의 코사인 유사도를 산출하여 유사도 행렬을 생성하는 단계인 것을 특징으로 하는 의미 기반 유사특허 검색 방법.
The method of claim 7, wherein step (c)
And generating a similarity matrix by calculating a cosine similarity between the first and second core core terminology vectors that do not correspond to the index terminology.

핵심 기술용어 추출부, 특허 문헌 정제부 및 유사도 산출부를 포함하는 의미 기반 유사특허 검색 장치에서 수행되는 유사특허 검색 방법을 실행시키는 프로그램이 저장된 컴퓨터 수행 가능한 기록매체에 있어서,
상기 핵심 기술용어 추출부에서, 제1국 특허 문헌이 속한 기술 분야와 동일한 기술 분야에 속한 제2국 특허 문헌을 결정하고 상기 제1국 특허 문헌 및 상기 제2국 특허 문헌 각각에 대해 적어도 하나의 핵심 기술용어를 추출하는 과정;
상기 특허 문헌 정제부에서, 추출된 상기 적어도 하나의 핵심 기술용어를 기초로 상기 제1국 특허 문헌 및 상기 제2국 특허 문헌을 각각 정제하고, 정제된 상기 제2국 특허 문헌에 포함된 지표 기술용어를 상기 지표 기술용어에 매칭되는 제1국 핵심 기술용어로 대체하여 제1국 용어와 제2국 용어가 혼재된 문서로서 지표화된 제2국 특허 문헌을 생성하는 과정; 및
상기 유사도 산출부에서, 정제된 상기 제1국 특허 문헌 및 상기 지표화된 제2국 특허 문헌을 기초로 Word2Vec 학습을 통해 각 문헌에 포함된 핵심 기술용어를 벡터화하고 상기 핵심 기술용어 간의 유사도를 산출하는 과정을 포함하되,
상기 추출하는 과정은 우선권 정보를 기초로 연결된 특허 문헌 쌍에서 선출원에 해당하는 특허 문헌을 상기 제1국 특허 문헌으로 결정하고 후출원에 해당하는 특허 문헌을 상기 제2국 특허 문헌으로 결정하는 과정인 것을 특징으로 하는 기록매체.
In a computer-executable recording medium storing a program for executing a similar patent search method performed in a semantic-based similar patent search device including a core technical term extraction unit, a patent document refiner and a similarity calculator,
In the core terminology extracting unit, a second country patent document belonging to the same technical field as that of the first country patent document belongs is determined and at least one of each of the first country patent document and the second country patent document is determined. Extracting key technical terms;
In the patent document refining unit, the first country patent document and the second country patent document are respectively purified based on the extracted at least one key technical term, and the indicator technology included in the purified second country patent document. Generating a second-country patent document indexed as a document in which first-term terms and second-term terms are mixed by replacing terms with first-term core technical terms that match the index description terms; And
In the similarity calculating unit, based on the purified first country patent document and the indexed second country patent document, vectorized key technology terms included in each document through Word2Vec learning and calculating similarity between the key technology terms. Including courses,
The extracting process is a process of determining a patent document corresponding to a prior application as the first country patent document and a patent document corresponding to a subsequent application as the second country patent document from the pair of patent documents linked based on priority information. Record medium, characterized in that.