KR20100060165A

KR20100060165A - Method and system for determining similar word with input string

Info

Publication number: KR20100060165A
Application number: KR1020080118647A
Authority: KR
Inventors: 이도길; 김태일; 기윤서
Original assignee: 엔에이치엔(주)
Priority date: 2008-11-27
Filing date: 2008-11-27
Publication date: 2010-06-07
Also published as: KR101126406B1

Abstract

PURPOSE: A similar word determining method and a system thereof are provided to improve response speed of a similar word service by selecting a candidate character string from saved query languages and calculate actual editing distance only about the candidate character string. CONSTITUTION: A user interface(130) receives the first character string from a user and provides a synonym about the first character string to a recommendation query language. A candidate character string selection unit(140) selects a candidate character among the second character strings and calculates a final similarity score between the candidate characters and first character strings by using editing distance and a character similarity score between the first character string and saved second character strings, and a phonemic code of the first and second character strings. A synonym decision unit(160) determines the candidate character strings which the final similarity score is high rank to a synonym of the first character string.

Description

유사어 결정 방법 및 시스템{Method and System for Determining Similar Word with Input String}Method and System for Determining Similar Word with Input String}

본 발명은 검색 서비스에 관한 것으로서 보다 상세하게는 유사어를 추천 질의어로 제공하는 방법 및 장치에 관한 것이다.The present invention relates to a search service, and more particularly, to a method and apparatus for providing a similar word as a recommended query word.

최근 과학기술의 발전 및 경제 수준의 향상으로 인해 초고속 인터넷과 같은 통신망의 보급과 초고속 통신망 이용자의 급격한 증가가 이루어졌고, 초고속 통신망 이용자의 급격한 증가는 통신망을 통한 신규 서비스의 개발 및 서비스 아이템의 다양화를 가능하게 하였다. 이러한 통신망을 이용한 서비스 중 가장 일반적인 서비스가 검색 서비스라 할 수 있다.Recently, due to the development of science and technology and the improvement of the economic level, the spread of communication network such as high-speed Internet and the rapid increase of the use of high-speed communication network have been made. Made possible. Among the services using such a communication network, the most common service may be referred to as a search service.

검색 서비스란 사용자로부터 질의어가 입력되면, 입력된 질의어에 상응하는 검색 결과(예컨대, 입력된 질의어를 포함하는 웹 사이트, 입력된 질의어를 포함하는 기사, 또는 입력된 질의어를 포함하는 파일명을 갖는 이미지 등)를 사용자에게 제공하는 서비스를 의미 한다.When a query is input from a user, a search service corresponds to a search result corresponding to the input query (for example, a web site including the input query, an article including the input query, or an image having a file name including the input query). ) Means a service that provides the user.

그러나, 검색 서비스를 이용하는 사용자들은 질의어를 입력함에 있어서, 질의어를 정확하게 입력하지 못하고 오타 질의어를 입력하거나 자신이 원하는 질의어 를 정확히 알지 못하여 자신이 입력하기 원했던 질의어를 정확하게 입력하지 못할 수도 있는데, 이러한 경우 검색 서비스 제공자는 실제로 입력된 질의어를 기준으로 검색을 수행할 수 밖에 없어 결과적으로 사용자들은 자신이 원하는 검색결과를 제공 받을 수 없게 된다는 불편함이 있다.However, users who use the search service may not input the correct query, incorrect typing the query, or may not input the exact query that they wanted to input because they do not know the exact query they want. The service provider is forced to perform a search based on the actually input query, and as a result, users may not be provided with the desired search results.

이러한 불편함을 해결하기 위해 최근의 검색 서비스는, 사용자가 입력한 질의어에 대한 추천 질의어 제공 또는 연관 질의어 제공 등과 같은 다양한 검색 서비스를 제공하고 있는데, 여기서, 추천 질의어 제공이란, 사용자에 의해 입력된 질의어와 유사한 질의어들 중 일부를 추천 질의어로 제공하는 서비스를 의미한다.In order to solve this inconvenience, the recent search service provides various search services such as providing a recommendation query or a related query for a query input by a user, wherein the recommendation query providing is a query input by a user. Refers to a service that provides some of the similar queries with the recommended query.

일반적으로 이러한 추천 질의어를 제공하기 위해 근접 문자열 매칭(Approximate String Match)을 통해 데이터베이스에 저장된 질의어들 중 사용자가 입력한 질의어와 유사한 질의어들을 추천 질의어로 제공하였다. 이러한 근접 문자열 매칭을 위해 종래에는 편집 거리 계산이 많이 이용되었다. 편집 거리란 두 문자열이 같아지기 위해 필요한 삽입연산, 삭제연산, 교체연산의 개수로 정의된다.In general, in order to provide such a recommended query, a query similar to the query input by the user among query queries stored in the database is provided as a recommendation query through Appropriate String Match. Conventionally, editing distance calculation has been widely used for such close string matching. The edit distance is defined as the number of insert, delete, and replacement operations required for two strings to be equal.

그러나, 이러한 편집 거리 계산은 계산 복잡도가 매우 높기 때문에 데이터베이스에 저장된 모든 질의어들에 대해 사용자가 입력한 질의어와의 편집 거리를 계산하는 것은 매우 비효율적이라는 문제점이 있다.However, since the editing complexity is very high, it is very inefficient to calculate the editing distance from the query input by the user for all the queries stored in the database.

또한, 이러한 편집 거리 계산은 유사한 문자열을 구분하는 데에는 유용할 수 있지만 유사한 발음을 구분하는 데에는 유용하지 못하기 때문에, 사용자에게 사용자가 입력한 질의어와 발음이 유사한 질의어를 추천 질의어로 제공할 수 없다는 문제점이 있다.In addition, this edit distance calculation may be useful for distinguishing similar strings, but not useful for distinguishing similar pronunciations. Therefore, the user cannot provide a query with a pronunciation similar to that of the user input as a recommended query. There is this.

본 발명은 상술한 문제점을 해결하기 위한 것으로서, 미리 저장된 질의어들 중 일부를 후보 문자열로 선정하고, 선정된 후보 문자열에 대해서만 유사도 판단을 위한 실제 편집거리를 계산하도록 하는 유사어 결정 방법 및 시스템을 제공하는 것을 그 기술적 과제로 한다.SUMMARY OF THE INVENTION The present invention has been made to solve the above-mentioned problems, and provides a method and system for determining similar words that selects some of the pre-stored query words as candidate strings and calculates an actual edit distance for determining similarity only for the selected candidate strings. Let that be the technical problem.

또한, 본 발명은 후보 문자열 선정에 이용되는 편집거리를 와일드 카드 문자 검색 방법을 이용하여 계산할 수 있는 유사어 결정 방법 및 시스템을 제공하는 것을 다른 기술적 과제로 한다.Another object of the present invention is to provide a method and system for determining similar words that can calculate the edit distance used for selecting a candidate character string using a wildcard character search method.

또한, 본 발명은 사용자에 의해 입력된 질의어와 발음이 유사한 질의어를 추천 질의어로 제공할 수 있는 유사어 결정 방법 및 시스템을 제공하는 것을 다른 기술적 과제로 한다. Another object of the present invention is to provide a method and system for determining a similar word that can provide a query word having a pronunciation similar to a query word input by a user as a recommended query word.

상술한 목적을 달성하기 위한 본 발명의 일 측면에 따른 유사어 결정 방법은, 제1 문자열이 입력되면, 미리 저장된 제2 문자열들 중 상기 제1 문자열과의 제1 편집거리가 기준치 이하인 제2 문자열, 상기 제1 문자열과 음운 코드가 동일한 제2 문자열, 또는 상기 제1 문자열과 공통된 문자를 포함하고 있는 제2 문자열들 중 문자 유사도 점수가 상위N위 이내인 제2 문자열들을 후보 문자열들로 선정하는 단계; 상기 각 후보 문자열과 상기 제1 문자열간의 최종 유사도 점수를 산출하는 단계; 및 상기 후보 문자열들 중 상기 최종 유사도 점수가 상위 N위 이내인 후보 문자열을 상기 제1 문자열의 유사어로 결정하는 단계를 포함한다. 여기서, 상기 제1 및 제2 문자열은 검색 질의어일 수 있다.According to an aspect of the present invention, there is provided a method for determining a synonym according to an aspect of the present invention. Selecting, as candidate strings, second strings having the same phonetic code as the first string or second strings having a character similarity score within a top N position among second strings including characters common to the first string; ; Calculating a final similarity score between each candidate string and the first string; And determining a candidate string having the final similarity score within the upper N rank among the candidate strings as the similarity of the first string. Here, the first and second strings may be search query words.

일 실시예에 있어서, 상기 후보 문자열 선정단계 이전에, 사용자로부터 상기 제1 문자열을 수신하는 단계를 더 포함하고, 상기 유사어 결정 단계 이후에, 상기 결정된 유사어를 추천 질의어로 사용자에게 제공하는 단계를 더 포함할 수 있다.The method may further include receiving the first string from a user before the candidate string selection step, and after the synonym determining step, providing the determined analogous word to the user as a recommended query. It may include.

이때, 상기 후보 문자열 선정 단계에서, 상기 제2 문자열들 중 상기 제1 편집거리가 기준치 이하인 제2 문자열은 상기 제1 편집거리 산출을 위한 각 연산 별로 와일드 카드 문자(Wild Card Character) 검색을 이용하여 선정하는 것을 특징으로 하고, 상기 제1 편집거리의 각 연산은 삽입 연산, 삭제 연산, 교체 연산 및 전위 연산 중 적어도 하나를 포함하는 것을 특징으로 한다.At this time, in the candidate character string selection step, the second character string of which the first edit distance is less than or equal to the reference value among the second character strings is searched by using a wild card character search for each operation for calculating the first edit distance. And selecting at least one of an insert operation, an erase operation, a replace operation, and a potential operation.

한편, 상기 후보 문자열 선정 단계 이전에, 상기 제1 문자열 및 상기 제2 문자열들을 정규화하는 단계를 더 포함할 수 있는데, 이때, 상기 정규화는 상기 제1 및 제2 문자열이 영어인 경우 소문자를 대문자로 변환하거나 제1 및 제2 문자열의 앞뒤에 가상의 문자를 삽입하거나, 상기 제1 및 제2 문자열이 한글인 경우 상기 제1 및 제2 문자열들 중 특수문자 또는 띄어쓰기를 제거하거나, 상기 제1 및 제2 문자열을 키보드 입력 자소 단위로 변환하는 것임을 특징으로 한다.Meanwhile, before the candidate character string selecting step, the method may further include normalizing the first character string and the second character strings, wherein the normalization may be performed by capitalizing a lowercase letter when the first and second strings are English. Converts or inserts a virtual character before and after the first and second strings, or removes a special character or a space between the first and second strings when the first and second strings are Korean, or the first and second strings And converting the second string into a keyboard input phoneme unit.

또한, 상기 후보 문자열 선정 단계에서, 상기 제1 및 제2 문자열들의 음운 코드는 상기 제1 및 제2 문자열들이 영어인 경우 Soundex 또는 Metaphone 음운 알고리즘을 이용하여 획득하고, 상기 제1 및 제2 문자열이 한글인 경우 Kodex 음운 알고리즘을 이용하여 획득하는 것을 특징으로 한다.In addition, in the candidate character string selecting step, a phonological code of the first and second strings is obtained by using a Soundex or Metaphone phonological algorithm when the first and second strings are English, and the first and second strings are In the case of Korean, it is obtained using Kodex phonological algorithm.

또한, 상기 후보 문자열 선정 단계에서, 상기 제1 문자열과 공통된 문자를 포함하고 있는 제2 질의어들은 상기 제1 문자열과 공통된 ngram을 포함하고 있는 제2 질의어들이고, 상기 문자 유사도 점수는 상기 공통된 ngram의 크기, 상기 공통된 ngram의 개수, 상기 공통된 ngram이 발견된 위치의 유사도, 및 상기 제1 문자열과 상기 각 제2 문자열들간의 길이 차를 이용하여 결정되는 것을 특징으로 한다.In addition, in the candidate character string selecting step, the second query words including the characters common to the first string are second query words including the ngram common to the first string, and the character similarity score is the size of the common ngram. , The number of the common ngrams, the similarity of the locations where the common ngrams are found, and the length difference between the first string and each of the second strings.

이때, 상기 문자 유사도 점수는 상기 공통된 ngram의 크기 및 상기 공통된 ngram의 개수에는 비례하고, 상기 공통된 ngram이 발견된 위치의 유사도 및 상기 제1 문자열과 상기 제2 문자열들간의 길이 차와 반비례하는 것을 특징으로 하고, 상기 공통된 ngram이 발견된 위치의 유사도는, 상기 제1 문자열 중 공통 ngram 이전의 문자 개수와 상기 각 제2 문자열들 중 공통 ngram 이전의 문자 개수의 차이값과 상기 제1 문자열 중 공통 ngram 이후의 문자 개수와 상기 각 제2 문자열들 중 공통 ngram 이후의 문자 개수의 차이값 중 작은 값으로 정의되는 것을 특징으로 한다.In this case, the character similarity score is proportional to the size of the common ngram and the number of the common ngram, and is inversely proportional to the similarity of the position where the common ngram is found and the difference in length between the first string and the second string. The similarity of the position where the common ngram is found is a difference value between the number of characters before the common ngram of the first string and the number of characters before the common ngram of the second strings and the common ngram of the first string. And a smaller value of a difference value between the number of subsequent characters and the number of characters after the common ngram among the second strings.

한편, 상기 최종 유사도 점수 산출 단계에서 상기 최종 유사도 점수는, 상기 각 후보 문자열들과 상기 제1 문자열간의 문자 유사도 점수, 상기 각 후보 문자열들과 상기 제1 문자열간의 제2 편집거리를 점수화한 편집거리 점수, 상기 각 후보 문자열들의 음운코드와 상기 제1 문자열의 음운코드간의 제3 편집거리를 점수화한 음운코드 편집거리 점수, 및 상기 각 후보 문자열들과 상기 제1 문자열 간의 공통 문자 개수를 점수화한 공통문자 점수 중 적어도 하나를 이용하여 산출되는 것을 특징으로 한다. 예컨대, 상기 최종 유사도 점수는 상기 각 후보 문자열들과 상기 제 1 문자열간의 문자 유사도 점수, 상기 편집거리 점수, 음운코드 편집거리 점수, 및 상기 공통문자 점수의 곱을 이용하여 산출될 수 있다.Meanwhile, in the final similarity score calculation step, the final similarity score is an editing distance obtained by scoring a character similarity score between the candidate strings and the first string, and a second editing distance between the candidate strings and the first string. A phonological code edit distance score that scores a third editing distance between a phonological code of each candidate string and a phonological code of the first string, and a common score that scores the number of common characters between the candidate strings and the first string Characterized by using at least one of the character score. For example, the final similarity score may be calculated using a product of a character similarity score between the candidate strings and the first string, the edit distance score, the phonological code edit distance score, and the common character score.

이때, 상기 편집거리 점수는 상기 제2 편집거리와 제1 기준값의 합에 대한 역수, 제2 기준값과 상기 제2 편집거리의 차이값, 또는 상기 제1 문자열의 길이와 상기 제2 편집거리 간의 차와 상기 제1 문자열의 길이의 비율 중 어느 하나를 이용하여 산출될 수 있고, 상기 음운코드 편집거리 점수는 상기 제3 편집거리와 상기 제1 기준값의 합에 대한 역수, 상기 제2 기준값과 상기 제3 편집거리의 차이값, 또는 상기 제1 문자열의 음운코드 길이와 상기 제3 편집거리 간의 차와 상기 제1 문자열의 음운코드 길이 비율 중 어느 하나를 이용하여 산출될 수 있으며, 상기 공통문자 점수는, 상기 제1 문자열에 포함된 문자인 제1 문자와 상기 후보 문자열에 포함된 문자인 제2 문자 중 공통되는 문자의 개수와 상기 제1 및 제2 문자의 개수의 합의 비율 또는 상기 제1 문자와 상기 제2 문자 중 공통되는 문자의 개수와 상기 제1 문자의 개수의 비율을 이용하여 산출될 수 있다.In this case, the editing distance score is an inverse of the sum of the second editing distance and the first reference value, a difference value between the second reference value and the second editing distance, or a difference between the length of the first character string and the second editing distance. And a ratio of the length of the first character string, and the phonological code edit distance score is an inverse of the sum of the third editing distance and the first reference value, the second reference value, and the first reference value. 3 may be calculated using a difference value of the editing distance, or any one of the difference between the phonological code length of the first string and the third editing distance and the phonological code length ratio of the first string, wherein the common character score is A ratio of the sum of the number of characters in common among the first character that is a character included in the first string and the second character that is a character included in the candidate string, and the ratio of the sum of the number of the first and second characters;It may be calculated using a ratio of the number of characters in common with the second character and the number of the first characters.

상술한 목적을 달성하기 위한 본 발명의 일 측면에 따른 유사어 결정 시스템은 사용자에 의해 입력된 제1 문자열을 수신하고, 상기 제1 문자열에 대한 유사어를 추천 질의어로 사용자에게 제공하는 사용자 인터페이스부; 상기 제1 문자열과 미리 저장된 제2 문자열들간의 편집거리, 상기 제1 및 제2 문자열의 음운코드, 및 상기 제1 및 제2 문자열 간의 문자 유사도 점수를 이용하여 상기 제2 문자열들 중 후보 문자열을 선정하고, 선정된 각 후보 문자열들과 상기 제1 문자열과의 최종 유사도 점수를 산출하는 후보 문자열 선정부; 및 상기 후보 문자열들 중 상기 최종 유사도 점수가 상위 N위 이내인 후보 문자열들을 상기 제1 문자열의 유사어로 결정하여 상기 사용자 인터페이스부로 제공하는 유사어 결정부를 포함한다.According to an aspect of the present invention, there is provided a synonym determining system according to an aspect of the present invention, comprising: a user interface unit for receiving a first string input by a user and providing a similar word for the first string to a user as a recommended query; A candidate string among the second strings is selected by using an editing distance between the first string and the pre-stored second strings, a phonological code of the first and second strings, and a character similarity score between the first and second strings. A candidate string selecting unit configured to select and calculate a final similarity score between the selected candidate strings and the first string; And a synonym determination unit that determines candidate strings having the final similarity score within the upper N rank among the candidate strings as similar words of the first string and provides them to the user interface unit.

본 발명에 따르면, 미리 저장된 질의어들 중 일부를 후보 문자열로 선정하고, 선정된 후보 문자열에 대해서만 유사도 판단을 위한 실제 편집거리를 계산함으로써 입력 질의어에 대한 유사어 제공 서비스의 응답 속도를 개선할 수 있다는 효과가 있다.According to the present invention, it is possible to improve the response speed of a similar word providing service to an input query by selecting some of the pre-stored query words as candidate strings and calculating an actual edit distance for determining similarity only for the selected candidate strings. There is.

또한, 본 발명은 후보 문자열 선정에 이용되는 편집거리 계산시 와일드 카드 문자 검색 방법을 이용함으로써 편집거리 계산의 복잡도를 감소시킴은 물론 유사어 결정 시스템의 성능을 개선할 수 있다는 효과가 있다.In addition, the present invention has the effect of reducing the complexity of the editing distance calculation as well as improving the performance of the similarity determination system by using a wildcard character search method when calculating the editing distance used to select the candidate string.

또한, 본 발명은 음운코드를 이용하여 유사어를 결정함으로써 사용자에 의해 입력된 질의어와 발음이 유사한 질의어를 추천 질의어로 제공할 수 있어 유사어 결정 시스템의 재현율(Recall)을 향상시킬 수 있다는 효과가 있다. In addition, the present invention can provide a query similar in pronunciation to a query input by a user by determining a similar word using a phonological code, thereby improving the recall of the similarity determination system.

이하 첨부된 도면을 참조하여 본 발명의 실시예에 대해 상세히 설명한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 유사어 결정 시스템이 포함된 네트워크 구성을 보여주는 도면이다. 도시된 바와 같이, 유사어 결정 시스템(100)은 인터넷(110)을 통해 연결된 사용자 단말기(120)로부터 제1 문자열을 수신하고, 수신된 제1 문자열에 대한 유사어를 결정하여 결정된 유사어를 추천 질의어로 사용자 단말기(120)로 제공한다. 이러한 유사어 결정 시스템(100)은 도시된 바와 같이, 사용 자 인터페이스부(130), 후보 문자열 선정부(140), 데이터베이스(150), 및 유사어 결정부(160)를 포함한다.1 is a diagram illustrating a network configuration including a synonym determination system according to an embodiment of the present invention. As shown, the analogous word determination system 100 receives a first string from the user terminal 120 connected through the Internet 110, and determines the analogous word for the received first string to use the similar word as a recommended query user. Provided to the terminal 120. As shown in the figure, the synonym determination system 100 includes a user interface unit 130, a candidate character string selecting unit 140, a database 150, and a synonym determining unit 160.

먼저, 사용자 인터페이스부(130)는 사용자에 의해 입력된 제1 문자열을 사용자 단말기(120)를 통해 수신하고, 후술할 유사어 결정부(160)로부터 제공되는 제1 문자열에 대한 유사어를 사용자에게 추천 질의어로 제공한다.First, the user interface unit 130 receives a first string input by a user through the user terminal 120 and recommends a similar word to a user to the first string provided from the similar word determiner 160 to be described later. To provide.

후보 문자열 선정부(140)는 사용자 인터페이스부(130)를 통해 수신된 제1 문자열과 데이터베이스(150)에 미리 저장되어 있는 각 제2 문자열들간의 제1 편집거리, 제1 문자열의 음운코드와 제2 문자열의 음운코드, 및 제1 문자열과 제2 문자열 간의 문자 유사도 점수를 이용하여 제2 문자열들 중 후보 문자열을 선정하고, 제1 문자열과 선정된 각 후보 문자열 간의 유사도를 점수화한 최종 유사도 점수를 각 후보 문자열 별로 산출한다. 일 실시예에 있어서 제1 문자열 및 데이터베이스(150)에 저장되어 있는 제2 문자열들은 검색 서비스 제공에 이용되는 검색 질의어일 수 있다.The candidate character string selecting unit 140 may include a first edit distance between the first string received through the user interface 130 and each of the second strings previously stored in the database 150, a phonological code of the first string, and a first string. A candidate string is selected from the second strings using the phonological code of the two strings and the character similarity score between the first string and the second string, and the final similarity score scored by scoring the similarity between the first string and each selected candidate string. It is calculated for each candidate string. In an embodiment, the first string and the second strings stored in the database 150 may be search query words used to provide a search service.

구체적으로, 후보 문자열 선정부(140)는 제2 문자열들 중 제1 편집거리가 기준치 이하인 제2 문자열 예컨대 제1 편집거리가 1인 제2 문자열, 제1 문자열의 음운코드와 동일한 음운코드를 가지는 제2 문자열, 또는 문자 유사도 점수가 상위 N위 이내인 제2 문자열을 후보 문자열로 선정하고, 선정된 각 후보 문자열 별로 최종 유사도 점수를 산출한다. 이를 위해 후보 문자열 선정부(140)는 도 2에 도시된 바와 같이, 편집거리 계산부(200), 정규화부(210), 음운코드 비교부(220), 문자 유사도 점수 산출부(230), 후보 문자열 결정부(240), 및 최종 유사도 점수 산출 부(250)를 포함한다. 이하에서는 도 2를 참조하여 후보 문자열 선정부(140)를 구체적으로 설명한다.In detail, the candidate character string selecting unit 140 may have a second string having a first edit distance less than a reference value among the second strings, for example, a second string having a first edit distance of 1, and a phonological code identical to a phonological code of the first string. The second string or the second string having the character similarity score within the upper N rank is selected as the candidate string, and the final similarity score is calculated for each of the selected candidate strings. To this end, as shown in FIG. 2, the candidate character string selecting unit 140 includes an editing distance calculator 200, a normalization unit 210, a phonological code comparison unit 220, a character similarity score calculator 230, and a candidate. The character string determiner 240 and the final similarity score calculator 250 are included. Hereinafter, the candidate string selecting unit 140 will be described in detail with reference to FIG. 2.

먼저, 편집거리 계산부(200)는 제1 문자열과 데이터베이스(150)에 저장된 각 제2 문자열 간의 제1 편집거리를 계산하거나, 제1 문자열과 후술할 후보 문자열 결정부(240)에 의해 결정된 각 후보 문자열 간의 제2 편집거리를 계산하거나, 제1 문자열의 음운코드와 각 후보 문자열의 음운코드 간의 제3 편집거리를 계산한다. 여기서, 제1 문자열의 음운코드와 각 후보 문자열의 음운코드는 후술할 음운코드 비교부(220)로부터 획득할 수 있다. 여기서, 제1 편집거리는 제2 문자열들 중 후보 문자열을 선정하는데 이용되고, 제2 편집거리 및 제3 편집거리는 각 후보 문자열의 최종 유사도 점수를 산출하는데 이용된다.First, the edit distance calculator 200 calculates a first edit distance between the first string and each second string stored in the database 150, or the angle determined by the candidate string determiner 240 to be described later. A second edit distance between the candidate strings is calculated, or a third edit distance between the phonological code of the first string and the phonological code of each candidate string is calculated. Here, the phonological code of the first string and the phonological code of each candidate string may be obtained from the phonological code comparison unit 220 which will be described later. Here, the first edit distance is used to select a candidate string among the second strings, and the second edit distance and the third edit distance are used to calculate a final similarity score of each candidate string.

일 실시예에 있어서, 편집거리 계산부(200)는 제1 편집거리를 계산함에 있어서, 제1 편집거리 계산에 소요되는 시간을 감소시키기 위해 제1 편집거리 산출을 위한 각 연산 별로 와일드 카드 문자(Wild Card Character) 검색 기능을 이용할 수 있다.In one embodiment, the editing distance calculator 200 calculates a wild card character for each operation for calculating the first editing distance in order to reduce the time required for calculating the first editing distance in calculating the first editing distance. Wild Card Character) search function is available.

여기서, 각 연산은 삽입연산, 삭제연산, 교체연산, 및 전위연산을 포함하는 것으로서, 삽입연산은 특정 문자열에 새로운 문자를 추가함에 따라 발생하는 연산을 의미하고, 삭제연산은 특정 문자열에 포함된 문자를 삭제함에 따라 발생하는 연산을 의미하며, 교체연산은 특정 문자열에 포함된 문자를 새로운 문자로 교체함에 따라 발생하는 연산을 의미하고, 전위연산은 특정 문자열에 포함된 서로 인접한 문자의 순서를 변경함에 따라 발생하는 연산을 의미한다.Here, each operation includes an insert operation, a delete operation, a replacement operation, and a prefix operation, and the insert operation means an operation generated by adding a new character to a specific string, and the delete operation is a character included in the specific string. Means an operation that occurs when you delete a character. Substitution operation means an operation that occurs when a character included in a specific string is replaced with a new character, and the prefix operation changes the order of adjacent characters included in the specific string. It means the operation that occurs.

편집거리 계산부(200)가 와일드 카드 문자 검색 기능을 이용하여 제1 문자열과 제2 문자열 간의 제1 편집거리를 계산하는 방법을 도 3에 도시된 예를 참조하여 구체적으로 설명한다. 도 3에 도시된 예는, 편집거리 계산부(200)가 와일드 카드 검색 기능을 이용하여 제2 문자열들 중 제1 편집거리가 1인 제2 문자열을 탐색하는 방법을 도시한 것이다. 제1 문자열이 "abc"일 때 제1 편집거리가 1인 제2 문자열들을 탐색하는 경우, 편집거리 계산부(200)는 삽입연산을 위해 제1 문자열인 "abc"에 "?"와 같은 1개의 임의의 문자를 삽입한 후 "?abc", "a?bc". "ab?c". 및 "abc?"와 같은 형태의 제2 문자열들을 탐색하게 된다.A method of calculating the first editing distance between the first string and the second string by using the wildcard character search function will be described in detail with reference to the example illustrated in FIG. 3. The example illustrated in FIG. 3 illustrates a method in which the edit distance calculator 200 searches for a second string having a first edit distance of 1 through a wild card search function. When searching for the second strings of which the first edit distance is 1 when the first string is "abc", the edit distance calculator 200 may add 1 such as "?" To the first string "abc" for the insertion operation. Characters after "? Abc", "a? Bc". "ab? c". And second strings of the form "abc?".

또한, 삭제연산을 위해 제1 문자열인 "abc"로부터 1개의 문자를 삭제한 "bc", "ab", 및 "ac"와 같은 형태의 제2 문자열들을 탐색하게 되고, 교체연산을 위해 제1 문자열인 "?bc", "a?c", 및 "ab?"와 같은 형태의 제2 문자열들을 탐색하게 된다.In addition, the second strings of the form "bc", "ab", and "ac" having deleted one character from the first string "abc" for the deletion operation are searched, and the first operation for the replacement operation is performed. The second strings of the form "? Bc", "a? C", and "ab?" Are searched.

또한, 전위연산을 위해 제1 문자열인 "abc"에서 서로 인접한"a"와 "b"의 위치가 변경된 "bac"와 서로 인접한 "b"와 "a"의 위치가 변경된 "acb"와 같은 형태의 제2 문자열들을 탐색하게 된다.In addition, "bac" having changed positions of "a" and "b" adjacent to each other in the first character string "abc" for dislocation operation and "acb" having changed positions of "b" and "a" adjacent to each other The second strings of are searched.

이와 같이, 편집거리 계산부(200)가 와일드 카드 문자 검색 기능을 이용하여 그 길이가 n인 제1 문자열과의 제1 편집거리가 1인 제2 문자열들을 탐색하는 경우, 탐색횟수는 삽입연산의 경우 n+1회가 되고, 삭제연산 및 교체연산의 경우 n회가 되며, 전위연산의 경우 n-1회가 되므로 모든 연산에 대해 제1 편집거리가 1인 제2 문자열들을 탐색하는데 소요되는 횟수는 4n이 됨에 비해 와일드 카드 문자 검색 기능 을 이용하지 않고 제1 편집거리가 1인 제2 문자열들을 탐색하는 경우 탐색횟수는 삽입연산의 경우 26(n+1)회가 되고, 삭제연산의 경우 n회가 되며, 교체연산의 경우 25n회가 되고, 전위연산의 경우 n-1회가 되므로 모든 연산에 대해 제1 편집거리가 1인 제2 문자열을 탐색하는데 소요되는 횟수는 53n+25이므로 와일드 카드 검색 기능을 이용함으로써 속도를 개선할 수 있게 된다는 것을 알 수 있다.As such, when the editing distance calculation unit 200 searches for the second strings having the first editing distance of 1 with the first string having the length of n by using the wildcard character search function, the search frequency is determined by the insertion operation. In this case, n + 1 times, n times for the delete operation and the replacement operation, and n-1 times for the potential operation, so that the number of times required to search the second strings having the first edit distance of 1 for all operations Compared to 4n, when searching for the second character strings with the first edit distance of 1 without using the wildcard character search function, the search frequency is 26 (n + 1) times for the insert operation and n for the delete operation. Times, 25n times for the replacement operation and n-1 times for the potential operation, so the number of times required to search for the second character string with the first edit distance of 1 for all operations is 53n + 25. Speed up by using search It can be seen that so.

다시 도 2를 참조하면, 정규화부(210)는 편집거리의 계산 이전에 제1 문자열과 제2 문자열들을 정규화한다. 일 실시예에 있어서, 제1 문자열과 제2 문자열이 영문인 경우 정규화부(210)는 제1 문자열과 제2 문자열에 포함되어 있는 소문자를 모두 대문자로 변환하거나 제1 문자열과 제2 문자열의 맨 앞과 맨 뒤에 가상의 문자를 삽입함으로써 정규화를 수행할 수 있다.Referring back to FIG. 2, the normalization unit 210 normalizes the first string and the second string before calculating the editing distance. In an embodiment, when the first string and the second string are English, the normalization unit 210 converts all lowercase letters included in the first string and the second string into uppercase letters or the top of the first string and the second string. Normalization can be done by inserting fictional characters before and after.

예컨대, 문자열이 "standard"라고 하는 경우 이를 "_STANDARD_"와 같이 소문자를 모두 대문자로 변환하고, 문자열의 맨 앞과 맨 뒤에 "_"와 같은 가상의 문자를 삽입함으로써 정규화를 수행한다.For example, if a string is "standard", normalization is performed by converting all lowercase letters to all uppercase letters such as "_STANDARD_" and inserting virtual characters such as "_" at the beginning and end of the string.

다른 실시예에 있어서, 제1 문자열과 제2 문자열이 한글인 경우 정규화부(210)는 제1 문자열과 제2 문자열에서 특수문자나 띄어쓰기를 제거하거나 제1 문자열과 제2 문자열을 구성하는 음절들을 키보드 입력 자소 단위로 풀어 쓴 문자열로 변환함으로써 정규화를 수행할 수 있다. 여기서, 제1 문자열 및 제2 문자열을 일반 자소 단위가 아니라 키보드 입력 자소 단위로 변환하는 것은 문자열의 입력시 오류(Typing Error)를 감지하는데 효과적이기 때문이다. 이러한 것은 이중모음과 종성자음이 사용된 문자열에서 더욱 명확히 나타난다.In another embodiment, when the first string and the second string are Korean, the normalization unit 210 may remove syllables or spaces from the first string and the second string, or may include syllables constituting the first string and the second string. Normalization can be done by converting the strings written in keyboard input characters. Here, converting the first string and the second string into the keyboard input phone unit instead of the general phone unit is effective in detecting a Typing Error. This is more evident in strings where double vowels and final consonants are used.

예컨대, "월드"라는 문자열을 가정하는 경우, 해당 문자열을 키보드 입력 자소 단위로 정규화하면 "ㅇ, ㅜ, ㅓ, ㄹ, ㄷ, ㅡ"가 되고, 일반 자소 단위로 정규화하면 "ㅇ, ㅝ, ㄹ(종성), ㄷ, ㅡ"와 같이 된다. 사용자들이 "월드"라는 문자열을 입력하고자 하는 경우 일반적으로 "우러드"와 같은 형태의 오타 문자열을 자주 입력하게 되는데, "우러드"를 키보드 입력 자소 단위 또는 일반 자소 단위로 정규화하면 모두 "ㅇ, ㅜ, ㄹ, ㅓ, ㄷ, ㅡ"가 되므로 "월드"라는 문자열에 대한 정규화 결과와 "우러드"라는 문자열에 대한 정규화 결과의 편집거리를 계산하여 보면, 키보드 입력 자소 단위로 정규화하는 경우 편집거리는 1이 되지만 일반 자소 단위로 정규화하는 경우 편집거리는 3이 됨을 알 수 있다. 따라서, 이러한 오류 문자열에 대해서도 보다 정확한 유사어를 제공하기 위해 이러한 정규화부(210)가 필요한 것이다.For example, assuming that the string "World" is normalized, the string is normalized by the keyboard input letter unit, and it becomes "ㅇ, TT, ㅓ, ㄹ, ㄷ, ㅡ". (Jongjong), c, ㅡ " When users want to enter the string "World", they usually type a typo string like "Urdu". If "Urdu" is normalized by keyboard input unit or general unit, all "ㅇ," TT, ㄹ, ㅓ, ㄷ, ㅡ ", so you can calculate the editing distance of the normalization result for the string" World "and the normalization result for the string" Urdu ". It becomes 1, but when normalized by normal phoneme unit, we can see that the editing distance is 3. Therefore, the normalization unit 210 is necessary to provide more accurate analogous words for such an error string.

그러나, 이러한 정규화부(210)가 포함되지 않더라도 상술한 구성 만으로도 제1 문자열에 대한 유사어들을 제공할 수도 있으므로 정규화부(210)는 선택적으로 포함될 수 있을 것이다.However, even if the normalization unit 210 is not included, the normalization unit 210 may be selectively included because the above-described configuration may provide similar words for the first string.

한편, 데이터베이스(150)에 저장되어 있는 모든 제2 문자열은 미리 정규화하여 저장하여 둠으로써, 실제 유사어 결정시에는 제1 문자열에 대해서만 정규화를 수행할 수도 있을 것이다.Meanwhile, all the second strings stored in the database 150 are normalized and stored in advance, so that normalization may be performed only on the first strings when the actual similar words are determined.

음운코드 비교부(220)는 제1 문자열 및 제2 문자열의 음운코드를 획득하고, 제1 문자열의 음운코드와 제2 문자열들의 음운코드를 비교하여 제2 문자열 중 제1 문자열의 음운코드와 동일한 음운코드를 가지는 제2 문자열을 탐색한다. 일 실시 예에 있어서, 음운코드 비교부(220)는 제1 문자열 및 제2 문자열이 영어인 경우에는 Soundex 또는 Metaphone 등의 음운 알고리즘(Phonic Algorithm)을 통해 제1 문자열 및 제2 문자열의 음운코드를 획득할 수 있고, 제1 문자열 및 제2 문자열이 한글인 경우에는 Kodex와 같은 음운 알고리즘을 통해 제1 문자열 및 제2 문자열의 음운코드를 획득할 수 있다.The phonological code comparison unit 220 obtains a phonological code of the first string and the second string, compares the phonological code of the first string with the phonological code of the second string, and compares the phonological code of the first string with the phonological code of the first string. Search for a second string with a phonological code. In an embodiment, when the first string and the second string are English, the phonological code comparison unit 220 may determine the phonological codes of the first string and the second string through a phonic algorithm, such as Soundex or Metaphone. When the first string and the second string are Korean, the phonological codes of the first string and the second string may be obtained through a phonological algorithm such as Kodex.

한편, 본 발명에서는 제1 문자열과 동일한 음운코드를 가지는 제2 문자열들을 보다 빠르게 탐색하기 위해 데이터베이스(150)에 저장되어 있는 모든 제2 문자열들에 대한 음운코드를 미리 획득하여 저장해 둘 수 있을 것이다.Meanwhile, in the present invention, in order to search for the second strings having the same phonological code as the first string more quickly, the phonological codes for all the second strings stored in the database 150 may be obtained and stored in advance.

이와 같이, 본 발명에서는 후보 문자열을 결정함에 있어서 제1 문자열과 제2 문자열의 음운코드를 고려함으로써 철자는 다르나 음가가 동일한 문자열을 유사어로 제공할 수 있게 된다.As described above, according to the present invention, when determining candidate strings, the phonological codes of the first string and the second string may be considered to provide strings having similar spellings but different spellings.

한편, 음운코드 비교부(220)는 제1 문자열의 음운코드와 후술할 후보 문자열 결정부(240)에 의해 결정된 후보 문자열의 음운코드를 상술한 편집거리 계산부(200)로 제공함으로써 편집거리 계산부가 제3 편집거리를 계산할 수 있도록 한다.Meanwhile, the phonological code comparison unit 220 provides the phonological code of the first string and the phonological code of the candidate string determined by the candidate string determination unit 240 to be described later to the above-described editing distance calculator 200 to calculate the edit distance. The additional third editing distance can be calculated.

문자 유사도 점수 산출부(230)는 데이터베이스(150)에 저장된 제2 문자열들 중 제1 문자열과 공통된 문자, 예컨대 제1 문자열과 공통된 ngram을 포함하고 있는 제2 문자열들에 대한 문자 유사도 점수를 산출하거나, 후술할 후보 문자열 결정부(240)에 의해 결정된 후보 문자열들 중 상기 제1 문자열과 공통된 문자, 예컨대 제1 문자열과 공통된 ngram을 포함하고 있는 후보 문자열들에 대한 문자 유사도 점 수를 산출한다. 이하에서는, 설명의 편의를 위해 문자 유사도 점수 산출부(230)가 제1 문자열과 공통된 ngram을 갖는 제2 문자열 또는 후보 문자열에 대해 문자 유사도 점수를 산출하는 것으로 가정하여 설명하기로 한다.The character similarity score calculator 230 may calculate a character similarity score for a second string including a character common to the first string among the second strings stored in the database 150, for example, an ngram common to the first string. The number of character similarity points for the candidate character strings including a character common to the first character string, for example, an ngram in common with the first character string, among the candidate character strings determined by the candidate character string determiner 240 to be described later is calculated. In the following description, it is assumed that the character similarity score calculator 230 calculates a character similarity score for a second character string or a candidate character string having a common ngram with the first character string.

여기서, 제1 문자열과 공통된 ngram을 포함하고 있는 제2 문자열들에 대한 문자 유사도 점수는 후술할 후보 문자열 결정부(240)가 후보 문자열을 결정하는데 이용되고, 제1 문자열과 공통된 ngram을 포함하고 있는 후보 문자열들에 대한 문자 유사도 점수는 후술할 최종 유사도 점수 산출부(250)가 후보 문자열 별로 최종 유사도 점수를 산출하는데 이용된다. 공통 ngram을 포함하고 있는 제2 문자열의 문자 유사도 점수와 공통 ngram을 포함하고 있는 후보 문자열의 문자 유사도 점수를 산출하는 방법은 동일하기 때문에 이하에서는 설명의 편의를 위해 제1 문자열과 공통 ngram을 포함하고 있는 제2 문자열의 문자 유사도 점수를 산출하는 것을 기준으로 설명한다.Here, the character similarity score for the second strings including the ngram common to the first string is used by the candidate string determination unit 240 to be described later to determine the candidate string and includes the ngram in common with the first string. The character similarity score for the candidate strings is used by the final similarity score calculator 250 to calculate a final similarity score for each candidate string. Since the method of calculating the character similarity score of the second string including the common ngram and the character similarity score of the candidate string including the common ngram are the same, hereinafter, the first string and the common ngram are included for convenience of description. Based on calculating the character similarity score of the second string present.

구체적으로, 문자 유사도 점수 산출부(230)는 데이터베이스(150)로부터 제1 문자열에 포함되어 있는 ngram을 포함하고 있는 제2 문자열들을 획득한 후 획득된 제2 문자열들에 대해 문자 유사도 점수 산출함수를 이용하여 문자 유사도 점수를 산출한다.Specifically, the character similarity score calculator 230 obtains the second character strings including the ngram included in the first character string from the database 150 and then calculates a character similarity score calculation function for the obtained second character strings. Character similarity score is calculated.

일 실시예에 있어서, 문자 유사도 점수 산출함수는 공통되는 ngram의 크기, 공통되는 ngram의 개수, 공통되는 ngram이 발견된 위치의 유사도, 및 제1 문자열과 제2 문자열의 길이의 차를 이용하여 정의될 수 있다. 이때, 문자 유사도 점수 산출 함수를 정의함에 있어서, 문자 유사도 점수 산출부(230)는 문자 유사도 점수가, 공통되는 ngram의 크기 및 공통되는 ngram의 개수에는 비례하고, 공통되는 ngram이 발견된 위치의 유사도 및 양 문자열의 길이 차에는 반비례하게 되도록 문자 유사도 점수 산출함수를 정의할 수 있다. 본 발명의 일 실시예에 따른 문자 유사도 점수 산출함수는 아래의 수학식1과 같이 정의될 수 있다.In one embodiment, the character similarity score calculation function is defined using the size of the common ngram, the number of common ngrams, the similarity of the locations where the common ngrams are found, and the difference between the lengths of the first and second strings. Can be. In this case, in defining the character similarity score calculation function, the character similarity score calculation unit 230 is similar to the position where the character similarity score is proportional to the size of a common ngram and the number of common ngrams, and the common ngram is found. And a character similarity score calculating function to be inversely proportional to the difference in length between the two strings. Character similarity score calculation function according to an embodiment of the present invention may be defined as in Equation 1 below.

수학식 1에서,

는 제1 문자열인 q와 제2 문자열인 t 간의 문자 유사도 점수를 나타내고,

는 문자열 x의 길이를 나타내며,

는 문자열 x에 포함된 ngram들의 집합을 나타낸다. 또한,

은 제1 문자열과 제2 문자열 내에서 공통되는 ngram이 발견된 위치의 유사도를 나타내고,

은 제1 문자열과 제2 문자열의 길이 차를 나타낸다. 수학식 1에서 문자 유사도 점수가, 공통되는 ngram의 크기 및 공통되는 ngram의 개수에 비례한다는 것은

을 통해 반영되고, 공통되는 ngram이 발견된 위치의 유사도에 반비례한다는 것은

를 통해 반영되며, 양 문자열의 길이 차에 반비 례한다는 것은

를 통해 반영되어 있음을 알 수 있다.In Equation 1,

Represents a character similarity score between q, the first string, and t, the second string,

Represents the length of the string x,

Denotes the set of ngrams contained in the string x. Also,

Indicates the similarity between the locations where the common ngram was found in the first string and the second string,

Denotes the difference in length between the first string and the second string. In Equation 1, the character similarity score is proportional to the size of the common ngram and the number of common ngrams.

Is reflected in the graph, and the common ngram is inversely proportional to the similarity of the locations found.

Reflecting through, and inversely proportional to the difference in length

It can be seen through the reflection.

한편, 공통되는 ngram이 발견된 위치의 유사도를 정의하는

에서,

는 제1 문자열 q에서의 n의 위치와 제2 문자열 t의 위치에 대한 거리를 나타내는 것으로서, 아래의 수학식 2와 같이 정의된다.On the other hand, the similarity of the position where the common ngram is found

in,

Denotes a distance between the position of n in the first string q and the position of the second string t, and is defined as in Equation 2 below.

수학식 2에서,

는 전방거리를 나타내고,

는 후방거리를 나타내는 것으로서,

는 전방거리와 후방거리 중 작은 값으로 정의됨을 알 수 있다. 수학식 2에서, 전방거리인

와 후방거리인

는 각각 아래의 수학식 3 및 4와 같이 정의될 수 있다.In Equation 2,

Represents the forward distance,

Is the rear distance,

It can be seen that is defined as the smaller of the front distance and the rear distance. In Equation 2, the forward distance

With rear distance

May be defined as Equations 3 and 4, respectively.

위의 수학식 3 및 4에서 알 수 있는 바와 같이, 전방거리는 양 문자열에서 전방위치인

의 차로 계산되고, 후방거리는 양 문자열에서 후방위치인

의 차로 계산된다. 예컨대, 제1 문자열 q가 "dishwashing"이고, 제2 문자열 t가 "something"이며, bigram이 적용된다고 가정하자. 이러한 경우, 양 문자열의 공통 ngram인 "hi"에 대한 제1 문자열에서의 전방위치는 7이고 후방위치는 2이며, 제2 문자열에서의 전방위치는 5이고 후방위치는 2이기 때문에, 전방거리는 양 문자열의 전방위치의 차이값인 2가 되고 후방거리는 양 문자열의 후방위치의 차이값인 0이 되므로 제1 문자열에서의 n의 위치와 제2 문자열에서의 n의 위치에 대한 거리의 차이인

는 2와 0중에 작은 값인 0이 됨을 알 수 있다.As can be seen in Equations 3 and 4 above, the forward distance is the forward position in both strings.

Calculated by the difference of, and the rear distance is

Is calculated as the difference. For example, assume that the first string q is "dishwashing", the second string t is "something", and bigram is applied. In this case, since the forward position in the first string is 7 and the rear position is 2, the forward position is 5 and the rear position is 2 for the common ngram "hi" of both strings, the forward distance is equal to both strings. Since the difference between the front position becomes 2 and the rear distance becomes 0, the difference between the rear positions of both strings, the distance between the position of n in the first string and the position of n in the second string

It can be seen that the smaller of 2 and 0 is 0.

후보 문자열 결정부(240)는 상술한 편집거리 계산부(200)에 의해 탐색된 제1 편집거리가 1인 제2 문자열, 상술한 음운코드 비교부(220)에 의해 탐색된 제1 문자열의 음운코드와 동일한 음운코드를 가지는 제2 문자열, 또는 제1 문자열과 공통된 ngram을 가지는 제2 문자열들 중 문자 유사도 점수 산출부(230)에 의해 산출된 문자 유사도 점수가 상위N위 이내인 제2 문자열들을 후보 문자열들로 결정한다.The candidate character string determiner 240 is a second string having the first edit distance searched by the above-described editing distance calculator 200 as 1, and a phoneme of the first string searched by the phonological code comparison unit 220 described above. The second strings having the same phonetic code as the code, or the second strings having the character similarity score calculated by the character similarity score calculating unit 230 among the second strings having the same ngram as the first string, are within the top N positions. Determines candidate strings.

즉, 본 발명은 데이터베이스(150)에 저장된 제2 문자열들 중 후술할 최종 유사도 점수 산출부(250)에 의해 제1 문자열과의 실제 유사도 판단대상이 되는 후보 문자열들을 후보 문자열 결정부(240)가 결정함으로써 실제 유사도 판단 대상이 되는 제2 문자열의 개수를 감소시킬 수 있어 유사어를 사용자에게 제공하는 경우 서비스 응답 속도를 개선할 수 있게 되는 것이다.That is, in the present invention, the candidate string determination unit 240 selects candidate strings that are the target of actual similarity with the first string by the final similarity score calculator 250, which will be described later, among the second strings stored in the database 150. By determining the number of second character strings that are the object of similarity determination, the number of second character strings can be reduced, so that the response speed of the service can be improved when the similar words are provided to the user.

한편, 최종 유사도 점수 산출부(250)는 후보 문자열 결정부(240)에 의해 결 정된 각 후보 문자열과 제1 문자열과의 최종 유사도 점수를 산출하는 것으로서, 일 실시예에 있어서, 최종 유사도 점수 산출부(250)는, 4가지의 팩터(Factor) 중 적어도 하나를 이용하여 각 후보 문자열의 최종 유사도 점수를 산출할 수 있다. 여기서, 제1 팩터는 문자 유사도 점수 산출부(230)에 의해 산출된 각 후보 문자열과 제1 문자열간의 문자 유사도 점수이고, 제2 팩터는 제1 문자열과 각 후보 문자열 간의 제2 편집거리를 점수화한 편집거리 점수이며, 제3 팩터는 제1 문자열의 음운코드와 각 후보 문자열의 음운코드 간의 제3 편집거리를 점수화한 음운코드간 편집거리 점수이고, 제4 팩터는 각 후보 문자열과 제1 문자열 간의 공통문자의 개수를 점수화한 공통문자 점수이다.Meanwhile, the final similarity score calculator 250 calculates a final similarity score between each candidate string determined by the candidate string determiner 240 and the first string, and in one embodiment, the final similarity score calculator 250 may calculate a final similarity score of each candidate string using at least one of four factors. Here, the first factor is a character similarity score between each candidate character string and the first character string calculated by the character similarity score calculator 230, and the second factor is a score obtained by scoring a second editing distance between the first character string and each candidate character string. An edit distance score, and the third factor is an edit distance score between phonological codes obtained by scoring a third edit distance between a phonological code of the first string and a phonological code of each candidate string, and a fourth factor is an interval between each candidate string and the first string Common character score that scores the number of common characters.

예컨대, 최종 유사도 점수 산출부(250)는 4개의 팩터를 모두 곱하거나, 4개의 팩터 중 일부만을 곱하거나, 4개의 팩터 모두를 더하거나 4개의 팩터 중 일부를 더함으로써 각 후보 문자열들의 최종 유사도 점수를 산출할 수 있다. 이하에서는 최종 유사도 점수 산출부(250)가 각 후보 문자열의 최종 유사도 점수 산출에 이용되는 각 팩터들을 산출하는 방법에 대해 구체적으로 설명한다.For example, the final similarity score calculator 250 multiplies all four factors, multiplies only some of the four factors, adds all four factors, or adds some of the four factors to determine the final similarity score of each candidate string. Can be calculated. Hereinafter, a method of calculating the final similarity score calculator 250 for each factor used for calculating the final similarity score of each candidate string will be described in detail.

먼저, 제1 팩터인 각 후보 문자열과 제1 문자열간의 문자 유사도 점수는 상술한 문자 유사도 점수 산출부(230)에 의해 산출된 결과를 이용할 수 있다.First, the character similarity score between each candidate character string as the first factor and the first character string may use the result calculated by the character similarity score calculator 230 described above.

다음으로, 최종 유사도 점수 산출부(250)는 제2 팩터인 편집거리 점수를, 아래의 수학식 5에 기재된 바와 같이 제2 편집 거리와 제1 기준값의 합에 대한 역수를 이용하여 산출하거나, 수학식 6에 기재된 바와 같이 제2 기준값과 제2 편집거리의 차이값을 이용하여 산출하거나, 수학식 7에 기재된 바와 같이 제1 문자열의 길 이와 제2 편집거리 간의 차와 제1 문자열의 길이의 비율을 이용하여 산출할 수 있다. 수학식 5에서는 제1 기준값이 1인 것으로 기재하였지만 이에 한정되지 않고 다른 값으로 대체 가능할 것이다.Next, the final similarity score calculator 250 calculates the edit distance score that is the second factor using an inverse of the sum of the second edit distance and the first reference value, as described in Equation 5 below, Calculated using the difference between the second reference value and the second edit distance as described in Equation 6, or as described in Equation 7, the ratio of the difference between the length of the first string and the second edit distance and the length of the first string It can be calculated using. In Equation 5, the first reference value is described as 1, but the present invention is not limited thereto and may be replaced with another value.

수학식 5 내지 7에서

은 편집거리 점수를 나타내고,

는 제1 문자열 q와 후보 문자열 t간의 최소 편집거리를 나타내며,

는 미리 정해진 제2 기준값을 나타내고,

는 제1 문자열 q의 길이를 나타낸다.In Equations 5 to 7

Indicates the edit distance score,

Denotes the minimum editing distance between the first string q and the candidate string t,

Represents a second predetermined reference value,

Denotes the length of the first string q.

다음으로, 최종 유사도 점수 산출부(250)는 제3 팩터인 음운코드간 편집거리 점수를, 아래의 수학식 8에 기재된 바와 같이 제3 편집 거리와 제1 기준값의 합에 대한 역수를 이용하여 산출하거나, 수학식 9에 기재된 바와 같이 제2 기준값과 제3 편집거리의 차이값을 이용하여 산출하거나, 수학식 10에 기재된 바와 같이 제1 문 자열의 음운코드 길이와 제3 편집거리 간의 차와 제1 문자열의 음운코드 길이의 비율을 이용하여 산출할 수 있다. 수학식 8에서는 제1 기준값이 1인 것으로 기재하였지만 이에 한정되지 않고 다른 값으로 대체 가능할 것이다.Next, the final similarity score calculation unit 250 calculates the editing distance score between the phonological codes, which is the third factor, using the inverse of the sum of the third editing distance and the first reference value as described in Equation 8 below. Or using the difference between the second reference value and the third editing distance as described in Equation 9, or as shown in Equation 10, the difference between the phonological code length of the first string and the third editing distance It can be calculated using the ratio of phonological code lengths of one string. In Equation 8, the first reference value is described as 1, but the present invention is not limited thereto and may be replaced with another value.

수학식 8 내지 10에서,

은 음운코드간 편집거리 점수를 나타내고,

는 제1 문자열 q의 음운코드를 나타내며,

는 제1 문자열의 음운코드와 후보 문자열의 음운코드간의 편집거리를 나타내고,

는 제2 기준값을 나타내며,

는 제1 문자열의 음운코드의 길이를 나타낸다.In Equations 8 to 10,

Indicates the edit distance score between phonological codes,

Represents the phonological code of the first string q,

Denotes the editing distance between the phonological code of the first string and the phonological code of the candidate string,

Represents a second reference value,

Denotes the length of the phonological code of the first string.

다음으로, 최종 유사도 점수 산출부(250)는 제4 팩터인 공통문자 점수를, 아래의 수학식 11에 기재된 바와 같이, 1 문자열에 포함된 문자인 제1 문자와 각 후보 문자열에 포함된 문자인 제2 문자 중 공통되는 문자의 개수와 제1 문자의 개 수의 제2 문자의 개수의 합의 비율을 이용하여 산출하거나, 아래의 수학식 12에 기재된 바와 같이 제1 문자와 상기 제2 문자 중 공통되는 문자의 개수와 제1 문자의 개수의 비율을 이용하여 산출할 수 있다. Next, the final similarity score calculator 250 calculates the common character score, which is the fourth factor, as the first character that is a character included in one string and the character included in each candidate string, as described in Equation 11 below. It calculates using the ratio of the sum of the number of characters common in a 2nd character, and the number of 2nd characters of the number of a 1st character, or it is common among a 1st character and said 2nd character as shown in following formula (12). The ratio of the number of letters to the number of first letters may be calculated.

수학식 11 및 12에서,

은 공통문자 점수를 나타내고,

는 제1 문자열 q에 속하는 각 문자들의 집합을 나타내며,

는 제1 문자열 q에 속하는 문자들의 개수를 나타낸다.In Equations 11 and 12,

Indicates a common character score,

Represents a set of characters belonging to the first string q,

Denotes the number of characters belonging to the first string q.

이와 같이, 최종 유사도 점수 산출부(250)는 각 후보 문자열 별로 제1 문자열 간의 최종 유사도 점수를 다양한 방법을 통해 산출할 수 있다.As such, the final similarity score calculator 250 may calculate the final similarity score between the first strings for each candidate string through various methods.

다시 도 1을 참조하면, 유사어 결정부(160)는 후보 문자 선정부(140)에 의해 선정된 후보 문자열들 중 최종 유사도 점수가 상위 N위 이내인 후보 문자열들을 제1 문자열의 유사어로 결정하여 사용자 인터페이스부(130)로 제공한다. 사용자 인터페이스부(130)로 제공된 유사어들은 사용자 인터페이스부(130)를 통해 제1 문자 열에 대한 추천 질의어로 사용자에게 제공된다.Referring back to FIG. 1, the similarity determination unit 160 determines a candidate string having a final similarity score within a top N position among the candidate strings selected by the candidate character selecting unit 140 as a similarity of the first string. The interface unit 130 is provided. Similar words provided to the user interface 130 are provided to the user through the user interface 130 as a recommendation query for the first string.

한편, 상술한 실시예에 있어서는 제2 문자열들이 저장되는 데이터베이스(150)가 유사어 결정 시스템(100)에 포함되는 것으로 기재하였으나, 변형된 실시예에 있어서 데이터베이스(150)는 별도의 시스템에 포함될 수도 있을 것이다.Meanwhile, in the above-described embodiment, the database 150 in which the second strings are stored is described as being included in the synonym determination system 100. However, in the modified embodiment, the database 150 may be included in a separate system. will be.

이하에서는 도 4 및 5를 참조하여 본 발명에 따른 유사어 결정 방법을 설명한다. 도 4는 본 발명의 일 실시예에 따른 유사어 결정 방법을 보여주는 플로우차트이다.Hereinafter, a method for determining similar words according to the present invention will be described with reference to FIGS. 4 and 5. 4 is a flowchart showing a method for determining analogous words according to an embodiment of the present invention.

도시된 바와 같이, 사용자에 의해 입력된 제1 문자열을 사용자 단말기를 통해 수신한다(S400).As shown, the first string input by the user is received through the user terminal (S400).

이후, 수신된 제1 문자열과 데이터베이스에 기록된 제2 문자열간의 제1 편집거리, 제1 및 제2 문자열들의 음운코드, 또는 제1 문자열과 제2 문자열 간의 문자 유사도 점수를 이용하여 데이터베이스에 저장된 제2 문자열들 중 후보 문자열들을 선정한다(S410). 일 실시예에 있어서 제1 및 제2 문자열은 검색 서비스 제공에 이용되는 검색 질의어일 수 있다. 이하에서는 도 5를 참조하여, 후보 문자열을 선정하는 방법을 상세하게 설명한다.Then, the first stored distance in the database using the first edit distance between the received first string and the second string recorded in the database, the phonetic code of the first and second strings, or the character similarity score between the first string and the second string Candidate strings are selected from the two strings (S410). In one embodiment, the first and second strings may be search query words used to provide a search service. Hereinafter, a method of selecting a candidate character string will be described in detail with reference to FIG. 5.

먼저, 제1 문자열과 제2 문자열을 정규화한다(S502). 일 실시예에 있어서, 제1 및 제2 문자열이 영어인 경우 소문자를 대문자로 변환하거나 제1 및 제2 문자열의 앞뒤에 가상의 문자를 삽입함으로써 정규화를 수행하고, 제1 및 제2 문자열이 한글인 경우 제1 및 제2 문자열들 중 특수문자 또는 띄어쓰기를 제거하거나, 제1 및 제2 문자열을 키보드 입력 자소 단위로 변환함으로써 정규화를 수행할 수 있다.First, the first string and the second string are normalized (S502). In one embodiment, when the first and second strings are English, normalization is performed by converting lowercase letters to uppercase letters or inserting virtual characters before and after the first and second strings, and the first and second strings are Korean characters. In this case, normalization may be performed by removing a special character or a space between the first and second strings, or converting the first and second strings to a keyboard input letter unit.

이후 정규화된 제1 문자열과 정규화된 제2 문자열간의 제1 편집거리를 산출한 후 제1 편집거리가 기준치 이하인 제2 문자열들을 획득한다(S504). 여기서, 기준치는 1일 수 있다. 일 실시예에 있어서, 제1 편집거리는 제1 편집거리 산출을 위한 각 연산 별로 와일드 카드 문자 검색 방법을 이용하여 수행될 수 있다. 각 연산은 삽입 연산, 삭제 연산, 교체 연산 및 전위 연산 중 적어도 하나를 포함할 수 있다. 와일드 카드 문자 검색 방법을 이용하여 제1 문자열과 제2 문자열 간의 제1 편집거리를 산출하는 방법에 대해서는 상술한 편집거리 계산부의 설명에서 기재하였으므로 상세한 설명은 생략하기로 한다.Subsequently, after calculating the first editing distance between the normalized first string and the normalized second string, second strings having a first edit distance less than or equal to a reference value are obtained (S504). Here, the reference value may be 1. In one embodiment, the first edit distance may be performed using a wildcard character search method for each operation for calculating the first edit distance. Each operation may include at least one of an insert operation, an delete operation, a replace operation, and a prefix operation. Since the method for calculating the first edit distance between the first string and the second string using the wildcard character search method has been described in the above-described edit distance calculator, the detailed description thereof will be omitted.

한편, S502 및 S504의 수행과는 별도로 데이터베이스에 저장된 각 제2 문자열과 제1 문자열의 음운코드를 비교하여 제1 문자열의 음운코드와 동일한 음운코드를 가지는 제2 문자열을 획득한다(S506). 일 실시예에 있어서, 제1 및 제2 문자열들의 음운 코드는 제1 및 제2 문자열들이 영어인 경우 Soundex 또는 Metaphone 음운 알고리즘을 이용하여 획득할 수 있고, 제1 및 제2 문자열이 한글인 경우 Kodex 음운 알고리즘을 이용하여 획득할 수 있다.On the other hand, apart from the performance of S502 and S504, a second string having the same phonological code as the phonological code of the first string is obtained by comparing each phonological code of the first string with each second string stored in the database (S506). In one embodiment, the phonological code of the first and second strings may be obtained using Soundex or Metaphone phonological algorithm when the first and second strings are English, and Kodex when the first and second strings are Korean Can be obtained using a phonological algorithm.

한편, S502 내지 S506의 수행과는 별도로 제1 문자열과 공통된 ngram, 예컨대 공통된 ngram을 포함하고 있는 제2 문자열들 중 문자 유사도 점수가 상위N위 이내인 제2 문자열들을 획득한다(S508). 일 실시예에 있어서, 문자 유사도 점수는 공통된 ngram의 크기, 공통된 ngram의 개수, 공통된 ngram이 발견된 위치의 유사도, 및 제1 문자열과 각 제2 문자열간의 길이 차를 이용하여 정의되는 문자 유사도 점수 산출 함수를 이용하여 결정될 수 있다.On the other hand, apart from the performance of S502 to S506, second strings having a character similarity score within the upper N rank among second strings including a common ngram, for example, a common ngram, are acquired (S508). In one embodiment, the character similarity score is calculated based on the size of the common ngram, the number of common ngrams, the similarity of the locations where the common ngram is found, and the character similarity score defined by the length difference between the first string and each second string. Can be determined using a function.

이때, 문자 유사도 점수 산출 함수는 상술한 수학식 1에 기재된 바와 같이 문자 유사도 점수가, 상기 공통된 ngram의 크기 및 상기 공통된 ngram의 개수에는 비례하고 상기 공통된 ngram이 발견된 위치의 유사도 및 제1 문자열과 각 제2 문자열간의 길이 차에는 반비례하도록 정의될 수 있다. 이러한 문자 유사도 점수에 관한 구체적인 설명은 상술한 수학식 1에서 설명하였으므로 구체적인 설명은 생략하기로 한다.In this case, the character similarity score calculation function may be configured such that the character similarity score is proportional to the size of the common ngram and the number of the common ngrams as described in Equation 1 above, and the similarity of the position where the common ngram is found and the first string. The length difference between each second string may be defined in inverse proportion. Since a detailed description of the character similarity score has been described in the above Equation 1, a detailed description thereof will be omitted.

마지막으로, 제1 편집거리가 기준치 이하인 제2 문자열, 제1 문자열의 음운코드와 동일한 음운코드를 가지는 제2 문자열, 또는 제1 문자열과 공통된 ngram을 포함하고 있는 제2 문자열들 중 문자 유사도 점수가 상위N위 이내인 제2 문자열을 후보 문자열로 결정한다(S510).Finally, the character similarity score of the second character string having the first editing distance less than or equal to the reference value, the second character string having the same phonological code as the phonological code of the first character string, or the second character string including the ngram common to the first character string A second string within an upper N position is determined as a candidate string (S510).

다시 도 4를 참조하면, S508에서 결정된 각 후보 문자열 별로 제1 문자열간의 최종 유사도 점수를 산출한다(S404). 일 실시예에 있어서, 최종 유사도 점수는 4개의 팩터 중 적어도 하나를 이용하여 최종 유사도 점수를 산출할 수 있는데, 여기서 제1 팩터는 각 후보 문자열들과 제1 문자열간의 문자 유사도 점수이고, 제2 팩터는 각 후보 문자열들과 제1 문자열간의 제2 편집거리를 점수화한 편집거리 점수이며, 제3 팩터는 각 후보 문자열들의 음운코드와 제1 문자열의 음운코드간의 제3 편집거리를 점수화한 음운코드 편집거리 점수이고, 제4 팩터는 각 후보 문자열들과 상기 제1 문자열 간의 공통 문자 개수를 점수화한 공통문자 점수일 수 있다.Referring back to FIG. 4, a final similarity score between the first strings is calculated for each candidate string determined in S508 (S404). In one embodiment, the final similarity score may be used to calculate the final similarity score using at least one of four factors, where the first factor is a character similarity score between each candidate string and the first string, and the second factor. Is an editing distance score that scores a second editing distance between each candidate string and the first string, and the third factor is a phonological code editing that scores a third editing distance between the phonological code of each candidate string and the phonological code of the first string. The distance score, and the fourth factor may be a common character score obtained by scoring the number of common characters between each candidate character string and the first string.

따라서, 4개의 팩터를 모두 곱하거나, 4개의 팩터 중 일부만을 곱하거나, 4개의 팩터를 모두 더하거나, 4개의 팩터 중 일부만을 더함으로써 최종 유사도 점수 를 산출할 수 있을 것이다.Thus, the final similarity score may be calculated by multiplying all four factors, multiplying only some of the four factors, adding all four factors, or adding only some of the four factors.

여기서, 제1 팩터인 후보 문자열과 제1 문자열간의 문자 유사도 점수를 산출하는 방법은 상술한 문자 유사도 점수 산출부의 설명에서 기재하였으므로 상세한 설명은 생략하기로 한다.Here, since the method of calculating the character similarity score between the candidate character string and the first character string as the first factor has been described in the above description of the character similarity score calculator, detailed description thereof will be omitted.

다음으로, 제2 팩터인 편집거리 점수는 상술한 수학식 5 내지 7에 기재된 바와 같이, 제2 편집거리와 제1 기준값의 합에 대한 역수, 제2 기준값과 제2 편집거리의 차이값, 또는 제1 문자열의 길이와 제2 편집거리 간의 차와 제1 문자열의 길이의 비율 중 어느 하나를 이용하여 산출될 수 있다.Next, the edit factor score, which is the second factor, is the inverse of the sum of the second edit distance and the first reference value, the difference value between the second reference value and the second edit distance, as described in Equations 5 to 7, or The difference between the length of the first character string and the second editing distance and the ratio of the length of the first character string may be calculated.

또한, 제3 팩터인 음운코드간 편집거리 점수는 상술한 8 내지 10에 기재된 바와 같이, 제3 편집거리와 제1 기준값의 합에 대한 역수, 제2 기준값과 제3 편집거리의 차이값, 또는 제1 문자열의 음운코드 길이와 제3 편집거리 간의 차와 제1 문자열의 음운코드 길이 비율 중 어느 하나를 이용하여 산출될 수 있다.The editing distance score between the phonological codes, which is the third factor, may be inverse of the sum of the third editing distance and the first reference value, the difference value between the second reference value and the third editing distance, as described in 8 to 10 above, or The difference between the phonological code length of the first string and the third editing distance and the ratio of the phonological code length of the first string may be calculated.

또한, 제4 팩터인 공통문자 점수는, 상술한 수학식 11 및 12에 기재된 바와 같이, 제1 문자열에 포함된 문자인 제1 문자와 후보 문자열에 포함된 문자인 제2 문자 중 공통되는 문자의 개수와 제1 문자의 개수 및 제2 문자의 개수의 합의 비율, 또는 제1 문자와 제2 문자 중 공통되는 문자의 개수와 제1 문자의 개수의 비율을 이용하여 산출될 수 있다.The common character score, which is the fourth factor, is the same as that of the first character that is the character included in the first string and the second character that is the character included in the candidate string, as described in Equations 11 and 12 described above. The ratio of the sum of the number, the number of the first characters, and the number of the second characters, or the ratio of the number of characters in common among the first character and the second character and the number of the first characters may be calculated.

마지막으로, 각 후보 문자열들 중 최종 유사도 점수가 상위 N위 이내인 후보 문자열을 제1 문자열의 유사어로 결정하고(S406), 결정된 유사어를 추천 질의어로 사용자에게 제공한다(S408).Finally, candidate strings having a final similarity score among the top N ranks among the candidate strings are determined as similar words of the first string (S406), and the determined similar words are provided to the user as recommended query terms (S408).

상술한 유사어 결정 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터로 판독 가능한 기록 매체에 기록될 수 있다. 이때, 컴퓨터로 판독 가능한 기록매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 한편, 기록매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다.The analogous word determination method described above may be implemented in the form of program instructions that can be executed by various computer means and recorded in a computer-readable recording medium. In this case, the computer-readable recording medium may include program instructions, data files, data structures, and the like, alone or in combination. On the other hand, the program instructions recorded on the recording medium may be those specially designed and configured for the present invention or may be available to those skilled in the art of computer software.

컴퓨터로 판독 가능한 기록매체에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(Magnetic Media), CD-ROM, DVD와 같은 광기록 매체(Optical Media), 플롭티컬 디스크(Floptical Disk)와 같은 자기-광 매체(Magneto-Optical Media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 한편, 이러한 기록매체는 프로그램 명령, 데이터 구조 등을 지정하는 신호를 전송하는 반송파를 포함하는 광 또는 금속선, 도파관 등의 전송 매체일 수도 있다.The computer-readable recording medium includes a magnetic recording medium such as a magnetic medium such as a hard disk, a floppy disk and a magnetic tape, an optical medium such as a CD-ROM and a DVD, a magnetic disk such as a floppy disk, A magneto-optical media, and a hardware device specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like. The recording medium may be a transmission medium such as an optical or metal wire, a waveguide, or the like including a carrier wave for transmitting a signal specifying a program command, a data structure, or the like.

또한, 프로그램 명령에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상술한 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.In addition, program instructions include not only machine code generated by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like. The hardware device described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

한편, 본 발명이 속하는 기술분야의 당업자는 상술한 본 발명이 그 기술적 사상이나 필수적 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다.On the other hand, those skilled in the art will understand that the present invention described above can be implemented in other specific forms without changing the technical spirit or essential features.

그러므로, 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적인 것이 아닌 것으로 이해해야만 한다. 본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 등가 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.Therefore, it is to be understood that the embodiments described above are exemplary in all respects and not restrictive. The scope of the present invention is shown by the following claims rather than the detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalent concepts should be construed as being included in the scope of the present invention. do.

도 1은 본 발명의 일 실시예에 따른 유사어 결정 시스템의 개략적인 블록도.1 is a schematic block diagram of a synonym determination system in accordance with an embodiment of the present invention.

도 2는 도 1에 도시된 후보 문자열 결정부의 세부 구성을 보여주는 도면.FIG. 2 is a diagram illustrating a detailed configuration of a candidate character string determining unit shown in FIG. 1.

도 3은 와일드 카드 문자 검색 방법의 예를 보여주는 도면.3 shows an example of a wildcard character search method.

도 4는 본 발명의 일 실시예에 따른 유사어 결정 방법을 보여주는 플로우차트.4 is a flowchart showing a method for determining synonyms according to an embodiment of the present invention.

도 5는 후보 문자열 결정 방법을 보여주는 플로우차트.5 is a flowchart showing a candidate string determination method.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for the main parts of the drawings>

100: 유사어 결정 시스템 110: 인터넷100: synonym determination system 110: Internet

120: 사용자 단말기 130: 사용자 인터페이스부120: user terminal 130: user interface unit

140: 후보 문자열 선정부 150: 데이터베이스140: candidate string selection unit 150: database

160: 유사어 결정부160: synonym determination unit

Claims

제1 문자열이 입력되면, 미리 저장된 제2 문자열들 중 상기 제1 문자열과의 제1 편집거리가 기준치 이하인 제2 문자열, 상기 제1 문자열과 음운 코드가 동일한 제2 문자열, 또는 상기 제1 문자열과 공통된 문자를 포함하고 있는 제2 문자열들 중 상기 제1 문자열과의 문자 유사도 점수가 상위N위 이내인 제2 문자열들을 후보 문자열들로 선정하는 단계;When the first character string is input, a second character string having a first editing distance from the first character strings stored below the reference value, a second character string having the same phonetic code as the first character string, or the first character string among the second character strings stored in advance. Selecting, as candidate character strings, second character strings having a character similarity score with the first character string among the second character strings including the common character within a top N position;

상기 각 후보 문자열과 상기 제1 문자열간의 최종 유사도 점수를 산출하는 단계; 및Calculating a final similarity score between each candidate string and the first string; And

상기 후보 문자열들 중 상기 최종 유사도 점수가 상위 N위 이내인 후보 문자열을 상기 제1 문자열의 유사어로 결정하는 단계를 포함하는 것을 특징으로 하는 유사어 결정 제공 방법.And determining a candidate string having the final similarity score among the candidate strings within the top N positions as a similar word of the first string.

제1항에 있어서,The method of claim 1,

상기 제1 및 제2 문자열은 검색 질의어인 것을 특징으로 하는 유사어 결정 방법.And the first and second strings are search query words.

제1항에 있어서,The method of claim 1,

상기 후보 문자열 선정단계 이전에, 사용자로부터 상기 제1 문자열을 수신하는 단계를 더 포함하고,Before the candidate character string selecting step, further comprising receiving the first character string from a user,

상기 유사어 결정 단계 이후에, 상기 결정된 유사어를 추천 질의어로 사용자에게 제공하는 단계를 더 포함하는 것을 특징으로 하는 유사어 결정 방법.And after the synonym determination step, providing the determined analogous word to the user as a recommendation query.

제1항에 있어서, 상기 후보 문자열 선정 단계에서,The method of claim 1, wherein in the candidate string selection step,

상기 제2 문자열들 중 상기 제1 편집거리가 기준치 이하인 제2 문자열은 상기 제1 편집거리 산출을 위한 각 연산 별로 와일드 카드 문자(Wild Card Character) 검색을 이용하여 선정하는 것을 특징으로 하는 유사어 결정 방법.The second character string of which the first editing distance is less than a reference value among the second character strings is selected using a wild card character search for each operation for calculating the first editing distance. .

제4항에 있어서,The method of claim 4, wherein

상기 제1 편집거리의 각 연산은 삽입 연산, 삭제 연산, 교체 연산 및 전위 연산 중 적어도 하나를 포함하는 것을 특징으로 하는 유사어 결정 방법.Wherein each operation of the first edit distance includes at least one of an insert operation, an erase operation, a replace operation, and a prefix operation.

제1항에 있어서, 상기 후보 문자열 선정 단계 이전에,The method of claim 1, wherein before the candidate character string selection step,

상기 제1 문자열 및 상기 제2 문자열들을 정규화하는 단계를 더 포함하는 것을 특징으로 하는 유사어 결정 제공 방법.And normalizing the first string and the second strings.

제6항에 있어서,The method of claim 6,

상기 정규화는 상기 제1 및 제2 문자열이 영어인 경우 소문자를 대문자로 변환하거나 제1 및 제2 문자열의 앞뒤에 가상의 문자를 삽입하거나, 상기 제1 및 제2 문자열이 한글인 경우 상기 제1 및 제2 문자열들 중 특수문자 또는 띄어쓰기를 제 거하거나, 상기 제1 및 제2 문자열을 키보드 입력 자소 단위로 변환하는 것임을 특징으로 하는 유사어 결정 방법.The normalization may include converting lowercase letters to uppercase when the first and second strings are English, inserting virtual characters before and after the first and second strings, or when the first and second strings are Korean. And removing special characters or spaces from the second strings, or converting the first and second strings into a keyboard input character unit.

상기 제1 및 제2 문자열들의 음운 코드는 상기 제1 및 제2 문자열들이 영어인 경우 Soundex 또는 Metaphone 음운 알고리즘을 이용하여 획득하고, 상기 제1 및 제2 문자열이 한글인 경우 Kodex 음운 알고리즘을 이용하여 획득하는 것을 특징으로 하는 유사어 결정 방법.Phonological codes of the first and second strings are obtained using a Soundex or Metaphone phonological algorithm when the first and second strings are English, and using Kodex phonological algorithms when the first and second strings are Korean. A method for determining synonyms, characterized in that obtaining.

상기 제1 문자열과 공통된 문자를 포함하고 있는 제2 질의어들은 상기 제1 문자열과 공통된 ngram을 포함하고 있는 제2 질의어들이고,The second query words including the characters common to the first string are second query words including the ngram in common with the first string,

상기 문자 유사도 점수는 제1 문자열과 공통된 ngram의 크기, 상기 공통된 ngram의 개수, 상기 공통된 ngram이 발견된 위치의 유사도, 및 상기 제1 문자열과 상기 각 제2 문자열들간의 길이 차를 이용하여 결정되는 것을 특징으로 하는 유사어 결정 방법.The character similarity score is determined using the size of the ngram in common with the first string, the number of the common ngrams, the similarity of the location where the common ngram is found, and the length difference between the first string and the respective second strings. A method for determining synonyms, characterized in that.

제9항에 있어서,10. The method of claim 9,

상기 문자 유사도 점수는 상기 공통된 ngram의 크기 및 상기 공통된 ngram의 개수에는 비례하고, 상기 공통된 ngram이 발견된 위치의 유사도 및 상기 제1 문자 열과 상기 제2 문자열들간의 길이 차와 반비례하는 것을 특징으로 하는 유사어 결정 방법.The character similarity score is proportional to the size of the common ngram and the number of the common ngram, and is inversely proportional to the similarity of the position where the common ngram is found and the difference in length between the first string and the second strings. How to determine synonyms.

제9항에 있어서,10. The method of claim 9,

상기 공통된 ngram이 발견된 위치의 유사도는, 상기 제1 문자열 중 공통 ngram 이전의 문자 개수와 상기 각 제2 문자열들 중 공통 ngram 이전의 문자 개수의 차이값과 상기 제1 문자열 중 공통 ngram 이후의 문자 개수와 상기 각 제2 문자열들 중 공통 ngram 이후의 문자 개수의 차이값 중 작은 값으로 정의되는 것을 특징으로 하는 유사어 결정 방법.The similarity of the position where the common ngram is found is a difference between the number of characters before the common ngram in the first string and the number of characters before the common ngram in the second strings and the character after the common ngram in the first string. And a smaller value among the difference between the number and the number of characters after the common ngram of the second strings.

제1항에 있어서, 상기 최종 유사도 점수 산출 단계에서, The method of claim 1, wherein in calculating the final similarity score,

상기 최종 유사도 점수는 상기 각 후보 문자열들과 상기 제1 문자열간의 문자 유사도 점수, 상기 각 후보 문자열들과 상기 제1 문자열간의 제2 편집거리를 점수화한 편집거리 점수, 상기 각 후보 문자열들의 음운코드와 상기 제1 문자열의 음운코드간의 제3 편집거리를 점수화한 음운코드 편집거리 점수, 및 상기 각 후보 문자열들과 상기 제1 문자열 간의 공통 문자 개수를 점수화한 공통문자 점수 중 적어도 하나를 이용하여 산출되는 것을 특징으로 하는 유사어 결정 방법.The final similarity score is a character similarity score between the candidate strings and the first string, an edit distance score that scores a second edit distance between the candidate strings and the first string, and a phonological code of each candidate string; Calculated using at least one of a phonological code edit distance score that scores a third edit distance between phonological codes of the first string and a common character score that scores the number of common characters between each candidate string and the first string A method for determining synonyms, characterized in that.

제12항에 있어서,The method of claim 12,

상기 최종 유사도 점수는 상기 각 후보 문자열들과 상기 제1 문자열간의 문 자 유사도 점수, 상기 편집거리 점수, 음운코드 편집거리 점수, 및 상기 공통문자 점수의 곱을 이용하여 산출되는 것을 특징으로 하는 유사어 결정 방법.The final similarity score is calculated by using a product of the character similarity score between the candidate strings and the first string, the edit distance score, the phonological code edit distance score, and the common character score. .

제12항에 있어서,The method of claim 12,

상기 편집거리 점수는 상기 제2 편집거리와 제1 기준값의 합에 대한 역수, 제2 기준값과 상기 제2 편집거리의 차이값, 또는 상기 제1 문자열의 길이와 상기 제2 편집거리 간의 차와 상기 제1 문자열의 길이의 비율 중 어느 하나를 이용하여 산출되는 것을 특징으로 하는 유사어 결정 방법.The edit distance score is the inverse of the sum of the second edit distance and the first reference value, the difference between the second reference value and the second edit distance, or the difference between the length of the first string and the second edit distance and the The analogous word determination method is calculated using any one of ratios of lengths of the first character string.

제12항에 있어서,The method of claim 12,

상기 음운코드 편집거리 점수는 상기 제3 편집거리와 상기 제1 기준값의 합에 대한 역수, 상기 제2 기준값과 상기 제3 편집거리의 차이값, 또는 상기 제1 문자열의 음운코드 길이와 상기 제3 편집거리 간의 차와 상기 제1 문자열의 음운코드 길이 비율 중 어느 하나를 이용하여 산출되는 것을 특징으로 하는 유사어 결정 방법.The phonological code edit distance score is an inverse of the sum of the third edit distance and the first reference value, a difference value between the second reference value and the third edit distance, or a phonological code length of the first string and the third And calculating one of the difference between the editing distances and the phonological code length ratio of the first character string.

제12항에 있어서,The method of claim 12,

상기 공통문자 점수는, 상기 제1 문자열에 포함된 문자인 제1 문자와 상기 후보 문자열에 포함된 문자인 제2 문자 중 공통되는 문자의 개수와 상기 제1 및 제2 문자의 개수의 합의 비율 또는 상기 제1 문자와 상기 제2 문자 중 공통되는 문자 의 개수와 상기 제1 문자의 개수의 비율을 이용하여 산출되는 것을 특징으로 하는 유사어 결정 방법.The common character score is a ratio of the sum of the number of characters in common among the first character that is a character included in the first character string and the second character that is a character included in the candidate string, or the number of the first and second characters. The method for determining similar words using the ratio of the number of characters in common among the first character and the second character and the number of the first characters.

제1항 내지 제16항 중 어느 하나의 항에 기재된 방법을 수행하기 위한 프로그램이 기록된 기록매체.A recording medium having recorded thereon a program for performing the method according to any one of claims 1 to 16.

사용자에 의해 입력된 제1 문자열을 수신하고, 상기 제1 문자열에 대한 유사어를 추천 질의어로 사용자에게 제공하는 사용자 인터페이스부;A user interface unit which receives a first string input by a user and provides a user with a similar word for the first string as a recommended query;

상기 제1 문자열과 미리 저장된 각 제2 문자열들간의 편집거리, 상기 제1 및 제2 문자열의 음운코드, 및 상기 제1 및 제2 문자열 간의 문자 유사도 점수를 이용하여 상기 제2 문자열들 중 후보 문자열을 선정하고, 선정된 각 후보 문자열들과 상기 제1 문자열과의 최종 유사도 점수를 산출하는 후보 문자열 선정부; 및A candidate string among the second strings using an editing distance between the first string and each of the second strings stored in advance, a phonological code of the first and second strings, and a character similarity score between the first and second strings. A candidate string selecting unit configured to select a and calculate a final similarity score between each of the selected candidate strings and the first string; And

상기 후보 문자열들 중 상기 최종 유사도 점수가 상위 N위 이내인 후보 문자열들을 상기 제1 문자열의 유사어로 결정하여 상기 사용자 인터페이스부로 제공하는 유사어 결정부를 포함하는 것을 특징으로 하는 유사어 결정 시스템.And a similar word determination unit configured to determine candidate strings having the final similarity score within the upper N rank among the candidate strings as similar words of the first string and provide them to the user interface unit.

제18항에 있어서,The method of claim 18,

상기 제1 및 제2 문자열은 검색 질의어인 것을 특징으로 하는 유사어 결정 시스템.And the first and second strings are search query words.

제19항에 있어서, 상기 후보 문자열 선정부는,The method of claim 19, wherein the candidate character string selecting unit,

상기 제1 문자열과 상기 각 제2 문자열 간의 제1 편집거리, 상기 제1 문자열과 상기 각 후보 문자열 간의 제2 편집거리, 및 상기 제1 문자열의 음운코드와 상기 각 후보 문자열의 음운코드 간의 제3 편집거리를 계산하는 편집거리 계산부;A first editing distance between the first string and each second string, a second editing distance between the first string and each candidate string, and a third between a phonological code of the first string and a phonological code of each candidate string An editing distance calculator configured to calculate an editing distance;

상기 제1 문자열과 제2 문자열들로부터 음운코드를 획득하여 비교하는 음운코드 비교부;A phonological code comparison unit which obtains and compares phonological codes from the first string and the second strings;

제2 문자열들 또는 후보 문자열들 중 상기 제1 문자열과 공통된 문자를 포함하고 있는 문자열에 대한 문자 유사도 점수를 산출하는 문자 유사도 점수 산출부;A character similarity score calculator configured to calculate a character similarity score for a string including characters common to the first string among second strings or candidate strings;

상기 제1 편집거리가 기준치 이하인 제2 문자열, 상기 제1 문자열의 음운코드와 동일한 음운코드를 가진 제2 문자열, 또는 상기 문자 유사도 점수가 상위 N위 이내인 제2 문자열들을 상기 후보 문자열들로 결정하는 후보 문자열 결정부; 및The second character strings having the first editing distance less than or equal to the reference value, the second character strings having the same phonetic code as the phonological codes of the first character strings, or the second character strings having the character similarity score within the upper N rank are determined as the candidate character strings. A candidate string determiner; And

상기 각 후보 문자열과 상기 제1 문자열 간의 최종 유사도 점수를 산출하는 최종 유사도 점수 산출부를 포함하는 것을 특징으로 하는 유사어 결정 시스템.And a final similarity score calculator for calculating a final similarity score between each candidate string and the first string.

제20항에 있어서,The method of claim 20,

상기 편집거리 계산부는, 상기 제1 편집거리 계산시 와일드 카드 문자 검색을 이용하여 각 연산 별로 제1 편집거리를 계산하는 것을 특징으로 하는 유사어 결정 시스템.And the editing distance calculator calculates a first editing distance for each operation by using a wildcard character search when calculating the first editing distance.

제20항에 있어서,The method of claim 20,

상기 음운코드 비교부는, 상기 제1 및 제2 문자열들이 영어인 경우 Soundex 또는 Metaphone 음운 알고리즘을 이용하여 상기 제1 및 제2 문자열들의 음운 코드를 획득하고, 상기 제1 및 제2 문자열이 한글인 경우 Kodex 음운 알고리즘을 이용하여 상기 제1 및 제2 문자열들의 음운 코드를 획득하는 것을 특징으로 하는 유사어 결정 시스템.The phonological code comparison unit obtains a phonological code of the first and second strings using a Soundex or Metaphone phonological algorithm when the first and second strings are English, and when the first and second strings are Korean. And a phonological code of the first and second strings using a Kodex phonological algorithm.

제20항에 있어서,The method of claim 20,

상기 제2 문자열들 또는 후보 문자열들 중 상기 제1 문자열과 공통된 문자를 포함하고 있는 문자열은, 상기 제2 문자열 또는 후보 문자열과 공통된 ngram을 포함하고 있는 문자열이고,The string including the character common to the first string among the second strings or the candidate strings is a string including ngram in common with the second string or the candidate string,

상기 문자 유사도 점수 산출부는, 상기 공통된 ngram의 크기, 상기 공통된 ngram의 개수, 상기 공통된 ngram이 발견된 위치의 유사도, 및 상기 제1 문자열과 상기 각 제2 문자열 또는 각 후보 문자열 간의 길이 차를 이용하여 상기 문자 유사도 점수를 산출하는 것을 특징으로 하는 유사어 결정 시스템.The character similarity score calculation unit uses the size of the common ngram, the number of the common ngrams, the similarity of the locations where the common ngrams are found, and the length difference between the first string and each second string or each candidate string. Calculating the similarity score of the character.

제20항에 있어서,The method of claim 20,

상기 문자 유사도 점수는, 상기 공통된 ngram의 크기 및 상기 공통된 ngram의 개수에는 비례하고, 상기 공통된 ngram이 발견된 위치의 유사도 및 상기 제1 문자열과 상기 각 제2 문자열 또는 각 후보 문자열 간의 길이 차에는 반비례하는 것을 특징으로 하는 유사어 결정 시스템.The character similarity score is proportional to the size of the common ngram and the number of the common ngrams, and inversely proportional to the similarity of the locations where the common ngrams are found and the difference in length between the first string and each second string or each candidate string. A synonym determination system, characterized in that.

제21항에 있어서,The method of claim 21,

상기 문자 유사도 점수 산출부는, 상기 제1 문자열 내에서 공통 ngram 이전의 문자 개수와 상기 각 제2 문자열 또는 후보 문자열 내에서 공통 ngram 이전의 문자 개수의 차이값과 상기 제1 문자열 내에서 공통 ngram 이후의 문자 개수와 상기 각 제2 문자열 또는 후보 문자열 내에서 공통 ngram 이후의 문자 개수의 차이값 중 작은 값으로 정의되는 유사어 결정 시스템.The character similarity score calculation unit may include a difference value between the number of characters before the common ngram in the first string and the number of characters before the common ngram in the second string or the candidate string, and the value after the common ngram in the first string. A synonym determination system defined as the smaller of the difference between the number of characters and the number of characters after the common ngram in each second or candidate string.

제20항에 있어서, 상기 후보 문자열 선정부는,The method of claim 20, wherein the candidate character string selecting unit,

상기 제1 문자열 및 상기 각 제2 문자열을 정규화하는 정규화부를 더 포함하는 것을 특징으로 하는 유사어 결정 시스템.And a normalization unit for normalizing the first string and each second string.

제26항에 있어서,The method of claim 26,

상기 정규화부는 상기 제1 및 제2 문자열이 영어인 경우 소문자를 대문자로 변환하거나 제1 및 제2 문자열의 앞뒤에 가상의 문자를 삽입함으로써 상기 제1 및 제2 문자열을 정규화하고, 상기 제1 및 제2 문자열이 한글인 경우 제1 및 제2 문자열들 중 특수문자 또는 띄어쓰기를 제거하거나, 제1 및 제2 문자열을 키보드 입력 자소 단위로 변환함으로써 상기 제1 및 제2 문자열을 정규화하는 것을 특징으로 하는 유사어 결정 시스템.The normalization unit normalizes the first and second strings by converting lowercase letters to uppercase letters or inserting virtual characters before and after the first and second strings when the first and second strings are English. When the second string is Korean, the first and second strings may be normalized by removing special characters or spaces from the first and second strings, or by converting the first and second strings to a keyboard input character unit. Synonym determination system.

제20항에 있어서,The method of claim 20,

상기 최종 유사도 점수 산출부는, 상기 각 후보 문자열과 상기 제1 문자열간의 문자 유사도 점수, 상기 제2 편집거리를 점수화한 편집거리 점수, 상기 제3 편집거리를 점수화한 음운코드간 편집거리 점수, 및 상기 각 후보 문자열들과 상기 제1 문자열 간의 공통 문자의 개수를 점수화한 공통문자 점수 중 적어도 하나를 이용하여 상기 최종 유사도 점수를 산출하는 것을 특징으로 하는 유사어 결정 시스템.The final similarity score calculation unit may include a character similarity score between each candidate string and the first string, an edit distance score that scores the second edit distance, an edit distance score between phonological codes that score the third edit distance, and the And calculating the final similarity score using at least one of common character scores that score the number of common characters between each candidate character string and the first character string.

제28항에 있어서,The method of claim 28,

상기 최종 유사도 점수 산출부는, 상기 각 후보 문자열들과 상기 제1 문자열간의 문자 유사도 점수, 상기 편집거리 점수, 상기 음운코드간 편집거리 점수, 및 상기 각 공통문자 점수를 모두 곱함으로써 상기 최종 유사도 점수를 산출하는 것을 특징으로 하는 유사어 결정 시스템.The final similarity score calculator may multiply the final similarity score by multiplying all the character similarity scores between the candidate strings and the first string, the editing distance score, the editing distance score between the phonological codes, and the common character scores. A synonym determination system, characterized in that for calculating.

제28항에 있어서,The method of claim 28,

상기 최종 유사도 점수 산출부는, 상기 제2 편집거리와 제1 기준값의 합에 대한 역수, 제2 기준값과 상기 제1 편집거리의 차이값, 또는 상기 제1 문자열의 길이와 상기 제1 편집거리 간의 차와 상기 제1 문자열의 길이의 비율 중 어느 하나를 이용하여 상기 편집거리 점수를 산출하는 것을 특징으로 하는 유사어 결정 시스템.The final similarity score calculator may include an inverse of the sum of the second editing distance and the first reference value, a difference value between the second reference value and the first editing distance, or a difference between the length of the first character string and the first editing distance. And the editing distance score using any one of ratios of lengths of the first character strings.

제28항에 있어서,The method of claim 28,

상기 최종 유사도 점수 산출부는, 상기 제2 편집거리와 상기 제1 기준값의 합에 대한 역수, 상기 제2 기준값과 상기 제2 편집거리의 차이값, 또는 상기 제1 문자열의 음운코드 길이와 상기 제2 편집거리 간의 차와 상기 제1 문자열의 음운코드 길이의 비율 중 어느 하나를 이용하여 상기 음운코드 편집거리 점수를 산출하는 것을 특징으로 하는 유사어 결정 시스템.The final similarity score calculator may include an inverse of the sum of the second editing distance and the first reference value, a difference value between the second reference value and the second editing distance, or a phonological code length of the first string and the second And calculating a phonological code edit distance score using any one of a difference between edit distances and a ratio of phonological code lengths of the first character string.

제28항에 있어서,The method of claim 28,

상기 최종 유사도 점수 산출부는, 상기 제1 문자열에 포함된 문자인 제1 문자와 상기 후보 문자열에 포함된 문자인 제2 문자 중 공통되는 문자의 개수와 상기 제1 및 제2 문자의 개수의 합의 비율 또는 상기 제1 문자와 상기 제2 문자 중 공통되는 문자의 개수와 상기 제1 문자의 개수의 비율을 이용하여 상기 공통문자 점수를 산출하는 것을 특징으로 하는 유사어 결정 시스템.The final similarity score calculator is a ratio of the sum of the number of characters in common among a first character that is a character included in the first string and a second character that is a character included in the candidate string, and the number of the first and second characters. Or calculating the common character score using a ratio of the number of characters in common among the first character and the second character and the number of the first characters.