KR101389148B1

KR101389148B1 - Suggesting and refining user input based on original user input

Info

Publication number: KR101389148B1
Application number: KR1020077028339A
Authority: KR
Inventors: 쥔 우; 데캉 린; 저 췐; 제 저우
Original assignee: 구글 잉크.
Priority date: 2005-05-04
Filing date: 2006-05-04
Publication date: 2014-04-24
Also published as: US20060253427A1; JP5203934B2; US9411906B2; US8438142B2; WO2006121702A1; KR20080008400A; US20130103696A1; CN101297291A; CN102945237A; US9020924B2; US20150220547A1; EP1877939A1; CN102945237B; JP2008541233A

Abstract

검색 질의와 같은 오리지널 유저 입력에 기초하여 변형/정제된 유저 입력을 생성시키기 위한 시스템 및 방법이 개시된다. 이 방법은 로마계 언어 및/또는 중국어와 같은 비-로마계 언어에 대해 구현될 수도 있다. 일반적으로, 이 방법은 오리지널 유저 입력을 수신하고 그 내부의 핵심 용어를 식별하는 단계, 유사성 매트릭스에 따라서 오리지널 입력의 핵심 용어(들) 를 다른 용어로 대체함으로써 및/또는 하나의 단어 시퀀스가 다른 하나의 시퀀스의 서브스트링인 확대/축소 표에 따라서 오리지널 입력의 단어 시퀀스를 다른 단어 시퀀스로 대체함으로써 잠재 대안적인 입력을 결정하는 단계, 및 예를 들어, 대안적인 입력의 가능성이 적어도 오리지널 입력의 가능성인 소정의 기준에 따라서 가장 적절한 대안적인 입력을 선택하는 단계를 포함한다. 사전-컴퓨팅된 오리지널 유저 입력과 해당 대안적인 입력을 포함하는 캐시를 제공할 수도 있다.A system and method are disclosed for generating modified / refined user input based on original user input, such as a search query. This method may be implemented for non-Roman languages such as Roman and / or Chinese. In general, the method receives original user input and identifies key terms therein, by replacing key term (s) of the original input with another term in accordance with a similarity matrix and / or one word sequence is another Determining a potential alternative input by replacing the word sequence of the original input with another word sequence according to a zoom table that is a substring of the sequence of s, and for example, the likelihood of the alternative input being at least the probability of the original input. Selecting the most appropriate alternative input according to a predetermined criterion. A cache may be provided that includes pre-computed original user inputs and corresponding alternative inputs.

오리지널 유저 입력, 대안적인 유저 입력, 잠재 대안적인 유저 입력 Original user input, alternative user input, potential alternative user input

Description

오리지널 유저 입력에 기초한 유저 입력의 제안 및 정제{SUGGESTING AND REFINING USER INPUT BASED ON ORIGINAL USER INPUT}Proposal and refinement of user input based on original user input {SUGGESTING AND REFINING USER INPUT BASED ON ORIGINAL USER INPUT}

본 발명은, 일반적으로, 대안적인 유저 입력의 생성에 관한 것이다. 더욱 상세하게는, 검색 질의 (search query) 와 같은 오리지널 유저 입력에 기초하여 변형 또는 정제된 유저 입력을 생성시키기 위한 시스템 및 방법이 개시된다.The present invention generally relates to the generation of alternative user inputs. More specifically, systems and methods are disclosed for generating modified or refined user input based on original user input, such as a search query.

수많은 유저들이 주어진 검색 세션 도중에 그들의 오리지널 검색 질의를 종종, 때때로 반복적으로 변형 또는 정제한다. 예를 들어, 유저는 오리지널 검색 질의를 더욱 구체적인 검색 질의, 더욱 광범위한 검색 질의, 및/또는 소정의 검색 결과가 생성될 때까지 대안적인 질의 용어를 이용하는 검색 질의로 변형할 수도 있다. 유저 검색 질의 정제는, 로마계 언어, 예를 들어, 영어로의 질의뿐만 아니라 비-로마계 언어, 예를 들어, 중국어, 일본어, 한국어 (CJK), 태국어 등으로의 질의로도 생성한다. 오리지널 검색 질의가 양호한 일련의 검색 결과를 산출하지 않았을 때, 예를 들어, 검색 질의가 너무 구체적이거나 또는 너무 광범위한 경우, 또는 부적절한 용어가 검색 질의에 이용된 경우, 유저는 그들의 검색 질의를 일반적으로 변형 또는 정제한다. 예를 들어, 하나 이상의 검색 용어가 모호하고, 반환된 도큐먼트의 일부가, 유저가 의도한 검색 용어의 의미와는 다른 모호한 검색 용어의 의미와 관련되는 경우, 및/또는 유저가 검색 용어의 수많은 양태 중에 주어진 검색 용어의 오직 하나의 양태에만 관련되는 경우에, 오리지널 유저 검색 질의는 너무 많은 부적절한 결과를 산출할 수도 있다. 또한, 오리지널 유저 검색 질의는, 유저가 소정의 검색 용어와 관련된 개념만을 검색할 때, 너무 많은 부적절한 결과를 산출할 수도 있다.Numerous users often modify or refine their original search queries often, often during a given search session. For example, a user may transform the original search query into a more specific search query, a broader search query, and / or a search query using alternative query terms until a given search result is generated. User search query refinements are generated not only in queries in the Roman language, for example English, but also in queries in non-Roman languages, for example, Chinese, Japanese, Korean (CJK), Thai, and the like. When the original search query did not yield a good set of search results, for example, if the search query was too specific or too broad, or if inappropriate terms were used in the search query, the user would typically modify their search query. Or purified. For example, if one or more of the search terms are ambiguous, and a portion of the document returned relates to the meaning of an ambiguous search term that is different from the meaning of the search term intended by the user, and / or the user has numerous aspects of the search term. If only one aspect of a given search term is given, then the original user search query may yield too many inappropriate results. In addition, the original user search query may yield too many inappropriate results when the user searches only concepts related to a given search term.

수많은 검색 엔진은 유저의 오리지널 검색 질의와 관련된 제안된 검색 질의의 목록을 제안한다. 예를 들어, 유저의 오리지널 검색 질의가 "Amazon" 이면, 검색 엔진은 "Amazon.com", "Amazon Rainforest", 및 "Amazon River" 와 같은 대안적인 관련 검색 질의를 제안할 수도 있다. 검색 질의 제안은 CJK 유저와 같은 비-로마계 언어 유저에 대해 특히 유용할 수도 있다. 구체적으로, 비-로마계 언어는 일반적으로 문자의 큰 세트를 가지고, 각각의 문자는 종래의 로마계 키보드를 사용하여 수개의 키스트로크 (keystroke) 를 요구할 수도 있기 때문에, 비-로마계 언어 유저들은 변형된 검색 질의를 타이핑함으로써 제안된 검색 질의를 클릭하거나 또는 선택하는 것을 선호할 수도 있다. 예를 들어, 수많은 중국어 유저들이 중국어 문자를 입력하기 위해 병음 (phonetic spelling) 을 이용한다. 통상, 종래의 병음 입력 시스템은 병음 입력을 변환하고, 유저가 중국어 문자의 의도된 세트를 선택할 수도 있는 후보 중국어 문자 세트의 목록을 제공한다. 명백하게, 복수의-단계 입력 프로세스는 지루할 수도 있고 시간을 낭비하는 것일 수도 있다.Many search engines offer a list of suggested search queries related to the user's original search query. For example, if a user's original search query is "Amazon", the search engine may suggest alternative related search queries such as "Amazon.com", "Amazon Rainforest", and "Amazon River". Search query suggestions may be particularly useful for non-Roman language users, such as CJK users. Specifically, non-Roman language users generally have a large set of characters, and each character may require several keystrokes using a conventional Roman keyboard. You may prefer to click or select the proposed search query by typing the modified search query. For example, many Chinese users use phonetic spelling to enter Chinese characters. Conventional Pinyin input systems typically convert the Pinyin input and provide a list of candidate Chinese character sets from which the user may select an intended set of Chinese characters. Obviously, the multi-step input process may be tedious and time consuming.

또한, 검색 질의 제안은 로마계 언어 유저에 대해 유용할 수도 있다. Yahoo, Teoma, Alta Vista, Askjeeves, AllTheWeb, 및 Baidu 와 같은 수많은 검색 엔진은 관련 검색, 질의 정제, 또는 질의 클러스터링 형태의 피쳐 (feature) 를 제시한다.Search query suggestions may also be useful for Roman-speaking users. Numerous search engines such as Yahoo, Teoma, Alta Vista, Askjeeves, AllTheWeb, and Baidu present features in the form of related searches, query refining, or query clustering.

검색 질의와 같이 오리지널 유저 입력에 기초하여 변형된 또는 정제된 유저 입력을 생성시키기 위한 시스템 및 방법이 개시된다. 본 발명은, 프로세스, 장치, 시스템, 디바이스, 방법, 또는, 프로그램 명령이 광 또는 전자 통신선을 통해서 전송되는 컴퓨터 판독가능 저장 매체 또는 컴퓨터 네트워크와 같은 컴퓨터 판독가능 매체를 포함하는 수많은 방법으로 구현될 수 있다는 것이 명시된다. 일반적으로, 용어 컴퓨터는 개인 휴대용 정보 단말기 (PDA), 셀룰러폰, 및 네트워크 스위치와 같은 컴퓨팅 파워를 가지는 임의의 디바이스를 지칭한다. 본 발명의 몇몇 독창적인 실시형태가 이하 설명된다.Systems and methods are disclosed for generating modified or refined user input based on original user input, such as a search query. The invention may be embodied in a number of ways including a computer readable medium such as a computer readable storage medium or computer network in which a process, apparatus, system, device, method, or program instruction is transmitted over an optical or electronic communication line. It is specified. In general, the term computer refers to any device having computing power, such as a personal digital assistant (PDA), cellular phone, and network switch. Some inventive embodiments of the invention are described below.

중국어와 같은 비-로마계 언어로의 질의에 대해 이 방법이 적용될 수도 있다. 일반적으로, 이 방법은 오리지널 유저 입력에서의 핵심 용어를 수신 및 식별하는 단계, 유사성 매트릭스에 따라서 오리지널 입력에서의 핵심 용어(들)을 다른 용어로 대체함으로써 잠재 대안적인 유저 입력을 결정하는 단계 및/또는 확대/축소 표에 따라서 일 시퀀스가 다른 시퀀스의 서브스트링인, 오리지널 입력에서 단어의 시퀀스를 단어의 다른 시퀀스로 대체하는 단계, 잠재 대안적인 유저 입력의 가능성을 컴퓨팅하는 단계, 예를 들어, 각각의 선택된 대안적인 유저 입력의 가능성이 적어도 오리지널 유저 입력의 가능성인, 소정의 기준에 따라서 가장 적절한 대안적인 유저 입력을 선택하는 단계를 포함한다. 또한, 이 방법은 오리지널 유저 입력이 제안된 대안적인 유저 입력의 사전-컴퓨팅된 캐시에 있는지의 여부를 결정하고, 만약 그렇다면, 사전-컴퓨팅된 캐시에 저장된 사전-컴퓨팅된 가장 적절한 대안적인 유저 입력을 출력하는 단계를 포함할 수도 있다.This method may be applied to queries in non-Roman languages such as Chinese. In general, the method comprises receiving and identifying key terms in the original user input, determining potential alternative user inputs by replacing key term (s) in the original input with other terms in accordance with the similarity matrix and / or Or replacing a sequence of words in the original input with another sequence of words, where one sequence is a substring of another sequence in accordance with a zoom table, computing the potential for alternative user input, eg, each Selecting the most appropriate alternative user input according to a predetermined criterion, wherein the probability of the selected alternative user input of is at least the probability of the original user input. In addition, the method determines whether the original user input is in the pre-computed cache of the proposed alternative user input, and if so, retrieves the pre-computed most appropriate alternative user input stored in the pre-computed cache. It may also include the step of outputting.

유사성 매트릭스는, 코퍼스를 이용하여 생성될 수도 있고, 매우 높은 유사성을 가질 수도 있는 "New York" 및 "Los Angeles" 와 같은 관용어를 포함하는 2 개의 유사 용어들 사이에서, 각 쌍의 대응 용어 (New 와 Los 및 York 와 Angeles) 가 높은 유사성을 가지지 않더라도, 유사값을 가질 수도 있다. 일 실시형태에서, 유사성 매트릭스는 코퍼스에서 단어들에 대한 피쳐 벡터를 구성하고, 그 피쳐 벡터를 이용하여 2 개의 단어/관용어 사이에서 유사값을 결정함으로써 생성될 수 있다.The similarity matrix may be generated using a corpus, and between two similar terms, including idioms such as "New York" and "Los Angeles," which may be generated using a corpus, and may have very high similarities, each pair of corresponding terms (New And Los and York and Angeles) may have similarity, even if they do not have high similarity. In one embodiment, the similarity matrix may be generated by constructing a feature vector for words in the corpus and using the feature vector to determine similarity values between two words / idioms.

확대/축소 표는 유저 입력 데이터베이스로부터 생성될 수도 있고, 용어의 각 쌍의 시퀀스와 관련된 빈도값을 가질 수도 있다. 일 실시형태에서, 확대/축소 표는 빈도수가 높은 단어 시퀀스를 결정하고, 비-관용구 (non-phrasal) 단어 시퀀스를 필터링하고, 빈도수로서 각 시퀀스의 용어와 카운트를 조합함으로써 생성될 수도 있다. 설명을 위해, 확대/축소 표에서 엔트리의 예는 "The United States of America" 및 "United States" 일 수도 있다.The zoom table may be generated from a user input database and may have a frequency value associated with the sequence of each pair of terms. In one embodiment, the zoom table may be generated by determining a high frequency word sequence, filtering non-phrasal word sequences, and combining the term and count of each sequence as a frequency. For illustrative purposes, examples of entries in the zoom table may be "The United States of America" and "United States".

잠재 대안적인 유저 입력의 가능성은: (a) 오리지널 유저 입력과 잠재 대안적인 유저 입력 사이의 관련성, (b) 잠재 대안적인 유저 입력이 유저에 의해 선택될 확률, 및 (c) 잠재 대안적인 유저 입력에 대한 위치의 스코어 중 하나 이상을 결정함으로써 컴퓨팅될 수도 있다. 특히, 오리지널 유저 입력과 잠재 대안적인 유저 입력 사이의 관련성이 오리지널 유저 입력과 잠재 대안적인 유저 입력의 정렬된 용어들 사이에서 상관값을 이용하여 결정될 수도 있다.The likelihood of latent alternative user input is: (a) the relationship between the original and latent alternative user input, (b) the probability that the latent alternative user input is selected by the user, and (c) the latent alternative user input. It may be computed by determining one or more of the scores of locations for. In particular, the association between the original user input and the potential alternative user input may be determined using a correlation value between the sorted terms of the original user and the potential alternative user input.

다른 실시형태에서, 대안적인 유저 입력을 제안하기 위한 시스템은 일반적으로 하나 이상의 핵심 용어를 가지는 오리지널 유저 입력을 수신하고, 오리지널 유저 입력의 핵심 용어를 식별하고, (a) 2 개의 용어 사이에서 유사값을 갖는 유사성 매트릭스에 따라서 대안적인 용어와 오리지널 유저 입력의 하나 이상의 핵심 용어를 대체하는 단계 및 (b) 하나의 시퀀스는 대안적인 시퀀스의 서브스트링이고, 용어의 각각의 시퀀스와 조합된 빈도값을 갖는 확대/축소 표에 따라서 단어의 대안적인 시퀀스를 갖는 오리지널 유저 입력의 단어의 시퀀스를 대체시키는 단계, 중 하나 이상의 단계를 수행함으로써 잠재 대안적인 유저 입력을 결정하고, 잠재 대안적인 유저 입력의 가능성을 컴퓨팅하고, 소정의 기준에 따라서 가장 적절한 대안적인 유저 입력을 선택 및 출력하도록 구성된 제안/정제 서버를 포함한다.In another embodiment, a system for suggesting alternative user input generally receives original user input having one or more key terms, identifies key terms of the original user input, and (a) a similarity value between the two terms. Substituting one or more key terms of the alternative term and the original user input according to a similarity matrix with (b) one sequence is a substring of the alternative sequence and has a frequency value combined with each sequence of terms. Replacing the sequence of words of the original user input having an alternative sequence of words according to the zoom table to determine one or more of the potential alternative user inputs by performing one or more of the steps, and computing the potential for potential alternative user inputs. Select the most appropriate alternative user input according to predetermined criteria, and It includes a suggestion / purification server configured to output.

다른 실시형태에서, 대안적인 유저 입력을 제안하기 위한 컴퓨터 프로그램 제품은 컴퓨터 시스템과 관련되어 이용되고, 컴퓨터 프로세서상에서 실행가능한 명령이 저장된 컴퓨터 판독가능 저장 매체를 포함한다. 이 명령은, 오리지널 유저 입력의 핵심 용어를 수신 및 식별하는 기능, 유사성 매트릭스에 따른 대안적인 용어와 오리지널 입력의 핵심 용어(들)을 대체함으로써 잠재 대안적인 유저 입력을 결정하는 기능 및/또는 확대/축소 표에 따라서, 일 시퀀스는 대안적인 시퀀스의 서브스트링인, 단어의 대안적인 시퀀스와 오리지널 입력의 단어의 시퀀스를 대체하는 기능, 잠재 대안적인 유저 입력의 가능성을 컴퓨팅하고 부가적으로 잠재 대안적인 유저 입력과 예측된 유저 충족을 컴퓨팅하는 기능, 예를 들어, 각각의 선택된 대안적인 유저 입력의 가능성은 적어도 오리지널 유저 입력의 가능성인 소정의 기준에 따라 가장 적절한 대안적인 유저 입력을 선택하는 기능을 일반적으로 포함할 수도 있다.In another embodiment, a computer program product for suggesting alternative user input is utilized in connection with a computer system and includes a computer readable storage medium having stored thereon instructions executable on a computer processor. This command may be used to receive and identify key terms of the original user input, to determine potential alternative user inputs by substituting alternative terms according to the similarity matrix and key term (s) of the original input and / or augmentation / According to the reduced table, one sequence is a substring of an alternative sequence, the ability to replace an alternative sequence of words and a sequence of words of the original input, computing the likelihood of potential alternative user input and additionally the potential alternative user. The ability to compute input and predicted user satisfaction, for example, the likelihood of each selected alternative user input is generally a function of selecting the most appropriate alternative user input according to a predetermined criterion that is at least a possibility of original user input. It may also include.

시스템 및 방법을 구현하는 어플리케이션은 검색 엔진 상에서와 같이 서버 사이트 상에서 구현될 수도 있고, 또는, 유저의 컴퓨터와 같은 클라이언트 사이트 상에서 구현되어, 예를 들어, 다운로드되어, 제안된 대안적인 입력을 제공하고 또는 검색 엔진과 같은 원격 서버와 상호작용할 수도 있다.Applications implementing the systems and methods may be implemented on a server site, such as on a search engine, or may be implemented on a client site, such as a user's computer, for example, downloaded to provide a suggested alternative input, or You can also interact with remote servers such as search engines.

본 발명이 이러한 특징 및 다른 특징과 이점은 본 발명의 예시 원리의 방법으로 설명하는 이하의 상세한 설명 및 첨부된 도면에서 더욱 상세하게 설명된다.These and other features and advantages of the present invention are described in more detail in the following detailed description and the accompanying drawings, which are described by way of exemplary principles of the invention.

본 발명은 첨부된 도면과 관련하여 이하 상세한 설명으로 용이하게 이해되며, 동일한 참조 수치는 동일한 구조 엘리먼트를 지칭한다.BRIEF DESCRIPTION OF THE DRAWINGS The present invention is easily understood by the following detailed description in conjunction with the accompanying drawings, wherein like reference numerals refer to like structural elements.

도 1a 는 유저 검색 질의와 같은 제안된 변형/정제된 유저 입력을 생성시키기 위한 예시적인 시스템의 블록도이다.1A is a block diagram of an example system for generating proposed modified / refined user input, such as a user search query.

도 1b 는 제안-정제 서버의 유사 단어 추출기에 의해 유사성 매트릭스를 생성시키기 위한 프로세스를 도시하는 블록도이다.1B is a block diagram illustrating a process for generating a similarity matrix by a similar word extractor of a suggestion-purification server.

도 1c 는 제안-정제 서버의 확대/축소 표 생성기에 의해 확대/축소 표를 생성시키기 위한 프로세스를 도시하는 블록도이다.1C is a block diagram illustrating a process for generating a zoom table by a zoom table generator of a suggestion-purification server.

도 1d 는 제안-정제 서버의 세션 파서 (session parser) 에 의해 초기의 변형/정제 캐시를 생성시키기 위한 프로세스를 도시하는 블록도이다.FIG. 1D is a block diagram illustrating a process for creating an initial variant / purification cache by a session parser of a suggestion-purification server.

도 2a 는 도 1a 에 도시된 시스템에 의해 구현될 수도 있는 변형/정제 유저 입력을 생성시키기 위한 예시적인 프로세스를 도시하는 플로우차트이다.FIG. 2A is a flowchart illustrating an example process for generating modification / purification user input that may be implemented by the system shown in FIG. 1A.

도 2b 는 도 1a 에 도시된 시스템에 의해 구현될 수도 있는 유저 질의와 같은 제안된 변형/정제된 제어 입력을 생성시키기 위한 예시적인 프로세스를 도시하는 플로우차트이다.FIG. 2B is a flowchart illustrating an example process for generating a proposed modified / refined control input, such as a user query, which may be implemented by the system shown in FIG. 1A.

도 3 은 오리지널 유저 질의를 파싱하여 생성된 예시적인 질의 래티스 도면이다.3 is an exemplary query lattice diagram generated by parsing an original user query.

도 4 는 질의 용어의 대체에 의해 제안된 변형/정제된 질의를 생성하는데 이용하기 위한 유사성 매트릭스를 구성하기 위한 예시적인 프로세스를 도시하는 플로우차트이다.4 is a flowchart illustrating an example process for constructing a similarity matrix for use in generating modified / refined queries proposed by the substitution of query terms.

도 5 는 예시적인 텍스트에서 생성된 용어 "communities" 에 대한 표 목록 피쳐 및 해당 카운트이다.5 is a table listing feature and corresponding count for the term “communities” generated in the example text.

도 6 은 코퍼스로부터 생성된 용어 "communities" 에 대한 표 목록 예시 피쳐 및 해당 카운트이다.6 is a table listing example feature and corresponding count for the term “communities” generated from a corpus.

도 7 은 제안된 변형/정제 질의를 생성시키기 위해 용어를 대체하는데 이용하기 위한 예시적인 유사성 매트릭스이다.7 is an example similarity matrix for use in replacing terms to generate a proposed modification / purification query.

도 8 은 질의의 복합어를 대체함으로써 제안된 변형/정제 질의를 생성시키기 위해 이용되는 복합어 쌍의 추출/축소 표를 구성하기 위한 예시적인 프로세스를 도시하는 플로우차트이다.8 is a flowchart illustrating an example process for constructing an extraction / reduction table of compound word pairs used to generate a proposed variant / purification query by replacing compound words in a query.

도 9 는 제안된 변형/정제 질의를 생성시키기 위해 질의의 복합어를 대체하 는데 이용하는 확대/축소 표의 몇몇 예시적인 엔트리를 도시하는 표이다.9 is a table showing some example entries in a zoom table used to replace compound words in a query to create a proposed variant / purification query.

도 10 은 제안된 변형/정제 질의의 스코어를 결정하기 위한 예시적인 프로세스를 도시하는 플로우차트이다.10 is a flowchart illustrating an example process for determining the score of a proposed modification / purification query.

도 11 은 용어 2 개의 질의 Q 및 Q' 의 정렬 매핑의 예를 도시하는 도면이다.11 is a diagram illustrating an example of alignment mapping of the terms two queries Q and Q '.

도 12 는 검출된 신규의 엔티티에 대한 상관값을 생성시키기 위한 예시적인 프로세스를 도시하는 플로우차트이다.12 is a flowchart illustrating an example process for generating a correlation value for a detected new entity.

검색 질의와 같은 오리지널 유저 입력에 기초하여 변형 또는 정제된 유저 입력을 생성시키기 위한 시스템 및 방법이 개시된다. 단지 명료함을 목적으로, 본 명세서에 설명된 예는 중국어 질의 입력의 용어로만 일반적으로 표현된다는 것이 명시된다. 그러나, 정제/변형 유저 입력을 제안하기 위한 시스템 및 방법은 일본어, 한국어, 태국어 등과 같은 대안적인 비-로마계 언어뿐만 아니라 로마계 언어에 대해 유사하게 적용가능할 수도 있다. 또한, 정제/변형된 유저 입력을 제안하기 위한 시스템 및 방법은 다른 비-질의 유저 입력에 대해 유사하게 적용가능할 수도 있다. 이하의 설명은 당업자가 본 발명을 구성하고 이용하도록 표현된다. 구체적인 실시형태 및 어플리케이션의 설명이 예로서만 제공되고, 당업자에게는 다양한 변형이 용이하게 명백하다. 본 발명에 정의된 일반적인 원리는 본 발명의 취지 및 범위를 벗어나지 않고 다른 실시형태 및 어플리케이션에 적용될 수도 있다. 따라서, 본 발명은 본 명세서에 개시된 원리 및 특징과 일관되는 수많은 대안, 변형 및 등가물을 포함하는 가장 넓은 범위에 따른다. 명백함을 위해, 본 발명에 관련된 기술적 분야에 알려진 기술적인 재료에 관련된 설명은 본 발명을 불필요하게 모호하게 하지 않기 위해 상세하게 설명되지 않는다.Systems and methods are disclosed for generating modified or refined user input based on original user input, such as a search query. For purposes of clarity only, it is specified that the examples described herein are generally expressed only in terms of Chinese query input. However, systems and methods for suggesting tablet / modified user input may be similarly applicable to Roman languages as well as alternative non-Roman languages such as Japanese, Korean, Thai, and the like. In addition, systems and methods for suggesting refined / modified user input may be similarly applicable to other non-quality user input. The following description is expressed to enable those skilled in the art to make and use the invention. Descriptions of specific embodiments and applications are provided by way of example only, and various modifications are readily apparent to those skilled in the art. The general principles defined in the invention may be applied to other embodiments and applications without departing from the spirit and scope of the invention. Accordingly, the present invention is to be accorded the widest scope encompassing numerous alternatives, modifications, and equivalents consistent with the principles and features disclosed herein. For clarity, descriptions relating to technical materials known in the technical field related to the present invention are not described in detail in order not to unnecessarily obscure the present invention.

이 시스템 및 방법은 유저 질의의 용어들 사이의 유저의 질의 히스토리 및 관계에 기초하여, 질의와 같이 오리지널 유저 입력에 기초한 변형 또는 정제된 유저 입력을 생성하기 위한 것이다. 시스템 및 방법은 신규의 명칭 엔티티 (예를 들어, 적절한 명칭, 영화, 노래 및 제품 등의 명칭) 및 용어들 사이의 관계를 포함하는 신규의 용어를 추출하기 위한 시스템 및 방법을 포함할 수도 있다. 본 명세서에 설명된 시스템 및 방법은 질의 (또는 대안적인 유저 입력) 정제를 생성하는데 이용 적절하지만, 뉴스 기사 분류, 스펠링 정정, 매체 검색 및 분절 (segmentation) 과 같은 많은 대안적인 어플리케이션에 적합할 수도 있다. 수많은 유저에 대해, 초기 검색 질의는 종종 가장 적절한 검색 질의가 아니고, 따라서 유저는 주어진 검색 세션 도중에 검색 질의를, 종종 여러번, 변형 또는 정제한다.This system and method is for generating modified or refined user input based on original user input, such as a query, based on the user's query history and relationships between terms of the user query. The systems and methods may include systems and methods for extracting new terms, including new name entities (eg, names such as appropriate names, movies, songs, and products) and relationships between terms. The systems and methods described herein are suitable for use in generating query (or alternative user input) tablets, but may be suitable for many alternative applications, such as news article classification, spelling correction, media retrieval, and segmentation. . For many users, the initial search query is often not the most appropriate search query, so the user modifies or refines the search query, often several times, during a given search session.

도 1a 는 유저 검색 질의 (22) 와 같은 오리지널 유저 입력으로부터 제안된 변형/정제 입력 (26) 을 생성시키기 위한 예시적인 시스템 (20) 의 블록도이다. 일반적으로, 시스템 (20) 은 다양한 데이터 소스로부터 유래될 수도 있는 확률을 이용하여 제안된 변형/정제된 질의 (26) 를 생성하는 제안/정제 서버 (24) 를 포함한다. 다양한 데이터 소스의 예는 사전-컴퓨팅된 질의 제안 또는 정제의 캐시를 저장하는 부가적인 제안/정제 캐시 (36) 를 포함한다. 제안/정제 캐시 (36) 는 제안/정제 서버 (24) 의 세션 파서 (24C) 에 의해 초기에 생성될 수도 있다. 다른 데이터 소스로는 제안/정제 서버 (24) 의 유사 단어 추출기 (24A) 에 의해 생성될 수도 있는 유사성 매트릭스 (38) 및 확대/축소 표 생성기 (24B) 에 의해 생성될 수도 있는 확대/축소 표 (39) 일 수도 있다. 유사성 매트릭스 (38) 및 확대/축소 표 (39) 는 일반적으로 용어들 사이 및/또는 용어의 시퀀스 사이의 관계와 대략 비슷하다. 시스템 (20) 은 유사성 매트릭스 (38) 및/또는 확대/축소 표 (39) 를 주기적으로 업데이트 및/또는 재생성시킬 수도 있다. 제안/정제 서버 (24) 의 유사 단어 추출기 (24A), 확대/축소 표 생성기 (24B), 및 세션 파서 (24C) 는 각각 도 1b 내지 도 1d 을 참조하여 이하 상세하게 설명된다.1A is a block diagram of an example system 20 for generating proposed modification / purification inputs 26 from original user inputs, such as user search query 22. In general, system 20 includes a suggestion / purification server 24 that generates a proposed modified / refined query 26 using probabilities that may be derived from various data sources. Examples of various data sources include an additional suggestion / purification cache 36 that stores a cache of pre-computed query suggestions or refinements. Proposal / purification cache 36 may be initially generated by session parser 24C of proposal / purification server 24. Other data sources include a similarity matrix 38, which may be generated by the similarity word extractor 24A of the suggestion / purification server 24, and a zoom table, which may be generated by the zoom table generator 24B. 39). Similarity matrix 38 and zoom table 39 generally resemble relationships between terms and / or sequences of terms. System 20 may periodically update and / or regenerate similarity matrix 38 and / or zoom table 39. Similar word extractor 24A, zoom table generator 24B, and session parser 24C of proposal / purification server 24 are described in detail below with reference to FIGS. 1B-1D.

도 1b 는 유사 단어 추출기 (24A) 에 의해 유사성 매트릭스 (38) 를 생성시키기 위한 프로세스를 도시하는 블록도이다. 도시된 바와 같이, 유사 단어 추출기 (24A) 는 유사성 매트릭스 (38) 를 생성시키기 위해 다양한 데이터 소스를 이용할 수도 있다. 유사 단어 추출기 (24A) 에 의해 이용된 데이터 소스의 예는 웹 코퍼스 (30; 예를 들어, 뉴스, 웹 페이지, 및 앵커 (anchor) 텍스트 정보) 와 같은 코퍼스, 질의 및 질의 로그 (32) 에 저장된 것과 같은 관련 유저 선택, 및/또는 각각의 주어진 세션에 질의의 히스토리를 포함할 수도 있는 세션 데이터 (34) 를 포함한다. 또한, 웹 코퍼스 (30) 는 앵커 텍스트 정보를 포함할 수도 있다. 질의 로그 (32) 는 유저 질의의 로그만이 아니라 유저에 의해 이루어진 검색 결과 선택을 포함할 수도 있고, 또한, 유저가 예를 들어 검색 결과로 반환하기 전에 선택된 검색 결과에 머무른 기간도 포함할 수도 있다.1B is a block diagram illustrating a process for generating similarity matrix 38 by similar word extractor 24A. As shown, similar word extractor 24A may use various data sources to generate similarity matrix 38. Examples of data sources used by the similar word extractor 24A are stored in corpus, queries, and query logs 32, such as web corpus 30 (e.g., news, web pages, and anchor text information). Related user selection, and / or session data 34, which may include a history of the query in each given session. Web corpus 30 may also include anchor text information. The query log 32 may include a search result selection made by the user as well as a log of the user query, and may also include a period of time the user stayed in the selected search result before returning to the search result, for example. .

도 1c 는 확대/축소 표 생성기 (24B) 에 의해 확대/축소 표 (39) 를 생성시키기 위한 프로세스를 도시하는 블록도이다. 도시된 바와 같이, 확대/축소 표 생성기 (24B) 는 확대/축소 표 (39) 를 생성시키기 위한 데이터 소스로서 질의 로그 (32) 및/또는 세션 데이터 (34) 를 이용할 수도 있다. 도 1d 는 세션 파서 (24C) 에 의해 초기 변형/정제 캐시 (36a) 를 생성시키기 위한 프로세스를 도시하는 블록도이다. 도시된 바와 같이, 세션 파서 (24C) 는 초기 변형/정제 캐시 (36a) 를 생성시키기 위한 그 데이터 소스로서 세션 데이터 (34) 를 이용할 수도 있다.1C is a block diagram illustrating a process for generating a zoom table 39 by a zoom table generator 24B. As shown, zoom table generator 24B may use query log 32 and / or session data 34 as a data source for generating zoom table 39. 1D is a block diagram illustrating a process for creating an initial modification / purification cache 36a by session parser 24C. As shown, session parser 24C may use session data 34 as its data source for generating initial modification / purification cache 36a.

도 2a 및 도 2b 는 제안/정제 서버 (24) 에 의해 수행될 수도 있는 예시적인 프로세스를 도시하는 플로우차트이다. 특히, 도 2a 는 도 1a 에 도시된 시스템 (20) 에 의해 구현될 수도 있는 변형/정제 유저 입력 캐시를 생성시키기 위한 예시적인 프로세스 (40) 를 도시하는 플로우차트이다. 블록 (41) 에서, 초기 변형/정제 캐시는 세션 파서를 사용하여 세션 데이터로부터 생성될 수 있다. 상술된 바와 같이, 세션 데이터는 질의 세션 또는 각각의 주어진 유저 입력의 질의 히스토리를 포함할 수도 있다. 다음으로, 프로세스 (40) 는, 블록 (42) 에서, 가장 공통적인 유저 입력, 예를 들어, 질의의 소정의 수의 각각에 대해 블록 (43 내지 48) 을 포함하는 루프로 진입한다. 특히, 블록 (43) 에서, 프로세스는 캐시의 제안된 변형/정제된 질의의 검색을 수행한다. 블록 (43) 에서의 검색은 제안 (1, 2,...M) 을 초래할 수도 있다.2A and 2B are flowcharts illustrating an example process that may be performed by the suggestion / purification server 24. In particular, FIG. 2A is a flowchart illustrating an example process 40 for generating a modified / purified user input cache that may be implemented by the system 20 shown in FIG. 1A. In block 41, an initial modification / purification cache may be generated from the session data using the session parser. As described above, session data may include a query history of a query session or each given user input. Next, process 40 enters at block 42 a loop comprising blocks 43-48 for each of the most common user inputs, for example, a predetermined number of queries. In particular, at block 43, the process performs a search of the proposed modified / refined query of the cache. The search in block 43 may result in a proposal (1, 2,... M).

변형/정제 캐시에서 각각의 유저 입력 또는 질의 엔트리는 제안된 질의의 소정의 수 N 의 목록을 포함한다. 따라서, 제안 M+1, M+2...N 을 생성시키기 위해, 즉, 각각의 질의에 대한 제안된 질의를 충족시키기 위해서는, 블록 (44 내지 47) 이 또한 수행될 수도 있다. 특히, 블록 (44 및 45) 에서, 부가적으로 제안된 변형/정제 (대안적인) 질의를 생성시키기 위해 확대된 질의 래티스가 (개념적으로) 구성될 수도 있다. 블록 (44) 은 용어 대체 질의 변형/정제 방법을 일반적으로 나타내고, 블록 (45) 는 확대/축소 질의 변형/정제 방법을 일반적으로 나타낸다. 구체적으로, 블록 (44) 에서, 확대된 질의 래티스는 유사 용어의 유사성 매트릭스를 사용하여 오리지널 질의 용어를 유사 용어로 대체함으로써 이루어질 수도 있다. 용어 대체는 유사 단어 또는 용어를 오리지널 질의의 단어 또는 (관용어를 포함하는) 용어로 대체한다. 유사 용어는 동의어 또는 근사 동의어 (예를 들어, 공동체와 이웃), 두문자어, 및/또는 동일한 구문/의미 카테고리 (예를 들어, Toyota 및 Honda, Dell 및 HP, DVD 및 디지털 카메라, 및 Nokia 및 Motorola) 를 포함할 수도 있다.Each user input or query entry in the modification / purification cache contains a list of a predetermined number N of proposed queries. Thus, blocks 44-47 may also be performed to generate proposals M + 1, M + 2 ... N, i.e., to satisfy the proposed query for each query. In particular, at blocks 44 and 45, an enlarged query lattice may be (conceptually) constructed to generate the proposed proposed modification / purification (alternative) query. Block 44 generally represents the term substitution query modification / purification method, and block 45 generally represents the zoom query modification / purification method. Specifically, at block 44, the expanded query lattice may be made by replacing the original query term with a similar term using a similarity matrix of similar terms. Term substitution replaces similar words or terms with words or terms (including idioms) of the original query. Similar terms may be synonymous or approximate synonyms (eg, community and neighborhood), acronyms, and / or the same phrase / meaning categories (eg, Toyota and Honda, Dell and HP, DVD and digital cameras, and Nokia and Motorola). It may also include.

블록 (45) 에서, 확대된 질의 래티스는 복합어 쌍의 확대/축소 표를 이용하여 오리지널 질의의 용어를 추가/삭제함으로써 부가적으로 및/또는 대안적으로 구성할 수도 있다. 특히, 확대/축소 표의 각각의 엔트리는, 하나의 복합어가 대안적인 서브스트링인, 예를 들어, T₁T₂<=>T₁T₂T₃, 및 T₄T₅T₆<=>T₄T₅ 인 복합어의 쌍이다. 중국어 복합어 쌍의 예는 상하이와 상하이 도시뿐만 아니라 텔레비전 및 텔레비전 세트를 포함한다. 복합어 쌍은 모호한 용어 및 그들의 명백한 콘텍스 트 (예를 들어, Amazon 및 Amazon rain forest 및/또는 Amazon.com), 개념 및 그 정의 (예를 들어, 셀과 스템 셀 및/또는 셀 폰), 용어와 그 속성 (예를 들어, 컴퓨터 및 메모리, 하드 디스크 드라이브, 및/또는 DVD 드라이브), 및 명칭 (예를 들어, 사람들, 회사 등의 명칭) 및 이들에 해당하는 액티비티, 직업, 제품 등 (예를 들어, 톰 행크스와 포레스트 검프와 같은 영화배우-영화, 애플사와 아이팟과 같은 회사-제품, 빌 게이츠와 마이크로소프트 또는 CEO 와 같은, 사람-회사 또는 직함, 저자-북, 가수-노래 등) 을 포함할 수도 있다.At block 45, the expanded query lattice may additionally and / or alternatively be constructed by adding / deleting terms of the original query using the zoom table of the compound word pair. In particular, each entry in the zoom table is such that T ₁ T ₂ <=> T ₁ T ₂ T ₃ , and T ₄ T ₅ T ₆ <=> T where one compound word is an alternative substring. ₄ T ₅ compound word pair. Examples of Chinese compound word pairs include Shanghai and Shanghai cities, as well as televisions and television sets. Compound word pairs are ambiguous terms and their explicit context (eg, Amazon and Amazon rain forest and / or Amazon.com), concepts and their definitions (eg, cells and stem cells and / or cell phones), terms And its attributes (e.g., computers and memory, hard disk drives, and / or DVD drives), and their names (e.g., names of people, companies, etc.) and their activities, occupations, products, etc. (e.g., For example, actors such as Tom Hanks and Forest Gump-movies, companies such as Apple and iPod-products, people-company or title, such as Bill Gates and Microsoft or CEO, author-book, singer-song, etc. It may also include.

확대된 질의 래티스가 다양한 대안적인 패스를 포함하도록 구성된 후, 확대된 질의 래티스에서의 소정의 수의 가장 좋은 질의에 대한 패스와 스코어는 블록 (46) 에서 잠재 제안된 질의로서 식별된다. 블록 (47) 에서, 오리지널 공통 유저 질의의 스코어가 컴퓨팅되어, 스코어가 적어도 오리지널 공통 유저 질의의 스코어인 잠재 제안된 질의만이 제안된 변형/정제된 질의로서 제공된다. 스코어는 유저에 의해 선택되거나 의도된 질의인 소정의 질의 (오리지널 또는 잠재 제안된 질의) 의 가능성을 나타낼 수도 있다. 스코어가 적어도 오리지널 공통 유저 질의의 스코어인 질의가 변형/정제 캐시에 제안 목록 엔트리를 채우기 위해 제안된 변형/정제된 질의로서 제공될 수 있다. 그 결과의 제안된 질의는 사전-컴퓨팅된 변형/정제 질의 캐시에 저장될 수도 있다. 프로세스 (40) 또는 루프가 블록 (42 내지 49) 을 포함한다는 것은 변형/정제 캐시를 업데이트하기 위해 주기적으로 반복될 수도 있다는 것을 나타낸다.After the expanded query lattice is configured to include various alternative passes, the passes and scores for any number of the best queries in the expanded query lattice are identified as potential proposed queries in block 46. In block 47, the score of the original common user query is computed such that only potential proposed queries whose scores are at least the score of the original common user query are provided as proposed modified / refined queries. The score may indicate the likelihood of a given query (original or potential proposed query) that is a query selected or intended by the user. A query whose score is at least the score of the original common user query may be provided as a proposed modified / refined query to populate the list of suggestions entries in the variant / purification cache. The resulting suggested query may be stored in a pre-computed variant / purification query cache. The inclusion of blocks 42-49 in the process 40 or loop indicates that it may be repeated periodically to update the variant / purification cache.

도 2b 는, 도 1a 에 도시된 시스템 (20) 에 의해 구현될 수도 있는 것과 같이, 유저 질의와 같은 제안된 변형/정제 유저 입력을 생성시키기 위한 예시적인 프로세스 (50) 를 도시하는 플로우차트이다. 블록 (51) 에서, 유저 질의와 같은 유저 입력이 수신된다. 판별 블록 (52) 에서, 블록 (51) 에서 수신된 오리지널 유저 입력은 부가적인 사전-컴퓨팅된 변형/정제 캐시의 엔트리와 비교될 수도 있다. 판별 블록 (52) 에서, 오리지널 유저 질의가 제안/정제 캐시에 있는 것으로 결정되면, 블록 (53) 에서, 사이즈 N 의 질의 제안 목록은 사전-컴퓨팅된 변형/정제 캐시로부터 N 사전-컴퓨팅된 질의 제안에 대해 적어도 부분적으로 채워진다. 판별 블록 (54) 에서, 제안 목록이 채워져 있다고 결정되면, 프로세스 (50) 는 종료된다. 제안 목록이 사전에 정의된 사이즈 N, 예를 들어, 10 개의 제안 또는 단일의 가장 좋은 제안일 수도 있다는 것이 명시된다. 이와 다르게, 판별 블록 (54) 에서, 제안 목록이 채워지지 않다고 결정되면, 다음으로, 프로세스 (50) 는 블록 (55 및 56) 으로 계속된다. 유사하게, 판별 블록 (52) 에서, 오리지널 유저 질의가 제안/정제 캐시 내에 있지 않다고 결정되면, 프로세스는 블록 (55 및 56) 으로 또한 계속된다. 블록 (55 내지 58) 은 도 2a 를 참조하여 설명된 것과 같이 블록 (44 내지 47) 과 유사하다는 것이 명시된다. 따라서, 유사한 내용에 대한 설명은 명백함을 목적으로 본 명세서에 반복되지 않는다.FIG. 2B is a flowchart illustrating an example process 50 for generating proposed modification / purification user input, such as a user query, as may be implemented by the system 20 shown in FIG. 1A. In block 51, user input, such as a user query, is received. At decision block 52, the original user input received at block 51 may be compared to an entry in an additional pre-computed transform / purification cache. If it is determined in decision block 52 that the original user query is in the suggestion / purification cache, then in block 53, the list of query proposals of size N is N pre-computed query suggestions from the pre-computed transform / purification cache. Is at least partially filled for. In determination block 54, if it is determined that the proposal list is filled, the process 50 ends. It is specified that the proposal list may be a predefined size N, for example 10 proposals or a single best proposal. Alternatively, if it is determined in decision block 54 that the suggestion list is not filled, then process 50 continues to blocks 55 and 56. Similarly, in decision block 52, if it is determined that the original user query is not in the suggestion / purification cache, the process continues to blocks 55 and 56 as well. It is specified that blocks 55-58 are similar to blocks 44-47 as described with reference to FIG. 2A. Accordingly, descriptions of similar contents are not repeated herein for the purpose of clarity.

블록 (55 및 56) 에서, 확대된 질의 래티스는 제안된 변형/정제된 (대안적인) 질의를 생성시키기 위해 (개념적으로) 구성된다. 확대된 질의 래티스가 다양한 대안적인 패스를 포함하도록 구성된 후, 확대된 질의 래티스의 가장 좋은 질의의 소정의 수에 대한 패스 및 스코어는 블록 (57) 에서의 잠재 제안된 질의로서 식별된다. 블록 (58) 에서, 오리지널 유저 질의의 스코어가 컴퓨팅되어, 스코어가 적어도 오리지널 유저 질의의 스코어인 잠재 제안된 질의만이 제안된 변형/정제된 질의로서 제공된다. 스코어가 적어도 오리지널 유저 질의의 스코어인 질의는 제안된 변형/정제된 질의로서 유저에게 제공되어 제안 목록 및 제안 목록의 나머지를 채울 수 있다. 도시되지 않았지만, 단일의 가장 좋은 질의가 이와 다르게 제공될 수도 있다. 또한, 오리지널 유저 질의 및 그 결과물인 제안된 질의는 사전-컴퓨팅된 변형/정제된 질의 캐시에 부가적으로 저장될 수도 있다.At blocks 55 and 56, the expanded query lattice is (conceptually) configured to generate the proposed modified / refined (alternative) query. After the expanded query lattice is configured to include various alternative passes, the passes and scores for a predetermined number of the best queries of the expanded query lattice are identified as potential proposed queries in block 57. In block 58, the score of the original user query is computed such that only potential proposed queries whose scores are at least the scores of the original user query are provided as proposed modified / refined queries. A query whose score is at least the score of the original user query may be provided to the user as a proposed modified / refined query to populate the proposal list and the rest of the proposal list. Although not shown, a single best query may alternatively be provided. In addition, the original user query and the resulting proposed query may additionally be stored in a pre-computed modified / refined query cache.

제안된 변형/정제된 유저 입력을 생성시키기 위해 도 2a 및 도 2b 를 참조하여 상기 도시되고 설명된 것과 같은 프로세스 (40 및 50) 의 다양한 블록들은 이하 상세하게 설명된다.Various blocks of processes 40 and 50 as shown and described above with reference to FIGS. 2A and 2B to generate the proposed modified / refined user input are described in detail below.

도 3 은 예시적인 확대된 질의 래티스 도면을 도시한다. 도시된 바와 같이, 오리지널 질의는 다양한 핵심 단어 또는 용어 T₁, T₂, T₃, T₄ 및 비-핵심 단어 또는 용어 s₁, s₂, s₃ 을 포함할 수도 있다. 예를 들어, 중국어 질의 "sina (중국의 포털 사이트) 의 URL" 에서, 핵심 용어 또는 엔티티는 "URL" 이기보다는 "sina" 이다. 일반적으로, 비-핵심 용어는 중지 단어를 또한 포함한다. 비-중지 단어는 일반적으로 예를 들어, 웹 코퍼스와 같은 코퍼스에서 30 개의 가장 자주 발생하는 중국어 단어 또는 100 개의 가장 자주 발생하는 영어 단어로서 정의될 수도 있다.3 shows an exemplary enlarged query lattice diagram. As shown, the original query may include various key words or terms T ₁ , T ₂ , T ₃ , T ₄ and non-core words or terms s ₁ , s ₂ , s ₃ . For example, in the Chinese query "URL of sina (the portal site of China)", the key term or entity is "sina" rather than "URL". In general, non-core terms also include stop words. Non-stop words may generally be defined as, for example, the 30 most frequently occurring Chinese words or the 100 most frequently occurring English words in a corpus, such as the web corpus.

오리지널 질의의 핵심 엔티티가 식별된 후에, 하나 이상의 질의 변형 또는 정제 방법, 예를 들어, 용어 대체 및/또는 확대/축소가 확대된 질의 래티스를 구축하도록 적용될 수 있다. 전술된 바와 같이, 용어 대체는 핵심 엔티티 (core entity) 와 유사한 단어 및/또는 용어 (예를 들어, 동의어 또는 근사 동의어) 및 예를 들어, 유사성 매트릭스를 이용하여 식별될 수도 있는 단어 및/또는 용어의 대체를 지칭한다. 설명을 위해, 도 3 은 확대된 질의 래티스가 용어 T₁ 을 T₁' 또는 T₁'' 로 대체하고 및/또는 용어 T₄ 를 T₄' 로 대체함으로써 구축될 수도 있다는 것을 설명한다.After the key entities of the original query have been identified, one or more query modification or refinement methods, eg, term substitution and / or zoom, can be applied to build an expanded query lattice. As mentioned above, term substitution is a word and / or term that may be identified using similar words and / or terms (eg, synonyms or approximate synonyms) and, for example, a similarity matrix, to a core entity. Refers to the substitution of. For illustrative purposes, FIG. 3 illustrates that an expanded query lattice may be constructed by replacing the term T ₁ with T ₁ ′ or T ₁ ″ and / or replacing the term T ₄ with T ₄ ′.

전술한 바와 같이, 확대/축소는, 예를 들어, 복합어의 확대/축소 표를 이용하여 오리지널 질의로부터 핵심 엔티티의 일부를 삭제하거나 핵심 엔티티를 추가하는 것을 지칭한다. 설명을 위해, 복합어의 확대/축소 표는 복합어 T₁T₂ 및 T₁T₂T₅ 의 쌍에 대한 표 엔트리를 포함할 수도 있어서, 도 3 의 오리지널 질의의 복합어 T₁T₂ 는 복합어 T₁T₂T₅ (즉, 신규 용어 T₅ 의 추가) 로 대체되어 확대된 질의 래티스상에 더 구축될 수도 있다. 유사하게, 복합어의 확대/축소 표는 복합어 T₂T₃T₄ 및 T₃T₄ 의 쌍에 대한 표 엔트리를 포함할 수도 있어서, 도 3 의 오리지널 질의의 복합어 T₂T₃T₄ 는 복합어 T₃T₄ (즉, 핵심 엔티티 T₂ 의 삭제) 와 대체되어 확대된 질의 래티스상에 더 구축될 수도 있다.As mentioned above, zooming refers to deleting some of the key entities or adding key entities from the original query, for example using a zoom table of compound words. For illustrative purposes, the zoom table of compound words may include table entries for pairs of compound words T ₁ T ₂ and T ₁ T ₂ T ₅ , where compound word T ₁ T ₂ of the original query of FIG. 3 is compound word T _1. It may be further built on an expanded query lattice by being replaced by T ₂ T ₅ (ie, addition of the new term T ₅ ). Similarly, the zoom table of compound words may include table entries for pairs of compound words T ₂ T ₃ T ₄ and T ₃ T ₄ , such that compound word T ₂ T ₃ T ₄ of the original query of FIG. It may be further built on an expanded query lattice in place of ₃ T ₄ (ie, deletion of the core entity T ₂ ).

유사 용어의 유사성 매트릭스를 생성시키기 위한 일 예시적인 방법이 도 4 내지 도 7 을 참조하여 상세하게 설명된다. 도 4 는 질의 용어의 대체로 인해 제안된 변형/정제된 질의를 생성하는데 이용하는 유사성 매트릭스를 구성하기 위한 예시적인 프로세스 (60) 를 도시하는 플로우차트이다. 유사성 매트릭스는 단어 또는 용어 w 의 각각의 쌍 사이의 분포 단어 유사성의 매트릭스일 수도 있다. 각각의 단어 w 에 대한 분포 단어 유사성은 코퍼스, 예를 들어, 웹 페이지에서의 각각의 단어 w 에 대한 피쳐 벡터를 구성하고, 각각의 피쳐 벡터 사이의 각도의 코사인으로서 단어의 각 쌍들 사이의 유사성을 결정함으로써 획득될 수도 있다. 단어 또는 용어의 피쳐는 단어 또는 용어의 모든 생성시에 주변 단어를 포함할 수도 있다.One exemplary method for generating a similarity matrix of similar terms is described in detail with reference to FIGS. 4 to 7. 4 is a flowchart illustrating an example process 60 for constructing a similarity matrix that is used to generate a proposed modified / refined query due to substitution of query terms. The similarity matrix may be a matrix of distributed word similarities between each pair of words or terms w. The distribution word similarity for each word w constitutes a feature vector for each word w in a corpus, e.g., a web page, and the similarity between each pair of words as a cosine of the angle between each feature vector. May be obtained by determining. Features of a word or term may include surrounding words in all generations of the word or term.

피쳐 벡터와 유사성 매트릭스의 구성의 일 예가 도 4 를 참조하여 나타나고, 다양한 다른 피쳐 벡터와 유사성 매트릭스 구성 방법이 유사하게 사용될 수도 있다. 특히, 블록 (62) 에서, 피쳐 벡터에서의 각각의 피쳐 f 에 대한 카운트와 함께, 웹 코퍼스와 같은 코퍼스의 각각의 단어/용어 w 에 대한 피쳐 벡터가 구성된다. 단어/용어 w 의 피쳐는 제 1 비-중지 단어까지 단어/용어 w 이전 및 이후에 생성하는 단어를 포함할 수도 있다. 설명을 목적으로, "Because communities assess at different percentages of fair market value, the only way to compare tax rates among communities is by using equalized rates" 라는 문장이 주어지고, 단어 communities 의 피쳐 및 그 해당 공동-생성 카운트가 도 5 의 표에 리스트된다. 주어진 단어의 대안적인 형태, 예를 들어, "community" 및 "communities" 와 같은 단수와 복수 또는 "walk", "walking" 및 "walked" 와 같은 서로 대안적인 시제를 가지는 언어에서, 시스템은, 분리된 단어이지만 일반적으로 유사한 용어로서 단어의 상이한 형태를 처리할 수도 있다는 것이 명시된다. 주어진 단어의 상이한 형태의 이러한 처리는 통상적으로 예를 들어, 중국어의 경우에서와 같이 이러한 차이를 가지지 않는 언어에 대해서는 적절하지 않을 수도 있다. 또한, 접두사 "L:" 또는 "R:" 를 가지는 피쳐는 단어 w 의 좌측 또는 우측에 각각 나타나는 단어이다. 이 실시형태에서, 단어 w 의 주어진 순간의 각각의 좌측 및 우측의 하나 이상의 피쳐의 카운트의 합계는 1 이다. 예를 들어, 단어 "communities" 의 제 1 순간에, 좌측 및 우측 피쳐 각각은 카운트 1 이 할당된다. 또한, 단어 "communities" 에 인접하는 하나 이상의 중지 단어 예를 들어 "between", "is", 및 "by" 가 있을 때, 단어 "communities" 의 주어진 순간의 각 측에 대한 피쳐의 카운트는 동일하게 분할되고 이에 따라 조각으로서 컴퓨팅될 수 있다. 단어 "communities" 의 제 2 순간에서, 2 개의 좌측 피쳐가 있고, 각각의 좌측 피쳐는 카운트 0.5 가 할당된다. 유사하게, 단어 "communities" 의 제 2 순간에서, 2 개의 우측 피쳐가 있어서, 각각의 우측 피쳐는 0.33 의 카운트가 할당된다.An example of the configuration of the feature vector and the similarity matrix is shown with reference to FIG. 4, and various other feature vectors and the similarity matrix construction method may be similarly used. In particular, at block 62, with a count for each feature f in the feature vector, a feature vector for each word / term w of the corpus, such as the web corpus, is constructed. The feature of the word / term w may include words that generate before and after the word / term w up to the first non-stop word. For illustrative purposes, the sentence "Because communities assess at different percentages of fair market value, the only way to compare tax rates among communities is by using equalized rates" is given, and the features of the word communities and their corresponding co-generation counts Listed in the table of FIG. 5. In languages having alternative forms of a given word, for example, singular and plural such as "community" and "communities" or alternative tenses such as "walk", "walking" and "walked", the system is separated Although it is a word, it is generally stated that similar terms may be used to treat different forms of words. Such processing of different forms of a given word may not normally be appropriate for languages that do not have this difference, as for example in the case of Chinese. Also, features having the prefix "L:" or "R:" are words that appear respectively on the left or right side of the word w. In this embodiment, the sum of the counts of the one or more features on each left and right of each given moment of the word w is one. For example, at the first instant of the word "communities", each of the left and right features is assigned a count of one. Also, when there are one or more stop words adjacent to the word "communities", for example "between", "is", and "by", the count of features for each side of a given moment of the word "communities" is the same. It can be divided and thus computed as pieces. At the second instant of the word "communities", there are two left features, each left feature assigned a count of 0.5. Similarly, at the second instant of the word "communities", there are two right features, each right feature being assigned a count of 0.33.

다시 도 4 를 참조하여, 블록 (64) 에서, 피쳐 벡터의 각각의 피쳐 f 의 값은 단어 w 와 피쳐 f 사이의 포인트-와이즈 상호 정보 MI 로서 결정될 수도 있다. 포인트-와이즈 상호 정보 MI 의 값이 이용되기 때문에, 예를 들어, 중지 단어와 같이 자주 생성하는 단어는 더욱 높은 카운트를 가지는 경향이 있고, 이러한 단어는 의미적으로 무의미한 단어이다. 따라서, 피쳐의 카운트가 피쳐의 중요성의 우수한 표시자가 아닐 수도 있기 때문에, 단어 w 와 피쳐 f 사이의 포인트-와이즈 상호 정보 MI (w, f) 가 피쳐 f 의 값으로서 이용될 수도 있다. 포인트-와이즈 상호 정보 MI (w, f) 는 w 와 f 의 관찰된 결합 확률 사이의 비율인, P(w,f), w 의 기대 확률인 P(w), f 의 기대확률인 P(f) 대수로서 정의될 수도 있고, 이들이 독립적인 경우에는, Referring again to FIG. 4, at block 64, the value of each feature f of the feature vector may be determined as the point-wise mutual information MI between the word w and the feature f. Because the value of the point-wise mutual information MI is used, words that are frequently generated, such as, for example, stop words, tend to have higher counts, which words are semantically meaningless words. Thus, because the count of features may not be a good indicator of the importance of the feature, the point-wise mutual information MI (w, f) between the word w and the feature f may be used as the value of the feature f. The point-wise mutual information MI (w, f) is P (w, f), which is the ratio between the observed combined probabilities of w and f, P (w), the expected probability of w, and P (f, the expected probability of f. ) Can be defined as algebra, or if they are independent,

에서와 같이 동시에 생성할 수도 있는데, 피쳐 P(f) 와 단어 P(w) 의 확률 (예를 들어, 상대적인 빈도) 이 예를 들어 코퍼스의 그들의 각각의 확률을 이용하여 결정될 수도 있다. 예로서, 도 6 은 예시적인 피쳐 및 웹 포커스로부터 생성된 용어 "communities" 에 대한 해당 확률을 나열한 표이다. 도 6 에 도시된 피쳐 벡터 표는 단어 "communities" 의 피쳐의 서브세트 뿐만 아니라 피쳐와 단어 "communities" 사이의 상호 정보 및 확률을 리스팅한다. 피쳐 벡터는 공정하게 클 수 있다는 것이 명시된다. 예를 들어, 코퍼스로부터 추출된 단어 communities 의 피쳐의 풀 세트는 대략 2,000 엘리먼트를 포함한다.It is also possible to generate simultaneously, as in, where the probability (eg, relative frequency) of the feature P (f) and the word P (w) may be determined using their respective probability of the corpus, for example. By way of example, FIG. 6 is a table listing corresponding probabilities for the term “communities” generated from exemplary features and web focus. The feature vector table shown in FIG. 6 lists a subset of features of the word "communities" as well as the mutual information and probabilities between the feature and the word "communities". It is specified that the feature vector can be fairly large. For example, the full set of features of the word communities extracted from the corpus contains approximately 2,000 elements.

도 4 를 다시 참조하여, 블록 (66) 에서, 2 개의 단어와 관용구 w₁ 및 w₂ 사이의 유사성 측정 또는 값 sim 은 피쳐 벡터의 피쳐의 값을 이용하여 이들의 피쳐 벡터 사이의 각도의 코사인으로서 결정될 수도 있다. 특히, 2 개의 용어 또는 단어 w₁ 및 w₂ 사이의 유사값 sim 은 Referring again to FIG. 4, at block 66, the similarity measure or value sim between the two words and the idioms w ₁ and w ₂ is the cosine of the angle between their feature vectors using the value of the feature of the feature vector. May be determined. In particular, the similarity sim between two terms or words w ₁ and w ₂ is

로서 정의될 수 있고, 여기서, w₁ 및 w₂ 의 피쳐 벡터는 각각 (f₁₁, f₁₂..., f_1n) 및 (f₂₁, f₂₂..., f_2n) 로 표현된다.Where the feature vectors of w ₁ and w ₂ are represented by (f ₁₁ , f ₁₂ ... F _1n ) and (f ₂₁ , f ₂₂ ..., F _2n ), respectively.

블록 (68) 에서, 유사성 매트릭스가 용어의 단어의 각 쌍에 대한 유사값으로부터 구성되고, 질의 용어를 유사 용어와 대체함으로써 제안된 변형/정제된 질의를 생성하는데 이용될 수도 있다. 특히, 유사값은 예를 들어 잠재 제안 질의에 대한 스코어를 결정하는데 이용될 수도 있다. 유사성 매트릭스는 주기적으로 재컴퓨팅될 수도 있고, 및/또는 용어, 예를 들어, 신규로 식별된 용어에 대한 유사값은 이 매트릭스에 추가될 수도 있다. 도 7 은 제안된 변형/정제된 질의를 생성시키기 위한 유사 용어를 대체하는데 이용될 수도 있는 예시적으로 유사성 매트릭스이다.At block 68, a similarity matrix may be constructed from similarity values for each pair of words in a term, and used to generate the proposed modified / refined query by replacing the query term with the similar term. In particular, the similarity value may be used, for example, to determine a score for a potential proposal query. The similarity matrix may be periodically recomputed, and / or similar values for terms, eg, newly identified terms, may be added to this matrix. 7 is an example similarity matrix that may be used to replace similar terms for generating a proposed modified / refined query.

나타난 용어 대체 질의 변형/정제 방법에 적용되는 유사성 매트릭스를 생성시키기 위한 예시적인 방법, 확대/축소 질의 변형/정제 방법에 적용되는 복합어 쌍의 확대/축소 표를 생성시키기 위한 예시적인 방법이 이하 도 8 내지 도 11 을 참조하여 더욱 상세하게 설명된다. 도 8 은 복합어의 쌍의 추출/축소 표를 구성하기 위한 예시적인 프로세스 (70) 를 설명하는 플로우차트이다. 전술한 바와 같이, 확대/축소 표의 각각의 엔트리는 복합어 쌍이고, 하나의 복합어는 대안적인 하나의 서브스트링이며, 질의가 확대/축소 표의 엔트리에서 하나의 복합어 쌍인 복합어를 포함하는 경우에, 이 복합어는 래티스를 확대하는 복합어 쌍 엔트리의 다른 복합어에 의해 대체될 수도 있다. 이상적으로, 추출/축소 표의 각각의 복합어는 의미있는 관용어일 수도 있다. 예로서, 복합어 쌍은 Shanghai 와 Shanghai City 또는 television 과 television set 일 수도 있다. 전술한 바와 같이, 복합어 쌍은, 예를 들어, 모호한 용어와 그 명백한 콘텍스트 (예를 들어, Amazon 과 Amazon rain forest), 사람들의 이름과 그 해당 액티비티, 용어의 속성, 개념의 정제, 배우, 작가, 제품, 사람-위치 등을 포함할 수도 있다.An example method for generating a similarity matrix applied to the term substitution query modification / purification method shown, and an example method for generating a zoom table of compound word pairs applied to the zoom query modification / purification method are shown in FIG. 8. This will be described in more detail with reference to FIG. 11. 8 is a flowchart illustrating an example process 70 for constructing an extraction / reduction table of pairs of compound words. As mentioned above, each entry in the zoom table is a compound word pair, one compound word is an alternative one substring, and if the query includes a compound word that is one compound word pair in the entry of the zoom table, this compound word May be replaced by another compound word in the compound word pair entry that expands the lattice. Ideally, each compound word in the extraction / reduction table may be a meaningful idiom. For example, the compound word pair may be Shanghai and Shanghai City or television and television set. As mentioned above, compound word pairs can be, for example, ambiguous terms and their explicit context (e.g. Amazon and Amazon rain forest), names of people and their corresponding activities, attributes of terms, refinement of concepts, actors, authors, etc. , Products, people-locations, and the like.

블록 (71) 에서, 질의 로그의 질의 (또는 유저 입력의 대안적인 데이터베이스) 는 그 질의에 대한 전반적인 확률을 최대화하는 단어 시퀀스로 세분화될 수도 있다. 특히, 중국어 단어는 스페이스 또는 그 외 브레이크로 명백하게 묘사될 필요가 없기 때문에, 질의는 브레이크를 갖지 않은 중국어 문자의 스트링일 수도 있고, 분절이 문자의 시퀀스를 단어의 시퀀스로 분할하도록 이용될 수도 있다. 단어의 시퀀스는, 단어의 확률의 제품이 문자의 시퀀스의 모든 가능한 분할들 중에서 최대가 될 수도 있다. 명백하게, 블록 (71) 은 인접하는 단어들 사이의 묘사가 있는 영어와 같은 특정 언어에 대해서는 수행할 필요가 없다.At block 71, the query of the query log (or alternative database of user input) may be subdivided into word sequences that maximize the overall probability for that query. In particular, since Chinese words do not need to be explicitly described as spaces or other breaks, the query may be a string of Chinese characters without breaks, and a segment may be used to split the sequence of characters into a sequence of words. The sequence of words may be such that the product of the probability of the word is the largest of all possible divisions of the sequence of letters. Clearly, block 71 need not be performed for a particular language, such as English, with descriptions between adjacent words.

복합어/관용어를 식별하기 위해, 빈도수가 높은 단어 시퀀스 또는 n-그램 (n 시퀀스의 시퀀스) 가 블록 (72) 에서 식별된다. 블록 (72) 에서, 단어 시퀀스에서 단어의 모든 인접하는 쌍이 빈도수가 높은 n-그램인 단어 시퀀스의 카운트는 임의의 길이의 빈도수가 높은 단어 시퀀스를 식별하도록 구성된다. 빈도수가 높은 단어 시퀀스는 복합어일 수도 있고 복합어가 아닐 수도 있다는 것이 명시된다. 예를 들어, 빈도수가 높은 단어 시퀀스의 일부가 복합어이고, 다른 시퀀스들은 비-관용어 또는 비-복합어 시퀀스일 수도 있다.To identify the compound / idiom, a high frequency word sequence or n-gram (sequence of n sequence) is identified at block 72. In block 72, a count of word sequences where all adjacent pairs of words in the word sequence are high frequency n-grams is configured to identify high frequency word sequences of any length. It is specified that a high frequency word sequence may or may not be a compound word. For example, part of a high frequency word sequence may be a compound word, and other sequences may be non-idiom or non-compound word sequences.

블록 (73) 에서, 비-관용어 시퀀스는 (동일한 질의일 필요는 없는) 질의의 최소수의 시작뿐만 아니라 종료시에 나타나도록 복합어/관용어를 요구함으로써 식별된다. 질의의 최소수는 1 이상의 임의의 수일 수도 있지만, 통상적으로 1 보다 훨씬 큰, 예를 들어, 50 또는 100 이다.In block 73, a non-idiom sequence is identified by requiring a compound / idiom to appear at the beginning as well as at the end of the minimum number of queries (which need not be the same query). The minimum number of queries may be any number of one or more, but is typically much greater than one, for example 50 or 100.

블록 (74) 에서, 웹 코퍼스와 같은 코퍼스에서 각각의 n-그램에 대한 피쳐 벡터는 피쳐 벡터의 각각의 피쳐 f 에 대한 카운트와 함께 구성된다. 블록 (75) 에서, 피쳐 벡터의 각각의 피쳐 f 의 값은 n-그램과 피쳐 f 사이의 포인트-와이즈 상호 정보 MI 로서 결정될 수도 있다. 블록 (76) 에서, 2 개의 n-그램 사이의 유사성 측정 또는 값 sim 은 피쳐 벡터에서 피쳐의 값을 사용하여 그 피쳐 벡터들 사이의 값의 코사인으로서 결정될 수도 있다. 도 4 를 참조하여 전술된 바와 같이, 블록 (74, 75 및 76) 은 프로세스 (60) 의 각각의 블록 (62, 64 및 66) 과 유사하다는 것이 명시된다. 따라서, 유사한 내용에 대한 설명은 명료성을 위해 반복되지 않는다.In block 74, the feature vector for each n-gram in the corpus, such as the web corpus, is constructed with a count for each feature f of the feature vector. In block 75, the value of each feature f of the feature vector may be determined as the point-wise mutual information MI between n-gram and feature f. In block 76, the similarity measure or value sim between two n-grams may be determined as the cosine of the value between the feature vectors using the value of the feature in the feature vector. As described above with reference to FIG. 4, it is specified that blocks 74, 75, and 76 are similar to each of blocks 62, 64, and 66 of process 60. Therefore, descriptions of similar contents are not repeated for clarity.

그 후, 확대/축소 표는 복합어 쌍으로서 구성될 수도 있고, 여기서 일 복합어는 블록 (77) 에서 다른 복합어의 서브스트링이다. 또한, 복합어의 카운트는 확대/축소 표에서 결정되고 저장될 수 있다.The zoom table may then be configured as a compound word pair, where one compound word is a substring of another compound word at block 77. In addition, the count of compound words may be determined and stored in the zoom table.

도 9 는 제안된 변형/정제 질의를 생성시키기 위해 질의에서 복합어를 대체하는데 이용되는 확대/축소 표의 몇몇 예시적인 엔트리를 설명하는 표이다. 도시된 바와 같이, 확대/축소 표의 각각의 로우는 2 개의 복합어 또는, 하나의 복합어가 다른 복합어의 서브스트링인 단어 시퀀스를 포함한다. 또한, 각각의 복합어는, 예를 들어, 질의 로그 또는 몇몇 대안적인 유저 입력 데이터베이스로부터 결정될 수도 있는 카운트 (또는 대안적인 빈도값) 와 조합된다. 카운트는 데이터베이스의 크기를 감소시키기 위해 컷-오프로서 이용될 수도 있고 및/또는 예를 들어, 로그 (카운트) 를 이용함으로써, 용어 또는 복합어에 대한 비중을 결정하기 위해 적어도 일부에서 이용될 수도 있다. 도 2a 및 도 2b 를 참조하여 전술된 바와 같이, 질의 래티스가 용어를 대체함으로써 및 또는 오리지널 질의에 용어를 추가/삭제함으로써 확대되고, N 가장 적절한 질의의 패스 및 스코어는 잠재 제안된 질의로서 확대된 래티스로부터 결정된다. 도 10 은, 예를 들어, 확대된 질의 래티스에서의 패스와 같은 제안된 변형/정제된 질의의 스코어를 결정하기 위한 예시적인 프로세스 (80) 를 도시하는 플로우차트이다.9 is a table illustrating some example entries of a zoom table used to replace compound words in a query to generate a proposed variant / purification query. As shown, each row of the zoom table includes a word sequence in which two compound words or one compound word is a substring of another compound word. In addition, each compound word is combined with a count (or alternative frequency value) that may be determined, for example, from a query log or some alternative user input database. The count may be used as a cut-off to reduce the size of the database and / or at least in part to determine specific gravity for a term or compound word, for example by using a log (count). As described above with reference to FIGS. 2A and 2B, the query lattice is expanded by replacing terms and / or by adding / deleting terms to the original query, and the path and score of the N most appropriate query is expanded as a potential proposed query. Determined from lattice. 10 is a flowchart showing an example process 80 for determining the score of a proposed modified / refined query, such as, for example, a pass in an enlarged query lattice.

질의 제안의 결정은 현재 질의 세션에서 사전의 질의에 기초하여 예측 문제로서 처리될 수 있다. 현재 검색 세션에서 질의의 히스토리 Q₁, Q₂,..., Q_n-1 가 주어지면, 유저가 어떤 다음 질의 Q_n 를 선택할 가능성이 가장 높은지에 대한 예측이 이루어질 수도 있다. 제안된 또는 예측된 다음 질의 Q_n 는 현재 세션의 질의 질의의 히스토리 Q₁, Q₂,..., Q_n-1 에 관련되어야할 뿐만 아니라 우수한 검색 결과를 산출해야만 한다. 검색 결과가 얼마나 우수한지를 측정하는 방법은, 예를 들어, 클릭 위치 (유저가 선택하는 검색 결과의 위치) 및 클릭 기간 (얼마나 오래 유저가 선택된 검색 결과 페이지에서 머무르는지의 기간) 의 함수일 수도 있다.Determination of the query proposal may be treated as a prediction problem based on prior queries in the current query session. Given the history of the queries Q ₁ , Q ₂ , ..., Q _n-1 in the current search session, a prediction may be made as to which next query Q _n is most likely to be selected by the user. The proposed or predicted next query Q _n should not only be related to the history Q ₁ , Q ₂ , ..., Q _n-1 of the query query of the current session, but also should yield excellent search results. The method of measuring how good the search results are may be, for example, a function of the click location (the location of the search result the user selects) and the click period (how long the user stays on the selected search results page).

일 실시형태에서, 각각의 잠재 제안된 질의에 대한 스코어는 오브젝트 함수 F 의 값으로서 결정될 수 있는데,In one embodiment, the score for each potential proposed query can be determined as the value of the object function F,

F(Q, Q₁,..., Q_n _-1)=Rel(Q, Q₁,..., Q_n _-1)*Click(Q)*Position(Q) 이고;F (Q, Q ₁ , ..., Q _n _-1 ) = Rel (Q, Q ₁ , ..., Q _n _-1 ) * Click (Q) * Position (Q);

여기서, Rel(Q, Q₁,..., Q_n _-1) 는 질의의 히스토리 Q, Q₁,..., Q_n _-1 와 후보 제안된 질의 Q 사이의 관련성이고;Where Rel (Q, Q ₁ , ..., Q _n _-1 ) is the relationship between the history Q, Q ₁ , ..., Q _n _-1 of the query and the candidate proposed query Q;

Click(Q) 는, 후보 제안된 질의 Q 가 유저에 의해 선택될 확률이고; 및Click (Q) is the probability that the candidate proposed query Q is selected by the user; And

Position(Q) 는 클릭될 질의 Q 가 제안된 캔디데이트에 대한 검색 결과의 위치이다.Position (Q) is the position of the search result for the candy date for which the query Q to be clicked is proposed.

도 2 를 참조하여 전술한 바와 같이, 하나 이상의 제안된 또는 예측된 다음 질의 Q 가 유저에게 제공될 수 있다. 따라서, 가장 적절한 N 제안 다음 질의 (예를 들어, 확대된 질의 래티스에서의 패스) 는 가장 높은 오브젝트 함수값을 가지는 N 질의이고, 가장 적절하게 (예를 들어, 가장 가능성이 높게) 제안된 다음 질의는 오브젝트 함수 F:As described above with reference to FIG. 2, one or more proposed or predicted next queries Q may be provided to the user. Thus, the most appropriate N proposal next query (e.g., a pass in an enlarged query lattice) is the N query with the highest object function value, and the most appropriate (e.g. most likely) proposed next query. Is an object function F:

Q_n=ArgMax_Q{F(Q, Q₁,...,Q_n _-1)} Q _n = ArgMax_Q {F (Q, Q ₁ , ..., Q _n _-1 )}

의 값을 최대화하는 질의로서 표현될 수 있다.It can be expressed as a query that maximizes the value of.

각각의 잠재 제안된 또는 예측된 다음 질의 Q (90) 에 대한 스코어의 결정은 도 10 의 플로우차트에 도시된다. 블록 (82) 에서, 예측된 질의 Q 와 현재 세션의 유저 질의의 히스토리 Q₁,...,Q_n-1 사이의 관련성 Rel (Q, Q₁,...,Q_n-1) 이 질의의 정렬된 용어의 상관관계를 이용하여 결정된다. 특히, 관련성 함수 Rel 을 추정하기 위해, 오리지널 질의 Q 의 용어 또는 핵심 엔티티가 식별된다. 핵심 엔티티 사이의 상관관계를 이용하여, 2 개의 질의 Q 와 Q' 사이의 관련성 Rel (Q, Q') 은 그 핵심 엔티티의 상광관계로부터 유래될 수 있다. 특히, 관련성 Rel (Q, Q') 은:Determination of the score for each potential proposed or predicted next query Q 90 is shown in the flowchart of FIG. 10. In block 82, the relation Rel (Q, Q ₁ , ..., Q _n-1 ) between the predicted query Q and the history Q ₁ , ..., Q _n-1 of the user query of the current session is Is determined using the correlation of sorted terms. In particular, to estimate the relevance function Rel, the term or key entity of the original query Q is identified. Using the correlation between the core entities, the relationship Rel (Q, Q ') between two queries Q and Q' can be derived from the phase relationship of that core entity. In particular, the relevance Rel (Q, Q ') is:

Rel(Q, Q')=Max_fProd_{i=1}＾kCor(T_i,T_i')*w(T_i)Rel (Q, Q ') = Max_fProd_ {i = 1} ＾ kCor (T _i , T _i ') * w (T _i )

로서 표현될 수 있고, 여기서:Can be expressed as:

정렬 함수 f=f(T₁, T₂,...T_k, T₁', T₂',...,T_k') 는, 관련 질의 Q 와 Q' 의 용어를 매핑하고, 예를 들어, {T₁,...T_k, e} 와 {T₁',...,T_k'} 사이의 매핑, 도 11 에 도시된 일 예는;The sort function f = f (T ₁ , T ₂ , ... T _k , T ₁ ', T ₂ ', ..., T _k ') maps the terms of the related query Q and Q', For example, a mapping between {T ₁ , ... T _k , e} and {T ₁ ', ..., T _k '}, an example shown in FIG. 11;

Cor(T_i, T_i') 는 용어 T_i, T_i' 사이의 상관관계이고, 실제 수의 벡터이며;Cor (T _i , T _i ') is the correlation between the terms T _i , T _i ' and is a vector of actual numbers;

Q=T₁, T₂,...T_k (임의의 용어 T₁ 가 무의미한 용어 (empty term) e 일 수도 있는 질의 Q 의 핵심 엔티티);Q = T ₁ , T ₂ , ... T _k (key entity of query Q, where any term T ₁ may be an empty term e);

Q'=T₁', T₂',...,T_k' (임의의 용어 T_i' 가 엠티 용어 e 일 수도 있는 질의 Q' 의 핵심 엔티티); 및Q '= T ₁ ', T ₂ ', ..., T _k ' (key entity of query Q ', where any term T _i ' may be an empty term e); And

w(T_i) 는 용어 T_i 의 중요성, 예를 들어, T_i 에 대한 TF/IDF 이고, 여기서 TF 는 용어 출현 빈도 (용어의 카운트) 를 나타내고, IDF 는 전환된 도큐먼트 빈도를 나타낸다.w (T _i ) is the importance of the term T _i , for example TF / IDF for T _i , where TF indicates the frequency of occurrence of the term (count of terms) and IDF indicates the converted document frequency.

다음으로, 블록 (84) 에서, 질의 Q 가 유저에 의해 선택될 확률, Click(Q) 이 예를 들어, 클릭 기간 또는 표준화된 클릭 기간으로부터 결정된다. 블록 (86) 에서, 예측된 질의 Q 에 대한 위치 스코어, Position(Q) 은 예를 들어, 클릭 위치, 표준화된 클릭 위치, 또는 전환된 클릭 위치로부터 결정된다. 마지막으로, 블록 (88) 에서, 잠재 제안된 또는 예측된 다음 질의 Q 에 대한 오브젝트 함수 F 의 값은 전술한 바와 같이 블록 (82, 84, 및 86) 의 결과로부터 결정된다.Next, at block 84, the probability that the query Q is selected by the user, Click (Q), is determined, for example, from a click period or a standardized click period. In block 86, the position score for the predicted query Q, Position (Q), is determined from, for example, the click position, the normalized click position, or the converted click position. Finally, at block 88, the value of the object function F for the latent proposed or predicted next query Q is determined from the results of blocks 82, 84, and 86 as described above.

2 개의 질의 사이의 관련성을 결정하는데 이용된 상관값 Cor(T_i, T_i') 의 결정은 도 12 를 참조하여 더욱 상세하게 설명된다. 특히, 도 12 는 용어의 쌍 또는 코어 엔티티 T, T' 사이의 상관값을 생성시키기 위한 예시적인 프로세스 (90) 를 설명하는 플로우차트이다. 블록 (92) 에서, 신규의 코어 엔티티는 상호 정보를 이용하여 코퍼스, 예를 들어, 웹 페이지 및 유저 질의로부터 식별될 수도 있다. 블록 (92) 의 일 도시적인 구현에서, Motorola 가 엔티티이고, "Motorola announced", "Motorola cell phone", 및 "buy Motorola" 뿐만 아니라 "Nokia announced", "Nokia cell phone" 및 "buy Nokia" 가 코퍼스 내에 있으면, 다음으로, Nokia 도 또한 엔티티로서 식별된다. 오프-더-쉘프 사전이 종래의 핵심 엔티티를 제공할 수 있지만, 수많은 신규의 핵심 엔티티가 어휘에 종종 소개된다는 것이 명시된다. 신규의 핵심 엔티티의 예는 적절한 명칭, 예를 들어, 사람들 및 회사 명칭, 제품 모델, 영화 및 음악 제목 등과 같은 다양한 다른 신규의 단어 및 관용어를 포함한다.The determination of the correlation value Cor (T _i , T _i ') used to determine the association between the two queries is described in more detail with reference to FIG. In particular, FIG. 12 is a flowchart describing an example process 90 for generating a correlation value between a pair of terms or core entities T, T '. In block 92, the new core entity may be identified from the corpus, eg, a web page and a user query using mutual information. In one illustrative implementation of block 92, Motorola is an entity and "Motorola announced", "Motorola cell phone", and "buy Motorola" as well as "Nokia announced", "Nokia cell phone" and "buy Nokia" If within the corpus, then Nokia is also identified as an entity. While off-the-shelf dictionaries can provide conventional core entities, it is specified that many new core entities are often introduced in the vocabulary. Examples of new key entities include various other new words and idioms, such as appropriate names, such as people and company names, product models, movie and music titles, and the like.

블록 (94) 에서, 핵심 엔티티 T, T' 쌍 사이의 상관값은 예를 들어, 질의 로그, 웹 페이지 및 앵커 텍스트를 이용하여 결정될 수 있다. 2 개의 코어 엔티티 T₁ 및 T₂ 사이의 상관관계는 실수의 벡터의 함수로서 정의될 수도 있다:In block 94, the correlation between the key entity T, T 'pairs may be determined using, for example, query logs, web pages, and anchor text. The correlation between two core entities T ₁ and T ₂ may be defined as a function of the vector of real numbers:

Cor(T₁, T₂)=f(w₁, w₂,...,w_n)Cor (T ₁ , T ₂ ) = f (w ₁ , w ₂ , ..., w _n )

여기서, w₁, w₂,...,w_n 는 특정의 사전-결정된 관계의 비중이다. 사전-결정된 관계의 예는 (1) 동의어, 두문자어 및 반의어, (2) Shanghai 대 Shanghai City, television 대 television machine 와 같은 복합 관용어, (3) 예를 들어, Toyota 와 Honda 와 같은 동일한 구문/의미 카테고리의 용어, (4) 모호한 용어와 그의 명백한 콘텍스트, (5) 예를 들어, Oprah 와 토크 쇼 호스트와 같은 사람 이름과 그 해당 활동, (6) 예를 들어, 컴퓨터와 메모리와 같은 용어의 속성, (7) 예를 들어, Amazon 과 Amazon River, Amazon Rain Forrest, 및 Amazon.com 과 같은 개념의 정제, (8) 예를 들어, 톰 행크스와 포레스트 검프 및 빌 게이츠와 CEO 와 같은 영화 배우, 책-저자, 회사-제품, 개인-위치 등을 포함한다.Where w ₁ , w ₂ , ..., w _n are the specific gravity of a particular predetermined relationship. Examples of pre-determined relationships include (1) synonyms, acronyms and antonyms, (2) complex idioms such as Shanghai versus Shanghai City, television to television machine, and (3) the same syntax / meaning category, for example Toyota and Honda. Terms, (4) ambiguous terms and their apparent context, (5) names of persons such as Oprah and talk show hosts and their corresponding activities, (6) attributes of terms such as computers and memory, (7) For example, refining concepts such as Amazon and Amazon River, Amazon Rain Forrest, and Amazon.com; (8) For example, movie actors and books such as Tom Hanks and Forest Gump and Bill Gates and CEO. Author, company-product, personal-location, etc.

블록 (96) 에서, 상관 벡터 Cor(T₁, T₂) 의 값은 [0-1] 로 표준화될 수도 있다.In block 96, the value of correlation vector Cor (T ₁ , T ₂ ) may be normalized to [0-1].

변형된 또는 정제된 유저 입력을 생성시키기 위한 시스템 및 방법이 유저가 선택할 가능성이 높은 상위 결과를 생성하고 및/또는 유저가 이용할 가능성이 높은 질의를 제안할 수 있다. 이 시스템 및 방법은 2 개의 질의 사이에서 상관관계를 양에 대해 측정한다. 명백하게, 2 개의 질의가 임의의 공통 용어 또는 동의어를 가질 필요는 없다. 예를 들어, 노래 "Now and Forever" 의 mp3 파일에 대한 "Now and Forever' mp3" 의 오리지널 질의 (예를 들어, 중국어로) 는 예를 들 어, 동일한 아티스트에 의한 다른 노래 또는 앨범뿐만 아니라 "CoCoLee" (이 노래의 가수) 를 포함할 수도 있다. 따라서, 제안된 질의는, 단순히 오리지널 질의의 확장이 아닐 수도 있지만, 예를 들어, 유저가 선택할 가능성이 있는 검색 결과와 같은 더 나은 검색 결과를 가지는 질의일 수도 있다. 일 예에서, 제안된 질의는 오리지널 질의가 짧고 모호한 질의 센스 명확성을 달성하는 질의를 포함할 수도 있다. 다른 예로서, 제안된 질의는, 오리지널 질의가 길 수도 있고 및/또는 상호배타적인 용어를 포함할 수도 있는 더욱 짧은 질의로 오리지널 질의를 분리하는 질의를 포함할 수도 있다.Systems and methods for generating modified or refined user input may generate higher results that the user is likely to select and / or suggest queries that are more likely to be used by the user. This system and method measures the amount of correlation between two queries. Clearly, the two queries need not have any common terms or synonyms. For example, an original query of "Now and Forever 'mp3" (for example, in Chinese) for an mp3 file of song "Now and Forever" could be, for example, "as well as other songs or albums by the same artist. CoCoLee "(singer of this song). Thus, the proposed query may not be simply an extension of the original query, but may be a query with better search results, for example, a search result that the user may choose. In one example, the proposed query may include a query in which the original query achieves short and ambiguous query sense clarity. As another example, the proposed query may include a query that separates the original query into shorter queries, which may be long and / or include mutually exclusive terms.

본 발명의 예시적인 실시형태가 본 명세서에 설명되고 도시되며, 이들은 설명을 위한 것이며 본 발명의 취지 및 범위를 벗어나지 않고 변형이 이루어질 수 있다는 것이 명시된다. 따라서, 본 발명의 범위는 수정될 수도 있고, 본 발명의 실시형태로서 구체적인 실시형태의 설명으로 통합되는 각각의 특허청구범위를 가지는 이하의 특허청구범위의 용어로만 정의되도록 의도록 의도된다.Exemplary embodiments of the invention have been described and illustrated herein, which are for the purpose of description and it is specified that modifications can be made without departing from the spirit and scope of the invention. Accordingly, the scope of the present invention may be modified and is intended to be defined only by the terms of the following claims, which have respective claims, which are incorporated into the description of specific embodiments as embodiments of the invention.

Claims

컴퓨터로 구현되는 방법으로서,A computer-implemented method,

오리지널 질의를 수신하는 단계;Receiving an original query;

상기 오리지널 질의에서의 제 1 용어에 대한 제 1 피쳐 벡터를 생성하는 단계;Generating a first feature vector for a first term in the original query;

용어들의 컬렉션에서의 하나 이상의 상이한 용어들 각각에 대한 각각의 피쳐 벡터를 생성하는 단계;Generating a respective feature vector for each of the one or more different terms in the collection of terms;

상기 하나 이상의 상이한 용어들 각각과 각각의 유사성 값을 연관시키는 단계로서, 상기 유사성 값은 상기 제 1 용어에 대한 상기 제 1 피쳐 벡터와 상기 하나 이상의 상이한 용어들 각각에 대한 각각의 피쳐 벡터 사이의 유사성 측정에 적어도 부분적으로 기초하는, 상기 각각의 유사성 값을 연관시키는 단계;Associating each similarity value with each of the one or more different terms, wherein the similarity value is a similarity between the first feature vector for the first term and each feature vector for each of the one or more different terms. Associating each similarity value based at least in part on the measurement;

상기 하나 이상의 상이한 용어들 각각과 연관된 상기 각각의 유사성 값에 기초하여 상기 하나 이상의 상이한 용어들로부터 하나 이상의 유사한 용어들을 식별하는 단계;Identifying one or more similar terms from the one or more different terms based on the respective similarity values associated with each of the one or more different terms;

각각의 식별된 유사한 용어로 상기 오리지널 질의에서의 상기 제 1 용어를 대체함으로써 식별된 상기 하나 이상의 유사한 용어들 각각에 대한 대안적인 질의를 생성하는 단계;Generating an alternative query for each of the one or more similar terms identified by replacing the first term in the original query with each identified similar term;

각각의 상기 대안적인 질의 내에서, 식별된 유사한 용어와 연관된 상기 유사성 값에 기초하여 각각의 대안적인 질의에 대한 스코어를 컴퓨팅하는 단계; 및Within each said alternative query, computing a score for each alternative query based on said similarity value associated with identified similar terms; And

각각의 대안적인 질의에 대하여 컴퓨팅된 상기 스코어에 적어도 부분적으로 기초하여 상기 오리지널 질의에 대한 질의 제안으로서 하나 이상의 상기 대안적인 질의들을 식별하는 단계를 포함하는, 컴퓨터로 구현되는 방법.Identifying one or more of the alternative queries as a query proposal for the original query based at least in part on the score computed for each alternative query.

제 1 항에 있어서,The method of claim 1,

상기 오리지널 질의는 검색 질의인, 컴퓨터로 구현되는 방법.And wherein said original query is a search query.

제 1 항에 있어서,The method of claim 1,

상기 오리지널 질의는 비-로마계 언어인, 컴퓨터로 구현되는 방법.The original query is a non-Roman language.

제 1 항에 있어서,The method of claim 1,

상기 오리지널 질의 및 상기 하나 이상의 대안적인 질의들을 캐시 내에 저장하는 단계를 더 포함하는, 컴퓨터로 구현되는 방법.Storing the original query and the one or more alternative queries in a cache.

제 1 항에 있어서,The method of claim 1,

상기 유사성 측정들은 유사성 매트릭스 내에 저장되고, The similarity measures are stored in a similarity matrix,

상기 유사성 매트릭스는, 코퍼스 (corpus), 유저 입력 로그, 및 유저 세션 데이터 중 적어도 하나에서의 용어들의 컬렉션에서의 각각의 용어들에 대한 피쳐 벡터들을 생성하고, 대응하는 피쳐 벡터들을 이용하여 상기 용어들의 컬렉션에서의 상기 용어들의 쌍들 간의 각각의 유사성 측정을 결정함으로써 생성되는, 컴퓨터로 구현되는 방법.The similarity matrix generates feature vectors for each term in the collection of terms in at least one of a corpus, user input log, and user session data, and uses the corresponding feature vectors to determine the terms of the terms. And determining each measure of similarity between the pairs of terms in the collection.

제 1 항에 있어서,The method of claim 1,

상기 스코어는, (a) 상기 오리지널 질의와 제 1 대안적인 질의 간의 관련성, (b) 상기 제 1 대안적인 질의가 선택될 확률, 또는 (c) 상기 제 1 대안적인 질의에 대한 선택된 검색 결과의 위치의 스코어 중 적어도 하나를 결정하는 단계에 의해 계산되는, 컴퓨터로 구현되는 방법.The score may include (a) the relationship between the original query and the first alternative query, (b) the probability that the first alternative query is selected, or (c) the location of the selected search result for the first alternative query. The computer-implemented method is calculated by determining at least one of the scores of.

제 6 항에 있어서,The method according to claim 6,

상기 결정하는 단계는 상기 오리지널 질의와 상기 제 1 대안적인 질의 간의 상기 관련성을 결정하는 단계를 포함하고,The determining includes determining the association between the original query and the first alternative query,

상기 관련성을 결정하는 단계는,Determining the relevance,

상기 제 1 대안적인 질의의 용어들과 상기 오리지널 질의의 용어들을 정렬하는 단계; 및Sorting terms of the first alternative query with terms of the original query; And

상기 정렬된 용어들 간의 상관값들을 결정하는 단계를 포함하는, 컴퓨터로 구현되는 방법.Determining correlation values between the sorted terms.

오리지널 질의를 수신하고, 동작들을 수행하도록 구성된 서버 디바이스를 포함하는 시스템으로서,A system comprising a server device configured to receive an original query and perform operations,

상기 동작들은,The operations include,

상기 오리지널 질의에서의 제 1 용어에 대한 제 1 피쳐 벡터를 생성하는 것;Generating a first feature vector for a first term in the original query;

용어들의 컬렉션에서의 하나 이상의 상이한 용어들 각각에 대하여 각각의 피쳐 벡터를 생성하는 것;Generating a respective feature vector for each of the one or more different terms in the collection of terms;

상기 하나 이상의 상이한 용어들 각각과 각각의 유사성 값을 연관시키는 것으로서, 상기 유사성 값은 상기 제 1 용어에 대한 상기 제 1 피쳐 벡터와 상기 하나 이상의 상이한 용어들 각각에 대한 각각의 피쳐 벡터 사이의 유사성 측정에 적어도 부분적으로 기초하는, 상기 각각의 유사성 값을 연관시키는 것;Associating each similarity value with each of the one or more different terms, wherein the similarity value is a measure of similarity between the first feature vector for the first term and each feature vector for each of the one or more different terms Associating each similarity value based at least in part on;

상기 하나 이상의 상이한 용어들 각각과 연관된 상기 각각의 유사성 값에 기초하여 상기 하나 이상의 상이한 용어들로부터 하나 이상의 유사한 용어들을 식별하는 것;Identifying one or more similar terms from the one or more different terms based on the respective similarity values associated with each of the one or more different terms;

각각의 식별된 유사한 용어로 상기 오리지널 질의에서의 상기 제 1 용어를 대체함으로써 식별된 상기 하나 이상의 유사한 용어들 각각에 대한 대안적인 질의를 생성하는 것;Generating an alternative query for each of the one or more similar terms identified by replacing the first term in the original query with each identified similar term;

각각의 상기 대안적인 질의 내에서, 식별된 유사한 용어와 연관된 상기 유사성 값에 기초하여 각각의 대안적인 질의에 대한 스코어를 컴퓨팅하는 것; 및Within each said alternative query, computing a score for each alternative query based on said similarity value associated with the identified similar terms; And

각각의 대안적인 질의에 대하여 컴퓨팅된 상기 스코어에 적어도 부분적으로 기초하여 상기 오리지널 질의에 대한 질의 제안으로서 하나 이상의 상기 대안적인 질의들을 식별하는 것을 포함하는, 시스템.Identifying one or more of the alternative queries as a query proposal for the original query based at least in part on the score computed for each alternative query.

제 8 항에 있어서,9. The method of claim 8,

상기 오리지널 질의는 검색 질의인, 시스템.The original query is a search query.

제 8 항에 있어서,9. The method of claim 8,

상기 오리지널 질의는 비-로마계 언어인, 시스템.Wherein the original query is a non-Roman language.

제 8 항에 있어서,9. The method of claim 8,

상기 하나 이상의 대안적인 질의들의 사전-컴퓨팅된 캐시를 더 포함하고,Further comprising a pre-computed cache of the one or more alternative queries,

상기 서버 디바이스는, The server device,

상기 오리지널 질의가 상기 사전-컴퓨팅된 캐시 내에 있는지 여부를 결정하고, 상기 오리지널 질의가 상기 사전-컴퓨팅된 캐시 내에 있다고 결정되는 경우, 적어도 하나의 사전-컴퓨팅된 대안적인 질의를 출력하도록 또한 구성되는, 시스템.Further determine whether the original query is in the pre-computed cache, and if it is determined that the original query is in the pre-computed cache, output at least one pre-computed alternative query, system.

제 8 항에 있어서,9. The method of claim 8,

상기 서버 디바이스는, 코퍼스 (corpus), 유저 입력 로그, 및 유저 세션 데이터 중 적어도 하나에서의 용어들의 컬렉션에서의 각각의 용어들에 대한 피쳐 벡터들을 생성하고, 대응하는 피쳐 벡터들을 이용하여 상기 용어들의 컬렉션에서의 상기 용어들의 쌍들 간의 각각의 유사성 측정을 결정함으로써 상기 유사성 매트릭스를 생성하도록 또한 구성되는, 시스템.The server device generates feature vectors for each term in the collection of terms in at least one of a corpus, user input log, and user session data, and uses the corresponding feature vectors to determine the terms of the terms. And generate the similarity matrix by determining each similarity measure between the pairs of terms in a collection.

제 8 항에 있어서,9. The method of claim 8,

상기 서버 디바이스는, (a) 상기 오리지널 질의와 제 1 대안적인 질의 간의 관련성, (b) 상기 제 1 대안적인 질의가 선택될 확률, 또는 (c) 상기 제 1 대안적인 질의에 대한 선택된 검색 결과의 위치의 스코어 중 적어도 하나를 결정함으로써 상기 스코어를 컴퓨팅하도록 또한 구성되는, 시스템.The server device may be configured to: (a) the relationship between the original query and the first alternative query, (b) the probability that the first alternative query is selected, or (c) the selected search result for the first alternative query. And compute the score by determining at least one of a score of a location.

제 13 항에 있어서,14. The method of claim 13,

상기 서버 디바이스는, 상기 오리지널 질의와 상기 제 1 대안적인 질의 간의 상기 관련성을 결정하도록 또한 구성되고,The server device is further configured to determine the association between the original query and the first alternative query,

상기 관련성을 결정하는 것은,Determining the relevance,

상기 제 1 대안적인 질의의 용어들과 상기 오리지널 질의의 용어들을 정렬하는 것; 및Sorting terms of the first alternative query with terms of the original query; And

상기 정렬된 용어들 간의 상관값들을 결정하는 것을 포함하는, 시스템.Determining correlation values between the sorted terms.

컴퓨터 시스템과 관련되어 이용되는 컴퓨터 판독가능 저장 매체로서,A computer readable storage medium used in connection with a computer system,

상기 컴퓨터 판독가능 저장 매체는 컴퓨터 프로세서상에서 실행가능한 명령들을 저장하고,The computer readable storage medium stores instructions executable on a computer processor,

상기 명령들은,The commands are

오리지널 질의를 수신하는 명령;Receiving an original query;

상기 오리지널 질의에서의 제 1 용어에 대한 각각의 벡터를 생성하는 명령;Generating a respective vector for a first term in the original query;

용어들의 컬렉션에서의 하나 이상의 상이한 용어들 각각에 대한 제 1 피쳐 벡터를 생성하는 명령;Generating a first feature vector for each of one or more different terms in the collection of terms;

상기 하나 이상의 상이한 용어들 각각과 각각의 유사성 값을 연관시키는 명령으로서, 상기 유사성 값은 상기 제 1 용어에 대한 상기 제 1 피쳐 벡터와 상기 하나 이상의 상이한 용어들 각각에 대한 각각의 피쳐 벡터 사이의 유사성 측정에 적어도 부분적으로 기초하는, 상기 각각의 유사성 값을 연관시키는 명령;Instructions for associating each similarity value with each of the one or more different terms, wherein the similarity value is a similarity between the first feature vector for the first term and each feature vector for each of the one or more different terms. Associating each similarity value based at least in part on the measurement;

상기 하나 이상의 상이한 용어들 각각과 연관된 상기 각각의 유사성 값에 기초하여 상기 하나 이상의 상이한 용어들로부터 하나 이상의 유사한 용어들을 식별하는 명령;Identifying one or more similar terms from the one or more different terms based on the respective similarity values associated with each of the one or more different terms;

각각의 식별된 유사한 용어로 상기 오리지널 질의에서의 상기 제 1 용어를 대체함으로써 식별된 상기 하나 이상의 유사한 용어들 각각에 대하여 대안적인 질의를 생성하는 명령;Generating an alternative query for each of the one or more similar terms identified by replacing the first term in the original query with each identified similar term;

각각의 상기 대안적인 질의 내에서, 식별된 유사한 용어와 연관된 상기 유사성 값에 기초하여 각각의 대안적인 질의에 대한 스코어를 컴퓨팅하는 명령; 및Within each of the alternative queries, instructions for computing a score for each alternative query based on the similarity value associated with the identified similar terms; And

각각의 대안적인 질의에 대하여 컴퓨팅된 상기 스코어에 적어도 부분적으로 기초하여 상기 오리지널 질의에 대한 질의 제안으로서 하나 이상의 상기 대안적인 질의들을 식별하는 명령을 포함하는, 컴퓨터 판독가능 저장 매체.And identifying one or more of the alternative queries as a query proposal for the original query based at least in part on the score computed for each alternative query.

제 15 항에 있어서,16. The method of claim 15,

상기 명령들은,The commands are

상기 오리지널 질의 및 각각의 상기 하나 이상의 대안적인 질의들을 캐시 내에 저장하는 명령을 더 포함하는, 컴퓨터 판독가능 저장 매체.And storing the original query and each of the one or more alternative queries in a cache.

제 15 항에 있어서,16. The method of claim 15,

상기 유사성 측정들은 유사성 매트릭스 내에 저장되고,The similarity measures are stored in a similarity matrix,

상기 유사성 매트릭스는, 코퍼스 (corpus), 유저 입력 로그, 및 유저 세션 데이터 중 적어도 하나에서의 용어들의 컬렉션에서의 각각의 용어들에 대한 피쳐 벡터들을 생성하고, 대응하는 피쳐 벡터들을 이용하여 상기 용어들의 컬렉션에서의 상기 용어들의 쌍들 간의 각각의 유사성 측정을 결정함으로써 생성되는, 컴퓨터 판독가능 저장 매체.The similarity matrix generates feature vectors for each term in the collection of terms in at least one of a corpus, user input log, and user session data, and uses the corresponding feature vectors to determine the terms of the terms. And determining each measure of similarity between the pairs of terms in the collection.

제 15 항에 있어서,16. The method of claim 15,

상기 스코어는, (a) 상기 오리지널 질의와 제 1 대안적인 질의 간의 관련성, (b) 상기 제 1 대안적인 질의가 선택될 확률, 또는 (c) 상기 제 1 대안적인 질의에 대한 선택된 검색 결과의 위치의 스코어 중 적어도 하나를 결정함으로써 계산되는, 컴퓨터 판독가능 저장 매체.The score may include (a) the relationship between the original query and the first alternative query, (b) the probability that the first alternative query is selected, or (c) the location of the selected search result for the first alternative query. And calculated by determining at least one of a score of.

제 18 항에 있어서,19. The method of claim 18,

상기 명령들은 상기 오리지널 질의와 상기 제 1 대안적인 질의 간의 상기 관련성을 결정하는 명령을 포함하고,The instructions include instructions for determining the association between the original query and the first alternative query,

상기 관련성을 결정하는 명령은,The command to determine the relevance,

상기 제 1 대안적인 질의의 용어들과 상기 오리지널 질의의 용어들을 정렬하는 명령; 및Ordering terms of the first alternative query with terms of the original query; And

상기 정렬된 용어들 간의 상관값들을 결정하는 명령을 포함하는, 컴퓨터 판독가능 저장 매체.And instructions for determining correlation values between the sorted terms.

컴퓨터로 구현되는 방법으로서,A computer-implemented method,

오리지널 질의를 수신하는 단계;Receiving an original query;

상기 오리지널 질의에서의 하나 이상의 용어들의 제 1 시퀀스를 포함하는 제 1 복합어를 식별하는 단계;Identifying a first compound word that includes a first sequence of one or more terms in the original query;

하나 이상의 용어들의 상이한 제 2 시퀀스를 포함하는 제 2 복합어를 식별하는 단계로서, 상기 제 2 복합어는 상기 제 1 복합어의 확대 또는 축소인, 상기 제 2 복합어를 식별하는 단계;Identifying a second compound word comprising a different second sequence of one or more terms, wherein the second compound word is an enlargement or reduction of the first compound word;

상기 제 1 복합어의 확대 또는 축소로서 식별된 상기 제 2 복합어로 상기 오리지널 질의에서의 상기 제 1 복합어를 대체함으로써 대안적인 질의를 생성하는 단계;Generating an alternative query by replacing the first compound word in the original query with the second compound word identified as the expansion or contraction of the first compound word;

상기 대안적인 질의와 하나 이상의 미리 수신된 질의들의 히스토리 사이의 관련성에 적어도 부분적으로 기초하여 상기 대안적인 질의에 대한 스코어를 컴퓨팅하는 단계; 및Computing a score for the alternative query based at least in part on the association between the alternative query and a history of one or more previously received queries; And

상기 대안적인 질의에 대하여 컴퓨팅된 상기 스코어에 적어도 부분적으로 기초하여 상기 오리지널 질의에 대한 질의 제안으로서 상기 대안적인 질의를 식별하는 단계를 포함하는, 컴퓨터로 구현되는 방법.Identifying the alternative query as a query proposal for the original query based at least in part on the score computed for the alternative query.

제 20 항에 있어서,21. The method of claim 20,

상기 제 1 복합어 및 상기 제 2 복합어는 유저 입력 로그 및 유저 입력 데이터베이스 중 적어도 하나로부터 생성된 확대/축소 표 내에 저장되고,The first compound word and the second compound word are stored in a zoom table generated from at least one of a user input log and a user input database,

상기 확대/축소 표는 단어들의 시퀀스들의 발생들을 나타내는 빈도 값들을 포함하는, 컴퓨터로 구현되는 방법.The zoom table comprises frequency values indicative of occurrences of sequences of words.

제 21 항에 있어서,22. The method of claim 21,

상기 확대/축소 표는 빈번한 단어 시퀀스들을 결정하고, 비-관용구 (non-phrasal) 단어 시퀀스들을 필터링하며, 상기 빈도 값들로서 용어들의 시퀀스들과 카운트들을 연관시킴으로써 생성되는, 컴퓨터로 구현되는 방법. The zoom table is generated by determining frequent word sequences, filtering non-phrasal word sequences, and associating sequences with terms of terms as the frequency values and counts.

상기 명령들은,The commands are

오리지널 질의를 수신하는 명령;Receiving an original query;

상기 오리지널 질의에서의 하나 이상의 용어들을 포함하는 제 1 복합어를 식별하는 명령;Identifying a first compound word that includes one or more terms in the original query;

하나 이상의 용어들의 상이한 제 2 시퀀스를 포함하는 제 2 복합어를 식별하는 명령으로서, 상기 제 2 복합어는 상기 제 1 복합어의 확대 또는 축소인, 상기 제 2 복합어를 식별하는 명령;Instructions for identifying a second compound word that includes a different second sequence of one or more terms, the second compound word being an enlargement or reduction of the first compound word;

상기 제 1 복합어의 확대 또는 축소로서 식별된 상기 제 2 복합어로 상기 오리지널 질의에서의 상기 제 1 복합어를 대체함으로써 대안적인 질의를 생성하는 명령;Generating an alternative query by replacing the first compound word in the original query with the second compound word identified as the expansion or contraction of the first compound word;

상기 대안적인 질의와 하나 이상의 미리 수신된 질의들의 히스토리 사이의 관련성에 적어도 부분적으로 기초하여 상기 대안적인 질의에 대한 스코어를 컴퓨팅하는 명령; 및Computing a score for the alternative query based at least in part on the association between the alternative query and a history of one or more previously received queries; And

상기 대안적인 질의에 대하여 컴퓨팅된 상기 스코어에 적어도 부분적으로 기초하여 상기 오리지널 질의에 대한 질의 제안으로서 상기 대안적인 질의를 식별하는 명령을 포함하는, 컴퓨터 판독가능 저장 매체.And identifying the alternative query as a query suggestion for the original query based at least in part on the score computed for the alternative query.

오리지널 질의를 수신하고, 동작들을 수행하도록 구성된 서버를 포함하는 시스템으로서,A system comprising a server configured to receive an original query and perform operations,

상기 동작들은,The operations include,

상기 오리지널 질의에서의 하나 이상의 용어들의 제 1 시퀀스를 포함하는 제 1 복합어를 식별하는 것;Identifying a first compound word that includes a first sequence of one or more terms in the original query;

하나 이상의 용어들의 상이한 제 2 시퀀스를 포함하는 제 2 복합어를 식별하는 것으로서, 상기 제 2 복합어는 상기 제 1 복합어의 확대 또는 축소인, 상기 제 2 복합어를 식별하는 것;Identifying a second compound word, the second compound word comprising a different second sequence of one or more terms, wherein the second compound word is an enlargement or reduction of the first compound word;

상기 제 1 복합어의 확대 또는 축소로서 식별된 상기 제 2 복합어로 상기 오리지널 질의에서의 상기 제 1 복합어를 대체함으로써 대안적인 질의를 생성하는 것;Generating an alternative query by replacing the first compound word in the original query with the second compound word identified as the expansion or contraction of the first compound word;

상기 대안적인 질의와 하나 이상의 미리 수신된 질의들의 히스토리 사이의 관련성에 적어도 부분적으로 기초하여 상기 대안적인 질의에 대한 스코어를 컴퓨팅하는 것; 및Computing a score for the alternative query based at least in part on the association between the alternative query and a history of one or more previously received queries; And

상기 대안적인 질의에 대하여 컴퓨팅된 상기 스코어에 적어도 부분적으로 기초하여 상기 오리지널 질의에 대한 질의 제안으로서 상기 대안적인 질의를 식별하는 것을 포함하는, 시스템.Identifying the alternative query as a query suggestion for the original query based at least in part on the score computed for the alternative query.

제 24 항에 있어서,25. The method of claim 24,

상기 제 1 복합어 및 상기 제 2 복합어는 유저 입력 로그 및 유저 입력 데이터베이스 중 적어로 하나로부터 생성된 확대/축소 표 내에 저장되고,The first compound word and the second compound word are stored in an enlargement / reduction table generated from one of user input logs and a user input database,

상기 확대/축소 표는 단어들의 시퀀스들의 발생들을 나타내는 빈도 값들을 포함하는, 시스템.The zoom table includes frequency values indicative of occurrences of sequences of words.

제 25 항에 있어서,26. The method of claim 25,

상기 확대/축소 표는 빈번한 단어 시퀀스들을 결정하고, 비-관용구 (non-phrasal) 단어 시퀀스들을 필터링하며, 상기 빈도 값들로서 용어들의 시퀀스들과 카운트들을 연관시킴으로써 생성되는, 시스템.The zoom table is generated by determining frequent word sequences, filtering non-phrasal word sequences, and associating sequences with terms of terms as the frequency values and counts.

제 23 항에 있어서,24. The method of claim 23,

상기 확대/축소 표는 단어들의 시퀀스들의 발생들을 나타내는 빈도 값들을 포함하는, 컴퓨터 판독가능 저장 매체.And the zoom table includes frequency values indicative of occurrences of sequences of words.

제 27 항에 있어서,28. The method of claim 27,

상기 확대/축소 표는 빈번한 단어 시퀀스들을 결정하고, 비-관용구 (non-phrasal) 단어 시퀀스들을 필터링하며, 상기 빈도 값들로서 용어들의 시퀀스들과 카운트들을 연관시킴으로써 생성되는, 컴퓨터 판독가능 저장 매체.The zoom table is generated by determining frequent word sequences, filtering non-phrasal word sequences, and associating sequences with terms of terms as the frequency values.

삭제delete