CN103838739A - Method and system for detecting error correction words in search engine - Google Patents

Method and system for detecting error correction words in search engine Download PDF

Info

Publication number
CN103838739A
CN103838739A CN201210476236.3A CN201210476236A CN103838739A CN 103838739 A CN103838739 A CN 103838739A CN 201210476236 A CN201210476236 A CN 201210476236A CN 103838739 A CN103838739 A CN 103838739A
Authority
CN
China
Prior art keywords
error correction
correction term
user
query word
search results
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201210476236.3A
Other languages
Chinese (zh)
Other versions
CN103838739B (en
Inventor
阮星华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201210476236.3A priority Critical patent/CN103838739B/en
Publication of CN103838739A publication Critical patent/CN103838739A/en
Application granted granted Critical
Publication of CN103838739B publication Critical patent/CN103838739B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method for detecting error correction words in a search engine. The method comprises the steps of according to search logs of a user, conducting statics on the number of times of clicking search results of input words by a user, the number of times of clicking the error correction words by the user and the number of times of clicking the search results of the error correction words by the user after the error correction words are provided, utilizing the statistical number of times for calculating error correction good and bad values of the error correction words, and generating a detection result of the error correction words according to the error correction good and bad values. The invention further provides a system for detecting the error correction words in the search engine. According to the technical scheme, unreasonable error correction words in the search engine can be efficiently detected.

Description

The detection method of error correction term and system in a kind of search engine
[technical field]
The present invention relates to the search technique of internet arena, relate in particular to detection method and the system of error correction term in a kind of search engine.
[background technology]
It is a kind of method of effectively user's query word being corrected and being guided to user that search engine provides error correction term.When user is due to misspelling or remember the problems such as unclear, can not provide complete and accurately when query word, search engine can be by providing error correction term to carry out the input of correcting user or user being directed on correct query word, thereby make user obtain the Query Result needing; For example, user wants search " Zhong Guan-cun ", but " the Zhong Guan village " of input error, " Zhong Guan village " or " zhong closes village ", and now, search engine can provide and provide correct error correction term " Zhong Guan-cun ".
Search engine is in the time of the error correction term providing at present, there are following three problems, first problem is that the query word that user inputs is correct, search engine has provided wrong error correction term, Second Problem is that the query word that user inputs is wrong, the error correction term that search engine provides is also wrong, and the 3rd problem is that the query word that user inputs is wrong, and search engine does not provide error correction term.The problems referred to above are owing to cannot providing the Search Results of user's needs, thereby cause the counter productive of running counter to desire, for example, the content degree of correlation of Search Results and user's actual needs is not high, can not meet consumers' demand, or while there is error correction mistake, make user that the degree of belief of search engine is produced and be queried.Therefore for the compliance test result of search engine error correction term, as accuracy rate and recall rate, and irrational error correction term is carried out to detection and Identification just become extremely important.
At present, irrational error correction term is carried out to detection and Identification method is: the query word of a collection of mistake of manual construction, for example " Lao Shilaisi ", " Microsoft is than ear Gates " etc., these wrong query words are input to the error correction result that search engine provides to obtain search engine, then whether judgement " Lao Shilaisi " is corrected to " Rolls Royce ", whether " Microsoft is than ear Gates " is corrected to " the Bill Gates of Microsoft ", the shortcoming of this manual detection and recognition methods is that recognition efficiency is very low, wastes more manpower and materials.
[summary of the invention]
The invention provides detection method and the system of error correction term in a kind of search engine, can detect efficiently irrational error correction term in search engine.
Concrete technical scheme of the present invention is as follows:
According to one preferred embodiment of the present invention, the detection method of error correction term in a kind of search engine, comprising:
Number of times that user after error correction term clicks the Search Results of input inquiry word, number of times that user clicks error correction term are provided and click the number of times of the Search Results of error correction term according to user's search log statistic;
Utilize the good and bad value of error correction of the number of times computing error correction word of statistics, and generate the testing result of error correction term according to the good and bad value of described error correction.
In said method, according to the method for user's search log statistic number of times be:
From user's search daily record, extract query word, query word mark, error correction term and user's click behavior according to default configuration script, utilize the query word, query word mark, error correction term and the user's that extract click behavior composition error correction sequence; Described configuration script comprises the sequence number of field in search daily record;
According to described error correction sequence statistics number.
In said method,
Described query word mark is for representing that query word is the query word inputted of user or for representing that query word is the query word of clicking after error correction term;
When search engine provides error correction term, the error correction term of the error correction term of extraction for providing, when search engine does not provide error correction term, the error correction term of extraction is default character;
Described user's click behavior is that user is not provided by the concrete Search Results that the Search Results that provides or user click.
In said method, describedly specifically comprise according to error correction sequence statistics number:
When the click behavior of user in the error correction sequence that comprises the error correction term that search engine provides be user click concrete Search Results time, the numerical value of the first default counter is added to 1, and the numerical value of described the first counter equals to provide user after error correction term to click the number of times of the Search Results of input inquiry word;
In error correction sequence after the error correction sequence that comprises described error correction term in error correction sequence set, when query word mark represents that query word is the query word of clicking after error correction term, the numerical value of the second default counter is added to 1, and the numerical value of described the second counter equals user clicks the number of times of error correction term;
When inquiry mark represent query word be the click behavior of clicking user in the error correction sequence of the query word after error correction term be user click concrete Search Results time, the numerical value of the 3rd default counter is added to 1; The numerical value of described the 3rd counter equals the number of times of the Search Results of clicking error correction term.
In said method, the good and bad value of the error correction of the number of times computing error correction word of described utilization statistics specifically comprises:
Utilize the good and bad value of error correction of following formula computing error correction word:
I=β 1×(1-a)×V1+β 2×j+β 3×i×V2
Wherein,
Figure BDA00002443867200031
v2=1/log 2(1+g), a is not equal to 0 ratio for the number of times that provides user after error correction term to click the Search Results of input inquiry word, k is for providing user after error correction term to click the arithmetic mean of the number of times of the Search Results of input inquiry word, j is that the number of times that user clicks error correction term is not equal to 0 ratio, i is that the number of times of clicking the Search Results of error correction term is not equal to 0 ratio, and g is the arithmetic mean of clicking the number of times of the Search Results of error correction term; β 1, β 2, β 3for three tune weight factors of the operational efficiency configuration according to system.
In said method, the described testing result that generates error correction term according to the good and bad value of error correction specifically comprises:
The good and bad value of the error correction calculating is sorted according to ascending order;
Using the phrase of the query word corresponding good and bad error correction that is less than default threshold value value and error correction term composition as testing result; Or, using the phrase of query word corresponding good and bad more than one error correction of minimum value and error correction term composition as testing result.
A detection system for error correction term in search engine, comprising: statistic unit, computing unit, generation unit; Wherein,
Statistic unit, provides number of times that user after error correction term clicks the Search Results of input inquiry word, number of times that user clicks error correction term and clicks the number of times of the Search Results of error correction term for the search log statistic according to user;
Computing unit, for utilizing the good and bad value of error correction of number of times computing error correction word of described statistic unit statistics;
Generation unit, the good and bad value of error correction calculating for the described computing unit of foundation generates the testing result of error correction term.
In said system, described statistic unit specifically comprises according to user's search log statistic number of times:
From user's search daily record, extract query word, query word mark, error correction term and user's click behavior according to default configuration script, utilize the query word, query word mark, error correction term and the user's that extract click behavior composition error correction sequence; Described configuration script comprises the sequence number of field in search daily record;
According to described error correction sequence statistics number.
In said system,
Described query word mark is for representing that query word is the query word inputted of user or for representing that query word is the query word of clicking after error correction term;
When search engine provides error correction term, the error correction term of the error correction term of extraction for providing, when search engine does not provide error correction term, the error correction term of extraction is default character;
Described user's click behavior is that user is not provided by the concrete Search Results that the Search Results that provides or user click.
In said system, describedly specifically comprise according to error correction sequence statistics number:
When the click behavior of user in the error correction sequence that comprises the error correction term that search engine provides be user click concrete Search Results time, the numerical value of the first default counter is added to 1, and the numerical value of described the first counter equals to provide user after error correction term to click the number of times of the Search Results of input inquiry word;
In error correction sequence after the error correction sequence that comprises described error correction term in error correction sequence set, when query word mark represents that query word is the query word of clicking after error correction term, the numerical value of the second default counter is added to 1, and the numerical value of described the second counter equals user clicks the number of times of error correction term;
When inquiry mark represent query word be the click behavior of clicking user in the error correction sequence of the query word after error correction term be user click concrete Search Results time, the numerical value of the 3rd default counter is added to 1; The numerical value of described the 3rd counter equals the number of times of the Search Results of clicking error correction term.
In said system, the good and bad value of the error correction of the number of times computing error correction word of described computing unit utilization statistics specifically comprises:
Utilize the good and bad value of error correction of following formula computing error correction word:
I=β 1×(1-a)×V1+β 2×j+β 3×i×V2
Wherein,
Figure BDA00002443867200051
v2=1/log 2(1+g), a is not equal to 0 ratio for the number of times that provides user after error correction term to click the Search Results of input inquiry word, k is for providing user after error correction term to click the arithmetic mean of the number of times of the Search Results of input inquiry word, j is that the number of times that user clicks error correction term is not equal to 0 ratio, i is that the number of times of clicking the Search Results of error correction term is not equal to 0 ratio, and g is the arithmetic mean of clicking the number of times of the Search Results of error correction term; β 1, β 2, β 3for three tune weight factors of the operational efficiency configuration according to system.
In said system, the testing result that described generation unit generates error correction term according to the good and bad value of error correction specifically comprises:
The good and bad value of the error correction calculating is sorted according to ascending order;
Using the phrase of the query word corresponding good and bad error correction that is less than default threshold value value and error correction term composition as testing result; Or, using the phrase of query word corresponding good and bad more than one error correction of minimum value and error correction term composition as testing result.
As can be seen from the above technical solutions, provided by the invention have a following beneficial effect:
Can automatically detect irrational error correction term in search engine according to user's search daily record, compare with knowledge method for distinguishing with traditional manual detection, testing process is the modeling analysis processing procedure of robotization, therefore, detection and Identification efficiency to irrational error correction term is higher, uses manpower and material resources sparingly.
[brief description of the drawings]
Fig. 1 is the schematic flow sheet that the present invention realizes the preferred embodiment of the detection method of error correction term in search engine;
Fig. 2 is the structural representation that the present invention realizes the preferred embodiment of the detection system of error correction term in search engine.
[embodiment]
Basic thought of the present invention is: number of times that user after error correction term clicks the Search Results of input inquiry word, number of times that user clicks error correction term are provided and click the number of times of the Search Results of error correction term according to user's search log statistic; Utilize the good and bad value of error correction of the number of times computing error correction word of statistics, and generate the testing result of error correction term according to the good and bad value of described error correction.
In order to make the object, technical solutions and advantages of the present invention clearer, describe the present invention below in conjunction with the drawings and specific embodiments.
The invention provides the detection method of error correction term in a kind of search engine, Fig. 1 is the schematic flow sheet that the present invention realizes the preferred embodiment of the detection method of error correction term in search engine, and as shown in Figure 1, the preferred embodiment comprises the following steps:
Step 101 is extracted error correction sequence according to default field sequence number from user's search daily record, and this error correction sequence comprises query word, query word mark, error correction term and user's click behavior.
Concrete, search engine can be in the time providing search service to user, and the search daily record of recording user includes cookie, IP address, search time, query word, query word mark, error correction term and user's click behavior in each search daily record; Can, according to the search daily record in default periodicity extraction one-period, for example, the cycle can be configured to one day, search engine extracts the search daily record of a day in the past.
First according to cookie, the search daily record of extracting is sorted, according to the cookie in search daily record, search daily record identical cookie is classified as to one group, judge in every group searching daily record whether there is error correction term, if had, again according to search time this group searching daily record being sorted; If there is no error correction term, delete this group searching daily record.
User sets in advance a configuration script, the content of this configuration script is the sequence number of field in search daily record, according to the sequence number of field in configuration script from sorting and have in every group searching daily record of error correction term and extract the field corresponding with sequence number, in this preferred embodiment, the field of extracting comprises query word, query word mark, error correction term and user's click behavior, the error correction sequence of field composition extracting from a search daily record, so, corresponding every group searching daily record can correspondingly generate more than one error correction sequence, more than one error correction sequence composition error correction sequence set, here, for different search engines, corresponding user's search journal format is also different, therefore in configuration script, the sequence number of field can arrange according to user's search journal format, as long as can extract according to configuration script query word, query word mark, error correction term and user's click behavior from search daily record.
For example, there is search daily record: 001FCB319096148CEC8D4404A78A3ADD124.129.51.24228/Mar/201 2:15:17:51 Zhong Guan village 0 Zhong Guan village –
Wherein, 001FCB319096148CEC8D4404A78A3ADD is cookie, for representing a concrete user, 124.129.51.242 be IP address, 28/Mar/2012:15:35:56 is search time, Zhong Guan village is query word, 0 or 1 represents query word mark, 0 represents that previous query word is the query word that user inputs, 1 represents that previous query word is the query word of clicking after error correction term, the error correction term that Zhong Guan-cun provides for search engine, if search engine does not provide error correction term, can replace with default character, as replaced by "-", "-" represents user's click behavior, "-" represents that user is not provided by the Search Results providing if, if user's click behavior is url, represent that the concrete Search Results that user clicks is url, for above-mentioned search daily record, the error correction sequence extracting is: Zhong Guan village 0 Zhong Guan village –.
Step 102, provides after error correction term according to error correction sequence statistics, and user clicks number of times, the number of times that user clicks error correction term of the Search Results of input inquiry word and clicks the number of times of the Search Results of error correction term.
Concrete, for the error correction sequence set generating, in error correction sequence set, find the error correction sequence that comprises error correction term, then according to the click behavior of the user in this error correction sequence, judge whether user clicks Search Results, if the click behavior of the user in error correction sequence is url, represent that user still clicks the Search Results of the query word of input after search engine provides error correction term, adds 1 by the numerical value C1 of the first default counter; Otherwise, if the click row of the user in error correction sequence be-, represent that user does not click the Search Results of the query word of input after search engine provides error correction term, do not carry out any processing, keep the current numerical value of the first counter.
Judge in error correction sequence set, comprise error correction term error correction sequence after error correction sequence in, according to query word mark, judge whether user clicks error correction term; If query word is labeled as 1, represent that query word is above to be the query word of clicking after error correction term, not the query word that user inputs, judge user and click error correction term, the numerical value C2 of the second default counter is added to 1; Otherwise, do not carry out any processing, keep the current numerical value of the second counter; Here, in error correction sequence after the error correction sequence that comprises error correction term, user has been provided by the error correction term providing, the numerical value C2 of the second counter equals 1 after adding 1, or user is not provided by the error correction term providing, the numerical value C2 of the second counter still equals 0 after remaining unchanged, therefore, the numerical value C2 of the second counter only has two numerical value, and 1 and 0.
According to the click behavior of the user in error correction sequence, judge that user is clicking the Search Results of whether clicking error correction term after error correction term, if inquiring about the click behavior of user in the error correction sequence that is labeled as 1 is url, represent to have clicked the concrete Search Results of error correction term, the numerical value of the 3rd default counter C3 is added to 1, here, in error correction sequence set, click after error correction term user, the Search Results of an error correction term of every click, just adds 1 by the numerical value of the 3rd counter; Otherwise, do not carry out any processing, keep the current numerical value of the 3rd counter.
Finally, can correspondence obtain an error correction five-tuple for each error correction sequence, this error correction five-tuple comprises numerical value C1, the numerical value C2 of the second counter and the numerical value C3 of the 3rd counter of query word, error correction term, the first counter; Wherein, in the error correction sequence that the numerical value C1 of the first counter represents to comprise error correction term, user clicks the number of times of the Search Results of input inquiry word, in the error correction sequence that the numerical value C2 of the second counter represents to comprise error correction term, user clicks the number of times of error correction term, in the error correction sequence that the numerical value C3 of the 3rd counter represents to comprise error correction term, after clicking error correction term, click the number of times of the Search Results of error correction term; Correspondence is obtained more than one error correction five-tuple by more than one error correction sequence.
Step 103, utilizes the user of statistics to click the number of times of the Search Results of input inquiry word, number of times that user clicks error correction term and click the number of times of the Search Results of error correction term, calculates the good and bad value of error correction of error correction term.
Concrete, according to the error correction five-tuple generating, will there is the error correction five-tuple composition error correction five-tuple sequence of identical query word and error correction term.
Utilize numerical value C1, the numerical value C2 of the second counter and the numerical value C3 of the 3rd counter of the first counter that in step 102, statistics obtains, and the good and bad value of the error correction of formula (1) computing error correction word I.
I=β 1×(1-a)×V1+β 2×j+β 3×i×V2 (1)
Wherein, a is that in error correction five-tuple sequence, C1 is not equal to 0 ratio, k is the arithmetic mean of C1, the summation that the arithmetic mean of C1 equals the numerical value C1 of the first counter in error correction five-tuple sequence divided by C1 in the number that is not at 0 o'clock, the arithmetic mean of C1 is as query word is carried out to the result before error correction, its numerical distance 1 is far away, and the necessity that expression is carried out error correction to the query word of user's input is larger; J is that in error correction five-tuple sequence, C2 is not equal to 0 ratio, i is that C3 is not equal to 0 ratio, g is the arithmetic mean of C3, the summation that the arithmetic mean of C3 equals the numerical value C3 of the 3rd counter in error correction five-tuple sequence divided by C3 in the number that is not at 0 o'clock, the arithmetic mean of C3 is as query word is carried out to the result after error correction, its numerical distance 1 is nearer, represents that the Search Results after error correction is better; β 1, β 2, β 3be three and adjust weight factor, adjusting weight factor is that configurable weight is adjusted the factor, can carry out feedback adjustment according to the operational efficiency of system; V1 is two changed factors relevant to k and g with V2,
Figure BDA00002443867200091
v2=1/log 2(1+g).
Step 104, according to the testing result of the good and bad value generation of error correction error correction term.
Concrete, the good and bad value of the error correction finally calculating according to search daily record is sorted according to ascending order, using the phrase of the query word corresponding good and bad error correction that is less than default threshold value value and error correction term composition as testing result, and offer user; Or using the phrase of query word corresponding good and bad more than one error correction of minimum value and error correction term composition as testing result, and offer user; Wherein, default threshold value can arrange according to demand, for example, can be 0.3 by threshold value setting; Described user refers to the O&M personnel of search engine; Here, the phrase of the phrase of good and bad error correction that is less than threshold value value or the minimum good and bad value of more than one error correction being offered to user, is to be unreasonable error correction because of error correction term in these phrases probability is larger, the time that can greatly reduce the irrational error correction term of searching.
User, receiving after the phrase of query word and error correction term composition, can carry out manual synchronizing to irrational error correction term, then the error correction term after proofreading and correct is added in error correction dictionary, improves the error correction of search engine; Can also analyze irrational place that in search engine, error correction exists according to irrational error correction term, for example, which error correction strategies causes having occurred irrational error correction term, can adjust error correction strategies according to analysis result; If technique scheme is applied to the upgrading test of error correction, can also return checking to the error correction after upgrading, after detection upgrading, whether error correction can reasonably process the error correction term of incorrect processing originally.
Embodiment
For example, from the search daily record of magnanimity, extract following search daily record:
001FCB319096148CEC8D4404A78A3ADD 124.129.51.24228/Mar/2012:15:17:51 Zhong Guan village 0 Zhong Guan village –
00206813DB9FADE0125312BEF459A3C4 223.246.75.1528/Mar/2012:15:17:52 iphone5 0 - url
001FCB319096148CEC8D4404A78A3ADD 124.129.51.24228/Mar/2012:15:17:55 Zhong Guan-cun 1-url
00206813DB9FADE0125312BEF459A3C4 223.246.75.1528/Mar/2012:15:17:55sd card 0-url
001FCB319096148CEC8D4404A78A3ADD 124.129.51.24228/Mar/2012:15:18:50 Zhong Guan-cun 1-url
001FCB319096148CEC8D4404A78A3ADD 124.129.51.24228/Mar/2012:15:30:51 hard disc player 0-url
002A8A0B671A89511B0DF4A8DB466A06 211.80.191.23928/Mar/2012:15:30:56 Liu De China 0-url
001FCB319096148CEC8D4404A78A3ADD 1241295124228/Mar/2012:15:35:56 notebook 0 – url
To above-mentioned search daily record according to sorting cookie and search time, the daily record after being sorted as follows:
001FCB319096148CEC8D4404A78A3ADD 124.129.51.24228/Mar/2012:15:17:51 Zhong Guan village 0 Zhong Guan village –
001FCB319096148CEC8D4404A78A3ADD 124.129.51.24228/Mar/2012:15:17:55 Zhong Guan-cun 1-url
001FCB319096148CEC8D4404A78A3ADD 124.129.51.24228/Mar/2012:15:18:50 Zhong Guan-cun 1-url
001FCB319096148CEC8D4404A78A3ADD 124.129.51.24228/Mar/2012:15:30:51 hard disc player 0-url
001FCB319096148CEC8D4404A78A3ADD 124.129.51.24228/Mar/2012:15:35:56 notebook 0 – url
00206813DB9FADE0125312BEF459A3C4 223.246.75.1528/Mar/2012:15:17:52iphone5 0 - url
00206813DB9FADE0125312BEF459A3C4 223.246.75.1528/Mar/2012:15:17:55sd card 0 – url
002A8A0B671A89511B0DF4A8DB466A06 211.80.191.23928/Mar/2012:15:30:56 Liu De China 0-url
In above-mentioned search daily record, in the user's that cookie is 002A8A0B671A89511B0DF4A8DB466A06 search procedure, search engine does not provide error correction term, therefore delete the search daily record of 002A8A0B671A89511B0DF4A8DB466A06, retain the search daily record of 001FCB319096148CEC8D4404A78A3ADD:
0 Zhong Guan-cun, 001FCB319096148CEC8D4404A78A3ADD 124.129.51.24228/Mar/2012:15:17:51 Zhong Guan village-
001FCB319096148CEC8D4404A78A3ADD 124.129.51.24228/Mar/2012:15:17:55 Zhong Guan-cun 1-url
001FCB319096148CEC8D4404A78A3ADD 124.129.51.24228/Mar/2012:15:18:50 Zhong Guan-cun 1-url
001FCB319096148CEC8D4404A78A3ADD 124.129.51.24228/Mar/2012:15:30:51 hard disc player 0-url
001FCB319096148CEC8D4404A78A3ADD 124.129.51.24228/Mar/2012:15:35:56 notebook 0 – url
The search daily record that is 001FCB319096148CEC8D4404A78A3ADD for cookie, the content in configuration script is to extract 7 fields of the 4th field to the of search daily record, therefore search engine extracts following error correction sequence set from above-mentioned search daily record:
0 Zhong Guan-cun, Zhong Guan village-
Zhong Guan-cun 1-url
Zhong Guan-cun 1-url
Hard disc player 0-url
Notebook 0-url
For this error correction sequence set, 0 in first error correction sequence just represents that this query word " Zhong Guan village " is directly input in the search box of user, and " Zhong Guan-cun " below represents the error correction term that search engine provides for " the Zhong Guan village " of user's input, in second error correction sequence 1 represents that " Zhong Guan-cun " above come by clicking error correction term redirect, instead of the directly query word of input in the search box of user, " url " below represent user clicked Search Results in certain url, the Search Results of every page generally has 10 url, if be that "-" represents not click any Search Results here, in first error correction sequence, be exactly for example "-", represent that user does not click any Search Results, directly just click error correction term and jump to the search results pages of error correction term, click the number of times of the Search Results of input inquiry word according to this error correction sequence set counting user, user clicks the number of times of error correction term and clicks the number of times of the Search Results of error correction term, therefore, obtain error correction five-tuple (Zhong Guan village, Zhong Guan-cun, 0, 1, 2).
By that analogy, can obtain a series of error correction five-tuple:
(Zhong Guan village, Zhong Guan-cun, 0,1,2)
(zhong closes village, Zhong Guan-cun, 1,1,1)
(Zhong Guan village, Zhong Guan-cun, 1,1,3)
(excellent cruel tvv is newly acute, and excellent cruel tvb is newly acute, 0,1,1)
(middle base express delivery, ultimate express delivery, 1,0,0)
(Closing Ceremony of the Games, Olympic Games closing ceremony, 1,0,0)
With Cookie=001FCB319096148CEC8D4404A78A3ADD error correction five-tuple (the Zhong Guan village of a day, Zhong Guan-cun, 0,1,2) be example, this user is dissatisfied to result in search " Zhong Guan village ", so do not click, then he has clicked error correction term, and has twice click behavior in the Search Results of error correction term, met user's search need, therefore the current error correction from " Zhong Guan village " to " Zhong Guan-cun " is that the likelihood ratio of correct error correction is larger; And for error correction five-tuple (Closing Ceremony of the Games, Olympic Games closing ceremony, 1,0,0), user has click behavior afterwards in search " Closing Ceremony of the Games ", and do not click the behavior of the error correction term " Closing Ceremony of the Games " that search engine provides, illustrate that the effect of current error correction is not so good.But twice search behavior of individual user is can referential large not, need to do statistical study to the repeatedly search behavior of mass users, can filter out real irrational error correction term.
One day all cookie is done to the data that obtain magnanimity after above-mentioned processing procedure, the error correction of magnanimity is clicked to behavioral data and carry out obtaining after aggregation process:
Zhong Guan village → Zhong Guan-cun:
(Zhong Guan village, Zhong Guan-cun, 1,1,3)
(Zhong Guan village, Zhong Guan-cun, 0,1,2)
(Zhong Guan village, Zhong Guan-cun, 1,0,0)
(Zhong Guan village, Zhong Guan-cun, 0,1,2)
Closing Ceremony of the Games → Olympic Games closing ceremony:
(Closing Ceremony of the Games, Olympic Games closing ceremony, 1,0,0)
(Closing Ceremony of the Games, Olympic Games closing ceremony, 2,0,0)
(Closing Ceremony of the Games, Olympic Games closing ceremony, 1,0,0)
New play → excellent cruel the tvb of excellent cruel tvv is newly acute:
(excellent cruel tvv is newly acute, and excellent cruel tvb is newly acute, 0,1,1)
(excellent cruel tvv is newly acute, and excellent cruel tvb is newly acute, 0,1,1)
(excellent cruel tvv is newly acute, and excellent cruel tvb is newly acute, 0,1,1)
Three error correction behavior (Zhong Guan village → Zhong Guan-cun above, Closing Ceremony of the Games → Olympic Games closing ceremony, new play → excellent cruel the tvv of excellent cruel tvb is newly acute) in, the error correction effect of " the new play → excellent cruel tvv of excellent cruel tvb is newly acute " is best, because the user of nearly all search " excellent cruel tvv is newly acute " has clicked error correction term " excellent cruel tvb is newly acute ", and has click behavior in the Search Results of error correction term; The error correction effect of " Zhong Guan village → Zhong Guan-cun " is taken second place, and " Closing Ceremony of the Games → Olympic Games closing ceremony " error correction effect is the poorest, because according to user's click behavior, the user of search " Closing Ceremony of the Games " does not nearly all click error correction term, and in the Search Results of original query word " Closing Ceremony of the Games ", has the behavior of clicking result; Therefore, phrase Closing Ceremony of the Games → Olympic Games closing ceremony just can be used as testing result and offers user.
For realizing said method, the present invention also provides the detection system of error correction term in a kind of search engine, Fig. 2 is the structural representation that the present invention realizes the preferred embodiment of the detection system of error correction term in search engine, as shown in Figure 2, this system comprises: statistic unit 20, computing unit 21, generation unit 22; Wherein,
Statistic unit 20, provides number of times that user after error correction term clicks the Search Results of input inquiry word, number of times that user clicks error correction term and clicks the number of times of the Search Results of error correction term for the search log statistic according to user;
Computing unit 21, for utilizing the good and bad value of error correction of number of times computing error correction word of described statistic unit statistics;
Generation unit 22, the good and bad value of error correction calculating for the described computing unit of foundation generates the testing result of error correction term.
Wherein, described statistic unit 20 specifically comprises according to user's search log statistic number of times:
From user's search daily record, extract query word, query word mark, error correction term and user's click behavior according to default configuration script, utilize the query word, query word mark, error correction term and the user's that extract click behavior composition error correction sequence; Described configuration script comprises the sequence number of field in search daily record;
According to described error correction sequence statistics number.
Wherein, described query word mark is for representing that query word is the query word inputted of user or for representing that query word is the query word of clicking after error correction term; When search engine provides error correction term, the error correction term of the error correction term of extraction for providing, when search engine does not provide error correction term, the error correction term of extraction is default character; Described user's click behavior is that user is not provided by the concrete Search Results that the Search Results that provides or user click.
Wherein, describedly specifically comprise according to error correction sequence statistics number: when the click behavior of user in the error correction sequence that comprises the error correction term that search engine provides be user click concrete Search Results time, the numerical value of the first default counter is added to 1, and the numerical value of described the first counter equals to provide user after error correction term to click the number of times of the Search Results of input inquiry word; In error correction sequence after the error correction sequence that comprises described error correction term in error correction sequence set, when query word mark represents that query word is the query word of clicking after error correction term, the numerical value of the second default counter is added to 1, and the numerical value of described the second counter equals user clicks the number of times of error correction term; When inquiry mark represent query word be the click behavior of clicking user in the error correction sequence of the query word after error correction term be user click concrete Search Results time, the numerical value of the 3rd default counter is added to 1; The numerical value of described the 3rd counter equals the number of times of the Search Results of clicking error correction term.
Wherein, described computing unit 21 utilizes the good and bad value of error correction of the number of times computing error correction word of statistics specifically to comprise:
Utilize the good and bad value of error correction of following formula computing error correction word:
I=β 1×(1-a)×V1+β 2×j+β 3×i×V2
Wherein,
Figure BDA00002443867200151
v2=1/log 2(1+g), a is not equal to 0 ratio for the number of times that provides user after error correction term to click the Search Results of input inquiry word, k is for providing user after error correction term to click the arithmetic mean of the number of times of the Search Results of input inquiry word, j is that the number of times that user clicks error correction term is not equal to 0 ratio, i is that the number of times of clicking the Search Results of error correction term is not equal to 0 ratio, and g is the arithmetic mean of clicking the number of times of the Search Results of error correction term; β 1, β 2, β 3for three tune weight factors of the operational efficiency configuration according to system.
Wherein, the testing result that described generation unit 22 generates error correction term according to the good and bad value of error correction specifically comprises:
The good and bad value of the error correction calculating is sorted according to ascending order;
Using the phrase of the query word corresponding good and bad error correction that is less than default threshold value value and error correction term composition as testing result; Or, using the phrase of query word corresponding good and bad more than one error correction of minimum value and error correction term composition as testing result.
In this preferred embodiment, in search engine, the detection system of error correction term can be integrated in search engine, also can be independent of search engine and individualism does not limit here.
Technique scheme of the present invention can detect irrational error correction term in search engine automatically according to user's search daily record, with traditional manual detection and knowledge method for distinguishing, testing process is the modeling analysis processing procedure of robotization, therefore, detection and Identification efficiency to irrational error correction term is higher, uses manpower and material resources sparingly; And can analyze the search daily record of whole day, recall rate is far away higher than classic method.
The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any amendment of making, be equal to replacement, improvement etc., within all should being included in the scope of protection of the invention.

Claims (12)

1. a detection method for error correction term in search engine, is characterized in that, the method comprises:
Number of times that user after error correction term clicks the Search Results of input inquiry word, number of times that user clicks error correction term are provided and click the number of times of the Search Results of error correction term according to user's search log statistic;
Utilize the good and bad value of error correction of the number of times computing error correction word of statistics, and generate the testing result of error correction term according to the good and bad value of described error correction.
2. method according to claim 1, is characterized in that, according to the method for user's search log statistic number of times is:
From user's search daily record, extract query word, query word mark, error correction term and user's click behavior according to default configuration script, utilize the query word, query word mark, error correction term and the user's that extract click behavior composition error correction sequence; Described configuration script comprises the sequence number of field in search daily record;
According to described error correction sequence statistics number.
3. method according to claim 2, is characterized in that,
Described query word mark is for representing that query word is the query word inputted of user or for representing that query word is the query word of clicking after error correction term;
When search engine provides error correction term, the error correction term of the error correction term of extraction for providing, when search engine does not provide error correction term, the error correction term of extraction is default character;
Described user's click behavior is that user is not provided by the concrete Search Results that the Search Results that provides or user click.
4. method according to claim 2, is characterized in that, describedly specifically comprises according to error correction sequence statistics number:
When the click behavior of user in the error correction sequence that comprises the error correction term that search engine provides be user click concrete Search Results time, the numerical value of the first default counter is added to 1, and the numerical value of described the first counter equals to provide user after error correction term to click the number of times of the Search Results of input inquiry word;
In error correction sequence after the error correction sequence that comprises described error correction term in error correction sequence set, when query word mark represents that query word is the query word of clicking after error correction term, the numerical value of the second default counter is added to 1, and the numerical value of described the second counter equals user clicks the number of times of error correction term;
When inquiry mark represent query word be the click behavior of clicking user in the error correction sequence of the query word after error correction term be user click concrete Search Results time, the numerical value of the 3rd default counter is added to 1; The numerical value of described the 3rd counter equals the number of times of the Search Results of clicking error correction term.
5. method according to claim 1, is characterized in that, the good and bad value of error correction of the number of times computing error correction word of described utilization statistics specifically comprises:
Utilize the good and bad value of error correction of following formula computing error correction word:
I=β 1×(1-a)×V1+β 2×j+β 3×i×V2
Wherein,
Figure FDA00002443867100021
v2=1/log 2(1+g), a is not equal to 0 ratio for the number of times that provides user after error correction term to click the Search Results of input inquiry word, k is for providing user after error correction term to click the arithmetic mean of the number of times of the Search Results of input inquiry word, j is that the number of times that user clicks error correction term is not equal to 0 ratio, i is that the number of times of clicking the Search Results of error correction term is not equal to 0 ratio, and g is the arithmetic mean of clicking the number of times of the Search Results of error correction term; β 1, β 2, β 3for three tune weight factors of the operational efficiency configuration according to system.
6. method according to claim 1, is characterized in that, the described testing result that generates error correction term according to the good and bad value of error correction specifically comprises:
The good and bad value of the error correction calculating is sorted according to ascending order;
Using the phrase of the query word corresponding good and bad error correction that is less than default threshold value value and error correction term composition as testing result; Or, using the phrase of query word corresponding good and bad more than one error correction of minimum value and error correction term composition as testing result.
7. a detection system for error correction term in search engine, is characterized in that, this system comprises: statistic unit, computing unit, generation unit; Wherein,
Statistic unit, provides number of times that user after error correction term clicks the Search Results of input inquiry word, number of times that user clicks error correction term and clicks the number of times of the Search Results of error correction term for the search log statistic according to user;
Computing unit, for utilizing the good and bad value of error correction of number of times computing error correction word of described statistic unit statistics;
Generation unit, the good and bad value of error correction calculating for the described computing unit of foundation generates the testing result of error correction term.
8. system according to claim 7, is characterized in that, described statistic unit specifically comprises according to user's search log statistic number of times:
From user's search daily record, extract query word, query word mark, error correction term and user's click behavior according to default configuration script, utilize the query word, query word mark, error correction term and the user's that extract click behavior composition error correction sequence; Described configuration script comprises the sequence number of field in search daily record;
According to described error correction sequence statistics number.
9. system according to claim 8, is characterized in that,
Described query word mark is for representing that query word is the query word inputted of user or for representing that query word is the query word of clicking after error correction term;
When search engine provides error correction term, the error correction term of the error correction term of extraction for providing, when search engine does not provide error correction term, the error correction term of extraction is default character;
Described user's click behavior is that user is not provided by the concrete Search Results that the Search Results that provides or user click.
10. system according to claim 8, is characterized in that, describedly specifically comprises according to error correction sequence statistics number:
When the click behavior of user in the error correction sequence that comprises the error correction term that search engine provides be user click concrete Search Results time, the numerical value of the first default counter is added to 1, and the numerical value of described the first counter equals to provide user after error correction term to click the number of times of the Search Results of input inquiry word;
In error correction sequence after the error correction sequence that comprises described error correction term in error correction sequence set, when query word mark represents that query word is the query word of clicking after error correction term, the numerical value of the second default counter is added to 1, and the numerical value of described the second counter equals user clicks the number of times of error correction term;
When inquiry mark represent query word be the click behavior of clicking user in the error correction sequence of the query word after error correction term be user click concrete Search Results time, the numerical value of the 3rd default counter is added to 1; The numerical value of described the 3rd counter equals the number of times of the Search Results of clicking error correction term.
11. system according to claim 7, is characterized in that, the good and bad value of error correction of the number of times computing error correction word of described computing unit utilization statistics specifically comprises:
Utilize the good and bad value of error correction of following formula computing error correction word:
I=β 1×(1-a)×V1+β 2×j+β 3×i×V2
Wherein,
Figure FDA00002443867100041
v2=1/log 2(1+g), a is not equal to 0 ratio for the number of times that provides user after error correction term to click the Search Results of input inquiry word, k is for providing user after error correction term to click the arithmetic mean of the number of times of the Search Results of input inquiry word, j is that the number of times that user clicks error correction term is not equal to 0 ratio, i is that the number of times of clicking the Search Results of error correction term is not equal to 0 ratio, and g is the arithmetic mean of clicking the number of times of the Search Results of error correction term; β 1, β 2, β 3for three tune weight factors of the operational efficiency configuration according to system.
12. systems according to claim 7, is characterized in that, the testing result that described generation unit generates error correction term according to the good and bad value of error correction specifically comprises:
The good and bad value of the error correction calculating is sorted according to ascending order;
Using the phrase of the query word corresponding good and bad error correction that is less than default threshold value value and error correction term composition as testing result; Or, using the phrase of query word corresponding good and bad more than one error correction of minimum value and error correction term composition as testing result.
CN201210476236.3A 2012-11-21 2012-11-21 The detection method and system of error correction term in a kind of search engine Active CN103838739B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210476236.3A CN103838739B (en) 2012-11-21 2012-11-21 The detection method and system of error correction term in a kind of search engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210476236.3A CN103838739B (en) 2012-11-21 2012-11-21 The detection method and system of error correction term in a kind of search engine

Publications (2)

Publication Number Publication Date
CN103838739A true CN103838739A (en) 2014-06-04
CN103838739B CN103838739B (en) 2019-05-28

Family

ID=50802253

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210476236.3A Active CN103838739B (en) 2012-11-21 2012-11-21 The detection method and system of error correction term in a kind of search engine

Country Status (1)

Country Link
CN (1) CN103838739B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104036004A (en) * 2014-06-17 2014-09-10 百度在线网络技术(北京)有限公司 Search error correction method and search error correction device
CN106202153A (en) * 2016-06-21 2016-12-07 广州智索信息科技有限公司 The spelling error correction method of a kind of ES search engine and system
CN106339404A (en) * 2016-06-30 2017-01-18 北京奇艺世纪科技有限公司 Search word recognition method and device
CN109002521A (en) * 2018-07-12 2018-12-14 北京猫眼文化传媒有限公司 Error correction method, device and the storage medium of search key
CN110851459A (en) * 2018-07-25 2020-02-28 上海柯林布瑞信息技术有限公司 Searching method and device, storage medium and server

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1670723A (en) * 2004-03-16 2005-09-21 微软公司 Systems and methods for improved spell checking
CN101131706A (en) * 2007-09-28 2008-02-27 北京金山软件有限公司 Query amending method and system thereof
US20090019002A1 (en) * 2007-07-13 2009-01-15 Medio Systems, Inc. Personalized query completion suggestion
CN101350004A (en) * 2008-09-11 2009-01-21 北京搜狗科技发展有限公司 Method for forming personalized error correcting model and input method system of personalized error correcting
CN101369285A (en) * 2008-10-17 2009-02-18 清华大学 Spell emendation method for query word in Chinese search engine
CN101727271A (en) * 2008-10-22 2010-06-09 北京搜狗科技发展有限公司 Method and device for providing error correcting prompt and input method system
US20100306229A1 (en) * 2009-06-01 2010-12-02 Aol Inc. Systems and Methods for Improved Web Searching
CN102043833A (en) * 2010-11-25 2011-05-04 北京搜狗科技发展有限公司 Search method and device based on query word
US20120158765A1 (en) * 2010-12-15 2012-06-21 Microsoft Corporation User Interface for Interactive Query Reformulation

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1670723A (en) * 2004-03-16 2005-09-21 微软公司 Systems and methods for improved spell checking
US20090019002A1 (en) * 2007-07-13 2009-01-15 Medio Systems, Inc. Personalized query completion suggestion
CN101131706A (en) * 2007-09-28 2008-02-27 北京金山软件有限公司 Query amending method and system thereof
CN101350004A (en) * 2008-09-11 2009-01-21 北京搜狗科技发展有限公司 Method for forming personalized error correcting model and input method system of personalized error correcting
CN101369285A (en) * 2008-10-17 2009-02-18 清华大学 Spell emendation method for query word in Chinese search engine
CN101727271A (en) * 2008-10-22 2010-06-09 北京搜狗科技发展有限公司 Method and device for providing error correcting prompt and input method system
US20100306229A1 (en) * 2009-06-01 2010-12-02 Aol Inc. Systems and Methods for Improved Web Searching
CN102043833A (en) * 2010-11-25 2011-05-04 北京搜狗科技发展有限公司 Search method and device based on query word
US20120158765A1 (en) * 2010-12-15 2012-06-21 Microsoft Corporation User Interface for Interactive Query Reformulation

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104036004A (en) * 2014-06-17 2014-09-10 百度在线网络技术(北京)有限公司 Search error correction method and search error correction device
CN106202153A (en) * 2016-06-21 2016-12-07 广州智索信息科技有限公司 The spelling error correction method of a kind of ES search engine and system
CN106339404A (en) * 2016-06-30 2017-01-18 北京奇艺世纪科技有限公司 Search word recognition method and device
CN106339404B (en) * 2016-06-30 2019-10-22 北京奇艺世纪科技有限公司 A kind of search word recognition method and device
CN109002521A (en) * 2018-07-12 2018-12-14 北京猫眼文化传媒有限公司 Error correction method, device and the storage medium of search key
CN110851459A (en) * 2018-07-25 2020-02-28 上海柯林布瑞信息技术有限公司 Searching method and device, storage medium and server
CN110851459B (en) * 2018-07-25 2021-08-13 上海柯林布瑞信息技术有限公司 Searching method and device, storage medium and server

Also Published As

Publication number Publication date
CN103838739B (en) 2019-05-28

Similar Documents

Publication Publication Date Title
CN101883024B (en) Dynamic detection method for cross-site forged request
CN101593200B (en) Method for classifying Chinese webpages based on keyword frequency analysis
CN103559235B (en) A kind of online social networks malicious web pages detection recognition methods
US20150207704A1 (en) Public opinion information display system and method
CN103838739A (en) Method and system for detecting error correction words in search engine
CN102567494B (en) Website classification method and device
CN103336766A (en) Short text garbage identification and modeling method and device
Kim et al. Event diffusion patterns in social media
CN103778200A (en) Method for extracting information source of message and system thereof
CN104424308A (en) Web page classification standard acquisition method and device and web page classification method and device
CN104899335A (en) Method for performing sentiment classification on network public sentiment of information
CN102880647A (en) Method and device for acquiring another name of organization
CN109165273A (en) General Chinese address matching method facing big data environment
CN104102658A (en) Method and device for mining text contents
CN106021418A (en) News event clustering method and device
CN106503256B (en) A kind of hot information method for digging based on social networks document
CN101630315B (en) Quick retrieval method and system
CN104199838B (en) A kind of user model constructing method based on label disambiguation
CN107741960A (en) URL sorting technique and device
CN102156746A (en) Method for evaluating performance of search engine
CN115828180A (en) Log anomaly detection method based on analytic optimization and time sequence convolution network
CN102571922B (en) Method and device for processing data stream
CN109815337B (en) Method and device for determining article categories
CN103136212A (en) Mining method of class new words and device
CN105589900A (en) Data mining method based on multi-dimensional analysis

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant