Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a method capable of automatically and quickly evaluating the on-line search engines, which can compare the result difference among the search engines and is suitable for carrying out regular comparison and evaluation among the search engines and frequently carrying out evaluation when the search engines are optimized to check whether the optimized algorithm is successful or not.
In order to achieve the purpose of the invention, the invention adopts the following technical scheme:
a multi-search-engine automatic comparison and evaluation method independent of a document library is characterized by comprising the following steps:
A. selecting an evaluation word;
B. capturing a search result and storing the search result as a document;
C. extracting a document text;
D. calculating the correlation;
E. integrating the documents and ordering according to the relevance of the documents;
F. calculating DCG;
G. and sequencing according to the DCG results and summarizing the evaluation results.
Preferably, the evaluation word comprises: a page search keyword in a web search, a movie name or an actor name in a video search.
Preferably, the grabbing comprises two grabbing processes;
snatch for the first time includes: generating search result links of search engines according to the keywords, performing first grabbing, extracting relevant information of each result and links of detailed information of each result page from each search engine by using a template, and storing the relevant information and the links; the template is a regular expression comprising a search condition;
snatch for the second time includes: and capturing corresponding pages according to the links of the detailed information of the result pages obtained in the first capturing, and storing the corresponding pages as documents respectively according to the sequence.
Preferably, the text extraction method includes: HTML extraction method based on DOM tree, text extraction method of text longest string;
the HTML extraction method based on the DOM tree comprises the following steps: converting the HTML text into a DOM tree, and then extracting content related to the text according to the node analysis of the DOM tree so as to remove irrelevant information in the page; the irrelevant information includes: page noise and HTML tags;
the text extraction method of the longest text string comprises the following steps: finding the longest text string in the HTML page content, then expanding the text string back and forth until the text string is expanded to a threshold value, then truncating and extracting to obtain the text content of the text.
Preferably, the correlation calculation method includes: a word frequency ratio method; the expression of the method is as follows: relevance = the proportion of word frequency in the document-the proportion of word frequency in all the grab results.
Preferably, the sorting by relevance comprises: and dividing the document into a plurality of grades in average, and setting a corresponding correlation coefficient score for each grade.
Preferably, the calculated DCG is expressed by the formula:
where s is the total number of documents, i is the ordinal number of the level in which the document is located, reliThe correlation coefficient score for the level at which the document is located.
Preferably, the calculation results obtained in the step F are sequenced and analyzed to obtain various output results, and a report is generated; the output result includes: and F, ranking the average DCG score of the calculation results, ranking the total DCG score, and ranking the number of the good and bad search results in all the keywords.
Compared with the prior art, the invention has the beneficial effects that:
1) automation is realized, manual participation is not needed, and a large amount of labor is saved;
2) the evaluation result can be obtained quickly and in a short time;
3) the method is flexible, and in the process of the invention, a plurality of places adopt configurable modes, and correlation calculation and the like can be automatically adjusted, so that the method has high flexibility;
4) the whole set of method can be applied to various vertical searches, not only simple web search, but also news search, video search and the like.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings.
According to analysis of each search engine, research of a user using the search engine and the like, the method can confirm that the user focuses on the search engine mostly in both accuracy and sequencing, wherein the accuracy is to ensure that the content displayed by the search result is the content the user wants, and the sequencing is to arrange the result closer to the user requirement in front so that the user can directly find the wanted content without pulling down or turning pages.
The method comprises the following specific steps:
1) selecting evaluation words
The selection of the evaluation words directly determines the fitness of the evaluation results and the actual effect, so that more search quantity can be covered by evaluation. The selection of the range and the quantity of the words can be changed according to the actual situation, if the evaluation webpage search is carried out, the webpage search keyword is selected, and if the video search is carried out, the film name or the actor and the like which are searched at high frequency are selected.
2) Capturing search results of each search engine
The research result on the user behavior shows that most users only care about the first 2 pages of the search result, that is, about 40, so the invention captures the first 40 pieces of data in the search result by default for research analysis (the number of the pieces of data can be configured by themselves according to the needs). For the returned results in the search engine, most of the returned results can return the link and the abstract of the source address, and because the returned results are incomplete results, the invention needs to carry out secondary grabbing to remove the source address and grab a complete result page for calculating the correlation degree between the page and the search terms.
The specific process of the two-time grabbing comprises the steps of firstly generating search result links of search engines according to keywords, carrying out first grabbing, extracting relevant information of each result and links of detailed information of each result page from each search engine by using a template of a regular expression, storing the relevant information and the links, and using the links for second grabbing.
And the second grabbing is to obtain the link of the detailed information of the result page from the result of the first grabbing, grab the corresponding page, store the page in sequence and provide the page for the step 3.
3) Text extraction
Most of result pages captured from a source address have noises such as advertisements, so before the relevance of the calculation result is calculated, the text content of the result page needs to be extracted, so that the calculation result is not influenced by the noises.
In the text extraction method, a common method such as an HTML (hypertext markup language) extraction method based on a DOM (document object model) tree or a text extraction method of the longest text string can be adopted to obtain the text in the result page, and the relevance between the article and the search keyword is calculated according to the text.
The HTML extraction method of the DOM tree firstly converts an HTML text into the DOM tree, then extracts content related to a text according to the node analysis of the DOM tree, and removes irrelevant information such as page noise, HTML labels and the like; the key point of the method is how to correctly repair the DOM tree when the DOM tree is incomplete.
The extraction method of the longest text string is suitable for the page with a long text in the text; the longest text string is found in HTML content, then the text string is expanded back and forth until the text string is expanded to a threshold value, and then truncation and extraction are carried out to obtain the text content of the text.
4) Computing correlations
The calculation of the relevance is a key ring in the process of the invention, the previous steps 2 and 3 are prepared for calculating the relevance, and the correctness of the final evaluation result can be ensured only if the relevance of each search result and the search keyword is correctly calculated.
The selection of the correlation calculation rule also changes according to different vertical searches: if the search is a webpage search, the content matching degree is emphasized, if the search is a news search, the content matching and the time need to be focused at the same time, and if the search is a video search, the titles, the comments and the like are emphasized.
In the invention, the algorithm of the correlation can be flexibly adjusted, the weight required by the correlation calculation can be dynamically adjusted by a machine learning method by taking a small part of the result of artificial evaluation as a sample, and some formed correlation algorithms can also be directly adopted.
For example, in a news search test, a word frequency ratio method is used to calculate the relevance of a plain text, and a specific algorithm is that the relevance = the ratio of word frequency in the document — the ratio of word frequency in all captured results, that is:
wherein,
it is 3 times higher to balance the weight with P (D);
in the formula, n is the number of words after word segmentation, N (i) is the occurrence frequency of the word i, L (i) is the length of the word i, and L (T) is the full-text length;
wherein, T (i) is the number of times that the word i appears in all search results of all search engines;
the time dependence is in the form of an inverse curve, of
Wherein T (n) is the current time, T (t) is the distribution time, and the weight value of the numerator W is used for balancing the weight between P (M) and P (T);
the final correlation is calculated using the harmonic mean of the two,
this can increase the weight of the term with low correlation, making the result more realistic.
5) Integrate and sort by relevance
And 4, calculating the relevance of each result document, wherein all result documents returned by a single search keyword on all search engines are integrated and sorted according to the relevance, then the results are averagely divided into three categories of superior-intermediate-inferior (the results can be divided into a plurality of categories according to different requirements and are automatic operation), and the corresponding relevance coefficient score of each category is set to be 3-1 (if the category is N, the score is N-1) to be provided for a DCG calculation formula, so that the DCG calculation formula calculates the final DCG score.
6) Calculating DCG
The DCG is an evaluation method for verifying the rank, the document with high relevance is ranked in front of the result page, the score is high, otherwise, the document with low relevance is ranked in front of the result page, and the score is low. The DCG calculation formula of s documents is as follows:
step 5 has ranked the search results for a single search keyword and assigned a corresponding relevance coefficient score, i.e., reli in the formula, for each document. And then all the results of the keyword are grouped according to the search engines, and in a single search engine group, the total DCG score of the keyword in the search engine is calculated by a formula according to the ranking i of all the results in the search engine, so that the DCG score of the keyword in each search engine is obtained by calculating all the groups.
In the calculation process of the DCG, there are the following cases:
1. search engine A generally outperforms search engine B, but does not rank well than B, since relAIs generally higher than relBTherefore, the result of DCG is that A is higher than B, which is logical.
2. The results of search engine A are nearly as relevant as those of search engine B, but A ranks better, when rel scores higherBRanking algorithm 1/log to be ranked later2i is pulled low, resulting in an overall DCG for B that is lower than for A, logically.
3. The result of search engine A is better than that of search engine B, the order is better than that of B, and the DCG of A is higher than that of B, thus meeting the logic.
These 3 cases all demonstrate that the DCG results can be used as a criterion for evaluating the search engine results during the implementation of the present invention.
7) Sorting according to DCG results and summarizing evaluation results
And (4) sequencing the results obtained in the step (6) and analyzing the results in detail to obtain various output results, such as average DCG score ranking, total DCG score ranking, ranking of the number of the good or bad search results in all keywords and the like of all the results, and generating a report so as to compare and check the results visually.
The method of the invention can simply and quickly obtain the evaluation result, and completely avoids a large amount of time and labor consumption brought by manual evaluation. The news search in the vertical search is used for testing, 3000 news hotwords, Baidu, dog, Zhongxie and Yahoo 4 search engines are selected (the evaluation target is not added to the problems of frequent shielding of Google and the like), each search engine selects 40 search results, and the evaluation time is about 2 hours (the bottleneck is web page capture); comparing the obtained result with the result of manual evaluation, the difference between the evaluation result of the invention and the result of manual evaluation is found to be within 5%.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.