CN103544307B - A kind of multiple search engine automation contrast evaluating method independent of document library - Google Patents

A kind of multiple search engine automation contrast evaluating method independent of document library Download PDF

Info

Publication number
CN103544307B
CN103544307B CN201310538069.5A CN201310538069A CN103544307B CN 103544307 B CN103544307 B CN 103544307B CN 201310538069 A CN201310538069 A CN 201310538069A CN 103544307 B CN103544307 B CN 103544307B
Authority
CN
China
Prior art keywords
search
text
document
results
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201310538069.5A
Other languages
Chinese (zh)
Other versions
CN103544307A (en
Inventor
张鹏飞
赵毅强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongsou Cloud Business Network Technology Co ltd
Original Assignee
Beijing Zhongsou Cloud Business Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongsou Cloud Business Network Technology Co Ltd filed Critical Beijing Zhongsou Cloud Business Network Technology Co Ltd
Priority to CN201310538069.5A priority Critical patent/CN103544307B/en
Publication of CN103544307A publication Critical patent/CN103544307A/en
Application granted granted Critical
Publication of CN103544307B publication Critical patent/CN103544307B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of multiple search engine automation contrast evaluating method independent of document library, it is characterised in that methods described comprises the following steps:A. word is evaluated and tested in selection;B. capture search result and save as document;C. document text is extracted;D. correlation is calculated;E. document is integrated and by its relevance ranking;F. DCG is calculated;G. it is ranked up by DCG results, summarizes evaluation result.Following effect can be reached by the present invention:Automation, without artificial participation, saves a large amount of artificial;Quickly, evaluation result can be obtained in the short time;Flexibly, in process of the invention, many places employ configurable pattern, and correlation calculations etc. can also be adjusted voluntarily, therefore with very high flexibility;It can be applied in a variety of vertical searches, more than simple Webpage search, it can also be used to news search, video search etc..

Description

Multi-search-engine automatic comparison and evaluation method independent of document library
Technical Field
The invention belongs to the field of search engines, and particularly relates to an automatic comparison and evaluation method for multiple search engines independent of a document library.
Background
In the network environment of present day, a search engine becomes an indispensable tool for netizens; in the internet, there are many search engines. There are two main approaches to comparing the results of search engines: one is that some key words are manually selected to search on each search engine to obtain a result page, each search result is scored, and then the scores are compared to evaluate the advantages and disadvantages among the search engines; another approach relies on a corpus of documents to evaluate each search engine algorithm in terms of accuracy and recall.
Manually evaluating the results of a search engine can be labor intensive and time consuming. If a search engine is in an optimized state, frequent evaluation is needed, which undoubtedly brings great difficulty to manual evaluation, and makes manual evaluation unrealistic.
The method relying on the document library can only be used for search engines under the line, and because the document libraries of the search engines are different, the method cannot evaluate the search engines running on the line.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a method capable of automatically and quickly evaluating the on-line search engines, which can compare the result difference among the search engines and is suitable for carrying out regular comparison and evaluation among the search engines and frequently carrying out evaluation when the search engines are optimized to check whether the optimized algorithm is successful or not.
In order to achieve the purpose of the invention, the invention adopts the following technical scheme:
a multi-search-engine automatic comparison and evaluation method independent of a document library is characterized by comprising the following steps:
A. selecting an evaluation word;
B. capturing a search result and storing the search result as a document;
C. extracting a document text;
D. calculating the correlation;
E. integrating the documents and ordering according to the relevance of the documents;
F. calculating DCG;
G. and sequencing according to the DCG results and summarizing the evaluation results.
Preferably, the evaluation word comprises: a page search keyword in a web search, a movie name or an actor name in a video search.
Preferably, the grabbing comprises two grabbing processes;
snatch for the first time includes: generating search result links of search engines according to the keywords, performing first grabbing, extracting relevant information of each result and links of detailed information of each result page from each search engine by using a template, and storing the relevant information and the links; the template is a regular expression comprising a search condition;
snatch for the second time includes: and capturing corresponding pages according to the links of the detailed information of the result pages obtained in the first capturing, and storing the corresponding pages as documents respectively according to the sequence.
Preferably, the text extraction method includes: HTML extraction method based on DOM tree, text extraction method of text longest string;
the HTML extraction method based on the DOM tree comprises the following steps: converting the HTML text into a DOM tree, and then extracting content related to the text according to the node analysis of the DOM tree so as to remove irrelevant information in the page; the irrelevant information includes: page noise and HTML tags;
the text extraction method of the longest text string comprises the following steps: finding the longest text string in the HTML page content, then expanding the text string back and forth until the text string is expanded to a threshold value, then truncating and extracting to obtain the text content of the text.
Preferably, the correlation calculation method includes: a word frequency ratio method; the expression of the method is as follows: relevance = the proportion of word frequency in the document-the proportion of word frequency in all the grab results.
Preferably, the sorting by relevance comprises: and dividing the document into a plurality of grades in average, and setting a corresponding correlation coefficient score for each grade.
Preferably, the calculated DCG is expressed by the formula:
where s is the total number of documents, i is the ordinal number of the level in which the document is located, reliThe correlation coefficient score for the level at which the document is located.
Preferably, the calculation results obtained in the step F are sequenced and analyzed to obtain various output results, and a report is generated; the output result includes: and F, ranking the average DCG score of the calculation results, ranking the total DCG score, and ranking the number of the good and bad search results in all the keywords.
Compared with the prior art, the invention has the beneficial effects that:
1) automation is realized, manual participation is not needed, and a large amount of labor is saved;
2) the evaluation result can be obtained quickly and in a short time;
3) the method is flexible, and in the process of the invention, a plurality of places adopt configurable modes, and correlation calculation and the like can be automatically adjusted, so that the method has high flexibility;
4) the whole set of method can be applied to various vertical searches, not only simple web search, but also news search, video search and the like.
Drawings
FIG. 1 is a flow chart of the evaluation process of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings.
According to analysis of each search engine, research of a user using the search engine and the like, the method can confirm that the user focuses on the search engine mostly in both accuracy and sequencing, wherein the accuracy is to ensure that the content displayed by the search result is the content the user wants, and the sequencing is to arrange the result closer to the user requirement in front so that the user can directly find the wanted content without pulling down or turning pages.
The method comprises the following specific steps:
1) selecting evaluation words
The selection of the evaluation words directly determines the fitness of the evaluation results and the actual effect, so that more search quantity can be covered by evaluation. The selection of the range and the quantity of the words can be changed according to the actual situation, if the evaluation webpage search is carried out, the webpage search keyword is selected, and if the video search is carried out, the film name or the actor and the like which are searched at high frequency are selected.
2) Capturing search results of each search engine
The research result on the user behavior shows that most users only care about the first 2 pages of the search result, that is, about 40, so the invention captures the first 40 pieces of data in the search result by default for research analysis (the number of the pieces of data can be configured by themselves according to the needs). For the returned results in the search engine, most of the returned results can return the link and the abstract of the source address, and because the returned results are incomplete results, the invention needs to carry out secondary grabbing to remove the source address and grab a complete result page for calculating the correlation degree between the page and the search terms.
The specific process of the two-time grabbing comprises the steps of firstly generating search result links of search engines according to keywords, carrying out first grabbing, extracting relevant information of each result and links of detailed information of each result page from each search engine by using a template of a regular expression, storing the relevant information and the links, and using the links for second grabbing.
And the second grabbing is to obtain the link of the detailed information of the result page from the result of the first grabbing, grab the corresponding page, store the page in sequence and provide the page for the step 3.
3) Text extraction
Most of result pages captured from a source address have noises such as advertisements, so before the relevance of the calculation result is calculated, the text content of the result page needs to be extracted, so that the calculation result is not influenced by the noises.
In the text extraction method, a common method such as an HTML (hypertext markup language) extraction method based on a DOM (document object model) tree or a text extraction method of the longest text string can be adopted to obtain the text in the result page, and the relevance between the article and the search keyword is calculated according to the text.
The HTML extraction method of the DOM tree firstly converts an HTML text into the DOM tree, then extracts content related to a text according to the node analysis of the DOM tree, and removes irrelevant information such as page noise, HTML labels and the like; the key point of the method is how to correctly repair the DOM tree when the DOM tree is incomplete.
The extraction method of the longest text string is suitable for the page with a long text in the text; the longest text string is found in HTML content, then the text string is expanded back and forth until the text string is expanded to a threshold value, and then truncation and extraction are carried out to obtain the text content of the text.
4) Computing correlations
The calculation of the relevance is a key ring in the process of the invention, the previous steps 2 and 3 are prepared for calculating the relevance, and the correctness of the final evaluation result can be ensured only if the relevance of each search result and the search keyword is correctly calculated.
The selection of the correlation calculation rule also changes according to different vertical searches: if the search is a webpage search, the content matching degree is emphasized, if the search is a news search, the content matching and the time need to be focused at the same time, and if the search is a video search, the titles, the comments and the like are emphasized.
In the invention, the algorithm of the correlation can be flexibly adjusted, the weight required by the correlation calculation can be dynamically adjusted by a machine learning method by taking a small part of the result of artificial evaluation as a sample, and some formed correlation algorithms can also be directly adopted.
For example, in a news search test, a word frequency ratio method is used to calculate the relevance of a plain text, and a specific algorithm is that the relevance = the ratio of word frequency in the document — the ratio of word frequency in all captured results, that is:
wherein,
it is 3 times higher to balance the weight with P (D);
in the formula, n is the number of words after word segmentation, N (i) is the occurrence frequency of the word i, L (i) is the length of the word i, and L (T) is the full-text length;
wherein, T (i) is the number of times that the word i appears in all search results of all search engines;
the time dependence is in the form of an inverse curve, of
Wherein T (n) is the current time, T (t) is the distribution time, and the weight value of the numerator W is used for balancing the weight between P (M) and P (T);
the final correlation is calculated using the harmonic mean of the two,
this can increase the weight of the term with low correlation, making the result more realistic.
5) Integrate and sort by relevance
And 4, calculating the relevance of each result document, wherein all result documents returned by a single search keyword on all search engines are integrated and sorted according to the relevance, then the results are averagely divided into three categories of superior-intermediate-inferior (the results can be divided into a plurality of categories according to different requirements and are automatic operation), and the corresponding relevance coefficient score of each category is set to be 3-1 (if the category is N, the score is N-1) to be provided for a DCG calculation formula, so that the DCG calculation formula calculates the final DCG score.
6) Calculating DCG
The DCG is an evaluation method for verifying the rank, the document with high relevance is ranked in front of the result page, the score is high, otherwise, the document with low relevance is ranked in front of the result page, and the score is low. The DCG calculation formula of s documents is as follows:
step 5 has ranked the search results for a single search keyword and assigned a corresponding relevance coefficient score, i.e., reli in the formula, for each document. And then all the results of the keyword are grouped according to the search engines, and in a single search engine group, the total DCG score of the keyword in the search engine is calculated by a formula according to the ranking i of all the results in the search engine, so that the DCG score of the keyword in each search engine is obtained by calculating all the groups.
In the calculation process of the DCG, there are the following cases:
1. search engine A generally outperforms search engine B, but does not rank well than B, since relAIs generally higher than relBTherefore, the result of DCG is that A is higher than B, which is logical.
2. The results of search engine A are nearly as relevant as those of search engine B, but A ranks better, when rel scores higherBRanking algorithm 1/log to be ranked later2i is pulled low, resulting in an overall DCG for B that is lower than for A, logically.
3. The result of search engine A is better than that of search engine B, the order is better than that of B, and the DCG of A is higher than that of B, thus meeting the logic.
These 3 cases all demonstrate that the DCG results can be used as a criterion for evaluating the search engine results during the implementation of the present invention.
7) Sorting according to DCG results and summarizing evaluation results
And (4) sequencing the results obtained in the step (6) and analyzing the results in detail to obtain various output results, such as average DCG score ranking, total DCG score ranking, ranking of the number of the good or bad search results in all keywords and the like of all the results, and generating a report so as to compare and check the results visually.
The method of the invention can simply and quickly obtain the evaluation result, and completely avoids a large amount of time and labor consumption brought by manual evaluation. The news search in the vertical search is used for testing, 3000 news hotwords, Baidu, dog, Zhongxie and Yahoo 4 search engines are selected (the evaluation target is not added to the problems of frequent shielding of Google and the like), each search engine selects 40 search results, and the evaluation time is about 2 hours (the bottleneck is web page capture); comparing the obtained result with the result of manual evaluation, the difference between the evaluation result of the invention and the result of manual evaluation is found to be within 5%.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims (6)

1. A multi-search-engine automatic comparison and evaluation method independent of a document library is characterized by comprising the following steps:
A. selecting an evaluation word;
B. capturing a search result and storing the search result as a document;
C. extracting a document text;
D. calculating the correlation;
E. integrating the documents and ordering according to the relevance of the documents;
F. calculating DCG;
G. sequencing according to the DCG results, and summarizing the evaluation results;
the evaluation words are high-frequency words in 3000 selected search engine results;
the correlation calculation method comprises the following steps: a word frequency ratio method; the expression of the method is as follows: the relevance is the proportion of word frequency in the document and the proportion of word frequency in all the grasping results;
the text extraction method comprises the following steps: HTML extraction method based on DOM tree, text extraction method of text longest string;
the HTML extraction method based on the DOM tree comprises the following steps: converting the HTML text into a DOM tree, and then extracting content related to the text according to the node analysis of the DOM tree so as to remove irrelevant information in the page; the irrelevant information includes: page noise and HTML tags;
the text extraction method of the longest text string comprises the following steps: finding the longest text string in the HTML page content, then expanding the text string back and forth until the text string is expanded to a threshold value, then truncating and extracting to obtain the text content of the text.
2. An evaluation method according to claim 1, wherein the evaluation word comprises: a page search keyword in a web search, a movie name or an actor name in a video search.
3. Evaluation method according to claim 1, wherein the grabbing comprises two grabbing processes;
snatch for the first time includes: generating search result links of search engines according to the keywords, performing first grabbing, extracting relevant information of each result and links of detailed information of each result page from each search engine by using a template, and storing the relevant information and the links; the template is a regular expression comprising a search condition;
snatch for the second time includes: and capturing corresponding pages according to the links of the detailed information of the result pages obtained in the first capturing, and storing the corresponding pages as documents respectively according to the sequence.
4. An evaluation method according to claim 1, wherein said ranking by relevance comprises: and dividing the document into a plurality of grades in average, and setting a corresponding correlation coefficient score for each grade.
5. Evaluation method according to claim 1, wherein said calculating DCG is expressed by the formula:
where s is the total number of documents, i is the ordinal number of the level in which the document is located, reliThe correlation coefficient score for the level at which the document is located.
6. The evaluation method according to claim 1, wherein: c, sorting and analyzing the calculation results obtained in the step F to obtain various output results and generate a report; the output result includes: and F, ranking the average DCG score of the calculation results, ranking the total DCG score, and ranking the number of the good and bad search results in all the keywords.
CN201310538069.5A 2013-11-04 2013-11-04 A kind of multiple search engine automation contrast evaluating method independent of document library Expired - Fee Related CN103544307B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310538069.5A CN103544307B (en) 2013-11-04 2013-11-04 A kind of multiple search engine automation contrast evaluating method independent of document library

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310538069.5A CN103544307B (en) 2013-11-04 2013-11-04 A kind of multiple search engine automation contrast evaluating method independent of document library

Publications (2)

Publication Number Publication Date
CN103544307A CN103544307A (en) 2014-01-29
CN103544307B true CN103544307B (en) 2017-08-08

Family

ID=49967759

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310538069.5A Expired - Fee Related CN103544307B (en) 2013-11-04 2013-11-04 A kind of multiple search engine automation contrast evaluating method independent of document library

Country Status (1)

Country Link
CN (1) CN103544307B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105808601B (en) * 2014-12-31 2019-07-23 北京奇虎科技有限公司 Assessment search engine resource includes the calculation method and device of loss
CN104699830B (en) * 2015-03-30 2017-04-12 北京奇虎科技有限公司 Method and device for evaluating search engine ordering algorithm effectiveness
CN104699825B (en) * 2015-03-30 2016-10-05 北京奇虎科技有限公司 The balancing method of Performance of Search Engine and device
CN106227762B (en) * 2016-07-15 2019-06-28 苏群 A kind of method for vertical search and system based on user's assistance
CN107704467B (en) * 2016-08-09 2021-08-24 百度在线网络技术(北京)有限公司 Search quality evaluation method and device
CN106776299A (en) * 2016-11-30 2017-05-31 努比亚技术有限公司 Search engine test device and method
WO2018187949A1 (en) * 2017-04-12 2018-10-18 邹霞 Perspective analysis method for machine learning model

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079033A (en) * 2006-06-30 2007-11-28 腾讯科技(深圳)有限公司 Integrative searching result sequencing system and method
CN101089856A (en) * 2007-07-20 2007-12-19 李沫南 Method for abstracting network data and web reptile system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7720870B2 (en) * 2007-12-18 2010-05-18 Yahoo! Inc. Method and system for quantifying the quality of search results based on cohesion

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079033A (en) * 2006-06-30 2007-11-28 腾讯科技(深圳)有限公司 Integrative searching result sequencing system and method
CN101089856A (en) * 2007-07-20 2007-12-19 李沫南 Method for abstracting network data and web reptile system

Also Published As

Publication number Publication date
CN103544307A (en) 2014-01-29

Similar Documents

Publication Publication Date Title
CN103544307B (en) A kind of multiple search engine automation contrast evaluating method independent of document library
CN103294778B (en) A kind of method and system pushing information
CN106156204B (en) Text label extraction method and device
CN105488024B (en) The abstracting method and device of Web page subject sentence
CN101872351B (en) Method, device for identifying synonyms, and method and device for searching by using same
CN108763321B (en) Related entity recommendation method based on large-scale related entity network
CN102054006B (en) Vocabulary quality excavating evaluation method and device
US20180341686A1 (en) System and method for data search based on top-to-bottom similarity analysis
CN107885793A (en) A kind of hot microblog topic analyzing and predicting method and system
CN103678576A (en) Full-text retrieval system based on dynamic semantic analysis
CN105574047A (en) Website main page feature analysis based Chinese website sorting method and system
CN107463616B (en) Enterprise information analysis method and system
CN101350011A (en) Method for detecting search engine cheat based on small sample set
CN105653562A (en) Calculation method and apparatus for correlation between text content and query request
CN108228612B (en) Method and device for extracting network event keywords and emotional tendency
CN106844482B (en) Search engine-based retrieval information matching method and device
CN110555154A (en) theme-oriented information retrieval method
CN112256861A (en) Rumor detection method based on search engine return result and electronic device
Hamilton The Mechanics of a Deep Net Metasearch Engine.
JP2013168177A (en) Information provision program, information provision apparatus, and provision method of retrieval service
CN105468780A (en) Normalization method and device of product name entity in microblog text
Wang et al. Re-ranking search results using semantic similarity
JP5315726B2 (en) Information providing method, information providing apparatus, and information providing program
CN110019814B (en) News information aggregation method based on data mining and deep learning
CN114780712B (en) News thematic generation method and device based on quality evaluation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20170427

Address after: 100086 Beijing, Haidian District, North Third Ring Road West, No. 43, building 5, floor 08-09, No. 2

Applicant after: BEIJING ZHONGSOU CLOUD BUSINESS NETWORK TECHNOLOGY Co.,Ltd.

Address before: Shou Heng Technology Building No. 51 Beijing 100191 Haidian District Xueyuan Road room 0902

Applicant before: BEIJING ZHONGSOU NETWORK TECHNOLOGY Co.,Ltd.

GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170808

Termination date: 20211104