CN103310014A

CN103310014A - Method for improving accuracy of search result

Info

Publication number: CN103310014A
Application number: CN2013102760404A
Authority: CN
Inventors: 王宝会; 王洪军
Original assignee: Beihang University
Current assignee: Beijing easy to use Lianyou Technology Co.,Ltd.
Priority date: 2013-07-02
Filing date: 2013-07-02
Publication date: 2013-09-18
Anticipated expiration: 2033-07-02
Also published as: CN103310014B

Abstract

The invention relates to a method for improving accuracy of a search result, which comprises the following steps: (1) classifying HTML (Hypertext Markup Language) tags and setting weighting coefficients for various types of tags; (2) carrying out structuralization processing on HTML page contents according to classification in the step (1) to form structured data and generating index data for each type of tags; (3) according to the index data generated in the step (2) and the weighting coefficients obtained in the step (1), calculating matching relevance of a search word and an HTML page in accordance with a weighting algorithm, and according to the relevance, and calculating the frequency of occurrence of the tags in the HTML page. The method has the advantages of improving search accuracy in the HTML page and reaching high matching degree with accuracy of data under the same conditions and solves the problem of inaccuracy for searching and low rate when various types of tags exist in the HTML page.

Description

A kind of method that improves the result for retrieval accuracy rate

Technical field

The present invention relates in full-text search field and the text mining field, particularly full-text search for the Search Results optimization of web page with to the content analysis of web page.

Background technology

When carrying out full-text search, generally calculate the degree of correlation of term and one piece of document by the TF-IDF algorithm.TF-IDF is a kind of statistical method, for assessment of the importance degree of a word for a file in a file set or the corpus.Importance degree is larger, thinks that the degree of correlation of this word and this part file is larger, and in final result for retrieval tabulation, the degree of correlation is larger just will to come more forward position.

The theoretical foundation of TF-IDF originates from Shannon information theory, and its main thought is: the significance level of word or phrase is directly proportional with the frequency (TF:Term Frequency) that it occurs in one piece of document; Simultaneously, word or expression is for the significance level of one piece of document, the frequency that occurs in other documents with it be inversely proportional to (Inverse Document Frequency is abbreviated as IDF), both TFIDF=TF ^*IDF.

For general text document, there is not the differentiation of position or structure, this algorithm can be good at solving the computational problem of the degree of correlation; But the document for particular type, such as html page, Feature Words is in different positions, and is also different to the reflection degree of document content, weight when calculating the degree of correlation also should be different, and the TF-IDF algorithm does not embody the architectural feature of document when calculating the degree of correlation.

Summary of the invention

The technology of the present invention is dealt with problems: overcome the deficiencies in the prior art, a kind of method that improves the result for retrieval accuracy rate is provided, have and in html page, improve retrieval rate, reach with similar condition under the high advantage of the accurate rate matching degree of data, solved the inaccurate not high problem of speed of search when tag class is many in html page.

Technic relization scheme of the present invention: a kind of raising result for retrieval accuracy rate method, performing step is as follows:

1. html tag is classified and the weighting coefficient of all kinds of labels is set.

(11) according to the HTML standard, label is classified;

(12) according to implication and the importance of (11) sorted each label, label is arranged weighting coefficient;

(13) according to the mode of (tag name, weighted value), the output label weighted results.

According to the HTML standard, be categorized as example with significant label, html tag is classified as follows: h1, h2, em, caption, li, th, title ... and for each class label a weighting coefficient is set, more important label, weighting coefficient is higher, is exemplified below:

(title，0.8)，(h1，0.7)，(h2，0.65)，(h3，0.6)，(h4，0.5)，(li，0.5)，(h5，0.45)，(h6，0.4)，(th，0.4)，(caption，0.4)，(em，0.3)，(strong，0.3)，(b，0.3)…

In addition, the weighting coefficient of other area contents of the page is 0.2.

2. the html page content is carried out structuring according to above-mentioned classification and process, form structural data, and for each class label generating indexes data (accompanying drawing 1).

The html page content is carried out structuring processes the also method concrete steps of generating indexes data:

2.1: analyze the html page content, and according to the labeling that steps A arranges, convert the HTML content to structural data.Structural data represents with the form of bivariate table, bivariate table classify labeling as, each piece HTML converts a record of bivariate table to.If the same class label has a plurality of data in one piece of html page, then a plurality of data are merged in the field, cut apart with separator.As: a plurality of h2 labels are arranged in one piece of html page, and then the content with these a plurality of h2 merges to one, separates with specific separator, such as " ^ ", is placed in the h2 field of this record.

2.2: according to labeling, the every delegation for structural data sets up index take field as unit, for example, sets up the h1 field index of all data, h2 field index, full-text index etc.The index data of each field is called a field index storehouse, and the number of times of appearance, position appear in each word of record in the storehouse in this field of which data; Also record the record number that each word occurs in the field index storehouse.

3. when retrieving, calculate the degree of correlation of term and each html page according to weighting algorithm.

The method step that calculates the degree of correlation is as follows:

Inputted several terms 3.1 suppose the user in primary retrieval, at first by retrieval full text field, can obtain the set of hiting data, then calculated respectively the degree of correlation of each term and each bar hiting data, specific algorithm is as follows:

3.2 at first calculate the degree of correlation of a field in each term and the data, specific algorithm is as follows:

3.3 in index data, search the number of times that a word occurs in some fields of these data, be designated as n _i, calculate simultaneously total word number of this this field of data, be designated as N _i, then pass through formula

Calculate the TF value.

Occurred in how many bar data 3.4 in index data, search a word, be designated as d _j, simultaneously the data total number is designated as D, then pass through formula Calculate the IDF value.

3.5 search by the weighting coefficient of step 1 for this field setting, be designated as W _k, then by formula TIW _x=TF _i* IDF _j* W _kCalculate the degree of correlation TIW of a field in a word and the record _xValue.

The below is according to formula

Calculate the degree of correlation in a record of all terms, x is the field quantity that term hits in data, and value is 1-m, and y is once the term quantity of input, and value is that concrete calculating of 1-n. is divided into following two steps.

3.6 circulation step 3.3 to 3.5 calculates this term to the degree of correlation of other hit field,

Then pass through formula

Calculate the degree of correlation of a term and data.

3.7 to each term of user input, circulation step 3.6 calculates the degree of correlation GC of each term and each bar hiting data _y, then pass through formula

G = Σ_{y = 1}^{n} {GC}_{y}

Calculate the degree of correlation of each bar hiting data and user input content.

The invention has the advantages that, for specific html document, when the degree of correlation of the retrieval of content that calculates user's input and retrieval hiting data, adopted the weighting degree of correlation algorithm based on html tag, make the value of the degree of correlation can embody the architectural feature of html document, by afterwards result for retrieval being sorted according to the degree of correlation and showing the user, make the user obtain better experience.

Description of drawings

Fig. 1 is that procedure chart set up in html page structuring and index;

Fig. 2 is realization flow figure of the present invention.

Embodiment

Below in conjunction with concrete example in detail embodiments of the present invention.

HTML (Hypertext Markup Language) (English: HyperText Markup Language, HTML) is a kind of markup language for " Web page create and other information that can see in web browser " design.HTML is used to structured message---and for example title, paragraph and tabulation etc. also can be used to describe to a certain extent outward appearance and the semanteme of document.Nineteen eighty-two is created by Di Mubainasi-Li, by the SGML(standard generalized markup language of IETF with simplification) HTML that grammer further develops, became afterwards international standard, safeguarded by World Wide Web Consortium (W3C).W3C advises using XHTML1.1, XHTML1.0 or HTML4.01 Standard compilation webpage at present, but the newer HTML5 coding of existing many webpage conversions is write (such as Google).

Analyze the HTML standard criterion, can obtain complete html tag tabulation, such as following table:

Label	Describe	DTD
			＜!--...--	The definition note.	STF
＜! DOCTYPE 〉	The definition Doctype.	STF
			＜a 〉	The definition anchor.	STF
＜abbr 〉	The definition abbreviation.	STF
			＜acronym 〉	The abbreviation of initial is only got in definition.	STF
＜address 〉	Definition document author or owner's contact details.	STF
			＜applet 〉	Disapprove uses.The applet that definition embeds.	TF
＜area 〉	The zone of definition image mapped inside.	STF
			＜b 〉	The definition boldface letter.	STF
＜base 〉	Default address or the default objects of all-links in the definition page.	STF
			＜basefont 〉	Disapprove uses.Default font, color or the size of definition page Chinese version.	TF
＜bdo 〉	The definition words direction.	STF
			＜big 〉	Definition large size text.	STF
＜blockquote 〉	What definition was long quotes.	STF
			＜body 〉	The main body of definition document.	STF
＜br 〉	The simple folding row of definition.	STF
			＜button 〉	Definition button (push button).	STF
＜caption 〉	The definition tables title.	STF
			＜center 〉	Disapprove uses.The definition center text.	TF
＜cite 〉	(citation) quoted in definition.	STF
			＜code 〉	Definition computer code text.	STF
＜col 〉	The property value of one or more row in the definition tables.	STF
			＜colgroup 〉	Supply the row group of format in the definition tables.	STF
＜dd 〉	The description of project in the definition tabulation.	STF
			＜del 〉	Define deleted text.	STF
＜dir 〉	Disapprove uses.The definition directory listing.	TF
			＜div 〉	Joint in the definition document.	STF
＜dfn 〉	The definition project.	STF

＜dl 〉	The definition tabulation.	STF
			＜dt 〉	Project in the definition definition tabulation.	STF
＜em 〉	Text is emphasized in definition.	STF
			＜fieldset 〉	Definition centers on the frame of element in the list.	STF
＜font 〉	Disapprove uses.Font, size and the color of definition literal.	TF
			＜form 〉	Definition is for the HTML list of user's input.	STF
＜frame 〉	The window of definition frame collection or framework.	F
			＜frameset 〉	The definition frame collection.	F
＜h1〉to＜h6 〉	Definition HTML title.	STF
			＜head 〉	Definition is about the information of document.	STF
＜hr 〉	The definition horizontal line.	STF
			＜HTML 〉	The definition html document.	STF
＜i 〉	The definition italics.	STF
			＜iframe 〉	The definition inline frame.	TF
＜img 〉	The definition image.	STF
			＜input 〉	The definition input control.	STF
＜ins 〉	Definition is inserted into text.	STF
			＜isindex 〉	Disapprove uses.But define the search index relevant with document.	TF
＜kbd 〉	The definition keyboard text.	STF
			＜label 〉	The mark of definition input element.	STF
＜legend 〉	The title of definition fieldset element.	STF
			＜li 〉	The project of definition tabulation.	STF
＜link 〉	The relation of definition document and external resource.	STF
			＜map 〉	The definition image mapped.	STF
＜menu 〉	Disapprove uses.The definition menu list.	TF
			＜meta 〉	Definition is about the metamessage of html document.	STF
＜noframes 〉	Definition is for the user's of supporting frame replacement not.	TF
			＜noscript 〉	Definition is for the user's who does not support client script replacement.	STF
＜object 〉	The definition embedded object.	STF
			＜ol 〉	The definition ordered list.	STF
＜optgroup 〉	The combination of relevant option in the definition selective listing.	STF
			＜option 〉	Option in the definition selective listing.	STF

＜p 〉	The definition paragraph.	STF
			＜param 〉	The parameter of defining objects.	STF
＜pre 〉	Define pre-format text.	STF
			＜q 〉	Define short quoting.	STF
＜s 〉	Disapprove uses.Definition adds the text of strikethrough.	TF
			＜samp 〉	Definition computer code sample.	STF
＜script 〉	The definition client script.	STF
			＜select 〉	Definition selective listing (drop-down list).	STF
＜small 〉	Define small size text.	STF
			＜span 〉	Joint in the definition document.	STF
＜strike 〉	Disapprove uses.Definition adds the strikethrough text.	TF
			＜strong 〉	Text is emphasized in definition.	STF
＜style 〉	The style information of definition document.	STF
			＜sub 〉	Definition subscript text.	STF
＜sup 〉	Definition subscript text.	STF
			＜table 〉	Definition tables.	STF
＜tbody 〉	Body matter in the definition tables.	STF
			＜td 〉	Unit in the definition tables.	STF
＜textarea 〉	The text input control of definition multirow.	STF
			＜tfoot 〉	Table in the definition tables is annotated content (footnote).	STF
＜th 〉	Gauge outfit cell in the definition tables.	STF
			＜thead 〉	Gauge outfit content in the definition tables.	STF
＜title 〉	The title of definition document.	STF
			＜tr 〉	Row in the definition tables.	STF
＜tt 〉	The definition typewriter text.	STF
			＜u 〉	Disapprove uses.The definition underline text.	TF
＜ul 〉	The definition unordered list.	STF
			＜var 〉	The variable part of definition text.	STF
＜xmp 〉	Disapprove uses.Define pre-format text.	?

DTD: indication allows this label in which kind of XHTML1.0DTD.S=Strict，T=Transitional，F=Frameset.

As shown in Figure 2, the present invention is implemented as follows:

Step 1: the html tag in the upper table is classified and the weighting coefficient of all kinds of labels is set.Specifically be classified as follows (tag name, weighting coefficient): (title, 0.8), (h1,0.7), (h2,0.65), (h3,0.6), (h4,0.5), (li, 0.5), (h5,0.45), (h6,0.4), (th, 0.4), (caption, 0.4), (em, 0.3), (strong, 0.3), (b, 0.3) ... in addition, the weighting coefficient of other area contents of the page is 0.2.

Step 2.1: when the html page content being carried out the structuring processing, content of pages is stored in the two-dimensional data table, storage organization is as shown in the table:

Step 2.2: set up index database take label as unit, above table is example, sets up following several index database:

The title index database:

Diaoyudaoite: { 3,100}; (1,1,1); (2,1,2); (3,1,1)

First: { 20,100}; (2,1,11); (3,1,6)

Put on display: { 5,100}; (3,1,8)

……

The h1 index database:

Diaoyudaoite: { 3,100}; (1,1,1); (2,1,2); (3,1,1)

First: { 20,100}; (2,1,11); (3,1,6)

Put on display: { 5,100}; (3,1,8)

……

The h2 index database

First: { 10,100}; (1,1,13)

Put on display: { 2,100}; (1,1,15)

……

The index database of other fields

Full text field index storehouse.

Each bar record of index database is divided into 3: 1: word or phrase; 2: the IDF value of brace the inside is respectively occurrence number and total number of documents; 3: the TF value of parenthesis the insides, document be with an element group representation, and the interior value of first ancestral is respectively number of documents, occurrence number, the position occurs first.

Step 3.1: the retrieval of content of supposing user's input is " the fishing socle is put on display first "

Step 3.2: by retrieval full text field, can obtain the set of hiting data, then calculate successively the degree of correlation of each hit field of each word and each bar hiting data.With the degree of correlation of calculating " fishing socle " this word and the title field of article one data for for example time:

Step 3.3: at first calculate the TF value, look into index as can be known, TF=1(occurs 1 time)/3(word sum)=0.333.

Step 3.4: then calculate the IDF value, look into index as can be known, IDF=log (100/ (1+3))=1.398.

Step 3.5: calculate the weighting relevance degree of this field, TIW(title)=0.333*1.398*0.8=0.3724.

Step 3.6: repeating step 3.3 to 3.5, the degree of correlation of calculating " fishing socle " and other fields of article one data: TIW (h1)=1.0*log (100/ (1+3)) * 0.7=0.9786; TIW (h2)=0.Calculate at last the population characteristic valuve degree GC=(TIW(title) of " fishing socle " and article one data+TIW (h1))/2=(0.3724+0.9786)/2=0.6755.

Step 3.7: repeating step 3.3 to 3.6, calculate respectively " exhibition " degree of correlation of " first " and article one data: 0.0079,0.0123; Calculate at last the degree of correlation=0.6755+0.0079+0.0123=0.6958 of user input content and article one data

Step 3.8: repeating step 3.3 to 3.7, the correlation range degree that calculates the 2nd, 3 data and user input content is respectively 0.7958,0.8741, the ordering that can draw at last result for retrieval is (3,2,1).

Can find out this moment, the present invention can well calculate the degree of correlation of each piece relevant documentation and user input content, finally make result for retrieval can embody the architectural feature of HTML, can make again the user obtain more accurately the result who wants, make the user obtain better user and experience.

The above is embodiments of the present invention; can not limit with this interest field of the present invention; should be understood that; for those skilled in the art; under the prerequisite that does not break away from the principle of the invention; can also make some improvement and change; for example the document for other types (includes but not limited to the pdf document; the doc/docx document; the xls/xlsx document; ppt/pptx document etc.), change the classification of label and the weighting coefficient values of changing all kinds of labels, these improvement and change also are considered as protection scope of the present invention.

Claims

1. method that improves the result for retrieval accuracy rate is characterized in that performing step is as follows:

(1) html tag is classified and the weighting coefficient of all kinds of labels is set;

(2) the html page content is carried out structuring according to the classification of step (1) and process, form structural data, and for each class label generating indexes data;

(3) according to the weighting coefficient that obtains in the index data that generates in the step (2) and the step (1), calculate the degree of correlation of term and html page coupling according to weighting algorithm, go out the frequency of occurrences of this label in the html page according to relatedness computation.

2. the method for raising result for retrieval accuracy rate according to claim 1 is characterized in that: html tag is classified and step that the weighting coefficient of all kinds of labels is set is in the described step (1):

(11) according to the HTML standard, label is classified;

(12) according to implication and the importance of sorted each label of step (11), label is arranged weighting coefficient;

3. the method for raising result for retrieval accuracy rate according to claim 1 is characterized in that: in the described step (2) the html page content is carried out the method that index data was processed and formed in structuring, concrete steps are:

(21) analyze the html page content, and according to the labeling that step (1) arranges, convert the html page content to structural data; Structural data represents with the form of bivariate table, bivariate table classify labeling as, each piece HTML converts a record of bivariate table to, if the same class label has a plurality of data in one piece of html page, then a plurality of data are merged in the field, separate with separator;

(22) according to labeling, every delegation for structural data sets up index take field as unit, and the index data of each field is called a field index storehouse, the number of times of appearance, position appear in each word of record in the field index storehouse in this field of which data; Also record the record number that each word occurs in the field index storehouse.

4. the method for raising result for retrieval accuracy rate according to claim 1, it is characterized in that: the relatedness computation method in the described step (3) is:

G = Σ_{y = 1}^{n} (\frac{Σ_{x = 1}^{m} {TIW}_{x, y}}{m})

TIW wherein _{X, y}Be the degree of correlation of each field in a term and the data, x is the field quantity that term hits in data, and value is 1-m, and y is once the term quantity of input, and value is 1-n.