CN103310014A - Method for improving accuracy of search result - Google Patents

Method for improving accuracy of search result Download PDF

Info

Publication number
CN103310014A
CN103310014A CN2013102760404A CN201310276040A CN103310014A CN 103310014 A CN103310014 A CN 103310014A CN 2013102760404 A CN2013102760404 A CN 2013102760404A CN 201310276040 A CN201310276040 A CN 201310276040A CN 103310014 A CN103310014 A CN 103310014A
Authority
CN
China
Prior art keywords
data
definition
html
stf
field
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013102760404A
Other languages
Chinese (zh)
Other versions
CN103310014B (en
Inventor
王宝会
王洪军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing easy to use Lianyou Technology Co.,Ltd.
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN201310276040.4A priority Critical patent/CN103310014B/en
Publication of CN103310014A publication Critical patent/CN103310014A/en
Application granted granted Critical
Publication of CN103310014B publication Critical patent/CN103310014B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method for improving accuracy of a search result, which comprises the following steps: (1) classifying HTML (Hypertext Markup Language) tags and setting weighting coefficients for various types of tags; (2) carrying out structuralization processing on HTML page contents according to classification in the step (1) to form structured data and generating index data for each type of tags; (3) according to the index data generated in the step (2) and the weighting coefficients obtained in the step (1), calculating matching relevance of a search word and an HTML page in accordance with a weighting algorithm, and according to the relevance, and calculating the frequency of occurrence of the tags in the HTML page. The method has the advantages of improving search accuracy in the HTML page and reaching high matching degree with accuracy of data under the same conditions and solves the problem of inaccuracy for searching and low rate when various types of tags exist in the HTML page.

Description

A kind of method that improves the result for retrieval accuracy rate
Technical field
The present invention relates in full-text search field and the text mining field, particularly full-text search for the Search Results optimization of web page with to the content analysis of web page.
Background technology
When carrying out full-text search, generally calculate the degree of correlation of term and one piece of document by the TF-IDF algorithm.TF-IDF is a kind of statistical method, for assessment of the importance degree of a word for a file in a file set or the corpus.Importance degree is larger, thinks that the degree of correlation of this word and this part file is larger, and in final result for retrieval tabulation, the degree of correlation is larger just will to come more forward position.
The theoretical foundation of TF-IDF originates from Shannon information theory, and its main thought is: the significance level of word or phrase is directly proportional with the frequency (TF:Term Frequency) that it occurs in one piece of document; Simultaneously, word or expression is for the significance level of one piece of document, the frequency that occurs in other documents with it be inversely proportional to (Inverse Document Frequency is abbreviated as IDF), both TFIDF=TF *IDF.
For general text document, there is not the differentiation of position or structure, this algorithm can be good at solving the computational problem of the degree of correlation; But the document for particular type, such as html page, Feature Words is in different positions, and is also different to the reflection degree of document content, weight when calculating the degree of correlation also should be different, and the TF-IDF algorithm does not embody the architectural feature of document when calculating the degree of correlation.
Summary of the invention
The technology of the present invention is dealt with problems: overcome the deficiencies in the prior art, a kind of method that improves the result for retrieval accuracy rate is provided, have and in html page, improve retrieval rate, reach with similar condition under the high advantage of the accurate rate matching degree of data, solved the inaccurate not high problem of speed of search when tag class is many in html page.
Technic relization scheme of the present invention: a kind of raising result for retrieval accuracy rate method, performing step is as follows:
1. html tag is classified and the weighting coefficient of all kinds of labels is set.
(11) according to the HTML standard, label is classified;
(12) according to implication and the importance of (11) sorted each label, label is arranged weighting coefficient;
(13) according to the mode of (tag name, weighted value), the output label weighted results.
According to the HTML standard, be categorized as example with significant label, html tag is classified as follows: h1, h2, em, caption, li, th, title ... and for each class label a weighting coefficient is set, more important label, weighting coefficient is higher, is exemplified below:
(title,0.8),(h1,0.7),(h2,0.65),(h3,0.6),(h4,0.5),(li,0.5),(h5,0.45),(h6,0.4),(th,0.4),(caption,0.4),(em,0.3),(strong,0.3),(b,0.3)…
In addition, the weighting coefficient of other area contents of the page is 0.2.
2. the html page content is carried out structuring according to above-mentioned classification and process, form structural data, and for each class label generating indexes data (accompanying drawing 1).
The html page content is carried out structuring processes the also method concrete steps of generating indexes data:
2.1: analyze the html page content, and according to the labeling that steps A arranges, convert the HTML content to structural data.Structural data represents with the form of bivariate table, bivariate table classify labeling as, each piece HTML converts a record of bivariate table to.If the same class label has a plurality of data in one piece of html page, then a plurality of data are merged in the field, cut apart with separator.As: a plurality of h2 labels are arranged in one piece of html page, and then the content with these a plurality of h2 merges to one, separates with specific separator, such as " ^ ", is placed in the h2 field of this record.
2.2: according to labeling, the every delegation for structural data sets up index take field as unit, for example, sets up the h1 field index of all data, h2 field index, full-text index etc.The index data of each field is called a field index storehouse, and the number of times of appearance, position appear in each word of record in the storehouse in this field of which data; Also record the record number that each word occurs in the field index storehouse.
3. when retrieving, calculate the degree of correlation of term and each html page according to weighting algorithm.
The method step that calculates the degree of correlation is as follows:
Inputted several terms 3.1 suppose the user in primary retrieval, at first by retrieval full text field, can obtain the set of hiting data, then calculated respectively the degree of correlation of each term and each bar hiting data, specific algorithm is as follows:
3.2 at first calculate the degree of correlation of a field in each term and the data, specific algorithm is as follows:
3.3 in index data, search the number of times that a word occurs in some fields of these data, be designated as n i, calculate simultaneously total word number of this this field of data, be designated as N i, then pass through formula
Figure BDA00003453020200031
Calculate the TF value.
Occurred in how many bar data 3.4 in index data, search a word, be designated as d j, simultaneously the data total number is designated as D, then pass through formula Calculate the IDF value.
3.5 search by the weighting coefficient of step 1 for this field setting, be designated as W k, then by formula TIW x=TF i* IDF j* W kCalculate the degree of correlation TIW of a field in a word and the record xValue.
The below is according to formula
Figure BDA00003453020200042
Calculate the degree of correlation in a record of all terms, x is the field quantity that term hits in data, and value is 1-m, and y is once the term quantity of input, and value is that concrete calculating of 1-n. is divided into following two steps.
3.6 circulation step 3.3 to 3.5 calculates this term to the degree of correlation of other hit field,
Then pass through formula
Figure BDA00003453020200043
Calculate the degree of correlation of a term and data.
3.7 to each term of user input, circulation step 3.6 calculates the degree of correlation GC of each term and each bar hiting data y, then pass through formula G = Σ y = 1 n GC y Calculate the degree of correlation of each bar hiting data and user input content.
The invention has the advantages that, for specific html document, when the degree of correlation of the retrieval of content that calculates user's input and retrieval hiting data, adopted the weighting degree of correlation algorithm based on html tag, make the value of the degree of correlation can embody the architectural feature of html document, by afterwards result for retrieval being sorted according to the degree of correlation and showing the user, make the user obtain better experience.
Description of drawings
Fig. 1 is that procedure chart set up in html page structuring and index;
Fig. 2 is realization flow figure of the present invention.
Embodiment
Below in conjunction with concrete example in detail embodiments of the present invention.
HTML (Hypertext Markup Language) (English: HyperText Markup Language, HTML) is a kind of markup language for " Web page create and other information that can see in web browser " design.HTML is used to structured message---and for example title, paragraph and tabulation etc. also can be used to describe to a certain extent outward appearance and the semanteme of document.Nineteen eighty-two is created by Di Mubainasi-Li, by the SGML(standard generalized markup language of IETF with simplification) HTML that grammer further develops, became afterwards international standard, safeguarded by World Wide Web Consortium (W3C).W3C advises using XHTML1.1, XHTML1.0 or HTML4.01 Standard compilation webpage at present, but the newer HTML5 coding of existing many webpage conversions is write (such as Google).
Analyze the HTML standard criterion, can obtain complete html tag tabulation, such as following table:
Label Describe DTD
<!--...-- The definition note. STF
<! DOCTYPE 〉 The definition Doctype. STF
<a 〉 The definition anchor. STF
<abbr 〉 The definition abbreviation. STF
<acronym 〉 The abbreviation of initial is only got in definition. STF
<address 〉 Definition document author or owner's contact details. STF
<applet 〉 Disapprove uses.The applet that definition embeds. TF
<area 〉 The zone of definition image mapped inside. STF
<b 〉 The definition boldface letter. STF
<base 〉 Default address or the default objects of all-links in the definition page. STF
<basefont 〉 Disapprove uses.Default font, color or the size of definition page Chinese version. TF
<bdo 〉 The definition words direction. STF
<big 〉 Definition large size text. STF
<blockquote 〉 What definition was long quotes. STF
<body 〉 The main body of definition document. STF
<br 〉 The simple folding row of definition. STF
<button 〉 Definition button (push button). STF
<caption 〉 The definition tables title. STF
<center 〉 Disapprove uses.The definition center text. TF
<cite 〉 (citation) quoted in definition. STF
<code 〉 Definition computer code text. STF
<col 〉 The property value of one or more row in the definition tables. STF
<colgroup 〉 Supply the row group of format in the definition tables. STF
<dd 〉 The description of project in the definition tabulation. STF
<del 〉 Define deleted text. STF
<dir 〉 Disapprove uses.The definition directory listing. TF
<div 〉 Joint in the definition document. STF
<dfn 〉 The definition project. STF
<dl 〉 The definition tabulation. STF
<dt 〉 Project in the definition definition tabulation. STF
<em 〉 Text is emphasized in definition. STF
<fieldset 〉 Definition centers on the frame of element in the list. STF
<font 〉 Disapprove uses.Font, size and the color of definition literal. TF
<form 〉 Definition is for the HTML list of user's input. STF
<frame 〉 The window of definition frame collection or framework. F
<frameset 〉 The definition frame collection. F
<h1〉to<h6 〉 Definition HTML title. STF
<head 〉 Definition is about the information of document. STF
<hr 〉 The definition horizontal line. STF
<HTML 〉 The definition html document. STF
<i 〉 The definition italics. STF
<iframe 〉 The definition inline frame. TF
<img 〉 The definition image. STF
<input 〉 The definition input control. STF
<ins 〉 Definition is inserted into text. STF
<isindex 〉 Disapprove uses.But define the search index relevant with document. TF
<kbd 〉 The definition keyboard text. STF
<label 〉 The mark of definition input element. STF
<legend 〉 The title of definition fieldset element. STF
<li 〉 The project of definition tabulation. STF
<link 〉 The relation of definition document and external resource. STF
<map 〉 The definition image mapped. STF
<menu 〉 Disapprove uses.The definition menu list. TF
<meta 〉 Definition is about the metamessage of html document. STF
<noframes 〉 Definition is for the user's of supporting frame replacement not. TF
<noscript 〉 Definition is for the user's who does not support client script replacement. STF
<object 〉 The definition embedded object. STF
<ol 〉 The definition ordered list. STF
<optgroup 〉 The combination of relevant option in the definition selective listing. STF
<option 〉 Option in the definition selective listing. STF
<p 〉 The definition paragraph. STF
<param 〉 The parameter of defining objects. STF
<pre 〉 Define pre-format text. STF
<q 〉 Define short quoting. STF
<s 〉 Disapprove uses.Definition adds the text of strikethrough. TF
<samp 〉 Definition computer code sample. STF
<script 〉 The definition client script. STF
<select 〉 Definition selective listing (drop-down list). STF
<small 〉 Define small size text. STF
<span 〉 Joint in the definition document. STF
<strike 〉 Disapprove uses.Definition adds the strikethrough text. TF
<strong 〉 Text is emphasized in definition. STF
<style 〉 The style information of definition document. STF
<sub 〉 Definition subscript text. STF
<sup 〉 Definition subscript text. STF
<table 〉 Definition tables. STF
<tbody 〉 Body matter in the definition tables. STF
<td 〉 Unit in the definition tables. STF
<textarea 〉 The text input control of definition multirow. STF
<tfoot 〉 Table in the definition tables is annotated content (footnote). STF
<th 〉 Gauge outfit cell in the definition tables. STF
<thead 〉 Gauge outfit content in the definition tables. STF
<title 〉 The title of definition document. STF
<tr 〉 Row in the definition tables. STF
<tt 〉 The definition typewriter text. STF
<u 〉 Disapprove uses.The definition underline text. TF
<ul 〉 The definition unordered list. STF
<var 〉 The variable part of definition text. STF
<xmp 〉 Disapprove uses.Define pre-format text. ?
DTD: indication allows this label in which kind of XHTML1.0DTD.S=Strict,T=Transitional,F=Frameset.
As shown in Figure 2, the present invention is implemented as follows:
Step 1: the html tag in the upper table is classified and the weighting coefficient of all kinds of labels is set.Specifically be classified as follows (tag name, weighting coefficient): (title, 0.8), (h1,0.7), (h2,0.65), (h3,0.6), (h4,0.5), (li, 0.5), (h5,0.45), (h6,0.4), (th, 0.4), (caption, 0.4), (em, 0.3), (strong, 0.3), (b, 0.3) ... in addition, the weighting coefficient of other area contents of the page is 0.2.
Step 2.1: when the html page content being carried out the structuring processing, content of pages is stored in the two-dimensional data table, storage organization is as shown in the table:
Figure BDA00003453020200091
Step 2.2: set up index database take label as unit, above table is example, sets up following several index database:
The title index database:
Diaoyudaoite: { 3,100}; (1,1,1); (2,1,2); (3,1,1)
First: { 20,100}; (2,1,11); (3,1,6)
Put on display: { 5,100}; (3,1,8)
……
The h1 index database:
Diaoyudaoite: { 3,100}; (1,1,1); (2,1,2); (3,1,1)
First: { 20,100}; (2,1,11); (3,1,6)
Put on display: { 5,100}; (3,1,8)
……
The h2 index database
First: { 10,100}; (1,1,13)
Put on display: { 2,100}; (1,1,15)
……
The index database of other fields
Full text field index storehouse.
Each bar record of index database is divided into 3: 1: word or phrase; 2: the IDF value of brace the inside is respectively occurrence number and total number of documents; 3: the TF value of parenthesis the insides, document be with an element group representation, and the interior value of first ancestral is respectively number of documents, occurrence number, the position occurs first.
Step 3.1: the retrieval of content of supposing user's input is " the fishing socle is put on display first "
Step 3.2: by retrieval full text field, can obtain the set of hiting data, then calculate successively the degree of correlation of each hit field of each word and each bar hiting data.With the degree of correlation of calculating " fishing socle " this word and the title field of article one data for for example time:
Step 3.3: at first calculate the TF value, look into index as can be known, TF=1(occurs 1 time)/3(word sum)=0.333.
Step 3.4: then calculate the IDF value, look into index as can be known, IDF=log (100/ (1+3))=1.398.
Step 3.5: calculate the weighting relevance degree of this field, TIW(title)=0.333*1.398*0.8=0.3724.
Step 3.6: repeating step 3.3 to 3.5, the degree of correlation of calculating " fishing socle " and other fields of article one data: TIW (h1)=1.0*log (100/ (1+3)) * 0.7=0.9786; TIW (h2)=0.Calculate at last the population characteristic valuve degree GC=(TIW(title) of " fishing socle " and article one data+TIW (h1))/2=(0.3724+0.9786)/2=0.6755.
Step 3.7: repeating step 3.3 to 3.6, calculate respectively " exhibition " degree of correlation of " first " and article one data: 0.0079,0.0123; Calculate at last the degree of correlation=0.6755+0.0079+0.0123=0.6958 of user input content and article one data
Step 3.8: repeating step 3.3 to 3.7, the correlation range degree that calculates the 2nd, 3 data and user input content is respectively 0.7958,0.8741, the ordering that can draw at last result for retrieval is (3,2,1).
Can find out this moment, the present invention can well calculate the degree of correlation of each piece relevant documentation and user input content, finally make result for retrieval can embody the architectural feature of HTML, can make again the user obtain more accurately the result who wants, make the user obtain better user and experience.
The above is embodiments of the present invention; can not limit with this interest field of the present invention; should be understood that; for those skilled in the art; under the prerequisite that does not break away from the principle of the invention; can also make some improvement and change; for example the document for other types (includes but not limited to the pdf document; the doc/docx document; the xls/xlsx document; ppt/pptx document etc.), change the classification of label and the weighting coefficient values of changing all kinds of labels, these improvement and change also are considered as protection scope of the present invention.

Claims (4)

1. method that improves the result for retrieval accuracy rate is characterized in that performing step is as follows:
(1) html tag is classified and the weighting coefficient of all kinds of labels is set;
(2) the html page content is carried out structuring according to the classification of step (1) and process, form structural data, and for each class label generating indexes data;
(3) according to the weighting coefficient that obtains in the index data that generates in the step (2) and the step (1), calculate the degree of correlation of term and html page coupling according to weighting algorithm, go out the frequency of occurrences of this label in the html page according to relatedness computation.
2. the method for raising result for retrieval accuracy rate according to claim 1 is characterized in that: html tag is classified and step that the weighting coefficient of all kinds of labels is set is in the described step (1):
(11) according to the HTML standard, label is classified;
(12) according to implication and the importance of sorted each label of step (11), label is arranged weighting coefficient;
(13) according to the mode of (tag name, weighted value), the output label weighted results.
3. the method for raising result for retrieval accuracy rate according to claim 1 is characterized in that: in the described step (2) the html page content is carried out the method that index data was processed and formed in structuring, concrete steps are:
(21) analyze the html page content, and according to the labeling that step (1) arranges, convert the html page content to structural data; Structural data represents with the form of bivariate table, bivariate table classify labeling as, each piece HTML converts a record of bivariate table to, if the same class label has a plurality of data in one piece of html page, then a plurality of data are merged in the field, separate with separator;
(22) according to labeling, every delegation for structural data sets up index take field as unit, and the index data of each field is called a field index storehouse, the number of times of appearance, position appear in each word of record in the field index storehouse in this field of which data; Also record the record number that each word occurs in the field index storehouse.
4. the method for raising result for retrieval accuracy rate according to claim 1, it is characterized in that: the relatedness computation method in the described step (3) is:
G = Σ y = 1 n ( Σ x = 1 m TIW x , y m )
TIW wherein X, yBe the degree of correlation of each field in a term and the data, x is the field quantity that term hits in data, and value is 1-m, and y is once the term quantity of input, and value is 1-n.
CN201310276040.4A 2013-07-02 2013-07-02 A kind of method improving retrieval result accuracy rate Active CN103310014B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310276040.4A CN103310014B (en) 2013-07-02 2013-07-02 A kind of method improving retrieval result accuracy rate

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310276040.4A CN103310014B (en) 2013-07-02 2013-07-02 A kind of method improving retrieval result accuracy rate

Publications (2)

Publication Number Publication Date
CN103310014A true CN103310014A (en) 2013-09-18
CN103310014B CN103310014B (en) 2016-06-29

Family

ID=49135232

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310276040.4A Active CN103310014B (en) 2013-07-02 2013-07-02 A kind of method improving retrieval result accuracy rate

Country Status (1)

Country Link
CN (1) CN103310014B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105447205A (en) * 2016-01-05 2016-03-30 腾讯科技(深圳)有限公司 Retrieved result sorting method and device
CN105740406A (en) * 2016-01-28 2016-07-06 北京致远协创软件有限公司 Information indexing and searching method
CN105912694A (en) * 2016-04-25 2016-08-31 上海斐讯数据通信技术有限公司 Log file applied to embedded device, and creating and query systems and methods therefor
CN106815220A (en) * 2015-11-27 2017-06-09 英业达科技有限公司 Data are classified and method for searching
CN107357891A (en) * 2017-07-12 2017-11-17 中云开源数据技术(上海)有限公司 A kind of homepage Link Recommendation method
CN107729486A (en) * 2017-10-17 2018-02-23 北京奇艺世纪科技有限公司 A kind of video searching method and device
CN109599186A (en) * 2018-11-21 2019-04-09 金色熊猫有限公司 Data processing method, device and medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107656985B (en) * 2017-09-11 2020-11-27 北京京东尚科信息技术有限公司 Webpage query method and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101094194A (en) * 2006-06-19 2007-12-26 腾讯科技(深圳)有限公司 Method for picking up web information needed by user in web page
CN101571859A (en) * 2008-04-28 2009-11-04 国际商业机器公司 Method and apparatus for labelling document

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101094194A (en) * 2006-06-19 2007-12-26 腾讯科技(深圳)有限公司 Method for picking up web information needed by user in web page
CN101571859A (en) * 2008-04-28 2009-11-04 国际商业机器公司 Method and apparatus for labelling document

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MICHAL CUTLER AND SO ON: "Using the Structure of HTML Documents to Improve Retrieval", 《PROCEEDINGS OF THE USENIX SYMPOSIUM ON INTERNET TECHNOLOGIES AND SYSTEMS》, 31 December 1997 (1997-12-31), pages 1 - 12 *
YOUSSEF BASSIL AND SO ON: "Semantic-Sensitive Web Information Retrieval Model for HTML Documents", 《EUROPEAN JOURNAL OF SCIENTIFIC RESEARCH》, vol. 69, no. 4, 29 February 2012 (2012-02-29), pages 1 - 11 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106815220A (en) * 2015-11-27 2017-06-09 英业达科技有限公司 Data are classified and method for searching
CN105447205A (en) * 2016-01-05 2016-03-30 腾讯科技(深圳)有限公司 Retrieved result sorting method and device
CN105447205B (en) * 2016-01-05 2023-10-24 腾讯科技(深圳)有限公司 Method and device for sorting search results
CN105740406A (en) * 2016-01-28 2016-07-06 北京致远协创软件有限公司 Information indexing and searching method
CN105912694A (en) * 2016-04-25 2016-08-31 上海斐讯数据通信技术有限公司 Log file applied to embedded device, and creating and query systems and methods therefor
CN107357891A (en) * 2017-07-12 2017-11-17 中云开源数据技术(上海)有限公司 A kind of homepage Link Recommendation method
CN107729486A (en) * 2017-10-17 2018-02-23 北京奇艺世纪科技有限公司 A kind of video searching method and device
CN107729486B (en) * 2017-10-17 2021-02-09 北京奇艺世纪科技有限公司 Video searching method and device
CN109599186A (en) * 2018-11-21 2019-04-09 金色熊猫有限公司 Data processing method, device and medium

Also Published As

Publication number Publication date
CN103310014B (en) 2016-06-29

Similar Documents

Publication Publication Date Title
CN103310014B (en) A kind of method improving retrieval result accuracy rate
Sun et al. Dom based content extraction via text density
US7721195B2 (en) RTF template and XSL/FO conversion: a new way to create computer reports
US9639631B2 (en) Converting XML to JSON with configurable output
EP2057557B1 (en) Joint optimization of wrapper generation and template detection
US7370061B2 (en) Method for querying XML documents using a weighted navigational index
JP5209235B2 (en) Visualizing document annotations in the context of the source document
CN106528583A (en) Method for extracting and comparing web page main body
CN102915361B (en) Webpage text extracting method based on character distribution characteristic
CN102662969A (en) Internet information object positioning method based on webpage structure semantic meaning
CN101872350A (en) Web page text extracting method and device thereof
US8239425B1 (en) Isolating desired content, metadata, or both from social media
CN110390037B (en) Information classification method, device and equipment based on DOM tree and storage medium
CN112380337A (en) Highlight method and device based on rich text
Bu et al. An FAR-SW based approach for webpage information extraction
US8983980B2 (en) Domain constraint based data record extraction
CN112199960B (en) Standard knowledge element granularity analysis system
Lam et al. A method for web information extraction
JP5564442B2 (en) Text search device
US20210011895A1 (en) Hierarchical document sectioning for contextual retrieval
JP5187187B2 (en) Experience information search system
Silva Slicing XML documents
Ranaivo-Malançon et al. Transforming semi-structured indigenous dictionary into machine-readable dictionary
Fatima et al. Review on Web Page Data Extraction Technique
WO2023137160A1 (en) System, method, and computer program product for tokenizing document citations

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20210506

Address after: 100085 room 1008, 10 / F, block F, No.9, Shangdi 3rd Street, Haidian District, Beijing

Patentee after: Beijing easy to use Lianyou Technology Co.,Ltd.

Address before: 100191 No. 37, Haidian District, Beijing, Xueyuan Road

Patentee before: BEIHANG University

TR01 Transfer of patent right
CP02 Change in the address of a patent holder

Address after: Room 1601, 14th Floor, No. 27 Zhichun Road, Haidian District, Beijing, 100086

Patentee after: Beijing easy to use Lianyou Technology Co.,Ltd.

Address before: 100085 room 1008, 10 / F, block F, No.9, Shangdi 3rd Street, Haidian District, Beijing

Patentee before: Beijing easy to use Lianyou Technology Co.,Ltd.

CP02 Change in the address of a patent holder