CN102254014A - Adaptive information extraction method for webpage characteristics - Google Patents
Adaptive information extraction method for webpage characteristics Download PDFInfo
- Publication number
- CN102254014A CN102254014A CN 201110205137 CN201110205137A CN102254014A CN 102254014 A CN102254014 A CN 102254014A CN 201110205137 CN201110205137 CN 201110205137 CN 201110205137 A CN201110205137 A CN 201110205137A CN 102254014 A CN102254014 A CN 102254014A
- Authority
- CN
- China
- Prior art keywords
- page
- result
- information
- academic
- text unit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (4)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201110205137 CN102254014B (en) | 2011-07-21 | 2011-07-21 | Adaptive information extraction method for webpage characteristics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201110205137 CN102254014B (en) | 2011-07-21 | 2011-07-21 | Adaptive information extraction method for webpage characteristics |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102254014A true CN102254014A (en) | 2011-11-23 |
CN102254014B CN102254014B (en) | 2013-06-05 |
Family
ID=44981278
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN 201110205137 Expired - Fee Related CN102254014B (en) | 2011-07-21 | 2011-07-21 | Adaptive information extraction method for webpage characteristics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102254014B (en) |
Cited By (33)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102411630A (en) * | 2011-12-22 | 2012-04-11 | 南京烽火星空通信发展有限公司 | Attribute searching method |
CN102663123A (en) * | 2012-04-20 | 2012-09-12 | 哈尔滨工业大学 | Semantic attribute automatic extraction method on basis of pseudo-seed attributes and random walk sort and system for implementing same |
CN102662969A (en) * | 2012-03-11 | 2012-09-12 | 复旦大学 | Internet information object positioning method based on webpage structure semantic meaning |
CN102841920A (en) * | 2012-06-30 | 2012-12-26 | 北京百度网讯科技有限公司 | Method and device for extracting webpage frame information |
CN102867064A (en) * | 2012-09-28 | 2013-01-09 | 用友软件股份有限公司 | Associated field query device and associated field query method |
CN102932400A (en) * | 2012-07-20 | 2013-02-13 | 北京网康科技有限公司 | Method and device for identifying uniform resource locator primary links |
CN103051895A (en) * | 2012-12-07 | 2013-04-17 | 浙江大学 | Method and device of context model selection |
CN103218362A (en) * | 2012-01-19 | 2013-07-24 | 中兴通讯股份有限公司 | Method and system for constructing domain ontology |
CN103577578A (en) * | 2012-03-30 | 2014-02-12 | 奇智软件(北京)有限公司 | Marker file parsing method and device |
CN103793285A (en) * | 2012-10-29 | 2014-05-14 | 百度在线网络技术(北京)有限公司 | Method and platform server for processing online anomalies |
CN104252530A (en) * | 2014-09-10 | 2014-12-31 | 北京京东尚科信息技术有限公司 | Single-computer crawler grabbing method and system |
CN104331438A (en) * | 2014-10-24 | 2015-02-04 | 北京奇虎科技有限公司 | Method and device for selectively extracting content of novel webpage |
CN104376108A (en) * | 2014-11-26 | 2015-02-25 | 克拉玛依红有软件有限责任公司 | Unstructured natural language information extraction method based on 6W semantic annotation |
CN104699797A (en) * | 2015-03-18 | 2015-06-10 | 浪潮集团有限公司 | Webpage data structured analytic method and device |
CN105095400A (en) * | 2015-07-07 | 2015-11-25 | 清华大学 | Method for finding personal homepage |
CN106469176A (en) * | 2015-08-20 | 2017-03-01 | 百度在线网络技术(北京)有限公司 | A kind of method and apparatus for extracting text snippet |
CN106484920A (en) * | 2016-11-21 | 2017-03-08 | 北京恒华伟业科技股份有限公司 | A kind of abstracting method of evaluation document index |
CN106681596A (en) * | 2017-01-03 | 2017-05-17 | 北京百度网讯科技有限公司 | Information display method and device |
CN106708913A (en) * | 2015-11-12 | 2017-05-24 | 财团法人资讯工业策进会 | Intelligent product storage system and method thereof |
CN107808000A (en) * | 2017-11-13 | 2018-03-16 | 哈尔滨工业大学(威海) | A kind of hidden web data collection and extraction system and method |
CN108153851A (en) * | 2017-12-21 | 2018-06-12 | 北京工业大学 | A kind of rule-based and semantic universal forum topic post page info abstracting method |
CN108241680A (en) * | 2016-12-26 | 2018-07-03 | 北京国双科技有限公司 | The method and apparatus for obtaining the amount of reading of webpage |
CN109033282A (en) * | 2018-07-11 | 2018-12-18 | 山东邦尼信息科技有限公司 | A kind of Web page text extracting method and device based on extraction template |
CN109117435A (en) * | 2017-06-22 | 2019-01-01 | 索意互动(北京)信息技术有限公司 | A kind of client, server, search method and its system |
CN109657180A (en) * | 2018-12-11 | 2019-04-19 | 中科国力(镇江)智能技术有限公司 | It is a kind of intelligence web page contents automatically obscure extraction system |
US10289963B2 (en) | 2017-02-27 | 2019-05-14 | International Business Machines Corporation | Unified text analytics annotator development life cycle combining rule-based and machine learning based techniques |
CN110020366A (en) * | 2017-12-07 | 2019-07-16 | 北大方正集团有限公司 | Mailbox message abstracting method and device |
CN110189210A (en) * | 2019-06-05 | 2019-08-30 | 浙江米奥兰特商务会展股份有限公司 | Purchaser's information collecting method, device, equipment and the storage medium that foreign trade is brought together |
CN110781497A (en) * | 2019-10-21 | 2020-02-11 | 新华三信息安全技术有限公司 | Method for detecting web page link and storage medium |
CN113254751A (en) * | 2021-06-24 | 2021-08-13 | 北森云计算有限公司 | Method, equipment and storage medium for accurately extracting complex webpage structured information |
CN113268573A (en) * | 2021-05-19 | 2021-08-17 | 上海博亦信息科技有限公司 | Extraction method of academic talent information |
CN113434797A (en) * | 2021-06-29 | 2021-09-24 | 中国电信集团***集成有限责任公司 | Webpage information extraction method and device |
CN114116757A (en) * | 2020-08-31 | 2022-03-01 | 阿里巴巴集团控股有限公司 | Data processing method and device, electronic equipment and readable storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004046312A (en) * | 2002-07-09 | 2004-02-12 | Nippon Telegr & Teleph Corp <Ntt> | Site manager information extraction method and device, site manager information extraction program, and recording medium with the program recorded |
CN101620608A (en) * | 2008-07-04 | 2010-01-06 | 全国组织机构代码管理中心 | Information collection method and system |
US20100083095A1 (en) * | 2008-09-29 | 2010-04-01 | Nikovski Daniel N | Method for Extracting Data from Web Pages |
CN101727498A (en) * | 2010-01-15 | 2010-06-09 | 西安交通大学 | Automatic extraction method of web page information based on WEB structure |
-
2011
- 2011-07-21 CN CN 201110205137 patent/CN102254014B/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004046312A (en) * | 2002-07-09 | 2004-02-12 | Nippon Telegr & Teleph Corp <Ntt> | Site manager information extraction method and device, site manager information extraction program, and recording medium with the program recorded |
CN101620608A (en) * | 2008-07-04 | 2010-01-06 | 全国组织机构代码管理中心 | Information collection method and system |
US20100083095A1 (en) * | 2008-09-29 | 2010-04-01 | Nikovski Daniel N | Method for Extracting Data from Web Pages |
CN101727498A (en) * | 2010-01-15 | 2010-06-09 | 西安交通大学 | Automatic extraction method of web page information based on WEB structure |
Cited By (51)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102411630A (en) * | 2011-12-22 | 2012-04-11 | 南京烽火星空通信发展有限公司 | Attribute searching method |
CN103218362A (en) * | 2012-01-19 | 2013-07-24 | 中兴通讯股份有限公司 | Method and system for constructing domain ontology |
CN102662969A (en) * | 2012-03-11 | 2012-09-12 | 复旦大学 | Internet information object positioning method based on webpage structure semantic meaning |
CN103577578B (en) * | 2012-03-30 | 2017-04-05 | 北京奇虎科技有限公司 | A kind of tab file analysis method and device |
CN103577578A (en) * | 2012-03-30 | 2014-02-12 | 奇智软件(北京)有限公司 | Marker file parsing method and device |
CN102663123A (en) * | 2012-04-20 | 2012-09-12 | 哈尔滨工业大学 | Semantic attribute automatic extraction method on basis of pseudo-seed attributes and random walk sort and system for implementing same |
CN102663123B (en) * | 2012-04-20 | 2014-09-03 | 哈尔滨工业大学 | Semantic attribute automatic extraction method on basis of pseudo-seed attributes and random walk sort and system for implementing same |
CN102841920A (en) * | 2012-06-30 | 2012-12-26 | 北京百度网讯科技有限公司 | Method and device for extracting webpage frame information |
CN102841920B (en) * | 2012-06-30 | 2017-05-10 | 北京百度网讯科技有限公司 | Method and device for extracting webpage frame information |
CN102932400A (en) * | 2012-07-20 | 2013-02-13 | 北京网康科技有限公司 | Method and device for identifying uniform resource locator primary links |
CN102932400B (en) * | 2012-07-20 | 2015-06-17 | 北京网康科技有限公司 | Method and device for identifying uniform resource locator primary links |
CN102867064A (en) * | 2012-09-28 | 2013-01-09 | 用友软件股份有限公司 | Associated field query device and associated field query method |
CN102867064B (en) * | 2012-09-28 | 2015-12-02 | 用友网络科技股份有限公司 | Associate field inquiry unit and associate field querying method |
CN103793285A (en) * | 2012-10-29 | 2014-05-14 | 百度在线网络技术(北京)有限公司 | Method and platform server for processing online anomalies |
CN103051895B (en) * | 2012-12-07 | 2016-04-13 | 浙江大学 | The method and apparatus that a kind of context model is selected |
CN103051895A (en) * | 2012-12-07 | 2013-04-17 | 浙江大学 | Method and device of context model selection |
CN104252530A (en) * | 2014-09-10 | 2014-12-31 | 北京京东尚科信息技术有限公司 | Single-computer crawler grabbing method and system |
CN104252530B (en) * | 2014-09-10 | 2017-09-15 | 北京京东尚科信息技术有限公司 | A kind of unit crawler capturing method and system |
CN104331438A (en) * | 2014-10-24 | 2015-02-04 | 北京奇虎科技有限公司 | Method and device for selectively extracting content of novel webpage |
CN104331438B (en) * | 2014-10-24 | 2018-04-17 | 北京奇虎科技有限公司 | To novel web page contents selectivity abstracting method and device |
CN104376108A (en) * | 2014-11-26 | 2015-02-25 | 克拉玛依红有软件有限责任公司 | Unstructured natural language information extraction method based on 6W semantic annotation |
CN104376108B (en) * | 2014-11-26 | 2017-06-06 | 克拉玛依红有软件有限责任公司 | A kind of destructuring natural language information abstracting method based on the semantic marks of 6W |
CN104699797A (en) * | 2015-03-18 | 2015-06-10 | 浪潮集团有限公司 | Webpage data structured analytic method and device |
CN104699797B (en) * | 2015-03-18 | 2018-02-23 | 浪潮集团有限公司 | A kind of web page data structured analysis method and device |
CN105095400A (en) * | 2015-07-07 | 2015-11-25 | 清华大学 | Method for finding personal homepage |
CN106469176A (en) * | 2015-08-20 | 2017-03-01 | 百度在线网络技术(北京)有限公司 | A kind of method and apparatus for extracting text snippet |
CN106469176B (en) * | 2015-08-20 | 2019-08-16 | 百度在线网络技术(北京)有限公司 | It is a kind of for extracting the method and apparatus of text snippet |
CN106708913A (en) * | 2015-11-12 | 2017-05-24 | 财团法人资讯工业策进会 | Intelligent product storage system and method thereof |
CN106708913B (en) * | 2015-11-12 | 2020-03-27 | 财团法人资讯工业策进会 | Intelligent product storage system and method thereof |
CN106484920A (en) * | 2016-11-21 | 2017-03-08 | 北京恒华伟业科技股份有限公司 | A kind of abstracting method of evaluation document index |
CN108241680A (en) * | 2016-12-26 | 2018-07-03 | 北京国双科技有限公司 | The method and apparatus for obtaining the amount of reading of webpage |
CN108241680B (en) * | 2016-12-26 | 2020-10-13 | 北京国双科技有限公司 | Method and device for acquiring reading amount of webpage |
CN106681596A (en) * | 2017-01-03 | 2017-05-17 | 北京百度网讯科技有限公司 | Information display method and device |
US10289963B2 (en) | 2017-02-27 | 2019-05-14 | International Business Machines Corporation | Unified text analytics annotator development life cycle combining rule-based and machine learning based techniques |
CN109117435A (en) * | 2017-06-22 | 2019-01-01 | 索意互动(北京)信息技术有限公司 | A kind of client, server, search method and its system |
CN107808000A (en) * | 2017-11-13 | 2018-03-16 | 哈尔滨工业大学(威海) | A kind of hidden web data collection and extraction system and method |
CN110020366B (en) * | 2017-12-07 | 2021-06-15 | 北大方正集团有限公司 | Mailbox information extraction method and device |
CN110020366A (en) * | 2017-12-07 | 2019-07-16 | 北大方正集团有限公司 | Mailbox message abstracting method and device |
CN108153851B (en) * | 2017-12-21 | 2021-06-18 | 北京工业大学 | General forum subject post page information extraction method based on rules and semantics |
CN108153851A (en) * | 2017-12-21 | 2018-06-12 | 北京工业大学 | A kind of rule-based and semantic universal forum topic post page info abstracting method |
CN109033282A (en) * | 2018-07-11 | 2018-12-18 | 山东邦尼信息科技有限公司 | A kind of Web page text extracting method and device based on extraction template |
CN109033282B (en) * | 2018-07-11 | 2021-07-23 | 山东邦尼信息科技有限公司 | Webpage text extraction method and device based on extraction template |
CN109657180A (en) * | 2018-12-11 | 2019-04-19 | 中科国力(镇江)智能技术有限公司 | It is a kind of intelligence web page contents automatically obscure extraction system |
CN110189210A (en) * | 2019-06-05 | 2019-08-30 | 浙江米奥兰特商务会展股份有限公司 | Purchaser's information collecting method, device, equipment and the storage medium that foreign trade is brought together |
CN110781497A (en) * | 2019-10-21 | 2020-02-11 | 新华三信息安全技术有限公司 | Method for detecting web page link and storage medium |
CN114116757A (en) * | 2020-08-31 | 2022-03-01 | 阿里巴巴集团控股有限公司 | Data processing method and device, electronic equipment and readable storage medium |
CN114116757B (en) * | 2020-08-31 | 2022-10-18 | 阿里巴巴集团控股有限公司 | Data processing method and device, electronic equipment and readable storage medium |
CN113268573A (en) * | 2021-05-19 | 2021-08-17 | 上海博亦信息科技有限公司 | Extraction method of academic talent information |
CN113254751A (en) * | 2021-06-24 | 2021-08-13 | 北森云计算有限公司 | Method, equipment and storage medium for accurately extracting complex webpage structured information |
CN113434797A (en) * | 2021-06-29 | 2021-09-24 | 中国电信集团***集成有限责任公司 | Webpage information extraction method and device |
CN113434797B (en) * | 2021-06-29 | 2024-05-31 | ***数智科技有限公司 | Webpage information extraction method and device |
Also Published As
Publication number | Publication date |
---|---|
CN102254014B (en) | 2013-06-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102254014B (en) | Adaptive information extraction method for webpage characteristics | |
CN109492077B (en) | Knowledge graph-based petrochemical field question-answering method and system | |
CN111723215B (en) | Device and method for establishing biotechnological information knowledge graph based on text mining | |
US8751218B2 (en) | Indexing content at semantic level | |
CN103544255B (en) | Text semantic relativity based network public opinion information analysis method | |
CN101464898B (en) | Method for extracting feature word of text | |
US8868621B2 (en) | Data extraction from HTML documents into tables for user comparison | |
CN105893611B (en) | Method for constructing interest topic semantic network facing social network | |
CN104268148B (en) | A kind of forum page Information Automatic Extraction method and system based on time string | |
CN103678412B (en) | A kind of method and device of file retrieval | |
CN101079025B (en) | File correlation computing system and method | |
Schenker | Graph-theoretic techniques for web content mining | |
CN103823824A (en) | Method and system for automatically constructing text classification corpus by aid of internet | |
CN106649666A (en) | Left-right recursion-based new word discovery method | |
CN109145260A (en) | A kind of text information extraction method | |
CN102609427A (en) | Public opinion vertical search analysis system and method | |
Döhmen et al. | Multi-hypothesis CSV parsing | |
CN109165373B (en) | Data processing method and device | |
CN108153851B (en) | General forum subject post page information extraction method based on rules and semantics | |
CN109657114B (en) | Method for extracting webpage semi-structured data | |
CN106970938A (en) | Web page towards focusing is obtained and information extraction method | |
CN114090861A (en) | Education field search engine construction method based on knowledge graph | |
CN106649557A (en) | Semantic association mining method for defect report and mail list | |
CN104346382B (en) | Use the text analysis system and method for language inquiry | |
WO2016099422A2 (en) | Content sensitive document ranking method by analyzing the citation contexts |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
C53 | Correction of patent for invention or patent application | ||
CB03 | Change of inventor or designer information |
Inventor after: Jin Hai Inventor after: Li Yi Inventor after: Zhao Feng Inventor after: Yan Fengwei Inventor after: Chen Heng Inventor before: Jin Hai Inventor before: Li Yi Inventor before: Zhao Feng Inventor before: Yan Fengwei |
|
COR | Change of bibliographic data |
Free format text: CORRECT: INVENTOR; FROM: JIN HAI LI YI ZHAO FENG YAN FENGWEI TO: JIN HAI LI YI ZHAO FENG YAN FENGWEI CHEN HENG |
|
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20130605 Termination date: 20200721 |