CN101957816B - 基于多页面比较的网页元数据自动抽取方法和*** - Google Patents
基于多页面比较的网页元数据自动抽取方法和*** Download PDFInfo
- Publication number
- CN101957816B CN101957816B CN 200910054701 CN200910054701A CN101957816B CN 101957816 B CN101957816 B CN 101957816B CN 200910054701 CN200910054701 CN 200910054701 CN 200910054701 A CN200910054701 A CN 200910054701A CN 101957816 B CN101957816 B CN 101957816B
- Authority
- CN
- China
- Prior art keywords
- page
- metadata
- template
- data
- webpage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 19
- 230000014509 gene expression Effects 0.000 claims abstract description 50
- 238000004458 analytical method Methods 0.000 claims abstract description 26
- 238000009795 derivation Methods 0.000 claims abstract description 22
- 239000000284 extract Substances 0.000 claims abstract description 21
- 238000000034 method Methods 0.000 claims description 34
- 239000003550 marker Substances 0.000 claims description 14
- 230000008569 process Effects 0.000 claims description 14
- 238000013507 mapping Methods 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 9
- 238000013075 data extraction Methods 0.000 abstract description 4
- 230000010365 information processing Effects 0.000 abstract description 2
- 230000000875 corresponding effect Effects 0.000 description 10
- 230000008878 coupling Effects 0.000 description 5
- 238000010168 coupling process Methods 0.000 description 5
- 238000005859 coupling reaction Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 206010003830 Automatism Diseases 0.000 description 2
- 241000270322 Lepidosauria Species 0.000 description 2
- 239000012634 fragment Substances 0.000 description 2
- 230000013011 mating Effects 0.000 description 2
- 241000272525 Anas platyrhynchos Species 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 235000019788 craving Nutrition 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000012467 final product Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 239000004615 ingredient Substances 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000006386 neutralization reaction Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 235000014347 soups Nutrition 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
Images
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
Description
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 200910054701 CN101957816B (zh) | 2009-07-13 | 2009-07-13 | 基于多页面比较的网页元数据自动抽取方法和*** |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 200910054701 CN101957816B (zh) | 2009-07-13 | 2009-07-13 | 基于多页面比较的网页元数据自动抽取方法和*** |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101957816A CN101957816A (zh) | 2011-01-26 |
CN101957816B true CN101957816B (zh) | 2013-03-20 |
Family
ID=43485149
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN 200910054701 Expired - Fee Related CN101957816B (zh) | 2009-07-13 | 2009-07-13 | 基于多页面比较的网页元数据自动抽取方法和*** |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101957816B (zh) |
Families Citing this family (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102222310A (zh) * | 2011-07-18 | 2011-10-19 | 深圳证券信息有限公司 | 证券信息发布方法和平台 |
CN103544176B (zh) * | 2012-07-13 | 2018-08-10 | 百度在线网络技术(北京)有限公司 | 用于生成多个页面所对应的页面结构模板的方法和设备 |
CN102819597B (zh) * | 2012-08-13 | 2015-04-22 | 北京星网锐捷网络技术有限公司 | 网页分类方法及设备 |
CN102968466B (zh) * | 2012-11-09 | 2016-05-18 | 同济大学 | 基于网页分类的索引网络构建方法及其索引网构建器 |
SG11201506510WA (en) * | 2013-03-15 | 2015-09-29 | Ab Initio Technology Llc | System for metadata management |
US10108590B2 (en) | 2013-05-03 | 2018-10-23 | International Business Machines Corporation | Comparing markup language files |
CN104424334A (zh) * | 2013-09-11 | 2015-03-18 | 方正信息产业控股有限公司 | Xml文档节点的构建方法和装置 |
CN103870567A (zh) * | 2014-03-11 | 2014-06-18 | 浪潮集团有限公司 | 一种云计算中垂直搜索引擎网页采集模板自动识别方法 |
US9679076B2 (en) | 2014-03-24 | 2017-06-13 | Xiaomi Inc. | Method and device for controlling page rollback |
CN103914523A (zh) * | 2014-03-24 | 2014-07-09 | 小米科技有限责任公司 | 页面回退控制方法及装置 |
US20160004783A1 (en) * | 2014-07-01 | 2016-01-07 | EveryMundo, LLC | Automated generation of web site entry pages |
CN104317948A (zh) * | 2014-11-05 | 2015-01-28 | 北京中科辅龙信息技术有限公司 | 页面数据抓取方法和*** |
CN105653531B (zh) * | 2014-11-12 | 2020-02-07 | 中兴通讯股份有限公司 | 数据提取方法及装置 |
CN105335516A (zh) * | 2015-11-04 | 2016-02-17 | 浪潮软件集团有限公司 | 一种通用采集***的构建方法 |
CN105955984A (zh) * | 2016-04-19 | 2016-09-21 | ***股份有限公司 | 基于爬虫模式的网络数据搜索方法 |
CN108090080A (zh) * | 2016-11-22 | 2018-05-29 | 北京京东尚科信息技术有限公司 | 用于替换解析模板的方法与***及爬取方法 |
CN107092689A (zh) * | 2017-04-24 | 2017-08-25 | 深圳市茁壮网络股份有限公司 | 元数据生成方法及*** |
CN107992556B (zh) * | 2017-11-28 | 2020-08-21 | 福建中金在线信息科技有限公司 | 一种站点管理方法、装置、电子设备以及存储介质 |
CN108763279B (zh) * | 2018-04-11 | 2020-12-15 | 北京中科闻歌科技股份有限公司 | 一种网页数据分布式模板采集方法及*** |
CN109445784B (zh) * | 2018-09-29 | 2020-08-14 | Oppo广东移动通信有限公司 | 结构数据的处理方法、装置、存储介质及电子设备 |
CN111125589B (zh) * | 2018-10-31 | 2023-09-05 | 新方正控股发展有限责任公司 | 数据采集方法及装置、计算机可读存储介质 |
CN111125565A (zh) * | 2019-11-01 | 2020-05-08 | 上海掌门科技有限公司 | 一种在应用中输入信息的方法与设备 |
CN111460442A (zh) * | 2020-04-24 | 2020-07-28 | 怀化学院 | 一种基于互联网交叉搜索缺陷的攻击检测方法 |
CN112035722B (zh) * | 2020-08-04 | 2023-10-13 | 北京启明星辰信息安全技术有限公司 | 提取动态网页信息的方法、装置及计算机可读存储介质 |
CN112685364A (zh) * | 2020-12-24 | 2021-04-20 | 北京浪潮数据技术有限公司 | Flume元数据信息分析提取方法及相关组件 |
CN116702702B (zh) * | 2023-04-14 | 2024-02-13 | 北京雅昌艺术印刷有限公司 | 一种基于xml的自动排版方法及*** |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101404666A (zh) * | 2008-10-06 | 2009-04-08 | 赵洪宇 | 一种基于Web页无限层采集方法 |
CN101464905A (zh) * | 2009-01-08 | 2009-06-24 | 中国科学院计算技术研究所 | 一种网页信息抽取的***及方法 |
-
2009
- 2009-07-13 CN CN 200910054701 patent/CN101957816B/zh not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101404666A (zh) * | 2008-10-06 | 2009-04-08 | 赵洪宇 | 一种基于Web页无限层采集方法 |
CN101464905A (zh) * | 2009-01-08 | 2009-06-24 | 中国科学院计算技术研究所 | 一种网页信息抽取的***及方法 |
Also Published As
Publication number | Publication date |
---|---|
CN101957816A (zh) | 2011-01-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101957816B (zh) | 基于多页面比较的网页元数据自动抽取方法和*** | |
CN103823824B (zh) | 一种借助互联网自动构建文本分类语料库的方法及*** | |
CN110263180B (zh) | 意图知识图谱生成方法、意图识别方法及装置 | |
CN103136360B (zh) | 一种互联网行为标注引擎及对应该引擎的行为标注方法 | |
Chen et al. | Websrc: A dataset for web-based structural reading comprehension | |
US7739257B2 (en) | Search engine | |
Khare et al. | Understanding deep web search interfaces: A survey | |
Peters et al. | Content extraction using diverse feature sets | |
Zheng et al. | Template-independent news extraction based on visual consistency | |
CN101727498A (zh) | 一种基于web结构的网页信息自动提取方法 | |
CN103914478A (zh) | 网页训练方法及***、网页预测方法及*** | |
Pol et al. | A survey on web content mining and extraction of structured and semistructured data | |
CN103559234A (zh) | RESTful Web服务的自动化语义标注***和方法 | |
Omari et al. | Cross-supervised synthesis of web-crawlers | |
Furche et al. | Real understanding of real estate forms | |
CN110083760B (zh) | 一种基于可视块的多记录型动态网页信息提取方法 | |
CN100357942C (zh) | 一种移动互联网智能信息搜索引擎的搜索方法 | |
Arya et al. | Content extraction from news web pages using tag tree | |
Gkotsis et al. | Self-supervised automated wrapper generation for weblog data extraction | |
Lim et al. | Generalized and lightweight algorithms for automated web forum content extraction | |
Lindholm | Extracting content from online news sites | |
Mane et al. | Template extraction from heterogeneous web pages | |
Boronat | A comparison of HTML-aware tools for Web Data extraction | |
Parapar et al. | Blog posts and comments extraction and impact on retrieval effectiveness | |
Flesca et al. | Reasoning and ontologies in data extraction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
ASS | Succession or assignment of patent right |
Owner name: SHANGHAI HUAYAN PROPERTY DEVELOPMENT CO., LTD. Free format text: FORMER OWNER: SHANGHAI XIEYU NETWORK TECHNOLOGY CO., LTD. Effective date: 20110810 |
|
C41 | Transfer of patent application or patent right or utility model | ||
COR | Change of bibliographic data |
Free format text: CORRECT: ADDRESS; FROM: 200434 HONGKOU, SHANGHAI TO: 200052 CHANGNING, SHANGHAI |
|
TA01 | Transfer of patent application right |
Effective date of registration: 20110810 Address after: 16, Biology Building, No. 1326,, Shanghai, West Yan'an Road Applicant after: Shanghai Huayan House Development Co., Ltd. Address before: 200434 Shanghai city Jipu road 375 Lane 34, room 103 Applicant before: Shanghai Xieyu Network Technology Co., Ltd. |
|
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
C56 | Change in the name or address of the patentee |
Owner name: SHANGHAI HUAYAN FANGMENG NETWORK TECHNOLOGY CO., L Free format text: FORMER NAME: SHANGHAI HUAYAN PROPERTY DEVELOPMENT CO., LTD. |
|
CP03 | Change of name, title or address |
Address after: 200052, Changning District, West Yan'an Road, No. 16, building 1326, Shanghai Patentee after: Shanghai Huayan real NSFocus network Polytron Technologies Inc Address before: 16, Biology Building, No. 1326,, Shanghai, West Yan'an Road Patentee before: Shanghai Huayan House Development Co., Ltd. |
|
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20130320 Termination date: 20180713 |