CN111859867B - Web data extraction system based on XML and XPath and use method thereof - Google Patents

Web data extraction system based on XML and XPath and use method thereof Download PDF

Info

Publication number
CN111859867B
CN111859867B CN202010696786.0A CN202010696786A CN111859867B CN 111859867 B CN111859867 B CN 111859867B CN 202010696786 A CN202010696786 A CN 202010696786A CN 111859867 B CN111859867 B CN 111859867B
Authority
CN
China
Prior art keywords
extraction
module
webpage
data
url
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010696786.0A
Other languages
Chinese (zh)
Other versions
CN111859867A (en
Inventor
官鲁卫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi Meicube Engineering Consulting Co ltd
Original Assignee
Guangxi Meicube Engineering Consulting Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi Meicube Engineering Consulting Co ltd filed Critical Guangxi Meicube Engineering Consulting Co ltd
Priority to CN202010696786.0A priority Critical patent/CN111859867B/en
Publication of CN111859867A publication Critical patent/CN111859867A/en
Application granted granted Critical
Publication of CN111859867B publication Critical patent/CN111859867B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • G06F40/154Tree transformation for tree-structured or markup documents, e.g. XSLT, XSL-FO or stylesheets
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of data extraction, in particular to a Web data extraction system based on XML and XPath and a use method thereof. The invention realizes the conversion, extraction and crawling and integration processing of unstructured data, and sets the extraction and crawling rules according to the needs, thereby realizing the crawling needs of different data, being suitable for the processing operation of different types of unstructured data and improving the processing efficiency and quality of the unstructured data.

Description

Web data extraction system based on XML and XPath and use method thereof
Technical Field
The invention relates to the technical field of data extraction, in particular to a Web data extraction system based on XML and XPath and a use method thereof.
Background
The arrival of the network and information age, the data and information are full of our lives and penetrate into various fields. During the past few years, large amounts of data have been generated daily from our weblogs, social networking sites, online transaction records, or other data sources. Whether the current database importing technology and the retrieval mode can fully normalize daily generated data into a database format or mine and analyze useful information from useful massive unstructured data which is difficult to process becomes a hot spot problem of current research.
Unstructured data refers to data that is inconvenient to represent with a database two-dimensional logical table, including text, pictures, XML, HTML, audio/video, and the like. More than 85% of the data currently collected from the Internet is unstructured and semi-structured data. For example, hundreds of degrees per day about tens of PB data are processed; the Facebook registered users have more than 10 hundred million, the pictures uploaded per month have more than 10 hundred million, and log data of more than 300TB are generated per day; the members of the Taobao net have more than 3.7 hundred million, the online commodity has more than 8.8 hundred million, and tens of millions of transactions are carried out every day, so that about 20TB of data is generated; the total memory capacity of yahoo exceeds 100PB, etc. The data size is driven by related technologies such as cloud computing and the Internet of things, so that the world gradually enters the big data (BigData) era, the existing unstructured data is inconvenient to process and inconvenient to extract, and therefore, a Web data extraction system based on XML and XPath and a using method thereof are needed.
Disclosure of Invention
The Web data extraction system based on XML and XPath and the use method thereof solve the problems in the prior art.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
the Web data extraction system based on XML and XPath comprises a conversion module for converting webpages in a network database, wherein the conversion module is sequentially connected with an analysis module, an extraction building module, an extraction integration module, a crawling integration module and a pushing module, the extraction building module is sequentially connected with an extraction setting module and an extraction instruction modification module, and the analysis module, the extraction setting module and the crawling integration module are all connected with the same storage module.
Preferably, the conversion module converts the currently extracted target webpage into a canonical XML webpage, the analysis module analyzes the converted XML webpage into a DOM tree, the extraction building module extracts required target Web information point content from the analyzed DOM tree, the extraction integration module integrates the extracted Web information point content to form an extraction list, the crawling integration module downloads the corresponding target file content from the network database according to the extraction list after extraction integration, and finally converts and compresses the downloaded target file content after crawling, and the pushing module pushes the downloaded target file content to the network database for users to use.
Preferably, the extraction instruction modifying module is configured to receive extraction mode instruction content, and the extraction setting module modifies an extraction mode in the extraction establishing module according to the received extraction mode instruction content.
The application method of the Web data extraction system based on XML and XPath comprises the following steps:
s1 conversion and analysis of target web pages
Firstly, a conversion module converts a currently extracted target webpage into a canonical XML webpage, and an analysis module analyzes the converted XML webpage into a DOM tree and transmits the DOM tree to an extraction building module;
s2 extraction mode regulation and control
The operator generates extraction mode instruction content to the extraction instruction modification module, and then the extraction setting module modifies the extraction mode in the extraction establishing module according to the received extraction mode instruction content;
s3 target content extraction
The extraction building module extracts the required target Web information point content from the parsed DOM tree according to the extraction mode set by the extraction setting module, and only the extraction integration module integrates the extracted Web information point content to form an extraction list;
s4 extraction content crawling pushing
And the crawling integration module downloads the corresponding target file content from the network database according to the extraction list of extraction integration, then converts and compresses the downloaded target file content corresponding to crawling, and finally the pushing module pushes the downloaded target file content to the network database for users to use.
In the present invention,
through the Web data extraction system based on XML and XPath and the use method thereof, the conversion, extraction and crawling and integration processing of unstructured data are realized, and the extraction and crawling rules are set according to the needs, so that the crawling needs of different data are realized, the processing operation of the unstructured data of different types is suitable, and the processing efficiency and quality of the unstructured data are improved.
Drawings
FIG. 1 is a schematic diagram of a Web data extraction system based on XML and XPath and a method for using the same;
FIG. 2 is a schematic flow chart of a Web data extraction system based on XML and XPath and a relation between an extraction module and a crawler module in a using method thereof;
FIG. 3 is a schematic flow chart of the XML and XPath based Web data extraction system and method of use thereof according to the present invention;
FIG. 4 is a flowchart illustrating a Web page collection procedure using HTTP protocol in a Web crawler in the XML and XPath based Web data extraction system and the method for using the same according to the present invention;
FIG. 5 is a schematic diagram of a Web data extraction system based on XML and XPath and a method for using the same according to the present invention;
fig. 6 is a schematic diagram of a Web data extraction system based on XML and XPath and a Web crawler program flow in a method for using the same according to the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments.
Referring to fig. 1, the Web data extraction system based on XML and XPath includes a conversion module for converting a Web page in a network database, where the conversion module is sequentially connected with an analysis module, an extraction building module, an extraction integration module, a crawling integration module and a pushing module, the extraction building module is sequentially connected with an extraction setting module and an extraction instruction modification module, and the analysis module, the extraction setting module and the crawling integration module are all connected with the same storage module.
The conversion module converts the currently extracted target webpage into a canonical XML webpage, the analysis module analyzes the converted XML webpage into a DOM tree, the extraction building module extracts required target Web information point content from the analyzed DOM tree, the extraction integration module integrates the extracted Web information point content to form an extraction list, the crawling integration module downloads and crawls corresponding target file content from the network database according to the extraction list integrated by extraction, and finally converts and compresses the downloaded and crawled corresponding target file content, and the pushing module pushes the downloaded and crawled target file content to the network database for users to use.
Further, the extraction instruction modifying module is configured to receive extraction mode instruction content, and the extraction setting module modifies an extraction mode in the extraction establishing module according to the received extraction mode instruction content.
The application method of the Web data extraction system based on XML and XPath comprises the following steps:
s1 conversion and analysis of target web pages
Firstly, the conversion module converts the currently extracted target webpage into a canonical XML webpage, and the analysis module analyzes the converted XML webpage into a DOM tree and transmits the DOM tree to the extraction building module,
specifically, converting the irregular HTML of the tag into XML, preprocessing (optimizing and correcting) the HTML page by using a SgmlReader tool to enable the HTML page to meet the XML standard, then constructing a DOM tree by using an XML parsing packet, and finally processing the DOM tree by using an interface provided in W3C, wherein the operation of the HTML document is converted into the operation of the DOM, and the process is as follows:
sample learning stage:
converting the HTML document into XML (actually XHTML) format using a SgmlReader, and then converting the formatted document into the form of a DOM tree;
extraction rule definition phase:
the user marks the specific data items in the DOM tree in a visual form, and the system needs to process as follows: firstly, generating an XPath expression of a Web data item, then, corresponding the expression with respective destination table fields to finish definition of a mapping rule, and performing the same processing on other fields; the definition information of the extraction rules is stored in IERF files of the cluster webpages;
data extraction stage:
analyzing the IERF file, and if the webpage to be extracted meets the extraction rule, extracting the content node specified by the rule;
s2 extraction mode regulation and control
The operator generates extraction mode instruction content to the extraction instruction modification module, and then the extraction setting module modifies the extraction mode in the extraction establishing module according to the received extraction mode instruction content;
specifically, the rule of the Wrapper is adopted to extract information from the parsed DOM tree, wherein the information comprises the regional rule in the rule of the Wrapper and the setting and modification of the data rule in the region;
and extracting the webpage data by adopting a rule of a Wrapper, and then adopting an extraction rule of combining the reference text and the DOM to realize the extraction of the webpage data. In the data extraction process of the single webpage record, the relation between the extraction module and the crawler module is shown in fig. 2:
wherein the method comprises the steps of
The region rule is to locate the data region in the webpage content, and text information which is not easy to change in the webpage is used as a reference point, so that the position of the data region, namely the nearest father node of the sub information, can be quickly located, dependence on the webpage structure is avoided, the extraction rule is not influenced by dynamic change of the webpage structure, and the method for properly increasing the number of the reference texts can be used for solving the problem that the data region is not easy to uniquely locate. The process is described in terms of pseudo code as shown in fig. 3;
in-region data rule representation
After the data area is positioned, the data in the area is extracted, and an absolute path from the root node to the child node of the webpage is not needed, but a relative path of a DOM subtree is adopted, so that the path is relatively short, the robustness is good, and the page structure is not very dependent; traversing all nodes according to a right-order traversing mode, and extracting text data information contained in the subtrees; if only partial character strings of the data field need to be extracted, a text characteristic pattern matching method can be adopted;
s3 target content extraction
The extraction building module extracts required target Web information point contents from the parsed DOM tree according to the extraction mode set by the extraction setting module, the Web crawler crawls the required target Web information point contents extracted from the parsed DOM tree, and only the extraction integration module integrates the extracted Web information point contents to form an extraction list;
wherein the structure of the web crawler is shown in fig. 4;
URL queue: the URL records in the URL queue come from two places, one place is seed URL, and the URLs are mainly web page links predefined by a user; the other place is the URL which is continuously obtained from the subsequent webpage by the crawler in the process of crawling the subsequent webpage. After the crawler program is started, the crawler program starts to grab from the seed URL first, and a first-in first-out principle of a queue is adopted. The method is beneficial to realizing breadth-first grabbing strategies, skillfully avoids the characteristic of easy theme deviation of depth-first grabbing strategies, and improves the theme relevance of grabbing webpages.
Protocol processor: the layer is the basis of a web crawler and is positioned at the bottommost layer of the whole crawler system, and is mainly responsible for realizing the collection work of web page data by utilizing various network protocols. Typical network protocols include HTTP, HTTPs and FTP, and the current network protocol is mainly HTTP, so that it is also considered that the web crawler designed herein only supports data transmission of the HTTP protocol, and a brief description is given here of a web page collection procedure using the HTTP protocol:
a, analyzing the address and port number of the outbound point according to the URL, and establishing connection with the address and port number;
b, assembling an HTTP request head, sending the HTTP request head to a target site, if any response signal is not obtained within a certain period of time, discarding the grabbing of the page, and if the response signal is obtained, analyzing the response signal and entering the next step;
c, judging the step mainly by a status code, and if the step is 2XX, indicating that the page is correctly returned; if 301 or 302, the page redirection is indicated, a new URL needs to be extracted from the response head, and the previous step is returned; if it is another code (e.g., 404, indicating that the web page is not found), it indicates that the link failed. Continuing to try grabbing the URL, discarding the URL from grabbing after three times of failure;
d, finding out the webpage information such as the type, the length and the like of the webpage through the response head;
and e, acquiring page content.
The specific flow is shown in fig. 5;
URL analysis, namely acquiring semantic information of marks such as Meta or HREF from newly grabbed webpages, acquiring URLs, and filtering the newly acquired URLs. Filtering mainly refers to deleting URLs with targets such as pictures, sounds and videos or advertisements, and also has an important role in comparing the grabbed URLs with a visual queue (history table) or discovering that there are Visited URLs and deleting them, so that repeated grabbing is avoided.
The web crawler process flow is generally as shown in fig. 6;
s4 extraction content crawling pushing
And the crawling integration module downloads the corresponding target file content from the network database according to the extraction list of extraction integration, then converts and compresses the downloaded target file content corresponding to crawling, and finally the pushing module pushes the downloaded target file content to the network database for users to use.
The design realizes the conversion, extraction and crawling and integration treatment of unstructured data, and simultaneously sets the extraction and crawling rules according to the needs, thereby realizing the crawling needs of different data, being suitable for the processing operation of the unstructured data of different types and improving the processing efficiency and quality of the unstructured data.
In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present invention, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims (1)

1. The Web data extraction system based on XML and XPath comprises a conversion module for converting webpages in a network database, and is characterized in that the conversion module is sequentially connected with an analysis module, an extraction building module, an extraction integration module, a crawling integration module and a pushing module, the extraction building module is sequentially connected with an extraction setting module and an extraction instruction modification module, and the analysis module, the extraction setting module and the crawling integration module are all connected with the same storage module;
the conversion module converts the currently extracted target webpage into a canonical XML webpage, the analysis module analyzes the converted XML webpage into a DOM tree, the extraction building module extracts required target Web information point content from the analyzed DOM tree, the extraction integration module integrates the extracted Web information point content to form an extraction list, the crawling integration module downloads and crawls corresponding target file content from a network database according to the extraction list integrated by extraction, and finally converts and compresses the downloaded and crawled corresponding target file content, and the pushing module pushes the downloaded and crawled target file content to the network database for users to use;
the extraction instruction modification module is used for receiving the extraction mode instruction content, and the extraction setting module modifies the extraction mode in the extraction establishing module according to the received extraction mode instruction content;
the application method of the Web data extraction system based on XML and XPath comprises the following steps of:
s1 conversion and analysis of target web pages
Firstly, a conversion module converts a currently extracted target webpage into a canonical XML webpage, and an analysis module analyzes the converted XML webpage into a DOM tree and transmits the DOM tree to an extraction building module;
s2 extraction mode regulation and control
The operator generates extraction mode instruction content to the extraction instruction modification module, and then the extraction setting module modifies the extraction mode in the extraction establishing module according to the received extraction mode instruction content;
specifically, the rule of the Wrapper is adopted to extract information from the parsed DOM tree, wherein the information comprises the regional rule in the rule of the Wrapper and the setting and modification of the data rule in the region;
extracting webpage data by adopting a rule of a Wrapper, and then realizing the extraction of the webpage data by adopting an extraction rule of combining a reference text and a DOM;
the region rule is to locate the data region in the webpage content, and the text information which is not easy to change in the webpage is used as a reference point, so that the position of the data region, namely the nearest father node of the sub information, is quickly located, dependence on the webpage structure is avoided, the extraction rule is not influenced by dynamic change of the webpage structure, and the problem that the data region is not easy to uniquely locate is solved, and the number of the reference texts is increased;
after the data area is positioned, the data in the area is extracted, and an absolute path from the root node to the child node of the webpage is not needed, but a relative path of a DOM subtree is adopted, so that the path is relatively short, the robustness is good, and the page structure is not very dependent; traversing all nodes according to a right-order traversing mode, and extracting text data information contained in the subtrees; if only partial character strings of the data field need to be extracted, adopting a text characteristic pattern matching method;
s3 target content extraction
The extraction building module extracts required target Web information point contents from the parsed DOM tree according to the extraction mode set by the extraction setting module, the Web crawler crawls the required target Web information point contents extracted from the parsed DOM tree, and only the extraction integration module integrates the extracted Web information point contents to form an extraction list;
URL queue: the URL records in the URL queue come from two places, one place is seed URL, and the URL is a webpage link predefined by a user; the other part is from the URL which is continuously obtained from the subsequent webpage by the crawler in the process of crawling the subsequent webpage; after the crawler program is started, firstly, grabbing from a seed URL, and adopting a first-in first-out principle of a queue; the method is beneficial to realizing breadth-first grabbing strategies, skillfully avoids the characteristic of easily deviating topics of depth-first grabbing strategies, and improves the topic relevance of grabbing webpages;
protocol processor: the system is a basis of a web crawler, is positioned at the bottommost layer of the whole crawler system and is responsible for realizing the collection work of webpage data by utilizing various network protocols; web crawlers currently only support data transmission of the HTTP protocol, and a brief description is given here of a web page collection procedure using the HTTP protocol:
a, analyzing the address and port number of the outbound point according to the URL, and establishing connection with the address and port number;
b, assembling an HTTP request head, sending the HTTP request head to a target site, if any response signal is not obtained within a certain period of time, discarding the grabbing of the page, and if the response signal is obtained, analyzing the response signal and entering the next step;
c, judging through the state code, and if the state code is 2XX, indicating that the page is correctly returned; if 301 or 302, the page redirection is indicated, a new URL needs to be extracted from the response head, and the previous step is returned; if the code 404 indicates that the web page is not found, the link failure is indicated; continuing to try grabbing the URL, discarding the URL from grabbing after three times of failure;
d, finding out the page information through the response head;
e, acquiring page content;
URL analysis, namely acquiring semantic information of Meta or HREF marks in the web pages responsible for re-grabbing, acquiring URLs, and filtering the newly acquired URLs; filtering means deleting the URL containing the picture, the sound and the video as targets, comparing the grabbed URL with the visual queue, and deleting the URL after finding that the grabbed URL is found, thereby avoiding repeated grabbing;
s4 extraction content crawling pushing
And the crawling integration module downloads the corresponding target file content from the network database according to the extraction list of extraction integration, then converts and compresses the downloaded target file content corresponding to crawling, and finally the pushing module pushes the downloaded target file content to the network database for users to use.
CN202010696786.0A 2020-07-20 2020-07-20 Web data extraction system based on XML and XPath and use method thereof Active CN111859867B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010696786.0A CN111859867B (en) 2020-07-20 2020-07-20 Web data extraction system based on XML and XPath and use method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010696786.0A CN111859867B (en) 2020-07-20 2020-07-20 Web data extraction system based on XML and XPath and use method thereof

Publications (2)

Publication Number Publication Date
CN111859867A CN111859867A (en) 2020-10-30
CN111859867B true CN111859867B (en) 2024-03-12

Family

ID=73001005

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010696786.0A Active CN111859867B (en) 2020-07-20 2020-07-20 Web data extraction system based on XML and XPath and use method thereof

Country Status (1)

Country Link
CN (1) CN111859867B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112256708B (en) * 2020-12-22 2021-04-30 远光软件股份有限公司 Method, device, terminal and storage medium for acquiring and storing text content

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2538504A1 (en) * 2006-03-03 2007-09-03 Watchfire Corporation Method and system for obtaining script related information for website crawling
CN101582075A (en) * 2009-06-24 2009-11-18 大连海事大学 Web information extraction system
CN109657121A (en) * 2018-12-09 2019-04-19 佛山市金穗数据服务有限公司 A kind of Web page information acquisition method and device based on web crawlers
WO2019237547A1 (en) * 2018-06-11 2019-12-19 平安科技(深圳)有限公司 Data crawling method and apparatus, and computer device and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2538504A1 (en) * 2006-03-03 2007-09-03 Watchfire Corporation Method and system for obtaining script related information for website crawling
CN101582075A (en) * 2009-06-24 2009-11-18 大连海事大学 Web information extraction system
WO2019237547A1 (en) * 2018-06-11 2019-12-19 平安科技(深圳)有限公司 Data crawling method and apparatus, and computer device and storage medium
CN109657121A (en) * 2018-12-09 2019-04-19 佛山市金穗数据服务有限公司 A kind of Web page information acquisition method and device based on web crawlers

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于XPath 的新闻信息抽取***设计与实现;阮娟;《智能计算机与应用》;第5卷(第2期);第58-61页 *

Also Published As

Publication number Publication date
CN111859867A (en) 2020-10-30

Similar Documents

Publication Publication Date Title
US8185530B2 (en) Method and system for web document clustering
CN109543086B (en) Network data acquisition and display method oriented to multiple data sources
US8321396B2 (en) Automatically extracting by-line information
CN107273409B (en) Network data acquisition, storage and processing method and system
US7401287B2 (en) Device, method, and computer program product for generating information of link structure of documents
KR100461019B1 (en) web contents transcoding system and method for small display devices
US7941420B2 (en) Method for organizing structurally similar web pages from a web site
US7693903B2 (en) Method for gathering and summarizing internet information
US7379932B2 (en) System and a method for focused re-crawling of Web sites
US7055094B2 (en) Virtual tags and the process of virtual tagging utilizing user feedback in transformation rules
US20120066380A1 (en) Update notification method and system
CN102073726B (en) Structured data import method and device for search engine system
US20060101332A1 (en) Virtual tags and the process of virtual tagging
KR102222287B1 (en) Web Crawler System for Collecting a Structured and Unstructured Data in Hidden URL
US20130232424A1 (en) User operation detection system and user operation detection method
CN102662969A (en) Internet information object positioning method based on webpage structure semantic meaning
US20180232410A1 (en) Refining structured data indexes
US20110219017A1 (en) System and methods for citation database construction and for allowing quick understanding of scientific papers
WO2020101479A1 (en) System and method to detect and generate relevant content from uniform resource locator (url)
CN1841377A (en) Crawling databases for information
US11334592B2 (en) Self-orchestrated system for extraction, analysis, and presentation of entity data
CN111859867B (en) Web data extraction system based on XML and XPath and use method thereof
CN103927367A (en) Microblog acquisition system and method based on events
CN102004805A (en) Webpage denoising system and method based on maximum similarity matching
JPH11134341A (en) System for displaying selection of descriptive information in hyper media description language

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant