CN111859867A - Web data extraction system based on XML and XPath and use method thereof - Google Patents

Web data extraction system based on XML and XPath and use method thereof Download PDF

Info

Publication number
CN111859867A
CN111859867A CN202010696786.0A CN202010696786A CN111859867A CN 111859867 A CN111859867 A CN 111859867A CN 202010696786 A CN202010696786 A CN 202010696786A CN 111859867 A CN111859867 A CN 111859867A
Authority
CN
China
Prior art keywords
extraction
module
crawling
content
xml
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010696786.0A
Other languages
Chinese (zh)
Other versions
CN111859867B (en
Inventor
官鲁卫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi Meicube Engineering Consulting Co Ltd
Original Assignee
Guangxi Meicube Engineering Consulting Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi Meicube Engineering Consulting Co Ltd filed Critical Guangxi Meicube Engineering Consulting Co Ltd
Priority to CN202010696786.0A priority Critical patent/CN111859867B/en
Publication of CN111859867A publication Critical patent/CN111859867A/en
Application granted granted Critical
Publication of CN111859867B publication Critical patent/CN111859867B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • G06F40/154Tree transformation for tree-structured or markup documents, e.g. XSLT, XSL-FO or stylesheets
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of data extraction, in particular to a Web data extraction system based on XML and XPath and a use method thereof, and provides a scheme which comprises the Web data extraction system based on XML and XPath, and comprises a conversion module for converting a webpage in a network database, wherein the conversion module is sequentially connected with an analysis module, an extraction establishment module, an extraction integration module, a crawling integration module and a pushing module, the extraction establishment module is sequentially connected with an extraction setting module and an extraction instruction modification module, and the analysis module, the extraction setting module and the crawling integration module are all connected with the same storage module. The invention realizes the conversion, extraction and crawling and integration processing of the unstructured data, and sets the extraction and crawling rules according to the requirements, thereby realizing the crawling requirements of different data, being suitable for the processing operation of different types of unstructured data, and improving the processing efficiency and quality of the unstructured data.

Description

Web data extraction system based on XML and XPath and use method thereof
Technical Field
The invention relates to the technical field of data extraction, in particular to a Web data extraction system based on XML and XPath and a use method thereof.
Background
With the advent of the network and information age, data and information flood our lives and permeate various fields. In the past few years, large amounts of data have been generated daily from our weblogs, social networking sites, online transaction records, or other data sources. Whether the current database importing technology and retrieval mode can completely standardize the daily generated data into a database format or mine and analyze useful information from the useful but difficult-to-process massive unstructured data becomes a hotspot problem of current research.
Unstructured data refers to data that is not conveniently represented in a database two-dimensional logical table, including text, pictures, XML, HTML, audio/video, etc. More than 85% of the data currently collected from the Internet is unstructured and semi-structured data. For example, hundreds of degrees per day about tens of PB data are processed; the Facebook registers that more than 10 hundred million users upload more than 10 hundred million photos each month, and generates log data of more than 300TB each day; the Taobao network has more than 3.7 hundred million members, more than 8.8 hundred million online commodities, and millions of transactions per day, and generates about 20TB data; yahoo's total storage capacity exceeds 100PB, and so on. Due to the data volume, driven by related technologies such as cloud computing and internet of things, the world has gradually entered the "big data" (big data) era, the existing unstructured data processing is inconvenient, and the data extraction is inconvenient, so that a Web data extraction system based on XML and XPath and a use method thereof are needed.
Disclosure of Invention
The Web data extraction system based on XML and XPath and the use method thereof solve the problems in the prior art.
In order to achieve the purpose, the invention adopts the following technical scheme:
web data extraction system based on XML and XPath, including being used for carrying out the conversion module that changes to Web database middle webpage, the conversion module has connected gradually analytic module, extraction establishment module, extraction integration module, has crawled integration module and propelling movement module, the extraction establishment module has connected gradually extraction setting module and extraction instruction modification module, and analytic module, extraction setting module and crawl integration module all are connected with same storage module.
Preferably, the conversion module converts the currently extracted target webpage into a standard XML webpage, the parsing module parses the converted XML webpage into a DOM tree, the extraction establishment module extracts the required target Web information point content from the parsed DOM tree, the extraction integration module integrates the extracted Web information point content to form an extraction list, the crawling integration module downloads and crawls the corresponding target file content from the network database according to the extraction list, finally, the target file content corresponding to the downloading and crawling is converted and compressed, and the pushing module pushes the target file content after the downloading and crawling to the network database for use by users.
Preferably, the extraction instruction modification module is configured to receive an extraction mode instruction content, and the extraction setting module modifies the extraction mode in the extraction establishment module according to the received extraction mode instruction content.
The use method of the Web data extraction system based on XML and XPath comprises the following steps:
s1 conversion and parsing of target web page
Firstly, a conversion module converts a currently extracted target webpage into a standard XML webpage, and simultaneously an analysis module analyzes the converted XML webpage into a DOM tree and transmits the DOM tree to an extraction and establishment module;
s2 extraction mode control
An operator generates extraction mode instruction content to the extraction instruction modification module, and then the extraction setting module modifies the extraction mode in the extraction establishing module according to the received extraction mode instruction content;
s3 target content extraction
The extraction establishing module extracts the required target Web information point content from the analyzed DOM tree according to the extraction mode set by the extraction setting module, and only the extraction integration module integrates the extracted Web information point content to form an extraction list;
s4 crawling and pushing extracted content
And the crawling integration module downloads and crawls corresponding target file contents from the network database according to the extraction list of the extraction integration, then converts and compresses the target file contents corresponding to the downloading and crawling, and finally the pushing module pushes the target file contents after the downloading and crawling to the network database for use of users.
In the present invention,
through the set Web data extraction system based on XML and XPath and the use method thereof, the conversion, extraction, crawling and integration processing of unstructured data are realized, and meanwhile, extraction and crawling rules are set according to needs, so that the crawling needs of different data are realized, the processing operation of unstructured data of different types is suitable, and the processing efficiency and the quality of the unstructured data are improved.
Drawings
Fig. 1 is a schematic structural diagram of a Web data extraction system based on XML and XPath and a method for using the same according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.
Referring to fig. 1, the Web data extraction system based on XML and XPath includes a conversion module for converting Web pages in a network database, the conversion module is connected with an analysis module, an extraction establishment module, an extraction integration module, a crawling integration module and a pushing module in sequence, the extraction establishment module is connected with an extraction setting module and an extraction instruction modification module in sequence, and the analysis module, the extraction setting module and the crawling integration module are all connected with the same storage module.
Particularly, the conversion module converts a currently extracted target webpage into a standard XML webpage, the parsing module parses the converted XML webpage into a DOM tree, the extraction establishment module extracts required target Web information point contents from the parsed DOM tree, the extraction integration module integrates the extracted Web information point contents to form an extraction list, the crawling integration module downloads and crawls corresponding target file contents from a network database according to the extraction list, finally, the target file contents corresponding to the downloading and crawling are converted and compressed, and the pushing module pushes the target file contents after the downloading and crawling to the network database for use of users.
Further, the extraction instruction modification module is used for receiving the content of the extraction mode instruction, and the extraction setting module modifies the extraction mode in the extraction establishing module according to the received content of the extraction mode instruction.
The use method of the Web data extraction system based on XML and XPath comprises the following steps:
s1 conversion and parsing of target web page
Firstly, the conversion module converts the current extracted target webpage into a standard XML webpage, and simultaneously the parsing module parses the converted XML webpage into a DOM tree and transmits the DOM tree to the extraction and establishment module,
Specifically, the HTML with irregular tags is converted into XML, the HTML page is preprocessed (optimized and corrected) by using an SgmlReader tool to be in accordance with the standard of the XML, then a DOM tree is built through an XML parsing package, and finally the DOM tree is processed through an interface provided in W3C, so that the operation on the HTML document is also converted into the operation on the DOM, and the process is as follows:
a sample learning stage:
converting the HTML document into an XML (actually XHTML) format by using an Sgmlreader, and then converting the formatted document into a DOM tree form;
an extraction rule definition stage:
the user marks out specific data items in the DOM tree in a visual mode, and the inside of the system needs to do the following processing: firstly, generating an XPath expression of a Web data item, then corresponding the expression with respective target table fields to complete the definition of a mapping rule, and carrying out the same treatment on other fields; mapping rules of each field of the target table and definition information of the extraction rules are stored in IERF files of the cluster web pages;
a data extraction stage:
analyzing the IERF file, and if the webpage to be extracted meets the extraction rule, extracting the content node specified by the rule;
S2 extraction mode control
An operator generates extraction mode instruction content to the extraction instruction modification module, and then the extraction setting module modifies the extraction mode in the extraction establishing module according to the received extraction mode instruction content;
specifically, information extraction is carried out on the analyzed DOM tree by using a Wrapper rule, wherein the information extraction comprises setting and modification of a region rule and a data rule in the region in the Wrapper rule;
and extracting the webpage data by adopting a Wrapper rule, and then extracting the webpage data by adopting an extraction rule combining the reference text and the DOM. In the data extraction process of the single webpage single record, the relation between the extraction module and the crawler module is as follows:
Figure BDA0002591482740000071
wherein
The region rule is a mode of positioning a data region in webpage content, and using text information which is not easy to change in the webpage as a reference point, so that the position of the data region can be quickly positioned, namely a parent node with the nearest child information avoids the dependence on a webpage structure, the extraction rule is not influenced by dynamic change of the webpage structure, and the number of reference texts can be properly increased for the problem that the data region is not easy to uniquely position. The process is described in pseudo code as:
Figure BDA0002591482740000081
Intra-region data rule representation
After the data area is well positioned, extracting the data in the area, and adopting a relative path of a DOM sub-tree instead of an absolute path from a root node to a child node of a webpage, so that the path is relatively short, the robustness is good, and the dependency on the page structure is not high; traversing all the nodes according to a right-order traversal mode, and extracting text data information contained in the subtree; if only part of the character strings of the data field need to be extracted, a text feature pattern matching method can be adopted;
s3 target content extraction
The extraction establishing module extracts the required target Web information point content from the analyzed DOM tree according to the extraction mode set by the extraction setting module, crawls the extracted target Web information point content from the analyzed DOM tree by using a Web crawler, and only the extraction integrating module integrates the extracted Web information point content to form an extraction list;
the structure of the web crawler is as follows:
Figure BDA0002591482740000091
and (4) URL queue: URL records in the URL queue come from two places, one place is a seed URL, and the URLs are mainly web page links defined by a user in advance; and the other part is from the URL which is continuously obtained from the subsequent webpage by the crawler in the process of crawling the subsequent webpage. After the crawler program is started, the crawler program firstly begins to capture from the seed URL, and the first-in first-out principle of the queue is adopted. By adopting the method, the breadth-first grabbing strategy is favorably realized, the characteristic that the depth-first grabbing strategy is easy to deviate from the theme is ingeniously avoided, and the theme relevancy of the grabbed webpage is improved.
A protocol processor: the layer is the basis of the web crawler, is positioned at the bottommost layer of the whole crawler system, and is mainly responsible for acquiring webpage data by utilizing various network protocols. The common network protocols are HTTP, HTTPs and FTP, and the current network protocol is mainly HTTP, so it is also considered that the web crawler designed herein only supports data transmission of HTTP protocol, and the steps of collecting web pages using HTTP protocol are briefly described here:
a, analyzing a site address and a port number according to the URL, and establishing connection with the site address and the port number;
b, assembling an HTTP (hyper text transport protocol) request head, sending the HTTP request head to a target site, if no response signal is obtained within a certain time period, giving up the grabbing of the page, if the response signal is obtained, analyzing the response signal, and entering the next step;
c, judging mainly through a state code, and if the result is 2XX, indicating that the page is correctly returned; if the URL is 301 or 302, the page redirection is explained, a new URL needs to be extracted from the response head, and the last step is returned; if another code (e.g., 404, indicates that the web page is not found), the link is failed. Then, the URL continues to try to grab, and if the URL fails after three times, the URL is abandoned without grabbing;
d, finding out webpage information such as page type, length and the like through the response head;
and e, acquiring page content.
The specific flow is as follows:
Figure BDA0002591482740000111
and URL analysis, which is mainly responsible for acquiring the semantic information marked by Meta or HREF and the like from the newly captured webpage, acquiring URLs and filtering the newly acquired URLs. Filtering mainly refers to deleting targeted URLs including pictures, sounds, videos or advertisements, and the like, and an important task is to compare the captured URLs with a visual queue (history table) or delete accessed URLs when finding the accessed URLs, so that repeated capturing is avoided.
The web crawler program flow is roughly as follows:
Figure BDA0002591482740000121
s4 crawling and pushing extracted content
And the crawling integration module downloads and crawls corresponding target file contents from the network database according to the extraction list of the extraction integration, then converts and compresses the target file contents corresponding to the downloading and crawling, and finally the pushing module pushes the target file contents after the downloading and crawling to the network database for use of users.
The design realizes the conversion, extraction crawling and integration processing of the unstructured data, and sets extraction crawling rules according to needs, so that different data crawling needs are realized, the processing operation of the unstructured data of different types is suitable, and the processing efficiency and the quality of the unstructured data are improved.
In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", and the like, indicate orientations and positional relationships based on those shown in the drawings, and are used only for convenience of description and simplicity of description, and do not indicate or imply that the equipment or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be considered as limiting the present invention.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims (3)

1. Web data extraction system based on XML and XPath, including being used for carrying out the conversion module that changes to Web database middle webpage, its characterized in that, the conversion module has connected gradually analytic module, extraction establishment module, extraction integration module, has crawled integration module and push module, extraction establishment module has connected gradually extraction setting module and extraction instruction modification module, and analytic module, extraction setting module and crawl integration module all are connected with same storage module.
2. The system of claim 1, wherein the transformation module transforms a currently extracted target Web page into a standard XML Web page, the parsing module parses the transformed XML Web page into a DOM tree, the extraction creation module extracts a required target Web information point content from the parsed DOM tree, the extraction integration module integrates the extracted Web information point content to form an extraction list, the crawling integration module downloads a crawling corresponding target file content from the network database according to the extraction list, and finally transforms and compresses the downloading crawling corresponding target file content, and the pushing module pushes the downloading crawling target file content to the network database for use by a user.
3. The XML and XPath-based Web data extraction system of claim 1, wherein the extraction instruction modification module is configured to receive an extraction mode instruction content, and the extraction setting module modifies the extraction mode in the extraction creation module according to the received extraction mode instruction content.
The use method of the Web data extraction system based on XML and XPath comprises the following steps, and is characterized in that:
s1 conversion and parsing of target web page
Firstly, a conversion module converts a currently extracted target webpage into a standard XML webpage, and simultaneously an analysis module analyzes the converted XML webpage into a DOM tree and transmits the DOM tree to an extraction and establishment module;
s2 extraction mode control
An operator generates extraction mode instruction content to the extraction instruction modification module, and then the extraction setting module modifies the extraction mode in the extraction establishing module according to the received extraction mode instruction content;
s3 target content extraction
The extraction establishing module extracts the required target Web information point content from the analyzed DOM tree according to the extraction mode set by the extraction setting module, crawls the extracted target Web information point content from the analyzed DOM tree by using a Web crawler, and only the extraction integrating module integrates the extracted Web information point content to form an extraction list;
S4 crawling and pushing extracted content
And the crawling integration module downloads and crawls corresponding target file contents from the network database according to the extraction list of the extraction integration, then converts and compresses the target file contents corresponding to the downloading and crawling, and finally the pushing module pushes the target file contents after the downloading and crawling to the network database for use of users.
CN202010696786.0A 2020-07-20 2020-07-20 Web data extraction system based on XML and XPath and use method thereof Active CN111859867B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010696786.0A CN111859867B (en) 2020-07-20 2020-07-20 Web data extraction system based on XML and XPath and use method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010696786.0A CN111859867B (en) 2020-07-20 2020-07-20 Web data extraction system based on XML and XPath and use method thereof

Publications (2)

Publication Number Publication Date
CN111859867A true CN111859867A (en) 2020-10-30
CN111859867B CN111859867B (en) 2024-03-12

Family

ID=73001005

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010696786.0A Active CN111859867B (en) 2020-07-20 2020-07-20 Web data extraction system based on XML and XPath and use method thereof

Country Status (1)

Country Link
CN (1) CN111859867B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112256708A (en) * 2020-12-22 2021-01-22 远光软件股份有限公司 Method, device, terminal and storage medium for acquiring and storing text content

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2538504A1 (en) * 2006-03-03 2007-09-03 Watchfire Corporation Method and system for obtaining script related information for website crawling
CN101582075A (en) * 2009-06-24 2009-11-18 大连海事大学 Web information extraction system
CN109657121A (en) * 2018-12-09 2019-04-19 佛山市金穗数据服务有限公司 A kind of Web page information acquisition method and device based on web crawlers
WO2019237547A1 (en) * 2018-06-11 2019-12-19 平安科技(深圳)有限公司 Data crawling method and apparatus, and computer device and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2538504A1 (en) * 2006-03-03 2007-09-03 Watchfire Corporation Method and system for obtaining script related information for website crawling
CN101582075A (en) * 2009-06-24 2009-11-18 大连海事大学 Web information extraction system
WO2019237547A1 (en) * 2018-06-11 2019-12-19 平安科技(深圳)有限公司 Data crawling method and apparatus, and computer device and storage medium
CN109657121A (en) * 2018-12-09 2019-04-19 佛山市金穗数据服务有限公司 A kind of Web page information acquisition method and device based on web crawlers

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
阮娟: "基于XPath 的新闻信息抽取***设计与实现", 《智能计算机与应用》, vol. 5, no. 2, pages 58 - 61 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112256708A (en) * 2020-12-22 2021-01-22 远光软件股份有限公司 Method, device, terminal and storage medium for acquiring and storing text content

Also Published As

Publication number Publication date
CN111859867B (en) 2024-03-12

Similar Documents

Publication Publication Date Title
US8185530B2 (en) Method and system for web document clustering
US8321396B2 (en) Automatically extracting by-line information
US8255394B2 (en) Apparatus, system, and method for efficient content indexing of streaming XML document content
RU2530340C2 (en) Update notification method and system
CN101909079B (en) User online behavior data acquisition method in backbone link and system
US20060230100A1 (en) Web content transcoding system and method for small display device
CN109857956B (en) News webpage key information automatic extraction method based on label and block characteristics
US20110219017A1 (en) System and methods for citation database construction and for allowing quick understanding of scientific papers
CN110309386B (en) Method and device for crawling web page
US20180232410A1 (en) Refining structured data indexes
WO2020101479A1 (en) System and method to detect and generate relevant content from uniform resource locator (url)
US20050010556A1 (en) Method and apparatus for information retrieval
CN104598536B (en) A kind of distributed network information structuring processing method
KR20170073693A (en) Extracting similar group elements
CN111723265A (en) Extensible news website universal crawler method and system
CN116204660A (en) Multi-source heterogeneous data driven domain knowledge graph construction system method
CN1841377A (en) Crawling databases for information
CN114443928B (en) Web text data crawler method and system
CN111859867B (en) Web data extraction system based on XML and XPath and use method thereof
CN114117242A (en) Data query method and device, computer equipment and storage medium
Oita et al. Archiving data objects using Web feeds
CN103927367A (en) Microblog acquisition system and method based on events
US7296034B2 (en) Integrated support in an XML/XQuery database for web-based applications
CN113094568A (en) Data extraction method based on data crawler technology
EP2411930A2 (en) A system for automatic semantic-based mining

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant