CN111859867A

CN111859867A - Web data extraction system based on XML and XPath and use method thereof

Info

Publication number: CN111859867A
Application number: CN202010696786.0A
Authority: CN
Inventors: 官鲁卫
Original assignee: Guangxi Meicube Engineering Consulting Co Ltd
Current assignee: Guangxi Meicube Engineering Consulting Co Ltd
Priority date: 2020-07-20
Filing date: 2020-07-20
Publication date: 2020-10-30
Anticipated expiration: 2040-07-20
Also published as: CN111859867B

Abstract

The invention belongs to the technical field of data extraction, in particular to a Web data extraction system based on XML and XPath and a use method thereof, and provides a scheme which comprises the Web data extraction system based on XML and XPath, and comprises a conversion module for converting a webpage in a network database, wherein the conversion module is sequentially connected with an analysis module, an extraction establishment module, an extraction integration module, a crawling integration module and a pushing module, the extraction establishment module is sequentially connected with an extraction setting module and an extraction instruction modification module, and the analysis module, the extraction setting module and the crawling integration module are all connected with the same storage module. The invention realizes the conversion, extraction and crawling and integration processing of the unstructured data, and sets the extraction and crawling rules according to the requirements, thereby realizing the crawling requirements of different data, being suitable for the processing operation of different types of unstructured data, and improving the processing efficiency and quality of the unstructured data.

Description

Web data extraction system based on XML and XPath and use method thereof

Technical Field

The invention relates to the technical field of data extraction, in particular to a Web data extraction system based on XML and XPath and a use method thereof.

Background

With the advent of the network and information age, data and information flood our lives and permeate various fields. In the past few years, large amounts of data have been generated daily from our weblogs, social networking sites, online transaction records, or other data sources. Whether the current database importing technology and retrieval mode can completely standardize the daily generated data into a database format or mine and analyze useful information from the useful but difficult-to-process massive unstructured data becomes a hotspot problem of current research.

Unstructured data refers to data that is not conveniently represented in a database two-dimensional logical table, including text, pictures, XML, HTML, audio/video, etc. More than 85% of the data currently collected from the Internet is unstructured and semi-structured data. For example, hundreds of degrees per day about tens of PB data are processed; the Facebook registers that more than 10 hundred million users upload more than 10 hundred million photos each month, and generates log data of more than 300TB each day; the Taobao network has more than 3.7 hundred million members, more than 8.8 hundred million online commodities, and millions of transactions per day, and generates about 20TB data; yahoo's total storage capacity exceeds 100PB, and so on. Due to the data volume, driven by related technologies such as cloud computing and internet of things, the world has gradually entered the "big data" (big data) era, the existing unstructured data processing is inconvenient, and the data extraction is inconvenient, so that a Web data extraction system based on XML and XPath and a use method thereof are needed.

Disclosure of Invention

The Web data extraction system based on XML and XPath and the use method thereof solve the problems in the prior art.

In order to achieve the purpose, the invention adopts the following technical scheme:

web data extraction system based on XML and XPath, including being used for carrying out the conversion module that changes to Web database middle webpage, the conversion module has connected gradually analytic module, extraction establishment module, extraction integration module, has crawled integration module and propelling movement module, the extraction establishment module has connected gradually extraction setting module and extraction instruction modification module, and analytic module, extraction setting module and crawl integration module all are connected with same storage module.

Preferably, the conversion module converts the currently extracted target webpage into a standard XML webpage, the parsing module parses the converted XML webpage into a DOM tree, the extraction establishment module extracts the required target Web information point content from the parsed DOM tree, the extraction integration module integrates the extracted Web information point content to form an extraction list, the crawling integration module downloads and crawls the corresponding target file content from the network database according to the extraction list, finally, the target file content corresponding to the downloading and crawling is converted and compressed, and the pushing module pushes the target file content after the downloading and crawling to the network database for use by users.

Preferably, the extraction instruction modification module is configured to receive an extraction mode instruction content, and the extraction setting module modifies the extraction mode in the extraction establishment module according to the received extraction mode instruction content.

The use method of the Web data extraction system based on XML and XPath comprises the following steps:

s1 conversion and parsing of target web page

Firstly, a conversion module converts a currently extracted target webpage into a standard XML webpage, and simultaneously an analysis module analyzes the converted XML webpage into a DOM tree and transmits the DOM tree to an extraction and establishment module;

s2 extraction mode control

An operator generates extraction mode instruction content to the extraction instruction modification module, and then the extraction setting module modifies the extraction mode in the extraction establishing module according to the received extraction mode instruction content;

s3 target content extraction

The extraction establishing module extracts the required target Web information point content from the analyzed DOM tree according to the extraction mode set by the extraction setting module, and only the extraction integration module integrates the extracted Web information point content to form an extraction list;

s4 crawling and pushing extracted content

And the crawling integration module downloads and crawls corresponding target file contents from the network database according to the extraction list of the extraction integration, then converts and compresses the target file contents corresponding to the downloading and crawling, and finally the pushing module pushes the target file contents after the downloading and crawling to the network database for use of users.

In the present invention,

through the set Web data extraction system based on XML and XPath and the use method thereof, the conversion, extraction, crawling and integration processing of unstructured data are realized, and meanwhile, extraction and crawling rules are set according to needs, so that the crawling needs of different data are realized, the processing operation of unstructured data of different types is suitable, and the processing efficiency and the quality of the unstructured data are improved.

Drawings

Fig. 1 is a schematic structural diagram of a Web data extraction system based on XML and XPath and a method for using the same according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.

Referring to fig. 1, the Web data extraction system based on XML and XPath includes a conversion module for converting Web pages in a network database, the conversion module is connected with an analysis module, an extraction establishment module, an extraction integration module, a crawling integration module and a pushing module in sequence, the extraction establishment module is connected with an extraction setting module and an extraction instruction modification module in sequence, and the analysis module, the extraction setting module and the crawling integration module are all connected with the same storage module.

Particularly, the conversion module converts a currently extracted target webpage into a standard XML webpage, the parsing module parses the converted XML webpage into a DOM tree, the extraction establishment module extracts required target Web information point contents from the parsed DOM tree, the extraction integration module integrates the extracted Web information point contents to form an extraction list, the crawling integration module downloads and crawls corresponding target file contents from a network database according to the extraction list, finally, the target file contents corresponding to the downloading and crawling are converted and compressed, and the pushing module pushes the target file contents after the downloading and crawling to the network database for use of users.

Further, the extraction instruction modification module is used for receiving the content of the extraction mode instruction, and the extraction setting module modifies the extraction mode in the extraction establishing module according to the received content of the extraction mode instruction.

s1 conversion and parsing of target web page

Firstly, the conversion module converts the current extracted target webpage into a standard XML webpage, and simultaneously the parsing module parses the converted XML webpage into a DOM tree and transmits the DOM tree to the extraction and establishment module,

Specifically, the HTML with irregular tags is converted into XML, the HTML page is preprocessed (optimized and corrected) by using an SgmlReader tool to be in accordance with the standard of the XML, then a DOM tree is built through an XML parsing package, and finally the DOM tree is processed through an interface provided in W3C, so that the operation on the HTML document is also converted into the operation on the DOM, and the process is as follows:

a sample learning stage:

converting the HTML document into an XML (actually XHTML) format by using an Sgmlreader, and then converting the formatted document into a DOM tree form;

an extraction rule definition stage:

the user marks out specific data items in the DOM tree in a visual mode, and the inside of the system needs to do the following processing: firstly, generating an XPath expression of a Web data item, then corresponding the expression with respective target table fields to complete the definition of a mapping rule, and carrying out the same treatment on other fields; mapping rules of each field of the target table and definition information of the extraction rules are stored in IERF files of the cluster web pages;

a data extraction stage:

analyzing the IERF file, and if the webpage to be extracted meets the extraction rule, extracting the content node specified by the rule;

S2 extraction mode control

specifically, information extraction is carried out on the analyzed DOM tree by using a Wrapper rule, wherein the information extraction comprises setting and modification of a region rule and a data rule in the region in the Wrapper rule;

and extracting the webpage data by adopting a Wrapper rule, and then extracting the webpage data by adopting an extraction rule combining the reference text and the DOM. In the data extraction process of the single webpage single record, the relation between the extraction module and the crawler module is as follows:

wherein

The region rule is a mode of positioning a data region in webpage content, and using text information which is not easy to change in the webpage as a reference point, so that the position of the data region can be quickly positioned, namely a parent node with the nearest child information avoids the dependence on a webpage structure, the extraction rule is not influenced by dynamic change of the webpage structure, and the number of reference texts can be properly increased for the problem that the data region is not easy to uniquely position. The process is described in pseudo code as:

Intra-region data rule representation

After the data area is well positioned, extracting the data in the area, and adopting a relative path of a DOM sub-tree instead of an absolute path from a root node to a child node of a webpage, so that the path is relatively short, the robustness is good, and the dependency on the page structure is not high; traversing all the nodes according to a right-order traversal mode, and extracting text data information contained in the subtree; if only part of the character strings of the data field need to be extracted, a text feature pattern matching method can be adopted;

s3 target content extraction

The extraction establishing module extracts the required target Web information point content from the analyzed DOM tree according to the extraction mode set by the extraction setting module, crawls the extracted target Web information point content from the analyzed DOM tree by using a Web crawler, and only the extraction integrating module integrates the extracted Web information point content to form an extraction list;

the structure of the web crawler is as follows:

and (4) URL queue: URL records in the URL queue come from two places, one place is a seed URL, and the URLs are mainly web page links defined by a user in advance; and the other part is from the URL which is continuously obtained from the subsequent webpage by the crawler in the process of crawling the subsequent webpage. After the crawler program is started, the crawler program firstly begins to capture from the seed URL, and the first-in first-out principle of the queue is adopted. By adopting the method, the breadth-first grabbing strategy is favorably realized, the characteristic that the depth-first grabbing strategy is easy to deviate from the theme is ingeniously avoided, and the theme relevancy of the grabbed webpage is improved.

A protocol processor: the layer is the basis of the web crawler, is positioned at the bottommost layer of the whole crawler system, and is mainly responsible for acquiring webpage data by utilizing various network protocols. The common network protocols are HTTP, HTTPs and FTP, and the current network protocol is mainly HTTP, so it is also considered that the web crawler designed herein only supports data transmission of HTTP protocol, and the steps of collecting web pages using HTTP protocol are briefly described here:

a, analyzing a site address and a port number according to the URL, and establishing connection with the site address and the port number;

b, assembling an HTTP (hyper text transport protocol) request head, sending the HTTP request head to a target site, if no response signal is obtained within a certain time period, giving up the grabbing of the page, if the response signal is obtained, analyzing the response signal, and entering the next step;

c, judging mainly through a state code, and if the result is 2XX, indicating that the page is correctly returned; if the URL is 301 or 302, the page redirection is explained, a new URL needs to be extracted from the response head, and the last step is returned; if another code (e.g., 404, indicates that the web page is not found), the link is failed. Then, the URL continues to try to grab, and if the URL fails after three times, the URL is abandoned without grabbing;

d, finding out webpage information such as page type, length and the like through the response head;

and e, acquiring page content.

The specific flow is as follows:

and URL analysis, which is mainly responsible for acquiring the semantic information marked by Meta or HREF and the like from the newly captured webpage, acquiring URLs and filtering the newly acquired URLs. Filtering mainly refers to deleting targeted URLs including pictures, sounds, videos or advertisements, and the like, and an important task is to compare the captured URLs with a visual queue (history table) or delete accessed URLs when finding the accessed URLs, so that repeated capturing is avoided.

The web crawler program flow is roughly as follows:

s4 crawling and pushing extracted content

The design realizes the conversion, extraction crawling and integration processing of the unstructured data, and sets extraction crawling rules according to needs, so that different data crawling needs are realized, the processing operation of the unstructured data of different types is suitable, and the processing efficiency and the quality of the unstructured data are improved.

In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", and the like, indicate orientations and positional relationships based on those shown in the drawings, and are used only for convenience of description and simplicity of description, and do not indicate or imply that the equipment or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be considered as limiting the present invention.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. Web data extraction system based on XML and XPath, including being used for carrying out the conversion module that changes to Web database middle webpage, its characterized in that, the conversion module has connected gradually analytic module, extraction establishment module, extraction integration module, has crawled integration module and push module, extraction establishment module has connected gradually extraction setting module and extraction instruction modification module, and analytic module, extraction setting module and crawl integration module all are connected with same storage module.

2. The system of claim 1, wherein the transformation module transforms a currently extracted target Web page into a standard XML Web page, the parsing module parses the transformed XML Web page into a DOM tree, the extraction creation module extracts a required target Web information point content from the parsed DOM tree, the extraction integration module integrates the extracted Web information point content to form an extraction list, the crawling integration module downloads a crawling corresponding target file content from the network database according to the extraction list, and finally transforms and compresses the downloading crawling corresponding target file content, and the pushing module pushes the downloading crawling target file content to the network database for use by a user.

3. The XML and XPath-based Web data extraction system of claim 1, wherein the extraction instruction modification module is configured to receive an extraction mode instruction content, and the extraction setting module modifies the extraction mode in the extraction creation module according to the received extraction mode instruction content.

The use method of the Web data extraction system based on XML and XPath comprises the following steps, and is characterized in that:

s1 conversion and parsing of target web page

s2 extraction mode control

s3 target content extraction

S4 crawling and pushing extracted content