CN111859867B

CN111859867B - Web data extraction system based on XML and XPath and use method thereof

Info

Publication number: CN111859867B
Application number: CN202010696786.0A
Authority: CN
Inventors: 官鲁卫
Original assignee: Guangxi Meicube Engineering Consulting Co ltd
Current assignee: Guangxi Meicube Engineering Consulting Co ltd
Priority date: 2020-07-20
Filing date: 2020-07-20
Publication date: 2024-03-12
Anticipated expiration: 2040-07-20
Also published as: CN111859867A

Abstract

The invention belongs to the technical field of data extraction, in particular to a Web data extraction system based on XML and XPath and a use method thereof. The invention realizes the conversion, extraction and crawling and integration processing of unstructured data, and sets the extraction and crawling rules according to the needs, thereby realizing the crawling needs of different data, being suitable for the processing operation of different types of unstructured data and improving the processing efficiency and quality of the unstructured data.

Description

Web data extraction system based on XML and XPath and use method thereof

Technical Field

The invention relates to the technical field of data extraction, in particular to a Web data extraction system based on XML and XPath and a use method thereof.

Background

The arrival of the network and information age, the data and information are full of our lives and penetrate into various fields. During the past few years, large amounts of data have been generated daily from our weblogs, social networking sites, online transaction records, or other data sources. Whether the current database importing technology and the retrieval mode can fully normalize daily generated data into a database format or mine and analyze useful information from useful massive unstructured data which is difficult to process becomes a hot spot problem of current research.

Unstructured data refers to data that is inconvenient to represent with a database two-dimensional logical table, including text, pictures, XML, HTML, audio/video, and the like. More than 85% of the data currently collected from the Internet is unstructured and semi-structured data. For example, hundreds of degrees per day about tens of PB data are processed; the Facebook registered users have more than 10 hundred million, the pictures uploaded per month have more than 10 hundred million, and log data of more than 300TB are generated per day; the members of the Taobao net have more than 3.7 hundred million, the online commodity has more than 8.8 hundred million, and tens of millions of transactions are carried out every day, so that about 20TB of data is generated; the total memory capacity of yahoo exceeds 100PB, etc. The data size is driven by related technologies such as cloud computing and the Internet of things, so that the world gradually enters the big data (BigData) era, the existing unstructured data is inconvenient to process and inconvenient to extract, and therefore, a Web data extraction system based on XML and XPath and a using method thereof are needed.

Disclosure of Invention

The Web data extraction system based on XML and XPath and the use method thereof solve the problems in the prior art.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

the Web data extraction system based on XML and XPath comprises a conversion module for converting webpages in a network database, wherein the conversion module is sequentially connected with an analysis module, an extraction building module, an extraction integration module, a crawling integration module and a pushing module, the extraction building module is sequentially connected with an extraction setting module and an extraction instruction modification module, and the analysis module, the extraction setting module and the crawling integration module are all connected with the same storage module.

Preferably, the conversion module converts the currently extracted target webpage into a canonical XML webpage, the analysis module analyzes the converted XML webpage into a DOM tree, the extraction building module extracts required target Web information point content from the analyzed DOM tree, the extraction integration module integrates the extracted Web information point content to form an extraction list, the crawling integration module downloads the corresponding target file content from the network database according to the extraction list after extraction integration, and finally converts and compresses the downloaded target file content after crawling, and the pushing module pushes the downloaded target file content to the network database for users to use.

Preferably, the extraction instruction modifying module is configured to receive extraction mode instruction content, and the extraction setting module modifies an extraction mode in the extraction establishing module according to the received extraction mode instruction content.

The application method of the Web data extraction system based on XML and XPath comprises the following steps:

s1 conversion and analysis of target web pages

Firstly, a conversion module converts a currently extracted target webpage into a canonical XML webpage, and an analysis module analyzes the converted XML webpage into a DOM tree and transmits the DOM tree to an extraction building module;

s2 extraction mode regulation and control

The operator generates extraction mode instruction content to the extraction instruction modification module, and then the extraction setting module modifies the extraction mode in the extraction establishing module according to the received extraction mode instruction content;

s3 target content extraction

The extraction building module extracts the required target Web information point content from the parsed DOM tree according to the extraction mode set by the extraction setting module, and only the extraction integration module integrates the extracted Web information point content to form an extraction list;

s4 extraction content crawling pushing

And the crawling integration module downloads the corresponding target file content from the network database according to the extraction list of extraction integration, then converts and compresses the downloaded target file content corresponding to crawling, and finally the pushing module pushes the downloaded target file content to the network database for users to use.

In the present invention,

through the Web data extraction system based on XML and XPath and the use method thereof, the conversion, extraction and crawling and integration processing of unstructured data are realized, and the extraction and crawling rules are set according to the needs, so that the crawling needs of different data are realized, the processing operation of the unstructured data of different types is suitable, and the processing efficiency and quality of the unstructured data are improved.

Drawings

FIG. 1 is a schematic diagram of a Web data extraction system based on XML and XPath and a method for using the same;

FIG. 2 is a schematic flow chart of a Web data extraction system based on XML and XPath and a relation between an extraction module and a crawler module in a using method thereof;

FIG. 3 is a schematic flow chart of the XML and XPath based Web data extraction system and method of use thereof according to the present invention;

FIG. 4 is a flowchart illustrating a Web page collection procedure using HTTP protocol in a Web crawler in the XML and XPath based Web data extraction system and the method for using the same according to the present invention;

FIG. 5 is a schematic diagram of a Web data extraction system based on XML and XPath and a method for using the same according to the present invention;

fig. 6 is a schematic diagram of a Web data extraction system based on XML and XPath and a Web crawler program flow in a method for using the same according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments.

Referring to fig. 1, the Web data extraction system based on XML and XPath includes a conversion module for converting a Web page in a network database, where the conversion module is sequentially connected with an analysis module, an extraction building module, an extraction integration module, a crawling integration module and a pushing module, the extraction building module is sequentially connected with an extraction setting module and an extraction instruction modification module, and the analysis module, the extraction setting module and the crawling integration module are all connected with the same storage module.

The conversion module converts the currently extracted target webpage into a canonical XML webpage, the analysis module analyzes the converted XML webpage into a DOM tree, the extraction building module extracts required target Web information point content from the analyzed DOM tree, the extraction integration module integrates the extracted Web information point content to form an extraction list, the crawling integration module downloads and crawls corresponding target file content from the network database according to the extraction list integrated by extraction, and finally converts and compresses the downloaded and crawled corresponding target file content, and the pushing module pushes the downloaded and crawled target file content to the network database for users to use.

Further, the extraction instruction modifying module is configured to receive extraction mode instruction content, and the extraction setting module modifies an extraction mode in the extraction establishing module according to the received extraction mode instruction content.

s1 conversion and analysis of target web pages

Firstly, the conversion module converts the currently extracted target webpage into a canonical XML webpage, and the analysis module analyzes the converted XML webpage into a DOM tree and transmits the DOM tree to the extraction building module,

specifically, converting the irregular HTML of the tag into XML, preprocessing (optimizing and correcting) the HTML page by using a SgmlReader tool to enable the HTML page to meet the XML standard, then constructing a DOM tree by using an XML parsing packet, and finally processing the DOM tree by using an interface provided in W3C, wherein the operation of the HTML document is converted into the operation of the DOM, and the process is as follows:

sample learning stage:

converting the HTML document into XML (actually XHTML) format using a SgmlReader, and then converting the formatted document into the form of a DOM tree;

extraction rule definition phase:

the user marks the specific data items in the DOM tree in a visual form, and the system needs to process as follows: firstly, generating an XPath expression of a Web data item, then, corresponding the expression with respective destination table fields to finish definition of a mapping rule, and performing the same processing on other fields; the definition information of the extraction rules is stored in IERF files of the cluster webpages;

data extraction stage:

analyzing the IERF file, and if the webpage to be extracted meets the extraction rule, extracting the content node specified by the rule;

s2 extraction mode regulation and control

specifically, the rule of the Wrapper is adopted to extract information from the parsed DOM tree, wherein the information comprises the regional rule in the rule of the Wrapper and the setting and modification of the data rule in the region;

and extracting the webpage data by adopting a rule of a Wrapper, and then adopting an extraction rule of combining the reference text and the DOM to realize the extraction of the webpage data. In the data extraction process of the single webpage record, the relation between the extraction module and the crawler module is shown in fig. 2:

wherein the method comprises the steps of

The region rule is to locate the data region in the webpage content, and text information which is not easy to change in the webpage is used as a reference point, so that the position of the data region, namely the nearest father node of the sub information, can be quickly located, dependence on the webpage structure is avoided, the extraction rule is not influenced by dynamic change of the webpage structure, and the method for properly increasing the number of the reference texts can be used for solving the problem that the data region is not easy to uniquely locate. The process is described in terms of pseudo code as shown in fig. 3;

in-region data rule representation

After the data area is positioned, the data in the area is extracted, and an absolute path from the root node to the child node of the webpage is not needed, but a relative path of a DOM subtree is adopted, so that the path is relatively short, the robustness is good, and the page structure is not very dependent; traversing all nodes according to a right-order traversing mode, and extracting text data information contained in the subtrees; if only partial character strings of the data field need to be extracted, a text characteristic pattern matching method can be adopted;

s3 target content extraction

The extraction building module extracts required target Web information point contents from the parsed DOM tree according to the extraction mode set by the extraction setting module, the Web crawler crawls the required target Web information point contents extracted from the parsed DOM tree, and only the extraction integration module integrates the extracted Web information point contents to form an extraction list;

wherein the structure of the web crawler is shown in fig. 4;

URL queue: the URL records in the URL queue come from two places, one place is seed URL, and the URLs are mainly web page links predefined by a user; the other place is the URL which is continuously obtained from the subsequent webpage by the crawler in the process of crawling the subsequent webpage. After the crawler program is started, the crawler program starts to grab from the seed URL first, and a first-in first-out principle of a queue is adopted. The method is beneficial to realizing breadth-first grabbing strategies, skillfully avoids the characteristic of easy theme deviation of depth-first grabbing strategies, and improves the theme relevance of grabbing webpages.

Protocol processor: the layer is the basis of a web crawler and is positioned at the bottommost layer of the whole crawler system, and is mainly responsible for realizing the collection work of web page data by utilizing various network protocols. Typical network protocols include HTTP, HTTPs and FTP, and the current network protocol is mainly HTTP, so that it is also considered that the web crawler designed herein only supports data transmission of the HTTP protocol, and a brief description is given here of a web page collection procedure using the HTTP protocol:

a, analyzing the address and port number of the outbound point according to the URL, and establishing connection with the address and port number;

b, assembling an HTTP request head, sending the HTTP request head to a target site, if any response signal is not obtained within a certain period of time, discarding the grabbing of the page, and if the response signal is obtained, analyzing the response signal and entering the next step;

c, judging the step mainly by a status code, and if the step is 2XX, indicating that the page is correctly returned; if 301 or 302, the page redirection is indicated, a new URL needs to be extracted from the response head, and the previous step is returned; if it is another code (e.g., 404, indicating that the web page is not found), it indicates that the link failed. Continuing to try grabbing the URL, discarding the URL from grabbing after three times of failure;

d, finding out the webpage information such as the type, the length and the like of the webpage through the response head;

and e, acquiring page content.

The specific flow is shown in fig. 5;

URL analysis, namely acquiring semantic information of marks such as Meta or HREF from newly grabbed webpages, acquiring URLs, and filtering the newly acquired URLs. Filtering mainly refers to deleting URLs with targets such as pictures, sounds and videos or advertisements, and also has an important role in comparing the grabbed URLs with a visual queue (history table) or discovering that there are Visited URLs and deleting them, so that repeated grabbing is avoided.

The web crawler process flow is generally as shown in fig. 6;

s4 extraction content crawling pushing

The design realizes the conversion, extraction and crawling and integration treatment of unstructured data, and simultaneously sets the extraction and crawling rules according to the needs, thereby realizing the crawling needs of different data, being suitable for the processing operation of the unstructured data of different types and improving the processing efficiency and quality of the unstructured data.

In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present invention, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. The Web data extraction system based on XML and XPath comprises a conversion module for converting webpages in a network database, and is characterized in that the conversion module is sequentially connected with an analysis module, an extraction building module, an extraction integration module, a crawling integration module and a pushing module, the extraction building module is sequentially connected with an extraction setting module and an extraction instruction modification module, and the analysis module, the extraction setting module and the crawling integration module are all connected with the same storage module;

the conversion module converts the currently extracted target webpage into a canonical XML webpage, the analysis module analyzes the converted XML webpage into a DOM tree, the extraction building module extracts required target Web information point content from the analyzed DOM tree, the extraction integration module integrates the extracted Web information point content to form an extraction list, the crawling integration module downloads and crawls corresponding target file content from a network database according to the extraction list integrated by extraction, and finally converts and compresses the downloaded and crawled corresponding target file content, and the pushing module pushes the downloaded and crawled target file content to the network database for users to use;

the extraction instruction modification module is used for receiving the extraction mode instruction content, and the extraction setting module modifies the extraction mode in the extraction establishing module according to the received extraction mode instruction content;

the application method of the Web data extraction system based on XML and XPath comprises the following steps of:

s1 conversion and analysis of target web pages

s2 extraction mode regulation and control

extracting webpage data by adopting a rule of a Wrapper, and then realizing the extraction of the webpage data by adopting an extraction rule of combining a reference text and a DOM;

the region rule is to locate the data region in the webpage content, and the text information which is not easy to change in the webpage is used as a reference point, so that the position of the data region, namely the nearest father node of the sub information, is quickly located, dependence on the webpage structure is avoided, the extraction rule is not influenced by dynamic change of the webpage structure, and the problem that the data region is not easy to uniquely locate is solved, and the number of the reference texts is increased;

after the data area is positioned, the data in the area is extracted, and an absolute path from the root node to the child node of the webpage is not needed, but a relative path of a DOM subtree is adopted, so that the path is relatively short, the robustness is good, and the page structure is not very dependent; traversing all nodes according to a right-order traversing mode, and extracting text data information contained in the subtrees; if only partial character strings of the data field need to be extracted, adopting a text characteristic pattern matching method;

s3 target content extraction

URL queue: the URL records in the URL queue come from two places, one place is seed URL, and the URL is a webpage link predefined by a user; the other part is from the URL which is continuously obtained from the subsequent webpage by the crawler in the process of crawling the subsequent webpage; after the crawler program is started, firstly, grabbing from a seed URL, and adopting a first-in first-out principle of a queue; the method is beneficial to realizing breadth-first grabbing strategies, skillfully avoids the characteristic of easily deviating topics of depth-first grabbing strategies, and improves the topic relevance of grabbing webpages;

protocol processor: the system is a basis of a web crawler, is positioned at the bottommost layer of the whole crawler system and is responsible for realizing the collection work of webpage data by utilizing various network protocols; web crawlers currently only support data transmission of the HTTP protocol, and a brief description is given here of a web page collection procedure using the HTTP protocol:

c, judging through the state code, and if the state code is 2XX, indicating that the page is correctly returned; if 301 or 302, the page redirection is indicated, a new URL needs to be extracted from the response head, and the previous step is returned; if the code 404 indicates that the web page is not found, the link failure is indicated; continuing to try grabbing the URL, discarding the URL from grabbing after three times of failure;

d, finding out the page information through the response head;

e, acquiring page content;

URL analysis, namely acquiring semantic information of Meta or HREF marks in the web pages responsible for re-grabbing, acquiring URLs, and filtering the newly acquired URLs; filtering means deleting the URL containing the picture, the sound and the video as targets, comparing the grabbed URL with the visual queue, and deleting the URL after finding that the grabbed URL is found, thereby avoiding repeated grabbing;

s4 extraction content crawling pushing