CN109657121A

CN109657121A - A kind of Web page information acquisition method and device based on web crawlers

Info

Publication number: CN109657121A
Application number: CN201811499475.4A
Authority: CN
Inventors: 高培玉; 林华; 林一华; 郭茜
Original assignee: Foshan Jinsui Data Service Co Ltd
Current assignee: Foshan Jinsui Data Service Co Ltd
Priority date: 2018-12-09
Filing date: 2018-12-09
Publication date: 2019-04-19

Abstract

The invention discloses a kind of Web page information acquisition method and device based on web crawlers, by constructing web crawlers model, http request, which is initiated, according to the URL in domain name range obtains all web pages in domain name range, it traverses dom tree and extracts key message generation collection rule, page info is acquired according to collection rule, simplify the cumbersome data input operation of developer, effectively shorten business personnel's system entry time and the lower problem of typing accuracy rate, reduce the computing cost of collecting webpage data, a large amount of unrelated web pages can directly be skipped, support the information collection that web page is directly carried out according to semantic information.

Description

A kind of Web page information acquisition method and device based on web crawlers

Technical field

This disclosure relates to data collecting field, and in particular to a kind of Web page information acquisition method based on web crawlers And device.

Background technique

Traditional Web page text information acquisition passes through various web pages of different nature often collected result packet Containing with a large amount of unrelated web pages of acquisition theme, it is the network coverage as big as possible that Web page text information, which acquires target, And cause the contradiction between limited search engine resource and unlimited Web page text information resource increasing.The page In various forms of data modes increasingly the network information is promoted to continue to develop, have the image, video and audio of magnanimity daily The different data such as multimedia, text document are emerged in large numbers in a network, and present technology is often close to these information contents are handled Multi-source heterogeneous data comprehensively can not find and obtain in the page of collection, and be difficult to support to carry out web according to semantic information The information collection of the page.And in some network-based systems development process, many data are needed in upper re-using, one Aspect increases the typing work of foreground personnel；On the other hand, the auxiliary system data of construction early period can not utilize, and cause very big The wasting of resources.

Summary of the invention

The disclosure provides a kind of Web page information acquisition method and device based on web crawlers, is climbed according to a kind of network Erpoglyph type grabs the key message in web page by regularity.

To achieve the goals above, according to the one side of the disclosure, a kind of Web page information based on web crawlers is provided Acquisition method the described method comprises the following steps:

Step 1, web crawlers model is constructed；

Step 2, http request is initiated according to starting URL and obtains web page；

Step 3, web crawlers model is according to canonical matching rule from the URL crawled in domain name range in web page；

Step 4, http request is initiated according to the URL in domain name range and obtains all web pages in domain name range；

Step 5, all web pages that will acquire parse and generate the dom tree of XML；

Step 6, traversal dom tree extracts key message and generates collection rule；

Step 7, page info is acquired according to collection rule；

Step 8, circulation carry out step 2 to step 7 until web crawlers model according to canonical matching rule from web page The URL in all domain name ranges is crawled.

Further, in step 1, it is described building web crawlers model method the following steps are included:

Step 1.1, the configuration script for creating a web crawlers, will be in starting URL write-in configuration script；

Step 1.2, the domain name range that web crawlers crawls in configuration script；

Step 1.3, the queue of web crawlers is constructed, the element of queue is the URL of storage, i.e., queue is for storing URL；

Step 1.4, the canonical matching rule of web crawlers, the regular expression form of the URL of web crawlers model are constructed Canonical matching rule (w+ (- w+) *) (( w+ (- w+) *)) * that is http: ∥ (? *)?, canonical matching Rule Expression contains http: ∥ URL and the URL of "/" or " " symbol, and web crawlers deposits the URL for meeting canonical matching rule It stores up and crawls queue.

Further, in step 3, the web crawlers model is according to canonical matching rule from web page information crawler The method of URL in domain name range is all URL crawled in web page from the URL of web page, from the current page It extracts all URL to be put into the queue of web crawlers, the representation of the URL is by http: ∥ and "/" or " " symbol Composition, the URL pass through canonical matching rule,

Http: ∥ (w+ (- w+) *) (( w+ (- w+) *)) * (? *)?, extract in the current page All URL.

Further, in step 3, all URL in the web page that domain name range gets for the same URL.

Further, in steps of 5, all web pages that will acquire parse and the side of the dom tree that generates XML Method is to carry out structure elucidation using webpage source code of the HTMLParser or SGMLParser to web page to obtain as tree-shaped knot The URL of structure, theme, user name, content, time attribute information traverse access by the logical relation between each node of tree The dom tree of each of webpage node generation XML form.

Further, in steps of 5, all web pages that will acquire parse and the side of the dom tree that generates XML Method is to carry out structure elucidation using webpage source code of the HTMLParser or SGMLParser to web page to obtain all XML The DOM tree node element of form, the DOM tree node element include URL, theme, user name, content, time attribute information, are led to Each of the logical relation traversal access webpage crossed between each node of tree node generates the dom tree of XML form, DOM Tree indicates that wherein V is the node collection of dom tree, and E is the logical relation of any two DOM tree node, i.e. E is with undirected tree G (V, E) Each URL connection relationship in webpage, generates an intermediary tree, first reading root node v ∈ V^NURL be v (r), B_vAnd n_v's Value, B_vRefer to all node sets that can be reached by arc v, then comparison node V^NThe n of { r }_vValue, V^N{ r } is removal root The node collection of the tree of node r, successively by n_vIt is worth tectonic sequence L, then initializes H (Y, A) again, vertex y is selected to represent leaf Node r, and remember C_y={ r }, n_y=| N |+1/2, and connecting with leaf node | D^N _r| the arc of a leaf node is added to z collection In conjunction, D^N _rIndicate that all set of URL of leaf node v ∈ V close, enabling the known connecting node of leaf node arc is y, the intermediary tree For H (y, A), enabling dom tree is T^N, each vertex y ∈ Y of intermediary tree is T^NThe set C of interior joint_y∈V^N, arc A is T^NThe collection on middle side It closes, each node and leaf node are indicated with independent fixed point y ∈ Y, the set of N expression DOM tree node element, V^N, E^NIt is T^NIn own The set of DOM tree node and side, E^NIn element be known as side collection, V^NIn element be known as set node collection, initialize intermediary tree H For sky, by depth-first traversal by DOM tree node v ∈ V^NIt is added in H, until all V^NIn DOM tree node all in centre It sets in H, each node y ∈ Y of intermediary tree H represents setC_yIn include one or more child node v, n_yIt represents Dom tree interior joint n_vValue, node n_vIt is worth identical i.e. URL and belongs to the same connection relationship, access can be attached by URL, such as Certain node y has multiple child nodes in fruit H, then node v ∈ C_yN having the same_vIt is worth, arc a represents T in H^NIn side, reading During arc in H, when the front end for arc occur is connected with node, other end leaf node does not have connecting node or no dom tree Node elements, this arc are known as leaf node arc, and the collection of leaf node arc is combined into z, if leaf node arc has leaf when traversal Child node then deletes leaf node arc from set z, and each arc a ∈ A has a corresponding set B_aIt is corresponding, B_aRefer to All node sets that can be reached by arc a, z successively extract DOM tree node according to depth-first traversal from DOM tree node Element is indicated with v ', and update is added to H (Y, A) and terminates until traversing, final H (Y, A) the i.e. dom tree of XML, wherein | N | It is the quantity that N includes node.

Further, in step 6, it is to traverse that the traversal dom tree, which extracts the method that key message generates collection rule, It is that theme, user name, content, the information of time attribute, institute are extracted from a node of dom tree that dom tree, which extracts key message, State key message be the theme, the information of user name, content, time attribute, the collection rule of the generation is as follows, collection rule Regular expression is expressed as,

(<subject [ ^>]) ([s S] *?) (</subject>), (<user name [ ^>]) ([s S] *?) (</user name>),

(<content [ ^>]) ([s S] *?) (</content>), (<time [ ^>]) ([s S] *?) (</time>),

(? is) (? ≤<td>) .+? (?=</td>),

(? is) mode is modified, and i indicates that ignorecase, s indicate that single line mode can match new line

(? ≤<td>), backward is looked around certainly, need matched result with<td>beginning, still<td>match, as a result in Not comprising<td>.+? any character is matched to (any character) met every time, that is, attempts to match subsequent expression formula, until Subsequent expression formula failure, recalls last matching result, (?=</td>) sequence looks around certainly, matched result finally will be with </td>ending, but</td>it mismatches and does not include in then result</td>,

S: match any blank character, including space, tab, form feed character, be equivalent to [f n r t v],

S: match any non-blank-white character, be equivalent to [^ f n r t v],?: show it is non-greedy matching, collection rule Regular expression indicate be match 1 pairs of XML tag beginning, attribute, content and latter end.

It further, in step 7, is to advise according to acquisition to the method that page info is acquired according to collection rule Matching theme, user name, content, the beginning of the XML tag of time, attribute, content and latter end are then acquired, participle is called System will need the theme segmented and contents attribute segment then the document object insertion data-base recording after participle Table, the Words partition system is Chinese Academy of Sciences's Words partition system ICTCLAS50, and page info theme, user name, content, time are added Enter in dictionary file keydict.txt.

Further, in step 8, it is from the URL meaning crawled in web page in all domain name ranges, from current The page in can not be drawn into URL or without any URL can by canonical matching rule,

Http: ∥ (w+ (- w+) *) (( w+ (- w+) *)) * (? *)?, i.e., in the current page Extraction less than URL.

The present invention also provides a kind of Web page information acquisition device based on web crawlers, described device include: storage Device, processor and storage in the memory and the computer program that can run on the processor, the processor The computer program is executed to operate in the unit of following device:

Crawler model construction unit, for constructing web crawlers model；

Page acquiring unit obtains web page for initiating http request according to starting URL；

Canonical matching unit crawls domain name range according to canonical matching rule for web crawlers model from web page In URL；

Multi-page request unit, it is all in domain name range for initiating http request acquisition according to the URL in domain name range In web page；

Dom tree generation unit, all web pages for will acquire parse and generate the dom tree of XML；

Collection rule generation unit extracts key message generation collection rule for traversing dom tree；

Page info acquisition unit, for being acquired according to collection rule to page info；

Circle collection unit carries out page acquiring unit, canonical matching unit, multi-page request unit, DOM for recycling Generation unit, collection rule generation unit, page info acquisition unit are set, until web crawlers model is according to canonical matching rule From the URL crawled in web page in all domain name ranges.

The disclosure have the beneficial effect that the present invention provide a kind of Web page information acquisition method based on web crawlers and Device, simplify developer it is cumbersome data input operation, effectively shorten business personnel's system entry time and typing it is accurate The lower problem of rate, reduces the computing cost of collecting webpage data, can directly skip a large amount of unrelated web pages, support The information collection of web page is directly carried out according to semantic information.

Detailed description of the invention

By the way that the embodiment in conjunction with shown by attached drawing is described in detail, above-mentioned and other features of the disclosure will More obvious, identical reference label indicates the same or similar element in disclosure attached drawing, it should be apparent that, it is described below Attached drawing be only some embodiments of the present disclosure, for those of ordinary skill in the art, do not making the creative labor Under the premise of, it is also possible to obtain other drawings based on these drawings, in the accompanying drawings:

Fig. 1 show a kind of flow chart of Web page information acquisition method based on web crawlers；

Fig. 2 show a kind of Web page information acquisition device figure based on web crawlers.

Specific embodiment

It is carried out below with reference to technical effect of the embodiment and attached drawing to the design of the disclosure, specific structure and generation clear Chu, complete description, to be completely understood by the purpose, scheme and effect of the disclosure.It should be noted that the case where not conflicting Under, the features in the embodiments and the embodiments of the present application can be combined with each other.

As shown in Figure 1 for according to a kind of flow chart of Web page information acquisition method based on web crawlers of the disclosure, A kind of Web page information acquisition method based on web crawlers according to embodiment of the present disclosure is illustrated below with reference to Fig. 1.

The disclosure proposes a kind of Web page information acquisition method based on web crawlers, specifically includes the following steps:

Step 1, web crawlers model is constructed；

Step 7, page info is acquired according to collection rule；

Configuration network crawler crawls domain name range: master mould of the script:first_URLs as construction URL then crawls Domain name range be 10000~20000, used for splicing new URL.

Its Java form main code is as follows:

Further, in step 3, the web crawlers model is according to canonical matching rule from web page information crawler The method of URL in domain name range is all URL crawled in web page from the URL of web page, from the current page It extracts all URL to be put into the queue of web crawlers, the representation of the URL is by http: // and "/" or " " symbol Composition, the URL pass through canonical matching rule,

Http: // (w+ (- w+) *) (( w+ (- w+) *)) * (? *)?, extract in the current page All URL.

Further, in steps of 5, all web pages that will acquire parse and the side of the dom tree that generates XML Method is to carry out structure elucidation using webpage source code of the HTMLParser or SGMLParser to web page to obtain all XML The DOM tree node element of form, the DOM tree node element include URL, theme, user name, content, time attribute information, are led to Each of the logical relation traversal access webpage crossed between each node of tree node generates the dom tree of XML form, DOM Tree indicates that wherein V is the node collection of dom tree, and E is the logical relation of any two DOM tree node, i.e. E is with undirected tree G (V, E) Each URL connection relationship in webpage, generates an intermediary tree, first reading root node v ∈ V^NURL be v (r), B_vAnd n_v's Value, B_vRefer to all node sets that can be reached by arc v, then comparison node V^NThe n of { r }_vValue, V^N{ r } is removal root The node collection of the tree of node r, successively by n_vIt is worth tectonic sequence L, then initializes H (Y, A) again, vertex y is selected to represent leaf Node r, and remember C_y={ r }, n_y=| N |+1/2, | N | be the quantity that N includes node, and connecting with leaf node | D^N _r| it is a The arc of leaf node is added in z set, D^N _rIt indicates that all set of URL of leaf node v ∈ V close, enables the known of leaf node arc Connecting node is y, and the intermediary tree is H (y, A), and enabling dom tree is T^N, each vertex y ∈ Y of intermediary tree is T^NThe collection of interior joint Close C_y∈V^N, arc A is T^NThe set on middle side, each node and leaf node indicate that N indicates DOM tree node member with independent fixed point y ∈ Y The set of element, V^N, E^NIt is T^NIn all DOM tree nodes and side set, E^NIn element be known as side collection, V^NIn element be known as set Node collection, initialization intermediary tree H is sky, by depth-first traversal by DOM tree node v ∈ V^NIt is added in H, until all V^NIn DOM tree node all in intermediary tree H, each node y ∈ Y of intermediary tree H represents setC_yIn include one A or multiple child node v, n_yRepresent dom tree interior joint n_vValue, node n_vIt is worth identical i.e. URL and belongs to the same connection relationship, it can be with It is attached access by URL, if certain node y has multiple child nodes, node v ∈ C in H_yN having the same_yIt is worth, in H Arc a represents T^NIn side, read H in arc during, when the front end for arc occur is connected with node, other end leaf node does not have There is connecting node or there is no DOM tree node element, this arc is known as leaf node arc, and the collection of leaf node arc is combined into z, if time Leaf node arc is lasted with leaf node, then is deleted leaf node arc from set z, each arc a ∈ A has corresponding one A set B_aIt is corresponding, B_aRefer to all node sets that can be reached by arc a, z is according to depth-first traversal from dom tree section DOM tree node element is successively extracted in point, is indicated with v ', and update is added to H (Y, A) and terminates until traversing, final H (Y, A) That is the dom tree of XML.

Parsing of the HTMLParser or SGMLParser for the html file of the webpage source code of web page,

Parsed information preservation is the structure set by HTMLParser or SGMLParser.Node is information preservation Data type basis.

The definition of Node:

public interface Node extends Cloneable；

The method for including in Node has several classes:

The DOM tree node element of all XML forms is obtained to the webpage source code progress structure elucidation of web page, it is right In the function that tree is traversed:

Node getParent (): father node is obtained

NodeList getChildren (): the list of child node is obtained

Node getFirstChild (): first child node is obtained

Node getLastChild (): the last one child node is obtained

Node getPreviousSibling (): the previous brotgher of node is obtained

Node getNextSibling (): next brother node is obtained

Obtain the function of Node content:

String getText (): text is obtained

String toPlainTextString (): plain text information is obtained.

String toHtml (): it obtains HTML information (original HTML)

String toHtml (boolean verbatim): it obtains HTML information (original HTML)

String toString (): it obtains character string information (original HTML)

Page getPage (): the corresponding Page object of this Node is obtained

Int getStartPosition (): initial position of this Node in html page is obtained

Int getEndPosition (): end position of this Node in html page is obtained

Function for Filter filtering:

Void collectInto (NodeList list, NodeFilter filter): the condition pair based on filter It is filtered in this node, qualified node is put into list.

Function for Visitor traversal:

Void accept (NodeVisitor visitor): to this Node application visitor

It is this kind of to use fewer for modifying the function of content:

Void setPage (Page page): the corresponding Page object of this Node is set

Void setText (String text): setting text

Void setChildren (NodeList children): setting child list

Other functions:

Void doSemanticAction (): executing the corresponding operation of this Node, (only minority Tag has corresponding behaviour Make)

Object clone (): the abstract function of interface Clone.

It is practical with HTMLParser it is most be processing html page, Filter or the relevant function of Visitor are necessary , then the first kind and the second class function are with the most use.First kind function ratio is readily understood by, and exemplifies one below Lower second class function.

Here is the html file for test:

(<content [ ^>]) ([s S] *?) (</content>), (<time [ ^>]) ([s S] *?) (</time>),

(? is) (? ≤<td>) .+? (?=</td>),

Preferably, the disclosure provide a kind of Web page information acquisition method based on web crawlers embodiment it is as follows:

Step A constructs web crawlers model；

Step B, web crawlers model is by passing through modification web page from the URL crawled in domain name range in web page HTTP data packet crawl the page that preset URL is returned again, the HTTP data packet of the modification web page be the use of return The query result of family specified data value, parsing HTTP page elements obtain user's specified data value, and the preset URL is returned The page returned is the page that the method for executing inquiry user's specified data value returns, and user's specified data value includes page Face message subject, user name, content, time；

Step C being packaged into the data set of Json format by way of modifying the HTTP data packet of web page, and is extracted Json data set is to background server；

Step D parses Json data set in server background, based on the format of Json key-value pair, carries out Json and Java business datum entity object are converted between system, according to configured data entity object model, to extraction Data set is made whether the verification (such as character string cannot be converted into value type) of same data type, to verification data type Skimble-scamble invalid data is rejected, and the skimble-scamble invalid data of the verification data type is page info theme, user The data that one of name, content, time any attribute are not inconsistent；

Step E is put into specified data the specified category of data entity object according to configured data entity object model Property, carry out entity object assignment；

Step F is output to client from server the entity object data conversion of assignment at the format of Json；

Step G parses the data of Json format, according to form modifying user's HTTP data packet of Json key-value pair, and writes Enter server.

Further, in step D, the method for Json and Java business datum entity object conversion is between carry out system The data of Json format are read by java application and are formatted according to the prior art of Json as Java application The readable data of program.

A kind of Web page information acquisition device based on web crawlers that embodiment of the disclosure provides, is illustrated in figure 2 A kind of Web page information acquisition device figure based on web crawlers of the disclosure, the embodiment it is a kind of based on web crawlers Web page information acquisition device include: processor, memory and storage in the memory and can be on the processor The computer program of operation, the processor realize a kind of above-mentioned Web based on web crawlers when executing the computer program Step in page info acquisition device embodiment.

Described device includes: memory, processor and storage in the memory and can transport on the processor Capable computer program, the processor execute the computer program and operate in the unit of following device:

Crawler model construction unit, for constructing web crawlers model；

A kind of Web page information acquisition device based on web crawlers can run on desktop PC, notes Originally, palm PC and cloud server etc. calculate in equipment.A kind of Web page information based on web crawlers acquires dress It sets, the device that can be run may include, but be not limited only to, processor, memory.It will be understood by those skilled in the art that the example Son is only a kind of example of Web page information acquisition device based on web crawlers, does not constitute and is climbed to one kind based on network The restriction of the Web page information acquisition device of worm may include component more more or fewer than example, or the certain portions of combination Part or different components, such as a kind of Web page information acquisition device based on web crawlers can also include input Output equipment, network access equipment, bus etc..

Alleged processor can be central processing unit (Central Processing Unit, CPU), can also be it His general processor, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor is also possible to any conventional processor Deng, the processor is a kind of control centre of Web page information acquisition device running gear based on web crawlers, benefit With the entire a kind of Web page information acquisition device based on web crawlers of various interfaces and connection can running gear it is each Part.

The memory can be used for storing the computer program and/or module, and the processor is by operation or executes Computer program in the memory and/or module are stored, and calls the data being stored in memory, described in realization A kind of various functions of the Web page information acquisition device based on web crawlers.The memory can mainly include storage program Area and storage data area, wherein storing program area can application program needed for storage program area, at least one function (such as Sound-playing function, image player function etc.) etc.；Storage data area, which can be stored, uses created data (ratio according to mobile phone Such as audio data, phone directory) etc..In addition, memory may include high-speed random access memory, it can also include non-volatile Property memory, such as hard disk, memory, plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card), at least one disk memory, flush memory device or other Volatile solid-state part.

Preferably, the web crawlers model of the disclosure can be injected in the front end CS:

1) web crawlers model finds the injection of operation pages object implementatio8 script by C/S exhaustion IE.

2) web crawlers model realizes the detection of server and client release number by C/S, realizes real-time update.

JQuery is the library javascript of the more browsers of compatibility, and function is that AJAX interaction is provided for website.

Web crawlers model realizes the customized addition of element in face of current using jQuery, and AJAX is called to realize backstage Interaction.

Web crawlers model based on data application in, can from independently of real web pages server-side obtain and can To be dynamically written in webpage, foreground is returned data by backstage by AJAX, and parse related data, be written relevant In the record sheet of database.

Although the description of the disclosure is quite detailed and especially several embodiments are described, it is not Any of these details or embodiment or any specific embodiments are intended to be limited to, but should be considered as is by reference to appended A possibility that claim provides broad sense in view of the prior art for these claims explanation, to effectively cover the disclosure Preset range.In addition, the disclosure is described with inventor's foreseeable embodiment above, its purpose is to be provided with Description, and those equivalent modifications that the disclosure can be still represented to the unsubstantiality change of the disclosure still unforeseen at present.

Claims

1. a kind of Web page information acquisition method based on web crawlers, which is characterized in that the described method comprises the following steps:

Step 1, web crawlers model is constructed；

Step 7, page info is acquired according to collection rule；

Step 8, circulation carries out step 2 to step 7 until web crawlers model is crawled from web page according to canonical matching rule URL in complete all domain name ranges.

2. a kind of Web page information acquisition method based on web crawlers according to claim 1, which is characterized in that In step 1, it is described building web crawlers model method the following steps are included:

Step 1.4, the canonical matching rule of web crawlers is constructed, web crawlers arrives the URL storage for meeting canonical matching rule Crawl queue.

3. a kind of Web page information acquisition method based on web crawlers according to claim 1, which is characterized in that In step 3, the web crawlers model is according to canonical matching rule from the side of the URL in web page information crawler domain name range Method is all URL crawled in web page from the URL of web page, and all URL are extracted from the current page and are put into net In the queue of network crawler, the URL passes through canonical matching rule,

Http: ∥ (w+ (- w+) *) (( w+ (- w+) *)) * (? *)?, extract the institute in the current page Some URL.

4. a kind of Web page information acquisition method based on web crawlers according to claim 1, which is characterized in that In step 3, all URL in the web page that domain name range gets for the same URL.

5. a kind of Web page information acquisition method based on web crawlers according to claim 1, which is characterized in that In step 5, the method for the dom tree that all web pages that will acquire parse and generate XML is to utilize HTMLParser Or SGMLParser structure elucidation is carried out to the webpage source code of web page with obtain the URL for tree, theme, user name, It is raw to access each of webpage node by the logical relation traversal between each node of tree for content, time attribute information At the dom tree of XML form.

6. a kind of Web page information acquisition method based on web crawlers according to claim 1, which is characterized in that In step 5, the method for the dom tree that all web pages that will acquire parse and generate XML is to utilize HTMLParser Or SGMLParser carries out structure elucidation to the webpage source code of web page to obtain the DOM tree node member of all XML forms Element, the DOM tree node element include URL, theme, user name, content, time attribute information, by each node of tree it Between logical relation traversal access each of webpage node generate the dom tree of XML form, dom tree is with undirected tree G (V, E) It indicates, wherein V is the node collection of dom tree, and E is the logical relation of any two DOM tree node, i.e. E is each URL connection in webpage Relationship generates an intermediary tree, first reading root node v ∈ V^NURL be v (r), B_vAnd n_vValue, B_vRefer to through arc v institute All node sets that can be reached, then comparison node V^NThe n of { r }_vValue, V^N{ r } is the node collection for removing the tree of root node r, Successively by n_vIt is worth tectonic sequence L, then initializes H (Y, A) again, selects vertex y to represent leaf node r, and remember C_y={ r }, n_y =| N |+1/2, and connecting with leaf node | D^N _r| the arc of a leaf node is added in Z set, enables leaf node arc Known connecting node is y, and the intermediary tree is H (y, A), and enabling dom tree is T^N, each vertex y ∈ Y of intermediary tree is T^NInterior joint Set C_y∈V^N, arc A is T^NThe set on middle side, each node and leaf node indicate that N indicates dom tree section with independent fixed point y ∈ Y The set of point element, V^N,E^NIt is T^NIn all DOM tree nodes and side set, E^NIn element be known as side collection, V^NIn member be called usually For the node collection of tree, initializing intermediary tree H is sky, by depth-first traversal by DOM tree node v ∈ V^NIt is added in H, until All V^NIn DOM tree node all in intermediary tree H, each node y ∈ Y of intermediary tree H represents setC_yMiddle packet Containing one or more child node v, n_yRepresent dom tree interior joint n_vValue, node n_vIt is worth identical i.e. URL and belongs to the same connection relationship, Access can be attached by URL, if certain node y has multiple child nodes, node v ∈ C in H_yN having the same_v It is worth, arc a represents T in H^NIn side, read H in arc during, when the front end for arc occur is connected with node, other end leaf Node does not have connecting node or does not have DOM tree node element, and this arc is known as leaf node arc, and the collection of leaf node arc is combined into Z, If leaf node arc has leaf node when traversal, leaf node arc is deleted from set Z, each arc a ∈ A has phase The set B answered_aIt is corresponding, B_aRefer to all node sets that can be reached by arc a, Z according to depth-first traversal from DOM tree node element is successively extracted in DOM tree node, is indicated with v', and is updated and be added to H (Y, A).

7. a kind of Web page information acquisition method based on web crawlers according to claim 1, which is characterized in that In step 6, the traversal dom tree extracts the method that key message generates collection rule and is, traversal dom tree extracts key message and is Theme, user name, content, the information of time attribute are extracted from a node of dom tree, the key message is the theme, uses Name in an account book, content, the information of time attribute, the collection rule of the generation is as follows, and the regular expression of collection rule is expressed as,

(<content [ ^>]) ([s S] *?) (</content>), (<time [ ^>]) ([s S] *?) (</time>),

(? is) (? ≤<td>) .+? (?=</td>).

8. a kind of Web page information acquisition method based on web crawlers according to claim 1, which is characterized in that It in step 7, is to acquire matching theme, user according to collection rule to the method that page info is acquired according to collection rule Name, content, the beginning of the XML tag of time, attribute, content and latter end.

9. a kind of Web page information acquisition method based on web crawlers, which is characterized in that the described method comprises the following steps:

Step A constructs web crawlers model；

Step B, web crawlers model is by passing through modification web page from the URL crawled in domain name range in web page HTTP data packet crawls the page that preset URL is returned again, and the HTTP data packet of the modification web page is the user returned The query result of specified data value, parsing HTTP page elements obtain user's specified data value, and the preset URL is returned The page be execute inquiry user's specified data value method return the page, user's specified data value includes the page Message subject, user name, content, time；

Step C being packaged into the data set of Json format by way of modifying the HTTP data packet of web page, and extracts Json Data set is to background server；

Step D parses Json data set in server background, based on the format of Json key-value pair, carries out system Between Json and Java business datum entity object convert, according to configured data entity object model, to the data of extraction Collection is made whether the verification of same data type, rejects to the verification skimble-scamble invalid data of data type, the verification The skimble-scamble invalid data of data type is the number that one of page info theme, user name, content, time any attribute are not inconsistent According to；

Step E is put into specified data the specified attribute of data entity object according to configured data entity object model, Carry out entity object assignment；

Step G parses the data of Json format, according to form modifying user's HTTP data packet of Json key-value pair, and clothes is written Business device.

10. a kind of Web page information acquisition device based on web crawlers, which is characterized in that described device include: memory, Processor and storage in the memory and the computer program that can run on the processor, the processor execution The computer program operates in the unit of following device:

Crawler model construction unit, for constructing web crawlers model；

Canonical matching unit, for web crawlers model according to canonical matching rule from being crawled in web page in domain name range URL；

Multi-page request unit, it is all in domain name range for being obtained according to the URL initiation http request in domain name range Web page；

Circle collection unit carries out page acquiring unit, canonical matching unit, multi-page request unit, dom tree life for recycling At unit, collection rule generation unit, page info acquisition unit, until web crawlers model according to canonical matching rule from The URL in all domain name ranges has been crawled in web page.