CN109657121A - A kind of Web page information acquisition method and device based on web crawlers - Google Patents

A kind of Web page information acquisition method and device based on web crawlers Download PDF

Info

Publication number
CN109657121A
CN109657121A CN201811499475.4A CN201811499475A CN109657121A CN 109657121 A CN109657121 A CN 109657121A CN 201811499475 A CN201811499475 A CN 201811499475A CN 109657121 A CN109657121 A CN 109657121A
Authority
CN
China
Prior art keywords
node
url
web
dom tree
web page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811499475.4A
Other languages
Chinese (zh)
Inventor
高培玉
林华
林一华
郭茜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Foshan Jinsui Data Service Co Ltd
Original Assignee
Foshan Jinsui Data Service Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Foshan Jinsui Data Service Co Ltd filed Critical Foshan Jinsui Data Service Co Ltd
Priority to CN201811499475.4A priority Critical patent/CN109657121A/en
Publication of CN109657121A publication Critical patent/CN109657121A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a kind of Web page information acquisition method and device based on web crawlers, by constructing web crawlers model, http request, which is initiated, according to the URL in domain name range obtains all web pages in domain name range, it traverses dom tree and extracts key message generation collection rule, page info is acquired according to collection rule, simplify the cumbersome data input operation of developer, effectively shorten business personnel's system entry time and the lower problem of typing accuracy rate, reduce the computing cost of collecting webpage data, a large amount of unrelated web pages can directly be skipped, support the information collection that web page is directly carried out according to semantic information.

Description

A kind of Web page information acquisition method and device based on web crawlers
Technical field
This disclosure relates to data collecting field, and in particular to a kind of Web page information acquisition method based on web crawlers And device.
Background technique
Traditional Web page text information acquisition passes through various web pages of different nature often collected result packet Containing with a large amount of unrelated web pages of acquisition theme, it is the network coverage as big as possible that Web page text information, which acquires target, And cause the contradiction between limited search engine resource and unlimited Web page text information resource increasing.The page In various forms of data modes increasingly the network information is promoted to continue to develop, have the image, video and audio of magnanimity daily The different data such as multimedia, text document are emerged in large numbers in a network, and present technology is often close to these information contents are handled Multi-source heterogeneous data comprehensively can not find and obtain in the page of collection, and be difficult to support to carry out web according to semantic information The information collection of the page.And in some network-based systems development process, many data are needed in upper re-using, one Aspect increases the typing work of foreground personnel;On the other hand, the auxiliary system data of construction early period can not utilize, and cause very big The wasting of resources.
Summary of the invention
The disclosure provides a kind of Web page information acquisition method and device based on web crawlers, is climbed according to a kind of network Erpoglyph type grabs the key message in web page by regularity.
To achieve the goals above, according to the one side of the disclosure, a kind of Web page information based on web crawlers is provided Acquisition method the described method comprises the following steps:
Step 1, web crawlers model is constructed;
Step 2, http request is initiated according to starting URL and obtains web page;
Step 3, web crawlers model is according to canonical matching rule from the URL crawled in domain name range in web page;
Step 4, http request is initiated according to the URL in domain name range and obtains all web pages in domain name range;
Step 5, all web pages that will acquire parse and generate the dom tree of XML;
Step 6, traversal dom tree extracts key message and generates collection rule;
Step 7, page info is acquired according to collection rule;
Step 8, circulation carry out step 2 to step 7 until web crawlers model according to canonical matching rule from web page The URL in all domain name ranges is crawled.
Further, in step 1, it is described building web crawlers model method the following steps are included:
Step 1.1, the configuration script for creating a web crawlers, will be in starting URL write-in configuration script;
Step 1.2, the domain name range that web crawlers crawls in configuration script;
Step 1.3, the queue of web crawlers is constructed, the element of queue is the URL of storage, i.e., queue is for storing URL;
Step 1.4, the canonical matching rule of web crawlers, the regular expression form of the URL of web crawlers model are constructed Canonical matching rule (w+ (- w+) *) (( w+ (- w+) *)) * that is http: ∥ (? *)?, canonical matching Rule Expression contains http: ∥ URL and the URL of "/" or " " symbol, and web crawlers deposits the URL for meeting canonical matching rule It stores up and crawls queue.
Further, in step 3, the web crawlers model is according to canonical matching rule from web page information crawler The method of URL in domain name range is all URL crawled in web page from the URL of web page, from the current page It extracts all URL to be put into the queue of web crawlers, the representation of the URL is by http: ∥ and "/" or " " symbol Composition, the URL pass through canonical matching rule,
Http: ∥ (w+ (- w+) *) (( w+ (- w+) *)) * (? *)?, extract in the current page All URL.
Further, in step 3, all URL in the web page that domain name range gets for the same URL.
Further, in steps of 5, all web pages that will acquire parse and the side of the dom tree that generates XML Method is to carry out structure elucidation using webpage source code of the HTMLParser or SGMLParser to web page to obtain as tree-shaped knot The URL of structure, theme, user name, content, time attribute information traverse access by the logical relation between each node of tree The dom tree of each of webpage node generation XML form.
Further, in steps of 5, all web pages that will acquire parse and the side of the dom tree that generates XML Method is to carry out structure elucidation using webpage source code of the HTMLParser or SGMLParser to web page to obtain all XML The DOM tree node element of form, the DOM tree node element include URL, theme, user name, content, time attribute information, are led to Each of the logical relation traversal access webpage crossed between each node of tree node generates the dom tree of XML form, DOM Tree indicates that wherein V is the node collection of dom tree, and E is the logical relation of any two DOM tree node, i.e. E is with undirected tree G (V, E) Each URL connection relationship in webpage, generates an intermediary tree, first reading root node v ∈ VNURL be v (r), BvAnd nv's Value, BvRefer to all node sets that can be reached by arc v, then comparison node VNThe n of { r }vValue, VN{ r } is removal root The node collection of the tree of node r, successively by nvIt is worth tectonic sequence L, then initializes H (Y, A) again, vertex y is selected to represent leaf Node r, and remember Cy={ r }, ny=| N |+1/2, and connecting with leaf node | DN r| the arc of a leaf node is added to z collection In conjunction, DN rIndicate that all set of URL of leaf node v ∈ V close, enabling the known connecting node of leaf node arc is y, the intermediary tree For H (y, A), enabling dom tree is TN, each vertex y ∈ Y of intermediary tree is TNThe set C of interior jointy∈VN, arc A is TNThe collection on middle side It closes, each node and leaf node are indicated with independent fixed point y ∈ Y, the set of N expression DOM tree node element, VN, ENIt is TNIn own The set of DOM tree node and side, ENIn element be known as side collection, VNIn element be known as set node collection, initialize intermediary tree H For sky, by depth-first traversal by DOM tree node v ∈ VNIt is added in H, until all VNIn DOM tree node all in centre It sets in H, each node y ∈ Y of intermediary tree H represents setCyIn include one or more child node v, nyIt represents Dom tree interior joint nvValue, node nvIt is worth identical i.e. URL and belongs to the same connection relationship, access can be attached by URL, such as Certain node y has multiple child nodes in fruit H, then node v ∈ CyN having the samevIt is worth, arc a represents T in HNIn side, reading During arc in H, when the front end for arc occur is connected with node, other end leaf node does not have connecting node or no dom tree Node elements, this arc are known as leaf node arc, and the collection of leaf node arc is combined into z, if leaf node arc has leaf when traversal Child node then deletes leaf node arc from set z, and each arc a ∈ A has a corresponding set BaIt is corresponding, BaRefer to All node sets that can be reached by arc a, z successively extract DOM tree node according to depth-first traversal from DOM tree node Element is indicated with v ', and update is added to H (Y, A) and terminates until traversing, final H (Y, A) the i.e. dom tree of XML, wherein | N | It is the quantity that N includes node.
Further, in step 6, it is to traverse that the traversal dom tree, which extracts the method that key message generates collection rule, It is that theme, user name, content, the information of time attribute, institute are extracted from a node of dom tree that dom tree, which extracts key message, State key message be the theme, the information of user name, content, time attribute, the collection rule of the generation is as follows, collection rule Regular expression is expressed as,
(<subject [ ^>]) ([s S] *?) (</subject>), (<user name [ ^>]) ([s S] *?) (</user name>),
(<content [ ^>]) ([s S] *?) (</content>), (<time [ ^>]) ([s S] *?) (</time>),
(? is) (? ≤<td>) .+? (?=</td>),
(? is) mode is modified, and i indicates that ignorecase, s indicate that single line mode can match new line
(? ≤<td>), backward is looked around certainly, need matched result with<td>beginning, still<td>match, as a result in Not comprising<td>.+? any character is matched to (any character) met every time, that is, attempts to match subsequent expression formula, until Subsequent expression formula failure, recalls last matching result, (?=</td>) sequence looks around certainly, matched result finally will be with </td>ending, but</td>it mismatches and does not include in then result</td>,
S: match any blank character, including space, tab, form feed character, be equivalent to [f n r t v],
S: match any non-blank-white character, be equivalent to [^ f n r t v],?: show it is non-greedy matching, collection rule Regular expression indicate be match 1 pairs of XML tag beginning, attribute, content and latter end.
It further, in step 7, is to advise according to acquisition to the method that page info is acquired according to collection rule Matching theme, user name, content, the beginning of the XML tag of time, attribute, content and latter end are then acquired, participle is called System will need the theme segmented and contents attribute segment then the document object insertion data-base recording after participle Table, the Words partition system is Chinese Academy of Sciences's Words partition system ICTCLAS50, and page info theme, user name, content, time are added Enter in dictionary file keydict.txt.
Further, in step 8, it is from the URL meaning crawled in web page in all domain name ranges, from current The page in can not be drawn into URL or without any URL can by canonical matching rule,
Http: ∥ (w+ (- w+) *) (( w+ (- w+) *)) * (? *)?, i.e., in the current page Extraction less than URL.
The present invention also provides a kind of Web page information acquisition device based on web crawlers, described device include: storage Device, processor and storage in the memory and the computer program that can run on the processor, the processor The computer program is executed to operate in the unit of following device:
Crawler model construction unit, for constructing web crawlers model;
Page acquiring unit obtains web page for initiating http request according to starting URL;
Canonical matching unit crawls domain name range according to canonical matching rule for web crawlers model from web page In URL;
Multi-page request unit, it is all in domain name range for initiating http request acquisition according to the URL in domain name range In web page;
Dom tree generation unit, all web pages for will acquire parse and generate the dom tree of XML;
Collection rule generation unit extracts key message generation collection rule for traversing dom tree;
Page info acquisition unit, for being acquired according to collection rule to page info;
Circle collection unit carries out page acquiring unit, canonical matching unit, multi-page request unit, DOM for recycling Generation unit, collection rule generation unit, page info acquisition unit are set, until web crawlers model is according to canonical matching rule From the URL crawled in web page in all domain name ranges.
The disclosure have the beneficial effect that the present invention provide a kind of Web page information acquisition method based on web crawlers and Device, simplify developer it is cumbersome data input operation, effectively shorten business personnel's system entry time and typing it is accurate The lower problem of rate, reduces the computing cost of collecting webpage data, can directly skip a large amount of unrelated web pages, support The information collection of web page is directly carried out according to semantic information.
Detailed description of the invention
By the way that the embodiment in conjunction with shown by attached drawing is described in detail, above-mentioned and other features of the disclosure will More obvious, identical reference label indicates the same or similar element in disclosure attached drawing, it should be apparent that, it is described below Attached drawing be only some embodiments of the present disclosure, for those of ordinary skill in the art, do not making the creative labor Under the premise of, it is also possible to obtain other drawings based on these drawings, in the accompanying drawings:
Fig. 1 show a kind of flow chart of Web page information acquisition method based on web crawlers;
Fig. 2 show a kind of Web page information acquisition device figure based on web crawlers.
Specific embodiment
It is carried out below with reference to technical effect of the embodiment and attached drawing to the design of the disclosure, specific structure and generation clear Chu, complete description, to be completely understood by the purpose, scheme and effect of the disclosure.It should be noted that the case where not conflicting Under, the features in the embodiments and the embodiments of the present application can be combined with each other.
As shown in Figure 1 for according to a kind of flow chart of Web page information acquisition method based on web crawlers of the disclosure, A kind of Web page information acquisition method based on web crawlers according to embodiment of the present disclosure is illustrated below with reference to Fig. 1.
The disclosure proposes a kind of Web page information acquisition method based on web crawlers, specifically includes the following steps:
Step 1, web crawlers model is constructed;
Step 2, http request is initiated according to starting URL and obtains web page;
Step 3, web crawlers model is according to canonical matching rule from the URL crawled in domain name range in web page;
Step 4, http request is initiated according to the URL in domain name range and obtains all web pages in domain name range;
Step 5, all web pages that will acquire parse and generate the dom tree of XML;
Step 6, traversal dom tree extracts key message and generates collection rule;
Step 7, page info is acquired according to collection rule;
Step 8, circulation carry out step 2 to step 7 until web crawlers model according to canonical matching rule from web page The URL in all domain name ranges is crawled.
Further, in step 1, it is described building web crawlers model method the following steps are included:
Step 1.1, the configuration script for creating a web crawlers, will be in starting URL write-in configuration script;
Step 1.2, the domain name range that web crawlers crawls in configuration script;
Configuration network crawler crawls domain name range: master mould of the script:first_URLs as construction URL then crawls Domain name range be 10000~20000, used for splicing new URL.
Its Java form main code is as follows:
Step 1.3, the queue of web crawlers is constructed, the element of queue is the URL of storage, i.e., queue is for storing URL;
Step 1.4, the canonical matching rule of web crawlers, the regular expression form of the URL of web crawlers model are constructed Canonical matching rule (w+ (- w+) *) (( w+ (- w+) *)) * that is http: ∥ (? *)?, canonical matching Rule Expression contains http: ∥ URL and the URL of "/" or " " symbol, and web crawlers deposits the URL for meeting canonical matching rule It stores up and crawls queue.
Further, in step 3, the web crawlers model is according to canonical matching rule from web page information crawler The method of URL in domain name range is all URL crawled in web page from the URL of web page, from the current page It extracts all URL to be put into the queue of web crawlers, the representation of the URL is by http: // and "/" or " " symbol Composition, the URL pass through canonical matching rule,
Http: // (w+ (- w+) *) (( w+ (- w+) *)) * (? *)?, extract in the current page All URL.
Further, in step 3, all URL in the web page that domain name range gets for the same URL.
Further, in steps of 5, all web pages that will acquire parse and the side of the dom tree that generates XML Method is to carry out structure elucidation using webpage source code of the HTMLParser or SGMLParser to web page to obtain as tree-shaped knot The URL of structure, theme, user name, content, time attribute information traverse access by the logical relation between each node of tree The dom tree of each of webpage node generation XML form.
Further, in steps of 5, all web pages that will acquire parse and the side of the dom tree that generates XML Method is to carry out structure elucidation using webpage source code of the HTMLParser or SGMLParser to web page to obtain all XML The DOM tree node element of form, the DOM tree node element include URL, theme, user name, content, time attribute information, are led to Each of the logical relation traversal access webpage crossed between each node of tree node generates the dom tree of XML form, DOM Tree indicates that wherein V is the node collection of dom tree, and E is the logical relation of any two DOM tree node, i.e. E is with undirected tree G (V, E) Each URL connection relationship in webpage, generates an intermediary tree, first reading root node v ∈ VNURL be v (r), BvAnd nv's Value, BvRefer to all node sets that can be reached by arc v, then comparison node VNThe n of { r }vValue, VN{ r } is removal root The node collection of the tree of node r, successively by nvIt is worth tectonic sequence L, then initializes H (Y, A) again, vertex y is selected to represent leaf Node r, and remember Cy={ r }, ny=| N |+1/2, | N | be the quantity that N includes node, and connecting with leaf node | DN r| it is a The arc of leaf node is added in z set, DN rIt indicates that all set of URL of leaf node v ∈ V close, enables the known of leaf node arc Connecting node is y, and the intermediary tree is H (y, A), and enabling dom tree is TN, each vertex y ∈ Y of intermediary tree is TNThe collection of interior joint Close Cy∈VN, arc A is TNThe set on middle side, each node and leaf node indicate that N indicates DOM tree node member with independent fixed point y ∈ Y The set of element, VN, ENIt is TNIn all DOM tree nodes and side set, ENIn element be known as side collection, VNIn element be known as set Node collection, initialization intermediary tree H is sky, by depth-first traversal by DOM tree node v ∈ VNIt is added in H, until all VNIn DOM tree node all in intermediary tree H, each node y ∈ Y of intermediary tree H represents setCyIn include one A or multiple child node v, nyRepresent dom tree interior joint nvValue, node nvIt is worth identical i.e. URL and belongs to the same connection relationship, it can be with It is attached access by URL, if certain node y has multiple child nodes, node v ∈ C in HyN having the sameyIt is worth, in H Arc a represents TNIn side, read H in arc during, when the front end for arc occur is connected with node, other end leaf node does not have There is connecting node or there is no DOM tree node element, this arc is known as leaf node arc, and the collection of leaf node arc is combined into z, if time Leaf node arc is lasted with leaf node, then is deleted leaf node arc from set z, each arc a ∈ A has corresponding one A set BaIt is corresponding, BaRefer to all node sets that can be reached by arc a, z is according to depth-first traversal from dom tree section DOM tree node element is successively extracted in point, is indicated with v ', and update is added to H (Y, A) and terminates until traversing, final H (Y, A) That is the dom tree of XML.
Parsing of the HTMLParser or SGMLParser for the html file of the webpage source code of web page,
Parsed information preservation is the structure set by HTMLParser or SGMLParser.Node is information preservation Data type basis.
The definition of Node:
public interface Node extends Cloneable;
The method for including in Node has several classes:
The DOM tree node element of all XML forms is obtained to the webpage source code progress structure elucidation of web page, it is right In the function that tree is traversed:
Node getParent (): father node is obtained
NodeList getChildren (): the list of child node is obtained
Node getFirstChild (): first child node is obtained
Node getLastChild (): the last one child node is obtained
Node getPreviousSibling (): the previous brotgher of node is obtained
Node getNextSibling (): next brother node is obtained
Obtain the function of Node content:
String getText (): text is obtained
String toPlainTextString (): plain text information is obtained.
String toHtml (): it obtains HTML information (original HTML)
String toHtml (boolean verbatim): it obtains HTML information (original HTML)
String toString (): it obtains character string information (original HTML)
Page getPage (): the corresponding Page object of this Node is obtained
Int getStartPosition (): initial position of this Node in html page is obtained
Int getEndPosition (): end position of this Node in html page is obtained
Function for Filter filtering:
Void collectInto (NodeList list, NodeFilter filter): the condition pair based on filter It is filtered in this node, qualified node is put into list.
Function for Visitor traversal:
Void accept (NodeVisitor visitor): to this Node application visitor
It is this kind of to use fewer for modifying the function of content:
Void setPage (Page page): the corresponding Page object of this Node is set
Void setText (String text): setting text
Void setChildren (NodeList children): setting child list
Other functions:
Void doSemanticAction (): executing the corresponding operation of this Node, (only minority Tag has corresponding behaviour Make)
Object clone (): the abstract function of interface Clone.
It is practical with HTMLParser it is most be processing html page, Filter or the relevant function of Visitor are necessary , then the first kind and the second class function are with the most use.First kind function ratio is readily understood by, and exemplifies one below Lower second class function.
Here is the html file for test:
Further, in step 6, it is to traverse that the traversal dom tree, which extracts the method that key message generates collection rule, It is that theme, user name, content, the information of time attribute, institute are extracted from a node of dom tree that dom tree, which extracts key message, State key message be the theme, the information of user name, content, time attribute, the collection rule of the generation is as follows, collection rule Regular expression is expressed as,
(<subject [ ^>]) ([s S] *?) (</subject>), (<user name [ ^>]) ([s S] *?) (</user name>),
(<content [ ^>]) ([s S] *?) (</content>), (<time [ ^>]) ([s S] *?) (</time>),
(? is) (? ≤<td>) .+? (?=</td>),
(? is) mode is modified, and i indicates that ignorecase, s indicate that single line mode can match new line
(? ≤<td>), backward is looked around certainly, need matched result with<td>beginning, still<td>match, as a result in Not comprising<td>.+? any character is matched to (any character) met every time, that is, attempts to match subsequent expression formula, until Subsequent expression formula failure, recalls last matching result, (?=</td>) sequence looks around certainly, matched result finally will be with </td>ending, but</td>it mismatches and does not include in then result</td>,
S: match any blank character, including space, tab, form feed character, be equivalent to [f n r t v],
S: match any non-blank-white character, be equivalent to [^ f n r t v],?: show it is non-greedy matching, collection rule Regular expression indicate be match 1 pairs of XML tag beginning, attribute, content and latter end.
It further, in step 7, is to advise according to acquisition to the method that page info is acquired according to collection rule Matching theme, user name, content, the beginning of the XML tag of time, attribute, content and latter end are then acquired, participle is called System will need the theme segmented and contents attribute segment then the document object insertion data-base recording after participle Table, the Words partition system is Chinese Academy of Sciences's Words partition system ICTCLAS50, and page info theme, user name, content, time are added Enter in dictionary file keydict.txt.
Further, in step 8, it is from the URL meaning crawled in web page in all domain name ranges, from current The page in can not be drawn into URL or without any URL can by canonical matching rule,
Http: ∥ (w+ (- w+) *) (( w+ (- w+) *)) * (? *)?, i.e., in the current page Extraction less than URL.
Preferably, the disclosure provide a kind of Web page information acquisition method based on web crawlers embodiment it is as follows:
Step A constructs web crawlers model;
Step B, web crawlers model is by passing through modification web page from the URL crawled in domain name range in web page HTTP data packet crawl the page that preset URL is returned again, the HTTP data packet of the modification web page be the use of return The query result of family specified data value, parsing HTTP page elements obtain user's specified data value, and the preset URL is returned The page returned is the page that the method for executing inquiry user's specified data value returns, and user's specified data value includes page Face message subject, user name, content, time;
Step C being packaged into the data set of Json format by way of modifying the HTTP data packet of web page, and is extracted Json data set is to background server;
Step D parses Json data set in server background, based on the format of Json key-value pair, carries out Json and Java business datum entity object are converted between system, according to configured data entity object model, to extraction Data set is made whether the verification (such as character string cannot be converted into value type) of same data type, to verification data type Skimble-scamble invalid data is rejected, and the skimble-scamble invalid data of the verification data type is page info theme, user The data that one of name, content, time any attribute are not inconsistent;
Step E is put into specified data the specified category of data entity object according to configured data entity object model Property, carry out entity object assignment;
Step F is output to client from server the entity object data conversion of assignment at the format of Json;
Step G parses the data of Json format, according to form modifying user's HTTP data packet of Json key-value pair, and writes Enter server.
Further, in step D, the method for Json and Java business datum entity object conversion is between carry out system The data of Json format are read by java application and are formatted according to the prior art of Json as Java application The readable data of program.
A kind of Web page information acquisition device based on web crawlers that embodiment of the disclosure provides, is illustrated in figure 2 A kind of Web page information acquisition device figure based on web crawlers of the disclosure, the embodiment it is a kind of based on web crawlers Web page information acquisition device include: processor, memory and storage in the memory and can be on the processor The computer program of operation, the processor realize a kind of above-mentioned Web based on web crawlers when executing the computer program Step in page info acquisition device embodiment.
Described device includes: memory, processor and storage in the memory and can transport on the processor Capable computer program, the processor execute the computer program and operate in the unit of following device:
Crawler model construction unit, for constructing web crawlers model;
Page acquiring unit obtains web page for initiating http request according to starting URL;
Canonical matching unit crawls domain name range according to canonical matching rule for web crawlers model from web page In URL;
Multi-page request unit, it is all in domain name range for initiating http request acquisition according to the URL in domain name range In web page;
Dom tree generation unit, all web pages for will acquire parse and generate the dom tree of XML;
Collection rule generation unit extracts key message generation collection rule for traversing dom tree;
Page info acquisition unit, for being acquired according to collection rule to page info;
Circle collection unit carries out page acquiring unit, canonical matching unit, multi-page request unit, DOM for recycling Generation unit, collection rule generation unit, page info acquisition unit are set, until web crawlers model is according to canonical matching rule From the URL crawled in web page in all domain name ranges.
A kind of Web page information acquisition device based on web crawlers can run on desktop PC, notes Originally, palm PC and cloud server etc. calculate in equipment.A kind of Web page information based on web crawlers acquires dress It sets, the device that can be run may include, but be not limited only to, processor, memory.It will be understood by those skilled in the art that the example Son is only a kind of example of Web page information acquisition device based on web crawlers, does not constitute and is climbed to one kind based on network The restriction of the Web page information acquisition device of worm may include component more more or fewer than example, or the certain portions of combination Part or different components, such as a kind of Web page information acquisition device based on web crawlers can also include input Output equipment, network access equipment, bus etc..
Alleged processor can be central processing unit (Central Processing Unit, CPU), can also be it His general processor, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor is also possible to any conventional processor Deng, the processor is a kind of control centre of Web page information acquisition device running gear based on web crawlers, benefit With the entire a kind of Web page information acquisition device based on web crawlers of various interfaces and connection can running gear it is each Part.
The memory can be used for storing the computer program and/or module, and the processor is by operation or executes Computer program in the memory and/or module are stored, and calls the data being stored in memory, described in realization A kind of various functions of the Web page information acquisition device based on web crawlers.The memory can mainly include storage program Area and storage data area, wherein storing program area can application program needed for storage program area, at least one function (such as Sound-playing function, image player function etc.) etc.;Storage data area, which can be stored, uses created data (ratio according to mobile phone Such as audio data, phone directory) etc..In addition, memory may include high-speed random access memory, it can also include non-volatile Property memory, such as hard disk, memory, plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card), at least one disk memory, flush memory device or other Volatile solid-state part.
Preferably, the web crawlers model of the disclosure can be injected in the front end CS:
1) web crawlers model finds the injection of operation pages object implementatio8 script by C/S exhaustion IE.
2) web crawlers model realizes the detection of server and client release number by C/S, realizes real-time update.
JQuery is the library javascript of the more browsers of compatibility, and function is that AJAX interaction is provided for website.
Web crawlers model realizes the customized addition of element in face of current using jQuery, and AJAX is called to realize backstage Interaction.
Web crawlers model based on data application in, can from independently of real web pages server-side obtain and can To be dynamically written in webpage, foreground is returned data by backstage by AJAX, and parse related data, be written relevant In the record sheet of database.
Although the description of the disclosure is quite detailed and especially several embodiments are described, it is not Any of these details or embodiment or any specific embodiments are intended to be limited to, but should be considered as is by reference to appended A possibility that claim provides broad sense in view of the prior art for these claims explanation, to effectively cover the disclosure Preset range.In addition, the disclosure is described with inventor's foreseeable embodiment above, its purpose is to be provided with Description, and those equivalent modifications that the disclosure can be still represented to the unsubstantiality change of the disclosure still unforeseen at present.

Claims (10)

1. a kind of Web page information acquisition method based on web crawlers, which is characterized in that the described method comprises the following steps:
Step 1, web crawlers model is constructed;
Step 2, http request is initiated according to starting URL and obtains web page;
Step 3, web crawlers model is according to canonical matching rule from the URL crawled in domain name range in web page;
Step 4, http request is initiated according to the URL in domain name range and obtains all web pages in domain name range;
Step 5, all web pages that will acquire parse and generate the dom tree of XML;
Step 6, traversal dom tree extracts key message and generates collection rule;
Step 7, page info is acquired according to collection rule;
Step 8, circulation carries out step 2 to step 7 until web crawlers model is crawled from web page according to canonical matching rule URL in complete all domain name ranges.
2. a kind of Web page information acquisition method based on web crawlers according to claim 1, which is characterized in that In step 1, it is described building web crawlers model method the following steps are included:
Step 1.1, the configuration script for creating a web crawlers, will be in starting URL write-in configuration script;
Step 1.2, the domain name range that web crawlers crawls in configuration script;
Step 1.3, the queue of web crawlers is constructed, the element of queue is the URL of storage, i.e., queue is for storing URL;
Step 1.4, the canonical matching rule of web crawlers is constructed, web crawlers arrives the URL storage for meeting canonical matching rule Crawl queue.
3. a kind of Web page information acquisition method based on web crawlers according to claim 1, which is characterized in that In step 3, the web crawlers model is according to canonical matching rule from the side of the URL in web page information crawler domain name range Method is all URL crawled in web page from the URL of web page, and all URL are extracted from the current page and are put into net In the queue of network crawler, the URL passes through canonical matching rule,
Http: ∥ (w+ (- w+) *) (( w+ (- w+) *)) * (? *)?, extract the institute in the current page Some URL.
4. a kind of Web page information acquisition method based on web crawlers according to claim 1, which is characterized in that In step 3, all URL in the web page that domain name range gets for the same URL.
5. a kind of Web page information acquisition method based on web crawlers according to claim 1, which is characterized in that In step 5, the method for the dom tree that all web pages that will acquire parse and generate XML is to utilize HTMLParser Or SGMLParser structure elucidation is carried out to the webpage source code of web page with obtain the URL for tree, theme, user name, It is raw to access each of webpage node by the logical relation traversal between each node of tree for content, time attribute information At the dom tree of XML form.
6. a kind of Web page information acquisition method based on web crawlers according to claim 1, which is characterized in that In step 5, the method for the dom tree that all web pages that will acquire parse and generate XML is to utilize HTMLParser Or SGMLParser carries out structure elucidation to the webpage source code of web page to obtain the DOM tree node member of all XML forms Element, the DOM tree node element include URL, theme, user name, content, time attribute information, by each node of tree it Between logical relation traversal access each of webpage node generate the dom tree of XML form, dom tree is with undirected tree G (V, E) It indicates, wherein V is the node collection of dom tree, and E is the logical relation of any two DOM tree node, i.e. E is each URL connection in webpage Relationship generates an intermediary tree, first reading root node v ∈ VNURL be v (r), BvAnd nvValue, BvRefer to through arc v institute All node sets that can be reached, then comparison node VNThe n of { r }vValue, VN{ r } is the node collection for removing the tree of root node r, Successively by nvIt is worth tectonic sequence L, then initializes H (Y, A) again, selects vertex y to represent leaf node r, and remember Cy={ r }, ny =| N |+1/2, and connecting with leaf node | DN r| the arc of a leaf node is added in Z set, enables leaf node arc Known connecting node is y, and the intermediary tree is H (y, A), and enabling dom tree is TN, each vertex y ∈ Y of intermediary tree is TNInterior joint Set Cy∈VN, arc A is TNThe set on middle side, each node and leaf node indicate that N indicates dom tree section with independent fixed point y ∈ Y The set of point element, VN,ENIt is TNIn all DOM tree nodes and side set, ENIn element be known as side collection, VNIn member be called usually For the node collection of tree, initializing intermediary tree H is sky, by depth-first traversal by DOM tree node v ∈ VNIt is added in H, until All VNIn DOM tree node all in intermediary tree H, each node y ∈ Y of intermediary tree H represents setCyMiddle packet Containing one or more child node v, nyRepresent dom tree interior joint nvValue, node nvIt is worth identical i.e. URL and belongs to the same connection relationship, Access can be attached by URL, if certain node y has multiple child nodes, node v ∈ C in HyN having the samev It is worth, arc a represents T in HNIn side, read H in arc during, when the front end for arc occur is connected with node, other end leaf Node does not have connecting node or does not have DOM tree node element, and this arc is known as leaf node arc, and the collection of leaf node arc is combined into Z, If leaf node arc has leaf node when traversal, leaf node arc is deleted from set Z, each arc a ∈ A has phase The set B answeredaIt is corresponding, BaRefer to all node sets that can be reached by arc a, Z according to depth-first traversal from DOM tree node element is successively extracted in DOM tree node, is indicated with v', and is updated and be added to H (Y, A).
7. a kind of Web page information acquisition method based on web crawlers according to claim 1, which is characterized in that In step 6, the traversal dom tree extracts the method that key message generates collection rule and is, traversal dom tree extracts key message and is Theme, user name, content, the information of time attribute are extracted from a node of dom tree, the key message is the theme, uses Name in an account book, content, the information of time attribute, the collection rule of the generation is as follows, and the regular expression of collection rule is expressed as,
(<subject [ ^>]) ([s S] *?) (</subject>), (<user name [ ^>]) ([s S] *?) (</user name>),
(<content [ ^>]) ([s S] *?) (</content>), (<time [ ^>]) ([s S] *?) (</time>),
(? is) (? ≤<td>) .+? (?=</td>).
8. a kind of Web page information acquisition method based on web crawlers according to claim 1, which is characterized in that It in step 7, is to acquire matching theme, user according to collection rule to the method that page info is acquired according to collection rule Name, content, the beginning of the XML tag of time, attribute, content and latter end.
9. a kind of Web page information acquisition method based on web crawlers, which is characterized in that the described method comprises the following steps:
Step A constructs web crawlers model;
Step B, web crawlers model is by passing through modification web page from the URL crawled in domain name range in web page HTTP data packet crawls the page that preset URL is returned again, and the HTTP data packet of the modification web page is the user returned The query result of specified data value, parsing HTTP page elements obtain user's specified data value, and the preset URL is returned The page be execute inquiry user's specified data value method return the page, user's specified data value includes the page Message subject, user name, content, time;
Step C being packaged into the data set of Json format by way of modifying the HTTP data packet of web page, and extracts Json Data set is to background server;
Step D parses Json data set in server background, based on the format of Json key-value pair, carries out system Between Json and Java business datum entity object convert, according to configured data entity object model, to the data of extraction Collection is made whether the verification of same data type, rejects to the verification skimble-scamble invalid data of data type, the verification The skimble-scamble invalid data of data type is the number that one of page info theme, user name, content, time any attribute are not inconsistent According to;
Step E is put into specified data the specified attribute of data entity object according to configured data entity object model, Carry out entity object assignment;
Step F is output to client from server the entity object data conversion of assignment at the format of Json;
Step G parses the data of Json format, according to form modifying user's HTTP data packet of Json key-value pair, and clothes is written Business device.
10. a kind of Web page information acquisition device based on web crawlers, which is characterized in that described device include: memory, Processor and storage in the memory and the computer program that can run on the processor, the processor execution The computer program operates in the unit of following device:
Crawler model construction unit, for constructing web crawlers model;
Page acquiring unit obtains web page for initiating http request according to starting URL;
Canonical matching unit, for web crawlers model according to canonical matching rule from being crawled in web page in domain name range URL;
Multi-page request unit, it is all in domain name range for being obtained according to the URL initiation http request in domain name range Web page;
Dom tree generation unit, all web pages for will acquire parse and generate the dom tree of XML;
Collection rule generation unit extracts key message generation collection rule for traversing dom tree;
Page info acquisition unit, for being acquired according to collection rule to page info;
Circle collection unit carries out page acquiring unit, canonical matching unit, multi-page request unit, dom tree life for recycling At unit, collection rule generation unit, page info acquisition unit, until web crawlers model according to canonical matching rule from The URL in all domain name ranges has been crawled in web page.
CN201811499475.4A 2018-12-09 2018-12-09 A kind of Web page information acquisition method and device based on web crawlers Pending CN109657121A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811499475.4A CN109657121A (en) 2018-12-09 2018-12-09 A kind of Web page information acquisition method and device based on web crawlers

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811499475.4A CN109657121A (en) 2018-12-09 2018-12-09 A kind of Web page information acquisition method and device based on web crawlers

Publications (1)

Publication Number Publication Date
CN109657121A true CN109657121A (en) 2019-04-19

Family

ID=66113862

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811499475.4A Pending CN109657121A (en) 2018-12-09 2018-12-09 A kind of Web page information acquisition method and device based on web crawlers

Country Status (1)

Country Link
CN (1) CN109657121A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110134853A (en) * 2019-05-13 2019-08-16 重庆八戒传媒有限公司 Data crawling method and system
CN110287394A (en) * 2019-06-28 2019-09-27 北京金山安全软件有限公司 Website resource crawling method and device, computer equipment and storage medium
CN110297962A (en) * 2019-06-28 2019-10-01 北京金山安全软件有限公司 Website resource crawling method, device, system and computer equipment
CN110611713A (en) * 2019-09-17 2019-12-24 深圳市网心科技有限公司 Data downloading method and system, electronic equipment and storage medium
CN111859867A (en) * 2020-07-20 2020-10-30 广西美立方工程咨询有限公司 Web data extraction system based on XML and XPath and use method thereof
CN112084389A (en) * 2020-08-17 2020-12-15 上海交通大学 Network crawler-based academic institution geographical position information extraction method
CN113065051A (en) * 2021-04-02 2021-07-02 西南石油大学 Visual agricultural big data analysis interactive system
CN113922980A (en) * 2021-08-23 2022-01-11 北京天融信网络安全技术有限公司 DNS monitoring method, equipment and storage medium based on HTTP detection information

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007193642A (en) * 2006-01-20 2007-08-02 Nippon Telegr & Teleph Corp <Ntt> Xpath processor, xpath processing method, xpath processing program and storage medium
US20080098300A1 (en) * 2006-10-24 2008-04-24 Brilliant Shopper, Inc. Method and system for extracting information from web pages
US20140123303A1 (en) * 2012-10-31 2014-05-01 Tata Consultancy Services Limited Dynamic data masking
CN107066576A (en) * 2017-04-12 2017-08-18 成都四方伟业软件股份有限公司 A kind of big data web crawlers paging system of selection and system
CN107608949A (en) * 2017-10-16 2018-01-19 北京神州泰岳软件股份有限公司 A kind of Text Information Extraction method and device based on semantic model
CN107808000A (en) * 2017-11-13 2018-03-16 哈尔滨工业大学(威海) A kind of hidden web data collection and extraction system and method
CN108052632A (en) * 2017-12-20 2018-05-18 成都律云科技有限公司 A kind of method for obtaining network information, system and company information search system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007193642A (en) * 2006-01-20 2007-08-02 Nippon Telegr & Teleph Corp <Ntt> Xpath processor, xpath processing method, xpath processing program and storage medium
US20080098300A1 (en) * 2006-10-24 2008-04-24 Brilliant Shopper, Inc. Method and system for extracting information from web pages
US20140123303A1 (en) * 2012-10-31 2014-05-01 Tata Consultancy Services Limited Dynamic data masking
CN107066576A (en) * 2017-04-12 2017-08-18 成都四方伟业软件股份有限公司 A kind of big data web crawlers paging system of selection and system
CN107608949A (en) * 2017-10-16 2018-01-19 北京神州泰岳软件股份有限公司 A kind of Text Information Extraction method and device based on semantic model
CN107808000A (en) * 2017-11-13 2018-03-16 哈尔滨工业大学(威海) A kind of hidden web data collection and extraction system and method
CN108052632A (en) * 2017-12-20 2018-05-18 成都律云科技有限公司 A kind of method for obtaining network information, system and company information search system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李文等: "基于XML和DOM技术的Web信息抽取模型", 《大连交通大学学报》 *
高梦超等: "基于众包的社交网络数据采集模型设计与实现", 《计算机工程》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110134853A (en) * 2019-05-13 2019-08-16 重庆八戒传媒有限公司 Data crawling method and system
CN110287394A (en) * 2019-06-28 2019-09-27 北京金山安全软件有限公司 Website resource crawling method and device, computer equipment and storage medium
CN110297962A (en) * 2019-06-28 2019-10-01 北京金山安全软件有限公司 Website resource crawling method, device, system and computer equipment
CN110287394B (en) * 2019-06-28 2022-01-11 北京金山安全软件有限公司 Website resource crawling method and device, computer equipment and storage medium
CN110611713A (en) * 2019-09-17 2019-12-24 深圳市网心科技有限公司 Data downloading method and system, electronic equipment and storage medium
CN111859867A (en) * 2020-07-20 2020-10-30 广西美立方工程咨询有限公司 Web data extraction system based on XML and XPath and use method thereof
CN111859867B (en) * 2020-07-20 2024-03-12 广西美立方工程咨询有限公司 Web data extraction system based on XML and XPath and use method thereof
CN112084389A (en) * 2020-08-17 2020-12-15 上海交通大学 Network crawler-based academic institution geographical position information extraction method
CN113065051A (en) * 2021-04-02 2021-07-02 西南石油大学 Visual agricultural big data analysis interactive system
CN113922980A (en) * 2021-08-23 2022-01-11 北京天融信网络安全技术有限公司 DNS monitoring method, equipment and storage medium based on HTTP detection information

Similar Documents

Publication Publication Date Title
CN109657121A (en) A kind of Web page information acquisition method and device based on web crawlers
US10942708B2 (en) Generating web API specification from online documentation
CN101488151B (en) System and method for gathering website contents
CN109582909A (en) Webpage automatic generation method, device, electronic equipment and storage medium
CN101799753B (en) Method and device for realizing tree structure
US8839192B2 (en) System and method for presentation of cross organizational applications
JP2018097846A (en) Api learning
CN104881490B (en) A kind of WEB form data access method and system
US20110107243A1 (en) Searching Existing User Interfaces to Enable Design, Development and Provisioning of User Interfaces
CN111045678A (en) Method, device and equipment for executing dynamic code on page and storage medium
CN102279894A (en) Method for searching, integrating and providing comment information based on semantics and searching system
Szeredi et al. The semantic web explained: The technology and mathematics behind web 3.0
CN108509544B (en) Method and device for acquiring mind map, equipment and readable storage medium
CN111090417A (en) Binary file analysis method, device, equipment and medium
CN113986241B (en) Configuration method and device of business rules based on knowledge graph
CN103744987B (en) Video website media asset aggregation method and system based on DOM tree matching
US10326858B2 (en) System and method for dynamically generating personalized websites
CN107220250A (en) A kind of template configuration method and system
CN114356306A (en) Method for realizing visual customization of system components
CN104270257A (en) Network element level network management service configuration adaptive system and method based on PB and XPATH
CN115358200A (en) Template document automatic generation method based on SysML meta model
CN115017182A (en) Visual data analysis method and equipment
CN111061975B (en) Method and device for processing irrelevant content in page
CN117111909A (en) Code automatic generation method, system, computer equipment and storage medium
CN106991144B (en) Method and system for customizing data crawling workflow

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190419