CN109657121A - A kind of Web page information acquisition method and device based on web crawlers - Google Patents
A kind of Web page information acquisition method and device based on web crawlers Download PDFInfo
- Publication number
- CN109657121A CN109657121A CN201811499475.4A CN201811499475A CN109657121A CN 109657121 A CN109657121 A CN 109657121A CN 201811499475 A CN201811499475 A CN 201811499475A CN 109657121 A CN109657121 A CN 109657121A
- Authority
- CN
- China
- Prior art keywords
- node
- url
- web
- dom tree
- web page
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a kind of Web page information acquisition method and device based on web crawlers, by constructing web crawlers model, http request, which is initiated, according to the URL in domain name range obtains all web pages in domain name range, it traverses dom tree and extracts key message generation collection rule, page info is acquired according to collection rule, simplify the cumbersome data input operation of developer, effectively shorten business personnel's system entry time and the lower problem of typing accuracy rate, reduce the computing cost of collecting webpage data, a large amount of unrelated web pages can directly be skipped, support the information collection that web page is directly carried out according to semantic information.
Description
Technical field
This disclosure relates to data collecting field, and in particular to a kind of Web page information acquisition method based on web crawlers
And device.
Background technique
Traditional Web page text information acquisition passes through various web pages of different nature often collected result packet
Containing with a large amount of unrelated web pages of acquisition theme, it is the network coverage as big as possible that Web page text information, which acquires target,
And cause the contradiction between limited search engine resource and unlimited Web page text information resource increasing.The page
In various forms of data modes increasingly the network information is promoted to continue to develop, have the image, video and audio of magnanimity daily
The different data such as multimedia, text document are emerged in large numbers in a network, and present technology is often close to these information contents are handled
Multi-source heterogeneous data comprehensively can not find and obtain in the page of collection, and be difficult to support to carry out web according to semantic information
The information collection of the page.And in some network-based systems development process, many data are needed in upper re-using, one
Aspect increases the typing work of foreground personnel;On the other hand, the auxiliary system data of construction early period can not utilize, and cause very big
The wasting of resources.
Summary of the invention
The disclosure provides a kind of Web page information acquisition method and device based on web crawlers, is climbed according to a kind of network
Erpoglyph type grabs the key message in web page by regularity.
To achieve the goals above, according to the one side of the disclosure, a kind of Web page information based on web crawlers is provided
Acquisition method the described method comprises the following steps:
Step 1, web crawlers model is constructed;
Step 2, http request is initiated according to starting URL and obtains web page;
Step 3, web crawlers model is according to canonical matching rule from the URL crawled in domain name range in web page;
Step 4, http request is initiated according to the URL in domain name range and obtains all web pages in domain name range;
Step 5, all web pages that will acquire parse and generate the dom tree of XML;
Step 6, traversal dom tree extracts key message and generates collection rule;
Step 7, page info is acquired according to collection rule;
Step 8, circulation carry out step 2 to step 7 until web crawlers model according to canonical matching rule from web page
The URL in all domain name ranges is crawled.
Further, in step 1, it is described building web crawlers model method the following steps are included:
Step 1.1, the configuration script for creating a web crawlers, will be in starting URL write-in configuration script;
Step 1.2, the domain name range that web crawlers crawls in configuration script;
Step 1.3, the queue of web crawlers is constructed, the element of queue is the URL of storage, i.e., queue is for storing URL;
Step 1.4, the canonical matching rule of web crawlers, the regular expression form of the URL of web crawlers model are constructed
Canonical matching rule (w+ (- w+) *) (( w+ (- w+) *)) * that is http: ∥ (? *)?, canonical matching
Rule Expression contains http: ∥ URL and the URL of "/" or " " symbol, and web crawlers deposits the URL for meeting canonical matching rule
It stores up and crawls queue.
Further, in step 3, the web crawlers model is according to canonical matching rule from web page information crawler
The method of URL in domain name range is all URL crawled in web page from the URL of web page, from the current page
It extracts all URL to be put into the queue of web crawlers, the representation of the URL is by http: ∥ and "/" or " " symbol
Composition, the URL pass through canonical matching rule,
Http: ∥ (w+ (- w+) *) (( w+ (- w+) *)) * (? *)?, extract in the current page
All URL.
Further, in step 3, all URL in the web page that domain name range gets for the same URL.
Further, in steps of 5, all web pages that will acquire parse and the side of the dom tree that generates XML
Method is to carry out structure elucidation using webpage source code of the HTMLParser or SGMLParser to web page to obtain as tree-shaped knot
The URL of structure, theme, user name, content, time attribute information traverse access by the logical relation between each node of tree
The dom tree of each of webpage node generation XML form.
Further, in steps of 5, all web pages that will acquire parse and the side of the dom tree that generates XML
Method is to carry out structure elucidation using webpage source code of the HTMLParser or SGMLParser to web page to obtain all XML
The DOM tree node element of form, the DOM tree node element include URL, theme, user name, content, time attribute information, are led to
Each of the logical relation traversal access webpage crossed between each node of tree node generates the dom tree of XML form, DOM
Tree indicates that wherein V is the node collection of dom tree, and E is the logical relation of any two DOM tree node, i.e. E is with undirected tree G (V, E)
Each URL connection relationship in webpage, generates an intermediary tree, first reading root node v ∈ VNURL be v (r), BvAnd nv's
Value, BvRefer to all node sets that can be reached by arc v, then comparison node VNThe n of { r }vValue, VN{ r } is removal root
The node collection of the tree of node r, successively by nvIt is worth tectonic sequence L, then initializes H (Y, A) again, vertex y is selected to represent leaf
Node r, and remember Cy={ r }, ny=| N |+1/2, and connecting with leaf node | DN r| the arc of a leaf node is added to z collection
In conjunction, DN rIndicate that all set of URL of leaf node v ∈ V close, enabling the known connecting node of leaf node arc is y, the intermediary tree
For H (y, A), enabling dom tree is TN, each vertex y ∈ Y of intermediary tree is TNThe set C of interior jointy∈VN, arc A is TNThe collection on middle side
It closes, each node and leaf node are indicated with independent fixed point y ∈ Y, the set of N expression DOM tree node element, VN, ENIt is TNIn own
The set of DOM tree node and side, ENIn element be known as side collection, VNIn element be known as set node collection, initialize intermediary tree H
For sky, by depth-first traversal by DOM tree node v ∈ VNIt is added in H, until all VNIn DOM tree node all in centre
It sets in H, each node y ∈ Y of intermediary tree H represents setCyIn include one or more child node v, nyIt represents
Dom tree interior joint nvValue, node nvIt is worth identical i.e. URL and belongs to the same connection relationship, access can be attached by URL, such as
Certain node y has multiple child nodes in fruit H, then node v ∈ CyN having the samevIt is worth, arc a represents T in HNIn side, reading
During arc in H, when the front end for arc occur is connected with node, other end leaf node does not have connecting node or no dom tree
Node elements, this arc are known as leaf node arc, and the collection of leaf node arc is combined into z, if leaf node arc has leaf when traversal
Child node then deletes leaf node arc from set z, and each arc a ∈ A has a corresponding set BaIt is corresponding, BaRefer to
All node sets that can be reached by arc a, z successively extract DOM tree node according to depth-first traversal from DOM tree node
Element is indicated with v ', and update is added to H (Y, A) and terminates until traversing, final H (Y, A) the i.e. dom tree of XML, wherein | N |
It is the quantity that N includes node.
Further, in step 6, it is to traverse that the traversal dom tree, which extracts the method that key message generates collection rule,
It is that theme, user name, content, the information of time attribute, institute are extracted from a node of dom tree that dom tree, which extracts key message,
State key message be the theme, the information of user name, content, time attribute, the collection rule of the generation is as follows, collection rule
Regular expression is expressed as,
(<subject [ ^>]) ([s S] *?) (</subject>), (<user name [ ^>]) ([s S] *?) (</user name>),
(<content [ ^>]) ([s S] *?) (</content>), (<time [ ^>]) ([s S] *?) (</time>),
(? is) (? ≤<td>) .+? (?=</td>),
(? is) mode is modified, and i indicates that ignorecase, s indicate that single line mode can match new line
(? ≤<td>), backward is looked around certainly, need matched result with<td>beginning, still<td>match, as a result in
Not comprising<td>.+? any character is matched to (any character) met every time, that is, attempts to match subsequent expression formula, until
Subsequent expression formula failure, recalls last matching result, (?=</td>) sequence looks around certainly, matched result finally will be with
</td>ending, but</td>it mismatches and does not include in then result</td>,
S: match any blank character, including space, tab, form feed character, be equivalent to [f n r t v],
S: match any non-blank-white character, be equivalent to [^ f n r t v],?: show it is non-greedy matching, collection rule
Regular expression indicate be match 1 pairs of XML tag beginning, attribute, content and latter end.
It further, in step 7, is to advise according to acquisition to the method that page info is acquired according to collection rule
Matching theme, user name, content, the beginning of the XML tag of time, attribute, content and latter end are then acquired, participle is called
System will need the theme segmented and contents attribute segment then the document object insertion data-base recording after participle
Table, the Words partition system is Chinese Academy of Sciences's Words partition system ICTCLAS50, and page info theme, user name, content, time are added
Enter in dictionary file keydict.txt.
Further, in step 8, it is from the URL meaning crawled in web page in all domain name ranges, from current
The page in can not be drawn into URL or without any URL can by canonical matching rule,
Http: ∥ (w+ (- w+) *) (( w+ (- w+) *)) * (? *)?, i.e., in the current page
Extraction less than URL.
The present invention also provides a kind of Web page information acquisition device based on web crawlers, described device include: storage
Device, processor and storage in the memory and the computer program that can run on the processor, the processor
The computer program is executed to operate in the unit of following device:
Crawler model construction unit, for constructing web crawlers model;
Page acquiring unit obtains web page for initiating http request according to starting URL;
Canonical matching unit crawls domain name range according to canonical matching rule for web crawlers model from web page
In URL;
Multi-page request unit, it is all in domain name range for initiating http request acquisition according to the URL in domain name range
In web page;
Dom tree generation unit, all web pages for will acquire parse and generate the dom tree of XML;
Collection rule generation unit extracts key message generation collection rule for traversing dom tree;
Page info acquisition unit, for being acquired according to collection rule to page info;
Circle collection unit carries out page acquiring unit, canonical matching unit, multi-page request unit, DOM for recycling
Generation unit, collection rule generation unit, page info acquisition unit are set, until web crawlers model is according to canonical matching rule
From the URL crawled in web page in all domain name ranges.
The disclosure have the beneficial effect that the present invention provide a kind of Web page information acquisition method based on web crawlers and
Device, simplify developer it is cumbersome data input operation, effectively shorten business personnel's system entry time and typing it is accurate
The lower problem of rate, reduces the computing cost of collecting webpage data, can directly skip a large amount of unrelated web pages, support
The information collection of web page is directly carried out according to semantic information.
Detailed description of the invention
By the way that the embodiment in conjunction with shown by attached drawing is described in detail, above-mentioned and other features of the disclosure will
More obvious, identical reference label indicates the same or similar element in disclosure attached drawing, it should be apparent that, it is described below
Attached drawing be only some embodiments of the present disclosure, for those of ordinary skill in the art, do not making the creative labor
Under the premise of, it is also possible to obtain other drawings based on these drawings, in the accompanying drawings:
Fig. 1 show a kind of flow chart of Web page information acquisition method based on web crawlers;
Fig. 2 show a kind of Web page information acquisition device figure based on web crawlers.
Specific embodiment
It is carried out below with reference to technical effect of the embodiment and attached drawing to the design of the disclosure, specific structure and generation clear
Chu, complete description, to be completely understood by the purpose, scheme and effect of the disclosure.It should be noted that the case where not conflicting
Under, the features in the embodiments and the embodiments of the present application can be combined with each other.
As shown in Figure 1 for according to a kind of flow chart of Web page information acquisition method based on web crawlers of the disclosure,
A kind of Web page information acquisition method based on web crawlers according to embodiment of the present disclosure is illustrated below with reference to Fig. 1.
The disclosure proposes a kind of Web page information acquisition method based on web crawlers, specifically includes the following steps:
Step 1, web crawlers model is constructed;
Step 2, http request is initiated according to starting URL and obtains web page;
Step 3, web crawlers model is according to canonical matching rule from the URL crawled in domain name range in web page;
Step 4, http request is initiated according to the URL in domain name range and obtains all web pages in domain name range;
Step 5, all web pages that will acquire parse and generate the dom tree of XML;
Step 6, traversal dom tree extracts key message and generates collection rule;
Step 7, page info is acquired according to collection rule;
Step 8, circulation carry out step 2 to step 7 until web crawlers model according to canonical matching rule from web page
The URL in all domain name ranges is crawled.
Further, in step 1, it is described building web crawlers model method the following steps are included:
Step 1.1, the configuration script for creating a web crawlers, will be in starting URL write-in configuration script;
Step 1.2, the domain name range that web crawlers crawls in configuration script;
Configuration network crawler crawls domain name range: master mould of the script:first_URLs as construction URL then crawls
Domain name range be 10000~20000, used for splicing new URL.
Its Java form main code is as follows:
Step 1.3, the queue of web crawlers is constructed, the element of queue is the URL of storage, i.e., queue is for storing URL;
Step 1.4, the canonical matching rule of web crawlers, the regular expression form of the URL of web crawlers model are constructed
Canonical matching rule (w+ (- w+) *) (( w+ (- w+) *)) * that is http: ∥ (? *)?, canonical matching
Rule Expression contains http: ∥ URL and the URL of "/" or " " symbol, and web crawlers deposits the URL for meeting canonical matching rule
It stores up and crawls queue.
Further, in step 3, the web crawlers model is according to canonical matching rule from web page information crawler
The method of URL in domain name range is all URL crawled in web page from the URL of web page, from the current page
It extracts all URL to be put into the queue of web crawlers, the representation of the URL is by http: // and "/" or " " symbol
Composition, the URL pass through canonical matching rule,
Http: // (w+ (- w+) *) (( w+ (- w+) *)) * (? *)?, extract in the current page
All URL.
Further, in step 3, all URL in the web page that domain name range gets for the same URL.
Further, in steps of 5, all web pages that will acquire parse and the side of the dom tree that generates XML
Method is to carry out structure elucidation using webpage source code of the HTMLParser or SGMLParser to web page to obtain as tree-shaped knot
The URL of structure, theme, user name, content, time attribute information traverse access by the logical relation between each node of tree
The dom tree of each of webpage node generation XML form.
Further, in steps of 5, all web pages that will acquire parse and the side of the dom tree that generates XML
Method is to carry out structure elucidation using webpage source code of the HTMLParser or SGMLParser to web page to obtain all XML
The DOM tree node element of form, the DOM tree node element include URL, theme, user name, content, time attribute information, are led to
Each of the logical relation traversal access webpage crossed between each node of tree node generates the dom tree of XML form, DOM
Tree indicates that wherein V is the node collection of dom tree, and E is the logical relation of any two DOM tree node, i.e. E is with undirected tree G (V, E)
Each URL connection relationship in webpage, generates an intermediary tree, first reading root node v ∈ VNURL be v (r), BvAnd nv's
Value, BvRefer to all node sets that can be reached by arc v, then comparison node VNThe n of { r }vValue, VN{ r } is removal root
The node collection of the tree of node r, successively by nvIt is worth tectonic sequence L, then initializes H (Y, A) again, vertex y is selected to represent leaf
Node r, and remember Cy={ r }, ny=| N |+1/2, | N | be the quantity that N includes node, and connecting with leaf node | DN r| it is a
The arc of leaf node is added in z set, DN rIt indicates that all set of URL of leaf node v ∈ V close, enables the known of leaf node arc
Connecting node is y, and the intermediary tree is H (y, A), and enabling dom tree is TN, each vertex y ∈ Y of intermediary tree is TNThe collection of interior joint
Close Cy∈VN, arc A is TNThe set on middle side, each node and leaf node indicate that N indicates DOM tree node member with independent fixed point y ∈ Y
The set of element, VN, ENIt is TNIn all DOM tree nodes and side set, ENIn element be known as side collection, VNIn element be known as set
Node collection, initialization intermediary tree H is sky, by depth-first traversal by DOM tree node v ∈ VNIt is added in H, until all
VNIn DOM tree node all in intermediary tree H, each node y ∈ Y of intermediary tree H represents setCyIn include one
A or multiple child node v, nyRepresent dom tree interior joint nvValue, node nvIt is worth identical i.e. URL and belongs to the same connection relationship, it can be with
It is attached access by URL, if certain node y has multiple child nodes, node v ∈ C in HyN having the sameyIt is worth, in H
Arc a represents TNIn side, read H in arc during, when the front end for arc occur is connected with node, other end leaf node does not have
There is connecting node or there is no DOM tree node element, this arc is known as leaf node arc, and the collection of leaf node arc is combined into z, if time
Leaf node arc is lasted with leaf node, then is deleted leaf node arc from set z, each arc a ∈ A has corresponding one
A set BaIt is corresponding, BaRefer to all node sets that can be reached by arc a, z is according to depth-first traversal from dom tree section
DOM tree node element is successively extracted in point, is indicated with v ', and update is added to H (Y, A) and terminates until traversing, final H (Y, A)
That is the dom tree of XML.
Parsing of the HTMLParser or SGMLParser for the html file of the webpage source code of web page,
Parsed information preservation is the structure set by HTMLParser or SGMLParser.Node is information preservation
Data type basis.
The definition of Node:
public interface Node extends Cloneable;
The method for including in Node has several classes:
The DOM tree node element of all XML forms is obtained to the webpage source code progress structure elucidation of web page, it is right
In the function that tree is traversed:
Node getParent (): father node is obtained
NodeList getChildren (): the list of child node is obtained
Node getFirstChild (): first child node is obtained
Node getLastChild (): the last one child node is obtained
Node getPreviousSibling (): the previous brotgher of node is obtained
Node getNextSibling (): next brother node is obtained
Obtain the function of Node content:
String getText (): text is obtained
String toPlainTextString (): plain text information is obtained.
String toHtml (): it obtains HTML information (original HTML)
String toHtml (boolean verbatim): it obtains HTML information (original HTML)
String toString (): it obtains character string information (original HTML)
Page getPage (): the corresponding Page object of this Node is obtained
Int getStartPosition (): initial position of this Node in html page is obtained
Int getEndPosition (): end position of this Node in html page is obtained
Function for Filter filtering:
Void collectInto (NodeList list, NodeFilter filter): the condition pair based on filter
It is filtered in this node, qualified node is put into list.
Function for Visitor traversal:
Void accept (NodeVisitor visitor): to this Node application visitor
It is this kind of to use fewer for modifying the function of content:
Void setPage (Page page): the corresponding Page object of this Node is set
Void setText (String text): setting text
Void setChildren (NodeList children): setting child list
Other functions:
Void doSemanticAction (): executing the corresponding operation of this Node, (only minority Tag has corresponding behaviour
Make)
Object clone (): the abstract function of interface Clone.
It is practical with HTMLParser it is most be processing html page, Filter or the relevant function of Visitor are necessary
, then the first kind and the second class function are with the most use.First kind function ratio is readily understood by, and exemplifies one below
Lower second class function.
Here is the html file for test:
Further, in step 6, it is to traverse that the traversal dom tree, which extracts the method that key message generates collection rule,
It is that theme, user name, content, the information of time attribute, institute are extracted from a node of dom tree that dom tree, which extracts key message,
State key message be the theme, the information of user name, content, time attribute, the collection rule of the generation is as follows, collection rule
Regular expression is expressed as,
(<subject [ ^>]) ([s S] *?) (</subject>), (<user name [ ^>]) ([s S] *?) (</user name>),
(<content [ ^>]) ([s S] *?) (</content>), (<time [ ^>]) ([s S] *?) (</time>),
(? is) (? ≤<td>) .+? (?=</td>),
(? is) mode is modified, and i indicates that ignorecase, s indicate that single line mode can match new line
(? ≤<td>), backward is looked around certainly, need matched result with<td>beginning, still<td>match, as a result in
Not comprising<td>.+? any character is matched to (any character) met every time, that is, attempts to match subsequent expression formula, until
Subsequent expression formula failure, recalls last matching result, (?=</td>) sequence looks around certainly, matched result finally will be with
</td>ending, but</td>it mismatches and does not include in then result</td>,
S: match any blank character, including space, tab, form feed character, be equivalent to [f n r t v],
S: match any non-blank-white character, be equivalent to [^ f n r t v],?: show it is non-greedy matching, collection rule
Regular expression indicate be match 1 pairs of XML tag beginning, attribute, content and latter end.
It further, in step 7, is to advise according to acquisition to the method that page info is acquired according to collection rule
Matching theme, user name, content, the beginning of the XML tag of time, attribute, content and latter end are then acquired, participle is called
System will need the theme segmented and contents attribute segment then the document object insertion data-base recording after participle
Table, the Words partition system is Chinese Academy of Sciences's Words partition system ICTCLAS50, and page info theme, user name, content, time are added
Enter in dictionary file keydict.txt.
Further, in step 8, it is from the URL meaning crawled in web page in all domain name ranges, from current
The page in can not be drawn into URL or without any URL can by canonical matching rule,
Http: ∥ (w+ (- w+) *) (( w+ (- w+) *)) * (? *)?, i.e., in the current page
Extraction less than URL.
Preferably, the disclosure provide a kind of Web page information acquisition method based on web crawlers embodiment it is as follows:
Step A constructs web crawlers model;
Step B, web crawlers model is by passing through modification web page from the URL crawled in domain name range in web page
HTTP data packet crawl the page that preset URL is returned again, the HTTP data packet of the modification web page be the use of return
The query result of family specified data value, parsing HTTP page elements obtain user's specified data value, and the preset URL is returned
The page returned is the page that the method for executing inquiry user's specified data value returns, and user's specified data value includes page
Face message subject, user name, content, time;
Step C being packaged into the data set of Json format by way of modifying the HTTP data packet of web page, and is extracted
Json data set is to background server;
Step D parses Json data set in server background, based on the format of Json key-value pair, carries out
Json and Java business datum entity object are converted between system, according to configured data entity object model, to extraction
Data set is made whether the verification (such as character string cannot be converted into value type) of same data type, to verification data type
Skimble-scamble invalid data is rejected, and the skimble-scamble invalid data of the verification data type is page info theme, user
The data that one of name, content, time any attribute are not inconsistent;
Step E is put into specified data the specified category of data entity object according to configured data entity object model
Property, carry out entity object assignment;
Step F is output to client from server the entity object data conversion of assignment at the format of Json;
Step G parses the data of Json format, according to form modifying user's HTTP data packet of Json key-value pair, and writes
Enter server.
Further, in step D, the method for Json and Java business datum entity object conversion is between carry out system
The data of Json format are read by java application and are formatted according to the prior art of Json as Java application
The readable data of program.
A kind of Web page information acquisition device based on web crawlers that embodiment of the disclosure provides, is illustrated in figure 2
A kind of Web page information acquisition device figure based on web crawlers of the disclosure, the embodiment it is a kind of based on web crawlers
Web page information acquisition device include: processor, memory and storage in the memory and can be on the processor
The computer program of operation, the processor realize a kind of above-mentioned Web based on web crawlers when executing the computer program
Step in page info acquisition device embodiment.
Described device includes: memory, processor and storage in the memory and can transport on the processor
Capable computer program, the processor execute the computer program and operate in the unit of following device:
Crawler model construction unit, for constructing web crawlers model;
Page acquiring unit obtains web page for initiating http request according to starting URL;
Canonical matching unit crawls domain name range according to canonical matching rule for web crawlers model from web page
In URL;
Multi-page request unit, it is all in domain name range for initiating http request acquisition according to the URL in domain name range
In web page;
Dom tree generation unit, all web pages for will acquire parse and generate the dom tree of XML;
Collection rule generation unit extracts key message generation collection rule for traversing dom tree;
Page info acquisition unit, for being acquired according to collection rule to page info;
Circle collection unit carries out page acquiring unit, canonical matching unit, multi-page request unit, DOM for recycling
Generation unit, collection rule generation unit, page info acquisition unit are set, until web crawlers model is according to canonical matching rule
From the URL crawled in web page in all domain name ranges.
A kind of Web page information acquisition device based on web crawlers can run on desktop PC, notes
Originally, palm PC and cloud server etc. calculate in equipment.A kind of Web page information based on web crawlers acquires dress
It sets, the device that can be run may include, but be not limited only to, processor, memory.It will be understood by those skilled in the art that the example
Son is only a kind of example of Web page information acquisition device based on web crawlers, does not constitute and is climbed to one kind based on network
The restriction of the Web page information acquisition device of worm may include component more more or fewer than example, or the certain portions of combination
Part or different components, such as a kind of Web page information acquisition device based on web crawlers can also include input
Output equipment, network access equipment, bus etc..
Alleged processor can be central processing unit (Central Processing Unit, CPU), can also be it
His general processor, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit
(Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-
Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic,
Discrete hardware components etc..General processor can be microprocessor or the processor is also possible to any conventional processor
Deng, the processor is a kind of control centre of Web page information acquisition device running gear based on web crawlers, benefit
With the entire a kind of Web page information acquisition device based on web crawlers of various interfaces and connection can running gear it is each
Part.
The memory can be used for storing the computer program and/or module, and the processor is by operation or executes
Computer program in the memory and/or module are stored, and calls the data being stored in memory, described in realization
A kind of various functions of the Web page information acquisition device based on web crawlers.The memory can mainly include storage program
Area and storage data area, wherein storing program area can application program needed for storage program area, at least one function (such as
Sound-playing function, image player function etc.) etc.;Storage data area, which can be stored, uses created data (ratio according to mobile phone
Such as audio data, phone directory) etc..In addition, memory may include high-speed random access memory, it can also include non-volatile
Property memory, such as hard disk, memory, plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital
(Secure Digital, SD) card, flash card (Flash Card), at least one disk memory, flush memory device or other
Volatile solid-state part.
Preferably, the web crawlers model of the disclosure can be injected in the front end CS:
1) web crawlers model finds the injection of operation pages object implementatio8 script by C/S exhaustion IE.
2) web crawlers model realizes the detection of server and client release number by C/S, realizes real-time update.
JQuery is the library javascript of the more browsers of compatibility, and function is that AJAX interaction is provided for website.
Web crawlers model realizes the customized addition of element in face of current using jQuery, and AJAX is called to realize backstage
Interaction.
Web crawlers model based on data application in, can from independently of real web pages server-side obtain and can
To be dynamically written in webpage, foreground is returned data by backstage by AJAX, and parse related data, be written relevant
In the record sheet of database.
Although the description of the disclosure is quite detailed and especially several embodiments are described, it is not
Any of these details or embodiment or any specific embodiments are intended to be limited to, but should be considered as is by reference to appended
A possibility that claim provides broad sense in view of the prior art for these claims explanation, to effectively cover the disclosure
Preset range.In addition, the disclosure is described with inventor's foreseeable embodiment above, its purpose is to be provided with
Description, and those equivalent modifications that the disclosure can be still represented to the unsubstantiality change of the disclosure still unforeseen at present.
Claims (10)
1. a kind of Web page information acquisition method based on web crawlers, which is characterized in that the described method comprises the following steps:
Step 1, web crawlers model is constructed;
Step 2, http request is initiated according to starting URL and obtains web page;
Step 3, web crawlers model is according to canonical matching rule from the URL crawled in domain name range in web page;
Step 4, http request is initiated according to the URL in domain name range and obtains all web pages in domain name range;
Step 5, all web pages that will acquire parse and generate the dom tree of XML;
Step 6, traversal dom tree extracts key message and generates collection rule;
Step 7, page info is acquired according to collection rule;
Step 8, circulation carries out step 2 to step 7 until web crawlers model is crawled from web page according to canonical matching rule
URL in complete all domain name ranges.
2. a kind of Web page information acquisition method based on web crawlers according to claim 1, which is characterized in that
In step 1, it is described building web crawlers model method the following steps are included:
Step 1.1, the configuration script for creating a web crawlers, will be in starting URL write-in configuration script;
Step 1.2, the domain name range that web crawlers crawls in configuration script;
Step 1.3, the queue of web crawlers is constructed, the element of queue is the URL of storage, i.e., queue is for storing URL;
Step 1.4, the canonical matching rule of web crawlers is constructed, web crawlers arrives the URL storage for meeting canonical matching rule
Crawl queue.
3. a kind of Web page information acquisition method based on web crawlers according to claim 1, which is characterized in that
In step 3, the web crawlers model is according to canonical matching rule from the side of the URL in web page information crawler domain name range
Method is all URL crawled in web page from the URL of web page, and all URL are extracted from the current page and are put into net
In the queue of network crawler, the URL passes through canonical matching rule,
Http: ∥ (w+ (- w+) *) (( w+ (- w+) *)) * (? *)?, extract the institute in the current page
Some URL.
4. a kind of Web page information acquisition method based on web crawlers according to claim 1, which is characterized in that
In step 3, all URL in the web page that domain name range gets for the same URL.
5. a kind of Web page information acquisition method based on web crawlers according to claim 1, which is characterized in that
In step 5, the method for the dom tree that all web pages that will acquire parse and generate XML is to utilize HTMLParser
Or SGMLParser structure elucidation is carried out to the webpage source code of web page with obtain the URL for tree, theme, user name,
It is raw to access each of webpage node by the logical relation traversal between each node of tree for content, time attribute information
At the dom tree of XML form.
6. a kind of Web page information acquisition method based on web crawlers according to claim 1, which is characterized in that
In step 5, the method for the dom tree that all web pages that will acquire parse and generate XML is to utilize HTMLParser
Or SGMLParser carries out structure elucidation to the webpage source code of web page to obtain the DOM tree node member of all XML forms
Element, the DOM tree node element include URL, theme, user name, content, time attribute information, by each node of tree it
Between logical relation traversal access each of webpage node generate the dom tree of XML form, dom tree is with undirected tree G (V, E)
It indicates, wherein V is the node collection of dom tree, and E is the logical relation of any two DOM tree node, i.e. E is each URL connection in webpage
Relationship generates an intermediary tree, first reading root node v ∈ VNURL be v (r), BvAnd nvValue, BvRefer to through arc v institute
All node sets that can be reached, then comparison node VNThe n of { r }vValue, VN{ r } is the node collection for removing the tree of root node r,
Successively by nvIt is worth tectonic sequence L, then initializes H (Y, A) again, selects vertex y to represent leaf node r, and remember Cy={ r }, ny
=| N |+1/2, and connecting with leaf node | DN r| the arc of a leaf node is added in Z set, enables leaf node arc
Known connecting node is y, and the intermediary tree is H (y, A), and enabling dom tree is TN, each vertex y ∈ Y of intermediary tree is TNInterior joint
Set Cy∈VN, arc A is TNThe set on middle side, each node and leaf node indicate that N indicates dom tree section with independent fixed point y ∈ Y
The set of point element, VN,ENIt is TNIn all DOM tree nodes and side set, ENIn element be known as side collection, VNIn member be called usually
For the node collection of tree, initializing intermediary tree H is sky, by depth-first traversal by DOM tree node v ∈ VNIt is added in H, until
All VNIn DOM tree node all in intermediary tree H, each node y ∈ Y of intermediary tree H represents setCyMiddle packet
Containing one or more child node v, nyRepresent dom tree interior joint nvValue, node nvIt is worth identical i.e. URL and belongs to the same connection relationship,
Access can be attached by URL, if certain node y has multiple child nodes, node v ∈ C in HyN having the samev
It is worth, arc a represents T in HNIn side, read H in arc during, when the front end for arc occur is connected with node, other end leaf
Node does not have connecting node or does not have DOM tree node element, and this arc is known as leaf node arc, and the collection of leaf node arc is combined into Z,
If leaf node arc has leaf node when traversal, leaf node arc is deleted from set Z, each arc a ∈ A has phase
The set B answeredaIt is corresponding, BaRefer to all node sets that can be reached by arc a, Z according to depth-first traversal from
DOM tree node element is successively extracted in DOM tree node, is indicated with v', and is updated and be added to H (Y, A).
7. a kind of Web page information acquisition method based on web crawlers according to claim 1, which is characterized in that
In step 6, the traversal dom tree extracts the method that key message generates collection rule and is, traversal dom tree extracts key message and is
Theme, user name, content, the information of time attribute are extracted from a node of dom tree, the key message is the theme, uses
Name in an account book, content, the information of time attribute, the collection rule of the generation is as follows, and the regular expression of collection rule is expressed as,
(<subject [ ^>]) ([s S] *?) (</subject>), (<user name [ ^>]) ([s S] *?) (</user name>),
(<content [ ^>]) ([s S] *?) (</content>), (<time [ ^>]) ([s S] *?) (</time>),
(? is) (? ≤<td>) .+? (?=</td>).
8. a kind of Web page information acquisition method based on web crawlers according to claim 1, which is characterized in that
It in step 7, is to acquire matching theme, user according to collection rule to the method that page info is acquired according to collection rule
Name, content, the beginning of the XML tag of time, attribute, content and latter end.
9. a kind of Web page information acquisition method based on web crawlers, which is characterized in that the described method comprises the following steps:
Step A constructs web crawlers model;
Step B, web crawlers model is by passing through modification web page from the URL crawled in domain name range in web page
HTTP data packet crawls the page that preset URL is returned again, and the HTTP data packet of the modification web page is the user returned
The query result of specified data value, parsing HTTP page elements obtain user's specified data value, and the preset URL is returned
The page be execute inquiry user's specified data value method return the page, user's specified data value includes the page
Message subject, user name, content, time;
Step C being packaged into the data set of Json format by way of modifying the HTTP data packet of web page, and extracts Json
Data set is to background server;
Step D parses Json data set in server background, based on the format of Json key-value pair, carries out system
Between Json and Java business datum entity object convert, according to configured data entity object model, to the data of extraction
Collection is made whether the verification of same data type, rejects to the verification skimble-scamble invalid data of data type, the verification
The skimble-scamble invalid data of data type is the number that one of page info theme, user name, content, time any attribute are not inconsistent
According to;
Step E is put into specified data the specified attribute of data entity object according to configured data entity object model,
Carry out entity object assignment;
Step F is output to client from server the entity object data conversion of assignment at the format of Json;
Step G parses the data of Json format, according to form modifying user's HTTP data packet of Json key-value pair, and clothes is written
Business device.
10. a kind of Web page information acquisition device based on web crawlers, which is characterized in that described device include: memory,
Processor and storage in the memory and the computer program that can run on the processor, the processor execution
The computer program operates in the unit of following device:
Crawler model construction unit, for constructing web crawlers model;
Page acquiring unit obtains web page for initiating http request according to starting URL;
Canonical matching unit, for web crawlers model according to canonical matching rule from being crawled in web page in domain name range
URL;
Multi-page request unit, it is all in domain name range for being obtained according to the URL initiation http request in domain name range
Web page;
Dom tree generation unit, all web pages for will acquire parse and generate the dom tree of XML;
Collection rule generation unit extracts key message generation collection rule for traversing dom tree;
Page info acquisition unit, for being acquired according to collection rule to page info;
Circle collection unit carries out page acquiring unit, canonical matching unit, multi-page request unit, dom tree life for recycling
At unit, collection rule generation unit, page info acquisition unit, until web crawlers model according to canonical matching rule from
The URL in all domain name ranges has been crawled in web page.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811499475.4A CN109657121A (en) | 2018-12-09 | 2018-12-09 | A kind of Web page information acquisition method and device based on web crawlers |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811499475.4A CN109657121A (en) | 2018-12-09 | 2018-12-09 | A kind of Web page information acquisition method and device based on web crawlers |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109657121A true CN109657121A (en) | 2019-04-19 |
Family
ID=66113862
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811499475.4A Pending CN109657121A (en) | 2018-12-09 | 2018-12-09 | A kind of Web page information acquisition method and device based on web crawlers |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109657121A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110134853A (en) * | 2019-05-13 | 2019-08-16 | 重庆八戒传媒有限公司 | Data crawling method and system |
CN110287394A (en) * | 2019-06-28 | 2019-09-27 | 北京金山安全软件有限公司 | Website resource crawling method and device, computer equipment and storage medium |
CN110297962A (en) * | 2019-06-28 | 2019-10-01 | 北京金山安全软件有限公司 | Website resource crawling method, device, system and computer equipment |
CN110611713A (en) * | 2019-09-17 | 2019-12-24 | 深圳市网心科技有限公司 | Data downloading method and system, electronic equipment and storage medium |
CN111859867A (en) * | 2020-07-20 | 2020-10-30 | 广西美立方工程咨询有限公司 | Web data extraction system based on XML and XPath and use method thereof |
CN112084389A (en) * | 2020-08-17 | 2020-12-15 | 上海交通大学 | Network crawler-based academic institution geographical position information extraction method |
CN113065051A (en) * | 2021-04-02 | 2021-07-02 | 西南石油大学 | Visual agricultural big data analysis interactive system |
CN113922980A (en) * | 2021-08-23 | 2022-01-11 | 北京天融信网络安全技术有限公司 | DNS monitoring method, equipment and storage medium based on HTTP detection information |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007193642A (en) * | 2006-01-20 | 2007-08-02 | Nippon Telegr & Teleph Corp <Ntt> | Xpath processor, xpath processing method, xpath processing program and storage medium |
US20080098300A1 (en) * | 2006-10-24 | 2008-04-24 | Brilliant Shopper, Inc. | Method and system for extracting information from web pages |
US20140123303A1 (en) * | 2012-10-31 | 2014-05-01 | Tata Consultancy Services Limited | Dynamic data masking |
CN107066576A (en) * | 2017-04-12 | 2017-08-18 | 成都四方伟业软件股份有限公司 | A kind of big data web crawlers paging system of selection and system |
CN107608949A (en) * | 2017-10-16 | 2018-01-19 | 北京神州泰岳软件股份有限公司 | A kind of Text Information Extraction method and device based on semantic model |
CN107808000A (en) * | 2017-11-13 | 2018-03-16 | 哈尔滨工业大学(威海) | A kind of hidden web data collection and extraction system and method |
CN108052632A (en) * | 2017-12-20 | 2018-05-18 | 成都律云科技有限公司 | A kind of method for obtaining network information, system and company information search system |
-
2018
- 2018-12-09 CN CN201811499475.4A patent/CN109657121A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007193642A (en) * | 2006-01-20 | 2007-08-02 | Nippon Telegr & Teleph Corp <Ntt> | Xpath processor, xpath processing method, xpath processing program and storage medium |
US20080098300A1 (en) * | 2006-10-24 | 2008-04-24 | Brilliant Shopper, Inc. | Method and system for extracting information from web pages |
US20140123303A1 (en) * | 2012-10-31 | 2014-05-01 | Tata Consultancy Services Limited | Dynamic data masking |
CN107066576A (en) * | 2017-04-12 | 2017-08-18 | 成都四方伟业软件股份有限公司 | A kind of big data web crawlers paging system of selection and system |
CN107608949A (en) * | 2017-10-16 | 2018-01-19 | 北京神州泰岳软件股份有限公司 | A kind of Text Information Extraction method and device based on semantic model |
CN107808000A (en) * | 2017-11-13 | 2018-03-16 | 哈尔滨工业大学(威海) | A kind of hidden web data collection and extraction system and method |
CN108052632A (en) * | 2017-12-20 | 2018-05-18 | 成都律云科技有限公司 | A kind of method for obtaining network information, system and company information search system |
Non-Patent Citations (2)
Title |
---|
李文等: "基于XML和DOM技术的Web信息抽取模型", 《大连交通大学学报》 * |
高梦超等: "基于众包的社交网络数据采集模型设计与实现", 《计算机工程》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110134853A (en) * | 2019-05-13 | 2019-08-16 | 重庆八戒传媒有限公司 | Data crawling method and system |
CN110287394A (en) * | 2019-06-28 | 2019-09-27 | 北京金山安全软件有限公司 | Website resource crawling method and device, computer equipment and storage medium |
CN110297962A (en) * | 2019-06-28 | 2019-10-01 | 北京金山安全软件有限公司 | Website resource crawling method, device, system and computer equipment |
CN110287394B (en) * | 2019-06-28 | 2022-01-11 | 北京金山安全软件有限公司 | Website resource crawling method and device, computer equipment and storage medium |
CN110611713A (en) * | 2019-09-17 | 2019-12-24 | 深圳市网心科技有限公司 | Data downloading method and system, electronic equipment and storage medium |
CN111859867A (en) * | 2020-07-20 | 2020-10-30 | 广西美立方工程咨询有限公司 | Web data extraction system based on XML and XPath and use method thereof |
CN111859867B (en) * | 2020-07-20 | 2024-03-12 | 广西美立方工程咨询有限公司 | Web data extraction system based on XML and XPath and use method thereof |
CN112084389A (en) * | 2020-08-17 | 2020-12-15 | 上海交通大学 | Network crawler-based academic institution geographical position information extraction method |
CN113065051A (en) * | 2021-04-02 | 2021-07-02 | 西南石油大学 | Visual agricultural big data analysis interactive system |
CN113922980A (en) * | 2021-08-23 | 2022-01-11 | 北京天融信网络安全技术有限公司 | DNS monitoring method, equipment and storage medium based on HTTP detection information |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109657121A (en) | A kind of Web page information acquisition method and device based on web crawlers | |
US10942708B2 (en) | Generating web API specification from online documentation | |
CN101488151B (en) | System and method for gathering website contents | |
CN109582909A (en) | Webpage automatic generation method, device, electronic equipment and storage medium | |
CN101799753B (en) | Method and device for realizing tree structure | |
US8839192B2 (en) | System and method for presentation of cross organizational applications | |
JP2018097846A (en) | Api learning | |
CN104881490B (en) | A kind of WEB form data access method and system | |
US20110107243A1 (en) | Searching Existing User Interfaces to Enable Design, Development and Provisioning of User Interfaces | |
CN111045678A (en) | Method, device and equipment for executing dynamic code on page and storage medium | |
CN102279894A (en) | Method for searching, integrating and providing comment information based on semantics and searching system | |
Szeredi et al. | The semantic web explained: The technology and mathematics behind web 3.0 | |
CN108509544B (en) | Method and device for acquiring mind map, equipment and readable storage medium | |
CN111090417A (en) | Binary file analysis method, device, equipment and medium | |
CN113986241B (en) | Configuration method and device of business rules based on knowledge graph | |
CN103744987B (en) | Video website media asset aggregation method and system based on DOM tree matching | |
US10326858B2 (en) | System and method for dynamically generating personalized websites | |
CN107220250A (en) | A kind of template configuration method and system | |
CN114356306A (en) | Method for realizing visual customization of system components | |
CN104270257A (en) | Network element level network management service configuration adaptive system and method based on PB and XPATH | |
CN115358200A (en) | Template document automatic generation method based on SysML meta model | |
CN115017182A (en) | Visual data analysis method and equipment | |
CN111061975B (en) | Method and device for processing irrelevant content in page | |
CN117111909A (en) | Code automatic generation method, system, computer equipment and storage medium | |
CN106991144B (en) | Method and system for customizing data crawling workflow |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190419 |