CN107423382A - network crawling method and device - Google Patents
network crawling method and device Download PDFInfo
- Publication number
- CN107423382A CN107423382A CN201710571635.0A CN201710571635A CN107423382A CN 107423382 A CN107423382 A CN 107423382A CN 201710571635 A CN201710571635 A CN 201710571635A CN 107423382 A CN107423382 A CN 107423382A
- Authority
- CN
- China
- Prior art keywords
- child node
- link
- task
- station address
- subtask
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The present invention provides a kind of network crawling method and device.Inventive network crawling method includes:Child node receives the subtask that host node is sent, subtask includes crawling the station address in search groups corresponding to the task type of task and child node, search groups include at least one station address, search groups be host node according to distributed programmed framework map reduce and the task type for crawling task, what is obtained is divided at least one station address;Child node is crawled according to subtask, is crawled what is obtained in data Cun Chudao local storages;Child node is inquired about in local storage, obtains Query Result, and send Query Result to host node.The present invention can realize crawls process to a large amount of web datas.
Description
Technical field
The present invention relates to the communication technology, more particularly to a kind of network crawling method and device.
Background technology
With enriching constantly for Internet resources, increasing platform needs substantial amounts of data supporting to complete accordingly
Function.Generally obtaining the channel of data resource includes:Data are obtained after being logged in by hosted platform, are directly connected to other systems
Database and data docking is carried out by the way of data-interface.But these channels, which more or less occur, can not obtain correlation
The problem of data, higher cost.Therefore, crawl technology using network at present to crawl the data on webpage, in order to flat
Platform searches out webpage and related data.
Because curl (CommandLine Uniform Resource Locator) function supports GET, POST etc. to browse
Device behavior, the purpose of a simulation browser operation can be reached, it is therefore, usually used in existing network crawling method
Curl functions in RCurl program bags complete the crawl process of web data, and then obtain the data on webpage.However, only adopt
With the existing network crawling method curl of curl functions can not complete data volume it is larger crawl task.Therefore, a kind of energy is needed badly
Enough crawl the network crawling method of mass data.
The content of the invention
The present invention provides a kind of network crawling method and device, can not complete number to solve existing network crawling method
Amount amount it is larger the problem of crawling task.
In a first aspect, the present invention provides a kind of network crawling method, system is crawled applied to network, the network, which crawls, is
System includes:One host node and multiple child nodes, for any child node, methods described includes:
The child node receives the subtask that the host node is sent, and the subtask includes crawling the task class of task
Station address in search groups corresponding to type and the child node, the search groups include at least one station address, institute
State search groups be the host node according to distributed programmed framework map-reduce and the task type for crawling task, to institute
State at least one station address and divided what is obtained;
The child node is crawled according to the subtask, is crawled what is obtained in data Cun Chudao local storages;
The child node is inquired about in the local storage, obtains Query Result, and send to the host node
The Query Result.
Alternatively, the child node is crawled according to the subtask, described to crawl data Cun Chudao sheets by what is obtained
In ground memory, including:
The child node carries out traversal connection to the station address in the subtask, obtains the first website of successful connection
Address and the second station address of connection failure;
The child node is obtained and linked corresponding to web data page to be crawled in first station address;
The child node is carried out to being linked corresponding to each web data page to be crawled in first station address
Traversal connection, obtain the first link of successful connection and the second link of connection failure;
The child node crawls the task type of task according to, to each webpage number corresponding to the described first link
According to filtration treatment is carried out, web data corresponding to first link is obtained;
The child node parses to web data corresponding to the described first link, obtains target and crawls data;
The target is crawled data and corresponding first link storage into the local storage by the child node.
Alternatively, methods described also includes:
The child node reconnects second link, and judges whether the child node with described second links connection
Success;
If so, then the child node crawls the task type of task according to, to each institute corresponding to the described second link
State web data and carry out filtration treatment, obtain web data corresponding to second link, and to corresponding to the described second link
Web data is parsed, and is obtained the target and is crawled data, and the target is crawled into data and corresponding second link
Store in the local storage;
If it is not, repeating connection second link, and judge whether the child node with described second links connection
Successfully operation, if when repeating the number of connection more than the first preset times, the child node deposits the described second link
Store up in the local storage.
Alternatively, methods described also includes:
The child node reconnects second station address, and judge the child node whether with second website
Address successful connection;
Linked if so, the child node obtains corresponding to web data page to be crawled in second station address;
The child node is carried out to being linked corresponding to each web data page to be crawled in second station address
Traversal connection, obtain the 3rd link of successful connection and the 4th link of connection failure;
The child node crawls the task type of task according to, to each webpage number corresponding to the described 3rd link
According to filtration treatment is carried out, web data corresponding to the 3rd link is obtained;
The child node parses to web data corresponding to the described 3rd link, obtains the target crawl data;
The target is crawled data and corresponding 3rd link storage into the local storage by the child node;
If it is not, repeat connection second station address, and judge the child node whether with second website
The operation of address successful connection, if when repeating the number of connection more than the second preset times, the child node is by described the
Two station addresses are stored into the local storage.
Alternatively, methods described also includes:
The child node reconnects the 4th link, and judges whether the child node with the described 4th links connection
Success;
If so, then the child node crawls the task type of task according to, to each institute corresponding to the described 4th link
State web data and carry out filtration treatment, obtain web data corresponding to the 4th link, and to corresponding to the described 4th link
Web data is parsed, and obtains the target crawl data, and the target is crawled into data and corresponding 4th link
Store in the local storage;
If it is not, repeating connection the 4th link, and judge whether the child node with the described 4th links connection
Successfully operation, if when repeating the number of connection more than three preset times, the child node deposits the described 4th link
Store up in the local storage.
Alternatively, state indicating bit is also included in the subtask, the state indicating bit is used to indicate the child node
Whether the subtask is performed.
Second aspect, the present invention provide a kind of network crawling method, crawl system applied to network, the network, which crawls, is
System includes:One host node and multiple child nodes, methods described include:
The host node obtains the inquiry request of user's input, and the task for the task that crawls is obtained according to the inquiry request
Type, the inquiry request correspond at least one station address;
The host node is according to map-reduce and the task type for crawling task, at least one website
Location is divided, and obtains at least one search groups, and each search groups include at least one station address, each search groups corresponding one
Individual subtask, the corresponding child node in each subtask;
The host node sends each self-corresponding subtask to each child node, and the subtask includes described crawl
Station address in the task type of task and each self-corresponding search groups;
The host node receives the Query Result that each child node is sent, and multiple queries result is carried out to collect place
Reason, obtains target query result.
Alternatively, also include in the subtask:State indicating bit, the state indicating bit are used to indicate the child node
Whether the subtask is performed.
The third aspect, the present invention provide a kind of network and crawl device, crawl system applied to network, the network, which crawls, is
System includes:One host node and multiple child nodes, described device include:
Receiving module, the subtask sent for receiving the host node, the subtask include crawling appointing for task
Station address in search groups corresponding to service type and the child node, the search groups are with including at least one website
Location, the search groups be the host node according to map-reduce and the task type for crawling task, to described at least one
Individual station address is divided what is obtained;
Module is crawled, for being crawled according to the subtask, data Cun Chudao local storages are crawled by what is obtained
In;
Enquiry module, for being inquired about in the local storage, obtain Query Result;
Sending module, for sending the Query Result to the host node.
Alternatively, the module that crawls is specifically used for:Traversal connection is carried out to the station address in the subtask, obtained
First station address of successful connection and the second station address of connection failure;
Obtain and linked corresponding to web data page to be crawled in first station address;
Traversal connection is carried out to link corresponding to each web data page to be crawled in first station address, obtained
To the first link of successful connection and the second link of connection failure;
According to the task type for crawling task, each web data corresponding to the described first link is filtered
Processing, obtain web data corresponding to first link;
Web data corresponding to described first link is parsed, target is obtained and crawls data;
The target is crawled into data and corresponding first link storage into the local storage.
Alternatively, the module that crawls is specifically used for:Second link is reconnected, and whether judges the child node
Successful connection is linked with described second;
If so, then the child node crawls the task type of task according to, to each institute corresponding to the described second link
State web data and carry out filtration treatment, obtain web data corresponding to second link, and to corresponding to the described second link
Web data is parsed, and is obtained the target and is crawled data, and the target is crawled into data and corresponding second link
Store in the local storage;
If it is not, repeating connection second link, and judge whether the child node with described second links connection
Successfully operation, if when repeating the number of connection more than the first preset times, the child node deposits the described second link
Store up in the local storage.
Alternatively, the module that crawls is specifically used for:Second station address is reconnected, and judges the child node
Whether with the second station address successful connection;
Linked if so, the child node obtains corresponding to web data page to be crawled in second station address;
Traversal connection is carried out to link corresponding to each web data page to be crawled in second station address, obtained
To the 3rd link of successful connection and the 4th link of connection failure;
According to the task type for crawling task, each web data corresponding to the described 3rd link is filtered
Processing, obtain web data corresponding to the 3rd link;
Web data corresponding to described 3rd link is parsed, obtains the target crawl data;
The target is crawled into data and corresponding 3rd link storage into the local storage;
If it is not, repeat connection second station address, and judge the child node whether with second website
The operation of address successful connection, if when repeating the number of connection more than the second preset times, the child node is by described the
Two station addresses are stored into the local storage.
Alternatively, the module that crawls is specifically used for:The 4th link is reconnected, and whether judges the child node
Successful connection is linked with the described 4th;
If so, then the child node crawls the task type of task according to, to each institute corresponding to the described 4th link
State web data and carry out filtration treatment, obtain web data corresponding to the 4th link, and to corresponding to the described 4th link
Web data is parsed, and obtains the target crawl data, and the target is crawled into data and corresponding 4th link
Store in the local storage;
If it is not, repeating connection the 4th link, and judge whether the child node with the described 4th links connection
Successfully operation, if when repeating the number of connection more than three preset times, the child node deposits the described 4th link
Store up in the local storage.
Fourth aspect, the present invention provide a kind of network and crawl device, crawl system applied to network, the network, which crawls, is
System includes:One host node and multiple child nodes, described device include:
Receiving module, for obtaining the inquiry request of user's input, and obtained according to the inquiry request and crawl task
Task type, the inquiry request correspond at least one station address;
Division module, for according to map-reduce and the task type for crawling task, at least one net
Station address is divided, and obtains at least one search groups, and each search groups include at least one station address, each search groups pair
Answer a subtask, the corresponding child node in each subtask;
Sending module, for sending each self-corresponding subtask to each child node, the subtask includes described
Crawl task task type and each self-corresponding search groups in station address;
The receiving module, it is additionally operable to the host node and receives the Query Result that each child node is sent, and to multiple
Query Result carries out aggregation process, obtains target query result.
Network crawling method and device provided by the invention, this method are obtained by host node according to the inquiry request of user
The task type and station address of task are crawled, host node is further according to map-reduce and crawls the task type of task to each net
Station address is divided, and is formed subtask corresponding with each child node, each subtask is sent into each self-corresponding child node, respectively
Child node subtask corresponding to is crawled, and is crawled what is obtained in data Cun Chudao local storages, further according at this
Inquired about in ground memory, obtain Query Result, and Query Result is sent to host node, host node is by receiving each child node
Each Query Result is sent, obtains target query result.The present invention can realize climbing for a large amount of web datas by multiple child nodes
Take process, not only allow users to it is quick, comprehensively obtain information needed, additionally it is possible to meet that the various of user crawl demand.
Brief description of the drawings
Fig. 1 is the schematic diagram of a scenario of network crawling method provided by the invention;
Fig. 2 is the signaling process figure of network crawling method provided by the invention;
Fig. 3 is the flow chart one of network crawling method provided by the invention;
Fig. 4 is the flowchart 2 of network crawling method provided by the invention;
Fig. 5 is the structural representation one that network provided by the invention crawls device;
Fig. 6 is the structural representation two that network provided by the invention crawls device.
Embodiment
Fig. 1 is the schematic diagram of a scenario of network crawling method provided by the invention, as shown in figure 1, inventive network crawls is
System includes:One host node and multiple child nodes.Wherein, host node can use a server, and multiple child nodes are using multiple
Server.The system can be applied to the scene of public data acquisition, for example, the system is capable of the production of article raw material, article
Information, the quality tracing of article, the information such as sales information of article, the transparence chain letter of one production cycle of article
Breath, is easy to user to understand, accurately grasp, to carry out related work.In another example the system is applicable to each school
Enter oneself for the examination information, paper publishing information etc..The system is applicable in all trades and professions in life, without using aspectant
Mode can all get information by the way of largely searching for so that user can obtain the public data of needs, save
The time of user and cost.
With reference to the system shown in Fig. 1, the concrete technical scheme of network crawling method provided by the invention is carried out in detail
Describe in detail bright.Fig. 2 is the signaling process figure of network crawling method provided by the invention.Host node can be to more height sections in the present embodiment
Point sends subtask corresponding to inquiry request so that whether each child node is inquired about in each self-corresponding local storage and be stored with
Target crawls data, and each child node sends each self-corresponding Query Result to host node again, by host node to multiple queries result
Carry out aggregation process, to obtain target query result, i.e., disclosed data message needed for user.As shown in Fig. 2 the present embodiment is only
Crawl process for the network of host node and any child node and be described in detail, host node and the network of remaining child node crawl
Process is identical therewith for process, does not repeat herein.The network crawling method of the present embodiment includes:
S101, host node obtain the inquiry request of user's input, and the task class for the task that crawls is obtained according to inquiry request
Type, inquiry request correspond at least one station address.
Specifically, the inquiry request that host node inputs according to user in the present embodiment can not only be analyzed to obtain the task of crawling
Task type, additionally it is possible to obtain user and want query process in the enterprising row information in which website.For example, climbed in the present embodiment
Information can be entered oneself for the examination for the inquiry request of article raw material, the inquiry request of article quality information, school by taking the task type of task
The various situations such as inquiry request.The present embodiment is not limited the specific species for crawling the task type of task, only needs to meet
Host node can obtain the task type for the task that crawls according to inquiry request.Moreover, inquiry request is corresponding in the present embodiment
Station address can be enterprising in station addresses such as various search engines, school website, special department websites according to user experience
Row crawls process, and the present embodiment is not limited the number and species of station address, only need to meet that host node can be according to inquiry
Request obtains to station address.
S102, host node are according to distributed programmed framework map-reduce and crawl the task type of task, at least one
Individual station address is divided, and obtains at least one search groups, and each search groups include at least one station address, each search
The corresponding subtask of group, the corresponding child node in each subtask.
Specifically, host node can analyze use based on map-reduce and the task type for the task that crawls in the present embodiment
The quantity of station address in the inquiry request at family, and then the quantity for the station address that can be handled according to each child node is to website
Location is divided so that can all be realized in the range of the bearing capacity of each child node and fast and efficiently be captured process.Wherein, originally
The number of station address can specifically be drawn according to the bearing capacity of child node in search groups corresponding to each child node in embodiment
Point, the station address number of division can be identical, also can be different, and the present embodiment is not limited this.
Further, host node also can determine that grabbing for each child node based on map-reduce and the task type for the task that crawls
Take benchmark, crawl time and crawl order etc. so that each child node can carry out each self-corresponding according to specific implementation strategy
Subtask.For example, user needs to inquire about the quality information of article, host node can be based on map-reduce and the crawling task of the task
Type determines that each child node carries out each self-corresponding subtask according to the subtask implementation strategy of search article bar code.
Further, host node can also control the working condition of each child node, and the present embodiment controls each son to host node
The mode of node does not limit.Alternatively, also include in subtask:State indicating bit, state indicating bit are used to indicate child node
Whether subtasking.Specifically, host node can by state indicating bit can Real Time Observation to each child node work at present shape
State, so that host node can dynamically adjust whether each child node stops or start each self-corresponding subtask.Wherein, if main section
Point needs, to subtask corresponding to the distribution of some child node, to indicate that son corresponding to child node execution is appointed by state indicating bit
Business;If host node needs to stop subtasking to some child node, it can indicate that the child node stops holding by state indicating bit
Subtask corresponding to row.
S103, host node send subtask to child node, and subtask includes crawling the task type of task and search
Station address in group.
S104, child node are crawled according to subtask, and the obtained data Cun Chudao that crawls is locally stored with child node
In device.
Specifically, because child node and subtask correspond, subtask corresponds with search groups, therefore, host node
Each self-corresponding subtask can be sent to each child node, each child node can receive each self-corresponding subtask.Again due to son
Task includes the task type crawled and station address, and therefore, each child node can be according to specific implementation strategy to website
Web data in location is crawled, so as to obtain crawling data.And each child node can also crawl data by what is each obtained
Store in each self-corresponding local storage.In the present embodiment before each child node carries out each self-corresponding subtask, respectively
Child node can carry out emptying processing to each self-corresponding local storage.And the present embodiment is to storage to climbing in local storage
The concrete form of access evidence does not limit, and only need to meet that each child node can be inquired about in each self-corresponding local storage
.
S105, child node are inquired about according to subtask in local storage, obtain Query Result.
S106, child node send Query Result to host node.
The Query Result that S107, host node are sent to each child node carries out aggregation process, obtains target query result.
Specifically, due to being stored with the data that crawl to be checked, therefore, child node in local storage corresponding to child node
It can be inquired about according to subtask in corresponding local storage, so as to obtain Query Result, child node is again by Query Result
Host node is sent to, the Query Result that host node is sent to each child node received carries out aggregation process, obtains target query
As a result, to provide to user is timely, accurate information.
Further, if user also needs inquiry to be asked with current queries request type identical, child node can directly exist
Inquired about in corresponding local storage, without carrying out the cumbersome process that crawls to the web data of network address again, saved
The time is crawled, and improves the speed of inquiry.
In a specific embodiment, user inputs the inquiry request for obtaining article quality information, main section to host node
Point receives corresponding inquiry request, and each child node distributes each self-corresponding subtask backward.Because each child node is each self-corresponding
Storage has the quality information of all items in local storage, such as quality inspection number, quality inspection organization, quality inspection time, quality inspection personnel, matter
Result etc. is examined, therefore, each child node can be inquired about according to each self-corresponding subtask in respective local storage, be obtained
Query Result.Respective Query Result is sent to host node by each child node again, and host node converges to these Query Results again
Total processing, obtains the specific quality information of the article.So, because the system causes article quality information transparence, therefore, use
Family, which need not carry out aspectant exchange way or substantial amounts of inquiry work, can just grasp article quality information, save user
Cost, also improve the operating efficiency of user.
The network crawling method that the present embodiment provides, obtained by host node according to the inquiry request of user and crawl task
Task type and station address, host node is further according to map-reduce and crawls the task type of task to the progress of each station address
Division, form corresponding with each child node subtask, each subtask be sent to each self-corresponding child node, each child node according to
Corresponding subtask is crawled, and is crawled what is obtained in data Cun Chudao local storages, further according in local storage
Inquired about, obtain Query Result, and Query Result is sent to host node, host node sends each inquiry by receiving each child node
As a result, target query result is obtained.The present embodiment can realize the process that crawls of a large amount of web datas by multiple child nodes, no
Only allow users to it is quick, comprehensively obtain information needed, additionally it is possible to meet that the various of user crawl demand.
With reference to Fig. 3 and Fig. 4, crawled for any child node according to subtask, and data are crawled by what is obtained
The detailed process stored in corresponding local storage is described in detail.Fig. 3 is network crawling method provided by the invention
Flow chart one, Fig. 4 be network crawling method provided by the invention flowchart 2.Because any child node is according to subtask pair
Station address carries out traversal connection and two kinds of situations of successful connection and connection failure occurs, therefore, the present embodiment combination Fig. 3 is to even
Connect successful situation to be described in detail, the situation of connection failure is described in detail in the present embodiment combination Fig. 4.
On the one hand, as shown in figure 3, the present embodiment network crawling method also includes:
S201, child node carry out traversal connection to the station address in subtask, with obtaining the first website of successful connection
Location.
S202, child node are obtained and linked corresponding to web data page to be crawled in the first station address.
S203, child node carry out traversal company to link corresponding to web data page respectively to be crawled in the first station address
Connect, obtain the first link of successful connection and the second link of connection failure.
Specifically, for the first station address of successful connection, the present embodiment child node is firstly the need of obtaining the first website
All-links corresponding to web data to be crawled in address, then child node it is another one connection all-links, then according to
Both connection results, all-links are divided into the first link of successful connection and the second link of connection failure.
Further, at the present embodiment child nodes can be linked using different methods to the first link and second
Reason.For the first link, the present embodiment can perform step S204;For the second link, the present embodiment can perform step S20.This
Step S204 preferentially step S207 can be performed in embodiment, and step S207 preferentially step S204 can be performed, step S204 and step
S207 can also be performed simultaneously, and the present embodiment is not limited step S204 and step S207 execution sequence.
For the first link of successful connection, the present embodiment network crawling method also includes:
S204, child node filter according to the task type for the task that crawls to each web data corresponding to the first link
Processing, obtains web data corresponding to the first link.
S205, child node parse to web data corresponding to the first link, obtain target and crawl data.
Target is crawled data and corresponding first link storage into local storage by S206, child node.
Specifically, it can determine that child node carries out the specific implementation strategy of subtask due to crawling the task type of task,
Therefore, child node can carry out filtration treatment to each web data corresponding to the first link, filter out and crawl the task type of task
Unrelated web data, retain with crawl the task type of task about and meet the web data of standard or specification, be
Web data corresponding to first link.For example, child node needs to obtain article bar code, then the webpage number unrelated with article bar code
According to can filter out.And because article bar code is 13, if article bar code is 12 on web data, the web data is also filtered
Remove.
Further, the present embodiment child nodes parse to web data corresponding to the first link, and will parse what is obtained
Target crawls data and corresponding first link is stored in local storage together, is easy to child node to position the target and crawls
Data are specifically which chains acquisition at, and each child node to host node transmission by that can be easy to user to search.
For the second link of connection failure, the present embodiment network crawling method also includes:
S207, child node reconnect the second link, and judge whether child node with second links successful connection.If so,
Then perform step S208;If it is not, then perform step S209.
S208, child node filter according to the task type for the task that crawls to each web data corresponding to the second link
Processing, web data corresponding to the second link is obtained, and web data corresponding to the second link is parsed, obtained target and climb
Access evidence, and target is crawled into data and corresponding second link storage into local storage.
Wherein, the implementation such as S208 and S204, S205 in Fig. 3 embodiments and S206 is similar, and the present embodiment is herein not
Repeat again.
S209, repeat connection second link, and judge child node whether with the second operation for linking successful connection, if
When repeating the number of connection more than the first preset times, then child node stores the second link into local storage.
Specifically, the situation of connection failure occurs due to reasons such as network, parsings, during child node connecting link.It is existing
Network crawling method there is no fault tolerant mechanism, therefore the link can not be reconnected.And the present embodiment child nodes also may proceed to
The second link of connection failure is connected, more data sources can be provided for the crawl of target web data so that Yong Huneng
Access comprehensive information.
Herein it should be noted that the first preset times can be set based on experience value, this implementation is not limited this.
And it may include successfully record sheet and error logging table in each each self-corresponding local storage of child node in the present embodiment.Its
In, if station address includes multiple addresses, and there are multiple ranks multiple addresses, such as single-level address, two-level address, third-level address
Deng the record sheet that then succeeds can be according to rank height by the link in same station address on different stage and corresponding target
Crawl data to be stored, error logging table also can be according to rank just by the link in same station address on different stage
Carry out enumerating storage;If station address only has one, the record sheet that succeeds is directly to corresponding to station address and the station address
Target crawls data and stored, and the station address, error reason can be stored in error logging table and is repeated with connecting the website
The information such as the number of location, to facilitate location of mistake.The reason that wherein malfunctions can be network reason, parse reason etc., the present embodiment pair
This is not limited.
On the other hand, as shown in figure 4, the present embodiment network crawling method also includes:
S301, child node carry out traversal connection to the station address in subtask, with obtaining the second website of connection failure
Location.
Specifically, the situation of connection failure occurs due to reasons such as network, parsings, during child node connecting link.It is existing
Network crawling method there is no fault tolerant mechanism, therefore station address can not be reconnected.And the present embodiment child nodes can also be after
Second station address of continuous connection connection failure, can provide more data sources for the crawl of target web data so that
User can obtain comprehensive information.
S302, child node reconnect the second station address, and judge whether child node connects into the second station address
Work(.If so, then perform step S303;If it is not, then perform step S311.
S303, child node are obtained and linked corresponding to web data page to be crawled in the second station address.
S304, child node carry out traversal company to link corresponding to web data page respectively to be crawled in the second station address
Connect, obtain the 3rd link of successful connection and the 4th link of connection failure.
Specifically, for the second station address of connection failure, the present embodiment child node is firstly the need of obtaining the second website
All-links corresponding to web data to be crawled in address, then child node all link one by one again, then according to two
The connection result of person, all-links are divided into the 3rd link of successful connection and the 4th link of connection failure.
Further, at the present embodiment child nodes can be linked using different methods to the 3rd link and the 4th
Reason.For the 3rd link, the present embodiment can perform step S305;For the 4th link, the present embodiment can perform step S308, this
Step S305 preferentially step S308 can be performed in embodiment, and step S308 preferentially step S305 can be performed, step S305 and step
308 can also be performed simultaneously, and the present embodiment is not limited step S305 and step S308 execution sequence.
For the 3rd link of successful connection, the present embodiment network crawling method also includes:
S305, child node filter according to the task type for the task that crawls to each web data corresponding to the 3rd link
Processing, obtain web data corresponding to the 3rd link.
S306, child node parse to web data corresponding to the 3rd link, obtain target crawl data.
Target is crawled data and corresponding 3rd link storage into local storage by S307, child node.
Specifically, it can determine that child node carries out the specific implementation strategy of subtask due to crawling the task type of task,
Therefore, child node can carry out filtration treatment to each web data corresponding to the 3rd link, filter out and crawl the task type of task
Unrelated web data, retain with crawl the task type of task about and meet the web data of standard or specification, be
Web data corresponding to 3rd link.For example, child node needs to obtain article bar code, then the webpage number unrelated with article bar code
According to can filter out.And because article bar code is 13, if article bar code is 12 on web data, the web data is also filtered
Remove.
Further, the present embodiment child nodes parse to web data corresponding to the 3rd link, and will parse what is obtained
Target crawls data and corresponding 3rd link is stored in local storage together, is easy to child node positioning target to crawl number
According to being which specifically chains acquisition at, each child node to host node transmission by that can be easy to user to search.
For the second link of connection failure, the present embodiment network crawling method also includes:
S308, child node reconnect the 4th link, and judge whether child node with the 4th links successful connection.If so,
Then perform step S309;If it is not, then perform step S310.
S309, child node filter according to the task type for the task that crawls to each web data corresponding to the 4th link
Processing, web data corresponding to the 4th link is obtained, and web data corresponding to the 4th link is parsed, obtained target and grab
Access evidence, and target is crawled into data and corresponding 4th link storage into local storage.
Wherein, the implementation such as S308 and S305, S306 in Fig. 4 embodiments and S307 is similar, and the present embodiment is herein not
Repeat again.
S310, repeat connection the 4th link, and judge child node whether with the 4th operation for linking successful connection, if
When repeating the number of connection more than three preset times, then child node stores the 4th link into local storage.
Specifically, the situation of connection failure occurs due to reasons such as network, parsings, during child node connecting link.It is existing
Network crawling method there is no fault tolerant mechanism, therefore the link can not be reconnected.And the present embodiment child nodes also may proceed to
The 4th link of connection failure is connected, more data sources can be provided for the crawl of target web data so that Yong Huneng
Access comprehensive information.
S311, repeat connection the second station address, and judge child node whether with the second station address successful connection
Operation, if when repeating the number of connection more than the second preset times, child node is by the storage of the second station address to local
In memory.
Specifically, because the situation of connection failure still occurs in the reasons such as network, parsing, child node connection station address.
Existing network crawling method does not have fault tolerant mechanism, therefore can not reconnect the link.And the present embodiment child nodes can also
Continue the second station address of connection connection failure, more data sources can be provided for the crawl of target web data, made
Comprehensive information can be obtained by obtaining user.
Herein it should be noted that the second preset times and the 3rd preset times can all be set based on experience value, and
First preset times, the second preset times can be identical with the 3rd preset times, can also differ, this implementation is not limited this.
And it may include successfully record sheet and error logging table in each each self-corresponding local storage of child node in the present embodiment.Its
In, if station address includes multiple addresses, and there are multiple ranks multiple addresses, such as single-level address, two-level address, third-level address
Deng the record sheet that then succeeds can be according to the rule of rank height by the link in same station address on different stage and target
Crawl data and enumerate storage, error logging table also can will be not at the same level in same station address according to the rule of rank height
Link on not carries out enumerating storage;If station address only has one, the record sheet that succeeds is directly to station address and the website
Target crawls data and stored corresponding to address, and the station address, error reason can be stored in error logging table and repeats to connect
The information such as the number of the station address are connect, to facilitate location of mistake.The reason that wherein malfunctions can be network reason, parse reason etc.,
The present embodiment is not limited this.
Fig. 5 is the structural representation one that network provided by the invention crawls device, as shown in figure 5, the present embodiment network is climbed
Device 10 is taken to crawl system applied to network, the network, which crawls system, to be included:One host node and multiple child nodes, the net
Network, which crawls device 10, to be included:
Receiving module 11, the subtask sent for receiving the host node, the subtask includes crawling task
Station address in search groups corresponding to task type and the child node, the search groups are with including at least one website
Location, the search groups be the host node according to map-reduce and the task type for crawling task, to described at least one
Individual station address is divided what is obtained;
Module 12 is crawled, for being crawled according to the subtask, the obtained data Cun Chudao that crawls is locally stored
In device;
Enquiry module 13, for being inquired about in the local storage, obtain Query Result;
Sending module 14, for sending the Query Result to the host node.
Alternatively, the module 12 that crawls is specifically used for:Traversal connection is carried out to the station address in the subtask, obtained
To the first station address of successful connection and the second station address of connection failure;
Obtain and linked corresponding to web data page to be crawled in first station address;
Traversal connection is carried out to link corresponding to each web data page to be crawled in first station address, obtained
To the first link of successful connection and the second link of connection failure;
According to the task type for crawling task, each web data corresponding to the described first link is filtered
Processing, obtain web data corresponding to first link;
Web data corresponding to described first link is parsed, target is obtained and crawls data;
The target is crawled into data and corresponding first link storage into the local storage.
Alternatively, the module 12 that crawls specifically is additionally operable to:Second link is reconnected, and judges the child node
Whether with described second successful connection is linked;
If so, the task type of task is then crawled according to, to each web data corresponding to the described second link
Filtration treatment is carried out, obtains web data corresponding to second link, and web data corresponding to the described second link is entered
Go and parse, obtain the target and crawl data, and the target is crawled into data and corresponding second link storage described in
In local storage;
If it is not, repeating connection second link, and judge whether the child node with described second links connection
Successfully operation, if when repeating the number of connection more than the first preset times, the child node deposits the described second link
Store up in the local storage.
Alternatively, the module 12 that crawls specifically is additionally operable to:Second station address is reconnected, and judges the son
Node whether with the second station address successful connection;
Linked if so, the child node obtains corresponding to web data page to be crawled in second station address;
Traversal connection is carried out to link corresponding to each web data page to be crawled in second station address, obtained
To the 3rd link of successful connection and the 4th link of connection failure;
According to the task type for crawling task, each web data corresponding to the described 3rd link is filtered
Processing, obtain web data corresponding to the 3rd link;
Web data corresponding to described 3rd link is parsed, obtains the target crawl data;
The target is crawled into data and corresponding 3rd link storage into the local storage;
If it is not, repeat connection second station address, and judge the child node whether with second website
The operation of address successful connection, if when repeating the number of connection more than the second preset times, the child node is by described the
Two station addresses are stored into the local storage.
Alternatively, the module 12 that crawls specifically is additionally operable to:The 4th link is reconnected, and judges the child node
Whether with the described 4th successful connection is linked;
If so, then the child node crawls the task type of task according to, to each institute corresponding to the described 4th link
State web data and carry out filtration treatment, obtain web data corresponding to the 4th link, and to corresponding to the described 4th link
Web data is parsed, and obtains the target crawl data, and the target is crawled into data and corresponding 4th link
Store in the local storage;
If it is not, repeating connection the 4th link, and judge whether the child node with the described 4th links connection
Successfully operation, if when repeating the number of connection more than three preset times, the child node deposits the described 4th link
Store up in the local storage.
Network provided in an embodiment of the present invention crawls device 10, can perform above method embodiment, and it implements principle
And technique effect, reference can be made to above method embodiment, here is omitted for the present embodiment.
Fig. 6 is the structural representation two that network provided by the invention crawls device, as shown in fig. 6, the present embodiment network is climbed
Device 20 is taken to crawl system applied to network, the network, which crawls system, to be included:One host node and multiple child nodes, the net
Network, which crawls device 20, to be included:
Receiving module 21, for obtaining the inquiry request of user's input, and obtained according to the inquiry request and crawl task
Task type, the inquiry request corresponds at least one station address;
Division module 22, for according to map-reduce and the task type for crawling task, to described at least one
Station address is divided, and obtains at least one search groups, and each search groups include at least one station address, each search groups
A corresponding subtask, the corresponding child node in each subtask;
Sending module 23, for sending each self-corresponding subtask to each child node, the subtask includes institute
State the task of crawling task type and each self-corresponding search groups in station address;
The receiving module 21, it is additionally operable to the host node and receives the Query Result that each child node is sent, and to more
Individual Query Result carries out aggregation process, obtains target query result.
Network provided in an embodiment of the present invention crawls device 20, can perform above method embodiment, and it implements principle
And technique effect, reference can be made to above method embodiment, here is omitted for the present embodiment.
One of ordinary skill in the art will appreciate that:Realizing all or part of step of above-mentioned each method embodiment can lead to
The related hardware of programmed instruction is crossed to complete.Foregoing program can be stored in a computer read/write memory medium.The journey
Sequence upon execution, execution the step of including above-mentioned each method embodiment;And foregoing storage medium includes:ROM, RAM, magnetic disc or
Person's CD etc. is various can be with the medium of store program codes.
Finally it should be noted that:Various embodiments above is merely illustrative of the technical solution of the present invention, rather than its limitations;To the greatest extent
The present invention is described in detail with reference to foregoing embodiments for pipe, it will be understood by those within the art that:Its according to
The technical scheme described in foregoing embodiments can so be modified, either which part or all technical characteristic are entered
Row equivalent substitution;And these modifications or replacement, the essence of appropriate technical solution is departed from various embodiments of the present invention technology
The scope of scheme.
Claims (10)
1. a kind of network crawling method, system is crawled applied to network, the network, which crawls system, to be included:One host node and more
Individual child node, it is characterised in that for any child node, methods described includes:
The child node receives the subtask that the host node is sent, the subtask include crawling the task type of task with
And the station address in search groups corresponding to the child node, the search groups include at least one station address, described to search
Rope group is the host node according to distributed programmed framework map-reduce and the task type for crawling task, to it is described extremely
A few station address is divided what is obtained;
The child node is crawled according to the subtask, is crawled what is obtained in data Cun Chudao local storages;
The child node is inquired about in the local storage, obtains Query Result, and to described in host node transmission
Query Result.
2. according to the method for claim 1, it is characterised in that the child node is crawled according to the subtask, institute
State and crawled what is obtained in data Cun Chudao local storages, including:
The child node carries out traversal connection to the station address in the subtask, obtains the first station address of successful connection
With the second station address of connection failure;
The child node is obtained and linked corresponding to web data page to be crawled in first station address;
The child node travels through to being linked corresponding to each web data page to be crawled in first station address
Connection, obtain the first link of successful connection and the second link of connection failure;
The child node crawls the task type of task according to, and each web data corresponding to the described first link is entered
Row filtration treatment, obtain web data corresponding to first link;
The child node parses to web data corresponding to the described first link, obtains target and crawls data;
The target is crawled data and corresponding first link storage into the local storage by the child node.
3. according to the method for claim 2, it is characterised in that methods described also includes:
The child node reconnects second link, and judges whether the child node links with described second and connect into
Work(;
If so, then the child node crawls the task type of task according to, to each net corresponding to the described second link
Page data carries out filtration treatment, obtains web data corresponding to second link, and to webpage corresponding to the described second link
Data are parsed, and are obtained the target and are crawled data, and the target is crawled into data and corresponding second link storage
Into the local storage;
If it is not, repeating connection second link, and judge whether the child node with described second links successful connection
Operation, if repeat connection number more than the first preset times when, the child node by described second link storage arrive
In the local storage.
4. according to the method for claim 2, it is characterised in that methods described also includes:
The child node reconnects second station address, and judge the child node whether with second station address
Successful connection;
Linked if so, the child node obtains corresponding to web data page to be crawled in second station address;
The child node travels through to being linked corresponding to each web data page to be crawled in second station address
Connection, obtain the 3rd link of successful connection and the 4th link of connection failure;
The child node crawls the task type of task according to, and each web data corresponding to the described 3rd link is entered
Row filtration treatment, obtain web data corresponding to the 3rd link;
The child node parses to web data corresponding to the described 3rd link, obtains the target crawl data;
The target is crawled data and corresponding 3rd link storage into the local storage by the child node;
If it is not, repeat connection second station address, and judge the child node whether with second station address
The operation of successful connection, if when repeating the number of connection more than the second preset times, the child node is by second net
Station address is stored into the local storage.
5. according to the method for claim 4, it is characterised in that methods described also includes:
The child node reconnects the 4th link, and judges whether the child node links with the described 4th and connect into
Work(;
If so, then the child node crawls the task type of task according to, to each net corresponding to the described 4th link
Page data carries out filtration treatment, obtains web data corresponding to the 4th link, and to webpage corresponding to the described 4th link
Data are parsed, and obtain the target crawl data, and the target is crawled into data and corresponding 4th link storage
Into the local storage;
If it is not, repeating connection the 4th link, and judge whether the child node with the described 4th links successful connection
Operation, if repeat connection number more than three preset times when, the child node by the described 4th link storage arrive
In the local storage.
6. according to the method for claim 1, it is characterised in that also include state indicating bit, the shape in the subtask
State indicating bit is used to indicate whether the child node performs the subtask.
7. a kind of network crawling method, system is crawled applied to network, the network, which crawls system, to be included:One host node and more
Individual child node, it is characterised in that methods described includes:
The host node obtains the inquiry request of user's input, and the task class for the task that crawls is obtained according to the inquiry request
Type, the inquiry request correspond at least one station address;
The host node enters according to map-reduce and the task type for crawling task at least one station address
Row division, obtains at least one search groups, and each search groups include at least one station address, the corresponding son of each search groups
Task, the corresponding child node in each subtask;
The host node sends each self-corresponding subtask to each child node, the subtask include described in crawl task
Task type and each self-corresponding search groups in station address;
The host node receives the Query Result that each child node is sent, and carries out aggregation process to multiple queries result, obtains
To target query result.
8. according to the method for claim 7, it is characterised in that also include in the subtask:State indicating bit, the shape
State indicating bit is used to indicate whether the child node performs the subtask.
9. a kind of network crawls device, system is crawled applied to network, the network, which crawls system, to be included:One host node and more
Individual child node, it is characterised in that described device includes:
Receiving module, the subtask sent for receiving the host node, the subtask includes crawling the task class of task
Station address in search groups corresponding to type and the child node, the search groups include at least one station address, institute
State search groups be the host node according to map-reduce and the task type for crawling task, at least one website
Address is divided what is obtained;
Module is crawled, for being crawled according to the subtask, is crawled what is obtained in data Cun Chudao local storages;
Enquiry module, for being inquired about in the local storage, obtain Query Result;
Sending module, for sending the Query Result to the host node.
10. a kind of network crawls device, system is crawled applied to network, the network, which crawls system, to be included:One host node and
Multiple child nodes, it is characterised in that described device includes:
Receiving module, the task of task of crawling is obtained for obtaining the inquiry request of user's input, and according to the inquiry request
Type, the inquiry request correspond at least one station address;
Division module, for according to map-reduce and the task type for crawling task, at least one website
Location is divided, and obtains at least one search groups, and each search groups include at least one station address, each search groups corresponding one
Individual subtask, the corresponding child node in each subtask;
Sending module, for sending each self-corresponding subtask to each child node, the subtask includes described crawl
Station address in the task type of task and each self-corresponding search groups;
The receiving module, it is additionally operable to the host node and receives the Query Result that each child node is sent, and to multiple queries
As a result aggregation process is carried out, obtains target query result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710571635.0A CN107423382A (en) | 2017-07-13 | 2017-07-13 | network crawling method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710571635.0A CN107423382A (en) | 2017-07-13 | 2017-07-13 | network crawling method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107423382A true CN107423382A (en) | 2017-12-01 |
Family
ID=60426478
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710571635.0A Pending CN107423382A (en) | 2017-07-13 | 2017-07-13 | network crawling method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107423382A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109033269A (en) * | 2018-07-10 | 2018-12-18 | 卓源信息科技股份有限公司 | A kind of Distributed Area talent supply and demand subject data crawling method |
CN109088908A (en) * | 2018-06-06 | 2018-12-25 | 武汉酷犬数据科技有限公司 | A kind of the distributed general collecting method and system of network-oriented |
CN110297962A (en) * | 2019-06-28 | 2019-10-01 | 北京金山安全软件有限公司 | Website resource crawling method, device, system and computer equipment |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103310012A (en) * | 2013-07-02 | 2013-09-18 | 北京航空航天大学 | Distributed web crawler system |
CN103455597A (en) * | 2013-09-03 | 2013-12-18 | 山东省计算中心 | Distributed information hiding detection method facing mass web images |
CN104537005A (en) * | 2014-12-15 | 2015-04-22 | 北京国双科技有限公司 | Data processing method and device for webpage crawling |
WO2015145455A1 (en) * | 2014-03-28 | 2015-10-01 | Hewlett-Packard Development Company, L.P. | Resource directory |
US9177061B2 (en) * | 2007-08-29 | 2015-11-03 | Enpulz, Llc | Search engine with geographical verification processing |
CN105426407A (en) * | 2015-11-02 | 2016-03-23 | 浪潮软件集团有限公司 | Web data acquisition method based on content analysis |
CN106339385A (en) * | 2015-07-08 | 2017-01-18 | 阿里巴巴集团控股有限公司 | System for crawling webpages, method for distributing webpage crawling nodes and method for crawling webpages |
CN106682041A (en) * | 2015-11-11 | 2017-05-17 | 北京国双科技有限公司 | Method and device for detecting webpage broken link |
-
2017
- 2017-07-13 CN CN201710571635.0A patent/CN107423382A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9177061B2 (en) * | 2007-08-29 | 2015-11-03 | Enpulz, Llc | Search engine with geographical verification processing |
CN103310012A (en) * | 2013-07-02 | 2013-09-18 | 北京航空航天大学 | Distributed web crawler system |
CN103455597A (en) * | 2013-09-03 | 2013-12-18 | 山东省计算中心 | Distributed information hiding detection method facing mass web images |
WO2015145455A1 (en) * | 2014-03-28 | 2015-10-01 | Hewlett-Packard Development Company, L.P. | Resource directory |
CN104537005A (en) * | 2014-12-15 | 2015-04-22 | 北京国双科技有限公司 | Data processing method and device for webpage crawling |
CN106339385A (en) * | 2015-07-08 | 2017-01-18 | 阿里巴巴集团控股有限公司 | System for crawling webpages, method for distributing webpage crawling nodes and method for crawling webpages |
CN105426407A (en) * | 2015-11-02 | 2016-03-23 | 浪潮软件集团有限公司 | Web data acquisition method based on content analysis |
CN106682041A (en) * | 2015-11-11 | 2017-05-17 | 北京国双科技有限公司 | Method and device for detecting webpage broken link |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109088908A (en) * | 2018-06-06 | 2018-12-25 | 武汉酷犬数据科技有限公司 | A kind of the distributed general collecting method and system of network-oriented |
CN109033269A (en) * | 2018-07-10 | 2018-12-18 | 卓源信息科技股份有限公司 | A kind of Distributed Area talent supply and demand subject data crawling method |
CN110297962A (en) * | 2019-06-28 | 2019-10-01 | 北京金山安全软件有限公司 | Website resource crawling method, device, system and computer equipment |
CN110297962B (en) * | 2019-06-28 | 2021-08-24 | 北京金山安全软件有限公司 | Website resource crawling method, device, system and computer equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104951399B (en) | A kind of software testing system and method | |
CN107885777A (en) | A kind of control method and system of the crawl web data based on collaborative reptile | |
CN104503891B (en) | The method and apparatus that JVM thread is monitored online | |
CN107203424A (en) | A kind of method and apparatus that deep learning operation is dispatched in distributed type assemblies | |
CN103310012A (en) | Distributed web crawler system | |
CN107423382A (en) | network crawling method and device | |
CN107145556B (en) | Universal distributed acquisition system | |
CN106776693A (en) | A kind of website data acquisition method and device | |
CN106844730A (en) | The display methods and device of file content | |
Christensen | Next-generation catalogues: what do users think | |
CN109637238A (en) | A kind of generation method of exercise, device, equipment and storage medium | |
CN104424188A (en) | System and method for updating obtained webpage data | |
Najadat et al. | Evaluating Jordanian universities' websites based on data envelopment analysis | |
Stanford | Map your knowledge strategy | |
CN115422427A (en) | Employment skill requirement analysis system | |
Murali et al. | Crowdsourcing for disaster relief: A multi-platform model | |
CN114912538A (en) | Information push model training method, information push method, device and equipment | |
CN107870824A (en) | A kind of method and device that inspection is carried out to component | |
Cazares et al. | A Training Web Platform to Improve Cognitive Skills for Phishing Attacks Detection | |
CN106372071A (en) | Method and device for acquiring information of data warehouse | |
CN109948939A (en) | A kind of Industrial Solid Waste supervision main body credit evaluation system | |
Lin | Information visualization from the perspective of big data analysis and fusion | |
CN109658088A (en) | The associated method, apparatus of multi-platform account and browser based on browser | |
Xiang et al. | Distributed University Timetabling with Multiply Sectioned Constraint Networks. | |
CN102945245B (en) | Data configuration method, data configuration device and server |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20171201 |