CN106528567A - Method and device for updating web crawler cluster information - Google Patents

Method and device for updating web crawler cluster information Download PDF

Info

Publication number
CN106528567A
CN106528567A CN201510579940.5A CN201510579940A CN106528567A CN 106528567 A CN106528567 A CN 106528567A CN 201510579940 A CN201510579940 A CN 201510579940A CN 106528567 A CN106528567 A CN 106528567A
Authority
CN
China
Prior art keywords
target
link
local
broadcast
crawls
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510579940.5A
Other languages
Chinese (zh)
Other versions
CN106528567B (en
Inventor
崔志伸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201510579940.5A priority Critical patent/CN106528567B/en
Publication of CN106528567A publication Critical patent/CN106528567A/en
Application granted granted Critical
Publication of CN106528567B publication Critical patent/CN106528567B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for updating web crawler cluster information. Each web crawler in a web crawler cluster is equipped with a local inspector. The method comprises the steps: a target local inspector inquires whether a target crawling link exists in the target local inspector according to a message sent by the web crawler corresponding to the target local inspector, wherein the target crawling link is carried in the message; and when inquiring that the target crawling link does not exist, the target local inspector stores the target crawling link and sends a broadcast carrying the target crawling link to other local inspectors so that other local inspectors update a crawling link according to the broadcast. According to the method and the device for updating the web crawler cluster information, the technical problem of relatively low crawling efficiency of the web crawlers in the related technology is solved.

Description

The update method of web crawlers cluster information and device
Technical field
The application is related to the Internet reptile field, in particular to a kind of update method of web crawlers cluster information And device.
Background technology
Web crawlers cluster needs to filter the link for repeating, to prevent duplicate pages by repeatedly when various websites are crawled Crawl.During web crawlers crawls the page, the link for having crawled is stored in for filtering repeated pages In detector, in order to each reptile in web crawlers cluster is owned by identical detector as far as possible at any time, keep away Exempt from duplicate pages to be crawled again, accordingly, it would be desirable to synchronized update detector.
Existing scheme disposes a unified detector in the cluster, and all-network reptile can all access same detector To exclude duplicate pages, but this scheme causes the all-network reptile in cluster compete same detector money Source, when each web crawlers crawls the page, is required for whether the link that detector inspection is crawled repeats, causes network to be climbed Worm to crawl efficiency comparison low.
For the problems referred to above, effective solution is not yet proposed at present.
The content of the invention
The embodiment of the present application provides a kind of update method of web crawlers cluster information and device, at least to solve correlation In technology, web crawlers crawls the low technical problem of efficiency comparison.
It is according to the one side of the embodiment of the present application, there is provided a kind of update method of web crawlers cluster information, described In web crawlers cluster, each web crawlers is equipped with a local detector, and the local detector of target is according to its corresponding net The information query that network reptile sends crawls link with the presence or absence of target in the local detector of the target, wherein, described to disappear The target is carried in breath crawls link;Inquire do not exist the target and crawl link when, preserve the target Crawl link, and send to other local detectors and carry the broadcast that the target crawls link so that it is described other Local detector crawls link according to the broadcast renewal..
According to the another aspect of the embodiment of the present application, a kind of updating device of web crawlers cluster information, institute is additionally provided In stating web crawlers cluster, each web crawlers is equipped with a local detector, and described device includes:Query unit, uses In the information query sent according to the corresponding web crawlers of the local detector of target in the local detector of the target whether There is target and crawl link, wherein, carry the target in the message and crawl link;Radio unit, for Inquiring when do not exist the target and crawling link, preserve the target and crawl link, and to other local detectors The broadcast for carrying that the target crawls link is sent, so that described other local detectors are crawled according to the broadcast renewal Link.
In the embodiment of the present application, the information query for being sent according to its corresponding web crawlers using the local detector of target Link is crawled with the presence or absence of target in the local detector of the target, wherein, target is carried in message and is crawled link; Inquire there is no target and crawl link when, preserve target and crawl link, and send carrying to other local detectors There is target to crawl the broadcast of link, in the way of other local detectors is updated according to broadcast and crawl link, each net Network reptile is filtered repetition target by a corresponding local detector and crawls link, improves and crawls efficiency.Meanwhile, The information of the link that each local detector had been crawled by broadcast reception synchronized update, it is also possible to sent out by broadcast Send the information of the link that synchronized update crawled so that the local detector in web crawlers cluster possesses consistent Information, also ensures that, is performed in multiple reptiles simultaneously and is climbed When taking task, that is, can guarantee that comparison it is high crawl efficiency, can guarantee that the high accuracy of comparison again, and then solve correlation In technology, web crawlers crawls the low technical problem of efficiency comparison.
Description of the drawings
Accompanying drawing described herein is used for providing further understanding of the present application, constitutes the part of the application, this Shen Schematic description and description please does not constitute the improper restriction to the application for explaining the application.In accompanying drawing In:
Fig. 1 is the flow chart of the update method of the web crawlers cluster information according to the embodiment of the present application;
Fig. 2 is the schematic diagram of a kind of optional web crawlers cluster topology according to the embodiment of the present application;
Fig. 3 is the schematic diagram of the updating device of the web crawlers cluster information according to the embodiment of the present application.
Specific embodiment
In order that those skilled in the art more fully understand application scheme, below in conjunction with the embodiment of the present application Accompanying drawing, is clearly and completely described to the technical scheme in the embodiment of the present application, it is clear that described embodiment The only embodiment of the application part, rather than the embodiment of whole.Based on the embodiment in the application, ability The every other embodiment obtained under the premise of creative work is not made by domain those of ordinary skill, should all belong to The scope of the application protection.
It should be noted that the description and claims of this application and the term " first " in above-mentioned accompanying drawing, " second " Etc. being for distinguishing similar object, without for describing specific order or precedence.It should be appreciated that so making Data can be exchanged in the appropriate case, so that embodiments herein described herein can be with except here Order beyond those of diagram or description is implemented.Additionally, term " comprising " and " having " and their any deformation, Be intended to cover it is non-exclusive include, for example, contain the process of series of steps or unit, method, system, Product or equipment are not necessarily limited to those steps clearly listed or unit, but may include clearly not list or Other intrinsic for these processes, method, product or equipment step or unit.
According to the embodiment of the present application, there is provided a kind of embodiment of the method for the update method of web crawlers cluster information, need It is noted that can be in the computer of such as one group of computer executable instructions the step of the flow process of accompanying drawing is illustrated Perform in system, and, although show logical order in flow charts, but in some cases, can be with not The order being same as herein performs shown or described step.
Fig. 1 is the flow chart of the update method of the web crawlers cluster information according to the embodiment of the present application, the web crawlers In cluster, each web crawlers is equipped with a local detector, as shown in figure 1, the method comprises the steps:
Step S102, the local detector of target are local in the target according to the information query that its corresponding web crawlers sends Link is crawled with the presence or absence of target in detector, wherein, target is carried in message and is crawled link.
Step S104, inquire there is no target and crawl link when, the local detector of target preserves target and crawls link, And the broadcast that target crawls link is carried to other local detectors transmissions, so that other local detectors are according to broadcast Renewal crawls link.
Each web crawlers in web crawlers cluster is equipped with a local detector, and the local detector of target can be The local detector corresponding to any one web crawlers in web crawlers cluster.When being looked into using the local detector of target Ask out certain link and be not crawled out-of-date, corresponding web crawlers can be crawled to the link, and target is locally examined Device is looked into by broadcasting to send the message that the link has been crawled, other the local detector storages for receiving the broadcast should Link, in order to the link that is stored with detector corresponding to web crawlers when being crawled, filter out the link Repetition is avoided to crawl same link.As the local detector of each web crawlers in web crawlers cluster can be received To broadcast, therefore, the local detector of the web crawlers in web crawlers cluster being capable of the locally stored letter of synchronized update Breath.In this embodiment, the mode of broadcast realizes the information of multiple local detector synchronized update detectors no matter Web crawlers cluster filters repeated links using which local detector, and repeated links can be avoided accurately to be filtered out. Due to each web crawlers one local detector of correspondence, web crawlers is repeated using its corresponding local detector The inspection of link, need not seize the resource of same detector, improve the efficiency for filtering repeated links, also Improve web crawlers crawls efficiency.Each local inspection that the link for having crawled is stored in web crawlers cluster Look in device, it is accurate also to allow for each reptile and filter repeated links by respective local detector, that is, carrying The accuracy for filtering repeated links can be also improved while height crawls efficiency, reached effect that is accurate, efficiently being crawled Really.
Alternatively, it is local in the target according to the information query that its corresponding web crawlers sends in the local detector of target After link is crawled with the presence or absence of target in detector, method also includes:Inquire there is no target and crawl link when, The local detector of target sends the instruction for allowing to crawl to its corresponding web crawlers, so that web crawlers crawls target and climbs Take link;Inquire there is target and crawl link when, the local detector of target sends to its corresponding web crawlers and puts The instruction for crawling is abandoned, so that web crawlers is abandoned crawling target and crawls link.
Its target that whether is stored with is inquired about by the local detector of target and crawls link, can be found and just illustrate that the target is crawled Link had been crawled, it is not necessary to is crawled again, is then notified that corresponding web crawlers does not crawl link to target and crawls; Can not find and just illustrate that the target crawls link and do not crawled, can be crawled, then notify corresponding web crawlers Link is crawled to target to crawl.Crawl whether link crawled due to first inquiring about the target before crawling, therefore Avoid identical target crawl link repeatedly crawled.Due to climbing for each the local detector in web crawlers cluster Take link information be it is synchronous, therefore, each web crawlers from corresponding local detector can be inquired about target and crawl Whether link is crawled, it is to avoid seize same local detector, improves the efficiency of inquiry, as a complete unit, Also improve the efficiency for crawling.
As shown in Fig. 2 web crawlers A crawls target crawls link www.abcdefg.com, web crawlers A is local The Object linking is searched in detector a, if not finding the Object linking, network in local detector a Reptile A crawls target and crawls link www.abcdefg.com.If finding the object chain in local detector a Connect, it is determined that the Object linking had been crawled, abandon crawling the Object linking, so as to avoid the weight of same link Crawl again.
Specifically, web crawlers cluster also includes broadcast module, and the local detector of target is sent to other local detectors Carrying target and crawling the broadcast of link includes:The local detector of target carries target to broadcast module transmission and crawls chain What is connect crawls information, so that broadcast module generates broadcast according to the information that crawls, and will be broadcast to by broadcast module and be ordered Read other local detectors of broadcast.Local detector sends broadcast by the broadcast module in web crawlers cluster, The broadcast sent from broadcast module is received, it is achieved thereby that all local detector synchronized updates in web crawlers cluster. Other web crawlers in web crawlers cluster can receive broadcast, and record target and crawl link, realize multiple The link for having crawled can be stored in the respective local detector of web crawlers.
For example, as shown in Fig. 2 web crawlers cluster includes web crawlers A, web crawlers B, web crawlers C ... Web crawlers N etc., corresponding local detector is local detector a, local detector b, local detector c ... Local detector n, web crawlers cluster also include broadcast module X, and the web crawlers of all subscription broadcast can be received Hear the broadcast that broadcast module X sends.Web crawlers A crawls target and crawls link www.abcdefg.com, and network is climbed Worm A searches the Object linking in local detector a, does not as a result find the object chain in local detector a Connect, then web crawlers A crawls target and crawls link www.abcdefg.com.Local detector a is to broadcast module X Send target and crawl the information that link www.abcdefg.com has been crawled, broadcast module X generates broadcast, the broadcast Carry www.abcdefg.com.Local detector b, the local detector c ... that subscription has the broadcast is locally checked The www.abcdefg.com for broadcasting carrying is stored in locally by device n.Need to crawl target in web crawlers B and crawl chain When meeting www.abcdefg.com, local detector b finds the target and crawls link, then web crawlers B is no longer crawled www.abcdefg.com.Web crawlers B does not have the link for storing in crawling another local detector b after, The information of crawling can be sent, with reference to local detector a, here is omitted for process.
Alternatively, the local detector of target sends to other local detectors and carries the broadcast that target crawls link, with Making other local detectors crawl link according to broadcast renewal includes:Local detector sends to other local detectors and takes The broadcast of link is crawled with target, so that other local detectors receive broadcast, and the target that broadcast carries is preserved and is climbed Take link.
Due to each the reptile one local detector of correspondence in web crawlers cluster, can be by web crawlers cluster One reptile of addition and corresponding local detector carry out extended network reptile cluster, or remove from web crawlers cluster One reptile changes web crawlers cluster with corresponding local detector.Increasing a reptile and corresponding local inspection When looking into device, it is only necessary to which corresponding local detector subscribes to the broadcast of broadcast module, can just receive broadcast module and send more Fresh information, it is ensured that the fresh information synchronization of multiple local detectors.So, the letter for storing in multiple local detectors Breath is consistent, no matter is to increase local detector or reduces local detector, all without affecting in web crawlers cluster which Remaining local detector filters repeated links, does not also interfere with the accuracy that web crawlers crawls link.Due to each reptile Corresponding to a local detector, when increasing or reducing reptile and corresponding local detector in pairs, will not reduce Other reptiles crawl efficiency.
By above-described embodiment, each web crawlers filters repetition target by a corresponding local detector and crawls chain Connect, improve and crawl efficiency.Meanwhile, the chain that each local detector had been crawled by broadcast reception synchronized update The information for connecing, it is also possible to the information of the link crawled by broadcast transmission synchronized update so that web crawlers collection Group in local detector possess consistent information, also ensure that will not also repeat between different reptiles crawl it is same Link, when multiple reptiles perform and crawl task simultaneously, that is, can guarantee that comparison it is high crawl efficiency, can guarantee that again and compare High accuracy.
According to the embodiment of the present application, a kind of device embodiment of the updating device of web crawlers cluster information is additionally provided, In web crawlers cluster, each web crawlers is equipped with a local detector, the updating device of the web crawlers cluster information The update method of above-mentioned web crawlers cluster information is able to carry out, the update method of above-mentioned web crawlers cluster information also may be used To be performed by the updating device of the web crawlers cluster information.
As shown in figure 3, the updating device of the web crawlers cluster information includes:Query unit 10 is for according to target sheet The information query that detector corresponding web crawlers in ground sends is crawled with the presence or absence of target in the local detector of the target Link, wherein, carries the target and crawls link in the message;Radio unit 30 is not for existing inquiring When the target crawls link, preserve the target and crawl link, and carry to other local detectors transmissions described Target crawls the broadcast of link, so that described other local detectors crawl link according to the broadcast renewal.
Each web crawlers in web crawlers cluster is equipped with a local detector, when using the judgement of local detector Go out certain link and be not crawled out-of-date, the link can be crawled, and by broadcasting having sent the link The message being crawled, the local detector for receiving the broadcast store the link, in order to the detector of the link that is stored with Corresponding web crawlers is filtered out the link and avoids repetition from crawling same link when crawling.Due to network The local detector of each web crawlers in reptile cluster can receive broadcast, therefore, the net in web crawlers cluster The local detector of network reptile being capable of the locally stored information of synchronized update.In this embodiment, the mode of broadcast is realized The information of multiple local detector synchronized update detectors, no matter web crawlers is just filtered using which local detector Repeated links, can avoid repeated links from accurately being filtered out.Due to each web crawlers one local detector of correspondence, Web crawlers carries out the inspection of repeated links using its corresponding local detector, need not seize same detector Resource, improve filter repeated links efficiency, also just improve web crawlers crawls efficiency.Crawled Each local detector for being stored in web crawlers cluster of link in, also allow for each reptile by respective It is all accurate that ground detector filters repeated links, i.e., can also improve filtration repeated links while raising crawls efficiency Accuracy, reached accurately, the effect that efficiently crawled.
Alternatively, device also includes:First transmitting element, in the local detector of target according to its corresponding network After the information query that reptile sends crawls link in the local detector of the target with the presence or absence of target, inquiring not When there is target and crawling link, the instruction for allowing to crawl is sent to the corresponding web crawlers of the local detector of target, so that Web crawlers crawls target and crawls link;Second transmitting element, for inquire there is target and crawl link when, mesh Specimen ground detector sends the instruction for abandoning crawling to the corresponding web crawlers of the local detector of target, so that web crawlers Abandon crawling target and crawl link.
Its target that whether is stored with is inquired about by the local detector of target and crawls link, can be found and just illustrate that the target is crawled Link had been crawled, it is not necessary to is crawled again, is then notified that corresponding web crawlers does not crawl link to target and crawls; Can not find and just illustrate that the target crawls link and do not crawled, can be crawled, then notify corresponding web crawlers Link is crawled to target to crawl.Crawl whether link crawled due to first inquiring about the target before crawling, therefore Avoid identical target crawl link repeatedly crawled.Due to climbing for each the local detector in web crawlers cluster Take link information be it is synchronous, therefore, each web crawlers from corresponding local detector can be inquired about target and crawl Whether link is crawled, it is to avoid seize same local detector, improves the efficiency of inquiry, as a complete unit, Also improve the efficiency for crawling.
As shown in Fig. 2 web crawlers A crawls target crawls link www.abcdefg.com, web crawlers A is local The Object linking is searched in detector a, if not finding the Object linking, network in local detector a Reptile A crawls target and crawls link www.abcdefg.com.If finding the object chain in local detector a Connect, it is determined that the Object linking had been crawled, abandon crawling the Object linking, so as to avoid the weight of same link Crawl again.
Specifically, web crawlers cluster also includes broadcast module, and radio unit includes:Sending module, for broadcast Module send carry that target crawls link crawl information, so as to broadcast module is generated according to the information that crawls broadcast, and Other the local detectors for subscribing to broadcast will be broadcast to.
Local detector sends broadcast by the broadcast module in web crawlers cluster, also receives and sends from broadcast module Broadcast, it is achieved thereby that all local detector synchronized updates in web crawlers cluster.Its in web crawlers cluster His web crawlers can receive broadcast, and record target and crawl link, realize multiple web crawlers respective local The link for having crawled can be stored in detector.
For example, as shown in Fig. 2 web crawlers cluster includes web crawlers A, web crawlers B, web crawlers C ... Web crawlers N etc., corresponding local detector is local detector a, local detector b, local detector c ... Local detector n, web crawlers cluster also include broadcast module X, and the web crawlers of all subscription broadcast can be received Hear the broadcast that broadcast module X sends.Web crawlers A crawls target and crawls link www.abcdefg.com, and network is climbed Worm A searches the Object linking in local detector a, does not as a result find the object chain in local detector a Connect, then web crawlers A crawls target and crawls link www.abcdefg.com.Local detector a is to broadcast module X Send target and crawl the information that link www.abcdefg.com has been crawled, broadcast module X generates broadcast, the broadcast Carry www.abcdefg.com.Local detector b, the local detector c ... that subscription has the broadcast is locally checked The www.abcdefg.com for broadcasting carrying is stored in locally by device n.Need to crawl target in web crawlers B and crawl chain When meeting www.abcdefg.com, local detector b finds the target and crawls link, then web crawlers B is no longer crawled www.abcdefg.com.Web crawlers B does not have the link for storing in crawling another local detector b after, The information of crawling can be sent, with reference to local detector a, here is omitted for process.
Alternatively, radio unit is additionally operable to other local detectors send carry the broadcast that target crawls link, with Make other local detectors receive broadcast, and preserve the target of broadcast carrying to crawl link.
Due to each the reptile one local detector of correspondence in web crawlers cluster, can be by web crawlers cluster One reptile of addition and corresponding local detector carry out extended network reptile cluster, or remove from web crawlers cluster One reptile changes web crawlers cluster with corresponding local detector.Increasing a reptile and corresponding local inspection When looking into device, it is only necessary to which corresponding local detector subscribes to the broadcast of broadcast module, can just receive broadcast module and send more Fresh information, it is ensured that the fresh information synchronization of multiple local detectors.So, the letter for storing in multiple local detectors Breath is consistent, no matter is to increase local detector or reduces local detector, all without affecting in web crawlers cluster which Remaining local detector filters repeated links, does not also interfere with the accuracy that web crawlers crawls link.Due to each reptile Corresponding to a local detector, when increasing or reducing reptile and corresponding local detector in pairs, will not reduce Other reptiles crawl efficiency.
By above-described embodiment, each web crawlers filters repetition target by a corresponding local detector and crawls chain Connect, improve and crawl efficiency.Meanwhile, the chain that each local detector had been crawled by broadcast reception synchronized update The information for connecing, it is also possible to the information of the link crawled by broadcast transmission synchronized update so that web crawlers collection Group in local detector possess consistent information, also ensure that will not also repeat between different reptiles crawl it is same Link, when multiple reptiles perform and crawl task simultaneously, that is, can guarantee that comparison it is high crawl efficiency, can guarantee that again and compare High accuracy.
Above-mentioned the embodiment of the present application sequence number is for illustration only, does not represent the quality of embodiment.
In above-described embodiment of the application, the description to each embodiment all emphasizes particularly on different fields, and does not have in certain embodiment The part of detailed description, may refer to the associated description of other embodiment.
In several embodiments provided herein, it should be understood that disclosed technology contents, other can be passed through Mode realize.Wherein, device embodiment described above is only schematic, and the division of such as unit can Think a kind of division of logic function, when actually realizing, there can be other dividing mode, such as multiple units or component can To combine or be desirably integrated into another system, or some features can be ignored, or not perform.It is another, it is shown The coupling each other shown or discuss or direct-coupling or communication connection can be by some interfaces, unit or module INDIRECT COUPLING or communication connection, can be electrical or other forms.
The unit as separating component explanation can be or may not be it is physically separate, it is aobvious as unit The part for showing can be or may not be physical location, you can local to be located at one, or can also be distributed to On multiple units.Some or all of unit therein can be selected according to the actual needs to realize this embodiment scheme Purpose.
In addition, each functional unit in the application each embodiment can be integrated in a processing unit, it is also possible to It is that unit is individually physically present, it is also possible to which two or more units are integrated in a unit.It is above-mentioned integrated Unit both can be realized in the form of hardware, it would however also be possible to employ the form of SFU software functional unit is realized.
If the integrated unit realized using in the form of SFU software functional unit and as independent production marketing or use when, Can be stored in a computer read/write memory medium.Based on such understanding, the technical scheme essence of the application On all or part of part that in other words prior art is contributed or the technical scheme can be with software product Form is embodied, and the computer software product is stored in a storage medium, is used so that one including some instructions Platform computer equipment (can be personal computer, server or network equipment etc.) performs each embodiment institute of the application State all or part of step of method.And aforesaid storage medium includes:USB flash disk, read only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), portable hard drive, magnetic disc or CD etc. is various can be with the medium of store program codes.
The above is only the preferred implementation of the application, it is noted that for the ordinary skill people of the art For member, on the premise of without departing from the application principle, some improvements and modifications can also be made, these improve and moisten Decorations also should be regarded as the protection domain of the application.

Claims (8)

1. a kind of update method of web crawlers cluster information, it is characterised in that each network in the web crawlers cluster Reptile is equipped with a local detector, and methods described includes:
The local detector of target is looked in the local detector of the target according to the message that its corresponding web crawlers sends Inquiry crawls link with the presence or absence of target, wherein, carries the target and crawl link in the message;
Inquire do not exist the target and crawl link when, the local detector of the target preserves the target and climbs Take link, and send to other local detectors and carry the broadcast that the target crawls link so that it is described its He crawls link according to the broadcast renewal by local detector.
2. method according to claim 1, it is characterised in that corresponding according to which in the local detector of the target After the information query that web crawlers sends crawls link in the local detector of the target with the presence or absence of target, institute Stating method also includes:
Inquire do not exist the target and crawl link when, the local detector of the target is to its corresponding network Reptile sends the instruction for allowing to crawl, so that the web crawlers crawls the target and crawls link;
Inquire there is the target and crawl link when, the local detector of the target is climbed to its corresponding network Worm sends abandons the instruction that crawls, so that the web crawlers is abandoned crawling the target and crawls link.
3. method according to claim 1, it is characterised in that the web crawlers cluster also includes broadcast module, The local detector of the target sends to carry the target and crawl the broadcast of link to other local detectors to be included:
The local detector of the target send to the broadcast module carry that the target crawls link crawl letter Breath, so that the broadcast module generates the broadcast according to the information that crawls, and described broadcasting to is ordered Read other local detectors of broadcast.
4. method according to claim 1, it is characterised in that the local detector of the target is to other local inspections Device sends and carries the broadcast that the target crawls link, so that described other local detectors are according to the broadcast Renewal crawls link to be included:
The local detector sends to other local detectors and carries the broadcast that the target crawls link, with Make described other local detectors receive the broadcast, and preserve the target of the broadcast carrying to crawl link.
5. a kind of updating device of web crawlers cluster information, it is characterised in that each network in the web crawlers cluster Reptile is equipped with a local detector, and described device includes:
Query unit, for the information query that sent according to the corresponding web crawlers of the local detector of target described Link is crawled with the presence or absence of target in the local detector of target, wherein, the target is carried in the message and is climbed Take link;
Radio unit, for inquire do not exist the target and crawl link when, preserve the target and crawl chain Connect, and send to other local detectors and carry the broadcast that the target crawls link so that it is described other Ground detector crawls link according to the broadcast renewal.
6. device according to claim 5, it is characterised in that described device also includes:
First transmitting element, for the disappearing according to its corresponding web crawlers transmission in the local detector of the target After breath inquiry crawls link in the local detector of the target with the presence or absence of target, inquire do not exist it is described When target crawls link, the instruction for allowing to crawl is sent to the corresponding web crawlers of the local detector of the target, So that the web crawlers crawls the target and crawls link;
Second transmitting element, for inquire there is the target and crawl link when, the target is locally checked Device sends the instruction for abandoning crawling to the corresponding web crawlers of the local detector of the target, so that the network is climbed Worm is abandoned crawling the target and crawls link.
7. device according to claim 5, it is characterised in that the web crawlers cluster also includes broadcast module, The radio unit includes:
Sending module, for send to the broadcast module carry that the target crawls link crawl information, So that the broadcast module generates the broadcast according to the information that crawls, and subscription is broadcast to extensively by described Other the local detectors broadcast.
8. device according to claim 7, it is characterised in that the radio unit is additionally operable to other local inspections Device sends and carries the broadcast that the target crawls link, so that described other local detectors receive the broadcast, And preserve it is described broadcast carry the target crawl link.
CN201510579940.5A 2015-09-11 2015-09-11 The update method and device of web crawlers cluster information Active CN106528567B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510579940.5A CN106528567B (en) 2015-09-11 2015-09-11 The update method and device of web crawlers cluster information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510579940.5A CN106528567B (en) 2015-09-11 2015-09-11 The update method and device of web crawlers cluster information

Publications (2)

Publication Number Publication Date
CN106528567A true CN106528567A (en) 2017-03-22
CN106528567B CN106528567B (en) 2019-11-12

Family

ID=58348122

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510579940.5A Active CN106528567B (en) 2015-09-11 2015-09-11 The update method and device of web crawlers cluster information

Country Status (1)

Country Link
CN (1) CN106528567B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113965371A (en) * 2021-10-19 2022-01-21 北京天融信网络安全技术有限公司 Task processing method, device, terminal and storage medium in website monitoring process

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102298633A (en) * 2011-09-08 2011-12-28 厦门市美亚柏科信息股份有限公司 Method and system for investigating repeated data in distributed mass data
CN102932448A (en) * 2012-10-30 2013-02-13 工业和信息化部电信传输研究所 Distributed network crawler URL (uniform resource locator) duplicate removal system and method
CN103067521A (en) * 2013-01-08 2013-04-24 中国科学院声学研究所 Distributed-type nodes and distributed-type system in a crawler cluster
CN103258036A (en) * 2013-05-15 2013-08-21 广州一呼百应网络技术有限公司 Distributed real-time search engine based on p2p
CN103559083A (en) * 2013-10-11 2014-02-05 北京奇虎科技有限公司 Web crawl task scheduling method and task scheduler

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102298633A (en) * 2011-09-08 2011-12-28 厦门市美亚柏科信息股份有限公司 Method and system for investigating repeated data in distributed mass data
CN102932448A (en) * 2012-10-30 2013-02-13 工业和信息化部电信传输研究所 Distributed network crawler URL (uniform resource locator) duplicate removal system and method
CN103067521A (en) * 2013-01-08 2013-04-24 中国科学院声学研究所 Distributed-type nodes and distributed-type system in a crawler cluster
CN103258036A (en) * 2013-05-15 2013-08-21 广州一呼百应网络技术有限公司 Distributed real-time search engine based on p2p
CN103559083A (en) * 2013-10-11 2014-02-05 北京奇虎科技有限公司 Web crawl task scheduling method and task scheduler

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113965371A (en) * 2021-10-19 2022-01-21 北京天融信网络安全技术有限公司 Task processing method, device, terminal and storage medium in website monitoring process
CN113965371B (en) * 2021-10-19 2023-08-29 北京天融信网络安全技术有限公司 Task processing method, device, terminal and storage medium in website monitoring process

Also Published As

Publication number Publication date
CN106528567B (en) 2019-11-12

Similar Documents

Publication Publication Date Title
US9443019B2 (en) Optimized web domains classification based on progressive crawling with clustering
Qin et al. Analyzing terrorist networks: A case study of the global salafi jihad network
CN104077402B (en) Data processing method and data handling system
US8959091B2 (en) Keyword assignment to a web page
CN102339320A (en) Malicious web recognition method and device
CN105608194A (en) Method for analyzing main characteristics in social media
CN105740460B (en) Web crawling recommended method and device
CN107145556B (en) Universal distributed acquisition system
CN103530336A (en) Equipment and method for identifying invalid parameters in URLs
CN104408169A (en) Multi-dimensional expression language based dimension query method and device
CN105930502B (en) System, client and method for collecting data
CN107798106A (en) A kind of URL De-weight methods in distributed reptile system
CN107203623B (en) Load balancing and adjusting method of web crawler system
CN103019860B (en) Based on disposal route and the system of collaborative filtering
CN106528567A (en) Method and device for updating web crawler cluster information
CN109145194A (en) The acquisition method and device of user behavior data
CN107193870A (en) The extracting method and system of web page contents
CN103905434A (en) Method and device for processing network data
CN107423382A (en) network crawling method and device
CN107544994B (en) Associated data processing method and device
CN103049488B (en) A kind of collaborative filtering disposal route and system
CN104915439A (en) Search result pushing method and device
CN106815248A (en) Web analytics method and device
CN106778352B (en) Multisource privacy protection method for combined release of set value data and social network data
CN105069135B (en) The data crawling method and system of the website OTA

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant