CN109388736A - Response scheduling method in crawler system - Google Patents

Response scheduling method in crawler system Download PDF

Info

Publication number
CN109388736A
CN109388736A CN201811106373.1A CN201811106373A CN109388736A CN 109388736 A CN109388736 A CN 109388736A CN 201811106373 A CN201811106373 A CN 201811106373A CN 109388736 A CN109388736 A CN 109388736A
Authority
CN
China
Prior art keywords
news
frequency
attributes
entrance
subtask
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811106373.1A
Other languages
Chinese (zh)
Inventor
石松
孙志国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Truth Network Technology (beijing) Co Ltd
Original Assignee
Truth Network Technology (beijing) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Truth Network Technology (beijing) Co Ltd filed Critical Truth Network Technology (beijing) Co Ltd
Priority to CN201811106373.1A priority Critical patent/CN109388736A/en
Publication of CN109388736A publication Critical patent/CN109388736A/en
Pending legal-status Critical Current

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

This application involves a kind of response scheduling methods in crawler system, this method comprises: marking off multiple entrance kinds subtask according to the level plate structure of website;Initial acquisition frequency is preset according to level plate where each entrance kind subtask and the news amount of unit time;Multiple attributes of news are preset, and formulate the corresponding adjustment rule of the attributes of news;According to the calculating sample frequency for adjusting regular real-time update each entrance kind subtask.The application is each entrance seed in distribution optimal acquisition frequency in real time, and the waste of Service Source caused by avoiding information output and frequency not reciprocity also indirectly alleviates the pressure of monitoring station.

Description

Response scheduling method in crawler system
Technical field
Response scheduling method this application involves Internet resources search technique field, in especially a kind of crawler system.
Background technique
With the arrival of big data era, excavation and analysis for mass data have become current research hotspot, And data acquisition is the basis of data mining and analysis.During data acquisition, most important is exactly the reality of data acquisition Shi Xing, accuracy with it is comprehensive.And the real-time of data acquisition, i.e. INFORMATION DISCOVERY it is timely whether can directly affect one The development of a event, so the frequency of program scanning monitoring station becomes most important when designing crawlers.
In the related technology, data collection system is usually to set a scan frequency to a website unification, but adopting Just often it will appear when being acquired for changing faster plate data during collection, real-time is poor;Either for becoming When the slower plate data of change are acquired, system resource is wasted, and is also easy to because it is not artificial clear for being monitored to It lookes at and keeps crawlers banned.Although some systems also can carry out Plate division to website, for different plate and different Different frequency acquisitions is arranged in period, but according to the different plates under each website on existing market in other words a website Block can all have the different hot spot periods, and when the dispatch amount of a website, when pageview adjusts, preset frequency still can before There is the problem of real-time difference.
Summary of the invention
To be overcome at least to a certain extent to website one scan frequency of unified setting, cause real-time poor or The problem of person's system resource wastes, the application provide a kind of response scheduling method in crawler system, comprising:
Multiple entrance kinds subtask is marked off according to the level plate structure of website;
Initial acquisition frequency is preset according to level plate where each entrance kind subtask and the news amount of unit time;
Multiple attributes of news are preset, and formulate the corresponding adjustment rule of the attributes of news;
According to the calculating sample frequency for adjusting regular real-time update each entrance kind subtask.
Further, the level plate structure according to website marks off multiple entrance kinds subtask, comprising: the layer Grade plate structure and entrance kind subtask correspond.
Further, described default just according to level plate where each entrance kind subtask and the news amount of unit time Beginning frequency acquisition, comprising:
News amount W under a level column in a default hour;
It presets every page and includes news item number n;
Initial acquisition frequency p1, p1=1/ (W/n), p1 are to acquire website frequency per hour.
Further, the multiple attributes of news includes:
Plate rank, the period of news collection, the response of website are fast where news quantum of output, news amount of reading, news Whether whether degree be hot news and be one of starting or a variety of.
It is further, described to formulate the corresponding adjustment rule of the attributes of news, comprising:
Preset the parameter of each attributes of news;
Adjustment rule is determined according to each cycle acquisition parameter changing value.
Further, the adjustment rule, comprising:
The Relation Parameters value of each attributes of news and frequency acquisition is preset,
Adjustment rule is calculated by aggregate function.
Further, the aggregate function includes:
It is line number count, average avg, summation sum, one or more in maximum value max.
Further, described to according to the calculating sample frequency for adjusting regular real-time update each entrance kind subtask, packet It includes:
It is calculated in after first attributes of news according to the adjustment of first attributes of news rule and initial acquisition frequency Between sample frequency;
Based on the intermediate samples frequency, successively calculates and obtained after the adjustment rule of other attributes of news is cumulative Calculating sample frequency.
The technical solution that embodiments herein provides can include the following benefits:
The application marks off multiple entrance kinds subtask, each entrance seed job order according to the level plate structure of website Solely setting frequency acquisition avoids causing real-time poor or system resource one website, one scan frequency of unified setting The problem of wasting further samples frequency according to the calculating for adjusting regular real-time update each entrance kind subtask Rate, the frequency acquisition adjust automatically under different time sections, avoid news quantum of output and frequency acquisition it is not reciprocity caused by service The wasting of resources.
It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not The application can be limited.
Detailed description of the invention
The drawings herein are incorporated into the specification and forms part of this specification, and shows the implementation for meeting the application Example, and together with specification it is used to explain the principle of the application.
Fig. 1 is the flow chart of the response scheduling method in a kind of crawler system that the application one embodiment provides.
Specific embodiment
The present invention is described in detail below with reference to the accompanying drawings and embodiments.
Fig. 1 is the flow chart of the response scheduling method in a kind of crawler system that the application one embodiment provides.
As shown in Figure 1, the method for the present embodiment includes:
S1: multiple entrance kinds subtask is marked off according to the level plate structure of website;
S2: initial acquisition frequency is preset according to level plate where each entrance kind subtask and the news amount of unit time Rate;
S3: presetting multiple attributes of news, and formulates the corresponding adjustment rule of the attributes of news;
S4: according to the calculating sample frequency for adjusting regular real-time update each entrance kind subtask.
It is described that multiple entrances are marked off according to the level plate structure of website as optional a kind of implementation of the invention Kind subtask, comprising: the level plate structure and entrance kind subtask correspond.
Level plate is, for example, to entertain plate, finance and economics plate, movement plate etc., just often be will appear in collection process pair When the data that the faster plate of variation for example entertains plate are acquired, real-time is poor;It is either slower for changing The data of plate such as finance and economics plate when being acquired, system resource is wasted, and is also easy to because being monitored to not Be artificial browsing and make crawlers it is banned fall.Therefore entrance seed is separately provided for each level plate, avoids standing to one Point one frequency acquisition of unified setting causes real-time poor or the problem of system resource wastes.
It is described according to level plate where each entrance kind subtask and list as optional a kind of implementation of the invention The news amount of position time presets an initial acquisition frequency, comprising:
News amount W under a level column in a default hour;
It presets every page and includes news item number n;
Initial acquisition frequency p1, p1=1/ (W/n), p1 are to acquire website frequency per hour.
For example, common property goes out 100 datas in 10 points to 11 points 1 hours of certain portal page, every page of 50 datas, we think Guarantee that news is put in storage within an hour.Frequency acquisition provides data according to page data and calculates preset value, the entrance one The page data of (100/50)=2 is generated in hour, that is, a hour will acquire two page datas, so frequency is 1/2=0.5 small When/time, i.e., predeterminated frequency is 30 minutes/time.
By the way that preliminary examination frequency acquisition value is arranged, foundation is provided for subsequent calculating frequency acquisition.
Further, the multiple attributes of news includes:
Plate rank, the period of news collection, the response of website are fast where news quantum of output, news amount of reading, news Whether whether degree be hot news and be one of starting or a variety of.
As optional a kind of implementation of the invention, the multiple attributes of news includes:
Plate rank, the period of news collection, the response of website are fast where news quantum of output, news amount of reading, news It spends, whether be hot news, whether be starting;
Each attributes of news includes the adjustment rule to frequency acquisition.
News quantum of output: the data volume of an entrance news output is bigger, and frequency acquisition is faster, otherwise may cause data Leakage grabs or is delayed.
The amount of reading of news: by news item in the amount of access of a period, it can be inferred that some period The number of visiting people improves frequency acquisition in people's flow-rate ratio biggish period and concurrent quantity is less susceptible to be detected by website. Amount of reading also reflects the news quality height of this column sending simultaneously, and the news amount of reading of the entrance output of same frequency is high Entrance more should be ensured that his timeliness, and frequency is also corresponding more should be fast.
Plate rank of the seed locating for website where news: an entrance seed, the hierarchical location a website, certainly The probability that the news under this entrance is browsed and clicked is determined.Often news quantum of output wants high to the entrance of one homepage, and Quality also wants higher, so frequency acquisition also wants higher.
The period of news collection: the behavior of the information output period and people of a website have direct relationship, each Entrance information output in different time periods is different, so to calculate acquisition frequency according to different time dimensions for entrance Rate.
The response speed of website: a website response speed in different time periods has reacted website holding in this period Loading capability reduces frequency acquisition when a website bearing capacity is weaker, on the one hand mitigates the pressure of website, on the one hand It prevents from being monitored by website being crawlers.
Whether be hot news: the information quality of hot news implied meaning is high, concerned degree is high, ought to acquire more in time, frequency Rate is higher.
It is starting or forwarding: it is starting often more to be paid close attention to by people than forwarding, therefore starting frequency acquisition is higher than forwarding acquisition Frequency.
It is described to formulate the corresponding adjustment rule of the attributes of news as optional a kind of implementation of the invention, comprising:
Preset the parameter of each attributes of news;
Adjustment rule is determined according to each cycle acquisition parameter changing value.
As optional a kind of implementation of the invention, the adjustment rule, comprising:
The Relation Parameters value of each attributes of news and frequency acquisition is preset,
Adjustment rule is calculated by aggregate function.
Such as:
Default websites response speed and the Relation Parameters value of frequency acquisition are that 2,2 representatives mean the every increase of response speed One times of frequency acquisition will be added 2 minutes.
As optional a kind of implementation of the invention, the aggregate function includes:
It is line number count, average avg, summation sum, one or more in maximum value max.
For example, calculating the response speed of website:
The total time-consuming of 50 news of acquisition is calculated by summing function (sum);
Every average duration, the as response speed of website are calculated by being averaging function (avg).
Each attributes of news is calculated by aggregate function, not only calculates simply, also attributes of news is quantified Processing, provides foundation for subsequent calculating frequency acquisition.
It is described to appoint to according to the regular each entrance seed of real-time update of adjustment as optional a kind of implementation of the invention The calculating sample frequency of business, comprising:
It is calculated in after first attributes of news according to the adjustment of first attributes of news rule and initial acquisition frequency Between sample frequency;
Based on the intermediate samples frequency, successively calculates and obtained after the adjustment rule of other attributes of news is cumulative Calculating sample frequency.
Intermediate samples frequency after calculating each attributes of news is as shown in table 1.
Table 1 calculates frequency acquisition and updates table
In table 1, calculating initial acquisition frequency according to news quantum of output per hour is 20 minutes/time;News amount of reading increases, Calculating adjustment rule according to aggregate function is -4, obtains 16 minute/time of the second intermediate acquisition frequency;Plate rank where news Constant, item number increases, and calculating adjustment rule according to aggregate function is -4, obtains 12 minute/time of the second intermediate acquisition frequency;Newly The period of acquisition is heard due to being in idle, is obtained according to the period of preset news collection and frequency acquisition Relation Parameters value Adjustment rule is+1, obtains 13 minute/time of third intermediate acquisition frequency;The response speed of website is slack-off, according to aggregate function meter Calculating adjustment rule is+4, obtains 17 minute/time of the 4th intermediate acquisition frequency;Hot news item number increases, according to aggregate function Calculating adjustment rule is -3, obtains 14 minute/time of the 5th intermediate acquisition frequency;Starting number is reduced, according to aggregate function meter Calculating adjustment rule is+4, obtains calculating 18 minute/time of frequency acquisition.
In the present embodiment, multiple entrance kinds subtask, each entrance seed are marked off according to the level plate structure of website Frequency acquisition is separately provided in task, avoids causing real-time poor to a website one scan frequency of unified setting or being The problem of system resource wastes, further, according to the calculating for adjusting regular real-time update each entrance kind subtask Sample frequency, the frequency acquisition adjust automatically under different time sections avoid that news quantum of output and frequency acquisition be not reciprocity to be caused Service Source waste, also indirectly alleviate the pressure of monitoring station.
It is understood that same or similar part can mutually refer in the various embodiments described above, in some embodiments Unspecified content may refer to the same or similar content in other embodiments.
It should be noted that term " first ", " second " etc. are used for description purposes only in the description of the present application, without It can be interpreted as indication or suggestion relative importance.In addition, in the description of the present application, unless otherwise indicated, the meaning of " multiple " Refer at least two.
Any process described otherwise above or method description are construed as in flow chart or herein, and expression includes It is one or more for realizing specific logical function or process the step of executable instruction code module, segment or portion Point, and the range of the preferred embodiment of the application includes other realization, wherein can not press shown or discussed suitable Sequence, including according to related function by it is basic simultaneously in the way of or in the opposite order, Lai Zhihang function, this should be by the application Embodiment person of ordinary skill in the field understood.
It should be appreciated that each section of the application can be realized with hardware, software, firmware or their combination.Above-mentioned In embodiment, software that multiple steps or method can be executed in memory and by suitable instruction execution system with storage Or firmware is realized.It, and in another embodiment, can be under well known in the art for example, if realized with hardware Any one of column technology or their combination are realized: having a logic gates for realizing logic function to data-signal Discrete logic, with suitable combinational logic gate circuit specific integrated circuit, programmable gate array (PGA), scene Programmable gate array (FPGA) etc..
Those skilled in the art are understood that realize all or part of step that above-described embodiment method carries It suddenly is that relevant hardware can be instructed to complete by program, the program can store in a kind of computer-readable storage medium In matter, which when being executed, includes the steps that one or a combination set of embodiment of the method.
It, can also be in addition, can integrate in a processing module in each functional unit in each embodiment of the application It is that each unit physically exists alone, can also be integrated in two or more units in a module.Above-mentioned integrated mould Block both can take the form of hardware realization, can also be realized in the form of software function module.The integrated module is such as Fruit is realized and when sold or used as an independent product in the form of software function module, also can store in a computer In read/write memory medium.
Storage medium mentioned above can be read-only memory, disk or CD etc..
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is contained at least one embodiment or example of the application.In the present specification, schematic expression of the above terms are not Centainly refer to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be any One or more embodiment or examples in can be combined in any suitable manner.
Although embodiments herein has been shown and described above, it is to be understood that above-described embodiment is example Property, it should not be understood as the limitation to the application, those skilled in the art within the scope of application can be to above-mentioned Embodiment is changed, modifies, replacement and variant.
It should be noted that the present invention is not limited to above-mentioned preferred forms, those skilled in the art are of the invention Other various forms of products can be all obtained under enlightenment, however, make any variation in its shape or structure, it is all have with The identical or similar technical solution of the application, is within the scope of the present invention.

Claims (8)

1. a kind of response scheduling method in crawler system characterized by comprising
Multiple entrance kinds subtask is marked off according to the level plate structure of website;
Initial acquisition frequency is preset according to level plate where each entrance kind subtask and the news amount of unit time;
Multiple attributes of news are preset, and formulate the corresponding adjustment rule of the attributes of news;
According to the calculating sample frequency for adjusting regular real-time update each entrance kind subtask.
2. the method according to claim 1, wherein the level plate structure according to website mark off it is multiple Entrance kind subtask, comprising: the level plate structure and entrance kind subtask correspond.
3. the method according to claim 1, wherein described according to level plate where each entrance kind subtask Initial acquisition frequency is preset with the news amount of unit time, comprising:
News amount W under a level column in a default hour;
It presets every page and includes news item number n;
Initial acquisition frequency p1, p1=1/ (W/n), p1 are to acquire website frequency per hour.
4. the method according to claim 1, wherein the multiple attributes of news includes:
News quantum of output, news amount of reading, plate rank where news, the period of news collection, website response speed, be It is no to be hot news and whether be one of starting or a variety of.
5. the method according to claim 1, wherein described, to formulate the corresponding adjustment of the attributes of news regular, Include:
Preset the parameter of each attributes of news;
Adjustment rule is determined according to each cycle acquisition parameter changing value.
6. according to the method described in claim 5, it is characterized in that, the adjustment is regular, comprising:
The Relation Parameters value of each attributes of news and frequency acquisition is preset,
Adjustment rule is calculated by aggregate function.
7. according to the method described in claim 6, it is characterized in that, the aggregate function includes:
It is line number count, average avg, summation sum, one or more in maximum value max.
8. the method according to claim 1, wherein described to according to each entrance kind of the regular real-time update of adjustment The calculating sample frequency of subtask, comprising:
It adopts centre after calculating with initial acquisition frequency first attributes of news according to the adjustment rule of first attributes of news Sample frequency;
Based on the intermediate samples frequency, the meter obtained after the adjustment rule of other attributes of news is cumulative is successively calculated Calculate sample frequency.
CN201811106373.1A 2018-09-21 2018-09-21 Response scheduling method in crawler system Pending CN109388736A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811106373.1A CN109388736A (en) 2018-09-21 2018-09-21 Response scheduling method in crawler system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811106373.1A CN109388736A (en) 2018-09-21 2018-09-21 Response scheduling method in crawler system

Publications (1)

Publication Number Publication Date
CN109388736A true CN109388736A (en) 2019-02-26

Family

ID=65418723

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811106373.1A Pending CN109388736A (en) 2018-09-21 2018-09-21 Response scheduling method in crawler system

Country Status (1)

Country Link
CN (1) CN109388736A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111753163A (en) * 2020-07-08 2020-10-09 北京鼎泰智源科技有限公司 Data acquisition method
CN112835931A (en) * 2019-11-22 2021-05-25 珠海格力电器股份有限公司 Method and device for determining data acquisition frequency
WO2024078070A1 (en) * 2022-10-14 2024-04-18 卡奥斯工业智能研究院(青岛)有限公司 Data collection resource quantity control method and apparatus, and device and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103605670A (en) * 2013-10-29 2014-02-26 北京奇虎科技有限公司 Method and device for determining grabbing frequency of network resource points
CN103617264A (en) * 2013-12-02 2014-03-05 北京奇虎科技有限公司 Method and device for grabbing timeliness seed page
CN105117501A (en) * 2015-10-09 2015-12-02 广州神马移动信息科技有限公司 Web crawler scheduling method and web crawler system applying same
CN105868327A (en) * 2016-03-28 2016-08-17 浪潮软件集团有限公司 Distributed web crawler capturing method based on different updating strategies
CN106126716A (en) * 2016-06-30 2016-11-16 北京奇艺世纪科技有限公司 A kind of data crawling method and device
CN107193828A (en) * 2016-03-14 2017-09-22 百度在线网络技术(北京)有限公司 Novel webpage capture method and apparatus

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103605670A (en) * 2013-10-29 2014-02-26 北京奇虎科技有限公司 Method and device for determining grabbing frequency of network resource points
CN103617264A (en) * 2013-12-02 2014-03-05 北京奇虎科技有限公司 Method and device for grabbing timeliness seed page
CN105117501A (en) * 2015-10-09 2015-12-02 广州神马移动信息科技有限公司 Web crawler scheduling method and web crawler system applying same
CN107193828A (en) * 2016-03-14 2017-09-22 百度在线网络技术(北京)有限公司 Novel webpage capture method and apparatus
CN105868327A (en) * 2016-03-28 2016-08-17 浪潮软件集团有限公司 Distributed web crawler capturing method based on different updating strategies
CN106126716A (en) * 2016-06-30 2016-11-16 北京奇艺世纪科技有限公司 A kind of data crawling method and device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112835931A (en) * 2019-11-22 2021-05-25 珠海格力电器股份有限公司 Method and device for determining data acquisition frequency
CN111753163A (en) * 2020-07-08 2020-10-09 北京鼎泰智源科技有限公司 Data acquisition method
WO2024078070A1 (en) * 2022-10-14 2024-04-18 卡奥斯工业智能研究院(青岛)有限公司 Data collection resource quantity control method and apparatus, and device and storage medium

Similar Documents

Publication Publication Date Title
US20230419358A1 (en) Application program interface script caching and batching
US10572565B2 (en) User behavior models based on source domain
US9767174B2 (en) Efficient query processing using histograms in a columnar database
US11429609B2 (en) Geo-scale analytics with bandwidth and regulatory constraints
Sia et al. Efficient monitoring algorithm for fast news alerts
US8756206B2 (en) Updating an inverted index in a real time fashion
CN109388736A (en) Response scheduling method in crawler system
US8600921B2 (en) Predicting user navigation events in a browser using directed graphs
EP2904509B1 (en) Improving access to network content
US8880996B1 (en) System for reconfiguring a web site or web page based on real-time analytics data
US8775556B1 (en) Automated segmentation and processing of web site traffic data over a rolling window of time
US20120324043A1 (en) Access to network content
CN102932207B (en) The method of monitoring website access information and server
US8775941B1 (en) System for monitoring and reporting deviations of real-time analytics data from expected analytics data
US9628355B1 (en) System for validating site configuration based on real-time analytics data
US20200012647A1 (en) Adaptive Big Data Service
WO2013025874A9 (en) Page reporting
CN111125128B (en) Cache updating method, device and system
US20150193547A1 (en) Access to network content
CA2894106C (en) Automated predictive tag management system
CN104850627A (en) Method and apparatus for performing paging display
Kille et al. Stream-based recommendations: Online and offline evaluation as a service
CN103064670A (en) Method and system for innovation platform data management based on place net
US20100094881A1 (en) System and method for indexing sub-spaces
CN104252459B (en) The method and apparatus for recommending conventional website

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190226