CN101635718A - Network crawler system and method for acquiring resource as well as network resource gripping device - Google Patents

Network crawler system and method for acquiring resource as well as network resource gripping device Download PDF

Info

Publication number
CN101635718A
CN101635718A CN200910091624A CN200910091624A CN101635718A CN 101635718 A CN101635718 A CN 101635718A CN 200910091624 A CN200910091624 A CN 200910091624A CN 200910091624 A CN200910091624 A CN 200910091624A CN 101635718 A CN101635718 A CN 101635718A
Authority
CN
China
Prior art keywords
url
formation
customization
extracting
depth
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN200910091624A
Other languages
Chinese (zh)
Inventor
郑伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZTE Corp
Original Assignee
ZTE Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZTE Corp filed Critical ZTE Corp
Priority to CN200910091624A priority Critical patent/CN101635718A/en
Publication of CN101635718A publication Critical patent/CN101635718A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a network crawler system and a method for acquiring resources as well as a network resource gripping device. The network crawler system comprises a user customization management unit, a control unit and a gripping unit, and a user carries out customization operation through a user operation interface provided by the user customization management unit and saves a customization result; the control unit and the gripping unit start a gripping task and carries out gripping behavior for the set task according to the customization result of the user. The network crawler system provides a friendly user customization interface and has abundant contents of setting items, and the requirement of the user can be satisfied without secondary programming development, thereby well improving the usability of network crawlers, facilitating the user and flexibly realizing the search for network resources.

Description

Network crawler system and obtain the method and the network resource gripping device of resource
Technical field
The present invention relates to the Internet resources search technique, refer to a kind of network crawler system especially and obtain the method and the network resource gripping device of the resource on the Internet/local area network (LAN).
Background technology
Growing and universal along with network application, increasing resource has been placed on the network.In the network that carries the magnanimity resource, the significant problem that the user faces is exactly how could find required resource quickly and accurately.Rely on existing internet search engine, on the one hand, can not search the resource of local area network (LAN); On the other hand, because resource quantity is too huge, causes index upgrade untimely, and then cause to search for resource less than recent renewal; And the result that search is come out is a lot of, but much is not the information wanted etc.
So a lot of enterprises utilize the existing search engine of increasing income to come the building network search engine, and the supplier of its resource web crawlers just.Web crawlers is a kind of software that can grasp the resource on the Internet or the local area network (LAN) automatically.Web crawlers is except providing the source material to search engine, also has some other application, such as some website regularly being monitored etc.
Existing web crawlers can't satisfy the individual demand of different user aspect ease for use and customizable degree, for some users' demand, needing the user to carry out the quadratic programming exploitation could satisfy; But the setting option that existing web crawlers provides is less, equally also is difficult to satisfy user's personalized search demand; That existing web crawlers has even do not have a friendly configuration interface.In a word, for existing web crawlers, the user uses, and inconvenience can not be carried out the search to Internet resources flexibly.
Summary of the invention
In view of this, main purpose of the present invention is to provide a kind of network crawler system, can improve the ease for use of web crawlers, makes user's network resource search easily and flexibly.
Another object of the present invention is to provide a kind of network crawler system to obtain the method for resource, can improve the ease for use of web crawlers, make user's network resource search easily and flexibly.
Another purpose of the present invention is to provide a kind of network resource gripping device, can improve the ease for use of web crawlers, makes user's network resource search easily and flexibly.
For achieving the above object, technical scheme of the present invention is achieved in that
A kind of networking crawler system comprises customization administrative unit, control unit and placement unit, wherein,
The customization administrative unit is used to provide user interface, and the user carries out customization operations and preserves the customization result by user interface;
Control unit is used to read the customization result that the customization administrative unit produces, and grasps notice to placement unit transmission task, starts the extracting task;
Placement unit is used for being provided with of task is implemented to grasp.
This system also comprises monitoring unit, is used for the extracting behavior is monitored, and shows the running status of extracting task, the result of inquiry extracting task.
The customization operations of described customization administrative unit comprises one of following or combination in any: grasp the degree of depth, grasp task establishment, grasp resource classification, grasp resource scope, grasp the client setting, how the resource that grabs to preserve, grasp report setting.
Described placement unit specifically is used for, and according to the user the extracting degree of depth in the customization operations is set, and Internet resources are grasped;
Wherein, described placement unit comprises first formation and second formation, and first formation is used to deposit the uniform resource position mark URL of the current degree of depth, and second formation is used to deposit the URL of the next degree of depth.
The URL of the next degree of depth of described second queue for storing is carrying out obtaining in the extracting process to the Internet resources among the URL in described first formation.
Described placement unit also comprises a tabulation, is used for preserving all URL and the seized condition information thereof that current extracting process obtains.
A kind of network crawler system obtains the method for resource, and based on power 1 described system, this method comprises:
The user carries out customization operations and preserves the customization result by user interface;
According to user's customization result, start the extracting task and being provided with of task is implemented extracting.
Described customization result comprises one of following or combination in any: grasp the degree of depth, grasp task establishment, grasp resource classification, grasp resource scope, grasp the client setting, how the resource that grabs to preserve, grasp report setting.
Described placement unit comprises first formation and second formation, and the URL of the current degree of depth is deposited in described first formation, and the URL of the next degree of depth is deposited in described second formation; Comprise the extracting degree of depth in the described customization operations; Described enforcement is grasped and is comprised:
Placement unit extracts URL from first formation, according to described URL Internet resources are grasped;
Placement unit judges whether and need Internet resources be grasped according to the URL in second formation according to the extracting degree of depth of the Internet resources that are provided with, if then continue to grasp.
Described extracting task is one or more; When being that each grasps tasks in parallel and carries out, and relatively independent when grasping task more than one, each extracting task is safeguarded the seized condition of self separately.
A kind of network resource gripping device comprises placement unit,
Described placement unit is used for the extracting degree of depth according to user's setting, and Internet resources are grasped;
Wherein, described placement unit comprises first formation and second formation, and first formation is used to deposit the uniform resource position mark URL of the current degree of depth, and second formation is used to deposit the URL of the next degree of depth.
The URL of the next degree of depth of described second queue for storing is carrying out obtaining in the extracting process to the Internet resources among the URL in described first formation.
Described placement unit also comprises a tabulation, is used for preserving all URL and the seized condition information thereof that current extracting process obtains.
The technical scheme that provides from the invention described above as can be seen, network crawler system of the present invention comprises customization administrative unit, control unit and placement unit, and the user interface that the user provides by the customization administrative unit carries out customization operations and preserves the customization result; Control unit and placement unit start the extracting task and being provided with of task are implemented the extracting behavior according to user's customization result.Network crawler system of the present invention provides friendly customization interface, and setting option is abundant in content, the user need not to carry out the quadratic programming exploitation just can satisfy its demand, has improved the ease for use of web crawlers well, makes the user realize the search to Internet resources easily and flexibly.
Description of drawings
Fig. 1 is the composition structural representation of network crawler system of the present invention;
Fig. 2 obtains the flow chart of the method for resource for network crawler system of the present invention;
Fig. 3 obtains the flow chart of the embodiment of resource for network crawler system of the present invention.
Embodiment
Fig. 1 as shown in Figure 1, comprises customization administrative unit, control unit and placement unit for the composition structural representation of network crawler system of the present invention, wherein,
The customization administrative unit is used to provide friendly user interface, and the user carries out customization operations and preserves the customization result by user interface.The customization administrative unit also is further used for by described user interface user right being set.
Control unit is used to read the customization result that the customization administrative unit produces, and grasps notice to placement unit transmission task, starts the extracting task.
Placement unit is used for being provided with of task is implemented the extracting behavior.Be specially and create the URL formation, and startup extracting thread carries out the extracting of Internet resources, analysis and storage.
Placement unit specifically is used for, and according to the user the extracting degree of depth in the customization operations is set, and Internet resources are grasped; Wherein, described placement unit comprises first formation and second formation, and first formation is used to deposit the uniform resource position mark URL of the current degree of depth, and second formation is used to deposit the URL of the next degree of depth.The URL of the next degree of depth of second queue for storing is carrying out obtaining in the extracting process to the Internet resources among the URL in described first formation.
In placement unit, set in advance two formations, be used to preserve URL(uniform resource locator) (URL also abbreviates network address as).First formation is current degree of depth formation Q1, deposits the URL that need handle immediately, and second formation is next degree of depth formation Q2, when formation Q1 processed intact after, the element of formation Q1 will all move to Q2.Wherein, member's information comprises URL in the formation, seized condition (such as do not grasp, successfully grasp, be redirected, network is unusual, website abnormal etc.) etc.Formation can use internal memory or memory database to realize.
In placement unit, also set in advance a tabulation, be used for preserving this subtask and be information such as all URL that current extracting process obtains and seized condition thereof, its objective is and avoid URL to be repeated to handle that this tabulation uses internal memory or memory database to realize.
Network crawler system of the present invention can also comprise monitoring unit, is used for the extracting behavior is monitored.The running status that shows the extracting task, the result of inquiry extracting task.
Fig. 2, specifically may further comprise the steps in conjunction with Fig. 1 for network crawler system of the present invention obtains the flow chart of the method for resource:
Step 200: the user carries out customization operations and preserves the customization result by user interface.
The user can customize one or more tasks, and each task is customized respectively.User interface can adopt realizations such as multipad, or Web page application program.Further, user interface can be provided with authority, has the user of authority to carry out customization operations to it.Customization is the result can be kept in file or the database.Comprise one or more extracting task among the customization result.
The customization result of customization administrative unit includes but not limited to following:
1) grasps the establishment of task, comprise the start-up time of task, extracting cycle, the initial URL(uniform resource locator) of extracting (URL is also referred to as network address);
2) grasp resource classification, such as adopting the URL prefix to classify;
3) grasp the scope of resource, comprise the degree of depth of extracting, the URL prefix or the regular expression that can grasp;
4) grasp the client setting, the version of the HTTP(Hypertext Transport Protocol) of comprise whether using network agent server (Proxy), using, the user agent (UserAgent) of use, the overtime duration of HTTP request etc.; Wherein, Website server is distinguished client type with the UserAgent that uses;
5) how the resource that grabs preserves, and comprises the file type of the resource that will preserve, the feature that the resource that preserve meets etc.;
6) grasp report setting, comprise that grasping resource takes time, and grasps the url list of failure, the renewal frequency of resource etc.
Step 201:, start the extracting task and being provided with of task is implemented the extracting behavior according to user's customization result.
A plurality of extracting tasks in parallel are carried out, and relatively independent, and each task is safeguarded the seized condition of self separately.Control unit is dispatched task according to the setting of task.
In placement unit, set in advance two formations, be used to preserve URL(uniform resource locator) (URL also abbreviates network address as).First formation is current degree of depth formation Q1, deposits the URL that need handle immediately, and second formation is next degree of depth formation Q2, when formation Q1 processed intact after, the element of formation Q1 will all move to Q2.Wherein, member's information comprises URL in the formation, seized condition (such as do not grasp, successfully grasp, be redirected, network is unusual, website abnormal etc.) etc.Formation can use internal memory or memory database to realize.Particularly, placement unit comprises first formation and second formation, and the URL of the current degree of depth is deposited in described first formation, and the URL of the next degree of depth is deposited in described second formation; Comprise the extracting degree of depth in the described customization operations; Described enforcement is grasped and is comprised:
Placement unit extracts URL from first formation, according to described URL Internet resources are grasped;
Placement unit judges whether and need Internet resources be grasped according to the URL in second formation according to the extracting degree of depth of the Internet resources that are provided with, if then continue to grasp.
In placement unit, also set in advance a tabulation, be used for preserving information such as all URL that this subtask obtains and seized condition thereof, its objective is and avoid URL to be repeated to handle that this tabulation uses internal memory or memory database to realize.
The one or more initial URL that the user is provided with is placed among the Q1, and the current degree of depth is 0.Each URL uses a thread to finish and downloads and analyze.
Further, also comprise step 202: the extracting behavior is monitored.
After grasping task termination, grasp report by user's customization output.Grasping report content can comprise:
In the processing time of each URL, HTTP responds conditional code;
Be labeled as the url list of network error;
Be labeled as the url list of Website server mistake;
The url list that renewal frequency is higher;
Total duration that task is carried out.
Further, after the extracting task finishes, judge whether to notify the user, if desired, then be finished with mode (for example mail or note) the notice user task of setting.
Fig. 3 supposes that for network crawler system of the present invention obtains the flow chart of the embodiment of resource the user carries out customization operations by user interface and will need the URL of processing immediately to be saved among the formation Q1, as shown in Figure 3, comprising:
Step 300: from formation Q1, obtain a URL.Simultaneously, it is deleted from formation Q1, give the extracting thread process this URL.
Step 301: from the specified URL downloaded resources to local internal memory.
According to http protocol, use existing download tool class to finish process from the specified URL downloaded resources to local internal memory.In the implementation procedure of this step, can be applied to information such as Proxy that the user is provided with, HTTP request timed out duration, UserAgent.UserAgent is provided with and is mainly used in the different browser of simulation, and some website is supported HTTP and WAP simultaneously, visits same URL, and the content of using different User Agent to obtain is different.
Wherein, when visit URL, may run into redirected webpage, new URL need be carried out download process, former URL is marked as redirected, surpasses the redirected maximum times of user configured permission if be redirected number of times, and then seized condition is marked as the Website server mistake;
During URL, may return the mistake of network connection aspect in visit, for example the HTTP request connects overtime, destination host inaccessible etc., and its seized condition is labeled as network error;
When visit URL, may return the service end error message, for example conditional code is 4XX, 5XX etc. are labeled as the Website server mistake with its seized condition.
Step 302: judge whether to download successfully,, then enter step 310 if unsuccessful; If success enters step 303.
When seized condition is marked as the Website server mistake, or during network error, show download unsuccessful.
Step 303: judge whether the resource of downloading comprises webpage.Simple determination methods is for extracting multipurpose internet mail expansion (MIME) information of HTTP head, if MIME information is text/html or text/wml, then judges and comprises webpage, continues execution in step 304; Otherwise, judge in the resource of download and do not comprise webpage, enter step 308.
Step 304: analyzing web page, extract new URL.For webpage, need to extract the URL that comprises in the web page contents and obtain more URL.The processing procedure of HTML(Hypertext Markup Language) webpage is as follows:
Content-Type attribute in context type (Content-Type) attribute by the HTTP head that returns and the HTTPMETA information of webpage is judged the character set of webpage; Concrete determination methods belongs to those skilled in the art's conventional techniques means, no longer describes in detail here.
Resolve the webpage that obtains by existing html parser, most of webpage is that the HREF attribute by the A label is linked to other webpage, to handle the A label that extracts after resolving, because a lot of A label H REF attributes do not use the URL of standard, and be to use relative path, as<ahref=" ../b.html "〉aaa</a 〉, this HREF attribute will calculate according to the URL of place webpage, last just obtain real URL, and give word content that this URL preserves its source A label with as giving tacit consent to title; Specific implementation belongs to those skilled in the art's conventional techniques means, no longer describes in detail here.
The literal that title (TITLE) label of extraction webpage comprises is as title.
The wireless markup language (wml) webpage is handled with html web page similar.
Step 305: carry out regular to the new URL that extracts.
Each the new URL that extracts is carried out regular, purpose is to eliminate the information of using in its address information such as the superior and the subordinate's expression formula.
Such as: though http://www.163.com/a/b.html seems different with http://www.163.com/a/../a/b.html, be same URL in fact.
Step 306: judge whether new URL is effective, if effectively, then continue execution in step 307; Otherwise enter step 308.
Calculate the degree of depth of new URL, for the current degree of depth adds 1, whether effectively basis for estimation is as follows to judge this new URL once more:
Whether this new URL repeats, if repeat then invalid;
Whether this new URL meets user configured URL prefix, if do not meet then invalid;
Whether the degree of depth of this new URL exceeds, if exceed then invalid.
If end product is effective URL, then execution in step 307, otherwise execution in step 308.
Step 307: new URL is stamped key words sorting according to the classified information that the user is provided with, and put into formation Q2.
Step 308: judge whether the user needs the resource of preserving in the present internal memory is stored.The foundation of judging comprises:
Whether file upgraded, and can or simply just judge by file size by the MD5 digest value of content, and that does not upgrade just need not preserve;
Whether file type is the type that can preserve that the user is provided with;
Whether meet other requirements that the user sets, as comprise/do not comprise the keyword of setting;
Whether surpass maximum preservable file size;
Whether surpass the maximum file number of permission preservation etc.
Judged result is for preserving, and then execution in step 309, otherwise, enter step 310.
Step 309: preserve resource to local.
For fear of the filename conflict, avoid comprising illegal character, can adopt general unique identifier (UUID) as the resource name of preserving, and with the filename of URL, MIME information, preservation and title as a recorded and stored in file or database, use when analyzing for the concordance program of search engine or other program.
Step 310: process ends.
When not had content among the formation Q1, and all of current task are when grasping threads and all finishing, and the URL that represents the current degree of depth has grasped and finished.At this moment, if also have URL among the formation Q2, then the URL with formation Q2 transfers among the formation Q1, grasps thread for each URL starts once more; If there is not URL among the formation Q2, then this grasps task termination.
Further, after the extracting task termination, grasp report by user's customization output.Grasping report content can comprise:
In the processing time of each URL, HTTP responds conditional code;
And/or be labeled as the url list of network error;
And/or be labeled as the url list of Website server mistake;
And/or the higher url list of renewal frequency;
And/or total duration of task execution.
Further, after the extracting task finishes, judge whether to notify the user, if desired, then be finished with mode (for example mail or note) the notice user task of setting.
Next, by the cycle that the user is provided with, calculate and carry out the time point that next time grasps task, the notice main control module carries out task scheduling.
The above is preferred embodiment of the present invention only, is not to be used to limit protection scope of the present invention, all any modifications of being done within the spirit and principles in the present invention, is equal to and replaces and improvement etc., all should be included within protection scope of the present invention.

Claims (13)

1, a kind of networking crawler system is characterized in that, comprises customization administrative unit, control unit and placement unit, wherein,
The customization administrative unit is used to provide user interface, and the user carries out customization operations and preserves the customization result by user interface;
Control unit is used to read the customization result that the customization administrative unit produces, and grasps notice to placement unit transmission task, starts the extracting task;
Placement unit is used for being provided with of task is implemented to grasp.
2, networking according to claim 1 crawler system is characterized in that this system also comprises monitoring unit, is used for the extracting behavior is monitored, and shows the running status of extracting task, the result of inquiry extracting task.
3, networking according to claim 1 and 2 crawler system, it is characterized in that the customization operations of described customization administrative unit comprises one of following or combination in any: grasp the degree of depth, grasp task establishment, grasp resource classification, grasp resource scope, grasp the client setting, how the resource that grabs to preserve, grasp report setting.
4, networking according to claim 1 and 2 crawler system is characterized in that described placement unit specifically is used for, and according to the user the extracting degree of depth in the customization operations is set, and Internet resources are grasped;
Wherein, described placement unit comprises first formation and second formation, and first formation is used to deposit the uniform resource position mark URL of the current degree of depth, and second formation is used to deposit the URL of the next degree of depth.
5, networking according to claim 4 crawler system is characterized in that the URL of the next degree of depth of described second queue for storing is carrying out obtaining in the extracting process to the Internet resources among the URL in described first formation.
6, networking according to claim 4 crawler system is characterized in that described placement unit also comprises a tabulation, is used for preserving all URL and the seized condition information thereof that current extracting process obtains.
7, a kind of network crawler system obtains the method for resource, it is characterized in that, based on power 1 described system, this method comprises:
The user carries out customization operations and preserves the customization result by user interface;
According to user's customization result, start the extracting task and being provided with of task is implemented extracting.
8, method according to claim 7, it is characterized in that described customization result comprises one of following or combination in any: grasp the degree of depth, grasp task establishment, grasp resource classification, grasp resource scope, grasp the client setting, how the resource that grabs to preserve, grasp report setting.
9, method according to claim 7 is characterized in that, described placement unit comprises first formation and second formation, and the URL of the current degree of depth is deposited in described first formation, and the URL of the next degree of depth is deposited in described second formation; Comprise the extracting degree of depth in the described customization operations; Described enforcement is grasped and is comprised:
Placement unit extracts URL from first formation, according to described URL Internet resources are grasped;
Placement unit judges whether and need Internet resources be grasped according to the URL in second formation according to the extracting degree of depth of the Internet resources that are provided with, if then continue to grasp.
10, method according to claim 8 is characterized in that, described extracting task is one or more; When being that each grasps tasks in parallel and carries out, and relatively independent when grasping task more than one, each extracting task is safeguarded the seized condition of self separately.
11, a kind of network resource gripping device is characterized in that, comprises placement unit,
Described placement unit is used for the extracting degree of depth according to user's setting, and Internet resources are grasped;
Wherein, described placement unit comprises first formation and second formation, and first formation is used to deposit the uniform resource position mark URL of the current degree of depth, and second formation is used to deposit the URL of the next degree of depth.
12, network resource gripping device according to claim 11 is characterized in that, the URL of the next degree of depth of described second queue for storing is carrying out obtaining in the extracting process to the Internet resources among the URL in described first formation.
13, according to claim 11 or 12 described network resource gripping devices, it is characterized in that described placement unit also comprises a tabulation, be used for preserving all URL and the seized condition information thereof that current extracting process obtains.
CN200910091624A 2009-08-26 2009-08-26 Network crawler system and method for acquiring resource as well as network resource gripping device Pending CN101635718A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200910091624A CN101635718A (en) 2009-08-26 2009-08-26 Network crawler system and method for acquiring resource as well as network resource gripping device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200910091624A CN101635718A (en) 2009-08-26 2009-08-26 Network crawler system and method for acquiring resource as well as network resource gripping device

Publications (1)

Publication Number Publication Date
CN101635718A true CN101635718A (en) 2010-01-27

Family

ID=41594779

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200910091624A Pending CN101635718A (en) 2009-08-26 2009-08-26 Network crawler system and method for acquiring resource as well as network resource gripping device

Country Status (1)

Country Link
CN (1) CN101635718A (en)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102314463A (en) * 2010-07-07 2012-01-11 北京瑞信在线***技术有限公司 Distributed crawler system and webpage data extraction method for the same
CN102567513A (en) * 2011-12-27 2012-07-11 北京神州绿盟信息安全科技股份有限公司 Method and equipment for collecting phishing websites
CN102710748A (en) * 2012-05-02 2012-10-03 华为技术有限公司 Data acquisition method, system and equipment
CN102868639A (en) * 2012-09-29 2013-01-09 北京奇虎科技有限公司 Balanced scheduling system and balanced scheduling method based on site quota
CN102929721A (en) * 2012-09-29 2013-02-13 北京奇虎科技有限公司 Balanced scheduling system and method based on station quota
CN103037010A (en) * 2012-12-26 2013-04-10 人民搜索网络股份公司 Distributed network crawler system and catching method thereof
CN103279507A (en) * 2013-05-16 2013-09-04 北京尚友通达信息技术有限公司 Webpage spider operational method and system
CN103401849A (en) * 2013-07-18 2013-11-20 盘石软件(上海)有限公司 Abnormal session analyzing method for website logs
CN103455492A (en) * 2012-05-29 2013-12-18 腾讯科技(深圳)有限公司 Method and device for searching web pages
CN103514171A (en) * 2012-06-20 2014-01-15 同程网络科技股份有限公司 Method for implementing self-defined crawler based on optical character recognition and vertical search
CN103559304A (en) * 2013-11-18 2014-02-05 北京暴风科技股份有限公司 Implementation method and device for Internet data customization
CN103746929A (en) * 2014-01-13 2014-04-23 刘保太 Optimal access flow scheduling method based on DNS (Domain Name System) and optimal access flow scheduling equipment based on DNS
CN103970788A (en) * 2013-02-01 2014-08-06 北京英富森信息技术有限公司 Webpage-crawling-based crawler technology
CN104038471A (en) * 2013-03-08 2014-09-10 ***通信集团浙江有限公司 Method for managing IDC resources in internet and service provider network
CN104484424A (en) * 2014-12-19 2015-04-01 浪潮通用软件有限公司 Establishing method for resource price information base of construction enterprise based on internet
CN104915439A (en) * 2015-06-25 2015-09-16 百度在线网络技术(北京)有限公司 Search result pushing method and device
CN105447184A (en) * 2015-12-15 2016-03-30 北京百分点信息科技有限公司 Information capturing method and device
CN106331108A (en) * 2016-08-25 2017-01-11 北京量科邦信息技术有限公司 Crawler realization method and system capable of breaking through IP limit
CN106503017A (en) * 2015-09-08 2017-03-15 摩贝(上海)生物科技有限公司 A kind of distributed reptile system task grasping system and method
CN107832136A (en) * 2017-11-28 2018-03-23 广州启生信息技术有限公司 The management method and device of a kind of web crawler
CN107861861A (en) * 2016-11-14 2018-03-30 平安科技(深圳)有限公司 Short message interface lookup method and device
CN109033195A (en) * 2018-06-28 2018-12-18 上海盛付通电子支付服务有限公司 The acquisition methods of webpage information obtain equipment and computer-readable medium
CN109992707A (en) * 2019-03-18 2019-07-09 广州视源电子科技股份有限公司 Data crawling method and device, storage medium and server

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102314463A (en) * 2010-07-07 2012-01-11 北京瑞信在线***技术有限公司 Distributed crawler system and webpage data extraction method for the same
CN102567513A (en) * 2011-12-27 2012-07-11 北京神州绿盟信息安全科技股份有限公司 Method and equipment for collecting phishing websites
CN102567513B (en) * 2011-12-27 2014-09-17 北京神州绿盟信息安全科技股份有限公司 Method and equipment for collecting phishing websites
CN102710748A (en) * 2012-05-02 2012-10-03 华为技术有限公司 Data acquisition method, system and equipment
CN102710748B (en) * 2012-05-02 2016-01-27 华为技术有限公司 Data capture method, system and equipment
CN103455492B (en) * 2012-05-29 2018-10-30 腾讯科技(深圳)有限公司 A kind of method and apparatus of search and webpage
CN103455492A (en) * 2012-05-29 2013-12-18 腾讯科技(深圳)有限公司 Method and device for searching web pages
CN103514171A (en) * 2012-06-20 2014-01-15 同程网络科技股份有限公司 Method for implementing self-defined crawler based on optical character recognition and vertical search
CN103514171B (en) * 2012-06-20 2016-08-03 同程网络科技股份有限公司 Optically-based character recognition and the self-defined reptile method of vertical search
CN102929721B (en) * 2012-09-29 2015-04-08 北京奇虎科技有限公司 Balanced scheduling system and method based on station quota
CN102868639A (en) * 2012-09-29 2013-01-09 北京奇虎科技有限公司 Balanced scheduling system and balanced scheduling method based on site quota
CN102929721A (en) * 2012-09-29 2013-02-13 北京奇虎科技有限公司 Balanced scheduling system and method based on station quota
CN102868639B (en) * 2012-09-29 2016-08-03 北京奇虎科技有限公司 Balance dispatching system and method based on website quota
CN103037010A (en) * 2012-12-26 2013-04-10 人民搜索网络股份公司 Distributed network crawler system and catching method thereof
CN103970788A (en) * 2013-02-01 2014-08-06 北京英富森信息技术有限公司 Webpage-crawling-based crawler technology
CN104038471B (en) * 2013-03-08 2017-08-11 ***通信集团浙江有限公司 A kind of method and carrier network that IDC resources are managed in internet
CN104038471A (en) * 2013-03-08 2014-09-10 ***通信集团浙江有限公司 Method for managing IDC resources in internet and service provider network
CN103279507A (en) * 2013-05-16 2013-09-04 北京尚友通达信息技术有限公司 Webpage spider operational method and system
CN103279507B (en) * 2013-05-16 2016-12-28 北京尚友通达信息技术有限公司 Webpage spider operational method and system
CN103401849A (en) * 2013-07-18 2013-11-20 盘石软件(上海)有限公司 Abnormal session analyzing method for website logs
CN103401849B (en) * 2013-07-18 2017-02-15 盘石软件(上海)有限公司 Abnormal session analyzing method for website logs
CN103559304A (en) * 2013-11-18 2014-02-05 北京暴风科技股份有限公司 Implementation method and device for Internet data customization
CN103746929A (en) * 2014-01-13 2014-04-23 刘保太 Optimal access flow scheduling method based on DNS (Domain Name System) and optimal access flow scheduling equipment based on DNS
CN104484424A (en) * 2014-12-19 2015-04-01 浪潮通用软件有限公司 Establishing method for resource price information base of construction enterprise based on internet
CN104915439A (en) * 2015-06-25 2015-09-16 百度在线网络技术(北京)有限公司 Search result pushing method and device
CN106503017A (en) * 2015-09-08 2017-03-15 摩贝(上海)生物科技有限公司 A kind of distributed reptile system task grasping system and method
CN105447184A (en) * 2015-12-15 2016-03-30 北京百分点信息科技有限公司 Information capturing method and device
CN105447184B (en) * 2015-12-15 2019-06-11 北京百分点信息科技有限公司 Information extraction method and device
CN106331108A (en) * 2016-08-25 2017-01-11 北京量科邦信息技术有限公司 Crawler realization method and system capable of breaking through IP limit
CN107861861A (en) * 2016-11-14 2018-03-30 平安科技(深圳)有限公司 Short message interface lookup method and device
CN107832136A (en) * 2017-11-28 2018-03-23 广州启生信息技术有限公司 The management method and device of a kind of web crawler
CN109033195A (en) * 2018-06-28 2018-12-18 上海盛付通电子支付服务有限公司 The acquisition methods of webpage information obtain equipment and computer-readable medium
CN109992707A (en) * 2019-03-18 2019-07-09 广州视源电子科技股份有限公司 Data crawling method and device, storage medium and server

Similar Documents

Publication Publication Date Title
CN101635718A (en) Network crawler system and method for acquiring resource as well as network resource gripping device
CN101651707B (en) Method for automatically acquiring user behavior log of network
CN101515300B (en) Method and system for grabbing Ajax webpage content
CN105094888B (en) A kind of application plug loading method and device
CN102571932B (en) For application on site, user provides status service
CN109936621B (en) Information security multi-page message pushing method, device, equipment and storage medium
CN105243159A (en) Visual script editor-based distributed web crawler system
CN103118007B (en) A kind of acquisition methods of user access activity and system
CN101399716B (en) Distributed audit system and method for monitoring using state of office computer
CN102355488A (en) Crawler seed obtaining method and equipment and crawler crawling method and equipment
US8949462B1 (en) Removing personal identifiable information from client event information
CN105471635B (en) A kind of processing method of system log, device and system
JP2009116733A (en) Application retrieval system, application retrieval method, monitor terminal, retrieval server, and program
EP2521043A1 (en) Method for establishing a relationship between semantic data and the running of a widget
WO2012048617A1 (en) Method and system for updating widget, widget client and widget server
CN103077107A (en) Method and system for maintaining data
US20020052889A1 (en) Method for managing alterations of contents
CN112307292A (en) Information processing method and system based on advanced persistent threat attack
CN110688354B (en) Analysis method of slow log file in database, terminal and storage medium
CN102882988A (en) Method, device and equipment for acquiring address information of resource information
US9529911B2 (en) Building of a web corpus with the help of a reference web crawl
CN103825772A (en) Method for identifying user click behavior and gateway equipment
CN101315629A (en) Downloading method and system for web page dynamic contents
US20110055279A1 (en) Application server, object management method, and object management program
CN103475630A (en) Session preservation method and apparatus thereof

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20100127