CN101635718A

CN101635718A - Network crawler system and method for acquiring resource as well as network resource gripping device

Info

Publication number: CN101635718A
Application number: CN200910091624A
Authority: CN
Inventors: 郑伟
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2009-08-26
Filing date: 2009-08-26
Publication date: 2010-01-27

Abstract

The invention provides a network crawler system and a method for acquiring resources as well as a network resource gripping device. The network crawler system comprises a user customization management unit, a control unit and a gripping unit, and a user carries out customization operation through a user operation interface provided by the user customization management unit and saves a customization result; the control unit and the gripping unit start a gripping task and carries out gripping behavior for the set task according to the customization result of the user. The network crawler system provides a friendly user customization interface and has abundant contents of setting items, and the requirement of the user can be satisfied without secondary programming development, thereby well improving the usability of network crawlers, facilitating the user and flexibly realizing the search for network resources.

Description

Network crawler system and obtain the method and the network resource gripping device of resource

Technical field

The present invention relates to the Internet resources search technique, refer to a kind of network crawler system especially and obtain the method and the network resource gripping device of the resource on the Internet/local area network (LAN).

Background technology

Growing and universal along with network application, increasing resource has been placed on the network.In the network that carries the magnanimity resource, the significant problem that the user faces is exactly how could find required resource quickly and accurately.Rely on existing internet search engine, on the one hand, can not search the resource of local area network (LAN); On the other hand, because resource quantity is too huge, causes index upgrade untimely, and then cause to search for resource less than recent renewal; And the result that search is come out is a lot of, but much is not the information wanted etc.

So a lot of enterprises utilize the existing search engine of increasing income to come the building network search engine, and the supplier of its resource web crawlers just.Web crawlers is a kind of software that can grasp the resource on the Internet or the local area network (LAN) automatically.Web crawlers is except providing the source material to search engine, also has some other application, such as some website regularly being monitored etc.

Existing web crawlers can't satisfy the individual demand of different user aspect ease for use and customizable degree, for some users' demand, needing the user to carry out the quadratic programming exploitation could satisfy; But the setting option that existing web crawlers provides is less, equally also is difficult to satisfy user's personalized search demand; That existing web crawlers has even do not have a friendly configuration interface.In a word, for existing web crawlers, the user uses, and inconvenience can not be carried out the search to Internet resources flexibly.

Summary of the invention

In view of this, main purpose of the present invention is to provide a kind of network crawler system, can improve the ease for use of web crawlers, makes user's network resource search easily and flexibly.

Another object of the present invention is to provide a kind of network crawler system to obtain the method for resource, can improve the ease for use of web crawlers, make user's network resource search easily and flexibly.

Another purpose of the present invention is to provide a kind of network resource gripping device, can improve the ease for use of web crawlers, makes user's network resource search easily and flexibly.

For achieving the above object, technical scheme of the present invention is achieved in that

A kind of networking crawler system comprises customization administrative unit, control unit and placement unit, wherein,

The customization administrative unit is used to provide user interface, and the user carries out customization operations and preserves the customization result by user interface;

Control unit is used to read the customization result that the customization administrative unit produces, and grasps notice to placement unit transmission task, starts the extracting task;

Placement unit is used for being provided with of task is implemented to grasp.

This system also comprises monitoring unit, is used for the extracting behavior is monitored, and shows the running status of extracting task, the result of inquiry extracting task.

The customization operations of described customization administrative unit comprises one of following or combination in any: grasp the degree of depth, grasp task establishment, grasp resource classification, grasp resource scope, grasp the client setting, how the resource that grabs to preserve, grasp report setting.

Described placement unit specifically is used for, and according to the user the extracting degree of depth in the customization operations is set, and Internet resources are grasped;

Wherein, described placement unit comprises first formation and second formation, and first formation is used to deposit the uniform resource position mark URL of the current degree of depth, and second formation is used to deposit the URL of the next degree of depth.

The URL of the next degree of depth of described second queue for storing is carrying out obtaining in the extracting process to the Internet resources among the URL in described first formation.

Described placement unit also comprises a tabulation, is used for preserving all URL and the seized condition information thereof that current extracting process obtains.

A kind of network crawler system obtains the method for resource, and based on power 1 described system, this method comprises:

The user carries out customization operations and preserves the customization result by user interface;

According to user's customization result, start the extracting task and being provided with of task is implemented extracting.

Described customization result comprises one of following or combination in any: grasp the degree of depth, grasp task establishment, grasp resource classification, grasp resource scope, grasp the client setting, how the resource that grabs to preserve, grasp report setting.

Described placement unit comprises first formation and second formation, and the URL of the current degree of depth is deposited in described first formation, and the URL of the next degree of depth is deposited in described second formation; Comprise the extracting degree of depth in the described customization operations; Described enforcement is grasped and is comprised:

Placement unit extracts URL from first formation, according to described URL Internet resources are grasped;

Placement unit judges whether and need Internet resources be grasped according to the URL in second formation according to the extracting degree of depth of the Internet resources that are provided with, if then continue to grasp.

Described extracting task is one or more; When being that each grasps tasks in parallel and carries out, and relatively independent when grasping task more than one, each extracting task is safeguarded the seized condition of self separately.

A kind of network resource gripping device comprises placement unit,

Described placement unit is used for the extracting degree of depth according to user's setting, and Internet resources are grasped;

The technical scheme that provides from the invention described above as can be seen, network crawler system of the present invention comprises customization administrative unit, control unit and placement unit, and the user interface that the user provides by the customization administrative unit carries out customization operations and preserves the customization result; Control unit and placement unit start the extracting task and being provided with of task are implemented the extracting behavior according to user's customization result.Network crawler system of the present invention provides friendly customization interface, and setting option is abundant in content, the user need not to carry out the quadratic programming exploitation just can satisfy its demand, has improved the ease for use of web crawlers well, makes the user realize the search to Internet resources easily and flexibly.

Description of drawings

Fig. 1 is the composition structural representation of network crawler system of the present invention;

Fig. 2 obtains the flow chart of the method for resource for network crawler system of the present invention;

Fig. 3 obtains the flow chart of the embodiment of resource for network crawler system of the present invention.

Embodiment

Fig. 1 as shown in Figure 1, comprises customization administrative unit, control unit and placement unit for the composition structural representation of network crawler system of the present invention, wherein,

The customization administrative unit is used to provide friendly user interface, and the user carries out customization operations and preserves the customization result by user interface.The customization administrative unit also is further used for by described user interface user right being set.

Control unit is used to read the customization result that the customization administrative unit produces, and grasps notice to placement unit transmission task, starts the extracting task.

Placement unit is used for being provided with of task is implemented the extracting behavior.Be specially and create the URL formation, and startup extracting thread carries out the extracting of Internet resources, analysis and storage.

Placement unit specifically is used for, and according to the user the extracting degree of depth in the customization operations is set, and Internet resources are grasped; Wherein, described placement unit comprises first formation and second formation, and first formation is used to deposit the uniform resource position mark URL of the current degree of depth, and second formation is used to deposit the URL of the next degree of depth.The URL of the next degree of depth of second queue for storing is carrying out obtaining in the extracting process to the Internet resources among the URL in described first formation.

In placement unit, set in advance two formations, be used to preserve URL(uniform resource locator) (URL also abbreviates network address as).First formation is current degree of depth formation Q1, deposits the URL that need handle immediately, and second formation is next degree of depth formation Q2, when formation Q1 processed intact after, the element of formation Q1 will all move to Q2.Wherein, member's information comprises URL in the formation, seized condition (such as do not grasp, successfully grasp, be redirected, network is unusual, website abnormal etc.) etc.Formation can use internal memory or memory database to realize.

In placement unit, also set in advance a tabulation, be used for preserving this subtask and be information such as all URL that current extracting process obtains and seized condition thereof, its objective is and avoid URL to be repeated to handle that this tabulation uses internal memory or memory database to realize.

Network crawler system of the present invention can also comprise monitoring unit, is used for the extracting behavior is monitored.The running status that shows the extracting task, the result of inquiry extracting task.

Fig. 2, specifically may further comprise the steps in conjunction with Fig. 1 for network crawler system of the present invention obtains the flow chart of the method for resource:

Step 200: the user carries out customization operations and preserves the customization result by user interface.

The user can customize one or more tasks, and each task is customized respectively.User interface can adopt realizations such as multipad, or Web page application program.Further, user interface can be provided with authority, has the user of authority to carry out customization operations to it.Customization is the result can be kept in file or the database.Comprise one or more extracting task among the customization result.

The customization result of customization administrative unit includes but not limited to following:

1) grasps the establishment of task, comprise the start-up time of task, extracting cycle, the initial URL(uniform resource locator) of extracting (URL is also referred to as network address);

2) grasp resource classification, such as adopting the URL prefix to classify;

3) grasp the scope of resource, comprise the degree of depth of extracting, the URL prefix or the regular expression that can grasp;

4) grasp the client setting, the version of the HTTP(Hypertext Transport Protocol) of comprise whether using network agent server (Proxy), using, the user agent (UserAgent) of use, the overtime duration of HTTP request etc.; Wherein, Website server is distinguished client type with the UserAgent that uses;

5) how the resource that grabs preserves, and comprises the file type of the resource that will preserve, the feature that the resource that preserve meets etc.;

6) grasp report setting, comprise that grasping resource takes time, and grasps the url list of failure, the renewal frequency of resource etc.

Step 201:, start the extracting task and being provided with of task is implemented the extracting behavior according to user's customization result.

A plurality of extracting tasks in parallel are carried out, and relatively independent, and each task is safeguarded the seized condition of self separately.Control unit is dispatched task according to the setting of task.

In placement unit, set in advance two formations, be used to preserve URL(uniform resource locator) (URL also abbreviates network address as).First formation is current degree of depth formation Q1, deposits the URL that need handle immediately, and second formation is next degree of depth formation Q2, when formation Q1 processed intact after, the element of formation Q1 will all move to Q2.Wherein, member's information comprises URL in the formation, seized condition (such as do not grasp, successfully grasp, be redirected, network is unusual, website abnormal etc.) etc.Formation can use internal memory or memory database to realize.Particularly, placement unit comprises first formation and second formation, and the URL of the current degree of depth is deposited in described first formation, and the URL of the next degree of depth is deposited in described second formation; Comprise the extracting degree of depth in the described customization operations; Described enforcement is grasped and is comprised:

In placement unit, also set in advance a tabulation, be used for preserving information such as all URL that this subtask obtains and seized condition thereof, its objective is and avoid URL to be repeated to handle that this tabulation uses internal memory or memory database to realize.

The one or more initial URL that the user is provided with is placed among the Q1, and the current degree of depth is 0.Each URL uses a thread to finish and downloads and analyze.

Further, also comprise step 202: the extracting behavior is monitored.

After grasping task termination, grasp report by user's customization output.Grasping report content can comprise:

In the processing time of each URL, HTTP responds conditional code;

Be labeled as the url list of network error;

Be labeled as the url list of Website server mistake;

The url list that renewal frequency is higher;

Total duration that task is carried out.

Further, after the extracting task finishes, judge whether to notify the user, if desired, then be finished with mode (for example mail or note) the notice user task of setting.

Fig. 3 supposes that for network crawler system of the present invention obtains the flow chart of the embodiment of resource the user carries out customization operations by user interface and will need the URL of processing immediately to be saved among the formation Q1, as shown in Figure 3, comprising:

Step 300: from formation Q1, obtain a URL.Simultaneously, it is deleted from formation Q1, give the extracting thread process this URL.

Step 301: from the specified URL downloaded resources to local internal memory.

According to http protocol, use existing download tool class to finish process from the specified URL downloaded resources to local internal memory.In the implementation procedure of this step, can be applied to information such as Proxy that the user is provided with, HTTP request timed out duration, UserAgent.UserAgent is provided with and is mainly used in the different browser of simulation, and some website is supported HTTP and WAP simultaneously, visits same URL, and the content of using different User Agent to obtain is different.

Wherein, when visit URL, may run into redirected webpage, new URL need be carried out download process, former URL is marked as redirected, surpasses the redirected maximum times of user configured permission if be redirected number of times, and then seized condition is marked as the Website server mistake;

During URL, may return the mistake of network connection aspect in visit, for example the HTTP request connects overtime, destination host inaccessible etc., and its seized condition is labeled as network error;

When visit URL, may return the service end error message, for example conditional code is 4XX, 5XX etc. are labeled as the Website server mistake with its seized condition.

Step 302: judge whether to download successfully,, then enter step 310 if unsuccessful; If success enters step 303.

When seized condition is marked as the Website server mistake, or during network error, show download unsuccessful.

Step 303: judge whether the resource of downloading comprises webpage.Simple determination methods is for extracting multipurpose internet mail expansion (MIME) information of HTTP head, if MIME information is text/html or text/wml, then judges and comprises webpage, continues execution in step 304; Otherwise, judge in the resource of download and do not comprise webpage, enter step 308.

Step 304: analyzing web page, extract new URL.For webpage, need to extract the URL that comprises in the web page contents and obtain more URL.The processing procedure of HTML(Hypertext Markup Language) webpage is as follows:

Content-Type attribute in context type (Content-Type) attribute by the HTTP head that returns and the HTTPMETA information of webpage is judged the character set of webpage; Concrete determination methods belongs to those skilled in the art's conventional techniques means, no longer describes in detail here.

Resolve the webpage that obtains by existing html parser, most of webpage is that the HREF attribute by the A label is linked to other webpage, to handle the A label that extracts after resolving, because a lot of A label H REF attributes do not use the URL of standard, and be to use relative path, as＜ahref=" ../b.html "〉aaa＜/a 〉, this HREF attribute will calculate according to the URL of place webpage, last just obtain real URL, and give word content that this URL preserves its source A label with as giving tacit consent to title; Specific implementation belongs to those skilled in the art's conventional techniques means, no longer describes in detail here.

The literal that title (TITLE) label of extraction webpage comprises is as title.

The wireless markup language (wml) webpage is handled with html web page similar.

Step 305: carry out regular to the new URL that extracts.

Each the new URL that extracts is carried out regular, purpose is to eliminate the information of using in its address information such as the superior and the subordinate's expression formula.

Such as: though http://www.163.com/a/b.html seems different with http://www.163.com/a/../a/b.html, be same URL in fact.

Step 306: judge whether new URL is effective, if effectively, then continue execution in step 307; Otherwise enter step 308.

Calculate the degree of depth of new URL, for the current degree of depth adds 1, whether effectively basis for estimation is as follows to judge this new URL once more:

Whether this new URL repeats, if repeat then invalid;

Whether this new URL meets user configured URL prefix, if do not meet then invalid;

Whether the degree of depth of this new URL exceeds, if exceed then invalid.

If end product is effective URL, then execution in step 307, otherwise execution in step 308.

Step 307: new URL is stamped key words sorting according to the classified information that the user is provided with, and put into formation Q2.

Step 308: judge whether the user needs the resource of preserving in the present internal memory is stored.The foundation of judging comprises:

Whether file upgraded, and can or simply just judge by file size by the MD5 digest value of content, and that does not upgrade just need not preserve;

Whether file type is the type that can preserve that the user is provided with;

Whether meet other requirements that the user sets, as comprise/do not comprise the keyword of setting;

Whether surpass maximum preservable file size;

Whether surpass the maximum file number of permission preservation etc.

Judged result is for preserving, and then execution in step 309, otherwise, enter step 310.

Step 309: preserve resource to local.

For fear of the filename conflict, avoid comprising illegal character, can adopt general unique identifier (UUID) as the resource name of preserving, and with the filename of URL, MIME information, preservation and title as a recorded and stored in file or database, use when analyzing for the concordance program of search engine or other program.

Step 310: process ends.

When not had content among the formation Q1, and all of current task are when grasping threads and all finishing, and the URL that represents the current degree of depth has grasped and finished.At this moment, if also have URL among the formation Q2, then the URL with formation Q2 transfers among the formation Q1, grasps thread for each URL starts once more; If there is not URL among the formation Q2, then this grasps task termination.

Further, after the extracting task termination, grasp report by user's customization output.Grasping report content can comprise:

In the processing time of each URL, HTTP responds conditional code;

And/or be labeled as the url list of network error;

And/or be labeled as the url list of Website server mistake;

And/or the higher url list of renewal frequency;

And/or total duration of task execution.

Next, by the cycle that the user is provided with, calculate and carry out the time point that next time grasps task, the notice main control module carries out task scheduling.

The above is preferred embodiment of the present invention only, is not to be used to limit protection scope of the present invention, all any modifications of being done within the spirit and principles in the present invention, is equal to and replaces and improvement etc., all should be included within protection scope of the present invention.

Claims

1, a kind of networking crawler system is characterized in that, comprises customization administrative unit, control unit and placement unit, wherein,

Placement unit is used for being provided with of task is implemented to grasp.

2, networking according to claim 1 crawler system is characterized in that this system also comprises monitoring unit, is used for the extracting behavior is monitored, and shows the running status of extracting task, the result of inquiry extracting task.

3, networking according to claim 1 and 2 crawler system, it is characterized in that the customization operations of described customization administrative unit comprises one of following or combination in any: grasp the degree of depth, grasp task establishment, grasp resource classification, grasp resource scope, grasp the client setting, how the resource that grabs to preserve, grasp report setting.

4, networking according to claim 1 and 2 crawler system is characterized in that described placement unit specifically is used for, and according to the user the extracting degree of depth in the customization operations is set, and Internet resources are grasped;

5, networking according to claim 4 crawler system is characterized in that the URL of the next degree of depth of described second queue for storing is carrying out obtaining in the extracting process to the Internet resources among the URL in described first formation.

6, networking according to claim 4 crawler system is characterized in that described placement unit also comprises a tabulation, is used for preserving all URL and the seized condition information thereof that current extracting process obtains.

7, a kind of network crawler system obtains the method for resource, it is characterized in that, based on power 1 described system, this method comprises:

8, method according to claim 7, it is characterized in that described customization result comprises one of following or combination in any: grasp the degree of depth, grasp task establishment, grasp resource classification, grasp resource scope, grasp the client setting, how the resource that grabs to preserve, grasp report setting.

9, method according to claim 7 is characterized in that, described placement unit comprises first formation and second formation, and the URL of the current degree of depth is deposited in described first formation, and the URL of the next degree of depth is deposited in described second formation; Comprise the extracting degree of depth in the described customization operations; Described enforcement is grasped and is comprised:

10, method according to claim 8 is characterized in that, described extracting task is one or more; When being that each grasps tasks in parallel and carries out, and relatively independent when grasping task more than one, each extracting task is safeguarded the seized condition of self separately.

11, a kind of network resource gripping device is characterized in that, comprises placement unit,

12, network resource gripping device according to claim 11 is characterized in that, the URL of the next degree of depth of described second queue for storing is carrying out obtaining in the extracting process to the Internet resources among the URL in described first formation.

13, according to claim 11 or 12 described network resource gripping devices, it is characterized in that described placement unit also comprises a tabulation, be used for preserving all URL and the seized condition information thereof that current extracting process obtains.