CN105045838A

CN105045838A - Network crawler system based on distributed storage system

Info

Publication number: CN105045838A
Application number: CN201510377049.3A
Authority: CN
Inventors: 贺樑; 黄保荃; 杨燕
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2015-07-01
Filing date: 2015-07-01
Publication date: 2015-11-11

Abstract

The present invention discloses a network crawler system based on a distributed storage system. The system comprises a basic service module, a grabber, as well as a task scheduling module, a parsing service module, a page downloading module, a page updating module and a data storage module, which are all arranged in the grabber, wherein the task scheduling module controls a procedure of grabbing data by the grabber; the parsing service module parses the content of a webpage and provides user-defined template extraction information; the page downloading module downloads a source code of the webpage; the page updating module acquires data information of the updated webpage; the data storage module, with a structural information extraction method, stores the extracted content into a database of the distributed storage system; and the basic service module completes flow control of the grabber, a monitoring and warning mechanism of the grabber, a URL deduplication service, a URL normalization service and a js/css resource management service. The network crawler system based on the distributed storage system has the characteristics that a crawler method is flexible and intelligent and automatic structural extraction of webpage content information is realized.

Description

Based on the network crawler system of distributed memory system

Technical field

The invention belongs to computer data to excavate and search technique field, relate to the method based on the network crawler system of distributed memory system and structured message extraction.

Background technology

Along with the development of Internet technology and universal, Web resource is explosive growth, and webpage becomes the important sources of obtaining information in people's daily life.Internet resources are various and tool open, dynamic and isomerism etc., and cannot carry out unified management, this makes people want to find information needed rapidly and accurately becomes a difficult problem.The isomerism of Internet resources causes being difficult to obtain structurized information.

Summary of the invention

The object of the invention is a kind of network crawler system based on distributed memory system provided for the deficiencies in the prior art, this system can find information needed rapidly and accurately.

The concrete technical scheme realizing the object of the invention is:

A kind of network crawler system based on distributed memory system, feature is that this system comprises infrastructure service module, grabber and is arranged at task scheduling modules, analysis service module, page-downloading module, renewal of the page module and the data memory module in grabber, and task scheduling modules controls the flow process that grabber captures data; The content of analysis service module analyzing web page and self-defining masterplate Extracting Information is provided; The source code of page-downloading module downloading web pages, supports to load javascipt, the page of ajax and the form list that asynchronous dynamical loads; Data message after renewal of the page module acquisition webpage is updated; The method that data memory module is extracted by structured message, is stored into the content after decimated in the database of distributed memory system; Infrastructure service module completes the flow control of grabber, the monitoring alarm mechanism of grabber, the service of URL duplicate removal, URL normalization service and js/css resource management service.

The method that described structured message extracts comprises:

A) based on the vector space model algorithm building dictionary, concrete steps are as follows:

1) want according to user the data message capturing certain field, build the dictionary of this field keyword in advance, this dictionary is regarded as the term vector of a m dimension, be denoted as β _m;

2) by using participle instrument, the content of text of webpage is divided into isolated word;

3) calculate each word occurrence number in dictionary, the number of times that word occurs is higher, and just to represent the degree of correlation higher;

4) each word has corresponding number of times; A dimension be used as in each word, suppose that this webpage has n word, just this webpage is regarded as the vector of a n dimension;

5) each webpage obtains the term vector of a n dimension, is denoted as α _n;

6) by the degree of correlation of this formulae discovery webpage and dictionary:

7) web pages relevance S (α _n, β _m) be greater than the webpage of the threshold value θ pre-set, put into and treat that queue is extracted in structuring;

B) build Page template, concrete steps are as follows:

1) analyzing web page html structure, webpage comprises head label and body label;

2), in Head label, target labels is title label;

3), in body label, target labels is p label, a label, form form tags;

4) above-mentioned target labels is combined, complete the template of webpage;

C) structured message extracts, and concrete steps are as follows:

Use B) web page template that builds goes to extract A) info web in the webpage to be extracted that obtains of vector space model algorithm, finally obtain structural data, data are stored in distributed system database with xml form.

The flow process of described crawl data is as follows:

1) the given seed URL of user, as the entrance capturing internet web page;

2) user individual customization infrastructure service configuration file, comprising: flow control, url filtering rule, the service of URL duplicate removal, URL normalization, browser engine and monitor service;

3) according to url filtering rule, URL duplicate removal is served, and captures qualified URL;

4) page-downloading module is to the qualified URL corresponding page of download URL one by one; Multi-thread concurrent technology is adopted to accelerate the speed of downloading page; Page-downloading module calls browser engine in infrastructure service, and browser engine plays up javascript/ajax webpage by calling chrome kernel loads, ensures this page download data integrity;

5) task scheduling modules starts analytics engine module, and the external linkage URL meeting filtering rule is put into queue to be crawled by analytics engine module;

6) task scheduling modules starts data memory module, the method that data memory module uses structured message to extract, and the content after decimated is stored in the database of distributed memory system.

Compared with prior art, the present invention has following advantage:

(1), improve the efficiency of web crawlers.By using multithreading to improve the concurrency crawled, the method extracted by structured message improves the efficiency extracting info web.

(2), the present invention proposes usage space vector model method and can crawl webpage about a certain theme.

(3), the present invention carrys out in conjunction with vector space model and masterplate method the information that intelligent automation extracts webpage.

, the present invention be easy to operation, cost is low.Only need the configuration file of configuration native system and several linux servers just can reach the web data amount crawling 1,000,000 grades.

Accompanying drawing explanation

Fig. 1 present system Organization Chart;

Fig. 2 structured message abstracting method of the present invention process flow diagram;

Fig. 3 present system process flow diagram.

Embodiment

Below by crawl excellent cruel collection of drama essential information be example, describe the present invention in detail.

Yoqoo

1) http:// movie.youku.com, http://tv.youku.comas the entrance of the excellent cruel crawl of native system.

2) native system configuration file is customized: flow control, url filtering rule, the service of URL duplicate removal, URL normalization, browser engine and monitor service.

3) Yoqoo is for same IP address, the frequency of meeting limiting access.The grabber of this locality by agency, switch a different IP addresses according to each hour.Such Yoqoo website would not be refused native system and capture request.

4) crawl task priority according to native system to arrange from high to low: Yoqoo URL crawls, the Yoqoo essential information page crawls, crawling of the Yoqoo broadcast information page, and Yoqoo diversity information is play and crawled, crawling of excellent trenchant comments opinion information page, crawling of the Yoqoo increment page.

5) by seed http:// movie.youku.com, http://tv.youku.comuRL, native system removes the URL finding to point to external linkage by seed URL, and reads Yoqoo configuration file url filtering rule, filters out the URL not meeting filtering rule.The URL meeting filtering rule is put into queue to be crawled.

6) page-downloading is carried out to each URL.

7) the keyword dictionary of video field is built.75 domain lexicon that dictionary creation is provided by data hall, regard dictionary as the term vector of a m dimension, are denoted as β _m.

6) to the content of pages participle of URL, web page contents is divided into isolated word.

7) number of times that each webpage word always occurs at dictionary is calculated.

10) each word has corresponding number of times; A dimension be used as in each word, suppose that this webpage has n word, just this webpage is regarded as the vector of a n dimension, be denoted as α _n.

11) degree of correlation of webpage and dictionary is calculated:

12) the threshold value θ that native system pre-sets is 0.4.Step 11) result that calculates this webpage of being greater than 0.4 puts into and treats that queue is extracted in structuring.

13) only template is built, so the present embodiment only builds the template of essential information webpage to Yoqoo collection of drama essential information webpage in the present embodiment.Note: actual production environment needs to build a more than template: as the template, diversity broadcast information Page Template, review information Page Template, excellent cruel aggregate index Page Template etc. of excellent cruel collection of drama essential information Page Template, the broadcast information page.

14) Yoqoo collection of drama essential information web page template:

15) by the template of previous step, native system is automatically webpage and template matches, and coupling information is out deposited in xml file, and storing template is deposited into xml file in distributed memory system database.The data sample that structuring extracts is as follows:

16) by above-mentioned steps, just Yoqoo collection of drama essential information webpage can be crawled.

Claims

1. the network crawler system based on distributed memory system, it is characterized in that this system comprises infrastructure service module, grabber and is arranged at task scheduling modules, analysis service module, page-downloading module, renewal of the page module and the data memory module in grabber, task scheduling modules controls the flow process that grabber captures data; The content of analysis service module analyzing web page and self-defining masterplate Extracting Information is provided; The source code of page-downloading module downloading web pages, supports to load javascipt, the page of ajax and the form list that asynchronous dynamical loads; Data message after renewal of the page module acquisition webpage is updated; The method that data memory module is extracted by structured message, is stored into the content after decimated in the database of distributed memory system; Infrastructure service module completes the flow control of grabber, the monitoring alarm mechanism of grabber, the service of URL duplicate removal, URL normalization service and js/css resource management service.

2. network crawler system according to claim 1, is characterized in that the method that described structured message extracts comprises: