CN102609456A

CN102609456A - System and method for real-time and smart article capturing

Info

Publication number: CN102609456A
Application number: CN2012100083895A
Authority: CN
Inventors: 吴华鹏; 曾明; 厉锟
Original assignee: PHOENIX ONLINE (BEIJING) INFORMATION TECHNOLOGY Co Ltd
Current assignee: PHOENIX ONLINE (BEIJING) INFORMATION TECHNOLOGY Co Ltd
Priority date: 2012-01-12
Filing date: 2012-01-12
Publication date: 2012-07-25

Abstract

A system for real-time and smart article capturing comprises a real-time capturing module, a webpage extracting system, a duplication removing module for approximated documents, an automatic document classification module, and an article posting module. The real-time capturing module further comprises 7 online modules, namely a job extracting module, a job parsing module, a job capturing time range testing module, a job capturing time interval testing module, a job scheduling module, a job downloading module and a job capturing frequency adjusting module, and 3 offline modules, namely a job capturing time range discovering module, a job capturing time interval discovering module and a free proxy collecting and testing module. The article type webpage smart extracting system comprises a loading module for webpages to be extracted, a wrapper query module, a webpage extracting module, a collecting module for webpages failed in extraction, a learning judgment module, a webpage learning module and a extracting wrapper management module.

Description

A kind of article real-time intelligent grasping system and method

Technical field

The present invention relates to extracting technology, web digging technology, information extraction technique, natural language processing technique field in the Internet technology; Can be applied to need to grasp precisely, in real time on a large scale the internet arenas such as portal website, search engine web site of article.

Background technology

Patent of the present invention also has the advantage that more traditional grasping systems do not have:

Through can automatically the non-article page in the website being filtered such as channel page or leaf, special topic page or leaf, list page, advertising page with station study;

It is heavy to be similar to document row to the extracting article;

Can carry out semantic understanding to grasping article, classification automatically generates summary and keyword automatically;

Can accurately seek certain article number 50 and carry out the order merging with interior paging sequence and to the paging content;

Can carry out flexible configuration to website extracting scope.Support to grasp the article under one or more list area on website, channel, any page.

In practical application; This grasping system reprinted articles is of high quality; Can directly externally issue user oriented, the masterplates of thousands of extractings of adaptation websites change automatically simultaneously, have reduced the manpower participation that extracting needs greatly; News coverage and the real-time of improving door class website in large tracts of land have also reduced the human cost of door class website simultaneously.

In all door class websites, this patent all has application scenarios, can effectively improve the coverage and the real-time of its news, reduces human cost simultaneously.

In the news category search engine, this patent also can be used simultaneously.

The information extraction area now is having a lot of technical schemes, and core all is how to generate and safeguard the extraction wrapper.Below the technical main branch two types:

1) extraction system that adopts machine to generate extraction wrapper technology automatically can grasp article in a large number, but can't accomplish the accurate extraction of article, and the availability that grasps article is low;

2) adopt the artificial extraction system that extracts the wrapper technology that generates, it is accurate that article extracts the result, but will extract the generation and the updating maintenance work of wrapper to thousands of websites, internet, can only rely on great amount of manpower and participate in;

The abstraction module of patent of the present invention is a core with " based on the article Automatic Extraction that generates with station study and automatically rule " method of independent research, has solved top two problems well.

In practical application, the present technique scheme has realized that the machine that extracts wrapper generates automatically and safeguards, makes extraction not need great amount of manpower to participate in; Also realized the accurate extraction of article simultaneously, extracted the seldom redundant and omission of result, availability is very high.

Relate to technical term among the present invention, be explained as follows:

Extract wrapper:Web page information extraction is a type in the information extraction, and the wrapper generation technique of Web page information extraction develops into a comparatively independently field at present.Wrapper is by a series of decimation rules and use that these regular computer codes form, special from the customizing messages source information of extractions needs and the program of return results;

Based on the article Automatic Extraction method that generates with station study and automatically rule: the automatic generating method of wrapper that the present invention comprises, can accurate intelligent from webpage, extract article information;

With station study: by the website is unit, collects the webpage of a website q.s, carries out the machine statistical learning together, and then the rule of generations needs therefrom;

Reptile (perhaps grasping reptile): the module that refers to be responsible for page download in the grasping system separately;

The extraction wrapper of native system research and development comprises two storehouses:

Style tree or storehouse, path:

The set storehouse of Style.Style refers to certain DOM node to carry out seeking on the node in dom tree, up to arriving body node, the such paths and the weight information thereof that construct.In the storehouse, the path is unit organization with the website all, be merged into one with the path, and recording frequency is as weight;

Library:

So-called here pattern comprises

1) one is the following condition code of each section after the segmentation in the method:

Pattern=md5 ((content: text/img)+paragraph tag forward direction ergodic sequence+site name)+value

Wherein value is a weight information, also is the frequency of occurrences of pattern.

2) also having one is the automatic canonical that these sections are carried out generating behind the statistical learning:

Pattern=canonical.

Agent skill group:

Agent skill group is meant after acting server receives client requests can check and verify its legitimacy, and legal like it, acting server is fetched required information as a client computer and is transmitted to the client again;

Grasp in real time:

Stress that ageing a kind of extracting of grasping is technological.Target is can grab in real time after grasping the source station update content.

Summary of the invention

The present invention preferably resolves the problems referred to above.

According to article real-time intelligent grasping system of the present invention, comprise real-time grabbing module, web page extraction system, document approximate row molality piece, the automatic sort module of document and article release module.

Wherein said real-time grabbing module comprises online and offline two sub-module.The operation submodule is following on the line:

The task extraction module extracts a job in turn from task (job) set;

The task parsing module, (job) resolves to each task, and analysis result will form some attributes and rule;

Task grasps the time range inspection module, and the time range parameter of query task if time range does not comprise the current time, will not grasp, and skip this job, otherwise, grasp time interval check;

Task grasps time interval inspection module, the extracting time interval of query task, if specifying, the time interval grasps the time greater than current next time, and then will not grasp, skip this job, otherwise, carry out task and grasp;

Task scheduling modules is carried out the job scheduling according to other attribute of job of task parsing module, can judge in the time of scheduling, if this job existed in the past, does not then distribute, and still adopts home server to grasp; Otherwise, select the existing less station server of jobs number in the server zone, with the equilibrium of realization extracting task, thereby optimize overall grasp speed; Consider simultaneously to avoid too many on the station server with website job as far as possible, too big to prevent that the separate unit server from grasping pressure to single website;

The task download module carries out the concrete download of task, and the agency who removes to get in the broker library proper number generally is 5; If do not act on behalf of desirablely, then adopt non-agency to grasp; To not have agency and above-mentioned 5 agency's merging simultaneously, form agent list; According to the task parameters that parsing obtains, from agent list, select an agency at random, carry out the epicycle of task and download;

Task grasps frequency regulation block, according to the extracting interval radix of job, finds to upgrade if epicycle grasps, and then reduces next round at random and grasps the time interval; If find to upgrade, then amplify the extracting interval of next round at random; But guarantee that the time interval is that the extracting of job is at interval in [0.5,2] multiple scope of radix;

Said real-time grabbing module also comprises like the operation submodule that rolls off the production line down:

Task grasps time range and finds module, carries out the historical daily record intellectual analysis work of grasping, and therefrom analyzes the time range of each task (job);

Task grasps the time interval and finds module, reads in the extracting daily record of yesterday, analyzes the extracting situation of one day all round of each job, therefrom analyzes the more new situation of each job; Guaranteed that 50% above round can grab renewal, does not then adjust if find the current interval radix of job; Otherwise suitably amplify current interval radix, to reduce meaningless grab requesting;

Free agency collects and authentication module, freely acts on behalf of the free agent list that sharing website is downloaded the same day from the internet, and these free agencies are verified; Select the historical link formation checking url set of grasping at random; Using each to act on behalf of the url that carries out several times grasps; Weed out and successfully to grasp or the too slow agency of grasp speed; Stay the high fireballing agency simultaneously of success ratio, form the broker library on the same day, thinking to grasp on the line provides the agency to support.

Web page extraction of the present invention system also is an article types of web pages intelligence extraction system, comprising:

(1) webpage to be extracted insmods, and regularly inquires about local index, find new index just according to index with in the webpage loading system internal memory;

(2) wrapper enquiry module, to the webpage that remains to be extracted, the concrete wrapper information that extracts of inquiry if inquire, just according to extracting wrapper, gets into abstraction module, specifically extract, otherwise, webpage is labeled as to extract fails;

(3) web page extraction module extracts concrete article information from webpage, by existing extraction wrapper;

(4) extract the failure web page collection module, the collecting web page that epicycle is extracted failure gets up, by websites collection, conveniently to carry out focusing study;

(5) study judge module by extract the failure collections of web pages with query site, according to the failure webpage quantity of each website, calculates this website epicycle and extracts the ratio of successfully failing, and determines whether to enter the Web page study module;

(6) webpage study module carries out machine learning to all failure webpages, generates new extraction wrapper at last;

(7) extract the wrapper administration module, the extraction wrapper of system managed, also promptly storehouse, path and library are managed,, and provide wrapper to use interface to the web page extraction module, provide the wrapper updating interface to the webpage study module.

Said web page extraction module also comprises:

The HTML parsing module to importing webpage into, is resolved html and is made up dom tree;

Text field is sought module, according to wrapper information, seeks text field;

Article head and paging information extraction modules are used to extract article header and article and divide page information;

The text field correcting module is used to revise text field; Text field piecemeal module is used for the text domain partitioning, carries out piecemeal property determine and redundant piecemeal simultaneously and removes;

The segmentation filtering module is used for text field is carried out segmentation, filters piecemeal simultaneously;

The data preparation module is used for merging and organize your messages, generates the result of last article type.

Said webpage study module also comprises:

Text field is sought module, is used to seek text field;

Storehouse, path update module is used for warehouse-in and merges, and simultaneously the storehouse, path is put in order;

The text field correcting module is used to revise text field;

Text field piecemeal module is used for the text domain partitioning, carries out piecemeal property determine and redundant piecemeal simultaneously and removes;

The speced learning module is carried out the text field segmentation, makes up pattern piecemeal, goes into library and merges;

Pattern is concluded module, and all patterns are concluded, and create-rule is gone into library and merged;

The wrapper sorting module is put in order system's wrapper, removes invalid information.

Said text field piecemeal module also comprises:

The frequent mode identification module adopts the MDR method to discern frequent mode;

The piecemeal module to the frequent mode that obtains, carries out seeking on the searching of branch block header, the piecemeal father node, to obtain only blocking node combination, is combined to form piecemeal then;

The piecemeal mark module carries out mark to all piecemeals that identify in the text field dom tree.

The present invention also provides the grasping means of a kind of article real-time intelligent, and said method comprises real-time extracting step, web page extraction step, document approximate row heavy step, the automatic classifying step of article and article issuing steps.

Said real-time extracting step comprises online and offline operation substep, wherein:

The operation substep comprises on the said line:

Step 1 is extracted a job in turn from task (job) set;

Step 2, job resolves, and each job is resolved, and analysis result will form some attributes and rule;

Step 3, time range judges that the time range parameter of query task if time range does not comprise the current time, will not grasp, and return step 1, otherwise gets into next step;

Step 4, the time interval is judged, the extracting time interval of query task, if specifying, the time interval grasps the time greater than current next time, then will not grasp, return step 1, otherwise get into next step;

Step 5, job scheduling is carried out the job scheduling according to other attribute of job of task parsing module, can judge in the time of scheduling, if this job existed in the past, does not then distribute, and still adopts home server to grasp; Otherwise, select the existing less station server of jobs number in the server zone, with the equilibrium of realization extracting task, thereby optimize overall grasp speed; Consider simultaneously to avoid too many on the station server with website job as far as possible, too big to prevent that the separate unit server from grasping pressure to single website;

Step 6, task is downloaded, and the agency who removes to get in the broker library proper number generally is 5; If do not act on behalf of desirablely, then adopt non-agency to grasp; To not have agency and above-mentioned 5 agency's merging simultaneously, form agent list; According to the task parameters that parsing obtains, from agent list, select an agency at random, carry out the epicycle of task and download;

Step 7 grasps the frequency adjustment, according to the extracting interval radix of job, finds to upgrade if epicycle grasps, and then reduces next round at random and grasps the time interval; If find to upgrade, then amplify the extracting interval of next round at random; But guarantee that the time interval is that the extracting of job is at interval in [0.5,2] multiple scope of radix; After the frequency adjustment is accomplished, return step 1, repeat whole flow process;

Said real-time extracting step also comprises like the operation substep that rolls off the production line down:

Step 1 is analyzed daily record discovery time scope, carries out the historical daily record intellectual analysis work of grasping, and therefrom analyzes the time range of each task (job);

Step 2 is analyzed daily record and is found the new time interval, reads in the extracting daily record of yesterday, analyzes the extracting situation of one day all round of each job, therefrom analyzes the more new situation of each job; Guaranteed that 50% above round can grab renewal, does not then adjust if find the current interval radix of job; Otherwise suitably amplify current interval radix, to reduce meaningless grab requesting;

Step 3, free agency collects and checking, freely acts on behalf of the free agent list that sharing website is downloaded the same day from the internet, and these free agencies are verified; Select the historical link formation checking url set of grasping at random; Using each to act on behalf of the url that carries out several times grasps; Weed out and successfully to grasp or the too slow agency of grasp speed; Stay the high fireballing agency simultaneously of success ratio, form the broker library on the same day, thinking to grasp on the line provides the agency to support.

Said web page extraction step has also been explained a kind of article types of web pages intelligence abstracting method, comprises the steps:

Step 1, webpage to be extracted is written into.Just be written at set intervals the collections of web pages that remains to be extracted; If webpage not extracted directly gets into step 6;

Step 2, the wrapper inquiry.To the webpage that remains to be extracted, the concrete wrapper information that extracts of inquiry if inquire, just gets into step 4, specifically extracts; Otherwise, extract failure, get into step 5;

Step 3, web page extraction.According to wrapper, webpage is specifically extracted, will extract the result after extraction finishes and be organized into the article type;

Step 4, mark extracts failure.To extract failure webpage mark, collect, change step 2 simultaneously to make things convenient for step 6;

Step 5 is collected all and is extracted the failure webpage, forms and extracts the failure collections of web pages;

Step 6, study is judged.By extract the failure collections of web pages with query site, to the failure collections of web pages of each website, judge that this website epicycle extracts the ratio of successfully failing, whether decision carries out machine learning; If study adds collections of web pages to be learnt;

Step 7, webpage study.All failure webpages to each website are learnt, and generate new extraction wrapper;

Step 8 extracts the wrapper management.New extraction wrapper is put into the wrapper set;

Step 9 finishes.

Article types of web pages intelligence abstracting method core link is to extract link, study link.Extract link, that is, above-mentioned steps 3 comprises the steps:

Step 3.1, HTML resolves.To importing webpage into, resolve html and make up dom tree;

Earlier html is done pre-service, comprise the character format conversion, script/style information filtering, not visual character rejecting etc.; According to html code and html standard, adopt the HtmlParser assembly to come analyzing web page and obtain dom tree then;

Step 3.2 is sought text field; The location path of this website of inquiry extracts the text field path in extracting the style tree of wrapper, then according to the path, in dom tree, travels through, and navigates to concrete DOM node, and this node is exactly the text field that we inquire for;

Step 3.3 is extracted article header and article and is divided page information; The article header mainly is that the extraction step of article title information comprises:

(1) extracts the interior several rows that start of text field, calculate title matching degrees of these row respectively and extract maximum, obtain the candidate's header line in the territory; " OK " is meant according to the line feed label of html and such as <br >, < P>etc. the dom tree of whole webpage cut apart some adjacent dom node set that the back forms here, with and corresponding html code;

(2) text field several rows in front are calculated title matching degrees of these row respectively and are extracted maximumly, obtain the candidate's header line before the territory;

(3) then, compare, therefrom select one to do title according to heuristic rule and title matching degree;

The measurement formula of title matching degree is following:

P _t＝a*(1-len _punc/len _all)+b*(1-len _title/len _{max_title})+c*(1-len _keywords/len _{max_keywords})+d*(1-len _summery/len _{max_summery})+e*(1-len _authortext/len _{max_authortext})+f*WH+g*H _len

Wherein:

Len _PuncBe punctuation mark length in the row;

Len _AllBe all word length in the row;

Len _TitleIt is the editing distance of title field contents in row content and the webpage;

Len _{Max_title}It is the maximal value in the title field contents in row content and the webpage;

Keywords refers to the key word information that webpage carries; Summery refers to the abstract fields that webpage carries; Authortext refers to the corresponding anchor literal of webpage url; These three types of variable meanings and above-mentioned similar;

WH is the tag types weighting, occurs labels such as h1 h2...center in all nodes under the row, can give the node weighting;

H _LenBe the weighting of node content length, find that after large-scale statistical length for heading is the most common between 16-23, all there is distribution probability separately in other interval, calculates the length weighting of node with this;

A, b, c, d, e, f, g are the factors of influence of each factor, in application, can revise.

Article divides page information, and its recognition methods is in the afterbody searching several rows of text field, to carry out Serial No. line by line and find; If found continuous Serial No., such as " 1,2,3,4,5,6... " etc., and url link informations that these numerals have itself belong to a website with this webpage, then discern successfully.

Step 3.4 is revised text field;

Prompting by means of the news article form helps, and in conjunction with the article header of top step, article tail information (branch page information) can be revised text field, makes it more as far as possible accurately:

1) before the territory, search out article head (title, time etc.) after, to the text field correction:

The article head cuts off article head information in the past in the territory;

The article head is overseas, and part is gone into text field between the merging;

2) after the territory tail searches out article tail information (paging etc.):

If the article tail in the territory, then cuts off tailing section in the territory;

If the article tail is not revised overseas.

Step 3.5 is to the text domain partitioning; Comprise piecemeal, piecemeal property determine and two steps of redundant piecemeal removal; Wherein the step of piecemeal is following:

Step 3.5.1 adopts the MDR method to discern frequent mode (this method is that Bing liu proposes);

Step 3.5.2 to the pattern that obtains, carries out seeking on the father nodes such as the searching of branch block header, piecemeal, to obtain only blocking node combination; Be combined to form piecemeal then;

Step 3.5.3 carries out mark to all piecemeals that identify in the text field dom tree;

Simultaneously, the following criterion of foundation when making up the piecemeal tree:

(1) with in all child nodes of father, node also is a piecemeal between the piecemeal that marks, and the node before first piecemeal is a piecemeal, and backmost node is a piecemeal after the piecemeal;

(2) if there is the piecemeal that marks in the node subtree, then node itself also is a piecemeal;

Wherein the piecemeal property determine of text field piecemeal and redundant piecemeal are removed; Concrete removal method is:

(1) in the piecemeal that obtains, judges its link literal and overall number of words ratio.

(2) if the link of piecemeal than greater than threshold value (0.5f), is then thought redundant piecemeal, remove in the tree, substitute with the hr label;

(3) piecemeal that remaining frequent mode is identified because their clear and definite semantic informations will be they marked, makes them in subsequent operation, no longer split (such as a TV programme form);

Step 3.6, the text field segmentation is filtered;

Text domain partitioning tree is carried out segmentation, thereby obtain the text segmentation sequence.Why wanting segmentation, is because observing the back through some finds, redundant information all is that the form with section or multistage occurs, so in order to remove the convenience of redundant information in the subsequent action, text field carry out sectionization.

Carry out filtering piecemeal of text field segmentation then;

(1) generate pattern.To all sections, extract its html code, carry out the html fragment and simplify, only stay tag title and content, get md5 key, be built into pattern;

Pattern is expressed as follows:

Wherein value is a weight information, also is the frequency of occurrences of pattern;

(2) filter piecemeal.The pattern that obtains is put into the library of wrapper, put merging in storage;

Library inquires model identical, and then pattern weighting also is about to the value field and merges weighting;

If do not find, then put in storage.

Step 3.7, data preparation, the result generates; Merge organize your messages, extract summary etc., extract successfully;

Step 3.8 extracts and finishes.

In the study link, that is, above-mentioned steps 7 comprises:

Step 7.1, HMTL resolves.To importing webpage into, resolve html and make up dom tree;

Step 7.2 is sought text field; Locate text field through the text field recognition methods.

The purpose of location text field is tentatively to seek the reasonable zone of text, and the Dom that the minimizing method is handled sets scope, has reduced the error probability of method simultaneously;

Text is included in one or more nested Div, the Table node, and the text localization is sought only such Div or Table exactly; Realize through a highest Div or a Table of information degree; Information degree computing formula:

H＝len _{not_link}*log(1+len _link/len _allTxt)+a*len _{not_link}*log(1+len _{not_link}/len _html)

Wherein:

A is a factor of influence, is defaulted as 0.5 at present;

Len _{Not_link}It is disconnected word length in the node;

Len _AllTxtBe all word length in the node;

Len _HtmlBe the html length of node;

During calculating, parameter adds 1 among the log, makes the log operation result all＞0;

After finding this Div or Table that wants, in dom tree, date back to the body node, recall end after, just formed a paths; In the trace-back process, record also is from left to right the sequence number of each DOM node in father node through the positional information of DOM node.

At last, obtain a dom tree path, the node in path also has its positional information simultaneously

Step 7.3, the path warehouse-in merges; The storehouse, path of incorporating above-mentioned path into the system wrapper, and fashionable, merge weighting with the path;

If find the path of repetition, merge weighting, weighting is to revise the DFS field, the DFS value that also is about to new route adds old path;

If do not find repetition, the new route warehouse-in is just passable;

Step 7.4 is extracted article header and article and is divided page information; Comprise:

The article header mainly is that the method for distilling of heading message is:

(1) extracts the interior several rows that start of text field, calculate title matching degrees of these row respectively and extract maximum, obtain the candidate's header line in the territory;

The measurement formula of title matching degree is following:

Wherein:

Len _PuncBe punctuation mark length in the row;

Len _AllBe all word length in the row;

WH is the tag types weighting, occurs labels such as h1h2...center in all nodes under the row, can give the node weighting;

Article paging information identifying method is in the afterbody searching several rows of text field, to carry out Serial No. line by line and find; If found continuous Serial No., such as " 1,2,3,4,5,6... " etc., and url link informations that these numerals have itself belong to a website with this webpage, then discern successfully;

Step 7.5 is revised text field;

If the article tail is not revised overseas.

Step 7.6 to the text domain partitioning, comprises piecemeal, piecemeal property determine and two big steps of redundant piecemeal removal;

Wherein the concrete steps of piecemeal are following:

Step 7.6.1 adopts the MDR method to discern frequent mode (this method is that Bing liu proposes);

Step 7.6.2 to the frequent mode that obtains, carries out seeking on the father nodes such as the searching of branch block header, piecemeal, to obtain only blocking node combination; Be combined to form piecemeal then;

Step 7.6.3 carries out mark to all piecemeals that identify in the text field dom tree;

Wherein the method for piecemeal property determine and redundant piecemeal removal is following:

Step 7.7, speced learning comprises text field segmentation, two big steps of pattern learning one by one;

Carry out the text field segmentation earlier:

After the section of branch, all sections, generate pattern;

The pattern generative process is, to all sections, extracts its html code, carries out the html fragment and simplifies, and only stays tag title and content, gets md5 key, is built into pattern;

Pattern is expressed as follows:

Wherein value is a weight information, also is the occurrence number of pattern.

Carry out the study of pattern one by one then.Mode of learning is:

The pattern that obtains is put into the library of wrapper, put merging in storage.Library inquires model identical, and then pattern weighting also is about to the value field and merges weighting; If do not find, then put in storage just passable;

Step 7.8, pattern is concluded, and also is that automatic canonical generates;

The concrete steps that pattern is concluded are following:

Step 7.8.1: to all patterns in the storehouse, extract former string, divide into groups by the website, every group is carried out cluster by similarity of character string, obtains several groupings of high cohesion;

Step 7.8.2: to obtaining each grouping, in it, the canonical after it merges to every pair of segmentation calculating obtains all possible different canonical; The frequency of occurrences of these canonicals is got that of maximum by the height ordering; Verify second largest that again, if in can the covering group part overlay segments and weight are unsuitable, then it also is desirable pattern;

Extract the pattern of two segmentations: recursively seek the optimum common fragment of two segmentation remainders; The fragment forward part is exactly different, needs the place that merges; This totally is a method by the two-dimension table dynamic programming;

Step 7.8.3:To all canonicals that obtain, keep those weights and be higher than certain threshold value item, add in the library then;

Pattern obtains some canonicals after concluding and finishing; Add weight value information, go in the library;

Step 7.9 finishes.

Above-mentioned all study links have all been upgraded two storehouses at last: style treebank (storehouse, path), library; These two storehouses after upgrading are put in order into overall package device storehouse, accomplished all learning procedures.

1) through can automatically the non-article page in the website being filtered such as channel page or leaf, special topic page or leaf, list page, advertising page with station study;

2) it is heavy to be similar to document row to the extracting article;

3) can carry out semantic understanding to grasping article, classification automatically generates summary and keyword automatically;

4) can accurately seek certain article number 50 and carry out the order merging with interior paging sequence and to the paging content;

5) can carry out flexible configuration to website extracting scope.Support to grasp the article under one or more list area on website, channel, any page;

In practical application; This grasping system article grasps and is of high quality; Can directly externally issue user oriented, the masterplates of thousands of extractings of adaptation websites change automatically simultaneously, have reduced the manpower participation that extracting needs greatly; News coverage and the real-time of improving door class website in large tracts of land have also reduced the human cost of door class website simultaneously.

Description of drawings

Fig. 1 is native system modular structure figure;

Fig. 2 is the native system data flowchart;

Fig. 3 is the line upper module structural drawing of real-time grabbing module;

Fig. 4 is the line lower module structural drawing of real-time grabbing module;

Operational flow diagram on the line of the real-time grabbing module of Fig. 5;

Operational flow diagram under the line of the real-time grabbing module of Fig. 6;

Fig. 7 article types of web pages intelligence of the present invention extraction system modular structure figure;

The modular structure figure of Fig. 8 web page extraction module;

The modular structure figure of Fig. 9 webpage study module;

The modular structure figure of Figure 10 text field piecemeal module;

Figure 11 is based on the overview flow chart of the article Automatic Extraction method that generates with station study and automatically rule;

Figure 12 is based on the overview flow chart of the article Automatic Extraction method that generates with station study and automatically rule;

Figure 13 is based on the study link process flow diagram of the article Automatic Extraction method that generates with station study and automatically rule;

The data flowchart of Figure 14 text field piecemeal module;

The text field correction synoptic diagram of Figure 15 abstracting method;

Figure 16-the 19th, the accompanying drawing of the overall work case of system and real-time grabbing module case;

Figure 20-the 28th, the web page extraction job case based on the phoenix net of the web page extraction module of system.

Embodiment

Grasping system of the present invention is made up of 5 modules or subsystem altogether, and is as shown in Figure 1.Comprise: real-time grabbing module, web page extraction system, document approximate row molality piece, the automatic sort module of document, article release module.

The overall system data flow is as shown in Figure 2, and concrete steps are following:

Step 1 submits to a job or a pile jobs to give the real-time grabbing module of system; Grabbing module can mainly be divided into two key steps of jobs resolution scheduling module and reptile download module (task download module) again in real time;

Step 2, the jobs resolution scheduling module of grabbing module is responsible for each job is construed to several rules of native system regulation in real time, and these rules have been specified the concrete extracting logic of next step reptile module; Simultaneously, jobs resolution scheduling module also is responsible for each job scheduling is distributed on the suitable a certain station server, grasps and extracts to realize the faster of job, will safeguard overall server crowd's equilibrium simultaneously;

Step 3, the task of extracting is in real time downloaded (reptile download) module and is received the rule of each job, and the logic of creeping according to these regular appointments grasps; To on home server, file after extracting finishes and grasp web results while these results of index is that unit is written into these web results to make things convenient for abstraction module by the website;

Step 4; The local index that grasps is regularly inquired about by the web page extraction system; Find that new index is the unit loading system through all webpages that step 3 downloaded in index by the website just, " based on the article Automatic Extraction algorithm that generates with station study and automatically rule " that comprise according to the present invention specifically extracts; If extracting unsuccessful during extraction will be that unit is learnt automatically by the website, extract wrapper thereby generate automatically, realize successful extracting next time; Extraction also comprises autoabstract module, automatic keyword generation module, extracts summary, the key word information of article with generation;

Step 5, document approximate row molality piece is a server disposition.All articles that step 4 extracts will import the heavy server of document approximate row into through network, carry out document approximate row and weigh; If find to be similar to, will abandon article; Otherwise import the automatic sort module of document into;

Step 6, the automatic sort module of document also is a server disposition.Article set after step 5 row is heavy will be imported the classified service device into; According to automatic sorting algorithm article is classified automatically; According to classification results, the categorical attribute of specifying each piece article also will weed out bad articles such as advertisement, pornographic simultaneously such as military affairs, history etc.; All articles behind type of branch will file on home server, build index simultaneously and read to make things convenient for next step release module;

Step 7, the article release module is regularly inquired about local index, finds that new index just will arrive the particular content system through Web publishing through index with the article loading system.

Wherein, the detailed technology scheme of real-time grabbing module:

When requiring high real-time to grasp; Hope and in 1-3 minute, the other side's network upgrade content to be grabbed; This need be to grasping frequent initiation link and the download request of Website server; In actual the extracting, this will cause the other side's server stress excessive and then take to close strategy, thereby make us grasp instability even failure.

High simultaneously real-time extracting demand extremely expends hardware resources such as network, causes cost to rise.

Existing a lot of grasping systems solve the problems referred to above through adopting the mode that grasps frequency control, increase extracting server, ensure real-time, the security grasped.

The real-time grabbing module of patent of the present invention grasps technology such as time range automatic discovering method, active agency collection and verification method through comprehensive employing task (jobs) rational management, task extracting interval dynamic self-adapting method, task every day, has realized different real-time extracting schemes.

Grasp compared with techniques in real time with other, this programme cost is lower, structure is also simpler.

This in real time on grabbing module separated time with line under two modules.

Comprise 7 modules moving on the line: task extraction module, task parsing module, task grasp the time range inspection module, task grasps time interval inspection module, task scheduling modules, task download module, task extracting frequency regulation block; 3 modules that also comprise operation under the line: task grasps time range and finds that module, task grasp the time interval and find that module, free agency collect and authentication module.

The task extraction module extracts a job in turn from task (job) set;

The task download module carries out the concrete download of task, and the agency who removes to get in the broker library proper number generally is 5; If do not act on behalf of desirablely, then adopt non-agency to grasp; To not have agency and above-mentioned 5 agency's merging simultaneously, form agent list; According to the task parameters that parsing obtains, from agent list, select an agency at random, carry out the epicycle of task and download; Task is downloaded and has been adopted traditional page download engine;

Free agency collects and authentication module, freely acts on behalf of the free agent list that sharing website is downloaded the same day from the internet, and these free agencies are verified; Select the historical link formation checking url set of grasping at random; Using each to act on behalf of the url that carries out several times grasps; Weed out and successfully to grasp or the too slow agency of grasp speed; Stay the high fireballing agency simultaneously of success ratio, form the broker library on the same day, thinking to grasp on the line provides the agency to support.Under optimal situation; We can download a website with 5-10 agency; Grasping engine with tradition compares; This will reduce the IP frequency of occurrences of single extracting server greatly, our extracting network quality had by a small margin improve, and our separate unit server grasped closed risk to reduce greatly.

The line upper module has been carried out the extracting of each task, needs only the current extracting task that has, and just carries out; The line lower module is just for the operation of line upper module provides data and resource support, and such as the broker library of a renewal etc., the line lower module will move once in free time every day.Because operating ratio is more consuming time, thus put under the line, not influence the operation of line upper module.

This is operational scheme (Fig. 5) as follows on the line of grabbing module in real time:

Step 1 is extracted a job in turn from task (job) set;

Step 7 grasps the frequency adjustment, according to the extracting interval radix of job, finds to upgrade if epicycle grasps, and then reduce next round at random and grasp the time interval, generally be 0.2 times of minimizing; If find to upgrade, then amplify the extracting interval of next round at random, generally be 0.2 times; The extracting time interval that will guarantee next round at last is that the extracting of job is at interval in [0.5,2] multiple scope of radix; After the frequency adjustment is accomplished, return step 1, repeat whole flow process;

Flow process comprises that grasping time range finds under the line of this grasping system, grasps time interval discovery, and the agency collects and checking, and these steps are that the work of grasping flow process on the last upper thread provides knowledge, such as the time range of job, effective agency etc.;

The online operation down of this part generally is to grasp the relatively more idle 0-6 point time, once finishes.

Its concrete steps are (Fig. 6) as follows:

Step 1 is analyzed daily record and is found new time range, carries out the historical daily record intellectual analysis work of grasping, and therefrom analyzes the time range of each task (job); Our method is to get nearest 7 days extracting daily record, and grab the time first time that analyzes every day, the grabbing the time for the last time of every day; Get 7 days minimum grabbing the time for the first time then, 7 days maximum grabs the time for the last time, as new time range;

Step 3, free agency collects and checking, freely acts on behalf of the free agent list that sharing website is downloaded the same day from the internet, and these free agencies are verified;

Verification method is that random extraction is acted on behalf of the grasped url of 3 times of numbers according to the extracting historical record; To each of these agencies, 3 url of Random assignment supply it to grasp then, and one of success will be to the award that adds 1 fen, fail one will be to the punishment that subtracts 1 fen; Grasp in 5 seconds successfully will to the award that adds 5 fens, grab in 10 seconds giving and the award that adds 2 fens;

Checking according to these agencies' score, is generally got rid of score the agency below 2 minutes after accomplishing; Can not successfully grasp or the too slow agency of grasp speed thereby weed out; Stay the high fireballing agency simultaneously of success ratio, form the broker library on the same day, thinking to grasp on the line provides the agency to support.

Web page extraction of the present invention system also is an article types of web pages intelligence extraction system, its detailed technology side Case

Article types of web pages of the present invention intelligence extraction system mainly comprises following a few sub-module, like Fig. 3:

(1) webpage to be extracted insmods, and main being responsible for regularly inquired about local index, find new index just according to index with in the webpage loading system internal memory;

(3) web page extraction module is responsible for from webpage, extracting concrete article information.By existing extraction wrapper;

(4) extract the failure web page collection module, be responsible for that epicycle is extracted the collecting web page of failing and get up, by websites collection, conveniently to carry out focusing study;

(6) webpage study module is responsible for all failure webpages are carried out machine learning; Generate new extraction wrapper at last;

(7) extract the wrapper administration module, be responsible for the extraction wrapper of system is managed, also promptly storehouse, path and library are managed,, and provide wrapper to use interface to the web page extraction module, provide the wrapper updating interface to the webpage study module.

Said web page extraction module also comprises:

The data preparation module is used for merging and organize your messages, forms the article types results;

The data preparation module is used to generate last article information.

Said webpage study module also comprises:

Text field is sought module, is used to seek text field;

The text field correcting module is used to revise text field;

Said text field piecemeal module also comprises:

The core of article types of web pages intelligence extraction system is " based on the article Automatic Extraction that generates with station study and automatically rule " method.

This method mainly comprises two parts: extract link, study link.Its overview flow chart is as shown in Figure 7, and concrete steps are:

Step 1, webpage to be extracted is written into.Just be written at set intervals the collections of web pages that remains to be extracted; If webpage not extracted gets into step 6;

Step 9, epicycle extract and finish.

The abstracting method core link is to extract link, study link.Specifically introduce one by one below.

Extract link, that is, above-mentioned steps 3, flow process is as shown in Figure 8:

Step 3.2 is sought text field;

Text field refers to certain the DOM node in the dom tree, and it has comprised the main contents information of article.The searching mode is, the location path of this website of inquiry extracts the text field path in extracting the style tree of wrapper, then according to the path, in dom tree, travels through, and navigates to concrete DOM node, and this node is exactly the text field that we inquire for;

Step 3.3 is extracted article header and article and is divided page information;

The article header comprises article title information, article temporal information, article source information etc.

The method of extracting title is roughly following:

The measurement formula of title matching degree is following:

Wherein:

Len _PuncBe punctuation mark length in the row;

Len _AllBe all word length in the row;

Article temporal information, article source information are below the article title position, to carry out matched and searched in the several rows, because after text field confirmed with title, this part was very little, so it is very high to discern preparation;

Article divides recognition methods such as page information to be, seeks several rows at the afterbody of text field, carries out Serial No. line by line and finds; If found continuous Serial No., such as " 1,2,3,4,5,6... " etc., and url link informations that these numerals have itself belong to a website with this webpage, then discern successfully;

Step 3.4 is revised text field;

(wherein article tail information refers to the branch page information) as shown in Figure 6, correcting mode is following:

If the article tail is not revised overseas.

Step 3.5 to the text domain partitioning, comprises two steps of piecemeal and piecemeal property determine and redundant piecemeal removal, wherein said piecemeal:

Text field piecemeal purpose is that the page is divided into several complete zones, and property determine is carried out in the zone one by one, and then removes redundancy, improves the precision that extracts the result.

Text field method of partition step is following:

Simultaneously, the following criterion of foundation when making up piecemeal:

Step 3.6 is to the piecemeal property determine of text domain partitioning and redundant piecemeal removal etc.;

In the piecemeal that obtains, judge its link literal and overall number of words ratio;

If the link of piecemeal is then thought redundant piecemeal than greater than threshold value (0.5f), remove in the tree, substitute with the hr label;

Piecemeal to remaining frequent mode identifies because their clear and definite semantic informations will be they marked, makes them in subsequent operation, no longer split (such as a TV programme form);

Step 3.7, the text field segmentation is filtered;

Carry out filtering piecemeal of text field segmentation then;

Pattern is expressed as follows:

If do not find, then put in storage.

Step 3.8, data preparation, the result generates; Merge organize your messages, extract summary etc., extract successfully;

Step 3.9 extracts and finishes.

The study link is corresponding with extracting a lot of steps of link, and some step is identical.

The study link, that is, above-mentioned steps 7, flow process is as shown in Figure 9:

Step 7.1, HTML resolves.To importing webpage into, resolve html and make up dom tree;

With extracting link;

Step 7.2 is sought text field;

Link is different with extracting, and the study link is located text field through the text field recognition methods.

The purpose of location localization is tentatively to seek the reasonable zone of text, and the Dom that the minimizing method is handled sets scope, has reduced the error probability of method simultaneously; And, find in the experiment that a lot of webpages have just correctly proposed text in this stage of text localization;

According to experiment statistics, all texts all can be included in one or more nested Div, the Table node, so the thought of text localization is sought only such Div exactly or Table comes; Our method is to seek a highest Div or a Table of information degree to come;

Information degree computing formula:

Wherein:

A is a factor of influence, is defaulted as 0.5 at present;

Len _{Not_link}It is disconnected word length in the node;

Len _AllTxtBe all word length in the node;

Len _HtmlBe the html length of node;

When calculating moisture in the soil, parameter adds 1 among the log, makes the log operation result all＞0;

Formula is considered link literal ratio, can guarantee to find disconnected literal both candidate nodes as much as possible;

Formula is considered disconnected literal and html length ratio, and it is too wide in range to guarantee that both candidate nodes is enough shunk the both candidate nodes of avoiding finding;

We find this Div that wants or Table at last, in dom tree, date back to the body node then, recall end after, just formed a paths; In the trace-back process, record also is from left to right the sequence number of each DOM node in father node through the positional information of DOM node.

At last, we obtain a dom tree path, and the node in path also has its positional information simultaneously, such as:

“Div?index＝3?DFS＝1＝＝＞Body?index＝0?DFS＝1＝＝＞www.ifeng.com”

If do not find repetition, the new route warehouse-in is just passable;

Step 7.4 is extracted article header and article and is divided page information;

With extracting link;

Step 7.5 is revised text field;

With extracting link;

Step 7.6 is to the text domain partitioning;

With extracting link;

Step 7.7, the piecemeal property determine of text field piecemeal and redundant piecemeal removal etc.;

With extracting link;

Step 7.8, speced learning;

To the text field segmentation, mode is with extracting link earlier;

After the section of branch, all sections, generate pattern;

Pattern is expressed as follows:

Then the pattern that obtains is put into the library of wrapper, put in storage merging.Library inquires model identical, and then pattern weighting also is about to the value field and merges weighting; If do not find, then put in storage just passable;

Step 7.9, pattern is concluded, and also is that automatic canonical generates;

The pattern that a last step obtains in the library has a lot of things can canonical to merge;

As follows, similar these patterns should canonical merge:

" more splendid contents, come in healthy channel "

" more how excellent picture, come in schemes slice channel "

" more how excellent news, come in information channel "

Merge the back pattern:

" more how excellent *, come in * channel "

After the merging, we have just obtained the other one type pattern of library: canonical.

This process is called pattern and concludes.

The concrete steps that pattern is concluded are following:

Step 7.9.1: to all patterns in the storehouse, extract former string, divide into groups by the website, every group is carried out cluster by similarity of character string, obtains several groupings of high cohesion.

The similar calculating of character string: simple participle is a unit with " speech ", calculates the speech editing distance, obtains similarity; Html tag is a speech during participle, and English word is a speech, and word of Chinese character or punctuation mark are a speech;

Clustering method: K-Means method;

Step 7.9.2: to obtaining each grouping, in it, the canonical after it merges to every pair of segmentation calculating obtains all possible different canonical; The possibility of these canonicals (frequency of occurrences) is got maximum that (its coverage rate is inevitable the wideest) by the height ordering; Verify second largest that again, if in can the covering group part overlay segments and weight are unsuitable, then it also is desirable pattern;

How to extract the pattern of two segmentations: recursively seek the optimum common fragment of two segmentation remainders; The fragment forward part is exactly different, needs the place that merges; This totally is a method by the two-dimension table dynamic programming;

Different piece is handled: numeral is different, then use " d " merge; Digital alphabet mixes different, replaces with " d [a-z] "; Other difference replaces with " * "; If numeral is different, all Serial No.s under then different piece expands to separately are to improve adaptability;

Such as:

"/imgs/89089089.jpg " merge into "/d*? .jpg "

″/imgs/89010197.jpg″

Step 7.9.3: to all canonicals that obtain, keep those weights and be higher than certain threshold value item, add in the library then;

Step 7.10 finishes.

Provide an example of native system below.

With what grasp http://www.21cbh.com/channel/review/ is example, and the step of total grasping system is following:

Step 1 is read in a job like Figure 16 form earlier from the jobs set;

Step 2 gets into real-time grabbing module, grasps.Comprise that carrying out job earlier resolves and dispatch, resolve the back and produce attribute and the rule that job grasps, and be dispatched on certain machine in the extracting cluster; Carry out webpage then and grasp, grasp and obtain the webpage that a pile satisfies job extracting attribute and rule after finishing, and build index conveniently to read.These webpages all come from zone shown in Figure 17.

Step 3, the abstraction module of entering the Web page carries out web page extraction to the collections of web pages that step 3 produces.Extract successfully, with producing the corresponding article information of these webpages; If extract failure, do not find to extract wrapper, these webpages will get into the study link that extracts, and extract and finish, and will can not export concrete article information, and whole extracting flow process will finish, and forward step 7 to;

Step 4 gets into document row molality piece, and the document of all extractions be similar to row's weight.Weed out the article that those have grabbed similar content;

Step 5 gets into the automatic sort module of document, and the article after all rows are weighed is classified.Obtain its content type information, such as: the article behind " http://www.21cbh.com/HTML/2011-12-22/wNMDQwXzM5MDUwNA.html " web page extraction, sort module can provide " finance and economics " classification information; According to article information behind the web page extraction and classification information, form the article set, and set up index conveniently to read;

Step 6 gets into the article release module, reads these articles, is published in our content delivering system to go; These articles can generate the external page at once, supply the user to browse;

Step 7, total is grasped flow process and finishes.

Wherein said real-time grabbing module is divided into two operating procedures in online and offline again.The line upper module is carried out concrete extracting work, and the line lower module is that the operation of line upper module provides some data to support such as broker library etc.;

The line lower module generally is to carry out once about 0 of every day, and whole day is no longer carried out then; The line upper module is that poll is carried out, and does not have at a distance from 30 seconds just to carry out once.

With what grasp http://www.21cbh.com/channel/review/ is example, and operating procedure is following under the line of the real-time grabbing module of total grasping system:

Step 1 is analyzed daily record and is found new time range.Analyze job shown in figure 16, grasped daily record in nearest 7 days, grab the time first time of analyzing its every day, grab the time for the last time, and statistics, obtain following data:

DAY?1

DAY?2

DAY?3

DAY?4

DAY?5

DAY?6

DAY?7

Grab the time for the first time every day

02:13

03:10

02:05

01:25

04:56

03:11

04:16

Grab the time for the last time every day

06:15

06:32

06:54

07:21

07:23

06:26

08:11

After the analysis, get 7 days minimum grabbing the time for the first time, maximum grabs the time for the last time, and the new time range that obtains this job is: 1 o'clock to 8 o'clock; Also be 01-08, will revise the setting of job parameter; It is as follows to revise the back:

“2?248836?01-08”

Step 2 is analyzed daily record and is found the new time interval.Analyze the extracting situation of job all rounds yesterday shown in Figure 16,, find to grasp altogether 73 times yesterday according to data; Wherein grab update content 32 times, so the ratio less than 50% is with amplification time interval radix; Acquiescence is amplified 0.2 times, and the new time interval is 298603; To revise the setting of job parameter, as follows after revising:

“2?298603?01-08”

Step 3, free agency collects and checking.Some from network network address of acting on behalf of grasp some free proxy informations, and are shown in figure 18, obtain 12 agencies altogether.

Then these agencies are verified.Verification method provides the grasped url that acts on behalf of 3 times of numbers; To each of these agencies, 3 url of Random assignment supply it to grasp then, and one of success will be to the award that adds 1 fen, fail one will be to the punishment that subtracts 1 fen; Grasp in 5 seconds successfully will to the award that adds 5 fens, grab in 10 seconds giving and the award that adds 2 fens.

Last comprehensive each agency's score is got rid of the agency of score below 2 minutes, has formed effective agent list shown in figure 19:

Back one row are scores of each agency.

Put into our broker library to these agencies entirely at last, for operation on the line provides support.

With what grasp http://www.21cbh.com/channel/review/ is example, and operating procedure is following on the line of the real-time grabbing module of total grasping system:

Step 1, poll are extracted job one by one; (acquiescence is extracted job shown in Figure 16 here);

Step 2, job resolves; After parsing is over, produce the following attribute that grasps:

1) grasps http://www.21cbh.com/channel/review/, do not expand;

2) grasp < div class=" home_box ">specified zone of DOM node of this page;

3) grasp this regional url link of satisfying following url canonical:

http://www.21cbh.com/HTML/.*?\.html

4) grasping at interval, radix is 298603 milliseconds;

5) grasping time range is one day 1 o'clock to 8 o'clock;

Step 3, time range is judged.The time range parameter of inquiring about this job is 1 o'clock to 8 o'clock, if time range does not comprise the current time, will not grasp, and returns step 1, otherwise gets into next step;

Step 4, the time interval is judged.The time interval radix of inquiring about this job is 298603 milliseconds, grasps the time greater than the current time if the time interval specifies next time, then will not grasp, and returns step 1, otherwise gets into next step;

Step 5, the job scheduling.Other attribute of job that obtains according to the task parsing module carries out the job scheduling.Can judge in the time of scheduling,, then not distribute, still adopt home server to grasp if this job existed in the past; Otherwise, select the existing less station server of jobs number in the server zone, with the equilibrium of realization extracting task, thereby optimize overall grasp speed; Consider simultaneously to avoid too many on the station server with website job as far as possible, too big to prevent that the separate unit server from grasping pressure to single website;

Step 6, task is downloaded.Going earlier to get in the broker library agency of proper number, generally is 5; Effective agent list shown in figure 19 is therefrom selected an agency at random, carries out the epicycle of task and downloads;

Step 7 grasps the frequency adjustment.According to the extracting of this job radix 298603 at interval,, then reduce 0.2 times of extracting time interval of next round if this round grabs renewal; If this round does not grab renewal, then increase by 0.2 times of extracting time interval of next round.

It below is the practical operation example of extraction system part of the present invention.

To grasp the Taiwan lastest news of phoenix net Http:// news.ifeng.com/taiwan/rss/rtlist_0/index.shtmlBe example, the flow process of total extraction system is following:

All webpages in this lastest news tabulation are crawled before, next get into this webpage extraction system:

All webpages that here will grab are written into, in respect of 42 pieces;

Step 2 to each webpage, is carried out the wrapper inquiry.Domain name according to this batch webpage Www.ifeng.comPrompting, we get in the wrapper, the concrete wrapper information that extracts of inquiry if inquire, just gets into step 4, specifically extracts; Otherwise, extract failure, get into step 5;

Step 3, web page extraction.Extraction wrapper according to inquiry obtains specifically extracts webpage, will extract the result after extraction finishes and be organized into the article type; Follow-up will with one be linked as " Http:// news.ifeng.com/mil/3/detail_2011_11/21/10798106_0.shtml" webpage be example, provide concrete extraction instance;

Step 5 is collected all and is extracted the failure webpage, forms and extracts the failure collections of web pages; We have 26 pieces and extract failure here;

Our extraction here success 16 webpages, 42 of total webpages, successful ratio is 16/42＜0.5, so need learn;

Step 7, webpage study.All failure webpages to each website are learnt, and generate new extraction wrapper; The respective instance of study will provide in the back;

Step 9, epicycle extract and finish.

In above-mentioned instance, need to launch to set forth a concrete web page extraction step.Here, we are with one piece of webpage " http://news.ifeng.com/mil/3/detail_2011_11/21/10798106_0.shtml ", and the website is Www.ifeng.comBe example, demonstration is through our web page extraction step, how to obtain one piece of complete and article information accurately.

System reads in one and takes turns webpage to be extracted, and handles piece by piece, and wherein one piece chained address is " http://news.ifeng.com/mil/3/detail_2011_11/21/10798106_0.shtml ", and the website is www.ifeng.com; Shown in figure 20:

1.HTML resolve, construct dom tree at last;

Need carry out the webpage pre-service earlier; The character format conversion, script/style information filtering, not visual character rejecting etc.;

According to html code and html standard, adopt the HtmlParser assembly to come analyzing web page and obtain dom tree then;

2. searching text field;

Through Www.ifeng.comThis domain name, in the storehouse, path, find this paths shown in figure 21 (style):

Such dom tree path, instruct us to find red frame text field shown in figure 22:

3. extract the article head, divide page information;

Specifically how extracting header, divide page information, method is following:

At first, the article header mainly is the method for distilling of heading messageBe:

The measurement formula of title matching degree is following:

Wherein:

Len _PuncBe punctuation mark length in the row;

Len _AllBe all word length in the row;

Then, article paging information identifying methodBe, seek several rows, carry out Serial No. line by line and find at the afterbody of text field; If found continuous Serial No., such as " 1,2,3,4,5,6... " etc., and url link informations that these numerals have itself belong to a website with this webpage, then discern successfully;

The webpage of this instance does not have paging;

4. correction text field;

The mode of revising is shown in figure 23, the concrete elaboration as follows:

If the article tail is not revised overseas.

5. text field carries out piecemeal, comprises the piecemeal step, and piecemeal property determine and redundant piecemeal removal step;

The step of piecemealComprise:

1) the MDR method is discerned frequent mode;

2) combined joint such as frequent mode title searching forms piecemeal.

Shown in figure 24, we have obtained two piecemeals.

The piecemeal property determine is removed with redundantMode be:

If the link of piecemeal is then thought redundant piecemeal than greater than threshold value (0.5f), remove (in fact in tree, substituting) with the hr label;

Piecemeal to remaining frequent mode identifies because their clear and definite semantic informations will be they marked, no longer splits them in subsequent operation.

After last text field was processed, we obtained result shown in figure 25.

6. the text field segmentation is filtered; Comprise segmentation, filter two steps piecemeal;

Text field carries out segmentationThereby, obtain the text segmentation sequence.

Result after the segmentation is shown in figure 26:

Wherein the content in each black box is a segmentation.

Filter piecemeal, go into library by section and carry out the pattern match filtration;

Paragraph extracts pattern one by one, then the warehouse-in coupling;

Wherein, following pattern is mated:

This will filter out the afterbody picture segmented model in the step 26, and this picture is an advertising message, should be disallowable;

Last text field extracts and finishes, and the result is shown in figure 27.

7. data preparation, the result generates.Information such as extracting keywords, summary is assembled into one piece of article that accurately extracts;

Method finishes.

Extract in the instance overall, also need launch to set forth the concrete steps of webpage study.

In our instance, have the failure of 26 pieces of web page extractions, the study of will all entering the Web page of this batch webpage;

To be example to each webpage study wherein, its step is following:

1.HMTL resolve.To importing webpage into, resolve html and make up dom tree;

The same one of concrete grammar extracts instance;

2. searching text field; Locate text field through the text field recognition methods;

(1) extract Div all in the webpage dom tree, Table node, then one by one node according to following information degree computing formula computing node information degree:

Wherein:

A is a factor of influence, is defaulted as 0.5 at present;

Len _{Not_link}It is disconnected word length in the node;

Len _AllTxtBe all word length in the node;

Len _HtmlBe the html length of node;

(2) find that maximum node of information degree in all nodes; Red frame text field shown in figure 14 is exactly this node that we find;

(3) find this Div or Table that wants after, in dom tree, date back to the body node, recall end after, just formed a paths; In the trace-back process, record also is from left to right the sequence number of each DOM node in father node through the positional information of DOM node.

(4) last, obtain a dom tree path shown in figure 28, the node in path also has its positional information simultaneously, and DFS information all is 1, also is that frequency of occurrence is 1;

3, the path warehouse-in merges; The storehouse, path of incorporating above-mentioned path into the system wrapper, and fashionable, merge weighting with the path;

If do not find repetition, the new route warehouse-in is just passable;

4, extract article header and article and divide page information;

The corresponding step of a same web page extraction instance of concrete grammar;

5, revise text field;

The corresponding step of a same web page extraction instance of concrete steps;

6, to the text domain partitioning, comprise the removal of piecemeal and piecemeal property determine and redundant piecemeal;

7, speced learning, comprise the text field segmentation, piecemeal generate pattern, learn three steps piecemeal;

The corresponding step of a same web page extraction instance of the step of segmentation;

Generate pattern piecemealThe corresponding step of a same web page extraction instance of step;

Study piecemeal, mode of learning is:

Result after the segmentation is shown in figure 26, and according to the method that pattern generates, paragraph generates the pattern like figure below one by one:

Wherein second field is exactly concrete value information;

The pattern that obtains is put into the library of wrapper, put merging in storage.Library inquires model identical, and then pattern weighting also is about to the value field and merges addition; If do not find, then put in storage just passable;

8, pattern is concluded, and also is that automatic canonical generates;

The concrete steps that pattern is concluded are following:

Step 8.1: to all patterns in the storehouse, extract former string, divide into groups by the website, every group is carried out cluster by similarity of character string, obtains several groupings of high cohesion;

Step 8.2: to obtaining each grouping, in it, the canonical after it merges to every pair of segmentation calculating obtains all possible different canonical; The frequency of occurrences of these canonicals is got that of maximum by the height ordering; Verify second largest that again, if in can the covering group part overlay segments and weight are unsuitable, then it also is desirable pattern;

Step 8.3:To all canonicals that obtain, keep those weights and be higher than certain threshold value item, add in the library then;

9, finish.

Claims

1. an article real-time intelligent grasping system is characterized in that, said system comprises real-time grabbing module, web page extraction system, document approximate row molality piece, the automatic sort module of document and article release module.

2. system according to claim 1, wherein said real-time grabbing module comprise online and offline operation submodule, and the operation submodule comprises on the line:

The task extraction module extracts a job in turn from task job set;

The task parsing module is resolved each task job, and analysis result will form some attributes and rule;

Task grasps frequency regulation block, according to the extracting interval radix of job, finds to upgrade if epicycle grasps, and then reduces next round at random and grasps the time interval; If find to upgrade, then amplify the extracting interval of next round at random; But guarantee that the time interval is that the extracting of job is at interval in [0.5,2] multiple scope of radix.

3. system according to claim 1, wherein said real-time grabbing module comprise online and offline operation submodule, and line operation submodule down comprises:

Task grasps time range and finds module, carries out the historical daily record intellectual analysis work of grasping, and therefrom analyzes the time range of each task job;

4. article real-time intelligent grasping means is characterized in that, said method comprises real-time extracting step, web page extraction step, document approximate row heavy step, the automatic classifying step of article and article issuing steps; Said real-time extracting step also comprises online and offline operation substep.

5. method according to claim 4, the operation substep comprises on the said line:

Step 1 is extracted a job in turn from task job set;

Step 7 grasps the frequency adjustment, according to the extracting interval radix of job, finds to upgrade if epicycle grasps, and then reduces next round at random and grasps the time interval; If find to upgrade, then amplify the extracting interval of next round at random; But guarantee that the time interval is that the extracting of job is at interval in [0.5,2] multiple scope of radix; After the frequency adjustment is accomplished, return step 1, repeat whole flow process.

6. method according to claim 4, said line operation substep down comprise:

Step 1 is analyzed daily record discovery time scope, carries out the historical daily record intellectual analysis work of grasping, and therefrom analyzes the time range of each task job;

7. web page extraction system according to claim 1 is a kind of article types of web pages intelligence extraction system, comprising:

(7) extract the wrapper administration module, the extraction wrapper of system is managed, also promptly storehouse, path and library are managed, and provide wrapper to use interface, provide the wrapper updating interface to the webpage study module to the web page extraction module.

8. article types of web pages intelligence extraction system as claimed in claim 7 is characterized in that said web page extraction module also comprises:

The text field correcting module is used to revise text field;

Text field piecemeal module is used for the text domain partitioning; Carrying out piecemeal property determine and redundant piecemeal simultaneously removes;

The data preparation module is used for merging and organize your messages, forms the result of article type.

9. article types of web pages intelligence extraction system as claimed in claim 7 is characterized in that said webpage study module also comprises:

Text field is sought module, is used to seek text field;

The text field correcting module is used to revise text field;

10. like claim 8 or 9 described article types of web pages intelligence extraction systems, said text field piecemeal module also comprises:

11. an article types of web pages intelligence abstracting method, it comprises the steps:

Step 1, webpage to be extracted is written into; Just be written at set intervals the collections of web pages that remains to be extracted; If webpage not extracted directly gets into step 6;

Step 2, the wrapper inquiry; To the webpage that remains to be extracted, the concrete wrapper information that extracts of inquiry if inquire, just gets into step 4, specifically extracts; Otherwise, extract failure, get into step 5;

Step 3, web page extraction; According to wrapper, webpage is specifically extracted, will extract the result after extraction finishes and be organized into the article type;

Step 4, mark extracts failure; To extract failure webpage mark, collect, get back to step 2 simultaneously to make things convenient for step 6;

Step 6, study is judged; By extract the failure collections of web pages with query site, to the failure collections of web pages of each website, judge that this website epicycle extracts the ratio of successfully failing, whether decision carries out machine learning; If study adds collections of web pages to be learnt;

Step 7, webpage study; All failure webpages to each website are learnt, and generate new extraction wrapper;

Step 8 extracts the wrapper management; New extraction wrapper is put into the wrapper set;

Step 9 finishes.

12. method according to claim 11, wherein said step 3, web page extraction comprises the steps:

Step 3.1, HTML resolves; To importing webpage into, resolve html and make up dom tree;

Step 3.2 is sought text field;

Step 3.4 is revised text field; Help by means of the prompting of news article form, combine article header, the article of top step to divide page information, can revise text field, make it more accurate:

Step 3.5 is to the text domain partitioning; Carrying out piecemeal property determine and redundant piecemeal then removes;

Step 3.6, the text field segmentation is filtered; Earlier text domain partitioning tree is carried out segmentation, thereby obtain the text segmentation sequence; Filter piecemeal then;

Step 3.8 finishes.

13. method according to claim 12, wherein said step 3.1 is done pre-service to html earlier, comprises the character format conversion, script/style information filtering, not visual character rejecting etc.; According to html code and html standard, adopt the HtmlParser assembly to come analyzing web page and obtain dom tree then.

14. method according to claim 12; Wherein said step 3.2, the location path of this website of inquiry extracts the text field path in extracting the style tree of wrapper; Then according to the path; In dom tree, travel through, navigate to concrete DOM node, this node is exactly the text field that we inquire for.

15. method according to claim 12, in the wherein said step 3.3, said article header mainly is a heading message, and extraction step comprises:

(2) extract outside the text field beginning, several rows in front are calculated title matching degrees of these row respectively and are extracted maximumly, obtain the candidate's header line before the territory;

16. method according to claim 15, wherein said title matching degree, it is following that it weighs formula:

Wherein:

Len _PuncBe punctuation mark length in the row;

Len _AllBe all word length in the row;

17. method according to claim 12, in the wherein said step 3.3, said article divides page information, and its recognition methods is in the afterbody searching several rows of text field, to carry out Serial No. line by line and find; If found continuous Serial No., such as " 1,2,3,4,5,6... ", and these url link informations of having of numeral itself belong to a website with this webpage, then discern successfully.

18. method according to claim 12, wherein said step 3.4 comprises:

1) before the territory, search out the article head after, to the text field correction:

2) after the territory tail searches out article tail information:

If the article tail is not revised overseas.

19. method according to claim 12, text field piecemeal step in the wherein said step 3.5 comprises:

Step 3.5.1 adopts the MDR method to discern frequent mode;

Step 3.5.2 to the pattern that obtains, carries out father nodes such as the searching of branch block header, piecemeal

On seek, to obtain the combination of only blocking node; Be combined to form piecemeal then;

Step 3.5.3 carries out mark to all piecemeals that identify in the text field dom tree.

20. method according to claim 19, wherein said step 3.5.2, when being combined to form piecemeal, the following criterion of foundation:

(2) if there is the piecemeal that marks in the node subtree, then node itself also is a piecemeal.

21. method according to claim 12, piecemeal property determine and redundant piecemeal are removed step in the wherein said step 3.5, and method is following:

22. method according to claim 12, wherein said step 3.6, said text field filter method piecemeal is, carries out the text field segmentation earlier, segmentation method is to carry out cutting according to line feed html label; Generate pattern piecemeal then; Filter then, filter method is, if pattern match query in the library of wrapper, then pattern weighting was removed with the period; If do not match, then make up new model, warehouse-in, the power of putting is for minimum.

23. method according to claim 11, wherein said webpage learning procedure 7 comprises:

Step 7.1, HMTL resolves; To importing webpage into, resolve html and make up dom tree;

Step 7.2 is sought text field; Locate text field through the text field recognition methods;

Step 7.3, the path warehouse-in merges;

The storehouse, path of incorporating the path into the system wrapper, and fashionable, merging weighting with the path, weighting is to revise frequency values, the frequency of occurrences value that also is about to new route adds old path; If do not find repetition, the new route warehouse-in is just passable;

Step 7.5 is revised text field;

Step 7.6, the text field piecemeal; Carrying out piecemeal property determine and redundant piecemeal simultaneously removes;

Step 7.7, speced learning carries out the text field segmentation earlier; Text domain partitioning tree is carried out segmentation, thereby obtain the text segmentation sequence; Speced learning to all sections generate pattern, is put study in storage then then;

Step 7.8, pattern is concluded, and also is that automatic canonical generates;

Step 7.9 finishes.

24. method according to claim 23, wherein in said step 7.2,

Text is included in one or more nested Div, the Table node, and said searching text field is sought an only Div or Table exactly; Realize through a highest Div or a Table of information degree; Said information degree computing formula:

Wherein:

A is a factor of influence, is defaulted as 0.5 at present;

Len _{Not_link}It is disconnected word length in the node;

Len _AllTxtBe all word length in the node;

Len _HtmlBe the html length of node;

During calculating, parameter adds 1 among the log, makes the log operation result all greater than 0.

25. method according to claim 24, wherein, find said only Div or Table after, in dom tree, date back to the body node, recall end after, just formed a paths; In the trace-back process, record also is from left to right the sequence number of each DOM node in father node through the positional information of DOM node; At last, obtain a dom tree path, the node in path also has its positional information simultaneously.

26. method according to claim 23, in the wherein said step 7.4, said article header mainly is a heading message, and extraction step comprises:

(3) then, compare, therefrom select one to do title according to heuristic rule and title matching degree.

27. method according to claim 26, the measurement formula of wherein said title matching degree is following:

Wherein:

Len _PuncBe punctuation mark length in the row;

Len _AllBe all word length in the row;

l _EnBe the weighting of node content length, find that after large-scale statistical length for heading is the most common between 16-23, all there is distribution probability separately in other interval, calculates the length weighting of node with this;

28. method according to claim 23, wherein, said step 7.4, said article paging information identifying method is in the afterbody searching several rows of text field, to carry out Serial No. line by line and find; If found continuous Serial No., such as " 1,2,3,4,5,6... ", and these url link informations of having of numeral itself belong to a website with this webpage, then discern successfully.

29. method according to claim 23, wherein, said step 7.5 helps by means of the prompting of news article form, combines article header, the article of top step to divide page information, can revise text field, makes it more accurate; Comprise:

2) after the territory tail searches out article and divides page information:

If the article tail is not revised overseas.

30. method according to claim 23, wherein, the text field piecemeal step of said step 7.6 comprises the steps:

Step 7.6.1 adopts the MDR method to discern frequent mode;

Step 7.6.3 carries out mark to all piecemeals that identify in the text field dom tree.

31. method according to claim 30, wherein, said piecemeal array mode, the following criterion of foundation:

32. method according to claim 23, wherein, the piecemeal property determine of said step 7.6 and redundant piecemeal are removed step, the following criterion of foundation:

33. method according to claim 23, wherein, said step 7.7, speced learning comprises:

Said text field segmentation method is:

According to the prompting of line feed label in the text field, carry out content segmentation, content between the line feed label is one section;

Said generate pattern and speced learning process are:

(1) to all sections, extracts its html code, carry out the html fragment and simplify, only stay tag title and content, get md5 key, be built into pattern;

Pattern is expressed as follows:

(2) then the pattern that obtains is put into the library of wrapper, put in storage merging;

If do not find, then put in storage.

34. method according to claim 23, wherein, said step 7.8,

Said pattern is concluded rapid as follows:

Step 7.10.1: to all patterns in the storehouse, extract former string, divide into groups by the website, every group is carried out cluster by similarity of character string, obtains several groupings of high cohesion;

Step 7.10.2: to obtaining each grouping, in it, the canonical after it merges to every pair of segmentation calculating obtains all possible different canonical; The frequency of occurrences of these canonicals is got that of maximum by the height ordering; Verify second largest that again, if in can the covering group part overlay segments and weight are unsuitable, then it also is desirable pattern;

Step 7.10.3: to all canonicals that obtain, keep those weights and be higher than certain threshold value item, add in the library then;

Pattern obtains some canonicals after concluding and finishing; Add weight value information, go in the library.