CN102609456A - System and method for real-time and smart article capturing - Google Patents
System and method for real-time and smart article capturing Download PDFInfo
- Publication number
- CN102609456A CN102609456A CN2012100083895A CN201210008389A CN102609456A CN 102609456 A CN102609456 A CN 102609456A CN 2012100083895 A CN2012100083895 A CN 2012100083895A CN 201210008389 A CN201210008389 A CN 201210008389A CN 102609456 A CN102609456 A CN 102609456A
- Authority
- CN
- China
- Prior art keywords
- module
- piecemeal
- article
- job
- webpage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
A system for real-time and smart article capturing comprises a real-time capturing module, a webpage extracting system, a duplication removing module for approximated documents, an automatic document classification module, and an article posting module. The real-time capturing module further comprises 7 online modules, namely a job extracting module, a job parsing module, a job capturing time range testing module, a job capturing time interval testing module, a job scheduling module, a job downloading module and a job capturing frequency adjusting module, and 3 offline modules, namely a job capturing time range discovering module, a job capturing time interval discovering module and a free proxy collecting and testing module. The article type webpage smart extracting system comprises a loading module for webpages to be extracted, a wrapper query module, a webpage extracting module, a collecting module for webpages failed in extraction, a learning judgment module, a webpage learning module and a extracting wrapper management module.
Description
Technical field
The present invention relates to extracting technology, web digging technology, information extraction technique, natural language processing technique field in the Internet technology; Can be applied to need to grasp precisely, in real time on a large scale the internet arenas such as portal website, search engine web site of article.
Background technology
Patent of the present invention also has the advantage that more traditional grasping systems do not have:
Through can automatically the non-article page in the website being filtered such as channel page or leaf, special topic page or leaf, list page, advertising page with station study;
It is heavy to be similar to document row to the extracting article;
Can carry out semantic understanding to grasping article, classification automatically generates summary and keyword automatically;
Can accurately seek certain article number 50 and carry out the order merging with interior paging sequence and to the paging content;
Can carry out flexible configuration to website extracting scope.Support to grasp the article under one or more list area on website, channel, any page.
In practical application; This grasping system reprinted articles is of high quality; Can directly externally issue user oriented, the masterplates of thousands of extractings of adaptation websites change automatically simultaneously, have reduced the manpower participation that extracting needs greatly; News coverage and the real-time of improving door class website in large tracts of land have also reduced the human cost of door class website simultaneously.
In all door class websites, this patent all has application scenarios, can effectively improve the coverage and the real-time of its news, reduces human cost simultaneously.
In the news category search engine, this patent also can be used simultaneously.
The information extraction area now is having a lot of technical schemes, and core all is how to generate and safeguard the extraction wrapper.Below the technical main branch two types:
1) extraction system that adopts machine to generate extraction wrapper technology automatically can grasp article in a large number, but can't accomplish the accurate extraction of article, and the availability that grasps article is low;
2) adopt the artificial extraction system that extracts the wrapper technology that generates, it is accurate that article extracts the result, but will extract the generation and the updating maintenance work of wrapper to thousands of websites, internet, can only rely on great amount of manpower and participate in;
The abstraction module of patent of the present invention is a core with " based on the article Automatic Extraction that generates with station study and automatically rule " method of independent research, has solved top two problems well.
In practical application, the present technique scheme has realized that the machine that extracts wrapper generates automatically and safeguards, makes extraction not need great amount of manpower to participate in; Also realized the accurate extraction of article simultaneously, extracted the seldom redundant and omission of result, availability is very high.
Relate to technical term among the present invention, be explained as follows:
Extract wrapper:Web page information extraction is a type in the information extraction, and the wrapper generation technique of Web page information extraction develops into a comparatively independently field at present.Wrapper is by a series of decimation rules and use that these regular computer codes form, special from the customizing messages source information of extractions needs and the program of return results;
Based on the article Automatic Extraction method that generates with station study and automatically rule: the automatic generating method of wrapper that the present invention comprises, can accurate intelligent from webpage, extract article information;
With station study: by the website is unit, collects the webpage of a website q.s, carries out the machine statistical learning together, and then the rule of generations needs therefrom;
Reptile (perhaps grasping reptile): the module that refers to be responsible for page download in the grasping system separately;
The extraction wrapper of native system research and development comprises two storehouses:
Style tree or storehouse, path:
The set storehouse of Style.Style refers to certain DOM node to carry out seeking on the node in dom tree, up to arriving body node, the such paths and the weight information thereof that construct.In the storehouse, the path is unit organization with the website all, be merged into one with the path, and recording frequency is as weight;
Library:
So-called here pattern comprises
1) one is the following condition code of each section after the segmentation in the method:
Pattern=md5 ((content: text/img)+paragraph tag forward direction ergodic sequence+site name)+value
Wherein value is a weight information, also is the frequency of occurrences of pattern.
2) also having one is the automatic canonical that these sections are carried out generating behind the statistical learning:
Pattern=canonical.
Agent skill group:
Agent skill group is meant after acting server receives client requests can check and verify its legitimacy, and legal like it, acting server is fetched required information as a client computer and is transmitted to the client again;
Grasp in real time:
Stress that ageing a kind of extracting of grasping is technological.Target is can grab in real time after grasping the source station update content.
Summary of the invention
The present invention preferably resolves the problems referred to above.
According to article real-time intelligent grasping system of the present invention, comprise real-time grabbing module, web page extraction system, document approximate row molality piece, the automatic sort module of document and article release module.
Wherein said real-time grabbing module comprises online and offline two sub-module.The operation submodule is following on the line:
The task extraction module extracts a job in turn from task (job) set;
The task parsing module, (job) resolves to each task, and analysis result will form some attributes and rule;
Task grasps the time range inspection module, and the time range parameter of query task if time range does not comprise the current time, will not grasp, and skip this job, otherwise, grasp time interval check;
Task grasps time interval inspection module, the extracting time interval of query task, if specifying, the time interval grasps the time greater than current next time, and then will not grasp, skip this job, otherwise, carry out task and grasp;
Task scheduling modules is carried out the job scheduling according to other attribute of job of task parsing module, can judge in the time of scheduling, if this job existed in the past, does not then distribute, and still adopts home server to grasp; Otherwise, select the existing less station server of jobs number in the server zone, with the equilibrium of realization extracting task, thereby optimize overall grasp speed; Consider simultaneously to avoid too many on the station server with website job as far as possible, too big to prevent that the separate unit server from grasping pressure to single website;
The task download module carries out the concrete download of task, and the agency who removes to get in the broker library proper number generally is 5; If do not act on behalf of desirablely, then adopt non-agency to grasp; To not have agency and above-mentioned 5 agency's merging simultaneously, form agent list; According to the task parameters that parsing obtains, from agent list, select an agency at random, carry out the epicycle of task and download;
Task grasps frequency regulation block, according to the extracting interval radix of job, finds to upgrade if epicycle grasps, and then reduces next round at random and grasps the time interval; If find to upgrade, then amplify the extracting interval of next round at random; But guarantee that the time interval is that the extracting of job is at interval in [0.5,2] multiple scope of radix;
Said real-time grabbing module also comprises like the operation submodule that rolls off the production line down:
Task grasps time range and finds module, carries out the historical daily record intellectual analysis work of grasping, and therefrom analyzes the time range of each task (job);
Task grasps the time interval and finds module, reads in the extracting daily record of yesterday, analyzes the extracting situation of one day all round of each job, therefrom analyzes the more new situation of each job; Guaranteed that 50% above round can grab renewal, does not then adjust if find the current interval radix of job; Otherwise suitably amplify current interval radix, to reduce meaningless grab requesting;
Free agency collects and authentication module, freely acts on behalf of the free agent list that sharing website is downloaded the same day from the internet, and these free agencies are verified; Select the historical link formation checking url set of grasping at random; Using each to act on behalf of the url that carries out several times grasps; Weed out and successfully to grasp or the too slow agency of grasp speed; Stay the high fireballing agency simultaneously of success ratio, form the broker library on the same day, thinking to grasp on the line provides the agency to support.
Web page extraction of the present invention system also is an article types of web pages intelligence extraction system, comprising:
(1) webpage to be extracted insmods, and regularly inquires about local index, find new index just according to index with in the webpage loading system internal memory;
(2) wrapper enquiry module, to the webpage that remains to be extracted, the concrete wrapper information that extracts of inquiry if inquire, just according to extracting wrapper, gets into abstraction module, specifically extract, otherwise, webpage is labeled as to extract fails;
(3) web page extraction module extracts concrete article information from webpage, by existing extraction wrapper;
(4) extract the failure web page collection module, the collecting web page that epicycle is extracted failure gets up, by websites collection, conveniently to carry out focusing study;
(5) study judge module by extract the failure collections of web pages with query site, according to the failure webpage quantity of each website, calculates this website epicycle and extracts the ratio of successfully failing, and determines whether to enter the Web page study module;
(6) webpage study module carries out machine learning to all failure webpages, generates new extraction wrapper at last;
(7) extract the wrapper administration module, the extraction wrapper of system managed, also promptly storehouse, path and library are managed,, and provide wrapper to use interface to the web page extraction module, provide the wrapper updating interface to the webpage study module.
Said web page extraction module also comprises:
The HTML parsing module to importing webpage into, is resolved html and is made up dom tree;
Text field is sought module, according to wrapper information, seeks text field;
Article head and paging information extraction modules are used to extract article header and article and divide page information;
The text field correcting module is used to revise text field; Text field piecemeal module is used for the text domain partitioning, carries out piecemeal property determine and redundant piecemeal simultaneously and removes;
The segmentation filtering module is used for text field is carried out segmentation, filters piecemeal simultaneously;
The data preparation module is used for merging and organize your messages, generates the result of last article type.
Said webpage study module also comprises:
The HTML parsing module to importing webpage into, is resolved html and is made up dom tree;
Text field is sought module, is used to seek text field;
Storehouse, path update module is used for warehouse-in and merges, and simultaneously the storehouse, path is put in order;
Article head and paging information extraction modules are used to extract article header and article and divide page information;
The text field correcting module is used to revise text field;
Text field piecemeal module is used for the text domain partitioning, carries out piecemeal property determine and redundant piecemeal simultaneously and removes;
The speced learning module is carried out the text field segmentation, makes up pattern piecemeal, goes into library and merges;
Pattern is concluded module, and all patterns are concluded, and create-rule is gone into library and merged;
The wrapper sorting module is put in order system's wrapper, removes invalid information.
Said text field piecemeal module also comprises:
The frequent mode identification module adopts the MDR method to discern frequent mode;
The piecemeal module to the frequent mode that obtains, carries out seeking on the searching of branch block header, the piecemeal father node, to obtain only blocking node combination, is combined to form piecemeal then;
The piecemeal mark module carries out mark to all piecemeals that identify in the text field dom tree.
The present invention also provides the grasping means of a kind of article real-time intelligent, and said method comprises real-time extracting step, web page extraction step, document approximate row heavy step, the automatic classifying step of article and article issuing steps.
Said real-time extracting step comprises online and offline operation substep, wherein:
The operation substep comprises on the said line:
Step 4, the time interval is judged, the extracting time interval of query task, if specifying, the time interval grasps the time greater than current next time, then will not grasp, return step 1, otherwise get into next step;
Step 7 grasps the frequency adjustment, according to the extracting interval radix of job, finds to upgrade if epicycle grasps, and then reduces next round at random and grasps the time interval; If find to upgrade, then amplify the extracting interval of next round at random; But guarantee that the time interval is that the extracting of job is at interval in [0.5,2] multiple scope of radix; After the frequency adjustment is accomplished, return step 1, repeat whole flow process;
Said real-time extracting step also comprises like the operation substep that rolls off the production line down:
Said web page extraction step has also been explained a kind of article types of web pages intelligence abstracting method, comprises the steps:
Step 4, mark extracts failure.To extract failure webpage mark, collect, change step 2 simultaneously to make things convenient for step 6;
Step 7, webpage study.All failure webpages to each website are learnt, and generate new extraction wrapper;
Step 9 finishes.
Article types of web pages intelligence abstracting method core link is to extract link, study link.Extract link, that is, above-mentioned steps 3 comprises the steps:
Step 3.1, HTML resolves.To importing webpage into, resolve html and make up dom tree;
Earlier html is done pre-service, comprise the character format conversion, script/style information filtering, not visual character rejecting etc.; According to html code and html standard, adopt the HtmlParser assembly to come analyzing web page and obtain dom tree then;
Step 3.2 is sought text field; The location path of this website of inquiry extracts the text field path in extracting the style tree of wrapper, then according to the path, in dom tree, travels through, and navigates to concrete DOM node, and this node is exactly the text field that we inquire for;
Step 3.3 is extracted article header and article and is divided page information; The article header mainly is that the extraction step of article title information comprises:
(1) extracts the interior several rows that start of text field, calculate title matching degrees of these row respectively and extract maximum, obtain the candidate's header line in the territory; " OK " is meant according to the line feed label of html and such as <br >, < P>etc. the dom tree of whole webpage cut apart some adjacent dom node set that the back forms here, with and corresponding html code;
(2) text field several rows in front are calculated title matching degrees of these row respectively and are extracted maximumly, obtain the candidate's header line before the territory;
(3) then, compare, therefrom select one to do title according to heuristic rule and title matching degree;
The measurement formula of title matching degree is following:
P
t=a*(1-len
punc/len
all)+b*(1-len
title/len
max_title)+c*(1-len
keywords/len
max_keywords)+d*(1-len
summery/len
max_summery)+e*(1-len
authortext/len
max_authortext)+f*WH+g*H
len
Wherein:
Len
PuncBe punctuation mark length in the row;
Len
AllBe all word length in the row;
Len
TitleIt is the editing distance of title field contents in row content and the webpage;
Len
Max_titleIt is the maximal value in the title field contents in row content and the webpage;
Keywords refers to the key word information that webpage carries; Summery refers to the abstract fields that webpage carries; Authortext refers to the corresponding anchor literal of webpage url; These three types of variable meanings and above-mentioned similar;
WH is the tag types weighting, occurs labels such as h1 h2...center in all nodes under the row, can give the node weighting;
H
LenBe the weighting of node content length, find that after large-scale statistical length for heading is the most common between 16-23, all there is distribution probability separately in other interval, calculates the length weighting of node with this;
A, b, c, d, e, f, g are the factors of influence of each factor, in application, can revise.
Article divides page information, and its recognition methods is in the afterbody searching several rows of text field, to carry out Serial No. line by line and find; If found continuous Serial No., such as " 1,2,3,4,5,6... " etc., and url link informations that these numerals have itself belong to a website with this webpage, then discern successfully.
Step 3.4 is revised text field;
Prompting by means of the news article form helps, and in conjunction with the article header of top step, article tail information (branch page information) can be revised text field, makes it more as far as possible accurately:
1) before the territory, search out article head (title, time etc.) after, to the text field correction:
The article head cuts off article head information in the past in the territory;
The article head is overseas, and part is gone into text field between the merging;
2) after the territory tail searches out article tail information (paging etc.):
If the article tail in the territory, then cuts off tailing section in the territory;
If the article tail is not revised overseas.
Step 3.5 is to the text domain partitioning; Comprise piecemeal, piecemeal property determine and two steps of redundant piecemeal removal; Wherein the step of piecemeal is following:
Step 3.5.1 adopts the MDR method to discern frequent mode (this method is that Bing liu proposes);
Step 3.5.2 to the pattern that obtains, carries out seeking on the father nodes such as the searching of branch block header, piecemeal, to obtain only blocking node combination; Be combined to form piecemeal then;
Step 3.5.3 carries out mark to all piecemeals that identify in the text field dom tree;
Simultaneously, the following criterion of foundation when making up the piecemeal tree:
(1) with in all child nodes of father, node also is a piecemeal between the piecemeal that marks, and the node before first piecemeal is a piecemeal, and backmost node is a piecemeal after the piecemeal;
(2) if there is the piecemeal that marks in the node subtree, then node itself also is a piecemeal;
Wherein the piecemeal property determine of text field piecemeal and redundant piecemeal are removed; Concrete removal method is:
(1) in the piecemeal that obtains, judges its link literal and overall number of words ratio.
(2) if the link of piecemeal than greater than threshold value (0.5f), is then thought redundant piecemeal, remove in the tree, substitute with the hr label;
(3) piecemeal that remaining frequent mode is identified because their clear and definite semantic informations will be they marked, makes them in subsequent operation, no longer split (such as a TV programme form);
Step 3.6, the text field segmentation is filtered;
Text domain partitioning tree is carried out segmentation, thereby obtain the text segmentation sequence.Why wanting segmentation, is because observing the back through some finds, redundant information all is that the form with section or multistage occurs, so in order to remove the convenience of redundant information in the subsequent action, text field carry out sectionization.
Carry out filtering piecemeal of text field segmentation then;
(1) generate pattern.To all sections, extract its html code, carry out the html fragment and simplify, only stay tag title and content, get md5 key, be built into pattern;
Pattern is expressed as follows:
Pattern=md5 ((content: text/img)+paragraph tag forward direction ergodic sequence+site name)+value
Wherein value is a weight information, also is the frequency of occurrences of pattern;
(2) filter piecemeal.The pattern that obtains is put into the library of wrapper, put merging in storage;
Library inquires model identical, and then pattern weighting also is about to the value field and merges weighting;
If do not find, then put in storage.
Step 3.7, data preparation, the result generates; Merge organize your messages, extract summary etc., extract successfully;
Step 3.8 extracts and finishes.
In the study link, that is, above-mentioned steps 7 comprises:
Step 7.1, HMTL resolves.To importing webpage into, resolve html and make up dom tree;
Step 7.2 is sought text field; Locate text field through the text field recognition methods.
The purpose of location text field is tentatively to seek the reasonable zone of text, and the Dom that the minimizing method is handled sets scope, has reduced the error probability of method simultaneously;
Text is included in one or more nested Div, the Table node, and the text localization is sought only such Div or Table exactly; Realize through a highest Div or a Table of information degree; Information degree computing formula:
H=len
not_link*log(1+len
link/len
allTxt)+a*len
not_link*log(1+len
not_link/len
html)
Wherein:
A is a factor of influence, is defaulted as 0.5 at present;
Len
Not_linkIt is disconnected word length in the node;
Len
AllTxtBe all word length in the node;
Len
HtmlBe the html length of node;
During calculating, parameter adds 1 among the log, makes the log operation result all>0;
After finding this Div or Table that wants, in dom tree, date back to the body node, recall end after, just formed a paths; In the trace-back process, record also is from left to right the sequence number of each DOM node in father node through the positional information of DOM node.
At last, obtain a dom tree path, the node in path also has its positional information simultaneously
Step 7.3, the path warehouse-in merges; The storehouse, path of incorporating above-mentioned path into the system wrapper, and fashionable, merge weighting with the path;
If find the path of repetition, merge weighting, weighting is to revise the DFS field, the DFS value that also is about to new route adds old path;
If do not find repetition, the new route warehouse-in is just passable;
Step 7.4 is extracted article header and article and is divided page information; Comprise:
The article header mainly is that the method for distilling of heading message is:
(1) extracts the interior several rows that start of text field, calculate title matching degrees of these row respectively and extract maximum, obtain the candidate's header line in the territory;
(2) text field several rows in front are calculated title matching degrees of these row respectively and are extracted maximumly, obtain the candidate's header line before the territory;
(3) then, compare, therefrom select one to do title according to heuristic rule and title matching degree;
The measurement formula of title matching degree is following:
P
t=a*(1-len
punc/len
all)+b*(1-len
title/len
max_title)+c*(1-len
keywords/len
max_keywords)+d*(1-len
summery/len
max_summery)+e*(1-len
authortext/len
max_authortext)+f*WH+g*H
len
Wherein:
Len
PuncBe punctuation mark length in the row;
Len
AllBe all word length in the row;
Len
TitleIt is the editing distance of title field contents in row content and the webpage;
Len
Max_titleIt is the maximal value in the title field contents in row content and the webpage;
Keywords refers to the key word information that webpage carries; Summery refers to the abstract fields that webpage carries; Authortext refers to the corresponding anchor literal of webpage url; These three types of variable meanings and above-mentioned similar;
WH is the tag types weighting, occurs labels such as h1h2...center in all nodes under the row, can give the node weighting;
H
LenBe the weighting of node content length, find that after large-scale statistical length for heading is the most common between 16-23, all there is distribution probability separately in other interval, calculates the length weighting of node with this;
A, b, c, d, e, f, g are the factors of influence of each factor, in application, can revise.
Article paging information identifying method is in the afterbody searching several rows of text field, to carry out Serial No. line by line and find; If found continuous Serial No., such as " 1,2,3,4,5,6... " etc., and url link informations that these numerals have itself belong to a website with this webpage, then discern successfully;
Step 7.5 is revised text field;
Prompting by means of the news article form helps, and in conjunction with the article header of top step, article tail information (branch page information) can be revised text field, makes it more as far as possible accurately:
1) before the territory, search out article head (title, time etc.) after, to the text field correction:
The article head cuts off article head information in the past in the territory;
The article head is overseas, and part is gone into text field between the merging;
2) after the territory tail searches out article tail information (paging etc.):
If the article tail in the territory, then cuts off tailing section in the territory;
If the article tail is not revised overseas.
Step 7.6 to the text domain partitioning, comprises piecemeal, piecemeal property determine and two big steps of redundant piecemeal removal;
Wherein the concrete steps of piecemeal are following:
Step 7.6.1 adopts the MDR method to discern frequent mode (this method is that Bing liu proposes);
Step 7.6.2 to the frequent mode that obtains, carries out seeking on the father nodes such as the searching of branch block header, piecemeal, to obtain only blocking node combination; Be combined to form piecemeal then;
Step 7.6.3 carries out mark to all piecemeals that identify in the text field dom tree;
Simultaneously, the following criterion of foundation when making up the piecemeal tree:
(1) with in all child nodes of father, node also is a piecemeal between the piecemeal that marks, and the node before first piecemeal is a piecemeal, and backmost node is a piecemeal after the piecemeal;
(2) if there is the piecemeal that marks in the node subtree, then node itself also is a piecemeal;
Wherein the method for piecemeal property determine and redundant piecemeal removal is following:
(1) in the piecemeal that obtains, judges its link literal and overall number of words ratio.
(2) if the link of piecemeal than greater than threshold value (0.5f), is then thought redundant piecemeal, remove in the tree, substitute with the hr label;
(3) piecemeal that remaining frequent mode is identified because their clear and definite semantic informations will be they marked, makes them in subsequent operation, no longer split (such as a TV programme form);
Step 7.7, speced learning comprises text field segmentation, two big steps of pattern learning one by one;
Carry out the text field segmentation earlier:
Text domain partitioning tree is carried out segmentation, thereby obtain the text segmentation sequence.Why wanting segmentation, is because observing the back through some finds, redundant information all is that the form with section or multistage occurs, so in order to remove the convenience of redundant information in the subsequent action, text field carry out sectionization.
After the section of branch, all sections, generate pattern;
The pattern generative process is, to all sections, extracts its html code, carries out the html fragment and simplifies, and only stays tag title and content, gets md5 key, is built into pattern;
Pattern is expressed as follows:
Pattern=md5 ((content: text/img)+paragraph tag forward direction ergodic sequence+site name)+value
Wherein value is a weight information, also is the occurrence number of pattern.
Carry out the study of pattern one by one then.Mode of learning is:
The pattern that obtains is put into the library of wrapper, put merging in storage.Library inquires model identical, and then pattern weighting also is about to the value field and merges weighting; If do not find, then put in storage just passable;
Step 7.8, pattern is concluded, and also is that automatic canonical generates;
The concrete steps that pattern is concluded are following:
Step 7.8.1: to all patterns in the storehouse, extract former string, divide into groups by the website, every group is carried out cluster by similarity of character string, obtains several groupings of high cohesion;
Step 7.8.2: to obtaining each grouping, in it, the canonical after it merges to every pair of segmentation calculating obtains all possible different canonical; The frequency of occurrences of these canonicals is got that of maximum by the height ordering; Verify second largest that again, if in can the covering group part overlay segments and weight are unsuitable, then it also is desirable pattern;
Extract the pattern of two segmentations: recursively seek the optimum common fragment of two segmentation remainders; The fragment forward part is exactly different, needs the place that merges; This totally is a method by the two-dimension table dynamic programming;
Step 7.8.3:To all canonicals that obtain, keep those weights and be higher than certain threshold value item, add in the library then;
Pattern obtains some canonicals after concluding and finishing; Add weight value information, go in the library;
Step 7.9 finishes.
Above-mentioned all study links have all been upgraded two storehouses at last: style treebank (storehouse, path), library; These two storehouses after upgrading are put in order into overall package device storehouse, accomplished all learning procedures.
Patent of the present invention also has the advantage that more traditional grasping systems do not have:
1) through can automatically the non-article page in the website being filtered such as channel page or leaf, special topic page or leaf, list page, advertising page with station study;
2) it is heavy to be similar to document row to the extracting article;
3) can carry out semantic understanding to grasping article, classification automatically generates summary and keyword automatically;
4) can accurately seek certain article number 50 and carry out the order merging with interior paging sequence and to the paging content;
5) can carry out flexible configuration to website extracting scope.Support to grasp the article under one or more list area on website, channel, any page;
In practical application; This grasping system article grasps and is of high quality; Can directly externally issue user oriented, the masterplates of thousands of extractings of adaptation websites change automatically simultaneously, have reduced the manpower participation that extracting needs greatly; News coverage and the real-time of improving door class website in large tracts of land have also reduced the human cost of door class website simultaneously.
In all door class websites, this patent all has application scenarios, can effectively improve the coverage and the real-time of its news, reduces human cost simultaneously.
In the news category search engine, this patent also can be used simultaneously.
Description of drawings
Fig. 1 is native system modular structure figure;
Fig. 2 is the native system data flowchart;
Fig. 3 is the line upper module structural drawing of real-time grabbing module;
Fig. 4 is the line lower module structural drawing of real-time grabbing module;
Operational flow diagram on the line of the real-time grabbing module of Fig. 5;
Operational flow diagram under the line of the real-time grabbing module of Fig. 6;
Fig. 7 article types of web pages intelligence of the present invention extraction system modular structure figure;
The modular structure figure of Fig. 8 web page extraction module;
The modular structure figure of Fig. 9 webpage study module;
The modular structure figure of Figure 10 text field piecemeal module;
Figure 11 is based on the overview flow chart of the article Automatic Extraction method that generates with station study and automatically rule;
Figure 12 is based on the overview flow chart of the article Automatic Extraction method that generates with station study and automatically rule;
Figure 13 is based on the study link process flow diagram of the article Automatic Extraction method that generates with station study and automatically rule;
The data flowchart of Figure 14 text field piecemeal module;
The text field correction synoptic diagram of Figure 15 abstracting method;
Figure 16-the 19th, the accompanying drawing of the overall work case of system and real-time grabbing module case;
Figure 20-the 28th, the web page extraction job case based on the phoenix net of the web page extraction module of system.
Embodiment
Grasping system of the present invention is made up of 5 modules or subsystem altogether, and is as shown in Figure 1.Comprise: real-time grabbing module, web page extraction system, document approximate row molality piece, the automatic sort module of document, article release module.
The overall system data flow is as shown in Figure 2, and concrete steps are following:
Step 4; The local index that grasps is regularly inquired about by the web page extraction system; Find that new index is the unit loading system through all webpages that step 3 downloaded in index by the website just, " based on the article Automatic Extraction algorithm that generates with station study and automatically rule " that comprise according to the present invention specifically extracts; If extracting unsuccessful during extraction will be that unit is learnt automatically by the website, extract wrapper thereby generate automatically, realize successful extracting next time; Extraction also comprises autoabstract module, automatic keyword generation module, extracts summary, the key word information of article with generation;
Step 7, the article release module is regularly inquired about local index, finds that new index just will arrive the particular content system through Web publishing through index with the article loading system.
According to article real-time intelligent grasping system of the present invention, comprise real-time grabbing module, web page extraction system, document approximate row molality piece, the automatic sort module of document and article release module.
Wherein, the detailed technology scheme of real-time grabbing module:
When requiring high real-time to grasp; Hope and in 1-3 minute, the other side's network upgrade content to be grabbed; This need be to grasping frequent initiation link and the download request of Website server; In actual the extracting, this will cause the other side's server stress excessive and then take to close strategy, thereby make us grasp instability even failure.
High simultaneously real-time extracting demand extremely expends hardware resources such as network, causes cost to rise.
Existing a lot of grasping systems solve the problems referred to above through adopting the mode that grasps frequency control, increase extracting server, ensure real-time, the security grasped.
The real-time grabbing module of patent of the present invention grasps technology such as time range automatic discovering method, active agency collection and verification method through comprehensive employing task (jobs) rational management, task extracting interval dynamic self-adapting method, task every day, has realized different real-time extracting schemes.
Grasp compared with techniques in real time with other, this programme cost is lower, structure is also simpler.
This in real time on grabbing module separated time with line under two modules.
Comprise 7 modules moving on the line: task extraction module, task parsing module, task grasp the time range inspection module, task grasps time interval inspection module, task scheduling modules, task download module, task extracting frequency regulation block; 3 modules that also comprise operation under the line: task grasps time range and finds that module, task grasp the time interval and find that module, free agency collect and authentication module.
The task extraction module extracts a job in turn from task (job) set;
The task parsing module, (job) resolves to each task, and analysis result will form some attributes and rule;
Task grasps the time range inspection module, and the time range parameter of query task if time range does not comprise the current time, will not grasp, and skip this job, otherwise, grasp time interval check;
Task grasps time interval inspection module, the extracting time interval of query task, if specifying, the time interval grasps the time greater than current next time, and then will not grasp, skip this job, otherwise, carry out task and grasp;
Task scheduling modules is carried out the job scheduling according to other attribute of job of task parsing module, can judge in the time of scheduling, if this job existed in the past, does not then distribute, and still adopts home server to grasp; Otherwise, select the existing less station server of jobs number in the server zone, with the equilibrium of realization extracting task, thereby optimize overall grasp speed; Consider simultaneously to avoid too many on the station server with website job as far as possible, too big to prevent that the separate unit server from grasping pressure to single website;
The task download module carries out the concrete download of task, and the agency who removes to get in the broker library proper number generally is 5; If do not act on behalf of desirablely, then adopt non-agency to grasp; To not have agency and above-mentioned 5 agency's merging simultaneously, form agent list; According to the task parameters that parsing obtains, from agent list, select an agency at random, carry out the epicycle of task and download; Task is downloaded and has been adopted traditional page download engine;
Task grasps frequency regulation block, according to the extracting interval radix of job, finds to upgrade if epicycle grasps, and then reduces next round at random and grasps the time interval; If find to upgrade, then amplify the extracting interval of next round at random; But guarantee that the time interval is that the extracting of job is at interval in [0.5,2] multiple scope of radix;
Said real-time grabbing module also comprises like the operation submodule that rolls off the production line down:
Task grasps time range and finds module, carries out the historical daily record intellectual analysis work of grasping, and therefrom analyzes the time range of each task (job);
Task grasps the time interval and finds module, reads in the extracting daily record of yesterday, analyzes the extracting situation of one day all round of each job, therefrom analyzes the more new situation of each job; Guaranteed that 50% above round can grab renewal, does not then adjust if find the current interval radix of job; Otherwise suitably amplify current interval radix, to reduce meaningless grab requesting;
Free agency collects and authentication module, freely acts on behalf of the free agent list that sharing website is downloaded the same day from the internet, and these free agencies are verified; Select the historical link formation checking url set of grasping at random; Using each to act on behalf of the url that carries out several times grasps; Weed out and successfully to grasp or the too slow agency of grasp speed; Stay the high fireballing agency simultaneously of success ratio, form the broker library on the same day, thinking to grasp on the line provides the agency to support.Under optimal situation; We can download a website with 5-10 agency; Grasping engine with tradition compares; This will reduce the IP frequency of occurrences of single extracting server greatly, our extracting network quality had by a small margin improve, and our separate unit server grasped closed risk to reduce greatly.
The line upper module has been carried out the extracting of each task, needs only the current extracting task that has, and just carries out; The line lower module is just for the operation of line upper module provides data and resource support, and such as the broker library of a renewal etc., the line lower module will move once in free time every day.Because operating ratio is more consuming time, thus put under the line, not influence the operation of line upper module.
This is operational scheme (Fig. 5) as follows on the line of grabbing module in real time:
Step 4, the time interval is judged, the extracting time interval of query task, if specifying, the time interval grasps the time greater than current next time, then will not grasp, return step 1, otherwise get into next step;
Step 7 grasps the frequency adjustment, according to the extracting interval radix of job, finds to upgrade if epicycle grasps, and then reduce next round at random and grasp the time interval, generally be 0.2 times of minimizing; If find to upgrade, then amplify the extracting interval of next round at random, generally be 0.2 times; The extracting time interval that will guarantee next round at last is that the extracting of job is at interval in [0.5,2] multiple scope of radix; After the frequency adjustment is accomplished, return step 1, repeat whole flow process;
Flow process comprises that grasping time range finds under the line of this grasping system, grasps time interval discovery, and the agency collects and checking, and these steps are that the work of grasping flow process on the last upper thread provides knowledge, such as the time range of job, effective agency etc.;
The online operation down of this part generally is to grasp the relatively more idle 0-6 point time, once finishes.
Its concrete steps are (Fig. 6) as follows:
Verification method is that random extraction is acted on behalf of the grasped url of 3 times of numbers according to the extracting historical record; To each of these agencies, 3 url of Random assignment supply it to grasp then, and one of success will be to the award that adds 1 fen, fail one will be to the punishment that subtracts 1 fen; Grasp in 5 seconds successfully will to the award that adds 5 fens, grab in 10 seconds giving and the award that adds 2 fens;
Checking according to these agencies' score, is generally got rid of score the agency below 2 minutes after accomplishing; Can not successfully grasp or the too slow agency of grasp speed thereby weed out; Stay the high fireballing agency simultaneously of success ratio, form the broker library on the same day, thinking to grasp on the line provides the agency to support.
Web page extraction of the present invention system also is an article types of web pages intelligence extraction system, its detailed technology side
Case
The information extraction area now is having a lot of technical schemes, and core all is how to generate and safeguard the extraction wrapper.Below the technical main branch two types:
1) extraction system that adopts machine to generate extraction wrapper technology automatically can grasp article in a large number, but can't accomplish the accurate extraction of article, and the availability that grasps article is low;
2) adopt the artificial extraction system that extracts the wrapper technology that generates, it is accurate that article extracts the result, but will extract the generation and the updating maintenance work of wrapper to thousands of websites, internet, can only rely on great amount of manpower and participate in;
The abstraction module of patent of the present invention is a core with " based on the article Automatic Extraction that generates with station study and automatically rule " method of independent research, has solved top two problems well.
In practical application, the present technique scheme has realized that the machine that extracts wrapper generates automatically and safeguards, makes extraction not need great amount of manpower to participate in; Also realized the accurate extraction of article simultaneously, extracted the seldom redundant and omission of result, availability is very high.
Article types of web pages of the present invention intelligence extraction system mainly comprises following a few sub-module, like Fig. 3:
(1) webpage to be extracted insmods, and main being responsible for regularly inquired about local index, find new index just according to index with in the webpage loading system internal memory;
(2) wrapper enquiry module, to the webpage that remains to be extracted, the concrete wrapper information that extracts of inquiry if inquire, just according to extracting wrapper, gets into abstraction module, specifically extract, otherwise, webpage is labeled as to extract fails;
(3) web page extraction module is responsible for from webpage, extracting concrete article information.By existing extraction wrapper;
(4) extract the failure web page collection module, be responsible for that epicycle is extracted the collecting web page of failing and get up, by websites collection, conveniently to carry out focusing study;
(5) study judge module by extract the failure collections of web pages with query site, according to the failure webpage quantity of each website, calculates this website epicycle and extracts the ratio of successfully failing, and determines whether to enter the Web page study module;
(6) webpage study module is responsible for all failure webpages are carried out machine learning; Generate new extraction wrapper at last;
(7) extract the wrapper administration module, be responsible for the extraction wrapper of system is managed, also promptly storehouse, path and library are managed,, and provide wrapper to use interface to the web page extraction module, provide the wrapper updating interface to the webpage study module.
Said web page extraction module also comprises:
The HTML parsing module to importing webpage into, is resolved html and is made up dom tree;
Text field is sought module, according to wrapper information, seeks text field;
Article head and paging information extraction modules are used to extract article header and article and divide page information;
The text field correcting module is used to revise text field; Text field piecemeal module is used for the text domain partitioning, carries out piecemeal property determine and redundant piecemeal simultaneously and removes;
The segmentation filtering module is used for text field is carried out segmentation, filters piecemeal simultaneously;
The data preparation module is used for merging and organize your messages, forms the article types results;
The data preparation module is used to generate last article information.
Said webpage study module also comprises:
The HTML parsing module to importing webpage into, is resolved html and is made up dom tree;
Text field is sought module, is used to seek text field;
Storehouse, path update module is used for warehouse-in and merges, and simultaneously the storehouse, path is put in order;
Article head and paging information extraction modules are used to extract article header and article and divide page information;
The text field correcting module is used to revise text field;
Text field piecemeal module is used for the text domain partitioning, carries out piecemeal property determine and redundant piecemeal simultaneously and removes;
The speced learning module is carried out the text field segmentation, makes up pattern piecemeal, goes into library and merges;
Pattern is concluded module, and all patterns are concluded, and create-rule is gone into library and merged;
The wrapper sorting module is put in order system's wrapper, removes invalid information.
Said text field piecemeal module also comprises:
The frequent mode identification module adopts the MDR method to discern frequent mode;
The piecemeal module to the frequent mode that obtains, carries out seeking on the searching of branch block header, the piecemeal father node, to obtain only blocking node combination, is combined to form piecemeal then;
The piecemeal mark module carries out mark to all piecemeals that identify in the text field dom tree.
The core of article types of web pages intelligence extraction system is " based on the article Automatic Extraction that generates with station study and automatically rule " method.
This method mainly comprises two parts: extract link, study link.Its overview flow chart is as shown in Figure 7, and concrete steps are:
Step 4, mark extracts failure.To extract failure webpage mark, collect, change step 2 simultaneously to make things convenient for step 6;
Step 7, webpage study.All failure webpages to each website are learnt, and generate new extraction wrapper;
Step 9, epicycle extract and finish.
The abstracting method core link is to extract link, study link.Specifically introduce one by one below.
Extract link, that is, above-mentioned steps 3, flow process is as shown in Figure 8:
Step 3.1, HTML resolves.To importing webpage into, resolve html and make up dom tree;
Earlier html is done pre-service, comprise the character format conversion, script/style information filtering, not visual character rejecting etc.; According to html code and html standard, adopt the HtmlParser assembly to come analyzing web page and obtain dom tree then;
Step 3.2 is sought text field;
Text field refers to certain the DOM node in the dom tree, and it has comprised the main contents information of article.The searching mode is, the location path of this website of inquiry extracts the text field path in extracting the style tree of wrapper, then according to the path, in dom tree, travels through, and navigates to concrete DOM node, and this node is exactly the text field that we inquire for;
Step 3.3 is extracted article header and article and is divided page information;
The article header comprises article title information, article temporal information, article source information etc.
The method of extracting title is roughly following:
(1) extracts the interior several rows that start of text field, calculate title matching degrees of these row respectively and extract maximum, obtain the candidate's header line in the territory; " OK " is meant according to the line feed label of html and such as <br >, < P>etc. the dom tree of whole webpage cut apart some adjacent dom node set that the back forms here, with and corresponding html code;
(2) text field several rows in front are calculated title matching degrees of these row respectively and are extracted maximumly, obtain the candidate's header line before the territory;
(3) then, compare, therefrom select one to do title according to heuristic rule and title matching degree;
The measurement formula of title matching degree is following:
P
t=a*(1-len
punc/len
all)+b*(1-len
title/len
max_title)+c*(1-len
keywords/len
max_keywords)+d*(1-len
summery/len
max_summery)+e*(1-len
authortext/len
max_authortext)+f*WH+g*H
len
Wherein:
Len
PuncBe punctuation mark length in the row;
Len
AllBe all word length in the row;
Len
TitleIt is the editing distance of title field contents in row content and the webpage;
Len
Max_titleIt is the maximal value in the title field contents in row content and the webpage;
Keywords refers to the key word information that webpage carries; Summery refers to the abstract fields that webpage carries; Authortext refers to the corresponding anchor literal of webpage url; These three types of variable meanings and above-mentioned similar;
WH is the tag types weighting, occurs labels such as h1h2...center in all nodes under the row, can give the node weighting;
H
LenBe the weighting of node content length, find that after large-scale statistical length for heading is the most common between 16-23, all there is distribution probability separately in other interval, calculates the length weighting of node with this;
A, b, c, d, e, f, g are the factors of influence of each factor, in application, can revise.
Article temporal information, article source information are below the article title position, to carry out matched and searched in the several rows, because after text field confirmed with title, this part was very little, so it is very high to discern preparation;
Article divides recognition methods such as page information to be, seeks several rows at the afterbody of text field, carries out Serial No. line by line and finds; If found continuous Serial No., such as " 1,2,3,4,5,6... " etc., and url link informations that these numerals have itself belong to a website with this webpage, then discern successfully;
Step 3.4 is revised text field;
Prompting by means of the news article form helps, and in conjunction with the article header of top step, article tail information (branch page information) can be revised text field, makes it more as far as possible accurately:
(wherein article tail information refers to the branch page information) as shown in Figure 6, correcting mode is following:
1) before the territory, search out article head (title, time etc.) after, to the text field correction:
The article head cuts off article head information in the past in the territory;
The article head is overseas, and part is gone into text field between the merging;
2) after the territory tail searches out article tail information (paging etc.):
If the article tail in the territory, then cuts off tailing section in the territory;
If the article tail is not revised overseas.
Step 3.5 to the text domain partitioning, comprises two steps of piecemeal and piecemeal property determine and redundant piecemeal removal, wherein said piecemeal:
Text field piecemeal purpose is that the page is divided into several complete zones, and property determine is carried out in the zone one by one, and then removes redundancy, improves the precision that extracts the result.
Text field method of partition step is following:
Step 3.5.1 adopts the MDR method to discern frequent mode (this method is that Bing liu proposes);
Step 3.5.2 to the pattern that obtains, carries out seeking on the father nodes such as the searching of branch block header, piecemeal, to obtain only blocking node combination; Be combined to form piecemeal then;
Step 3.5.3 carries out mark to all piecemeals that identify in the text field dom tree;
Simultaneously, the following criterion of foundation when making up piecemeal:
(1) with in all child nodes of father, node also is a piecemeal between the piecemeal that marks, and the node before first piecemeal is a piecemeal, and backmost node is a piecemeal after the piecemeal;
(2) if there is the piecemeal that marks in the node subtree, then node itself also is a piecemeal;
Step 3.6 is to the piecemeal property determine of text domain partitioning and redundant piecemeal removal etc.;
In the piecemeal that obtains, judge its link literal and overall number of words ratio;
If the link of piecemeal is then thought redundant piecemeal than greater than threshold value (0.5f), remove in the tree, substitute with the hr label;
Piecemeal to remaining frequent mode identifies because their clear and definite semantic informations will be they marked, makes them in subsequent operation, no longer split (such as a TV programme form);
Step 3.7, the text field segmentation is filtered;
Text domain partitioning tree is carried out segmentation, thereby obtain the text segmentation sequence.Why wanting segmentation, is because observing the back through some finds, redundant information all is that the form with section or multistage occurs, so in order to remove the convenience of redundant information in the subsequent action, text field carry out sectionization.
Carry out filtering piecemeal of text field segmentation then;
(1) generate pattern.To all sections, extract its html code, carry out the html fragment and simplify, only stay tag title and content, get md5 key, be built into pattern;
Pattern is expressed as follows:
Pattern=md5 ((content: text/img)+paragraph tag forward direction ergodic sequence+site name)+value
Wherein value is a weight information, also is the frequency of occurrences of pattern;
(2) filter piecemeal.The pattern that obtains is put into the library of wrapper, put merging in storage;
Library inquires model identical, and then pattern weighting also is about to the value field and merges weighting;
If do not find, then put in storage.
Step 3.8, data preparation, the result generates; Merge organize your messages, extract summary etc., extract successfully;
Step 3.9 extracts and finishes.
The study link is corresponding with extracting a lot of steps of link, and some step is identical.
The study link, that is, above-mentioned steps 7, flow process is as shown in Figure 9:
Step 7.1, HTML resolves.To importing webpage into, resolve html and make up dom tree;
With extracting link;
Step 7.2 is sought text field;
Link is different with extracting, and the study link is located text field through the text field recognition methods.
The purpose of location localization is tentatively to seek the reasonable zone of text, and the Dom that the minimizing method is handled sets scope, has reduced the error probability of method simultaneously; And, find in the experiment that a lot of webpages have just correctly proposed text in this stage of text localization;
According to experiment statistics, all texts all can be included in one or more nested Div, the Table node, so the thought of text localization is sought only such Div exactly or Table comes; Our method is to seek a highest Div or a Table of information degree to come;
Information degree computing formula:
H=len
not_link*log(1+len
link/len
allTxt)+a*len
not_link*log(1+len
not_link/len
html)
Wherein:
A is a factor of influence, is defaulted as 0.5 at present;
Len
Not_linkIt is disconnected word length in the node;
Len
AllTxtBe all word length in the node;
Len
HtmlBe the html length of node;
When calculating moisture in the soil, parameter adds 1 among the log, makes the log operation result all>0;
Formula is considered link literal ratio, can guarantee to find disconnected literal both candidate nodes as much as possible;
Formula is considered disconnected literal and html length ratio, and it is too wide in range to guarantee that both candidate nodes is enough shunk the both candidate nodes of avoiding finding;
We find this Div that wants or Table at last, in dom tree, date back to the body node then, recall end after, just formed a paths; In the trace-back process, record also is from left to right the sequence number of each DOM node in father node through the positional information of DOM node.
At last, we obtain a dom tree path, and the node in path also has its positional information simultaneously, such as:
“Div?index=3?DFS=1==>Body?index=0?DFS=1==>www.ifeng.com”
Step 7.3, the path warehouse-in merges; The storehouse, path of incorporating above-mentioned path into the system wrapper, and fashionable, merge weighting with the path;
If find the path of repetition, merge weighting, weighting is to revise the DFS field, the DFS value that also is about to new route adds old path;
If do not find repetition, the new route warehouse-in is just passable;
Step 7.4 is extracted article header and article and is divided page information;
With extracting link;
Step 7.5 is revised text field;
With extracting link;
Step 7.6 is to the text domain partitioning;
With extracting link;
Step 7.7, the piecemeal property determine of text field piecemeal and redundant piecemeal removal etc.;
With extracting link;
Step 7.8, speced learning;
To the text field segmentation, mode is with extracting link earlier;
After the section of branch, all sections, generate pattern;
The pattern generative process is, to all sections, extracts its html code, carries out the html fragment and simplifies, and only stays tag title and content, gets md5 key, is built into pattern;
Pattern is expressed as follows:
Pattern=md5 ((content: text/img)+paragraph tag forward direction ergodic sequence+site name)+value
Wherein value is a weight information, also is the occurrence number of pattern.
Then the pattern that obtains is put into the library of wrapper, put in storage merging.Library inquires model identical, and then pattern weighting also is about to the value field and merges weighting; If do not find, then put in storage just passable;
Step 7.9, pattern is concluded, and also is that automatic canonical generates;
The pattern that a last step obtains in the library has a lot of things can canonical to merge;
As follows, similar these patterns should canonical merge:
" more splendid contents, come in healthy channel "
" more how excellent picture, come in schemes slice channel "
" more how excellent news, come in information channel "
Merge the back pattern:
" more how excellent *, come in * channel "
After the merging, we have just obtained the other one type pattern of library: canonical.
This process is called pattern and concludes.
The concrete steps that pattern is concluded are following:
Step 7.9.1: to all patterns in the storehouse, extract former string, divide into groups by the website, every group is carried out cluster by similarity of character string, obtains several groupings of high cohesion.
The similar calculating of character string: simple participle is a unit with " speech ", calculates the speech editing distance, obtains similarity; Html tag is a speech during participle, and English word is a speech, and word of Chinese character or punctuation mark are a speech;
Clustering method: K-Means method;
Step 7.9.2: to obtaining each grouping, in it, the canonical after it merges to every pair of segmentation calculating obtains all possible different canonical; The possibility of these canonicals (frequency of occurrences) is got maximum that (its coverage rate is inevitable the wideest) by the height ordering; Verify second largest that again, if in can the covering group part overlay segments and weight are unsuitable, then it also is desirable pattern;
How to extract the pattern of two segmentations: recursively seek the optimum common fragment of two segmentation remainders; The fragment forward part is exactly different, needs the place that merges; This totally is a method by the two-dimension table dynamic programming;
Different piece is handled: numeral is different, then use " d " merge; Digital alphabet mixes different, replaces with " d [a-z] "; Other difference replaces with " * "; If numeral is different, all Serial No.s under then different piece expands to separately are to improve adaptability;
Such as:
"/imgs/89089089.jpg " merge into "/d*? .jpg "
″/imgs/89010197.jpg″
Step 7.9.3: to all canonicals that obtain, keep those weights and be higher than certain threshold value item, add in the library then;
Pattern obtains some canonicals after concluding and finishing; Add weight value information, go in the library;
Step 7.10 finishes.
Above-mentioned all study links have all been upgraded two storehouses at last: style treebank (storehouse, path), library; These two storehouses after upgrading are put in order into overall package device storehouse, accomplished all learning procedures.
Provide an example of native system below.
With what grasp http://www.21cbh.com/channel/review/ is example, and the step of total grasping system is following:
Step 4 gets into document row molality piece, and the document of all extractions be similar to row's weight.Weed out the article that those have grabbed similar content;
Step 7, total is grasped flow process and finishes.
Wherein said real-time grabbing module is divided into two operating procedures in online and offline again.The line upper module is carried out concrete extracting work, and the line lower module is that the operation of line upper module provides some data to support such as broker library etc.;
The line lower module generally is to carry out once about 0 of every day, and whole day is no longer carried out then; The line upper module is that poll is carried out, and does not have at a distance from 30 seconds just to carry out once.
With what grasp http://www.21cbh.com/channel/review/ is example, and operating procedure is following under the line of the real-time grabbing module of total grasping system:
DAY?1 | DAY?2 | DAY?3 | DAY?4 | DAY?5 | DAY?6 | DAY?7 | |
Grab the time for the first time every day | 02:13 | 03:10 | 02:05 | 01:25 | 04:56 | 03:11 | 04:16 |
Grab the time for the last time every day | 06:15 | 06:32 | 06:54 | 07:21 | 07:23 | 06:26 | 08:11 |
After the analysis, get 7 days minimum grabbing the time for the first time, maximum grabs the time for the last time, and the new time range that obtains this job is: 1 o'clock to 8 o'clock; Also be 01-08, will revise the setting of job parameter; It is as follows to revise the back:
“2?248836?01-08”
“2?298603?01-08”
Then these agencies are verified.Verification method provides the grasped url that acts on behalf of 3 times of numbers; To each of these agencies, 3 url of Random assignment supply it to grasp then, and one of success will be to the award that adds 1 fen, fail one will be to the punishment that subtracts 1 fen; Grasp in 5 seconds successfully will to the award that adds 5 fens, grab in 10 seconds giving and the award that adds 2 fens.
Last comprehensive each agency's score is got rid of the agency of score below 2 minutes, has formed effective agent list shown in figure 19:
Back one row are scores of each agency.
Put into our broker library to these agencies entirely at last, for operation on the line provides support.
With what grasp http://www.21cbh.com/channel/review/ is example, and operating procedure is following on the line of the real-time grabbing module of total grasping system:
1) grasps http://www.21cbh.com/channel/review/, do not expand;
2) grasp < div class=" home_box ">specified zone of DOM node of this page;
3) grasp this regional url link of satisfying following url canonical:
http://www.21cbh.com/HTML/.*?\.html
4) grasping at interval, radix is 298603 milliseconds;
5) grasping time range is one day 1 o'clock to 8 o'clock;
Step 4, the time interval is judged.The time interval radix of inquiring about this job is 298603 milliseconds, grasps the time greater than the current time if the time interval specifies next time, then will not grasp, and returns step 1, otherwise gets into next step;
Step 7 grasps the frequency adjustment.According to the extracting of this job radix 298603 at interval,, then reduce 0.2 times of extracting time interval of next round if this round grabs renewal; If this round does not grab renewal, then increase by 0.2 times of extracting time interval of next round.
It below is the practical operation example of extraction system part of the present invention.
To grasp the Taiwan lastest news of phoenix net
Http:// news.ifeng.com/taiwan/rss/rtlist_0/index.shtmlBe example, the flow process of total extraction system is following:
All webpages in this lastest news tabulation are crawled before, next get into this webpage extraction system:
All webpages that here will grab are written into, in respect of 42 pieces;
Step 4, mark extracts failure.To extract failure webpage mark, collect, change step 2 simultaneously to make things convenient for step 6;
Our extraction here success 16 webpages, 42 of total webpages, successful ratio is 16/42<0.5, so need learn;
Step 7, webpage study.All failure webpages to each website are learnt, and generate new extraction wrapper; The respective instance of study will provide in the back;
Step 9, epicycle extract and finish.
In above-mentioned instance, need to launch to set forth a concrete web page extraction step.Here, we are with one piece of webpage " http://news.ifeng.com/mil/3/detail_2011_11/21/10798106_0.shtml ", and the website is
Www.ifeng.comBe example, demonstration is through our web page extraction step, how to obtain one piece of complete and article information accurately.
System reads in one and takes turns webpage to be extracted, and handles piece by piece, and wherein one piece chained address is " http://news.ifeng.com/mil/3/detail_2011_11/21/10798106_0.shtml ", and the website is www.ifeng.com; Shown in figure 20:
1.HTML resolve, construct dom tree at last;
Need carry out the webpage pre-service earlier; The character format conversion, script/style information filtering, not visual character rejecting etc.;
According to html code and html standard, adopt the HtmlParser assembly to come analyzing web page and obtain dom tree then;
2. searching text field;
Through
Www.ifeng.comThis domain name, in the storehouse, path, find this paths shown in figure 21 (style):
Such dom tree path, instruct us to find red frame text field shown in figure 22:
3. extract the article head, divide page information;
Specifically how extracting header, divide page information, method is following:
At first, the article header mainly is the method for distilling of heading messageBe:
(1) extracts the interior several rows that start of text field, calculate title matching degrees of these row respectively and extract maximum, obtain the candidate's header line in the territory;
(2) text field several rows in front are calculated title matching degrees of these row respectively and are extracted maximumly, obtain the candidate's header line before the territory;
(3) then, compare, therefrom select one to do title according to heuristic rule and title matching degree;
The measurement formula of title matching degree is following:
P
t=a*(1-len
punc/len
all)+b*(1-len
title/len
max_title)+c*(1-len
keywords/len
max_keywords)+d*(1-len
summery/len
max_summery)+e*(1-len
authortext/len
max_authortext)+f*WH+g*H
len
Wherein:
Len
PuncBe punctuation mark length in the row;
Len
AllBe all word length in the row;
Len
TitleIt is the editing distance of title field contents in row content and the webpage;
Len
Max_titleIt is the maximal value in the title field contents in row content and the webpage;
Keywords refers to the key word information that webpage carries; Summery refers to the abstract fields that webpage carries; Authortext refers to the corresponding anchor literal of webpage url; These three types of variable meanings and above-mentioned similar;
WH is the tag types weighting, occurs labels such as h1 h2...center in all nodes under the row, can give the node weighting;
H
LenBe the weighting of node content length, find that after large-scale statistical length for heading is the most common between 16-23, all there is distribution probability separately in other interval, calculates the length weighting of node with this;
A, b, c, d, e, f, g are the factors of influence of each factor, in application, can revise.
Then, article paging information identifying methodBe, seek several rows, carry out Serial No. line by line and find at the afterbody of text field; If found continuous Serial No., such as " 1,2,3,4,5,6... " etc., and url link informations that these numerals have itself belong to a website with this webpage, then discern successfully;
The webpage of this instance does not have paging;
4. correction text field;
The mode of revising is shown in figure 23, the concrete elaboration as follows:
Prompting by means of the news article form helps, and in conjunction with the article header of top step, article tail information (branch page information) can be revised text field, makes it more as far as possible accurately:
1) before the territory, search out article head (title, time etc.) after, to the text field correction:
The article head cuts off article head information in the past in the territory;
The article head is overseas, and part is gone into text field between the merging;
2) after the territory tail searches out article tail information (paging etc.):
If the article tail in the territory, then cuts off tailing section in the territory;
If the article tail is not revised overseas.
5. text field carries out piecemeal, comprises the piecemeal step, and piecemeal property determine and redundant piecemeal removal step;
The step of piecemealComprise:
1) the MDR method is discerned frequent mode;
2) combined joint such as frequent mode title searching forms piecemeal.
Shown in figure 24, we have obtained two piecemeals.
The piecemeal property determine is removed with redundantMode be:
In the piecemeal that obtains, judge its link literal and overall number of words ratio;
If the link of piecemeal is then thought redundant piecemeal than greater than threshold value (0.5f), remove (in fact in tree, substituting) with the hr label;
Piecemeal to remaining frequent mode identifies because their clear and definite semantic informations will be they marked, no longer splits them in subsequent operation.
After last text field was processed, we obtained result shown in figure 25.
6. the text field segmentation is filtered; Comprise segmentation, filter two steps piecemeal;
Text field carries out segmentationThereby, obtain the text segmentation sequence.
Result after the segmentation is shown in figure 26:
Wherein the content in each black box is a segmentation.
Filter piecemeal, go into library by section and carry out the pattern match filtration;
Paragraph extracts pattern one by one, then the warehouse-in coupling;
Wherein, following pattern is mated:
This will filter out the afterbody picture segmented model in the step 26, and this picture is an advertising message, should be disallowable;
Last text field extracts and finishes, and the result is shown in figure 27.
7. data preparation, the result generates.Information such as extracting keywords, summary is assembled into one piece of article that accurately extracts;
Method finishes.
Extract in the instance overall, also need launch to set forth the concrete steps of webpage study.
In our instance, have the failure of 26 pieces of web page extractions, the study of will all entering the Web page of this batch webpage;
To be example to each webpage study wherein, its step is following:
1.HMTL resolve.To importing webpage into, resolve html and make up dom tree;
The same one of concrete grammar extracts instance;
2. searching text field; Locate text field through the text field recognition methods;
(1) extract Div all in the webpage dom tree, Table node, then one by one node according to following information degree computing formula computing node information degree:
H=len
not_link*log(1+len
link/len
allTxt)+a*len
not_link*log(1+len
not_link/len
html)
Wherein:
A is a factor of influence, is defaulted as 0.5 at present;
Len
Not_linkIt is disconnected word length in the node;
Len
AllTxtBe all word length in the node;
Len
HtmlBe the html length of node;
During calculating, parameter adds 1 among the log, makes the log operation result all>0;
(2) find that maximum node of information degree in all nodes; Red frame text field shown in figure 14 is exactly this node that we find;
(3) find this Div or Table that wants after, in dom tree, date back to the body node, recall end after, just formed a paths; In the trace-back process, record also is from left to right the sequence number of each DOM node in father node through the positional information of DOM node.
(4) last, obtain a dom tree path shown in figure 28, the node in path also has its positional information simultaneously, and DFS information all is 1, also is that frequency of occurrence is 1;
3, the path warehouse-in merges; The storehouse, path of incorporating above-mentioned path into the system wrapper, and fashionable, merge weighting with the path;
If find the path of repetition, merge weighting, weighting is to revise the DFS field, the DFS value that also is about to new route adds old path;
If do not find repetition, the new route warehouse-in is just passable;
4, extract article header and article and divide page information;
The corresponding step of a same web page extraction instance of concrete grammar;
5, revise text field;
The corresponding step of a same web page extraction instance of concrete steps;
6, to the text domain partitioning, comprise the removal of piecemeal and piecemeal property determine and redundant piecemeal;
The corresponding step of a same web page extraction instance of concrete steps;
7, speced learning, comprise the text field segmentation, piecemeal generate pattern, learn three steps piecemeal;
The corresponding step of a same web page extraction instance of the step of segmentation;
Generate pattern piecemealThe corresponding step of a same web page extraction instance of step;
Study piecemeal, mode of learning is:
Result after the segmentation is shown in figure 26, and according to the method that pattern generates, paragraph generates the pattern like figure below one by one:
Wherein second field is exactly concrete value information;
The pattern that obtains is put into the library of wrapper, put merging in storage.Library inquires model identical, and then pattern weighting also is about to the value field and merges addition; If do not find, then put in storage just passable;
8, pattern is concluded, and also is that automatic canonical generates;
The concrete steps that pattern is concluded are following:
Step 8.1: to all patterns in the storehouse, extract former string, divide into groups by the website, every group is carried out cluster by similarity of character string, obtains several groupings of high cohesion;
Step 8.2: to obtaining each grouping, in it, the canonical after it merges to every pair of segmentation calculating obtains all possible different canonical; The frequency of occurrences of these canonicals is got that of maximum by the height ordering; Verify second largest that again, if in can the covering group part overlay segments and weight are unsuitable, then it also is desirable pattern;
Extract the pattern of two segmentations: recursively seek the optimum common fragment of two segmentation remainders; The fragment forward part is exactly different, needs the place that merges; This totally is a method by the two-dimension table dynamic programming;
Step 8.3:To all canonicals that obtain, keep those weights and be higher than certain threshold value item, add in the library then;
Pattern obtains some canonicals after concluding and finishing; Add weight value information, go in the library;
9, finish.
Claims (34)
1. an article real-time intelligent grasping system is characterized in that, said system comprises real-time grabbing module, web page extraction system, document approximate row molality piece, the automatic sort module of document and article release module.
2. system according to claim 1, wherein said real-time grabbing module comprise online and offline operation submodule, and the operation submodule comprises on the line:
The task extraction module extracts a job in turn from task job set;
The task parsing module is resolved each task job, and analysis result will form some attributes and rule;
Task grasps the time range inspection module, and the time range parameter of query task if time range does not comprise the current time, will not grasp, and skip this job, otherwise, grasp time interval check;
Task grasps time interval inspection module, the extracting time interval of query task, if specifying, the time interval grasps the time greater than current next time, and then will not grasp, skip this job, otherwise, carry out task and grasp;
Task scheduling modules is carried out the job scheduling according to other attribute of job of task parsing module, can judge in the time of scheduling, if this job existed in the past, does not then distribute, and still adopts home server to grasp; Otherwise, select the existing less station server of jobs number in the server zone, with the equilibrium of realization extracting task, thereby optimize overall grasp speed; Consider simultaneously to avoid too many on the station server with website job as far as possible, too big to prevent that the separate unit server from grasping pressure to single website;
The task download module carries out the concrete download of task, and the agency who removes to get in the broker library proper number generally is 5; If do not act on behalf of desirablely, then adopt non-agency to grasp; To not have agency and above-mentioned 5 agency's merging simultaneously, form agent list; According to the task parameters that parsing obtains, from agent list, select an agency at random, carry out the epicycle of task and download;
Task grasps frequency regulation block, according to the extracting interval radix of job, finds to upgrade if epicycle grasps, and then reduces next round at random and grasps the time interval; If find to upgrade, then amplify the extracting interval of next round at random; But guarantee that the time interval is that the extracting of job is at interval in [0.5,2] multiple scope of radix.
3. system according to claim 1, wherein said real-time grabbing module comprise online and offline operation submodule, and line operation submodule down comprises:
Task grasps time range and finds module, carries out the historical daily record intellectual analysis work of grasping, and therefrom analyzes the time range of each task job;
Task grasps the time interval and finds module, reads in the extracting daily record of yesterday, analyzes the extracting situation of one day all round of each job, therefrom analyzes the more new situation of each job; Guaranteed that 50% above round can grab renewal, does not then adjust if find the current interval radix of job; Otherwise suitably amplify current interval radix, to reduce meaningless grab requesting;
Free agency collects and authentication module, freely acts on behalf of the free agent list that sharing website is downloaded the same day from the internet, and these free agencies are verified; Select the historical link formation checking url set of grasping at random; Using each to act on behalf of the url that carries out several times grasps; Weed out and successfully to grasp or the too slow agency of grasp speed; Stay the high fireballing agency simultaneously of success ratio, form the broker library on the same day, thinking to grasp on the line provides the agency to support.
4. article real-time intelligent grasping means is characterized in that, said method comprises real-time extracting step, web page extraction step, document approximate row heavy step, the automatic classifying step of article and article issuing steps; Said real-time extracting step also comprises online and offline operation substep.
5. method according to claim 4, the operation substep comprises on the said line:
Step 1 is extracted a job in turn from task job set;
Step 2, job resolves, and each job is resolved, and analysis result will form some attributes and rule;
Step 3, time range judges that the time range parameter of query task if time range does not comprise the current time, will not grasp, and return step 1, otherwise gets into next step;
Step 4, the time interval is judged, the extracting time interval of query task, if specifying, the time interval grasps the time greater than current next time, then will not grasp, return step 1, otherwise get into next step;
Step 5, job scheduling is carried out the job scheduling according to other attribute of job of task parsing module, can judge in the time of scheduling, if this job existed in the past, does not then distribute, and still adopts home server to grasp; Otherwise, select the existing less station server of jobs number in the server zone, with the equilibrium of realization extracting task, thereby optimize overall grasp speed; Consider simultaneously to avoid too many on the station server with website job as far as possible, too big to prevent that the separate unit server from grasping pressure to single website;
Step 6, task is downloaded, and the agency who removes to get in the broker library proper number generally is 5; If do not act on behalf of desirablely, then adopt non-agency to grasp; To not have agency and above-mentioned 5 agency's merging simultaneously, form agent list; According to the task parameters that parsing obtains, from agent list, select an agency at random, carry out the epicycle of task and download;
Step 7 grasps the frequency adjustment, according to the extracting interval radix of job, finds to upgrade if epicycle grasps, and then reduces next round at random and grasps the time interval; If find to upgrade, then amplify the extracting interval of next round at random; But guarantee that the time interval is that the extracting of job is at interval in [0.5,2] multiple scope of radix; After the frequency adjustment is accomplished, return step 1, repeat whole flow process.
6. method according to claim 4, said line operation substep down comprise:
Step 1 is analyzed daily record discovery time scope, carries out the historical daily record intellectual analysis work of grasping, and therefrom analyzes the time range of each task job;
Step 2 is analyzed daily record and is found the new time interval, reads in the extracting daily record of yesterday, analyzes the extracting situation of one day all round of each job, therefrom analyzes the more new situation of each job; Guaranteed that 50% above round can grab renewal, does not then adjust if find the current interval radix of job; Otherwise suitably amplify current interval radix, to reduce meaningless grab requesting;
Step 3, free agency collects and checking, freely acts on behalf of the free agent list that sharing website is downloaded the same day from the internet, and these free agencies are verified; Select the historical link formation checking url set of grasping at random; Using each to act on behalf of the url that carries out several times grasps; Weed out and successfully to grasp or the too slow agency of grasp speed; Stay the high fireballing agency simultaneously of success ratio, form the broker library on the same day, thinking to grasp on the line provides the agency to support.
7. web page extraction system according to claim 1 is a kind of article types of web pages intelligence extraction system, comprising:
(1) webpage to be extracted insmods, and regularly inquires about local index, find new index just according to index with in the webpage loading system internal memory;
(2) wrapper enquiry module, to the webpage that remains to be extracted, the concrete wrapper information that extracts of inquiry if inquire, just according to extracting wrapper, gets into abstraction module, specifically extract, otherwise, webpage is labeled as to extract fails;
(3) web page extraction module extracts concrete article information from webpage, by existing extraction wrapper;
(4) extract the failure web page collection module, the collecting web page that epicycle is extracted failure gets up, by websites collection, conveniently to carry out focusing study;
(5) study judge module by extract the failure collections of web pages with query site, according to the failure webpage quantity of each website, calculates this website epicycle and extracts the ratio of successfully failing, and determines whether to enter the Web page study module;
(6) webpage study module carries out machine learning to all failure webpages, generates new extraction wrapper at last;
(7) extract the wrapper administration module, the extraction wrapper of system is managed, also promptly storehouse, path and library are managed, and provide wrapper to use interface, provide the wrapper updating interface to the webpage study module to the web page extraction module.
8. article types of web pages intelligence extraction system as claimed in claim 7 is characterized in that said web page extraction module also comprises:
The HTML parsing module to importing webpage into, is resolved html and is made up dom tree;
Text field is sought module, according to wrapper information, seeks text field;
Article head and paging information extraction modules are used to extract article header and article and divide page information;
The text field correcting module is used to revise text field;
Text field piecemeal module is used for the text domain partitioning; Carrying out piecemeal property determine and redundant piecemeal simultaneously removes;
The segmentation filtering module is used for text field is carried out segmentation, filters piecemeal simultaneously;
The data preparation module is used for merging and organize your messages, forms the result of article type.
9. article types of web pages intelligence extraction system as claimed in claim 7 is characterized in that said webpage study module also comprises:
The HTML parsing module to importing webpage into, is resolved html and is made up dom tree;
Text field is sought module, is used to seek text field;
Storehouse, path update module is used for warehouse-in and merges, and simultaneously the storehouse, path is put in order;
Article head and paging information extraction modules are used to extract article header and article and divide page information;
The text field correcting module is used to revise text field;
Text field piecemeal module is used for the text domain partitioning, carries out piecemeal property determine and redundant piecemeal simultaneously and removes;
The speced learning module is carried out the text field segmentation, makes up pattern piecemeal, goes into library and merges;
Pattern is concluded module, and all patterns are concluded, and create-rule is gone into library and merged;
The wrapper sorting module is put in order system's wrapper, removes invalid information.
10. like claim 8 or 9 described article types of web pages intelligence extraction systems, said text field piecemeal module also comprises:
The frequent mode identification module adopts the MDR method to discern frequent mode;
The piecemeal module to the frequent mode that obtains, carries out seeking on the searching of branch block header, the piecemeal father node, to obtain only blocking node combination, is combined to form piecemeal then;
The piecemeal mark module carries out mark to all piecemeals that identify in the text field dom tree.
11. an article types of web pages intelligence abstracting method, it comprises the steps:
Step 1, webpage to be extracted is written into; Just be written at set intervals the collections of web pages that remains to be extracted; If webpage not extracted directly gets into step 6;
Step 2, the wrapper inquiry; To the webpage that remains to be extracted, the concrete wrapper information that extracts of inquiry if inquire, just gets into step 4, specifically extracts; Otherwise, extract failure, get into step 5;
Step 3, web page extraction; According to wrapper, webpage is specifically extracted, will extract the result after extraction finishes and be organized into the article type;
Step 4, mark extracts failure; To extract failure webpage mark, collect, get back to step 2 simultaneously to make things convenient for step 6;
Step 5 is collected all and is extracted the failure webpage, forms and extracts the failure collections of web pages;
Step 6, study is judged; By extract the failure collections of web pages with query site, to the failure collections of web pages of each website, judge that this website epicycle extracts the ratio of successfully failing, whether decision carries out machine learning; If study adds collections of web pages to be learnt;
Step 7, webpage study; All failure webpages to each website are learnt, and generate new extraction wrapper;
Step 8 extracts the wrapper management; New extraction wrapper is put into the wrapper set;
Step 9 finishes.
12. method according to claim 11, wherein said step 3, web page extraction comprises the steps:
Step 3.1, HTML resolves; To importing webpage into, resolve html and make up dom tree;
Step 3.2 is sought text field;
Step 3.3 is extracted article header and article and is divided page information;
Step 3.4 is revised text field; Help by means of the prompting of news article form, combine article header, the article of top step to divide page information, can revise text field, make it more accurate:
Step 3.5 is to the text domain partitioning; Carrying out piecemeal property determine and redundant piecemeal then removes;
Step 3.6, the text field segmentation is filtered; Earlier text domain partitioning tree is carried out segmentation, thereby obtain the text segmentation sequence; Filter piecemeal then;
Step 3.7, data preparation, the result generates; Merge organize your messages, extract summary etc., extract successfully;
Step 3.8 finishes.
13. method according to claim 12, wherein said step 3.1 is done pre-service to html earlier, comprises the character format conversion, script/style information filtering, not visual character rejecting etc.; According to html code and html standard, adopt the HtmlParser assembly to come analyzing web page and obtain dom tree then.
14. method according to claim 12; Wherein said step 3.2, the location path of this website of inquiry extracts the text field path in extracting the style tree of wrapper; Then according to the path; In dom tree, travel through, navigate to concrete DOM node, this node is exactly the text field that we inquire for.
15. method according to claim 12, in the wherein said step 3.3, said article header mainly is a heading message, and extraction step comprises:
(1) extracts the interior several rows that start of text field, calculate title matching degrees of these row respectively and extract maximum, obtain the candidate's header line in the territory; " OK " is meant according to the line feed label of html and such as <br >, < P>etc. the dom tree of whole webpage cut apart some adjacent dom node set that the back forms here, with and corresponding html code;
(2) extract outside the text field beginning, several rows in front are calculated title matching degrees of these row respectively and are extracted maximumly, obtain the candidate's header line before the territory;
(3) then, compare, therefrom select one to do title according to heuristic rule and title matching degree;
16. method according to claim 15, wherein said title matching degree, it is following that it weighs formula:
P
t=a*(1-len
punc/len
all)+b*(1-len
title/len
max_title)+c*(1-len
keywords/len
max_keywords)+d*(1-len
summery/len
max_summery)+e*(1-len
authortext/len
max_authortext)+f*WH+g*H
len
Wherein:
Len
PuncBe punctuation mark length in the row;
Len
AllBe all word length in the row;
Len
TitleIt is the editing distance of title field contents in row content and the webpage;
Len
Max_titleIt is the maximal value in the title field contents in row content and the webpage;
Keywords refers to the key word information that webpage carries; Summery refers to the abstract fields that webpage carries; Authortext refers to the corresponding anchor literal of webpage url; These three types of variable meanings and above-mentioned similar;
WH is the tag types weighting, occurs labels such as h1 h2...center in all nodes under the row, can give the node weighting;
H
LenBe the weighting of node content length, find that after large-scale statistical length for heading is the most common between 16-23, all there is distribution probability separately in other interval, calculates the length weighting of node with this;
A, b, c, d, e, f, g are the factors of influence of each factor, in application, can revise.
17. method according to claim 12, in the wherein said step 3.3, said article divides page information, and its recognition methods is in the afterbody searching several rows of text field, to carry out Serial No. line by line and find; If found continuous Serial No., such as " 1,2,3,4,5,6... ", and these url link informations of having of numeral itself belong to a website with this webpage, then discern successfully.
18. method according to claim 12, wherein said step 3.4 comprises:
1) before the territory, search out the article head after, to the text field correction:
The article head cuts off article head information in the past in the territory;
The article head is overseas, and part is gone into text field between the merging;
2) after the territory tail searches out article tail information:
If the article tail in the territory, then cuts off tailing section in the territory;
If the article tail is not revised overseas.
19. method according to claim 12, text field piecemeal step in the wherein said step 3.5 comprises:
Step 3.5.1 adopts the MDR method to discern frequent mode;
Step 3.5.2 to the pattern that obtains, carries out father nodes such as the searching of branch block header, piecemeal
On seek, to obtain the combination of only blocking node; Be combined to form piecemeal then;
Step 3.5.3 carries out mark to all piecemeals that identify in the text field dom tree.
20. method according to claim 19, wherein said step 3.5.2, when being combined to form piecemeal, the following criterion of foundation:
(1) with in all child nodes of father, node also is a piecemeal between the piecemeal that marks, and the node before first piecemeal is a piecemeal, and backmost node is a piecemeal after the piecemeal;
(2) if there is the piecemeal that marks in the node subtree, then node itself also is a piecemeal.
21. method according to claim 12, piecemeal property determine and redundant piecemeal are removed step in the wherein said step 3.5, and method is following:
In the piecemeal that obtains, judge its link literal and overall number of words ratio;
If the link of piecemeal is then thought redundant piecemeal than greater than threshold value (0.5f), remove in the tree, substitute with the hr label;
Piecemeal to remaining frequent mode identifies because their clear and definite semantic informations will be they marked, no longer splits them in subsequent operation.
22. method according to claim 12, wherein said step 3.6, said text field filter method piecemeal is, carries out the text field segmentation earlier, segmentation method is to carry out cutting according to line feed html label; Generate pattern piecemeal then; Filter then, filter method is, if pattern match query in the library of wrapper, then pattern weighting was removed with the period; If do not match, then make up new model, warehouse-in, the power of putting is for minimum.
23. method according to claim 11, wherein said webpage learning procedure 7 comprises:
Step 7.1, HMTL resolves; To importing webpage into, resolve html and make up dom tree;
Step 7.2 is sought text field; Locate text field through the text field recognition methods;
Step 7.3, the path warehouse-in merges;
The storehouse, path of incorporating the path into the system wrapper, and fashionable, merging weighting with the path, weighting is to revise frequency values, the frequency of occurrences value that also is about to new route adds old path; If do not find repetition, the new route warehouse-in is just passable;
Step 7.4 is extracted article header and article and is divided page information;
Step 7.5 is revised text field;
Step 7.6, the text field piecemeal; Carrying out piecemeal property determine and redundant piecemeal simultaneously removes;
Step 7.7, speced learning carries out the text field segmentation earlier; Text domain partitioning tree is carried out segmentation, thereby obtain the text segmentation sequence; Speced learning to all sections generate pattern, is put study in storage then then;
Step 7.8, pattern is concluded, and also is that automatic canonical generates;
Step 7.9 finishes.
24. method according to claim 23, wherein in said step 7.2,
Text is included in one or more nested Div, the Table node, and said searching text field is sought an only Div or Table exactly; Realize through a highest Div or a Table of information degree; Said information degree computing formula:
H=len
not_link*log(1+len
link/len
allTxt)+a*len
not_link*log(1+len
not_link/len
html)
Wherein:
A is a factor of influence, is defaulted as 0.5 at present;
Len
Not_linkIt is disconnected word length in the node;
Len
AllTxtBe all word length in the node;
Len
HtmlBe the html length of node;
During calculating, parameter adds 1 among the log, makes the log operation result all greater than 0.
25. method according to claim 24, wherein, find said only Div or Table after, in dom tree, date back to the body node, recall end after, just formed a paths; In the trace-back process, record also is from left to right the sequence number of each DOM node in father node through the positional information of DOM node; At last, obtain a dom tree path, the node in path also has its positional information simultaneously.
26. method according to claim 23, in the wherein said step 7.4, said article header mainly is a heading message, and extraction step comprises:
(1) extracts the interior several rows that start of text field, calculate title matching degrees of these row respectively and extract maximum, obtain the candidate's header line in the territory; " OK " is meant according to the line feed label of html and such as <br >, < P>etc. the dom tree of whole webpage cut apart some adjacent dom node set that the back forms here, with and corresponding html code;
(2) text field several rows in front are calculated title matching degrees of these row respectively and are extracted maximumly, obtain the candidate's header line before the territory;
(3) then, compare, therefrom select one to do title according to heuristic rule and title matching degree.
27. method according to claim 26, the measurement formula of wherein said title matching degree is following:
P
t=a*(1-len
punc/len
all)+b*(1-len
title/len
max_title)+c*(1-len
keywords/len
max_keywords)+d*(1-len
summery/len
max_summery)+e*(1-len
authortext/len
max_authortext)+f*WH+g*H
len
Wherein:
Len
PuncBe punctuation mark length in the row;
Len
AllBe all word length in the row;
Len
TitleIt is the editing distance of title field contents in row content and the webpage;
Len
Max_titleIt is the maximal value in the title field contents in row content and the webpage;
Keywords refers to the key word information that webpage carries; Summery refers to the abstract fields that webpage carries; Authortext refers to the corresponding anchor literal of webpage url; These three types of variable meanings and above-mentioned similar;
WH is the tag types weighting, occurs labels such as h1 h2...center in all nodes under the row, can give the node weighting;
l
EnBe the weighting of node content length, find that after large-scale statistical length for heading is the most common between 16-23, all there is distribution probability separately in other interval, calculates the length weighting of node with this;
A, b, c, d, e, f, g are the factors of influence of each factor, in application, can revise.
28. method according to claim 23, wherein, said step 7.4, said article paging information identifying method is in the afterbody searching several rows of text field, to carry out Serial No. line by line and find; If found continuous Serial No., such as " 1,2,3,4,5,6... ", and these url link informations of having of numeral itself belong to a website with this webpage, then discern successfully.
29. method according to claim 23, wherein, said step 7.5 helps by means of the prompting of news article form, combines article header, the article of top step to divide page information, can revise text field, makes it more accurate; Comprise:
1) before the territory, search out the article head after, to the text field correction:
The article head cuts off article head information in the past in the territory;
The article head is overseas, and part is gone into text field between the merging;
2) after the territory tail searches out article and divides page information:
If the article tail in the territory, then cuts off tailing section in the territory;
If the article tail is not revised overseas.
30. method according to claim 23, wherein, the text field piecemeal step of said step 7.6 comprises the steps:
Step 7.6.1 adopts the MDR method to discern frequent mode;
Step 7.6.2 to the frequent mode that obtains, carries out seeking on the father nodes such as the searching of branch block header, piecemeal, to obtain only blocking node combination; Be combined to form piecemeal then;
Step 7.6.3 carries out mark to all piecemeals that identify in the text field dom tree.
31. method according to claim 30, wherein, said piecemeal array mode, the following criterion of foundation:
(1) with in all child nodes of father, node also is a piecemeal between the piecemeal that marks, and the node before first piecemeal is a piecemeal, and backmost node is a piecemeal after the piecemeal;
(2) if there is the piecemeal that marks in the node subtree, then node itself also is a piecemeal.
32. method according to claim 23, wherein, the piecemeal property determine of said step 7.6 and redundant piecemeal are removed step, the following criterion of foundation:
In the piecemeal that obtains, judge its link literal and overall number of words ratio;
If the link of piecemeal is then thought redundant piecemeal than greater than threshold value (0.5f), remove in the tree, substitute with the hr label;
Piecemeal to remaining frequent mode identifies because their clear and definite semantic informations will be they marked, no longer splits them in subsequent operation.
33. method according to claim 23, wherein, said step 7.7, speced learning comprises:
Said text field segmentation method is:
According to the prompting of line feed label in the text field, carry out content segmentation, content between the line feed label is one section;
Said generate pattern and speced learning process are:
(1) to all sections, extracts its html code, carry out the html fragment and simplify, only stay tag title and content, get md5 key, be built into pattern;
Pattern is expressed as follows:
Pattern=md5 ((content: text/img)+paragraph tag forward direction ergodic sequence+site name)+value
Wherein value is a weight information, also is the frequency of occurrences of pattern;
(2) then the pattern that obtains is put into the library of wrapper, put in storage merging;
Library inquires model identical, and then pattern weighting also is about to the value field and merges weighting;
If do not find, then put in storage.
34. method according to claim 23, wherein, said step 7.8,
Said pattern is concluded rapid as follows:
Step 7.10.1: to all patterns in the storehouse, extract former string, divide into groups by the website, every group is carried out cluster by similarity of character string, obtains several groupings of high cohesion;
Step 7.10.2: to obtaining each grouping, in it, the canonical after it merges to every pair of segmentation calculating obtains all possible different canonical; The frequency of occurrences of these canonicals is got that of maximum by the height ordering; Verify second largest that again, if in can the covering group part overlay segments and weight are unsuitable, then it also is desirable pattern;
Extract the pattern of two segmentations: recursively seek the optimum common fragment of two segmentation remainders; The fragment forward part is exactly different, needs the place that merges; This totally is a method by the two-dimension table dynamic programming;
Step 7.10.3: to all canonicals that obtain, keep those weights and be higher than certain threshold value item, add in the library then;
Pattern obtains some canonicals after concluding and finishing; Add weight value information, go in the library.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2012100083895A CN102609456A (en) | 2012-01-12 | 2012-01-12 | System and method for real-time and smart article capturing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2012100083895A CN102609456A (en) | 2012-01-12 | 2012-01-12 | System and method for real-time and smart article capturing |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102609456A true CN102609456A (en) | 2012-07-25 |
Family
ID=46526828
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2012100083895A Pending CN102609456A (en) | 2012-01-12 | 2012-01-12 | System and method for real-time and smart article capturing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102609456A (en) |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103324522A (en) * | 2013-06-20 | 2013-09-25 | 北京奇虎科技有限公司 | Method and device for scheduling tasks for capturing data from servers |
CN103617264A (en) * | 2013-12-02 | 2014-03-05 | 北京奇虎科技有限公司 | Method and device for grabbing timeliness seed page |
CN105260388A (en) * | 2015-09-11 | 2016-01-20 | 广州极数宝数据服务有限公司 | Optimization method of distributed vertical crawler service system |
CN105550165A (en) * | 2015-12-23 | 2016-05-04 | 深圳市八零年代网络科技有限公司 | Plug-in and method capable of importing webpage article into webpage text editor |
CN105955980A (en) * | 2013-05-31 | 2016-09-21 | 北京奇虎科技有限公司 | File download device and method |
CN106294364A (en) * | 2015-05-15 | 2017-01-04 | 阿里巴巴集团控股有限公司 | Realize the method and apparatus that web crawlers captures webpage |
CN106557334A (en) * | 2015-09-25 | 2017-04-05 | 北京国双科技有限公司 | Determination methods and device that reptile task is completed |
CN107193828A (en) * | 2016-03-14 | 2017-09-22 | 百度在线网络技术(北京)有限公司 | Novel webpage capture method and apparatus |
CN107610693A (en) * | 2016-07-11 | 2018-01-19 | 科大讯飞股份有限公司 | The construction method and device of text corpus |
CN108270812A (en) * | 2016-12-30 | 2018-07-10 | 深圳市青果乐园网络科技有限公司 | For obtaining method and system of the article publication with situation of sharing |
CN109918557A (en) * | 2019-03-12 | 2019-06-21 | 厦门商集网络科技有限责任公司 | A kind of web data crawls merging method and computer readable storage medium |
CN111178057A (en) * | 2020-01-02 | 2020-05-19 | 大汉软件股份有限公司 | Content analysis and extraction system of government affair electronic document |
CN111538887A (en) * | 2020-04-30 | 2020-08-14 | 广东所能网络有限公司 | Big data image-text recognition system and method based on artificial intelligence |
CN112488840A (en) * | 2019-09-12 | 2021-03-12 | 京东数字科技控股有限公司 | Information output method and device |
CN112836018A (en) * | 2021-02-07 | 2021-05-25 | 北京联创众升科技有限公司 | Method and device for processing emergency plan |
CN113704589A (en) * | 2021-09-03 | 2021-11-26 | 海粟智链(青岛)科技有限公司 | Internet system for collecting industrial chain data |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101203847A (en) * | 2005-03-11 | 2008-06-18 | 雅虎公司 | System and method for managing listings |
CN102073683A (en) * | 2010-12-22 | 2011-05-25 | 四川大学 | Distributed real-time news information acquisition system |
CN102096705A (en) * | 2010-12-31 | 2011-06-15 | 南威软件股份有限公司 | Article acquisition method |
CN102402627A (en) * | 2011-12-31 | 2012-04-04 | 凤凰在线(北京)信息技术有限公司 | System and method for real-time intelligent capturing of article |
CN102567530A (en) * | 2011-12-31 | 2012-07-11 | 凤凰在线(北京)信息技术有限公司 | Intelligent extraction system and intelligent extraction method for article type web pages |
-
2012
- 2012-01-12 CN CN2012100083895A patent/CN102609456A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101203847A (en) * | 2005-03-11 | 2008-06-18 | 雅虎公司 | System and method for managing listings |
CN102073683A (en) * | 2010-12-22 | 2011-05-25 | 四川大学 | Distributed real-time news information acquisition system |
CN102096705A (en) * | 2010-12-31 | 2011-06-15 | 南威软件股份有限公司 | Article acquisition method |
CN102402627A (en) * | 2011-12-31 | 2012-04-04 | 凤凰在线(北京)信息技术有限公司 | System and method for real-time intelligent capturing of article |
CN102567530A (en) * | 2011-12-31 | 2012-07-11 | 凤凰在线(北京)信息技术有限公司 | Intelligent extraction system and intelligent extraction method for article type web pages |
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105955980A (en) * | 2013-05-31 | 2016-09-21 | 北京奇虎科技有限公司 | File download device and method |
CN103324522A (en) * | 2013-06-20 | 2013-09-25 | 北京奇虎科技有限公司 | Method and device for scheduling tasks for capturing data from servers |
CN103324522B (en) * | 2013-06-20 | 2016-09-28 | 北京奇虎科技有限公司 | The method and apparatus that the task of capturing data from each server is scheduling |
CN103617264A (en) * | 2013-12-02 | 2014-03-05 | 北京奇虎科技有限公司 | Method and device for grabbing timeliness seed page |
CN103617264B (en) * | 2013-12-02 | 2020-07-07 | 北京奇虎科技有限公司 | Method and device for capturing timeliness seed page |
CN106294364A (en) * | 2015-05-15 | 2017-01-04 | 阿里巴巴集团控股有限公司 | Realize the method and apparatus that web crawlers captures webpage |
CN106294364B (en) * | 2015-05-15 | 2020-04-10 | 阿里巴巴集团控股有限公司 | Method and device for realizing web crawler to capture webpage |
CN105260388A (en) * | 2015-09-11 | 2016-01-20 | 广州极数宝数据服务有限公司 | Optimization method of distributed vertical crawler service system |
CN106557334B (en) * | 2015-09-25 | 2020-02-07 | 北京国双科技有限公司 | Method and device for judging completion of crawler task |
CN106557334A (en) * | 2015-09-25 | 2017-04-05 | 北京国双科技有限公司 | Determination methods and device that reptile task is completed |
CN105550165A (en) * | 2015-12-23 | 2016-05-04 | 深圳市八零年代网络科技有限公司 | Plug-in and method capable of importing webpage article into webpage text editor |
CN107193828A (en) * | 2016-03-14 | 2017-09-22 | 百度在线网络技术(北京)有限公司 | Novel webpage capture method and apparatus |
CN107610693B (en) * | 2016-07-11 | 2021-01-29 | 科大讯飞股份有限公司 | Text corpus construction method and device |
CN107610693A (en) * | 2016-07-11 | 2018-01-19 | 科大讯飞股份有限公司 | The construction method and device of text corpus |
CN108270812A (en) * | 2016-12-30 | 2018-07-10 | 深圳市青果乐园网络科技有限公司 | For obtaining method and system of the article publication with situation of sharing |
CN108270812B (en) * | 2016-12-30 | 2021-03-23 | 深圳市青果乐园网络科技有限公司 | Method and system for acquiring article publishing and sharing conditions |
CN109918557A (en) * | 2019-03-12 | 2019-06-21 | 厦门商集网络科技有限责任公司 | A kind of web data crawls merging method and computer readable storage medium |
CN112488840A (en) * | 2019-09-12 | 2021-03-12 | 京东数字科技控股有限公司 | Information output method and device |
CN111178057A (en) * | 2020-01-02 | 2020-05-19 | 大汉软件股份有限公司 | Content analysis and extraction system of government affair electronic document |
CN111178057B (en) * | 2020-01-02 | 2024-01-30 | 大汉软件股份有限公司 | Content analysis and extraction system for government electronic documents |
CN111538887A (en) * | 2020-04-30 | 2020-08-14 | 广东所能网络有限公司 | Big data image-text recognition system and method based on artificial intelligence |
CN111538887B (en) * | 2020-04-30 | 2023-11-10 | 贵阳杰汇数字创新中心有限公司 | Big data graph and text recognition system and method based on artificial intelligence |
CN112836018A (en) * | 2021-02-07 | 2021-05-25 | 北京联创众升科技有限公司 | Method and device for processing emergency plan |
CN113704589A (en) * | 2021-09-03 | 2021-11-26 | 海粟智链(青岛)科技有限公司 | Internet system for collecting industrial chain data |
CN113704589B (en) * | 2021-09-03 | 2023-10-13 | 海粟智链(青岛)科技有限公司 | Internet system for collecting industrial chain data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102609456A (en) | System and method for real-time and smart article capturing | |
CN102567530B (en) | Intelligent extraction system and intelligent extraction method for article type web pages | |
CN102402627B (en) | System and method for real-time intelligent capturing of article | |
CN100485603C (en) | Systems and methods for generating concept units from search queries | |
US8868621B2 (en) | Data extraction from HTML documents into tables for user comparison | |
CN101950312B (en) | Method for analyzing webpage content of internet | |
CN105005600B (en) | Preprocessing method of URL (Uniform Resource Locator) in access log | |
CN105893583A (en) | Data acquisition method and system based on artificial intelligence | |
CN109271477A (en) | A kind of method and system by internet building taxonomy library | |
US20080306941A1 (en) | System for automatically extracting by-line information | |
CN109299480A (en) | Terminology Translation method and device based on context of co-text | |
CN104933168B (en) | A kind of web page contents automatic acquiring method | |
US20060026496A1 (en) | Methods, apparatus and computer programs for characterizing web resources | |
CN101079024A (en) | Special word list dynamic generation system and method | |
CN108416034B (en) | Information acquisition system based on financial heterogeneous big data and control method thereof | |
CN110362824A (en) | A kind of method, apparatus of automatic error-correcting, terminal device and storage medium | |
CN103530429A (en) | Webpage content extracting method | |
CN104199833A (en) | Network search term clustering method and device | |
CN102662969A (en) | Internet information object positioning method based on webpage structure semantic meaning | |
CN104615734B (en) | A kind of community management service big data processing system and its processing method | |
CN104598536B (en) | A kind of distributed network information structuring processing method | |
CN108959580A (en) | A kind of optimization method and system of label data | |
CN103942268A (en) | Method and device for combining search and application and application interface | |
CN103870495A (en) | Method and device for extracting information from website | |
US11334592B2 (en) | Self-orchestrated system for extraction, analysis, and presentation of entity data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20120725 |