CN102750390A

CN102750390A - Automatic news webpage element extracting method

Info

Publication number: CN102750390A
Application number: CN2012102328312A
Authority: CN
Inventors: 张长水; 宋成儒; 翁时锋
Original assignee: Individual
Current assignee: Ningbo Zhongqing Cyyun New Media Technology Co Ltd
Priority date: 2012-07-05
Filing date: 2012-07-05
Publication date: 2012-10-24
Anticipated expiration: 2032-07-05
Also published as: CN102750390B

Abstract

The invention provides an automatic news webpage element extracting method which comprises the following steps of: (1) extracting a webpage title and webpage meta-information in a webpage sound code and obtaining a keyword dictionary related to webpage content; and (2), traversing literal nodes in the webpage sound code, and detecting and extracting a news title, issue time, a message source and a news text by utilizing the keyword dictionary according to a news title-issue time-message source-news text sequence or a news title-message source-issue time-news text sequence. The method provided by the invention does not depend on a specific template and is strong in commonality.

Description

News web page key element extraction method

Technical field

The present invention relates to the internet information analytical technology, particularly a kind of news web page key element extraction method.

Background technology

In recent years, along with the extensive of internet popularized, the many ground of People more and more obtain useful information from the network media.The network information has high-timeliness, and a lot of grave news incidents all are at first on network, to spread to come.Thus, phase-split network information, particularly news information, can help we hold well the social development pulse, in time find local anomaly, safeguard that the harmony of society is stable.

News on the internet is vast as the open sea, if adopt manual method to analyze, does not catch up with the renewal speed of news on the one hand, occurs careless mistake on the other hand easily, so will analyze by computing machine usually.Given certain news web page wants to understand information wherein and analyzes, and what at first will do is exactly that automatically extraction headline, issuing time, informed source, this 4 flash-news key element of body are as shown in Figure 1.

Existing element of news method for distilling only is placed on focus on headline and the text mostly, mainly contains following three kinds of methods:

1, regular expression

Regular expression is a character string that is generated by specific syntax rule, is used for describing or the statement of certain sentence structure standard of match.If news web page is generated by same template, we can be expressed as a regular expression with the code pattern in text zone, see also Fig. 2, and it is the synoptic diagram that utilizes the method extraction element of news of regular expression.Can use this unique expression formula to extract its content to each new input webpage.This method is simple and convenient, with strong points, once creates, unlimited operation.

But; The defective of regular expression method is that the artificial generalization procedure of web page code pattern is very complicated; The regular expression of being write as to a template only is applicable to this template, feels simply helpless for the webpage of extended formatting, even if original template; If in text, increase nested or modification slightly, also possibly cause the contents extraction failure.

2, wrapper

The method of regular expression needs manual compiling and can only be corresponding one by one with web page template.After this people attempt seeking the automatic derivation method of multi-template webpage unified model.N.Kushmerick has proposed the algorithm of a WIEN by name first and has realized this idea in 1997, and final mask is called wrapper.Here wrapper is represented a kind of flow process, is directed to certain new information source, can utilize existing template data and webpage experimental knowledge to carry out the conclusion of similar artificial intelligence and derivation automatically.Derivation result can be applied in automatic extraction of information of new information source.

Though the wrapper method for distilling has partly solved regular expression method poor efficiency, the narrow shortcoming of application surface, do not break away from the essence of former method all the time, promptly conclude the cost height, do not break away from dependence in essence template.

In sum, existing element of news method for distilling exist to masterplate too rely on, versatility is poor, and the masterplate code is concluded complicated problems.

Summary of the invention

The purpose of this invention is to provide a kind of news web page key element extraction method, with solve existing element of news method for distilling exist to masterplate too rely on, versatility is poor, and the masterplate code is concluded complicated problems.

The present invention proposes a kind of news web page key element extraction method, may further comprise the steps:

(1) extracts web page title and webpage metamessage in the webpage source code, and obtain the keyword dictionary of relevant web page contents;

(2) literal node in the webpage source code is traveled through; And according to the order of headline-issuing time-informed source-body or headline-informed source-issuing time-body, and utilize said keyword dictionary to detect and extract headline, issuing time, informed source and body.

Further, step (1) also comprises before: pre-service is carried out to the webpage source code in (10), removes scripted code.

Further, step (1) also comprises: (11) are carried out participle and are removed stop words web page title and the webpage metamessage that extracts.

Further, step also comprises in (2): filter literal node (21), and the literal node that filters out is got rid of outside sensing range.

Further, in the step (21), literal node is filtered, comprising according to the father node label of literal node:

(211) filter out the literal node of no father node;

(212) filter out the father node label and do not belong to < div >, < paragraph >, <tablecolumn >, < heading >, < span>central one literal node;

(213) label that filters out father node is < div >, and pattern is set to the literal node of " hiding ";

(214) after headline and issuing time have been detected, filter out the literal node of the label of father node for < heading >;

(215) label that filters out father node is < span>or < div >, and text size is less than the literal node of text paragraph average length.

Further, in the step (21), literal node is filtered, comprising according to content of text:

(216) filter out the literal node that comprises copyright statement information;

(217) filter out the literal node of the Word message that contains " sharing " and/or " comment " and/or " microblogging ".

Further, in the step (2), comprise when detection and extraction headline, issuing time, informed source:

(22) when a literal node that belongs to web page title; The text size of this article byte point is not less than 1/3rd of web page title Chinese version length; Or the text similarity of the text of arbitrary literal node and web page title is not less than predetermined threshold value; The content of text that then extracts this article byte point is a headline, and after this no longer carries out the detection of headline;

(23) content and the time format of literal node are mated, and the contents extraction that will mate successful literal node is issuing time, and after this no longer carries out the detection of issuing time;

(24) content when a literal node comprises " source " or " author's " Word message, and then the contents extraction with this article byte point is an informed source, and after this no longer carries out the detection of informed source.

Further, in the step (2), comprise when detection and extraction body:

(25) set up height and hit collection, preserve the high literal node of keyword dictionary hits;

(26) adopt the cluster mode that height is hit collection and purify, get the longest continuous nodes set as the collection of purifying;

(27) find out the minimum public father node of purifying and concentrating;

(28) traversal is the document tree of root node with the public father node of minimum, and obtains body.

Further, step (25) also comprises afterwards:

(251) set up doubtful collection, preserve keyword dictionary hits deficiency, or text size is greater than the literal node of a preset value;

(252) than higher quantity of information of hitting collection and doubtful collection;

(253) if the high quantity of information of the quantity of information of collection less than doubtful collection of hitting then reduces being selected into the hits threshold value that height hits collection, travel through the literal node in the webpage source code again, and rebulid height and hit collection;

(254) if height hits the quantity of information of the quantity of information of collection less than doubtful collection, then get into step (26).

Further, step (28) comprising:

(281) if literal node is identical with the node of headline, issuing time, informed source, then with this article byte point initial as body;

(282) if the father node label of literal node is a link type, and its node that makes progress all is not list type, extracts the content adding body of this article byte point;

(283) if the father node label of literal node belongs in the middle of < div >, < paragraph >, <tablecolumn >, < heading >, < span >, extract this article byte point content adding body.

With respect to prior art; The invention has the beneficial effects as follows: the present invention starts with from Chinese news web page is carried out statistical study; The advantage of comprehensive machine learning method, regular expression method has proposed a whole set of automatic flows of accurate extraction headline, issuing time, informed source, body four key elements.The present invention can not produce specific template and rely on, and has very strong versatility.

Description of drawings

Fig. 1 is the synoptic diagram of a news web page four elements;

Fig. 2 is the synoptic diagram that utilizes the method extraction element of news of regular expression;

Fig. 3 is a kind of news web page key element extraction method process flow diagram of the embodiment of the invention;

Fig. 4 is the speech bag model synoptic diagram according to web page title among Fig. 1 and the formation of webpage metamessage;

Fig. 5 is the comparatively detailed another kind of news web page key element extraction method process flow diagram of the embodiment of the invention;

Fig. 6 is a kind of news web page architectural feature synoptic diagram;

Fig. 7 is the flow process frame diagram in the unknown source of the present invention's active learning method study.

Embodiment

Specify the present invention below in conjunction with accompanying drawing.

See also Fig. 3, it is a kind of news web page key element extraction method process flow diagram of the embodiment of the invention, and it may further comprise the steps:

S31 extracts web page title and webpage metamessage in the webpage source code, and obtains the keyword dictionary of relevant web page contents.

Web page title is the high level overview to a webpage, and when browsing a webpage, the information that the show bar on the browser top occurs is exactly " web page title ".In webpage source code (HTML code), web page title is positioned at<head></head>Between the label, its form is:<title>The network marketing teaching website</title>, wherein " network marketing teaching website " is exactly " web page title ".

The webpage metamessage is included in < meta>label, with the form of key-value pair the information relevant with document is provided, mainly as the index reference of search engine.In the metamessage, description is the descriptor of web page contents, and keywords is the keyword of web page contents, can understand news content well through these two information.

Extract after web page title and the webpage metamessage, the present invention preferably adopts the speech bag model to extract the keyword of relevant web page contents, and forms the keyword dictionary.The speech bag model is a notion in the text mining, and it does not consider order, the modified relationship of word, only text fragment is regarded as the set of word.News pages with Fig. 1 is an example, and is as shown in Figure 4 according to the speech bag model that its web page title and webpage metamessage can form.On the basis of speech bag model, the present invention can further be expressed as vector with text fragment, calculates the similar content degree between the text fragment by computings such as the distance of vector, inner products afterwards.If only consider whether word occurs and with vector distance as measuring similarity, similarity is calculated and can be reduced to public entry number between statistics speech bag.Certainly, the similarity of text is calculated except that discrete vector distance, also has Cosine distance, Euclidean distance, city distance etc.

S32; Literal node in the webpage source code travels through; And according to the order of headline-issuing time-informed source-body or headline-informed source-issuing time-body, and utilize said keyword dictionary to detect and extract headline, issuing time, informed source and body.

Literal node of the present invention is meant the node in the DOM Document Object Model.DOM Document Object Model (Document Object Model, be called for short DOM), it is a kind of application programming interfaces, can be used for the document of types such as dynamic access HTML, XML.Mainly use HTMLDOM among the present invention, it representes document with tree structure, and has defined visit and the method for operating element in the document.

For further understanding technical scheme of the present invention, specify the present invention with a detailed embodiment below, see also Fig. 5, it is the comparatively detailed another kind of news web page key element extraction method process flow diagram of the embodiment of the invention, it may further comprise the steps:

S501 carries out pre-service to the webpage source code, removes script (JS) code, in order to avoid the dynamic load content that wherein comprises is disturbed the judgement of text position.

S502 extracts web page title and webpage metamessage in the webpage source code, and obtains the keyword dictionary of relevant web page contents.

Extract after web page title and the webpage metamessage, the present invention preferably adopts the speech bag model to extract the keyword of relevant web page contents, and forms the keyword dictionary.The speech bag model is as shown in Figure 4,

But the speech in the speech bag also not all needs; Very frequent that some speech occur in news web page arranged; But they do not have too big help to the expression of news content, such as " at present ", " so ", speech such as " it is reported ", be referred to as stop words (stop words) in the present invention.Therefore can after beginning to take shape the keyword dictionary, again these stop words that may disturb the content of text similarity to judge be removed, so that computing is more succinct.

S503 sets up height and hits collection and doubtful collection.Height hits collection and doubtful collection is used for respectively in ergodic process; Preservation is to the literal node of keyword dictionary hits high (meaning that the similar content degree is high) and hits deficiency but the sufficiently long literal node of node text, and purpose is in order to excavate doubtful text and then definite text scope.After this, the analyzing web page structure begins to travel through literal node wherein.

S504 carries out literal node is filtered in ergodic process, and the literal node that filters out is got rid of outside sensing range.The present invention preferably adopts two kinds of rules that literal node is filtered:

1. the father node label according to literal node filters literal node, comprising:

. filter out the literal node of no father node;

. filter out the father node label and do not belong to < div >, < paragraph >, <tablecolumn >, < heading >, < span>central one literal node;

. the label that filters out father node is < div >, and pattern is set to the literal node of " hiding ";

. after headline and issuing time have been detected, filter out the literal node of the label of father node for < heading >;

. the label that filters out father node is < span>or < div >; And text size is less than the literal node of text paragraph average length; Said text paragraph average length is based on the empirical value that the statistics of < span>or < div>exemplar in a large number obtains, and this type literal node is considered to navigation information and will not detects.

2. according to content of text literal node is filtered, comprising:

. filter out the literal node that comprises copyright statement information;

. filter out the literal node of the Word message that contains " sharing " and/or " comment " and/or " microblogging ".

S505, the news label detects.

When a literal node that belongs to web page title; The text size of this article byte point is not less than 1/3rd of web page title Chinese version length; Or the text similarity of the text of arbitrary literal node and web page title is not less than predetermined threshold value; The content of text that then extracts this article byte point is a headline, and after this no longer carries out the detection of headline.Can not there be text before the headline, so in case detect the news label,, can remove height and hit collection and doubtful concentrated headline place node literal node before for optimizing follow-up computing.

S506, issuing time detects.

Under the situation that title has found, the content and the time format of literal node are mated, and the contents extraction that will mate successful literal node is issuing time, and after this no longer carries out the detection of issuing time.Time format described here generally has numeral and digital connector two parts.The time numeral can be 4, such as " 2012 ", also can be 2, such as " 12 "; Maximum 2 of month, day numeral can zero padding under 1 situation, such as " 02 month 03 day ", also can not zero padding, and such as " February 3 ".The numeral connector mainly contains middle horizontal line, period, space, literal (date) and forward slash.Can not there be text before the issuing time, so in case detect issuing time,, can remove height and hit collection and doubtful concentrated issuing time place node literal node before for optimizing follow-up computing.

S507, informed source detects.

The issuing time of news web page and informed source are probably in same section text, so after finding issuing time, also will carry out the source format coupling to it.When the content of a literal node comprises " source " or " author's " Word message, then the contents extraction with this article byte point is an informed source, and after this no longer carries out the detection of informed source.

S508 detects the hits of literal node to the keyword dictionary.

In the process of traversal, investigate the literal node that coincidence detection requires, if the content in literal node to the hits of keyword dictionary more than or equal to 2, then add this article byte point to height and hit collection; If the content in literal node is 1 to the hits of keyword dictionary, then add this node to doubtful collection; If the content in literal node is not hit the keyword dictionary, but the text size of literal node then thinks to belong to text probably greater than 20 (20, the normal numbers of words of hemistich down that show of general Chinese news web page), will add doubtful collection to by this node.

Wherein, add high predetermined threshold value of hitting collection and doubtful collection, and the text size that adds doubtful collection can be provided with all according to the needs of actual conditions.

S509, height hit collection, doubtful collection quantity of information detects.

The detection that height hits the collection quantity of information mainly relies on content, the main dependency structure factor of the detection of doubtful collection quantity of information.Height hits concentrates the literal node number to be designated as N1; Calculate doubtful LDR (Length-Distance Ratio) value of concentrating each node, the LDR value is designated as N2 greater than the literal node number of certain threshold value; The doubtful literal node number that has keyword to hit of concentrating is designated as N3.Obtain the comparison that height hits collection and doubtful collection quantity of information according to triangular magnitude relationship.

If height hits to collect and contains much information in doubtful collection, then get into step S60, otherwise reduce the hits threshold value be selected into height and hit collection (reduce to 1 and hit) as hitting by 2, travel through again and rebulid height and hit collection.

If it is bigger to find to remain doubtful collection quantity of information after the traversal, the information of then very possible web page title (< title >) and webpage metamessage (< meta >) is insufficient, can directly form new height from the doubtful concentrated literal node of choosing N2 quantity and hit collection.

If N2 is 0, the very short or dispersion of then very possible body text causes the LDR value very little, can directly carry out text this moment and extract, and method is following:

. if doubtful concentrated literal node quantity seldom can directly be got the longest text of length as text;

. doubtful first concentrated node is suspected as title,, interval threshold is set, seek the continuous text node satisfy interval threshold since second node, with the combination of its content as text.

Wherein, the LDR that is mentioned here (Length-Distance Ratio) value is a kind of architectural feature of news web page, is used for measuring text context and connects compactedness, helps to distinguish text and non-text.Text in the webpage has certain text size; Certain distance is arranged between the adjacent text node; Length can be weighed with the ratio of distance and be connected compactedness between text, and as shown in Figure 6, L is a text size; D is the distance between the text node, and making even in front and back all to regard the tolerance of text context compactedness as.

The calculation expression of LDR value is following:

LDR (i) = \frac{1}{2} (\frac{L (i - 1)}{D (i - 1,1)} + \frac{L (i)}{D (i, 1 + 1)}),

The LDR value approaches 1 more and shows that context connection compactedness is good more necessarily less than 1, and the text possibly be true text more.

S510 makes up the collection of purifying.

Extract high hitting and concentrate the reference position of each literal node in web page code; Adopt clustering method that height is hit collection and carry out cluster; Cluster is meant that according to the similarity of some characteristic of literal node height being hit collection is divided into different classes of process; Class interior element similarity is big, and difference is big between class and the class.Consider these literal nodes possibly belong to text before, three parts in the text or behind the text, preferably be made as 3 to the initial category number.Analyze cluster result, get the longest continuous nodes set, be called the collection of purifying as the purification that height hits collection.

The preferred K-means cluster of cluster mode of the present invention, the K-means cluster is a kind of clustering method, it needs at first to confirm to divide the number k of classification; Choose k initial classification center; Each object upgrades k classification center afterwards according to drawing wherein in certain classification with the distance size at k center, so iterates; Basicly stable until k center, promptly obtain k class cluster result.

S511 finds out the minimum public father node of purifying and concentrating.

Enumerating purifies concentrates ancestors' (being the upper node of each literal node in the dom tree) of each literal node; The ancestors' stored count that repeats; Seek after the position is leaned on most in the count value maximum node as the start of text (STX) node, the minimum public father node that the purification element of set that this node is stated just is plain.If the headline node obtains; And the position of the start of text (STX) node that extracts is prior to the headline node; Thinking that then the purification collection is mixed with the content outside the text; This moment, we got the revised start of text (STX) node of conduct after the position is leaned on most in the count value time minor node, and it is for use to note its position.

S512, traversal is the document tree of root node with the public father node of minimum, and obtains body.

When obtaining body, to being that literal node in the document tree of root node is done following the processing with the public father node of minimum:

1), but, just to empty the text that has found with its initial as true text if node with the headline that has found, time or come source node identical, will not extract;

2) if the father node label of node is a link type, continue upwards to detect, if not list type, also just can get rid of the navigation possibility, extract node content and add the text that has extracted;

3) if the father node label of node belongs to < div >, < paragraph >, <tablecolumn >, < heading >, in the middle of < span>one extracts node content and adds the text that has extracted.

The present invention starts with from Chinese news web page is carried out statistical study, and the advantage of comprehensive machine learning method, regular expression method has proposed a whole set of automatic flows of accurate extraction headline, issuing time, informed source, body four key elements.The present invention can not produce specific template and rely on, and has very strong versatility.

The present invention carries out the analysis and the extraction of webpage according to the order of headline-issuing time-informed source-body or headline-informed source-issuing time-body; Because it is all that text module has generally all comprised four key elements that inquired for, therefore in the process of extracting body, headline, issuing time and informed source have just been obtained generally speaking.But for some special webpages, if after extracting body, do not obtain headline and issuing time is then carried out following extra flow process:

One, the additional extractions algorithm flow of headline, issuing time.

If the S61 headline obtains in the process that text extracts, then needn't further detect, otherwise carry out this flow process, detect because only under the existing situation of headline, just can carry out issuing time, divide two kinds of possible operations:

. if headline do not obtain, but web page title exists and the keyword dictionary in element number more, then do not find headline to be because similarity threshold is provided with too highly, can reduce threshold value this moment and travel through again and search, before seek scope is the start of text (STX) node.

. if headline does not obtain; But element number is less in the keyword dictionary; Think that then web page title maybe be irrelevant with body matter, can carry out participle this moment to the body matter that has obtained, remove stop words, obtain new keyword dictionary; Literal node before the traversal start of text (STX) node is got the maximum literal node of keyword dictionary hits as headline.

If S62 finds through step S61 headline, issuing time, then needn't further detect, otherwise following possibility is arranged:

. if headline does not obtain; Can enlarge the possible span of father node label of headline; If satisfying, the literal of certain node comprised or the condition very high with the web page title similarity by web page title; Then think headline, otherwise can specify the in the text in short as headline.For the text time, similarly, enlarge the span of father's label, at first specify in the text first time format occurrence as the text time, otherwise before specifying text last time format occurrence as the text time.

. if headline obtains, then handled getting final product the time according to top method.

Two, the additional extractions algorithm flow of informed source.

If S71 headline and time obtain, no matter whether informed source extracts at this moment, all to further detect so, prevent that " source ", " author " word of comprising in the text from producing interference, the informed source that finds is before this preserved subsequent use.

After S72, informed source one are positioned the headline node, possibly be positioned at timing node after, but the word length of informed source is generally less than the paragraph in the text.Can begin search behind the headline node, stop condition be present node behind timing node and the node word length greater than certain threshold value.If in this process, find the literal of source format then preferentially be chosen as informed source.

If do not find informed source among the S73 step S72; The specify message source is the source of preserving among the S71 so; If the source of preserving among the step S72 is for empty, our specify message source is title first source format occurrence afterwards, does not limit the father node tag types.

If S74 does not obtain the text source yet to this step, can export the tabulation of doubtful source, get into the initiatively pattern of study.

. interactive learning: the user can specify real informed source in the tabulation of doubtful source, program deposits this designated result in background data base.Can from database, read the informed source of all user's appointments at set intervals, they are carried out marginal testing,, then become a full member of the tabulation of medium speech, be applied in the extraction algorithm if really belong to the source.

. the statistical study of doubtful source: under the situation that does not have the user to participate in, can deposit the tabulation of doubtful source in background data base, dittograph adds up.The counting of doubtful source word in the staqtistical data base is given certain probable value to represent that it is the possibility of medium speech according to count value to each word at set intervals.In practical application, along with the operation of system, the number of times that news media become source can be a lot, and the non-source speech in the tabulation of doubtful source can disperse very much.The count value of word is high more, and it represents the possibility of source of news just big more.Initiatively the flow process framework of study is as shown in Figure 7.

The inventor has also done the accuracy test to method of the present invention:

The inventor is the news web page source with the RSS of Baidu; Grasped 11 types from 429 websites totally 1721 no repetition news as test set, the test on the M332 of Toshiba notebook computer, carry out, this machine is equipped with 32 Win7 Ultimate operating systems; Processor model is Intel (R) Core (TM) 2 DuoCpu T6400; Dominant frequency 2.00GHz, internal memory 2.00G, part of detecting carries out according to the order of headline-issuing time-informed source-body.Test result is as shown in table 1:

	Body	Headline	Issuing time	Informed source
					Accuracy (%)	96.11	98.43	98.2	97.39

Table 1

This shows that the present invention not only need not to rely on the code masterplate of manual compiling, and the analysis of webpage is had very high accuracy.In addition, in the test process to 1721 webpages, be 65ms the averaging time of the single webpage of arithmetic analysis, amounts to the 1s time and handle 15 webpages, has higher running efficiency.

More than disclosedly be merely several specific embodiment of the present invention, but the present invention is not limited thereto, any those skilled in the art can think variation, only otherwise exceed the said scope of appended claims, all should drop in protection scope of the present invention.

Claims

1. a news web page key element extraction method is characterized in that, may further comprise the steps:

2. news web page key element extraction method as claimed in claim 1 is characterized in that step (1) also comprises before: pre-service is carried out to the webpage source code in (10), removes scripted code.

3. news web page key element extraction method as claimed in claim 1 is characterized in that, step (1) also comprises: (11) are carried out participle and are removed stop words web page title and the webpage metamessage that extracts.

4. news web page key element extraction method as claimed in claim 1 is characterized in that step also comprises in (2): filter literal node (21), and the literal node that filters out is got rid of outside sensing range.

5. news web page key element extraction method as claimed in claim 4 is characterized in that, in the step (21), according to the father node label of literal node literal node is filtered, and comprising:

(211) filter out the literal node of no father node;

6. news web page key element extraction method as claimed in claim 4 is characterized in that, in the step (21), according to content of text literal node is filtered, and comprising:

7. news web page key element extraction method as claimed in claim 1 is characterized in that, in the step (2), comprises when detection and extraction headline, issuing time, informed source:

8. news web page key element extraction method as claimed in claim 1 is characterized in that, in the step (2), comprises when detection and extraction body:

(27) find out the minimum public father node of purifying and concentrating;

9. news web page key element extraction method as claimed in claim 8 is characterized in that, step (25) also comprises afterwards:

10. news web page key element extraction method as claimed in claim 8 is characterized in that, step (28) comprising: