CN102750390A - Automatic news webpage element extracting method - Google Patents

Automatic news webpage element extracting method Download PDF

Info

Publication number
CN102750390A
CN102750390A CN2012102328312A CN201210232831A CN102750390A CN 102750390 A CN102750390 A CN 102750390A CN 2012102328312 A CN2012102328312 A CN 2012102328312A CN 201210232831 A CN201210232831 A CN 201210232831A CN 102750390 A CN102750390 A CN 102750390A
Authority
CN
China
Prior art keywords
node
web page
literal
text
collection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012102328312A
Other languages
Chinese (zh)
Other versions
CN102750390B (en
Inventor
张长水
宋成儒
翁时锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ningbo Zhongqing Cyyun New Media Technology Co Ltd
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201210232831.2A priority Critical patent/CN102750390B/en
Publication of CN102750390A publication Critical patent/CN102750390A/en
Application granted granted Critical
Publication of CN102750390B publication Critical patent/CN102750390B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an automatic news webpage element extracting method which comprises the following steps of: (1) extracting a webpage title and webpage meta-information in a webpage sound code and obtaining a keyword dictionary related to webpage content; and (2), traversing literal nodes in the webpage sound code, and detecting and extracting a news title, issue time, a message source and a news text by utilizing the keyword dictionary according to a news title-issue time-message source-news text sequence or a news title-message source-issue time-news text sequence. The method provided by the invention does not depend on a specific template and is strong in commonality.

Description

News web page key element extraction method
Technical field
The present invention relates to the internet information analytical technology, particularly a kind of news web page key element extraction method.
Background technology
In recent years, along with the extensive of internet popularized, the many ground of People more and more obtain useful information from the network media.The network information has high-timeliness, and a lot of grave news incidents all are at first on network, to spread to come.Thus, phase-split network information, particularly news information, can help we hold well the social development pulse, in time find local anomaly, safeguard that the harmony of society is stable.
News on the internet is vast as the open sea, if adopt manual method to analyze, does not catch up with the renewal speed of news on the one hand, occurs careless mistake on the other hand easily, so will analyze by computing machine usually.Given certain news web page wants to understand information wherein and analyzes, and what at first will do is exactly that automatically extraction headline, issuing time, informed source, this 4 flash-news key element of body are as shown in Figure 1.
Existing element of news method for distilling only is placed on focus on headline and the text mostly, mainly contains following three kinds of methods:
1, regular expression
Regular expression is a character string that is generated by specific syntax rule, is used for describing or the statement of certain sentence structure standard of match.If news web page is generated by same template, we can be expressed as a regular expression with the code pattern in text zone, see also Fig. 2, and it is the synoptic diagram that utilizes the method extraction element of news of regular expression.Can use this unique expression formula to extract its content to each new input webpage.This method is simple and convenient, with strong points, once creates, unlimited operation.
But; The defective of regular expression method is that the artificial generalization procedure of web page code pattern is very complicated; The regular expression of being write as to a template only is applicable to this template, feels simply helpless for the webpage of extended formatting, even if original template; If in text, increase nested or modification slightly, also possibly cause the contents extraction failure.
2, wrapper
The method of regular expression needs manual compiling and can only be corresponding one by one with web page template.After this people attempt seeking the automatic derivation method of multi-template webpage unified model.N.Kushmerick has proposed the algorithm of a WIEN by name first and has realized this idea in 1997, and final mask is called wrapper.Here wrapper is represented a kind of flow process, is directed to certain new information source, can utilize existing template data and webpage experimental knowledge to carry out the conclusion of similar artificial intelligence and derivation automatically.Derivation result can be applied in automatic extraction of information of new information source.
Though the wrapper method for distilling has partly solved regular expression method poor efficiency, the narrow shortcoming of application surface, do not break away from the essence of former method all the time, promptly conclude the cost height, do not break away from dependence in essence template.
In sum, existing element of news method for distilling exist to masterplate too rely on, versatility is poor, and the masterplate code is concluded complicated problems.
Summary of the invention
The purpose of this invention is to provide a kind of news web page key element extraction method, with solve existing element of news method for distilling exist to masterplate too rely on, versatility is poor, and the masterplate code is concluded complicated problems.
The present invention proposes a kind of news web page key element extraction method, may further comprise the steps:
(1) extracts web page title and webpage metamessage in the webpage source code, and obtain the keyword dictionary of relevant web page contents;
(2) literal node in the webpage source code is traveled through; And according to the order of headline-issuing time-informed source-body or headline-informed source-issuing time-body, and utilize said keyword dictionary to detect and extract headline, issuing time, informed source and body.
Further, step (1) also comprises before: pre-service is carried out to the webpage source code in (10), removes scripted code.
Further, step (1) also comprises: (11) are carried out participle and are removed stop words web page title and the webpage metamessage that extracts.
Further, step also comprises in (2): filter literal node (21), and the literal node that filters out is got rid of outside sensing range.
Further, in the step (21), literal node is filtered, comprising according to the father node label of literal node:
(211) filter out the literal node of no father node;
(212) filter out the father node label and do not belong to < div >, < paragraph >, <tablecolumn >, < heading >, < span>central one literal node;
(213) label that filters out father node is < div >, and pattern is set to the literal node of " hiding ";
(214) after headline and issuing time have been detected, filter out the literal node of the label of father node for < heading >;
(215) label that filters out father node is < span>or < div >, and text size is less than the literal node of text paragraph average length.
Further, in the step (21), literal node is filtered, comprising according to content of text:
(216) filter out the literal node that comprises copyright statement information;
(217) filter out the literal node of the Word message that contains " sharing " and/or " comment " and/or " microblogging ".
Further, in the step (2), comprise when detection and extraction headline, issuing time, informed source:
(22) when a literal node that belongs to web page title; The text size of this article byte point is not less than 1/3rd of web page title Chinese version length; Or the text similarity of the text of arbitrary literal node and web page title is not less than predetermined threshold value; The content of text that then extracts this article byte point is a headline, and after this no longer carries out the detection of headline;
(23) content and the time format of literal node are mated, and the contents extraction that will mate successful literal node is issuing time, and after this no longer carries out the detection of issuing time;
(24) content when a literal node comprises " source " or " author's " Word message, and then the contents extraction with this article byte point is an informed source, and after this no longer carries out the detection of informed source.
Further, in the step (2), comprise when detection and extraction body:
(25) set up height and hit collection, preserve the high literal node of keyword dictionary hits;
(26) adopt the cluster mode that height is hit collection and purify, get the longest continuous nodes set as the collection of purifying;
(27) find out the minimum public father node of purifying and concentrating;
(28) traversal is the document tree of root node with the public father node of minimum, and obtains body.
Further, step (25) also comprises afterwards:
(251) set up doubtful collection, preserve keyword dictionary hits deficiency, or text size is greater than the literal node of a preset value;
(252) than higher quantity of information of hitting collection and doubtful collection;
(253) if the high quantity of information of the quantity of information of collection less than doubtful collection of hitting then reduces being selected into the hits threshold value that height hits collection, travel through the literal node in the webpage source code again, and rebulid height and hit collection;
(254) if height hits the quantity of information of the quantity of information of collection less than doubtful collection, then get into step (26).
Further, step (28) comprising:
(281) if literal node is identical with the node of headline, issuing time, informed source, then with this article byte point initial as body;
(282) if the father node label of literal node is a link type, and its node that makes progress all is not list type, extracts the content adding body of this article byte point;
(283) if the father node label of literal node belongs in the middle of < div >, < paragraph >, <tablecolumn >, < heading >, < span >, extract this article byte point content adding body.
With respect to prior art; The invention has the beneficial effects as follows: the present invention starts with from Chinese news web page is carried out statistical study; The advantage of comprehensive machine learning method, regular expression method has proposed a whole set of automatic flows of accurate extraction headline, issuing time, informed source, body four key elements.The present invention can not produce specific template and rely on, and has very strong versatility.
Description of drawings
Fig. 1 is the synoptic diagram of a news web page four elements;
Fig. 2 is the synoptic diagram that utilizes the method extraction element of news of regular expression;
Fig. 3 is a kind of news web page key element extraction method process flow diagram of the embodiment of the invention;
Fig. 4 is the speech bag model synoptic diagram according to web page title among Fig. 1 and the formation of webpage metamessage;
Fig. 5 is the comparatively detailed another kind of news web page key element extraction method process flow diagram of the embodiment of the invention;
Fig. 6 is a kind of news web page architectural feature synoptic diagram;
Fig. 7 is the flow process frame diagram in the unknown source of the present invention's active learning method study.
Embodiment
Specify the present invention below in conjunction with accompanying drawing.
See also Fig. 3, it is a kind of news web page key element extraction method process flow diagram of the embodiment of the invention, and it may further comprise the steps:
S31 extracts web page title and webpage metamessage in the webpage source code, and obtains the keyword dictionary of relevant web page contents.
Web page title is the high level overview to a webpage, and when browsing a webpage, the information that the show bar on the browser top occurs is exactly " web page title ".In webpage source code (HTML code), web page title is positioned at<head></head>Between the label, its form is:<title>The network marketing teaching website</title>, wherein " network marketing teaching website " is exactly " web page title ".
The webpage metamessage is included in < meta>label, with the form of key-value pair the information relevant with document is provided, mainly as the index reference of search engine.In the metamessage, description is the descriptor of web page contents, and keywords is the keyword of web page contents, can understand news content well through these two information.
Extract after web page title and the webpage metamessage, the present invention preferably adopts the speech bag model to extract the keyword of relevant web page contents, and forms the keyword dictionary.The speech bag model is a notion in the text mining, and it does not consider order, the modified relationship of word, only text fragment is regarded as the set of word.News pages with Fig. 1 is an example, and is as shown in Figure 4 according to the speech bag model that its web page title and webpage metamessage can form.On the basis of speech bag model, the present invention can further be expressed as vector with text fragment, calculates the similar content degree between the text fragment by computings such as the distance of vector, inner products afterwards.If only consider whether word occurs and with vector distance as measuring similarity, similarity is calculated and can be reduced to public entry number between statistics speech bag.Certainly, the similarity of text is calculated except that discrete vector distance, also has Cosine distance, Euclidean distance, city distance etc.
S32; Literal node in the webpage source code travels through; And according to the order of headline-issuing time-informed source-body or headline-informed source-issuing time-body, and utilize said keyword dictionary to detect and extract headline, issuing time, informed source and body.
Literal node of the present invention is meant the node in the DOM Document Object Model.DOM Document Object Model (Document Object Model, be called for short DOM), it is a kind of application programming interfaces, can be used for the document of types such as dynamic access HTML, XML.Mainly use HTMLDOM among the present invention, it representes document with tree structure, and has defined visit and the method for operating element in the document.
For further understanding technical scheme of the present invention, specify the present invention with a detailed embodiment below, see also Fig. 5, it is the comparatively detailed another kind of news web page key element extraction method process flow diagram of the embodiment of the invention, it may further comprise the steps:
S501 carries out pre-service to the webpage source code, removes script (JS) code, in order to avoid the dynamic load content that wherein comprises is disturbed the judgement of text position.
S502 extracts web page title and webpage metamessage in the webpage source code, and obtains the keyword dictionary of relevant web page contents.
Extract after web page title and the webpage metamessage, the present invention preferably adopts the speech bag model to extract the keyword of relevant web page contents, and forms the keyword dictionary.The speech bag model is as shown in Figure 4,
But the speech in the speech bag also not all needs; Very frequent that some speech occur in news web page arranged; But they do not have too big help to the expression of news content, such as " at present ", " so ", speech such as " it is reported ", be referred to as stop words (stop words) in the present invention.Therefore can after beginning to take shape the keyword dictionary, again these stop words that may disturb the content of text similarity to judge be removed, so that computing is more succinct.
S503 sets up height and hits collection and doubtful collection.Height hits collection and doubtful collection is used for respectively in ergodic process; Preservation is to the literal node of keyword dictionary hits high (meaning that the similar content degree is high) and hits deficiency but the sufficiently long literal node of node text, and purpose is in order to excavate doubtful text and then definite text scope.After this, the analyzing web page structure begins to travel through literal node wherein.
S504 carries out literal node is filtered in ergodic process, and the literal node that filters out is got rid of outside sensing range.The present invention preferably adopts two kinds of rules that literal node is filtered:
1. the father node label according to literal node filters literal node, comprising:
. filter out the literal node of no father node;
. filter out the father node label and do not belong to < div >, < paragraph >, <tablecolumn >, < heading >, < span>central one literal node;
. the label that filters out father node is < div >, and pattern is set to the literal node of " hiding ";
. after headline and issuing time have been detected, filter out the literal node of the label of father node for < heading >;
. the label that filters out father node is < span>or < div >; And text size is less than the literal node of text paragraph average length; Said text paragraph average length is based on the empirical value that the statistics of < span>or < div>exemplar in a large number obtains, and this type literal node is considered to navigation information and will not detects.
2. according to content of text literal node is filtered, comprising:
. filter out the literal node that comprises copyright statement information;
. filter out the literal node of the Word message that contains " sharing " and/or " comment " and/or " microblogging ".
S505, the news label detects.
When a literal node that belongs to web page title; The text size of this article byte point is not less than 1/3rd of web page title Chinese version length; Or the text similarity of the text of arbitrary literal node and web page title is not less than predetermined threshold value; The content of text that then extracts this article byte point is a headline, and after this no longer carries out the detection of headline.Can not there be text before the headline, so in case detect the news label,, can remove height and hit collection and doubtful concentrated headline place node literal node before for optimizing follow-up computing.
S506, issuing time detects.
Under the situation that title has found, the content and the time format of literal node are mated, and the contents extraction that will mate successful literal node is issuing time, and after this no longer carries out the detection of issuing time.Time format described here generally has numeral and digital connector two parts.The time numeral can be 4, such as " 2012 ", also can be 2, such as " 12 "; Maximum 2 of month, day numeral can zero padding under 1 situation, such as " 02 month 03 day ", also can not zero padding, and such as " February 3 ".The numeral connector mainly contains middle horizontal line, period, space, literal (date) and forward slash.Can not there be text before the issuing time, so in case detect issuing time,, can remove height and hit collection and doubtful concentrated issuing time place node literal node before for optimizing follow-up computing.
S507, informed source detects.
The issuing time of news web page and informed source are probably in same section text, so after finding issuing time, also will carry out the source format coupling to it.When the content of a literal node comprises " source " or " author's " Word message, then the contents extraction with this article byte point is an informed source, and after this no longer carries out the detection of informed source.
S508 detects the hits of literal node to the keyword dictionary.
In the process of traversal, investigate the literal node that coincidence detection requires, if the content in literal node to the hits of keyword dictionary more than or equal to 2, then add this article byte point to height and hit collection; If the content in literal node is 1 to the hits of keyword dictionary, then add this node to doubtful collection; If the content in literal node is not hit the keyword dictionary, but the text size of literal node then thinks to belong to text probably greater than 20 (20, the normal numbers of words of hemistich down that show of general Chinese news web page), will add doubtful collection to by this node.
Wherein, add high predetermined threshold value of hitting collection and doubtful collection, and the text size that adds doubtful collection can be provided with all according to the needs of actual conditions.
S509, height hit collection, doubtful collection quantity of information detects.
The detection that height hits the collection quantity of information mainly relies on content, the main dependency structure factor of the detection of doubtful collection quantity of information.Height hits concentrates the literal node number to be designated as N1; Calculate doubtful LDR (Length-Distance Ratio) value of concentrating each node, the LDR value is designated as N2 greater than the literal node number of certain threshold value; The doubtful literal node number that has keyword to hit of concentrating is designated as N3.Obtain the comparison that height hits collection and doubtful collection quantity of information according to triangular magnitude relationship.
If height hits to collect and contains much information in doubtful collection, then get into step S60, otherwise reduce the hits threshold value be selected into height and hit collection (reduce to 1 and hit) as hitting by 2, travel through again and rebulid height and hit collection.
If it is bigger to find to remain doubtful collection quantity of information after the traversal, the information of then very possible web page title (< title >) and webpage metamessage (< meta >) is insufficient, can directly form new height from the doubtful concentrated literal node of choosing N2 quantity and hit collection.
If N2 is 0, the very short or dispersion of then very possible body text causes the LDR value very little, can directly carry out text this moment and extract, and method is following:
. if doubtful concentrated literal node quantity seldom can directly be got the longest text of length as text;
. doubtful first concentrated node is suspected as title,, interval threshold is set, seek the continuous text node satisfy interval threshold since second node, with the combination of its content as text.
Wherein, the LDR that is mentioned here (Length-Distance Ratio) value is a kind of architectural feature of news web page, is used for measuring text context and connects compactedness, helps to distinguish text and non-text.Text in the webpage has certain text size; Certain distance is arranged between the adjacent text node; Length can be weighed with the ratio of distance and be connected compactedness between text, and as shown in Figure 6, L is a text size; D is the distance between the text node, and making even in front and back all to regard the tolerance of text context compactedness as.
The calculation expression of LDR value is following:
LDR ( i ) = 1 2 ( L ( i - 1 ) D ( i - 1,1 ) + L ( i ) D ( i , 1 + 1 ) ) ,
The LDR value approaches 1 more and shows that context connection compactedness is good more necessarily less than 1, and the text possibly be true text more.
S510 makes up the collection of purifying.
Extract high hitting and concentrate the reference position of each literal node in web page code; Adopt clustering method that height is hit collection and carry out cluster; Cluster is meant that according to the similarity of some characteristic of literal node height being hit collection is divided into different classes of process; Class interior element similarity is big, and difference is big between class and the class.Consider these literal nodes possibly belong to text before, three parts in the text or behind the text, preferably be made as 3 to the initial category number.Analyze cluster result, get the longest continuous nodes set, be called the collection of purifying as the purification that height hits collection.
The preferred K-means cluster of cluster mode of the present invention, the K-means cluster is a kind of clustering method, it needs at first to confirm to divide the number k of classification; Choose k initial classification center; Each object upgrades k classification center afterwards according to drawing wherein in certain classification with the distance size at k center, so iterates; Basicly stable until k center, promptly obtain k class cluster result.
S511 finds out the minimum public father node of purifying and concentrating.
Enumerating purifies concentrates ancestors' (being the upper node of each literal node in the dom tree) of each literal node; The ancestors' stored count that repeats; Seek after the position is leaned on most in the count value maximum node as the start of text (STX) node, the minimum public father node that the purification element of set that this node is stated just is plain.If the headline node obtains; And the position of the start of text (STX) node that extracts is prior to the headline node; Thinking that then the purification collection is mixed with the content outside the text; This moment, we got the revised start of text (STX) node of conduct after the position is leaned on most in the count value time minor node, and it is for use to note its position.
S512, traversal is the document tree of root node with the public father node of minimum, and obtains body.
When obtaining body, to being that literal node in the document tree of root node is done following the processing with the public father node of minimum:
1), but, just to empty the text that has found with its initial as true text if node with the headline that has found, time or come source node identical, will not extract;
2) if the father node label of node is a link type, continue upwards to detect, if not list type, also just can get rid of the navigation possibility, extract node content and add the text that has extracted;
3) if the father node label of node belongs to < div >, < paragraph >, <tablecolumn >, < heading >, in the middle of < span>one extracts node content and adds the text that has extracted.
The present invention starts with from Chinese news web page is carried out statistical study, and the advantage of comprehensive machine learning method, regular expression method has proposed a whole set of automatic flows of accurate extraction headline, issuing time, informed source, body four key elements.The present invention can not produce specific template and rely on, and has very strong versatility.
The present invention carries out the analysis and the extraction of webpage according to the order of headline-issuing time-informed source-body or headline-informed source-issuing time-body; Because it is all that text module has generally all comprised four key elements that inquired for, therefore in the process of extracting body, headline, issuing time and informed source have just been obtained generally speaking.But for some special webpages, if after extracting body, do not obtain headline and issuing time is then carried out following extra flow process:
One, the additional extractions algorithm flow of headline, issuing time.
If the S61 headline obtains in the process that text extracts, then needn't further detect, otherwise carry out this flow process, detect because only under the existing situation of headline, just can carry out issuing time, divide two kinds of possible operations:
. if headline do not obtain, but web page title exists and the keyword dictionary in element number more, then do not find headline to be because similarity threshold is provided with too highly, can reduce threshold value this moment and travel through again and search, before seek scope is the start of text (STX) node.
. if headline does not obtain; But element number is less in the keyword dictionary; Think that then web page title maybe be irrelevant with body matter, can carry out participle this moment to the body matter that has obtained, remove stop words, obtain new keyword dictionary; Literal node before the traversal start of text (STX) node is got the maximum literal node of keyword dictionary hits as headline.
If S62 finds through step S61 headline, issuing time, then needn't further detect, otherwise following possibility is arranged:
. if headline does not obtain; Can enlarge the possible span of father node label of headline; If satisfying, the literal of certain node comprised or the condition very high with the web page title similarity by web page title; Then think headline, otherwise can specify the in the text in short as headline.For the text time, similarly, enlarge the span of father's label, at first specify in the text first time format occurrence as the text time, otherwise before specifying text last time format occurrence as the text time.
. if headline obtains, then handled getting final product the time according to top method.
Two, the additional extractions algorithm flow of informed source.
If S71 headline and time obtain, no matter whether informed source extracts at this moment, all to further detect so, prevent that " source ", " author " word of comprising in the text from producing interference, the informed source that finds is before this preserved subsequent use.
After S72, informed source one are positioned the headline node, possibly be positioned at timing node after, but the word length of informed source is generally less than the paragraph in the text.Can begin search behind the headline node, stop condition be present node behind timing node and the node word length greater than certain threshold value.If in this process, find the literal of source format then preferentially be chosen as informed source.
If do not find informed source among the S73 step S72; The specify message source is the source of preserving among the S71 so; If the source of preserving among the step S72 is for empty, our specify message source is title first source format occurrence afterwards, does not limit the father node tag types.
If S74 does not obtain the text source yet to this step, can export the tabulation of doubtful source, get into the initiatively pattern of study.
. interactive learning: the user can specify real informed source in the tabulation of doubtful source, program deposits this designated result in background data base.Can from database, read the informed source of all user's appointments at set intervals, they are carried out marginal testing,, then become a full member of the tabulation of medium speech, be applied in the extraction algorithm if really belong to the source.
. the statistical study of doubtful source: under the situation that does not have the user to participate in, can deposit the tabulation of doubtful source in background data base, dittograph adds up.The counting of doubtful source word in the staqtistical data base is given certain probable value to represent that it is the possibility of medium speech according to count value to each word at set intervals.In practical application, along with the operation of system, the number of times that news media become source can be a lot, and the non-source speech in the tabulation of doubtful source can disperse very much.The count value of word is high more, and it represents the possibility of source of news just big more.Initiatively the flow process framework of study is as shown in Figure 7.
The inventor has also done the accuracy test to method of the present invention:
The inventor is the news web page source with the RSS of Baidu; Grasped 11 types from 429 websites totally 1721 no repetition news as test set, the test on the M332 of Toshiba notebook computer, carry out, this machine is equipped with 32 Win7 Ultimate operating systems; Processor model is Intel (R) Core (TM) 2 DuoCpu T6400; Dominant frequency 2.00GHz, internal memory 2.00G, part of detecting carries out according to the order of headline-issuing time-informed source-body.Test result is as shown in table 1:
Body Headline Issuing time Informed source
Accuracy (%) 96.11 98.43 98.2 97.39
Table 1
This shows that the present invention not only need not to rely on the code masterplate of manual compiling, and the analysis of webpage is had very high accuracy.In addition, in the test process to 1721 webpages, be 65ms the averaging time of the single webpage of arithmetic analysis, amounts to the 1s time and handle 15 webpages, has higher running efficiency.
More than disclosedly be merely several specific embodiment of the present invention, but the present invention is not limited thereto, any those skilled in the art can think variation, only otherwise exceed the said scope of appended claims, all should drop in protection scope of the present invention.

Claims (10)

1. a news web page key element extraction method is characterized in that, may further comprise the steps:
(1) extracts web page title and webpage metamessage in the webpage source code, and obtain the keyword dictionary of relevant web page contents;
(2) literal node in the webpage source code is traveled through; And according to the order of headline-issuing time-informed source-body or headline-informed source-issuing time-body, and utilize said keyword dictionary to detect and extract headline, issuing time, informed source and body.
2. news web page key element extraction method as claimed in claim 1 is characterized in that step (1) also comprises before: pre-service is carried out to the webpage source code in (10), removes scripted code.
3. news web page key element extraction method as claimed in claim 1 is characterized in that, step (1) also comprises: (11) are carried out participle and are removed stop words web page title and the webpage metamessage that extracts.
4. news web page key element extraction method as claimed in claim 1 is characterized in that step also comprises in (2): filter literal node (21), and the literal node that filters out is got rid of outside sensing range.
5. news web page key element extraction method as claimed in claim 4 is characterized in that, in the step (21), according to the father node label of literal node literal node is filtered, and comprising:
(211) filter out the literal node of no father node;
(212) filter out the father node label and do not belong to < div >, < paragraph >, <tablecolumn >, < heading >, < span>central one literal node;
(213) label that filters out father node is < div >, and pattern is set to the literal node of " hiding ";
(214) after headline and issuing time have been detected, filter out the literal node of the label of father node for < heading >;
(215) label that filters out father node is < span>or < div >, and text size is less than the literal node of text paragraph average length.
6. news web page key element extraction method as claimed in claim 4 is characterized in that, in the step (21), according to content of text literal node is filtered, and comprising:
(216) filter out the literal node that comprises copyright statement information;
(217) filter out the literal node of the Word message that contains " sharing " and/or " comment " and/or " microblogging ".
7. news web page key element extraction method as claimed in claim 1 is characterized in that, in the step (2), comprises when detection and extraction headline, issuing time, informed source:
(22) when a literal node that belongs to web page title; The text size of this article byte point is not less than 1/3rd of web page title Chinese version length; Or the text similarity of the text of arbitrary literal node and web page title is not less than predetermined threshold value; The content of text that then extracts this article byte point is a headline, and after this no longer carries out the detection of headline;
(23) content and the time format of literal node are mated, and the contents extraction that will mate successful literal node is issuing time, and after this no longer carries out the detection of issuing time;
(24) content when a literal node comprises " source " or " author's " Word message, and then the contents extraction with this article byte point is an informed source, and after this no longer carries out the detection of informed source.
8. news web page key element extraction method as claimed in claim 1 is characterized in that, in the step (2), comprises when detection and extraction body:
(25) set up height and hit collection, preserve the high literal node of keyword dictionary hits;
(26) adopt the cluster mode that height is hit collection and purify, get the longest continuous nodes set as the collection of purifying;
(27) find out the minimum public father node of purifying and concentrating;
(28) traversal is the document tree of root node with the public father node of minimum, and obtains body.
9. news web page key element extraction method as claimed in claim 8 is characterized in that, step (25) also comprises afterwards:
(251) set up doubtful collection, preserve keyword dictionary hits deficiency, or text size is greater than the literal node of a preset value;
(252) than higher quantity of information of hitting collection and doubtful collection;
(253) if the high quantity of information of the quantity of information of collection less than doubtful collection of hitting then reduces being selected into the hits threshold value that height hits collection, travel through the literal node in the webpage source code again, and rebulid height and hit collection;
(254) if height hits the quantity of information of the quantity of information of collection less than doubtful collection, then get into step (26).
10. news web page key element extraction method as claimed in claim 8 is characterized in that, step (28) comprising:
(281) if literal node is identical with the node of headline, issuing time, informed source, then with this article byte point initial as body;
(282) if the father node label of literal node is a link type, and its node that makes progress all is not list type, extracts the content adding body of this article byte point;
(283) if the father node label of literal node belongs in the middle of < div >, < paragraph >, <tablecolumn >, < heading >, < span >, extract this article byte point content adding body.
CN201210232831.2A 2012-07-05 2012-07-05 Automatic news webpage element extracting method Active CN102750390B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210232831.2A CN102750390B (en) 2012-07-05 2012-07-05 Automatic news webpage element extracting method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210232831.2A CN102750390B (en) 2012-07-05 2012-07-05 Automatic news webpage element extracting method

Publications (2)

Publication Number Publication Date
CN102750390A true CN102750390A (en) 2012-10-24
CN102750390B CN102750390B (en) 2014-07-23

Family

ID=47030575

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210232831.2A Active CN102750390B (en) 2012-07-05 2012-07-05 Automatic news webpage element extracting method

Country Status (1)

Country Link
CN (1) CN102750390B (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103488675A (en) * 2013-07-11 2014-01-01 哈尔滨工程大学 Automatic precise extraction device for multi-webpage news comment contents
CN106021392A (en) * 2016-05-12 2016-10-12 中国互联网络信息中心 News key information extraction method and system
CN106033428A (en) * 2015-03-11 2016-10-19 北大方正集团有限公司 A uniform resource locator selecting method and a uniform resource locator selecting device
CN106874346A (en) * 2016-12-26 2017-06-20 微梦创科网络科技(中国)有限公司 Page body extracting method and device in webpage
CN107766384A (en) * 2016-08-22 2018-03-06 北京国双科技有限公司 A kind of method and apparatus for determining page issuing time
CN108009137A (en) * 2017-12-22 2018-05-08 中科鼎富(北京)科技发展有限公司 A kind of specification document processing method, apparatus and system based on configuration file
CN108090104A (en) * 2016-11-23 2018-05-29 百度在线网络技术(北京)有限公司 For obtaining the method and apparatus of webpage information
CN108153851A (en) * 2017-12-21 2018-06-12 北京工业大学 A kind of rule-based and semantic universal forum topic post page info abstracting method
CN108241680A (en) * 2016-12-26 2018-07-03 北京国双科技有限公司 The method and apparatus for obtaining the amount of reading of webpage
CN108320255A (en) * 2017-01-16 2018-07-24 软通动力信息技术(集团)有限公司 A kind of information processing method and device
CN108399257A (en) * 2018-03-08 2018-08-14 江苏省广播电视总台 Personalize News clue based on the analysis of intelligent manuscript recommends method
CN108874870A (en) * 2018-04-24 2018-11-23 北京中科闻歌科技股份有限公司 A kind of data pick-up method, equipment and computer can storage mediums
CN109710833A (en) * 2018-12-29 2019-05-03 上海蜜度信息技术有限公司 For determining the method and apparatus of content node
CN109857956A (en) * 2019-01-25 2019-06-07 四川大学 The automatic abstracting method of news web page key message based on label and blocking characteristic
CN110427541A (en) * 2019-08-05 2019-11-08 安徽大学 A kind of webpage content extracting method, system, electronic equipment and medium
CN110968807A (en) * 2018-09-27 2020-04-07 北京国双科技有限公司 Webpage text extraction method and device
CN114329138A (en) * 2021-12-24 2022-04-12 奇安信科技集团股份有限公司 Webpage information extraction method and device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5694594A (en) * 1994-11-14 1997-12-02 Chang; Daniel System for linking hypermedia data objects in accordance with associations of source and destination data objects and similarity threshold without using keywords or link-difining terms
CN101470728A (en) * 2007-12-25 2009-07-01 北京大学 Method and device for automatically abstracting text of Chinese news web page

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5694594A (en) * 1994-11-14 1997-12-02 Chang; Daniel System for linking hypermedia data objects in accordance with associations of source and destination data objects and similarity threshold without using keywords or link-difining terms
CN101470728A (en) * 2007-12-25 2009-07-01 北京大学 Method and device for automatically abstracting text of Chinese news web page

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
DOU SHEN 等: "Web-page classification through summarization", 《SIGIR "04 PROCEEDINGS OF THE 27TH ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL》 *

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103488675A (en) * 2013-07-11 2014-01-01 哈尔滨工程大学 Automatic precise extraction device for multi-webpage news comment contents
CN106033428A (en) * 2015-03-11 2016-10-19 北大方正集团有限公司 A uniform resource locator selecting method and a uniform resource locator selecting device
CN106033428B (en) * 2015-03-11 2019-08-30 北大方正集团有限公司 The selection method of uniform resource locator and the selection device of uniform resource locator
CN106021392A (en) * 2016-05-12 2016-10-12 中国互联网络信息中心 News key information extraction method and system
CN107766384A (en) * 2016-08-22 2018-03-06 北京国双科技有限公司 A kind of method and apparatus for determining page issuing time
CN108090104A (en) * 2016-11-23 2018-05-29 百度在线网络技术(北京)有限公司 For obtaining the method and apparatus of webpage information
CN106874346B (en) * 2016-12-26 2020-10-30 微梦创科网络科技(中国)有限公司 Method and device for extracting page text in webpage
CN106874346A (en) * 2016-12-26 2017-06-20 微梦创科网络科技(中国)有限公司 Page body extracting method and device in webpage
CN108241680A (en) * 2016-12-26 2018-07-03 北京国双科技有限公司 The method and apparatus for obtaining the amount of reading of webpage
CN108241680B (en) * 2016-12-26 2020-10-13 北京国双科技有限公司 Method and device for acquiring reading amount of webpage
CN108320255A (en) * 2017-01-16 2018-07-24 软通动力信息技术(集团)有限公司 A kind of information processing method and device
CN108320255B (en) * 2017-01-16 2022-06-21 软通动力信息技术(集团)股份有限公司 Information processing method and device
CN108153851A (en) * 2017-12-21 2018-06-12 北京工业大学 A kind of rule-based and semantic universal forum topic post page info abstracting method
CN108153851B (en) * 2017-12-21 2021-06-18 北京工业大学 General forum subject post page information extraction method based on rules and semantics
CN108009137A (en) * 2017-12-22 2018-05-08 中科鼎富(北京)科技发展有限公司 A kind of specification document processing method, apparatus and system based on configuration file
CN108009137B (en) * 2017-12-22 2021-01-29 鼎富智能科技有限公司 Standard document processing method, device and system based on configuration file
CN108399257A (en) * 2018-03-08 2018-08-14 江苏省广播电视总台 Personalize News clue based on the analysis of intelligent manuscript recommends method
CN108399257B (en) * 2018-03-08 2021-05-18 江苏省广播电视总台 Personalized news clue recommendation method based on intelligent manuscript analysis
CN108874870A (en) * 2018-04-24 2018-11-23 北京中科闻歌科技股份有限公司 A kind of data pick-up method, equipment and computer can storage mediums
CN110968807A (en) * 2018-09-27 2020-04-07 北京国双科技有限公司 Webpage text extraction method and device
CN109710833B (en) * 2018-12-29 2021-07-16 上海蜜度信息技术有限公司 Method and apparatus for determining content node
CN109710833A (en) * 2018-12-29 2019-05-03 上海蜜度信息技术有限公司 For determining the method and apparatus of content node
CN109857956A (en) * 2019-01-25 2019-06-07 四川大学 The automatic abstracting method of news web page key message based on label and blocking characteristic
CN110427541A (en) * 2019-08-05 2019-11-08 安徽大学 A kind of webpage content extracting method, system, electronic equipment and medium
CN110427541B (en) * 2019-08-05 2022-09-16 安徽大学 Webpage content extraction method, system, electronic equipment and medium
CN114329138A (en) * 2021-12-24 2022-04-12 奇安信科技集团股份有限公司 Webpage information extraction method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN102750390B (en) 2014-07-23

Similar Documents

Publication Publication Date Title
CN102750390B (en) Automatic news webpage element extracting method
Choi et al. Emerging topic detection in twitter stream based on high utility pattern mining
Hulsebos et al. Gittables: A large-scale corpus of relational tables
CN103049435B (en) Text fine granularity sentiment analysis method and device
Caragea et al. Citation-enhanced keyphrase extraction from research papers: A supervised approach
Chen et al. Websrc: A dataset for web-based structural reading comprehension
US8185530B2 (en) Method and system for web document clustering
TWI695277B (en) Automatic website data collection method
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
CN104933164A (en) Method for extracting relations among named entities in Internet massive data and system thereof
CN103914478A (en) Webpage training method and system and webpage prediction method and system
US9600587B2 (en) Methods and apparatuses for generating search expressions from content, for applying search expressions to content collections, and/or for analyzing corresponding search results
CN102945244A (en) Chinese web page repeated document detection and filtration method based on full stop characteristic word string
CN102890702A (en) Internet forum-oriented opinion leader mining method
CN106354844B (en) Service combination package recommendation system and method based on text mining
CN102609427A (en) Public opinion vertical search analysis system and method
CN104268230B (en) A kind of Chinese micro-blog viewpoint detection method based on heterogeneous figure random walk
Osipov et al. Exactus expert—search and analytical engine for research and development support
Sateli et al. Semantic representation of scientific literature: bringing claims, contributions and named entities onto the Linked Open Data cloud
Han et al. Text Summarization Using FrameNet‐Based Semantic Graph Model
Beheshti et al. Big data and cross-document coreference resolution: Current state and future opportunities
Chen et al. Finding keywords in blogs: Efficient keyword extraction in blog mining via user behaviors
Chen et al. Staged query graph generation based on answer type for question answering over knowledge base
Win et al. Web page segmentation and informative content extraction for effective information retrieval
Qiu et al. Detecting geo-relation phrases from web texts for triplet extraction of geographic knowledge: A context-enhanced method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: NINGBO ZHONGQING HUAYUN NEW MEDIA TECHNOLOGY CO.,

Free format text: FORMER OWNER: WENG SHIFENG

Effective date: 20141210

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 315192 NINGBO, ZHEJIANG PROVINCE TO: 315100 NINGBO, ZHEJIANG PROVINCE

TR01 Transfer of patent right

Effective date of registration: 20141210

Address after: 315100, 8 floor, Di Yi Building, 666 Taikang Road, Ningbo, Zhejiang, Yinzhou District

Patentee after: NINGBO ZHONGQING CYYUN NEW MEDIA TECHNOLOGY CO., LTD.

Address before: 315192 room 298, science and technology center, 514 bachelor Road, Yinzhou District, Zhejiang, Ningbo

Patentee before: Weng Shifeng