CN101702160B - Method for acquiring internet subject information and device thereof - Google Patents

Method for acquiring internet subject information and device thereof Download PDF

Info

Publication number
CN101702160B
CN101702160B CN 200910110356 CN200910110356A CN101702160B CN 101702160 B CN101702160 B CN 101702160B CN 200910110356 CN200910110356 CN 200910110356 CN 200910110356 A CN200910110356 A CN 200910110356A CN 101702160 B CN101702160 B CN 101702160B
Authority
CN
China
Prior art keywords
character string
character
subject information
html
html tag
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN 200910110356
Other languages
Chinese (zh)
Other versions
CN101702160A (en
Inventor
黎柯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Longguan Media Co., Ltd.
Original Assignee
SHENZHEN LONGGUAN MEDIA CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHENZHEN LONGGUAN MEDIA CO Ltd filed Critical SHENZHEN LONGGUAN MEDIA CO Ltd
Priority to CN 200910110356 priority Critical patent/CN101702160B/en
Publication of CN101702160A publication Critical patent/CN101702160A/en
Application granted granted Critical
Publication of CN101702160B publication Critical patent/CN101702160B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a method for acquiring internet subject information and a device thereof, wherein the method comprises the steps of: acquiring a hyper text makeup language HTML source code of an internet webpage; dividing the HTML source code into different character strings by taking a div label as a mark label, and forming the different character strings into a character string table; and analyzing each character string in the character string table one by one, and when the number of the character outside an HTML label in some character string is larger than that of the character in the HTML label, and the number of the character outside the HTML label is larger than a set base number, taking the content included in the character strings as the subject information. The internet subject information acquiring method and the device thereof divides the HTML source code into a plurality of character strings with the div label and analyzes the character strings, thereby obtaining the subject information, being capable of processing webpage information of different webpage moulds on the internet, and improving the accuracy for acquiring the subject information.

Description

A kind of method for acquiring internet subject information and device
Technical field
The present invention relates to a kind for the treatment of technology of internet information, relate in particular to a kind of method for acquiring internet subject information and device.
Background technology
Browse the info web on the Web, can find that they comprise two parts content usually, what a part of content embodied is the subject information of webpage, and such as the news information part in the news web page, we are referred to as " theme " information; Another part then is the content such as navigation bar, advertising message, copyright information and questionnaire irrelevant with subject content, is referred to as " noise " information.Noise information is distributed in around the subject information usually, sometimes also be mixed in the middle of the subject content, but they there is no content relevance.
Noise information normally occurs with the form of link navigation literal (anchor text), and therefore, the webpage that noise information can cause interlinking is usually also without content relevance.Like this, the noise content in the webpage is not only brought difficulty to Web is upper based on the application system of web page contents, brings difficulty also for the application system of pointing to based on the super chain of webpage.
After identifying fast and accurately and remove the noise content in the webpage, the subject content that can gather webpage is to carry out follow-up processing or exploitation.
In the prior art one, propose one and removed noise information in the internet web page, gather the method for subject information, the method is foundation<table at first〉tag tree of label configurations webpage, and then foundation<table label throws the net one and page be planned to mutually nested content blocks; Then, for the webpage collection that the same template of use is made, finding out at this webpage and concentrate the content that repeatedly occurs, as redundant content, is exactly the effective information piece and concentrate the less content blocks of common appearance at this webpage.Experimental results show that the method is effectively, but the method must be confined to the webpage collection based on same template, and the web page template on the Web is countless, so the method is obviously general not.
HTML (HyperText Mark-up Language, HTML (Hypertext Markup Language)) is a kind of identifiable language (Markup Language), has wherein defined the page layout when a cover label is portrayed web displaying.Therefore, for the most frequently used representation method of html web page be the tag tree of structure webpage.Existing tag tree structure instrument is a lot, and DOM (Document Object Model, DOM Document Object Model) is a tag tree structure instrument commonly used, and it can be organized into one tree shape structure according to nest relation with the label in the webpage.Realize that noise reduction ice gathers useful subject information, at first according to HTML code, generate dom tree, then the parsing tree element extracts subject information.
The DOM full name is DOM Document Object Model (Document Object Model, DOM), it is a tree structure according to the nest relation between the mark in the document with document representation, the element in the document, attribute, all is node with character data, note and the processing instruction etc. analyzed.
Prior art two implementation steps are as follows:
1, the html document with inadequate standard is organized into the good XHTML document of form;
2, the XHTML document is resolved to a tree-model---dom tree;
3, then carry out the extraction of information around dom tree;
4, utilize the structure of the sample webpage that the inductive learning user provides, just can generate an XML document according to the node among the DOM, only keep the node of the interested information of user in this XML document, thereby finish information extraction.
The inventor finds that prior art two has following shortcoming at least in implementing process of the present invention:
Dom tree is relatively complicated, and analysis efficiency is lower, and speed is slow; And dom tree is of a great variety, if will obtain correct subject information, has larger difference and difficulty.
Summary of the invention
Technical matters to be solved by this invention is, for above-mentioned the deficiencies in the prior art, the invention provides a kind of method for acquiring internet subject information and device, dom tree without complex structure, and provide a kind of general method, accurately reach and analyze rapidly and process webpages all on the internet, to obtain subject information.
A kind of method for acquiring internet subject information that the embodiment of the invention provides comprises:
Obtain the HTML (Hypertext Markup Language) html source code of internet web page;
Take the div label as the sign label described html source code is divided into different character strings, and described different character string is formed the character string tabulation;
Analyze one by one each character string in the described character string tabulation, when the character number outside the html tag in certain character string greater than the character number in the described html tag, and the outer character number of html tag is during greater than the radix set, and the content that this character string is comprised is as subject information.
Obtain in the described character string tabulation character string of the outer number of characters maximum of html tag;
Analyze in the described character string tabulation front character string and the rear character string of the character string of the outer number of characters maximum of described html tag; If the character number outside described front character string and/or the satisfied html tag wherein of rear character string is greater than the character number in the described html tag, and the outer character number of html tag is during greater than the condition of the radix of setting, with the content that comprises in this front character string and/or the rear character string as subject information.
The embodiment of the invention also provides a kind of acquiring internet subject information device, comprising:
The source code acquisition module is for the HTML (Hypertext Markup Language) html source code that obtains internet web page;
Character string forms module, is used for take the div label as the sign label described html source code being divided into different character strings, and described different character string is formed the character string tabulation;
The first string analysis module, be used for analyzing one by one each character string of described character string tabulation, when the character number outside the html tag in certain character string greater than the character number in the described html tag, and the outer character number of html tag is during greater than the radix set, and the content that this character string is comprised is as subject information.
The second string analysis module is used for obtaining via behind described the first string analysis module analysis, the character string of the outer number of characters maximum of html tag in the described character string tabulation; And analyze in the described character string tabulation front character string and the rear character string of the character string of the outer number of characters maximum of described html tag; If the character number outside described front character string and/or the satisfied html tag wherein of rear character string is greater than the character number in the described html tag, and the outer character number of html tag is during greater than the condition of the radix of setting, with the content that comprises in this front character string and/or the rear character string as subject information.
Implement method for acquiring internet subject information provided by the invention and device, by with the div label html source code being divided into a plurality of character strings, again a plurality of character strings are analyzed, thereby obtain subject information, can process the info web of different web pages template on the internet, and improve the accuracy of topic information acquisition.
Description of drawings
In order to be illustrated more clearly in the embodiment of the invention or technical scheme of the prior art, the below will do to introduce simply to the accompanying drawing of required use in embodiment or the description of the Prior Art, apparently, accompanying drawing in the following describes only is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain according to these accompanying drawings other accompanying drawing.
Fig. 1 is the schematic flow sheet of method for acquiring internet subject information embodiment one among the present invention;
Fig. 2 is the schematic flow sheet of method for acquiring internet subject information embodiment two among the present invention;
Fig. 3 is the schematic flow sheet of method for acquiring internet subject information embodiment three among the present invention;
Fig. 4 is the schematic flow sheet of acquiring internet subject information device embodiment one among the present invention;
Fig. 5 is the schematic flow sheet of acquiring internet subject information device embodiment two among the present invention;
Fig. 6 is the schematic flow sheet of acquiring internet subject information device embodiment three among the present invention.
Embodiment
The invention provides a kind of method for acquiring internet subject information and device, need not stick to unified network template, and a kind of general method is provided, accurate analysis is also processed webpages all on the internet, to obtain subject information.
Referring to Fig. 1, the schematic flow sheet of the embodiment one of the method for acquiring internet subject information that provides for the embodiment of the invention.
The method for acquiring internet subject information that the embodiment of the invention provides comprises:
Step 100 is obtained the HTML (Hypertext Markup Language) html source code of internet web page;
Need to prove that HTML is the abbreviation of hypertext language, general being used for write webpage, by checking the html source code of webpage on the network, can understand the structure of this webpage and the specific address of some pictures or video.
Step 101 is divided into different character strings take the div label as the sign label with described html source code, and described different character string is formed the character string tabulation;
Need to prove, html tag normally the full name of english vocabulary (quote such as piece: blockquote) or abbreviation (representing Paragraph such as " p "), but they have any different with general text because they are placed in single punctuation marks used to enclose the title.So the Paragragh label is<p 〉, piece is quoted label and is<blockquote 〉.Some html tag instruction page is formatted (for example, beginning a new paragraph) how, and other illustrate then how these words show, and (<b〉make literal chap) also has some other labels to be provided at the information that does not show on the page, for example title.
Html tag becomes two and occurs.Whenever using a label, such as<blockquote 〉, then must with another label</blockquote it is closed.Slash before the blockquote is closed label and the difference of opening label exactly.But some label exceptions are arranged also.Such as,<input〉label just do not need.
Usually, html source code begins with DOCTYPE, the type of its statement document, and before it any content (comprising newline and space) can not be arranged, otherwise will make the document statement invalid, and then be<html〉label, with</html〉the label end.<html〉label and</html label also is a kind of in the html tag, between them, full page has two parts, title and text.Wherein, heading is clipped in<head〉label and</head between the label, this word appears at the minimized window of bottom of screen when opening the page.Text then is clipped in<body〉label and</body between the label, i.e. the content place of all pages.Anything that shows on the page is included among these two labels.
The div label is a kind of in the html tag, is the element that structure and background are provided for the content of bulk in the html source code (block-level).The div label comprises: start-tag<div〉and end-tag</div 〉, all the elements between these two labels all are used for consisting of this piece, wherein the characteristic of institute's containing element is controlled by the attribute of div label, or by controlling with this piece of fstyleformat.scrolltrackization.
The div label is called and separates mark, and its effect is: the putting position of setting word, picture, form etc.When literal, image, or other be placed in the div label, it can be referred to as " DIV block ", or " DIV element " or " CSS-layer ", or is " layer " i.e. " level ".
Because the div label is arranged in the html source code of the webpage of any template, with the div label html source code is divided into character string, do not need to consider that this webpage is the template of which kind of type, so have versatility;
For example, this section HTML has simulated the structure of news website below.Each div label is wherein combined the title of every news and summary.
Figure GDA0000091949160000051
In the present embodiment, take the div label as the sign label, namely take<div〉and</div〉as the boundary, with each group<div〉and</div in the character string that comprises extract separately, for example, with first group<div in the above-mentioned html source code〉and</div between character string extract as the first character string, that is:
The first character string is:
Figure GDA0000091949160000062
Then, again with second group<div in the above-mentioned html source code〉and</div between character string extract as second character string, that is:
Second character string is:
Figure GDA0000091949160000063
By that analogy, will own<div〉and</div between character string extract with this, form the character string tabulation.
Step 102 is analyzed each character string in the character string tabulation, one by one to analyze subject information;
Concrete, analyze one by one each character string in the described character string tabulation, when the character number outside the html tag in certain character string greater than the character number in the described html tag, and the outer character number of html tag is during greater than the radix set, and the content that this character string is comprised is as subject information.
Therefore, for the character string of dividing with the div label, by the character outside the more above-mentioned various html tags and the character number in the html tag, if the character number outside the html tag is greater than the character number in the html tag, and greater than predetermined radix value, then can judge the content obtaining subject information in this character string.
Enforcement the invention provides a kind of method for acquiring internet subject information, need not stick to unified network template, and provide a kind of general method, with the div label html source code is divided into different character strings, and each character string carried out analyzing and processing, thereby can accurate analysis and process webpages all on the internet, to obtain subject information.
Referring to Fig. 2, be the schematic flow sheet of a kind of method for acquiring internet subject information embodiment two of providing among the present invention.
Need to prove that at first the method that the embodiment of the invention provides both can be used for gathering theme of news information, also can be used for gathering the daily record subject information; The subject information that gathers as required is the difference of news information or daily record subject information, and the character number outside the html tag in analyzing character string can this radix value be set to difference during whether greater than some radix values.
Step 200 is downloaded extend markup language (XML, the Extensible Markup Language) page, extracts list information;
Concrete, if need to gather theme of news information, then download the XML page, therefrom extract news list information; If gather the daily record subject information, then from the XML page of downloading, extract log list information;
Step 201 is downloaded the uniform resource position mark URL in the described list information, in order to obtain the html source code of subject information place webpage.
Concrete, can obtain the html source code of the theme of news information place page, perhaps obtain the source code of the HTML of daily record subject information place webpage.
Step 202, filter in the described html source code html label irrelevant with subject information (that is,<html label and</html label).
Concrete, filter out the html tag that had nothing to do in new day with theme of news information or daily record theme in the html source code, for example script label, style label, object label, iframe label, form label;
Step 203 is obtained the html source code of internet web page;
In the present embodiment, because this html source code has filtered out the html tag that has nothing to do with theme of news information or daily record subject information, therefore than a upper embodiment, improved efficient for analyzing character string, laid a good foundation for improving the accuracy that gathers subject information.
Step 204 is divided into different character strings take the div label as the sign label with described html source code, and described different character string is formed the character string tabulation.
For example, this section HTML has simulated the structure of news website below.Each div label is wherein combined the title of every news and summary.
Figure GDA0000091949160000081
In the present embodiment, take the div label as the sign label, namely take<div〉and</div〉as the boundary, with each group<div〉and</div in the character string that comprises extract separately, for example, with first group<div in the above-mentioned html source code〉and</div between character string extract as the first character string, that is:
The first character string is:
Figure GDA0000091949160000082
Figure GDA0000091949160000091
Then, again with second group<div in the above-mentioned html source code〉and</div between character string extract as second character string, that is:
Second character string is:
Figure GDA0000091949160000092
By that analogy, will own<div〉and</div between character string extract with this, form the character string tabulation.
Step 205, analyze one by one each character string in the described character string tabulation, when the character number outside the html tag in certain character string greater than the character number in the described html tag, and the outer character number of html tag is during greater than the radix set, and the content that this character string is comprised is as subject information.
Need to prove, be theme of news information if will gather subject information, and then described radix is set to 50, less than this value, generally not theme of news information;
In order on the basis of embodiment one, further to improve the accuracy that gathers subject information, in the present embodiment two, also comprise:
Step 206 is obtained in the described character string tabulation character string of the outer number of characters maximum of html tag;
Step 207 is analyzed in the described character string tabulation front character string and the rear character string of the character string of the outer number of characters maximum of described html tag;
Particularly, if the character number outside described front character string and/or the satisfied html tag wherein of rear character string is greater than the character number in the described html tag, and the outer character number of html tag is during greater than the condition of the radix of setting, with the content that comprises in this front character string and/or the rear character string as subject information.
Step 208 is analyzed front character string and/or described rear character string, to obtain as a result character string;
Particularly, if the character number in described front character string and/or the described rear character string outside the satisfied html tag wherein is greater than the character number in the described html tag, and the outer character number of html tag is during greater than the condition of the radix of setting, with this front character string and/or rear character string with the character string of the outer number of characters maximum of described html tag character string as a result of;
Step 209 is processed described as a result character string, to gather subject information.
At last, step 210 is preserved the subject information and this character string that comprise in the described character string through step 209 processing, uses for secondary development.
Enforcement the invention provides a kind of method for acquiring internet subject information, need not stick to unified network template, a kind of general method is provided, at first with the div label html source code is divided into different character strings, and each character string carried out analyzing and processing, can accurate analysis and process webpages all on the internet, and character string is by analysis carried out secondary analysis, further improve the accuracy of analyzing webpage on the internet, thereby collect fast and accurately subject information.
Referring to Fig. 3, be the schematic flow sheet of a kind of method for acquiring internet subject information embodiment three among the present invention.
To describe the step 209 among the embodiment two in the present embodiment in detail, it specifically comprises:
Step 300, the character that each html tag in the character string as a result is outer compares with filtering key word, filters the character irrelevant with subject information to be collected;
Described filtration key word is scheduled to, and is specially illegal key word or advertisement keywords, navigation bar key word, the noise information that survey key word etc. and subject information are irrelevant;
Step 301 is extracted all picture image labels in the described filtration as a result character string afterwards, and the download pictures resource is also preserved; Can also obtain simultaneously picture width and height;
Step 302 replaces with the local resource path with the Internet resources path in the described as a result character string;
Step 303 keeps paragraph p label and picture image label in the described as a result character string, deletes other labels in the described as a result character string.
At last, the subject information and this character string that comprise in the described character string through the processing of 300~step 303 are preserved, use for secondary development.
Enforcement the invention provides a kind of method for acquiring internet subject information, in conjunction with the embodiments one and the basis of embodiment two accurate Quick Acquisition subject informations on, to the further purified treatment of subject information that gathers, and news or the original form of daily record have been kept, can also keep the picture in original webpage, therefore can better be used by secondary development.
Referring to Fig. 4, be the structural representation of a kind of acquiring internet subject information device embodiment one among the present invention.
The acquiring internet subject information device of present embodiment comprises: source code acquisition module 10, character string form module 11 and the first string analysis module 12, and their function and efficacy is as follows:
Source code acquisition module 10 is for the html source code that obtains internet web page;
In the time of implementation, this source code acquisition module 10 is used for carrying out aforementioned method for acquiring internet subject information embodiment one step 100 of (being called for short afterwards embodiment of the method one);
Character string forms module 11, is used for take the div label as the sign label described html source code being divided into different character strings, and described different character string is formed the character string tabulation;
Need to prove, html tag normally the full name of english vocabulary (quote such as piece: blockquote) or abbreviation (representing Paragraph such as " p "), but they have any different with general text because they are placed in single punctuation marks used to enclose the title.So the Paragragh label is<p 〉, piece is quoted label and is<blockquote 〉.Some html tag instruction page is formatted (for example, beginning a new paragraph) how, and other illustrate then how these words show, and (<b〉make literal chap) also has some other labels to be provided at the information that does not show on the page, for example title.
Html tag becomes two and occurs.Whenever using a label, such as<blockquote 〉, then must with another label</blockquote it is closed.Slash before the blockquote is closed label and the difference of opening label exactly.But some label exceptions are arranged also.Such as,<input〉label just do not need.
Usually, html source code begins with DOCTYPE, the type of its statement document, and before it any content (comprising newline and space) can not be arranged, otherwise will make the document statement invalid, and then be<html〉label, with</html〉the label end.<html〉label and</html label also is a kind of in the html tag, between them, full page has two parts, title and text.Wherein, heading is clipped in<head〉label and</head between the label, this word appears at the minimized window of bottom of screen when opening the page.Text then is clipped in<body〉label and</body between the label, i.e. the content place of all pages.Anything that shows on the page is included among these two labels.
The div label is a kind of in the html tag, is the element that structure and background are provided for the content of bulk in the html source code (block-level).The div label comprises: start-tag<div〉and end-tag</div 〉, all the elements between these two labels all are used for consisting of this piece.The div label is called and separates mark, and its effect is: the putting position of setting word, picture, form etc.Because the div label is arranged in the html source code of the webpage of any template.Character string in the present embodiment forms module 11 in implementation, be used for carrying out the step 101 of preceding method embodiment one, namely with the div label html source code is divided into character string, do not need to consider that this webpage is the template of which kind of type, thereby html source code is divided into different character strings, the tabulation of formation character string has versatility;
For example, this section HTML has simulated the structure of news website below.Each div label is wherein combined the title of every news and summary.
Figure GDA0000091949160000121
In the present embodiment, take the div label as the sign label, namely take<div〉and</div〉as the boundary, with each group<div〉and</div in the character string that comprises extract separately, for example, with first group<div in the above-mentioned html source code〉and</div between character string extract as the first character string, that is:
The first character string is:
Figure GDA0000091949160000131
Then, again with second group<div in the above-mentioned html source code〉and</div between character string extract as second character string, that is:
Second character string is:
Figure GDA0000091949160000132
By that analogy, will own<div〉and</div between character string extract with this, form the character string tabulation.
The first string analysis module 12, for each character string in the character string tabulation of analyzing one by one 10 formation of described character string formation module, when the character number outside the html tag in certain character string greater than the character number in the described html tag, and the outer character number of html tag is during greater than the radix set, and the content that this character string is comprised is as subject information.
Particularly, for the character string of dividing with the div label, by by the character outside the various html tags in the first string analysis module 12 comparison of aforementioned embodiments of the method one and the character number in the html tag, if the character number outside the html tag is greater than the character number in the html tag, and greater than predetermined radix value, then can judge the content obtaining subject information in this character string.In implementation, this first string analysis module 12 is used for carrying out the step 102 of preceding method embodiment one.
Enforcement the invention provides a kind of acquiring internet subject information device, need not stick to unified network template, and provide a kind of general mode, with the div label html source code is divided into different character strings, and each character string carried out analyzing and processing, thereby can accurate analysis and process webpages all on the internet, to obtain subject information.
Referring to Fig. 5, be the structural representation of a kind of acquiring internet subject information device embodiment two of providing among the present invention.
Need to prove that at first the device that the embodiment of the invention provides both can be used for gathering theme of news information, also can be used for gathering the daily record subject information.
The device that present embodiment provides, source code acquisition module 10, character string in comprising aforementioned acquiring internet subject information device embodiment one (hereinafter to be referred as device embodiment one) forms module 11 and the first string analysis module 12, also comprise: radix setting module 13, information downloading module 14, information filtering module 15 and the second string analysis module 16, string processing module 17, information acquisition module 18, their function and efficacy is as follows:
Radix setting module 13, being used for according to subject information to be collected is theme of news information or daily record subject information, and the value of described radix is set as different values;
Concrete, the subject information that gathers as required is news information or the difference of subject information, and whether the character number outside the html tag in analyzing character string is during greater than some radix values, and radix setting module 13 can this radix value be set to difference.
Device among the embodiment two also comprises:
Information downloading module 14 is used for downloading the expandable mark language XML page, extracts list information; And download uniform resource position mark URL in the described list information, and send to described source code acquisition module 10 and process.
Concrete, if need to gather theme of news information, 14 of information downloading module are downloaded the XML page, therefrom extract news list information; If gather the daily record subject information, 14 of information downloading module are extracted log list information from the XML page of downloading; And download uniform resource position mark URL in the described list information;
In specific embodiment, this information downloading module 14 is used for carrying out step 200 and the step 201 of preceding method embodiment two;
After this, described source code acquisition module 10 obtains html source code from described list information and URL;
Information filtering module 15 is used for filtering in the html source code that described source code acquisition module 10 gets access to the html tag irrelevant with subject information.
Concrete, information filtering module 15 is used for filtering out such as the irrelevant html tag of script label, style label, object label, iframe label, form label etc. and subject information;
Wait in the specific implementation, information filtering module 15 is used for carrying out the step 202 of preceding method embodiment two;
After this, form module 11 by aforesaid character string and take the div label as the sign label described html source code is divided into different character strings, and described different character string is formed the character string tabulation; Analyze one by one each character string in the tabulation of described character string by aforesaid the first string analysis module 12 again, when the character number outside the html tag in certain character string greater than the character number in the described html tag, and the outer character number of html tag is during greater than the radix set, and the content that this character string is comprised is as subject information;
In order further to improve the accuracy that gathers subject information on the basis of device embodiment one, the device in the present embodiment two also comprises:
The second string analysis module 16 is used for obtaining via after 12 analyses of described the first string analysis module the character string of the outer number of characters maximum of html tag in the described character string tabulation; And analyze in the described character string tabulation front character string and the rear character string of the character string of the outer number of characters maximum of described html tag; If the character number outside described front character string and/or the satisfied html tag wherein of rear character string is greater than the character number in the described html tag, and the outer character number of html tag is during greater than the condition of the radix of setting, with the content that comprises in this front character string and/or the rear character string as subject information.In the time of implementation, the step 206 among the second string analysis module 16 execution preceding method embodiment two~step 207;
Device in the present embodiment two also comprises:
String processing module 17, be used for character number outside described front character string and/or described rear character string satisfy wherein html tag greater than the character number in the described html tag, and the outer character number of html tag is during greater than the condition of the radix of setting, with this front character string and/or rear character string with the character string of the outer number of characters maximum of described html tag character string as a result of; And described as a result character string processed, to gather subject information.In the time of specific embodiment, the step 208 among this string processing module 17 execution preceding method embodiment two~step 209;
Device in the present embodiment two also comprises:
Information acquisition module 18, be used for the described described as a result character string of processing through string processing module 17 and this as a result the subject information that comprises of character string preserve, use for user's secondary development.
Enforcement the invention provides a kind of acquiring internet subject information device, need not stick to unified network template, a kind of general method is provided, at first with the div label html source code is divided into different character strings, and each character string carried out analyzing and processing, can accurate analysis and process webpages all on the internet, and character string is by analysis carried out secondary analysis, further improve the accuracy of analyzing webpage on the internet, thereby collect fast and accurately subject information.
Referring to Fig. 6, be the structural representation of a kind of acquiring internet subject information device embodiment three among the present invention.
In the present embodiment, with the string processing module 17 of describing in detail among the aforementioned means embodiment two;
Described string processing module 17 specifically comprises: character filtering unit 170, picture download unit 171, path replacement unit 172, tag processes unit 173, and their function and efficacy is as follows:
Character filtering unit 170 is used for the outer character of each html tag of character string is as a result compared with filtering key word, filters the character irrelevant with subject information; Concrete, described filtration key word is scheduled to, and is specially illegal key word or advertisement keywords, navigation bar key word, the noise information that survey key word etc. and subject information are irrelevant; In implementation, this character filtering unit 170 is used for carrying out the step 300 of preceding method embodiment three;
Picture download unit 171 is used for extracting described process character filtering unit 170 and filters all picture image labels of as a result character string afterwards, and the download pictures resource is also preserved; Can also obtain simultaneously picture width and height;
Path replacement unit 172 is used for the Internet resources path of described as a result character string is replaced with the local resource path;
Other labels in the described as a result character string for the paragraph p label and the picture image label that keep described as a result character string, are deleted in tag processes unit 173.
Enforcement the invention provides a kind of acquiring internet subject information device, on the basis of coupling apparatus embodiment one and device embodiment two accurate Quick Acquisition subject informations, to the further purified treatment of subject information that gathers, and news or the original form of daily record have been kept, can also keep the picture in original webpage, therefore can better be used by secondary development.
One of ordinary skill in the art will appreciate that all or part of flow process that realizes in above-described embodiment method, to come the relevant hardware of instruction to finish by computer program, described program can be stored in the computer read/write memory medium, this program can comprise the flow process such as the embodiment of above-mentioned each side method when carrying out.Wherein, described storage medium can be magnetic disc, CD, read-only store-memory body (Read-Only Memory, ROM) or store-memory body (Random Access Memory, RAM) etc. at random.
The above is preferred implementation of the present invention; should be pointed out that for those skilled in the art, under the prerequisite that does not break away from the principle of the invention; can also make some improvements and modifications, these improvements and modifications also are considered as protection scope of the present invention.

Claims (16)

1. a method for acquiring internet subject information is characterized in that, comprising:
Obtain the HTML (Hypertext Markup Language) html source code of internet web page;
Take the div label as the sign label described html source code is divided into different character strings, and described different character string is formed the character string tabulation;
Analyze one by one each character string in the described character string tabulation, when the character number outside the html tag in certain character string greater than the character number in the described html tag, and the outer character number of html tag is during greater than the radix set, and the content that this character string is comprised is as subject information;
Obtain in the described character string tabulation character string of the outer number of characters maximum of html tag;
Analyze in the described character string tabulation front character string and the rear character string of the character string of the outer number of characters maximum of described html tag; If the character number outside described front character string and/or the satisfied html tag wherein of rear character string is greater than the character number in the described html tag, and the outer character number of html tag is during greater than the condition of the radix of setting, with the content that comprises in this front character string and/or the rear character string as subject information.
2. the method for claim 1 is characterized in that, described subject information is theme of news information or daily record subject information.
3. method as claimed in claim 2 is characterized in that, when subject information to be collected was theme of news information or daily record subject information, the value of described radix was set as difference.
4. method as claimed in claim 3 is characterized in that, when described subject information is theme of news information, before obtaining the HTML (Hypertext Markup Language) html source code step of internet web page, comprising:
Download the expandable mark language XML page, extract list information;
Download the uniform resource position mark URL in the described list information, in order to obtain the html source code of subject information place webpage.
5. method as claimed in claim 4 is characterized in that, described obtaining after the html source code comprises:
Filter the html tag that has nothing to do with subject information in the described html source code.
6. such as each described method among the claim 1-5, it is characterized in that, the content that comprises in this front character string and/or the rear character string as after the subject information, being comprised:
If the character number in described front character string and/or the described rear character string outside the satisfied html tag wherein is greater than the character number in the described html tag, and the outer character number of html tag is during greater than the condition of the radix of setting, with this front character string and/or rear character string with the character string of the outer number of characters maximum of described html tag character string as a result of;
Described as a result character string is processed, gathered subject information.
7. method as claimed in claim 6 is characterized in that, described described as a result character string is processed, and specifically comprises:
The character that each html tag in the described as a result character string is outer compares with filtering key word, filters the character irrelevant with subject information to be collected;
Extract all picture image labels in the described filtration as a result character string afterwards, the download pictures resource is also preserved;
Internet resources path in the described as a result character string is replaced with the local resource path;
Keep paragraph p label and picture image label in the described as a result character string, delete other labels in the described as a result character string.
8. method as claimed in claim 7 is characterized in that, will preserve through the subject information that the described as a result character string of described processing and this comprise in character string as a result, uses for secondary development.
9. an acquiring internet subject information device is characterized in that, comprising:
The source code acquisition module is for the HTML (Hypertext Markup Language) html source code that obtains internet web page;
Character string forms module, is used for take the div label as the sign label described html source code being divided into different character strings, and described different character string is formed the character string tabulation;
The first string analysis module, be used for analyzing one by one each character string of described character string tabulation, when the character number outside the html tag in certain character string greater than the character number in the described html tag, and the outer character number of html tag is during greater than the radix set, and the content that this character string is comprised is as subject information;
The second string analysis module is used for obtaining via behind described the first string analysis module analysis, the character string of the outer number of characters maximum of html tag in the described character string tabulation; And analyze in the described character string tabulation front character string and the rear character string of the character string of the outer number of characters maximum of described html tag; If the character number outside described front character string and/or the satisfied html tag wherein of rear character string is greater than the character number in the described html tag, and the outer character number of html tag is during greater than the condition of the radix of setting, with the content that comprises in this front character string and/or the rear character string as subject information.
10. device as claimed in claim 9 is characterized in that, described subject information is theme of news information or daily record subject information.
11. device as claimed in claim 10 is characterized in that, described device also comprises:
The radix setting module, being used for according to subject information to be collected is theme of news information or daily record subject information, and the value of described radix is set as different values.
12. device as claimed in claim 11 is characterized in that, described device also comprises:
Information downloading module is used for downloading the expandable mark language XML page, extracts list information; And download uniform resource position mark URL in the described list information, and send to described source code acquisition module and process.
13. device as claimed in claim 12 is characterized in that, described device also comprises:
The information filtering module is used for filtering in the html source code that described source code acquisition module gets access to the html tag irrelevant with subject information.
14. device as claimed in claim 13 is characterized in that, described device also comprises:
The string processing module, be used for character number outside described front character string and/or described rear character string satisfy wherein html tag greater than the character number in the described html tag, and the outer character number of html tag is during greater than the condition of the radix of setting, with this front character string and/or rear character string with the character string of the outer number of characters maximum of described html tag character string as a result of; And described as a result character string processed, to gather subject information.
15. device as claimed in claim 14 is characterized in that, described string processing module specifically comprises:
The character filtering unit is used for the character that each html tag of described as a result character string is outer and compares with filtering key word, filters the character irrelevant with subject information;
The picture download unit is used for extracting described process character filtering unit and filters all picture image labels of as a result character string afterwards, and the download pictures resource is also preserved;
The path replacement unit is used for the Internet resources path of described as a result character string is replaced with the local resource path;
Other labels in the described as a result character string for the paragraph p label and the picture image label that keep described as a result character string, are deleted in the tag processes unit.
16. device as claimed in claim 15 is characterized in that, described device also comprises:
Information acquisition module, be used for will through the described as a result character string of described string processing resume module with this as a result the subject information that comprises of character string preserve, use for user's secondary development.
CN 200910110356 2009-10-28 2009-10-28 Method for acquiring internet subject information and device thereof Expired - Fee Related CN101702160B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200910110356 CN101702160B (en) 2009-10-28 2009-10-28 Method for acquiring internet subject information and device thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200910110356 CN101702160B (en) 2009-10-28 2009-10-28 Method for acquiring internet subject information and device thereof

Publications (2)

Publication Number Publication Date
CN101702160A CN101702160A (en) 2010-05-05
CN101702160B true CN101702160B (en) 2013-04-17

Family

ID=42157075

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200910110356 Expired - Fee Related CN101702160B (en) 2009-10-28 2009-10-28 Method for acquiring internet subject information and device thereof

Country Status (1)

Country Link
CN (1) CN101702160B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102270206A (en) * 2010-06-03 2011-12-07 北京迅捷英翔网络科技有限公司 Method and device for capturing valid web page contents
CN102737116B (en) * 2012-05-29 2016-04-13 深圳市同洲电子股份有限公司 A kind of web page resources store method and device
CN102750392B (en) * 2012-07-09 2014-07-16 浙江省公众信息产业有限公司 Web topic information extraction method and system
CN103020129B (en) * 2012-11-20 2015-11-18 中兴通讯股份有限公司 A kind of method for extracting content of text and device
CN103279567A (en) * 2013-06-18 2013-09-04 重庆邮电大学 Web data collection method and system both based on AJAX (asynchronous javascript and extensible markup language)
CN103488621A (en) * 2013-09-24 2014-01-01 长沙裕邦软件开发有限公司 Type setting method and system for laws and regulations
CN104156458B (en) * 2014-08-20 2017-09-22 北京小度互娱科技有限公司 The extracting method and device of a kind of information
CN105578294B (en) * 2014-10-15 2018-12-21 优视科技有限公司 Browse switching handling method, apparatus and system
CN104750812A (en) * 2015-03-30 2015-07-01 浪潮集团有限公司 Automatic data collecting method based on webpage label analysis
CN111488511B (en) * 2019-01-25 2024-04-09 深信服科技股份有限公司 Website theme extraction method and system, electronic equipment and storage medium
CN113505271A (en) * 2021-07-14 2021-10-15 杭州隆埠科技有限公司 HTML document analysis method, HTML document transmission method, HTML document analysis device, and HTML document transmission device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000077663A2 (en) * 1999-06-14 2000-12-21 Lockheed Martin Corporation System and method for interactive electronic media extraction for web page generation
CN101079031A (en) * 2006-06-15 2007-11-28 腾讯科技(深圳)有限公司 Web page subject extraction system and method
CN101470728A (en) * 2007-12-25 2009-07-01 北京大学 Method and device for automatically abstracting text of Chinese news web page

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000077663A2 (en) * 1999-06-14 2000-12-21 Lockheed Martin Corporation System and method for interactive electronic media extraction for web page generation
CN101079031A (en) * 2006-06-15 2007-11-28 腾讯科技(深圳)有限公司 Web page subject extraction system and method
CN101470728A (en) * 2007-12-25 2009-07-01 北京大学 Method and device for automatically abstracting text of Chinese news web page

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
孙承杰等.《基于统计的网页正文信息抽取方法的研究》.《中文信息学报》.2004,第18卷(第5期),第19页,第22页. *

Also Published As

Publication number Publication date
CN101702160A (en) 2010-05-05

Similar Documents

Publication Publication Date Title
CN101702160B (en) Method for acquiring internet subject information and device thereof
Gupta et al. DOM-based content extraction of HTML documents
US7047033B2 (en) Methods and apparatus for analyzing, processing and formatting network information such as web-pages
CN102915308B (en) A kind of method of page rendering and device
CN102270206A (en) Method and device for capturing valid web page contents
CN104217036B (en) A kind of webpage content extracting method and equipment
CN101251855A (en) Equipment, system and method for cleaning internet web page
CN106503211B (en) Method for automatically generating mobile version facing information publishing website
CN102779169A (en) Extracting method and device for webpage content based on HTML (Hypertext Markup Language) label
CN110390038A (en) Segment method, apparatus, equipment and storage medium based on dom tree
CN104391978B (en) Web page storage processing method and processing device for browser
CN103699591A (en) Page body extraction method based on sample page
CN109857956A (en) The automatic abstracting method of news web page key message based on label and blocking characteristic
CN103166981A (en) Wireless webpage transcoding method and device
CN105740355B (en) Webpage context extraction method and device based on aggregation text density
CN103365877B (en) Method and server to establishing catalogue after webpage progress transcoding
JP3832693B2 (en) Structured document search and display method and apparatus
CN102955852A (en) Method, device and equipment for webpage resource processing
CN113849718A (en) Internet tobacco science and technology information automatic acquisition device, method and storage medium
CN115391711B (en) Webpage text information extraction method, device, equipment and medium
CN103246680A (en) Method and device for aggregating and displaying webpage contents in browser
KR20080030196A (en) The way of internet web page tagging and tag search system
TWI292104B (en)
Zheng et al. Design and implementation of news collecting and filtering system based on RSS
Maurer et al. Database-driven content analysis

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
ASS Succession or assignment of patent right

Owner name: SHENZHEN LONGSHI MEDIA CO., LTD.

Free format text: FORMER OWNER: SHENZHEN TONGZHOU ELECTRONIC CO., LTD.

Effective date: 20120424

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 518129 SHENZHEN, GUANGDONG PROVINCE TO: 518057 SHENZHEN, GUANGDONG PROVINCE

TA01 Transfer of patent application right

Effective date of registration: 20120424

Address after: 518057 District, Guangdong, Nanshan District hi tech Zone, the North Zone of the Fifth Industrial Zone, rainbow science and technology building, A2-3 District,

Applicant after: Shenzhen Longguan Media Co., Ltd.

Address before: 518129 Rainbow Technology Building, North hi tech Zone, Nanshan District, Guangdong, Shenzhen

Applicant before: Shenzhen Tongzhou Electronic Co., Ltd.

C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130417

Termination date: 20161028