CN101702160B

CN101702160B - Method for acquiring internet subject information and device thereof

Info

Publication number: CN101702160B
Application number: CN 200910110356
Authority: CN
Inventors: 黎柯
Original assignee: SHENZHEN LONGGUAN MEDIA CO Ltd
Current assignee: Shenzhen Longguan Media Co., Ltd.
Priority date: 2009-10-28
Filing date: 2009-10-28
Publication date: 2013-04-17
Anticipated expiration: 2029-10-28
Also published as: CN101702160A

Abstract

The invention provides a method for acquiring internet subject information and a device thereof, wherein the method comprises the steps of: acquiring a hyper text makeup language HTML source code of an internet webpage; dividing the HTML source code into different character strings by taking a div label as a mark label, and forming the different character strings into a character string table; and analyzing each character string in the character string table one by one, and when the number of the character outside an HTML label in some character string is larger than that of the character in the HTML label, and the number of the character outside the HTML label is larger than a set base number, taking the content included in the character strings as the subject information. The internet subject information acquiring method and the device thereof divides the HTML source code into a plurality of character strings with the div label and analyzes the character strings, thereby obtaining the subject information, being capable of processing webpage information of different webpage moulds on the internet, and improving the accuracy for acquiring the subject information.

Description

A kind of method for acquiring internet subject information and device

Technical field

The present invention relates to a kind for the treatment of technology of internet information, relate in particular to a kind of method for acquiring internet subject information and device.

Background technology

Browse the info web on the Web, can find that they comprise two parts content usually, what a part of content embodied is the subject information of webpage, and such as the news information part in the news web page, we are referred to as " theme " information; Another part then is the content such as navigation bar, advertising message, copyright information and questionnaire irrelevant with subject content, is referred to as " noise " information.Noise information is distributed in around the subject information usually, sometimes also be mixed in the middle of the subject content, but they there is no content relevance.

Noise information normally occurs with the form of link navigation literal (anchor text), and therefore, the webpage that noise information can cause interlinking is usually also without content relevance.Like this, the noise content in the webpage is not only brought difficulty to Web is upper based on the application system of web page contents, brings difficulty also for the application system of pointing to based on the super chain of webpage.

After identifying fast and accurately and remove the noise content in the webpage, the subject content that can gather webpage is to carry out follow-up processing or exploitation.

In the prior art one, propose one and removed noise information in the internet web page, gather the method for subject information, the method is foundation＜table at first〉tag tree of label configurations webpage, and then foundation＜table label throws the net one and page be planned to mutually nested content blocks; Then, for the webpage collection that the same template of use is made, finding out at this webpage and concentrate the content that repeatedly occurs, as redundant content, is exactly the effective information piece and concentrate the less content blocks of common appearance at this webpage.Experimental results show that the method is effectively, but the method must be confined to the webpage collection based on same template, and the web page template on the Web is countless, so the method is obviously general not.

HTML (HyperText Mark-up Language, HTML (Hypertext Markup Language)) is a kind of identifiable language (Markup Language), has wherein defined the page layout when a cover label is portrayed web displaying.Therefore, for the most frequently used representation method of html web page be the tag tree of structure webpage.Existing tag tree structure instrument is a lot, and DOM (Document Object Model, DOM Document Object Model) is a tag tree structure instrument commonly used, and it can be organized into one tree shape structure according to nest relation with the label in the webpage.Realize that noise reduction ice gathers useful subject information, at first according to HTML code, generate dom tree, then the parsing tree element extracts subject information.

The DOM full name is DOM Document Object Model (Document Object Model, DOM), it is a tree structure according to the nest relation between the mark in the document with document representation, the element in the document, attribute, all is node with character data, note and the processing instruction etc. analyzed.

Prior art two implementation steps are as follows:

1, the html document with inadequate standard is organized into the good XHTML document of form;

2, the XHTML document is resolved to a tree-model---dom tree;

3, then carry out the extraction of information around dom tree;

4, utilize the structure of the sample webpage that the inductive learning user provides, just can generate an XML document according to the node among the DOM, only keep the node of the interested information of user in this XML document, thereby finish information extraction.

The inventor finds that prior art two has following shortcoming at least in implementing process of the present invention:

Dom tree is relatively complicated, and analysis efficiency is lower, and speed is slow; And dom tree is of a great variety, if will obtain correct subject information, has larger difference and difficulty.

Summary of the invention

Technical matters to be solved by this invention is, for above-mentioned the deficiencies in the prior art, the invention provides a kind of method for acquiring internet subject information and device, dom tree without complex structure, and provide a kind of general method, accurately reach and analyze rapidly and process webpages all on the internet, to obtain subject information.

A kind of method for acquiring internet subject information that the embodiment of the invention provides comprises:

Obtain the HTML (Hypertext Markup Language) html source code of internet web page;

Take the div label as the sign label described html source code is divided into different character strings, and described different character string is formed the character string tabulation;

Analyze one by one each character string in the described character string tabulation, when the character number outside the html tag in certain character string greater than the character number in the described html tag, and the outer character number of html tag is during greater than the radix set, and the content that this character string is comprised is as subject information.

Obtain in the described character string tabulation character string of the outer number of characters maximum of html tag;

Analyze in the described character string tabulation front character string and the rear character string of the character string of the outer number of characters maximum of described html tag; If the character number outside described front character string and/or the satisfied html tag wherein of rear character string is greater than the character number in the described html tag, and the outer character number of html tag is during greater than the condition of the radix of setting, with the content that comprises in this front character string and/or the rear character string as subject information.

The embodiment of the invention also provides a kind of acquiring internet subject information device, comprising:

The source code acquisition module is for the HTML (Hypertext Markup Language) html source code that obtains internet web page;

Character string forms module, is used for take the div label as the sign label described html source code being divided into different character strings, and described different character string is formed the character string tabulation;

The first string analysis module, be used for analyzing one by one each character string of described character string tabulation, when the character number outside the html tag in certain character string greater than the character number in the described html tag, and the outer character number of html tag is during greater than the radix set, and the content that this character string is comprised is as subject information.

The second string analysis module is used for obtaining via behind described the first string analysis module analysis, the character string of the outer number of characters maximum of html tag in the described character string tabulation; And analyze in the described character string tabulation front character string and the rear character string of the character string of the outer number of characters maximum of described html tag; If the character number outside described front character string and/or the satisfied html tag wherein of rear character string is greater than the character number in the described html tag, and the outer character number of html tag is during greater than the condition of the radix of setting, with the content that comprises in this front character string and/or the rear character string as subject information.

Implement method for acquiring internet subject information provided by the invention and device, by with the div label html source code being divided into a plurality of character strings, again a plurality of character strings are analyzed, thereby obtain subject information, can process the info web of different web pages template on the internet, and improve the accuracy of topic information acquisition.

Description of drawings

In order to be illustrated more clearly in the embodiment of the invention or technical scheme of the prior art, the below will do to introduce simply to the accompanying drawing of required use in embodiment or the description of the Prior Art, apparently, accompanying drawing in the following describes only is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain according to these accompanying drawings other accompanying drawing.

Fig. 1 is the schematic flow sheet of method for acquiring internet subject information embodiment one among the present invention;

Fig. 2 is the schematic flow sheet of method for acquiring internet subject information embodiment two among the present invention;

Fig. 3 is the schematic flow sheet of method for acquiring internet subject information embodiment three among the present invention;

Fig. 4 is the schematic flow sheet of acquiring internet subject information device embodiment one among the present invention;

Fig. 5 is the schematic flow sheet of acquiring internet subject information device embodiment two among the present invention;

Fig. 6 is the schematic flow sheet of acquiring internet subject information device embodiment three among the present invention.

Embodiment

The invention provides a kind of method for acquiring internet subject information and device, need not stick to unified network template, and a kind of general method is provided, accurate analysis is also processed webpages all on the internet, to obtain subject information.

Referring to Fig. 1, the schematic flow sheet of the embodiment one of the method for acquiring internet subject information that provides for the embodiment of the invention.

The method for acquiring internet subject information that the embodiment of the invention provides comprises:

Step 100 is obtained the HTML (Hypertext Markup Language) html source code of internet web page;

Need to prove that HTML is the abbreviation of hypertext language, general being used for write webpage, by checking the html source code of webpage on the network, can understand the structure of this webpage and the specific address of some pictures or video.

Step 101 is divided into different character strings take the div label as the sign label with described html source code, and described different character string is formed the character string tabulation;

Need to prove, html tag normally the full name of english vocabulary (quote such as piece: blockquote) or abbreviation (representing Paragraph such as " p "), but they have any different with general text because they are placed in single punctuation marks used to enclose the title.So the Paragragh label is＜p 〉, piece is quoted label and is＜blockquote 〉.Some html tag instruction page is formatted (for example, beginning a new paragraph) how, and other illustrate then how these words show, and (＜b〉make literal chap) also has some other labels to be provided at the information that does not show on the page, for example title.

Html tag becomes two and occurs.Whenever using a label, such as＜blockquote 〉, then must with another label＜/blockquote it is closed.Slash before the blockquote is closed label and the difference of opening label exactly.But some label exceptions are arranged also.Such as,＜input〉label just do not need.

Usually, html source code begins with DOCTYPE, the type of its statement document, and before it any content (comprising newline and space) can not be arranged, otherwise will make the document statement invalid, and then be＜html〉label, with＜/html〉the label end.＜html〉label and＜/html label also is a kind of in the html tag, between them, full page has two parts, title and text.Wherein, heading is clipped in＜head〉label and＜/head between the label, this word appears at the minimized window of bottom of screen when opening the page.Text then is clipped in＜body〉label and＜/body between the label, i.e. the content place of all pages.Anything that shows on the page is included among these two labels.

The div label is a kind of in the html tag, is the element that structure and background are provided for the content of bulk in the html source code (block-level).The div label comprises: start-tag＜div〉and end-tag＜/div 〉, all the elements between these two labels all are used for consisting of this piece, wherein the characteristic of institute's containing element is controlled by the attribute of div label, or by controlling with this piece of fstyleformat.scrolltrackization.

The div label is called and separates mark, and its effect is: the putting position of setting word, picture, form etc.When literal, image, or other be placed in the div label, it can be referred to as " DIV block ", or " DIV element " or " CSS-layer ", or is " layer " i.e. " level ".

Because the div label is arranged in the html source code of the webpage of any template, with the div label html source code is divided into character string, do not need to consider that this webpage is the template of which kind of type, so have versatility;

For example, this section HTML has simulated the structure of news website below.Each div label is wherein combined the title of every news and summary.

In the present embodiment, take the div label as the sign label, namely take＜div〉and＜/div〉as the boundary, with each group＜div〉and＜/div in the character string that comprises extract separately, for example, with first group＜div in the above-mentioned html source code〉and＜/div between character string extract as the first character string, that is:

The first character string is:

Then, again with second group＜div in the above-mentioned html source code〉and＜/div between character string extract as second character string, that is:

Second character string is:

By that analogy, will own＜div〉and＜/div between character string extract with this, form the character string tabulation.

Step 102 is analyzed each character string in the character string tabulation, one by one to analyze subject information;

Concrete, analyze one by one each character string in the described character string tabulation, when the character number outside the html tag in certain character string greater than the character number in the described html tag, and the outer character number of html tag is during greater than the radix set, and the content that this character string is comprised is as subject information.

Therefore, for the character string of dividing with the div label, by the character outside the more above-mentioned various html tags and the character number in the html tag, if the character number outside the html tag is greater than the character number in the html tag, and greater than predetermined radix value, then can judge the content obtaining subject information in this character string.

Enforcement the invention provides a kind of method for acquiring internet subject information, need not stick to unified network template, and provide a kind of general method, with the div label html source code is divided into different character strings, and each character string carried out analyzing and processing, thereby can accurate analysis and process webpages all on the internet, to obtain subject information.

Referring to Fig. 2, be the schematic flow sheet of a kind of method for acquiring internet subject information embodiment two of providing among the present invention.

Need to prove that at first the method that the embodiment of the invention provides both can be used for gathering theme of news information, also can be used for gathering the daily record subject information; The subject information that gathers as required is the difference of news information or daily record subject information, and the character number outside the html tag in analyzing character string can this radix value be set to difference during whether greater than some radix values.

Step 200 is downloaded extend markup language (XML, the Extensible Markup Language) page, extracts list information;

Concrete, if need to gather theme of news information, then download the XML page, therefrom extract news list information; If gather the daily record subject information, then from the XML page of downloading, extract log list information;

Step 201 is downloaded the uniform resource position mark URL in the described list information, in order to obtain the html source code of subject information place webpage.

Concrete, can obtain the html source code of the theme of news information place page, perhaps obtain the source code of the HTML of daily record subject information place webpage.

Step 202, filter in the described html source code html label irrelevant with subject information (that is,＜html label and＜/html label).

Concrete, filter out the html tag that had nothing to do in new day with theme of news information or daily record theme in the html source code, for example script label, style label, object label, iframe label, form label;

Step 203 is obtained the html source code of internet web page;

In the present embodiment, because this html source code has filtered out the html tag that has nothing to do with theme of news information or daily record subject information, therefore than a upper embodiment, improved efficient for analyzing character string, laid a good foundation for improving the accuracy that gathers subject information.

Step 204 is divided into different character strings take the div label as the sign label with described html source code, and described different character string is formed the character string tabulation.

The first character string is:

Second character string is:

Step 205, analyze one by one each character string in the described character string tabulation, when the character number outside the html tag in certain character string greater than the character number in the described html tag, and the outer character number of html tag is during greater than the radix set, and the content that this character string is comprised is as subject information.

Need to prove, be theme of news information if will gather subject information, and then described radix is set to 50, less than this value, generally not theme of news information;

In order on the basis of embodiment one, further to improve the accuracy that gathers subject information, in the present embodiment two, also comprise:

Step 206 is obtained in the described character string tabulation character string of the outer number of characters maximum of html tag;

Step 207 is analyzed in the described character string tabulation front character string and the rear character string of the character string of the outer number of characters maximum of described html tag;

Particularly, if the character number outside described front character string and/or the satisfied html tag wherein of rear character string is greater than the character number in the described html tag, and the outer character number of html tag is during greater than the condition of the radix of setting, with the content that comprises in this front character string and/or the rear character string as subject information.

Step 208 is analyzed front character string and/or described rear character string, to obtain as a result character string;

Particularly, if the character number in described front character string and/or the described rear character string outside the satisfied html tag wherein is greater than the character number in the described html tag, and the outer character number of html tag is during greater than the condition of the radix of setting, with this front character string and/or rear character string with the character string of the outer number of characters maximum of described html tag character string as a result of;

Step 209 is processed described as a result character string, to gather subject information.

At last, step 210 is preserved the subject information and this character string that comprise in the described character string through step 209 processing, uses for secondary development.

Enforcement the invention provides a kind of method for acquiring internet subject information, need not stick to unified network template, a kind of general method is provided, at first with the div label html source code is divided into different character strings, and each character string carried out analyzing and processing, can accurate analysis and process webpages all on the internet, and character string is by analysis carried out secondary analysis, further improve the accuracy of analyzing webpage on the internet, thereby collect fast and accurately subject information.

Referring to Fig. 3, be the schematic flow sheet of a kind of method for acquiring internet subject information embodiment three among the present invention.

To describe the step 209 among the embodiment two in the present embodiment in detail, it specifically comprises:

Step 300, the character that each html tag in the character string as a result is outer compares with filtering key word, filters the character irrelevant with subject information to be collected;

Described filtration key word is scheduled to, and is specially illegal key word or advertisement keywords, navigation bar key word, the noise information that survey key word etc. and subject information are irrelevant;

Step 301 is extracted all picture image labels in the described filtration as a result character string afterwards, and the download pictures resource is also preserved; Can also obtain simultaneously picture width and height;

Step 302 replaces with the local resource path with the Internet resources path in the described as a result character string;

Step 303 keeps paragraph p label and picture image label in the described as a result character string, deletes other labels in the described as a result character string.

At last, the subject information and this character string that comprise in the described character string through the processing of 300～step 303 are preserved, use for secondary development.

Enforcement the invention provides a kind of method for acquiring internet subject information, in conjunction with the embodiments one and the basis of embodiment two accurate Quick Acquisition subject informations on, to the further purified treatment of subject information that gathers, and news or the original form of daily record have been kept, can also keep the picture in original webpage, therefore can better be used by secondary development.

Referring to Fig. 4, be the structural representation of a kind of acquiring internet subject information device embodiment one among the present invention.

The acquiring internet subject information device of present embodiment comprises: source code acquisition module 10, character string form module 11 and the first string analysis module 12, and their function and efficacy is as follows:

Source code acquisition module 10 is for the html source code that obtains internet web page;

In the time of implementation, this source code acquisition module 10 is used for carrying out aforementioned method for acquiring internet subject information embodiment one step 100 of (being called for short afterwards embodiment of the method one);

Character string forms module 11, is used for take the div label as the sign label described html source code being divided into different character strings, and described different character string is formed the character string tabulation;

The div label is a kind of in the html tag, is the element that structure and background are provided for the content of bulk in the html source code (block-level).The div label comprises: start-tag＜div〉and end-tag＜/div 〉, all the elements between these two labels all are used for consisting of this piece.The div label is called and separates mark, and its effect is: the putting position of setting word, picture, form etc.Because the div label is arranged in the html source code of the webpage of any template.Character string in the present embodiment forms module 11 in implementation, be used for carrying out the step 101 of preceding method embodiment one, namely with the div label html source code is divided into character string, do not need to consider that this webpage is the template of which kind of type, thereby html source code is divided into different character strings, the tabulation of formation character string has versatility;

The first character string is:

Second character string is:

The first string analysis module 12, for each character string in the character string tabulation of analyzing one by one 10 formation of described character string formation module, when the character number outside the html tag in certain character string greater than the character number in the described html tag, and the outer character number of html tag is during greater than the radix set, and the content that this character string is comprised is as subject information.

Particularly, for the character string of dividing with the div label, by by the character outside the various html tags in the first string analysis module 12 comparison of aforementioned embodiments of the method one and the character number in the html tag, if the character number outside the html tag is greater than the character number in the html tag, and greater than predetermined radix value, then can judge the content obtaining subject information in this character string.In implementation, this first string analysis module 12 is used for carrying out the step 102 of preceding method embodiment one.

Enforcement the invention provides a kind of acquiring internet subject information device, need not stick to unified network template, and provide a kind of general mode, with the div label html source code is divided into different character strings, and each character string carried out analyzing and processing, thereby can accurate analysis and process webpages all on the internet, to obtain subject information.

Referring to Fig. 5, be the structural representation of a kind of acquiring internet subject information device embodiment two of providing among the present invention.

Need to prove that at first the device that the embodiment of the invention provides both can be used for gathering theme of news information, also can be used for gathering the daily record subject information.

The device that present embodiment provides, source code acquisition module 10, character string in comprising aforementioned acquiring internet subject information device embodiment one (hereinafter to be referred as device embodiment one) forms module 11 and the first string analysis module 12, also comprise: radix setting module 13, information downloading module 14, information filtering module 15 and the second string analysis module 16, string processing module 17, information acquisition module 18, their function and efficacy is as follows:

Radix setting module 13, being used for according to subject information to be collected is theme of news information or daily record subject information, and the value of described radix is set as different values;

Concrete, the subject information that gathers as required is news information or the difference of subject information, and whether the character number outside the html tag in analyzing character string is during greater than some radix values, and radix setting module 13 can this radix value be set to difference.

Device among the embodiment two also comprises:

Information downloading module 14 is used for downloading the expandable mark language XML page, extracts list information; And download uniform resource position mark URL in the described list information, and send to described source code acquisition module 10 and process.

Concrete, if need to gather theme of news information, 14 of information downloading module are downloaded the XML page, therefrom extract news list information; If gather the daily record subject information, 14 of information downloading module are extracted log list information from the XML page of downloading; And download uniform resource position mark URL in the described list information;

In specific embodiment, this information downloading module 14 is used for carrying out step 200 and the step 201 of preceding method embodiment two;

After this, described source code acquisition module 10 obtains html source code from described list information and URL;

Information filtering module 15 is used for filtering in the html source code that described source code acquisition module 10 gets access to the html tag irrelevant with subject information.

Concrete, information filtering module 15 is used for filtering out such as the irrelevant html tag of script label, style label, object label, iframe label, form label etc. and subject information;

Wait in the specific implementation, information filtering module 15 is used for carrying out the step 202 of preceding method embodiment two;

After this, form module 11 by aforesaid character string and take the div label as the sign label described html source code is divided into different character strings, and described different character string is formed the character string tabulation; Analyze one by one each character string in the tabulation of described character string by aforesaid the first string analysis module 12 again, when the character number outside the html tag in certain character string greater than the character number in the described html tag, and the outer character number of html tag is during greater than the radix set, and the content that this character string is comprised is as subject information;

In order further to improve the accuracy that gathers subject information on the basis of device embodiment one, the device in the present embodiment two also comprises:

The second string analysis module 16 is used for obtaining via after 12 analyses of described the first string analysis module the character string of the outer number of characters maximum of html tag in the described character string tabulation; And analyze in the described character string tabulation front character string and the rear character string of the character string of the outer number of characters maximum of described html tag; If the character number outside described front character string and/or the satisfied html tag wherein of rear character string is greater than the character number in the described html tag, and the outer character number of html tag is during greater than the condition of the radix of setting, with the content that comprises in this front character string and/or the rear character string as subject information.In the time of implementation, the step 206 among the second string analysis module 16 execution preceding method embodiment two～step 207;

Device in the present embodiment two also comprises:

String processing module 17, be used for character number outside described front character string and/or described rear character string satisfy wherein html tag greater than the character number in the described html tag, and the outer character number of html tag is during greater than the condition of the radix of setting, with this front character string and/or rear character string with the character string of the outer number of characters maximum of described html tag character string as a result of; And described as a result character string processed, to gather subject information.In the time of specific embodiment, the step 208 among this string processing module 17 execution preceding method embodiment two～step 209;

Device in the present embodiment two also comprises:

Information acquisition module 18, be used for the described described as a result character string of processing through string processing module 17 and this as a result the subject information that comprises of character string preserve, use for user's secondary development.

Enforcement the invention provides a kind of acquiring internet subject information device, need not stick to unified network template, a kind of general method is provided, at first with the div label html source code is divided into different character strings, and each character string carried out analyzing and processing, can accurate analysis and process webpages all on the internet, and character string is by analysis carried out secondary analysis, further improve the accuracy of analyzing webpage on the internet, thereby collect fast and accurately subject information.

Referring to Fig. 6, be the structural representation of a kind of acquiring internet subject information device embodiment three among the present invention.

In the present embodiment, with the string processing module 17 of describing in detail among the aforementioned means embodiment two;

Described string processing module 17 specifically comprises: character filtering unit 170, picture download unit 171, path replacement unit 172, tag processes unit 173, and their function and efficacy is as follows:

Character filtering unit 170 is used for the outer character of each html tag of character string is as a result compared with filtering key word, filters the character irrelevant with subject information; Concrete, described filtration key word is scheduled to, and is specially illegal key word or advertisement keywords, navigation bar key word, the noise information that survey key word etc. and subject information are irrelevant; In implementation, this character filtering unit 170 is used for carrying out the step 300 of preceding method embodiment three;

Picture download unit 171 is used for extracting described process character filtering unit 170 and filters all picture image labels of as a result character string afterwards, and the download pictures resource is also preserved; Can also obtain simultaneously picture width and height;

Path replacement unit 172 is used for the Internet resources path of described as a result character string is replaced with the local resource path;

Other labels in the described as a result character string for the paragraph p label and the picture image label that keep described as a result character string, are deleted in tag processes unit 173.

Enforcement the invention provides a kind of acquiring internet subject information device, on the basis of coupling apparatus embodiment one and device embodiment two accurate Quick Acquisition subject informations, to the further purified treatment of subject information that gathers, and news or the original form of daily record have been kept, can also keep the picture in original webpage, therefore can better be used by secondary development.

One of ordinary skill in the art will appreciate that all or part of flow process that realizes in above-described embodiment method, to come the relevant hardware of instruction to finish by computer program, described program can be stored in the computer read/write memory medium, this program can comprise the flow process such as the embodiment of above-mentioned each side method when carrying out.Wherein, described storage medium can be magnetic disc, CD, read-only store-memory body (Read-Only Memory, ROM) or store-memory body (Random Access Memory, RAM) etc. at random.

The above is preferred implementation of the present invention; should be pointed out that for those skilled in the art, under the prerequisite that does not break away from the principle of the invention; can also make some improvements and modifications, these improvements and modifications also are considered as protection scope of the present invention.

Claims

1. a method for acquiring internet subject information is characterized in that, comprising:

Analyze one by one each character string in the described character string tabulation, when the character number outside the html tag in certain character string greater than the character number in the described html tag, and the outer character number of html tag is during greater than the radix set, and the content that this character string is comprised is as subject information;

2. the method for claim 1 is characterized in that, described subject information is theme of news information or daily record subject information.

3. method as claimed in claim 2 is characterized in that, when subject information to be collected was theme of news information or daily record subject information, the value of described radix was set as difference.

4. method as claimed in claim 3 is characterized in that, when described subject information is theme of news information, before obtaining the HTML (Hypertext Markup Language) html source code step of internet web page, comprising:

Download the expandable mark language XML page, extract list information;

Download the uniform resource position mark URL in the described list information, in order to obtain the html source code of subject information place webpage.

5. method as claimed in claim 4 is characterized in that, described obtaining after the html source code comprises:

Filter the html tag that has nothing to do with subject information in the described html source code.

6. such as each described method among the claim 1-5, it is characterized in that, the content that comprises in this front character string and/or the rear character string as after the subject information, being comprised:

If the character number in described front character string and/or the described rear character string outside the satisfied html tag wherein is greater than the character number in the described html tag, and the outer character number of html tag is during greater than the condition of the radix of setting, with this front character string and/or rear character string with the character string of the outer number of characters maximum of described html tag character string as a result of;

Described as a result character string is processed, gathered subject information.

7. method as claimed in claim 6 is characterized in that, described described as a result character string is processed, and specifically comprises:

The character that each html tag in the described as a result character string is outer compares with filtering key word, filters the character irrelevant with subject information to be collected;

Extract all picture image labels in the described filtration as a result character string afterwards, the download pictures resource is also preserved;

Internet resources path in the described as a result character string is replaced with the local resource path;

Keep paragraph p label and picture image label in the described as a result character string, delete other labels in the described as a result character string.

8. method as claimed in claim 7 is characterized in that, will preserve through the subject information that the described as a result character string of described processing and this comprise in character string as a result, uses for secondary development.

9. an acquiring internet subject information device is characterized in that, comprising:

The first string analysis module, be used for analyzing one by one each character string of described character string tabulation, when the character number outside the html tag in certain character string greater than the character number in the described html tag, and the outer character number of html tag is during greater than the radix set, and the content that this character string is comprised is as subject information;

10. device as claimed in claim 9 is characterized in that, described subject information is theme of news information or daily record subject information.

11. device as claimed in claim 10 is characterized in that, described device also comprises:

The radix setting module, being used for according to subject information to be collected is theme of news information or daily record subject information, and the value of described radix is set as different values.

12. device as claimed in claim 11 is characterized in that, described device also comprises:

Information downloading module is used for downloading the expandable mark language XML page, extracts list information; And download uniform resource position mark URL in the described list information, and send to described source code acquisition module and process.

13. device as claimed in claim 12 is characterized in that, described device also comprises:

The information filtering module is used for filtering in the html source code that described source code acquisition module gets access to the html tag irrelevant with subject information.

14. device as claimed in claim 13 is characterized in that, described device also comprises:

The string processing module, be used for character number outside described front character string and/or described rear character string satisfy wherein html tag greater than the character number in the described html tag, and the outer character number of html tag is during greater than the condition of the radix of setting, with this front character string and/or rear character string with the character string of the outer number of characters maximum of described html tag character string as a result of; And described as a result character string processed, to gather subject information.

15. device as claimed in claim 14 is characterized in that, described string processing module specifically comprises:

The character filtering unit is used for the character that each html tag of described as a result character string is outer and compares with filtering key word, filters the character irrelevant with subject information;

The picture download unit is used for extracting described process character filtering unit and filters all picture image labels of as a result character string afterwards, and the download pictures resource is also preserved;

The path replacement unit is used for the Internet resources path of described as a result character string is replaced with the local resource path;

Other labels in the described as a result character string for the paragraph p label and the picture image label that keep described as a result character string, are deleted in the tag processes unit.

16. device as claimed in claim 15 is characterized in that, described device also comprises:

Information acquisition module, be used for will through the described as a result character string of described string processing resume module with this as a result the subject information that comprises of character string preserve, use for user's secondary development.