CN101246494A - Internet web page conversion method, system and equipment - Google Patents

Internet web page conversion method, system and equipment Download PDF

Info

Publication number
CN101246494A
CN101246494A CNA2008100655972A CN200810065597A CN101246494A CN 101246494 A CN101246494 A CN 101246494A CN A2008100655972 A CNA2008100655972 A CN A2008100655972A CN 200810065597 A CN200810065597 A CN 200810065597A CN 101246494 A CN101246494 A CN 101246494A
Authority
CN
China
Prior art keywords
webpage
web page
subject content
internet web
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2008100655972A
Other languages
Chinese (zh)
Other versions
CN101246494B (en
Inventor
陈虓将
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Tencent Computer Systems Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN2008100655972A priority Critical patent/CN101246494B/en
Publication of CN101246494A publication Critical patent/CN101246494A/en
Application granted granted Critical
Publication of CN101246494B publication Critical patent/CN101246494B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention is appropriate for the Internet information processing field. The invention provides a method, a system and a device for Internet webpage conversion. The method comprises the following steps of: analyzing a read Internet webpage; extracting subject content from the analyzed Internet webpage; converting the extract subject content, outputting a corresponding XHTML webpage. Before converting the Internet webpage to the XHTML webpage, extracting subject content concerned by user from the Internet webpage, converting the extracted subject content to the XHTML webpage, so as to reduce webpage length and occupancy space obtained by conversion greatly, and reduce bandwidth pressure of a server, which ensures that the subject content of the webpage is outstanding, and improves speed of browsing webpage for user; the invention is convenient for user to search or browse information.

Description

A kind of internet web page conversion method, system and equipment
Technical field
The invention belongs to the internet information process field, relate in particular to a kind of internet web page conversion method, system and equipment.
Background technology
Along with development of internet technology, wireless interconnected network technology is also developing by leaps and bounds, and the user can search for or browse wireless interconnected online information by wireless terminals such as mobile phones.At present, maximum resources is to adopt HTML (Hypertext Markup Language) (HyperText Markup Language, HTML) webpage of form on the internet.Because HTML code is lack of standardization, too fat to move, the browser of wireless terminal needs enough intelligent and hugely can correctly show HTML, World Wide Web Consortium (World Wide Web Consortium for this reason, W3C) formulated expansion HTML (Hypertext Markup Language) (Extended Hypertext Markup Language, XHTML).
Because the quantity of html web page is far longer than XHTML webpage quantity at present, therefore user search or the information major part browsed are present in the html web page, therefore html web page need be converted to the XHTML webpage, for wireless interconnected network users directly at the enterprising line search of wireless terminal or browse.
The ultimate principle of webpage conversion is to obtain user's request, isolates original html web page address, and system will grasp this webpage automatically, and resolve, changes and store.When html web page is converted to the XHTML webpage, adopt content all to be converted to the XHTML webpage at present, keep the mode of all the elements of former html web page former html web page.Because the XHTML webpage that is converted to all is pushed to the user with all the elements of former html web page, data quantity transmitted is big, causes the waste of transmission bandwidth, has increased the pressure of server.In addition, comprise many users in the XHTML webpage that is converted to and do not needed the information paid close attention to, be not easy to the information that the user obtains real needs, the time that has increased user information search or browsed.Simultaneously, the information that wireless terminal receives and explicit user is not paid close attention to can cause bigger communication delay, has reduced the speed that the user obtains information, to user's information search or browse and put to no little inconvenience.
Summary of the invention
The purpose of the embodiment of the invention is to provide a kind of internet web page conversion method, be intended to solve prior art when html web page is converted to the XHTML webpage, the all the elements that kept former html web page, cause the waste of transmission bandwidth, increased the pressure of server, and given user's information search and browse the problem that puts to no little inconvenience.
The embodiment of the invention is achieved in that a kind of internet web page conversion method, and described method comprises the steps:
The internet web page that parsing is read;
Extract subject content the internet web page after resolving;
With the corresponding XHTML webpage of subject content conversion output that extracts.
Another purpose of the embodiment of the invention is to provide a kind of internet web page converting system, and described system comprises:
The webpage resolution unit is used to resolve the internet web page that reads;
The web page contents clean unit, the internet web page that is used for after resolve extracts subject content; And
The conversion output unit is used for the corresponding XHTML webpage of subject content conversion output that will extract.
Another purpose of the embodiment of the invention is to provide a kind of server that comprises above-mentioned internet web page converting system.
In embodiments of the present invention, before internet web page is converted into the XHTML webpage, from internet web page, extract the subject content that the user pays close attention to, convert the subject content of extracting to the XHTML webpage, make the web length that is converted to greatly reduce, reduced the bandwidth pressure of server, can guarantee that the subject content of webpage is outstanding with taking up room, improve the speed of user's browsing page, be convenient to user search or browsing information.
Description of drawings
Fig. 1 is the realization flow figure of the internet web page conversion method that provides of the embodiment of the invention;
Fig. 2 is the structural drawing of the internet web page converting system that provides of the embodiment of the invention.
Embodiment
In order to make purpose of the present invention, technical scheme and advantage clearer,, the present invention is further elaborated below in conjunction with drawings and Examples.Should be appreciated that specific embodiment described herein only in order to explanation the present invention, and be not used in qualification the present invention.
The embodiment of the invention is before being converted into the XHTML webpage with internet web page, from internet web page, extract the subject content that the user pays close attention to, convert the subject content of extracting to the XHTML webpage, make the web length that is converted to greatly reduce with taking up room, reduced the bandwidth pressure of server, the subject content that can guarantee webpage is outstanding, is suitable for the user and searches for or browse by wireless terminal.
Fig. 1 shows the realization flow of the internet web page conversion method that the embodiment of the invention provides, and details are as follows:
In step S101, resolve the internet web page that reads;
In embodiments of the present invention, during parsing the Context resolution of internet web page become DOM Document Object Model (Document Object Model, DOM) tree.
In step S102, extract subject content the internet web page after resolving;
As one embodiment of the present of invention, when extracting subject content the internet web page after resolving, for accuracy and the efficient that guarantees that subject content is extracted, at first judge the type of internet web page, internet web page is classified, then dissimilar webpages is handled, extracted corresponding subject content.
In embodiments of the present invention, internet web page is divided into two classes, one class webpage is based on hyperlink display text, a catalogue only is provided, provide a public inlet to related other webpages on the content, be called " navigational route type webpage " in the embodiment of the invention, first class catalogue under website homepage, the homepage etc. for example, another kind of webpage itself contains actual content, be generally the final stage webpage that shows specifying information, be called " theme type webpage " in the embodiment of the invention, for example the network novel, news information, blog etc.
Theme type webpage comprises two parts content usually, and what a part of content embodied is the subject information of webpage, and for example the news section in news web page is called " subject content " in the embodiment of the invention; Another part then is and the irrelevant contents such as navigation bar, advertising message, copyright information and questionnaire of subject content, is called " noise content " in the embodiment of the invention.Obviously, for theme type webpage, subject content is most important, also be that the user pays close attention to most, this part content need keep in the XTHML webpage after conversion, and the noise content is less important, and user's less relevance is not even paid close attention to, and this part content can be deleted in the XTHML webpage after conversion.And for the navigational route type webpage, except the noise content that reaches minority, the importance equalization of other each several part contents all can be considered as " subject content ", so these contents all need to keep.
In the embodiment of the invention, for theme type webpage, text in the theme type webpage and title are taken out, constitute a new webpage, for the class special web page in the navigational route type webpage, be forum's class webpages such as BBS and forum, then extract its title block and navigation information, make it to be suitable on wireless terminal, browsing.
In order further to reduce web page contents, in embodiments of the present invention, can before converting internet web page to the XHTML webpage, further the noise content that exists in the subject content of extracting be removed.
In step S103, with the corresponding XHTML webpage of subject content conversion output that extracts.
In above-mentioned flow process, after successfully internet web page contents being resolved to dom tree, judge that webpage is theme type webpage or navigational route type webpage.The foundation of type webpage of judging whether to be the theme is whether to have text block and title block in the webpage.When judging, from dom tree, take out non-link text successively, judgement is the text that can the subtree of root element constitute the subject content of theme type webpage with father's element of this element, be text block and the title block that then finds subject content earlier, otherwise continue to attempt next non-link text up to text block that finds subject content and title block, when perhaps having attempted all non-link texts and still can not find text, judge that this webpage is the navigational route type webpage, specific algorithm is as follows:
1. from root element, according to depth-first algorithm traversal dom tree, search first non-link text, if can not find, judge that then this webpage is the navigational route type webpage, algorithm finishes, otherwise enters 2;
2. judge that father's element with this element is the text that can the subtree of root element constitute the subject content of theme type webpage, if could not would enter 3, otherwise enter 4;
3. follow the position of a non-link text and continue the traversal dom tree, search next non-link text.If complete dom tree of traversal still can not find next non-link text, judge that then this webpage is the navigational route type webpage, algorithm finishes, otherwise enters 2;
4. from father's element of non-link text, upwards search the text block and the title block of subject content in dom tree, if can not find title block then enter 3, otherwise judge the webpage type webpage that is the theme, algorithm finishes.
According to above-mentioned algorithm, judge that can whether the be the theme prerequisite of type webpage of the page or leaf of throwing the net be find the text of subject content.In embodiments of the present invention, traversal is the subtree of root element with father's element of non-link text, count fullstop sum and comma sum that text elements all under this subtree comprises, be designated as x and y respectively, count the sum of other element outside p element (be used for causing one section new element) and the br element (being used for inserting the element of a newline), be designated as n, and count the sum of a element (indicating the element of the initial or destination locations of hyperlink), be designated as a.Set a ratio threshold value RATE_AWITHNEWSSIGN.Like this, judgement is that whether the be the theme ultimate principle of text of content of the subtree of root element is with father's element of non-link text, according to the text characteristics in form of subject content, to above-mentioned x, y, n, the value of a is judged, just judges that when these values depart from the statistical value of text this subtree is not a text, otherwise when these values satisfy all statistical laws, judge that this subtree is a text, exemplary algorithm is as follows:
1. if n>100 are judged that then this subtree is not a text, otherwise are entered 2;
2., judge that then this subtree is not a text, otherwise enter 3 if n>30 and x or y are 0;
3., judge that then this subtree is not a text, otherwise enter 4 if x and y are 0;
4. if x 〉=0 and y 〉=0, and a/ (x+y)>RATE_AWITHNEWSSIGN judges that then this subtree is not a text, otherwise enters 5;
5., judge that then this subtree is not a text, otherwise enter 6 if the number of words that comprises of the text of this number of words is less than 80;
6. judge that this subtree is a text.
In the above-mentioned algorithm, 100,30,80 equivalences all are example statistical values, can adjust according to actual conditions during specific implementation.
According to the statistical results show to a large amount of webpages, the element with content relevance has the advantages that to cluster into piece, therefore needs upwards to seek the text block of containing this text after finding the text of subject content.In embodiments of the present invention, root element from the subtree that constitutes text, the path of the root element along this element to whole dom tree is search upwards, when element table (specifying contained content will be organized into the element of the form of ranks) that runs into the expression piece or div (element of the container of HTML is played up in appointment), search stops, and the table element or the div element that find are referred to as the text block element.
General theme type webpage all has title, also needs to find the title of subject content for this reason.In embodiments of the present invention, the text element from dom tree under the daughter element Title (element that comprises the title of document) of taking-up Head element (element about the unordered ensemble of communication of document is provided), text element is called the form title.The element of expression subject content title, the embodiment of the invention is called title element.In embodiments of the present invention, title element must satisfy some base attributes: (1) must be text element; (2) the text number of words can not be oversize can not be too short; (3) can not be a link; (4) in the text fullstop can not appear.The element that satisfies these base attributes is exactly candidate's a title element.
In embodiments of the present invention, satisfying under the prerequisite of above-mentioned base attribute, can give this element marking according to title characteristics in form, the element that score is the highest is exactly a title element, and exemplary algorithm is as follows:
1. if number of words is between 5 to 25, add 10 fens, otherwise bonus point not;
2. do not add 10 when having fullstop and comma, add 5 when having comma not have fullstop, otherwise bonus point not;
3. when text element nearest, add 30 fens, dropped to successively then 20,10 and 0 fens;
4. if the text of form title has comprised the text of this element, add 30 fens, otherwise score not.
Between father's element and root element of text element, constitute a paths.Between the node on this paths and the left side, path, seek candidate's title element, and from all candidate's title element, select the highest element of score as final title element.
The title of subject content is in the front of text block, therefore from the root element of text block, searches title element in the left side in the path of the root element of this element and whole dom tree.Specific algorithm is as follows:
Initialization: the root element that makes dom tree is pRoot, the root element of text block is pTextArea, father's element of pTextArea is made as currentElement pCur, current index nCurIndex is that the index value of pTextArea in the daughter element tabulation of its father node subtracts 1, make Boolean type variable i sTried=false, set an achievement threshold value GRADE_THRESHOLD simultaneously, carry out following steps:
1. then directly enter 2 as if isTried==false, otherwise give a mark for pCur according to aforementioned algorithm, if mark>GRADE_THRESHOLD then puts into candidate queue with the mark of pCur and pCur, no matter mark all enters 2;
2. as if the daughter element number of nCurIndex 〉=1 and nCurIndex≤pCur, then making nCurIndex the daughter element of pCur is currentElement pCur, makes nCurIndex=0, and isTried=false enters 4 then, otherwise enters 3;
If pCur at pTextArea to the path between the pRoot, then the index value with the previous node of pCur is made as current index nCurIndex, otherwise the index value of the back node of pCur is made as current index nCurInex.Make isTried=true, the father node that makes pCur is currentElement pCur, enters 4 then;
4. if currentElement is not pRoot then enters 1, otherwise enters 5;
5. select the highest element of mark from candidate queue, father's element of this node is exactly a title element, and algorithm finishes.
All there is navigation bar in a lot of theme type webpages, for example navigation information such as " page up " at network novel end, " following one page ".Navigation bar is helpful for the relevant web page contents of browse themes, so in the embodiment of the invention it is retained in the subject content.Navigation bar is general all in the back of text, in the embodiment of the invention, begins to search link downwards from the root element of text block, if comprised navigation information in the link, then it is taken out as navigation bar.Specific algorithm is as follows:
Initialization: the root element that makes dom tree is pRoot, the root node of text block is pTextArea, the father node of pTextArea is made as currentElement pCur, the index value of pTextArea in the daughter element tabulation of its father node is added 1 be made as current index pCurIndex, carry out following steps:
1. if pCurIndex exceeds the daughter element number of pCur, then enter 3; If pCurIndex is no more than the child nodes number of pCur, then take out pCurIndex the daughter element pSrcchild of pCur, if the element type of this node is a (indicating the element of the initial or destination locations of hyperlink), and the link literal that it comprises is exactly a navigation information, the father's element that then takes out this element is as navigation elements, algorithm finishes, otherwise enters 2;
2. if pSrcChild comprises child node, then make pCur=pSrcChild, nCurIndex=0 enters 4 then, otherwise the value of nCurIndex is added 1, enters 4 then;
3. the index value of currentElement pCur is added 1 and be made as current index nCurIndex, father's element of pCur is made as currentElement pCur, enter 4 again;
If pCur!=pRoot or nCurIndex are no more than the child nodes number of pCur, then enter 1, otherwise the algorithm end, navigation elements does not exist.
In embodiments of the present invention,, find root element, title element and the navigation elements of text block according to the method described above, constituted the subject content of subject type webpage with these three elements subtree that is root element for theme type webpage.For theme type webpage, information outside the subject content all should abandon, therefore the embodiment of the invention keeps text block, father's element with title element is the subtree of root node, with the navigation elements is the subtree (if navigation elements existence) and the form title of father's element, all the other elements outside this three stalks tree and the form title are all cropped from dom tree, a newly-built then html element (showing that document comprises the element of html element), head element (element about the unordered ensemble of communication of document is provided) and body element (beginning of specified documents main body and the element of end), with the daughter element of form title as the head element, title element, root element that text is fast and navigation elements are respectively as first of body element, second and the 3rd daughter element, head element and body element are as first and second daughter element of html element, and the html element is as root element.So just can synthesize and obtain a new dom tree.For theme type webpage, original dom tree is die on, and it is abandoned.
For the navigational route type webpage, there is a class webpage very special, be forum's class webpages such as BBS, forum, the characteristics of this class webpage are to have a content, this piece content is made up of information such as the author information of posting, the time of posting, the number of times of replying to the topic, model titles, in the embodiment of the invention this part content is referred to as key blocks.For the user, it is important having only model title one hurdle in the key blocks.Because the screen of wireless terminal is very little, write down if will post and all to preserve, the model title block just can't highlight, it is also very time-consuming that the user skips other information searching model titles, therefore in the embodiment of the invention, when judging that webpage is the webpage of types such as BBS, forum, with all the other the information Delete Alls outside the title block.
This class webpage also has characteristics, and it realizes by framework that normally the left side of framework or upside are navigation informations, and right side or downside are forum's contents.In the embodiment of the invention, close after framed structure both sides content got respectively and write in the new webpage, this webpage is show navigator information and forum's content successively from top to bottom.Web page contents is analyzed, found out the key blocks position, with all the other the hurdle Delete Alls outside the title block in the key blocks, and navigation information is the inlet that enters each special topic, needs to keep.Specific algorithm is as follows:
Initialization: the root element of dom tree is made as is current father's element, first daughter element of root element is made as currentElement, be set as follows constant: minimum table number of elements BBS_MIN_TABLE_NUM, minimum tr number of elements BBS_MIN_TR_NUM, minimum td number of elements BBS_MIN_TD_NUM, maximum td number of elements BBS_MAX_TD_NUM, carry out following steps:
1. then enter 4 if currentElement is table (specifying contained content will be organized into the element of the form of ranks) or tbody (indicating the element of row as the form main body) and the daughter element number of this element less than BBS_MIN_TR_NUM, otherwise enter 2;
2. current table element is added up, if td element (specifying the element of the cell in the form) number under the tr element under this table (specifying the element of the delegation in the form) then enters 3, otherwise enters 6 between BBS_MIN_TD_NUM and BBS_MAX_TD_NUM;
3. current table element is added up, found out the longest row of link literal, these row are exactly title bar, find title bar afterwards with all the other the row Delete Alls outside these table lower banner row, and algorithm finishes.
4. the table element of a plurality of next-door neighbours under father's element is added up, if there is BBS_MIN_TABLE_NUM next-door neighbour's table element, and these table elements exist the sequence number of the file that links all identical, then get the longest file of link length as title bar, enter 5 then; If can not find so a plurality of table elements then enter 6;
5. with non-table element Delete All under father's element, the inconsistent table element of structure Delete All with BBS_MIN_TABLE_NUM table element described in all table elements and 4, with the file Delete All outside the title bar in the table element identical with the BBS_MIN_TABLE_NUM described in 4 table element structure in all table elements, algorithm finishes;
6. if currentElement contains daughter element, then this element is made as father's element, first daughter element of this element is made as currentElement, enters 1 then, otherwise enters 7;
7. if currentElement is not last daughter element of father's element, then the element with currentElement the right is made as currentElement, enters 1 then, otherwise if father's element is a root element, algorithm finishes, otherwise enters 8;
8. the next element with father's element is made as currentElement, and father's element of father's element is made as new father's element, enters 1 then.
In embodiments of the present invention, still be the navigational route type webpage no matter for theme type webpage, before being converted to the XHTML webpage, remove the noise content in the webpage, with unnecessary content in the further minimizing webpage.Theme type webpage exists some identical noise contents with the navigational route type webpage, for example all exist the unnecessary br element element of a newline (insert), img element (element of embedded images or video clipping in document) does not have src attribute etc., therefore can handle noise content in the webpage with an identical algorithm.
In embodiments of the present invention, the element that meets the following conditions is regarded as the noise content in the webpage:
(1) the unsupported element of XHTML;
(2), play the element of planning the page layout purpose and not having daughter element as container;
(3) the img element that does not have " src " attribute;
(4) a plurality of ul elements (element of the Bulleted List of rendering text) side by side;
(5) unnecessary br element;
(6) word length is 1 nothing link literal;
(7) short chain connects;
(8) there is not type attribute or type attribute input element (creating the element of various list input controls) for " input " or " password ".
In embodiments of the present invention, in fact the noise content of handling in the webpage is exactly cutting, the node of adjusting dom tree, and the specific algorithm example is as follows:
Each node do in for webpage
The element type of if node is h1, h2, h3, h4, h5, h6, the unsupported element type then of span or other xhtml
Child nodes all under this node all is suspended in the child node tabulation of father node of this node, deletes this node then
end?if
The element type of if node is not br, hr, and img, link, meta, input, postfield, body, frame or text and this node do not have child node
With this knot removal
end?if
The element type of if node is img then
This node of if is not pointed out " src " element property then
With this knot removal
else
Newly-built element type is the node of div, the img node is moved on in the child node tabulation of this node, and the div node is suspended in the child node tabulation of the original father node of img node then
The element type of else if node is ul then
With adjacent element type is the node merging of ul
The element type of else if node is br then
Has only this node then in the child node tabulation of the father node of this node of if
Delete this node
First that this node of else if is a father node or last child node then
Delete this node
The node in this node front of else if can automatic branch then
Delete this node
end?if
The element type of else if node is no link text then
The if text has only a word then
With this knot removal
end?if
The element type of else if node is link text then
The element type of the ancestor node of too short and this node of if text size is li then
Comprise the li that short chain connects text and merge into a li a plurality of
end?if
The element type of else if node is input then
It is " input " or " password " then that the if node does not have type attribute or type attribute
Newly-built element type is the node of div, this node is suspended in the child node tabulation of div, then with the div node
Be suspended under the child node tabulation of the original father node of input node
end?if
end?if
end?for
Fig. 2 shows the structure of the internet web page converting system that the embodiment of the invention provides, and for convenience of explanation, only shows the part relevant with the embodiment of the invention.This system can built-inly run on internet web page is carried out in the server of conversion process.
Webpage resolution unit 21 is resolved the internet web page that reads, and during specific implementation internet web page contents is resolved to dom tree.
Extract subject content the internet web page of web page contents clean unit 22 after resolving.As one embodiment of the present of invention, when extracting the subject content of internet web page, for accuracy and the efficient that guarantees that subject content is extracted, type of webpage judge module 221 is judged the type of internet web page, type webpage and navigational route type webpage in the embodiment of the invention are the theme the internet web page branch, judgment mode repeats no more as mentioned above.Subject content extraction module 222 extracts corresponding subject content according to the type of internet web page, be specifically as follows text, title and the navigation information of theme type webpage, the perhaps title block and the navigation information of forum's class webpage in the navigational route type webpage, extracting mode repeats no more as mentioned above.As one embodiment of the present of invention, the noise content is removed the module 223 further noise contents of removing in the subject content, and removing method repeats no more as mentioned above.
Conversion output unit 23 is with the corresponding XHTML webpage of subject content conversion output that extracts.
In embodiments of the present invention, before internet web page is converted into the XHTML webpage, from internet web page, extract the subject content that the user pays close attention to, convert the subject content of extracting to the XHTML webpage, make the web length that is converted to greatly reduce, reduced the bandwidth pressure of server, can guarantee that the subject content of webpage is outstanding with taking up room, improve the speed of user's browsing page, be convenient to user search or browsing information.Simultaneously, by with theme type webpage and navigational route type webpage differentiated treatment, can effectively guarantee efficient and accuracy that subject content is extracted, remove the noise content of webpage simultaneously, can further purify webpage, reduce web page contents, be suitable for the user by wireless terminal search or browsing information.
The above only is preferred embodiment of the present invention, not in order to restriction the present invention, all any modifications of being done within the spirit and principles in the present invention, is equal to and replaces and improvement etc., all should be included within protection scope of the present invention.

Claims (11)

1, a kind of internet web page conversion method is characterized in that, described method comprises the steps:
The internet web page that parsing is read;
Extract subject content the internet web page after resolving;
With the corresponding XHTML webpage of subject content conversion output that extracts.
2, the method for claim 1 is characterized in that, the step of extracting subject content the described internet web page after resolving is specially:
Judge the internet web page type webpage that whether is the theme;
When internet web page is the theme the type webpage, extract root element, title element and the navigation elements of the text block of webpage.
3, method as claimed in claim 2 is characterized in that, the step that corresponding XHTML webpage is exported in described subject content conversion with extraction further comprises:
Keep webpage text block, be the subtree of root node, be the subtree and the form title of father's element with the navigation elements with father's element of title element;
With described text block, be the subtree of root node, be the subtree of father's element with the navigation elements with father's element of title element, and all the other elements outside the form title crop synthetic new dom tree from former dom tree.
4, method as claimed in claim 2 is characterized in that, describedly judges that whether the be the theme step of type webpage of internet web page is specially:
From the dom tree that internet web page resolves to, take out non-link text successively;
Judgement is the text that can the subtree of root element constitute the subject content of theme type webpage with father's element of this element, be text block and the title block of then searching subject content, otherwise continue to attempt next non-link text, up to text block that finds subject content and title block;
When having attempted all non-link texts and still can not find the text of subject content, judge that this webpage is the navigational route type webpage.
5, the method for claim 1 is characterized in that, the step of extracting subject content the described internet web page after resolving is specially:
Judge that whether webpage is the forum's class webpage in the navigational route type webpage;
When webpage is forum class webpage in the navigational route type webpage, extract title block and navigation information in the webpage.
As the described method of the arbitrary claim of claim 1 to 5, it is characterized in that 6, the step of extracting subject content the described internet web page after resolving further comprises:
Remove the noise content in the subject content.
7, a kind of internet web page converting system is characterized in that, described system comprises:
The webpage resolution unit is used to resolve the internet web page that reads;
The web page contents clean unit, the internet web page that is used for after resolve extracts subject content; And
The conversion output unit is used for the corresponding XHTML webpage of subject content conversion output that will extract.
8, system as claimed in claim 7 is characterized in that, described web page contents clean unit comprises:
The type of webpage judge module is used to judge the type of internet web page;
The subject content extraction module extracts corresponding subject content according to the type of internet web page.
9, system as claimed in claim 8 is characterized in that, described web page contents clean unit further comprises:
The noise content is removed module, is used for removing the noise content of subject content.
10, as claim 7,8 or 9 described systems, it is characterized in that described subject content be the theme text, title and the navigation information of type webpage, the perhaps title block and the navigation information of forum's class webpage in the navigational route type webpage.
11, a kind of server that comprises the internet web page converting system of claim 7.
CN2008100655972A 2008-03-19 2008-03-19 Internet web page conversion method, system and equipment Active CN101246494B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2008100655972A CN101246494B (en) 2008-03-19 2008-03-19 Internet web page conversion method, system and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2008100655972A CN101246494B (en) 2008-03-19 2008-03-19 Internet web page conversion method, system and equipment

Publications (2)

Publication Number Publication Date
CN101246494A true CN101246494A (en) 2008-08-20
CN101246494B CN101246494B (en) 2011-11-02

Family

ID=39946947

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2008100655972A Active CN101246494B (en) 2008-03-19 2008-03-19 Internet web page conversion method, system and equipment

Country Status (1)

Country Link
CN (1) CN101246494B (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102024028A (en) * 2010-11-22 2011-04-20 百度在线网络技术(北京)有限公司 Method and equipment for distinctly displaying main contents of webpage on mobile terminal
CN102035883A (en) * 2010-11-26 2011-04-27 百度在线网络技术(北京)有限公司 Method and device for optimizing webpage in network equipment
CN102163213A (en) * 2011-02-25 2011-08-24 中国科学院计算技术研究所 Voice browsing method and browser
CN102541863A (en) * 2010-12-14 2012-07-04 联芯科技有限公司 Webpage compression method applied to mobile terminal
CN102880707A (en) * 2012-09-27 2013-01-16 广州市动景计算机科技有限公司 Method and device for webpage body content recognition
CN102955852A (en) * 2012-11-01 2013-03-06 北京小米科技有限责任公司 Method, device and equipment for webpage resource processing
CN103246684A (en) * 2012-02-13 2013-08-14 联想(北京)有限公司 Method, device and system for web page transition
WO2014019506A1 (en) * 2012-08-03 2014-02-06 Tencent Technology (Shenzhen) Company Limited Method and device for displaying webpage contents in browser
CN103853760A (en) * 2012-12-03 2014-06-11 ***通信集团公司 Method and device for extracting contents of bodies of web pages
CN104156458A (en) * 2014-08-20 2014-11-19 百度在线网络技术(北京)有限公司 Information extraction method and device
CN104965901A (en) * 2015-06-30 2015-10-07 北京奇虎科技有限公司 Method and apparatus for grabbing content of target page
CN106227882A (en) * 2016-08-02 2016-12-14 浙江大学 A kind of accessible web page navigation method extracted based on navigation object
CN108073588A (en) * 2016-11-09 2018-05-25 北京国双科技有限公司 column information extracting method and device
CN108228609A (en) * 2016-12-14 2018-06-29 北京国双科技有限公司 Information filtering method and device
CN109271162A (en) * 2018-09-03 2019-01-25 中国建设银行股份有限公司 A kind of page generation method and device
WO2019090738A1 (en) * 2017-11-10 2019-05-16 深圳市华阅文化传媒有限公司 Method and device for purifying web fiction page

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102024028B (en) * 2010-11-22 2014-04-02 百度在线网络技术(北京)有限公司 Method and equipment for distinctly displaying main contents of webpage on mobile terminal
CN102024028A (en) * 2010-11-22 2011-04-20 百度在线网络技术(北京)有限公司 Method and equipment for distinctly displaying main contents of webpage on mobile terminal
CN102035883A (en) * 2010-11-26 2011-04-27 百度在线网络技术(北京)有限公司 Method and device for optimizing webpage in network equipment
CN102035883B (en) * 2010-11-26 2015-07-01 百度在线网络技术(北京)有限公司 Method and device for optimizing webpage in network equipment
CN102541863A (en) * 2010-12-14 2012-07-04 联芯科技有限公司 Webpage compression method applied to mobile terminal
CN102163213A (en) * 2011-02-25 2011-08-24 中国科学院计算技术研究所 Voice browsing method and browser
CN102163213B (en) * 2011-02-25 2015-06-24 中国科学院计算技术研究所 Voice browsing method and browser
CN103246684A (en) * 2012-02-13 2013-08-14 联想(北京)有限公司 Method, device and system for web page transition
WO2014019506A1 (en) * 2012-08-03 2014-02-06 Tencent Technology (Shenzhen) Company Limited Method and device for displaying webpage contents in browser
CN102880707A (en) * 2012-09-27 2013-01-16 广州市动景计算机科技有限公司 Method and device for webpage body content recognition
CN102880707B (en) * 2012-09-27 2016-03-16 广州市动景计算机科技有限公司 Webpage body content recognition methods and device
CN102955852A (en) * 2012-11-01 2013-03-06 北京小米科技有限责任公司 Method, device and equipment for webpage resource processing
CN103853760B (en) * 2012-12-03 2017-05-03 ***通信集团公司 Method and device for extracting contents of bodies of web pages
CN103853760A (en) * 2012-12-03 2014-06-11 ***通信集团公司 Method and device for extracting contents of bodies of web pages
CN104156458A (en) * 2014-08-20 2014-11-19 百度在线网络技术(北京)有限公司 Information extraction method and device
CN104156458B (en) * 2014-08-20 2017-09-22 北京小度互娱科技有限公司 The extracting method and device of a kind of information
CN104965901A (en) * 2015-06-30 2015-10-07 北京奇虎科技有限公司 Method and apparatus for grabbing content of target page
CN106227882A (en) * 2016-08-02 2016-12-14 浙江大学 A kind of accessible web page navigation method extracted based on navigation object
CN106227882B (en) * 2016-08-02 2019-08-23 浙江大学 A kind of accessible web page navigation method extracted based on navigation object
CN108073588A (en) * 2016-11-09 2018-05-25 北京国双科技有限公司 column information extracting method and device
CN108073588B (en) * 2016-11-09 2021-07-30 北京国双科技有限公司 Column information extraction method and device
CN108228609A (en) * 2016-12-14 2018-06-29 北京国双科技有限公司 Information filtering method and device
WO2019090738A1 (en) * 2017-11-10 2019-05-16 深圳市华阅文化传媒有限公司 Method and device for purifying web fiction page
CN109271162A (en) * 2018-09-03 2019-01-25 中国建设银行股份有限公司 A kind of page generation method and device

Also Published As

Publication number Publication date
CN101246494B (en) 2011-11-02

Similar Documents

Publication Publication Date Title
CN101246494B (en) Internet web page conversion method, system and equipment
CN101251855B (en) Equipment, system and method for cleaning internet web page
CN101788991B (en) Updating reminding method and system
CN101291304B (en) Transplantable network information sharing method
CN101727461B (en) Method for extracting content of web page
CN101599089B (en) Method and system for automatically searching and extracting update information on content of video service website
CN103166981B (en) A kind of radio web page code-transferring method and device
CN101361063B (en) System and method supporting document content mining based on rules
US20110302486A1 (en) Method and apparatus for obtaining the effective contents of web page
CN106503211B (en) Method for automatically generating mobile version facing information publishing website
CN102420842B (en) A kind of sending method of webpage in mobile network and system
CN102163213B (en) Voice browsing method and browser
CN106354861A (en) Automatic film label indexing method and automatic indexing system
CN101216842A (en) Method for obtaining page key words and page information processing apparatus
CN102306201B (en) Method and system for analyzing webpage title
CN101196918A (en) Paging method and paging device
CN103064880A (en) Method, device and system based on searching information for providing users with website choice
US20150100877A1 (en) Method or system for automated extraction of hyper-local events from one or more web pages
JP2005063432A (en) Multimedia object retrieval apparatus and multimedia object retrieval method
KR101607468B1 (en) Keyword tagging method and system for contents
CN103064845A (en) Website information processing device and website information processing method
Yu et al. Web content information extraction based on DOM tree and statistical information
JP5462591B2 (en) Specific content determination device, specific content determination method, specific content determination program, and related content insertion device
CN101996190B (en) Method and device for extracting information from webpage
KR20100090178A (en) Apparatus and method refining keyword and contents searching system and method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20160111

Address after: The South Road in Guangdong province Shenzhen city Fiyta building 518057 floor 5-10 Nanshan District high tech Zone

Patentee after: Shenzhen Tencent Computer System Co., Ltd.

Address before: Shenzhen Futian District City, Guangdong province 518044 Zhenxing Road, SEG Science Park 2 East Room 403

Patentee before: Tencent Technology (Shenzhen) Co., Ltd.