CN110309457A - Web data processing method, device, computer equipment and storage medium - Google Patents

Web data processing method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN110309457A
CN110309457A CN201810236011.8A CN201810236011A CN110309457A CN 110309457 A CN110309457 A CN 110309457A CN 201810236011 A CN201810236011 A CN 201810236011A CN 110309457 A CN110309457 A CN 110309457A
Authority
CN
China
Prior art keywords
hypertext
current
content
tags
web object
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810236011.8A
Other languages
Chinese (zh)
Other versions
CN110309457B (en
Inventor
王炼
吕远方
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201810236011.8A priority Critical patent/CN110309457B/en
Publication of CN110309457A publication Critical patent/CN110309457A/en
Application granted granted Critical
Publication of CN110309457B publication Critical patent/CN110309457B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of web data processing method, device, computer equipment and storage mediums, which comprises obtains the corresponding hypertext document to be processed of webpage to be processed;Object content hypertext data is extracted from hypertext document to be processed, the object content hypertext data includes one or more target hypertext tags and the corresponding hypertext content of the target hypertext tags;Using each target hypertext tags in object content hypertext data as current hypertext tags, the corresponding web object of each target hypertext tags is generated;The corresponding web object of each target hypertext tags is formed into web object sequence.The above method can be reduced the degree to computer resource.

Description

Web data processing method, device, computer equipment and storage medium
Technical field
The present invention relates to Internet technical field, more particularly to web data processing method, device, computer equipment and Storage medium.
Background technique
With the fast development of internet, internet web page has become the carrier of information publication and information sharing, interconnection Network users can be distributed the information such as various contents, such as news, product introduction on webpage.
Currently, the information on a webpage is other than wanting the content of publication, there are also a lot of other information, such as extensively Therefore announcement, navigation and copyright information etc. when issuing or saving the Content Transformation of publication to other platforms, need The data of entire webpage are obtained, data volume is big, occupies computer resource.
Summary of the invention
Based on this, it is necessary to for above-mentioned problem, provide a kind of web data processing method, device, computer equipment And storage medium, object content hypertext data can be extracted from the corresponding hypertext document to be processed of webpage to be processed, according to The data type that the hypertext tags of object content hypertext data indicate handles hypertext content respectively, and in hypertext tags When the data type of expression is text data type, hypertext content further is handled according to tag types, gets target network The corresponding web object sequence of page content, obtains the high-efficient of target pages content and reduces data volume, reduce to calculating The degree of machine resource.
A kind of web data processing method, which comprises obtain the corresponding hypertext text to be processed of webpage to be processed Part;Object content hypertext data is extracted from the hypertext document to be processed, the object content hypertext data includes One or more target hypertext tags and the corresponding hypertext content of the target hypertext tags;By the object content Each target hypertext tags in hypertext data generate each target hypertext tags as current hypertext tags Corresponding web object, comprising: the current data type that the current hypertext tags indicate is obtained, when the current data class When type is non-text data type, obtained according to the corresponding current hypertext content of the current hypertext tags described current super Corresponding first web object of text label obtains described current super when the current data type is text data type The corresponding current label type of text label, it is corresponding to the current hypertext tags current according to the current label type Hypertext content is handled, and the second web object is obtained;By the corresponding web object group of each target hypertext tags At webpage object sequence.
A kind of web data processing unit, described device includes: file acquisition module to be processed, for obtaining net to be processed The corresponding hypertext document to be processed of page;Extraction module is super for extracting object content from the hypertext document to be processed Text data, the object content hypertext data include one or more target hypertext tags and the target hypertext The corresponding hypertext content of label;Object generation module, for surpassing each target in the object content hypertext data Text label generates the corresponding web object of each target hypertext tags, comprising: obtain as current hypertext tags The current data type that the current hypertext tags indicate, when the current data type is non-text data type, root Corresponding first webpage of the current hypertext tags is obtained according to the corresponding current hypertext content of the current hypertext tags Object obtains the corresponding current label of the current hypertext tags when the current data type is text data type Type handles the corresponding current hypertext content of the current hypertext tags according to the current label type, obtains To the second web object;Sequence comprising modules, for by the corresponding web object group networking of each target hypertext tags Page object sequence.
In one embodiment, described device further include: content obtains module, for being non-when the current data type When text data type, the content of current web object to be generated is obtained, according to the content of current web object to be generated Generate the third web object.
In one embodiment, described device further include: template obtains module, for obtaining webpage hypertext template;It fills out Mold filling block obtains pair for each web object in the web object sequence to be filled into the webpage hypertext template The target webpage hypertext document answered, each web object in web object sequence described in the target webpage hypertext document Corresponding hypertext tags are block grade label.
In one embodiment, described device further include: data obtaining module is used for from the hypertext document to be processed The corresponding multidate information of middle acquisition object content and/or static information;The filling module is used for: by the multidate information and/ Or each web object is filled into the webpage hypertext template in static information and the web object sequence, obtains institute State target webpage hypertext document.
In one embodiment, the object generation module includes: level acquiring unit, for obtaining each target Hierarchical relationship between hypertext tags;Current label acquiring unit, for the level according to upper one current hypertext tags And depth-first traversal algorithm obtains current hypertext tags from the target hypertext tags;The sequence comprising modules For:
By the corresponding web object of each target hypertext tags according to the solution of each super this paper label of target Analysis sequence composition web object sequence.
In one embodiment, the extraction module includes: path data acquiring unit, for obtaining target hypertext road Diameter data;Extraction unit, for extracting mesh from the hypertext document to be processed according to the target hypertext path data Mark content hypertext data.
A kind of computer equipment, including memory and processor are stored with computer program, the meter in the memory When calculation machine program is executed by the processor, so that the step of processor executes above-mentioned web data processing method.
A kind of computer readable storage medium, which is characterized in that calculating is stored on the computer readable storage medium Machine program, when the computer program is executed by processor, so that the processor executes above-mentioned web data processing method Step.
Above-mentioned web data processing method, device, computer equipment and storage medium, can from webpage to be processed it is corresponding to It handles and extracts object content hypertext data in hypertext document, indicated according to the hypertext tags of object content hypertext data Data type handle hypertext content respectively, and when the data type that hypertext tags indicate is text data type, into One step handles hypertext content according to tag types, gets the corresponding web object sequence of targeted web content, obtains target Content of pages high-efficient and data volume is reduced, reduces the degree to computer resource.
Detailed description of the invention
Fig. 1 is the applied environment figure of the web data processing method provided in one embodiment;
Fig. 2 is path configuration interface schematic diagram in one embodiment;
Fig. 3 is the flow chart of web data processing method in one embodiment;
Fig. 4 is the flow chart of web data processing method in one embodiment;
Fig. 5 is in one embodiment using each target hypertext tags in object content hypertext data as current super The flow chart of text label;
Fig. 6 is hypertext tags level schematic diagram in one embodiment;
Fig. 7 is the flow chart of web data processing method in one embodiment;
Fig. 8 is the flow chart of web data processing method in one embodiment;
Fig. 9 is the schematic diagram of target webpage in one embodiment;
Figure 10 is the flow chart of web data processing method in one embodiment;
Figure 11 is the structural block diagram of web data processing unit in one embodiment;
Figure 12 is the structural block diagram of web data processing unit in one embodiment;
Figure 13 is the structural block diagram of web data processing unit in one embodiment;
Figure 14 is the internal structure block diagram of computer equipment in one embodiment.
Specific embodiment
The present invention is further described in detail below with reference to the accompanying drawings and embodiments.
It is appreciated that term " first " used in this application, " second " etc. can be used to describe various elements herein, But unless stated otherwise, these elements should not be limited by these terms.These terms are only used to by first element and another yuan Part is distinguished.For example, in the case where not departing from scope of the present application, the first web object can be known as the second webpage pair As, and similarly, the second web object can be known as the first web object.
Fig. 1 is the applied environment figure of the web data processing method provided in one embodiment.As shown in Figure 1, answering at this With in environment, including terminal 110 and computer equipment 120.When to obtain the object content on webpage to be processed, such as when It needs to be converted to the news shown on computerized version webpage when mobile phone applies the news of upper display, computer equipment 120 obtains The corresponding hypertext document to be processed of webpage to be processed, then executes web data processing method provided in an embodiment of the present invention, The corresponding web object sequence of target pages content is obtained, after obtaining web object sequence, computer equipment 120 can be by webpage Object sequence is sent in terminal 110, and terminal 110 shows each web object according to web object sequence.Wherein, each net Page object can be used as a paragraph of the webpage shown in terminal 110.Computer equipment 120 can be independent physical services Device or terminal are also possible to the server cluster that multiple physical servers are constituted, can be to provide Cloud Server, cloud database, The Cloud Server of the basis such as cloud storage and CDN cloud computing service.Terminal 110 can be smart phone, tablet computer, notebook electricity Brain, desktop computer, intelligent sound box, smartwatch etc., however, it is not limited to this.Terminal 110 and computer equipment 120 can be with Connected by communication connections modes such as bluetooth, USB (Universal Serial Bus, universal serial bus) or networks It connects, the present invention is herein with no restrictions.
It should be noted that applied environment figure provided by the embodiments of the present application is only a kind of example, do not constitute to this hair The limitation for the web data processing method that bright embodiment provides, can also be by web data processing side provided in an embodiment of the present invention Method is applied in other application environment, for example, computer equipment 120 can directly exist according to the web object sequence acquired Computer equipment 120 generates corresponding target webpage.
As shown in Fig. 2, in one embodiment it is proposed that a kind of web data processing method, the present embodiment is mainly with this Method is applied to the computer equipment 120 in above-mentioned Fig. 1 to illustrate.It can specifically include following steps:
Step S202 obtains the corresponding hypertext document to be processed of webpage to be processed.
Specifically, webpage to be processed refers to that the webpage for needing to extract object content such as body matter, webpage to be processed are It is generated according to hypertext document to be processed.Hypertext document refer to using HTML (Hyper Text Markup Language, Hypertext markup language) file write.For example, when webpage to be generated, the available hypertext document of browser, according to super Text file generates webpage.Webpage to be processed corresponding hypertext document to be processed can be to be obtained according to crawler software, It can be and directly extracted from server.For example, when the webpage for needing to switch to computerized version webpage to show in mobile phone application When, hypertext document to be processed can be downloaded from the server of storage hypertext document.
Step S204 extracts object content hypertext data, object content hypertext number from hypertext document to be processed According to including one or more target hypertext tags and the corresponding hypertext content of target hypertext tags.
Specifically, object content hypertext data refers to that the object content for needing to obtain from webpage to be processed is corresponding super Text data specifically can be set according to actual needs the object content hypertext data for needing to obtain.For example, when needing to obtain When body in one news web page page, then object content hypertext data is the corresponding hypertext number of body According to.Hypertext tags for being identified to the classification or attribute of hypertext content, hypertext tags for example may include tt, Abb, acronym, image, fieldset, figcaption and form etc., target hypertext tags are with specific reference to be processed The difference of hypertext document and it is different.The corresponding hypertext content of target hypertext tags is shown in interior in webpage to be processed Hold, or according to the available content being shown in webpage of the hypertext content.For example, then surpassing for the image in webpage Content of text can be URL (Uniform Resoure Locator, uniform resource locator) address of image, according to URL Location is available to obtain image.For the text data in webpage, then hypertext content can be then shown in interior in webpage Hold.A hypertext content can be identified with a pair of or several pairs of labels.A pair of of hypertext tags include starting label and knot Beam label, starting label can be made of an in-less-than symbol "<", tag name and is-greater-than symbol ">".It end-tag and opens The difference of beginning label will accord with behind in-less-than symbol plus a slash space, for example,<div>and</div>it respectively indicates out Beginning label and end-tag.For example, "<div>this is hypertext content</div>" in, " this is hypertext content " is div mark Sign corresponding hypertext content.The number of target hypertext tags is determined according to the object content hypertext data of extraction, Specifically without limitation.
In one embodiment, object content hypertext data can be literary from hypertext to be processed according to preset path Extraction obtains in part, and it includes: to obtain target hypertext that object content hypertext data is extracted from hypertext document to be processed Path data extracts object content hypertext data according to target hypertext path data from hypertext document to be processed.
Specifically, path data can be xpath (path XML) data, XML (eXtensible Markup Language, extensible markup language) path language is a kind of language for determining the position of data in html document, according to The path xpath is available to obtain corresponding data in hypertext document.Target hypertext path data is according to specific net What page and the object content for needing to extract determined.The path xpath configuration interface can be set, and set on the configuration interface of path Set the path xpath of each content in webpage to be processed.As shown in figure 3, title, publishtime in name column, Author, commennum, promoteimage and content respectively indicate title, issuing time, the work of webpage to be processed Person, promotes picture and body matter at number of reviews.Therefore, webpage to be processed can be obtained according to the path xpath that path is arranged Title, issuing time, author, number of reviews, promote picture and the corresponding hypertext data of body matter.Assuming that just Literary content is object content, then can obtain corresponding target according to the path data of " // * [@id=" main_content "] " Content hypertext data.Wherein, " // " indicates to search object content hypertext data from entire hypertext document, and " * " is indicated It can be any matched node, " "@id=" main_content " " indicates that attribute is that "@id='s " main_content " is super Text data is the corresponding object content hypertext data of body matter.
Step S206, using each target hypertext tags in object content hypertext data as current hypertext mark Label, generate the corresponding web object of each target hypertext tags, comprising: obtain the current data that current hypertext tags indicate Type, when current data type is non-text data type, according to the corresponding current hypertext content of current hypertext tags Corresponding first web object of current hypertext tags is obtained, when current data type is text data type, is obtained current The corresponding current label type of hypertext tags, according to current label type current hypertext corresponding to current hypertext tags Content is handled, and the second web object is obtained.
Specifically, web object is the object shown on webpage, can indicate complete and independent content in webpage. For example, the content of text of a picture, a video and a paragraph on webpage to be processed can correspond to a webpage pair As.The current data type that current hypertext tags indicate refers to be shown in webpage to be processed according to what current hypertext content obtained The data type for the content shown, data type may include non-text data type and text data type.Image is marked Label, audio label and video label, respectively image, audio and view when corresponding hypertext content is shown in webpage Frequently.Therefore, the data type of image label, audio label and video tag representation is non-text data type, and for The labels such as div tag, h4 label, acronym label and abbr label, the data type of expression are text data type.? It, can be by the data of the expression of the hypertext tags except image label, audio label and video label in one embodiment Type is as non-text data type.Tag types, which can according to need, classifies.For example, label can be divided into block grade mark Type and inline tag types are signed, inline label refers to that corresponding hypertext content can super text corresponding with other labels The label that this content is shown on a same row, block interior label refer to the label that corresponding hypertext content needs to enter a new line again. For non-text data type, the corresponding current hypertext content of available current hypertext tags will be in current hypertext Hold and is used as a web object, i.e. the first web object.For text data type, tag types and processing side are pre-set The corresponding relationship of formula, therefore the corresponding current label type of current hypertext tags can be further obtained, according to current label The corresponding processing mode of type handles current hypertext content, and obtains the second web object.It is super to obtain object content It is corresponding to each current hypertext tags using each target hypertext tags as current hypertext tags after text data Hypertext content is handled, and web object is obtained.It can be with as the sequence of current hypertext tags using target hypertext tags It is successively to be obtained according to putting in order for label, when hypertext tags are there are when level, can first obtains target hypertext tags Between hierarchical relationship, according to hierarchical relationship using target hypertext tags as current hypertext tags.
In one embodiment, can also judge current label whether be type of comment label, if type of comment Label can then abandon the corresponding current hypertext content of the label of type of comment.For example, the format of comment tag is <!The content of --- ->, annotation writes on after second "-", when current hypertext tags be<!-- this is one and writes a Chinese character in simplified form -- > when, Then abandon corresponding current hypertext content " this is one and writes a Chinese character in simplified form ".
It in one embodiment, can be with when current hypertext tags are that format tags type is, for example, font label Obtain the corresponding format information of the format tags and corresponding current hypertext content, and storing format informations and current super literary The corresponding relationship of this content.Format information for example can be the format informations such as font-weight, italic, font color.
The corresponding web object of each target hypertext tags is formed web object sequence by step S208.
Specifically, after obtaining the corresponding web object of each target hypertext tags, each web object is combined, Obtain web object sequence.It can be combined according to the parsing sequence of label, i.e., according to using target hypertext tags as working as The sequence of preceding hypertext tags is combined.Web object sequence can be to be stored in a manner of sectionlist, Sectionlist is group list component, and a web object corresponds to a section in sectionlist, i.e. a portion Point.The data type that storage target hypertext tags indicate, web object and webpage can also be corresponded in web object sequence The corresponding data type of object can be with the storage of json format.
In one embodiment, after obtaining web object sequence, target webpage is obtained according to web object sequence and is shown The target webpage.For example, when webpage to be processed is the webpage that target application is introduced, it can be in application downloading software Introducing for target application shows each web object in interface.Wherein, each web object can correspond to a paragraph.
It in one embodiment, can be with root when obtaining the corresponding relationship of format information and current hypertext content Format setting is carried out to content corresponding in target webpage according to format information.For example, when format information is to carry out overstriking to font When, then overstriking can be carried out to content corresponding in target webpage according to format information.
In one embodiment, the other information in webpage to be processed, such as available object content can also be obtained At least one of corresponding multidate information or static information.Then other information is shown on target webpage.Static state letter Breath refers to the information that will not be changed over time, and multidate information refers to the information that can change with the time.Static information can wrap It includes the title of object content, deliver the contents such as time and author, multidate information may include the reading number of object content, comment Count, thumb up several and video playing number etc..By taking a news web page as an example, next section be the web object sequence that acquires with And the example of the static information of object content.Wherein, title, author and publishTimes respectively correspond object content Title, author and deliver the time.In Sectionlist, type indicates data type, and wherein non-text data type can be with It is divided into image type, audio type and video type.The corresponding web object of content and net in one braces The description information of page object.For example, with " " type ": " image " " for starting point, " " source ": " http://www.qq.com/ Image.png " is that the content of terminal is a section, including the corresponding web object of an image label and expression Data type.Width and height respectively indicates the width and length of image, and source indicates the source address of image.
In one embodiment, it after obtaining web object sequence, can also obtain in web object sequence, hypertext tags The data type of expression is the corresponding text webpage object of text data type, and text webpage object is spliced, mesh is obtained Content of text is marked, as the corresponding content of text of webpage to be processed.When target text content can be used as progress Webpage search, search The corresponding shorthand information of webpage to be processed in hitch fruit, or when establishing inverted index between Web Page Key Words and webpage, to Handle the corresponding content of text of webpage.
Above-mentioned web data processing method, device, computer equipment and storage medium, can from webpage to be processed it is corresponding to It handles and extracts object content hypertext data in hypertext document, indicated according to the hypertext tags of object content hypertext data Data type handle hypertext content respectively, and when the data type that hypertext tags indicate is text data type, into One step handles hypertext content according to tag types, gets the corresponding web object sequence of targeted web content, obtains target Content of pages high-efficient and data volume is reduced, reduces the degree to computer resource.
In one embodiment, according to current label type current hypertext content corresponding to current hypertext tags Before being handled, as shown in figure 4, web data processing method includes step S402: judge current hypertext tags whether be The first kind or Second Type.When for the first kind, S404 is entered step, when for Second Type, enters step S406.
Specifically, the corresponding label of the first kind and the corresponding label of Second Type specifically can according to actual needs into Row setting.In one embodiment, the first kind can be inline tag types, and Second Type can be block grade tag types. In one embodiment, the label of the first kind may include tt, abbr, acronym, cite, code, dfn, kbd, samp, The labels such as var, bdo, br, map, object, q, sub, sup, button, input, label and textarea, Second Type Label may include a, address, article, aside, blockquote, canvas, dd, div, dl, fieldset, The labels such as figcaption, form, hgroup, hr, ol, output, p, pre, section, h1, h2, h3, h4, h5 and h6.
In one embodiment, as shown in figure 4, it is corresponding to current hypertext tags current super according to current label type Content of text is handled, obtain the second web object the following steps are included:
Step S404, it is corresponding current super literary according to current hypertext tags when current label type is the first kind This content obtains the content of current web object to be generated.
Specifically, current hypertext content corresponding for the current hypertext tags of the first kind, can be current by this Content of the hypertext content as current web object to be generated, when needing to generate web object, further according to current to be generated At web object content generate web object.
In one embodiment, default storage region can be preset for storing current web object to be generated Content, such as text buffer can be preset, for storing the content of current web object to be generated.Work as current label When type is the first kind, current webpage pair to be generated is obtained according to the corresponding current hypertext content of current hypertext tags The step of content of elephant includes: to store the corresponding current hypertext content of current hypertext tags into default storage region, Content as current web object to be generated.For example, when the current label type of current hypertext tags is inline label When, then it can be by the corresponding current hypertext content storage of current hypertext tags into text buffer, and continuing will be next A target hypertext tags are as current hypertext tags.When the current data type that next target hypertext tags indicate is When text data type and corresponding current label type are the first kind, continue next target hypertext tags are corresponding Content of the hypertext content as current web object to be generated, store into text buffer.
Step S406 obtains the content of current web object to be generated, root when current label type is Second Type The second web object is generated according to the content of current web object to be generated, by the corresponding current hypertext of current hypertext tags Content of the content as next web object to be generated.
Specifically, when current label type is Second Type, then the content of current web object to be generated, root are obtained It combines to obtain the second web object according to the content of current web object to be generated.And corresponding for current hypertext tags work as Preceding hypertext content, using the current hypertext content as the content of next web object to be generated.
In one embodiment, when the content of current web object to be generated is stored in default storage region, Then step S406 may include: using storage content currently stored in default storage region as current web object to be generated Content, the second web object is generated according to currently stored storage content in default storage region, deletes default storage region In currently stored storage content, by the corresponding current hypertext content storage of current hypertext tags to default storage region In, using the content as next web object to be generated.
Specifically, when the type for obtaining current hypertext tags is Second Type, then available default storage region In currently stored content, generate the second web object.After generating the second web object, delete current in default storage region The content of storage, and the corresponding current hypertext content storage of current hypertext tags is preset in storage region to this, as The content of next current web object to be generated, continues to obtain next target hypertext tags as current hypertext mark Label.
In one embodiment, when current data type is non-text data type, according to current hypertext tags pair The current hypertext content answered obtains before corresponding first web object of current hypertext tags further include: when current data class When type is non-text data type, the content of current web object to be generated is obtained, according to current web object to be generated Content generate third web object.
Specifically, when current hypertext tags are non-text data type, then available current webpage to be generated The content of object generates third web object.For example, when the content of current web object to be generated is stored in default memory block When domain, if being stored with content in default storage region, the content stored in available default storage region generates third net Page object.And the content stored in default storage region is deleted, and according in the corresponding current hypertext of current hypertext tags Appearance obtains corresponding first web object of current hypertext tags.It is appreciated that obtained third web object is also a group networking The web object of page object sequence.
In the embodiment of the present invention, the corresponding hypertext content of the current hypertext tags of the first kind can be stored to pre- If storage region, therefore be text data type and be the first kind in the data type indicated when next target hypertext tags When type, it can continue to store corresponding hypertext content into default storage region, until next target hypertext tags When the data type of expression is non-text data type or is text data type and is Second Type, then obtain default storage Content in region generates web object.Therefore, the web object that can make is complete and independent.
In one embodiment, web data processing method further include:, will when current hypertext tags are invalid tag The current corresponding current hypertext content of hypertext tags replaces with space character, by space character storage to default storage region In.
Specifically, invalid tag can be specifically configured according to actual needs, for example, when needing to turn computerized version webpage It, can be by the one or more in script, select and noscript label as no criterion when being changed to mobile phone version webpage Label.After obtaining invalid tag, the corresponding current hypertext content of invalid tag is replaced with into space character, then by the space word Symbol storage is into preset storage region.The corresponding hypertext content of invalid tag is replaced with into space character, can be made The object content arrived is succinct and layout is clear.
In one embodiment, as shown in figure 5, in step S206 by each target in object content hypertext data Hypertext tags include: as current hypertext tags
Step S502 obtains the hierarchical relationship between each target hypertext tags.
Specifically, the hierarchical relationship between target hypertext tags refers to each target hypertext in target hypertext data Level relation between label, after obtaining target hypertext data, can using dom (document object model, DOM Document Object Model) resolver parsing target hypertext data generates dom tree construction, and dom defines one group and platform and language Unrelated interface, so as to program and script can content, structure and pattern in dynamic access and modification person's code, dom parsing Hypertext document can be resolved to the tree construction of dom tree by device according to the sequence of label pair, be obtained between target hypertext tags Hierarchical relationship.For example, it is assumed that in target hypertext data, the display order of the super this paper label of target is<a><b><b1></b1 ><b2></b2></b><c></c></a>, then available a label is the first level, and b label and c label are the second level, B1 label and b2 label are next level of b label.Obtained hierarchical relationship is as shown in Figure 6.
Step S504, it is super literary from target according to the level of upper one current hypertext tags and depth-first traversal algorithm Current hypertext tags are obtained in this label.
Specifically, depth-first traversal algorithm refers to when obtaining current hypertext tags from target hypertext tags, edge The branch of a level obtained, until each level acquisition under the level finishes, just return to another level of acquisition Target hypertext tags are as current hypertext tags.When obtaining current hypertext tags, needs to obtain one and currently surpass Then the level of text label obtains the of next level of a current hypertext tags according to depth-first traversal algorithm One target hypertext tags is as current hypertext tags.It, first can be by a of the first level by taking the hierarchical relationship of Fig. 6 as an example Label is as current hypertext tags, after having handled the corresponding hypertext content of a label, then successively by b label, b1 label, b2 Label and c label are as current hypertext tags.
In one embodiment, the corresponding web object of each target hypertext tags is formed webpage pair by step S208 As sequence includes: the parsing sequence by the corresponding web object of each target hypertext tags according to the super this paper label of each target Form web object sequence.
Specifically, the parsing sequence of target hypertext tags refers to using target hypertext tags as current hypertext tags Sequence.Sequence according to target hypertext tags as current hypertext tags forms web object sequence, i.e. web object The sequence of web object is obtained according to the parsing sequence of target hypertext tags in sequence.By taking the hierarchical relationship of Fig. 5 as an example, It is corresponding can be followed successively by a label, b label, b1 label, b2 label and c label for the sequence of web object in web object sequence Web object.
In one embodiment, it can be and often obtain a web object, deposited the web object as a section Sectionlist is stored up, until generating the last one current web page object and arriving as a section storage After in sectionlist, web object sequence is obtained.
In one embodiment, as shown in fig. 7, web data processing method can with the following steps are included:
Step S702 obtains webpage hypertext template.
Specifically, webpage hypertext template is pre-set, and can be pre-set mobile phone web pages hypertext mould Plate, webpage hypertext template can be specifically configured according to actual needs.
Web object each in web object sequence is filled into webpage hypertext template, is corresponded to by step S704 Target webpage hypertext document, the corresponding super text of each web object in web object sequence in target webpage hypertext document This label is block grade label.
Specifically, in webpage hypertext template, the filling position of web object can be it is pre-set, can be according to net The sequence of web object is filled web object in page object sequence, and corresponding block can also be added before web object Grade label, so that when according to target webpage hypertext document displaying target webpage, the corresponding target network of each web object A paragraph on page.
In one embodiment, as shown in figure 8, web data data processing method can also include step S802: to It handles and obtains the corresponding multidate information of object content and/or static information in hypertext document.Then step S704 is i.e. by webpage pair As web object each in sequence is filled into webpage hypertext template, obtaining corresponding target webpage hypertext document includes: Each web object in multidate information and/or static information and web object sequence is filled into webpage hypertext template, Obtain target webpage hypertext document.
Specifically, static information refers to the information that will not be changed over time, and multidate information refers to as the time can change Information.Static information may include the title of object content, deliver the contents such as time and author, and multidate information may include The reading number of object content, comment number thumb up several and video playing number etc..Static information and/or multidate information are super in webpage Filling position in text template is also possible to preset in advance.Multidate information can both be filled or fill static information, it can also To fill one of multidate information or static information.For example, when in the web object sequence and target of above-mentioned example The static information of appearance is filled into webpage hypertext template, after obtaining target webpage hypertext document, if according to target hypertext Web displaying target webpage, then target webpage can be as shown in Figure 9.
Below by taking the webpage for being converted to computer corresponding webpage in cell phone client as an example, the embodiment of the present invention is provided Method be illustrated, comprising the following steps:
Step S1002 obtains the corresponding hypertext document to be processed of webpage to be processed in server.For example, available The storage address of text file to be processed in the server acquires hypertext document to be processed according to storage address.
Step S1004 creates this buffer area of ineffective law, rule, etc. and sectionlist text for storing web object to be generated Part.Wherein, this buffer area of ineffective law, rule, etc. refers to the text buffer of not stored content.
Step S1006 obtains object content hypertext data, obtains the hierarchical relationship of target hypertext tags.Such as when It, can be according to the path xpath of pre-set body matter from hypertext document to be processed when object content is body matter The corresponding object content hypertext data of middle acquisition body matter, and obtained according to the corresponding dom tree construction of target hypertext data Take the hierarchical relationship of target hypertext tags.
Step S1008 is obtained current super according to the level of a upper current text label and depth-first traversal algorithm Text label.For example, when obtaining current hypertext tags for the first time, using the target hypertext tags of the first level as current super Text label.When obtaining current hypertext tags for the second time, the first aim hypertext tags for obtaining the second level, which are used as, to be worked as Preceding hypertext tags.When third time obtains current hypertext tags, the first aim hypertext tags of the second level are obtained In the label of next level, first hypertext tags as current hypertext tags, and so on, until each level Branch obtains and finishes, then returns to the second target hypertext tags for obtaining the second level as current hypertext tags.
Step S1010, whether the current data type for judging that current hypertext tags indicate is non-text data type.When When for non-text data type, then S1012 is entered step.When not being non-text data type, S1014 is entered step.
Step S1012 obtains current hypertext tags pair according to the corresponding current hypertext content of current hypertext tags The first web object answered.When text buffer is stored with content, one is generated according to the content of text buffer Section as third web object, and is stored into sectionlist.It is corresponding current super to parse current hypertext tags Content of text generates another section according to current hypertext content, and as the first web object, storage is arrived In sectionlist, S1016 is entered step.
Step S1014 obtains the corresponding current label type of current hypertext tags, according to current label type to current The corresponding current hypertext content of hypertext tags is handled, and the second web object is obtained.It, will be current when for the first kind The corresponding current hypertext content storage of hypertext tags is into text buffer.When for Second Type, if text buffer It is stored with content, a section is generated according to the content of text buffer, and store into sectionlist, and emptying Behind text buffer, by the corresponding current hypertext content storage of current hypertext tags into text buffer, as next The content of a web object to be generated.When for invalid tag, then corresponding hypertext content is replaced with into space, by space It stores into text buffer.When for comment tag, then corresponding hypertext content is abandoned.
Step S1016 judges whether target hypertext tags obtain and finishes, when have not been obtained finish when, return step S1008.When acquisition finishes, S1018 is entered step.
Step S1018 obtains sectionlist file, obtains web object sequence.
As shown in figure 11, in one embodiment, a kind of web data processing unit, web data processing dress are provided Setting can integrate in above-mentioned computer equipment 120, can specifically include file acquisition module 1102 to be processed, extraction module 1104, object generation module 1106 and sequence comprising modules 1108.
File acquisition module 1102 to be processed, for obtaining the corresponding hypertext document to be processed of webpage to be processed.
Extraction module 1104, for extracting object content hypertext data, object content from hypertext document to be processed Hypertext data includes one or more target hypertext tags and the corresponding hypertext content of target hypertext tags.
Object generation module 1106, for using each target hypertext tags in object content hypertext data as working as Preceding hypertext tags generate the corresponding web object of each target hypertext tags, comprising: obtaining current hypertext tags indicates Current data type, it is corresponding current according to current hypertext tags when current data type is non-text data type Hypertext content obtains corresponding first web object of current hypertext tags, when current data type is text data type When, the corresponding current label type of current hypertext tags is obtained, it is corresponding to current hypertext tags according to current label type Current hypertext content handled, obtain the second web object.
Sequence comprising modules 1108, for the corresponding web object of each target hypertext tags to be formed web object sequence Column.
In one embodiment, extraction module includes:
Path data acquiring unit, for obtaining target hypertext path data.Extraction unit, for super literary according to target This path data extracts object content hypertext data from hypertext document to be processed.
In one embodiment, object generation module includes:
Contents of object to be generated obtains unit, is used for when current label type is the first kind, according to current hypertext The corresponding current hypertext content of label obtains the content of current web object to be generated.
Object obtains unit, for obtaining current web object to be generated when current label type is Second Type Content, the second web object is generated according to the content of current web object to be generated, current hypertext tags are corresponding Current content of the hypertext content as next web object to be generated.
In one embodiment, the content of current web object to be generated is stored in default storage region, to be generated right It is used for as content obtains unit: when current label type is the first kind, current hypertext tags are corresponding current super literary This content is stored into default storage region, the content as current web object to be generated.
Object obtains unit and is used for: using storage content currently stored in default storage region as current net to be generated The content of page object generates the second web object according to storage content currently stored in default storage region.Delete default deposit Currently stored storage content in storage area domain, by the corresponding current hypertext content storage of current hypertext tags to default storage In region, using the content as next web object to be generated.
In one embodiment, web data processing unit further include: replacement module, for being when current hypertext tags When invalid tag, the corresponding current hypertext content of current hypertext tags is replaced with into space character, space character is stored Into default storage region.
In one embodiment, web data processing unit further includes that content obtains module, for working as current data type When for non-text data type, the content of current web object to be generated is obtained, according to current web object to be generated Content generates third web object.
In one embodiment, as shown in figure 12, web data processing unit further include:
Template obtains module 1202, for obtaining webpage hypertext template.
Module 1204 is filled, for web object each in web object sequence to be filled into webpage hypertext template, Corresponding target webpage hypertext document is obtained, each web object pair in web object sequence in target webpage hypertext document The hypertext tags answered are block grade label.
In one embodiment, as shown in figure 13, web data processing unit further include:
Data obtaining module 1302, for obtaining the corresponding multidate information of object content from hypertext document to be processed And/or static information.
Filling module 1204 is used for: by each webpage pair in multidate information and/or static information and web object sequence As being filled into webpage hypertext template, target webpage hypertext document is obtained.
In one embodiment, object generation module includes:
Level acquiring unit, for obtaining the hierarchical relationship between each target hypertext tags.
Current label acquiring unit, for the level and depth-first traversal calculation according to upper one current hypertext tags Method obtains current hypertext tags from target hypertext tags.
Sequence comprising modules 1108 are used for: the corresponding web object of each target hypertext tags is surpassed according to each target The parsing sequence composition web object sequence of this paper label.
Figure 14 shows the internal structure chart of computer equipment in one embodiment.As shown in figure 14, the computer equipment It include processor, memory, network interface and the input unit connected by system bus including the computer equipment.Wherein, Memory includes non-volatile memory medium and built-in storage.The non-volatile memory medium of the computer equipment is stored with operation System can also be stored with computer program, when which is executed by processor, processor may make to realize web data Processing method.Computer program can also be stored in the built-in storage, when which is executed by processor, may make place It manages device and executes web data processing method.The input unit of computer equipment can be the touch layer covered on display screen, can also To be the key being arranged on computer equipment shell, trace ball or Trackpad, external keyboard, Trackpad or mouse can also be Deng.
It will be understood by those skilled in the art that structure shown in Figure 14, only part relevant to application scheme The block diagram of structure, does not constitute the restriction for the computer equipment being applied thereon to application scheme, and specific computer is set Standby may include perhaps combining certain components or with different component layouts than more or fewer components as shown in the figure.
In one embodiment, web data processing unit provided by the present application can be implemented as a kind of computer program Form, computer program can be run in computer equipment as shown in figure 14.Group can be stored in the memory of computer equipment At each program module of the web data processing unit, for example, file acquisition module 1102 to be processed shown in Figure 11, extracting Module 1104, object generation module 1106 and sequence comprising modules 1108.The computer program that each program module is constituted makes It obtains processor and executes the step in the web data processing method of each embodiment of the application described in this specification.
For example, computer equipment shown in Figure 14 can by web data processing unit as shown in figure 11 wait locate Reason file acquisition module 1102 obtains the corresponding hypertext document to be processed of webpage to be processed.By extraction module 1104 to from It manages and extracts object content hypertext data in hypertext document, object content hypertext data includes that one or more targets are super literary This label and the corresponding hypertext content of target hypertext tags.By object generation module 1106 by object content hypertext Each target hypertext tags in data generate the corresponding webpage of each target hypertext tags as current hypertext tags Object, comprising: the current data type that current hypertext tags indicate is obtained, when current data type is non-text data type When, corresponding first webpage pair of current hypertext tags is obtained according to the corresponding current hypertext content of current hypertext tags As the corresponding current label type of current hypertext tags being obtained, according to working as when current data type is text data type Preceding tag types current hypertext content corresponding to current hypertext tags is handled, and obtains the second web object.Pass through The corresponding web object of each target hypertext tags is formed web object sequence by sequence comprising modules 1108.
In one embodiment it is proposed that a kind of computer equipment, computer equipment include memory, processor and storage On a memory and the computer program that can run on a processor, processor perform the steps of when executing computer program Obtain the corresponding hypertext document to be processed of webpage to be processed;Object content hypertext number is extracted from hypertext document to be processed According to object content hypertext data includes one or more target hypertext tags and the corresponding super text of target hypertext tags This content;Using each target hypertext tags in object content hypertext data as current hypertext tags, generate each The corresponding web object of target hypertext tags, comprising: the current data type that current hypertext tags indicate is obtained, when current When data type is non-text data type, obtained according to the corresponding current hypertext content of current hypertext tags current super literary Corresponding first web object of this label obtains current hypertext tags pair when current data type is text data type The current label type answered, at current label type current hypertext content corresponding to current hypertext tags Reason, obtains the second web object;The corresponding web object of each target hypertext tags is formed into web object sequence.
In one embodiment, what processor executed is corresponding to current hypertext tags current according to current label type Hypertext content is handled, and obtaining the second web object includes: when current label type is the first kind, according to current super The corresponding current hypertext content of text label obtains the content of current web object to be generated;When current label type is the When two types, the content of current web object to be generated is obtained, generates the according to the content of current web object to be generated Two web objects, using the corresponding current hypertext content of current hypertext tags as in next web object to be generated Hold.
In one embodiment, the content for the current web object to be generated that processor executes is stored in default memory block Domain, the step of content of current web object to be generated is obtained according to current hypertext tags corresponding current hypertext content It include: by the corresponding current hypertext content storage of current hypertext tags into default storage region, as current to be generated Web object content;The content for obtaining current web object to be generated, according in current web object to be generated Hold and generate the second web object, using the corresponding current hypertext content of current hypertext tags as next webpage to be generated The content of object includes: using storage content currently stored in default storage region as in current web object to be generated Hold, the second web object is generated according to storage content currently stored in default storage region;Delete in default storage region when The storage content of preceding storage, by the corresponding current hypertext content storage of current hypertext tags into default storage region, with Content as next web object to be generated.
In one embodiment, computer program also makes processor execute following steps: when current hypertext tags are When invalid tag, the corresponding current hypertext content of current hypertext tags is replaced with into space character, space character is stored Into default storage region.In one embodiment, processor execute when current data type be non-text data type when, According to the corresponding current hypertext content of current hypertext tags obtain corresponding first web object of current hypertext tags it Before, computer program also makes processor execute following steps: when current data type is non-text data type, acquisition is worked as The content of preceding web object to be generated generates third web object according to the content of current web object to be generated.
In one embodiment, computer program also makes processor execute following steps: obtaining webpage hypertext template; Web object each in web object sequence is filled into webpage hypertext template, corresponding target webpage hypertext text is obtained Part, the corresponding hypertext tags of each web object are block grade label in web object sequence in target webpage hypertext document.
In one embodiment, computer program also makes processor execute following steps: from hypertext document to be processed The corresponding multidate information of middle acquisition object content and/or static information;Web object each in web object sequence is filled into In webpage hypertext template, obtain corresponding target webpage hypertext document include: by multidate information and/or static information and Each web object is filled into webpage hypertext template in web object sequence, obtains target webpage hypertext document.
In one embodiment, processor execute by each target hypertext tags in object content hypertext data It include: the hierarchical relationship obtained between each target hypertext tags as current hypertext tags;It is current super according to upper one The level and depth-first traversal algorithm of text label obtain current hypertext tags from target hypertext tags;Processor Executing includes: by each target hypertext by the corresponding web object composition web object sequence of each target hypertext tags The corresponding web object of label forms web object sequence according to the parsing sequence of the super this paper label of each target.
In one embodiment, what processor executed extracts object content hypertext data from hypertext document to be processed It include: to obtain target hypertext path data;Mesh is extracted from hypertext document to be processed according to target hypertext path data Mark content hypertext data.
In one embodiment, a kind of computer readable storage medium is provided, is stored on computer readable storage medium Computer program, when computer program is executed by processor, so that processor executes following steps: it is corresponding to obtain webpage to be processed Hypertext document to be processed;Object content hypertext data, object content hypertext are extracted from hypertext document to be processed Data include one or more target hypertext tags and the corresponding hypertext content of target hypertext tags;By object content It is corresponding to generate each target hypertext tags as current hypertext tags for each target hypertext tags in hypertext data Web object, comprising: the current data type that current hypertext tags indicate is obtained, when current data type is non-textual number When according to type, corresponding first net of current hypertext tags is obtained according to the corresponding current hypertext content of current hypertext tags Page object obtains the corresponding current label type of current hypertext tags, root when current data type is text data type It is handled according to current label type current hypertext content corresponding to current hypertext tags, obtains the second web object; The corresponding web object of each target hypertext tags is formed into web object sequence.
In one embodiment, what processor executed is corresponding to current hypertext tags current according to current label type Hypertext content is handled, and obtaining the second web object includes: when current label type is the first kind, according to current super The corresponding current hypertext content of text label obtains the content of current web object to be generated;When current label type is the When two types, the content of current web object to be generated is obtained, generates the according to the content of current web object to be generated Two web objects, using the corresponding current hypertext content of current hypertext tags as in next web object to be generated Hold.
In one embodiment, the content for the current web object to be generated that processor executes is stored in default memory block Domain, the step of content of current web object to be generated is obtained according to current hypertext tags corresponding current hypertext content It include: by the corresponding current hypertext content storage of current hypertext tags into default storage region, as current to be generated Web object content;The content for obtaining current web object to be generated, according in current web object to be generated Hold and generate the second web object, using the corresponding current hypertext content of current hypertext tags as next webpage to be generated The content of object includes: using storage content currently stored in default storage region as in current web object to be generated Hold, the second web object is generated according to storage content currently stored in default storage region;Delete in default storage region when The storage content of preceding storage, by the corresponding current hypertext content storage of current hypertext tags into default storage region, with Content as next web object to be generated.
In one embodiment, computer program also makes processor execute following steps: when current hypertext tags are When invalid tag, the corresponding current hypertext content of current hypertext tags is replaced with into space character, space character is stored Into default storage region.In one embodiment, processor execute when current data type be non-text data type when, According to the corresponding current hypertext content of current hypertext tags obtain corresponding first web object of current hypertext tags it Before, computer program also makes processor execute following steps: when current data type is non-text data type, acquisition is worked as The content of preceding web object to be generated generates third web object according to the content of current web object to be generated.
In one embodiment, computer program also makes processor execute following steps: obtaining webpage hypertext template; Web object each in web object sequence is filled into webpage hypertext template, corresponding target webpage hypertext text is obtained Part, the corresponding hypertext tags of each web object are block grade label in web object sequence in target webpage hypertext document.
In one embodiment, computer program also makes processor execute following steps: from hypertext document to be processed The corresponding multidate information of middle acquisition object content and/or static information;Web object each in web object sequence is filled into In webpage hypertext template, obtain corresponding target webpage hypertext document include: by multidate information and/or static information and Each web object is filled into webpage hypertext template in web object sequence, obtains target webpage hypertext document.
In one embodiment, processor execute by each target hypertext tags in object content hypertext data It include: the hierarchical relationship obtained between each target hypertext tags as current hypertext tags;It is current super according to upper one The level and depth-first traversal algorithm of text label obtain current hypertext tags from target hypertext tags;Processor Executing includes: by each target hypertext by the corresponding web object composition web object sequence of each target hypertext tags The corresponding web object of label forms web object sequence according to the parsing sequence of the super this paper label of each target.
In one embodiment, what processor executed extracts object content hypertext data from hypertext document to be processed It include: to obtain target hypertext path data;Mesh is extracted from hypertext document to be processed according to target hypertext path data Mark content hypertext data.
Although should be understood that various embodiments of the present invention flow chart in each step according to arrow instruction successively It has been shown that, but these steps are not that the inevitable sequence according to arrow instruction successively executes.Unless expressly state otherwise herein, There is no stringent sequences to limit for the execution of these steps, these steps can execute in other order.Moreover, each embodiment In at least part step may include that perhaps these sub-steps of multiple stages or stage are not necessarily multiple sub-steps Completion is executed in synchronization, but can be executed at different times, the execution in these sub-steps or stage sequence is not yet Necessarily successively carry out, but can be at least part of the sub-step or stage of other steps or other steps in turn Or it alternately executes.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Instruct relevant hardware to complete by computer program, program can be stored in a non-volatile computer storage can be read In medium, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, provided herein each To any reference of memory, storage, database or other media used in embodiment, may each comprise it is non-volatile and/ Or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM (PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include random access memory (RAM) or external cache.
Each technical characteristic of above embodiments can be combined arbitrarily, for simplicity of description, not to above-described embodiment In each technical characteristic it is all possible combination be all described, as long as however, the combination of these technical characteristics be not present lance Shield all should be considered as described in this specification.

Claims (15)

1. a kind of web data processing method, which comprises
Obtain the corresponding hypertext document to be processed of webpage to be processed;
Object content hypertext data is extracted from the hypertext document to be processed, the object content hypertext data includes One or more target hypertext tags and the corresponding hypertext content of the target hypertext tags;
Using each target hypertext tags in the object content hypertext data as current hypertext tags, described in generation The corresponding web object of each target hypertext tags, comprising: obtain the current data class that the current hypertext tags indicate Type, it is corresponding current super literary according to the current hypertext tags when the current data type is non-text data type This content obtains corresponding first web object of the current hypertext tags, when the current data type is text data class When type, the corresponding current label type of the current hypertext tags is obtained, according to the current label type to described current The corresponding current hypertext content of hypertext tags is handled, and the second web object is obtained;
The corresponding web object of each target hypertext tags is formed into web object sequence.
2. the method according to claim 1, wherein described currently surpass according to the current label type to described The corresponding current hypertext content of text label is handled, and is obtained the second web object and is included:
When the current label type is the first kind, according to the corresponding current hypertext content of the current hypertext tags Obtain the content of current web object to be generated;
When the current label type is Second Type, obtain the content of current web object to be generated, according to currently to The content of the web object of generation generates second web object, by the corresponding current hypertext of the current hypertext tags Content of the content as next web object to be generated.
3. according to the method described in claim 2, it is characterized in that, the content of current web object to be generated be stored in it is default Storage region, it is described that current webpage pair to be generated is obtained according to the corresponding current hypertext content of the current hypertext tags The step of content of elephant includes:
By the corresponding current hypertext content storage of the current hypertext tags into the default storage region, as current The content of web object to be generated;
The content for obtaining current web object to be generated, according to the content generation of current web object to be generated Second web object, using the corresponding current hypertext content of the current hypertext tags as next webpage pair to be generated The content of elephant includes:
Using storage content currently stored in the default storage region as the content of current web object to be generated, according to Currently stored storage content generates second web object in the default storage region;
Storage content currently stored in the default storage region is deleted, the current hypertext tags are corresponding current super Content of text is stored into the default storage region, using the content as next web object to be generated.
4. according to the method described in claim 3, it is characterized in that, the method also includes:
When the current hypertext tags are invalid tag, by the corresponding current hypertext content of the current hypertext tags Space character is replaced with, by space character storage into the default storage region.
5. method according to any one of claims 1 to 4, which is characterized in that described when the current data type is non- When text data type, the current hypertext mark is obtained according to the corresponding current hypertext content of the current hypertext tags Before signing corresponding first web object further include:
When the current data type is non-text data type, the content of current web object to be generated is obtained, according to The content of current web object to be generated generates the third web object.
6. the method according to claim 1, wherein the method also includes:
Obtain webpage hypertext template;
Each web object in the web object sequence is filled into the webpage hypertext template, corresponding target is obtained Webpage hypertext document, each web object is corresponding super in web object sequence described in the target webpage hypertext document Text label is block grade label.
7. according to the method described in claim 6, it is characterized in that, the method also includes:
The corresponding multidate information of object content and/or static information are obtained from the hypertext document to be processed;
It is described that each web object in the web object sequence is filled into the webpage hypertext template, it obtains corresponding Target webpage hypertext document includes:
Each web object in the multidate information and/or static information and the web object sequence is filled into the net In page hypertext template, the target webpage hypertext document is obtained.
8. the method according to claim 1, wherein described will be each in the object content hypertext data Target hypertext tags include: as current hypertext tags
Obtain the hierarchical relationship between each target hypertext tags;
According to the level of upper one current hypertext tags and depth-first traversal algorithm from the target hypertext tags Obtain current hypertext tags;
It is described to include: by the corresponding web object composition web object sequence of each target hypertext tags
The corresponding web object of each target hypertext tags is suitable according to the parsing of each super this paper label of target Sequence forms web object sequence.
9. the method according to claim 1, wherein extracting object content from the hypertext document to be processed Hypertext data includes:
Obtain target hypertext path data;
Object content hypertext data is extracted from the hypertext document to be processed according to the target hypertext path data.
10. a kind of web data processing unit, described device include:
File acquisition module to be processed, for obtaining the corresponding hypertext document to be processed of webpage to be processed;
Extraction module, for extracting object content hypertext data, the object content from the hypertext document to be processed Hypertext data includes one or more target hypertext tags and the corresponding hypertext content of the target hypertext tags;
Object generation module, for using each target hypertext tags in the object content hypertext data as current super Text label generates the corresponding web object of each target hypertext tags, comprising: obtain the current hypertext tags The current data type of expression, when the current data type is non-text data type, according to the current hypertext mark It signs corresponding current hypertext content and obtains corresponding first web object of the current hypertext tags, when the current data When type is text data type, the corresponding current label type of the current hypertext tags is obtained, according to the current mark Label type handles the corresponding current hypertext content of the current hypertext tags, obtains the second web object;
Sequence comprising modules, for will the corresponding web object composition web object sequence of each target hypertext tags.
11. device according to claim 10, which is characterized in that the object generation module includes:
Contents of object to be generated obtains unit, for currently being surpassed according to described when the current label type is the first kind The corresponding current hypertext content of text label obtains the content of current web object to be generated;
Object obtains unit, for obtaining current web object to be generated when the current label type is Second Type Content, second web object is generated according to the content of current web object to be generated, by the current hypertext mark Sign content of the corresponding current hypertext content as next web object to be generated.
12. device according to claim 11, which is characterized in that the content of current web object to be generated is stored in pre- If storage region, the contents of object to be generated obtains unit and is used for:
When the current label type is the first kind, the corresponding current hypertext content of the current hypertext tags is deposited It stores up in the default storage region, the content as current web object to be generated;
The object obtains unit and is used for:
Using storage content currently stored in the default storage region as the content of current web object to be generated, according to Currently stored storage content generates second web object in the default storage region;
Storage content currently stored in the default storage region is deleted, the current hypertext tags are corresponding current super Content of text is stored into the default storage region, using the content as next web object to be generated.
13. device according to claim 12, which is characterized in that described device further include:
Replacement module is used for when the current hypertext tags are invalid tag, and the current hypertext tags are corresponding Current hypertext content replaces with space character, by space character storage into the default storage region.
14. a kind of computer equipment, which is characterized in that including memory and processor, be stored with computer in the memory Program, when the computer program is executed by the processor, so that the processor perform claim requires any one of 1 to 9 Described in claim the step of web data processing method.
15. a kind of computer readable storage medium, which is characterized in that be stored with computer on the computer readable storage medium Program, when the computer program is executed by processor, so that the processor perform claim requires any one of 1 to 9 right It is required that the step of web data processing method.
CN201810236011.8A 2018-03-21 2018-03-21 Webpage data processing method, device, computer equipment and storage medium Active CN110309457B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810236011.8A CN110309457B (en) 2018-03-21 2018-03-21 Webpage data processing method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810236011.8A CN110309457B (en) 2018-03-21 2018-03-21 Webpage data processing method, device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110309457A true CN110309457A (en) 2019-10-08
CN110309457B CN110309457B (en) 2023-06-16

Family

ID=68073523

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810236011.8A Active CN110309457B (en) 2018-03-21 2018-03-21 Webpage data processing method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110309457B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111596907A (en) * 2020-05-19 2020-08-28 北京字节跳动网络技术有限公司 File generation method, device, equipment and storage medium
CN111597487A (en) * 2020-05-06 2020-08-28 五八有限公司 Page data acquisition method and device, electronic equipment and storage medium
CN113378515A (en) * 2021-08-16 2021-09-10 宜科(天津)电子有限公司 Text generation system based on production data
CN116661803A (en) * 2023-07-31 2023-08-29 腾讯科技(深圳)有限公司 Processing method and device for multi-mode webpage template and computer equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140359413A1 (en) * 2013-05-28 2014-12-04 Tencent Technology (Shenzhen) Company Limited Apparatuses and methods for webpage content processing
CN106547895A (en) * 2016-11-03 2017-03-29 北京锐安科技有限公司 A kind of extracting method and device of info web
CN107153716A (en) * 2017-06-06 2017-09-12 百度在线网络技术(北京)有限公司 Webpage content extracting method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140359413A1 (en) * 2013-05-28 2014-12-04 Tencent Technology (Shenzhen) Company Limited Apparatuses and methods for webpage content processing
CN106547895A (en) * 2016-11-03 2017-03-29 北京锐安科技有限公司 A kind of extracting method and device of info web
CN107153716A (en) * 2017-06-06 2017-09-12 百度在线网络技术(北京)有限公司 Webpage content extracting method and device

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111597487A (en) * 2020-05-06 2020-08-28 五八有限公司 Page data acquisition method and device, electronic equipment and storage medium
CN111596907A (en) * 2020-05-19 2020-08-28 北京字节跳动网络技术有限公司 File generation method, device, equipment and storage medium
CN113378515A (en) * 2021-08-16 2021-09-10 宜科(天津)电子有限公司 Text generation system based on production data
CN116661803A (en) * 2023-07-31 2023-08-29 腾讯科技(深圳)有限公司 Processing method and device for multi-mode webpage template and computer equipment
CN116661803B (en) * 2023-07-31 2023-11-17 腾讯科技(深圳)有限公司 Processing method and device for multi-mode webpage template and computer equipment

Also Published As

Publication number Publication date
CN110309457B (en) 2023-06-16

Similar Documents

Publication Publication Date Title
US11294968B2 (en) Combining website characteristics in an automatically generated website
CN100465956C (en) System, web server and method for adding personalized value to web sites
US20160283499A1 (en) Webpage advertisement interception method, device and browser
CN110309457A (en) Web data processing method, device, computer equipment and storage medium
CN108399150B (en) Text processing method and device, computer equipment and storage medium
CN101593186B (en) Visual website editing method and visual website editing system
CN108717437B (en) Search result display method and device and storage medium
CN113515928B (en) Electronic text generation method, device, equipment and medium
CN112100550A (en) Page construction method and device
CN104750851A (en) Webpage content lazy loading method and system
CN109933751B (en) Image-text drawing method and device, computer-readable storage medium and computer equipment
CN108595697A (en) Webpage integrated approach, apparatus and system
US20170109442A1 (en) Customizing a website string content specific to an industry
CN106294885A (en) A kind of data collection towards isomery webpage and mask method
CN114791988A (en) Browser-based PDF file analysis method, system and storage medium
CN109558123B (en) Method for converting webpage into electronic book, electronic equipment and storage medium
CN109948085A (en) Browser kernel initial method, calculates equipment and storage medium at device
CN112433995A (en) File format conversion method, system, computer equipment and storage medium
CN106951429B (en) Method, browser and equipment for enhancing webpage comment display
CN115577683B (en) HTML rich text content conversion method, device, equipment and medium
CN113139145B (en) Page generation method and device, electronic equipment and readable storage medium
KR20210098813A (en) Apparatus of crawling and analyzing text data and method thereof
CN108664511B (en) Method and device for acquiring webpage information
US20210397663A1 (en) Data reduction in a tree data structure for a wireframe
CN115599367A (en) Method for collecting and sorting energy big data and establishing visual platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant