CN110309457A - Web data processing method, device, computer equipment and storage medium - Google Patents
Web data processing method, device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN110309457A CN110309457A CN201810236011.8A CN201810236011A CN110309457A CN 110309457 A CN110309457 A CN 110309457A CN 201810236011 A CN201810236011 A CN 201810236011A CN 110309457 A CN110309457 A CN 110309457A
- Authority
- CN
- China
- Prior art keywords
- hypertext
- current
- content
- tags
- web object
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/957—Browsing optimisation, e.g. caching or content distillation
- G06F16/9577—Optimising the visualization of content, e.g. distillation of HTML documents
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to a kind of web data processing method, device, computer equipment and storage mediums, which comprises obtains the corresponding hypertext document to be processed of webpage to be processed;Object content hypertext data is extracted from hypertext document to be processed, the object content hypertext data includes one or more target hypertext tags and the corresponding hypertext content of the target hypertext tags;Using each target hypertext tags in object content hypertext data as current hypertext tags, the corresponding web object of each target hypertext tags is generated;The corresponding web object of each target hypertext tags is formed into web object sequence.The above method can be reduced the degree to computer resource.
Description
Technical field
The present invention relates to Internet technical field, more particularly to web data processing method, device, computer equipment and
Storage medium.
Background technique
With the fast development of internet, internet web page has become the carrier of information publication and information sharing, interconnection
Network users can be distributed the information such as various contents, such as news, product introduction on webpage.
Currently, the information on a webpage is other than wanting the content of publication, there are also a lot of other information, such as extensively
Therefore announcement, navigation and copyright information etc. when issuing or saving the Content Transformation of publication to other platforms, need
The data of entire webpage are obtained, data volume is big, occupies computer resource.
Summary of the invention
Based on this, it is necessary to for above-mentioned problem, provide a kind of web data processing method, device, computer equipment
And storage medium, object content hypertext data can be extracted from the corresponding hypertext document to be processed of webpage to be processed, according to
The data type that the hypertext tags of object content hypertext data indicate handles hypertext content respectively, and in hypertext tags
When the data type of expression is text data type, hypertext content further is handled according to tag types, gets target network
The corresponding web object sequence of page content, obtains the high-efficient of target pages content and reduces data volume, reduce to calculating
The degree of machine resource.
A kind of web data processing method, which comprises obtain the corresponding hypertext text to be processed of webpage to be processed
Part;Object content hypertext data is extracted from the hypertext document to be processed, the object content hypertext data includes
One or more target hypertext tags and the corresponding hypertext content of the target hypertext tags;By the object content
Each target hypertext tags in hypertext data generate each target hypertext tags as current hypertext tags
Corresponding web object, comprising: the current data type that the current hypertext tags indicate is obtained, when the current data class
When type is non-text data type, obtained according to the corresponding current hypertext content of the current hypertext tags described current super
Corresponding first web object of text label obtains described current super when the current data type is text data type
The corresponding current label type of text label, it is corresponding to the current hypertext tags current according to the current label type
Hypertext content is handled, and the second web object is obtained;By the corresponding web object group of each target hypertext tags
At webpage object sequence.
A kind of web data processing unit, described device includes: file acquisition module to be processed, for obtaining net to be processed
The corresponding hypertext document to be processed of page;Extraction module is super for extracting object content from the hypertext document to be processed
Text data, the object content hypertext data include one or more target hypertext tags and the target hypertext
The corresponding hypertext content of label;Object generation module, for surpassing each target in the object content hypertext data
Text label generates the corresponding web object of each target hypertext tags, comprising: obtain as current hypertext tags
The current data type that the current hypertext tags indicate, when the current data type is non-text data type, root
Corresponding first webpage of the current hypertext tags is obtained according to the corresponding current hypertext content of the current hypertext tags
Object obtains the corresponding current label of the current hypertext tags when the current data type is text data type
Type handles the corresponding current hypertext content of the current hypertext tags according to the current label type, obtains
To the second web object;Sequence comprising modules, for by the corresponding web object group networking of each target hypertext tags
Page object sequence.
In one embodiment, described device further include: content obtains module, for being non-when the current data type
When text data type, the content of current web object to be generated is obtained, according to the content of current web object to be generated
Generate the third web object.
In one embodiment, described device further include: template obtains module, for obtaining webpage hypertext template;It fills out
Mold filling block obtains pair for each web object in the web object sequence to be filled into the webpage hypertext template
The target webpage hypertext document answered, each web object in web object sequence described in the target webpage hypertext document
Corresponding hypertext tags are block grade label.
In one embodiment, described device further include: data obtaining module is used for from the hypertext document to be processed
The corresponding multidate information of middle acquisition object content and/or static information;The filling module is used for: by the multidate information and/
Or each web object is filled into the webpage hypertext template in static information and the web object sequence, obtains institute
State target webpage hypertext document.
In one embodiment, the object generation module includes: level acquiring unit, for obtaining each target
Hierarchical relationship between hypertext tags;Current label acquiring unit, for the level according to upper one current hypertext tags
And depth-first traversal algorithm obtains current hypertext tags from the target hypertext tags;The sequence comprising modules
For:
By the corresponding web object of each target hypertext tags according to the solution of each super this paper label of target
Analysis sequence composition web object sequence.
In one embodiment, the extraction module includes: path data acquiring unit, for obtaining target hypertext road
Diameter data;Extraction unit, for extracting mesh from the hypertext document to be processed according to the target hypertext path data
Mark content hypertext data.
A kind of computer equipment, including memory and processor are stored with computer program, the meter in the memory
When calculation machine program is executed by the processor, so that the step of processor executes above-mentioned web data processing method.
A kind of computer readable storage medium, which is characterized in that calculating is stored on the computer readable storage medium
Machine program, when the computer program is executed by processor, so that the processor executes above-mentioned web data processing method
Step.
Above-mentioned web data processing method, device, computer equipment and storage medium, can from webpage to be processed it is corresponding to
It handles and extracts object content hypertext data in hypertext document, indicated according to the hypertext tags of object content hypertext data
Data type handle hypertext content respectively, and when the data type that hypertext tags indicate is text data type, into
One step handles hypertext content according to tag types, gets the corresponding web object sequence of targeted web content, obtains target
Content of pages high-efficient and data volume is reduced, reduces the degree to computer resource.
Detailed description of the invention
Fig. 1 is the applied environment figure of the web data processing method provided in one embodiment;
Fig. 2 is path configuration interface schematic diagram in one embodiment;
Fig. 3 is the flow chart of web data processing method in one embodiment;
Fig. 4 is the flow chart of web data processing method in one embodiment;
Fig. 5 is in one embodiment using each target hypertext tags in object content hypertext data as current super
The flow chart of text label;
Fig. 6 is hypertext tags level schematic diagram in one embodiment;
Fig. 7 is the flow chart of web data processing method in one embodiment;
Fig. 8 is the flow chart of web data processing method in one embodiment;
Fig. 9 is the schematic diagram of target webpage in one embodiment;
Figure 10 is the flow chart of web data processing method in one embodiment;
Figure 11 is the structural block diagram of web data processing unit in one embodiment;
Figure 12 is the structural block diagram of web data processing unit in one embodiment;
Figure 13 is the structural block diagram of web data processing unit in one embodiment;
Figure 14 is the internal structure block diagram of computer equipment in one embodiment.
Specific embodiment
The present invention is further described in detail below with reference to the accompanying drawings and embodiments.
It is appreciated that term " first " used in this application, " second " etc. can be used to describe various elements herein,
But unless stated otherwise, these elements should not be limited by these terms.These terms are only used to by first element and another yuan
Part is distinguished.For example, in the case where not departing from scope of the present application, the first web object can be known as the second webpage pair
As, and similarly, the second web object can be known as the first web object.
Fig. 1 is the applied environment figure of the web data processing method provided in one embodiment.As shown in Figure 1, answering at this
With in environment, including terminal 110 and computer equipment 120.When to obtain the object content on webpage to be processed, such as when
It needs to be converted to the news shown on computerized version webpage when mobile phone applies the news of upper display, computer equipment 120 obtains
The corresponding hypertext document to be processed of webpage to be processed, then executes web data processing method provided in an embodiment of the present invention,
The corresponding web object sequence of target pages content is obtained, after obtaining web object sequence, computer equipment 120 can be by webpage
Object sequence is sent in terminal 110, and terminal 110 shows each web object according to web object sequence.Wherein, each net
Page object can be used as a paragraph of the webpage shown in terminal 110.Computer equipment 120 can be independent physical services
Device or terminal are also possible to the server cluster that multiple physical servers are constituted, can be to provide Cloud Server, cloud database,
The Cloud Server of the basis such as cloud storage and CDN cloud computing service.Terminal 110 can be smart phone, tablet computer, notebook electricity
Brain, desktop computer, intelligent sound box, smartwatch etc., however, it is not limited to this.Terminal 110 and computer equipment 120 can be with
Connected by communication connections modes such as bluetooth, USB (Universal Serial Bus, universal serial bus) or networks
It connects, the present invention is herein with no restrictions.
It should be noted that applied environment figure provided by the embodiments of the present application is only a kind of example, do not constitute to this hair
The limitation for the web data processing method that bright embodiment provides, can also be by web data processing side provided in an embodiment of the present invention
Method is applied in other application environment, for example, computer equipment 120 can directly exist according to the web object sequence acquired
Computer equipment 120 generates corresponding target webpage.
As shown in Fig. 2, in one embodiment it is proposed that a kind of web data processing method, the present embodiment is mainly with this
Method is applied to the computer equipment 120 in above-mentioned Fig. 1 to illustrate.It can specifically include following steps:
Step S202 obtains the corresponding hypertext document to be processed of webpage to be processed.
Specifically, webpage to be processed refers to that the webpage for needing to extract object content such as body matter, webpage to be processed are
It is generated according to hypertext document to be processed.Hypertext document refer to using HTML (Hyper Text Markup Language,
Hypertext markup language) file write.For example, when webpage to be generated, the available hypertext document of browser, according to super
Text file generates webpage.Webpage to be processed corresponding hypertext document to be processed can be to be obtained according to crawler software,
It can be and directly extracted from server.For example, when the webpage for needing to switch to computerized version webpage to show in mobile phone application
When, hypertext document to be processed can be downloaded from the server of storage hypertext document.
Step S204 extracts object content hypertext data, object content hypertext number from hypertext document to be processed
According to including one or more target hypertext tags and the corresponding hypertext content of target hypertext tags.
Specifically, object content hypertext data refers to that the object content for needing to obtain from webpage to be processed is corresponding super
Text data specifically can be set according to actual needs the object content hypertext data for needing to obtain.For example, when needing to obtain
When body in one news web page page, then object content hypertext data is the corresponding hypertext number of body
According to.Hypertext tags for being identified to the classification or attribute of hypertext content, hypertext tags for example may include tt,
Abb, acronym, image, fieldset, figcaption and form etc., target hypertext tags are with specific reference to be processed
The difference of hypertext document and it is different.The corresponding hypertext content of target hypertext tags is shown in interior in webpage to be processed
Hold, or according to the available content being shown in webpage of the hypertext content.For example, then surpassing for the image in webpage
Content of text can be URL (Uniform Resoure Locator, uniform resource locator) address of image, according to URL
Location is available to obtain image.For the text data in webpage, then hypertext content can be then shown in interior in webpage
Hold.A hypertext content can be identified with a pair of or several pairs of labels.A pair of of hypertext tags include starting label and knot
Beam label, starting label can be made of an in-less-than symbol "<", tag name and is-greater-than symbol ">".It end-tag and opens
The difference of beginning label will accord with behind in-less-than symbol plus a slash space, for example,<div>and</div>it respectively indicates out
Beginning label and end-tag.For example, "<div>this is hypertext content</div>" in, " this is hypertext content " is div mark
Sign corresponding hypertext content.The number of target hypertext tags is determined according to the object content hypertext data of extraction,
Specifically without limitation.
In one embodiment, object content hypertext data can be literary from hypertext to be processed according to preset path
Extraction obtains in part, and it includes: to obtain target hypertext that object content hypertext data is extracted from hypertext document to be processed
Path data extracts object content hypertext data according to target hypertext path data from hypertext document to be processed.
Specifically, path data can be xpath (path XML) data, XML (eXtensible Markup
Language, extensible markup language) path language is a kind of language for determining the position of data in html document, according to
The path xpath is available to obtain corresponding data in hypertext document.Target hypertext path data is according to specific net
What page and the object content for needing to extract determined.The path xpath configuration interface can be set, and set on the configuration interface of path
Set the path xpath of each content in webpage to be processed.As shown in figure 3, title, publishtime in name column,
Author, commennum, promoteimage and content respectively indicate title, issuing time, the work of webpage to be processed
Person, promotes picture and body matter at number of reviews.Therefore, webpage to be processed can be obtained according to the path xpath that path is arranged
Title, issuing time, author, number of reviews, promote picture and the corresponding hypertext data of body matter.Assuming that just
Literary content is object content, then can obtain corresponding target according to the path data of " // * [@id=" main_content "] "
Content hypertext data.Wherein, " // " indicates to search object content hypertext data from entire hypertext document, and " * " is indicated
It can be any matched node, " "@id=" main_content " " indicates that attribute is that "@id='s " main_content " is super
Text data is the corresponding object content hypertext data of body matter.
Step S206, using each target hypertext tags in object content hypertext data as current hypertext mark
Label, generate the corresponding web object of each target hypertext tags, comprising: obtain the current data that current hypertext tags indicate
Type, when current data type is non-text data type, according to the corresponding current hypertext content of current hypertext tags
Corresponding first web object of current hypertext tags is obtained, when current data type is text data type, is obtained current
The corresponding current label type of hypertext tags, according to current label type current hypertext corresponding to current hypertext tags
Content is handled, and the second web object is obtained.
Specifically, web object is the object shown on webpage, can indicate complete and independent content in webpage.
For example, the content of text of a picture, a video and a paragraph on webpage to be processed can correspond to a webpage pair
As.The current data type that current hypertext tags indicate refers to be shown in webpage to be processed according to what current hypertext content obtained
The data type for the content shown, data type may include non-text data type and text data type.Image is marked
Label, audio label and video label, respectively image, audio and view when corresponding hypertext content is shown in webpage
Frequently.Therefore, the data type of image label, audio label and video tag representation is non-text data type, and for
The labels such as div tag, h4 label, acronym label and abbr label, the data type of expression are text data type.?
It, can be by the data of the expression of the hypertext tags except image label, audio label and video label in one embodiment
Type is as non-text data type.Tag types, which can according to need, classifies.For example, label can be divided into block grade mark
Type and inline tag types are signed, inline label refers to that corresponding hypertext content can super text corresponding with other labels
The label that this content is shown on a same row, block interior label refer to the label that corresponding hypertext content needs to enter a new line again.
For non-text data type, the corresponding current hypertext content of available current hypertext tags will be in current hypertext
Hold and is used as a web object, i.e. the first web object.For text data type, tag types and processing side are pre-set
The corresponding relationship of formula, therefore the corresponding current label type of current hypertext tags can be further obtained, according to current label
The corresponding processing mode of type handles current hypertext content, and obtains the second web object.It is super to obtain object content
It is corresponding to each current hypertext tags using each target hypertext tags as current hypertext tags after text data
Hypertext content is handled, and web object is obtained.It can be with as the sequence of current hypertext tags using target hypertext tags
It is successively to be obtained according to putting in order for label, when hypertext tags are there are when level, can first obtains target hypertext tags
Between hierarchical relationship, according to hierarchical relationship using target hypertext tags as current hypertext tags.
In one embodiment, can also judge current label whether be type of comment label, if type of comment
Label can then abandon the corresponding current hypertext content of the label of type of comment.For example, the format of comment tag is
<!The content of --- ->, annotation writes on after second "-", when current hypertext tags be<!-- this is one and writes a Chinese character in simplified form -- > when,
Then abandon corresponding current hypertext content " this is one and writes a Chinese character in simplified form ".
It in one embodiment, can be with when current hypertext tags are that format tags type is, for example, font label
Obtain the corresponding format information of the format tags and corresponding current hypertext content, and storing format informations and current super literary
The corresponding relationship of this content.Format information for example can be the format informations such as font-weight, italic, font color.
The corresponding web object of each target hypertext tags is formed web object sequence by step S208.
Specifically, after obtaining the corresponding web object of each target hypertext tags, each web object is combined,
Obtain web object sequence.It can be combined according to the parsing sequence of label, i.e., according to using target hypertext tags as working as
The sequence of preceding hypertext tags is combined.Web object sequence can be to be stored in a manner of sectionlist,
Sectionlist is group list component, and a web object corresponds to a section in sectionlist, i.e. a portion
Point.The data type that storage target hypertext tags indicate, web object and webpage can also be corresponded in web object sequence
The corresponding data type of object can be with the storage of json format.
In one embodiment, after obtaining web object sequence, target webpage is obtained according to web object sequence and is shown
The target webpage.For example, when webpage to be processed is the webpage that target application is introduced, it can be in application downloading software
Introducing for target application shows each web object in interface.Wherein, each web object can correspond to a paragraph.
It in one embodiment, can be with root when obtaining the corresponding relationship of format information and current hypertext content
Format setting is carried out to content corresponding in target webpage according to format information.For example, when format information is to carry out overstriking to font
When, then overstriking can be carried out to content corresponding in target webpage according to format information.
In one embodiment, the other information in webpage to be processed, such as available object content can also be obtained
At least one of corresponding multidate information or static information.Then other information is shown on target webpage.Static state letter
Breath refers to the information that will not be changed over time, and multidate information refers to the information that can change with the time.Static information can wrap
It includes the title of object content, deliver the contents such as time and author, multidate information may include the reading number of object content, comment
Count, thumb up several and video playing number etc..By taking a news web page as an example, next section be the web object sequence that acquires with
And the example of the static information of object content.Wherein, title, author and publishTimes respectively correspond object content
Title, author and deliver the time.In Sectionlist, type indicates data type, and wherein non-text data type can be with
It is divided into image type, audio type and video type.The corresponding web object of content and net in one braces
The description information of page object.For example, with " " type ": " image " " for starting point, " " source ": " http://www.qq.com/
Image.png " is that the content of terminal is a section, including the corresponding web object of an image label and expression
Data type.Width and height respectively indicates the width and length of image, and source indicates the source address of image.
In one embodiment, it after obtaining web object sequence, can also obtain in web object sequence, hypertext tags
The data type of expression is the corresponding text webpage object of text data type, and text webpage object is spliced, mesh is obtained
Content of text is marked, as the corresponding content of text of webpage to be processed.When target text content can be used as progress Webpage search, search
The corresponding shorthand information of webpage to be processed in hitch fruit, or when establishing inverted index between Web Page Key Words and webpage, to
Handle the corresponding content of text of webpage.
Above-mentioned web data processing method, device, computer equipment and storage medium, can from webpage to be processed it is corresponding to
It handles and extracts object content hypertext data in hypertext document, indicated according to the hypertext tags of object content hypertext data
Data type handle hypertext content respectively, and when the data type that hypertext tags indicate is text data type, into
One step handles hypertext content according to tag types, gets the corresponding web object sequence of targeted web content, obtains target
Content of pages high-efficient and data volume is reduced, reduces the degree to computer resource.
In one embodiment, according to current label type current hypertext content corresponding to current hypertext tags
Before being handled, as shown in figure 4, web data processing method includes step S402: judge current hypertext tags whether be
The first kind or Second Type.When for the first kind, S404 is entered step, when for Second Type, enters step S406.
Specifically, the corresponding label of the first kind and the corresponding label of Second Type specifically can according to actual needs into
Row setting.In one embodiment, the first kind can be inline tag types, and Second Type can be block grade tag types.
In one embodiment, the label of the first kind may include tt, abbr, acronym, cite, code, dfn, kbd, samp,
The labels such as var, bdo, br, map, object, q, sub, sup, button, input, label and textarea, Second Type
Label may include a, address, article, aside, blockquote, canvas, dd, div, dl, fieldset,
The labels such as figcaption, form, hgroup, hr, ol, output, p, pre, section, h1, h2, h3, h4, h5 and h6.
In one embodiment, as shown in figure 4, it is corresponding to current hypertext tags current super according to current label type
Content of text is handled, obtain the second web object the following steps are included:
Step S404, it is corresponding current super literary according to current hypertext tags when current label type is the first kind
This content obtains the content of current web object to be generated.
Specifically, current hypertext content corresponding for the current hypertext tags of the first kind, can be current by this
Content of the hypertext content as current web object to be generated, when needing to generate web object, further according to current to be generated
At web object content generate web object.
In one embodiment, default storage region can be preset for storing current web object to be generated
Content, such as text buffer can be preset, for storing the content of current web object to be generated.Work as current label
When type is the first kind, current webpage pair to be generated is obtained according to the corresponding current hypertext content of current hypertext tags
The step of content of elephant includes: to store the corresponding current hypertext content of current hypertext tags into default storage region,
Content as current web object to be generated.For example, when the current label type of current hypertext tags is inline label
When, then it can be by the corresponding current hypertext content storage of current hypertext tags into text buffer, and continuing will be next
A target hypertext tags are as current hypertext tags.When the current data type that next target hypertext tags indicate is
When text data type and corresponding current label type are the first kind, continue next target hypertext tags are corresponding
Content of the hypertext content as current web object to be generated, store into text buffer.
Step S406 obtains the content of current web object to be generated, root when current label type is Second Type
The second web object is generated according to the content of current web object to be generated, by the corresponding current hypertext of current hypertext tags
Content of the content as next web object to be generated.
Specifically, when current label type is Second Type, then the content of current web object to be generated, root are obtained
It combines to obtain the second web object according to the content of current web object to be generated.And corresponding for current hypertext tags work as
Preceding hypertext content, using the current hypertext content as the content of next web object to be generated.
In one embodiment, when the content of current web object to be generated is stored in default storage region,
Then step S406 may include: using storage content currently stored in default storage region as current web object to be generated
Content, the second web object is generated according to currently stored storage content in default storage region, deletes default storage region
In currently stored storage content, by the corresponding current hypertext content storage of current hypertext tags to default storage region
In, using the content as next web object to be generated.
Specifically, when the type for obtaining current hypertext tags is Second Type, then available default storage region
In currently stored content, generate the second web object.After generating the second web object, delete current in default storage region
The content of storage, and the corresponding current hypertext content storage of current hypertext tags is preset in storage region to this, as
The content of next current web object to be generated, continues to obtain next target hypertext tags as current hypertext mark
Label.
In one embodiment, when current data type is non-text data type, according to current hypertext tags pair
The current hypertext content answered obtains before corresponding first web object of current hypertext tags further include: when current data class
When type is non-text data type, the content of current web object to be generated is obtained, according to current web object to be generated
Content generate third web object.
Specifically, when current hypertext tags are non-text data type, then available current webpage to be generated
The content of object generates third web object.For example, when the content of current web object to be generated is stored in default memory block
When domain, if being stored with content in default storage region, the content stored in available default storage region generates third net
Page object.And the content stored in default storage region is deleted, and according in the corresponding current hypertext of current hypertext tags
Appearance obtains corresponding first web object of current hypertext tags.It is appreciated that obtained third web object is also a group networking
The web object of page object sequence.
In the embodiment of the present invention, the corresponding hypertext content of the current hypertext tags of the first kind can be stored to pre-
If storage region, therefore be text data type and be the first kind in the data type indicated when next target hypertext tags
When type, it can continue to store corresponding hypertext content into default storage region, until next target hypertext tags
When the data type of expression is non-text data type or is text data type and is Second Type, then obtain default storage
Content in region generates web object.Therefore, the web object that can make is complete and independent.
In one embodiment, web data processing method further include:, will when current hypertext tags are invalid tag
The current corresponding current hypertext content of hypertext tags replaces with space character, by space character storage to default storage region
In.
Specifically, invalid tag can be specifically configured according to actual needs, for example, when needing to turn computerized version webpage
It, can be by the one or more in script, select and noscript label as no criterion when being changed to mobile phone version webpage
Label.After obtaining invalid tag, the corresponding current hypertext content of invalid tag is replaced with into space character, then by the space word
Symbol storage is into preset storage region.The corresponding hypertext content of invalid tag is replaced with into space character, can be made
The object content arrived is succinct and layout is clear.
In one embodiment, as shown in figure 5, in step S206 by each target in object content hypertext data
Hypertext tags include: as current hypertext tags
Step S502 obtains the hierarchical relationship between each target hypertext tags.
Specifically, the hierarchical relationship between target hypertext tags refers to each target hypertext in target hypertext data
Level relation between label, after obtaining target hypertext data, can using dom (document object model,
DOM Document Object Model) resolver parsing target hypertext data generates dom tree construction, and dom defines one group and platform and language
Unrelated interface, so as to program and script can content, structure and pattern in dynamic access and modification person's code, dom parsing
Hypertext document can be resolved to the tree construction of dom tree by device according to the sequence of label pair, be obtained between target hypertext tags
Hierarchical relationship.For example, it is assumed that in target hypertext data, the display order of the super this paper label of target is<a><b><b1></b1
><b2></b2></b><c></c></a>, then available a label is the first level, and b label and c label are the second level,
B1 label and b2 label are next level of b label.Obtained hierarchical relationship is as shown in Figure 6.
Step S504, it is super literary from target according to the level of upper one current hypertext tags and depth-first traversal algorithm
Current hypertext tags are obtained in this label.
Specifically, depth-first traversal algorithm refers to when obtaining current hypertext tags from target hypertext tags, edge
The branch of a level obtained, until each level acquisition under the level finishes, just return to another level of acquisition
Target hypertext tags are as current hypertext tags.When obtaining current hypertext tags, needs to obtain one and currently surpass
Then the level of text label obtains the of next level of a current hypertext tags according to depth-first traversal algorithm
One target hypertext tags is as current hypertext tags.It, first can be by a of the first level by taking the hierarchical relationship of Fig. 6 as an example
Label is as current hypertext tags, after having handled the corresponding hypertext content of a label, then successively by b label, b1 label, b2
Label and c label are as current hypertext tags.
In one embodiment, the corresponding web object of each target hypertext tags is formed webpage pair by step S208
As sequence includes: the parsing sequence by the corresponding web object of each target hypertext tags according to the super this paper label of each target
Form web object sequence.
Specifically, the parsing sequence of target hypertext tags refers to using target hypertext tags as current hypertext tags
Sequence.Sequence according to target hypertext tags as current hypertext tags forms web object sequence, i.e. web object
The sequence of web object is obtained according to the parsing sequence of target hypertext tags in sequence.By taking the hierarchical relationship of Fig. 5 as an example,
It is corresponding can be followed successively by a label, b label, b1 label, b2 label and c label for the sequence of web object in web object sequence
Web object.
In one embodiment, it can be and often obtain a web object, deposited the web object as a section
Sectionlist is stored up, until generating the last one current web page object and arriving as a section storage
After in sectionlist, web object sequence is obtained.
In one embodiment, as shown in fig. 7, web data processing method can with the following steps are included:
Step S702 obtains webpage hypertext template.
Specifically, webpage hypertext template is pre-set, and can be pre-set mobile phone web pages hypertext mould
Plate, webpage hypertext template can be specifically configured according to actual needs.
Web object each in web object sequence is filled into webpage hypertext template, is corresponded to by step S704
Target webpage hypertext document, the corresponding super text of each web object in web object sequence in target webpage hypertext document
This label is block grade label.
Specifically, in webpage hypertext template, the filling position of web object can be it is pre-set, can be according to net
The sequence of web object is filled web object in page object sequence, and corresponding block can also be added before web object
Grade label, so that when according to target webpage hypertext document displaying target webpage, the corresponding target network of each web object
A paragraph on page.
In one embodiment, as shown in figure 8, web data data processing method can also include step S802: to
It handles and obtains the corresponding multidate information of object content and/or static information in hypertext document.Then step S704 is i.e. by webpage pair
As web object each in sequence is filled into webpage hypertext template, obtaining corresponding target webpage hypertext document includes:
Each web object in multidate information and/or static information and web object sequence is filled into webpage hypertext template,
Obtain target webpage hypertext document.
Specifically, static information refers to the information that will not be changed over time, and multidate information refers to as the time can change
Information.Static information may include the title of object content, deliver the contents such as time and author, and multidate information may include
The reading number of object content, comment number thumb up several and video playing number etc..Static information and/or multidate information are super in webpage
Filling position in text template is also possible to preset in advance.Multidate information can both be filled or fill static information, it can also
To fill one of multidate information or static information.For example, when in the web object sequence and target of above-mentioned example
The static information of appearance is filled into webpage hypertext template, after obtaining target webpage hypertext document, if according to target hypertext
Web displaying target webpage, then target webpage can be as shown in Figure 9.
Below by taking the webpage for being converted to computer corresponding webpage in cell phone client as an example, the embodiment of the present invention is provided
Method be illustrated, comprising the following steps:
Step S1002 obtains the corresponding hypertext document to be processed of webpage to be processed in server.For example, available
The storage address of text file to be processed in the server acquires hypertext document to be processed according to storage address.
Step S1004 creates this buffer area of ineffective law, rule, etc. and sectionlist text for storing web object to be generated
Part.Wherein, this buffer area of ineffective law, rule, etc. refers to the text buffer of not stored content.
Step S1006 obtains object content hypertext data, obtains the hierarchical relationship of target hypertext tags.Such as when
It, can be according to the path xpath of pre-set body matter from hypertext document to be processed when object content is body matter
The corresponding object content hypertext data of middle acquisition body matter, and obtained according to the corresponding dom tree construction of target hypertext data
Take the hierarchical relationship of target hypertext tags.
Step S1008 is obtained current super according to the level of a upper current text label and depth-first traversal algorithm
Text label.For example, when obtaining current hypertext tags for the first time, using the target hypertext tags of the first level as current super
Text label.When obtaining current hypertext tags for the second time, the first aim hypertext tags for obtaining the second level, which are used as, to be worked as
Preceding hypertext tags.When third time obtains current hypertext tags, the first aim hypertext tags of the second level are obtained
In the label of next level, first hypertext tags as current hypertext tags, and so on, until each level
Branch obtains and finishes, then returns to the second target hypertext tags for obtaining the second level as current hypertext tags.
Step S1010, whether the current data type for judging that current hypertext tags indicate is non-text data type.When
When for non-text data type, then S1012 is entered step.When not being non-text data type, S1014 is entered step.
Step S1012 obtains current hypertext tags pair according to the corresponding current hypertext content of current hypertext tags
The first web object answered.When text buffer is stored with content, one is generated according to the content of text buffer
Section as third web object, and is stored into sectionlist.It is corresponding current super to parse current hypertext tags
Content of text generates another section according to current hypertext content, and as the first web object, storage is arrived
In sectionlist, S1016 is entered step.
Step S1014 obtains the corresponding current label type of current hypertext tags, according to current label type to current
The corresponding current hypertext content of hypertext tags is handled, and the second web object is obtained.It, will be current when for the first kind
The corresponding current hypertext content storage of hypertext tags is into text buffer.When for Second Type, if text buffer
It is stored with content, a section is generated according to the content of text buffer, and store into sectionlist, and emptying
Behind text buffer, by the corresponding current hypertext content storage of current hypertext tags into text buffer, as next
The content of a web object to be generated.When for invalid tag, then corresponding hypertext content is replaced with into space, by space
It stores into text buffer.When for comment tag, then corresponding hypertext content is abandoned.
Step S1016 judges whether target hypertext tags obtain and finishes, when have not been obtained finish when, return step
S1008.When acquisition finishes, S1018 is entered step.
Step S1018 obtains sectionlist file, obtains web object sequence.
As shown in figure 11, in one embodiment, a kind of web data processing unit, web data processing dress are provided
Setting can integrate in above-mentioned computer equipment 120, can specifically include file acquisition module 1102 to be processed, extraction module
1104, object generation module 1106 and sequence comprising modules 1108.
File acquisition module 1102 to be processed, for obtaining the corresponding hypertext document to be processed of webpage to be processed.
Extraction module 1104, for extracting object content hypertext data, object content from hypertext document to be processed
Hypertext data includes one or more target hypertext tags and the corresponding hypertext content of target hypertext tags.
Object generation module 1106, for using each target hypertext tags in object content hypertext data as working as
Preceding hypertext tags generate the corresponding web object of each target hypertext tags, comprising: obtaining current hypertext tags indicates
Current data type, it is corresponding current according to current hypertext tags when current data type is non-text data type
Hypertext content obtains corresponding first web object of current hypertext tags, when current data type is text data type
When, the corresponding current label type of current hypertext tags is obtained, it is corresponding to current hypertext tags according to current label type
Current hypertext content handled, obtain the second web object.
Sequence comprising modules 1108, for the corresponding web object of each target hypertext tags to be formed web object sequence
Column.
In one embodiment, extraction module includes:
Path data acquiring unit, for obtaining target hypertext path data.Extraction unit, for super literary according to target
This path data extracts object content hypertext data from hypertext document to be processed.
In one embodiment, object generation module includes:
Contents of object to be generated obtains unit, is used for when current label type is the first kind, according to current hypertext
The corresponding current hypertext content of label obtains the content of current web object to be generated.
Object obtains unit, for obtaining current web object to be generated when current label type is Second Type
Content, the second web object is generated according to the content of current web object to be generated, current hypertext tags are corresponding
Current content of the hypertext content as next web object to be generated.
In one embodiment, the content of current web object to be generated is stored in default storage region, to be generated right
It is used for as content obtains unit: when current label type is the first kind, current hypertext tags are corresponding current super literary
This content is stored into default storage region, the content as current web object to be generated.
Object obtains unit and is used for: using storage content currently stored in default storage region as current net to be generated
The content of page object generates the second web object according to storage content currently stored in default storage region.Delete default deposit
Currently stored storage content in storage area domain, by the corresponding current hypertext content storage of current hypertext tags to default storage
In region, using the content as next web object to be generated.
In one embodiment, web data processing unit further include: replacement module, for being when current hypertext tags
When invalid tag, the corresponding current hypertext content of current hypertext tags is replaced with into space character, space character is stored
Into default storage region.
In one embodiment, web data processing unit further includes that content obtains module, for working as current data type
When for non-text data type, the content of current web object to be generated is obtained, according to current web object to be generated
Content generates third web object.
In one embodiment, as shown in figure 12, web data processing unit further include:
Template obtains module 1202, for obtaining webpage hypertext template.
Module 1204 is filled, for web object each in web object sequence to be filled into webpage hypertext template,
Corresponding target webpage hypertext document is obtained, each web object pair in web object sequence in target webpage hypertext document
The hypertext tags answered are block grade label.
In one embodiment, as shown in figure 13, web data processing unit further include:
Data obtaining module 1302, for obtaining the corresponding multidate information of object content from hypertext document to be processed
And/or static information.
Filling module 1204 is used for: by each webpage pair in multidate information and/or static information and web object sequence
As being filled into webpage hypertext template, target webpage hypertext document is obtained.
In one embodiment, object generation module includes:
Level acquiring unit, for obtaining the hierarchical relationship between each target hypertext tags.
Current label acquiring unit, for the level and depth-first traversal calculation according to upper one current hypertext tags
Method obtains current hypertext tags from target hypertext tags.
Sequence comprising modules 1108 are used for: the corresponding web object of each target hypertext tags is surpassed according to each target
The parsing sequence composition web object sequence of this paper label.
Figure 14 shows the internal structure chart of computer equipment in one embodiment.As shown in figure 14, the computer equipment
It include processor, memory, network interface and the input unit connected by system bus including the computer equipment.Wherein,
Memory includes non-volatile memory medium and built-in storage.The non-volatile memory medium of the computer equipment is stored with operation
System can also be stored with computer program, when which is executed by processor, processor may make to realize web data
Processing method.Computer program can also be stored in the built-in storage, when which is executed by processor, may make place
It manages device and executes web data processing method.The input unit of computer equipment can be the touch layer covered on display screen, can also
To be the key being arranged on computer equipment shell, trace ball or Trackpad, external keyboard, Trackpad or mouse can also be
Deng.
It will be understood by those skilled in the art that structure shown in Figure 14, only part relevant to application scheme
The block diagram of structure, does not constitute the restriction for the computer equipment being applied thereon to application scheme, and specific computer is set
Standby may include perhaps combining certain components or with different component layouts than more or fewer components as shown in the figure.
In one embodiment, web data processing unit provided by the present application can be implemented as a kind of computer program
Form, computer program can be run in computer equipment as shown in figure 14.Group can be stored in the memory of computer equipment
At each program module of the web data processing unit, for example, file acquisition module 1102 to be processed shown in Figure 11, extracting
Module 1104, object generation module 1106 and sequence comprising modules 1108.The computer program that each program module is constituted makes
It obtains processor and executes the step in the web data processing method of each embodiment of the application described in this specification.
For example, computer equipment shown in Figure 14 can by web data processing unit as shown in figure 11 wait locate
Reason file acquisition module 1102 obtains the corresponding hypertext document to be processed of webpage to be processed.By extraction module 1104 to from
It manages and extracts object content hypertext data in hypertext document, object content hypertext data includes that one or more targets are super literary
This label and the corresponding hypertext content of target hypertext tags.By object generation module 1106 by object content hypertext
Each target hypertext tags in data generate the corresponding webpage of each target hypertext tags as current hypertext tags
Object, comprising: the current data type that current hypertext tags indicate is obtained, when current data type is non-text data type
When, corresponding first webpage pair of current hypertext tags is obtained according to the corresponding current hypertext content of current hypertext tags
As the corresponding current label type of current hypertext tags being obtained, according to working as when current data type is text data type
Preceding tag types current hypertext content corresponding to current hypertext tags is handled, and obtains the second web object.Pass through
The corresponding web object of each target hypertext tags is formed web object sequence by sequence comprising modules 1108.
In one embodiment it is proposed that a kind of computer equipment, computer equipment include memory, processor and storage
On a memory and the computer program that can run on a processor, processor perform the steps of when executing computer program
Obtain the corresponding hypertext document to be processed of webpage to be processed;Object content hypertext number is extracted from hypertext document to be processed
According to object content hypertext data includes one or more target hypertext tags and the corresponding super text of target hypertext tags
This content;Using each target hypertext tags in object content hypertext data as current hypertext tags, generate each
The corresponding web object of target hypertext tags, comprising: the current data type that current hypertext tags indicate is obtained, when current
When data type is non-text data type, obtained according to the corresponding current hypertext content of current hypertext tags current super literary
Corresponding first web object of this label obtains current hypertext tags pair when current data type is text data type
The current label type answered, at current label type current hypertext content corresponding to current hypertext tags
Reason, obtains the second web object;The corresponding web object of each target hypertext tags is formed into web object sequence.
In one embodiment, what processor executed is corresponding to current hypertext tags current according to current label type
Hypertext content is handled, and obtaining the second web object includes: when current label type is the first kind, according to current super
The corresponding current hypertext content of text label obtains the content of current web object to be generated;When current label type is the
When two types, the content of current web object to be generated is obtained, generates the according to the content of current web object to be generated
Two web objects, using the corresponding current hypertext content of current hypertext tags as in next web object to be generated
Hold.
In one embodiment, the content for the current web object to be generated that processor executes is stored in default memory block
Domain, the step of content of current web object to be generated is obtained according to current hypertext tags corresponding current hypertext content
It include: by the corresponding current hypertext content storage of current hypertext tags into default storage region, as current to be generated
Web object content;The content for obtaining current web object to be generated, according in current web object to be generated
Hold and generate the second web object, using the corresponding current hypertext content of current hypertext tags as next webpage to be generated
The content of object includes: using storage content currently stored in default storage region as in current web object to be generated
Hold, the second web object is generated according to storage content currently stored in default storage region;Delete in default storage region when
The storage content of preceding storage, by the corresponding current hypertext content storage of current hypertext tags into default storage region, with
Content as next web object to be generated.
In one embodiment, computer program also makes processor execute following steps: when current hypertext tags are
When invalid tag, the corresponding current hypertext content of current hypertext tags is replaced with into space character, space character is stored
Into default storage region.In one embodiment, processor execute when current data type be non-text data type when,
According to the corresponding current hypertext content of current hypertext tags obtain corresponding first web object of current hypertext tags it
Before, computer program also makes processor execute following steps: when current data type is non-text data type, acquisition is worked as
The content of preceding web object to be generated generates third web object according to the content of current web object to be generated.
In one embodiment, computer program also makes processor execute following steps: obtaining webpage hypertext template;
Web object each in web object sequence is filled into webpage hypertext template, corresponding target webpage hypertext text is obtained
Part, the corresponding hypertext tags of each web object are block grade label in web object sequence in target webpage hypertext document.
In one embodiment, computer program also makes processor execute following steps: from hypertext document to be processed
The corresponding multidate information of middle acquisition object content and/or static information;Web object each in web object sequence is filled into
In webpage hypertext template, obtain corresponding target webpage hypertext document include: by multidate information and/or static information and
Each web object is filled into webpage hypertext template in web object sequence, obtains target webpage hypertext document.
In one embodiment, processor execute by each target hypertext tags in object content hypertext data
It include: the hierarchical relationship obtained between each target hypertext tags as current hypertext tags;It is current super according to upper one
The level and depth-first traversal algorithm of text label obtain current hypertext tags from target hypertext tags;Processor
Executing includes: by each target hypertext by the corresponding web object composition web object sequence of each target hypertext tags
The corresponding web object of label forms web object sequence according to the parsing sequence of the super this paper label of each target.
In one embodiment, what processor executed extracts object content hypertext data from hypertext document to be processed
It include: to obtain target hypertext path data;Mesh is extracted from hypertext document to be processed according to target hypertext path data
Mark content hypertext data.
In one embodiment, a kind of computer readable storage medium is provided, is stored on computer readable storage medium
Computer program, when computer program is executed by processor, so that processor executes following steps: it is corresponding to obtain webpage to be processed
Hypertext document to be processed;Object content hypertext data, object content hypertext are extracted from hypertext document to be processed
Data include one or more target hypertext tags and the corresponding hypertext content of target hypertext tags;By object content
It is corresponding to generate each target hypertext tags as current hypertext tags for each target hypertext tags in hypertext data
Web object, comprising: the current data type that current hypertext tags indicate is obtained, when current data type is non-textual number
When according to type, corresponding first net of current hypertext tags is obtained according to the corresponding current hypertext content of current hypertext tags
Page object obtains the corresponding current label type of current hypertext tags, root when current data type is text data type
It is handled according to current label type current hypertext content corresponding to current hypertext tags, obtains the second web object;
The corresponding web object of each target hypertext tags is formed into web object sequence.
In one embodiment, what processor executed is corresponding to current hypertext tags current according to current label type
Hypertext content is handled, and obtaining the second web object includes: when current label type is the first kind, according to current super
The corresponding current hypertext content of text label obtains the content of current web object to be generated;When current label type is the
When two types, the content of current web object to be generated is obtained, generates the according to the content of current web object to be generated
Two web objects, using the corresponding current hypertext content of current hypertext tags as in next web object to be generated
Hold.
In one embodiment, the content for the current web object to be generated that processor executes is stored in default memory block
Domain, the step of content of current web object to be generated is obtained according to current hypertext tags corresponding current hypertext content
It include: by the corresponding current hypertext content storage of current hypertext tags into default storage region, as current to be generated
Web object content;The content for obtaining current web object to be generated, according in current web object to be generated
Hold and generate the second web object, using the corresponding current hypertext content of current hypertext tags as next webpage to be generated
The content of object includes: using storage content currently stored in default storage region as in current web object to be generated
Hold, the second web object is generated according to storage content currently stored in default storage region;Delete in default storage region when
The storage content of preceding storage, by the corresponding current hypertext content storage of current hypertext tags into default storage region, with
Content as next web object to be generated.
In one embodiment, computer program also makes processor execute following steps: when current hypertext tags are
When invalid tag, the corresponding current hypertext content of current hypertext tags is replaced with into space character, space character is stored
Into default storage region.In one embodiment, processor execute when current data type be non-text data type when,
According to the corresponding current hypertext content of current hypertext tags obtain corresponding first web object of current hypertext tags it
Before, computer program also makes processor execute following steps: when current data type is non-text data type, acquisition is worked as
The content of preceding web object to be generated generates third web object according to the content of current web object to be generated.
In one embodiment, computer program also makes processor execute following steps: obtaining webpage hypertext template;
Web object each in web object sequence is filled into webpage hypertext template, corresponding target webpage hypertext text is obtained
Part, the corresponding hypertext tags of each web object are block grade label in web object sequence in target webpage hypertext document.
In one embodiment, computer program also makes processor execute following steps: from hypertext document to be processed
The corresponding multidate information of middle acquisition object content and/or static information;Web object each in web object sequence is filled into
In webpage hypertext template, obtain corresponding target webpage hypertext document include: by multidate information and/or static information and
Each web object is filled into webpage hypertext template in web object sequence, obtains target webpage hypertext document.
In one embodiment, processor execute by each target hypertext tags in object content hypertext data
It include: the hierarchical relationship obtained between each target hypertext tags as current hypertext tags;It is current super according to upper one
The level and depth-first traversal algorithm of text label obtain current hypertext tags from target hypertext tags;Processor
Executing includes: by each target hypertext by the corresponding web object composition web object sequence of each target hypertext tags
The corresponding web object of label forms web object sequence according to the parsing sequence of the super this paper label of each target.
In one embodiment, what processor executed extracts object content hypertext data from hypertext document to be processed
It include: to obtain target hypertext path data;Mesh is extracted from hypertext document to be processed according to target hypertext path data
Mark content hypertext data.
Although should be understood that various embodiments of the present invention flow chart in each step according to arrow instruction successively
It has been shown that, but these steps are not that the inevitable sequence according to arrow instruction successively executes.Unless expressly state otherwise herein,
There is no stringent sequences to limit for the execution of these steps, these steps can execute in other order.Moreover, each embodiment
In at least part step may include that perhaps these sub-steps of multiple stages or stage are not necessarily multiple sub-steps
Completion is executed in synchronization, but can be executed at different times, the execution in these sub-steps or stage sequence is not yet
Necessarily successively carry out, but can be at least part of the sub-step or stage of other steps or other steps in turn
Or it alternately executes.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with
Instruct relevant hardware to complete by computer program, program can be stored in a non-volatile computer storage can be read
In medium, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, provided herein each
To any reference of memory, storage, database or other media used in embodiment, may each comprise it is non-volatile and/
Or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM (PROM), electrically programmable
ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include random access memory
(RAM) or external cache.
Each technical characteristic of above embodiments can be combined arbitrarily, for simplicity of description, not to above-described embodiment
In each technical characteristic it is all possible combination be all described, as long as however, the combination of these technical characteristics be not present lance
Shield all should be considered as described in this specification.
Claims (15)
1. a kind of web data processing method, which comprises
Obtain the corresponding hypertext document to be processed of webpage to be processed;
Object content hypertext data is extracted from the hypertext document to be processed, the object content hypertext data includes
One or more target hypertext tags and the corresponding hypertext content of the target hypertext tags;
Using each target hypertext tags in the object content hypertext data as current hypertext tags, described in generation
The corresponding web object of each target hypertext tags, comprising: obtain the current data class that the current hypertext tags indicate
Type, it is corresponding current super literary according to the current hypertext tags when the current data type is non-text data type
This content obtains corresponding first web object of the current hypertext tags, when the current data type is text data class
When type, the corresponding current label type of the current hypertext tags is obtained, according to the current label type to described current
The corresponding current hypertext content of hypertext tags is handled, and the second web object is obtained;
The corresponding web object of each target hypertext tags is formed into web object sequence.
2. the method according to claim 1, wherein described currently surpass according to the current label type to described
The corresponding current hypertext content of text label is handled, and is obtained the second web object and is included:
When the current label type is the first kind, according to the corresponding current hypertext content of the current hypertext tags
Obtain the content of current web object to be generated;
When the current label type is Second Type, obtain the content of current web object to be generated, according to currently to
The content of the web object of generation generates second web object, by the corresponding current hypertext of the current hypertext tags
Content of the content as next web object to be generated.
3. according to the method described in claim 2, it is characterized in that, the content of current web object to be generated be stored in it is default
Storage region, it is described that current webpage pair to be generated is obtained according to the corresponding current hypertext content of the current hypertext tags
The step of content of elephant includes:
By the corresponding current hypertext content storage of the current hypertext tags into the default storage region, as current
The content of web object to be generated;
The content for obtaining current web object to be generated, according to the content generation of current web object to be generated
Second web object, using the corresponding current hypertext content of the current hypertext tags as next webpage pair to be generated
The content of elephant includes:
Using storage content currently stored in the default storage region as the content of current web object to be generated, according to
Currently stored storage content generates second web object in the default storage region;
Storage content currently stored in the default storage region is deleted, the current hypertext tags are corresponding current super
Content of text is stored into the default storage region, using the content as next web object to be generated.
4. according to the method described in claim 3, it is characterized in that, the method also includes:
When the current hypertext tags are invalid tag, by the corresponding current hypertext content of the current hypertext tags
Space character is replaced with, by space character storage into the default storage region.
5. method according to any one of claims 1 to 4, which is characterized in that described when the current data type is non-
When text data type, the current hypertext mark is obtained according to the corresponding current hypertext content of the current hypertext tags
Before signing corresponding first web object further include:
When the current data type is non-text data type, the content of current web object to be generated is obtained, according to
The content of current web object to be generated generates the third web object.
6. the method according to claim 1, wherein the method also includes:
Obtain webpage hypertext template;
Each web object in the web object sequence is filled into the webpage hypertext template, corresponding target is obtained
Webpage hypertext document, each web object is corresponding super in web object sequence described in the target webpage hypertext document
Text label is block grade label.
7. according to the method described in claim 6, it is characterized in that, the method also includes:
The corresponding multidate information of object content and/or static information are obtained from the hypertext document to be processed;
It is described that each web object in the web object sequence is filled into the webpage hypertext template, it obtains corresponding
Target webpage hypertext document includes:
Each web object in the multidate information and/or static information and the web object sequence is filled into the net
In page hypertext template, the target webpage hypertext document is obtained.
8. the method according to claim 1, wherein described will be each in the object content hypertext data
Target hypertext tags include: as current hypertext tags
Obtain the hierarchical relationship between each target hypertext tags;
According to the level of upper one current hypertext tags and depth-first traversal algorithm from the target hypertext tags
Obtain current hypertext tags;
It is described to include: by the corresponding web object composition web object sequence of each target hypertext tags
The corresponding web object of each target hypertext tags is suitable according to the parsing of each super this paper label of target
Sequence forms web object sequence.
9. the method according to claim 1, wherein extracting object content from the hypertext document to be processed
Hypertext data includes:
Obtain target hypertext path data;
Object content hypertext data is extracted from the hypertext document to be processed according to the target hypertext path data.
10. a kind of web data processing unit, described device include:
File acquisition module to be processed, for obtaining the corresponding hypertext document to be processed of webpage to be processed;
Extraction module, for extracting object content hypertext data, the object content from the hypertext document to be processed
Hypertext data includes one or more target hypertext tags and the corresponding hypertext content of the target hypertext tags;
Object generation module, for using each target hypertext tags in the object content hypertext data as current super
Text label generates the corresponding web object of each target hypertext tags, comprising: obtain the current hypertext tags
The current data type of expression, when the current data type is non-text data type, according to the current hypertext mark
It signs corresponding current hypertext content and obtains corresponding first web object of the current hypertext tags, when the current data
When type is text data type, the corresponding current label type of the current hypertext tags is obtained, according to the current mark
Label type handles the corresponding current hypertext content of the current hypertext tags, obtains the second web object;
Sequence comprising modules, for will the corresponding web object composition web object sequence of each target hypertext tags.
11. device according to claim 10, which is characterized in that the object generation module includes:
Contents of object to be generated obtains unit, for currently being surpassed according to described when the current label type is the first kind
The corresponding current hypertext content of text label obtains the content of current web object to be generated;
Object obtains unit, for obtaining current web object to be generated when the current label type is Second Type
Content, second web object is generated according to the content of current web object to be generated, by the current hypertext mark
Sign content of the corresponding current hypertext content as next web object to be generated.
12. device according to claim 11, which is characterized in that the content of current web object to be generated is stored in pre-
If storage region, the contents of object to be generated obtains unit and is used for:
When the current label type is the first kind, the corresponding current hypertext content of the current hypertext tags is deposited
It stores up in the default storage region, the content as current web object to be generated;
The object obtains unit and is used for:
Using storage content currently stored in the default storage region as the content of current web object to be generated, according to
Currently stored storage content generates second web object in the default storage region;
Storage content currently stored in the default storage region is deleted, the current hypertext tags are corresponding current super
Content of text is stored into the default storage region, using the content as next web object to be generated.
13. device according to claim 12, which is characterized in that described device further include:
Replacement module is used for when the current hypertext tags are invalid tag, and the current hypertext tags are corresponding
Current hypertext content replaces with space character, by space character storage into the default storage region.
14. a kind of computer equipment, which is characterized in that including memory and processor, be stored with computer in the memory
Program, when the computer program is executed by the processor, so that the processor perform claim requires any one of 1 to 9
Described in claim the step of web data processing method.
15. a kind of computer readable storage medium, which is characterized in that be stored with computer on the computer readable storage medium
Program, when the computer program is executed by processor, so that the processor perform claim requires any one of 1 to 9 right
It is required that the step of web data processing method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810236011.8A CN110309457B (en) | 2018-03-21 | 2018-03-21 | Webpage data processing method, device, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810236011.8A CN110309457B (en) | 2018-03-21 | 2018-03-21 | Webpage data processing method, device, computer equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110309457A true CN110309457A (en) | 2019-10-08 |
CN110309457B CN110309457B (en) | 2023-06-16 |
Family
ID=68073523
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810236011.8A Active CN110309457B (en) | 2018-03-21 | 2018-03-21 | Webpage data processing method, device, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110309457B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111596907A (en) * | 2020-05-19 | 2020-08-28 | 北京字节跳动网络技术有限公司 | File generation method, device, equipment and storage medium |
CN111597487A (en) * | 2020-05-06 | 2020-08-28 | 五八有限公司 | Page data acquisition method and device, electronic equipment and storage medium |
CN113378515A (en) * | 2021-08-16 | 2021-09-10 | 宜科(天津)电子有限公司 | Text generation system based on production data |
CN116661803A (en) * | 2023-07-31 | 2023-08-29 | 腾讯科技(深圳)有限公司 | Processing method and device for multi-mode webpage template and computer equipment |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140359413A1 (en) * | 2013-05-28 | 2014-12-04 | Tencent Technology (Shenzhen) Company Limited | Apparatuses and methods for webpage content processing |
CN106547895A (en) * | 2016-11-03 | 2017-03-29 | 北京锐安科技有限公司 | A kind of extracting method and device of info web |
CN107153716A (en) * | 2017-06-06 | 2017-09-12 | 百度在线网络技术(北京)有限公司 | Webpage content extracting method and device |
-
2018
- 2018-03-21 CN CN201810236011.8A patent/CN110309457B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140359413A1 (en) * | 2013-05-28 | 2014-12-04 | Tencent Technology (Shenzhen) Company Limited | Apparatuses and methods for webpage content processing |
CN106547895A (en) * | 2016-11-03 | 2017-03-29 | 北京锐安科技有限公司 | A kind of extracting method and device of info web |
CN107153716A (en) * | 2017-06-06 | 2017-09-12 | 百度在线网络技术(北京)有限公司 | Webpage content extracting method and device |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111597487A (en) * | 2020-05-06 | 2020-08-28 | 五八有限公司 | Page data acquisition method and device, electronic equipment and storage medium |
CN111596907A (en) * | 2020-05-19 | 2020-08-28 | 北京字节跳动网络技术有限公司 | File generation method, device, equipment and storage medium |
CN113378515A (en) * | 2021-08-16 | 2021-09-10 | 宜科(天津)电子有限公司 | Text generation system based on production data |
CN116661803A (en) * | 2023-07-31 | 2023-08-29 | 腾讯科技(深圳)有限公司 | Processing method and device for multi-mode webpage template and computer equipment |
CN116661803B (en) * | 2023-07-31 | 2023-11-17 | 腾讯科技(深圳)有限公司 | Processing method and device for multi-mode webpage template and computer equipment |
Also Published As
Publication number | Publication date |
---|---|
CN110309457B (en) | 2023-06-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11294968B2 (en) | Combining website characteristics in an automatically generated website | |
CN100465956C (en) | System, web server and method for adding personalized value to web sites | |
US20160283499A1 (en) | Webpage advertisement interception method, device and browser | |
CN110309457A (en) | Web data processing method, device, computer equipment and storage medium | |
CN108399150B (en) | Text processing method and device, computer equipment and storage medium | |
CN101593186B (en) | Visual website editing method and visual website editing system | |
CN108717437B (en) | Search result display method and device and storage medium | |
CN113515928B (en) | Electronic text generation method, device, equipment and medium | |
CN112100550A (en) | Page construction method and device | |
CN104750851A (en) | Webpage content lazy loading method and system | |
CN109933751B (en) | Image-text drawing method and device, computer-readable storage medium and computer equipment | |
CN108595697A (en) | Webpage integrated approach, apparatus and system | |
US20170109442A1 (en) | Customizing a website string content specific to an industry | |
CN106294885A (en) | A kind of data collection towards isomery webpage and mask method | |
CN114791988A (en) | Browser-based PDF file analysis method, system and storage medium | |
CN109558123B (en) | Method for converting webpage into electronic book, electronic equipment and storage medium | |
CN109948085A (en) | Browser kernel initial method, calculates equipment and storage medium at device | |
CN112433995A (en) | File format conversion method, system, computer equipment and storage medium | |
CN106951429B (en) | Method, browser and equipment for enhancing webpage comment display | |
CN115577683B (en) | HTML rich text content conversion method, device, equipment and medium | |
CN113139145B (en) | Page generation method and device, electronic equipment and readable storage medium | |
KR20210098813A (en) | Apparatus of crawling and analyzing text data and method thereof | |
CN108664511B (en) | Method and device for acquiring webpage information | |
US20210397663A1 (en) | Data reduction in a tree data structure for a wireframe | |
CN115599367A (en) | Method for collecting and sorting energy big data and establishing visual platform |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |