CN106570044B - Method and device for analyzing webpage codes - Google Patents

Method and device for analyzing webpage codes Download PDF

Info

Publication number
CN106570044B
CN106570044B CN201510670507.2A CN201510670507A CN106570044B CN 106570044 B CN106570044 B CN 106570044B CN 201510670507 A CN201510670507 A CN 201510670507A CN 106570044 B CN106570044 B CN 106570044B
Authority
CN
China
Prior art keywords
webpage
coding information
data segment
information
decoding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510670507.2A
Other languages
Chinese (zh)
Other versions
CN106570044A (en
Inventor
李可欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201510670507.2A priority Critical patent/CN106570044B/en
Publication of CN106570044A publication Critical patent/CN106570044A/en
Application granted granted Critical
Publication of CN106570044B publication Critical patent/CN106570044B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method and a device for analyzing webpage codes, relates to the technical field of internet, and solves the problem of low efficiency of acquiring webpage information caused by the process that complicated statistical calculation is needed to be carried out on webpage data to guess codes actually used by a webpage when a crawler system carries out webpage analysis. The method of the invention comprises the following steps: reading webpage response data from the webpage response packet; decoding the webpage response data segment through preset coding information, and judging whether the webpage coding information is recorded in the current data segment; if the judgment result is yes, decoding the current data segment by using the webpage coding information, and decoding the webpage response data by using the webpage coding information when the current data segment is completely decoded; if the judgment result is negative, decoding the other data segment through the preset coding information, and judging whether the webpage coding information is recorded in the data segment. The method and the system are mainly used for acquiring the webpage information in real time by using the crawler system.

Description

Method and device for analyzing webpage codes
Technical Field
The invention relates to the technical field of internet, in particular to a method and a device for analyzing webpage codes.
Background
The web crawler is a program or script for automatically capturing web page information according to a certain rule, web page data are interacted in a binary system form in a network transmission process, and a data acquirer needs to decode the web page data according to a specific coding rule to obtain a form which can be read by a human. The mainstream network transmission uses Hypertext transfer protocol (HTTP) to encapsulate web pages, and the web pages are organized and described by Hypertext Markup Language (HTML). The HTTP protocol has a coding field for the configuration of a server, but the HTTP protocol has no strict requirement on the coding field, and developers of some website servers do not uniformly set the coding of the coding field and the coding used in the webpage; similarly, the code used by the web page is usually identified by the Charset attribute of the meta tag in the HTML structure, but some web site developers do not fill in the attribute, and even the filled-in attribute does not match the code actually used by the web page.
Although most of the codes can correctly decode the code value of the english character, decoding the code value of the chinese character has strict requirements on the codes, and a specific code is required to decode the code value of the chinese character. Due to the reasons, when acquiring the chinese web page information, the web crawler cannot determine what code the web page data should be decoded with. In view of the above situation, currently, browsers in the market mainly rely on complex statistical algorithms for webpage data to guess actually used codes, but in a web crawler experiment, the efficiency of the algorithms is not sufficient to meet the requirement of the web crawler to acquire webpage information in high real-time.
Disclosure of Invention
In view of this, the present invention provides a method and an apparatus for analyzing web page codes, and mainly aims to solve the problem of low efficiency of acquiring web page information caused by a process of guessing codes actually used by web pages by performing complex statistical calculation on web page data when a crawler system performs web page analysis.
According to a first aspect of the present invention, the present invention provides a method for parsing a web page code, including:
reading webpage response data from the webpage response packet;
decoding the webpage response data segment through preset coding information, and judging whether the webpage coding information is recorded in the current data segment;
if the judgment result is yes, decoding the current data segment by using the webpage coding information, and decoding the webpage response data by using the webpage coding information when the current data segment is completely decoded;
if the judgment result is negative, decoding the other data segment through the preset coding information, and judging whether the webpage coding information is recorded in the data segment.
According to a second aspect of the present invention, an apparatus for parsing a web page code is provided, comprising:
the acquisition unit is used for reading webpage response data from the webpage response packet;
the judging unit is used for decoding the webpage response data segment read by the acquiring unit through the preset coding information and judging whether the webpage coding information is recorded in the current data segment;
the processing unit is used for decoding the current data segment by using the webpage coding information when the judgment result of the judging unit is yes, and decoding the webpage response data by using the webpage coding information when the current data segment is completely decoded;
the processing unit is also used for decoding another data segment through the preset coding information when the judgment result of the judging unit is negative, and judging whether the webpage coding information is recorded in the data segment through the judging unit.
By means of the technical scheme, the method and the device for analyzing the webpage code, provided by the embodiment of the invention, can read the webpage response data from the webpage response packet, decode the webpage response data by presetting the code information, judge whether the webpage code information is recorded in the current data segment, decode the current data segment by using the webpage code information when the judgment result is yes, indicate that the webpage code information is the code information actually used by the webpage when the current data segment is completely decoded, further decode the webpage response data by using the webpage code information, and store the webpage; and when the judgment result is negative, decoding the other data segment through the preset coding information, and judging whether the webpage coding information is recorded in the other data segment. Compared with the prior art that a complicated statistical algorithm needs to be carried out on webpage data to guess the encoding mode actually used by the webpage, the method and the system improve the webpage analyzing speed of the crawler system and meet the requirement that the crawler system acquires webpage information in high real-time.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 is a flowchart illustrating a method for parsing webpage codes according to an embodiment of the present invention;
FIG. 2 is a block diagram illustrating an apparatus for parsing webpage codes according to an embodiment of the present invention;
fig. 3 is a block diagram illustrating an apparatus for parsing webpage codes according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
When web crawlers acquire web page information, because the coding information recorded by the coding field in the HTTP protocol packet and the web page coding information identified by the meta tag in the HTML data structure do not match the coding information actually used by the web page, in this case, the crawler system cannot usually determine which code should be used to decode the web page data, and only guess the code actually used by the web page data by performing a complex statistical algorithm, the efficiency of these algorithms is not sufficient to meet the requirement of the web crawlers to acquire the web page information in high real-time.
In order to solve the problem of low efficiency of acquiring webpage information caused by the process of guessing codes actually used by webpages by performing complex statistical calculation on webpage data when a crawler system performs webpage analysis, an embodiment of the present invention provides a method for analyzing webpage codes, as shown in fig. 1, the method includes:
101. and reading the webpage response data from the webpage response packet.
Generally, when obtaining page information, a web crawler needs to start downloading web page data from a specified Uniform Resource Locator (URL) address, encapsulate a web page with a Hypertext transfer protocol (HTTP), and organize and describe the web page data with a Hypertext Markup Language (HTML) in a web page data transmission process. The web crawler sends a web request packet (HTTP request packet) to the server, receives a web response packet (HTTP response packet) returned by the server after the server responds to the web request packet, where the web response packet includes a status line, a message header, and response content, where the status line reflects a response of the server to the web request packet sent by the web crawler, and the response content is usually a web text, that is, HTML page data returned by the server, that is, web response data in step 101.
102. And decoding the webpage response data segment through preset coding information, and judging whether the webpage coding information is recorded in the current data segment.
Because information is interacted in a binary stream form in the network transmission process, and a data acquisition party needs to decode data into a form which can be read by human beings by using a coding rule, after webpage response data are read from a webpage response packet in step 101, the webpage response data need to be decoded by using preset coding information, partial webpage response information (incomplete decoding) or all webpage response information (complete decoding) can be obtained after the webpage response data are decoded, and usually any code can be correctly decoded on English characters, so that an English coding field for recording the webpage coding information can be decoded, and whether the webpage coding information is recorded in the webpage response data or not is determined. After the above-mentioned processing is performed on the web page response data, if the encoding information actually used by the web page still cannot be determined, the web page response data needs to be decoded again, so as to avoid the loss of processing resources caused by repeated decoding of the web page response data, therefore, in the embodiment of the present invention, step 102 may be performed to decode the web page response data segment by presetting the encoding information, and determine whether the web page encoding information is recorded in the current data segment. When the encoding information actually used by the webpage still cannot be determined after the current data segment in the webpage response data is processed, the other data segment can be continuously processed, and the waste of processing resources caused by processing all the webpage response data again is avoided.
103. If the judgment result is yes, decoding the current data segment by using the webpage coding information, and decoding the webpage response data by using the webpage coding information when the current data segment is completely decoded.
After judging that the webpage coding information is recorded in the current data segment in step 102, the current data segment needs to be decoded through the webpage coding information, whether the webpage coding information can completely decode the current data segment is determined, when the webpage coding information can be completely decoded, the webpage coding information recorded in the current data segment can be used as the coding information actually used by the webpage, and after the coding information actually used by the webpage is obtained, the webpage response data can be further decoded through the webpage coding information to obtain webpage information which can be read by human beings, so that the webpage information can be stored. If the webpage cannot be decoded completely, the webpage coding information recorded in the current data segment is not the coding information actually used by the webpage, and the preset coding information is required to be used for decoding and analyzing another data segment in the webpage response data.
104. If the judgment result is negative, decoding the other data segment through the preset coding information, and judging whether the webpage coding information is recorded in the data segment.
After judging that no webpage coding information is recorded in the current data segment in step 102, decoding another data segment through preset coding information to judge whether webpage coding information is recorded, and if the webpage coding information is recorded, executing step 103; if no webpage coding information is recorded, the process continues to step 104.
The method for analyzing the webpage code provided by the embodiment of the invention can read the webpage response data from the webpage response packet, decode the webpage response data by presetting the code information, judge whether the webpage code information is recorded in the current data segment, decode the current data segment by using the webpage code information when the judgment result is yes, when the current data segment is completely decoded, the webpage code information is the code information actually used by the webpage, further decode the webpage response data by using the webpage code information and store the webpage; and when the judgment result is negative, decoding the other data segment through the preset coding information, and judging whether the webpage coding information is recorded in the other data segment. Compared with the prior art that a complicated statistical algorithm needs to be carried out on webpage data to guess the encoding mode actually used by the webpage, the method and the system improve the webpage analyzing speed of the crawler system and meet the requirement that the crawler system acquires webpage information in high real-time.
Since the web page response data included in the web page response packet, that is, the web page response text, is usually HTML structural data, in order to better understand the method shown in fig. 1, in the embodiment of the present invention, an HTTP response packet is used as the web page response packet, and HTML structural data is used as the web page response data, which is described in detail with respect to each step in fig. 1.
The process of reading the web page response data from the web page response packet, that is, the process of reading the HTML structure data from the HTTP response packet, that is, the process of capturing the web page HTTP response packet by using a web crawler and obtaining the whole web page HTML data packet therefrom, and then, the binary data of the web page needs to be decoded into a character string for display.
Before the HTML structural data is segmented and decoded through the preset coding information, the preset coding information needs to be acquired. Specifically, the encoding field needs to be searched and read from the HTTP response packet, and whether the encoding field records the response packet encoding information is determined. If the coded field in the HTTP response packet records coded information of the response packet, the coded information of the response packet is used as preset coded information, where the coded information of the response packet recorded in the coded field in the HTTP response packet is a character set used by HTML structure data, a common character set includes common characters, and a coding rule of the common characters is specified, for example, for a chinese web page, the coded information used by the web page data, that is, the character set is GB2312, which is a chinese character coded character set for information exchange, can be read from a coded field "charset ═ GB 2312" in the HTTP response packet, and is a chinese character coded extension specification. Binary webpage data can be decoded to obtain Chinese character strings through the encoding information GB2312 recorded by the encoding fields in the HTTP response packet. In addition, HTML structure data of which the webpage data is plain text can be read from a field 'Content-Type: text/HTML' in the HTTP response packet. If the encoded field in the HTTP response packet does not record the response packet encoded information, the acquirer of the HTML structure data may use default encoded information as the preset encoded information, for example, may use the ten-thousand code UTF-8 as the preset encoded information.
After the HTML structure data and the preset coding information are obtained, the coding information of the webpage file statement is required to be obtained from the HTML structure data, therefore, the HTML structure data is required to be decoded by using the preset coding information, in order to improve the decoding efficiency and reduce the loss of processing resources caused by decoding all the HTML structure data again when the decoding is required to be repeated, the HTML structure data can be segmented firstly according to the preset segmentation rule, and the webpage response data is decoded from the head to the segment by the preset coding information. As an alternative implementation manner, the embodiment of the present invention may segment the HTML structure data by a preset rule that every 20% of the data is divided into one data segment, and decode the data segment by segment from the initial 20% of the data segment by preset coding information. Of course, the HTML structure data may also be segmented in other proportions.
When the HTML structure data is segmented according to the preset segmentation rule, the HTML structure data needs to be decoded from the head segment by segment through the preset coding information, and whether the webpage coding information is recorded in the current data segment is judged.
In practical situations, because the meta tag is usually used in the HTML structure data to identify the encoding information of the webpage document declaration, and the encoding information in the HTML structure data is generally recorded in the first half of the whole data, it is necessary to segment the HTML structure data according to the preset segmentation rule, decode the HTML structure data segment by segment from the beginning through the preset encoding information, find whether the meta tag is recorded in the current data segment, and further read whether the meta tag identifies the webpage encoding information. For example, < meta http-equiv ═ Content-Type "Content ═ text/html; and the charset is utf-8'/> is an http-equiv attribute of a meta tag in HTML structural data, wherein the encoding field charset records webpage encoding information utf-8. Because the tag and the attribute information of the HTML structural data are both English, the tag and the attribute in the HTML structural data can be correctly decoded and analyzed to acquire the webpage coding information no matter what type of coding is preset for the coding information.
After the webpage coding information is determined to be recorded in the current data segment in the above manner, if the webpage coding information is different from the preset coding information, the webpage coding information is required to be used for decoding the current data segment again, and if the current data segment can be completely decoded, the webpage coding information can be used as the coding information actually used by the webpage. If the webpage coding information is the same as the preset coding information, the webpage coding information cannot be described as the coding information actually used by the webpage, in order to obtain the correct webpage coding information, the webpage coding information is also required to be used for decoding the current data segment again, and if the current data segment can be completely decoded, the webpage coding information can be used as the coding information actually used by the webpage.
After determining that no webpage coding information is recorded in the current data segment in the above manner, decoding the next data segment adjacent to the current data segment by using the preset coding information to judge whether the webpage coding information is recorded therein, if so, the processing manner is the same as that of the above processing manner after determining that the webpage coding information is recorded in the current data segment until determining the coding information actually used by the webpage.
In addition, after the current data segment is decoded by the preset coding information, if the current data segment is completely decoded by the preset coding information and the webpage coding information recorded in the current data segment is the same as the preset coding information, the preset coding information can be determined as the coding information actually used by the webpage without re-decoding the current data segment by the webpage coding information recorded in the current data segment, and further, other data segments in the webpage response data can be decoded by the preset coding information, and the webpage information obtained after decoding is stored.
The embodiment of the invention carries out sectional decoding on the webpage response data through the preset coding information to obtain the webpage coding information recorded in the coding field in the webpage data, finds a balance point between the precise matching coding information and the rough matching coding information through the comparison of the preset information and the webpage coding information, and solves the problem of processing resource loss caused by repeatedly decoding all webpage data when the coding information in the webpage response packet is inconsistent with the coding information stated in the webpage data by sacrificing a certain matching accuracy to obtain the webpage information with high real-time performance by the crawler system.
As an implementation of the method shown in fig. 1, an embodiment of the present invention provides an apparatus for parsing a webpage code, as shown in fig. 2, the apparatus includes: an acquisition unit 21, a determination unit 22 and a processing unit 23, wherein,
an obtaining unit 21, configured to read web page response data from the web page response packet;
the judging unit 22 is configured to decode the webpage response data segment read by the obtaining unit 21 through preset encoding information, and judge whether the webpage encoding information is recorded in the current data segment;
a processing unit 23, configured to decode the current data segment by using the web page coding information when the determination result of the determining unit 22 is yes, and decode the web page response data by using the web page coding information when the current data segment is completely decoded;
the processing unit 23 is further configured to decode another data segment by the preset encoding information when the judgment result of the judging unit 22 is no, and judge whether the web page encoding information is recorded therein by the judging unit 22.
Further, as shown in fig. 3, the obtaining unit 21 is further configured to obtain preset encoding information; the acquisition unit 21 includes:
the reading module 211 is configured to read a code field in the webpage response packet;
a judging module 212, configured to judge whether response packet coding information is recorded in the coding field read by the reading module 211;
a determining module 213, configured to determine the response packet coding information as the preset coding information when the determining module 212 determines that the response packet coding information is recorded in the coding field;
the determining module 213 is further configured to determine the default encoding information as the preset encoding information when the determining module 212 determines that no response packet encoding information is recorded in the encoding field.
Further, the judging unit 22 includes:
the segmenting module 221 is configured to segment the web page response data according to a preset segmentation rule;
and a decoding module 222 for decoding the web page response data segment by segment from the head by presetting the coding information.
Further, the judging unit 22 further includes:
the reading module 223 is configured to read a code field in the current data segment, and determine whether the code field records the webpage code information.
The device for analyzing the webpage code provided by the embodiment of the invention can read the webpage response data from the webpage response packet, decode the webpage response data by presetting the code information, judge whether the webpage code information is recorded in the current data segment, decode the current data segment by using the webpage code information when the judgment result is yes, when the current data segment is completely decoded, the webpage code information is the code information actually used by the webpage, further decode the webpage response data by using the webpage code information and store the webpage; and when the judgment result is negative, decoding the other data segment through the preset coding information, and judging whether the webpage coding information is recorded in the other data segment. Compared with the prior art that a complicated statistical algorithm needs to be carried out on webpage data to guess the encoding mode actually used by the webpage, the method and the system improve the webpage analyzing speed of the crawler system and meet the requirement that the crawler system acquires webpage information in high real-time.
In addition, the embodiment of the invention carries out sectional decoding on the webpage response data through the preset coding information to obtain the webpage coding information recorded in the coding field in the webpage data, finds a balance point between the precise matching coding information and the rough matching coding information through the comparison of the preset information and the webpage coding information, and obtains the webpage information at high real-time by sacrificing a certain matching accuracy through the crawler system, thereby avoiding the problem of processing resource loss caused by repeatedly decoding all webpage data when the coding information in the webpage response packet is inconsistent with the coding information stated in the webpage data.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
It will be appreciated that the relevant features of the method and apparatus described above are referred to one another. In addition, "first", "second", and the like in the above embodiments are for distinguishing the embodiments, and do not represent merits of the embodiments.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some or all of the components in the title of the invention (e.g., means for determining the level of links within a web site) in accordance with embodiments of the invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims (10)

1. A method for parsing web page code, the method comprising:
reading webpage response data from the webpage response packet;
decoding the webpage response data segment through preset coding information, and judging whether webpage coding information is recorded in the current data segment or not, wherein the preset coding information can decode English characters;
if the judgment result is yes, decoding the current data segment by using the webpage coding information, and decoding the webpage response data by using the webpage coding information when the current data segment is completely decoded;
if the judgment result is negative, decoding the other data segment through the preset coding information, and judging whether the webpage coding information is recorded;
wherein, the decoding the webpage response data segment through the preset coding information comprises: and segmenting the webpage response data according to a preset segmentation rule, and decoding the webpage response data segment by segment from the beginning through preset coding information.
2. The method of claim 1, wherein before said decoding said web page response data segment by preset encoding information, said method further comprises:
acquiring preset coding information; the acquiring preset encoding information includes:
reading a coding field in a webpage response packet;
judging whether response packet coding information is recorded in the coding field;
if the response packet coding information is recorded, using the response packet coding information as the preset coding information;
and if the response packet coding information is not recorded, using default coding information as the preset coding information.
3. The method of claim 1, wherein the determining whether the current data segment has the webpage encoding information recorded therein comprises:
reading a coding field in the current data segment;
and judging whether the coding field records webpage coding information or not.
4. The method of claim 1 or 3, wherein the decoding the web page response data through the web page encoding information when the current data segment is fully decoded comprises:
and decoding other data segments through the webpage coding information.
5. The method of claim 1, wherein after the decoding the webpage response data segment by the preset encoding information and determining whether the webpage encoding information is recorded in the current data segment, the method further comprises:
and if the current data segment is completely decoded by the preset coding information and the webpage coding information recorded in the current data segment is the same as the preset coding information, decoding the webpage response data through the preset coding information.
6. An apparatus for parsing web page code, the apparatus comprising:
the acquisition unit is used for reading webpage response data from the webpage response packet;
the judging unit is used for decoding the webpage response data segment read by the acquiring unit through preset coding information and judging whether the webpage coding information is recorded in the current data segment or not, wherein the preset coding information can decode English characters;
the processing unit is used for decoding the current data segment by using the webpage coding information when the judgment result of the judging unit is yes, and decoding the webpage response data by using the webpage coding information when the current data segment is completely decoded;
the processing unit is further used for decoding another data segment through the preset coding information when the judgment result of the judging unit is negative, and judging whether the webpage coding information is recorded in the data segment through the judging unit;
the judging unit includes:
the segmentation module is used for segmenting the webpage response data according to a preset segmentation rule;
and the decoding module is used for decoding the webpage response data from head to section through preset coding information.
7. The apparatus of claim 6, wherein the obtaining unit is further configured to obtain preset encoding information; the acquisition unit includes:
the reading module is used for reading the coding field in the webpage response packet;
the judging module is used for judging whether the code field read by the reading module records the response packet code information;
the determining module is used for determining the response packet coding information as the preset coding information when the judging module judges that the response packet coding information is recorded in the coding field;
the determining module is further configured to determine default coding information as the preset coding information when the determining module determines that no response packet coding information is recorded in the coding field.
8. The apparatus of claim 6, wherein the determining unit further comprises:
and the reading module is used for reading the coding field in the current data segment and judging whether the coding field records the webpage coding information.
9. A storage medium, comprising a stored program, wherein the program, when executed, controls a device on which the storage medium is located to perform the method for parsing a web page encoding according to any one of claims 1 to 5.
10. A processor, configured to run a program, wherein the program when running performs the method for parsing a web page encoding according to any one of claims 1 to 5.
CN201510670507.2A 2015-10-13 2015-10-13 Method and device for analyzing webpage codes Active CN106570044B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510670507.2A CN106570044B (en) 2015-10-13 2015-10-13 Method and device for analyzing webpage codes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510670507.2A CN106570044B (en) 2015-10-13 2015-10-13 Method and device for analyzing webpage codes

Publications (2)

Publication Number Publication Date
CN106570044A CN106570044A (en) 2017-04-19
CN106570044B true CN106570044B (en) 2019-12-24

Family

ID=58508827

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510670507.2A Active CN106570044B (en) 2015-10-13 2015-10-13 Method and device for analyzing webpage codes

Country Status (1)

Country Link
CN (1) CN106570044B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110020343B (en) * 2017-09-01 2021-03-30 北京国双科技有限公司 Method and device for determining webpage coding format

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6952425B1 (en) * 2000-11-14 2005-10-04 Cisco Technology, Inc. Packet data analysis with efficient and flexible parsing capabilities
CN101101606A (en) * 2007-08-03 2008-01-09 中兴通讯股份有限公司 Web page coding language automatic identification method and device for embedded type browser
CN101526963A (en) * 2009-04-17 2009-09-09 深圳华为通信技术有限公司 Method for identifying web page coding, device and terminal equipment
CN103207877A (en) * 2012-01-17 2013-07-17 阿里巴巴集团控股有限公司 Decoding method and device
CN103443741A (en) * 2011-02-07 2013-12-11 黑莓有限公司 Method and apparatus for receiving presentation metadata
CN103870487A (en) * 2012-12-13 2014-06-18 腾讯科技(深圳)有限公司 Webpage file processing method and mobile terminal

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6952425B1 (en) * 2000-11-14 2005-10-04 Cisco Technology, Inc. Packet data analysis with efficient and flexible parsing capabilities
CN101101606A (en) * 2007-08-03 2008-01-09 中兴通讯股份有限公司 Web page coding language automatic identification method and device for embedded type browser
CN101526963A (en) * 2009-04-17 2009-09-09 深圳华为通信技术有限公司 Method for identifying web page coding, device and terminal equipment
CN103443741A (en) * 2011-02-07 2013-12-11 黑莓有限公司 Method and apparatus for receiving presentation metadata
CN103207877A (en) * 2012-01-17 2013-07-17 阿里巴巴集团控股有限公司 Decoding method and device
CN103870487A (en) * 2012-12-13 2014-06-18 腾讯科技(深圳)有限公司 Webpage file processing method and mobile terminal

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
嵌入式浏览器跨平台服务组件研究与设计;王姝文;《中国优秀硕士学位论文全文数据库 信息科技辑》;20111215(第12期);I139-248 *

Also Published As

Publication number Publication date
CN106570044A (en) 2017-04-19

Similar Documents

Publication Publication Date Title
US10567407B2 (en) Method and system for detecting malicious web addresses
CN103095681B (en) A kind of method and device detecting leak
CN109033115B (en) Dynamic webpage crawler system
CN104881603B (en) Webpage redirects leak detection method and device
US20190196811A1 (en) Api specification generation
US20120198558A1 (en) Xss detection method and device
CN107153716B (en) Webpage content extraction method and device
CN111104587A (en) Webpage display method and device and server
CN111740923A (en) Method and device for generating application identification rule, electronic equipment and storage medium
TW201800962A (en) Webpage file sending method, webpage rendering method and device and webpage rendering system
CN104079559B (en) A kind of website safety detection method, device and server
WO2014153457A1 (en) Merging web page style addresses
CN111008405A (en) Website fingerprint identification method based on file Hash
CN107590236B (en) Big data acquisition method and system for building construction enterprises
CN104023046B (en) Mobile terminal recognition method and device
CN112445997A (en) Method and device for extracting CMS multi-version identification feature rule
CN108809943B (en) Website monitoring method and device
CN103825772B (en) Identifying user clicks on the method and gateway device of behavior
CN106570044B (en) Method and device for analyzing webpage codes
CN110020343B (en) Method and device for determining webpage coding format
CN111125704B (en) Webpage Trojan horse recognition method and system
CN109492146B (en) Method and device for preventing WEB crawler
US20130230248A1 (en) Ensuring validity of the bookmark reference in a collaborative bookmarking system
CN108664511B (en) Method and device for acquiring webpage information
CN102694802B (en) Network access information recording method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant