CN113434797A - Webpage information extraction method and device - Google Patents

Webpage information extraction method and device Download PDF

Info

Publication number
CN113434797A
CN113434797A CN202110725319.0A CN202110725319A CN113434797A CN 113434797 A CN113434797 A CN 113434797A CN 202110725319 A CN202110725319 A CN 202110725319A CN 113434797 A CN113434797 A CN 113434797A
Authority
CN
China
Prior art keywords
webpage
visual
extracted
information
visual block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110725319.0A
Other languages
Chinese (zh)
Other versions
CN113434797B (en
Inventor
李成钢
杨本栋
李忠
李金岭
杜忠田
王彦君
夏海轮
张碧昭
余清华
卜理超
张天正
李凤文
袁福碧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Digital Intelligence Technology Co Ltd
Original Assignee
China Telecom Group System Integration Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Group System Integration Co Ltd filed Critical China Telecom Group System Integration Co Ltd
Priority to CN202110725319.0A priority Critical patent/CN113434797B/en
Publication of CN113434797A publication Critical patent/CN113434797A/en
Application granted granted Critical
Publication of CN113434797B publication Critical patent/CN113434797B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a webpage information extraction method and device, and belongs to the field of information identification. The method comprises the following steps: acquiring webpage data to be identified; blocking the webpage data to be identified according to a visual information algorithm to obtain a webpage visual block; labeling the webpage visual block to obtain metadata to be extracted; and extracting the metadata to be extracted to obtain target data. The invention provides an article type webpage information extraction method for extracting webpage structural data by fully utilizing webpage context information without depending on a webpage design style, and solves the technical problems of low algorithm efficiency and poor node combination effect caused by webpage design style change in the prior art, and the technical problems of less utilization of context information and insufficient information extraction precision.

Description

Webpage information extraction method and device
Technical Field
The invention belongs to the field of information identification, and particularly relates to a webpage information extraction method and device.
Background
Along with the continuous development of intelligent science and technology, people use intelligent equipment more and more among life, work, the study, use intelligent science and technology means, improved the quality of people's life, increased the efficiency of people's study and work.
Meanwhile, with the continuous expansion of the internet scale, the amount of information in the internet increases at an exponential rate, and people rely on obtaining information from internet pages through electronic devices such as computers or mobile phones. However, the web pages all contain a large amount of content which is used for website promotion or commercial promotion and is irrelevant to subject, the mass data causes the problems of information overload and information redundancy, one of the current research hotspots is how to efficiently acquire valuable information from the mass information contained in the internet, and the problem that how to rapidly and efficiently extract article information in the web pages contained in the internet is urgently needed to be overcome by the prior art. At present, the VIPS algorithm is improved by combining the currently popular web design style in the prior art, but the problems of low algorithm efficiency and poor node merging effect caused by changes of the web design style still exist, and meanwhile, the prior art also provides an automatic web information extraction algorithm combining machine learning and grouping technology. The information extraction can automatically identify and extract the structured information from the unstructured document, can quickly and accurately analyze the truly useful information from the mass data, and can improve the information acquisition efficiency. In the field of web page information extraction, researchers have proposed various web page information extraction algorithms for different types of web pages. With the change of the webpage design specification and the webpage design style, some algorithms are not suitable any more, and the extraction accuracy of the existing algorithms is not high.
Disclosure of Invention
The invention provides a webpage information extraction method and device, provides an article type webpage information extraction method which is independent of a webpage design style and fully utilizes webpage context information to extract webpage structured data, and solves the technical problems of low algorithm efficiency, poor node combination effect caused by webpage design style change and low utilization of context information and insufficient information extraction precision in the prior art.
One aspect of the present invention provides a method for extracting web page information, including: acquiring webpage data to be identified; blocking the webpage data to be identified according to a visual information algorithm to obtain a webpage visual block; labeling the webpage visual block to obtain metadata to be extracted; and extracting the metadata to be extracted to obtain target data.
Further, after the acquiring the data of the webpage to be identified, the method further comprises: and preprocessing the webpage data to be identified.
Further, before the labeling the visual block of the webpage to obtain metadata to be extracted, the method further includes: and selecting a webpage main body area according to the webpage visual block.
Further, the extracting the metadata to be extracted to obtain the target data includes: acquiring a random field model; extracting the metadata to be extracted according to the random field model to obtain an extraction result; and outputting the extraction result and generating the target data.
In another aspect of the present invention, an apparatus for extracting web page information is further provided, including: the acquisition module is used for acquiring the webpage data to be identified; the blocking module is used for blocking the webpage data to be identified according to a visual information algorithm to obtain a webpage visual block; the marking module is used for marking the webpage visual block to obtain metadata to be extracted; and the extraction module is used for carrying out extraction operation on the metadata to be extracted to obtain target data.
Further, the apparatus further comprises: and the preprocessing module is used for preprocessing the webpage data to be identified.
Further, the apparatus further comprises: and the selection module is used for selecting the main body area of the webpage according to the webpage visual block.
Further, the extraction module comprises: the model unit is used for acquiring a random field model; the extraction unit is used for extracting the metadata to be extracted according to the random field model to obtain an extraction result; and the output unit is used for outputting the extraction result and generating the target data.
In another aspect of the present invention, a non-volatile storage medium is further provided, where the non-volatile storage medium includes a stored program, and the program controls, when running, a device in which the non-volatile storage medium is located to execute a method for extracting web page information.
In another aspect of the present invention, an electronic device is further provided, which includes a processor and a memory; the memory is stored with computer readable instructions, and the processor is used for executing the computer readable instructions, wherein the computer readable instructions execute a webpage information extraction method when running.
Compared with the prior art, the invention has the beneficial effects that:
the method of the invention adopts the steps of acquiring the data of the webpage to be identified; blocking the webpage data to be identified according to a visual information algorithm to obtain a webpage visual block; labeling the webpage visual block to obtain metadata to be extracted; the method for extracting the article type webpage information fully utilizes the webpage context information to extract the webpage structural data in a mode of extracting the metadata to be extracted to obtain the target data without depending on the webpage design style, and solves the technical problems of low efficiency of an algorithm and poor node combination effect caused by the change of the webpage design style in the prior art, and the technical problems of less utilization of the context information and insufficient information extraction precision.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a diagram illustrating a method for extracting information from an article-type web page according to an embodiment of the present invention;
FIG. 2 is a flow chart of a web page information extraction algorithm according to an embodiment of the present invention;
FIG. 3 is a schematic input/output diagram of a paging block algorithm according to an embodiment of the present invention;
FIG. 4 is a flowchart of a method for extracting web page information according to an embodiment of the present invention;
fig. 5 is a block diagram of a structure of a web page information extraction apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In accordance with an embodiment of the present invention, there is provided a method embodiment of a method for extracting web page information, it should be noted that the steps illustrated in the flowchart of the accompanying drawings may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than that herein.
Example one
Fig. 4 is a flowchart of a method for extracting web page information according to an embodiment of the present invention, as shown in fig. 4, the method includes the following steps:
step S402, acquiring webpage data to be identified.
Specifically, in the embodiment of the present invention, in order to extract the required information through the processing and model identification of the web page information that needs to be identified, the web page content and the information to be identified need to be captured and stored by the web page acquisition tool and used for subsequent processing.
Optionally, after the acquiring the webpage data to be identified, the method further includes: and preprocessing the webpage data to be identified.
Specifically, for the web page preprocessing of the embodiment of the present invention, a beaut iful soup standardized web page may be used. The redundant information is eliminated by deleting unnecessary labels and contents in the webpage source codes, including comments, scripts, patterns and interactive labels, and eliminating the redundant information in a regular expression matching mode. The regular expressions used are shown in table 2.
TABLE 2 webpage preprocessing regular expression
Regular expression Means of
(?is)<!--[^>]*--> Matching HTML annotations
(?is)<script.*?>.*?</script> Matching<script>Label (R)
(?is)<style.*?>.*?</style> Matching<style>Label (R)
(?is)<\s*link[^>]*> Matching<link>Label (R)
(?is)<input.*?> Matching<input>Label (R)
(?is)<select.*?>.*?</select> Matching<select>Label (R)
And S404, blocking the webpage data to be identified according to a visual information algorithm to obtain a webpage visual block.
Specifically, the method for selecting the main area of the webpage comprises the following steps:
Figure BDA0003138384310000041
Figure BDA0003138384310000051
Figure BDA0003138384310000052
Figure BDA0003138384310000053
wherein Lc is the number of text lines; td is the text density; pd is the symbol density; ld is hyperlink density; height refers to the node height; n is the number of texts; t and l are text parameters; padding _ top is the upper distance between the content of the element in the node and the border of the element; padding _ bottom is the lower margin between the element content and the element border in the node; line height is the amount of space for multiple lines of elements, i.e., the spacing between multiple lines of text; p is a radical ofiIs the number of symbols in node i; liIs the length of the hyperlink text in each node; ntiThe number of the text words of the node i is the sum of the number of Chinese characters and the number of English words; itiIs the number of text words of the hyperlink in node i.
Selecting a visual block:
traversing all the visual blocks, and calculating the text density Td, the symbol density Pd, the hyperlink density Ld and the text line number Lc of each visual block;
a) finding out the visual block with the maximum text density Td from all the unmarked visual blocks, judging whether the text line Lc is more than 3, if so, executing the next step by taking the visual block as a reference visual block, otherwise, marking the visual block to execute the step again, and if the judgment is failed for 3 times continuously, considering that the webpage is not an article webpage.
b) Upwards searching a starting visual block of a main body part of the webpage from a reference visual block, wherein the line spacing of a general article title is the largest, upwards finding out the visual block with the largest line spacing and the text line number Lc not more than 2, upwards calculating the average hyperlink density of 3 adjacent visual blocks by using a sliding window mode from the visual block, stopping if the average hyperlink density is more than 20%, taking the stopped visual block as the starting visual block, and executing the next step; and if no visual block which meets the conditions that the line spacing is maximum and the text line number Lc is not more than 2 is found, executing the next step.
c) And (3) upwards calculating the average value S and the average symbol density of the visual block scores of the 3 adjacent visual blocks by using a sliding window mode from the reference visual block, wherein the calculation mode of S is the formulas 1-5, stopping until the calculation result S is less than 10, and judging to obtain a starting visual block by using the same average hyperlink density as that in the step (c).
S=Td*log(1+NodeNum)*log(Pd) (1-5)
In the formula, NodeNum is the number of nodes.
d) The method for judging the ending visual block is the same as the method for judging the starting visual block in the step (d).
e) And judging the visual block between the starting visual block and the ending visual block as the main body area of the webpage.
And step S406, labeling the webpage visual block to obtain metadata to be extracted.
Optionally, before the labeling the webpage visual block to obtain the metadata to be extracted, the method further includes: and selecting a webpage main body area according to the webpage visual block.
And step S408, extracting the metadata to be extracted to obtain target data.
Optionally, the extracting the metadata to be extracted to obtain the target data includes: acquiring a random field model; extracting the metadata to be extracted according to the random field model to obtain an extraction result; and outputting the extraction result and generating the target data.
Specifically, a flowchart of an article-type web information extraction algorithm provided by the embodiment of the present invention is shown in fig. 2. Firstly, preprocessing a webpage, standardizing the webpage and removing redundant information, partitioning the webpage according to the visual characteristics of an article webpage, and dividing DOM tree nodes into a plurality of consistent visual blocks; then, positioning a main body area of the webpage by utilizing statistical characteristics, and filtering a large amount of noise information; and finally, selecting text, vision and dictionary features as feature sets to perform feature extraction, performing sequence annotation by using a three-order conditional random field model, and extracting information such as titles, texts, authors, sources, release time, images and accessories. The method can quickly acquire a large amount of information contained in the article type webpage.
Specifically, after the webpage information with blocks is acquired in the embodiment of the invention, the process of merging nodes is the process of webpage blocking, which means that a plurality of nodes with the same type are merged into one node block, also called a visual block, by using visual information.
It should be noted that, for the concept of Consistent Visual Block (CVB), the nodes in the consistent visual block should satisfy the following condition: if the consistency visual block contains a plurality of nodes, the nodes are adjacent in the page layout and the DOM tree; nodes in the consistent visual block are left-aligned or top-aligned; if the nodes in the consistent visual block are left aligned, then the width of the nodes needs to be kept consistent; if the top ends of the nodes in the consistency visual block are aligned, the heights of the nodes need to be kept consistent; the fonts of all nodes in the consistency visual block are the same and comprise character fonts, character sizes, character colors, whether the characters are thickened or not and whether underlines exist or not; each node that does not contain text is a separate consistent visual block.
The main process is as follows:
a) and traversing the DOM tree to obtain a List of leaf nodes.
b) And judging whether each leaf node in the List is a text node or an image node, and if not, deleting the leaf node.
c) Searching brother leaf nodes of leaf nodes in the List, judging whether the adjacent brother nodes and the node meet the standard of the consistency visual block, if not, judging the leaf nodes to be independent consistency visual blocks, if so, merging the two leaf nodes into one consistency visual block, continuously judging the consistency between the adjacent brother nodes and the consistency visual block, and so on, finishing the judgment of all brother leaf nodes.
d) After a leaf node and brother nodes are merged, if a father node exists, the merged consistency visual block replaces the father node, each visual block is used as a new leaf node, and the nodes in the consistency visual block still need to keep the original sequence after replacement.
e) And repeating the steps until all the leaf nodes are consistent visual block nodes.
It should be further noted that, in the web page metadata extraction based on CRF, the problems of underutilization of context information and insufficient extraction precision exist in the current visual block or node labeling, and a conditional random field model can integrate the context information into transfer characteristics. Generating a training file after determining the label and the characteristics, extracting the characteristics of the current visual block position and the previous two visual blocks simultaneously when extracting the characteristics, wherein the characteristics of the current position are state characteristics, the combination of the current position characteristics and the previous two position characteristics is transfer characteristics, and the determination process of the characteristic weight is the establishment process of the three-order conditional random field model. The characteristics used are shown in table 1. During training, the L-BFGS training algorithm is matched with an ElasticNet (L1+ L2) regularization to use parameters for solving a three-order conditional random field model, the L-BFGS algorithm has good convergence and high calculation speed, and an ElasticNet (L1+ L2) algorithm adds an L1 regular term and an L2 regular term after a loss function to solve the over-fitting problem in the model training process. Viterbi decoding is utilized in the prediction marking process to obtain an optimal sequence.
TABLE 3 characteristics
Figure BDA0003138384310000071
Figure BDA0003138384310000081
Through the embodiment, the technical problems that although the VIPS algorithm is improved on the webpage design style in the prior art, the algorithm efficiency is low and the node combination effect caused by the change of the webpage design style is poor are solved, meanwhile, the technical problems that a plurality of features are generated by using DOM tree node attributes to train a machine learning model, then candidate nodes are selected according to the model, noise in the candidate nodes is filtered out, and missing data of the candidate nodes is selected are also solved, but the technical problems that the utilization of context information is less and the information extraction precision is not high exist in the technical content.
Example two
Fig. 5 is a block diagram of a structure of a web page information extraction apparatus according to an embodiment of the present invention, as shown in fig. 5, the apparatus includes:
and the obtaining module 50 is configured to obtain the data of the web page to be identified.
Specifically, in the embodiment of the present invention, in order to extract the required information through the processing and model identification of the web page information that needs to be identified, the web page content and the information to be identified need to be captured and stored by the web page acquisition tool and used for subsequent processing.
Optionally, the apparatus further comprises: and the preprocessing module is used for preprocessing the webpage data to be identified.
Specifically, for the web page preprocessing of the embodiment of the present invention, a tool, a beautifuloup standardized web page, may be used. The redundant information is eliminated by deleting unnecessary labels and contents in the webpage source codes, including comments, scripts, patterns and interactive labels, and eliminating the redundant information in a regular expression matching mode. The regular expressions used are shown in table 2.
TABLE 2 webpage preprocessing regular expression
Regular expression Means of
(?is)<!--[^>]*--> Matching HTML annotations
(?is)<script.*?>.*?</script> Matching<script>Label (R)
(?is)<style.*?>.*?</style> Matching<style>Label (R)
(?is)<\s*link[^>]*> Matching<link>Label (R)
(?is)<input.*?> Matching<input>Label (R)
(?is)<select.*?>.*?</select> Matching<select>Label (R)
And the blocking module 52 is configured to block the webpage data to be identified according to a visual information algorithm to obtain a webpage visual block.
Optionally, the apparatus further comprises: and the selection module is used for selecting the main body area of the webpage according to the webpage visual block.
Specifically, the positioning of the main area of the web page in the embodiment of the present invention may be:
Figure BDA0003138384310000091
Figure BDA0003138384310000092
Figure BDA0003138384310000093
Figure BDA0003138384310000094
wherein Lc is the number of text lines; td is the text density; pd is the symbol density; ld is hyperlink density; height refers to the node height; n is the number of texts; t and l are text parameters; padding _ top is the upper distance between the content of the element in the node and the border of the element; padding _ bottom is the lower margin between the element content and the element border in the node; line height is the amount of space for multiple lines of elements, i.e., the spacing between multiple lines of text; p is a radical ofiIs the number of symbols in node i; liIs the length of the hyperlink text in each node; ntiThe number of the text words of the node i is the sum of the number of Chinese characters and the number of English words; itiIs the number of text words of the hyperlink in node i.
Selecting a visual block:
a) and traversing all the visual blocks, and calculating the text density Td, the symbol density Pd, the hyperlink density Ld and the text line number Lc of each visual block.
b) Finding out the visual block with the maximum text density Td from all the unmarked visual blocks, judging whether the text line Lc is more than 3, if so, executing the next step by taking the visual block as a reference visual block, otherwise, marking the visual block to execute the step again, and if the judgment is failed for 3 times continuously, considering that the webpage is not an article webpage.
c) Upwards searching a starting visual block of a main body part of the webpage from a reference visual block, wherein the line spacing of a general article title is the largest, upwards finding out the visual block with the largest line spacing and the text line number Lc not more than 2, upwards calculating the average hyperlink density of 3 adjacent visual blocks by using a sliding window mode from the visual block, stopping if the average hyperlink density is more than 20%, taking the stopped visual block as the starting visual block, and executing the next step; and if no visual block which meets the conditions that the line spacing is maximum and the text line number Lc is not more than 2 is found, executing the next step.
d) Calculating the average value S and the average symbol density of the visual block scores of 3 adjacent visual blocks upwards from the reference visual block by using a sliding window mode, wherein the calculation mode of S is the formula 1-5, and stopping until the calculation result S is less than 10, and judging to obtain a starting visual block by using the same average hyperlink density as that in the step (c);
S=Td*log(1+NodeNum)*log(Pd) (1-5)
in the formula, NodeNum is the number of nodes.
e) The method for judging the ending visual block is the same as the method for judging the starting visual block in the step (d).
f) And judging the visual block between the starting visual block and the ending visual block as the main body area of the webpage.
And the labeling module 54 is configured to label the webpage visual block to obtain metadata to be extracted.
And the extracting module 56 is configured to perform an extracting operation on the metadata to be extracted to obtain target data.
Optionally, the extracting module includes: the model unit is used for acquiring a random field model; the extraction unit is used for extracting the metadata to be extracted according to the random field model to obtain an extraction result; and the output unit is used for outputting the extraction result and generating the target data.
Specifically, a flowchart of an article-type web information extraction algorithm provided by the embodiment of the present invention is shown in fig. 2. Firstly, preprocessing a webpage, standardizing the webpage and removing redundant information, partitioning the webpage according to the visual characteristics of an article webpage, and dividing DOM tree nodes into a plurality of consistent visual blocks; then, positioning a main body area of the webpage by utilizing statistical characteristics, and filtering a large amount of noise information; and finally, selecting text, vision and dictionary features as feature sets to perform feature extraction, performing sequence annotation by using a three-order conditional random field model, and extracting information such as titles, texts, authors, sources, release time, images and accessories. The method can quickly acquire a large amount of information contained in the article type webpage.
Specifically, after the webpage information with blocks is acquired in the embodiment of the invention, the process of merging nodes is the process of webpage blocking, which means that a plurality of nodes with the same type are merged into one node block, also called a visual block, by using visual information.
It should be noted that, for the concept of Consistent Visual Block (CVB), the nodes in the consistent visual block should satisfy the following condition: if the consistency visual block contains a plurality of nodes, the nodes are adjacent in the page layout and the DOM tree; nodes in the consistent visual block are left-aligned or top-aligned; if the nodes in the consistent visual block are left aligned, then the width of the nodes needs to be kept consistent; if the top ends of the nodes in the consistency visual block are aligned, the heights of the nodes need to be kept consistent; the fonts of all nodes in the consistency visual block are the same and comprise character fonts, character sizes, character colors, whether the characters are thickened or not and whether underlines exist or not; each node that does not contain text is a separate consistent visual block. The main process is as follows:
a) and traversing the DOM tree to obtain a List of leaf nodes.
b) And judging whether each leaf node in the List is a text node or an image node, and if not, deleting the leaf node.
c) Searching brother leaf nodes of leaf nodes in the List, judging whether the adjacent brother nodes and the node meet the standard of the consistency visual block, if not, judging the leaf nodes to be independent consistency visual blocks, if so, merging the two leaf nodes into one consistency visual block, continuously judging the consistency between the adjacent brother nodes and the consistency visual block, and so on, finishing the judgment of all brother leaf nodes.
d) After a leaf node and brother nodes are merged, if a father node exists, the merged consistency visual block replaces the father node, each visual block is used as a new leaf node, and the nodes in the consistency visual block still need to keep the original sequence after replacement.
e) And repeating the steps until all the leaf nodes are consistent visual block nodes.
It should be further noted that, in the web page metadata extraction based on CRF, the problems of underutilization of context information and insufficient extraction precision exist in the current visual block or node labeling, and a conditional random field model can integrate the context information into transfer characteristics. Generating a training file after determining the label and the characteristics, extracting the characteristics of the current visual block position and the previous two visual blocks simultaneously when extracting the characteristics, wherein the characteristics of the current position are state characteristics, the combination of the current position characteristics and the previous two position characteristics is transfer characteristics, and the determination process of the characteristic weight is the establishment process of the three-order conditional random field model. The characteristics used are shown in table 1. During training, the L-BFGS training algorithm is matched with an Elastic Net (L1+ L2) regularization to use parameters for solving a three-order conditional random field model, the L-BFGS algorithm has good convergence and high calculation speed, and an Elastic Net (L1+ L2) algorithm adds an L1 regular term and an L2 regular term after a loss function to solve the overfitting problem in the model training process. Viterbi decoding is utilized in the prediction marking process to obtain an optimal sequence.
TABLE 3 characteristics
Figure BDA0003138384310000111
Figure BDA0003138384310000121
EXAMPLE III
In another embodiment of the present invention, a non-volatile storage medium is provided, where the non-volatile storage medium includes a stored program, and the program controls a device in which the non-volatile storage medium is located to execute a method for extracting web page information when running.
Specifically, the method comprises the following steps: acquiring webpage data to be identified; blocking the webpage data to be identified according to a visual information algorithm to obtain a webpage visual block; labeling the webpage visual block to obtain metadata to be extracted; and extracting the metadata to be extracted to obtain target data.
Example four
In another embodiment of the present invention, an electronic device is provided, which includes a processor and a memory; the memory is stored with computer readable instructions, and the processor is used for executing the computer readable instructions, wherein the computer readable instructions execute a webpage information extraction method when running.
Specifically, the method comprises the following steps: acquiring webpage data to be identified; blocking the webpage data to be identified according to a visual information algorithm to obtain a webpage visual block; labeling the webpage visual block to obtain metadata to be extracted; and extracting the metadata to be extracted to obtain target data.
Through the embodiment, the technical problems that although the VIPS algorithm is improved on the webpage design style in the prior art, the algorithm efficiency is low and the node combination effect caused by the change of the webpage design style is poor are solved, meanwhile, the technical problems that a plurality of features are generated by using DOM tree node attributes to train a machine learning model, then candidate nodes are selected according to the model, noise in the candidate nodes is filtered out, and missing data of the candidate nodes is selected are also solved, but the technical problems that the utilization of context information is less and the information extraction precision is not high exist in the technical content.
Effects of the embodiment
In the embodiment of the invention, verification tests are intensively carried out on the web pages consisting of 80 article type web pages of domestic and foreign websites such as the news of New billow, the department of education of China, the department of education of England and the like, and the test results are shown in table 1. The result shows that the information extraction algorithm provided by the invention has a good extraction effect.
Table 1 test set a information extraction results
Label (R) Total number of Identification number The correct amount P R F1
Title 80 80 80 1.000 1.000 1.000
Text 473 494 466 0.943 0.985 0.963
Authors refer to 50 40 40 0.800 1.000 0.889
Time of release 80 91 62 0.775 0.681 0.720
Source 50 52 46 0.920 0.885 0.902
Picture frame 12 14 12 1.000 0.857 0.923
Accessories 22 21 21 0.955 1.000 0.978
Comparing the algorithm provided by the invention with the existing algorithm with better extraction effect, as shown in fig. 1, fig. 1 is a comparison diagram of the article type webpage information extraction method according to the embodiment of the invention. news extractor is the extraction method in document 1, and the extracted contents include a title, a body, a distribution time, and a source. newsboper is an extraction method used in document 2, and extracted contents include a title, an author, a distribution time, a keyword, and a body text. blockCRF is the method proposed herein, and the extracted content includes title, author, release time, text, source, picture, and attachment, etc. In this embodiment, the common data items extracted by the 3 algorithms are compared, and the algorithm provided by this embodiment has the best extraction effect in view of the comprehensive indexes.
Among them, document 1: extraction algorithm [ J ] for key information of news web pages, 2016,36(08):2082-2086+ 2120.
Document 2: sarr E N, Ousane S, Diallo A.FactExtract: automatic collection and aggregation of images and statistical effects from online news paper [ C ]//2018Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS). IEEE,2018: 336-.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A method for extracting webpage information is characterized by comprising the following steps:
acquiring webpage data to be identified;
blocking the webpage data to be identified according to a visual information algorithm to obtain a webpage visual block;
labeling the webpage visual block to obtain metadata to be extracted;
and extracting the metadata to be extracted to obtain target data.
2. The method of claim 1, wherein after the obtaining the data of the web page to be identified, the method further comprises:
and preprocessing the webpage data to be identified.
3. The method of claim 1, wherein before labeling the visual block of the web page to obtain metadata to be extracted, the method further comprises:
and selecting a webpage main body area according to the webpage visual block.
4. The method according to claim 1, wherein the extracting the metadata to be extracted to obtain target data comprises:
acquiring a random field model;
extracting the metadata to be extracted according to the random field model to obtain an extraction result;
and outputting the extraction result and generating the target data.
5. A web page information extraction apparatus, comprising:
the acquisition module is used for acquiring the webpage data to be identified;
the blocking module is used for blocking the webpage data to be identified according to a visual information algorithm to obtain a webpage visual block;
the marking module is used for marking the webpage visual block to obtain metadata to be extracted;
and the extraction module is used for carrying out extraction operation on the metadata to be extracted to obtain target data.
6. The apparatus of claim 5, further comprising:
and the preprocessing module is used for preprocessing the webpage data to be identified.
7. The apparatus of claim 5, further comprising:
and the selection module is used for selecting the main body area of the webpage according to the webpage visual block.
8. The apparatus of claim 5, wherein the extraction module comprises:
the model unit is used for acquiring a random field model;
the extraction unit is used for extracting the metadata to be extracted according to the random field model to obtain an extraction result;
and the output unit is used for outputting the extraction result and generating the target data.
9. A non-volatile storage medium, comprising a stored program, wherein the program, when executed, controls an apparatus in which the non-volatile storage medium is located to perform the method of any one of claims 1 to 4.
10. An electronic device comprising a processor and a memory; the memory has stored therein computer readable instructions for execution by the processor, wherein the computer readable instructions when executed perform the method of any one of claims 1 to 4.
CN202110725319.0A 2021-06-29 2021-06-29 Webpage information extraction method and device Active CN113434797B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110725319.0A CN113434797B (en) 2021-06-29 2021-06-29 Webpage information extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110725319.0A CN113434797B (en) 2021-06-29 2021-06-29 Webpage information extraction method and device

Publications (2)

Publication Number Publication Date
CN113434797A true CN113434797A (en) 2021-09-24
CN113434797B CN113434797B (en) 2024-05-31

Family

ID=77757552

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110725319.0A Active CN113434797B (en) 2021-06-29 2021-06-29 Webpage information extraction method and device

Country Status (1)

Country Link
CN (1) CN113434797B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114218515A (en) * 2021-12-21 2022-03-22 北京大学 Web digital object extraction method and system based on content segmentation

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102254014A (en) * 2011-07-21 2011-11-23 华中科技大学 Adaptive information extraction method for webpage characteristics
CN103473338A (en) * 2013-09-22 2013-12-25 北京奇虎科技有限公司 Webpage content extraction method and webpage content extraction system
WO2017080090A1 (en) * 2015-11-14 2017-05-18 孙燕群 Extraction and comparison method for text of webpage
CN111428444A (en) * 2020-03-27 2020-07-17 新华智云科技有限公司 Automatic extraction method of webpage information
CN112101004A (en) * 2020-09-23 2020-12-18 电子科技大学 General webpage character information extraction method based on conditional random field and syntactic analysis

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102254014A (en) * 2011-07-21 2011-11-23 华中科技大学 Adaptive information extraction method for webpage characteristics
CN103473338A (en) * 2013-09-22 2013-12-25 北京奇虎科技有限公司 Webpage content extraction method and webpage content extraction system
WO2017080090A1 (en) * 2015-11-14 2017-05-18 孙燕群 Extraction and comparison method for text of webpage
CN111428444A (en) * 2020-03-27 2020-07-17 新华智云科技有限公司 Automatic extraction method of webpage information
CN112101004A (en) * 2020-09-23 2020-12-18 电子科技大学 General webpage character information extraction method based on conditional random field and syntactic analysis

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
于洪涛,虞海明,张付志: "基于三阶条件随机场的论文元数据提取方法", 小型微型计算机***, no. 3, pages 606 - 609 *
王少康,董科军,阎保平: "使用特征文本密度的网页正文提取", 计算机工程与应用, no. 20, pages 1 - 3 *
邵振凯: "网页信息提取技术", 计算机技术与发展, vol. 23, no. 9, pages 36 - 38 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114218515A (en) * 2021-12-21 2022-03-22 北京大学 Web digital object extraction method and system based on content segmentation

Also Published As

Publication number Publication date
CN113434797B (en) 2024-05-31

Similar Documents

Publication Publication Date Title
CN104881458B (en) A kind of mask method and device of Web page subject
Cai et al. Extracting content structure for web pages based on visual representation
CN107229668B (en) Text extraction method based on keyword matching
CN103514183B (en) Information search method and system based on interactive document clustering
CN102663023B (en) Implementation method for extracting web content
CN104598577B (en) A kind of extracting method of Web page text
CN109543126B (en) Webpage text information extraction method based on block character ratio
CN105912684B (en) The cross-media retrieval method of view-based access control model feature and semantic feature
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
CN111767725A (en) Data processing method and device based on emotion polarity analysis model
CN113961685A (en) Information extraction method and device
CN105550359B (en) Webpage sorting method and device based on vertical search and server
CN106446072A (en) Webpage content processing method and apparatus
CN103530429A (en) Webpage content extracting method
CN109165373B (en) Data processing method and device
CN112084451B (en) Webpage LOGO extraction system and method based on visual blocking
CN112699232A (en) Text label extraction method, device, equipment and storage medium
US20160283582A1 (en) Device and method for detecting similar text, and application
CN106649308B (en) Word segmentation and word library updating method and system
CN117312711A (en) Search engine optimization method and system based on AI analysis
CN101673263B (en) Method for searching video content
CN102722526B (en) Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method
CN113434797B (en) Webpage information extraction method and device
CN108595466B (en) Internet information filtering and internet user information and network card structure analysis method
CN109472020A (en) A kind of feature alignment Chinese word cutting method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: Room 1308, 13th floor, East Tower, 33 Fuxing Road, Haidian District, Beijing 100036

Applicant after: China Telecom Digital Intelligence Technology Co.,Ltd.

Address before: Room 1308, 13th floor, East Tower, 33 Fuxing Road, Haidian District, Beijing 100036

Applicant before: CHINA TELECOM GROUP SYSTEM INTEGRATION Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant