CN113434797A

CN113434797A - Webpage information extraction method and device

Info

Publication number: CN113434797A
Application number: CN202110725319.0A
Authority: CN
Inventors: 李成钢; 杨本栋; 李忠; 李金岭; 杜忠田; 王彦君; 夏海轮; 张碧昭; 余清华; 卜理超; 张天正; 李凤文; 袁福碧
Original assignee: China Telecom Group System Integration Co Ltd
Current assignee: China Telecom Digital Intelligence Technology Co Ltd
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2021-09-24
Anticipated expiration: 2041-06-29
Also published as: CN113434797B

Abstract

The invention discloses a webpage information extraction method and device, and belongs to the field of information identification. The method comprises the following steps: acquiring webpage data to be identified; blocking the webpage data to be identified according to a visual information algorithm to obtain a webpage visual block; labeling the webpage visual block to obtain metadata to be extracted; and extracting the metadata to be extracted to obtain target data. The invention provides an article type webpage information extraction method for extracting webpage structural data by fully utilizing webpage context information without depending on a webpage design style, and solves the technical problems of low algorithm efficiency and poor node combination effect caused by webpage design style change in the prior art, and the technical problems of less utilization of context information and insufficient information extraction precision.

Description

Webpage information extraction method and device

Technical Field

The invention belongs to the field of information identification, and particularly relates to a webpage information extraction method and device.

Background

Along with the continuous development of intelligent science and technology, people use intelligent equipment more and more among life, work, the study, use intelligent science and technology means, improved the quality of people's life, increased the efficiency of people's study and work.

Meanwhile, with the continuous expansion of the internet scale, the amount of information in the internet increases at an exponential rate, and people rely on obtaining information from internet pages through electronic devices such as computers or mobile phones. However, the web pages all contain a large amount of content which is used for website promotion or commercial promotion and is irrelevant to subject, the mass data causes the problems of information overload and information redundancy, one of the current research hotspots is how to efficiently acquire valuable information from the mass information contained in the internet, and the problem that how to rapidly and efficiently extract article information in the web pages contained in the internet is urgently needed to be overcome by the prior art. At present, the VIPS algorithm is improved by combining the currently popular web design style in the prior art, but the problems of low algorithm efficiency and poor node merging effect caused by changes of the web design style still exist, and meanwhile, the prior art also provides an automatic web information extraction algorithm combining machine learning and grouping technology. The information extraction can automatically identify and extract the structured information from the unstructured document, can quickly and accurately analyze the truly useful information from the mass data, and can improve the information acquisition efficiency. In the field of web page information extraction, researchers have proposed various web page information extraction algorithms for different types of web pages. With the change of the webpage design specification and the webpage design style, some algorithms are not suitable any more, and the extraction accuracy of the existing algorithms is not high.

Disclosure of Invention

The invention provides a webpage information extraction method and device, provides an article type webpage information extraction method which is independent of a webpage design style and fully utilizes webpage context information to extract webpage structured data, and solves the technical problems of low algorithm efficiency, poor node combination effect caused by webpage design style change and low utilization of context information and insufficient information extraction precision in the prior art.

One aspect of the present invention provides a method for extracting web page information, including: acquiring webpage data to be identified; blocking the webpage data to be identified according to a visual information algorithm to obtain a webpage visual block; labeling the webpage visual block to obtain metadata to be extracted; and extracting the metadata to be extracted to obtain target data.

Further, after the acquiring the data of the webpage to be identified, the method further comprises: and preprocessing the webpage data to be identified.

Further, before the labeling the visual block of the webpage to obtain metadata to be extracted, the method further includes: and selecting a webpage main body area according to the webpage visual block.

Further, the extracting the metadata to be extracted to obtain the target data includes: acquiring a random field model; extracting the metadata to be extracted according to the random field model to obtain an extraction result; and outputting the extraction result and generating the target data.

In another aspect of the present invention, an apparatus for extracting web page information is further provided, including: the acquisition module is used for acquiring the webpage data to be identified; the blocking module is used for blocking the webpage data to be identified according to a visual information algorithm to obtain a webpage visual block; the marking module is used for marking the webpage visual block to obtain metadata to be extracted; and the extraction module is used for carrying out extraction operation on the metadata to be extracted to obtain target data.

Further, the apparatus further comprises: and the preprocessing module is used for preprocessing the webpage data to be identified.

Further, the apparatus further comprises: and the selection module is used for selecting the main body area of the webpage according to the webpage visual block.

Further, the extraction module comprises: the model unit is used for acquiring a random field model; the extraction unit is used for extracting the metadata to be extracted according to the random field model to obtain an extraction result; and the output unit is used for outputting the extraction result and generating the target data.

In another aspect of the present invention, a non-volatile storage medium is further provided, where the non-volatile storage medium includes a stored program, and the program controls, when running, a device in which the non-volatile storage medium is located to execute a method for extracting web page information.

In another aspect of the present invention, an electronic device is further provided, which includes a processor and a memory; the memory is stored with computer readable instructions, and the processor is used for executing the computer readable instructions, wherein the computer readable instructions execute a webpage information extraction method when running.

Compared with the prior art, the invention has the beneficial effects that:

the method of the invention adopts the steps of acquiring the data of the webpage to be identified; blocking the webpage data to be identified according to a visual information algorithm to obtain a webpage visual block; labeling the webpage visual block to obtain metadata to be extracted; the method for extracting the article type webpage information fully utilizes the webpage context information to extract the webpage structural data in a mode of extracting the metadata to be extracted to obtain the target data without depending on the webpage design style, and solves the technical problems of low efficiency of an algorithm and poor node combination effect caused by the change of the webpage design style in the prior art, and the technical problems of less utilization of the context information and insufficient information extraction precision.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a diagram illustrating a method for extracting information from an article-type web page according to an embodiment of the present invention;

FIG. 2 is a flow chart of a web page information extraction algorithm according to an embodiment of the present invention;

FIG. 3 is a schematic input/output diagram of a paging block algorithm according to an embodiment of the present invention;

FIG. 4 is a flowchart of a method for extracting web page information according to an embodiment of the present invention;

fig. 5 is a block diagram of a structure of a web page information extraction apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In accordance with an embodiment of the present invention, there is provided a method embodiment of a method for extracting web page information, it should be noted that the steps illustrated in the flowchart of the accompanying drawings may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than that herein.

Example one

Fig. 4 is a flowchart of a method for extracting web page information according to an embodiment of the present invention, as shown in fig. 4, the method includes the following steps:

step S402, acquiring webpage data to be identified.

Specifically, in the embodiment of the present invention, in order to extract the required information through the processing and model identification of the web page information that needs to be identified, the web page content and the information to be identified need to be captured and stored by the web page acquisition tool and used for subsequent processing.

Optionally, after the acquiring the webpage data to be identified, the method further includes: and preprocessing the webpage data to be identified.

Specifically, for the web page preprocessing of the embodiment of the present invention, a beaut iful soup standardized web page may be used. The redundant information is eliminated by deleting unnecessary labels and contents in the webpage source codes, including comments, scripts, patterns and interactive labels, and eliminating the redundant information in a regular expression matching mode. The regular expressions used are shown in table 2.

TABLE 2 webpage preprocessing regular expression

Regular expression	Means of
		(？is)<！--[^>]*-->	Matching HTML annotations
(？is)<script.？>.？</script>	Matching<script>Label (R)
		(？is)<style.？>.？</style>	Matching<style>Label (R)
(？is)<\slink[^>]>	Matching<link>Label (R)
		(？is)<input.*？>	Matching<input>Label (R)
(？is)<select.？>.？</select>	Matching<select>Label (R)

And S404, blocking the webpage data to be identified according to a visual information algorithm to obtain a webpage visual block.

Specifically, the method for selecting the main area of the webpage comprises the following steps:

wherein Lc is the number of text lines; td is the text density; pd is the symbol density; ld is hyperlink density; height refers to the node height; n is the number of texts; t and l are text parameters; padding _ top is the upper distance between the content of the element in the node and the border of the element; padding _ bottom is the lower margin between the element content and the element border in the node; line height is the amount of space for multiple lines of elements, i.e., the spacing between multiple lines of text; p is a radical of_iIs the number of symbols in node i; l_iIs the length of the hyperlink text in each node; nt_iThe number of the text words of the node i is the sum of the number of Chinese characters and the number of English words; it_iIs the number of text words of the hyperlink in node i.

Selecting a visual block:

traversing all the visual blocks, and calculating the text density Td, the symbol density Pd, the hyperlink density Ld and the text line number Lc of each visual block;

a) finding out the visual block with the maximum text density Td from all the unmarked visual blocks, judging whether the text line Lc is more than 3, if so, executing the next step by taking the visual block as a reference visual block, otherwise, marking the visual block to execute the step again, and if the judgment is failed for 3 times continuously, considering that the webpage is not an article webpage.

b) Upwards searching a starting visual block of a main body part of the webpage from a reference visual block, wherein the line spacing of a general article title is the largest, upwards finding out the visual block with the largest line spacing and the text line number Lc not more than 2, upwards calculating the average hyperlink density of 3 adjacent visual blocks by using a sliding window mode from the visual block, stopping if the average hyperlink density is more than 20%, taking the stopped visual block as the starting visual block, and executing the next step; and if no visual block which meets the conditions that the line spacing is maximum and the text line number Lc is not more than 2 is found, executing the next step.

c) And (3) upwards calculating the average value S and the average symbol density of the visual block scores of the 3 adjacent visual blocks by using a sliding window mode from the reference visual block, wherein the calculation mode of S is the formulas 1-5, stopping until the calculation result S is less than 10, and judging to obtain a starting visual block by using the same average hyperlink density as that in the step (c).

S＝Td*log(1+NodeNum)*log(Pd) (1-5)

In the formula, NodeNum is the number of nodes.

d) The method for judging the ending visual block is the same as the method for judging the starting visual block in the step (d).

e) And judging the visual block between the starting visual block and the ending visual block as the main body area of the webpage.

And step S406, labeling the webpage visual block to obtain metadata to be extracted.

Optionally, before the labeling the webpage visual block to obtain the metadata to be extracted, the method further includes: and selecting a webpage main body area according to the webpage visual block.

And step S408, extracting the metadata to be extracted to obtain target data.

Optionally, the extracting the metadata to be extracted to obtain the target data includes: acquiring a random field model; extracting the metadata to be extracted according to the random field model to obtain an extraction result; and outputting the extraction result and generating the target data.

Specifically, a flowchart of an article-type web information extraction algorithm provided by the embodiment of the present invention is shown in fig. 2. Firstly, preprocessing a webpage, standardizing the webpage and removing redundant information, partitioning the webpage according to the visual characteristics of an article webpage, and dividing DOM tree nodes into a plurality of consistent visual blocks; then, positioning a main body area of the webpage by utilizing statistical characteristics, and filtering a large amount of noise information; and finally, selecting text, vision and dictionary features as feature sets to perform feature extraction, performing sequence annotation by using a three-order conditional random field model, and extracting information such as titles, texts, authors, sources, release time, images and accessories. The method can quickly acquire a large amount of information contained in the article type webpage.

Specifically, after the webpage information with blocks is acquired in the embodiment of the invention, the process of merging nodes is the process of webpage blocking, which means that a plurality of nodes with the same type are merged into one node block, also called a visual block, by using visual information.

It should be noted that, for the concept of Consistent Visual Block (CVB), the nodes in the consistent visual block should satisfy the following condition: if the consistency visual block contains a plurality of nodes, the nodes are adjacent in the page layout and the DOM tree; nodes in the consistent visual block are left-aligned or top-aligned; if the nodes in the consistent visual block are left aligned, then the width of the nodes needs to be kept consistent; if the top ends of the nodes in the consistency visual block are aligned, the heights of the nodes need to be kept consistent; the fonts of all nodes in the consistency visual block are the same and comprise character fonts, character sizes, character colors, whether the characters are thickened or not and whether underlines exist or not; each node that does not contain text is a separate consistent visual block.

The main process is as follows:

a) and traversing the DOM tree to obtain a List of leaf nodes.

b) And judging whether each leaf node in the List is a text node or an image node, and if not, deleting the leaf node.

c) Searching brother leaf nodes of leaf nodes in the List, judging whether the adjacent brother nodes and the node meet the standard of the consistency visual block, if not, judging the leaf nodes to be independent consistency visual blocks, if so, merging the two leaf nodes into one consistency visual block, continuously judging the consistency between the adjacent brother nodes and the consistency visual block, and so on, finishing the judgment of all brother leaf nodes.

d) After a leaf node and brother nodes are merged, if a father node exists, the merged consistency visual block replaces the father node, each visual block is used as a new leaf node, and the nodes in the consistency visual block still need to keep the original sequence after replacement.

e) And repeating the steps until all the leaf nodes are consistent visual block nodes.

It should be further noted that, in the web page metadata extraction based on CRF, the problems of underutilization of context information and insufficient extraction precision exist in the current visual block or node labeling, and a conditional random field model can integrate the context information into transfer characteristics. Generating a training file after determining the label and the characteristics, extracting the characteristics of the current visual block position and the previous two visual blocks simultaneously when extracting the characteristics, wherein the characteristics of the current position are state characteristics, the combination of the current position characteristics and the previous two position characteristics is transfer characteristics, and the determination process of the characteristic weight is the establishment process of the three-order conditional random field model. The characteristics used are shown in table 1. During training, the L-BFGS training algorithm is matched with an ElasticNet (L1+ L2) regularization to use parameters for solving a three-order conditional random field model, the L-BFGS algorithm has good convergence and high calculation speed, and an ElasticNet (L1+ L2) algorithm adds an L1 regular term and an L2 regular term after a loss function to solve the over-fitting problem in the model training process. Viterbi decoding is utilized in the prediction marking process to obtain an optimal sequence.

TABLE 3 characteristics

Through the embodiment, the technical problems that although the VIPS algorithm is improved on the webpage design style in the prior art, the algorithm efficiency is low and the node combination effect caused by the change of the webpage design style is poor are solved, meanwhile, the technical problems that a plurality of features are generated by using DOM tree node attributes to train a machine learning model, then candidate nodes are selected according to the model, noise in the candidate nodes is filtered out, and missing data of the candidate nodes is selected are also solved, but the technical problems that the utilization of context information is less and the information extraction precision is not high exist in the technical content.

Example two

Fig. 5 is a block diagram of a structure of a web page information extraction apparatus according to an embodiment of the present invention, as shown in fig. 5, the apparatus includes:

and the obtaining module 50 is configured to obtain the data of the web page to be identified.

Optionally, the apparatus further comprises: and the preprocessing module is used for preprocessing the webpage data to be identified.

Specifically, for the web page preprocessing of the embodiment of the present invention, a tool, a beautifuloup standardized web page, may be used. The redundant information is eliminated by deleting unnecessary labels and contents in the webpage source codes, including comments, scripts, patterns and interactive labels, and eliminating the redundant information in a regular expression matching mode. The regular expressions used are shown in table 2.

TABLE 2 webpage preprocessing regular expression

And the blocking module 52 is configured to block the webpage data to be identified according to a visual information algorithm to obtain a webpage visual block.

Optionally, the apparatus further comprises: and the selection module is used for selecting the main body area of the webpage according to the webpage visual block.

Specifically, the positioning of the main area of the web page in the embodiment of the present invention may be:

Selecting a visual block:

a) and traversing all the visual blocks, and calculating the text density Td, the symbol density Pd, the hyperlink density Ld and the text line number Lc of each visual block.

b) Finding out the visual block with the maximum text density Td from all the unmarked visual blocks, judging whether the text line Lc is more than 3, if so, executing the next step by taking the visual block as a reference visual block, otherwise, marking the visual block to execute the step again, and if the judgment is failed for 3 times continuously, considering that the webpage is not an article webpage.

c) Upwards searching a starting visual block of a main body part of the webpage from a reference visual block, wherein the line spacing of a general article title is the largest, upwards finding out the visual block with the largest line spacing and the text line number Lc not more than 2, upwards calculating the average hyperlink density of 3 adjacent visual blocks by using a sliding window mode from the visual block, stopping if the average hyperlink density is more than 20%, taking the stopped visual block as the starting visual block, and executing the next step; and if no visual block which meets the conditions that the line spacing is maximum and the text line number Lc is not more than 2 is found, executing the next step.

d) Calculating the average value S and the average symbol density of the visual block scores of 3 adjacent visual blocks upwards from the reference visual block by using a sliding window mode, wherein the calculation mode of S is the formula 1-5, and stopping until the calculation result S is less than 10, and judging to obtain a starting visual block by using the same average hyperlink density as that in the step (c);

S＝Td*log(1+NodeNum)*log(Pd) (1-5)

in the formula, NodeNum is the number of nodes.

e) The method for judging the ending visual block is the same as the method for judging the starting visual block in the step (d).

f) And judging the visual block between the starting visual block and the ending visual block as the main body area of the webpage.

And the labeling module 54 is configured to label the webpage visual block to obtain metadata to be extracted.

And the extracting module 56 is configured to perform an extracting operation on the metadata to be extracted to obtain target data.

Optionally, the extracting module includes: the model unit is used for acquiring a random field model; the extraction unit is used for extracting the metadata to be extracted according to the random field model to obtain an extraction result; and the output unit is used for outputting the extraction result and generating the target data.

It should be noted that, for the concept of Consistent Visual Block (CVB), the nodes in the consistent visual block should satisfy the following condition: if the consistency visual block contains a plurality of nodes, the nodes are adjacent in the page layout and the DOM tree; nodes in the consistent visual block are left-aligned or top-aligned; if the nodes in the consistent visual block are left aligned, then the width of the nodes needs to be kept consistent; if the top ends of the nodes in the consistency visual block are aligned, the heights of the nodes need to be kept consistent; the fonts of all nodes in the consistency visual block are the same and comprise character fonts, character sizes, character colors, whether the characters are thickened or not and whether underlines exist or not; each node that does not contain text is a separate consistent visual block. The main process is as follows:

a) and traversing the DOM tree to obtain a List of leaf nodes.

It should be further noted that, in the web page metadata extraction based on CRF, the problems of underutilization of context information and insufficient extraction precision exist in the current visual block or node labeling, and a conditional random field model can integrate the context information into transfer characteristics. Generating a training file after determining the label and the characteristics, extracting the characteristics of the current visual block position and the previous two visual blocks simultaneously when extracting the characteristics, wherein the characteristics of the current position are state characteristics, the combination of the current position characteristics and the previous two position characteristics is transfer characteristics, and the determination process of the characteristic weight is the establishment process of the three-order conditional random field model. The characteristics used are shown in table 1. During training, the L-BFGS training algorithm is matched with an Elastic Net (L1+ L2) regularization to use parameters for solving a three-order conditional random field model, the L-BFGS algorithm has good convergence and high calculation speed, and an Elastic Net (L1+ L2) algorithm adds an L1 regular term and an L2 regular term after a loss function to solve the overfitting problem in the model training process. Viterbi decoding is utilized in the prediction marking process to obtain an optimal sequence.

TABLE 3 characteristics

EXAMPLE III

In another embodiment of the present invention, a non-volatile storage medium is provided, where the non-volatile storage medium includes a stored program, and the program controls a device in which the non-volatile storage medium is located to execute a method for extracting web page information when running.

Specifically, the method comprises the following steps: acquiring webpage data to be identified; blocking the webpage data to be identified according to a visual information algorithm to obtain a webpage visual block; labeling the webpage visual block to obtain metadata to be extracted; and extracting the metadata to be extracted to obtain target data.

Example four

In another embodiment of the present invention, an electronic device is provided, which includes a processor and a memory; the memory is stored with computer readable instructions, and the processor is used for executing the computer readable instructions, wherein the computer readable instructions execute a webpage information extraction method when running.

Effects of the embodiment

In the embodiment of the invention, verification tests are intensively carried out on the web pages consisting of 80 article type web pages of domestic and foreign websites such as the news of New billow, the department of education of China, the department of education of England and the like, and the test results are shown in table 1. The result shows that the information extraction algorithm provided by the invention has a good extraction effect.

Table 1 test set a information extraction results

Label (R)	Total number of	Identification number	The correct amount	P	R	F1
							Title	80	80	80	1.000	1.000	1.000
Text	473	494	466	0.943	0.985	0.963
							Authors refer to	50	40	40	0.800	1.000	0.889
Time of release	80	91	62	0.775	0.681	0.720
							Source	50	52	46	0.920	0.885	0.902
Picture frame	12	14	12	1.000	0.857	0.923
							Accessories	22	21	21	0.955	1.000	0.978

Comparing the algorithm provided by the invention with the existing algorithm with better extraction effect, as shown in fig. 1, fig. 1 is a comparison diagram of the article type webpage information extraction method according to the embodiment of the invention. news extractor is the extraction method in document 1, and the extracted contents include a title, a body, a distribution time, and a source. newsboper is an extraction method used in document 2, and extracted contents include a title, an author, a distribution time, a keyword, and a body text. blockCRF is the method proposed herein, and the extracted content includes title, author, release time, text, source, picture, and attachment, etc. In this embodiment, the common data items extracted by the 3 algorithms are compared, and the algorithm provided by this embodiment has the best extraction effect in view of the comprehensive indexes.

Among them, document 1: extraction algorithm [ J ] for key information of news web pages, 2016,36(08):2082-2086+ 2120.

Document 2: sarr E N, Ousane S, Diallo A.FactExtract: automatic collection and aggregation of images and statistical effects from online news paper [ C ]//2018Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS). IEEE,2018: 336-.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for extracting webpage information is characterized by comprising the following steps:

acquiring webpage data to be identified;

blocking the webpage data to be identified according to a visual information algorithm to obtain a webpage visual block;

labeling the webpage visual block to obtain metadata to be extracted;

and extracting the metadata to be extracted to obtain target data.

2. The method of claim 1, wherein after the obtaining the data of the web page to be identified, the method further comprises:

and preprocessing the webpage data to be identified.

3. The method of claim 1, wherein before labeling the visual block of the web page to obtain metadata to be extracted, the method further comprises:

and selecting a webpage main body area according to the webpage visual block.

4. The method according to claim 1, wherein the extracting the metadata to be extracted to obtain target data comprises:

acquiring a random field model;

extracting the metadata to be extracted according to the random field model to obtain an extraction result;

and outputting the extraction result and generating the target data.

5. A web page information extraction apparatus, comprising:

the acquisition module is used for acquiring the webpage data to be identified;

the blocking module is used for blocking the webpage data to be identified according to a visual information algorithm to obtain a webpage visual block;

the marking module is used for marking the webpage visual block to obtain metadata to be extracted;

and the extraction module is used for carrying out extraction operation on the metadata to be extracted to obtain target data.

6. The apparatus of claim 5, further comprising:

and the preprocessing module is used for preprocessing the webpage data to be identified.

7. The apparatus of claim 5, further comprising:

and the selection module is used for selecting the main body area of the webpage according to the webpage visual block.

8. The apparatus of claim 5, wherein the extraction module comprises:

the model unit is used for acquiring a random field model;

the extraction unit is used for extracting the metadata to be extracted according to the random field model to obtain an extraction result;

and the output unit is used for outputting the extraction result and generating the target data.

9. A non-volatile storage medium, comprising a stored program, wherein the program, when executed, controls an apparatus in which the non-volatile storage medium is located to perform the method of any one of claims 1 to 4.

10. An electronic device comprising a processor and a memory; the memory has stored therein computer readable instructions for execution by the processor, wherein the computer readable instructions when executed perform the method of any one of claims 1 to 4.