CN112559929B

CN112559929B - Method, electronic device and medium for extracting webpage target information

Info

Publication number: CN112559929B
Application number: CN202110207419.4A
Authority: CN
Inventors: 张景龙; 王殿胜; 张乃钊; 薄满辉; 翟性国; 唐红武; 卞磊; 刘宇; 姚远
Original assignee: China Travelsky Mobile Technology Co Ltd
Current assignee: China Travelsky Mobile Technology Co Ltd
Priority date: 2021-02-25
Filing date: 2021-02-25
Publication date: 2021-05-07
Anticipated expiration: 2041-02-25
Also published as: CN112559929A

Abstract

The invention relates to a method, electronic equipment and medium for extracting webpage target information, wherein the method comprises the steps of S1, acquiring an HTML code of a webpage to be extracted, and constructing a tree structure; step S2, traversing the tree structure, obtaining title node text data, and obtaining the characteristic information of each content node; step S3, grouping all content nodes based on the path information of all content nodes; step S4, determining target grouping from the grouping according to the header node text data and the characteristic information of the content node in each grouping; step S5, the content node of the target group is used as a node to be analyzed, and it is determined whether the node to be analyzed includes the target information, if so, the target information is obtained from the node to be analyzed, otherwise, the group node connected between the father node of the node to be analyzed and the father node of the node to be analyzed is upgraded to the node to be analyzed until the target information is obtained. The invention improves the accuracy and efficiency of extracting the target information of the webpage.

Description

Method, electronic device and medium for extracting webpage target information

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method, an electronic device, and a medium for extracting target information of a web page.

Background

In the internet, a large amount of webpage data are generated every day, in the process of analyzing the webpage data, target information, such as titles, webpage text information, dates and the like, needs to be extracted, most of the text information of the existing webpage is displayed in html, and if the text information is information collected by a web crawler, part of the text information is displayed in a serialized (json) structure. The existing text extraction method is a method for processing the page block with the maximum text density as the text by analyzing the text density in each page block and extracting the text by using the text density, but the recognition rate is low, and usually, a large amount of useless contents or the missing part of text are often mixed in the web pages, for example, some media platforms support a style editor, so that the page structure is more complicated, noise information such as recommended links, propaganda views and the like can cause the text density to be reduced, extraction errors are easily caused, and the information accuracy is low. In addition, the existing extraction method is to traverse the whole webpage source code to extract the target information, so the information extraction efficiency is low. Therefore, how to improve the accuracy and efficiency of extracting the target information of the webpage becomes a technical problem to be solved urgently.

Disclosure of Invention

The invention aims to provide a method, electronic equipment and medium for extracting webpage target information, and the accuracy and efficiency of extracting the webpage target information are improved.

According to a first aspect of the present invention, there is provided a method for extracting target information of a web page, including:

s1, acquiring HTML codes of the webpage to be extracted, and constructing a corresponding tree structure based on the HTML codes;

step S2, traversing the tree structure, obtaining title node text data according to the title information of the head part of the tree structure, and obtaining the characteristic information of each content node from the tree structure, wherein the content node characteristic information comprises path information, content node text data and text density, and the content nodes are other nodes except the title nodes in the tree structure;

step S3, grouping all content nodes based on the path information of all content nodes;

step S4, determining a target grouping from the grouping according to the title node text data and the characteristic information of the content node in each grouping;

step S5, taking the content node of the target group as a node to be analyzed, and determining whether the node to be analyzed includes target information, if so, acquiring the target information from the node to be analyzed, otherwise, raising the group node connected between the father node of the node to be analyzed and the father node of the node to be analyzed to the node to be analyzed until the target information is acquired.

According to a second aspect of the present invention, there is provided an electronic apparatus comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being arranged to perform the method of the first aspect of the invention.

According to a third aspect of the invention, there is provided a computer readable storage medium, the computer instructions being for performing the method of the first aspect of the invention.

Compared with the prior art, the invention has obvious advantages and beneficial effects. By means of the technical scheme, the method, the electronic equipment and the medium for extracting the target information of the webpage can achieve considerable technical progress and practicability, have industrial wide utilization value and at least have the following advantages:

the method and the device construct the tree structure based on the HTML codes of the webpage to be extracted, group the content nodes of the tree structure, determine the optimal group from the groups, and acquire the target information based on the optimal group, thereby improving the accuracy and efficiency of extracting the target information of the webpage.

The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical means of the present invention more clearly understood, the present invention may be implemented in accordance with the content of the description, and in order to make the above and other objects, features, and advantages of the present invention more clearly understood, the following preferred embodiments are described in detail with reference to the accompanying drawings.

Drawings

Fig. 1 is a flowchart of a method for extracting target information of a web page according to an embodiment of the present invention.

Detailed Description

To further illustrate the technical means and effects of the present invention adopted to achieve the predetermined objects, the following detailed description will be given to specific embodiments and effects of a method, an electronic device and a medium for extracting target information of a web page according to the present invention, with reference to the accompanying drawings and preferred embodiments.

The embodiment of the invention provides a method for extracting webpage target information, which comprises the following steps of:

step S1, acquiring hypertext markup language (HTML) codes of the webpage to be extracted, and constructing a corresponding tree structure based on the HTML codes;

the title node corresponds to a head part of the tree structure, and the content node corresponds to a body part of the tree structure.

wherein the target group is a predicted group, i.e. the best group, most likely to contain target information.

Specifically, the iterchildren () method in the lxml library may be adopted to perform the lattice-lifting operation on the child node. The target information may specifically include information such as a title, a text, a date, the number of praise, the number of focus, the number of comments, and the like.

According to the embodiment of the invention, the tree structure is constructed based on the HTML code of the webpage to be extracted, the content nodes of the tree structure are grouped, the optimal group is determined from the groups, the target information is obtained based on the optimal group, and the accuracy and the efficiency of extracting the target information of the webpage are improved.

Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the steps as a sequential process, many of the steps can be performed in parallel, concurrently or simultaneously. In addition, the order of the steps may be rearranged. A process may be terminated when its operations are completed, but may have additional steps not included in the figure. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc.

Since the path information is usually longer and takes up a large amount of memory to directly obtain and store, as an embodiment, the step S2, when obtaining the path information of each content node from the tree structure, further includes: step S21, performing compression encoding on the path information of each content node, specifically, md5 may be used for compression encoding, and performing compression encoding on the path information may adjust the packet granularity, reduce the length of the packet path, and save the memory.

As an embodiment, the path information is xpath path information of main content in a web page, and the step S3 includes:

step S31, performing fuzzy processing on the subscript information of the path information of each content node;

it will be understood that blurring the subscripts refers to replacing all with the same preset character, or deleting.

Step S32 is to divide the content nodes with the same path information after the blurring processing into the same group.

The following is illustrated with a specific example:

the xpath path information corresponding to the first content node is:

“//*[@id="root"]/div/div[3]/div[1]/div[1]/div[3]/div/div[1]/p[1]”；

and fuzzy processing is carried out on the xpath path information corresponding to the first content node to obtain:

“//*[@id="root"]/div/div$/div$/div$/div$/div/div$/p$”。

the xpath path information corresponding to the second content node is:

“//*[@id="root"]/div/div[3]/div[1]/div[1]/div[3]/div/div[1]/p[2]”；

and fuzzy processing is carried out on the xpath path information corresponding to the second content node to obtain:

“//*[@id="root"]/div/div$/div$/div$/div$/div/div$/p$”。

as a result, the path information of the first content node and the path information of the second content node after the fuzzy processing of the subscripts are the same, and therefore the first content node and the second content node belong to the same group.

The text density refers to the length of the text, is a statistic of valid characters, and specifically may be excluding codes and more than a certain number of characters. It should be noted that the elements with high text density are not necessarily the text, and some text elements such as source, time, author, etc. may be extracted as the text by mistake; the element with low text density is not necessarily a text, for example, in a pub webpage and a forum webpage, there may be a sentence or a link shared by a user, and these contents all cause the text density to decrease, so in this embodiment of the present invention, the extraction of the target information may be processed by using a multi-dimensional feature index, and specifically, as an embodiment, the step S4 includes:

s41, acquiring the text density corresponding to each group according to the text data of the content nodes in each group, and sequencing P1, P2 and … PN in the descending order, wherein N represents the total number of the groups;

wherein text density is a valid character statistic

S42, acquiring N text densities P1, P2 and … Pn preset in the front, wherein N is a positive integer greater than or equal to 2, and N is smaller than N;

and step S43, acquiring the average difference of the numerical values of P1, P2 and … Pn, comparing the average difference with a preset average difference threshold value, and if the average difference of the numerical values is larger than or equal to the average difference threshold value, determining the group corresponding to P1 as a target group.

Wherein n can be 3, that is, the average difference between P1 and P2 and P3 is obtained and compared with the average difference threshold.

Further, if the numerical average is smaller than the average threshold, the step S4 further includes:

step S44, judging whether the title node text is empty, if so, directly determining the grouping corresponding to the P1 as a target grouping, otherwise, executing step S45, wherein the fact that the title node text is empty indicates that the title node cannot be determined;

it should be noted that some web pages have less certain title nodes, and in this case, the text density can be directly used to select the target group.

S45, acquiring the similarity Qx between the text data of the x-th group and the text data of the title node, wherein the text density corresponding to the x-th group is Px, and the value of x is 1 to N or 1 to N;

step S46, obtaining a first reference value Yx = Px Qx corresponding to the x-th group, and determining the group with the largest first reference value as the target group.

As an embodiment, the step S45 may specifically include:

step S451, carrying out similarity calculation on the text data of the x-th group and the text data of the title nodes to obtain an initial similarity value a;

specifically, the initial similarity value may be calculated from the texts of the title node and each node in the packet by using a difflib. Or the grouped data can be scanned, and the distance from the text data in the group to the text data of the header node is calculated by adopting a similarity algorithm Euclidean distance, wherein the Euclidean distance is a commonly adopted distance definition, a real distance between two points in a multi-dimensional space, or a natural length of a vector.

Step S452, segmenting the text data of each content node of the x-th group, traversing the content node texts and the title node texts in a dual-cycle manner, and calculating the hit ratio b of the content node text data hitting the title text data;

step S453, determining the similarity Qx between the text data of the xth group and the text data of the title node based on the initial similarity value a, the hit ratio b, and the preset first weight k: qx = a + k × b.

The value of the first weight is in positive correlation with the influence of the hit ratio on the grouping, and the higher the first weight is set, the greater the influence of the hit ratio on the grouping result is.

In some embodiments, the number of links on the web page may be noise information such as recommended links of articles or advertisement advertisements, and therefore, it may be necessary to perform a filtering operation based on the number of links to reduce the amount of computation, specifically, the content node characteristic information further includes the number of links included in the node, and in step S4, before performing step S41, the method may further include:

and step S40, traversing the nodes of each group, acquiring the link number of each group, comparing the link number with a preset link number threshold, and filtering the group if the link number threshold is exceeded, thereby realizing the filtering of noise data in the webpage.

As an example, step S5 may specifically include:

step S51, taking the content nodes of the target group as nodes to be analyzed, judging whether the nodes to be analyzed comprise target information, if so, acquiring the target information from the nodes to be analyzed, otherwise, executing step S52;

step S52, using the father node connected by the target grouping as a first father node, adding the first father node and each grouping node connected by the first father node into the node to be analyzed, judging whether the node to be analyzed includes target information, if yes, obtaining the target information from the node to be analyzed, otherwise, executing step S53;

step S53, taking the father node of the first father node as a second father node, adding the second father node and each grouped node connected with the second father node into the node to be analyzed, judging whether the node to be analyzed comprises target information, if so, acquiring the target information from the node to be analyzed, otherwise, executing step S8;

… ("…" means performing according to the rules described above)

Step S5m +2, taking the father node of the mth father node as the m +1 th father node, taking the m +1 th father node as the father node common to the target grouping node and the title node, adding the m +1 th father node and each grouping node connected with the m +1 th father node into the node to be analyzed, judging whether the node to be analyzed comprises target information, if so, obtaining the target information from the node to be analyzed, otherwise, ending the process.

In step S5, the determining whether the node to be analyzed includes the target information may specifically include:

and step S50, acquiring the number of text nodes, the number of subtitles and the number of dates from the nodes to be analyzed, judging whether the number of text nodes, the number of subtitles and the number of dates are equal, and if so, determining that the nodes to be analyzed comprise target information.

It should be noted that, information such as a title, a date, and a text is retrieved for each content node of the optimal block, the web pages are classified according to the existence or nonexistence of the title, the date, and the text, and the title node can also be used as a feature, and a node where the target information is located can also be determined based on the classification, which is described below by using several specific examples:

as an example, if the text of the title node is not empty, it indicates that there is a certain title node in the source code of the web page, and the number of text nodes and the number of subtitles and dates retrieved from the target group do not correspond to each other, it is determined that the type of the web page belongs to the chapter class, and the main information needs to be obtained based on the target and the promotion point.

As an example, the text of the title node in the webpage source code is empty, the title node is uncertain, but the number of the text nodes in the target group corresponds to the number of the subtitles and the dates, according to the classification result, the type of the webpage source code is determined to be an instant messaging type, and the main information of the webpage source code is directly obtained from the target group.

As an example, the source code of the web page has a certain title node, the number of dates corresponds to the number of dates, but there is no corresponding subtitle in the target group, and according to the classification result, the type can be determined to be a type with social attributes, and the target information can also be directly sorted in the target group.

As an example, a web page source code has a determined title node, a plurality of hyperlinks are provided in an element after the lattice expansion, the number of the hyperlinks corresponds to the number of element tags in a content node, the discrete degree of each text size in the content node is lower than a preset discrete threshold, the dates do not correspond, the type can be determined to be an article list or a navigation type, and target information needs to be obtained based on target grouping and the lattice expansion node.

An embodiment of the present invention further provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions configured to perform a method according to an embodiment of the invention.

The embodiment of the invention also provides a computer-readable storage medium, and the computer instructions are used for executing the method of the embodiment of the invention.

Although the present invention has been described with reference to a preferred embodiment, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for extracting target information of a webpage is characterized by comprising the following steps:

the step S4 includes:

step S43, obtaining the numerical mean deviation of P1, P2 and … Pn, comparing the numerical mean deviation with a preset mean deviation threshold value, and if the numerical mean deviation is larger than or equal to the mean deviation threshold value, determining the grouping corresponding to P1 as a target grouping;

2. The method of claim 1,

the path information is xpath path information of main content in a web page, and the step S3 includes:

3. The method of claim 1,

if the average value of the numerical values is smaller than the average value threshold, the step S4 further includes:

4. The method of claim 3,

the step S45 includes:

5. The method of claim 1,

the content node characteristic information further includes the number of links included in the node, and before the step S41 is executed in the step S4, the method further includes:

step S40, traversing the nodes of each packet, obtaining the link number of each packet, comparing the link number with a preset link number threshold, and filtering the packet if the link number exceeds the link number threshold.

6. The method of claim 1,

step S5 includes:

step S53, taking the father node of the first father node as a second father node, adding the second father node and each grouped node connected with the second father node into the node to be analyzed, judging whether the node to be analyzed comprises target information, if so, acquiring the target information from the node to be analyzed, otherwise, executing step S54;

…

step S5m +2, taking the father node of the mth father node as the m +1 th father node, wherein the m +1 th father node is a father node common to the target grouping node and the title node, adding the m +1 th father node and each grouping node connected with the m +1 th father node into the node to be analyzed, judging whether the node to be analyzed comprises target information, if so, acquiring the target information from the node to be analyzed, otherwise, ending the process.

7. The method of claim 6,

in step S5, the determining whether the node to be analyzed includes the target information includes:

8. An electronic device, comprising:

at least one processor;

and a memory communicatively coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor, the instructions being arranged to perform the method of any of the preceding claims 1-7.

9. A computer-readable storage medium having stored thereon computer-executable instructions for performing the method of any of the preceding claims 1-7.