CN112287273B

CN112287273B - Method, system and storage medium for classifying website list pages

Info

Publication number: CN112287273B
Application number: CN202011161426.7A
Authority: CN
Inventors: 孟剑; 郭岩; 贺广福; 陈银鹏; 史存会; 俞晓明; 刘悦; 程学旗
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2020-10-27
Filing date: 2020-10-27
Publication date: 2022-09-30
Anticipated expiration: 2040-10-27
Also published as: CN112287273A

Abstract

The invention relates to a method for classifying web site list pages, wherein a web site is based on a hypertext markup language (HTML), and the method comprises the following steps: step 100, acquiring a website page set, wherein the pages belong to the same website; step 200, extracting tree structure characteristics and page text characteristics of a Document Object Model (DOM) of each website page to respectively form a DOM tree structure characteristic space and a page text characteristic space; step 300, clustering the DOM tree structure characteristics and the page text characteristics in the DOM tree structure characteristic space and the page text characteristic space respectively to obtain a structure cluster and a text cluster respectively; step 400, mapping between the structure type cluster and the text type cluster according to a website address link (URL) of a website page, when the mapping is many-to-one, selecting the structure type cluster or the text type cluster which is intersected at the maximum, and finding the nearest public father node of the structure type cluster or the text type cluster which is intersected at the maximum in the website, wherein the public father node is the list page.

Description

Method, system and storage medium for classifying website list pages

Technical Field

The invention relates to the technical field of webpage classification, in particular to an automatic classification method and system of a Board page based on heterogeneous space association mapping.

Background

With the gradual development of the internet in recent years, networks have become the largest data source. People have been focusing on the internet data collection task for a long time. One common acquisition mode is customized acquisition, i.e., customized development is performed on a certain or a certain specific website, website link conditions are analyzed, and then a data extraction method is constructed according to the page and network characteristics of the website.

Data in the internet can be divided into different information sources such as news, forums, blogs and the like according to the publishing and interaction forms of the data, each information source has a specific format, such as a news data source, the data comprises data such as news text, news authors, news topics, news comments and the like, and each news page has a category to which the data belongs. The forum is also divided into blocks, and the data of the forum includes the main posts of the forum, the reply posts of the forum and the like. The customized development of collectors for each information source, and even each website, necessarily results in collectors that cannot be reused. This is a waste to develop. The research on a large number of websites with multiple information sources finds that the network data structures of different information sources have different forms, but have certain common characteristics. For example, a website in a news information source, whether classified according to content or a website top page, has a page similar to a list, and the page directly and explicitly lists related news article links according to a certain rule, and depending on the number of all articles under the relevant rule, the page also has related page turning links, which can help to obtain more articles. Similarly, there may be a similar structure for a website in a blog information source, often more prominently a personal home page, or personal timeline. Similar structures also exist for web sites in forum information sources, and for such structures, they can be generalized to a Board-atricle structure, where the list page is called Board page and the real data page to be collected is called articule page.

While Board pages are usually theme-dependent, i.e. all the articule page links on a Board page are often around a uniform theme or have uniform strong features. The characteristic of the Board page ensures that data under the requirement theme can be captured through one Board page, so that the collection of redundant data is avoided. The Board page is used as a portal page, and the Article page has a tree structure instead of an open graph structure, so that the perception of data change can be realized by scanning the Board page. By analyzing the Board page, the change of the data can be easily obtained, so that the data can be tracked more efficiently. Therefore, how to find the Board page from the website becomes a problem that the customized acquisition must solve.

At present, the following methods are mainly found on the Board page:

(1) based on manual work: i.e. manually screening out the Board pages from the web site. Due to the significant diversity of web pages, the cost of manually screening Board pages is quite expensive when faced with large-scale web sites, especially large-scale web sites. Meanwhile, frequent edition modification of the website also increases instability of the Board page, and the Board page needs to be re-screened at a further manual cost.

(2) Based on the rule: the experience of manually screening the Board page is converted into rules, and a simulator discovers the Board page from a website based on the rules. Similarly, web pages have significant diversity, so that the rule-based method has the inherent defect of weak generalization capability, and the recall rate and accuracy of the Board page cannot be guaranteed.

Therefore, the existing Board page discovery method mainly depends on intuitive cognition of people on the Board page, various features of the Board page, especially some hidden regular features, cannot be fully utilized, the generalization capability of the method is weak, the recall rate and accuracy of the Board page cannot be guaranteed, and the data quality of customized acquisition is influenced to a great extent.

Disclosure of Invention

In order to solve the above technical problems, an object of the present invention is to provide a method and a system for classifying Board pages based on heterogeneous spatial association mapping, wherein the Board pages can be expressed in different feature spaces according to different features of the pages, and then a connection is established between the different feature spaces, so as to fully utilize various features to identify the Board pages.

Specifically, the invention discloses a method for classifying website list pages, wherein the website is based on hypertext markup language (HTML), and the method comprises the following steps:

step 100, acquiring a website page set, wherein the pages belong to the same website;

step 200, extracting tree structure characteristics and page text characteristics of a Document Object Model (DOM) of each website page to respectively form a DOM tree structure characteristic space and a page text characteristic space;

step 300, clustering is respectively carried out on the DOM tree structure characteristic space and the page text characteristic space aiming at the DOM tree structure characteristic and the page text characteristic, and a structure cluster and a text cluster are respectively obtained;

step 400, mapping between the structure class cluster and the text class cluster according to a website address link (URL) of the website page, and when many-to-one mapping occurs, selecting the structure class cluster or the text class cluster with the largest intersection, and finding a nearest common parent node of the structure class cluster or the text class cluster with the largest intersection in the website, where the common parent node is the list page.

According to the classification method, the extraction step of the page text features comprises the following steps:

step 210, extracting the title and content information of the website page according to a hypertext markup language (HTML) rule;

step 230, encoding the information by a word embedding method (WordEmbelling) to obtain a vector representation of a word, and encoding a sentence by combining with a conditioned p-mean and an SIF method to obtain a text feature of the website page.

According to the classification method, the text characteristics of the website page further include text length, text type, text paragraph number, and sentence number.

According to the classification method, the extraction step of the DOM tree structure features comprises the following steps:

step 220, traversing the DOM tree layer by layer to obtain an HTML sequence of the DOM tree, the number of various nodes in the DOM tree, the number of external links in the DOM tree and the number of nodes of the DOM tree;

and step 240, filling (padding) the features and then combining (concat) the features to form the DOM tree structure features of the page.

According to the classification method, the page text feature clustering step in the page text feature space comprises the following steps:

step 310, clustering the page text features by using a k-means clustering algorithm (Kmeans) to obtain the text cluster;

and 330, linearly detecting different k values, and selecting an optimal k value according to the obtained dispersion inflection point of the text cluster.

According to the classification method, the DOM tree structure features are clustered by using a spectral clustering method in the DOM tree structure feature space.

To achieve another object of the present invention, there is also provided a system for classifying web site list pages, wherein the web sites are based on hypertext markup language (HTML), the system comprising:

the system comprises a webpage acquisition module, a webpage selection module and a webpage selection module, wherein the webpage acquisition module is used for acquiring a website page set, and the pages belong to the same website;

a feature extraction module, configured to extract tree structure features and page text features of a Document Object Model (DOM) of each website page, and respectively form a DOM tree structure feature space and a page text feature space;

the characteristic clustering module is used for respectively clustering the DOM tree structure characteristics and the page text characteristics in the DOM tree structure characteristic space and the page text characteristic space to respectively obtain a structure cluster and a text cluster;

and the mapping classification module is used for mapping between the structure class cluster and the text class cluster according to a website link (URL) of the website page, when the mapping is in a many-to-one condition, selecting the structure class cluster or the text class cluster which is intersected at the maximum, and finding the nearest public father node of the structure class cluster or the text class cluster which is intersected at the maximum in the website, wherein the public father node is the list page.

The classification system, wherein the feature extraction module further comprises:

the first feature extraction submodule is used for extracting the page text features; and

and the second feature extraction submodule is used for extracting the DOM tree structure features.

The classification system according to the above, wherein the feature clustering module further comprises:

the first feature clustering submodule is used for clustering the page text features in the page text feature space to obtain the text cluster; and

and the second feature clustering submodule is used for clustering the DOM tree structure features in the DOM tree structure feature space to respectively obtain the structure clusters.

To achieve another object of the present invention, the present invention further provides a computer-readable storage medium, wherein the computer-readable storage medium stores an implementation program of information transmission, and the program, when executed by a processor, implements the steps of the classification method according to any one of the above.

In order to make the aforementioned features and effects of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.

Drawings

FIG. 1 is a flowchart illustrating a method for classifying web site list pages according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a DOM feature space and a text feature space formed in an original website page data space in the method for classifying website list pages according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating respective clustering in DOM feature spaces and text feature spaces in the method for classifying web site list pages according to an embodiment of the present invention;

fig. 4 is a schematic diagram illustrating mapping of cluster classes collected by different feature spaces in the classification method for a website list page according to an embodiment of the present invention;

fig. 5 is a schematic diagram illustrating a search of a website list page in the method for classifying a website list page according to an embodiment of the present invention;

fig. 6 is a frame diagram of a classification system for web site list pages according to an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art. It is to be understood that the embodiments described below are only a few embodiments of the present invention, and not all embodiments.

In addition, the descriptions related to "first", "second", etc. in the present invention are for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between the embodiments may be combined with each other, but must be based on the realization of the technical solutions by a person skilled in the art, and when the technical solutions are contradictory to each other or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.

The invention is used for solving the problems that the existing Board page discovery method in the prior art cannot fully utilize various characteristics of the Board page and has weak generalization capability, and provides an automatic Board page discovery method and system based on heterogeneous space association mapping.

In the present invention, the following assumptions exist for the website, the page and their relationship to each other:

1. a web site is made up of pages that have one or more identities, i.e., URLs. Each URL uniquely corresponds to a page.

2. The page itself is composed of HTML, and the HTML contains information of nodes, node attributes, node contents (text), and node styles, wherein URLs of other pages may exist in the node attributes.

3. A page can be considered to point to a page by the URLs of other pages in the current page.

Based on the above assumptions, the websites themselves form a directed network, and the nodes in the network have respective characteristics. In the invention, the characteristics of the Board page are abstracted, namely the definition of the Board page: the Board page is naturally owned by the website and can be regarded as a node in the network formed by the website. The Board page is an aggregation page of the Article page pointed by the Board page, so that a node corresponding to the Board page has a specific network structure characteristic in the network. The Board page points to similar content of the Article page, so that the similar characteristic of the Article page content can be utilized to analyze by using text characteristics.

The Board page points to the same Article page structure, so the structural characteristics of the Article page itself can be used for analysis by utilizing the same characteristics of the Article page structure.

The existing Board page discovery methods are mainly based on manual work and rules, the methods mainly depend on intuitive cognition of people on the Board page, various features of the Board page cannot be fully utilized, particularly some hidden regular features, the generalization capability of the methods is weak, the recall rate and accuracy of the Board page cannot be further guaranteed, and the quality of data acquired in a customized mode can be influenced to a great extent.

Aiming at the problems, the invention can express the page in different feature spaces according to different features of the page, and then establishes a connection between the different feature spaces, thereby fully utilizing various features to identify the Board page.

According to the invention, the association is established through mapping of the page between the clustering results of the text space and the structural space, so that the Board page is identified. The details are as follows:

clustering in a text space: the content of the Article pages pointed by the Board page is similar, so that the similar characteristic of the content of the Article pages can be utilized to perform clustering according to text characteristics, and text clusters can be found.

Clustering in a structural space: the Board page points to the same arrow page structure, so that the arrow page structure can be utilized to perform clustering by using the DOM (Document Object Model) based structural features of the arrow page itself, and a structural class cluster can be found.

And mapping between the cluster classes gathered in the two spaces through a page URL, wherein the mapped cluster is an Article page pointed by the defined Board page according to the definition of the Board page, and if the mapping is many-to-one, the largest cluster is selected.

The nearest common parent node of the Article page set is selected as the Board page.

The invention provides a method and a system for automatically discovering a Board page based on heterogeneous space association mapping. The method comprises the steps of respectively extracting the characteristics of a page in a text space and a structure space, then respectively clustering in the two spaces, establishing the characteristic relation of the page between the text space and the structure space through mapping between clustering results, and further identifying the Board page. The method solves the problems that the existing Board page discovery method cannot fully utilize various characteristics of the Board page and has weak generalization capability.

Referring to fig. 1, fig. 1 is a flowchart illustrating a method for classifying a website list page (Board page) according to an embodiment of the present invention. The method comprises the steps of expressing each original page in two independent feature spaces, namely a DOM feature space and a text feature space, clustering the DOM feature space and the text feature space respectively, and mapping through a page URL according to the clustered clusters. By doing so, the text features and the structural features of the page can be fully utilized, and more implicit relationship features can be captured by establishing a mapping between the text space and the structural space. The cluster with the largest intersection in the two spaces is selected. According to the definition of the Board page, the clusters mapped by the clustering results of the DOM feature space and the text feature space should be the arrow pages pointed by the defined Board page, if the mapping is many-to-one, the largest cluster is selected, and then the nearest common father node of the clusters in the network is found, wherein the father node is the Board page. By the method, the learning capacity of the Board page discovery method can be improved, the Board page discovery method is not dependent on manual marking data, and the influence of the quality of the marking data on the method is avoided. Specifically, as shown in fig. 1, the classification method includes the following steps:

step 100, acquiring a website page set, wherein the pages belong to the same website. Specifically, in this embodiment, the method for obtaining the web page mainly obtains the html source code of the web page according to the web page link.

Step 200, extracting tree structure features and page text features of a Document Object Model (DOM) of each website page to respectively form a DOM tree structure feature space and a page text feature space, and as shown in fig. 2, mapping an integral original webpage data space a to two independent feature spaces, namely, a DOM feature space B and a text feature space C, through feature filtering. And extracting the webpage title and content information through the HTML rule. And then coding the information through Word Embedding (if the Word is regarded as the minimum unit of the text, the Word Embedding can be understood as a mapping, the process is that a certain Word in a text space is mapped or embedded (Embedding) to another numerical value vector space through a certain method, the output of the Word Embedding is the vector representation of each Word (Word), and then the sentence formed by the words is coded by combining a coordinated p-mean method and a smooth inverse Word frequency method (SIF), thereby obtaining the characteristic coding of the page text at the level of words, sentences and chapters. That is, by these two methods, a sentence is converted into a vector of a sentence by a word vector, the sentence vector has semantic meaning, and an operation can be performed. And simultaneously adding text basic characteristics such as text length, text type, text paragraph number, sentence number and the like, and jointly using the text basic characteristics as text characteristic codes of the page. Extraction of page DOM structural features focuses on encoding web pages using HTML structural features. The features used include: and traversing the DOM tree represented by the HTML according to layers to obtain a flattened sequence, the number of various nodes in the DOM tree, the number of external links in the DOM tree and the number of nodes in the DOM tree. These features are padded and then concatered together to be used as structural feature codes of the page.

Step 300, as shown in fig. 3, clustering is performed on the DOM tree structure feature space B and the page text feature space C respectively according to the DOM tree structure feature and the page text feature, so as to obtain a structure cluster and a text cluster respectively. And clustering two independent feature spaces, namely a DOM feature space B and a text feature space C, respectively to form a clustered DOM feature space B 'and a clustered text feature space C'. Clustering text characteristic space data by using a Kmeans method, linearly detecting different K values, and selecting an optimal K value according to an obtained cluster-like dispersion inflection point. And clustering the DOM characteristic space data by using a spectral clustering method. Considering that the structure of the page in the website has certain universality, particularly the structure of the page of the small website is almost the same, so that the clustering process does not need to forcibly require the consistent number of clusters in two spaces, but the clustering of the page structure space data is taken as constraint and is associated and matched with the text characteristic space data, thereby ensuring the easy implementation of the method.

Step 400, mapping between the structure class cluster of the DOM feature space B 'and the text class cluster of the text feature space C' according to the website address link (URL) of the website page, and when the mapping is many-to-one, selecting the structure class cluster or the text class cluster which is intersected at the maximum according to the page id and the clustering result, and finding the nearest public father node of the structure class cluster or the text class cluster which is intersected at the maximum in the website, wherein the public father node is the list page. As shown in fig. 4, mapping is performed by page URL according to the clustered class. According to the definition of the Board page, the mapped cluster of classes should be the collection of articule pages pointed to by the Board page. The reason is that the data under the mapped class cluster conforms to the characteristics of similar text content and the same page structure, and therefore belong to the same version block. If a many-to-one mapping situation occurs, the largest cluster is selected. As shown in fig. 5, the nearest common parent node of each data instance in the network structure of the website in the cluster obtained in step 3 is selected, and the parent node is the Board page.

Based on the same inventive concept, the present invention further provides a system 500 for classifying web site list pages, where the web site is based on hypertext markup language (HTML), as shown in fig. 6, and fig. 6 shows a frame diagram of a system for classifying web site list pages according to an embodiment of the present invention, where the system includes:

a web page obtaining module 510, configured to obtain a website page set, where the pages belong to the same website;

a feature extraction module 520, configured to extract tree structure features and page text features of a Document Object Model (DOM) of each website page, so as to respectively form a DOM tree structure feature space and a page text feature space; accordingly, the feature extraction module 520 further comprises: the first feature extraction submodule 521 is configured to extract the page text feature; and a second feature extraction submodule 522 for extracting the DOM tree structure features.

A feature clustering module 530, configured to cluster the DOM tree structure features and the page text features in the DOM tree structure feature space and the page text feature space, respectively, so as to obtain structure clusters and text clusters, respectively; accordingly, the feature clustering module further comprises: a first feature clustering sub-module 531, configured to cluster the page text features in the page text feature space to obtain the text cluster; and a second feature clustering submodule 532, configured to cluster the DOM tree structure features in the DOM tree structure feature space, so as to obtain the structure cluster respectively.

And a mapping classification module 540, configured to perform mapping between the structure class cluster and the text class cluster according to a website link (URL) of the website page, and when the mapping is performed in a many-to-one manner, select the structure class cluster or the text class cluster that is intersected maximally, and find a nearest common parent node of the structure class cluster or the text class cluster that is intersected maximally in the website, where the common parent node is a list page.

Based on the same inventive concept, the present invention further provides a computer-readable storage medium, on which an implementation program of information transfer is stored, and when the program is executed by a processor, the method implements the steps of any one of the classification methods.

The invention fully utilizes the text characteristic and the structural characteristic of the page and captures more implicit relational characteristics by establishing mapping between the text space and the structural space. In addition, the unsupervised machine learning method such as clustering is used in the invention, so that the learning capacity of the Board page discovery method can be improved, and the Board page discovery method does not depend on artificial labeling data. The manual labeling data has the problems of certain label missing, label error and a large amount of expired data, so the method avoids the influence of the quality of the labeling data on the method. Therefore, compared with the existing Board page discovery method, the Board page discovery method better utilizes various characteristics of the Board page such as text, structure and the like, and utilizes an unsupervised cluster analysis method, so that the method not only has better generalization capability, but also does not need to pay labor cost.

The present invention is capable of other embodiments, and various changes and modifications can be made by one skilled in the art without departing from the spirit and scope of the invention.

Claims

1. A method for classifying web site listing pages, said web sites being based on hypertext markup language (HTML), said method comprising:

step 400, mapping between the structure cluster and the text cluster according to a website link (URL) of the website page, and when many-to-one mapping occurs, selecting the structure cluster or the text cluster that is most intersected, and finding a nearest common parent node of the structure cluster or the text cluster that is most intersected in the website, where the common parent node is a list page.

2. The classification method according to claim 1, wherein the extraction step of the page text features comprises:

step 230, encoding the title and the content information by a word embedding method (WordEmbelling) to obtain a vector representation of a word, and encoding a sentence by combining a constrained p-mean method and a smooth inverse word frequency method (SIF) to obtain the text characteristics of the website page.

3. The classification method according to claim 2, wherein the text features of the website page further include text length, text category, number of text paragraphs, and number of sentences.

4. A classification method according to any one of claims 1 to 3, characterised in that said extraction step of DOM tree structural features comprises:

and step 240, filling (padding) the features and then merging (concat) the features to form the DOM tree structure features of the page.

5. The method of classifying according to claim 4, wherein the step of clustering the page text features of the page text feature space comprises:

6. The classification method according to claim 5, wherein the DOM tree structure features are clustered using a spectral clustering method in the DOM tree structure feature space.

7. A system for classifying web site listing pages, said web sites being based on hypertext markup language (HTML), said system comprising:

8. The classification system of claim 7, wherein the feature extraction module further comprises:

and the second characteristic extraction submodule is used for extracting the DOM tree structure characteristics.

9. The classification system according to claim 7 or 8, wherein the feature clustering module further comprises:

10. A computer-readable storage medium, characterized in that it has stored thereon a program for implementing the transfer of information, which program, when being executed by a processor, implements the steps of the classification method according to any one of claims 1 to 6.