CN112199613B

CN112199613B - Product URL automatic positioning method integrating DOM topology and text attributes

Info

Publication number: CN112199613B
Application number: CN202011099728.6A
Authority: CN
Inventors: 潘丽敏; 郜森; 罗森林; 吴舟婷; 周妍汝; 董勃
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2020-10-13
Filing date: 2020-10-13
Publication date: 2023-03-03
Anticipated expiration: 2040-10-13
Also published as: CN112199613A

Abstract

The invention relates to a product URL automatic positioning method fusing DOM topology and text attributes, and belongs to the technical field of computers and information science. Firstly, converting a website into a DOM (document object model) parse tree structure, acquiring text attributes of each node under the DOM parse tree and adding label attributes to the nodes; then, constructing a tree graph with product label attributes at nodes by recursively traversing the DOM label tree, and converting the tree graph into a node vector set w containing a DOM parse tree topology structure; converting the text attribute under each node into a text vector h through doc2 vec; and finally, training a node classification model by combining the learned node vector and text vector [ w, h ] fusing DOM topological information with the label attribute, and completing URL automatic positioning. The method integrates DOM topology and text attributes, automatically learns the extraction rule of the page on the basis of the existing method, improves the self-adaptive capacity of the method, effectively overcomes the defects of poor robustness, low accuracy and large workload of the existing method, and has higher practical value and social value.

Description

Product URL automatic positioning method integrating DOM topology and text attributes

Technical Field

The invention relates to a product URL automatic positioning method fusing DOM topology and text attributes, and belongs to the technical field of computers and information science.

Background

With the development of globalization process, the supply chain of the information communication technology industry becomes increasingly complex, and countries have generally recognized the importance of strengthening the safety management of the supply chain of the information communication technology industry and started to construct the supply chain network of the industry. In order to construct a supply chain network, product information disclosed by an enterprise official network related to the information communication technology industry needs to be collected and structured information extraction is completed. The key difficulty in the process of constructing the supply chain of the information communication technology industry is to position the product information on the official network of the supplier and abandon useless information data. However, due to the non-normative of the program code and the prevalence of DHTML and Ajax, the DOM structure is extremely complex, and the product information on the official website is difficult to accurately locate. Therefore, various methods are proposed to solve this problem, and the conventional URL node location methods are statistically analyzed, and the general use methods can be classified into two types:

1. node positioning method based on rule judgment

The node positioning method based on rule judgment mainly depends on participation of human experts, and information retrieval rules are worked out by analyzing the characteristics of information to be extracted. And then, searching the whole page to match corresponding data content. However, the method needs to formulate different retrieval rules for different information, and the workload is large; on the other hand, it is difficult to make complete and comprehensive rules, so that the accuracy is low.

2. Node positioning method based on webpage structure

The positioning method based on the webpage structure utilizes the DOM structure of the webpage, and the information needing to be extracted is regarded as a child node on the DOM structure. And then obtaining a path to be passed from the root node to the child node, wherein the node position can be uniquely identified according to the path. However, this method requires the web page structure of the target website to be fixed and consistent, and it is difficult to achieve this condition in practical applications. This leads to the problems of low accuracy and efficiency and poor robustness in batch accurate positioning.

In summary, the existing URL node location method relies on manual work to make corresponding extraction rules or match the content under the fixed xpath path, and has the problems of low accuracy and efficiency, poor robustness and large workload. Therefore, a more efficient, accurate and automatic method for automatically positioning the nodes of the products on the official website is needed. The invention provides a product URL automatic positioning method fusing DOM topology and text attributes.

Disclosure of Invention

The invention aims to solve the problems of low accuracy and efficiency, poor robustness and large workload of the conventional product node positioning method, and provides a product URL automatic positioning method integrating DOM topology and text attributes.

The design principle of the invention is as follows: crawling the total station content of a provider website through an input URL (uniform resource locator), converting the total station content into a DOM (document object model) parse tree structure, acquiring the text attribute of each node under the DOM parse tree, and adding a label attribute (whether the node is a product node, is 1, and the other nodes are 0) to the corresponding node of the DOM parse tree; then, constructing a tree graph with product label attributes at nodes by recursively traversing the DOM label tree, and converting the tree graph into a node vector set w containing a DOM parse tree topology structure; converting the text attribute under each node into a text vector h through doc2 vec; and finally, training a node classification model by combining the learned node vector and text vector [ w, h ] which are fused with DOM topological information with the label attributes, and completing automatic URL positioning to realize automatic batch information acquisition.

The technical scheme of the invention is realized by the following steps:

step 1, analyzing the webpage and acquiring a topological structure to generate a label attribute tree.

And 2, embedding the DOM tree attributes.

And step 3, classifying and positioning the product nodes.

Advantageous effects

Compared with the rule judgment method and the URL node positioning method of the webpage structure, the method not only combines the two methods, but also greatly improves the robustness of the method; and the method automatically extracts the DOM structure of the webpage and automatically learns the extraction rule of the corresponding page, thereby greatly improving the efficiency and the accuracy.

Drawings

FIG. 1 is a schematic diagram of the URL automatic positioning method of the product integrating DOM topology and text attributes.

Detailed Description

To better illustrate the objects and advantages of the present invention, embodiments of the method of the present invention are described in further detail below with reference to examples.

The experimental data come from laboratory labeling data, and mainly relate to 4397 website addresses in total of 86 science and technology companies. The method comprises the following steps of: and 7, dividing the ratio into a test set and a training set, and importing the test set and the training set into a model for training. The effect of an F1 value, P (precision rate) and R (recall rate) evaluation model is adopted in the experiment, and the calculation methods of the F1 value, the P (precision rate) and the R (recall rate) are.

Wherein TP is the number of predicting positive class nodes into positive class, FN is the number of predicting positive class nodes into negative class, FP is the number of predicting negative class nodes into positive class, TN is the number of predicting negative class nodes into negative class

The specific process comprises the following steps:

Step 1.1, the model crawls the total station content of the websites of a plurality of supplier companies through the input company official website URL, converts the HTML of the front end of the website into a DOM analytic tree structure by using a Beautiful library, searches nodes according to the DOM tree and obtains the names and texts of all the nodes;

step 1.2, newly labeling attributes (whether the attributes are products, 1, and the other attributes are 0) to the nodes corresponding to the DOM parse tree, so as to obtain a DOM tag tree graph with different product tag attributes;

and step 1.3, recursively traversing the DOM tag tree, obtaining descendant nodes of each node of the DOM tag tree to form topological link node pairs, and generating training data.

And 2, embedding the DOM tree attributes.

Step 2.1, initialize the node vector set Φ.

Step 2.2, for each node v in the sample set _i Obtaining node sequence by random walk

And (3) giving a current access starting node, randomly sampling nodes from the neighbors of the current access starting node to serve as a next access node, and repeating the process until the length of the access sequence meets a preset condition. Respectively carrying out Random Walk sampling from each node in the graph to obtain locally associated training data

Step 2.3, for each sequence

With the skip-gram (phi,

w) to update the node vector. After a sufficient number of node access sequences are obtained, skip-gram training is carried out on the sampling data: by representing discrete network nodes as vectorization, node co-occurrence is maximized. And finally, inputting all the sequences into a skip-gram model to generate a final node embedding vector w.

And 2.4, creating vectorization representation h of the text under each node by using doc2vec, and combining the text vector and the node vector to obtain an attribute vector [ w, h ] containing topology information and text attribute information.

And step 3, classifying and positioning the product nodes.

And 3.1, finally, training a Linear SVC node classification model by using [ w, h ] to finish automatic positioning of the supplier official website URL product node.

And (3) testing results: the invention discloses a product URL automatic positioning method based on DOM topology and text attribute fusion in an experiment, which is combined with a DOM topology structure of a webpage and a machine learning related method, effectively realizes the purpose of automatically positioning key nodes, and solves the problems of low efficiency and poor robustness of the traditional method. The experiment is carried out on official website data of 60 suppliers, the F1 value of the product link automatic positioning task can reach 78.2%, the P value can reach 74.2%, and the R value can reach 92.4%, so that the method has very high accuracy and practical value.

The above detailed description is intended to illustrate the objects, aspects and advantages of the present invention, and it should be understood that the above detailed description is only exemplary of the present invention and is not intended to limit the scope of the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A product URL automatic positioning method based on fusion DOM topology and text attributes is characterized by comprising the following steps:

step 1, a model acquires a DOM (document object model) parse tree structure of a webpage by using a Beautiful library, acquires a root node of the webpage, recursively searches all nodes of the parse tree layer by layer, and stores the DOM parse tree structure and text information under corresponding nodes;

step 2, before generating a sampling sample, reading each node type, pruning the parsing tree according to the node type and the number of layers, distributing weights to the nodes, obtaining a new parsing tree, sampling the nodes by adopting a random walk strategy, and skip-gram the sampling data, so that a DOM parsing tree structure of a website is converted into a node vector set w containing a DOM parsing tree topological structure, and text attributes corresponding to each node are converted into text vectors h by using doc2 vec;

and 3, before training the classification model, generating a node vector and a text vector, splicing the node vector and the text vector to generate an attribute vector containing webpage topology information and text characteristics, and training the node classification model by combining the learned attribute vector [ w, h ] with the label attributes of each node to realize automatic positioning of the URL product node.