CN112199613B - Product URL automatic positioning method integrating DOM topology and text attributes - Google Patents

Product URL automatic positioning method integrating DOM topology and text attributes Download PDF

Info

Publication number
CN112199613B
CN112199613B CN202011099728.6A CN202011099728A CN112199613B CN 112199613 B CN112199613 B CN 112199613B CN 202011099728 A CN202011099728 A CN 202011099728A CN 112199613 B CN112199613 B CN 112199613B
Authority
CN
China
Prior art keywords
node
dom
text
vector
attributes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011099728.6A
Other languages
Chinese (zh)
Other versions
CN112199613A (en
Inventor
潘丽敏
郜森
罗森林
吴舟婷
周妍汝
董勃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN202011099728.6A priority Critical patent/CN112199613B/en
Publication of CN112199613A publication Critical patent/CN112199613A/en
Application granted granted Critical
Publication of CN112199613B publication Critical patent/CN112199613B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a product URL automatic positioning method fusing DOM topology and text attributes, and belongs to the technical field of computers and information science. Firstly, converting a website into a DOM (document object model) parse tree structure, acquiring text attributes of each node under the DOM parse tree and adding label attributes to the nodes; then, constructing a tree graph with product label attributes at nodes by recursively traversing the DOM label tree, and converting the tree graph into a node vector set w containing a DOM parse tree topology structure; converting the text attribute under each node into a text vector h through doc2 vec; and finally, training a node classification model by combining the learned node vector and text vector [ w, h ] fusing DOM topological information with the label attribute, and completing URL automatic positioning. The method integrates DOM topology and text attributes, automatically learns the extraction rule of the page on the basis of the existing method, improves the self-adaptive capacity of the method, effectively overcomes the defects of poor robustness, low accuracy and large workload of the existing method, and has higher practical value and social value.

Description

Product URL automatic positioning method integrating DOM topology and text attributes
Technical Field
The invention relates to a product URL automatic positioning method fusing DOM topology and text attributes, and belongs to the technical field of computers and information science.
Background
With the development of globalization process, the supply chain of the information communication technology industry becomes increasingly complex, and countries have generally recognized the importance of strengthening the safety management of the supply chain of the information communication technology industry and started to construct the supply chain network of the industry. In order to construct a supply chain network, product information disclosed by an enterprise official network related to the information communication technology industry needs to be collected and structured information extraction is completed. The key difficulty in the process of constructing the supply chain of the information communication technology industry is to position the product information on the official network of the supplier and abandon useless information data. However, due to the non-normative of the program code and the prevalence of DHTML and Ajax, the DOM structure is extremely complex, and the product information on the official website is difficult to accurately locate. Therefore, various methods are proposed to solve this problem, and the conventional URL node location methods are statistically analyzed, and the general use methods can be classified into two types:
1. node positioning method based on rule judgment
The node positioning method based on rule judgment mainly depends on participation of human experts, and information retrieval rules are worked out by analyzing the characteristics of information to be extracted. And then, searching the whole page to match corresponding data content. However, the method needs to formulate different retrieval rules for different information, and the workload is large; on the other hand, it is difficult to make complete and comprehensive rules, so that the accuracy is low.
2. Node positioning method based on webpage structure
The positioning method based on the webpage structure utilizes the DOM structure of the webpage, and the information needing to be extracted is regarded as a child node on the DOM structure. And then obtaining a path to be passed from the root node to the child node, wherein the node position can be uniquely identified according to the path. However, this method requires the web page structure of the target website to be fixed and consistent, and it is difficult to achieve this condition in practical applications. This leads to the problems of low accuracy and efficiency and poor robustness in batch accurate positioning.
In summary, the existing URL node location method relies on manual work to make corresponding extraction rules or match the content under the fixed xpath path, and has the problems of low accuracy and efficiency, poor robustness and large workload. Therefore, a more efficient, accurate and automatic method for automatically positioning the nodes of the products on the official website is needed. The invention provides a product URL automatic positioning method fusing DOM topology and text attributes.
Disclosure of Invention
The invention aims to solve the problems of low accuracy and efficiency, poor robustness and large workload of the conventional product node positioning method, and provides a product URL automatic positioning method integrating DOM topology and text attributes.
The design principle of the invention is as follows: crawling the total station content of a provider website through an input URL (uniform resource locator), converting the total station content into a DOM (document object model) parse tree structure, acquiring the text attribute of each node under the DOM parse tree, and adding a label attribute (whether the node is a product node, is 1, and the other nodes are 0) to the corresponding node of the DOM parse tree; then, constructing a tree graph with product label attributes at nodes by recursively traversing the DOM label tree, and converting the tree graph into a node vector set w containing a DOM parse tree topology structure; converting the text attribute under each node into a text vector h through doc2 vec; and finally, training a node classification model by combining the learned node vector and text vector [ w, h ] which are fused with DOM topological information with the label attributes, and completing automatic URL positioning to realize automatic batch information acquisition.
The technical scheme of the invention is realized by the following steps:
step 1, analyzing the webpage and acquiring a topological structure to generate a label attribute tree.
And 2, embedding the DOM tree attributes.
And step 3, classifying and positioning the product nodes.
Advantageous effects
Compared with the rule judgment method and the URL node positioning method of the webpage structure, the method not only combines the two methods, but also greatly improves the robustness of the method; and the method automatically extracts the DOM structure of the webpage and automatically learns the extraction rule of the corresponding page, thereby greatly improving the efficiency and the accuracy.
Drawings
FIG. 1 is a schematic diagram of the URL automatic positioning method of the product integrating DOM topology and text attributes.
Detailed Description
To better illustrate the objects and advantages of the present invention, embodiments of the method of the present invention are described in further detail below with reference to examples.
The experimental data come from laboratory labeling data, and mainly relate to 4397 website addresses in total of 86 science and technology companies. The method comprises the following steps of: and 7, dividing the ratio into a test set and a training set, and importing the test set and the training set into a model for training. The effect of an F1 value, P (precision rate) and R (recall rate) evaluation model is adopted in the experiment, and the calculation methods of the F1 value, the P (precision rate) and the R (recall rate) are.
Figure BDA0002722008570000031
Figure BDA0002722008570000032
Figure BDA0002722008570000033
Wherein TP is the number of predicting positive class nodes into positive class, FN is the number of predicting positive class nodes into negative class, FP is the number of predicting negative class nodes into positive class, TN is the number of predicting negative class nodes into negative class
The specific process comprises the following steps:
step 1, analyzing the webpage and acquiring a topological structure to generate a label attribute tree.
Step 1.1, the model crawls the total station content of the websites of a plurality of supplier companies through the input company official website URL, converts the HTML of the front end of the website into a DOM analytic tree structure by using a Beautiful library, searches nodes according to the DOM tree and obtains the names and texts of all the nodes;
step 1.2, newly labeling attributes (whether the attributes are products, 1, and the other attributes are 0) to the nodes corresponding to the DOM parse tree, so as to obtain a DOM tag tree graph with different product tag attributes;
and step 1.3, recursively traversing the DOM tag tree, obtaining descendant nodes of each node of the DOM tag tree to form topological link node pairs, and generating training data.
And 2, embedding the DOM tree attributes.
Step 2.1, initialize the node vector set Φ.
Step 2.2, for each node v in the sample set i Obtaining node sequence by random walk
Figure BDA0002722008570000041
And (3) giving a current access starting node, randomly sampling nodes from the neighbors of the current access starting node to serve as a next access node, and repeating the process until the length of the access sequence meets a preset condition. Respectively carrying out Random Walk sampling from each node in the graph to obtain locally associated training data
Step 2.3, for each sequence
Figure BDA0002722008570000042
With the skip-gram (phi,
Figure BDA0002722008570000043
w) to update the node vector. After a sufficient number of node access sequences are obtained, skip-gram training is carried out on the sampling data: by representing discrete network nodes as vectorization, node co-occurrence is maximized. And finally, inputting all the sequences into a skip-gram model to generate a final node embedding vector w.
And 2.4, creating vectorization representation h of the text under each node by using doc2vec, and combining the text vector and the node vector to obtain an attribute vector [ w, h ] containing topology information and text attribute information.
And step 3, classifying and positioning the product nodes.
And 3.1, finally, training a Linear SVC node classification model by using [ w, h ] to finish automatic positioning of the supplier official website URL product node.
And (3) testing results: the invention discloses a product URL automatic positioning method based on DOM topology and text attribute fusion in an experiment, which is combined with a DOM topology structure of a webpage and a machine learning related method, effectively realizes the purpose of automatically positioning key nodes, and solves the problems of low efficiency and poor robustness of the traditional method. The experiment is carried out on official website data of 60 suppliers, the F1 value of the product link automatic positioning task can reach 78.2%, the P value can reach 74.2%, and the R value can reach 92.4%, so that the method has very high accuracy and practical value.
The above detailed description is intended to illustrate the objects, aspects and advantages of the present invention, and it should be understood that the above detailed description is only exemplary of the present invention and is not intended to limit the scope of the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (1)

1. A product URL automatic positioning method based on fusion DOM topology and text attributes is characterized by comprising the following steps:
step 1, a model acquires a DOM (document object model) parse tree structure of a webpage by using a Beautiful library, acquires a root node of the webpage, recursively searches all nodes of the parse tree layer by layer, and stores the DOM parse tree structure and text information under corresponding nodes;
step 2, before generating a sampling sample, reading each node type, pruning the parsing tree according to the node type and the number of layers, distributing weights to the nodes, obtaining a new parsing tree, sampling the nodes by adopting a random walk strategy, and skip-gram the sampling data, so that a DOM parsing tree structure of a website is converted into a node vector set w containing a DOM parsing tree topological structure, and text attributes corresponding to each node are converted into text vectors h by using doc2 vec;
and 3, before training the classification model, generating a node vector and a text vector, splicing the node vector and the text vector to generate an attribute vector containing webpage topology information and text characteristics, and training the node classification model by combining the learned attribute vector [ w, h ] with the label attributes of each node to realize automatic positioning of the URL product node.
CN202011099728.6A 2020-10-13 2020-10-13 Product URL automatic positioning method integrating DOM topology and text attributes Active CN112199613B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011099728.6A CN112199613B (en) 2020-10-13 2020-10-13 Product URL automatic positioning method integrating DOM topology and text attributes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011099728.6A CN112199613B (en) 2020-10-13 2020-10-13 Product URL automatic positioning method integrating DOM topology and text attributes

Publications (2)

Publication Number Publication Date
CN112199613A CN112199613A (en) 2021-01-08
CN112199613B true CN112199613B (en) 2023-03-03

Family

ID=74009072

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011099728.6A Active CN112199613B (en) 2020-10-13 2020-10-13 Product URL automatic positioning method integrating DOM topology and text attributes

Country Status (1)

Country Link
CN (1) CN112199613B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462540A (en) * 2014-12-24 2015-03-25 中国科学院声学研究所 Webpage information extraction method
CN107451215A (en) * 2017-07-17 2017-12-08 广州特道信息科技有限公司 Feature text abstracting method and device
CN110457579A (en) * 2019-07-30 2019-11-15 四川大学 The Web de-noising method and system to be cooperated based on template and classifier

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577466B (en) * 2012-08-03 2017-02-15 腾讯科技(深圳)有限公司 Method and device for displaying webpage content in browser

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462540A (en) * 2014-12-24 2015-03-25 中国科学院声学研究所 Webpage information extraction method
CN107451215A (en) * 2017-07-17 2017-12-08 广州特道信息科技有限公司 Feature text abstracting method and device
CN110457579A (en) * 2019-07-30 2019-11-15 四川大学 The Web de-noising method and system to be cooperated based on template and classifier

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于Web的新闻信息抽取;朱永盛等;《计算机工程》;20060531;第32卷(第10期);74-76 *
基于文本标签属性的网页信息抽取方法研究;沈娜;《武汉职业技术学院学报》;20160229;第15卷(第01期);62-65、73 *

Also Published As

Publication number Publication date
CN112199613A (en) 2021-01-08

Similar Documents

Publication Publication Date Title
CN109710701B (en) Automatic construction method for big data knowledge graph in public safety field
CN104899273B (en) A kind of Web Personalization method based on topic and relative entropy
CN103136360B (en) A kind of internet behavior markup engine and to should the behavior mask method of engine
CN112749284B (en) Knowledge graph construction method, device, equipment and storage medium
CN107135092B (en) A kind of Web service clustering method towards global social interaction server net
CN110134613B (en) Software defect data acquisition system based on code semantics and background information
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
CN104268148A (en) Forum page information auto-extraction method and system based on time strings
CN102591992A (en) Webpage classification identifying system and method based on vertical search and focused crawler technology
CN103324666A (en) Topic tracing method and device based on micro-blog data
CN103559234A (en) System and method for automated semantic annotation of RESTful Web services
CN110059085B (en) Web 2.0-oriented JSON data analysis and modeling method
CN101477571A (en) Method and apparatus for marking network contents semantic structure
CN114090861A (en) Education field search engine construction method based on knowledge graph
CN102508901A (en) Content-based massive image search method and content-based massive image search system
CN114911893A (en) Method and system for automatically constructing knowledge base based on knowledge graph
Wen et al. Heterogeneous Information Network‐Based Scientific Workflow Recommendation for Complex Applications
CN101576933A (en) Fully-automatic grouping method of WEB pages based on title separator
Nethra et al. WEB CONTENT EXTRACTION USING HYBRID APPROACH.
CN112199613B (en) Product URL automatic positioning method integrating DOM topology and text attributes
CN109948015B (en) Meta search list result extraction method and system
CN115905705A (en) Industrial algorithm model recommendation method based on industrial big data
CN113051455B (en) Water affair public opinion identification method based on network text data
CN104281693A (en) Semantic search method and semantic search system
Ramulu et al. A study of semantic web mining: Integrating domain knowledge into web mining

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant