WO2022134820A1 - Webpage data extraction method and apparatus, electronic device, and storage medium - Google Patents

Webpage data extraction method and apparatus, electronic device, and storage medium Download PDF

Info

Publication number
WO2022134820A1
WO2022134820A1 PCT/CN2021/125865 CN2021125865W WO2022134820A1 WO 2022134820 A1 WO2022134820 A1 WO 2022134820A1 CN 2021125865 W CN2021125865 W CN 2021125865W WO 2022134820 A1 WO2022134820 A1 WO 2022134820A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
dom tree
label
dom
tree
Prior art date
Application number
PCT/CN2021/125865
Other languages
French (fr)
Chinese (zh)
Inventor
王大伟
周威
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2022134820A1 publication Critical patent/WO2022134820A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • G06F40/154Tree transformation for tree-structured or markup documents, e.g. XSLT, XSL-FO or stylesheets

Definitions

  • the present application relates to the technical field of terminals, and in particular, to a method, apparatus, electronic device and storage medium for extracting data from a webpage.
  • the present application provides a method, device, electronic device and storage medium for data extraction from a webpage.
  • a first aspect of the present application provides a method for extracting data from a web page, the method comprising:
  • a new webpage to be extracted is generated according to the DOM tree of the fifth node of all the unordered list tags, and webpage feature data extraction is performed on the new webpage to be extracted.
  • a second aspect of the present application provides an electronic device, the electronic device comprising a memory and a processor, the memory for storing at least one computer-readable instruction, the processor for executing the at least one computer-readable instruction to Implement the following steps:
  • a new webpage to be extracted is generated according to the DOM tree of the fifth node of all the unordered list tags, and webpage feature data extraction is performed on the new webpage to be extracted.
  • a third aspect of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores at least one computer-readable instruction, and when the at least one computer-readable instruction is executed by a processor, implements the following steps:
  • a new webpage to be extracted is generated according to the DOM tree of the fifth node of all the unordered list tags, and webpage feature data extraction is performed on the new webpage to be extracted.
  • a fourth aspect of the present application provides an apparatus for extracting data from a webpage, wherein the apparatus includes:
  • an acquisition module for acquiring HTML codes in the source codes of the web pages to be extracted, and parsing the HTML codes into a first node DOM tree
  • a parsing module for parsing the first node DOM tree to obtain all unordered list tags
  • the traversal module is used to traverse all the list tags corresponding to each unordered list tag to obtain the traversal result, and select the DOM tree corresponding to the list tag with the most child nodes from the traversal result as the second node DOM tree;
  • a matching module configured to match the third node DOM tree of each unselected list tag in the traversal result with the second node DOM tree, and generate each unselected list tag according to the matching result
  • the fourth node of the DOM tree
  • a generating module is used to generate the fifth node DOM tree of each unordered list label according to the second node DOM tree and all fourth node DOM trees;
  • the extraction module is used for generating a new webpage to be extracted according to the DOM tree of the fifth node of all the unordered list tags, and extracting webpage feature data for the new webpage to be extracted.
  • the data extraction method, device, electronic device and storage medium for webpages described in this application improve the accuracy of data extraction.
  • FIG. 1 is a flowchart of a method for extracting data from a webpage according to Embodiment 1 of the present application.
  • FIG. 2 is a structural diagram of an apparatus for extracting data from a webpage according to Embodiment 2 of the present application.
  • FIG. 3 is a schematic structural diagram of an electronic device provided in Embodiment 3 of the present application.
  • FIG. 1 is a flowchart of a method for extracting data from a webpage according to Embodiment 1 of the present application.
  • the method for extracting data from a webpage can be applied to an electronic device.
  • the data extraction method from the webpage provided by the method of the present application can be directly integrated on the electronic device. function, or run in the electronic device in the form of a Software Development Kit (SKD).
  • the method for extracting data from a webpage specifically includes the following steps. According to different requirements, the order of the steps in the flowchart can be changed, and some can be omitted.
  • S11 Acquire HTML code in the source code of the webpage to be extracted, and parse the HTML code into a first node DOM tree.
  • the link of the webpage to be extracted is received, the source code is downloaded according to the link, the JavaScript and CSS codes are deleted from the source code, the HTML code is retained, and an HTML parser is used to parse the HTML corresponding to the webpage to be extracted.
  • the code is parsed into the first node DOM tree according to the label hierarchy.
  • the above-mentioned webpage to be extracted can also be stored in a node of a blockchain.
  • the unordered list tag refers to a UL tag.
  • the first node DOM tree is parsed to obtain the unordered list tag of the webpage to be extracted, wherein, The first node DOM book may contain multiple unordered list tags.
  • the list tag refers to a sub-tag of the next level corresponding to the unordered list tag
  • each unordered list tag may contain multiple word tags, and traverse each unordered list tag For each corresponding list tag, the DOM tree corresponding to the list tag with the most child nodes is selected from the traversal result as the second node DOM tree.
  • the third node DOM tree of each unselected list tag is compared with the second node DOM tree of the list tag with the most child nodes. Match, update the DOM tree of the third node, so that the DOM tree of each fourth node is consistent with the DOM tree of the second node, and avoid data dislocation after the extracted webpage feature data is converted into a two-dimensional table phenomenon, which improves the accuracy of data extraction.
  • the third node DOM tree of each unselected list tag in the traversal result is matched with the second node DOM tree, and each unselected list tag is generated according to the matching result
  • the fourth node of the DOM tree consists of:
  • the first label of the root node of the DOM tree of the second node and the unselected first label of each is matched, and if the first label is consistent with the second label, it is determined that the second node DOM tree and the third node DOM tree If the labels of the root nodes are the same, then continue to judge whether the root node of the DOM tree of the second node and the root node of the DOM tree of the third node are leaf nodes.
  • the leaf node refers to the root node. is the end node.
  • the DOM tree of the second node is inconsistent with the DOM tree of the third node, traverse the DOM tree of the second node and the DOM tree of the third node to find the inconsistent nodes, and find the inconsistent nodes according to the inconsistent nodes.
  • the corresponding tag updates the DOM tree of the second node or the DOM tree of the third node.
  • generating the fourth node DOM tree of each unselected list tag according to the matching result includes:
  • the fourth label is inserted into the leftmost of the right neighbor node to obtain a new third node DOM tree, and the new third node DOM tree is as the fourth node DOM tree of each unselected list tag;
  • the fourth label is inserted between the left neighbor node and the right neighbor node to obtain a new third node DOM tree, and the new third node
  • the DOM tree is used as the fourth node DOM tree of each unselected list tag.
  • the DOM tree of the second node is updated according to the child nodes corresponding to other unselected list labels to obtain a new second node.
  • Node DOM tree specifically, when the third label of any child node at the next level of the root node of the second node DOM tree is the fourth label of all child nodes at the same level of the third node DOM tree When inconsistent, it is determined that the second node DOM tree needs to be updated.
  • the update process of the second node DOM tree includes:
  • the left neighbor node When the left neighbor node is recognized, but the right neighbor node is not recognized, insert the third label to the rightmost of the left neighbor node to obtain a new second node DOM tree, and insert the new second node DOM
  • the tree is the DOM tree corresponding to the list tag with the most child nodes;
  • the third label is inserted into the leftmost of the right neighbor node to obtain a new second node DOM tree, and the new second node DOM tree is as the DOM tree corresponding to the list tag with the most child nodes;
  • the third label is inserted between the left neighbor node and the right neighbor node to obtain a new second node DOM tree, and the new second node
  • the DOM tree is used as the DOM tree corresponding to the list tag with the most child nodes.
  • the second node DOM is updated to the new second node DOM tree, which avoids the phenomenon of missing fields in the process of extracting webpage feature data, and further improves the Comprehensiveness of the data in each list tab.
  • the method also includes:
  • the DOM tree of the third node is used as each unselected list The tag's fourth node in the DOM tree.
  • the method also includes:
  • the root node of the DOM tree of the second node When the root node of the DOM tree of the second node is not a leaf node, but the root node of the DOM tree of the third node is a leaf node, traverse all the child nodes of the root node of the DOM tree of the second node;
  • the method also includes:
  • the root node of the DOM tree of the second node is a leaf node and the root node of the DOM tree of the third node is a leaf node, determine that the DOM tree of the third node is the DOM tree of each unselected list tag The fourth node DOM tree.
  • the method also includes:
  • the third node DOM tree is used as the fourth node DOM tree of each unselected list label.
  • the fourth node DOM tree corresponding to each list tag can be quickly determined according to different judgment criteria, which improves the diversity of determining the fourth node DOM tree.
  • S15 Generate a fifth node DOM tree for each unordered list tag according to the second node DOM tree and all fourth node DOM trees.
  • each unordered list corresponds to a node DOM tree
  • each unordered list is obtained by corresponding the second node DOM tree and all fourth node DOM trees to positions corresponding to each unordered list
  • the DOM tree of the fifth node of the tag ensures the consistency of the web page data to be extracted before and after matching.
  • generating the fifth node DOM tree of each unordered list tag according to the second node DOM tree and all fourth node DOM trees includes:
  • S16 Generate a new webpage to be extracted according to the DOM tree of the fifth node of all the unordered list tags, and perform webpage feature data extraction on the new webpage to be extracted.
  • the new webpage to be extracted is obtained by parsing the first DOM tree corresponding to the webpage to be extracted, and then matching the DOM tree of each list tag corresponding to each unordered list tag to obtain a new DOM tree.
  • the method also includes:
  • the extracted webpage feature data is converted into a two-dimensional table.
  • each unselected list tag is generated according to the matching result.
  • the DOM tree of the fourth node of the list tag to ensure the consistency of the DOM tree corresponding to each unordered list tag, therefore, according to the new DOM tree to regenerate a new webpage to be extracted, in the new webpage to be extracted.
  • the method for extracting data from a webpage obtains the HTML code in the source code of the webpage to be extracted, and parses the HTML code into a first node DOM tree; parses the first node
  • the DOM tree obtains all unordered list tags; traverses all list tags corresponding to each unordered list tag to obtain the traversal result, and selects the DOM tree corresponding to the list tag with the most child nodes from the traversal result as the second node DOM tree;
  • the third node DOM tree of each unselected list tag in the traversal result is matched with the second node DOM tree, and the fourth node DOM tree of each unselected list tag is generated according to the matching result.
  • the DOM tree corresponding to the list tag with the most child nodes as the second node DOM tree that is, the seed node DOM tree
  • the seed node DOM contains the most child nodes in the book
  • the third node DOM tree of each unselected list label in the traversal result is matched with the second node DOM tree, according to the matching
  • the fourth node DOM tree of each unselected list tag is generated to ensure the consistency of the DOM tree corresponding to each unordered list tag. Therefore, a new webpage to be extracted is regenerated according to the new DOM tree.
  • FIG. 2 is a structural diagram of an apparatus for extracting data from a webpage according to Embodiment 2 of the present application.
  • the data extraction apparatus 20 of the webpage may include a plurality of functional modules composed of program code segments.
  • the program codes of each program segment in the webpage data extraction apparatus 20 may be stored in the memory of the electronic device and executed by the at least one processor to perform (see description in FIG. 1 ) data extraction of the webpage.
  • the data extraction apparatus 20 of the webpage can be divided into a plurality of functional modules according to the functions performed by the data extraction apparatus 20 .
  • the functional modules may include: an acquisition module 201 , an analysis module 202 , a traversal module 203 , a matching module 204 , a determination module 205 , a generation module 206 and an extraction module 207 .
  • a module referred to in this application refers to a series of computer-readable instruction segments that can be executed by at least one processor and can perform fixed functions, and are stored in a memory. In this embodiment, the functions of each module will be described in detail in subsequent embodiments.
  • the obtaining module 201 is configured to obtain the HTML code in the source code of the webpage to be extracted, and parse the HTML code into a first node DOM tree.
  • a parsing module 202 configured to parse the DOM tree of the first node to obtain all unordered list tags.
  • the traversal module 203 is configured to traverse all the list tags corresponding to each unordered list tag to obtain a traversal result, and select the DOM tree corresponding to the list tag with the most child nodes from the traversal result as the second node DOM tree.
  • the matching module 204 is configured to match the third node DOM tree of each unselected list tag in the traversal result with the second node DOM tree, and generate the each unselected list according to the matching result The tag's fourth node in the DOM tree.
  • the determining module 205 is configured to, when the root node of the DOM tree of the second node is a leaf node and the root node of the DOM tree of the third node is a leaf node, determine that the DOM tree of the third node is the The fourth node DOM tree of the selected list tag.
  • the generating module 206 is configured to generate a fifth node DOM tree of each unordered list tag according to the second node DOM tree and all fourth node DOM trees.
  • the extraction module 207 is configured to generate a new webpage to be extracted according to the DOM tree of the fifth node of all the unordered list tags, and to extract webpage feature data for the new webpage to be extracted.
  • the device for extracting data from a webpage selects the DOM tree corresponding to the list tag with the most child nodes as the second node DOM tree, that is, the seed node DOM tree, because the seed node DOM book contains
  • the third node DOM tree of each unselected list label in the traversal result is compared with the second The node DOM tree is matched, and the fourth node DOM tree of each unselected list tag is generated according to the matching result to ensure the consistency of the DOM tree corresponding to each unordered list tag.
  • the tree regenerates a new web page to be extracted, and the feature data extraction of the web page on the new web page to be extracted will not cause the phenomenon of missing fields, which avoids the occurrence of data misalignment after the extracted web page feature data is converted into a two-dimensional table. phenomenon, the accuracy rate of data extraction is improved; finally, in the process of matching the third node DOM tree of each unselected list tag in the traversal result with the second node DOM tree, by identifying inconsistent nodes corresponding to The left neighbor node and right neighbor node of , the exact insertion position is precise, and the consistency of the DOM tree of the fourth node of each list label is guaranteed.
  • the electronic device 3 includes a memory 31 , at least one processor 32 , at least one communication bus 33 and a transceiver 34 .
  • the structure of the electronic device shown in FIG. 3 does not constitute a limitation of the embodiments of the present application, and may be a bus-type structure or a star-shaped structure, and the electronic device 3 may also include a schematic more or less other hardware or software, or a different arrangement of components is shown.
  • the electronic device 3 is an electronic device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions, and its hardware includes but is not limited to microprocessors, application-specific integrated circuits , programmable gate arrays, digital processors and embedded devices.
  • the electronic device 3 may also include a client device, which includes but is not limited to any electronic product that can perform human-computer interaction with a client through a keyboard, a mouse, a remote control, a touchpad, or a voice-activated device, for example, Personal computers, tablets, smartphones, digital cameras, etc.
  • the electronic device 3 is only an example, and other existing or future electronic products, if applicable to the present application, should also be included within the protection scope of the present application, and incorporated herein by reference .
  • the memory 31 is used to store program codes and various data, such as the data extraction apparatus 20 of a webpage installed in the electronic device 3 , and realizes high-speed, automatic operation during the operation of the electronic device 3 . Complete program or data access.
  • the memory 31 includes non-volatile memory and volatile memory, such as read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable Read-Only Memory, PROM), erasable programmable only memory.
  • EPROM Erasable Programmable Read-Only Memory
  • OTPROM One-time Programmable Read-Only Memory
  • EEPROM Electrically-Erasable Programmable Read-Only Memory
  • CD-ROM Compact Disc Read-Only Memory
  • the at least one processor 32 may be composed of integrated circuits, for example, may be composed of a single packaged integrated circuit, or may be composed of multiple integrated circuits packaged with the same function or different functions, including one Or a combination of multiple central processing units (Central Processing units, CPUs), microprocessors, digital processing chips, graphics processors, and various control chips.
  • the at least one processor 32 is the control core (Control Unit) of the electronic device 3, and uses various interfaces and lines to connect the various components of the entire electronic device 3, by running or executing the program stored in the memory 31 or modules, and call data stored in the memory 31 to perform various functions of the electronic device 3 and process data.
  • Control Unit Control Unit
  • the at least one communication bus 33 is configured to enable connection communication between the memory 31 and the at least one processor 32 and the like.
  • the electronic device 3 may also include a power source (such as a battery) for supplying power to various components.
  • the power source may be logically connected to the at least one processor 32 through a power management device, so that the power management device Implement functions such as managing charging, discharging, and power consumption.
  • the power source may also include one or more DC or AC power sources, recharging devices, power failure detection circuits, power converters or inverters, power status indicators, and any other components.
  • the electronic device 3 may further include a variety of sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.
  • the above-mentioned integrated units implemented in the form of software functional modules may be stored in a computer-readable storage medium.
  • the above-mentioned software function modules are stored in a storage medium, and include several instructions to enable a computer device (which may be a personal computer, an electronic device, or a network device, etc.) or a processor (processor) to execute the methods described in the various embodiments of the present application. part.
  • the at least one processor 32 can execute the operating device of the electronic device 3 and various types of installed application programs (such as the data extraction device 20 of the web page), program codes etc., for example, the various modules described above.
  • Program codes are stored in the memory 31, and the at least one processor 32 can call the program codes stored in the memory 31 to perform related functions.
  • each module described in FIG. 2 is a program code stored in the memory 31 and executed by the at least one processor 32, so as to realize the functions of the various modules to achieve the purpose of data extraction of web pages .
  • the memory 31 stores a plurality of computer-readable instructions, and the plurality of computer-readable instructions are executed by the at least one processor 32 to implement the function of data extraction from a web page.
  • the program code may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 31 and executed by the processor 32 to complete the present invention.
  • the one or more modules/units may be a series of computer-readable instruction segments capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer program in the electronic device 3 .
  • the program code may be divided into an acquisition module 201 , a parsing module 202 , a traversal module 203 , a matching module 204 , a determination module 205 , a generation module 206 and an extraction module 207 .
  • the computer-readable storage medium may be non-volatile or volatile.
  • the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function, and the like; The data created by the use of the node, etc.
  • the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • modules described as separate components may or may not be physically separated, and the components shown as modules may or may not be physical units, and may be located in one place or distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional module in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware, or can be implemented in the form of hardware plus software function modules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The present application relates to the technical field of terminals, and provides a webpage data extraction method and apparatus, an electronic device, and a storage medium. The method comprises: parsing the HTML code of a webpage to be extracted into a first node DOM tree to obtain unordered list tags; traversing the unordered list tags, and selecting the DOM tree with the most child nodes as a second node DOM tree; matching third node DOM trees of unselected list tags with the second node DOM tree to generate fourth node DOM trees; generating fifth node DOM trees according to the second node DOM tree and the fourth node DOM trees; and generating a new webpage to be extracted according to the fifth node DOM trees for webpage feature data extraction. According to the present application, after the DOM trees of unordered list tags are made consistent, a new webpage to be extracted is re-generated for webpage feature extraction, thereby improving the accuracy of data extraction. The present application also relates to blockchain technologies. Information of the webpage to be extracted is stored in blockchain nodes.

Description

网页的数据抽取方法、装置、电子设备及存储介质Web page data extraction method, device, electronic device and storage medium
本申请要求于2020年12月23日提交中国专利局,申请号为202011541079.0申请名称为“网页的数据抽取方法、装置、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on December 23, 2020, with the application number of 202011541079.0, the application title is "Data Extraction Method, Device, Electronic Device and Storage Medium for Web Pages", the entire contents of which are by reference Incorporated in this application.
技术领域technical field
本申请涉及终端技术领域,具体涉及一种网页的数据抽取方法、装置、电子设备及存储介质。The present application relates to the technical field of terminals, and in particular, to a method, apparatus, electronic device and storage medium for extracting data from a webpage.
背景技术Background technique
随着互联网高速发展,发明人发现公开网页中存在大量有价值信息,为了获取这些有价值的信息,现有技术通过编写网络采集程序从网页中抽取有价值的信息,但是在采集列表类型数据的过程中,由于每个列表项所展示的数据不够全面,并且不同数据字段定位的位置不同,导致抽取得到的数据转换为二维表格后容易出现数据错位的问题。With the rapid development of the Internet, the inventor found that there is a large amount of valuable information in public web pages. In order to obtain such valuable information, the prior art extracts valuable information from web pages by writing network collection programs, but in the process of collecting list-type data In the process, because the data displayed by each list item is not comprehensive enough, and the location of different data fields is different, the problem of data misalignment is prone to occur after the extracted data is converted into a two-dimensional table.
发明内容SUMMARY OF THE INVENTION
本申请提出一种网页的数据抽取方法、装置、电子设备及存储介质。The present application provides a method, device, electronic device and storage medium for data extraction from a webpage.
本申请的第一方面提供一种网页的数据抽取方法,所述方法包括:A first aspect of the present application provides a method for extracting data from a web page, the method comprising:
获取待抽取网页的源代码中的HTML代码,并将所述HTML代码解析为第一节点DOM树;Obtain the HTML code in the source code of the webpage to be extracted, and parse the HTML code into the first node DOM tree;
解析所述第一节点DOM树得到所有无序列表标签;Parse the first node DOM tree to obtain all unordered list tags;
遍历每个无序列表标签对应的所有列表标签得到遍历结果,从所述遍历结果中选取子节点最多的列表标签对应的DOM树作为第二节点DOM树;Traverse all the list tags corresponding to each unordered list tag to obtain the traversal result, and select the DOM tree corresponding to the list tag with the most child nodes from the traversal result as the second node DOM tree;
将所述遍历结果中每个未被选取的列表标签的第三节点DOM树与所述第二节点DOM树进行匹配,根据匹配结果生成所述每个未被选取的列表标签的第四节点DOM树;Matching the third node DOM tree of each unselected list tag in the traversal result with the second node DOM tree, and generating the fourth node DOM of each unselected list tag according to the matching result Tree;
根据所述第二节点DOM树和所有第四节点DOM树生成每个无序列表标签的第五节点DOM树;Generate a fifth node DOM tree of each unordered list tag according to the second node DOM tree and all fourth node DOM trees;
根据所述所有无序列表标签的第五节点DOM树生成新的待抽取网页,并对所述新的待抽取网页进行网页特征数据抽取。A new webpage to be extracted is generated according to the DOM tree of the fifth node of all the unordered list tags, and webpage feature data extraction is performed on the new webpage to be extracted.
本申请的第二方面提供一种电子设备,所述电子设备包括存储器及处理器,所述存储器用于存储至少一个计算机可读指令,所述处理器用于执行所述至少一个计算机可读指令以实现以下步骤:A second aspect of the present application provides an electronic device, the electronic device comprising a memory and a processor, the memory for storing at least one computer-readable instruction, the processor for executing the at least one computer-readable instruction to Implement the following steps:
获取待抽取网页的源代码中的HTML代码,并将所述HTML代码解析为第一节点DOM树;Obtain the HTML code in the source code of the webpage to be extracted, and parse the HTML code into the first node DOM tree;
解析所述第一节点DOM树得到所有无序列表标签;Parse the first node DOM tree to obtain all unordered list tags;
遍历每个无序列表标签对应的所有列表标签得到遍历结果,从所述遍历结果中选取子节点最多的列表标签对应的DOM树作为第二节点DOM树;Traverse all the list tags corresponding to each unordered list tag to obtain the traversal result, and select the DOM tree corresponding to the list tag with the most child nodes from the traversal result as the second node DOM tree;
将所述遍历结果中每个未被选取的列表标签的第三节点DOM树与所述第二节点DOM树进行匹配,根据匹配结果生成所述每个未被选取的列表标签的第四节点DOM树;Matching the third node DOM tree of each unselected list tag in the traversal result with the second node DOM tree, and generating the fourth node DOM of each unselected list tag according to the matching result Tree;
根据所述第二节点DOM树和所有第四节点DOM树生成每个无序列表标签的第五节点DOM树;Generate a fifth node DOM tree of each unordered list tag according to the second node DOM tree and all fourth node DOM trees;
根据所述所有无序列表标签的第五节点DOM树生成新的待抽取网页,并对所述新的待抽取网页进行网页特征数据抽取。A new webpage to be extracted is generated according to the DOM tree of the fifth node of all the unordered list tags, and webpage feature data extraction is performed on the new webpage to be extracted.
本申请的第三方面提供一种计算机可读存储介质,所述计算机可读存储介质存储有至少一个计算机可读指令,所述至少一个计算机可读指令被处理器执行时实现以下步骤:A third aspect of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores at least one computer-readable instruction, and when the at least one computer-readable instruction is executed by a processor, implements the following steps:
获取待抽取网页的源代码中的HTML代码,并将所述HTML代码解析为第一节点DOM树;Obtain the HTML code in the source code of the webpage to be extracted, and parse the HTML code into the first node DOM tree;
解析所述第一节点DOM树得到所有无序列表标签;Parse the first node DOM tree to obtain all unordered list tags;
遍历每个无序列表标签对应的所有列表标签得到遍历结果,从所述遍历结果中选取子节点最多的列表标签对应的DOM树作为第二节点DOM树;Traverse all the list tags corresponding to each unordered list tag to obtain the traversal result, and select the DOM tree corresponding to the list tag with the most child nodes from the traversal result as the second node DOM tree;
将所述遍历结果中每个未被选取的列表标签的第三节点DOM树与所述第二节点DOM树进行匹配,根据匹配结果生成所述每个未被选取的列表标签的第四节点DOM树;Matching the third node DOM tree of each unselected list tag in the traversal result with the second node DOM tree, and generating the fourth node DOM of each unselected list tag according to the matching result Tree;
根据所述第二节点DOM树和所有第四节点DOM树生成每个无序列表标签的第五节点DOM树;Generate a fifth node DOM tree of each unordered list tag according to the second node DOM tree and all fourth node DOM trees;
根据所述所有无序列表标签的第五节点DOM树生成新的待抽取网页,并对所述新的待抽取网页进行网页特征数据抽取。A new webpage to be extracted is generated according to the DOM tree of the fifth node of all the unordered list tags, and webpage feature data extraction is performed on the new webpage to be extracted.
本申请的第四方面提供一种网页的数据抽取装置,其中,所述装置包括:A fourth aspect of the present application provides an apparatus for extracting data from a webpage, wherein the apparatus includes:
获取模块,用于获取待抽取网页的源代码中的HTML代码,并将所述HTML代码解析为第一节点DOM树;an acquisition module, for acquiring HTML codes in the source codes of the web pages to be extracted, and parsing the HTML codes into a first node DOM tree;
解析模块,用于解析所述第一节点DOM树得到所有无序列表标签;a parsing module for parsing the first node DOM tree to obtain all unordered list tags;
遍历模块,用于遍历每个无序列表标签对应的所有列表标签得到遍历结果,从所述遍历结果中选取子节点最多的列表标签对应的DOM树作为第二节点DOM树;The traversal module is used to traverse all the list tags corresponding to each unordered list tag to obtain the traversal result, and select the DOM tree corresponding to the list tag with the most child nodes from the traversal result as the second node DOM tree;
匹配模块,用于将所述遍历结果中每个未被选取的列表标签的第三节点DOM树与所述第二节点DOM树进行匹配,根据匹配结果生成所述每个未被选取的列表标签的第四节点DOM树;A matching module, configured to match the third node DOM tree of each unselected list tag in the traversal result with the second node DOM tree, and generate each unselected list tag according to the matching result The fourth node of the DOM tree;
生成模块,用于根据所述第二节点DOM树和所有第四节点DOM树生成每个无序列表标签的第五节点DOM树;A generating module is used to generate the fifth node DOM tree of each unordered list label according to the second node DOM tree and all fourth node DOM trees;
抽取模块,用于根据所述所有无序列表标签的第五节点DOM树生成新的待抽取网页,并对所述新的待抽取网页进行网页特征数据抽取。The extraction module is used for generating a new webpage to be extracted according to the DOM tree of the fifth node of all the unordered list tags, and extracting webpage feature data for the new webpage to be extracted.
本申请所述的网页的数据抽取方法、装置、电子设备及存储介质,提高了数据抽取的准确率。The data extraction method, device, electronic device and storage medium for webpages described in this application improve the accuracy of data extraction.
附图说明Description of drawings
图1是本申请实施例一提供的网页的数据抽取方法的流程图。FIG. 1 is a flowchart of a method for extracting data from a webpage according to Embodiment 1 of the present application.
图2是本申请实施例二提供的网页的数据抽取装置的结构图。FIG. 2 is a structural diagram of an apparatus for extracting data from a webpage according to Embodiment 2 of the present application.
图3是本申请实施例三提供的电子设备的结构示意图。FIG. 3 is a schematic structural diagram of an electronic device provided in Embodiment 3 of the present application.
具体实施方式Detailed ways
为了能够更清楚地理解本申请的上述目的、特征和优点,下面结合附图和具体实施例对本申请进行详细描述。需要说明的是,在不冲突的情况下,本申请的实施例及实施例中的特征可以相互组合。In order to more clearly understand the above objects, features and advantages of the present application, the present application will be described in detail below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments of the present application and the features in the embodiments may be combined with each other in the case of no conflict.
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中在本申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field to which this application belongs. The terms used herein in the specification of the application are for the purpose of describing specific embodiments only, and are not intended to limit the application.
实施例一Example 1
图1是本申请实施例一提供的网页的数据抽取方法的流程图。FIG. 1 is a flowchart of a method for extracting data from a webpage according to Embodiment 1 of the present application.
在本实施例中,所述网页的数据抽取方法可以应用于电子设备中,对于需要进行网页的数据抽取的电子设备,可以直接在电子设备上集成本申请的方法所提供的网页的数据抽取的 功能,或者以软件开发工具包(Software Development Kit,SKD)的形式运行在电子设备中。In this embodiment, the method for extracting data from a webpage can be applied to an electronic device. For an electronic device that needs to extract data from a webpage, the data extraction method from the webpage provided by the method of the present application can be directly integrated on the electronic device. function, or run in the electronic device in the form of a Software Development Kit (SKD).
如图1所示,所述网页的数据抽取方法具体包括以下步骤,根据不同的需求,该流程图中步骤的顺序可以改变,某些可以省略。As shown in FIG. 1 , the method for extracting data from a webpage specifically includes the following steps. According to different requirements, the order of the steps in the flowchart can be changed, and some can be omitted.
S11,获取待抽取网页的源代码中的HTML代码,并将所述HTML代码解析为第一节点DOM树。S11: Acquire HTML code in the source code of the webpage to be extracted, and parse the HTML code into a first node DOM tree.
本实施例中,接收待抽取网页的链接,并根据所述链接下载源代码,从所述源代码中删除JavaScript及CSS代码,保留HTML代码,采用HTML解析器将所述待抽取网页对应的HTML代码按照标签层级关系解析为第一节点DOM树。In this embodiment, the link of the webpage to be extracted is received, the source code is downloaded according to the link, the JavaScript and CSS codes are deleted from the source code, the HTML code is retained, and an HTML parser is used to parse the HTML corresponding to the webpage to be extracted. The code is parsed into the first node DOM tree according to the label hierarchy.
需要强调的是,为进一步保证上述待抽取网页的私密和安全性,上述待抽取网页还可以存储于一区块链的节点中。It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned webpage to be extracted, the above-mentioned webpage to be extracted can also be stored in a node of a blockchain.
S12,解析所述第一节点DOM树得到所有无序列表标签。S12, parse the DOM tree of the first node to obtain all unordered list tags.
本实施例中,所述无序列表标签指的是UL标签,在得到所述第一节点DOM树之后,解析所述第一节点DOM树获取所述待抽取网页的无序列表标签,其中,所述第一节点DOM书中可以包含多个无序列表标签。In this embodiment, the unordered list tag refers to a UL tag. After the first node DOM tree is obtained, the first node DOM tree is parsed to obtain the unordered list tag of the webpage to be extracted, wherein, The first node DOM book may contain multiple unordered list tags.
S13,遍历每个无序列表标签对应的所有列表标签得到遍历结果,从所述遍历结果中选取子节点最多的列表标签对应的DOM树作为第二节点DOM树。S13, traverse all list tags corresponding to each unordered list tag to obtain a traversal result, and select the DOM tree corresponding to the list tag with the most child nodes from the traversal result as the second node DOM tree.
本实施例中,所述列表标签(li标签)是指无序列表标签的对应的下一层级的子标签,每个无序列表标签可以包含有多个字标签,遍历每个无序列表标签对应的每个列表标签,从遍历结果中选取出子节点最多的列表标签对应的DOM树作为第二节点DOM树。In this embodiment, the list tag (li tag) refers to a sub-tag of the next level corresponding to the unordered list tag, each unordered list tag may contain multiple word tags, and traverse each unordered list tag For each corresponding list tag, the DOM tree corresponding to the list tag with the most child nodes is selected from the traversal result as the second node DOM tree.
本实施例中,通过选取子节点最多的列表标签对应的DOM树作为第二节点DOM树,即种子节点DOM树,由于所述种子节点DOM书中包含的子节点最多,故确保了每个列表标签中的数据的全面性。In this embodiment, by selecting the DOM tree corresponding to the list label with the most child nodes as the second node DOM tree, that is, the seed node DOM tree, since the seed node DOM contains the most child nodes in the book, it is ensured that each list Comprehensiveness of data in labels.
S14,将所述遍历结果中每个未被选取的列表标签的第三节点DOM树与所述第二节点DOM树进行匹配,根据匹配结果生成所述每个未被选取的列表标签的第四节点DOM树。S14: Match the third node DOM tree of each unselected list tag in the traversal result with the second node DOM tree, and generate a fourth node DOM tree of each unselected list tag according to the matching result Node DOM tree.
本实施例中,为了确保每个列表标签的DOM树中的数据的全面性,将每个未被选取的列表标签的第三节点DOM树与子节点最多的列表标签的第二节点DOM树进行匹配,进行所述第三节点DOM树的更新,使得每个第四节点DOM树与所述第二节点DOM树的一致,避免了抽取得到的网页特征数据在转换为二维表格后出现数据错位的现象,提高了数据抽取的准确率。In this embodiment, in order to ensure the comprehensiveness of the data in the DOM tree of each list tag, the third node DOM tree of each unselected list tag is compared with the second node DOM tree of the list tag with the most child nodes. Match, update the DOM tree of the third node, so that the DOM tree of each fourth node is consistent with the DOM tree of the second node, and avoid data dislocation after the extracted webpage feature data is converted into a two-dimensional table phenomenon, which improves the accuracy of data extraction.
优选地,所述将所述遍历结果中每个未被选取的列表标签的第三节点DOM树与所述第二节点DOM树进行匹配,根据匹配结果生成所述每个未被选取的列表标签的第四节点DOM树包括:Preferably, the third node DOM tree of each unselected list tag in the traversal result is matched with the second node DOM tree, and each unselected list tag is generated according to the matching result The fourth node of the DOM tree consists of:
将所述第二节点DOM树的根节点的第一标签与所述每个未被选取的列表标签的第三DOM树的根节点的第二标签进行匹配;matching the first label of the root node of the second node DOM tree with the second label of the root node of the third DOM tree of each unselected list label;
当所述第一标签与所述第二标签一致时,判断所述第二节点DOM树的根节点及所述第三节点DOM树的根节点是否为叶子节点;When the first label is consistent with the second label, determine whether the root node of the DOM tree of the second node and the root node of the DOM tree of the third node are leaf nodes;
当所述第二节点DOM树的根节点不为叶子节点及所述第三节点DOM树的根节点不为叶子节点时,将所述第二节点DOM树的根节点的下一层级的所有子节点的第三标签与所述第三节点DOM树的相同层级的所有子节点的第四标签进行匹配;When the root node of the DOM tree of the second node is not a leaf node and the root node of the DOM tree of the third node is not a leaf node, all children of the next level of the root node of the DOM tree of the second node are The third label of the node is matched with the fourth label of all child nodes of the same level of the third node DOM tree;
当所述第二节点DOM树的根节点的下一层级的所有子节点的第三标签与所述第三节点DOM树的相同层级的所有子节点的第四标签一致时,重复上述过程,直至所述第二节点DOM树的子节点及所述第三节点DOM树的子节点为叶子节点。When the third labels of all child nodes at the next level of the root node of the DOM tree of the second node are consistent with the fourth labels of all child nodes at the same level of the DOM tree of the third node, repeat the above process until The child nodes of the DOM tree of the second node and the child nodes of the DOM tree of the third node are leaf nodes.
本实施例中,为了确保每个第二节点DOM树与所述第四节点DOM树的一致性,先将所述第二节点DOM树的根节点的第一标签与所述每个未被选取的列表标签的第三DOM树的根节点的第二标签进行匹配,若所述第一标签与所述第二标签一致,则确定所述第二节点 DOM树与所述第三节点DOM树的根节点的标签是一致的,则继续判断所述第二节点DOM树的根节点及所述第三节点DOM树的根节点是否为叶子节点,具体地,所述叶子节点是指所述根节点是结束节点。In this embodiment, in order to ensure the consistency between the DOM tree of each second node and the DOM tree of the fourth node, the first label of the root node of the DOM tree of the second node and the unselected first label of each The second label of the root node of the third DOM tree of the list label is matched, and if the first label is consistent with the second label, it is determined that the second node DOM tree and the third node DOM tree If the labels of the root nodes are the same, then continue to judge whether the root node of the DOM tree of the second node and the root node of the DOM tree of the third node are leaf nodes. Specifically, the leaf node refers to the root node. is the end node.
当判断结果为所述第二节点DOM树的根节点不为叶子节点及所述第三节点DOM树的根节点不为叶子节点时,则需要继续判断所述根节点对应的下一层级的所有子节点的第三标签与所述第三节点DOM树的相同层级的所有子节点的第四标签是否一致,若一致,则确定所述第二节点DOM树与所述第三节点DOM树一致,只需要重复上述过程继续判断,直至所述第二节点DOM树的子节点及所述第三节点DOM树的子节点为叶子节点。When the judgment result is that the root node of the DOM tree of the second node is not a leaf node and the root node of the DOM tree of the third node is not a leaf node, then it is necessary to continue to judge all the nodes of the next level corresponding to the root node. Whether the third label of the child node is consistent with the fourth label of all child nodes of the same level of the DOM tree of the third node, if they are consistent, it is determined that the DOM tree of the second node is consistent with the DOM tree of the third node, It is only necessary to repeat the above process to continue the judgment until the child nodes of the DOM tree of the second node and the child nodes of the DOM tree of the third node are leaf nodes.
若不一致,则确定所述第二节点DOM树与所述第三节点DOM树不一致,遍历所述第二节点DOM树及所述第三节点DOM树找出不一致的节点,根据所述不一致的节点对应的标签进行更新所述第二节点DOM树或者所述第三节点DOM树。If they are inconsistent, determine that the DOM tree of the second node is inconsistent with the DOM tree of the third node, traverse the DOM tree of the second node and the DOM tree of the third node to find the inconsistent nodes, and find the inconsistent nodes according to the inconsistent nodes. The corresponding tag updates the DOM tree of the second node or the DOM tree of the third node.
具体地,当所述第二节点DOM树的根节点的下一层级的所有子节点的第三标签与所述第三节点DOM树的相同层级的任意一个子节点的第四标签不一致时,确定需要更新第三节点DOM树。Specifically, when the third labels of all the child nodes at the next level of the root node of the DOM tree of the second node are inconsistent with the fourth label of any child node of the same level of the DOM tree of the third node, determine The third node DOM tree needs to be updated.
具体地,所述根据匹配结果生成所述每个未被选取的列表标签的第四节点DOM树包括:Specifically, generating the fourth node DOM tree of each unselected list tag according to the matching result includes:
识别所述第四标签的左邻居节点和右邻居节点;Identify the left neighbor node and the right neighbor node of the fourth label;
当识别到左邻居节点,但未识别到右邻居节点时,将所述第四标签***至所述左邻居节点的最右边得到新的第三节点DOM树,将所述新的第三节点DOM树作为所述每个未被选取的列表标签的第四节点DOM树;或者When the left neighbor node is recognized, but the right neighbor node is not recognized, insert the fourth label to the rightmost of the left neighbor node to obtain a new third node DOM tree, and insert the new third node DOM tree as the fourth node DOM tree of each unselected list tag; or
当未识别左邻居节点,但识别到右邻居节点时,将所述第四标签***至所述右邻居节点的最左边得到新的第三节点DOM树,将所述新的第三节点DOM树作为所述每个未被选取的列表标签的第四节点DOM树;或者When the left neighbor node is not recognized, but the right neighbor node is recognized, the fourth label is inserted into the leftmost of the right neighbor node to obtain a new third node DOM tree, and the new third node DOM tree is as the fourth node DOM tree of each unselected list tag; or
当识别到左邻居节点和右邻居节点时,将所述第四标签***至所述左邻居节点和所述右邻居节点之间得到新的第三节点DOM树,将所述新的第三节点DOM树作为所述每个未被选取的列表标签的第四节点DOM树。When the left neighbor node and the right neighbor node are identified, the fourth label is inserted between the left neighbor node and the right neighbor node to obtain a new third node DOM tree, and the new third node The DOM tree is used as the fourth node DOM tree of each unselected list tag.
本实施例中,在进行所述第三节点DOM树的更新过程中,需要识别不一致节点对应的左邻居节点和右邻居节点,精确了具体的***的位置,保证了每个列表标签的第四节点DOM树的一致性。In this embodiment, in the process of updating the DOM tree of the third node, it is necessary to identify the left neighbor node and the right neighbor node corresponding to the inconsistent node, so as to make the specific insertion position precise and ensure the fourth position of each list label. Consistency of the node DOM tree.
在其他一些实施例中,在对所述列表标签的子节点进行了匹配的过程中,根据其他未被选取的列表标签对应的子节点对所述第二节点DOM树进行更新得到新的第二节点DOM树,具体地,当所述第二节点DOM树的根节点的下一层级的任意一个子节点的第三标签与所述第三节点DOM树的相同层级的所有子节点的第四标签不一致时,确定需要更新第二节点DOM树。In some other embodiments, in the process of matching the child nodes of the list label, the DOM tree of the second node is updated according to the child nodes corresponding to other unselected list labels to obtain a new second node. Node DOM tree, specifically, when the third label of any child node at the next level of the root node of the second node DOM tree is the fourth label of all child nodes at the same level of the third node DOM tree When inconsistent, it is determined that the second node DOM tree needs to be updated.
具体地,所述第二节点DOM树的更新过程包括:Specifically, the update process of the second node DOM tree includes:
识别所述第三标签的左邻居节点和右邻居节点;Identifying the left neighbor node and the right neighbor node of the third label;
当识别到左邻居节点,但未识别到右邻居节点时,将所述第三标签***至所述左邻居节点的最右边得到新的第二节点DOM树,将所述新的第二节点DOM树作为所述子节点最多的列表标签对应的DOM树;或者When the left neighbor node is recognized, but the right neighbor node is not recognized, insert the third label to the rightmost of the left neighbor node to obtain a new second node DOM tree, and insert the new second node DOM The tree is the DOM tree corresponding to the list tag with the most child nodes; or
当未识别左邻居节点,但识别到右邻居节点时,将所述第三标签***至所述右邻居节点的最左边得到新的第二节点DOM树,将所述新的第二节点DOM树作为所述子节点最多的列表标签对应的DOM树;或者When the left neighbor node is not recognized, but the right neighbor node is recognized, the third label is inserted into the leftmost of the right neighbor node to obtain a new second node DOM tree, and the new second node DOM tree is as the DOM tree corresponding to the list tag with the most child nodes; or
当识别到左邻居节点和右邻居节点时,将所述第三标签***至所述左邻居节点和所述右邻居节点之间得到新的第二节点DOM树,将所述新的第二节点DOM树作为所述子节点最多的列表标签对应的DOM树。When the left neighbor node and the right neighbor node are identified, the third label is inserted between the left neighbor node and the right neighbor node to obtain a new second node DOM tree, and the new second node The DOM tree is used as the DOM tree corresponding to the list tag with the most child nodes.
本实施例中,在执行S14,S15时将所述第二节点DOM更新为所述新的第二节点DOM 树,避免了在进行网页特征数据抽取的过程中出现字段缺失的现象,进一步的提高了每个列表标签中的数据的全面性。In this embodiment, when S14 and S15 are executed, the second node DOM is updated to the new second node DOM tree, which avoids the phenomenon of missing fields in the process of extracting webpage feature data, and further improves the Comprehensiveness of the data in each list tab.
进一步地,所述方法还包括:Further, the method also includes:
当所述第二节点DOM树的根节点为叶子节点,但所述第三节点DOM树的根节点不为叶子节点时,将所述第三节点DOM树作为所述每个未被选取的列表标签的第四节点DOM树。When the root node of the DOM tree of the second node is a leaf node, but the root node of the DOM tree of the third node is not a leaf node, the DOM tree of the third node is used as each unselected list The tag's fourth node in the DOM tree.
进一步地,所述方法还包括:Further, the method also includes:
当所述第二节点DOM树的根节点不为叶子节点,但所述第三节点DOM树的根节点为叶子节点时,遍历所述第二节点DOM树的根节点的所有子节点;When the root node of the DOM tree of the second node is not a leaf node, but the root node of the DOM tree of the third node is a leaf node, traverse all the child nodes of the root node of the DOM tree of the second node;
将所述所有子节点的对应的标签***至所述第三节点DOM树对应的位置得到新的第三节点DOM树,并将所述新的第三节点DOM树作为每个未被选取的列表标签的第四节点DOM树。Insert the corresponding labels of all the child nodes into the corresponding positions of the third node DOM tree to obtain a new third node DOM tree, and use the new third node DOM tree as each unselected list The tag's fourth node in the DOM tree.
进一步地,所述方法还包括:Further, the method also includes:
当所述第二节点DOM树的根节点为叶子节点及所述第三节点DOM树的根节点为叶子节点时,确定所述第三节点DOM树为所述每个未被选取的列表标签的第四节点DOM树。When the root node of the DOM tree of the second node is a leaf node and the root node of the DOM tree of the third node is a leaf node, determine that the DOM tree of the third node is the DOM tree of each unselected list tag The fourth node DOM tree.
进一步地,所述方法还包括:Further, the method also includes:
当所述第一标签与所述第二标签不一致时,将所述第三节点DOM树作为每个未被选取的列表标签的第四节点DOM树。When the first label is inconsistent with the second label, the third node DOM tree is used as the fourth node DOM tree of each unselected list label.
本实施例中,根据不同的判断标准可以快速的确定出每个列表标签对应的第四节点DOM树,提高了确定第四节点DOM树的多样性。In this embodiment, the fourth node DOM tree corresponding to each list tag can be quickly determined according to different judgment criteria, which improves the diversity of determining the fourth node DOM tree.
S15,根据所述第二节点DOM树和所有第四节点DOM树生成每个无序列表标签的第五节点DOM树。S15: Generate a fifth node DOM tree for each unordered list tag according to the second node DOM tree and all fourth node DOM trees.
本实施例中,每个无序列表对应一个节点DOM树,通过将所述第二节点DOM树和所有第四节点DOM树对应到所述每个无序列表对应的位置得到每个无序列表标签的第五节点DOM树,确保了匹配前与匹配后所述待抽取网页数据的一致性。In this embodiment, each unordered list corresponds to a node DOM tree, and each unordered list is obtained by corresponding the second node DOM tree and all fourth node DOM trees to positions corresponding to each unordered list The DOM tree of the fifth node of the tag ensures the consistency of the web page data to be extracted before and after matching.
可选地,所述根据所述第二节点DOM树和所有第四节点DOM树生成每个无序列表标签的第五节点DOM树包括:Optionally, generating the fifth node DOM tree of each unordered list tag according to the second node DOM tree and all fourth node DOM trees includes:
将所述第二节点DOM树和所有第四节点DOM树对应到所述每个无序列表标签中的对应位置,得到所述每个无序列表标签的第五节点DOM树。Corresponding the second node DOM tree and all fourth node DOM trees to corresponding positions in each unordered list tag, to obtain the fifth node DOM tree of each unordered list tag.
本实施例中,通过识别所述第二节点DOM对应的列表标签及所述列表标签在所述每个无序列表标签中的位置;识别每个所述第四节点DOM树对应的列表标签及每个列表标签在所述每个无序列表标签中的位置;然后将所述第二节点DOM树和所有第四节点DOM树对应到所述每个无序列表标签中的对应位置得到所述每个无序列表标签的第五节点DOM树。In this embodiment, by identifying the list label corresponding to the second node DOM and the position of the list label in each unordered list label; identifying the list label corresponding to each fourth node DOM tree and The position of each list tag in each unordered list tag; then the second node DOM tree and all fourth node DOM trees are corresponding to the corresponding positions in each unordered list tag to obtain the The fifth node DOM tree for each unordered list tag.
S16,根据所述所有无序列表标签的第五节点DOM树生成新的待抽取网页,并对所述新的待抽取网页进行网页特征数据抽取。S16: Generate a new webpage to be extracted according to the DOM tree of the fifth node of all the unordered list tags, and perform webpage feature data extraction on the new webpage to be extracted.
本实施例中,所述新的待抽取网页是通过将待抽取网页对应的第一DOM树进行解析后,将每个无序列表标签对应的每个列表标签的DOM树进行匹配后得到新的DOM树。In this embodiment, the new webpage to be extracted is obtained by parsing the first DOM tree corresponding to the webpage to be extracted, and then matching the DOM tree of each list tag corresponding to each unordered list tag to obtain a new DOM tree.
进一步地,所述方法还包括:Further, the method also includes:
在对所述新的待抽取网页进行网页特征数据抽取之后,将抽取得到的网页特征数据转换为二维表格。After the webpage feature data extraction is performed on the new webpage to be extracted, the extracted webpage feature data is converted into a two-dimensional table.
本实施例中,由于在通过将所述遍历结果中每个未被选取的列表标签的第三节点DOM树与所述第二节点DOM树进行匹配,根据匹配结果生成所述每个未被选取的列表标签的第四节点DOM树,确保每个无序列表标签对应的DOM树的一致性,因此,根据所述新的DOM树重新生成新的待抽取网页,在所述新的待抽取网页上进行网页特征数据抽取不会出现缺失字段的现象,避免了抽取得到的网页特征数据在转换为二维表格后出现数据错位的现象,提 高了数据抽取的准确率。In this embodiment, since the third node DOM tree of each unselected list tag in the traversal result is matched with the second node DOM tree, each unselected list tag is generated according to the matching result. The DOM tree of the fourth node of the list tag, to ensure the consistency of the DOM tree corresponding to each unordered list tag, therefore, according to the new DOM tree to regenerate a new webpage to be extracted, in the new webpage to be extracted The feature data extraction of web pages will not appear the phenomenon of missing fields, which avoids the phenomenon of data dislocation after the extracted web page feature data is converted into a two-dimensional table, and improves the accuracy of data extraction.
综上所述,本实施例所述的网页的数据抽取方法,通过获取待抽取网页的源代码中的HTML代码,并将所述HTML代码解析为第一节点DOM树;解析所述第一节点DOM树得到所有无序列表标签;遍历每个无序列表标签对应的所有列表标签得到遍历结果,从所述遍历结果中选取子节点最多的列表标签对应的DOM树作为第二节点DOM树;将所述遍历结果中每个未被选取的列表标签的第三节点DOM树与所述第二节点DOM树进行匹配,根据匹配结果生成所述每个未被选取的列表标签的第四节点DOM树;根据所述第二节点DOM树和所有第四节点DOM树生成每个无序列表标签的第五节点DOM树;根据所述所有无序列表标签的第五节点DOM树生成新的待抽取网页,并对所述新的待抽取网页进行网页特征数据抽取。To sum up, the method for extracting data from a webpage according to this embodiment obtains the HTML code in the source code of the webpage to be extracted, and parses the HTML code into a first node DOM tree; parses the first node The DOM tree obtains all unordered list tags; traverses all list tags corresponding to each unordered list tag to obtain the traversal result, and selects the DOM tree corresponding to the list tag with the most child nodes from the traversal result as the second node DOM tree; The third node DOM tree of each unselected list tag in the traversal result is matched with the second node DOM tree, and the fourth node DOM tree of each unselected list tag is generated according to the matching result. ; Generate the fifth node DOM tree of each unordered list label according to the second node DOM tree and all the fourth node DOM trees; Generate a new webpage to be extracted according to the fifth node DOM tree of all the unordered list labels , and perform webpage feature data extraction on the new webpage to be extracted.
本实施例,一方面,通过选取子节点最多的列表标签对应的DOM树作为第二节点DOM树,即种子节点DOM树,由于所述种子节点DOM书中包含的子节点最多,故确保了每个列表标签中的数据的全面性;另一方面,由于在通过将所述遍历结果中每个未被选取的列表标签的第三节点DOM树与所述第二节点DOM树进行匹配,根据匹配结果生成所述每个未被选取的列表标签的第四节点DOM树,确保每个无序列表标签对应的DOM树的一致性,因此,根据所述新的DOM树重新生成新的待抽取网页,在所述新的待抽取网页上进行网页特征数据抽取不会出现缺失字段的现象,避免了抽取得到的网页特征数据在转换为二维表格后出现数据错位的现象,提高了数据抽取的准确率;最后,将所述遍历结果中每个未被选取的列表标签的第三节点DOM树与所述第二节点DOM树进行匹配过程中,通过识别不一致节点对应的左邻居节点和右邻居节点,精确了具体的***的位置,保证了每个列表标签的第四节点DOM树的一致性。In this embodiment, on the one hand, by selecting the DOM tree corresponding to the list tag with the most child nodes as the second node DOM tree, that is, the seed node DOM tree, since the seed node DOM contains the most child nodes in the book, it is ensured that each On the other hand, since the third node DOM tree of each unselected list label in the traversal result is matched with the second node DOM tree, according to the matching As a result, the fourth node DOM tree of each unselected list tag is generated to ensure the consistency of the DOM tree corresponding to each unordered list tag. Therefore, a new webpage to be extracted is regenerated according to the new DOM tree. Therefore, the phenomenon of missing fields will not appear when the webpage feature data extraction is performed on the new webpage to be extracted, which avoids the phenomenon of data dislocation after the extracted webpage feature data is converted into a two-dimensional table, and improves the accuracy of data extraction. Finally, in the process of matching the third node DOM tree of each unselected list label in the traversal result with the second node DOM tree, by identifying the left neighbor node and the right neighbor node corresponding to the inconsistent node , the specific insertion position is precise, and the consistency of the DOM tree of the fourth node of each list label is guaranteed.
实施例二Embodiment 2
图2是本申请实施例二提供的网页的数据抽取装置的结构图。FIG. 2 is a structural diagram of an apparatus for extracting data from a webpage according to Embodiment 2 of the present application.
在一些实施例中,所述网页的数据抽取装置20可以包括多个由程序代码段所组成的功能模块。所述网页的数据抽取装置20中的各个程序段的程序代码可以存储于电子设备的存储器中,并由所述至少一个处理器所执行,以执行(详见图1描述)网页的数据抽取。In some embodiments, the data extraction apparatus 20 of the webpage may include a plurality of functional modules composed of program code segments. The program codes of each program segment in the webpage data extraction apparatus 20 may be stored in the memory of the electronic device and executed by the at least one processor to perform (see description in FIG. 1 ) data extraction of the webpage.
本实施例中,所述网页的数据抽取装置20根据其所执行的功能,可以被划分为多个功能模块。所述功能模块可以包括:获取模块201、解析模块202、遍历模块203、匹配模块204、确定模块205、生成模块206及抽取模块207。本申请所称的模块是指一种能够被至少一个处理器所执行并且能够完成固定功能的一系列计算机可读指令段,其存储在存储器中。在本实施例中,关于各模块的功能将在后续的实施例中详述。In this embodiment, the data extraction apparatus 20 of the webpage can be divided into a plurality of functional modules according to the functions performed by the data extraction apparatus 20 . The functional modules may include: an acquisition module 201 , an analysis module 202 , a traversal module 203 , a matching module 204 , a determination module 205 , a generation module 206 and an extraction module 207 . A module referred to in this application refers to a series of computer-readable instruction segments that can be executed by at least one processor and can perform fixed functions, and are stored in a memory. In this embodiment, the functions of each module will be described in detail in subsequent embodiments.
获取模块201,用于获取待抽取网页的源代码中的HTML代码,并将所述HTML代码解析为第一节点DOM树。解析模块202,用于解析所述第一节点DOM树得到所有无序列表标签。The obtaining module 201 is configured to obtain the HTML code in the source code of the webpage to be extracted, and parse the HTML code into a first node DOM tree. A parsing module 202, configured to parse the DOM tree of the first node to obtain all unordered list tags.
遍历模块203,用于遍历每个无序列表标签对应的所有列表标签得到遍历结果,从所述遍历结果中选取子节点最多的列表标签对应的DOM树作为第二节点DOM树。The traversal module 203 is configured to traverse all the list tags corresponding to each unordered list tag to obtain a traversal result, and select the DOM tree corresponding to the list tag with the most child nodes from the traversal result as the second node DOM tree.
匹配模块204,用于将所述遍历结果中每个未被选取的列表标签的第三节点DOM树与所述第二节点DOM树进行匹配,根据匹配结果生成所述每个未被选取的列表标签的第四节点DOM树。The matching module 204 is configured to match the third node DOM tree of each unselected list tag in the traversal result with the second node DOM tree, and generate the each unselected list according to the matching result The tag's fourth node in the DOM tree.
确定模块205,用于当所述第二节点DOM树的根节点为叶子节点及所述第三节点DOM树的根节点为叶子节点时,确定所述第三节点DOM树为所述每个未被选取的列表标签的第四节点DOM树。The determining module 205 is configured to, when the root node of the DOM tree of the second node is a leaf node and the root node of the DOM tree of the third node is a leaf node, determine that the DOM tree of the third node is the The fourth node DOM tree of the selected list tag.
生成模块206,用于根据所述第二节点DOM树和所有第四节点DOM树生成每个无序列表标签的第五节点DOM树。The generating module 206 is configured to generate a fifth node DOM tree of each unordered list tag according to the second node DOM tree and all fourth node DOM trees.
抽取模块207,用于根据所述所有无序列表标签的第五节点DOM树生成新的待抽取网 页,并对所述新的待抽取网页进行网页特征数据抽取。The extraction module 207 is configured to generate a new webpage to be extracted according to the DOM tree of the fifth node of all the unordered list tags, and to extract webpage feature data for the new webpage to be extracted.
本实施例所述的网页的数据抽取装置,一方面,通过选取子节点最多的列表标签对应的DOM树作为第二节点DOM树,即种子节点DOM树,由于所述种子节点DOM书中包含的子节点最多,故确保了每个列表标签中的数据的全面性;另一方面,由于在通过将所述遍历结果中每个未被选取的列表标签的第三节点DOM树与所述第二节点DOM树进行匹配,根据匹配结果生成所述每个未被选取的列表标签的第四节点DOM树,确保每个无序列表标签对应的DOM树的一致性,因此,根据所述新的DOM树重新生成新的待抽取网页,在所述新的待抽取网页上进行网页特征数据抽取不会出现缺失字段的现象,避免了抽取得到的网页特征数据在转换为二维表格后出现数据错位的现象,提高了数据抽取的准确率;最后,将所述遍历结果中每个未被选取的列表标签的第三节点DOM树与所述第二节点DOM树进行匹配过程中,通过识别不一致节点对应的左邻居节点和右邻居节点,精确了具体的***的位置,保证了每个列表标签的第四节点DOM树的一致性。The device for extracting data from a webpage according to this embodiment, on the one hand, selects the DOM tree corresponding to the list tag with the most child nodes as the second node DOM tree, that is, the seed node DOM tree, because the seed node DOM book contains On the other hand, since the third node DOM tree of each unselected list label in the traversal result is compared with the second The node DOM tree is matched, and the fourth node DOM tree of each unselected list tag is generated according to the matching result to ensure the consistency of the DOM tree corresponding to each unordered list tag. Therefore, according to the new DOM The tree regenerates a new web page to be extracted, and the feature data extraction of the web page on the new web page to be extracted will not cause the phenomenon of missing fields, which avoids the occurrence of data misalignment after the extracted web page feature data is converted into a two-dimensional table. phenomenon, the accuracy rate of data extraction is improved; finally, in the process of matching the third node DOM tree of each unselected list tag in the traversal result with the second node DOM tree, by identifying inconsistent nodes corresponding to The left neighbor node and right neighbor node of , the exact insertion position is precise, and the consistency of the DOM tree of the fourth node of each list label is guaranteed.
实施例三Embodiment 3
参阅图3所示,为本申请实施例三提供的电子设备的结构示意图。在本申请较佳实施例中,所述电子设备3包括存储器31、至少一个处理器32、至少一条通信总线33及收发器34。Referring to FIG. 3 , it is a schematic structural diagram of an electronic device according to Embodiment 3 of the present application. In a preferred embodiment of the present application, the electronic device 3 includes a memory 31 , at least one processor 32 , at least one communication bus 33 and a transceiver 34 .
本领域技术人员应该了解,图3示出的电子设备的结构并不构成本申请实施例的限定,既可以是总线型结构,也可以是星形结构,所述电子设备3还可以包括比图示更多或更少的其他硬件或者软件,或者不同的部件布置。Those skilled in the art should understand that the structure of the electronic device shown in FIG. 3 does not constitute a limitation of the embodiments of the present application, and may be a bus-type structure or a star-shaped structure, and the electronic device 3 may also include a schematic more or less other hardware or software, or a different arrangement of components is shown.
在一些实施例中,所述电子设备3是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的电子设备,其硬件包括但不限于微处理器、专用集成电路、可编程门阵列、数字处理器及嵌入式设备等。所述电子设备3还可包括客户设备,所述客户设备包括但不限于任何一种可与客户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互的电子产品,例如,个人计算机、平板电脑、智能手机、数码相机等。In some embodiments, the electronic device 3 is an electronic device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions, and its hardware includes but is not limited to microprocessors, application-specific integrated circuits , programmable gate arrays, digital processors and embedded devices. The electronic device 3 may also include a client device, which includes but is not limited to any electronic product that can perform human-computer interaction with a client through a keyboard, a mouse, a remote control, a touchpad, or a voice-activated device, for example, Personal computers, tablets, smartphones, digital cameras, etc.
需要说明的是,所述电子设备3仅为举例,其他现有的或今后可能出现的电子产品如可适应于本申请,也应包含在本申请的保护范围以内,并以引用方式包含于此。It should be noted that the electronic device 3 is only an example, and other existing or future electronic products, if applicable to the present application, should also be included within the protection scope of the present application, and incorporated herein by reference .
在一些实施例中,所述存储器31用于存储程序代码和各种数据,例如安装在所述电子设备3中的网页的数据抽取装置20,并在电子设备3的运行过程中实现高速、自动地完成程序或数据的存取。所述存储器31包括非易失性存储器和易失性存储器,比如只读存储器(Read-Only Memory,ROM)、可编程只读存储器(Programmable Read-Only Memory,PROM)、可擦除可编程只读存储器(Erasable Programmable Read-Only Memory,EPROM)、一次可编程只读存储器(One-time Programmable Read-Only Memory,OTPROM)、电子擦除式可复写只读存储器(Electrically-Erasable Programmable Read-Only Memory,EEPROM)、只读光盘(Compact Disc Read-Only Memory,CD-ROM)或其他光盘存储器、磁盘存储器、磁带存储器、或者能够用于携带或存储数据的计算机可读的任何其他介质。In some embodiments, the memory 31 is used to store program codes and various data, such as the data extraction apparatus 20 of a webpage installed in the electronic device 3 , and realizes high-speed, automatic operation during the operation of the electronic device 3 . Complete program or data access. The memory 31 includes non-volatile memory and volatile memory, such as read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable Read-Only Memory, PROM), erasable programmable only memory. Erasable Programmable Read-Only Memory (EPROM), One-time Programmable Read-Only Memory (OTPROM), Electronically-Erasable Programmable Read-Only Memory (Electrically-Erasable Programmable Read-Only Memory) , EEPROM), Compact Disc Read-Only Memory (CD-ROM), or other optical disk storage, magnetic disk storage, tape storage, or any other computer-readable medium that can be used to carry or store data.
在一些实施例中,所述至少一个处理器32可以由集成电路组成,例如可以由单个封装的集成电路所组成,也可以是由多个相同功能或不同功能封装的集成电路所组成,包括一个或者多个中央处理器(Central Processing unit,CPU)、微处理器、数字处理芯片、图形处理器及各种控制芯片的组合等。所述至少一个处理器32是所述电子设备3的控制核心(Control Unit),利用各种接口和线路连接整个电子设备3的各个部件,通过运行或执行存储在所述存储器31内的程序或者模块,以及调用存储在所述存储器31内的数据,以执行电子设备3的各种功能和处理数据。In some embodiments, the at least one processor 32 may be composed of integrated circuits, for example, may be composed of a single packaged integrated circuit, or may be composed of multiple integrated circuits packaged with the same function or different functions, including one Or a combination of multiple central processing units (Central Processing units, CPUs), microprocessors, digital processing chips, graphics processors, and various control chips. The at least one processor 32 is the control core (Control Unit) of the electronic device 3, and uses various interfaces and lines to connect the various components of the entire electronic device 3, by running or executing the program stored in the memory 31 or modules, and call data stored in the memory 31 to perform various functions of the electronic device 3 and process data.
在一些实施例中,所述至少一条通信总线33被设置为实现所述存储器31以及所述至少一个处理器32等之间的连接通信。In some embodiments, the at least one communication bus 33 is configured to enable connection communication between the memory 31 and the at least one processor 32 and the like.
尽管未示出,所述电子设备3还可以包括给各个部件供电的电源(比如电池),可选的,电源可以通过电源管理装置与所述至少一个处理器32逻辑相连,从而通过电源管理装置实现 管理充电、放电、以及功耗管理等功能。电源还可以包括一个或一个以上的直流或交流电源、再充电装置、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。所述电子设备3还可以包括多种传感器、蓝牙模块、Wi-Fi模块等,在此不再赘述。Although not shown, the electronic device 3 may also include a power source (such as a battery) for supplying power to various components. Optionally, the power source may be logically connected to the at least one processor 32 through a power management device, so that the power management device Implement functions such as managing charging, discharging, and power consumption. The power source may also include one or more DC or AC power sources, recharging devices, power failure detection circuits, power converters or inverters, power status indicators, and any other components. The electronic device 3 may further include a variety of sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.
应该了解,所述实施例仅为说明之用,在专利申请范围上并不受此结构的限制。It should be understood that the embodiments are only used for illustration, and are not limited by this structure in the scope of the patent application.
上述以软件功能模块的形式实现的集成的单元,可以存储在一个计算机可读取存储介质中。上述软件功能模块存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,电子设备,或者网络设备等)或处理器(processor)执行本申请各个实施例所述方法的部分。The above-mentioned integrated units implemented in the form of software functional modules may be stored in a computer-readable storage medium. The above-mentioned software function modules are stored in a storage medium, and include several instructions to enable a computer device (which may be a personal computer, an electronic device, or a network device, etc.) or a processor (processor) to execute the methods described in the various embodiments of the present application. part.
在进一步的实施例中,结合图2,所述至少一个处理器32可执行所述电子设备3的操作装置以及安装的各类应用程序(如所述的网页的数据抽取装置20)、程序代码等,例如,上述的各个模块。In a further embodiment, with reference to FIG. 2 , the at least one processor 32 can execute the operating device of the electronic device 3 and various types of installed application programs (such as the data extraction device 20 of the web page), program codes etc., for example, the various modules described above.
所述存储器31中存储有程序代码,且所述至少一个处理器32可调用所述存储器31中存储的程序代码以执行相关的功能。例如,图2中所述的各个模块是存储在所述存储器31中的程序代码,并由所述至少一个处理器32所执行,从而实现所述各个模块的功能以达到网页的数据抽取的目的。Program codes are stored in the memory 31, and the at least one processor 32 can call the program codes stored in the memory 31 to perform related functions. For example, each module described in FIG. 2 is a program code stored in the memory 31 and executed by the at least one processor 32, so as to realize the functions of the various modules to achieve the purpose of data extraction of web pages .
在本申请的一个实施例中,所述存储器31存储多个计算机可读指令,所述多个计算机可读指令被所述至少一个处理器32所执行以实现网页的数据抽取的功能。In an embodiment of the present application, the memory 31 stores a plurality of computer-readable instructions, and the plurality of computer-readable instructions are executed by the at least one processor 32 to implement the function of data extraction from a web page.
示例性的,所述程序代码可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器31中,并由所述处理器32执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机可读指令段,该指令段用于描述所述计算机程序在所述电子设备3中的执行过程。例如,所述程序代码可以被分割成获取模块201、解析模块202、遍历模块203、匹配模块204、确定模块205、生成模块206及抽取模块207。Exemplarily, the program code may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 31 and executed by the processor 32 to complete the present invention. Application. The one or more modules/units may be a series of computer-readable instruction segments capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer program in the electronic device 3 . For example, the program code may be divided into an acquisition module 201 , a parsing module 202 , a traversal module 203 , a matching module 204 , a determination module 205 , a generation module 206 and an extraction module 207 .
具体地,所述至少一个处理器32对上述指令的具体实现方法可参考图1对应实施例中相关步骤的描述,在此不赘述。Specifically, for the specific implementation method of the above instruction by the at least one processor 32, reference may be made to the description of the relevant steps in the embodiment corresponding to FIG. 1 , and details are not described herein.
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the modules is only a logical function division, and there may be other division manners in actual implementation.
进一步地,所述计算机可读存储介质可以是非易失性,也可以是易失性。Further, the computer-readable storage medium may be non-volatile or volatile.
进一步地,所述计算机可读存储介质可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作***、至少一个功能所需的应用程序等;存储数据区可存储根据区块链节点的使用所创建的数据等。Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function, and the like; The data created by the use of the node, etc.
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。The blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理单元,既可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The modules described as separate components may or may not be physically separated, and the components shown as modules may or may not be physical units, and may be located in one place or distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
另外,在本申请各个实施例中的各功能模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。In addition, each functional module in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware, or can be implemented in the form of hardware plus software function modules.
对于本领域技术人员而言,显然本申请不限于上述示范性实施例的细节,而且在不背离本申请的精神或基本特征的情况下,能够以其他的具体形式实现本申请。因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求 而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本申请内。不应将权利要求中的任何附图标记视为限制所涉及的权利要求。此外,显然“包括”一词不排除其他单元或,单数不排除复数。本申请中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第一,第二等词语用来表示名称,而并不表示任何特定的顺序。It will be apparent to those skilled in the art that the present application is not limited to the details of the above-described exemplary embodiments, but that the present application can be implemented in other specific forms without departing from the spirit or essential characteristics of the present application. Accordingly, the embodiments are to be regarded in all respects as illustrative and not restrictive, and the scope of the application is to be defined by the appended claims rather than the foregoing description, which is therefore intended to fall within the scope of the claims. All changes within the meaning and scope of the equivalents of , are included in this application. Any reference signs in the claims shall not be construed as limiting the involved claim. Furthermore, it is clear that the word "comprising" does not exclude other units or, and the singular does not exclude the plural. A plurality of units or devices stated in this application may also be implemented by one unit or device through software or hardware. The terms first, second, etc. are used to denote names and do not denote any particular order.
最后应说明的是,以上实施例仅用以说明本申请的技术方案而非限制,尽管参照较佳实施例对本申请进行了详细说明,本领域的普通技术人员应当理解,可以对本申请的技术方案进行修改或等同替换,而不脱离本申请技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application and not to limit them. Although the present application has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present application can be Modifications or equivalent substitutions can be made without departing from the spirit and scope of the technical solutions of the present application.

Claims (20)

  1. 一种网页的数据抽取方法,其中,所述方法包括:A method for extracting data from web pages, wherein the method comprises:
    获取待抽取网页的源代码中的HTML代码,并将所述HTML代码解析为第一节点DOM树;Obtain the HTML code in the source code of the webpage to be extracted, and parse the HTML code into the first node DOM tree;
    解析所述第一节点DOM树得到所有无序列表标签;Parse the first node DOM tree to obtain all unordered list tags;
    遍历每个无序列表标签对应的所有列表标签得到遍历结果,从所述遍历结果中选取子节点最多的列表标签对应的DOM树作为第二节点DOM树;Traverse all the list tags corresponding to each unordered list tag to obtain the traversal result, and select the DOM tree corresponding to the list tag with the most child nodes from the traversal result as the second node DOM tree;
    将所述遍历结果中每个未被选取的列表标签的第三节点DOM树与所述第二节点DOM树进行匹配,根据匹配结果生成所述每个未被选取的列表标签的第四节点DOM树;Matching the third node DOM tree of each unselected list tag in the traversal result with the second node DOM tree, and generating the fourth node DOM of each unselected list tag according to the matching result Tree;
    根据所述第二节点DOM树和所有第四节点DOM树生成每个无序列表标签的第五节点DOM树;Generate a fifth node DOM tree of each unordered list tag according to the second node DOM tree and all fourth node DOM trees;
    根据所述所有无序列表标签的第五节点DOM树生成新的待抽取网页,并对所述新的待抽取网页进行网页特征数据抽取。A new webpage to be extracted is generated according to the DOM tree of the fifth node of all the unordered list tags, and webpage feature data extraction is performed on the new webpage to be extracted.
  2. 如权利要求1所述的网页的数据抽取方法,其中,所述将所述遍历结果中每个未被选取的列表标签的第三节点DOM树与所述第二节点DOM树进行匹配,根据匹配结果生成所述每个未被选取的列表标签的第四节点DOM树包括:The method for extracting data from a webpage according to claim 1, wherein the matching the third node DOM tree of each unselected list tag in the traversal result with the second node DOM tree, according to the matching As a result, the fourth node DOM tree of each unselected list tag is generated, including:
    将所述第二节点DOM树的根节点的第一标签与所述每个未被选取的列表标签的第三DOM树的根节点的第二标签进行匹配;matching the first label of the root node of the second node DOM tree with the second label of the root node of the third DOM tree of each unselected list label;
    当所述第一标签与所述第二标签一致时,判断所述第二节点DOM树的根节点及所述第三节点DOM树的根节点是否为叶子节点;When the first label is consistent with the second label, determine whether the root node of the DOM tree of the second node and the root node of the DOM tree of the third node are leaf nodes;
    当所述第二节点DOM树的根节点不为叶子节点及所述第三节点DOM树的根节点不为叶子节点时,将所述第二节点DOM树的根节点的下一层级的所有子节点的第三标签与所述第三节点DOM树的相同层级的所有子节点的第四标签进行匹配;When the root node of the DOM tree of the second node is not a leaf node and the root node of the DOM tree of the third node is not a leaf node, all children of the next level of the root node of the DOM tree of the second node are The third label of the node is matched with the fourth label of all child nodes of the same level of the third node DOM tree;
    当所述第二节点DOM树的根节点的下一层级的所有子节点的第三标签与所述第三节点DOM树的相同层级的所有子节点的第四标签一致时,重复上述过程,直至所述第二节点DOM树的子节点及所述第三节点DOM树的子节点为叶子节点。When the third labels of all child nodes at the next level of the root node of the DOM tree of the second node are consistent with the fourth labels of all child nodes at the same level of the DOM tree of the third node, repeat the above process until The child nodes of the DOM tree of the second node and the child nodes of the DOM tree of the third node are leaf nodes.
  3. 如权利要求1所述的网页的数据抽取方法,其中,所述根据匹配结果生成所述每个未被选取的列表标签的第四节点DOM树包括:The method for extracting data from a webpage according to claim 1, wherein the generating the fourth node DOM tree of each unselected list tag according to the matching result comprises:
    当所述第二节点DOM树的根节点的下一层级的所有子节点的第三标签与所述第三节点DOM树的相同层级的任意一个子节点的第四标签不一致时,识别所述第四标签的左邻居节点和右邻居节点;When the third label of all child nodes at the next level of the root node of the second node DOM tree is inconsistent with the fourth label of any child node at the same level of the third node DOM tree, identifying the third label Four-label left neighbor node and right neighbor node;
    当识别到左邻居节点,但未识别到右邻居节点时,将所述第四标签***至所述左邻居节点的最右边得到新的第三节点DOM树,将所述新的第三节点DOM树作为所述每个未被选取的列表标签的第四节点DOM树;或者When the left neighbor node is recognized, but the right neighbor node is not recognized, insert the fourth label to the rightmost of the left neighbor node to obtain a new third node DOM tree, and insert the new third node DOM tree as the fourth node DOM tree of each unselected list tag; or
    当未识别左邻居节点,但识别到右邻居节点时,将所述第四标签***至所述右邻居节点的最左边得到新的第三节点DOM树,将所述新的第三节点DOM树作为所述每个未被选取的列表标签的第四节点DOM树;或者When the left neighbor node is not recognized, but the right neighbor node is recognized, the fourth label is inserted into the leftmost of the right neighbor node to obtain a new third node DOM tree, and the new third node DOM tree is as the fourth node DOM tree of each unselected list tag; or
    当识别到左邻居节点和右邻居节点时,将所述第四标签***至所述左邻居节点和所述右邻居节点之间得到新的第三节点DOM树,将所述新的第三节点DOM树作为所述每个未被选取的列表标签的第四节点DOM树。When the left neighbor node and the right neighbor node are identified, the fourth label is inserted between the left neighbor node and the right neighbor node to obtain a new third node DOM tree, and the new third node The DOM tree is used as the fourth node DOM tree of each unselected list tag.
  4. 如权利要求2所述的网页的数据抽取方法,其中,所述方法还包括:The method for extracting data from a webpage according to claim 2, wherein the method further comprises:
    当所述第二节点DOM树的根节点为叶子节点,但所述第三节点DOM树的根节点不为叶子节点时,将所述第三节点DOM树作为所述每个未被选取的列表标签的第四节点DOM树。When the root node of the DOM tree of the second node is a leaf node, but the root node of the DOM tree of the third node is not a leaf node, the DOM tree of the third node is used as each unselected list The tag's fourth node in the DOM tree.
  5. 如权利要求2所述的网页的数据抽取方法,其中,所述方法还包括:The method for extracting data from a webpage according to claim 2, wherein the method further comprises:
    当所述第二节点DOM树的根节点不为叶子节点,但所述第三节点DOM树的根节点为叶子节点时,遍历所述第二节点DOM树的根节点的所有子节点;When the root node of the DOM tree of the second node is not a leaf node, but the root node of the DOM tree of the third node is a leaf node, traverse all the child nodes of the root node of the DOM tree of the second node;
    将所述所有子节点的对应的标签***至所述第三节点DOM树对应的位置得到新的第三节点DOM树,并将所述新的第三节点DOM树作为每个未被选取的列表标签的第四节点DOM树。Insert the corresponding labels of all the child nodes into the corresponding positions of the third node DOM tree to obtain a new third node DOM tree, and use the new third node DOM tree as each unselected list The tag's fourth node in the DOM tree.
  6. 如权利要求2所述的网页的数据抽取方法,其中,所述方法还包括:The method for extracting data from a webpage according to claim 2, wherein the method further comprises:
    当所述第二节点DOM树的根节点为叶子节点及所述第三节点DOM树的根节点为叶子节点时,确定所述第三节点DOM树为所述每个未被选取的列表标签的第四节点DOM树。When the root node of the DOM tree of the second node is a leaf node and the root node of the DOM tree of the third node is a leaf node, determine that the DOM tree of the third node is the DOM tree of each unselected list tag The fourth node DOM tree.
  7. 如权利要求1至6中任意一项所述的网页的数据抽取方法,其中,所述根据所述第二节点DOM树和所有第四节点DOM树生成每个无序列表标签的第五节点DOM树包括:The method for extracting data from web pages according to any one of claims 1 to 6, wherein the generating the fifth node DOM of each unordered list tag according to the second node DOM tree and all fourth node DOM trees The tree includes:
    将所述第二节点DOM树和所有第四节点DOM树对应到所述每个无序列表标签中的对应位置,得到所述每个无序列表标签的第五节点DOM树。Corresponding the second node DOM tree and all fourth node DOM trees to corresponding positions in each unordered list tag, to obtain the fifth node DOM tree of each unordered list tag.
  8. 一种电子设备,其中,所述电子设备包括存储器及处理器,所述存储器用于存储至少一个计算机可读指令,所述处理器用于执行所述至少一个计算机可读指令以实现以下步骤:An electronic device, wherein the electronic device comprises a memory and a processor, the memory is used to store at least one computer-readable instruction, and the processor is used to execute the at least one computer-readable instruction to implement the following steps:
    获取待抽取网页的源代码中的HTML代码,并将所述HTML代码解析为第一节点DOM树;Obtain the HTML code in the source code of the webpage to be extracted, and parse the HTML code into the first node DOM tree;
    解析所述第一节点DOM树得到所有无序列表标签;Parse the first node DOM tree to obtain all unordered list tags;
    遍历每个无序列表标签对应的所有列表标签得到遍历结果,从所述遍历结果中选取子节点最多的列表标签对应的DOM树作为第二节点DOM树;Traverse all the list tags corresponding to each unordered list tag to obtain the traversal result, and select the DOM tree corresponding to the list tag with the most child nodes from the traversal result as the second node DOM tree;
    将所述遍历结果中每个未被选取的列表标签的第三节点DOM树与所述第二节点DOM树进行匹配,根据匹配结果生成所述每个未被选取的列表标签的第四节点DOM树;Matching the third node DOM tree of each unselected list tag in the traversal result with the second node DOM tree, and generating the fourth node DOM of each unselected list tag according to the matching result Tree;
    根据所述第二节点DOM树和所有第四节点DOM树生成每个无序列表标签的第五节点DOM树;Generate a fifth node DOM tree of each unordered list tag according to the second node DOM tree and all fourth node DOM trees;
    根据所述所有无序列表标签的第五节点DOM树生成新的待抽取网页,并对所述新的待抽取网页进行网页特征数据抽取。A new webpage to be extracted is generated according to the DOM tree of the fifth node of all the unordered list tags, and webpage feature data extraction is performed on the new webpage to be extracted.
  9. 如权利要求8所述的电子设备,其中,所述处理器执行所述至少一个计算机可读指令以实现所述将所述遍历结果中每个未被选取的列表标签的第三节点DOM树与所述第二节点DOM树进行匹配,根据匹配结果生成所述每个未被选取的列表标签的第四节点DOM树时,具体包括:9. The electronic device of claim 8, wherein the processor executes the at least one computer-readable instruction to implement the linking the third node DOM tree of each unselected list tag in the traversal result with the The second node DOM tree is matched, and when the fourth node DOM tree of each unselected list tag is generated according to the matching result, it specifically includes:
    将所述第二节点DOM树的根节点的第一标签与所述每个未被选取的列表标签的第三DOM树的根节点的第二标签进行匹配;matching the first label of the root node of the second node DOM tree with the second label of the root node of the third DOM tree of each unselected list label;
    当所述第一标签与所述第二标签一致时,判断所述第二节点DOM树的根节点及所述第三节点DOM树的根节点是否为叶子节点;When the first label is consistent with the second label, determine whether the root node of the DOM tree of the second node and the root node of the DOM tree of the third node are leaf nodes;
    当所述第二节点DOM树的根节点不为叶子节点及所述第三节点DOM树的根节点不为叶子节点时,将所述第二节点DOM树的根节点的下一层级的所有子节点的第三标签与所述第三节点DOM树的相同层级的所有子节点的第四标签进行匹配;When the root node of the DOM tree of the second node is not a leaf node and the root node of the DOM tree of the third node is not a leaf node, all children of the next level of the root node of the DOM tree of the second node are The third label of the node is matched with the fourth label of all child nodes of the same level of the third node DOM tree;
    当所述第二节点DOM树的根节点的下一层级的所有子节点的第三标签与所述第三节点DOM树的相同层级的所有子节点的第四标签一致时,重复上述过程,直至所述第二节点DOM树的子节点及所述第三节点DOM树的子节点为叶子节点。When the third labels of all child nodes at the next level of the root node of the DOM tree of the second node are consistent with the fourth labels of all child nodes at the same level of the DOM tree of the third node, repeat the above process until The child nodes of the DOM tree of the second node and the child nodes of the DOM tree of the third node are leaf nodes.
  10. 如权利要求8所述的电子设备,其中,所述处理器执行所述至少一个计算机可读指令以实现所述根据匹配结果生成所述每个未被选取的列表标签的第四节点DOM树时,具体包括:The electronic device of claim 8, wherein the processor executes the at least one computer-readable instruction to achieve the time when the fourth node DOM tree of each unselected list tag is generated according to the matching result , including:
    当所述第二节点DOM树的根节点的下一层级的所有子节点的第三标签与所述第三 节点DOM树的相同层级的任意一个子节点的第四标签不一致时,识别所述第四标签的左邻居节点和右邻居节点;When the third label of all child nodes at the next level of the root node of the second node DOM tree is inconsistent with the fourth label of any child node at the same level of the third node DOM tree, identifying the third label Four-label left neighbor node and right neighbor node;
    当识别到左邻居节点,但未识别到右邻居节点时,将所述第四标签***至所述左邻居节点的最右边得到新的第三节点DOM树,将所述新的第三节点DOM树作为所述每个未被选取的列表标签的第四节点DOM树;或者When the left neighbor node is recognized, but the right neighbor node is not recognized, insert the fourth label to the rightmost of the left neighbor node to obtain a new third node DOM tree, and insert the new third node DOM tree as the fourth node DOM tree of each unselected list tag; or
    当未识别左邻居节点,但识别到右邻居节点时,将所述第四标签***至所述右邻居节点的最左边得到新的第三节点DOM树,将所述新的第三节点DOM树作为所述每个未被选取的列表标签的第四节点DOM树;或者When the left neighbor node is not recognized, but the right neighbor node is recognized, the fourth label is inserted into the leftmost of the right neighbor node to obtain a new third node DOM tree, and the new third node DOM tree is as the fourth node DOM tree of each unselected list tag; or
    当识别到左邻居节点和右邻居节点时,将所述第四标签***至所述左邻居节点和所述右邻居节点之间得到新的第三节点DOM树,将所述新的第三节点DOM树作为所述每个未被选取的列表标签的第四节点DOM树。When the left neighbor node and the right neighbor node are identified, the fourth label is inserted between the left neighbor node and the right neighbor node to obtain a new third node DOM tree, and the new third node The DOM tree is used as the fourth node DOM tree of each unselected list tag.
  11. 如权利要求9所述的电子设备,其中,所述处理器执行所述至少一个计算机可读指令还用以实现以下步骤:The electronic device of claim 9, wherein the processor executes the at least one computer-readable instruction to further implement the following steps:
    当所述第二节点DOM树的根节点为叶子节点,但所述第三节点DOM树的根节点不为叶子节点时,将所述第三节点DOM树作为所述每个未被选取的列表标签的第四节点DOM树。When the root node of the DOM tree of the second node is a leaf node, but the root node of the DOM tree of the third node is not a leaf node, the DOM tree of the third node is used as each unselected list The tag's fourth node in the DOM tree.
  12. 如权利要求9所述的电子设备,其中,所述处理器执行所述至少一个计算机可读指令还用以实现以下步骤:The electronic device of claim 9, wherein the processor executes the at least one computer-readable instruction to further implement the following steps:
    当所述第二节点DOM树的根节点不为叶子节点,但所述第三节点DOM树的根节点为叶子节点时,遍历所述第二节点DOM树的根节点的所有子节点;When the root node of the DOM tree of the second node is not a leaf node, but the root node of the DOM tree of the third node is a leaf node, traverse all the child nodes of the root node of the DOM tree of the second node;
    将所述所有子节点的对应的标签***至所述第三节点DOM树对应的位置得到新的第三节点DOM树,并将所述新的第三节点DOM树作为每个未被选取的列表标签的第四节点DOM树。Insert the corresponding labels of all the child nodes into the corresponding positions of the third node DOM tree to obtain a new third node DOM tree, and use the new third node DOM tree as each unselected list The tag's fourth node in the DOM tree.
  13. 如权利要求9所述的电子设备,其中,所述处理器执行所述至少一个计算机可读指令还用以实现以下步骤:The electronic device of claim 9, wherein the processor executes the at least one computer-readable instruction to further implement the following steps:
    当所述第二节点DOM树的根节点为叶子节点及所述第三节点DOM树的根节点为叶子节点时,确定所述第三节点DOM树为所述每个未被选取的列表标签的第四节点DOM树。When the root node of the DOM tree of the second node is a leaf node and the root node of the DOM tree of the third node is a leaf node, determine that the DOM tree of the third node is the DOM tree of each unselected list tag The fourth node DOM tree.
  14. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有至少一个计算机可读指令,所述至少一个计算机可读指令被处理器执行时实现以下步骤:A computer-readable storage medium, wherein the computer-readable storage medium stores at least one computer-readable instruction, and the at least one computer-readable instruction implements the following steps when executed by a processor:
    获取待抽取网页的源代码中的HTML代码,并将所述HTML代码解析为第一节点DOM树;Obtain the HTML code in the source code of the webpage to be extracted, and parse the HTML code into the first node DOM tree;
    解析所述第一节点DOM树得到所有无序列表标签;Parse the first node DOM tree to obtain all unordered list tags;
    遍历每个无序列表标签对应的所有列表标签得到遍历结果,从所述遍历结果中选取子节点最多的列表标签对应的DOM树作为第二节点DOM树;Traverse all the list tags corresponding to each unordered list tag to obtain the traversal result, and select the DOM tree corresponding to the list tag with the most child nodes from the traversal result as the second node DOM tree;
    将所述遍历结果中每个未被选取的列表标签的第三节点DOM树与所述第二节点DOM树进行匹配,根据匹配结果生成所述每个未被选取的列表标签的第四节点DOM树;Matching the third node DOM tree of each unselected list tag in the traversal result with the second node DOM tree, and generating the fourth node DOM of each unselected list tag according to the matching result Tree;
    根据所述第二节点DOM树和所有第四节点DOM树生成每个无序列表标签的第五节点DOM树;Generate a fifth node DOM tree of each unordered list tag according to the second node DOM tree and all fourth node DOM trees;
    根据所述所有无序列表标签的第五节点DOM树生成新的待抽取网页,并对所述新的待抽取网页进行网页特征数据抽取。A new webpage to be extracted is generated according to the DOM tree of the fifth node of all the unordered list tags, and webpage feature data extraction is performed on the new webpage to be extracted.
  15. 如权利要求14所述的存储介质,其中,所述至少一个计算机可读指令被所述处理器执行以实现所述将所述遍历结果中每个未被选取的列表标签的第三节点DOM树与所述第二节点DOM树进行匹配,根据匹配结果生成所述每个未被选取的列表标签的第四节点DOM树时,具体包括:15. The storage medium of claim 14, wherein the at least one computer-readable instruction is executed by the processor to implement the third node DOM tree for each unselected list tag in the traversal result When matching with the second node DOM tree, and generating the fourth node DOM tree of each unselected list tag according to the matching result, it specifically includes:
    将所述第二节点DOM树的根节点的第一标签与所述每个未被选取的列表标签的第三DOM树的根节点的第二标签进行匹配;matching the first label of the root node of the second node DOM tree with the second label of the root node of the third DOM tree of each unselected list label;
    当所述第一标签与所述第二标签一致时,判断所述第二节点DOM树的根节点及所述第三节点DOM树的根节点是否为叶子节点;When the first label is consistent with the second label, determine whether the root node of the DOM tree of the second node and the root node of the DOM tree of the third node are leaf nodes;
    当所述第二节点DOM树的根节点不为叶子节点及所述第三节点DOM树的根节点不为叶子节点时,将所述第二节点DOM树的根节点的下一层级的所有子节点的第三标签与所述第三节点DOM树的相同层级的所有子节点的第四标签进行匹配;When the root node of the DOM tree of the second node is not a leaf node and the root node of the DOM tree of the third node is not a leaf node, all children of the next level of the root node of the DOM tree of the second node are The third label of the node is matched with the fourth label of all child nodes of the same level of the third node DOM tree;
    当所述第二节点DOM树的根节点的下一层级的所有子节点的第三标签与所述第三节点DOM树的相同层级的所有子节点的第四标签一致时,重复上述过程,直至所述第二节点DOM树的子节点及所述第三节点DOM树的子节点为叶子节点。When the third labels of all child nodes at the next level of the root node of the DOM tree of the second node are consistent with the fourth labels of all child nodes at the same level of the DOM tree of the third node, repeat the above process until The child nodes of the DOM tree of the second node and the child nodes of the DOM tree of the third node are leaf nodes.
  16. 如权利要求14所述的存储介质,其中,所述至少一个计算机可读指令被所述处理器执行以实现所述根据匹配结果生成所述每个未被选取的列表标签的第四节点DOM树时,具体包括:15. The storage medium of claim 14, wherein the at least one computer-readable instruction is executed by the processor to implement the generating a fourth node DOM tree of the each unselected list tag according to a matching result , including:
    当所述第二节点DOM树的根节点的下一层级的所有子节点的第三标签与所述第三节点DOM树的相同层级的任意一个子节点的第四标签不一致时,识别所述第四标签的左邻居节点和右邻居节点;When the third label of all child nodes at the next level of the root node of the second node DOM tree is inconsistent with the fourth label of any child node at the same level of the third node DOM tree, identifying the third label Four-label left neighbor node and right neighbor node;
    当识别到左邻居节点,但未识别到右邻居节点时,将所述第四标签***至所述左邻居节点的最右边得到新的第三节点DOM树,将所述新的第三节点DOM树作为所述每个未被选取的列表标签的第四节点DOM树;或者When the left neighbor node is recognized, but the right neighbor node is not recognized, insert the fourth label to the rightmost of the left neighbor node to obtain a new third node DOM tree, and insert the new third node DOM tree as the fourth node DOM tree of each unselected list tag; or
    当未识别左邻居节点,但识别到右邻居节点时,将所述第四标签***至所述右邻居节点的最左边得到新的第三节点DOM树,将所述新的第三节点DOM树作为所述每个未被选取的列表标签的第四节点DOM树;或者When the left neighbor node is not recognized, but the right neighbor node is recognized, the fourth label is inserted into the leftmost of the right neighbor node to obtain a new third node DOM tree, and the new third node DOM tree is as the fourth node DOM tree of each unselected list tag; or
    当识别到左邻居节点和右邻居节点时,将所述第四标签***至所述左邻居节点和所述右邻居节点之间得到新的第三节点DOM树,将所述新的第三节点DOM树作为所述每个未被选取的列表标签的第四节点DOM树。When the left neighbor node and the right neighbor node are identified, the fourth label is inserted between the left neighbor node and the right neighbor node to obtain a new third node DOM tree, and the new third node The DOM tree is used as the fourth node DOM tree of each unselected list tag.
  17. 如权利要求15所述的存储介质,其中,所述至少一个计算机可读指令被处理器执行时还用以实现以下步骤:The storage medium of claim 15, wherein the at least one computer-readable instruction, when executed by the processor, is further configured to implement the following steps:
    当所述第二节点DOM树的根节点为叶子节点,但所述第三节点DOM树的根节点不为叶子节点时,将所述第三节点DOM树作为所述每个未被选取的列表标签的第四节点DOM树。When the root node of the DOM tree of the second node is a leaf node, but the root node of the DOM tree of the third node is not a leaf node, the DOM tree of the third node is used as each unselected list The tag's fourth node in the DOM tree.
  18. 如权利要求15所述的存储介质,其中,所述至少一个计算机可读指令被处理器执行时还用以实现以下步骤:The storage medium of claim 15, wherein the at least one computer-readable instruction, when executed by the processor, is further configured to implement the following steps:
    当所述第二节点DOM树的根节点不为叶子节点,但所述第三节点DOM树的根节点为叶子节点时,遍历所述第二节点DOM树的根节点的所有子节点;When the root node of the DOM tree of the second node is not a leaf node, but the root node of the DOM tree of the third node is a leaf node, traverse all the child nodes of the root node of the DOM tree of the second node;
    将所述所有子节点的对应的标签***至所述第三节点DOM树对应的位置得到新的第三节点DOM树,并将所述新的第三节点DOM树作为每个未被选取的列表标签的第四节点DOM树。Insert the corresponding labels of all the child nodes into the corresponding positions of the third node DOM tree to obtain a new third node DOM tree, and use the new third node DOM tree as each unselected list The tag's fourth node in the DOM tree.
  19. 如权利要求15所述的存储介质,其中,所述至少一个计算机可读指令被处理器执行时还用以实现以下步骤:The storage medium of claim 15, wherein the at least one computer-readable instruction, when executed by the processor, is further configured to implement the following steps:
    当所述第二节点DOM树的根节点为叶子节点及所述第三节点DOM树的根节点为叶子节点时,确定所述第三节点DOM树为所述每个未被选取的列表标签的第四节点DOM树。When the root node of the DOM tree of the second node is a leaf node and the root node of the DOM tree of the third node is a leaf node, determine that the DOM tree of the third node is the DOM tree of each unselected list tag The fourth node DOM tree.
  20. 一种网页的数据抽取装置,其中,所述装置包括:An apparatus for extracting data from web pages, wherein the apparatus comprises:
    获取模块,用于获取待抽取网页的源代码中的HTML代码,并将所述HTML代码解析为第一节点DOM树;an acquisition module, for acquiring HTML codes in the source codes of the web pages to be extracted, and parsing the HTML codes into a first node DOM tree;
    解析模块,用于解析所述第一节点DOM树得到所有无序列表标签;a parsing module for parsing the first node DOM tree to obtain all unordered list tags;
    遍历模块,用于遍历每个无序列表标签对应的所有列表标签得到遍历结果,从所述遍历结果中选取子节点最多的列表标签对应的DOM树作为第二节点DOM树;The traversal module is used to traverse all the list tags corresponding to each unordered list tag to obtain the traversal result, and select the DOM tree corresponding to the list tag with the most child nodes from the traversal result as the second node DOM tree;
    匹配模块,用于将所述遍历结果中每个未被选取的列表标签的第三节点DOM树与所述第二节点DOM树进行匹配,根据匹配结果生成所述每个未被选取的列表标签的第四节点DOM树;A matching module, configured to match the third node DOM tree of each unselected list tag in the traversal result with the second node DOM tree, and generate each unselected list tag according to the matching result The fourth node of the DOM tree;
    生成模块,用于根据所述第二节点DOM树和所有第四节点DOM树生成每个无序列表标签的第五节点DOM树;A generating module is used to generate the fifth node DOM tree of each unordered list label according to the second node DOM tree and all fourth node DOM trees;
    抽取模块,用于根据所述所有无序列表标签的第五节点DOM树生成新的待抽取网页,并对所述新的待抽取网页进行网页特征数据抽取。The extraction module is used for generating a new webpage to be extracted according to the DOM tree of the fifth node of all the unordered list tags, and extracting webpage feature data for the new webpage to be extracted.
PCT/CN2021/125865 2020-12-23 2021-10-22 Webpage data extraction method and apparatus, electronic device, and storage medium WO2022134820A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011541079.0 2020-12-23
CN202011541079.0A CN112667874A (en) 2020-12-23 2020-12-23 Webpage data extraction method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2022134820A1 true WO2022134820A1 (en) 2022-06-30

Family

ID=75409158

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/125865 WO2022134820A1 (en) 2020-12-23 2021-10-22 Webpage data extraction method and apparatus, electronic device, and storage medium

Country Status (2)

Country Link
CN (1) CN112667874A (en)
WO (1) WO2022134820A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112667874A (en) * 2020-12-23 2021-04-16 深圳壹账通智能科技有限公司 Webpage data extraction method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184189A (en) * 2011-04-18 2011-09-14 北京理工大学 Webpage core block determining method based on DOM (Document Object Model) node text density
CN106372232A (en) * 2016-09-09 2017-02-01 北京百度网讯科技有限公司 Method and device for mining information based on artificial intelligence
CN109582886A (en) * 2018-11-02 2019-04-05 北京字节跳动网络技术有限公司 Content of pages extracting method, the generation method of template and device, medium and equipment
CN109726376A (en) * 2018-12-21 2019-05-07 上海众源网络有限公司 A kind of generation method of standard form, device and electronic equipment
CN112667874A (en) * 2020-12-23 2021-04-16 深圳壹账通智能科技有限公司 Webpage data extraction method and device, electronic equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727461B (en) * 2008-10-13 2012-11-21 中国科学院计算技术研究所 Method for extracting content of web page
CN107943929B (en) * 2017-11-22 2021-09-28 福州大学 Wrapper automatic generation method based on DOM tree abstraction

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184189A (en) * 2011-04-18 2011-09-14 北京理工大学 Webpage core block determining method based on DOM (Document Object Model) node text density
CN106372232A (en) * 2016-09-09 2017-02-01 北京百度网讯科技有限公司 Method and device for mining information based on artificial intelligence
CN109582886A (en) * 2018-11-02 2019-04-05 北京字节跳动网络技术有限公司 Content of pages extracting method, the generation method of template and device, medium and equipment
CN109726376A (en) * 2018-12-21 2019-05-07 上海众源网络有限公司 A kind of generation method of standard form, device and electronic equipment
CN112667874A (en) * 2020-12-23 2021-04-16 深圳壹账通智能科技有限公司 Webpage data extraction method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN112667874A (en) 2021-04-16

Similar Documents

Publication Publication Date Title
CN111813963B (en) Knowledge graph construction method and device, electronic equipment and storage medium
US9298680B2 (en) Display of hypertext documents grouped according to their affinity
CN101464905B (en) Web page information extraction system and method
WO2022048211A1 (en) Document directory generation method and apparatus, electronic device and readable storage medium
CN104657402B (en) Method and system for linguistic labelses management
US20160306627A1 (en) Determining errors and warnings corresponding to a source code revision
WO2022048210A1 (en) Named entity recognition method and apparatus, and electronic device and readable storage medium
US11907181B2 (en) Inferring a dataset schema from input files
CN112328677B (en) Lost data recovery method, device, equipment and medium based on table association
CN106662986A (en) Optimized browser rendering process
JP2014199569A (en) Source program analysis system, source program analysis method, and program
CN111931471B (en) Form collection method, form collection device, electronic equipment and storage medium
CN111796809A (en) Interface document generation method and device, electronic equipment and medium
CN103345532A (en) Method and device for extracting webpage information
CN115048111B (en) Code generation method, device, equipment and medium based on metadata
WO2022134820A1 (en) Webpage data extraction method and apparatus, electronic device, and storage medium
CN113139033B (en) Text processing method, device, equipment and storage medium
CN113886204A (en) User behavior data collection method and device, electronic equipment and readable storage medium
CN113139145B (en) Page generation method and device, electronic equipment and readable storage medium
CN113687827B (en) Data list generation method, device and equipment based on widget and storage medium
CN114968725A (en) Task dependency relationship correction method and device, computer equipment and storage medium
CN115114297A (en) Data lightweight storage and search method and device, electronic equipment and storage medium
CN114385167A (en) Front-end page generation method, device, equipment and medium
CN104252355B (en) The method and apparatus of different information between a kind of acquisition Net procedure sets
CN114064033A (en) Front-end component development method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21908821

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 311023)

122 Ep: pct application non-entry in european phase

Ref document number: 21908821

Country of ref document: EP

Kind code of ref document: A1