CN112559929B - Method, electronic device and medium for extracting webpage target information - Google Patents

Method, electronic device and medium for extracting webpage target information Download PDF

Info

Publication number
CN112559929B
CN112559929B CN202110207419.4A CN202110207419A CN112559929B CN 112559929 B CN112559929 B CN 112559929B CN 202110207419 A CN202110207419 A CN 202110207419A CN 112559929 B CN112559929 B CN 112559929B
Authority
CN
China
Prior art keywords
node
analyzed
content
nodes
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110207419.4A
Other languages
Chinese (zh)
Other versions
CN112559929A (en
Inventor
张景龙
王殿胜
张乃钊
薄满辉
翟性国
唐红武
卞磊
刘宇
姚远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Travelsky Mobile Technology Co Ltd
Original Assignee
China Travelsky Mobile Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Travelsky Mobile Technology Co Ltd filed Critical China Travelsky Mobile Technology Co Ltd
Priority to CN202110207419.4A priority Critical patent/CN112559929B/en
Publication of CN112559929A publication Critical patent/CN112559929A/en
Application granted granted Critical
Publication of CN112559929B publication Critical patent/CN112559929B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/322Trees

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method, electronic equipment and medium for extracting webpage target information, wherein the method comprises the steps of S1, acquiring an HTML code of a webpage to be extracted, and constructing a tree structure; step S2, traversing the tree structure, obtaining title node text data, and obtaining the characteristic information of each content node; step S3, grouping all content nodes based on the path information of all content nodes; step S4, determining target grouping from the grouping according to the header node text data and the characteristic information of the content node in each grouping; step S5, the content node of the target group is used as a node to be analyzed, and it is determined whether the node to be analyzed includes the target information, if so, the target information is obtained from the node to be analyzed, otherwise, the group node connected between the father node of the node to be analyzed and the father node of the node to be analyzed is upgraded to the node to be analyzed until the target information is obtained. The invention improves the accuracy and efficiency of extracting the target information of the webpage.

Description

Method, electronic device and medium for extracting webpage target information
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method, an electronic device, and a medium for extracting target information of a web page.
Background
In the internet, a large amount of webpage data are generated every day, in the process of analyzing the webpage data, target information, such as titles, webpage text information, dates and the like, needs to be extracted, most of the text information of the existing webpage is displayed in html, and if the text information is information collected by a web crawler, part of the text information is displayed in a serialized (json) structure. The existing text extraction method is a method for processing the page block with the maximum text density as the text by analyzing the text density in each page block and extracting the text by using the text density, but the recognition rate is low, and usually, a large amount of useless contents or the missing part of text are often mixed in the web pages, for example, some media platforms support a style editor, so that the page structure is more complicated, noise information such as recommended links, propaganda views and the like can cause the text density to be reduced, extraction errors are easily caused, and the information accuracy is low. In addition, the existing extraction method is to traverse the whole webpage source code to extract the target information, so the information extraction efficiency is low. Therefore, how to improve the accuracy and efficiency of extracting the target information of the webpage becomes a technical problem to be solved urgently.
Disclosure of Invention
The invention aims to provide a method, electronic equipment and medium for extracting webpage target information, and the accuracy and efficiency of extracting the webpage target information are improved.
According to a first aspect of the present invention, there is provided a method for extracting target information of a web page, including:
s1, acquiring HTML codes of the webpage to be extracted, and constructing a corresponding tree structure based on the HTML codes;
step S2, traversing the tree structure, obtaining title node text data according to the title information of the head part of the tree structure, and obtaining the characteristic information of each content node from the tree structure, wherein the content node characteristic information comprises path information, content node text data and text density, and the content nodes are other nodes except the title nodes in the tree structure;
step S3, grouping all content nodes based on the path information of all content nodes;
step S4, determining a target grouping from the grouping according to the title node text data and the characteristic information of the content node in each grouping;
step S5, taking the content node of the target group as a node to be analyzed, and determining whether the node to be analyzed includes target information, if so, acquiring the target information from the node to be analyzed, otherwise, raising the group node connected between the father node of the node to be analyzed and the father node of the node to be analyzed to the node to be analyzed until the target information is acquired.
According to a second aspect of the present invention, there is provided an electronic apparatus comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being arranged to perform the method of the first aspect of the invention.
According to a third aspect of the invention, there is provided a computer readable storage medium, the computer instructions being for performing the method of the first aspect of the invention.
Compared with the prior art, the invention has obvious advantages and beneficial effects. By means of the technical scheme, the method, the electronic equipment and the medium for extracting the target information of the webpage can achieve considerable technical progress and practicability, have industrial wide utilization value and at least have the following advantages:
the method and the device construct the tree structure based on the HTML codes of the webpage to be extracted, group the content nodes of the tree structure, determine the optimal group from the groups, and acquire the target information based on the optimal group, thereby improving the accuracy and efficiency of extracting the target information of the webpage.
The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical means of the present invention more clearly understood, the present invention may be implemented in accordance with the content of the description, and in order to make the above and other objects, features, and advantages of the present invention more clearly understood, the following preferred embodiments are described in detail with reference to the accompanying drawings.
Drawings
Fig. 1 is a flowchart of a method for extracting target information of a web page according to an embodiment of the present invention.
Detailed Description
To further illustrate the technical means and effects of the present invention adopted to achieve the predetermined objects, the following detailed description will be given to specific embodiments and effects of a method, an electronic device and a medium for extracting target information of a web page according to the present invention, with reference to the accompanying drawings and preferred embodiments.
The embodiment of the invention provides a method for extracting webpage target information, which comprises the following steps of:
step S1, acquiring hypertext markup language (HTML) codes of the webpage to be extracted, and constructing a corresponding tree structure based on the HTML codes;
step S2, traversing the tree structure, obtaining title node text data according to the title information of the head part of the tree structure, and obtaining the characteristic information of each content node from the tree structure, wherein the content node characteristic information comprises path information, content node text data and text density, and the content nodes are other nodes except the title nodes in the tree structure;
the title node corresponds to a head part of the tree structure, and the content node corresponds to a body part of the tree structure.
Step S3, grouping all content nodes based on the path information of all content nodes;
step S4, determining a target grouping from the grouping according to the title node text data and the characteristic information of the content node in each grouping;
wherein the target group is a predicted group, i.e. the best group, most likely to contain target information.
Step S5, taking the content node of the target group as a node to be analyzed, and determining whether the node to be analyzed includes target information, if so, acquiring the target information from the node to be analyzed, otherwise, raising the group node connected between the father node of the node to be analyzed and the father node of the node to be analyzed to the node to be analyzed until the target information is acquired.
Specifically, the iterchildren () method in the lxml library may be adopted to perform the lattice-lifting operation on the child node. The target information may specifically include information such as a title, a text, a date, the number of praise, the number of focus, the number of comments, and the like.
According to the embodiment of the invention, the tree structure is constructed based on the HTML code of the webpage to be extracted, the content nodes of the tree structure are grouped, the optimal group is determined from the groups, the target information is obtained based on the optimal group, and the accuracy and the efficiency of extracting the target information of the webpage are improved.
Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the steps as a sequential process, many of the steps can be performed in parallel, concurrently or simultaneously. In addition, the order of the steps may be rearranged. A process may be terminated when its operations are completed, but may have additional steps not included in the figure. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc.
Since the path information is usually longer and takes up a large amount of memory to directly obtain and store, as an embodiment, the step S2, when obtaining the path information of each content node from the tree structure, further includes: step S21, performing compression encoding on the path information of each content node, specifically, md5 may be used for compression encoding, and performing compression encoding on the path information may adjust the packet granularity, reduce the length of the packet path, and save the memory.
As an embodiment, the path information is xpath path information of main content in a web page, and the step S3 includes:
step S31, performing fuzzy processing on the subscript information of the path information of each content node;
it will be understood that blurring the subscripts refers to replacing all with the same preset character, or deleting.
Step S32 is to divide the content nodes with the same path information after the blurring processing into the same group.
The following is illustrated with a specific example:
the xpath path information corresponding to the first content node is:
“//*[@id="root"]/div/div[3]/div[1]/div[1]/div[3]/div/div[1]/p[1]”;
and fuzzy processing is carried out on the xpath path information corresponding to the first content node to obtain:
“//*[@id="root"]/div/div$/div$/div$/div$/div/div$/p$”。
the xpath path information corresponding to the second content node is:
“//*[@id="root"]/div/div[3]/div[1]/div[1]/div[3]/div/div[1]/p[2]”;
and fuzzy processing is carried out on the xpath path information corresponding to the second content node to obtain:
“//*[@id="root"]/div/div$/div$/div$/div$/div/div$/p$”。
as a result, the path information of the first content node and the path information of the second content node after the fuzzy processing of the subscripts are the same, and therefore the first content node and the second content node belong to the same group.
The text density refers to the length of the text, is a statistic of valid characters, and specifically may be excluding codes and more than a certain number of characters. It should be noted that the elements with high text density are not necessarily the text, and some text elements such as source, time, author, etc. may be extracted as the text by mistake; the element with low text density is not necessarily a text, for example, in a pub webpage and a forum webpage, there may be a sentence or a link shared by a user, and these contents all cause the text density to decrease, so in this embodiment of the present invention, the extraction of the target information may be processed by using a multi-dimensional feature index, and specifically, as an embodiment, the step S4 includes:
s41, acquiring the text density corresponding to each group according to the text data of the content nodes in each group, and sequencing P1, P2 and … PN in the descending order, wherein N represents the total number of the groups;
wherein text density is a valid character statistic
S42, acquiring N text densities P1, P2 and … Pn preset in the front, wherein N is a positive integer greater than or equal to 2, and N is smaller than N;
and step S43, acquiring the average difference of the numerical values of P1, P2 and … Pn, comparing the average difference with a preset average difference threshold value, and if the average difference of the numerical values is larger than or equal to the average difference threshold value, determining the group corresponding to P1 as a target group.
Wherein n can be 3, that is, the average difference between P1 and P2 and P3 is obtained and compared with the average difference threshold.
Further, if the numerical average is smaller than the average threshold, the step S4 further includes:
step S44, judging whether the title node text is empty, if so, directly determining the grouping corresponding to the P1 as a target grouping, otherwise, executing step S45, wherein the fact that the title node text is empty indicates that the title node cannot be determined;
it should be noted that some web pages have less certain title nodes, and in this case, the text density can be directly used to select the target group.
S45, acquiring the similarity Qx between the text data of the x-th group and the text data of the title node, wherein the text density corresponding to the x-th group is Px, and the value of x is 1 to N or 1 to N;
step S46, obtaining a first reference value Yx = Px Qx corresponding to the x-th group, and determining the group with the largest first reference value as the target group.
As an embodiment, the step S45 may specifically include:
step S451, carrying out similarity calculation on the text data of the x-th group and the text data of the title nodes to obtain an initial similarity value a;
specifically, the initial similarity value may be calculated from the texts of the title node and each node in the packet by using a difflib. Or the grouped data can be scanned, and the distance from the text data in the group to the text data of the header node is calculated by adopting a similarity algorithm Euclidean distance, wherein the Euclidean distance is a commonly adopted distance definition, a real distance between two points in a multi-dimensional space, or a natural length of a vector.
Step S452, segmenting the text data of each content node of the x-th group, traversing the content node texts and the title node texts in a dual-cycle manner, and calculating the hit ratio b of the content node text data hitting the title text data;
step S453, determining the similarity Qx between the text data of the xth group and the text data of the title node based on the initial similarity value a, the hit ratio b, and the preset first weight k: qx = a + k × b.
The value of the first weight is in positive correlation with the influence of the hit ratio on the grouping, and the higher the first weight is set, the greater the influence of the hit ratio on the grouping result is.
In some embodiments, the number of links on the web page may be noise information such as recommended links of articles or advertisement advertisements, and therefore, it may be necessary to perform a filtering operation based on the number of links to reduce the amount of computation, specifically, the content node characteristic information further includes the number of links included in the node, and in step S4, before performing step S41, the method may further include:
and step S40, traversing the nodes of each group, acquiring the link number of each group, comparing the link number with a preset link number threshold, and filtering the group if the link number threshold is exceeded, thereby realizing the filtering of noise data in the webpage.
As an example, step S5 may specifically include:
step S51, taking the content nodes of the target group as nodes to be analyzed, judging whether the nodes to be analyzed comprise target information, if so, acquiring the target information from the nodes to be analyzed, otherwise, executing step S52;
step S52, using the father node connected by the target grouping as a first father node, adding the first father node and each grouping node connected by the first father node into the node to be analyzed, judging whether the node to be analyzed includes target information, if yes, obtaining the target information from the node to be analyzed, otherwise, executing step S53;
step S53, taking the father node of the first father node as a second father node, adding the second father node and each grouped node connected with the second father node into the node to be analyzed, judging whether the node to be analyzed comprises target information, if so, acquiring the target information from the node to be analyzed, otherwise, executing step S8;
… ("…" means performing according to the rules described above)
Step S5m +2, taking the father node of the mth father node as the m +1 th father node, taking the m +1 th father node as the father node common to the target grouping node and the title node, adding the m +1 th father node and each grouping node connected with the m +1 th father node into the node to be analyzed, judging whether the node to be analyzed comprises target information, if so, obtaining the target information from the node to be analyzed, otherwise, ending the process.
In step S5, the determining whether the node to be analyzed includes the target information may specifically include:
and step S50, acquiring the number of text nodes, the number of subtitles and the number of dates from the nodes to be analyzed, judging whether the number of text nodes, the number of subtitles and the number of dates are equal, and if so, determining that the nodes to be analyzed comprise target information.
It should be noted that, information such as a title, a date, and a text is retrieved for each content node of the optimal block, the web pages are classified according to the existence or nonexistence of the title, the date, and the text, and the title node can also be used as a feature, and a node where the target information is located can also be determined based on the classification, which is described below by using several specific examples:
as an example, if the text of the title node is not empty, it indicates that there is a certain title node in the source code of the web page, and the number of text nodes and the number of subtitles and dates retrieved from the target group do not correspond to each other, it is determined that the type of the web page belongs to the chapter class, and the main information needs to be obtained based on the target and the promotion point.
As an example, the text of the title node in the webpage source code is empty, the title node is uncertain, but the number of the text nodes in the target group corresponds to the number of the subtitles and the dates, according to the classification result, the type of the webpage source code is determined to be an instant messaging type, and the main information of the webpage source code is directly obtained from the target group.
As an example, the source code of the web page has a certain title node, the number of dates corresponds to the number of dates, but there is no corresponding subtitle in the target group, and according to the classification result, the type can be determined to be a type with social attributes, and the target information can also be directly sorted in the target group.
As an example, a web page source code has a determined title node, a plurality of hyperlinks are provided in an element after the lattice expansion, the number of the hyperlinks corresponds to the number of element tags in a content node, the discrete degree of each text size in the content node is lower than a preset discrete threshold, the dates do not correspond, the type can be determined to be an article list or a navigation type, and target information needs to be obtained based on target grouping and the lattice expansion node.
An embodiment of the present invention further provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions configured to perform a method according to an embodiment of the invention.
The embodiment of the invention also provides a computer-readable storage medium, and the computer instructions are used for executing the method of the embodiment of the invention.
Although the present invention has been described with reference to a preferred embodiment, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (9)

1. A method for extracting target information of a webpage is characterized by comprising the following steps:
s1, acquiring HTML codes of the webpage to be extracted, and constructing a corresponding tree structure based on the HTML codes;
step S2, traversing the tree structure, obtaining title node text data according to the title information of the head part of the tree structure, and obtaining the characteristic information of each content node from the tree structure, wherein the content node characteristic information comprises path information, content node text data and text density, and the content nodes are other nodes except the title nodes in the tree structure;
step S3, grouping all content nodes based on the path information of all content nodes;
step S4, determining a target grouping from the grouping according to the title node text data and the characteristic information of the content node in each grouping;
the step S4 includes:
s41, acquiring the text density corresponding to each group according to the text data of the content nodes in each group, and sequencing P1, P2 and … PN in the descending order, wherein N represents the total number of the groups;
s42, acquiring N text densities P1, P2 and … Pn preset in the front, wherein N is a positive integer greater than or equal to 2, and N is smaller than N;
step S43, obtaining the numerical mean deviation of P1, P2 and … Pn, comparing the numerical mean deviation with a preset mean deviation threshold value, and if the numerical mean deviation is larger than or equal to the mean deviation threshold value, determining the grouping corresponding to P1 as a target grouping;
step S5, taking the content node of the target group as a node to be analyzed, and determining whether the node to be analyzed includes target information, if so, acquiring the target information from the node to be analyzed, otherwise, raising the group node connected between the father node of the node to be analyzed and the father node of the node to be analyzed to the node to be analyzed until the target information is acquired.
2. The method of claim 1,
the path information is xpath path information of main content in a web page, and the step S3 includes:
step S31, performing fuzzy processing on the subscript information of the path information of each content node;
step S32 is to divide the content nodes with the same path information after the blurring processing into the same group.
3. The method of claim 1,
if the average value of the numerical values is smaller than the average value threshold, the step S4 further includes:
step S44, judging whether the title node text is empty, if so, directly determining the grouping corresponding to the P1 as a target grouping, otherwise, executing step S45, wherein the fact that the title node text is empty indicates that the title node cannot be determined;
s45, acquiring the similarity Qx between the text data of the x-th group and the text data of the title node, wherein the text density corresponding to the x-th group is Px, and the value of x is 1 to N or 1 to N;
step S46, obtaining a first reference value Yx = Px Qx corresponding to the x-th group, and determining the group with the largest first reference value as the target group.
4. The method of claim 3,
the step S45 includes:
step S451, carrying out similarity calculation on the text data of the x-th group and the text data of the title nodes to obtain an initial similarity value a;
step S452, segmenting the text data of each content node of the x-th group, traversing the content node texts and the title node texts in a dual-cycle manner, and calculating the hit ratio b of the content node text data hitting the title text data;
step S453, determining the similarity Qx between the text data of the xth group and the text data of the title node based on the initial similarity value a, the hit ratio b, and the preset first weight k: qx = a + k × b.
5. The method of claim 1,
the content node characteristic information further includes the number of links included in the node, and before the step S41 is executed in the step S4, the method further includes:
step S40, traversing the nodes of each packet, obtaining the link number of each packet, comparing the link number with a preset link number threshold, and filtering the packet if the link number exceeds the link number threshold.
6. The method of claim 1,
step S5 includes:
step S51, taking the content nodes of the target group as nodes to be analyzed, judging whether the nodes to be analyzed comprise target information, if so, acquiring the target information from the nodes to be analyzed, otherwise, executing step S52;
step S52, using the father node connected by the target grouping as a first father node, adding the first father node and each grouping node connected by the first father node into the node to be analyzed, judging whether the node to be analyzed includes target information, if yes, obtaining the target information from the node to be analyzed, otherwise, executing step S53;
step S53, taking the father node of the first father node as a second father node, adding the second father node and each grouped node connected with the second father node into the node to be analyzed, judging whether the node to be analyzed comprises target information, if so, acquiring the target information from the node to be analyzed, otherwise, executing step S54;
step S5m +2, taking the father node of the mth father node as the m +1 th father node, wherein the m +1 th father node is a father node common to the target grouping node and the title node, adding the m +1 th father node and each grouping node connected with the m +1 th father node into the node to be analyzed, judging whether the node to be analyzed comprises target information, if so, acquiring the target information from the node to be analyzed, otherwise, ending the process.
7. The method of claim 6,
in step S5, the determining whether the node to be analyzed includes the target information includes:
and step S50, acquiring the number of text nodes, the number of subtitles and the number of dates from the nodes to be analyzed, judging whether the number of text nodes, the number of subtitles and the number of dates are equal, and if so, determining that the nodes to be analyzed comprise target information.
8. An electronic device, comprising:
at least one processor;
and a memory communicatively coupled to the at least one processor;
wherein the memory stores instructions executable by the at least one processor, the instructions being arranged to perform the method of any of the preceding claims 1-7.
9. A computer-readable storage medium having stored thereon computer-executable instructions for performing the method of any of the preceding claims 1-7.
CN202110207419.4A 2021-02-25 2021-02-25 Method, electronic device and medium for extracting webpage target information Active CN112559929B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110207419.4A CN112559929B (en) 2021-02-25 2021-02-25 Method, electronic device and medium for extracting webpage target information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110207419.4A CN112559929B (en) 2021-02-25 2021-02-25 Method, electronic device and medium for extracting webpage target information

Publications (2)

Publication Number Publication Date
CN112559929A CN112559929A (en) 2021-03-26
CN112559929B true CN112559929B (en) 2021-05-07

Family

ID=75034663

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110207419.4A Active CN112559929B (en) 2021-02-25 2021-02-25 Method, electronic device and medium for extracting webpage target information

Country Status (1)

Country Link
CN (1) CN112559929B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116090375B (en) * 2023-03-01 2024-02-02 上海合见工业软件集团有限公司 System for determining target drive source code based on coverage rate data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184189A (en) * 2011-04-18 2011-09-14 北京理工大学 Webpage core block determining method based on DOM (Document Object Model) node text density
CN104217025A (en) * 2014-09-28 2014-12-17 福州大学 System and method for extracting record items of multi-record web page
CN109582886A (en) * 2018-11-02 2019-04-05 北京字节跳动网络技术有限公司 Content of pages extracting method, the generation method of template and device, medium and equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5367869B2 (en) * 2012-04-27 2013-12-11 楽天株式会社 Aggregation device, aggregation program, computer-readable recording medium recording the aggregation program, and aggregation method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184189A (en) * 2011-04-18 2011-09-14 北京理工大学 Webpage core block determining method based on DOM (Document Object Model) node text density
CN104217025A (en) * 2014-09-28 2014-12-17 福州大学 System and method for extracting record items of multi-record web page
CN109582886A (en) * 2018-11-02 2019-04-05 北京字节跳动网络技术有限公司 Content of pages extracting method, the generation method of template and device, medium and equipment

Also Published As

Publication number Publication date
CN112559929A (en) 2021-03-26

Similar Documents

Publication Publication Date Title
CN109885692B (en) Knowledge data storage method, apparatus, computer device and storage medium
Tkaczyk et al. CERMINE: automatic extraction of structured metadata from scientific literature
US10885323B2 (en) Digital image-based document digitization using a graph model
CN110968667B (en) Periodical and literature table extraction method based on text state characteristics
CN109933780B (en) Determining contextual reading order in a document using deep learning techniques
Tkaczyk et al. Cermine--automatic extraction of metadata and references from scientific literature
CN105630941B (en) Web body matter abstracting methods based on statistics and structure of web page
US20160154877A1 (en) Anomaly, association and clustering detection
CN106570128A (en) Mining algorithm based on association rule analysis
KR102345498B1 (en) Line segmentation method
US8762829B2 (en) Robust wrappers for web extraction
JP2005025763A (en) Division program, division device and division method for structured document
CN103544186B (en) The method and apparatus excavating the subject key words in picture
Shigarov et al. TabbyPDF: Web-based system for PDF table extraction
Döhmen et al. Multi-hypothesis CSV parsing
CN109993216B (en) Text classification method and device based on K nearest neighbor KNN
CN112559929B (en) Method, electronic device and medium for extracting webpage target information
Seth et al. Segmenting tables via indexing of value cells by table headers
US10120852B2 (en) Data processing method, non-transitory computer-readable storage medium, and data processing device
Tkaczyk et al. Structured affiliations extraction from scientific literature
Utiu et al. Learning web content extraction with DOM features
KR102567896B1 (en) Apparatus and method for religious sentiment analysis using deep learning
CN110795933B (en) Webpage text recognition processing method and device
US20160055413A1 (en) Methods and systems that classify and structure documents
WO2013063734A1 (en) Determining document structure similarity using discrete wavelet transformation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant