CN112667940A

CN112667940A - Webpage text extraction method based on deep learning

Info

Publication number: CN112667940A
Application number: CN202110026891.8A
Authority: CN
Inventors: 陈前华
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2020-10-15
Filing date: 2021-01-09
Publication date: 2021-04-16
Anticipated expiration: 2041-01-09
Also published as: CN112667940B

Abstract

The invention discloses a webpage text extraction method based on deep learning, which comprises the following steps: 1) preparing a data set from a root DOM node to a leaf DOM node; 2) constructing a data set from the root DOM node to the leaf DOM node; 3) marking data in the data set from the root DOM node to the leaf DOM node; 4) pre-training and coding the label of the path by using Fastext; 5) training an LSTM classification model of the label path text; 6) predicting the label path text by the LSTM model; 7) and restoring the extracted webpage text. The invention belongs to the technical field of internet, and particularly relates to a webpage text extraction method based on deep learning, which can improve the accuracy of text extraction of resume webpages.

Description

Webpage text extraction method based on deep learning

Technical Field

The invention belongs to the technical field of internet, and particularly relates to a webpage text extraction method based on deep learning.

Background

There are a lot of public information on the internet, and to acquire the information, a series of crawling and natural language processing technologies need to be adopted to acquire and analyze the web pages, wherein web page text extraction is an important research subject. As the world wide web develops, the function and style structure of web pages become more and more complex, and web pages often contain a large amount of useless information: advertisements, external links, navigation bars, etc. generally speaking, only the text content of the web page is concerned, so-called text, which is the content information of interest in the web page, including target characters, pictures, videos.

The text extraction method based on different density distributions is characterized in that text content is assumed to be concentrated in an existing text extraction method based on different density distributions, and except tags forming HTML are characters, a place where a webpage tag is the least is considered to be a text. Based on this assumption, the article generates a tag distribution map, which targets the web page with more concentrated text, such as news web page, and the extraction is coarse, and some scattered text information may be missed by this method.

Since HTML tags usually have some symbolic meaning, in addition to the displayed syntax, the functionality of the module in a web page is also embodied: such as < p >, < img >, < table >, etc., the DOM (Document Object Model text Object Model) tree of HTML can also embody the visual layout structure and logical structure of the web page. Therefore, there are many relevant papers for text extraction by applying DOM trees, which parse HTML into a DOM tree and obtain a text by two filtering steps: filtering the content such as the tag, the advertisement and the like, wherein the filtering mode is defined based on the functions of the HTML tag: for example, filtering the content containing links by using the keyword proportion of href, src, etc. and considering that the content is likely to be advertisement, this method can really achieve the effect that the algorithm proposer wants to achieve for most websites, but with the occurrence of a large number of bad websites and the complexity of website arrangement (for example, a large number of links may also occur in some text content), the rule-based method has the problem of requiring continuous update by human. In fact, after 2003, many learners have proposed many rule-based web page analysis methods, and the complexity of the rules continues to increase with the development of web page design; for example, a record extraction method based on DOM tree and tag path combined clustering utilizes the characteristic that repeated content blocks have a large number of identical segmentation elements. The method is an unsupervised learning method with high stability, and is used for extracting and displaying a large number of repeatedly recorded webpage text contents, such as commodities of a shopping website and a paper list of a scholarer.

In addition, there are many methods based on visual chunking that simulate human use when viewing a web page. Microsoft proposes a vision-based website blocking algorithm VIPS, which is based on a hierarchical blocking mode defined by using 13 rules and effectively blocks a webpage from a website syntax perspective. Strictly speaking, the article does not perform text extraction of the webpage, and in addition, the article also performs analysis by using a DOM structure of HTML; there is also a proposed method of data record extraction based on VIPS, where the extraction of the content structure tree is performed by VIPS, based on two assumptions: the data area always occupies a large area of the whole webpage in the horizontal center and the data area always occupies a large area, and the position of the data record is extracted from the article structure.

In recent years, many text extraction methods based on machine learning and data mining methods have been developed. There are cluster-based and also decision tree-based. The features used are grouped into several broad categories: describing individual text blocks (elements), describing the entire HTML file (a list of text blocks and structural information), describing visual information throughout the web page, and describing several clusters of text in the web site with the same characteristics (e.g., the above-mentioned repeated records).

The main stream of the Python-based text extraction tool is like Readability, newsapper 3k and the like, and has good effect in news web pages. However, in practical projects, these tools are found to be ineligible for text extraction of sparse text, including encyclopedia pages, resume pages, and the like. In addition, the density-based text extraction method and the vision-based text extraction method are based on the premise that the text has independent visual features. The encyclopedia and resume page has a plurality of styles, and the content of each style is very dispersed, so that the method is difficult to apply; in addition, the rule-based strategy commonly found in the previous article is intuitively not suitable for the situation of increasingly complex webpage structure and part of webpage design irregularity no matter according to visual elements or HTML tag information, content information and the like, and the rule definition is tedious and the implementation and maintenance are very complex.

Disclosure of Invention

In order to solve the problems, the invention provides a webpage text extraction method based on deep learning, which takes the concept of a tag path of a DOM tree as a starting point, takes element information and corresponding DOM structure information as feature selection sources, and realizes text extraction by using a mode of predicting whether each element is a text or not by using RNN (Current Neural network), wherein the mode can consider grammatical information such as position information, tag information, HTML structure information and the like of the text and semantic information. According to the training result, the model is found to have good fitting and predicting capability on the seen webpage structure through simple training, and the accuracy of extracting the text of the resume webpage is improved.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows: the webpage text extraction method based on deep learning comprises the following steps:

1) data set preparation from root DOM node to leaf DOM node: the text extraction task is defined as a text classification task: for a text segment, classifying the text segment as needing to be reserved or not, inputting a content text block, namely the whole path from a root DOM node to a leaf DOM node, by training an LSTM model, and obtaining judgment of prediction probability of classification of whether the path is reserved; coding the subordinate information of the text block, analyzing HTML into a DOM tree, wherein each node of the tree is a label, and the label from the DOM tree root node to the target text block node is used for representing the path, namely the label path;

2) data set construction from root DOM node to leaf DOM node: firstly, finding out all tags of HTML, finding out the positions of the tags by regular matching, and dividing the HTML into the tags and contents; removing the label which exists independently and is not concerned subsequently and the corresponding content, and performing the operation of distributing the content to the label: if the starting label and the next label have content in between, the content is distributed to the starting label; if the content is not distributed before the end label, namely the previous label is also the end label, distributing the content to the label; in a special case, when a text label is nested in a text label, a case that content exists between a previous ending label and a next starting label occurs, and the previous ending label is assigned; HTML is parsed into a sequence of tags, and the web page is parsed into a sequence of tags; matching every two labels according to the following rules, namely matching the starting label with the ending label: preparing an empty stack Cache, traversing the well-processed HTML according to the tags, directly pushing the stack when encountering a starting tag, and directly pushing the stack when encountering a Cache empty; when an end label is encountered, simply matching the label at the current stack top with the end label, judging whether the label is a corresponding label, if not, searching forwards until the matched label is searched, then pressing the end label into the stack, recording the state of the Cache as a unit labeled later, then popping the end label and the corresponding start label, and finally obtaining an empty Cache as a path, thereby realizing pairwise matching of the labels and constructing a node;

3) labeling data in a data set from a root DOM node to a leaf DOM node: after the label paths are obtained according to the step 2), according to the texts corresponding to the leaf nodes in the paths, marking category labels on each path for deep learning model training and testing; if the text block is a text body, the category label is marked as 1, and if the text block is not a text body, the category label is marked as 0;

4) the labels of the path are pre-trained and encoded using Fasttext: using a pre-training technique that maps the vocabulary in the vocabulary to a vector space, training a deep network structure, referred to as a "pre-training model", for a high-dimensional representation of each vocabulary using a large amount of unlabeled data text corpora to obtain a set of low-dimensional model parameters; firstly, performing pre-training on the label and the class thereof by using Fastext to obtain two groups of vectors with 10 dimensions and 50 dimensions respectively; giving 1 dimension to each character content length and termination punctuation mark number, after entering a model, leading the last two bits to pass through a neural network with 2 dimensions of an input layer and 10 dimensions of an output layer, and splicing the last two bits with the front 60-dimension data into 70-dimension vector data as the input of an LSTM;

5) training an LSTM classification model of label path text: setting the length of the LSTM model to be 15 according to the 70-dimensional vector in the step 4), setting the length of a label path sample with the length being more than 15, cutting the length of the label path sample to be 15 and the length of the label path sample with the length being less than 15, compensating the length of the label path sample to be 15 by using a 0 vector, sending the processed label path sample into the LSTM model, inputting a hidden state vector output by the LSTM model into a fully-connected network, passing through the fully-connected network, passing through a softmax layer, and finally obtaining a classification result, and performing the weight of a cross entropy loss back propagation whole neural network;

6) predicting the label path text by the LSTM model;

7) restoring the extracted webpage text: the DOM tree is restored through logic of subsequent traversal, each Tag path is obtained at the same time, a list Keep list is stored to store HTML which is determined to be reserved, a class Tag is defined for the Tag, and a start Tag or an end Tag corresponding to the class Tag is stored; during traversal, if a start label is met, pushing to the Cache and the Keep list at the same time; when the label is finished, generating a label path, and judging immediately; and if the text is the text, pressing in the Keep list, if not, not pressing in the Keep list, finding the tag.

The invention adopts the structure to obtain the following beneficial effects: the invention relates to a webpage text extraction method based on deep learning, which adopts a deep learning LSTM method to extract webpage text, and initiates the extraction of text by classifying paths based on the information of the whole label path from a root node to leaf nodes; by adopting a pre-training technology, fastext is adopted to train the embedded representation of tag and class based on massive webpages, so that the pre-training representation is obtained, the problem of embedded representation of tag and class is solved, and the method is different from a word or word-based pre-training model which is released on the market, so that the method has the advantages that the cost of manually labeling a label path at the later stage is greatly improved by the pre-training technology, and the performance of text classification by using LSTM at the downstream is also improved; by adopting the self-written generation label path based on the webpage HTML source code, the problem that the output sequence of the text content of each leaf node is output according to the original HTML sequence is solved, the extracted text is ensured to be in line with the original text sequence, the reading experience of a user is not influenced, and more importantly, the subsequent downstream tasks are not influenced by the sequence.

Drawings

FIG. 1 is a flow chart of a webpage text extraction method based on deep learning;

FIG. 2 is a diagram of a webpage text extraction method based on deep learning;

FIG. 3 is a sample diagram of a label path of a deep learning-based web page text extraction method;

FIG. 4 is a fuzzy wuzzy ratio result diagram of text extraction from a test set webpage based on the deep learning webpage text extraction method.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention relates to a resume page text extraction method based on deep learning, which comprises the following steps:

1. data set preparation from root DOM node to leaf DOM node

Through detailed analysis of the DOM path, the text extraction task is defined as a text classification task: for a text segment, it is classified as either should be retained or not (determined by the trained deep learning model), and by training the LSTM model (with multiple features as input), it is expected that by inputting a block of content text (i.e., the entire path from the root DOM node to the leaf DOM nodes), a decision is made as to whether the predicted probability of the classification for that path is retained.

Deep learning is a common method of text classification, but unlike other methods of text classification, the present invention is intended to encode dependency information of a text block (i.e., a tag is included in a tag) rather than the order of appearance in a web page, where the concept of a tag path is introduced, HTML can be parsed into a DOM tree, each node of the tree is a tag, and the path is represented by a tag from the root node of the DOM tree to the node of a target text block, and is the tag path.

A specific application scene is selected, namely, published personnel histories or resumes of all universities and scientific research institutions on the Chinese Internet are crawled to verify the algorithm effect. In this verification scenario, the text extracted web page object is set to the people resume page class web page. The crawling of the webpage texts in other scenes can be realized by labeling the webpage with the corresponding theme. The training set used in the present invention consists of the label paths mentioned above. Extracting 800 web pages of teacher resume pages from domestic and college websites, and additionally extracting 300 random web pages to obtain a web page set containing 1100 web pages. Then, DOM tree extraction is carried out one by one, label paths of all content blocks are obtained, and the paths form a data set which is used as a basis for LSTM model training. The training set, the test set and the verification set are obtained by dividing the data set.

2. Data set construction from root DOM node to leaf DOM node

For each web page in the set of 1100 web pages, first, all tags of the entire HTML are found, and the locations of the tags are found by regular matching, so that the entire HTML is divided into tags and content. For the subsequent pairwise matching of tags (start tag and end tag), some tags that may exist individually or that are not of interest subsequently and their corresponding contents are removed, for example: < meta >, < script >, < iframe >, < img >, etc., and then an operation of assigning contents to tags is performed, the web page can be formally parsed into a tag sequence, and the specific assignment rule is as follows: if the content exists between the start label and the next label, the content is distributed to the start label; if the end tag is preceded by content that is not assigned (i.e., the previous tag is also the end tag), then assigning the content to the tag; there is a special case: when a text tag is nested in a text tag (for example, < p > is nested for several < strong >), the situation that the middle of the last ending tag and the next starting tag has content can occur, and the situation can be considered that the whole piece of content is not necessarily distinguished in subsequent judgment, so that the last ending tag and the next starting tag can be matched, and the last ending tag is assigned to the HTML, and the HTML is already parsed into a sequence of tags.

For traversing the DOM tree, the objective is to obtain path information for each leaf node, such as: in order to obtain the path information of each node, the DOM tree is constructed and simultaneously the tree is subjected to subsequent traversal, the path of each node is returned, pairwise matching of tags is realized through the rule that the end tag of the parent node is bound to be behind the end tag of the descendant node of the parent node, and the node is constructed. The specific mode is as follows: preparing an empty stack Cache, traversing the processed HTML according to the tag, and directly pressing the stack when a starting tag is met or the Cache is empty; when an end label is encountered, a simple match is made between the current top label and the end label to see that they may not be the corresponding label (note that even if they are of the same type, they may not be the start label and the end label that match each other due to website syntax problems), if they do not match, the search is performed forward (the middle part is merged into the content of the node) until a match is found, then the end label is pushed into the stack, the state of the Cache is recorded as a unit of the subsequent label, and then the end label, the corresponding start label (and all the content in the middle) are popped. In this way, an empty Cache (the last pop is < html > </html >) and many reserved Cache snapshots, that is, the required paths, are obtained.

After the above operations, a set containing a plurality of label paths, that is, a training set for training the LSTM model, is obtained.

3. Annotating data in a dataset of root DOM nodes to leaf DOM nodes

In the last step, a construction method of the label paths is introduced, after the label paths are obtained, according to texts corresponding to leaf nodes in the paths, category labels are marked on each line of paths for deep learning model training and testing; if the text block is a body text, the category label is marked as 1, if the text block is not a body text, the category label is marked as 0, and the sample text is shown in FIG. 1.

4. Pre-training and encoding labels for paths using Fasttext

The pre-training technique is a technique for mapping the words in the vocabulary to vector space, and a deep network structure is trained by using a large amount of text corpora without labeled data to obtain a set of low-dimensional model parameters, wherein the deep network structure is called a pre-training model, and the trained model parameters are applied to other specific tasks. The use of this technique can greatly improve the effect of downstream tasks and relax the requirements on the amount of annotation data.

The pre-training technology in the invention uses a Fasttext model in a Gensim library, firstly, Fasttext is used for pre-training labels and class thereof respectively (only for div and table labels with class, other codes are all 0), and two groups of vectors with 10 dimensions and 50 dimensions are obtained; in addition, giving the length of the text content and the number of the symbols of the termination punctuation point each 1-dimensional, the total 62-dimensional, after entering the model, the last two bits will pass through a neural network with 2-dimensional input layer and 10-dimensional output layer, and then the two bits are spliced with the previous 60-dimensional data into 70-dimensional vector data as the input of the LSTM.

The reason why the fastext is considered for pre-training is that for class in html of web page of chinese website, the web page developer may use english, pinyin, english abbreviation or pinyin abbreviation to represent class name, so that when applying the model, some out-of-list words (words that do not appear in the pre-training set) will be encountered, and fastext solves this problem well: fasttext applies morphology and trains with skip-grams, can quickly train models from large collections of text, and has the ability to process out-of-list words. In addition, the conjunctions in the training set are separated according to separators, so that each word is meaningful as much as possible, and the occurrence of out-of-list words encountered in downstream tasks is reduced as much as possible.

5. LSTM classification model for training label path text

If the dimension of the vector is 70-dimensional in the last step, the length of the LSTM model is set to 15 in the present invention, the length of the label path sample with the length greater than 15 is truncated to 15, and the length of the label path sample with the length less than 15 is directly complemented to 15, and the complemented vectors are all 0 vectors. The label path sample processed in this way can be sent into the LSTM model, the hidden state vector output by the LSTM model is input into the full-connection network, and the classification result is obtained through the full-connection network and the softmax layer. Back propagation of cross-entropy loss is performed to adjust the weights of the neural network.

The parameter setting conditions of the LSTM model are as follows: dropout is 0.3, the number of hidden layer units is 128, the number of LSTM layers is 2, the output is 2 classes (text or not), the optimizer is Adam, and the learning rate is 0.001. The cross entropy loss function, with the super parameter batch _ size of 32, goes through at least 100 epochs, and then stops training as long as no 20 consecutive epochs produce a better loss and f1 score.

6. Prediction of label path text by LSTM model

After the last step is finished, a deep learning model capable of extracting the text of the webpage is obtained. Then a new web page HTML is obtained, and the tag path set corresponding to the HTML is obtained by processing in the same way as the processing in step 1. And then, according to the same operation method as the encoding part in the step 3, each label path in the label path set is encoded and represented, so that the label path set can be converted into an input tensor adapted to the LSTM model. After the forward operation of the LSTM neural network, the classification result of the label path is obtained, and if the classification result is 1, the classification result is retained. If it is 0, it is removed. This results in a set of label paths that remain.

7. Restoring extracted web page text

And restoring the DOM tree in a mode of subsequent traversal, and outputting the final text according to the sequence. After the label path set of the text is finally obtained, if the text is directly spliced, the following problems can occur: when the circularly nested characters are encountered, the contents (descendant nodes) in the characters are generally traversed first, and then the parent nodes are traversed. Usually, only the deepest leaf node has content, but when some websites are designed or content such as a paper is rendered, a large amount of content nesting exists, and at this time, the text is restored in a direct splicing mode and disorder occurs.

In order to solve the problem, the method of restoring HTML is used to restore the sequence in the present invention, because only HTML is used for only preserving the original text sequence, the prediction step is advanced to the time of constructing the route set, and the DOM tree is restored through the logic of the subsequent traversal and each label path is obtained at the same time, but a list Keep _ list is saved to save the HTML for deciding to preserve the sequence. In addition, a class Tag is defined for the Tag with the goal of saving the start (end) Tag corresponding to it. During traversal, if a start label is encountered, pushing to the Cache and the Keep _ list at the same time; encounter withAnd (4) finishing the label and generating a label path, immediately judging, if the label path is a text, pressing in the Keep _ list, if the label path is not the text, not pressing in, and finding out tag. The temporal complexity of this approach is O (n)²) And after the processes, the extracted webpage text is obtained.

To better illustrate the objects, technical solutions and advantages of the present invention, the present invention will be further described with reference to the accompanying drawings and specific comparative examples and examples.

Comparative example 1

The webpage text extraction based on the Readability is actually not a library for a certain language, but an algorithm, is made into a package called mercury-parser, is made into a plug-in of Chrome, and performs text extraction on 304 webpages in a test set according to default parameters in the original package without change, wherein the effect is as shown in a dotted line in fig. 4.

Comparative example 2

Web page text extraction based on newsapper 3 k. Newspper 3k also applies lxml to analyze HTML based on Python, all parameters are default parameters in the original package, and text extraction is performed on 304 webpages in a test set without modification, and the effect is as the dotted line in FIG. 4.

Example 1

The input label path is coded and expressed through a pretrained fasttext model to obtain the input of an LSTM model, and the LSTM is trained by combining the classification of the label path. The parameters are set as follows: the sequence length of the maximum label path is set to 15 (only the first 15 labels are input for label paths longer than 15), dropout is 0.3, the number of hidden layer units is 128, the number of LSTM layers is 2, the output is 2 classes (whether text or not), the optimizer is Adam, the learning rate is 0.001, the loss function is a cross entropy function, the batch size is 32, at least 100 epochs are passed, and then the training is stopped as long as no more optimal loss and f1score are generated for 20 consecutive epochs. In this way, a text extraction model is obtained. And (5) carrying out text extraction test on 304 webpages in the test set, wherein the experimental result is shown as a solid line in FIG. 4.

In the comparative examples 1, 2 and 1, the effect evaluation is performed by using the fuzzy character string matching method, the three tools are used for extracting the texts of more than 300 websites in the verification set, and in addition, one text is extracted according to the label to be used as the standard answer. In order to eliminate errors caused by a segmentation mode, all the spaces and line feed characters of results are uniformly taken out, fuzzy character string matching is realized by using fuzzy Wuzzy, the fuzzy Wuzzy is a character string similarity measuring tool based on Levenshtein distance, and the Levenshtein distance represents that a character string at least needs to be converted into another character and standard length. The metric for FuzzyWuzzy to measure string similarity is the ratio of the Levenshtein distance to the average length of two strings, and a higher score indicates that two strings are more similar.

The abscissa of fig. 4 is the web page number, the ordinate is the similarity between the text extracted from a certain tool on this web page and the standard answer, the higher the similarity is, the better the performance of the tool is, the dotted line is Readability, the dotted line is newsapper 3k, and the solid line is the text extraction result based on the LSTM model of the present invention. It is obvious from the figure that the text extraction effect of the invention is good.

The present invention and its embodiments have been described above, but the description is not limitative, and the actual structure is not limited thereto. In summary, those skilled in the art should appreciate that they can readily use the disclosed conception and specific embodiments as a basis for designing or modifying other structures for carrying out the same purposes of the present invention without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. The webpage text extraction method based on deep learning is characterized by comprising the following steps:

2) data set construction from root DOM node to leaf DOM node: finding out all tags of HTML, finding out the positions of the tags by regular matching, and dividing the HTML into the tags and contents; removing the label which exists independently and is not concerned subsequently and the corresponding content, and performing the operation of distributing the content to the label: if the starting label and the next label have content in between, the content is distributed to the starting label; if the content is not distributed before the end label, namely the previous label is also the end label, distributing the content to the label; in a special case, when a text label is nested in a text label, the condition that content exists between a previous ending label and a next starting label occurs, and the previous ending label is allocated to the previous ending label; HTML is analyzed into a sequence of tags, a webpage is analyzed into a tag sequence, and pairwise matching is carried out on the tags according to the following rules, namely the starting tag is matched with the ending tag: preparing an empty stack Cache, traversing the well-processed HTML according to the tags, directly pushing the stack when encountering a starting tag, and directly pushing the stack when encountering a Cache empty; when an end label is encountered, simply matching the label at the current stack top with the end label, judging whether the label is a corresponding label, if not, searching forwards until the matched label is searched, then pressing the end label into the stack, recording the state of the Cache as a unit labeled later, then popping the end label and the corresponding start label, and finally obtaining an empty Cache as a path, thereby realizing pairwise matching of the labels and constructing a node;

6) predicting the label path text by the LSTM model;