CN110377884B

CN110377884B - Document analysis method and device, computer equipment and storage medium

Info

Publication number: CN110377884B
Application number: CN201910509468.6A
Authority: CN
Inventors: 李双婕; 黄昉; 郝学峰; 史亚冰; 宋勋超; 蒋烨; 张扬; 朱勇
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-06-13
Filing date: 2019-06-13
Publication date: 2023-03-24
Anticipated expiration: 2039-06-13
Also published as: CN110377884A

Abstract

The invention discloses a document parsing method, a document parsing device, computer equipment and a storage medium, wherein the method comprises the following steps: segmenting a document with a preset format to be processed into text nodes; aiming at the text nodes of the preset type, respectively carrying out the following processing: acquiring chapter mode information of text nodes; determining the hierarchy of the text nodes according to the chapter mode information; and adding text nodes into the constructed document tree according to the hierarchy. By applying the scheme of the invention, hierarchical analysis and the like of the document can be realized.

Description

Document analysis method and device, computer equipment and storage medium

[ technical field ] A

The present invention relates to computer application technologies, and in particular, to a method and an apparatus for document parsing, a computer device, and a storage medium.

[ background of the invention ]

In the important fields of finance, public security, judicial and the like, a great number of requirements and functions directly or indirectly depending on the knowledge graph exist. For example, intelligent customer service, intelligent investment and research, intelligent investment and customer, wind control decision making in the financial industry; intelligent search, legal reasoning, intelligent case judgment, document composition and review, etc. in the legal industry. Meanwhile, companies and organizations in the industries accumulate a large number of professional documents, and the professional documents can be used for constructing an industry knowledge graph, so that related requirements are met.

Typically, professional documents in these industries are stored in doc, docx, pdf, etc. format, and their contents cannot be stored as structured information in a relational database. However, these professional documents generally have certain hierarchical structural features, in which information required for constructing a knowledge graph is contained, and may assist in constructing business knowledge Object (SPO) triples and the like. Therefore, the document needs to be hierarchically parsed, but a better implementation mode does not exist at present.

[ summary of the invention ]

In view of the above, the invention provides a document parsing method, a document parsing device, a computer device and a storage medium.

The specific technical scheme is as follows:

a document parsing method, comprising:

segmenting a document with a preset format to be processed into text nodes;

aiming at the text nodes of the preset type, respectively carrying out the following processing:

acquiring chapter mode information of the text nodes;

determining the hierarchy of the text node according to the chapter mode information;

and adding the text nodes into the constructed document tree according to the hierarchy.

According to a preferred embodiment of the present invention, the predetermined format includes: a hypertext markup language format;

the method further comprises the following steps: and if the format of the document is not the hypertext markup language format, converting the document into the hypertext markup language format.

According to a preferred embodiment of the present invention, the segmenting the document in the predetermined format to be processed into text nodes includes: and cutting the document into text nodes with paragraph granularity.

According to a preferred embodiment of the present invention, the predetermined type of text node includes: text nodes cut from non-directory pages.

According to a preferred embodiment of the present invention, the acquiring chapter mode information of the text node includes:

if the text node has explicit chapter mode information, analyzing the chapter mode information as the chapter mode information of the text node;

or if the content identical to the text content in the text node exists in the directory page, acquiring chapter mode information corresponding to the text content in the directory page as the chapter mode information of the text node;

or, the hypertext markup language path information of the text node except the < li > tag is used as the chapter mode information of the text node.

According to a preferred embodiment of the invention, the method further comprises: identifying a text node with text content as a directory title, identifying a text node with text content as directory content behind the text node with text content as a directory title, and analyzing chapter mode information from the text node with text content as directory content; the text nodes with the text contents being the directory titles and the text nodes with the text contents being the directory contents are text nodes cut from the directory pages;

the acquiring chapter mode information corresponding to the text content in the directory page includes: and taking chapter mode information corresponding to the directory content containing the text content as chapter mode information corresponding to the text content.

According to a preferred embodiment of the invention, the method further comprises: and acquiring a document title of the document, and setting the document title as a root node of the document tree.

According to a preferred embodiment of the present invention, the obtaining the document title of the document includes:

and if the document has the title tag, taking the content corresponding to the title tag as the document title, otherwise, taking the file name of the document as the document title.

According to a preferred embodiment of the invention, the method further comprises: initializing a global chapter mode sequence, initially setting the global chapter mode sequence to be empty, adding chapter mode information corresponding to the document title into the global chapter mode sequence, and taking the position sequence of the chapter mode information corresponding to the document title in the global chapter mode sequence as the hierarchy of the root node;

the determining the hierarchy of the text node according to the chapter mode information comprises: determining whether chapter mode information for the text node is present in the global chapter mode sequence; if so, taking the position serial number of the chapter mode information of the text node in the global chapter mode sequence as the hierarchy of the text node; if not, adding the chapter mode information of the text node into the global chapter mode sequence, and taking the position serial number of the chapter mode information of the text node in the global chapter mode sequence as the hierarchy of the text node; and the position serial number is a sequence serial number of different chapter mode information added into the global chapter mode sequence.

According to a preferred embodiment of the invention, the method further comprises: and after the hierarchy of the text nodes is determined, deleting the chapter mode information of which the position sequence number is greater than the hierarchy of the text nodes in the global chapter mode sequence.

According to a preferred embodiment of the invention, the method further comprises: setting the root node as a reference tree node in an initial state;

adding the text node into the constructed document tree according to the hierarchy comprises: and comparing the hierarchy of the text node with the hierarchy of the reference tree node, adding the text node into the document tree as a child node or a brother node of the reference tree node according to a comparison result, and setting the text node as the reference tree node.

According to a preferred embodiment of the present invention, the comparing the hierarchy of the text node with the hierarchy of the reference tree node, and adding the text node as a child node or a sibling node to the document tree according to the comparison result comprises:

if the hierarchy of the text node is larger than that of the reference tree node, adding the text node into the document tree as a child node of the reference tree node;

if the hierarchy of the text node is equal to that of the reference tree node, adding the text node into the document tree as a brother node of the reference tree node;

if the hierarchy of the text node is smaller than the hierarchy of the reference tree node, executing the following predetermined processing: and taking the previous level node of the current reference tree node as an updated reference tree node, if the level of the updated reference tree node is smaller than that of the text node, adding the text node as a child node of the updated reference tree node into the document tree, and otherwise, repeatedly executing the preset processing.

A document parsing device, comprising: a segmentation unit and an analysis unit;

the segmentation unit is used for segmenting the document with the preset format to be processed into text nodes;

the analysis unit is used for respectively carrying out the following processing on the text nodes of the preset type:

acquiring chapter mode information of the text nodes;

the device further comprises: and the preprocessing unit is used for converting the document into the hypertext markup language format when the format of the document is not the hypertext markup language format.

According to a preferred embodiment of the present invention, the segmenting unit segments the document into text nodes of paragraph granularity.

According to a preferred embodiment of the present invention, when there is explicit chapter mode information in the text node, the parsing unit parses the chapter mode information as chapter mode information of the text node;

or, when the content identical to the text content in the text node exists in a directory page, the parsing unit acquires chapter mode information corresponding to the text content in the directory page as chapter mode information of the text node;

or, the parsing unit takes the html path information of the text node except the < li > tag as the chapter mode information of the text node.

According to a preferred embodiment of the present invention, the parsing unit is further configured to identify a text node whose text content is a directory title, identify a text node whose text content is a directory content and located after the text node whose text content is a directory title, and parse chapter mode information from the text node whose text content is a directory content; the text nodes with the text contents being the directory titles and the text nodes with the text contents being the directory contents are text nodes cut from the directory pages;

and the analysis unit takes the chapter mode information corresponding to the catalogue content containing the text content as the chapter mode information corresponding to the text content.

According to a preferred embodiment of the present invention, the parsing unit is further configured to obtain a document title of the document, and set the document title as a root node of the document tree.

According to a preferred embodiment of the present invention, if the parsing unit determines that the document has a title tag, the content corresponding to the title tag is used as the document title, otherwise, the file name of the document is used as the document title.

According to a preferred embodiment of the present invention, the parsing unit is further configured to initialize a global chapter mode sequence, where the global chapter mode sequence is initially empty, add chapter mode information corresponding to the document title to the global chapter mode sequence, and use a position order of the chapter mode information corresponding to the document title in the global chapter mode sequence as a level of the root node;

the parsing unit determines whether chapter mode information of the text node exists in the global chapter mode sequence, if so, the position serial number of the chapter mode information of the text node in the global chapter mode sequence is used as the hierarchy of the text node, and if not, the chapter mode information of the text node is added into the global chapter mode sequence, and the position serial number of the chapter mode information of the text node in the global chapter mode sequence is used as the hierarchy of the text node; and the position serial number is a sequence serial number of different chapter mode information added into the global chapter mode sequence.

According to a preferred embodiment of the present invention, the parsing unit is further configured to delete chapter mode information in the global chapter mode sequence, where a position number of the chapter mode information is greater than a hierarchy of the text node, after the hierarchy of the text node is determined.

According to a preferred embodiment of the present invention, the parsing unit is further configured to, in an initial state, set the root node as a reference tree node;

the analysis unit compares the hierarchy of the text node with the hierarchy of a reference tree node, adds the text node as a child node or a brother node of the reference tree node to the document tree according to a comparison result, and sets the text node as the reference tree node.

According to a preferred embodiment of the present invention, the parsing unit adds the text node as a child node of a reference tree node to the document tree if it is determined that the hierarchy of the text node is greater than the hierarchy of the reference tree node, adds the text node as a sibling node of the reference tree node to the document tree if it is determined that the hierarchy of the text node is equal to the hierarchy of the reference tree node, and performs the following predetermined processing if it is determined that the hierarchy of the text node is less than the hierarchy of the reference tree node: and taking the previous level node of the current reference tree node as an updated reference tree node, if the level of the updated reference tree node is smaller than that of the text node, adding the text node as a child node of the updated reference tree node into the document tree, and otherwise, repeatedly executing the preset processing.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method as described above when executing the program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method as set forth above.

Based on the introduction, the scheme of the invention can analyze the document into the form of the document tree, thereby realizing the hierarchical analysis of the document, further assisting the construction of the knowledge graph based on the document analysis, and improving the construction efficiency and accuracy and the like.

[ description of the drawings ]

Fig. 1 is a flowchart of a first embodiment of a document parsing method according to the present invention.

FIG. 2 is a schematic diagram of a document tree in a build process.

Fig. 3 is a first schematic diagram of the document tree after adding the text node 6 on the basis of fig. 2.

Fig. 4 is a second schematic diagram of the document tree after the text node 6 is added on the basis of fig. 2.

Fig. 5 is a third schematic diagram of the document tree after the text node 6 is added on the basis of fig. 2.

FIG. 6 is a flowchart illustrating a second embodiment of a document parsing method according to the present invention.

FIG. 7 is a schematic diagram of a composition structure of an embodiment of a document parsing apparatus according to the present invention.

FIG. 8 illustrates a block diagram of an exemplary computer system/server 12 suitable for use in implementing embodiments of the present invention.

[ detailed description ] embodiments

In order to make the technical scheme of the invention more clear and understood, the scheme of the invention is further explained by referring to the attached drawings and embodiments.

It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In addition, it should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

FIG. 1 is a flowchart of a first embodiment of a document parsing method according to the present invention. As shown in fig. 1, the following detailed implementation is included.

In 101, a document in a predetermined format to be processed is segmented into text nodes.

At 102, the processing proceeds as shown at 103-105 for the predetermined type of text node, respectively.

In 103, chapter mode information of the text node is acquired.

At 104, a hierarchy of text nodes is determined based on the chapter mode information.

At 105, text nodes are added to the constructed document tree according to the determined hierarchy.

Preferably, the predetermined format may be a hypertext Markup Language (HTML) format. If the document to be processed is not in the HTML format, such as doc format or pdf format, the document needs to be converted into the HTML format first.

Documents in doc and pdf formats may be converted using third party tools. For example, for a document in doc format (including doc, docx, etc.), an open source code library is used for conversion, for a document in pdf format, a text portion thereof may be converted using an open source tool, a picture portion thereof may be converted into text using an Optical Character Recognition (OCR) tool, and the two may be spliced according to location information.

For the HTML-format document to be processed, the document header of the document can be parsed out. If the document has a title (title) tag (HTML tag), the content corresponding to the title tag can be used as the document title, otherwise, the file name of the document can be used as the document title. The document title may be set as a root node of the constructed document tree.

For a document in HTML format to be processed, it can also be segmented into text nodes of paragraph granularity. The segmentation can be performed by using h-tags, p-tags, div-tags, etc. according to the existing manner, so as to segment the input document into text nodes of paragraph granularity, and record the HTML tag, style information, etc. of each text node.

The chapter information contained in the document directory can be parsed out based on style information or the like to facilitate subsequent hierarchical parsing. Analyzing the document directory mainly comprises three steps of identifying a directory title node, identifying a directory content node and analyzing chapter information. The directory title nodes and the directory content nodes are text nodes cut from the directory pages.

First, a text node whose text content is a directory title, that is, a directory title node, can be identified. For documents containing directories, there is usually one text node (possibly one line) containing the "directory" two words, and there are special styles such as bold, centered, etc. When the directory title node is identified, if a text node exists, the text content of the text node contains a 'directory' two-character and contains special styles such as bold, middle, special font and the like, the text node can be regarded as the directory title node.

Then, the text content located behind the directory title node can be identified as the text node of the directory content, that is, the directory content node is identified. For documents containing directories, typically the directory content node is located after the directory title node and has some specific structure, such as explicit chapter information like "chapter x" at the beginning of a line and page number (e.g. the starting page to indicate this chapter) information at the end of a line. When the target content node is identified, if the following text nodes exist: after the directory title node, and the text content starts with the chapter information and ends with the page number information, the text node can be considered as the directory content node.

Chapter information, such as chapter x' of chapter pattern information, can be parsed from the catalog content nodes.

For a text node of a predetermined type, such as a text node cut from a non-directory page, parent-child, sibling relationships and the like of different text nodes on a document tree structure can be respectively analyzed, so that a document tree is constructed. Specifically, for each text node, the following processing may be performed: acquiring chapter mode information of text nodes; determining the hierarchy of the text nodes according to the chapter mode information; and adding the text nodes into the document tree according to the determined hierarchy.

Accordingly, the following processing may be performed in advance: initializing a global chapter mode sequence, initializing a document tree, setting an obtained document title as a root node of the document tree, adding chapter mode information corresponding to the document title into the global chapter mode sequence, and taking a position sequence of the chapter mode information corresponding to the document title in the global chapter mode sequence as a hierarchy of the root node; in addition, the root node may also be set as a reference tree node. The position serial number is a sequence serial number of different chapter mode information added into the global chapter mode sequence. The chapter mode information corresponding to the document title may be "title".

For each text node cut from a non-directory page, the chapter mode information of the text node may be obtained first, and the obtaining manner may include, but is not limited to, the following.

1) Text mode: if the text node has explicit chapter mode information, the chapter mode information can be analyzed to be used as the chapter mode information of the text node.

For example, the text node includes the following text contents: chapter 17 The fee and tax of the fund, where explicit chapter mode information "chapter x" exists, can be resolved as chapter mode information for the text node.

2) Whether chapter pattern information in the directory page is hit: if the content identical to the text content in the text node exists in the catalog page, the chapter mode information corresponding to the text content in the catalog page can be acquired as the chapter mode information of the text node.

The obtaining of the chapter mode information corresponding to the text content in the catalog page may be that the chapter mode information corresponding to the catalog content containing the text content is used as the chapter mode information corresponding to the text content.

Sometimes, the stated chapter mode information in the document catalog omits explicit chapter mode information in the text, and the text content in the text node can be matched in the catalog page, if the same content exists, the corresponding chapter mode information in the catalog can be used.

For example, the text node includes the following text contents: the fee and tax of the fund, and the following list exists: chapter 17 The fee and tax of the fund, chapter x may be used as chapter mode information for the text node.

3) HTML label: hypertext markup language path (html xpath) information of the text node other than the < li > tag may be used as the chapter mode information of the text node.

The section mode information of the text node can be obtained through HTML labels such as < ol >, < ul >, < li >, and the like, and HTML xpath information except the < li > label can be used as section mode information.

The < ol > and the < ul > are both parent tags of the < li >, and are used together with the < li >. The HTML xpath is path information of the text nodes in the HTML, and each text node has the HTML xpath.

For example, the text node includes the following text contents: the fund fee and tax, the corresponding html xpath is: and/html/body/div 1/ol/li 2, then/html/body/div 1/ol can be used as the chapter mode information of the text node.

According to the manners shown in the above 1), 2), and 3), other chapter information of the text node may also be obtained, for example, the chapter number information, such as the chapter number 17 obtained in the manners 1) and 2), and the chapter number 2 obtained in the manner 3) (i.e., the number in the < li > tag) may also be obtained.

In practical applications, the specific manner used may be determined according to practical requirements, such as preferentially using manner 1), then manner 2), and finally manner 3).

After the chapter mode information of the text node is acquired, the hierarchy of the text node can be determined according to the chapter mode information. For example, it may be determined whether the chapter mode information of the text node exists in the global chapter mode sequence, if so, a position serial number of the chapter mode information of the text node in the global chapter mode sequence may be used as a hierarchy of the text node, and if not, the chapter mode information of the text node may be added to the global chapter mode sequence, and a position serial number of the chapter mode information of the text node in the global chapter mode sequence may be used as a hierarchy of the text node. As described above, the position number is the sequence number in which different chapter mode information is added to the global chapter mode sequence.

For example, the chapter mode information of the text node is "chapter x", and the global chapter mode sequence includes the following chapter mode information: the "title" and the "xth chapter" are compared to see that the chapter pattern information of the text node already exists in the global chapter pattern sequence, and therefore, the position number 2 of the chapter pattern information "xth chapter" of the text node in the global chapter pattern sequence can be used as the hierarchy of the text node.

For another example, the chapter mode information of the text node is "section x", and the global chapter mode sequence includes the following chapter mode information: as can be seen from comparison of the "title" and the "xth chapter", the chapter pattern information of the text node does not exist in the global chapter pattern sequence, and therefore, the chapter pattern information "xth section" of the text node can be added to the global chapter pattern sequence, so that the chapter pattern information such as the "title", "xth chapter", and "xth section" is included in the global chapter pattern sequence, and the position number 3 of the chapter pattern information "xth section" of the text node in the global chapter pattern sequence can be regarded as the hierarchy of the text node.

After the hierarchy of the text node is determined, chapter mode information with the position sequence number larger than the hierarchy of the text node in the global chapter mode sequence can be deleted, namely the length of the global chapter mode sequence is cut to be equal to the hierarchy of the current text node.

For example, the global chapter mode sequence includes the following chapter mode information: the chapter mode information of the current text node is 'chapter x', the 'chapter x' can be deleted, and therefore the path from the root node to the current text node is recorded in the global chapter mode sequence.

The text nodes may then be added to the constructed document tree according to the determined hierarchy of text nodes. Initially, the root node may be set as a reference tree node, but the reference tree node is dynamically changing. For a text node, the hierarchy of the text node may be compared with the hierarchy of the reference tree node, the text node may be added to the document tree as a child node or a sibling node of the reference tree node according to the comparison result, and thereafter, the text node may be set as the reference tree node.

Specifically, when comparing the hierarchy of the text node with the hierarchy of the reference tree node, the manner of adding the text node as a child node or a sibling node to the document tree according to the comparison result may be: if the hierarchy of the text node is larger than that of the reference tree node, the text node can be used as a child node of the reference tree node and added into the document tree; if the hierarchy of the text node is equal to the hierarchy of the reference tree node, the text node can be added into the document tree as a brother node of the reference tree node; if the hierarchy of the text node is smaller than the hierarchy of the reference tree node, the following predetermined process may be performed: and taking the previous level node of the current reference tree node as an updated reference tree node, if the level of the updated reference tree node is smaller than that of the text node, adding the text node as a child node of the updated reference tree node into the document tree, and otherwise, repeatedly executing the preset processing.

FIG. 2 is a schematic diagram of a document tree in a build process. Assuming that there are four levels of nodes, a root node at level 1, a text node 2 at level 2, a text node 3 and a text node 4 at level 3, and a text node 5 at level 4, as shown in fig. 2, the level of the text node 6 may be determined first when a new text node 6 needs to be added to the document tree.

Assuming that the level of the text node 6 is 4 and the text node 5 is the reference tree node, the level of the text node 6 is equal to the level of the reference tree node by comparison, so that the text node 6 can be added to the document tree as a brother node of the reference tree node, as shown in fig. 3, and fig. 3 is a first schematic diagram of the document tree after the text node 6 is added on the basis of fig. 2.

Assuming that the level of the text node 6 is 5 and the text node 5 is a reference tree node, the level of the text node 6 is greater than that of the reference tree node by comparison, so that the text node 6 can be added to the document tree as a child node of the reference tree node, as shown in fig. 4, and fig. 4 is a second schematic diagram of the document tree after the text node 6 is added to the document tree shown in fig. 2.

Assuming that the level of the text node 6 is 3 and the text node 5 is a reference tree node, then comparing that the level of the text node 6 is smaller than the level of the reference tree node, and therefore the reference tree node needs to be updated, first, the text node 4, which is a node on the previous level of the current reference tree node, is used as the updated reference tree node, and the level of the updated reference tree node still does not satisfy the requirement of being smaller than the level of the text node, so that the reference tree node needs to be updated, and the text node 2, which is a node on the previous level of the current reference tree node, i.e., the text node 4, is used as the updated reference tree node, and the level of the text node 2 is smaller than the level of the text node 6, so that the text node 6 can be added to the document tree as a child node of the text node 2, as shown in fig. 5, and fig. 5 is a third schematic diagram of the document tree after the text node 6 is added on the basis shown in fig. 2.

And processing each text node according to the method so as to add each text node into the document tree and complete the construction of the document tree with the hierarchical structure.

In practical application, the text nodes may be sequentially processed according to a predetermined sequence, for example, the text nodes may be sequentially processed according to the sequence of the text content in the segmented text nodes appearing in the document.

With the above introduction in mind, fig. 6 is a flowchart of a second embodiment of the document parsing method according to the present invention. As shown in fig. 6, the following detailed implementation is included.

In 601, the document to be processed is converted into HTML format.

Assuming that the document to be processed is in doc format, it needs to be first converted to HTML format.

At 602, a document title of a document is obtained.

If the document has the title label, the content corresponding to the title label can be used as the document title, otherwise, the file name of the document can be used as the document title.

In 603, the document is sliced into text nodes of paragraph granularity.

The document can be segmented into a series of text nodes with segment falling granularity by using h-tags, p-tags, div-tags and the like according to the existing mode.

At 604, the document tree is initialized with the document title as the root node of the document tree, with a level of 1.

The global chapter mode sequence can be initialized to be empty, the chapter mode information corresponding to the document title can be added into the global chapter mode sequence, and the position sequence of the chapter mode information corresponding to the document title in the global chapter mode sequence can be used as the hierarchy of the root node. The position serial number is a sequence serial number of different chapter mode information added into the global chapter mode sequence.

The chapter mode information corresponding to the document title may be "title". As can be seen, the position number of the chapter mode information corresponding to the document title in the global chapter mode sequence is 1, and thus, it is possible to determine that the hierarchy of the root node is 1.

At 605, the root node is set as the reference tree node.

At 606, for each text node of the predetermined type, processing is performed as shown at 607-609, respectively.

The predetermined type of text node may refer to a text node cut out from a non-directory page.

In 607, chapter mode information for the text node is obtained.

The manner of obtaining the chapter mode information may include, but is not limited to, the following:

alternatively, html xpath information of the text node other than the < li > tag is used as the chapter mode information of the text node.

Therefore, the document directory may be analyzed in advance to analyze the chapter mode information included in the document directory, for example, the text node whose text content is the directory title is identified, the text node whose text content is the directory content and located after the text node whose text content is the directory title is identified, and the chapter mode information may be analyzed from the text node whose text content is the directory content. The text nodes with the text contents as the directory titles and the text nodes with the text contents as the directory contents are all the text nodes cut from the directory pages.

Accordingly, the obtaining of the chapter mode information corresponding to the text content in the catalog page may refer to taking the chapter mode information corresponding to the catalog content containing the text content as the chapter mode information corresponding to the text content.

At 608, a hierarchy of text nodes is determined from the chapter mode information.

Whether the chapter mode information of the text nodes exists in the global chapter mode sequence or not can be determined firstly, if yes, the position serial numbers of the chapter mode information of the text nodes in the global chapter mode sequence can be used as the levels of the text nodes, if not, the chapter mode information of the text nodes can be added into the global chapter mode sequence, and the position serial numbers of the chapter mode information of the text nodes in the global chapter mode sequence can be used as the levels of the text nodes.

In addition, after the hierarchy of the text nodes is determined, chapter mode information with position sequence numbers larger than the hierarchy of the text nodes in the global chapter mode sequence can be deleted. This operation need not be performed if there is no chapter mode information in the global chapter mode sequence that has a position order greater than the hierarchy of text nodes.

At 609, the text node is added to the document tree based on the hierarchy of the reference tree node and the hierarchy of the text node, and the reference tree node is updated.

The hierarchy of the text node may be compared with the hierarchy of the reference tree node, the text node may be added to the document tree as a child node or a sibling node of the reference tree node according to a comparison result, and the text node may be set as the reference tree node.

If the hierarchy of the text node is greater than the hierarchy of the reference tree node, the text node can be added to the document tree as a child node of the reference tree node, if the hierarchy of the text node is equal to the hierarchy of the reference tree node, the text node can be added to the document tree as a sibling node of the reference tree node, and if the hierarchy of the text node is less than the hierarchy of the reference tree node, the following predetermined processing can be performed: and taking the previous level node of the current reference tree node as an updated reference tree node, if the level of the updated reference tree node is smaller than that of the text node, adding the text node as a child node of the updated reference tree node into the document tree, and otherwise, repeatedly executing the preset processing.

It should be noted that, for simplicity of description, the foregoing method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In short, by adopting the scheme of the embodiment of the method, the document can be analyzed into the form of the document tree, so that the hierarchical analysis of the document is realized, the knowledge graph construction based on the document analysis can be assisted, and the construction efficiency and the accuracy and the like are improved.

The above is a description of method embodiments, and the embodiments of the present invention are further described below by way of apparatus embodiments.

FIG. 7 is a schematic diagram of a composition structure of an embodiment of a document parsing apparatus according to the present invention. As shown in fig. 7, includes: a slicing unit 701 and an analysis unit 702.

The segmentation unit 701 is configured to segment a document in a predetermined format to be processed into text nodes.

An analyzing unit 702, configured to perform the following processing for text nodes of a predetermined type: acquiring chapter mode information of text nodes; determining the hierarchy of the text nodes according to the chapter mode information; and adding text nodes into the constructed document tree according to the hierarchy.

The predetermined format may be an HTML format. The apparatus shown in fig. 7 may further include: and a preprocessing unit 700 for converting the document into the HTML format when the format of the document is not the HTML format. The segmentation unit 701 may segment the HTML formatted document into paragraph-sized text nodes.

For a document in the HTML format, the parsing unit 702 may obtain a document title of the document, and may set the document title as a root node of a document tree, where if a title tag exists in the document, a content corresponding to the title tag may be used as the document title, otherwise, a file name of the document may be used as the document title.

The parsing unit 702 may further parse chapter mode information included in the document directory to facilitate subsequent hierarchical parsing, for example, a text node whose text content is a directory title may be identified first, a text node whose text content is a directory content may be identified later than the text node whose text content is a directory title, and chapter mode information may be parsed from the text node whose text content is a directory content. The text nodes with the text contents as the directory titles and the text nodes with the text contents as the directory contents are text nodes cut from the directory pages.

For a predetermined type of text node, such as a text node cut from a non-directory page, the parsing unit 702 may parse out the parent-child, sibling relationships, etc. of different text nodes on the document tree structure, respectively, so as to construct a document tree. Specifically, for each text node, the following processing may be performed: acquiring chapter mode information of text nodes; determining the hierarchy of the text nodes according to the chapter mode information; and adding the text nodes into the document tree according to the determined hierarchy.

Before this, the parsing unit 702 may also perform the following processes: initializing a global chapter mode sequence, initializing a document tree when the global chapter mode sequence is empty, setting a document title as a root node of the document tree, adding chapter mode information corresponding to the document title into the global chapter mode sequence, and taking a position sequence of the chapter mode information corresponding to the document title in the global chapter mode sequence as a hierarchy of the root node. The position serial number is a sequence serial number of different chapter mode information added into the global chapter mode sequence. The chapter mode information corresponding to the document title may be "title".

For each text node cut from the non-catalog page, the parsing unit 702 may first obtain the chapter mode information of the text node, which may include, but is not limited to, the following: when the explicit chapter mode information exists in the text node, the chapter mode information is analyzed and used as the chapter mode information of the text node; or when the content identical to the text content in the text node exists in the directory page, acquiring chapter mode information corresponding to the text content in the directory page as the chapter mode information of the text node, wherein the chapter mode information corresponding to the directory content containing the text content can be used as the chapter mode information corresponding to the text content; alternatively, html xpath information of the text node other than the < li > tag is used as the chapter mode information of the text node.

After the chapter mode information of the text node is obtained, the parsing unit 702 may determine the hierarchy of the text node according to the chapter mode information. Specifically, it may be determined whether the chapter mode information of the text node exists in the global chapter mode sequence, if so, a position serial number of the chapter mode information of the text node in the global chapter mode sequence may be used as a hierarchy of the text node, and if not, the chapter mode information of the text node may be added to the global chapter mode sequence, and a position serial number of the chapter mode information of the text node in the global chapter mode sequence may be used as a hierarchy of the text node. As described above, the position number is the sequence number in which different chapter mode information is added to the global chapter mode sequence.

After determining the hierarchy of the text node, the parsing unit 702 may further delete the chapter mode information in the global chapter mode sequence whose position sequence number is greater than the hierarchy of the text node, i.e., truncate the length of the global chapter mode sequence to be equal to the hierarchy of the current text node.

The parsing unit 702 may add the text node to the constructed document tree according to the determined hierarchy of the text node. Initially, the root node may be set as a reference tree node, but the reference tree node is dynamically changing. For a text node, the hierarchy of the text node may be compared with the hierarchy of the reference tree node, the text node may be added to the document tree as a child node or a sibling node of the reference tree node according to the comparison result, and thereafter, the text node may be set as the reference tree node.

Specifically, when comparing the hierarchy of the text node with the hierarchy of the reference tree node, the manner of adding the text node as a child node or a sibling node to the document tree according to the comparison result may be: if the hierarchy of the text node is larger than that of the reference tree node, the text node can be used as a child node of the reference tree node and added into the document tree; if the hierarchy of the text node is equal to the hierarchy of the reference tree node, the text node can be added into the document tree as a brother node of the reference tree node; if the hierarchy of the text node is smaller than the hierarchy of the reference tree node, the following predetermined processing may be performed: and taking the previous level node of the current reference tree node as an updated reference tree node, if the level of the updated reference tree node is smaller than that of the text node, adding the text node as a child node of the updated reference tree node into the document tree, and otherwise, repeatedly executing the preset processing.

For a specific work flow of the apparatus embodiment shown in fig. 7, reference is made to the related description in the foregoing method embodiment, and details are not repeated.

In a word, by adopting the scheme of the embodiment of the device, the document can be analyzed into the form of the document tree, so that the hierarchical analysis of the document is realized, the construction of the knowledge graph based on the document analysis can be assisted, and the construction efficiency, the construction accuracy and the like are improved.

FIG. 8 illustrates a block diagram of an exemplary computer system/server 12 suitable for use in implementing embodiments of the present invention. The computer system/server 12 shown in FIG. 8 is only an example and should not be taken to limit the scope of use or the functionality of embodiments of the present invention in any way.

As shown in FIG. 8, computer system/server 12 is in the form of a general purpose computing device. The components of computer system/server 12 may include, but are not limited to: one or more processors (processing units) 16, a memory 28, and a bus 18 that connects the various system components, including the memory 28 and the processors 16.

Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Computer system/server 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer system/server 12 and includes both volatile and nonvolatile media, removable and non-removable media.

The memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 30 and/or cache memory 32. The computer system/server 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 8, and commonly referred to as a "hard drive"). Although not shown in FIG. 8, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.

A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which or some combination of which may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of the described embodiments of the invention.

The computer system/server 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with the computer system/server 12, and/or with any devices (e.g., network card, modem, etc.) that enable the computer system/server 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, the computer system/server 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet) via the network adapter 20. As shown in FIG. 8, the network adapter 20 communicates with the other modules of the computer system/server 12 via the bus 18. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the computer system/server 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The processor 16 executes various functional applications and data processing by executing programs stored in the memory 28, for example, implementing the methods in the embodiments shown in fig. 1 or fig. 6.

The invention also discloses a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, will carry out the method as in the embodiments of fig. 1 or 6.

Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method, etc., can be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A document parsing method, comprising:

the method for segmenting the document in the preset format to be processed into text nodes comprises the following steps: segmenting the document into text nodes of paragraph granularity; the predetermined format includes: a hypertext markup language format;

aiming at the text nodes cut from the non-directory pages, the following processing is respectively carried out:

acquiring chapter mode information of the text node, including: if the text node has explicit chapter mode information, analyzing the chapter mode information as the chapter mode information of the text node; or if the content identical to the text content in the text node exists in the catalog page, acquiring chapter mode information corresponding to the text content in the catalog page as the chapter mode information of the text node; or, taking the hypertext markup language path information of the text node except the < li > tag as the chapter mode information of the text node;

determining the hierarchy of the text node according to the chapter mode information, including: determining whether the chapter mode information of the text node exists in a global chapter mode sequence, if so, taking a position serial number of the chapter mode information of the text node in the global chapter mode sequence as a hierarchy of the text node, if not, adding the chapter mode information of the text node into the global chapter mode sequence, taking a position serial number of the chapter mode information of the text node in the global chapter mode sequence as a hierarchy of the text node, wherein the position serial number is a sequence serial number of different chapter mode information added into the global chapter mode sequence, the global chapter mode sequence is initially empty, and during initialization, adding the chapter mode information corresponding to a document title of the document into the global chapter mode sequence, taking a position serial number of the chapter mode information corresponding to the document title in the global chapter mode sequence as a hierarchy of a root node, and setting the document title as the root node;

2. The method of claim 1,

the method further comprises the following steps: and if the format of the document is not in the hypertext markup language format, converting the document into the hypertext markup language format.

3. The method of claim 1,

the method further comprises the following steps: identifying a text node with the text content as a directory title, identifying a text node with the text content as the directory content and positioned behind the text node with the text content as the directory title, and analyzing chapter mode information from the text node with the text content as the directory content; the text nodes with the text contents being the directory titles and the text nodes with the text contents being the directory contents are text nodes cut from the directory pages;

the obtaining of the chapter mode information corresponding to the text content in the directory page includes: and taking the chapter mode information corresponding to the catalog content containing the text content as the chapter mode information corresponding to the text content.

4. The method of claim 1,

the obtaining of the document title of the document comprises:

5. The method of claim 1,

the method further comprises the following steps: and after the hierarchy of the text nodes is determined, deleting the chapter mode information of which the position sequence number is greater than the hierarchy of the text nodes in the global chapter mode sequence.

6. The method of claim 1,

the method further comprises the following steps: setting the root node as a reference tree node in an initial state;

adding the text node into the constructed document tree according to the hierarchy comprises:

and comparing the hierarchy of the text node with the hierarchy of the reference tree node, adding the text node as a child node or a brother node of the reference tree node into the document tree according to a comparison result, and setting the text node as the reference tree node.

7. The method of claim 6,

the comparing the hierarchy of the text node with the hierarchy of a reference tree node, and adding the text node as a child node or a sibling node to the document tree according to the comparison result includes:

8. A document parsing apparatus, comprising: a segmentation unit and an analysis unit;

the segmentation unit is used for segmenting the document with the preset format to be processed into text nodes, and comprises: segmenting the document into text nodes with paragraph granularity; the predetermined format includes: a hypertext markup language format;

the parsing unit is configured to perform the following processing for text nodes cut from a non-directory page:

acquiring chapter mode information of the text node, including: if the text node has explicit chapter mode information, analyzing the chapter mode information as the chapter mode information of the text node; or if the content identical to the text content in the text node exists in the directory page, acquiring chapter mode information corresponding to the text content in the directory page as the chapter mode information of the text node; or, taking the hypertext markup language path information of the text node except the < li > tag as the chapter mode information of the text node;

determining the hierarchy of the text nodes according to the chapter mode information, wherein the method comprises the following steps: determining whether chapter mode information of the text node exists in a global chapter mode sequence, if so, taking a position sequence number of the chapter mode information of the text node in the global chapter mode sequence as a hierarchy of the text node, if not, adding the chapter mode information of the text node into the global chapter mode sequence, and taking a position sequence number of the chapter mode information of the text node in the global chapter mode sequence as a hierarchy of the text node, wherein the position sequence number is a sequence number of adding different chapter mode information into the global chapter mode sequence, the global chapter mode sequence is initially empty, and during initialization, adding the chapter mode information corresponding to a document title of the document into the global chapter mode sequence, and taking a position sequence number of the chapter mode information corresponding to the document title in the global chapter mode sequence as a hierarchy of a root node, and the document title is set as the root node;

9. The apparatus of claim 8,

10. The apparatus of claim 8,

the parsing unit is further configured to identify a text node whose text content is a directory title, identify a text node whose text content is a directory content and located behind the text node whose text content is a directory title, and parse chapter mode information from the text node whose text content is a directory content; the text nodes with the text contents being the directory titles and the text nodes with the text contents being the directory contents are text nodes cut from the directory pages;

11. The apparatus of claim 8,

the parsing unit is further configured to, if it is determined that the document has a title tag, use a content corresponding to the title tag as the document title, and otherwise, use a file name of the document as the document title.

12. The apparatus of claim 8,

the parsing unit is further configured to delete the chapter mode information of which the position sequence number is greater than the hierarchy of the text node in the global chapter mode sequence after determining the hierarchy of the text node.

13. The apparatus of claim 8,

the analysis unit is further configured to, in an initial state, set the root node as a reference tree node;

14. The apparatus of claim 13,

the parsing unit adds the text node as a child node of a reference tree node to the document tree if it is determined that the hierarchy of the text node is greater than the hierarchy of the reference tree node, adds the text node as a sibling node of the reference tree node to the document tree if it is determined that the hierarchy of the text node is equal to the hierarchy of the reference tree node, and performs the following predetermined processing if it is determined that the hierarchy of the text node is less than the hierarchy of the reference tree node: and taking the previous level node of the current reference tree node as an updated reference tree node, if the level of the updated reference tree node is smaller than that of the text node, adding the text node as a child node of the updated reference tree node into the document tree, and otherwise, repeatedly executing the preset processing.

15. A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the method of any of claims 1~7.

16. A computer readable storage medium, having stored thereon a computer program, wherein the program, when executed by a processor, implements the method of any of claims 1~7.