CN116136958A - Document processing method, apparatus, computer program product, and readable storage medium - Google Patents

Document processing method, apparatus, computer program product, and readable storage medium Download PDF

Info

Publication number
CN116136958A
CN116136958A CN202310156700.9A CN202310156700A CN116136958A CN 116136958 A CN116136958 A CN 116136958A CN 202310156700 A CN202310156700 A CN 202310156700A CN 116136958 A CN116136958 A CN 116136958A
Authority
CN
China
Prior art keywords
node
document
title
dictionary tree
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310156700.9A
Other languages
Chinese (zh)
Inventor
李斌
谷利峰
谢鸣晓
刘峻杉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
CCB Finetech Co Ltd
Original Assignee
China Construction Bank Corp
CCB Finetech Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp, CCB Finetech Co Ltd filed Critical China Construction Bank Corp
Priority to CN202310156700.9A priority Critical patent/CN116136958A/en
Publication of CN116136958A publication Critical patent/CN116136958A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a document processing method, a document processing device, computer equipment, a storage medium and a computer program product, which are applied to the technical field of data processing. The method comprises the following steps: acquiring a document to be processed; obtaining a target chapter title dictionary tree according to a regular expression corresponding to the document to be processed and a preset title style; the target chapter title dictionary tree includes at least one sub-hierarchy, each sub-hierarchy including at least one node; obtaining a document tree based on statistical information and characteristic information of each node of each sub-level in the target chapter title dictionary tree; and carrying out mode mining on each node of each sub-level in the document tree according to the statistical information and the characteristic information between each node and the brother node corresponding to each node in the document tree, and obtaining a document mode corresponding to the document to be processed. By adopting the method, the recognition accuracy of the document mode can be improved.

Description

Document processing method, apparatus, computer program product, and readable storage medium
Technical Field
The present application relates to the field of data processing technology, and in particular, to a document processing method, apparatus, computer device, computer readable storage medium, and computer program product.
Background
In general, extraction and recognition of documents to be processed is of great significance to enterprises for effectively utilizing unstructured data to improve the level of digitization. At present, in the aspect of extraction and recognition of a document to be processed, the prior art regards the document to be processed as an extraction task in natural language processing, and extraction is performed by training an entity extraction and relation extraction model through marking data and summarizing a plurality of customized rules according to specific document content so as to obtain a document mode.
However, the recognition accuracy in the above manner is low.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a document processing method, apparatus, computer device, computer-readable storage medium, and computer program product that can improve recognition accuracy.
In a first aspect, the present application provides a document processing method, the method including:
acquiring a document to be processed; obtaining a target chapter title dictionary tree according to the regular expression corresponding to the document to be processed and the preset title style; the target chapter title dictionary tree includes at least one sub-hierarchy, each sub-hierarchy including at least one node; obtaining a document tree based on statistical information and characteristic information of each node of each sub-level in the target chapter title dictionary tree; and carrying out mode mining on each node of each sub-level in the document tree according to the statistical information and the characteristic information between each node and the brother node corresponding to each node in the document tree, and obtaining the document mode corresponding to the document to be processed.
In one embodiment, the performing mode mining on each node of each sub-level in the document tree according to statistical information and characteristic information between each node and sibling nodes corresponding to each node in the document tree to obtain a document mode corresponding to the document to be processed includes:
determining whether statistical information and characteristic information between each node in the document tree and brother nodes corresponding to each node are similar; and when the fact that each node in the document tree has the same mode with the corresponding brother node of each node is determined according to the similarity of the statistical information and the characteristic information between each node in the document tree and the corresponding brother node of each node, obtaining the document mode corresponding to the document to be processed according to the mode mining result of all the nodes in the document tree.
In one embodiment, the obtaining the target chapter title dictionary tree according to the regular expression corresponding to the document to be processed and the preset title style includes: according to at least one paragraph information of the document to be processed, obtaining a candidate title corresponding to the document to be processed; obtaining the candidate chapter title dictionary tree according to the candidate title and the regular expression corresponding to the preset title style; and preprocessing the candidate chapter title dictionary tree to obtain the target chapter title dictionary tree.
In one embodiment, the candidate title includes a plurality of preset contents, and the obtaining the candidate chapter title dictionary tree according to the regular expression corresponding to the candidate title and the preset title style includes: determining a target regular expression corresponding to the candidate title according to preset contents which appear first in the plurality of preset contents; and obtaining the candidate chapter title dictionary tree according to the target regular expression corresponding to the candidate title and the text content corresponding to the candidate title.
In one embodiment, the number of regular expressions corresponding to the preset header style is a plurality of regular expressions; the obtaining the candidate chapter title dictionary tree according to the regular expression corresponding to the candidate title and the preset title style includes: when the candidate title is determined to be matched with the regular expressions according to the regular expressions corresponding to the preset title patterns, determining that the regular expression with the longest matching length of the candidate title is the target regular expression corresponding to the candidate title; and obtaining the candidate chapter title dictionary tree according to the target regular expression corresponding to the candidate title and the text content corresponding to the candidate title.
In one embodiment, the preprocessing the candidate chapter title dictionary tree to obtain a target chapter title dictionary tree includes: merging nodes with the same occurrence frequency in the candidate chapter title dictionary tree to obtain a first reference chapter title dictionary tree; setting paragraph ranges for all nodes in the first reference chapter title dictionary tree to obtain a second reference chapter title dictionary tree; and eliminating nodes of the chapter titles of the to-be-processed document, which are unlikely to be in the same level with each node in the second reference chapter title dictionary tree, so as to obtain the target chapter title dictionary tree.
In one embodiment, the setting a paragraph range for each node in the first reference chapter title dictionary tree to obtain a second reference chapter title dictionary tree includes:
determining whether each node in the first reference chapter title dictionary tree has a child node; when determining that each node in the first reference chapter title dictionary tree has no child node, determining that the paragraph range of each node is a range formed by the initial paragraph number of each node and the termination paragraph number of each node; the initial paragraph number and the termination paragraph number of each node are paragraph numbers corresponding to each node in the document to be processed; when each node in the first reference chapter title dictionary tree is determined to have a child node, determining that the paragraph range of each node is a range formed by the initial paragraph number of each node and the termination paragraph number of each node; the initial paragraph number of each node is the paragraph number corresponding to the first child node of each node in the document to be processed, and the termination paragraph number of each node is the paragraph number corresponding to the last child node of each node in the document to be processed; and determining the first reference chapter title dictionary tree after the paragraph range is set for each node in the first reference chapter title dictionary tree as the second reference chapter title dictionary tree.
In one embodiment, the excluding the nodes of the chapter title of the document to be processed that are unlikely to be in the same level as each node in the second reference chapter title dictionary tree, to obtain the target chapter title dictionary tree includes:
traversing each node in the second reference chapter title dictionary tree; and when determining that the brother node corresponding to each node is unlikely to be a node of the chapter title of the document to be processed in the same level with each node according to the paragraph scope of each node including the paragraph scope of the brother node corresponding to each node, acquiring the target chapter title dictionary tree.
In a second aspect, the present application provides a document processing apparatus, the apparatus comprising:
the first acquisition module is used for acquiring a document to be processed; the first processing module is used for obtaining a target chapter title dictionary tree according to the regular expression corresponding to the document to be processed and the preset title style; the target chapter title dictionary tree includes at least one sub-hierarchy, each sub-hierarchy including at least one node; the second acquisition module is used for acquiring a document tree based on the statistical information and the characteristic information of each node of each sub-level in the target chapter title dictionary tree; and the second processing module is used for carrying out mode mining on each node of each sub-level in the document tree according to the statistical information and the characteristic information between each node and the brother node corresponding to each node in the document tree, so as to obtain the document mode corresponding to the document to be processed.
In one embodiment, the second processing module is further configured to: determining whether statistical information and characteristic information between each node in the document tree and brother nodes corresponding to each node are similar; and when the fact that each node in the document tree has the same mode with the corresponding brother node of each node is determined according to the similarity of the statistical information and the characteristic information between each node in the document tree and the corresponding brother node of each node, obtaining the document mode corresponding to the document to be processed according to the mode mining result of all the nodes in the document tree.
In one embodiment, the first processing module is further configured to: according to at least one paragraph information of the document to be processed, obtaining a candidate title corresponding to the document to be processed; obtaining the candidate chapter title dictionary tree according to the candidate title and the regular expression corresponding to the preset title style; and preprocessing the candidate chapter title dictionary tree to obtain a target chapter title dictionary tree.
In one embodiment, the candidate title includes a plurality of preset contents, and the first processing module is further configured to: determining a target regular expression corresponding to the candidate title according to preset contents which appear first in the plurality of preset contents; and obtaining the candidate chapter title dictionary tree according to the target regular expression corresponding to the candidate title and the text content corresponding to the candidate title.
In one embodiment, the number of regular expressions corresponding to the preset header style is a plurality of regular expressions; the first processing module is further configured to: when the candidate title is determined to be matched with the regular expressions according to the regular expressions corresponding to the preset title patterns, determining that the regular expression with the longest matching length of the candidate title is the target regular expression corresponding to the candidate title; and obtaining the candidate chapter title dictionary tree according to the target regular expression corresponding to the candidate title and the text content corresponding to the candidate title.
In one embodiment, the first processing module is further configured to: merging nodes with the same occurrence frequency in the candidate chapter title dictionary tree to obtain a first reference chapter title dictionary tree; setting paragraph ranges for all nodes in the first reference chapter title dictionary tree to obtain a second reference chapter title dictionary tree; and eliminating nodes of the chapter titles of the to-be-processed document, which are unlikely to be in the same level with each node in the second reference chapter title dictionary tree, so as to obtain the target chapter title dictionary tree.
In one embodiment, the first processing module is further configured to: determining whether each node in the first reference chapter title dictionary tree has a child node; when determining that each node in the first reference chapter title dictionary tree has no child node, determining that the paragraph range of each node is a range formed by the initial paragraph number of each node and the termination paragraph number of each node; the initial paragraph number and the termination paragraph number of each node are paragraph numbers corresponding to each node in the document to be processed; when each node in the first reference chapter title dictionary tree is determined to have a child node, determining that the paragraph range of each node is a range formed by the initial paragraph number of each node and the termination paragraph number of each node; the initial paragraph number of each node is the paragraph number corresponding to the first child node of each node in the document to be processed, and the termination paragraph number of each node is the paragraph number corresponding to the last child node of each node in the document to be processed; and determining the first reference chapter title dictionary tree after the paragraph range is set for each node in the first reference chapter title dictionary tree as the second reference chapter title dictionary tree.
In one embodiment, the first processing module is further configured to: traversing each node in the second reference chapter title dictionary tree; and when determining that the brother node corresponding to each node is unlikely to be a node of the chapter title of the document to be processed in the same level with each node according to the paragraph scope of each node including the paragraph scope of the brother node corresponding to each node, acquiring the target chapter title dictionary tree.
According to the document processing method and device, the document to be processed is acquired, the hierarchical structure of any level can be recursively identified according to the regular expression corresponding to the document to be processed and the preset title style, the target chapter title dictionary tree is obtained, and the accuracy of document mode identification is improved; based on the statistical information and the characteristic information of each node of each sub-level in the target chapter title dictionary tree, constructing a document tree through the statistical information and the characteristic information of each node, so that the structural information and the text information of the document are considered when the document mode is identified based on the document tree, and the accuracy of the document identification is improved; and then, according to statistical information and characteristic information between each node and brother nodes corresponding to each node in the document tree, each node of each sub-level in the document tree is subjected to mode mining, and when a document mode corresponding to a document to be processed is obtained, the accuracy of the identified document mode is high.
Drawings
FIG. 1 is an application environment diagram of a document processing method in one embodiment;
FIG. 2 is a flow diagram of a method of processing a document in one embodiment;
FIG. 3 is a schematic flow chart of a document mode corresponding to a document to be processed obtained by performing mode mining on each node of each sub-level in a document tree according to statistical information and characteristic information between each node and sibling nodes corresponding to each node in the document tree in one embodiment;
FIG. 4 is a flowchart of a target chapter title dictionary tree obtained according to a regular expression corresponding to a document to be processed and a preset title style in one embodiment;
FIG. 5 is a schematic diagram of a candidate chapter title dictionary tree that is not differentiated in one embodiment;
FIG. 6 is a schematic diagram of a candidate chapter title dictionary tree that may be differentiated in one embodiment;
FIG. 7 is a flowchart of a method for obtaining a candidate chapter topic dictionary tree according to a regular expression corresponding to a candidate topic and a preset topic style in one embodiment;
FIG. 8 is a flowchart of a method for obtaining a candidate chapter topic dictionary tree according to a regular expression corresponding to a candidate topic and a preset topic style in one embodiment;
FIG. 9 is a flow diagram of preprocessing a candidate chapter title dictionary tree to obtain a target chapter title dictionary tree, under one embodiment;
FIG. 10 is a diagram of the results of counting the number of node occurrences in a candidate chapter title dictionary tree, under one embodiment;
FIG. 11 is a diagram of a first reference chapter title dictionary tree in one embodiment;
FIG. 12 is a flow diagram of a second reference chapter title dictionary tree obtained by setting paragraph ranges for nodes in the first reference chapter title dictionary tree in one embodiment;
FIG. 13 is a flow diagram of a target chapter title dictionary tree obtained by excluding nodes of chapter titles of a document to be processed that are unlikely to be at the same level as nodes in a second reference chapter title dictionary tree in one embodiment;
FIG. 14 is a schematic diagram of a candidate chapter title dictionary tree in one embodiment;
FIG. 15 is a schematic diagram of the result of merging nodes in a candidate chapter title dictionary tree and setting paragraph ranges in one embodiment;
FIG. 16 is a block diagram showing the structure of a document processing apparatus in one embodiment;
fig. 17 is an internal structural view of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
The document processing method provided by the embodiment of the application can be applied to an application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process, in this application the data storage system may store documents to be processed. The data storage system may be integrated on the server 104 or may be located on a cloud or other network server.
Specifically, after the terminal 102 obtains the document to be processed, the terminal 102 sends the document to be processed to the server 104, and the server 104 obtains a target chapter title dictionary tree according to a regular expression corresponding to the document to be processed and a preset title style, wherein the target chapter title dictionary tree comprises at least one sub-level, and each sub-level comprises at least one node; the server 104 obtains a document tree based on the statistical information and the characteristic information of each node of each sub-level in the target chapter title dictionary tree, so as to perform pattern mining on each node of each sub-level in the document tree according to the statistical information and the characteristic information between each node and the brother node corresponding to each node in the document tree, and obtain a document pattern corresponding to the document to be processed. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, where the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be implemented as a stand-alone server or as a server cluster of multiple servers.
In one embodiment, as shown in fig. 2, a document processing method is provided, and the method is applied to the server 104 in fig. 1 for illustration, and includes the following steps:
s202, acquiring a document to be processed.
The documents to be processed can comprise business description documents, business guide documents, regulation documents, various notices, meeting summary documents and other business documents in enterprises, and can also comprise technical documents, office documents, books or articles in books in the technical field.
S204, obtaining a target chapter title dictionary tree according to the document to be processed and the regular expression corresponding to the preset title style; the target chapter title dictionary tree includes at least one sub-hierarchy, each sub-hierarchy including at least one node.
The preset title styles are used for representing structural forms corresponding to the titles, and regular expressions corresponding to different preset title styles are different. In some embodiments, the preset title styles may include a first title style and a second title style; the first title style is a structural style corresponding to a title represented by an underlined, italic, bolded or other form; the second title style is a structural style corresponding to a title represented by numerals and/or symbols, and the symbols may include chinese and english brackets, dot numbers, and a pause number, etc., and for example, the second title style may be "1", "one", "(1)", "one)", and "first", etc.
It will be appreciated that titles are generally no more than one line in length, most of which are in the range of 5 to 50 words in length, and may be composed according to title style and title content, for example, the title may be "first chapter background introduction", "second chapter functional description", "third chapter functional design", and "twelfth chapter reference", etc.
S206, obtaining a document tree based on the statistical information and the characteristic information of each node of each sub-level in the target chapter title dictionary tree.
It can be understood that the mode of the document to be processed is implicit in some information such as text formats, tables, lists, text prefixes and the like, so that when frequent pattern mining is performed, specific text content does not play a decisive role, but rather some formats, styles, fixed prefixes and other characteristic information of the contained content can play a role in extracting the document mode. Therefore, the document tree is obtained according to the characteristic information (namely, the statistical information and the characteristic information) of the text content under each node of each sub-level in the target chapter title dictionary tree, and further the mode mining is carried out based on the document tree, so that the accuracy of the mode mining can be improved.
The statistics information of each node in each sub-level in the target chapter title dictionary tree is used for representing the number of target elements contained in text content corresponding to each node, and the target elements may include paragraphs, tables, pictures, attachments and the like.
The feature information of each node of each sub-level in the target chapter heading dictionary tree may include a prefix word vector, a subject word vector, and the like. The prefix word vector is a word vector formed by the first N words of each paragraph, which is actually a vectorized representation of the prefixes of the text paragraphs, and the corresponding prefix word vector can be obtained by the first N prefixes of each paragraph through the configuration parameter N. The subject term vector comprises core words contained under each node, and the value of each dimension of the subject term vector can be represented by the word frequency of the core words; the text content corresponding to each node is subjected to word segmentation and dependency analysis, then subject, object and noun are screened, and the subject, object and noun obtained after screening are determined to be core words.
S208, performing mode mining on each node of each sub-level in the document tree according to the statistical information and the characteristic information between each node and the brother nodes corresponding to each node in the document tree, and obtaining a document mode corresponding to the document to be processed.
It will be appreciated that pattern mining of documents is compared to general frequent subtree pattern mining: the frequent subtrees can be subjected to mining consideration in a layering manner, and only two layers of tree nodes are considered as candidate subtrees each time; the mode mining of the document does not need to consider single nodes or take nodes with more than two layers as candidate frequent items, so that the space of the candidate frequent items corresponding to the mode mining of the document is far smaller than that of a common frequent subtree mining task.
It can be understood that, taking the document to be processed as the business document and the office document as examples, the writing formats of the business document and the office document are not frequently changed in the same mechanism, so that after the document mode corresponding to the document to be processed is obtained, the document mode can be saved, so that the subsequent document can be directly acted on the same type of document, and the analysis of the same type of document is conveniently performed, thereby improving the efficiency of the document analysis.
In some embodiments, in combination with the document schema corresponding to the document to be processed and the template language of the document, a new document may be generated as needed to perform a corresponding operation based on the new document.
In some embodiments, the document schema corresponding to the document to be processed may be displayed on the management console interface so that a user may conveniently maintain and correct defects on the document schema or supplement certain other schemas. The method can also store the document mode corresponding to the document to be processed in a database, provide the inquiry and retrieval functions of the document mode, and facilitate the generation of new documents by using the document mode by a user, thereby improving the efficiency of document generation.
In summary, in the embodiment shown in fig. 2, by acquiring the document to be processed, according to the regular expression corresponding to the document to be processed and the preset title style, the hierarchical structure of any level can be recursively identified, the target chapter title dictionary tree is obtained, and the accuracy of document pattern identification is improved; based on the statistical information and the characteristic information of each node of each sub-level in the target chapter title dictionary tree, constructing a document tree through the statistical information and the characteristic information of each node, so that the structural information and the text information of the document are considered when the document mode is identified based on the document tree, and the accuracy of the document identification is improved; and then, according to statistical information and characteristic information between each node and brother nodes corresponding to each node in the document tree, each node of each sub-level in the document tree is subjected to mode mining, and when a document mode corresponding to a document to be processed is obtained, the accuracy of the identified document mode is high.
In one embodiment, as shown in fig. 3, a flowchart of a mode mining is provided for each node of each sub-level in a document tree according to statistical information and characteristic information between each node and sibling nodes corresponding to each node in the document tree, so as to obtain a document mode corresponding to a document to be processed, including the following steps:
S302, determining whether statistical information and characteristic information between each node and brother nodes corresponding to each node in the document tree are similar.
S304, when the fact that each node in the document tree has the same mode with the corresponding brother node of each node is determined according to the similarity of the statistical information and the characteristic information between the brother nodes corresponding to each node in the document tree, the document mode corresponding to the document to be processed is obtained according to the mode mining results of all the nodes in the document tree.
The sibling node corresponding to each node is used for representing the node at the same level as each node, in some embodiments, each node may be used as a target, the subsequent sibling node may be traversed, and whether the similarity of the statistical information and the feature information between each node and the sibling node corresponding to each node exceeds a threshold value is determined; and when the similarity of the statistical information and the characteristic information between the nodes and the brother nodes corresponding to the nodes exceeds a threshold value, determining that the statistical information and the characteristic information between the nodes and the brother nodes corresponding to the nodes are similar.
It will be appreciated that when the similarity of the statistical information and the feature information between each node and the sibling node corresponding to each node does not exceed the threshold, that is, the similarity of the statistical information and the feature information between each node and the sibling node corresponding to each node is smaller than the threshold, it is determined that the statistical information and the feature information between each node and the sibling node corresponding to each node are not similar.
To sum up, in the embodiment shown in fig. 3, by determining whether the statistical information and the characteristic information between each node and the sibling node corresponding to each node in the document tree are similar, and when it is determined that each node in the document tree has the same pattern as the sibling node corresponding to each node according to the statistical information and the characteristic information between each node and the sibling node corresponding to each node in the document tree are similar, the document pattern corresponding to the document to be processed is obtained according to the pattern mining results of all the nodes in the document tree. Thus, when the document tree is subjected to pattern mining, the structural information and the text information of the document are considered, so that the accuracy of document pattern recognition can be improved.
In one embodiment, as shown in fig. 4, there is provided a flowchart of obtaining a target chapter title dictionary tree according to a regular expression corresponding to a document to be processed and a preset title style, including the steps of:
s402, obtaining candidate titles corresponding to the document to be processed according to at least one paragraph information of the document to be processed.
S404, obtaining a candidate chapter title dictionary tree according to the regular expression corresponding to the candidate title and the preset title style.
It may be appreciated that when the preset header styles are different, the corresponding regular expressions are different, and taking the preset header style as a second header style, where the second header style is an example of a structural style corresponding to a header represented by a number and/or a symbol, the regular expressions may include the following multiple representations:
The first is expressed as: \ ([ 0-9] + \ "); the second is expressed as: + ([ 0-9 ]) +; third representation: \ ([ 0-9] + \); the fourth is denoted as: \ ([ two, three, five, six, seven, eight, ninety ] + \ "); the fifth is expressed as: \ ([ two three five six seven ninety + ] "; the sixth is expressed as: \ ([ two, three, five, six, seven, eight, ninety + \); the seventh is expressed as: ([ 0-9] +); the eighth is expressed as: ([ 0-9] + ]; the ninth is denoted as: ([ 0-9] +); the tenth is expressed as: ([ two, three, five, six, seven, eight, ninety ] + ], \or; the eleventh expression is: ([ two three five six seven ninety ] + ]; the twelfth expression is: ([ two three five six seven eight ninety ] +); the thirteenth expression is: [0-9] +, -; the fourteenth is denoted as: [0-9] +,/j; the fifteenth is denoted as: [0-9] + \\; sixteenth is expressed as: [ two, three, five, six, seven, eight, ninety +,; the seventeenth is expressed as: [ two, three, five, six, seven, eight, ninety + ]; the eighteenth expression is: [ two, three, five, six, seven, eight, ninety + \); nineteenth is expressed as: [0-9] +; the twentieth is expressed as: [0-9] + ]; the twenty-first is denoted as: [0-9] +; the twenty-second type is denoted as: [ two, three, five, six, seven, eight, ninety ] + ]; the twenty-third is denoted as: [ two, three, five, six, seven, eight, ninety ] + ]; the twenty-fourth is denoted as: [ two three five six seven eight ninety ] +; the twenty-fifth expression is: ([ 0-9] + \2.) + [1-9]; the twenty-sixth expression is: ([ 0-9] + \2.) + [1-9]; the twenty-seventh expression is: ([ 0-9] + \.) +; the twenty-eighth expression is: [0-9] +; twenty-ninth is represented as: [0-9] + \; thirty-first is expressed as: [0-9] +; thirty-first is denoted as: [ two, three, five, six, seven, eight, ninety+,; thirty-second is expressed as: [ two, three, five, six, seven, eight, ninety ] + \; thirty-third is expressed as: [ two three five six seven ninety+ ].
Among the above-mentioned regular expressions, the use of "; "spaced apart; wherein the meaning of the character strings in each representation of the regular expression is shown in table 1:
TABLE 1
Figure BDA0004092754140000111
Figure BDA0004092754140000121
In combination with the above listed representations of regular expressions and what is shown in Table 1, for example, regular expressions of "\ ([ 0-9] + \ ],", then can be used to match "(1)," (2), "such sequence numbers; the regular expression is "\ ([ 0-9] + \)," wherein [0-9] + in the regular expression represents one or more numbers, and the subsequent pause is a common pause; the brackets are brackets of the English half angle, so the previous \\ needs to be escape (and) refers to a slash, and the escape (and) is used for keeping it in the regular expression as it is.
For example, in connection with the regular expressions listed above, the title styles "1", "2", "11" may be replaced with @1@, the title styles "one", "two", "eleven" may be replaced with @2@ in the dictionary tree, and the title styles "(one)", "(two)", "(eleven)" may be replaced with @4@ in the dictionary tree; wherein, the numbers in the two @ are used to represent what kind of representation the regular expression corresponding to the title style is, for example, the title styles "one", "two", "eleven" are replaced by @2@ in the dictionary tree, and it can be understood that "one", "two", "eleven" are replaced according to the second representation of the regular expression.
When the preset title style is a second title style, and the second title style is a structural style corresponding to a title represented by a number and/or a symbol, and a dictionary tree is obtained according to a regular expression corresponding to the second title style, the dictionary tree is represented by a 'yan'; in order to distinguish the dictionary tree representation from the regular expression corresponding to the second header style, when the dictionary tree representation is performed according to the regular expression corresponding to the first header style, the dictionary tree can be represented by using "#", and the regular expression corresponding to the first header style can be adaptively modified according to the regular expression corresponding to the second header style, which is not described herein.
In some embodiments, in the same document, different levels of titles may use different preset title styles, so that the title styles of each level in the same document may be replaced according to the regular expression corresponding to the corresponding preset title style.
In some embodiments, the same hierarchical title may use the same title style in the same document, however, the dominant numerical differences are present in the same title style. When a document to be processed is processed, the numbers in the candidate titles can be subjected to de-differentiation processing according to the regular expressions corresponding to the preset title styles, so that corresponding candidate chapter title dictionary trees are obtained. For example, the document to be processed includes: the "first chapter background introduction", "second chapter functional description", "third chapter functional design", and "twelfth chapter reference document" are examples, and fig. 5 is a candidate chapter title dictionary tree obtained without performing the differentiation processing, and fig. 6 is a candidate chapter title dictionary tree obtained with performing the differentiation processing.
S406, preprocessing the candidate chapter title dictionary tree to obtain a target chapter title dictionary tree.
In some embodiments, nodes which cannot be chapter titles of the document to be processed in the candidate chapter title dictionary tree can be eliminated according to paragraph information of the document to be processed, and a target chapter title dictionary tree is obtained.
In some embodiments, nodes in the candidate chapter heading dictionary tree, which cannot be the chapter heading of the document to be processed, may be excluded according to paragraph numbers corresponding to the nodes in the candidate chapter heading dictionary tree, so as to obtain a target chapter heading dictionary tree.
To sum up, in the embodiment shown in fig. 4, candidate titles corresponding to the document to be processed are obtained according to at least one paragraph information of the document to be processed; obtaining a candidate chapter title dictionary tree according to the regular expressions corresponding to the candidate title and the preset title style; and further preprocessing the candidate chapter title dictionary tree, and further processing the candidate chapter title dictionary tree to improve the accuracy of chapter title identification in the document to be processed and obtain a target chapter title dictionary tree, so that the document tree is obtained according to the target chapter title dictionary tree, and the identification accuracy of the document mode can be improved when the mode is mined based on the document tree.
In one embodiment, the candidate title includes a plurality of preset contents, as shown in fig. 7, a flowchart of obtaining a dictionary tree of candidate chapter titles according to a regular expression corresponding to the candidate title and a preset title style is provided, which includes the following steps:
s702, determining a target regular expression corresponding to the candidate title according to the preset content which appears first in the preset contents.
S704, obtaining a candidate chapter title dictionary tree according to the target regular expression corresponding to the candidate title and the text content corresponding to the candidate title.
The preset content is used for representing the content of at least one number and/or symbol in the title, and the symbol can comprise Chinese and English brackets, point numbers, pause numbers and the like; for example, when the candidate title is "one, four functions of the product", the candidate title includes preset contents such as "one," and "four," and since the preset content "one," appears first than the preset content "four," the regular expression that is "matched" according to the preset content "one" is determined as the target regular expression corresponding to the candidate title, and then the candidate chapter title dictionary tree is obtained according to the target regular expression corresponding to the candidate title and the text content corresponding to the candidate title.
In one embodiment, the number of regular expressions corresponding to the preset title pattern is multiple, as shown in fig. 8, a flowchart of obtaining a candidate chapter title dictionary tree according to the regular expressions corresponding to the candidate title and the preset title pattern is provided, which includes the following steps:
s802, when the candidate title is determined to match a plurality of regular expressions according to the regular expressions corresponding to the preset title patterns, determining the regular expression with the longest matching length of the candidate title as the target regular expression corresponding to the candidate title.
S804, obtaining a candidate chapter title dictionary tree according to the target regular expression corresponding to the candidate title and the text content corresponding to the candidate title.
The preset content is used for representing the content of at least one number and/or symbol in the title, and the symbol can comprise Chinese and English brackets, point numbers, pause numbers and the like; for example, the candidate title is "(one) and four functions of the product", the preset title pattern is a second title pattern, according to the regular expressions corresponding to the second title pattern described above, "one", and "(one)" in the candidate title, the "can be respectively matched to the corresponding regular expressions, which makes the candidate title be matched to the plurality of regular expressions, but because the" one "in the candidate title and the" length of the matched regular expression are longest, the "one" and the "matched regular expression can be determined as the target regular expression corresponding to the candidate title, and further the candidate chapter title dictionary tree can be obtained according to the target regular expression corresponding to the candidate title and the text content corresponding to the candidate title.
In one embodiment, as shown in fig. 9, a flowchart of preprocessing a candidate chapter title dictionary tree to obtain a target chapter title dictionary tree is provided, including the steps of:
s902, merging nodes with the same occurrence number in the candidate chapter title dictionary tree to obtain a first reference chapter title dictionary tree.
It can be understood that the size of the dictionary tree can be reduced by merging the nodes with the same occurrence times in the candidate chapter title dictionary tree, so that the dictionary tree can be conveniently and correspondingly operated to obtain the document mode, and the recognition efficiency of the document mode is improved. For example, in connection with FIG. 6, statistics can be made on the number of occurrences of nodes in the dictionary tree shown in FIG. 6, as shown in FIG. 10, wherein "33" occurs 4 times, "background introduction" occurs 1 time, "function" occurs 2 times, "description" occurs 1 time, "design" occurs 1 time, "reference" occurs 1 time; the first reference chapter title dictionary tree shown in fig. 11 can be obtained by merging nodes having the same number of occurrences in the candidate chapter title dictionary tree.
S904, setting paragraph ranges for all nodes in the first reference chapter title dictionary tree to obtain a second reference chapter title dictionary tree.
It can be understood that when the types of the nodes are different, the corresponding paragraph ranges are also different; for example, if a node is a non-leaf node, there are 2 and more child nodes below the non-leaf node, and there is not only 1 child node, because if there is only one child node for the non-leaf node, the non-leaf node is merged with the child nodes of the non-leaf node when the first reference chapter heading dictionary tree is obtained. Thus, in some embodiments, a paragraph range may be set for each node in the first reference chapter heading dictionary tree based on whether each node has children, to obtain the second reference chapter heading dictionary tree.
S906, excluding nodes of the chapter titles of the to-be-processed documents which are unlikely to be in the same level with all nodes in the second reference chapter title dictionary tree, and obtaining a target chapter title dictionary tree.
In some embodiments, according to paragraph information of the document to be processed, nodes of chapter titles of the document to be processed, which are unlikely to be in the same level with each node in the second reference chapter title dictionary tree, may be excluded to obtain the target chapter title dictionary tree.
In some embodiments, according to the paragraph numbers corresponding to the nodes in the second reference chapter title dictionary tree, the nodes of the chapter title of the document to be processed, which are unlikely to be in the same level with each node in the second reference chapter title dictionary tree, may be excluded, so as to obtain the target chapter title dictionary tree.
In summary, in the embodiment shown in fig. 9, the first reference chapter title dictionary tree is obtained by merging the nodes with the same occurrence number in the candidate chapter title dictionary tree, so as to reduce the size of the dictionary tree and improve the efficiency of document mode recognition; setting paragraph ranges for all nodes in the first reference chapter title dictionary tree to obtain a second reference chapter title dictionary tree, and further eliminating nodes of chapter titles of the to-be-processed document which are unlikely to be in the same level with all nodes in the second reference chapter title dictionary tree to obtain a target chapter title dictionary tree so as to improve the recognition accuracy of the chapter titles of the to-be-processed document; thus, the document tree is obtained based on the target chapter title dictionary tree, and the recognition accuracy of the document mode can be improved when the mode mining is performed based on the document tree.
In one embodiment, as shown in fig. 12, there is provided a flowchart of setting a paragraph range for each node in a first reference chapter title dictionary tree to obtain a second reference chapter title dictionary tree, including the steps of:
s1202, determining whether each node in the first reference chapter title dictionary tree has a child node.
S1204, when determining that each node in the first reference chapter title dictionary tree has no child node, determining that the paragraph range of each node is a range consisting of a start paragraph number of each node and a stop paragraph number of each node; the start paragraph number and the end paragraph number of each node are paragraph numbers corresponding to each node in the document to be processed.
S1206, when determining that each node in the first reference chapter title dictionary tree has a child node, determining that the paragraph range of each node is a range formed by the beginning paragraph number of each node and the ending paragraph number of each node; the initial paragraph number of each node is the paragraph number corresponding to the first child node of each node in the document to be processed, and the termination paragraph number of each node is the paragraph number corresponding to the last child node of each node in the document to be processed.
S1208, determining the first reference chapter title dictionary tree after setting the paragraph range for each node in the first reference chapter title dictionary tree as the second reference chapter title dictionary tree.
In summary, in the embodiment shown in fig. 12, the paragraph numbers corresponding to the nodes in the document to be processed may be obtained by numbering the paragraphs in full text, so that the second reference chapter title dictionary tree is obtained by setting the paragraph range for each node in the first reference chapter title dictionary tree, and then the nodes of the chapter title of the document to be processed, which are unlikely to be in the same level with each node in the second reference chapter title dictionary tree, are excluded, so as to obtain the target chapter title dictionary tree, so as to improve the accuracy of identifying the chapter title of the document to be processed; thus, the document tree is obtained based on the target chapter title dictionary tree, and the recognition accuracy of the document mode can be improved when the mode mining is performed based on the document tree.
In one embodiment, as shown in fig. 13, there is provided a flowchart for excluding nodes of chapter titles of a document to be processed that are unlikely to be at the same level as each node in a second reference chapter title dictionary tree, to obtain a target chapter title dictionary tree, comprising the steps of:
s1302, traversing each node in the second reference chapter title dictionary tree.
S1304, when the corresponding brother node of each node is not possible to be the node of the chapter title of the document to be processed in the same level with each node according to the paragraph scope of each node including the paragraph scope of the brother node corresponding to each node, obtaining the target chapter title dictionary tree.
The sibling nodes corresponding to the nodes are other nodes except the nodes, and whether the sibling nodes corresponding to the nodes are in the same hierarchy with the nodes is determined by judging whether the paragraph scope of the nodes comprises the paragraph scope of the sibling nodes corresponding to the nodes.
Illustratively, fig. 14 provides a candidate chapter title dictionary tree, and after the merging processing and the paragraph range setting processing are performed on the nodes in the candidate chapter title dictionary tree shown in fig. 14, a result diagram shown in fig. 15 may be obtained. Wherein, for the node' chapter @33@, the occurrence number is 4, and the coverage paragraph ranges from 1 segment to 10 segments; for node "@" the number of occurrences is 6 and the covered paragraphs range from 4 to 7. It can be seen that the paragraph range of the node "@" includes the paragraph range of each branch of the paragraph range of the node "@" because the paragraph range of the branch of the node "@" is intersected with the paragraph range of the node "@ 33@ chapter", i.e., the paragraph range of the node "@ 33@ chapter" includes the paragraph range of the branch of the node "@, the node" @33@ chapter "is not at the same level as the node" @, i.e., the node such as the "background introduction" is not at the same level as the node such as the "query function"; in fig. 15, the paragraph range displayed on the left side includes the paragraph range displayed on the right side, which means that the node corresponding to the paragraph range displayed on the right side is a sub-hierarchy node of the node corresponding to the paragraph range displayed on the left side.
From the regular expressions corresponding to the preset title styles in fig. 14 and fig. 15, it can be known that the document to be processed includes: "first chapter background introduction", "second chapter functional description", "third chapter functional design", and "twelfth chapter reference"; the "third chapter functional design" includes: the "3.1 query function", "3.2 update function", and "3.3 other functions", the "3.1 query function" include: "(1) query conditions", "(2) shows results" and "(3) advanced query".
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides a document processing device for realizing the above-mentioned document processing method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in one or more embodiments of the document processing device provided below may refer to the limitation of the document processing method hereinabove, and will not be repeated herein.
In one embodiment, as shown in fig. 16, there is provided a document processing apparatus including: a first acquisition module 1602, a first processing module 1604, a second acquisition module 1606, and a second processing module 1608, wherein: a first obtaining module 1602, configured to obtain a document to be processed; the first processing module 1604 is configured to obtain a target chapter title dictionary tree according to a regular expression corresponding to the document to be processed and a preset title style; the target chapter title dictionary tree includes at least one sub-hierarchy, each sub-hierarchy including at least one node; a second obtaining module 1606, configured to obtain a document tree based on the statistical information and the feature information of each node of each sub-level in the target chapter title dictionary tree; and the second processing module 1608 is used for carrying out mode mining on each node of each sub-level in the document tree according to the statistical information and the characteristic information between each node and the brother node corresponding to each node in the document tree, and obtaining the document mode corresponding to the document to be processed.
In one embodiment, the second processing module is further configured to: determining whether statistical information and characteristic information between each node and brother nodes corresponding to each node in a document tree are similar; when it is determined that each node in the document tree has the same mode as the corresponding brother node of each node according to the similarity of statistical information and characteristic information between each node and the corresponding brother node of each node, according to the mode mining results of all nodes in the document tree, a document mode corresponding to the document to be processed is obtained.
In one embodiment, the first processing module is further configured to: according to at least one paragraph information of the document to be processed, obtaining a candidate title corresponding to the document to be processed; obtaining a candidate chapter title dictionary tree according to the regular expressions corresponding to the candidate title and the preset title style; preprocessing the candidate chapter title dictionary tree to obtain a target chapter title dictionary tree.
In one embodiment, the candidate title includes a plurality of preset contents, and the first processing module is further configured to: determining a target regular expression corresponding to the candidate title according to preset contents which appear first in the preset contents; and obtaining a candidate chapter title dictionary tree according to the target regular expression corresponding to the candidate title and the text content corresponding to the candidate title.
In one embodiment, the number of regular expressions corresponding to the preset header style is a plurality of regular expressions; the first processing module is further configured to: when a plurality of regular expressions are matched with the candidate title according to the regular expressions corresponding to the preset title patterns, determining the regular expression with the longest matching length of the candidate title as a target regular expression corresponding to the candidate title; and obtaining a candidate chapter title dictionary tree according to the target regular expression corresponding to the candidate title and the text content corresponding to the candidate title.
In one embodiment, the first processing module is further configured to: merging nodes with the same occurrence frequency in the candidate chapter title dictionary tree to obtain a first reference chapter title dictionary tree; setting paragraph ranges for all nodes in the first reference chapter title dictionary tree to obtain a second reference chapter title dictionary tree; and eliminating nodes of the chapter titles of the documents to be processed, which are unlikely to be in the same level with each node in the second reference chapter title dictionary tree, so as to obtain a target chapter title dictionary tree.
In one embodiment, the first processing module is further configured to: determining whether each node in the first reference chapter title dictionary tree has a child node; when determining that each node in the first reference chapter title dictionary tree has no child node, determining that the paragraph range of each node is a range formed by the starting paragraph number of each node and the ending paragraph number of each node; the initial paragraph number and the end paragraph number of each node are paragraph numbers corresponding to each node in the document to be processed; when determining that each node in the first reference chapter title dictionary tree has a child node, determining that the paragraph range of each node is a range formed by the starting paragraph number of each node and the ending paragraph number of each node; the initial paragraph number of each node is the paragraph number corresponding to the first child node of each node in the document to be processed, and the termination paragraph number of each node is the paragraph number corresponding to the last child node of each node in the document to be processed; and determining the first reference chapter title dictionary tree after the paragraph range is set for each node in the first reference chapter title dictionary tree as a second reference chapter title dictionary tree.
In one embodiment, the first processing module is further configured to: traversing each node in the second reference chapter title dictionary tree; and when the corresponding brother node of each node is not possible to be the node of the chapter title of the document to be processed in the same level with each node according to the paragraph range of each node including the paragraph range of the brother node corresponding to each node, acquiring the target chapter title dictionary tree.
The respective modules in the above-described document processing apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 17. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing regular expressions corresponding to the preset title styles. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a document processing method.
It will be appreciated by those skilled in the art that the structure shown in fig. 17 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the computer device to which the present application applies, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.
In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims (19)

1. A document processing method, the method comprising:
acquiring a document to be processed;
obtaining a target chapter title dictionary tree according to the regular expression corresponding to the document to be processed and the preset title style; the target chapter title dictionary tree includes at least one sub-hierarchy, each sub-hierarchy including at least one node;
obtaining a document tree based on statistical information and characteristic information of each node of each sub-level in the target chapter title dictionary tree;
And carrying out mode mining on each node of each sub-level in the document tree according to the statistical information and the characteristic information between each node and the brother node corresponding to each node in the document tree, and obtaining the document mode corresponding to the document to be processed.
2. The method according to claim 1, wherein the performing pattern mining on each node of each sub-level in the document tree according to statistical information and feature information between each node and sibling nodes corresponding to each node in the document tree to obtain a document pattern corresponding to the document to be processed includes:
determining whether statistical information and characteristic information between each node in the document tree and brother nodes corresponding to each node are similar;
and when the fact that each node in the document tree has the same mode with the corresponding brother node of each node is determined according to the similarity of the statistical information and the characteristic information between each node in the document tree and the corresponding brother node of each node, obtaining the document mode corresponding to the document to be processed according to the mode mining result of all the nodes in the document tree.
3. The method according to claim 1, wherein the obtaining the target chapter title dictionary tree according to the regular expression corresponding to the document to be processed and the preset title style includes:
According to at least one paragraph information of the document to be processed, obtaining a candidate title corresponding to the document to be processed;
obtaining the candidate chapter title dictionary tree according to the candidate title and the regular expression corresponding to the preset title style;
and preprocessing the candidate chapter title dictionary tree to obtain the target chapter title dictionary tree.
4. The method of claim 3, wherein the candidate title includes a plurality of preset contents, and the obtaining the candidate chapter title dictionary tree according to the regular expression corresponding to the candidate title and the preset title style includes:
determining a target regular expression corresponding to the candidate title according to preset contents which appear first in the plurality of preset contents;
and obtaining the candidate chapter title dictionary tree according to the target regular expression corresponding to the candidate title and the text content corresponding to the candidate title.
5. The method of claim 3, wherein the number of regular expressions corresponding to the preset header style is a plurality of regular expressions; the obtaining the candidate chapter title dictionary tree according to the regular expression corresponding to the candidate title and the preset title style includes:
When the candidate title is determined to be matched with the regular expressions according to the regular expressions corresponding to the preset title patterns, determining that the regular expression with the longest matching length of the candidate title is the target regular expression corresponding to the candidate title;
and obtaining the candidate chapter title dictionary tree according to the target regular expression corresponding to the candidate title and the text content corresponding to the candidate title.
6. A method according to claim 3, wherein said preprocessing the candidate chapter heading dictionary tree to obtain the target chapter heading dictionary tree comprises:
merging nodes with the same occurrence frequency in the candidate chapter title dictionary tree to obtain a first reference chapter title dictionary tree;
setting paragraph ranges for all nodes in the first reference chapter title dictionary tree to obtain a second reference chapter title dictionary tree;
and eliminating nodes of the chapter titles of the to-be-processed document, which are unlikely to be in the same level with each node in the second reference chapter title dictionary tree, so as to obtain the target chapter title dictionary tree.
7. The method of claim 6, wherein the setting a paragraph range for each node in the first reference chapter heading dictionary tree to obtain a second reference chapter heading dictionary tree comprises:
Determining whether each node in the first reference chapter title dictionary tree has a child node;
when determining that each node in the first reference chapter title dictionary tree has no child node, determining that the paragraph range of each node is a range formed by the initial paragraph number of each node and the termination paragraph number of each node; the initial paragraph number and the termination paragraph number of each node are paragraph numbers corresponding to each node in the document to be processed;
when each node in the first reference chapter title dictionary tree is determined to have a child node, determining that the paragraph range of each node is a range formed by the initial paragraph number of each node and the termination paragraph number of each node; the initial paragraph number of each node is the paragraph number corresponding to the first child node of each node in the document to be processed, and the termination paragraph number of each node is the paragraph number corresponding to the last child node of each node in the document to be processed;
and determining the first reference chapter title dictionary tree after the paragraph range is set for each node in the first reference chapter title dictionary tree as the second reference chapter title dictionary tree.
8. The method of claim 6, wherein excluding nodes of the chapter title of the document to be processed that are not likely to be at the same level as nodes in the second reference chapter title dictionary tree, obtains the target chapter title dictionary tree, comprising:
traversing each node in the second reference chapter title dictionary tree;
and when determining that the brother node corresponding to each node is unlikely to be a node of the chapter title of the document to be processed in the same level with each node according to the paragraph scope of each node including the paragraph scope of the brother node corresponding to each node, acquiring the target chapter title dictionary tree.
9. A document processing apparatus, the apparatus comprising:
the first acquisition module is used for acquiring a document to be processed;
the first processing module is used for obtaining a target chapter title dictionary tree according to the regular expression corresponding to the document to be processed and the preset title style; the target chapter title dictionary tree includes at least one sub-hierarchy, each sub-hierarchy including at least one node;
the second acquisition module is used for acquiring a document tree based on the statistical information and the characteristic information of each node of each sub-level in the target chapter title dictionary tree;
And the second processing module is used for carrying out mode mining on each node of each sub-level in the document tree according to the statistical information and the characteristic information between each node and the brother node corresponding to each node in the document tree, so as to obtain the document mode corresponding to the document to be processed.
10. The apparatus of claim 9, wherein the second processing module is further configured to:
determining whether statistical information and characteristic information between each node in the document tree and brother nodes corresponding to each node are similar;
and when the fact that each node in the document tree has the same mode with the corresponding brother node of each node is determined according to the similarity of the statistical information and the characteristic information between each node in the document tree and the corresponding brother node of each node, obtaining the document mode corresponding to the document to be processed according to the mode mining result of all the nodes in the document tree.
11. The apparatus of claim 9, wherein the first processing module is further configured to:
according to at least one paragraph information of the document to be processed, obtaining a candidate title corresponding to the document to be processed;
obtaining the candidate chapter title dictionary tree according to the candidate title and the regular expression corresponding to the preset title style;
And preprocessing the candidate chapter title dictionary tree to obtain the target chapter title dictionary tree.
12. The apparatus of claim 11, wherein the candidate title comprises a plurality of presets, the first processing module further configured to:
determining a target regular expression corresponding to the candidate title according to preset contents which appear first in the plurality of preset contents;
and obtaining the candidate chapter title dictionary tree according to the target regular expression corresponding to the candidate title and the text content corresponding to the candidate title.
13. The apparatus of claim 11, wherein the number of regular expressions corresponding to the preset header style is a plurality of regular expressions; the first processing module is further configured to:
when the candidate title is determined to be matched with the regular expressions according to the regular expressions corresponding to the preset title patterns, determining that the regular expression with the longest matching length of the candidate title is the target regular expression corresponding to the candidate title;
and obtaining the candidate chapter title dictionary tree according to the target regular expression corresponding to the candidate title and the text content corresponding to the candidate title.
14. The apparatus of claim 11, wherein the first processing module is further configured to:
merging nodes with the same occurrence frequency in the candidate chapter title dictionary tree to obtain a first reference chapter title dictionary tree;
setting paragraph ranges for all nodes in the first reference chapter title dictionary tree to obtain a second reference chapter title dictionary tree;
and eliminating nodes of the chapter titles of the to-be-processed document, which are unlikely to be in the same level with each node in the second reference chapter title dictionary tree, so as to obtain the target chapter title dictionary tree.
15. The apparatus of claim 14, wherein the first processing module is further configured to:
determining whether each node in the first reference chapter title dictionary tree has a child node;
when determining that each node in the first reference chapter title dictionary tree has no child node, determining that the paragraph range of each node is a range formed by the initial paragraph number of each node and the termination paragraph number of each node; the initial paragraph number and the termination paragraph number of each node are paragraph numbers corresponding to each node in the document to be processed;
When each node in the first reference chapter title dictionary tree is determined to have a child node, determining that the paragraph range of each node is a range formed by the initial paragraph number of each node and the termination paragraph number of each node; the initial paragraph number of each node is the paragraph number corresponding to the first child node of each node in the document to be processed, and the termination paragraph number of each node is the paragraph number corresponding to the last child node of each node in the document to be processed;
and determining the first reference chapter title dictionary tree after the paragraph range is set for each node in the first reference chapter title dictionary tree as the second reference chapter title dictionary tree.
16. The apparatus of claim 14, wherein the first processing module is further configured to:
traversing each node in the second reference chapter title dictionary tree;
and when determining that the brother node corresponding to each node is unlikely to be a node of the chapter title of the document to be processed in the same level with each node according to the paragraph scope of each node including the paragraph scope of the brother node corresponding to each node, acquiring the target chapter title dictionary tree.
17. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 8 when the computer program is executed.
18. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 8.
19. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the method of any one of claims 1 to 8.
CN202310156700.9A 2023-02-09 2023-02-09 Document processing method, apparatus, computer program product, and readable storage medium Pending CN116136958A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310156700.9A CN116136958A (en) 2023-02-09 2023-02-09 Document processing method, apparatus, computer program product, and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310156700.9A CN116136958A (en) 2023-02-09 2023-02-09 Document processing method, apparatus, computer program product, and readable storage medium

Publications (1)

Publication Number Publication Date
CN116136958A true CN116136958A (en) 2023-05-19

Family

ID=86333499

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310156700.9A Pending CN116136958A (en) 2023-02-09 2023-02-09 Document processing method, apparatus, computer program product, and readable storage medium

Country Status (1)

Country Link
CN (1) CN116136958A (en)

Similar Documents

Publication Publication Date Title
CN111104794B (en) Text similarity matching method based on subject term
Wu et al. Fonduer: Knowledge base construction from richly formatted data
US6470347B1 (en) Method, system, program, and data structure for a dense array storing character strings
US20220012231A1 (en) Automatic content-based append detection
CN111460170B (en) Word recognition method, device, terminal equipment and storage medium
Lizunov et al. Detection of near dublicates in tables based on the locality-sensitive hashing method and the nearest neighbor method
US20210141464A1 (en) Stylizing text by providing alternate glyphs
CN111325030A (en) Text label construction method and device, computer equipment and storage medium
CN111708805A (en) Data query method and device, electronic equipment and storage medium
CN111651986A (en) Event keyword extraction method, device, equipment and medium
CN106980620A (en) A kind of method and device matched to Chinese character string
CN111753514B (en) Automatic generation method and device of patent application text
Talburt et al. A practical guide to entity resolution with OYSTER
Dölek et al. A deep learning model for Ottoman OCR
CN115210705A (en) Vector embedding model for relational tables with invalid or equivalent values
CN111680146A (en) Method and device for determining new words, electronic equipment and readable storage medium
CN116136958A (en) Document processing method, apparatus, computer program product, and readable storage medium
CN107145947B (en) Information processing method and device and electronic equipment
CN107622129B (en) Method and device for organizing knowledge base and computer storage medium
Bellaouar et al. Efficient geometric-based computation of the string subsequence kernel
Liu et al. Structured data extraction: wrapper generation
CN112860958B (en) Information display method and device
US10387466B1 (en) Window queries for large unstructured data sets
CN111625579A (en) Information processing method, device and system
CN115408491B (en) Text retrieval method and system for historical data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination