CN113761840A - Intelligent document processing method, system, computer device and medium - Google Patents

Intelligent document processing method, system, computer device and medium Download PDF

Info

Publication number
CN113761840A
CN113761840A CN202111048195.3A CN202111048195A CN113761840A CN 113761840 A CN113761840 A CN 113761840A CN 202111048195 A CN202111048195 A CN 202111048195A CN 113761840 A CN113761840 A CN 113761840A
Authority
CN
China
Prior art keywords
node
data
xml file
target
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111048195.3A
Other languages
Chinese (zh)
Inventor
郭春磊
马丽霞
夏义鹏
王骁
李涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Securities Co Ltd
Original Assignee
China Securities Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Securities Co Ltd filed Critical China Securities Co Ltd
Priority to CN202111048195.3A priority Critical patent/CN113761840A/en
Publication of CN113761840A publication Critical patent/CN113761840A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents
    • G06F40/143Markup, e.g. Standard Generalized Markup Language [SGML] or Document Type Definition [DTD]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/81Indexing, e.g. XML tags; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an intelligent document processing method, a system, computer equipment and a medium, wherein the method comprises the following steps: acquiring an xml file of a data source document, wherein the xml file comprises at least one paragraph node, and the paragraph node comprises at least one target text node; standardizing the xml file to obtain a target xml file, wherein the standardization comprises target text node merging, target text node splitting and node identifier adding which are sequentially executed; compressing the target xml file to obtain a standardized data source document; and extracting data of the standardized data source document, and establishing a document database according to the data extraction result. The invention establishes the data source document database by carrying out standardized processing and data extraction on the data source document, provides a data base for intelligent document editing operation, is beneficial to saving labor cost and time cost and improving the document editing efficiency and accuracy.

Description

Intelligent document processing method, system, computer device and medium
Technical Field
The present invention relates to the field of document processing technologies, and in particular, to an intelligent document processing method, system, computer device, and medium.
Background
With the improvement of internet security, a service mode for developing financial services based on internet technology has been widely popularized in the financial field.
In the prior art, the internet financial business usually performs document interaction based on a coded XML mode, and an operator manually inputs required data into a report based on coded data, so that the problems that the coded data format is not uniform, the report content is filled, modified, deleted and the like in a manual writing mode, a large amount of time and human resources are required to be input into text editing and approval, the efficiency is low, and the cost is high are solved.
Disclosure of Invention
The invention provides an intelligent document processing method, an intelligent document processing system, computer equipment and a medium, which are used for realizing document editing by replacing manual work with a software program, have high intelligent degree, solve the problems of high cost and low efficiency of manual editing and are convenient and fast.
In a first aspect, an embodiment of the present invention provides an intelligent document processing method, including the following steps: acquiring an xml file of a data source document, wherein the xml file comprises at least one paragraph node, and the paragraph node comprises at least one target text node; standardizing the xml file to obtain a target xml file, wherein the standardization comprises target text node merging, target text node splitting and node identifier adding which are sequentially executed; compressing the target xml file to obtain a standardized data source document; and extracting data of the standardized data source document, and establishing a document database according to the data extraction result.
Optionally, the normalizing process is performed on the xml file, and includes the following steps: traversing all paragraph nodes in the xml file by adopting a recursive algorithm; merging all target text nodes in any paragraph node into a first target text node in the same paragraph node; and determining a first xml file according to the target text node merging result.
Optionally, the normalizing process is performed on the xml file, and includes the following steps: acquiring a first xml file obtained by merging the target text nodes; traversing all paragraph nodes in the first xml file by adopting a recursive algorithm; splitting text contents in a first target text node based on a preset anchor point mark in the first target text node; and determining a second xml file according to the target text node splitting result.
Optionally, the normalizing process is performed on the xml file, and includes the following steps: acquiring a second xml file obtained by splitting the target text node; traversing all target nodes in the second xml file by adopting a recursive algorithm, wherein the target nodes comprise paragraph nodes and target text nodes; determining node identifiers of the target nodes according to a recursion sequence, wherein the node identifiers correspond to the target nodes one by one, and the values of the node identifiers are sequentially increased progressively according to the recursion sequence; adding the node identifier to a list of attributes of the corresponding target node; and determining the target xml file according to the node identifier adding result.
Optionally, the data extraction of the standardized data source document includes the following steps: decompressing the standardized data source document to obtain a target xml file with a node identifier; performing data analysis on node data of all nodes in the target xml file, wherein the node data comprises text content data, node label data and text type data; and determining the target structured data according to the data analysis result.
Optionally, the creating a document database according to the data extraction result includes the following steps: acquiring all directory title nodes in the target structured data based on the node tag data; acquiring directory title data of each directory title node and a corresponding relation between the directory title node and a text; and determining a document directory data set in the document database according to the directory header data and the corresponding relation.
Optionally, the creating a document database according to the data extraction result includes the following steps: acquiring all paragraph nodes in the target structured data based on the node label data; traversing node paragraph data of a target text node in each paragraph node; and determining a document paragraph data set in the document database according to the paragraph nodes and the node paragraph data.
Optionally, the creating a document database according to the data extraction result includes the following steps: obtaining all table nodes in the target structured data based on the node tag data; traversing table row nodes and cell nodes in each table node; determining the coordinate parameter of each cell in the whole table according to the traversal result; and determining a document table data set in the document database according to the table nodes, the cell nodes and the coordinate parameters.
Optionally, after the document database is established, the intelligent document processing method further includes: and creating a target document according to the document database.
In a second aspect, an embodiment of the present invention further provides an intelligent document processing system, including:
the data source acquisition module is used for acquiring an xml file of a data source document, wherein the xml file comprises at least one paragraph node, and the paragraph node comprises at least one target text node;
the document preprocessing module is used for carrying out standardization processing on the xml file to obtain a target xml file, wherein the standardization processing comprises target text node merging, target text node splitting and node identifier adding which are sequentially executed;
the document compression module is used for compressing the target xml file to obtain a standardized data source document;
and the data extraction module is used for extracting data of the standardized data source document and establishing a document database according to the data extraction result.
In a third aspect, an embodiment of the present invention further provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the intelligent document processing method when executing the computer program.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the above-mentioned intelligent document processing method.
The intelligent document processing system, the computer device and the computer readable storage medium provided by the embodiment of the invention execute an intelligent document processing method, wherein the document processing method obtains an xml file of a data source document, the xml file comprises at least one paragraph node, and the paragraph node comprises at least one target text node; standardizing the xml file to obtain a target xml file, wherein the standardization comprises target text node merging, target text node splitting and node identifier adding which are sequentially executed; compressing the target xml file to obtain a standardized data source document; the method comprises the steps of extracting data of a standardized data source document, establishing a document database according to a data extraction result, wherein the document database is used for establishing a target document, solving the problems of low efficiency and high cost caused by manual document editing in the prior art, providing a data base for intelligent document editing operation, having high intelligent degree and convenient and quick document processing, being beneficial to saving labor cost and time cost and improving document editing efficiency and accuracy.
Drawings
FIG. 1 is a flowchart of an intelligent document processing method according to an embodiment of the present invention;
FIG. 2 is a flow chart of another intelligent document processing method according to an embodiment of the present invention;
FIG. 3 is a flowchart of another intelligent document processing method according to an embodiment of the present invention;
FIG. 4 is a flowchart of a method for processing an intelligent document according to an embodiment of the present invention;
FIG. 5 is a flowchart of a method for processing an intelligent document according to an embodiment of the present invention;
FIG. 6 is a flowchart of a method for processing an intelligent document according to an embodiment of the present invention;
FIG. 7 is a flowchart of a method for processing an intelligent document according to an embodiment of the present invention;
FIG. 8 is a schematic structural diagram of an intelligent document processing system according to a second embodiment of the present invention;
fig. 9 is a schematic structural diagram of a computer device according to a third embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Example one
Fig. 1 is a flowchart of an intelligent document processing method according to an embodiment of the present invention, where the embodiment is applicable to an application scenario in which a document is filled, modified, and deleted based on an internet financial office system (e.g., a web office system), and the method may be executed by a specific software program or a specific functional module, and specifically includes the following steps:
step S1: and acquiring an xml file of the data source document, wherein the xml file comprises at least one paragraph node, and the paragraph node comprises at least one target text node.
The xml (Extensible Markup Language) is a Markup Language specially designed for the internet, and the xml can be used for marking data and defining data types, and is an important tool for internet data transmission.
The data source document refers to a word document for recording specific data, the data source document can provide a data source for subsequent document editing operation, typically, the data source document can be financial and newspaper data sent by a stock exchange, and a file extension of the word document can be docx.
The extensible markup language xml file of the data source document is a plain text file which is formed by adding node tags to text contents in the data source document and is used for internet transmission. A node (node) is the most basic component of an xml file, and each part in the xml file may be referred to as a node, for example: the attribute, text, comment, etc. are all a node, and the node label is a marker symbol consisting of the symbol "< >" and the label name.
Typically, the node tags in an xml file include the following 3 types:
the < p > is a node of a paragraph type, namely a paragraph node, the label name of which is p, and represents an independent paragraph, and each paragraph corresponds to a paragraph start label < p > and a paragraph end label </p >;
< r > is a node of a text attribute type, i.e., a text attribute node, whose tag name is r, representing a style string indicating a display style of text included therein, for example: the font is bold, the size of the font is 12, the font name is Song style, and the like;
< t > is a text type node, i.e., a text node with a tag name of t, representing real text contents such as "company name", "balance sheet", "profit sheet", "cash flow sheet", etc.
In this step, an automation program may be adopted to change the file extension of the data source document to a ZIP (data compression file format), and a ZIP decompression method is adopted to decompress the compressed file, so as to obtain an xml file of the data source document, where the xml file includes at least one paragraph node < p >, each paragraph node < p > includes at least one text attribute node < r >, and each text attribute node < r > includes at least one target text node < t >.
Step S2: and carrying out standardization processing on the xml file to obtain a target xml file, wherein the standardization processing comprises target text node merging, target text node splitting and node identifier adding which are sequentially executed.
The merging of the target text nodes refers to merging the text contents recorded in the target text nodes < t > in all the text attribute nodes < r > in the same paragraph node < p > into the same text attribute node < r >; the splitting of the target text node means that the target text node merged in the same paragraph node < p > is split into a plurality of text attribute nodes < r >; adding a node identifier refers to adding a one-to-one corresponding identifier to all nodes in a file.
In this step, a recursive algorithm may be adopted to traverse the nodes in the xml file, and the operations of merging the target text nodes, splitting the target text nodes, adding node identifiers, and the like are sequentially performed to obtain the target xml file with the node identifiers.
Step S3: and compressing the target xml file to obtain a standardized data source document.
The suffix name of the standardized data source document can be docx, the target xml file can be compressed by adopting an xml-based compressed file format instead of a default file format to form the standardized data source document with the suffix name of the docx format, and text content in the standardized data source document is stored in the xml format.
Step S4: and extracting data of the standardized data source document, and establishing a document database according to the data extraction result.
The document database is a structured data set stored in an internet office system, and automatic office can be realized by calling data in the document database.
Optionally, the data extraction may include extracting directory header data in the document, extracting paragraph data in the document, and extracting table data in the document, and a person skilled in the art may set a specific extraction method according to actual needs, which is not limited to this.
Specifically, after a data source document is received, data structure preprocessing is performed on the data source document, the data source document is decompressed to obtain an xml file of the data source document, the xml file comprises at least one paragraph node, the paragraph node comprises at least one target text node, all nodes in the xml file are traversed by a recursive algorithm, different nodes are identified according to node labels, target text node merging, target text node splitting and node identifier adding are sequentially performed on the xml file to obtain a target xml file with simplified text data, each node in the target xml file has a one-to-one corresponding node identifier, and the target xml file with the node identifier is compressed by adopting a compressed file format based on xml to obtain a standardized data source document in a docx format.
After the data structure preprocessing is completed, data extraction is performed on a standardized data source document, a parser can be used for obtaining data stored in all nodes in the standardized data source document, a recursive algorithm is used for classifying and extracting and cleaning all data, all data are integrated and stored in a warehouse, and a finally established document database can be used for creating a target document.
Optionally, fig. 2 is a flowchart of another intelligent document processing method provided in the first embodiment of the present invention, and on the basis of fig. 1, a specific implementation of normalizing an xml file is exemplarily shown, but not limited to the foregoing method.
Referring to fig. 2, when the step S2 is executed, the method includes the following steps:
step S21: and carrying out target text node merging on the xml file.
Step S22: and splitting the target text node of the xml file after the node combination.
Step S23: and adding a node identifier to the xml file after the node is split.
Step S24: and determining the xml file added with the node identifier as a target xml file.
Specifically, an xml file for obtaining a data source document by decompression can be defined as an initial xml file, an xml file after node combination is defined as a first xml file, an xml file after node split is defined as a second xml file, and in each paragraph node < p > of the initial xml file, all text contents in all text attribute nodes < r > are combined into the same text attribute node < r >, so that a first xml file is obtained; then according to the text sequence, splitting the text content in each paragraph node < p > in the first xml file into a plurality of text attribute nodes < r >, and obtaining a second xml file; and then adding unique node identifiers to all nodes in the second xml file to realize a unified document data structure, thereby facilitating data extraction.
Hereinafter, the method for preprocessing the xml file provided in the above steps S21 to S23 will be described in detail with reference to the drawings.
Fig. 3 is a flowchart of another intelligent document processing method according to an embodiment of the present invention, and on the basis of fig. 2, a specific implementation of target text node merging is exemplarily shown, but not limited to the node merging method described above.
Referring to fig. 3, the target text node merging is performed on the xml file, which specifically includes the following steps:
step S201: a recursive algorithm is used to traverse all paragraph nodes < p > in the xml file.
The recursive algorithm is a method for solving a problem by repeatedly decomposing the problem into similar sub-problems in computer science.
In this step, each node in the xml file is hierarchical, the hierarchy of nodes is called a document tree, and a recursive algorithm is adopted to traverse all nodes in the xml file, including the paragraph node < p > and the text attribute node < r > and the target text node < t > in each paragraph node < p >.
Step S202: all target text nodes < t > in any paragraph node < p > are merged to the first target text node in the same paragraph node < p >.
The first target text node may be a target text node < t > in any text attribute node < r > in the same paragraph node < p >, and preferably, the target text node < t > in the first text attribute node < r > in the same paragraph node < p > may be the first target text node.
In this step, after all the target text nodes < t > are merged into the first target text node, all the target text nodes < t > and their subordinate child nodes except the first target text node in the same paragraph node < p > are deleted.
Step S203: and determining a first xml file according to the merging result of the target text nodes, wherein in the first xml file, the text contents in the same paragraph node < p > are merged to the same target text node < t >.
Specifically, in the process of recursively traversing all nodes in an xml file, all target text nodes < t > in all text attribute nodes < r > in the same paragraph node < p > are merged into a first target text node < t > in the paragraph node < p >, all target text nodes < t > except the first target text node in the same paragraph node < p > and the child nodes to which the target text nodes belong are deleted to obtain a first xml file, and in the first xml file, text contents in the same paragraph node < p > are merged into the same target text node < t >, so that merging of the document contents is realized.
Optionally, fig. 4 is a flowchart of another intelligent document processing method provided in an embodiment of the present invention, and on the basis of fig. 2, a specific implementation of target text node splitting is exemplarily shown, but not limited to the above node splitting method.
Referring to fig. 4, splitting a target text node of an xml file after node merging specifically includes the following steps:
step S204: and acquiring a first xml file obtained by merging the target text nodes, wherein the text contents in the same paragraph node < p > are merged into the same target text node < t > in the first xml file.
Step S205: a recursive algorithm is used to traverse all paragraph nodes < p > in the first xml file.
In this step, each node of the first xml file is in a tree structure, and a recursive algorithm is adopted to traverse all nodes in the xml file, wherein all nodes include a paragraph node < p >, a text attribute node < r > in each paragraph node < p >, and a target text node < t >.
Step S206: splitting the text content in the first target text node < t > based on the preset anchor mark in the first target text node < t >.
The anchor point is a position mark set in the document, a preset anchor point mark can be added in the data source document through a document processing program, the preset anchor point mark can be used for pointing to a specific text in the xml file, and typically, the specific text pointed by the preset anchor point mark can comprise a title, a directory, specific data and other texts. At least one preset anchor mark may be included in each target text node < t >.
Illustratively, the preset anchor point marker may be "[ that is, a specific text in the xml file may be marked with the preset anchor point marker" [ that is.
Step S207: and determining a second xml file according to the target text node splitting result, wherein in the second xml file, the text content in the same paragraph node is split into a plurality of target text nodes < t > according to the preset anchor point mark.
Specifically, after the nodes are merged, a recursive algorithm is adopted to traverse all paragraph nodes < p > in the first xml file, the text content in each paragraph node < p > is completely merged and recorded in a first target text node < t >, the text content in the first target text node < t > is split into a plurality of target text nodes < t > according to preset anchor marks, each preset anchor mark is independently formed into a section, the text content except the preset anchor marks is integrated into one independent target text node < t >, all the integrated target text nodes < t > are inserted into the paragraph nodes < p >, a second xml file is obtained, the text content is split in the second xml file according to the preset anchor marks, and extraction of specific paragraph data is facilitated.
Optionally, fig. 5 is a flowchart of another intelligent document processing method provided in an embodiment of the present invention, and on the basis of fig. 5, a specific implementation of adding a node identifier is exemplarily shown, but not limited to the foregoing method.
Referring to fig. 5, adding a node identifier to an xml file after node splitting specifically includes the following steps:
step S208: and acquiring a second xml file obtained by splitting the target text node.
Step S209: all target nodes in the second xml file are traversed using a recursive algorithm, with the target nodes including paragraph node < p > and target text node < t >.
In this step, each node of the second xml file is in a tree structure, and a recursive algorithm is adopted to traverse all target nodes in the xml file, where all target nodes include a paragraph node < p >, a text attribute node < r > in each paragraph node < p >, and a target text node < t > in the text attribute node < r >.
Step S210: and determining the node identifiers of the target nodes according to the recursion sequence, wherein the node identifiers correspond to the target nodes one by one, and the values of the node identifiers are sequentially increased according to the recursion sequence.
Wherein the value of the node identifier may be a self-increasing number, and the node identifier has uniqueness in an xml file, the value of the node identifier may be set to self-increase from 0, typically in a recursive order, and the value of the node identifier may be increased by 1 each time a new target node is traversed.
Step S211: the node identifier is added to the list of attributes of the corresponding target node.
The attribute list of the target node is a list for recording attributes of the target node, and typically, the attributes of the node include a node name, a Value (Value) of the node, and a node type, where the Value of the text node is the text itself, and the Value of the text attribute node is the Value of the attribute.
Illustratively, the key value (key value, i.e., key) of the newly added attribute may be defined as yuxin _ uid.
Step S212: and determining the target xml file according to the node identifier adding result.
Specifically, after the nodes are split, all target nodes in the second xml file are traversed by adopting a recursive algorithm, the numerical value of the node identifier is calculated from 0, every time a new target node is traversed, the numerical value of the node identifier is increased by 1, the node identifier is written into an attribute list of each target node, the newly added attribute key value is yuxin _ uid, the node value is the numerical value of the node identifier, and after the unique node identifier is added to all the target nodes, the target xml file is obtained.
Therefore, in the present invention, through the steps S201 to S212, the data structure of the initial xml file is modified to obtain a modified target xml file, node identifiers of all nodes of the target xml file have uniqueness, and the target xml file is compressed into a standardized data source document in a docx format, which is convenient for data transmission.
Optionally, fig. 6 is a flowchart of another intelligent document processing method provided in an embodiment of the present invention, and on the basis of fig. 1, a specific implementation of data extraction is exemplarily shown, but not limited to the data extraction method described above.
Referring to fig. 6, extracting data from a standardized data source document and creating a document database according to the data extraction result includes the following steps:
step S401: and decompressing the standardized data source document to obtain a target xml file with the node identifier.
In this step, ZIP may be added after the document with the suffix name of docx, and then the ZIP decompression method is adopted to decompress the compressed packet, so that the contents recorded in the obtained xml file are all the text contents in the standardized data source document in the entire docx format.
Step S402: and performing data analysis on node data of all nodes in the target xml file, wherein the node data comprises text content data (text), node tag data (tag), text type data and a node identifier.
Wherein text content data (text) may be used to define the text content, node tag data (tag) may be used to define the type of node, text type data may be used to define the text type, which may typically include text, tables, or pictures.
In this embodiment, an lxml or xml parser may be used to parse the decompressed target xml file, and a recursive algorithm is used to read and clean node data of all nodes.
Step S403: and determining the target structured data according to the data analysis result.
Illustratively, the target structured data can be two-dimensional structured data.
Step S404: a document database is built based on the target structured data.
Specifically, when data extraction is performed on a standardized data source document, firstly, an automation program is adopted to change a suffix name of a document with a suffix name of docx into ZIP, then a ZIP decompression method is adopted to decompress to obtain a target xml file, an lxml or xml parser is adopted to perform data analysis on the decompressed target xml file, node data of all nodes in the target xml file are traversed through a recursive algorithm, the node data are integrated and cleaned to form two-dimensional structured data, data extraction is performed based on the two-dimensional structured data, and a document database is established to facilitate data classification and storage.
Optionally, fig. 7 is a flowchart of another intelligent document processing method according to an embodiment of the present invention, and referring to fig. 7, the creating a document database based on target structured data specifically includes the following steps:
step S701: and extracting the directory header data of all the nodes in the target xml file.
The directory title data includes text content and text attributes of the directory title, and the text content includes a directory name and a title name.
Optionally, building a document database based on the target structured data comprises the following steps: acquiring all directory title nodes in the target structured data based on the node label data; acquiring directory title data of each directory title node and a corresponding relation between the directory title nodes and a text; and determining a document directory data set in the document database according to the directory header data and the corresponding relation.
The corresponding relationship between the directory title node and the text includes the structural data of each level of directory and title in the text, for example, the structural data includes a first level directory, a second level directory, a first level title, a second level title, etc.
Specifically, a paragraph node is defined as < p >, information of all child nodes in the paragraph node < p > is traversed, whether a < instrText > node exists in the child nodes or not is judged, whether a text content contains a "PAGEREF" text or not is judged, if the < instrText > node exists in the child nodes and the text content contains the "PAGEREF" text, a directory position is quickly located through the "PAGEREF" text, directory title data of the corresponding directory title node and a corresponding relation between the directory title node and a text are obtained, a document directory can be formed through the directory title data and the corresponding relation, and directory data are provided for intelligent office work.
Step S702: and extracting node paragraph data of all nodes in the target xml file.
The node paragraph data comprises at least one target text node and specific text content in each node.
Optionally, building a document database based on the target structured data comprises the following steps: acquiring all paragraph nodes in the target structured data based on the node label data; traversing node paragraph data of a target text node in each paragraph node; and determining a document paragraph data set in the document database according to the paragraph nodes and the node paragraph data.
Specifically, the paragraph node is defined as < p >, the node paragraph data of the corresponding paragraph node is obtained by traversing the content in each < p > node, all the paragraph nodes and the paragraph data corresponding to one are integrated and put in storage, and text content data is provided for intelligent office work.
Step S703: and extracting table data of all nodes in the target xml file.
The table data comprises coordinates of each cell in the table in the whole table and text content of each cell.
Optionally, building a document database based on the target structured data comprises the following steps: acquiring all table nodes in the target structured data based on the node label data; traversing table row nodes and cell nodes in each table node; determining the coordinate parameter of each cell in the whole table according to the traversal result; and determining a document table data set in the document database according to the table nodes, the cell nodes and the coordinate parameters.
Specifically, the node label of the table node can be defined as < tbl >, and the table data is provided for intelligent office work by calculating the coordinate parameter of each cell in the whole table through an algorithm by traversing the table line node < tr > and the cell node < tc > in the table node < tbl > through the identification label < tbl >.
Step S704: and classifying and integrating the directory header data, the node paragraph data and the table data to establish a document database.
Specifically, the extracted directory header data, node paragraph data and table data are respectively processed by abnormal values and missing values, and are classified and stored, so that classified storage of document data is realized, a data base is provided for document editing operation of an internet financial office system, and the document editing efficiency and accuracy are improved.
Optionally, after the document database is established, the intelligent document processing method further includes: a target document is created from a document database.
The target document refers to a project table created through a web office system, and the project table can be a financial report of any company.
Specifically, the project personnel can call data such as directories, paragraph contents or tables in the document database according to actual business requirements, and perform operations such as modification, filling or deletion on the target document.
Therefore, the invention carries out preprocessing such as target text node merging, target text node splitting, node identifier adding and the like on the xml file of the data source document to obtain the standardized data source document, carries out data extraction on the standardized data source document, establishes the document database according to the data extraction result, and realizes automatic creation of the target document by calling the data in the document database, thereby solving the problems of low efficiency and high cost caused by manually editing the document in the prior art, providing a data base for the document editing operation of the internet financial office system, having high intelligent degree, convenient and fast document processing, being beneficial to saving labor cost and time cost and improving the document editing efficiency and accuracy.
Example two
The second embodiment of the invention provides an intelligent document processing system, which can execute the intelligent document processing method provided by any embodiment of the invention and has corresponding functional modules and beneficial effects of the execution method.
Fig. 8 is a schematic structural diagram of an intelligent document processing system according to a second embodiment of the present invention.
As shown in fig. 8, the intelligent document processing system 00 includes: the system comprises a data source acquisition module 101, a document preprocessing module 102, a document compression module 103 and a data extraction module 104, wherein the data source acquisition module 101 is used for acquiring an xml file of a data source document, the xml file comprises at least one paragraph node, and the paragraph node comprises at least one target text node; the document preprocessing module 102 is configured to perform standardization processing on the xml file to obtain a target xml file, where the standardization processing includes target text node merging, target text node splitting, and node identifier addition which are performed in sequence; the document compression module 103 is used for compressing the target xml file to obtain a standardized data source document; and the data extraction module 104 is used for extracting data of the standardized data source document and establishing a document database according to the data extraction result.
Optionally, the document preprocessing module 102 is configured to traverse all paragraph nodes in the xml file by using a recursive algorithm; merging all target text nodes in any paragraph node into a first target text node in the same paragraph node; and determining a first xml file according to the target text node merging result.
Optionally, the document preprocessing module 102 is further configured to obtain a first xml file obtained by merging the target text nodes; traversing all paragraph nodes in the first xml file by adopting a recursive algorithm; splitting the text content in the first target text node based on the preset anchor point mark in the first target text node; and determining a second xml file according to the target text node splitting result.
Optionally, the document preprocessing module 102 is further configured to obtain a second xml file obtained by splitting the target text node; traversing all target nodes in the second xml file by adopting a recursive algorithm, wherein the target nodes comprise paragraph nodes and target text nodes; determining node identifiers of target nodes according to a recursion sequence, wherein the node identifiers correspond to the target nodes one to one, and the values of the node identifiers are sequentially increased in an increasing manner according to the recursion sequence; adding the node identifier to the attribute list of the corresponding target node; and determining the target xml file according to the node identifier adding result.
Optionally, the data extraction module 104 is configured to decompress the standardized data source document to obtain a target xml file with a node identifier; performing data analysis on node data of all nodes in the target xml file, wherein the node data comprises text content data, node label data and text type data; and determining the target structured data according to the data analysis result.
Optionally, the data extraction module 104 is further configured to obtain all directory title nodes in the target structured data based on the node tag data; acquiring directory title data of each directory title node and a corresponding relation between the directory title nodes and a text; and determining a document directory data set in the document database according to the directory header data and the corresponding relation.
Optionally, the data extraction module 104 is further configured to obtain all paragraph nodes in the target structured data based on the node tag data; traversing node paragraph data of a target text node in each paragraph node; and determining a document paragraph data set in the document database according to the paragraph nodes and the node paragraph data.
Optionally, the data extraction module 104 is further configured to obtain all table nodes in the target structured data based on the node tag data; traversing table row nodes and cell nodes in each table node; determining the coordinate parameter of each cell in the whole table according to the traversal result; and determining a document table data set in the document database according to the table nodes, the cell nodes and the coordinate parameters.
Alternatively, a document database may be used to create the target document.
Therefore, the intelligent document processing system provided by the embodiment of the invention executes the intelligent document processing method, the document processing method solves the problems of low efficiency and high cost caused by manually editing the document in the prior art by carrying out standardized processing on the xml file of the data source document, carrying out data extraction on the xml file after the standardized processing and establishing the document database according to the data extraction result, provides a data base for intelligent document editing operation, has high intelligent degree and convenient and quick document processing, is beneficial to saving labor cost and time cost and improves the document editing efficiency and accuracy.
EXAMPLE III
Fig. 9 is a schematic structural diagram of a computer device according to a third embodiment of the present invention. FIG. 9 illustrates a block diagram of an exemplary computer device 12 suitable for use in implementing embodiments of the present invention. The computer device 12 shown in fig. 9 is only an example and should not bring any limitations to the function and scope of use of the embodiments of the present invention.
As shown in FIG. 9, computer device 12 is in the form of a general purpose computing device. The components of computer device 12 may include, but are not limited to: one or more processors 16, a memory 28, a bus 18 connecting the various system components (including the memory 28 and the processors 16), and a computer program stored on the memory and executable on the processors, which when executed by the processors implement the intelligent document processing method described above, with corresponding functional blocks and advantages for performing the method.
Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Computer device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.
The memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)30 and/or cache memory 32. Computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 9, and commonly referred to as a "hard drive"). Although not shown in FIG. 9, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of the described embodiments of the invention.
Computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with computer device 12, and/or with any devices (e.g., network card, modem, etc.) that enable computer device 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, computer device 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via network adapter 20. As shown, network adapter 20 communicates with the other modules of computer device 12 via bus 18. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with computer device 12, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
The processor 16 executes various functional applications and data processing, such as implementing the intelligent document processing method provided by the embodiments of the present invention, by executing programs stored in the memory 28.
Example four
The fourth embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the above-mentioned intelligent document processing method.
Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (11)

1. An intelligent document processing method is characterized by comprising the following steps:
acquiring an xml file of a data source document, wherein the xml file comprises at least one paragraph node, and the paragraph node comprises at least one target text node;
standardizing the xml file to obtain a target xml file, wherein the standardization comprises target text node merging, target text node splitting and node identifier adding which are sequentially executed;
compressing the target xml file to obtain a standardized data source document;
and extracting data of the standardized data source document, and establishing a document database according to the data extraction result.
2. The intelligent document processing method according to claim 1, wherein the step of standardizing the xml file comprises the steps of:
traversing all paragraph nodes in the xml file by adopting a recursive algorithm;
merging all target text nodes in any paragraph node into a first target text node in the same paragraph node;
and determining a first xml file according to the target text node merging result.
3. The intelligent document processing method according to claim 2, wherein the step of standardizing the xml file comprises the steps of:
acquiring a first xml file obtained by merging the target text nodes;
traversing all paragraph nodes in the first xml file by adopting a recursive algorithm;
splitting text contents in a first target text node based on a preset anchor point mark in the first target text node;
and determining a second xml file according to the target text node splitting result.
4. The intelligent document processing method according to claim 1, wherein the step of standardizing the xml file comprises the steps of:
acquiring a second xml file obtained by splitting the target text node;
traversing all target nodes in the second xml file by adopting a recursive algorithm, wherein the target nodes comprise paragraph nodes and target text nodes;
determining node identifiers of the target nodes according to a recursion sequence, wherein the node identifiers correspond to the target nodes one by one, and the values of the node identifiers are sequentially increased progressively according to the recursion sequence;
adding the node identifier to a list of attributes of the corresponding target node;
and determining the target xml file according to the node identifier adding result.
5. The intelligent document processing method according to claim 1, wherein the data extraction of the standardized data source document comprises the following steps:
decompressing the standardized data source document to obtain a target xml file with a node identifier;
performing data analysis on node data of all nodes in the target xml file, wherein the node data comprises text content data, node label data and text type data;
and determining the target structured data according to the data analysis result.
6. The intelligent document processing method according to claim 5, wherein the creating of the document database according to the data extraction result comprises the steps of:
acquiring all directory title nodes in the target structured data based on the node tag data;
acquiring directory title data of each directory title node and a corresponding relation between the directory title node and a text;
and determining a document directory data set in the document database according to the directory header data and the corresponding relation.
7. The intelligent document processing method according to claim 5, wherein the creating of the document database according to the data extraction result comprises the steps of:
acquiring all paragraph nodes in the target structured data based on the node label data;
traversing node paragraph data of a target text node in each paragraph node;
and determining a document paragraph data set in the document database according to the paragraph nodes and the node paragraph data.
8. The intelligent document processing method according to claim 5, wherein the creating of the document database according to the data extraction result comprises the steps of:
obtaining all table nodes in the target structured data based on the node tag data;
traversing table row nodes and cell nodes in each table node;
determining the coordinate parameter of each cell in the whole table according to the traversal result;
and determining a document table data set in the document database according to the table nodes, the cell nodes and the coordinate parameters.
9. An intelligent document processing system, comprising:
the data source acquisition module is used for acquiring an xml file of a data source document, wherein the xml file comprises at least one paragraph node, and the paragraph node comprises at least one target text node;
the document preprocessing module is used for carrying out standardization processing on the xml file to obtain a target xml file, wherein the standardization processing comprises target text node merging, target text node splitting and node identifier adding which are sequentially executed;
the document compression module is used for compressing the target xml file to obtain a standardized data source document;
and the data extraction module is used for extracting data of the standardized data source document and establishing a document database according to the data extraction result.
10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the intelligent document processing method according to any one of claims 1-8 when executing the program.
11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the intelligent document processing method according to any one of claims 1 to 8.
CN202111048195.3A 2021-09-08 2021-09-08 Intelligent document processing method, system, computer device and medium Pending CN113761840A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111048195.3A CN113761840A (en) 2021-09-08 2021-09-08 Intelligent document processing method, system, computer device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111048195.3A CN113761840A (en) 2021-09-08 2021-09-08 Intelligent document processing method, system, computer device and medium

Publications (1)

Publication Number Publication Date
CN113761840A true CN113761840A (en) 2021-12-07

Family

ID=78793829

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111048195.3A Pending CN113761840A (en) 2021-09-08 2021-09-08 Intelligent document processing method, system, computer device and medium

Country Status (1)

Country Link
CN (1) CN113761840A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102571966A (en) * 2012-01-16 2012-07-11 上海方正数字出版技术有限公司 Network transmission method for large extensible markup language (XML) document
CN105320739A (en) * 2015-09-22 2016-02-10 深圳市永兴元科技有限公司 Information extraction method and apparatus
CN108334481A (en) * 2018-03-01 2018-07-27 四川语言桥信息技术有限公司 Document processing method and device
CN109783554A (en) * 2018-12-13 2019-05-21 重庆金融资产交易所有限责任公司 Excel document analytic method, device and computer readable storage medium
CN110083805A (en) * 2018-01-25 2019-08-02 北京大学 A kind of method and system that Word file is converted to EPUB file
CN110968999A (en) * 2019-11-01 2020-04-07 数地科技(北京)有限公司 Annotating method and system for automatically realizing fine granularity and diversification of docx file
CN112507660A (en) * 2020-12-07 2021-03-16 厦门美亚亿安信息科技有限公司 Method and system for determining homology and displaying difference of compound document
CN112507666A (en) * 2020-12-21 2021-03-16 北京百度网讯科技有限公司 Document conversion method and device, electronic equipment and storage medium
CN112667563A (en) * 2020-12-04 2021-04-16 深圳先进技术研究院 Document management and operation method and system

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102571966A (en) * 2012-01-16 2012-07-11 上海方正数字出版技术有限公司 Network transmission method for large extensible markup language (XML) document
CN105320739A (en) * 2015-09-22 2016-02-10 深圳市永兴元科技有限公司 Information extraction method and apparatus
CN110083805A (en) * 2018-01-25 2019-08-02 北京大学 A kind of method and system that Word file is converted to EPUB file
CN108334481A (en) * 2018-03-01 2018-07-27 四川语言桥信息技术有限公司 Document processing method and device
CN109783554A (en) * 2018-12-13 2019-05-21 重庆金融资产交易所有限责任公司 Excel document analytic method, device and computer readable storage medium
CN110968999A (en) * 2019-11-01 2020-04-07 数地科技(北京)有限公司 Annotating method and system for automatically realizing fine granularity and diversification of docx file
CN112667563A (en) * 2020-12-04 2021-04-16 深圳先进技术研究院 Document management and operation method and system
CN112507660A (en) * 2020-12-07 2021-03-16 厦门美亚亿安信息科技有限公司 Method and system for determining homology and displaying difference of compound document
CN112507666A (en) * 2020-12-21 2021-03-16 北京百度网讯科技有限公司 Document conversion method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN108920659B (en) Data processing system, data processing method thereof, and computer-readable storage medium
CN111339186B (en) Workflow engine data synchronization method, device, medium and electronic equipment
CN110377884B (en) Document analysis method and device, computer equipment and storage medium
CN113158101B (en) Visual page rendering method, device, equipment and storage medium
CN113760891B (en) Data table generation method, device, equipment and storage medium
CN111553556A (en) Business data analysis method and device, computer equipment and storage medium
CN110975293A (en) Method, device, server and medium for establishing resource reference relation table
CN112445775A (en) Fault analysis method, device, equipment and storage medium of photoetching machine
CN116383193A (en) Data management method and device, electronic equipment and storage medium
CN113657088A (en) Interface document analysis method and device, electronic equipment and storage medium
CN115391322A (en) Data checking method, device, equipment, storage medium and program product
CN110308907B (en) Data conversion method and device, storage medium and electronic equipment
CN110704432A (en) Data index establishing method and device, readable storage medium and electronic equipment
CN112783482B (en) Visual form generation method, device, equipment and storage medium
KR100762712B1 (en) Method for transforming of electronic document based on mapping rule and system thereof
CN113569543A (en) Implementation method of nuclear power engineering automatic report generation technology
CN112416992A (en) Industry type identification method, system and equipment based on big data and keywords
CN113761840A (en) Intelligent document processing method, system, computer device and medium
CN116185393A (en) Method, device, equipment, medium and product for generating interface document
CN113642291B (en) Method, system, storage medium and terminal for constructing logical structure tree reported by listed companies
CN113255369B (en) Text similarity analysis method and device and storage medium
CN113050987B (en) Method and device for generating interface document, storage medium and electronic equipment
CN113806556A (en) Method, device, equipment and medium for constructing knowledge graph based on power grid data
CN114968725A (en) Task dependency relationship correction method and device, computer equipment and storage medium
CN113553826A (en) Information input method and device combining RPA and AI and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination