CN114118072A

CN114118072A - Document structuring method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN114118072A
Application number: CN202010905682.6A
Authority: CN
Inventors: 李闯
Original assignee: Shanghai Xiaoi Robot Technology Co Ltd
Current assignee: Shanghai Xiaoi Robot Technology Co Ltd
Priority date: 2020-09-01
Filing date: 2020-09-01
Publication date: 2022-03-01

Abstract

The embodiment of the application provides a document structuring method and device, electronic equipment and a computer readable storage medium, and solves the problems that the existing document structuring mode is low in structuring efficiency, poor in knowledge extraction precision and low in knowledge extraction coverage rate, and therefore the existing document structuring mode is difficult to adapt to different intelligent interaction scene requirements. The document structuring method comprises the following steps: respectively determining corresponding target data splitting conditions and target structured processing conditions according to the document type of the document to be processed; identifying data attribute information included in the document to be processed based on the target data splitting condition so as to split one or more target data corresponding to the data attribute information from the document to be processed; and mapping each target data to a corresponding structured data field based on the target structured processing conditions.

Description

Document structuring method and device, electronic equipment and computer readable storage medium

Technical Field

The present application relates to the field of document processing technologies, and in particular, to a document structuring method, an apparatus, an electronic device, and a computer-readable storage medium.

Background

With the continuous development of artificial intelligence technology and the continuous improvement of the requirements of people on interactive experience, an intelligent interactive mode gradually replaces some traditional human-computer interactive modes and becomes a research hotspot, such as intelligent question-answering interaction, intelligent search, intelligent push and the like. However, the above intelligent interaction method needs to be performed based on a knowledge database, and the robot needs to understand the interaction intention of the interaction object and extract available knowledge content from the knowledge database to complete the response. Thus, the knowledge database needs to be established before the intelligent interaction can take place.

In many application scenarios, knowledge content to be utilized by the robot is actually an unstructured document (for example, a rule system or an operation flow established inside a government or an enterprise), and the requirement for fine-grained control of the knowledge content in an intelligent interaction scenario cannot be met. It is therefore necessary to structure the knowledge content of these unstructured documents.

At present, the structuralization processing mode of the unstructured document is realized by manually understanding semantics and manually splitting and warehousing. This requires a large number of knowledge engineers to split the unstructured document according to the requirements of the structured data fields, which consumes a large amount of manpower and is inefficient. In addition, because knowledge extraction needs to be performed by relying on human understanding semantics, the precision of the extracted knowledge and the coverage rate of the knowledge extraction of different document types are not high. Due to the fact that document types of unstructured documents to be referred to are different under different intelligent interaction scenes, and field formats of structured documents to be formed are different, structured documents obtained through an existing structured processing mode are difficult to effectively support different requirements of different intelligent interaction scenes.

Disclosure of Invention

In view of this, embodiments of the present application provide a document structuring method, an apparatus, an electronic device, and a computer-readable storage medium, which solve the problems that an existing document structuring method is low in structuring efficiency, poor in knowledge extraction precision, and low in knowledge extraction coverage, and therefore is difficult to adapt to different intelligent interaction scene requirements.

According to an aspect of the present application, a document structuring method provided by an embodiment of the present application includes: acquiring a plurality of data splitting conditions and a plurality of structured processing conditions, wherein different data splitting conditions correspond to different document types, different structured processing conditions correspond to different document types, the data splitting conditions comprise data attribute information of target data to be split, and the structured processing conditions comprise corresponding relations between the split target data and structured data fields; respectively determining corresponding target data splitting conditions and target structured processing conditions according to the document type of the document to be processed; identifying data attribute information included in the document to be processed based on the target data splitting condition so as to split one or more target data corresponding to the data attribute information from the document to be processed; and mapping each of the target data to a corresponding structured data field based on the target structured processing condition.

According to another aspect of the present application, an embodiment of the present application provides a document structuring apparatus including: the condition acquisition module is configured to acquire a plurality of data splitting conditions and a plurality of structured processing conditions, wherein different data splitting conditions correspond to different document types, different structured processing conditions correspond to different document types, the data splitting conditions comprise data attribute information of target data to be split, and the structured processing conditions comprise corresponding relations between the split target data and structured data fields; the condition determining module is configured to respectively determine corresponding target data splitting conditions and target structured processing conditions according to the document type of the document to be processed; the splitting module is configured to identify data attribute information included in the document to be processed based on the target data splitting condition so as to split one or more target data corresponding to the data attribute information from the document to be processed; and a structured processing module configured to map each of the target data to a corresponding structured data field based on the target structured processing condition.

According to another aspect of the present application, an embodiment of the present application provides an electronic device including: a processor; and a memory having stored therein computer program instructions which, when executed by the processor, cause the processor to perform a document structuring method as described in any of the preceding.

According to another aspect of the present application, an embodiment of the present application provides a computer-readable storage medium having stored thereon computer program instructions, which, when executed by a processor, cause the processor to perform a document structuring method as described in any of the preceding.

According to the document structuring method, the document structuring device, the electronic equipment and the computer readable storage medium, a structuring process is divided into a data dividing process for identifying target data based on data attribute information and a structuring processing process for corresponding the target data to a structured data field; and different data splitting conditions corresponding to different document types and different structured processing conditions corresponding to different document types are obtained. The whole structuring process can determine preset data splitting conditions and structuring processing conditions based on the document types, the data splitting process and the structuring processing process can be automatically carried out based on preset rules, manual participation is not needed, and therefore the efficiency of structuring processing can be greatly improved. Meanwhile, because differential structured processing strategies can be adopted according to different document types in a targeted manner, the limitation of manual experience is eliminated, the precision of knowledge extraction and the coverage rate of knowledge extraction of different document types can be effectively ensured, the formed structured document can also effectively support the requirements of various intelligent interaction scenes, and the human-computer interaction experience can be obviously improved.

Drawings

Fig. 1 is a flowchart illustrating a document structuring method according to an embodiment of the present application.

Fig. 2 is a schematic flowchart illustrating target data splitting in a document structuring method according to an embodiment of the present application.

Fig. 3 is a flowchart illustrating a document structuring method according to another embodiment of the present application.

Fig. 4 is a schematic structural diagram of a document structuring apparatus according to an embodiment of the present application.

Fig. 5 is a schematic structural diagram of a document structuring apparatus according to another embodiment of the present application.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Fig. 1 is a flowchart illustrating a document structuring method according to an embodiment of the present application. As shown in fig. 1, the document structuring method includes the steps of:

step 101: the method comprises the steps of obtaining a plurality of data splitting conditions and a plurality of structural processing conditions, wherein different data splitting conditions correspond to different document types, different structural processing conditions correspond to different document types, the data splitting conditions comprise data attribute information of target data to be split, and the structural processing conditions comprise corresponding relations between the split target data and structural data fields.

The data splitting condition is a specific rule for guiding the subsequent identification and splitting of target data based on data attribute information, and the data attribute information is used for representing data attribute characteristics of document contents. Different document types have different forms of document contents, and the forms of the included data attribute information are also different, so that the contents of the data splitting conditions corresponding to different document types are also different. For a certain document type, the data splitting condition is determined. Therefore, when the document type of the document to be processed is determined, the target data with which data attribute information needs to be split from the document to be processed is determined.

The structured processing condition is a specific rule for guiding which structured data field the split target data corresponds to. After the target data is split from the document to be processed based on the data splitting condition, the split target data needs to be mapped to the corresponding structured data field according to the structured processing condition, so as to complete the structuring process.

Step 102: and respectively determining corresponding target data splitting conditions and target structured processing conditions according to the document type of the document to be processed.

As described above, different data splitting conditions correspond to different document types, and different structuring processing conditions correspond to different document types, so that the target data splitting condition to be referred to in the subsequent data splitting process and the target structuring processing condition to be referred to in the subsequent structuring processing process can be specified according to the document type of the document to be processed.

Step 103: and identifying data attribute information included in the document to be processed based on the target data splitting condition so as to split one or more target data corresponding to the data attribute information from the document to be processed.

The target data splitting condition determined based on the document type comprises data attribute information of target data required to be split. Specifically, the data attribute information in the document to be processed may be first identified, and then the text content corresponding to the data attribute information may be split into the first target data. For example, when the document to be processed is a text document (e.g., document suffix doc, docx, pdf), the target data to be identified and split in the target data splitting condition may include one or more of the following data attribute information: hypertext markup language tags, preset characters, preset punctuation, and preset text styles. When the document to be processed is a table document (for example, the suffix of the document is csv or xls), the included data attribute information may be a preset table position, and at this time, the target data to be identified and split in the target data splitting condition may be the text content of the preset table position (for example, the header of a preset row or column). When the document to be processed is a flow chart document (for example, the suffix is vsdx), the included data attribute information may be the position of the flow chart node, and the target data to be identified and split in the target data splitting condition may be the text content of the position of the flow chart node.

It should be understood that the data attribute information may also include more forms according to the document type and the document content, and the specific content and form of the data attribute information are not specifically limited in the present application. In an embodiment, the data attribute information may include a tag preset by a user, so that when the tag preset by the user is identified, the data content corresponding to the tag may be the target data to be split. In another embodiment, the data attribute information may further include a preset entity category, where the preset entity category is a set of pre-classified knowledge contents, for example, in a bank smart interaction scenario, the knowledge contents related to credit card transaction may be pre-classified into one entity category, so that when the data contents belonging to the preset entity category are found in the document to be processed, the data contents may be split. In another embodiment, the data attribute information may further include a preset data digest, that is, a portion of data content of the document to be processed needs to be processed to generate the data digest (for example, by means of semantic analysis), and when the generated data digest matches the preset data digest (for example, the generated data digest is subjected to similarity calculation with the preset data digest, and when the similarity is greater than a threshold, the two are considered to match), the portion of data content may be split. In another embodiment, the data attribute information may also include a picture attribute, that is, a picture file in the document to be processed may be split, and then the split picture file may be corresponding to the structured field in the subsequent structuring process. When the structured document formed in this way is displayed to a user, since the structured field in the structured document can directly correspond to the picture file, the user can directly split the picture file from the structured document for a subsequent processing procedure.

In an embodiment of the present application, in order to improve the splitting efficiency of the target data, a data attribute identification model for identifying data attribute information may be established through a pre-training process. Training a neural network model by using a document sample with a data attribute information label, comparing a recognition result output by the neural network model with the data attribute information label to calculate loss, and adjusting network parameters of the neural network model according to the loss. When the training process reaches the regression accuracy requirement, the neural network model becomes a data attribute identification model based on the data attribute information identified by the input document to be processed. In a further embodiment, different data attribute identification models can be respectively established for different types of documents to be processed, so that data attribute information output by the data attribute identification model can be obtained only by inputting the documents to be processed into the data attribute identification model corresponding to the document type of the documents to be processed. In one embodiment, the data attribute identification model may also be trained to complete the identification process of the data attribute information and then complete the splitting process of the target data. At this time, the output result of the data attribute identification model is the split target data.

Step 104: each target data is mapped to a corresponding structured data field based on the target structured processing conditions.

The structured processing condition determined based on the document type includes a correspondence between the split target data and the structured data field. As previously mentioned, the form of document content is different for different document types, and the specific form of the structured data fields may also be limited by the form of the document content. For example, when the document to be processed is a text document, since the document contents of the text document are all in a text format, the split target data may correspond to structured data fields corresponding to different text contents, such as a question field, a title field, an introduction field, an answer field, and the like. When the document to be processed is a table document, the content in the table is also in a text format, and the split target data can also correspond to structured data fields of different text contents, such as different dynamic navigation fields. When the document to be processed is a flow chart document, because the data contents in the flow chart are often related, the split target data can also correspond to a node field in another knowledge graph, and the structure of the knowledge graph can be different from that of the original flow chart document.

It should be understood that the specific corresponding rule in the target structured processing condition and the specific content and form of the structured data field can be adjusted by the developer according to the specific requirements of different application scenarios, and the application does not strictly limit the target structured processing condition and the specific content of the structured data field.

Therefore, in the document structuring method provided by the embodiment of the application, the structuring process is divided into a data dividing process for identifying the target data based on the data attribute information and a structuring processing process for corresponding the target data to the structured data field; different data splitting conditions corresponding to different document types and different structured processing conditions corresponding to different document types are preset. The whole structuring process can determine preset data splitting conditions and structuring processing conditions based on the document types, the data splitting process and the structuring processing process can be automatically carried out based on preset rules, manual participation is not needed, and therefore the efficiency of structuring processing can be greatly improved. Meanwhile, because differential structured processing strategies can be adopted according to different document types in a targeted manner, the limitation of manual experience is eliminated, the precision of knowledge extraction and the coverage rate of knowledge extraction of different document types can be effectively ensured, the formed structured document can also effectively support the requirements of various intelligent interaction scenes, and the human-computer interaction experience can be obviously improved.

Fig. 2 is a schematic flowchart illustrating target data splitting in a document structuring method according to an embodiment of the present application. As shown in fig. 2, the target data splitting process may specifically include the following steps:

step 201: data attribute information in the document to be processed is identified.

As described above, when the target data splitting condition is determined based on the document type of the document to be processed, it is possible to identify which data attribute information is included in the document content of the document to be processed.

Step 202: and splitting the text content corresponding to the data attribute information into first target data.

Due to different document types of the documents to be processed, the data attribute information to be split explicitly in the target data splitting condition may also be different, and the corresponding relationship between the target data split explicitly in the target structured processing condition and the structured data field is also different. For example, when the document to be processed is a text document, the text document may be copied and pasted to a rich text editor, and the text content corresponding to the hypertext markup language tag with the title content may be split into first target data; or, the text content of the text style with bold body added can be split into first target data; or, the text content framed by the brackets can be split into the first target data; or, a text paragraph at the beginning of the text content with the data abstract content of "caliber" or "question" obtained by semantic analysis can be split into the first target data. The first target data may correspond to a structured data field whose content is a title or a question in a subsequent structuring process.

Step 203: and splitting the text content between two adjacent first target data into second target data.

When the first target data corresponds to the structured data field of which the content is a title or a question, the content between the first target data is the explanation of the title or the answer to the question, so that the text content between two adjacent first target data can be split into the second target data. This allows the second target data to be mapped to either the introduction field corresponding to the title field or the answer field corresponding to the question field during subsequent structuring processes.

Therefore, in the splitting process of the target data, the splitting of the second target data does not actually refer to the data attribute information, but refers to the splitting result of the first target data. By the method, the splitting of the second target data and the splitting of the first target data can be associated, so that different structured data fields with corresponding association relations are corresponding to in the subsequent structured processing process, the extraction of knowledge with higher precision can be realized, the data content among the first target data can be covered by the structured processing process, and the coverage rate of the knowledge extraction is higher.

It should be understood that although the first target data is mapped to the question/title field and the second target data is mapped to the answer/introduction field in the above embodiment, in other embodiments of the present application, the data content of the corresponding answer/introduction field may be split as the first target data, that is, the data content of the corresponding answer/introduction field may be split based on the data attribute information. The first target data may be split and then correspond to a question/title field or an answer/introduction field, which is not limited in the present application.

In an embodiment of the present application, as shown in fig. 3, after the structural processing of the document to be processed is completed, the target data corresponding to different structured data fields may also be identified as different formats in the document to be processed (step 301). For example, different structured data fields can be marked in a key value pair mode or a different color mode to be checked again by knowledge management personnel, so that the knowledge management personnel can adjust and modify more intuitively, and the precision and the efficiency of structured processing are further improved.

In one embodiment of the present application, as shown in FIG. 3, in addition to mapping each target data to a corresponding structured data field based on the target structured processing conditions, the user-selected target data may also be mapped to the user-selected structured data field (step 302). For example, when the document to be processed is a text document, the document to be processed may be copied and pasted into a rich text editor, and a knowledge manager may select a text field in the document to be processed through the rich text editor and click on a structured data field, thereby mapping the text field to the clicked structured data field. In this way, the knowledge management personnel can select the target data to be mapped to the specific structured data field in a manual mode, and the precision and the efficiency of the structured processing can be further improved.

After the target data is mapped to the structured data field, the target data can also be stored as a structured document in a structured form. In one embodiment of the present application, as shown in FIG. 3, the user may choose to store the target data in a data structure consistent with the document to be processed (step 303). Since the target data are already split into different data units, the structured document stored in such a way can be retrieved in full text for the target data. In another embodiment of the present application, as shown in FIG. 3, the user may also generate a knowledge-graph based on the relationships between the structured data fields, storing the target data based on the data structure of the knowledge-graph (step 304). Since the structured document thus formed can embody the relationship between the target data, semantic retrieval can be realized, for example, target data corresponding to the corresponding answer field is retrieved based on the target data corresponding to the question field. The specific structure of the knowledge-graph can be adjusted according to the specific content of the document to be processed. For example, when the document to be processed is a text document, the knowledge graph formed after the structuring processing can be stored according to the data structure of "article name + directory navigation", "primary title + secondary title/introduction", "secondary title + standard question/introduction", "standard question + answer"; when the document to be processed is a form document, the knowledge graph formed after the structuralization processing can be stored according to the data structure of the name of the information table and the header field of the table; when the document to be processed is a flow chart document, the knowledge graph formed after the structuring processing can be stored according to the data structure of the application scene name + the flow node.

In an embodiment of the present application, a user may also improve accuracy and efficiency of the structuring process through some preprocessing processes, as shown in fig. 3, before splitting target data from a document to be processed based on data attribute information included in the document to be processed, an original document may be preprocessed based on an operation instruction of the user to obtain the document to be processed (step 305). The pre-treatment may include one or more of the following operations in combination: setting text paragraphs, adjusting text styles and setting labels. Therefore, the data attribute information in the document to be processed, which is obtained after the preprocessing by the user, can be more easily identified, so that the precision and the efficiency of the structured processing can be further improved. However, it should be understood that the specific content of the preprocessing operation may also be adjusted according to the specific content of the target data splitting condition and the target structured processing condition in the actual application scenario, and the specific content of the preprocessing operation is not strictly limited in this application.

Fig. 4 is a schematic structural diagram of a document structuring apparatus according to an embodiment of the present application. As shown in fig. 4, the document structuring device 40 includes: a condition acquisition module 401, a condition determination module 402, a splitting module 403, and a structuring processing module 404. Specifically, the condition obtaining module 401 is configured to obtain a plurality of data splitting conditions and a plurality of structured processing conditions, where different data splitting conditions correspond to different document types, and different structured processing conditions correspond to different document types, the data splitting conditions include data attribute information of target data to be split, and the structured processing conditions include a correspondence between the split target data and structured data fields. The condition determining module 402 is configured to determine corresponding target data splitting conditions and target structured processing conditions according to the document type of the document to be processed. The splitting module 403 is configured to identify data attribute information included in the to-be-processed document based on the target data splitting condition, so as to split one or more target data corresponding to the data attribute information from the to-be-processed document. The structuring processing module 404 is configured to map each of the target data to a corresponding structured data field based on the target structured processing condition.

In an embodiment of the present application, as shown in fig. 5, the splitting module 403 includes: an identifying unit 4031 configured to identify data attribute information in a document to be processed; and a splitting unit 4032 configured to split the text content corresponding to the data attribute information into first target data.

In an embodiment of the present application, as shown in fig. 5, the structuring processing module 404 is further configured to: when the document to be processed is of a text document type, mapping the first target data to a question field or a title field; when the document to be processed is of a form document type, mapping the first target data to a dynamic navigation field; and when the document to be processed is the type of the flow chart document, mapping the first target data to the node field.

In an embodiment of the present application, the splitting unit 4032 is further configured to: and splitting the text content between two adjacent first target data into second target data.

In an embodiment of the present application, the structuring processing module 404 is further configured to:

and mapping the second target data to an answer field or an introduction field when the document to be processed is of a text document type.

In an embodiment of the present application, the splitting module 403 is further configured to: inputting the document to be processed into a data attribute recognition model corresponding to the document type of the document to be processed so as to obtain data attribute information output by the data attribute recognition model, wherein the data attribute recognition model is established through a pre-learning training process.

In an embodiment of the present application, the data attribute information includes one or more of the following items of information: the system comprises hypertext markup language tags, preset characters, preset punctuations, preset text styles, preset table positions, flow chart node positions, tags preset by a user, preset entity categories and preset data summaries.

In an embodiment of the present application, as shown in fig. 5, the document structuring device 40 further includes:

an identifying module 405 configured to identify target data corresponding to different structured data fields in the document to be processed into different formats.

In an embodiment of the present application, the structured processing module 404 is further configured to map the user-selected target data to the user-selected structured data fields.

a storage module 406 configured to store target data in a data structure consistent with the document to be processed; or, generating a knowledge graph according to the relation between the structured data fields, and storing the target data based on the data structure of the knowledge graph.

the preprocessing module 407 is configured to preprocess the original document based on an operation instruction of a user to obtain the document to be processed before splitting target data from the document to be processed based on data attribute information included in the document to be processed; wherein the pretreatment comprises one or more of the following operations in combination: setting text paragraphs, adjusting text styles and setting labels.

The detailed functions and operations of the respective modules in the above-described document structuring device 40 have been described in detail in the document structuring method described above with reference to fig. 1 to 3, and therefore, a repetitive description thereof will be omitted herein.

It should be noted that the document structuring apparatus 40 according to the embodiment of the present application may be integrated into the electronic device 60 as a software module and/or a hardware module, in other words, the electronic device 60 may include the document structuring apparatus 40. For example, the document structuring means 40 may be a software module in the operating system of the electronic device 60, or may be an application developed for it; of course, the document structuring means 40 could equally be one of many hardware modules of the electronic device 60.

In another embodiment of the present application, the document structuring device 40 and the electronic device 60 may also be separate devices (e.g., servers), and the document structuring device 40 may be connected to the electronic device 60 via a wired and/or wireless network and transmit the interaction information according to an agreed data format.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 6, the electronic device 60 includes: one or more processors 601 and memory 602; and computer program instructions stored in the memory 602 which, when executed by the processor 601, cause the processor 601 to perform a document structuring method as in any of the embodiments described above.

The processor 601 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device to perform desired functions.

The memory 602 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, Random Access Memory (RAM), cache memory (or the like). The non-volatile memory may include, for example, Read Only Memory (ROM), a hard disk, flash memory, and the like. One or more computer program instructions may be stored on a computer-readable storage medium and executed by the processor 601 to implement the steps in the document structuring method of the various embodiments of the application above and/or other desired functions. Information such as light intensity, compensation light intensity, position of the filter, etc. may also be stored in the computer readable storage medium.

In one example, the electronic device 60 may further include: an input device 603 and an output device 604, which are interconnected by a bus system and/or other form of connection mechanism (not shown in fig. 6).

For example, when the electronic device is a stand-alone device, the input means 603 may be a communication network connector for receiving the acquired input signal from an external removable device. The input device 603 may also include, for example, a keyboard, a mouse, a microphone, and so forth.

The output device 604 may output various information to the outside, and may include, for example, a display, speakers, a printer, and a communication network and its connected remote output devices, among others.

Of course, for the sake of simplicity, only some of the components related to the present application in the electronic apparatus 60 are shown in fig. 6, and components such as a bus, an input device/output interface, and the like are omitted. In addition, the electronic device 60 may include any other suitable components depending on the particular application.

In addition to the above-described methods and apparatus, embodiments of the present application may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the document structuring method of any of the above-described embodiments.

The computer program product may include program code for carrying out operations for embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present application may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform the steps in the document structuring method of the various embodiments of the present application.

A computer-readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a random access memory ((RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present application in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present application are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present application. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the foregoing disclosure is not intended to be exhaustive or to limit the disclosure to the precise details disclosed.

The block diagrams of devices, apparatuses, systems referred to in this application are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

It should also be noted that in the devices, apparatuses, and methods of the present application, the components or steps may be decomposed and/or recombined. These decompositions and/or recombinations are to be considered as equivalents of the present application.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modifications, equivalents and the like that are within the spirit and principle of the present application should be included in the scope of the present application.

Claims

1. A method for document structuring, comprising:

acquiring a plurality of data splitting conditions and a plurality of structured processing conditions, wherein different data splitting conditions correspond to different document types, different structured processing conditions correspond to different document types, the data splitting conditions comprise data attribute information of target data to be split, and the structured processing conditions comprise corresponding relations between the split target data and structured data fields;

respectively determining corresponding target data splitting conditions and target structured processing conditions according to the document type of the document to be processed;

identifying data attribute information included in the document to be processed based on the target data splitting condition so as to split one or more target data corresponding to the data attribute information from the document to be processed; and

mapping each of the target data to a corresponding structured data field based on the target structured processing condition.

2. The method of claim 1, wherein the identifying the data attribute information included in the document to be processed based on the target data splitting condition to split one or more target data corresponding to the data attribute information from the document to be processed comprises:

identifying the data attribute information in the document to be processed; and

and splitting the text content corresponding to the data attribute information into first target data.

3. The method according to claim 2, wherein the mapping the split target data to corresponding structured data fields based on the target structured processing condition comprises:

when the document to be processed is of a text document type, mapping the first target data to a question field or a title field; when the document to be processed is of a form document type, mapping the first target data to a dynamic navigation field; and when the document to be processed is of a flow chart document type, mapping the first target data to a node field.

4. The method of claim 2, wherein the identifying the data attribute information included in the document to be processed based on the target data splitting condition to split one or more target data corresponding to the data attribute information from the document to be processed further comprises:

and splitting the text content between two adjacent first target data into second target data.

5. The method of claim 4, wherein mapping each of the target data to a corresponding structured data field based on the target structured processing condition comprises:

and the document to be processed is a text document type, and the second target data is mapped to an answer field or an introduction field.

6. The method of any of claims 1 to 5, further comprising:

and identifying the target data corresponding to different structured data fields in the document to be processed into different formats.

7. The method of any of claims 1 to 5, further comprising:

storing the target data by adopting a data structure consistent with the document to be processed; or the like, or, alternatively,

and generating a knowledge graph according to the relation between the structured data fields, and storing the target data based on the data structure of the knowledge graph.

8. A document structuring apparatus, comprising:

the condition acquisition module is configured to acquire a plurality of data splitting conditions and a plurality of structured processing conditions, wherein different data splitting conditions correspond to different document types, different structured processing conditions correspond to different document types, the data splitting conditions comprise data attribute information of target data to be split, and the structured processing conditions comprise corresponding relations between the split target data and structured data fields;

the condition determining module is configured to respectively determine corresponding target data splitting conditions and target structured processing conditions according to the document type of the document to be processed;

the splitting module is configured to identify data attribute information included in the document to be processed based on the target data splitting condition so as to split one or more target data corresponding to the data attribute information from the document to be processed; and

a structured processing module configured to map each of the target data to a corresponding structured data field based on the target structured processing condition.

9. An electronic device, comprising:

a processor; and

memory having stored therein computer program instructions which, when executed by the processor, cause the processor to perform a document structuring method according to any one of claims 1-5.

10. A computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform a document structuring method according to any one of claims 1-5.