CN117786426A - Text structuring processing method and device, readable storage medium and terminal equipment - Google Patents

Text structuring processing method and device, readable storage medium and terminal equipment Download PDF

Info

Publication number
CN117786426A
CN117786426A CN202410079966.2A CN202410079966A CN117786426A CN 117786426 A CN117786426 A CN 117786426A CN 202410079966 A CN202410079966 A CN 202410079966A CN 117786426 A CN117786426 A CN 117786426A
Authority
CN
China
Prior art keywords
text
target
template
matched
structured
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410079966.2A
Other languages
Chinese (zh)
Inventor
谢安庆
颜艳桃
王飞虎
周泽
赵占胜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhonghe Nongxin Agricultural Group Co ltd
Original Assignee
Zhonghe Nongxin Agricultural Group Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhonghe Nongxin Agricultural Group Co ltd filed Critical Zhonghe Nongxin Agricultural Group Co ltd
Priority to CN202410079966.2A priority Critical patent/CN117786426A/en
Publication of CN117786426A publication Critical patent/CN117786426A/en
Pending legal-status Critical Current

Links

Landscapes

  • Document Processing Apparatus (AREA)

Abstract

The application belongs to the technical field of text processing, and particularly relates to a text structuring processing method, a device, a computer readable storage medium and terminal equipment. The method comprises the steps of obtaining a target text to be processed and extracting target text information of the target file; performing template matching on the target text information by using each preset text template; and if the target text information is matched with the target matching template corresponding to the target text information, carrying out text structuring processing on the target text according to the target matching template to obtain a target structured text. By the method, when the target matching template corresponding to the target text information is matched, the text structuring processing can be carried out on the target text according to the target matching template, so that the text structuring processing can be carried out on the target text efficiently and accurately, and the user experience can be improved.

Description

Text structuring processing method and device, readable storage medium and terminal equipment
Technical Field
The application belongs to the technical field of text processing, and particularly relates to a text structuring processing method, a device, a computer readable storage medium and terminal equipment.
Background
Text structuring refers to the process of converting unstructured or semi-structured text into text with well-defined structures and semantics that make it easier to understand, analyze and utilize. However, when processing text data of non-form type, the conventional text structuring processing method is generally configured by using regular expressions, but the output efficiency and the precision of the method are not high, so that an efficient and accurate text structuring method is needed.
Disclosure of Invention
In view of this, embodiments of the present application provide a text structuring processing method, apparatus, computer readable storage medium, and terminal device, so as to solve the problem that the output efficiency and accuracy of the text structuring processing method in the prior art are not high.
A first aspect of an embodiment of the present application provides a text structuring processing method, which may include:
acquiring a target text to be processed, and extracting target text information of the target text;
performing template matching on the target text information by using each preset text template;
and if the target text information is matched with the target matching template corresponding to the target text information, carrying out text structuring processing on the target text according to the target matching template to obtain a target structured text.
In a specific implementation manner of the first aspect, the target text information may include respective text coordinates and respective text titles of the target text;
the performing template matching on the target text information by using preset text templates may include:
acquiring each text coordinate and each text title of a target template to be matched; the target template to be matched is any unmatched text template in each text template;
if the text coordinates of the target text are the same as the text coordinates of the target template to be matched and the text titles of the target text are the same as the text titles of the target template to be matched, determining the target template to be matched as the target matching template;
and if the text coordinates of the target text are different from the text coordinates of the target template to be matched or the text titles of the target text are different from the text titles of the target template to be matched, returning to the step of acquiring the text coordinates and the text titles of the target template to be matched and the subsequent steps.
In a specific implementation manner of the first aspect, if the target matching template corresponding to the target text information is matched, performing text structuring processing on the target text according to the target matching template to obtain a target structured text, which may include:
acquiring each row spacing and each format conversion rule of the target matching template;
according to each text coordinate and each text title of the target matching template, carrying out data classification on the target text to obtain each column of text;
carrying out same-line text merging on the target column text according to the line spacing of the target matching template to obtain each same-line text of the target column text; the target column text is any column text which is not subjected to the same-line text combination in all column texts;
according to each text coordinate and each column text of the target text, carrying out data line division on the target text to obtain each line text;
and carrying out format conversion on the target text according to each format conversion rule of the target matching template to obtain the target structured text.
In a specific implementation manner of the first aspect, after the matching to the target matching template corresponding to the target text information, performing text structuring processing on the target text according to the target matching template to obtain a target structured text, the method may further include:
And carrying out data verification on the target structured text to obtain a data verification result corresponding to the target structured text.
In a specific implementation manner of the first aspect, after the matching to the target matching template corresponding to the target text information, performing text structuring processing on the target text according to the target matching template to obtain a target structured text, the method may further include:
mapping the target structured text according to a preset title field mapping relation; and the title field mapping relation is the mapping relation between the text title of the target structured text and the mapped field.
In a specific implementation manner of the first aspect, if the target matching template corresponding to the target text information is matched, performing text structuring processing on the target text according to the target matching template to obtain a target structured text, and then further includes:
and displaying the text of the target structured text according to a preset display mode.
In a specific implementation manner of the first aspect, the method may further include:
and if the target matching template corresponding to the target text information is not matched, performing exception handling according to a preset exception handling mode.
A second aspect of an embodiment of the present application provides a text structuring processing device, which may include:
the target text information extraction module is used for acquiring target text to be processed and extracting target text information of the target text;
the text template matching module is used for performing template matching on the target text information by utilizing each preset text template;
and the text structuring processing module is used for carrying out text structuring processing on the target text according to the target matching template if the target matching template corresponding to the target text information is matched, so as to obtain the target structured text.
In a specific implementation manner of the second aspect, the target text information may include respective text coordinates and respective text titles of the target text;
the text template matching module may include:
the coordinate and title acquisition sub-module is used for acquiring each text coordinate and each text title of the target template to be matched; the target template to be matched is any unmatched text template in each text template;
the target matching template determining sub-module is used for determining the target template to be matched as the target matching template if each text coordinate of the target text is the same as each text coordinate of the target template to be matched and each text title of the target text is the same as each text title of the target template to be matched;
And the return execution sub-module is used for returning to the step of acquiring the text coordinates and the text titles of the target to-be-matched templates and the subsequent steps if the text coordinates of the target text are different from the text coordinates of the target to-be-matched templates or the text titles of the target text are different from the text titles of the target to-be-matched templates.
In a specific implementation manner of the second aspect, the text structuring processing module may include:
the line spacing and rule submodule is used for acquiring each line spacing and each format conversion rule of the target matching template;
the data classification sub-module is used for classifying the data of the target text according to each text coordinate and each text title of the target matching template to obtain each column of text;
the same-line text merging sub-module is used for merging the same-line texts of the target column texts according to the line spacing of the target matching template to obtain each same-line text of the target column texts; the target column text is any column text which is not subjected to the same-line text combination in all column texts;
The data line dividing sub-module is used for carrying out data line dividing on the target text according to each text coordinate and each column text of the target text to obtain each line text;
and the format conversion sub-module is used for carrying out format conversion on the target text according to each format conversion rule of the target matching template to obtain the target structured text.
In a specific implementation manner of the second aspect, the text structuring processing device may further include:
and the data verification module is used for carrying out data verification on the target structured text to obtain a data verification result corresponding to the target structured text.
In a specific implementation manner of the second aspect, the text structuring processing device may further include:
the text mapping module is used for mapping the target structured text according to a preset title field mapping relation; and the title field mapping relation is the mapping relation between the text title of the target structured text and the mapped field.
In a specific implementation manner of the second aspect, the text structuring processing device may further include:
and the text display module is used for displaying the text of the target structured text according to a preset display mode.
In a specific implementation manner of the second aspect, the text structuring processing device may further include:
and the exception handling module is used for performing exception handling according to a preset exception handling mode if the target matching template corresponding to the target text information is not matched.
A third aspect of the embodiments of the present application provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of any one of the above-described text structuring methods.
A fourth aspect of the embodiments of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of any one of the above text structuring processing methods when the computer program is executed.
A fifth aspect of the embodiments of the present application provides a computer program product for, when run on a terminal device, causing the terminal device to perform the steps of any of the text structuring method described above.
Compared with the prior art, the embodiment of the application has the beneficial effects that: according to the method and the device, the target text to be processed is obtained, and target text information of the target file is extracted; performing template matching on the target text information by using each preset text template; and if the target text information is matched with the target matching template corresponding to the target text information, carrying out text structuring processing on the target text according to the target matching template to obtain a target structured text. According to the embodiment of the application, when the target matching template corresponding to the target text information is matched, the text structuring processing can be carried out on the target text according to the target matching template, so that the text structuring processing can be carried out on the target text efficiently and accurately, and the user experience can be improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following description will briefly introduce the drawings that are needed in the embodiments or the description of the prior art, it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of one embodiment of a method for text structuring processing in the embodiments of the present application;
FIG. 2 is a schematic diagram of a target text;
FIG. 3 is a schematic diagram of various columns of text;
FIG. 4 is a schematic diagram of respective inline texts of a target column text;
FIG. 5 is a schematic diagram of various lines of text;
FIG. 6 is a schematic diagram of format conversion of target text;
FIG. 7 is a block diagram of one embodiment of a text structuring device in accordance with the embodiments of the present application;
fig. 8 is a schematic block diagram of a terminal device in an embodiment of the present application.
Detailed Description
In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the embodiments described below are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
As used in this specification and the appended claims, the term "if" may be interpreted in context as "when … …" or "upon" or "in response to a determination" or "in response to detection. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".
In addition, in the description of the present application, the terms "first," "second," "third," etc. are used merely to distinguish between descriptions and are not to be construed as indicating or implying relative importance.
Text structuring refers to the process of converting unstructured or semi-structured text into text with well-defined structures and semantics that make it easier to understand, analyze and utilize.
When a portable document format (Portable Document Format, PDF) file of a form type is processed, the existing text structuring processing method can read according to lines of the form and split according to columns, and then text structuring processing can be carried out on the PDF text of the form type in a mode of special text processing such as line changing. However, when processing text data of non-form type, the conventional text structuring processing method is generally configured by using regular expressions, but the output efficiency and the precision of the method are not high, so that an efficient and accurate text structuring method is needed.
In view of this, embodiments of the present application provide a text structuring processing method, apparatus, computer readable storage medium, and terminal device, so as to solve the problem that the output efficiency and accuracy of the text structuring processing method in the prior art are not high.
It should be noted that, the execution body of the method of the present application is a terminal device, and specifically may be a common computing device such as a desktop computer, a notebook computer, a palm computer, a smart phone, a tablet, or other computing devices.
Referring to fig. 1, an embodiment of a text structuring method in an embodiment of the present application may include:
step S101, acquiring a target text to be processed, and extracting target text information of the target text.
In the embodiment of the present application, the target text to be processed may be a PDF format text, and the target text may be obtained and subjected to text structuring processing.
In a specific implementation manner, the target text may be a text stored in a preset storage module of the terminal device, and when the text structuring process is required for the target text, the target text may be read from the storage module.
In another specific implementation manner, the target text may be a text uploaded by the user in real time, and when the text structuring processing is required to be performed on the target text, the user may upload or import the target text into the terminal device, so that the terminal device may obtain the target text.
After the target text is acquired, target text information of the target text can be extracted. In order to save memory and server resources, a preset reading method can be utilized to extract target text information from target text; here, the target text may be extracted by using a per-page reading method, so that the target text may be processed while occupying a small server resource. In addition, the extraction of the target text information may also be performed on the target text using a preset PDF text processing tool, which may include, but is not limited to, a PDFBox.
Specifically, the target text information may include respective text coordinates of the target text; here, the text coordinates may include an abscissa and an ordinate of the text, and the text coordinates may be used to represent a specific position of a certain text in the target text.
According to each text coordinate of the target text, a range of text coordinates corresponding to each text title can be obtained. For example, referring to fig. 2, the text titles "name", "age", "gender" and "card number information" in the target text may be extracted, and the range of text coordinates corresponding to the "name", the range of text coordinates corresponding to the "age", the range of text coordinates corresponding to the "gender" and the range of text coordinates corresponding to the "card number information" may be obtained, so that the target text information may be obtained.
Step S102, performing template matching on the target text information by using preset text templates.
In the embodiment of the application, each text template for performing text structuring processing can be preconfigured, so that text structuring processing can be performed on various different texts; the text template may include each text coordinate, each text title, each line interval, and each format conversion rule, where the line interval is an interval between line texts and the line texts, and the format conversion rule is a rule for converting the text into the format.
For convenience of description, a template matching process in the present application will be described below by taking any one of the text templates that is not matched (to be noted as a target template to be matched) as an example.
Specifically, each text coordinate and each text title of the target template to be matched can be obtained, and if each text coordinate of the target text is the same as each text coordinate of the target template to be matched and each text title of the target text is the same as each text title of the target template to be matched, the target template to be matched can be determined as the target matching template; and if the text coordinates of the target text are different from the text coordinates of the target template to be matched, or the text titles of the target text are different from the text titles of the target template to be matched, returning to the step of acquiring the text coordinates and the text titles of the target template to be matched and the subsequent steps.
Accordingly, when the target matching template is matched, text structuring processing can be performed according to the target matching template; if the target matching template corresponding to the target text information is matched, step S103 may be performed.
If the target matching template corresponding to the target text information is not matched, performing exception handling according to a preset exception handling mode; the exception handling mode may be specified and set in a scene according to actual needs, which is not limited in this application.
For example, a preset abnormality prompt message can be sent out and an abnormality cause can be recorded, and a user can be informed of text data with possible abnormality in time; for another example, the target text may be ignored, and text structuring may be performed on other text to be processed according to the method described above.
Therefore, the text structuring processing is carried out on the target text only when the target matching template is matched, so that the probability of carrying out the text formatting processing on the error text can be reduced, and the influence of the error text on the normal text structuring processing flow is reduced.
And step S103, if the target matching template corresponding to the target text information is matched, performing text structuring processing on the target text according to the target matching template to obtain a target structured text.
In the embodiment of the application, each text coordinate, each text title, each line interval and each format conversion rule of the target matching template can be obtained, and text structuring processing can be performed on the target text according to the text coordinates, the text titles, the line intervals and the format conversion rules.
Specifically, the target text can be subjected to data classification according to each text coordinate and each text title of the target matching template to obtain each column of text; wherein each text title may have a corresponding range of text coordinates. Here, the target text may be subjected to data sorting according to the range of text coordinates corresponding to each text title of the target matching template, so as to obtain each column of text.
For example, referring to fig. 3, if the text abscissa range corresponding to the text heading "name" of the target matching template is (x 1, x 2), the text of the text abscissa position (x 1, x 2) in the target text may be divided into a column to obtain the column text corresponding to the text heading "name", and the text abscissa range corresponding to the text heading "age" of the target matching template is (x 3, x 4), the text of the text abscissa position (x 3, x 4) in the target text may be divided into a column to obtain the column text corresponding to the text heading "age"; the text with the text abscissa at (x 5, x 6) in the target text can be divided into a column to obtain a column text corresponding to the text title 'gender', wherein the text abscissa range corresponding to the text title 'gender' of the target matching template is (x 5, x 6); the text of the text abscissa (x 7, x 8) in the target text can be divided into a column to obtain a column text corresponding to the text title (card number information) when the text abscissa (x 7, x 8) of the target matching template corresponds to the text title (card number information).
And then, carrying out same-line text merging on the target text subjected to data column division according to the line spacing of the target matching template. Taking any one of the column texts (marked as a target column text) which is not subjected to the same-line text merging as an example, for the target column text, whether each text in the target column text belongs to the same line or not can be judged according to the line spacing of the target matching template, specifically, the text ordinate of two texts in the target column text can be subtracted, and if the difference of the text ordinate of the two texts is smaller than or equal to the line spacing of the target matching template, the two texts can be considered to belong to the same line; if the difference between the text ordinate of the two texts is greater than the line spacing of the target matching template, the two texts may be considered to not belong to the same line. Accordingly, the same-line text combination can be performed on the target column text to obtain each same-line text of the target column text, and each column text can be traversed according to the method, so that each same-line text of each column text can be obtained.
For example, referring to fig. 4, for the column text under the text heading "card number information," the difference between the ordinate of the text of "a bank-Wang Mou 100001" is smaller than the line spacing of the target matching template, whereby the two texts can be considered as the same line text; the difference between the vertical coordinates of the texts of the B bank-Li Mou 100021 is smaller than the line spacing of the target matching template, so that the two texts can be considered as the same line text.
And then, carrying out data line division on the target text according to each text coordinate and each column text of the target text to obtain each line text. Specifically, for a certain same-line text in the reference column, the text with the same longitudinal coordinate range as that of the text in the same-line text in each related column can be obtained to obtain each line of text; wherein, the reference column may be a certain column text for data line division in each column text, and the related column may be each column text except the reference column.
For example, referring to fig. 5, here, a text title "name" may be taken as a reference column, a text title "age", a text title "gender" and a column text of the text title "card number information" may be taken as related columns, specifically, a text ordinate range of text "Wang Mou" of the first line under the reference column may be acquired as (y 1, y 2); for the text heading "age", a line of text with a text ordinate range of (y 1, y 2) of "20" can be obtained; for the text heading "gender", a line of text with a text ordinate range of (y 1, y 2) may be obtained as "woman"; for the text title "card number information", a line of text with a text ordinate range of (y 1, y 2) is "a bank-Wang Mou 100001"; accordingly, line text of the first line (other than the text header) can be obtained; thereafter, a text ordinate range of text "Li Mou" of the second row under the reference column may be acquired as (y 3, y 4); for the text heading "age", a line of text "22" with a text ordinate range of (y 3, y 4) can be obtained; for the text heading "gender", a line of text with text ordinate range (y 3, y 4) may be obtained as "man"; for the text header "card number information", a line of text with a text ordinate range of (y 3, y 4) is "B bank-Li Mou 100021" can be obtained, according to this method, each same-line text under the reference column can be traversed, and for a certain same-line text, a text (text in the same line) with the text ordinate range of the same-line text can be obtained in the relevant column, so as to obtain each line of text.
Because each acquired text is a character string, in practice, the target text may contain different types of data, and in order to better understand the target text, format conversion may be performed on the target text according to each format conversion rule of the target matching template, so as to obtain a target structured text.
For example, referring to fig. 6, the target text may include column text with text labels "trade time", "balance" and "trade opponent information", and for column text with text labels "trade time", the corresponding text type may be a character string shaped as "XXXX year XX month XX day"; for column text with text title "balance", the corresponding text should be floating point number; for a column text with a text title of 'transaction opponent information', the corresponding text type is a character string; thus, for column text with text titled "trade time", it can be converted into a string shaped as "XXXX year XX month XX day"; for the column text with the text title of 'balance', the column text can be converted into floating point numbers, and the text type of the column text with the text title of 'trade opponent information' is a character string, and the text type is consistent with the corresponding format conversion rule, so that format conversion is not needed, and accordingly, the target structured text can be obtained.
Optionally, in order to ensure the integrity and accuracy of the data, the data verification may be further performed on the target structured document, so as to obtain a data verification result corresponding to the target structured document.
In one particular implementation, the data format of the target structured text may be checked, and in particular whether the data under a certain text header conforms to a particular format, e.g., mail format, date format, address format, phone format, etc. If the data format of the abnormality exists in the target structured text, determining that the data verification result corresponding to the target structured text is that the abnormality exists; otherwise, it may be determined that the data verification result corresponding to the target structured document is that no data anomaly exists.
In another specific implementation, the data range of the target structured document may also be checked, specifically whether the data under a certain text heading is within a reasonable range, e.g., the age should not be negative, the temperature data should be within a reasonable temperature range, etc. If an abnormal data range exists in the target structured text, determining that a data verification result corresponding to the target structured text is that the data is abnormal; otherwise, it may be determined that the data verification result corresponding to the target structured document is that no data anomaly exists.
In another specific implementation, the data logic of the target structured text may also be checked, in particular to check whether the logical relationship between the data under a certain text title or several text titles is correct, e.g. the start date should be earlier than the end date, the order quantity should be greater than zero, etc. If abnormal logic relation exists in the target structured text, determining that the data verification result corresponding to the target structured text is abnormal; otherwise, it may be determined that the data verification result corresponding to the target structured document is that no data anomaly exists.
In another specific implementation manner, the data uniqueness of the target structured text can also be checked, and specifically whether the data under a certain text title is unique in the target structured text, for example, whether the user ID is unique, whether the product code is unique, and the like can be checked. If some repeated data exist in the target structured text, determining that a data verification result corresponding to the target structured text is abnormal; otherwise, it may be determined that the data verification result corresponding to the target structured document is that no data anomaly exists.
In practical application, the data verification can be performed by using one or more data verification modes, and the data verification modes can be embodied and set according to practical needs, which is not limited in the application.
After the data verification result corresponding to the target structured text is obtained, if the data verification result indicates that the target structured text has data abnormality, the target structured text can be subjected to abnormality processing, for example, preset prompt information can be displayed to prompt a user that the target structured text has data abnormality, and for example, abnormal data in the target structured text can be highlighted according to a preset abnormal data display mode so as to be convenient for the user to check.
In a specific implementation manner of the embodiment of the present application, after performing text structuring processing on the target text to obtain the target structured text, the target structured text may also be stored; for example, the target structured data may be stored in a preset relational database. In order to improve the storage efficiency, a mapping relation between the text title and the mapped field can be preset to obtain a title field mapping relation, and then the target structured text can be mapped according to the title field mapping relation. For example, a mapping relationship between a text title "name" and a mapped field "name" and a mapping relationship between a text title "gender" and a mapped field "gender" may be preset to obtain a title field mapping relationship; then, the text with the text title of 'name' in the target structured text can be stored in a reflection assignment mode; similarly, text with the text heading "gender" in the target structured text may also be stored by way of reflective assignment.
In another specific implementation manner of the embodiment of the present application, after performing text structuring processing on the target text to obtain the target structured text, text displaying may also be performed on the target structured text according to a preset display manner; the display mode may be specified and set in a scene according to actual situations, which is not limited in this application. For example, the target structured text may be displayed in a preset font and color for ease of reading and analysis.
In summary, the embodiment of the present application obtains the target text to be processed, and extracts the target text information of the target file; performing template matching on the target text information by using each preset text template; and if the target text information is matched with the target matching template corresponding to the target text information, carrying out text structuring processing on the target text according to the target matching template to obtain a target structured text. According to the embodiment of the application, when the target matching template corresponding to the target text information is matched, the text structuring processing can be carried out on the target text according to the target matching template, so that the text structuring processing can be carried out on the target text efficiently and accurately, and the user experience can be improved.
It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not limit the implementation process of the embodiment of the present application in any way.
Corresponding to a text structuring processing method described in the above embodiments, fig. 7 shows a block diagram of an embodiment of a text structuring processing device provided in the embodiments of the present application.
In an embodiment of the present application, a text structuring processing apparatus may include:
a target text information extraction module 701, configured to obtain a target text to be processed, and extract target text information of the target text;
a text template matching module 702, configured to perform template matching on the target text information by using preset text templates;
and the text structuring processing module 703 is configured to perform text structuring processing on the target text according to the target matching template if the target matching template corresponding to the target text information is matched, so as to obtain a target structured text.
In a specific implementation manner of the embodiment of the present application, the target text information may include each text coordinate and each text title of the target text;
The text template matching module may include:
the coordinate and title acquisition sub-module is used for acquiring each text coordinate and each text title of the target template to be matched; the target template to be matched is any unmatched text template in each text template;
the target matching template determining sub-module is used for determining the target template to be matched as the target matching template if each text coordinate of the target text is the same as each text coordinate of the target template to be matched and each text title of the target text is the same as each text title of the target template to be matched;
and the return execution sub-module is used for returning to the step of acquiring the text coordinates and the text titles of the target to-be-matched templates and the subsequent steps if the text coordinates of the target text are different from the text coordinates of the target to-be-matched templates or the text titles of the target text are different from the text titles of the target to-be-matched templates.
In a specific implementation manner of the embodiment of the present application, the text structuring processing module may include:
The line spacing and rule submodule is used for acquiring each line spacing and each format conversion rule of the target matching template;
the data classification sub-module is used for classifying the data of the target text according to each text coordinate and each text title of the target matching template to obtain each column of text;
the same-line text merging sub-module is used for merging the same-line texts of the target column texts according to the line spacing of the target matching template to obtain each same-line text of the target column texts; the target column text is any column text which is not subjected to the same-line text combination in all column texts;
the data line dividing sub-module is used for carrying out data line dividing on the target text according to each text coordinate and each column text of the target text to obtain each line text;
and the format conversion sub-module is used for carrying out format conversion on the target text according to each format conversion rule of the target matching template to obtain the target structured text.
In a specific implementation manner of the embodiment of the present application, the text structuring processing device may further include:
and the data verification module is used for carrying out data verification on the target structured text to obtain a data verification result corresponding to the target structured text.
In a specific implementation manner of the embodiment of the present application, the text structuring processing device may further include:
the text mapping module is used for mapping the target structured text according to a preset title field mapping relation; and the title field mapping relation is the mapping relation between the text title of the target structured text and the mapped field.
In a specific implementation manner of the embodiment of the present application, the text structuring processing device may further include:
and the text display module is used for displaying the text of the target structured text according to a preset display mode.
In a specific implementation manner of the embodiment of the present application, the text structuring processing device may further include:
and the exception handling module is used for performing exception handling according to a preset exception handling mode if the target matching template corresponding to the target text information is not matched.
It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described apparatus, modules and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.
Fig. 8 shows a schematic block diagram of a terminal device provided in an embodiment of the present application, and for convenience of explanation, only a portion relevant to the embodiment of the present application is shown.
As shown in fig. 8, the terminal device 8 of this embodiment includes: a processor 80, a memory 81 and a computer program 82 stored in the memory 81 and executable on the processor 80. The steps of the respective embodiments of the text structuring method described above, such as steps S101 to S103 shown in fig. 1, are implemented when the processor 80 executes the computer program 82. Alternatively, the processor 80 may perform the functions of the modules/units of the apparatus embodiments described above, such as the functions of the modules 701 to 703 shown in fig. 7, when executing the computer program 82.
By way of example, the computer program 82 may be partitioned into one or more modules/units that are stored in the memory 81 and executed by the processor 80 to complete the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing the specified functions describing the execution of the computer program 82 in the terminal device 8.
It will be appreciated by those skilled in the art that fig. 8 is merely an example of the terminal device 8 and does not constitute a limitation of the terminal device 8, and may include more or less components than illustrated, or may combine certain components, or different components, e.g., the terminal device 8 may also include input-output devices, network access devices, buses, etc.
The processor 80 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSPs), application specific integrated circuits (Application Specific Integrated Circuit, ASICs), field programmable gate arrays (Field-Programmable Gate Array, FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 81 may be an internal storage unit of the terminal device 8, such as a hard disk or a memory of the terminal device 8. The memory 81 may be an external storage device of the terminal device 8, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the terminal device 8. Further, the memory 81 may also include both an internal storage unit and an external storage device of the terminal device 8. The memory 81 is used for storing the computer program as well as other programs and data required by the terminal device 8. The memory 81 may also be used to temporarily store data that has been output or is to be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other manners. For example, the apparatus/terminal device embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical function division, and there may be additional divisions in actual implementation, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each method embodiment described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable storage medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable storage medium may include content that is subject to appropriate increases and decreases as required by jurisdictions and by jurisdictions in which such computer readable storage medium does not include electrical carrier signals and telecommunications signals.
The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims (10)

1. A method for text structuring, comprising:
acquiring a target text to be processed, and extracting target text information of the target text;
performing template matching on the target text information by using each preset text template;
and if the target text information is matched with the target matching template corresponding to the target text information, carrying out text structuring processing on the target text according to the target matching template to obtain a target structured text.
2. The text structuring processing method according to claim 1, wherein the target text information includes respective text coordinates and respective text titles of the target text;
The performing template matching on the target text information by using preset text templates comprises the following steps:
acquiring each text coordinate and each text title of a target template to be matched; the target template to be matched is any unmatched text template in each text template;
if the text coordinates of the target text are the same as the text coordinates of the target template to be matched and the text titles of the target text are the same as the text titles of the target template to be matched, determining the target template to be matched as the target matching template;
and if the text coordinates of the target text are different from the text coordinates of the target template to be matched or the text titles of the target text are different from the text titles of the target template to be matched, returning to the step of acquiring the text coordinates and the text titles of the target template to be matched and the subsequent steps.
3. The text structuring method according to claim 2, wherein if the target text information is matched with a target matching template corresponding to the target text information, performing text structuring processing on the target text according to the target matching template to obtain a target structured text, including:
Acquiring each row spacing and each format conversion rule of the target matching template;
according to each text coordinate and each text title of the target matching template, carrying out data classification on the target text to obtain each column of text;
carrying out same-line text merging on the target column text according to the line spacing of the target matching template to obtain each same-line text of the target column text; the target column text is any column text which is not subjected to the same-line text combination in all column texts;
according to each text coordinate and each column text of the target text, carrying out data line division on the target text to obtain each line text;
and carrying out format conversion on the target text according to each format conversion rule of the target matching template to obtain the target structured text.
4. The text structuring method according to claim 1, wherein after the target text is text structured according to the target matching template if the target matching template corresponding to the target text information is matched, the method further comprises:
And carrying out data verification on the target structured text to obtain a data verification result corresponding to the target structured text.
5. The text structuring method according to claim 1, wherein after the target text is text structured according to the target matching template if the target matching template corresponding to the target text information is matched, the method further comprises:
mapping the target structured text according to a preset title field mapping relation; and the title field mapping relation is the mapping relation between the text title of the target structured text and the mapped field.
6. The method for structuring a text according to claim 1, wherein if the target text is matched with a target matching template corresponding to the target text information, performing text structuring processing on the target text according to the target matching template to obtain a target structured text, further comprising:
and displaying the text of the target structured text according to a preset display mode.
7. The text structuring processing method according to any one of claims 1 to 6, characterized by further comprising:
And if the target matching template corresponding to the target text information is not matched, performing exception handling according to a preset exception handling mode.
8. A text structuring processing device, comprising:
the target text information extraction module is used for acquiring target text to be processed and extracting target text information of the target text;
the text template matching module is used for performing template matching on the target text information by utilizing each preset text template;
and the text structuring processing module is used for carrying out text structuring processing on the target text according to the target matching template if the target matching template corresponding to the target text information is matched, so as to obtain the target structured text.
9. A computer-readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the text structuring method of any one of claims 1 to 7.
10. Terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the text structuring method according to any of claims 1 to 7 when the computer program is executed.
CN202410079966.2A 2024-01-19 2024-01-19 Text structuring processing method and device, readable storage medium and terminal equipment Pending CN117786426A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410079966.2A CN117786426A (en) 2024-01-19 2024-01-19 Text structuring processing method and device, readable storage medium and terminal equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410079966.2A CN117786426A (en) 2024-01-19 2024-01-19 Text structuring processing method and device, readable storage medium and terminal equipment

Publications (1)

Publication Number Publication Date
CN117786426A true CN117786426A (en) 2024-03-29

Family

ID=90394595

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410079966.2A Pending CN117786426A (en) 2024-01-19 2024-01-19 Text structuring processing method and device, readable storage medium and terminal equipment

Country Status (1)

Country Link
CN (1) CN117786426A (en)

Similar Documents

Publication Publication Date Title
CN109886928B (en) Target cell marking method, device, storage medium and terminal equipment
US11232300B2 (en) System and method for automatic detection and verification of optical character recognition data
CN110765770A (en) Automatic contract generation method and device
CN112016273A (en) Document directory generation method and device, electronic equipment and readable storage medium
CN109002425B (en) Method for acquiring upstream and downstream relations of enterprise, terminal device and medium
CN112860905A (en) Text information extraction method, device and equipment and readable storage medium
CN115730605A (en) Data analysis method based on multi-dimensional information
CN116562247A (en) Electronic form content generation method, electronic form content generation device and computer equipment
CN110598194B (en) Non-full-grid table content extraction method and device and terminal equipment
CN109324963B (en) Method for automatically testing profit result and terminal equipment
CN111292068A (en) Contract information auditing method and device, electronic equipment and storage medium
CN114491134B (en) Trademark registration success rate analysis method and system
CN115114588B (en) Intelligent education academic achievement discussion anti-plagiarism method and system based on block chain
CN117786426A (en) Text structuring processing method and device, readable storage medium and terminal equipment
CN113642291B (en) Method, system, storage medium and terminal for constructing logical structure tree reported by listed companies
CN116303820A (en) Label generation method, label generation device, computer equipment and medium
CN113282837B (en) Event analysis method, device, computer equipment and storage medium
CN113590581B (en) Data transmission method, device, equipment and storage medium
CN115294586A (en) Invoice identification method and device, storage medium and electronic equipment
CN112926577B (en) Medical bill image structuring method and device and computer readable medium
CN113504865A (en) Work order label adding method, device, equipment and storage medium
CN113936130A (en) Document information intelligent acquisition and error correction method, system and equipment based on OCR technology
JP2021033688A (en) Date generation apparatus, control method, and program
CN117115839B (en) Invoice field identification method and device based on self-circulation neural network
CN113360505B (en) Time sequence data-based data processing method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination