Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
In order to explain the technical means of the present invention, the following description will be given by way of specific examples.
Fig. 1 is a schematic flow chart of an implementation of a method for creating a text field of a PDF file according to an embodiment of the present invention, where as shown in the diagram, the method may include the following steps:
step S101, all preset objects in a page to be processed are obtained, and the positions of the preset objects in the page to be processed are obtained.
In practical application, at least one page to be processed exists, each page to be processed can be processed according to the page number sequence, and all pages to be processed can also be processed simultaneously. If each page to be processed is processed according to the page number sequence, acquiring all preset objects in the page to be processed means acquiring all preset objects in the current page to be processed; if all the pages to be processed are processed simultaneously, acquiring all the preset objects in the pages to be processed means acquiring all the preset objects in each page to be processed respectively. In other words, the preset objects need to be grouped according to the page to be processed, and the preset objects which are not in the same page to be processed cannot be processed.
Wherein the preset object comprises any one of: cells, horizontal lines, radio boxes, and check boxes. It should be noted that the preset objects include, but are not limited to, the above listed objects, and are not limited to these specific limitations.
And S102, extracting text information in a preset range of the preset object according to the position of the preset object in the page to be processed.
Wherein, the preset range of the preset object can be preset manually, including: the method comprises the following steps of presetting the inside of an object, being right above the preset object, being right in front of the preset object, being left of the preset object, being above the preset object, being right below the preset object and being right behind the preset object. It should be noted that the preset range of the preset object includes, but is not limited to, the above listed ranges, and is not specifically limited herein. In practical applications, a preset range of the preset object needs to be set according to actual conditions, for example, the preset object is a cell, and the preset range of the preset object can be set to be the left side of the cell or the upper side of the cell. It should be noted that, in order to conform to the conventional reading habit and facilitate the user to read the PDF file, the preset range of the preset object may be determined by the visual angle of the user facing the PDF file.
Step S103, creating a text field for the preset object, and taking the text information as the name of the text field.
Optionally, referring to fig. 2, if the preset object to be obtained is a cell, the obtaining all preset objects in the page to be processed includes:
step S201, acquiring all lines in the page to be processed, preprocessing all lines in the page to be processed, and dividing a table based on the intersection relationship of the preprocessed lines.
Wherein the pre-treatment may comprise any one of: sorting, de-duplicating, connecting and sorting. It should be noted that the pretreatment includes, but is not limited to, various treatment methods listed above, and is not specifically limited herein.
In practical application, all the obtained lines are classified, deduplicated, connected and sorted, and preparation is made for identifying the table. The table division based on the intersection relationship of the preprocessed lines may be that lines directly intersecting or indirectly connected with each other are divided into the same table, that is, the table is identified.
Step S202, whether the lines divided into the same table have closed table frame lines or not is determined.
In practical application, after the table is identified, whether the table is a valid table needs to be judged, and the judgment can be carried out by determining whether the lines divided into the same table have closed table frame lines. If the lines divided into the same table have closed table frame lines, the identified table is an effective table; and if the lines divided into the same table do not have closed table frame lines, the identified table is an invalid table.
Step S203, if the lines divided into the same table have closed table frame lines, obtaining the cells of the table.
In practical applications, after the identified table is determined to be a valid table, cells need to be found from the valid table, and the cells can be found by confirming whether a closed table line exists in a line in the valid table. After the cells are found, the row-column span of each cell is determined, so as to determine the size of the text field. Note that, unlike the closed table border lines, the closed table border lines may be lines that form a table border, and the closed table lines may be lines that form cells inside the table.
Further, the extracting text information within a preset range of the preset object according to the position of the preset object in the page to be processed includes:
judging whether the cell interior contains text information or not;
and if the cell does not contain text information, extracting the text information in the cell adjacent to the cell from the page to be processed.
In practical application, if the cell contains text information, the cell does not need to be created with a text field; if no textual information is contained within a cell, then a text field needs to be created for that cell. Before creating the text field, the text field name of the cell needs to be determined, that is, the text information in the cells adjacent to the cell is extracted from the page to be processed. The cell adjacent to the cell may be a cell on the left side or the upper side of the cell. After extracting text information of a cell adjacent to the cell, a text field is created for the cell, and the extracted text information is taken as a name of the text field.
Optionally, if the preset object to be acquired is a horizontal line, after determining whether there is a closed table frame line in the lines divided into the same table, the method further includes:
and step S204, if the lines divided into the same table have closed table frame lines, acquiring horizontal lines which do not belong to the closed table lines in the table.
In practical application, whether the lines divided into the same table have closed table frame lines or not is determined, and if the lines divided into the same table have closed table frame lines, the table is an effective table; there may be cells within the active table that require the creation of a text field, and there may also be horizontal lines that require the creation of a text field. If a closed table line exists in the effective table, the effective table is indicated to have a cell; if unclosed form lines exist within the active form, horizontal lines may exist within the unclosed form lines. So after a valid table is identified, horizontal bars that do not belong to closed form lines can also be retrieved from within the valid table.
Further, the extracting text information within a preset range of the preset object according to the position of the preset object in the page to be processed includes:
determining the position right in front of or below the horizontal line according to the position of the horizontal line in the page to be processed;
extracting text information at a position right in front of or below the horizontal line;
the creating a text field for the preset object and using the text information as the name of the text field includes:
creating a text field above the position of the horizontal line in the page to be processed so that the width of the text field is equal to the length of the horizontal line;
taking the text information as the name of the text field;
the horizontal lines include: horizontal path object, continuous underlined characters.
In practice, the continuous underlined characters can also be considered as horizontal lines. Acquiring all lines in the page to be processed comprises: and acquiring all lines in the page to be processed and all continuous underline characters in the page to be processed. The continuous underline characters may be composed of several continuous underline characters, which may be preset manually, and is not limited herein.
In addition, in practical applications, if the acquired preset object is a check box, the check box can be regarded as a small table, so the method for creating the check box text field in the PDF file can refer to the method described in steps S101-S103 and S201-S203.
For the method for creating the digital signature field of the PDF file, reference may be made to the method for creating the text field of the PDF file described in steps S101 to S103 and S201 to S204, except that the extracted text information is used as the name of the text field in the method for creating the text field, and the text information including the preset keyword is used as the name of the text field in the method for creating the digital signature field. For example, in an english PDF file, a preset keyword may be set to "signature", and text information including "signature" is used as a name of a text field; in the chinese PDF file, the preset keyword may be set as "signature", and the text information including the "signature" is used as the name of the text field.
The method comprises the steps of obtaining all preset objects in a page to be processed and obtaining the positions of the preset objects in the page to be processed; extracting text information in a preset range of the preset object according to the position of the preset object in the page to be processed; creating a text field for the preset object, and taking the text information as the name of the text field; the method and the device realize automatic generation of the cell text field and the horizontal line text field for the PDF file, and solve the problems of inaccurate size and position and large workload of manual addition of the form field in the prior art.
Fig. 3 is a schematic flow chart of an implementation of a method for creating a text field of a PDF file according to another embodiment of the present invention, where as shown in the figure, if an acquired preset object is a radio box, the acquiring all preset objects in a page to be processed further includes:
step S301, obtaining all path objects composed of four segments of Bezier curves connected end to end in the page to be processed.
In practical application, the step of judging whether the path object is composed of four segments of bezier curves connected end to end may include the following steps: judging whether the number of points forming the path object is 13 or not; if the number of the points composing the path object is 13, then judge whether the starting point of the path object contains Move To mark, whether the rest points contain Bezier To mark and whether the end point contains Close Figure mark. Wherein, the Move To flag, Bezier To flag, and Close Figure flag can be instructions in the program.
Step S302, determining whether each bezier curve in the path object is 1/4 arc segments.
Step S303, if each segment of the bezier curve of the path object is an 1/4 arc segment, defining the path object as a first type radio box, and obtaining the first type radio box.
Further, the extracting text information within a preset range of the preset object according to the position of the preset object in the page to be processed includes:
determining the position right behind the first type radio box according to the position of the first type radio box in the page to be processed;
extracting text information at the position right behind the first type radio box;
the creating a text field for the preset object and using the text information as the name of the text field specifically include:
and creating a text field for the first type radio box, and taking the text information as the name of the text field.
And step S304, if the Bezier curve which is not the 1/4 arc segment exists in the path object, discarding the path object.
Optionally, referring to fig. 4, if the obtained preset object is a radio frame, the obtaining all preset objects in the page to be processed further includes:
step S401, acquiring all text objects in a page to be processed;
step S402, judging whether preset characters exist in the text object;
step S403, if there is a preset character in the text object, defining the character as a second type radio box, and obtaining the second type radio box.
In practical applications, the preset character may be a Unicode code or an ASCII code in which the character has a circular shape or the same shape as the radio frame. In other words, some text objects are circular or have the same shape as a radio box, and such text objects, such as Unicode codes and ASCII codes, can be regarded as radio boxes. The preset characters are not limited to Unicode and ASCII code, and are not specifically limited herein.
Further, the extracting text information within a preset range of the preset object according to the position of the preset object in the page to be processed specifically includes:
extracting text information on the adjacent position of the second type radio box according to the position of the second type radio box in the page to be processed;
the creating a text field for the preset object and using the text information as the name of the text field specifically include:
and creating a text field for the second type radio box, and using the text information as the name of the text field.
The adjacent positions of the second type radio frames may be preset manually, and may include any one of the following: the second type radio frame is right in front, right behind, right above and right below. And is not particularly limited herein.
Further, after creating a text field for the preset object and using the text information as a name of the text field, the method includes:
and grouping the radio frames according to the positions of the radio frames in the page to be processed.
In practical applications, the radio boxes are grouped to ensure that the names of the text fields of the radio boxes in the same group belong to the same category and/or that the radio options in the same group are mutually exclusive. For example: in the effective table, 3 radio boxes are sequentially arranged right behind the position where the text information payment frequency is located, and the names of text fields of the radio boxes are respectively 'daily', 'monthly' and 'yearly'; the 3 radio boxes can be grouped according to the positions of the 3 radio boxes in the page to be processed, the names of the text fields of the 3 radio boxes belong to the same category, namely, "payment frequency", and the 3 radio boxes are mutually exclusive.
The method comprises the steps of obtaining all preset objects in a page to be processed and obtaining the positions of the preset objects in the page to be processed; extracting text information in a preset range of the preset object according to the position of the preset object in the page to be processed; creating a text field for the preset object, and taking the text information as the name of the text field; the method and the device realize automatic generation of the radio box text field for the PDF file, and solve the problems of inaccurate size and position and large workload of manual addition of the form field in the prior art.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
Fig. 5 is a schematic diagram of a device for creating a text field of a PDF file according to an embodiment of the present invention, and for convenience of explanation, only the portions related to the embodiment of the present invention are shown.
The device 5 for creating the text field of the PDF file comprises:
an obtaining unit 51, configured to obtain all preset objects in a page to be processed, and obtain positions of the preset objects in the page to be processed;
the extracting unit 52 is configured to extract text information within a preset range of the preset object according to the position of the preset object in the page to be processed;
the creating unit 53 is configured to create a text field for the preset object, and use the text information as a name of the text field.
Optionally, the obtaining unit 51 includes:
the preprocessing module is used for acquiring all lines in the page to be processed, preprocessing all the lines in the page to be processed and dividing a table based on the intersection relation of the preprocessed lines;
the determining module is used for determining whether the lines divided into the same table have closed table frame lines or not;
the cell acquisition module is used for acquiring cells of the table if the lines divided into the same table have closed table frame lines;
further, the extraction unit 52 includes:
the judging module is used for judging whether the interior of the cell contains text information or not;
and the extraction module is used for extracting the text information in the cell adjacent to the cell in the page to be processed if the interior of the cell does not contain the text information.
Optionally, the obtaining unit 51 further includes:
the horizontal line acquiring module is used for acquiring a horizontal line which does not belong to a closed table line in the table if the lines divided into the same table have unclosed table frame lines after determining whether the lines divided into the same table have the closed table frame lines;
the horizontal lines include: horizontal path object, continuous underlined characters.
Optionally, the obtaining unit 51 further includes:
the path object acquisition module is used for acquiring all path objects consisting of four segments of Bezier curves which are connected end to end in the page to be processed;
the circular arc section judging module is used for judging whether each section of Bezier curve in the path object is 1/4 circular arc sections;
the first definition module is used for defining the path object as a first type radio box and acquiring the first type radio box if each section of the Bezier curve of the path object is an 1/4 arc section;
and the discarding module is used for discarding the path object if the Bezier curve which is not the 1/4 arc segment exists in the path object.
Optionally, the obtaining unit 51 further includes:
the text object acquisition module is used for acquiring all text objects in the page to be processed;
the code value judging module is used for judging whether preset characters exist in the text object or not;
and the second definition module is used for defining the character as a second type radio box and acquiring the second type radio box if the preset character exists in the text object.
Further, the creating device 5 further includes:
and the grouping unit is used for grouping the radio boxes according to the positions of the radio boxes in the page to be processed after creating a text field for the preset object and taking the text information as the name of the text field.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
Fig. 6 is a schematic diagram of a terminal device according to an embodiment of the present invention. As shown in fig. 6, the terminal device 6 of this embodiment includes: a processor 60, a memory 61 and a computer program 62 stored in said memory 61 and executable on said processor 60. The processor 60, when executing the computer program 62, implements the steps in the above-described embodiments of the method for creating a text field of a PDF file, such as the steps S101 to S107 shown in fig. 1. Alternatively, the processor 60, when executing the computer program 62, implements the functions of the modules/units in the above-mentioned device embodiments, such as the functions of the modules 51 to 53 shown in fig. 5.
Illustratively, the computer program 62 may be partitioned into one or more modules/units that are stored in the memory 61 and executed by the processor 60 to implement the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 62 in the terminal device 6. For example, the computer program 62 may be divided into an acquisition unit, an extraction unit, and a creation unit, and each unit has the following specific functions:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring all preset objects in a page to be processed and acquiring the positions of the preset objects in the page to be processed;
the extraction unit is used for extracting text information in a preset range of the preset object according to the position of the preset object in the page to be processed;
and the creating unit is used for creating a text field for the preset object and taking the text information as the name of the text field.
Optionally, the obtaining unit includes:
the preprocessing module is used for acquiring all lines in the page to be processed, preprocessing all the lines in the page to be processed and dividing a table based on the intersection relation of the preprocessed lines;
the determining module is used for determining whether the lines divided into the same table have closed table frame lines or not;
the cell acquisition module is used for acquiring cells of the table if the lines divided into the same table have closed table frame lines;
further, the extraction unit includes:
the judging module is used for judging whether the interior of the cell contains text information or not;
and the extraction module is used for extracting the text information in the cell adjacent to the cell in the page to be processed if the interior of the cell does not contain the text information.
Optionally, the obtaining unit further includes:
the horizontal line acquiring module is used for acquiring a horizontal line which does not belong to a closed table line in the table if the lines divided into the same table have unclosed table frame lines after determining whether the lines divided into the same table have the closed table frame lines;
the horizontal lines include: horizontal path object, continuous underlined characters.
Optionally, the obtaining unit further includes:
the path object acquisition module is used for acquiring all path objects consisting of four segments of Bezier curves which are connected end to end in the page to be processed;
the circular arc section judging module is used for judging whether each section of Bezier curve in the path object is 1/4 circular arc sections;
the first definition module is used for defining the path object as a first type radio box and acquiring the first type radio box if each section of the Bezier curve of the path object is an 1/4 arc section;
and the discarding module is used for discarding the path object if the Bezier curve which is not the 1/4 arc segment exists in the path object.
Optionally, the obtaining unit further includes:
the text object acquisition module is used for acquiring all text objects in the page to be processed;
the code value judging module is used for judging whether preset characters exist in the text object or not;
and the second definition module is used for defining the character as a second type radio box and acquiring the second type radio box if the preset character exists in the text object.
Further, the creating device further includes:
and the grouping unit is used for grouping the radio boxes according to the positions of the radio boxes in the page to be processed after creating a text field for the preset object and taking the text information as the name of the text field.
The terminal device 6 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal device may include, but is not limited to, a processor 60, a memory 61. Those skilled in the art will appreciate that fig. 6 is merely an example of a terminal device 6 and does not constitute a limitation of terminal device 6 and may include more or less components than those shown, or some components in combination, or different components, for example, the terminal device may also include input output devices, network access devices, buses, etc.
The Processor 60 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 61 may be an internal storage unit of the terminal device 6, such as a hard disk or a memory of the terminal device 6. The memory 61 may also be an external storage device of the terminal device 6, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 6. Further, the memory 61 may also include both an internal storage unit and an external storage device of the terminal device 6. The memory 61 is used for storing the computer program and other programs and data required by the terminal device. The memory 61 may also be used to temporarily store data that has been output or is to be output.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. . Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain other components which may be suitably increased or decreased as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media which may not include electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.