CN108062297B - PDF file text field creating method and device and terminal equipment - Google Patents

PDF file text field creating method and device and terminal equipment Download PDF

Info

Publication number
CN108062297B
CN108062297B CN201711176252.XA CN201711176252A CN108062297B CN 108062297 B CN108062297 B CN 108062297B CN 201711176252 A CN201711176252 A CN 201711176252A CN 108062297 B CN108062297 B CN 108062297B
Authority
CN
China
Prior art keywords
processed
page
preset
lines
acquiring
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711176252.XA
Other languages
Chinese (zh)
Other versions
CN108062297A (en
Inventor
晏检平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Yitu Software Co.,Ltd.
Original Assignee
Shenzhen Yitu Software Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Yitu Software Co ltd filed Critical Shenzhen Yitu Software Co ltd
Priority to CN201711176252.XA priority Critical patent/CN108062297B/en
Publication of CN108062297A publication Critical patent/CN108062297A/en
Application granted granted Critical
Publication of CN108062297B publication Critical patent/CN108062297B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/177Editing, e.g. inserting or deleting of tables; using ruled lines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)
  • Machine Translation (AREA)

Abstract

The invention is suitable for the technical field of electronic file processing, and provides a method, a device and a terminal device for creating a text field of a PDF file, wherein the method comprises the following steps: acquiring all preset objects in a page to be processed, and acquiring the positions of the preset objects in the page to be processed; extracting text information in a preset range of the preset object according to the position of the preset object in the page to be processed; and creating a text field for the preset object, and taking the text information as the name of the text field. The method and the device realize automatic creation of the text field for the PDF file, and solve the problems of inaccurate size and position and large workload of manual addition of the form field in the prior art.

Description

PDF file text field creating method and device and terminal equipment
Technical Field
The invention belongs to the technical field of electronic file processing, and particularly relates to a method and a device for creating a text field of a PDF (portable document format) file and terminal equipment.
Background
PDF (Portable Document Format) is an electronic file Format developed by Adobe Systems for file exchange, and the file Format can be applied to various operating Systems, so that more and more electronic books, product descriptions, company reports, network materials, e-mails, etc. start to use PDF files, and in many cases, in order to pursue file stability and compatibility, users convert Word files into PDF files and then transmit the PDF files.
If a form to be filled in by a user is contained in a Word file, the form in the file becomes unpopulable after the Word file of this type is converted into a PDF file. Unless the user manually adds the corresponding form field for each area to be filled and carefully sizes and positions them so that they all appear in the correct place. However, manual addition of the form field may cause the problems of inaccurate size and position of the added form field, and the work is time-consuming, labor-consuming and quite tedious, and the workload is very large as the number of documents increases.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method, an apparatus, and a terminal device for creating a PDF file text field, so as to solve the problems in the prior art that manually adding a form field is inaccurate in size and position, and heavy in workload.
The first aspect of the embodiments of the present invention provides a method for creating a PDF file text field, including:
acquiring all preset objects in a page to be processed, and acquiring the positions of the preset objects in the page to be processed;
extracting text information in a preset range of the preset object according to the position of the preset object in the page to be processed;
and creating a text field for the preset object, and taking the text information as the name of the text field.
A second aspect of the embodiments of the present invention provides a device for creating a PDF file text field, including:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring all preset objects in a page to be processed and acquiring the positions of the preset objects in the page to be processed;
the extraction unit is used for extracting text information in a preset range of the preset object according to the position of the preset object in the page to be processed;
and the creating unit is used for creating a text field for the preset object and taking the text information as the name of the text field.
A third aspect of the present embodiment provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the method provided in the first aspect of the present embodiment when executing the computer program.
A fourth aspect of embodiments of the present invention provides a computer-readable storage medium storing a computer program which, when executed by one or more processors, performs the steps of the method provided by the first aspect of embodiments of the present invention.
Compared with the prior art, the embodiment of the invention has the following beneficial effects:
the method comprises the steps of obtaining all preset objects in a page to be processed, and obtaining the positions of the preset objects in the page to be processed; extracting text information in a preset range of the preset object according to the position of the preset object in the page to be processed; creating a text field for the preset object, and taking the text information as the name of the text field; the problem of artifical addition form field size, position inaccuracy and work load are big among the prior art is solved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
Fig. 1 is a schematic flow chart of an implementation of a method for creating a text field of a PDF file according to an embodiment of the present invention;
fig. 2 is a schematic flow chart illustrating an implementation of a method for acquiring a preset object in a method for creating a PDF file text field according to an embodiment of the present invention;
fig. 3 is a schematic flow chart illustrating an implementation of a method for acquiring a preset object in a method for creating a PDF file text field according to an embodiment of the present invention;
fig. 4 is a schematic flow chart illustrating an implementation process of a method for acquiring a preset object by a method for creating a PDF file text field according to an embodiment of the present invention;
fig. 5 is a schematic diagram of a device for creating a text field of a PDF file according to an embodiment of the present invention;
fig. 6 is a schematic diagram of a terminal device according to an embodiment of the present invention.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
In order to explain the technical means of the present invention, the following description will be given by way of specific examples.
Fig. 1 is a schematic flow chart of an implementation of a method for creating a text field of a PDF file according to an embodiment of the present invention, where as shown in the diagram, the method may include the following steps:
step S101, all preset objects in a page to be processed are obtained, and the positions of the preset objects in the page to be processed are obtained.
In practical application, at least one page to be processed exists, each page to be processed can be processed according to the page number sequence, and all pages to be processed can also be processed simultaneously. If each page to be processed is processed according to the page number sequence, acquiring all preset objects in the page to be processed means acquiring all preset objects in the current page to be processed; if all the pages to be processed are processed simultaneously, acquiring all the preset objects in the pages to be processed means acquiring all the preset objects in each page to be processed respectively. In other words, the preset objects need to be grouped according to the page to be processed, and the preset objects which are not in the same page to be processed cannot be processed.
Wherein the preset object comprises any one of: cells, horizontal lines, radio boxes, and check boxes. It should be noted that the preset objects include, but are not limited to, the above listed objects, and are not limited to these specific limitations.
And S102, extracting text information in a preset range of the preset object according to the position of the preset object in the page to be processed.
Wherein, the preset range of the preset object can be preset manually, including: the method comprises the following steps of presetting the inside of an object, being right above the preset object, being right in front of the preset object, being left of the preset object, being above the preset object, being right below the preset object and being right behind the preset object. It should be noted that the preset range of the preset object includes, but is not limited to, the above listed ranges, and is not specifically limited herein. In practical applications, a preset range of the preset object needs to be set according to actual conditions, for example, the preset object is a cell, and the preset range of the preset object can be set to be the left side of the cell or the upper side of the cell. It should be noted that, in order to conform to the conventional reading habit and facilitate the user to read the PDF file, the preset range of the preset object may be determined by the visual angle of the user facing the PDF file.
Step S103, creating a text field for the preset object, and taking the text information as the name of the text field.
Optionally, referring to fig. 2, if the preset object to be obtained is a cell, the obtaining all preset objects in the page to be processed includes:
step S201, acquiring all lines in the page to be processed, preprocessing all lines in the page to be processed, and dividing a table based on the intersection relationship of the preprocessed lines.
Wherein the pre-treatment may comprise any one of: sorting, de-duplicating, connecting and sorting. It should be noted that the pretreatment includes, but is not limited to, various treatment methods listed above, and is not specifically limited herein.
In practical application, all the obtained lines are classified, deduplicated, connected and sorted, and preparation is made for identifying the table. The table division based on the intersection relationship of the preprocessed lines may be that lines directly intersecting or indirectly connected with each other are divided into the same table, that is, the table is identified.
Step S202, whether the lines divided into the same table have closed table frame lines or not is determined.
In practical application, after the table is identified, whether the table is a valid table needs to be judged, and the judgment can be carried out by determining whether the lines divided into the same table have closed table frame lines. If the lines divided into the same table have closed table frame lines, the identified table is an effective table; and if the lines divided into the same table do not have closed table frame lines, the identified table is an invalid table.
Step S203, if the lines divided into the same table have closed table frame lines, obtaining the cells of the table.
In practical applications, after the identified table is determined to be a valid table, cells need to be found from the valid table, and the cells can be found by confirming whether a closed table line exists in a line in the valid table. After the cells are found, the row-column span of each cell is determined, so as to determine the size of the text field. Note that, unlike the closed table border lines, the closed table border lines may be lines that form a table border, and the closed table lines may be lines that form cells inside the table.
Further, the extracting text information within a preset range of the preset object according to the position of the preset object in the page to be processed includes:
judging whether the cell interior contains text information or not;
and if the cell does not contain text information, extracting the text information in the cell adjacent to the cell from the page to be processed.
In practical application, if the cell contains text information, the cell does not need to be created with a text field; if no textual information is contained within a cell, then a text field needs to be created for that cell. Before creating the text field, the text field name of the cell needs to be determined, that is, the text information in the cells adjacent to the cell is extracted from the page to be processed. The cell adjacent to the cell may be a cell on the left side or the upper side of the cell. After extracting text information of a cell adjacent to the cell, a text field is created for the cell, and the extracted text information is taken as a name of the text field.
Optionally, if the preset object to be acquired is a horizontal line, after determining whether there is a closed table frame line in the lines divided into the same table, the method further includes:
and step S204, if the lines divided into the same table have closed table frame lines, acquiring horizontal lines which do not belong to the closed table lines in the table.
In practical application, whether the lines divided into the same table have closed table frame lines or not is determined, and if the lines divided into the same table have closed table frame lines, the table is an effective table; there may be cells within the active table that require the creation of a text field, and there may also be horizontal lines that require the creation of a text field. If a closed table line exists in the effective table, the effective table is indicated to have a cell; if unclosed form lines exist within the active form, horizontal lines may exist within the unclosed form lines. So after a valid table is identified, horizontal bars that do not belong to closed form lines can also be retrieved from within the valid table.
Further, the extracting text information within a preset range of the preset object according to the position of the preset object in the page to be processed includes:
determining the position right in front of or below the horizontal line according to the position of the horizontal line in the page to be processed;
extracting text information at a position right in front of or below the horizontal line;
the creating a text field for the preset object and using the text information as the name of the text field includes:
creating a text field above the position of the horizontal line in the page to be processed so that the width of the text field is equal to the length of the horizontal line;
taking the text information as the name of the text field;
the horizontal lines include: horizontal path object, continuous underlined characters.
In practice, the continuous underlined characters can also be considered as horizontal lines. Acquiring all lines in the page to be processed comprises: and acquiring all lines in the page to be processed and all continuous underline characters in the page to be processed. The continuous underline characters may be composed of several continuous underline characters, which may be preset manually, and is not limited herein.
In addition, in practical applications, if the acquired preset object is a check box, the check box can be regarded as a small table, so the method for creating the check box text field in the PDF file can refer to the method described in steps S101-S103 and S201-S203.
For the method for creating the digital signature field of the PDF file, reference may be made to the method for creating the text field of the PDF file described in steps S101 to S103 and S201 to S204, except that the extracted text information is used as the name of the text field in the method for creating the text field, and the text information including the preset keyword is used as the name of the text field in the method for creating the digital signature field. For example, in an english PDF file, a preset keyword may be set to "signature", and text information including "signature" is used as a name of a text field; in the chinese PDF file, the preset keyword may be set as "signature", and the text information including the "signature" is used as the name of the text field.
The method comprises the steps of obtaining all preset objects in a page to be processed and obtaining the positions of the preset objects in the page to be processed; extracting text information in a preset range of the preset object according to the position of the preset object in the page to be processed; creating a text field for the preset object, and taking the text information as the name of the text field; the method and the device realize automatic generation of the cell text field and the horizontal line text field for the PDF file, and solve the problems of inaccurate size and position and large workload of manual addition of the form field in the prior art.
Fig. 3 is a schematic flow chart of an implementation of a method for creating a text field of a PDF file according to another embodiment of the present invention, where as shown in the figure, if an acquired preset object is a radio box, the acquiring all preset objects in a page to be processed further includes:
step S301, obtaining all path objects composed of four segments of Bezier curves connected end to end in the page to be processed.
In practical application, the step of judging whether the path object is composed of four segments of bezier curves connected end to end may include the following steps: judging whether the number of points forming the path object is 13 or not; if the number of the points composing the path object is 13, then judge whether the starting point of the path object contains Move To mark, whether the rest points contain Bezier To mark and whether the end point contains Close Figure mark. Wherein, the Move To flag, Bezier To flag, and Close Figure flag can be instructions in the program.
Step S302, determining whether each bezier curve in the path object is 1/4 arc segments.
Step S303, if each segment of the bezier curve of the path object is an 1/4 arc segment, defining the path object as a first type radio box, and obtaining the first type radio box.
Further, the extracting text information within a preset range of the preset object according to the position of the preset object in the page to be processed includes:
determining the position right behind the first type radio box according to the position of the first type radio box in the page to be processed;
extracting text information at the position right behind the first type radio box;
the creating a text field for the preset object and using the text information as the name of the text field specifically include:
and creating a text field for the first type radio box, and taking the text information as the name of the text field.
And step S304, if the Bezier curve which is not the 1/4 arc segment exists in the path object, discarding the path object.
Optionally, referring to fig. 4, if the obtained preset object is a radio frame, the obtaining all preset objects in the page to be processed further includes:
step S401, acquiring all text objects in a page to be processed;
step S402, judging whether preset characters exist in the text object;
step S403, if there is a preset character in the text object, defining the character as a second type radio box, and obtaining the second type radio box.
In practical applications, the preset character may be a Unicode code or an ASCII code in which the character has a circular shape or the same shape as the radio frame. In other words, some text objects are circular or have the same shape as a radio box, and such text objects, such as Unicode codes and ASCII codes, can be regarded as radio boxes. The preset characters are not limited to Unicode and ASCII code, and are not specifically limited herein.
Further, the extracting text information within a preset range of the preset object according to the position of the preset object in the page to be processed specifically includes:
extracting text information on the adjacent position of the second type radio box according to the position of the second type radio box in the page to be processed;
the creating a text field for the preset object and using the text information as the name of the text field specifically include:
and creating a text field for the second type radio box, and using the text information as the name of the text field.
The adjacent positions of the second type radio frames may be preset manually, and may include any one of the following: the second type radio frame is right in front, right behind, right above and right below. And is not particularly limited herein.
Further, after creating a text field for the preset object and using the text information as a name of the text field, the method includes:
and grouping the radio frames according to the positions of the radio frames in the page to be processed.
In practical applications, the radio boxes are grouped to ensure that the names of the text fields of the radio boxes in the same group belong to the same category and/or that the radio options in the same group are mutually exclusive. For example: in the effective table, 3 radio boxes are sequentially arranged right behind the position where the text information payment frequency is located, and the names of text fields of the radio boxes are respectively 'daily', 'monthly' and 'yearly'; the 3 radio boxes can be grouped according to the positions of the 3 radio boxes in the page to be processed, the names of the text fields of the 3 radio boxes belong to the same category, namely, "payment frequency", and the 3 radio boxes are mutually exclusive.
The method comprises the steps of obtaining all preset objects in a page to be processed and obtaining the positions of the preset objects in the page to be processed; extracting text information in a preset range of the preset object according to the position of the preset object in the page to be processed; creating a text field for the preset object, and taking the text information as the name of the text field; the method and the device realize automatic generation of the radio box text field for the PDF file, and solve the problems of inaccurate size and position and large workload of manual addition of the form field in the prior art.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
Fig. 5 is a schematic diagram of a device for creating a text field of a PDF file according to an embodiment of the present invention, and for convenience of explanation, only the portions related to the embodiment of the present invention are shown.
The device 5 for creating the text field of the PDF file comprises:
an obtaining unit 51, configured to obtain all preset objects in a page to be processed, and obtain positions of the preset objects in the page to be processed;
the extracting unit 52 is configured to extract text information within a preset range of the preset object according to the position of the preset object in the page to be processed;
the creating unit 53 is configured to create a text field for the preset object, and use the text information as a name of the text field.
Optionally, the obtaining unit 51 includes:
the preprocessing module is used for acquiring all lines in the page to be processed, preprocessing all the lines in the page to be processed and dividing a table based on the intersection relation of the preprocessed lines;
the determining module is used for determining whether the lines divided into the same table have closed table frame lines or not;
the cell acquisition module is used for acquiring cells of the table if the lines divided into the same table have closed table frame lines;
further, the extraction unit 52 includes:
the judging module is used for judging whether the interior of the cell contains text information or not;
and the extraction module is used for extracting the text information in the cell adjacent to the cell in the page to be processed if the interior of the cell does not contain the text information.
Optionally, the obtaining unit 51 further includes:
the horizontal line acquiring module is used for acquiring a horizontal line which does not belong to a closed table line in the table if the lines divided into the same table have unclosed table frame lines after determining whether the lines divided into the same table have the closed table frame lines;
the horizontal lines include: horizontal path object, continuous underlined characters.
Optionally, the obtaining unit 51 further includes:
the path object acquisition module is used for acquiring all path objects consisting of four segments of Bezier curves which are connected end to end in the page to be processed;
the circular arc section judging module is used for judging whether each section of Bezier curve in the path object is 1/4 circular arc sections;
the first definition module is used for defining the path object as a first type radio box and acquiring the first type radio box if each section of the Bezier curve of the path object is an 1/4 arc section;
and the discarding module is used for discarding the path object if the Bezier curve which is not the 1/4 arc segment exists in the path object.
Optionally, the obtaining unit 51 further includes:
the text object acquisition module is used for acquiring all text objects in the page to be processed;
the code value judging module is used for judging whether preset characters exist in the text object or not;
and the second definition module is used for defining the character as a second type radio box and acquiring the second type radio box if the preset character exists in the text object.
Further, the creating device 5 further includes:
and the grouping unit is used for grouping the radio boxes according to the positions of the radio boxes in the page to be processed after creating a text field for the preset object and taking the text information as the name of the text field.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
Fig. 6 is a schematic diagram of a terminal device according to an embodiment of the present invention. As shown in fig. 6, the terminal device 6 of this embodiment includes: a processor 60, a memory 61 and a computer program 62 stored in said memory 61 and executable on said processor 60. The processor 60, when executing the computer program 62, implements the steps in the above-described embodiments of the method for creating a text field of a PDF file, such as the steps S101 to S107 shown in fig. 1. Alternatively, the processor 60, when executing the computer program 62, implements the functions of the modules/units in the above-mentioned device embodiments, such as the functions of the modules 51 to 53 shown in fig. 5.
Illustratively, the computer program 62 may be partitioned into one or more modules/units that are stored in the memory 61 and executed by the processor 60 to implement the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 62 in the terminal device 6. For example, the computer program 62 may be divided into an acquisition unit, an extraction unit, and a creation unit, and each unit has the following specific functions:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring all preset objects in a page to be processed and acquiring the positions of the preset objects in the page to be processed;
the extraction unit is used for extracting text information in a preset range of the preset object according to the position of the preset object in the page to be processed;
and the creating unit is used for creating a text field for the preset object and taking the text information as the name of the text field.
Optionally, the obtaining unit includes:
the preprocessing module is used for acquiring all lines in the page to be processed, preprocessing all the lines in the page to be processed and dividing a table based on the intersection relation of the preprocessed lines;
the determining module is used for determining whether the lines divided into the same table have closed table frame lines or not;
the cell acquisition module is used for acquiring cells of the table if the lines divided into the same table have closed table frame lines;
further, the extraction unit includes:
the judging module is used for judging whether the interior of the cell contains text information or not;
and the extraction module is used for extracting the text information in the cell adjacent to the cell in the page to be processed if the interior of the cell does not contain the text information.
Optionally, the obtaining unit further includes:
the horizontal line acquiring module is used for acquiring a horizontal line which does not belong to a closed table line in the table if the lines divided into the same table have unclosed table frame lines after determining whether the lines divided into the same table have the closed table frame lines;
the horizontal lines include: horizontal path object, continuous underlined characters.
Optionally, the obtaining unit further includes:
the path object acquisition module is used for acquiring all path objects consisting of four segments of Bezier curves which are connected end to end in the page to be processed;
the circular arc section judging module is used for judging whether each section of Bezier curve in the path object is 1/4 circular arc sections;
the first definition module is used for defining the path object as a first type radio box and acquiring the first type radio box if each section of the Bezier curve of the path object is an 1/4 arc section;
and the discarding module is used for discarding the path object if the Bezier curve which is not the 1/4 arc segment exists in the path object.
Optionally, the obtaining unit further includes:
the text object acquisition module is used for acquiring all text objects in the page to be processed;
the code value judging module is used for judging whether preset characters exist in the text object or not;
and the second definition module is used for defining the character as a second type radio box and acquiring the second type radio box if the preset character exists in the text object.
Further, the creating device further includes:
and the grouping unit is used for grouping the radio boxes according to the positions of the radio boxes in the page to be processed after creating a text field for the preset object and taking the text information as the name of the text field.
The terminal device 6 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal device may include, but is not limited to, a processor 60, a memory 61. Those skilled in the art will appreciate that fig. 6 is merely an example of a terminal device 6 and does not constitute a limitation of terminal device 6 and may include more or less components than those shown, or some components in combination, or different components, for example, the terminal device may also include input output devices, network access devices, buses, etc.
The Processor 60 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 61 may be an internal storage unit of the terminal device 6, such as a hard disk or a memory of the terminal device 6. The memory 61 may also be an external storage device of the terminal device 6, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 6. Further, the memory 61 may also include both an internal storage unit and an external storage device of the terminal device 6. The memory 61 is used for storing the computer program and other programs and data required by the terminal device. The memory 61 may also be used to temporarily store data that has been output or is to be output.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. . Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain other components which may be suitably increased or decreased as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media which may not include electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims (8)

1. A method for creating a text field of a PDF (portable document format) file is characterized by comprising the following steps:
acquiring all preset objects in a page to be processed, and acquiring the positions of the preset objects in the page to be processed;
extracting text information in a preset range of the preset object according to the position of the preset object in the page to be processed;
creating a text field for the preset object, and taking the text information as the name of the text field;
the acquiring all preset objects in the page to be processed comprises:
acquiring all lines in the page to be processed, preprocessing all the lines in the page to be processed, and dividing a table based on the intersection relation of the preprocessed lines;
determining whether the lines divided into the same table have closed table frame lines or not;
if the lines divided into the same table have closed table frame lines, obtaining the cells of the table;
the extracting of the text information within the preset range of the preset object according to the position of the preset object in the page to be processed includes:
judging whether the cell interior contains text information or not;
and if the cell does not contain text information, extracting the text information in the cell adjacent to the cell from the page to be processed.
2. The method for creating the text field of the PDF file according to claim 1, wherein after determining whether the lines divided into the same table have closed table frame lines, the method further comprises:
if the lines divided into the same table have closed table frame lines, acquiring horizontal lines which do not belong to the closed table lines in the table;
the horizontal lines include: horizontal path object, continuous underlined characters.
3. The method for creating the text field of the PDF file according to claim 1, wherein the acquiring all the preset objects in the page to be processed further comprises:
acquiring all path objects consisting of four segments of Bezier curves connected end to end in the page to be processed;
judging whether each Bezier curve in the path object is an 1/4 circular arc segment;
if each section of Bezier curve of the path object is 1/4 circular arc sections, defining the path object as a first type radio box and acquiring the first type radio box;
and if the Bezier curve which is not the 1/4 circular arc segment exists in the path object, discarding the path object.
4. The method for creating the text field of the PDF file according to claim 1, wherein the acquiring all the preset objects in the page to be processed further comprises:
acquiring all text objects in a page to be processed;
judging whether preset characters exist in the text object or not;
and if the preset characters exist in the text object, defining the characters as a second type radio box, and acquiring the second type radio box.
5. The method for creating the text field of the PDF file as claimed in claim 3 or 4, wherein after creating the text field for the preset object and using the text information as the name of the text field, the method comprises:
and grouping the radio frames according to the positions of the radio frames in the page to be processed.
6. An apparatus for creating a text field of a PDF file, comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring all preset objects in a page to be processed and acquiring the positions of the preset objects in the page to be processed;
the extraction unit is used for extracting text information in a preset range of the preset object according to the position of the preset object in the page to be processed;
the creating unit is used for creating a text field for the preset object and taking the text information as the name of the text field;
the acquisition unit includes:
the preprocessing module is used for acquiring all lines in the page to be processed, preprocessing all the lines in the page to be processed and dividing a table based on the intersection relation of the preprocessed lines;
the determining module is used for determining whether the lines divided into the same table have closed table frame lines or not;
the obtaining module is used for obtaining the cells of the table if the lines divided into the same table have closed table frame lines;
the extraction unit includes:
the judging module is used for judging whether the interior of the cell contains text information or not;
and the extraction module is used for extracting the text information in the cell adjacent to the cell in the page to be processed if the interior of the cell does not contain the text information.
7. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 5 when executing the computer program.
8. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 5.
CN201711176252.XA 2017-11-22 2017-11-22 PDF file text field creating method and device and terminal equipment Active CN108062297B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711176252.XA CN108062297B (en) 2017-11-22 2017-11-22 PDF file text field creating method and device and terminal equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711176252.XA CN108062297B (en) 2017-11-22 2017-11-22 PDF file text field creating method and device and terminal equipment

Publications (2)

Publication Number Publication Date
CN108062297A CN108062297A (en) 2018-05-22
CN108062297B true CN108062297B (en) 2021-06-15

Family

ID=62134998

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711176252.XA Active CN108062297B (en) 2017-11-22 2017-11-22 PDF file text field creating method and device and terminal equipment

Country Status (1)

Country Link
CN (1) CN108062297B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104063364A (en) * 2013-03-19 2014-09-24 福建福昕软件开发股份有限公司北京分公司 PDF document recognition method
CN104462160A (en) * 2013-09-25 2015-03-25 北大方正集团有限公司 Method and system for editing formula
CN105988996A (en) * 2015-01-27 2016-10-05 腾讯科技(深圳)有限公司 Index file generation method and device
CN107291919A (en) * 2017-06-28 2017-10-24 四川妥妥递科技有限公司 A kind of system and method for add fields online in pdf document

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003014867A2 (en) * 2001-08-03 2003-02-20 John Allen Ananian Personalized interactive digital catalog profiling
US20140195347A1 (en) * 2013-01-08 2014-07-10 American Express Travel Related Services Company, Inc. Method, system, and computer program product for business designation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104063364A (en) * 2013-03-19 2014-09-24 福建福昕软件开发股份有限公司北京分公司 PDF document recognition method
CN104462160A (en) * 2013-09-25 2015-03-25 北大方正集团有限公司 Method and system for editing formula
CN105988996A (en) * 2015-01-27 2016-10-05 腾讯科技(深圳)有限公司 Index file generation method and device
CN107291919A (en) * 2017-06-28 2017-10-24 四川妥妥递科技有限公司 A kind of system and method for add fields online in pdf document

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Logical Labeling of Fixed Layout PDF Documents Using Multiple Contexts";Xin Tao 等;《2014 11th IAPR International Workshop on Document Analysis Systems》;20141231;第1-4页 *
边巴次仁 等." 用Acrobat制作PDF文档格式的科室报告***".《西藏科技》.2010,(第4期), *

Also Published As

Publication number Publication date
CN108062297A (en) 2018-05-22

Similar Documents

Publication Publication Date Title
CN108710613B (en) Text similarity obtaining method, terminal device and medium
CN109062874B (en) Financial data acquisition method, terminal device and medium
CN110334346B (en) Information extraction method and device of PDF (Portable document Format) file
CN104915327A (en) Text information processing method and device
CN107590291A (en) A kind of searching method of picture, terminal device and storage medium
US10970458B1 (en) Logical grouping of exported text blocks
CN109710771B (en) Table information extraction method, device and storage medium
CN104636428A (en) Trademark recommendation method and device
CN111694946A (en) Text keyword visual display method and device and computer equipment
CN107516534A (en) A kind of comparison method of voice messaging, device and terminal device
CN106095972B (en) Information classification method and device
CN105302626B (en) Analytic method of XPS (XPS) structured data
CN106445906A (en) Generation method and apparatus for medium-and-long phrase in domain lexicon
CN111611813B (en) Document translation method, device, electronic equipment and storage medium
CN105653984A (en) File fingerprint check method and apparatus
CN113722438A (en) Sentence vector generation method and device based on sentence vector model and computer equipment
CN110990539B (en) Manuscript internal duplicate checking method and device and electronic equipment
CN112783825B (en) Data archiving method, device, computer device and storage medium
CN114359533A (en) Page number identification method based on page text and computer equipment
CN114092948A (en) Bill identification method, device, equipment and storage medium
CN111160445B (en) Bid file similarity calculation method and device
CN108628875B (en) Text label extraction method and device and server
CN108062297B (en) PDF file text field creating method and device and terminal equipment
CN103257961A (en) Method, device and system of bibliography repeat removal
CN109670183B (en) Text importance calculation method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 850000 No.2, floor 6, unit 2, building 8, east of Liuwu building, west of East Ring Road, north of 1-4 Road, south of 1-3 Road, east of Liuwu building, Lhasa City, Tibet Autonomous Region

Applicant after: Wanxing Technology Group Co.,Ltd.

Address before: 850000 No.2, floor 6, unit 2, building 8, east of Liuwu building, west of East Ring Road, north of 1-4 Road, south of 1-3 Road, east of Liuwu building, Lhasa City, Tibet Autonomous Region

Applicant before: WONDERSHARE TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20210415

Address after: 518000 a1204, building 11, Shenzhen Bay science and technology ecological park, No.16, Keji South Road, high tech community, Yuehai street, Nanshan District, Shenzhen City, Guangdong Province

Applicant after: Shenzhen Yitu Software Co.,Ltd.

Address before: 850000 No.2, floor 6, unit 2, building 8, east of Liuwu building, west of East Ring Road, north of 1-4 Road, south of 1-3 Road, east of Liuwu building, Lhasa City, Tibet Autonomous Region

Applicant before: Wanxing Technology Group Co.,Ltd.

GR01 Patent grant
GR01 Patent grant