CN110377885A - Convert method, apparatus, equipment and the computer storage medium of pdf document - Google Patents

Convert method, apparatus, equipment and the computer storage medium of pdf document Download PDF

Info

Publication number
CN110377885A
CN110377885A CN201910515512.4A CN201910515512A CN110377885A CN 110377885 A CN110377885 A CN 110377885A CN 201910515512 A CN201910515512 A CN 201910515512A CN 110377885 A CN110377885 A CN 110377885A
Authority
CN
China
Prior art keywords
text block
text
pdf document
block
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910515512.4A
Other languages
Chinese (zh)
Other versions
CN110377885B (en
Inventor
郝学峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201910515512.4A priority Critical patent/CN110377885B/en
Publication of CN110377885A publication Critical patent/CN110377885A/en
Application granted granted Critical
Publication of CN110377885B publication Critical patent/CN110377885B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/146Aligning or centring of the image pick-up or image-field
    • G06V30/1475Inclination or skew detection or correction of characters or of image to be recognised
    • G06V30/1478Inclination or skew detection or correction of characters or of image to be recognised of characters or characters lines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Character Input (AREA)

Abstract

The present invention provides a kind of method, apparatus, equipment and computer storage medium for converting pdf document, method includes: to obtain pdf document to be converted, and the parsing result of pdf document is obtained, the paragraph relationship between content of text and each line of text comprising each line of text in the pdf document in parsing result;Optical character identification is carried out to pdf document, to obtain the content of text and attribute information of each text block in pdf document, the attribute information of text block includes the width of text block;Successively using each text block as current text block, and according to the width of current text block, obtains current text block and whether next text block belongs to the judgement result of same paragraph;The corresponding relationship of each line of text Yu each text block is determined according to content of text, and the paragraph relationship between each line of text in pdf document is corrected using judgement result, transformation result of the parsing result after output calibration as pdf document.The present invention can accurately restore the text fragment in pdf document.

Description

Convert method, apparatus, equipment and the computer storage medium of pdf document
[technical field]
The present invention relates to field of computer technology more particularly to a kind of conversion pdf document method, apparatus, equipment and calculating Machine storage medium.
[background technique]
The prior art is when carrying out the conversion of pdf document, such as pdf document is being converted to preset format (such as html Format) alternative document when, entered a new line according to line of text visual in pdf document due to being, and be not in pdf document Normal text paragraph enters a new line, therefore the transformation result of the prior art obtained pdf document is caused to have poor section Fall line feed reduction effect.And in some applications, such as the Knowledge Extraction towards pdf document, if being unable to get normal reduction The pdf document transformation result of text fragment then can greatly influence the accuracy rate of Knowledge Extraction.Therefore, it is urgent to provide a kind of energy The conversion method of text fragment in enough accurate reproduction pdf documents.
[summary of the invention]
In view of this, the present invention provides a kind of method for converting pdf document, device, equipment and computer storages to be situated between Matter, for accurately restoring the text fragment in pdf document.
The present invention is that technical solution used by solution technology is to provide a kind of method for converting pdf document, the method It include: to obtain pdf document to be converted, and obtain the parsing result of the pdf document, comprising described in the parsing result Paragraph relationship in pdf document between the content of text and each line of text of each line of text;Optics word is carried out to the pdf document Symbol identification, to obtain the content of text and attribute information of each text block in the pdf document, the attribute information of the text block Width comprising text block;Successively using each text block as current text block, and according to the width of the current text block, obtain Whether the current text block and next text block belong to the judgement result of same paragraph;Each line of text is determined according to content of text With the corresponding relationship of each text block, and using it is described judgement result between each line of text in the pdf document paragraph close System is corrected, transformation result of the parsing result after output calibration as the pdf document.
According to one preferred embodiment of the present invention, the attribute information of the text block also includes: the height of the text block, institute State the left spacing of text block and the upper spacing of the text block.
According to one preferred embodiment of the present invention, described that the current text block is obtained according to the width of the current text block When whether belonging to the judgement result of same paragraph with next text block, comprising: the document for obtaining each page in the pdf document is wide Degree;According to the document width of the page where the width of the current text block and the current text block, determine described current Whether text block reaches end of line;If the current text block reaches end of line, the current text block and next text block are obtained Belong to the judgement of same paragraph and is not belonging to the judgement of same paragraph with next text block as a result, otherwise obtaining the current text block As a result.
According to one preferred embodiment of the present invention, described that the current text block is obtained according to the width of the current text block When whether belonging to the judgement result of same paragraph with next text block, the width of the current text block can also be comprised determining that Whether be more than preset threshold, if so, obtain the current text block and next text block belong to same paragraph judgement as a result, Otherwise it obtains the current text block and next text block is not belonging to the judgement result of same paragraph.
According to one preferred embodiment of the present invention, it is described using the judgement result to each line of text in the pdf document Between paragraph relationship be corrected, comprising: will sentence in the paragraph relationship between each line of text in the pdf document with described Determine the inconsistent paragraph relationship of result to be adjusted to and consistent paragraph relationship in the judgement result.
According to one preferred embodiment of the present invention, using the judgement result to each line of text in the pdf document it Between paragraph relationship be corrected after, further includes: according to the width of the text block determine be located at table in each text block; The size of spacing upper in identified each text block is met into the number of the text block of preset condition as the columns of table, by institute The size of left spacing meets line number of the text block number of preset condition as table in determining each text block;According to determining Ranks number carry out table reduction, and the content of text of each text block is inserted into the corresponding position in restored table.
The present invention is that technical solution used by solution technology is to provide a kind of device for converting pdf document, described device Include: acquiring unit, for obtaining pdf document to be converted, and obtains the parsing result of the pdf document, the parsing knot The paragraph relationship between content of text and each line of text comprising each line of text in the pdf document in fruit;Recognition unit is used In carrying out optical character identification to the pdf document, to obtain the content of text and attribute of each text block in the pdf document Information, the attribute information of the text block include the width of text block;Judging unit, for successively using each text block as current Text block, and according to the width of the current text block, obtain the current text block whether belong to next text block it is same The judgement result of paragraph;Output unit, for determining the corresponding relationship of each line of text Yu each text block, and benefit according to content of text The paragraph relationship between each line of text in the pdf document is corrected with the judgement result, the solution after output calibration Analyse transformation result of the result as the pdf document.
According to one preferred embodiment of the present invention, which is characterized in that the attribute information of the text block also includes: the text The upper spacing of the height of block, the left spacing of the text block and the text block.
According to one preferred embodiment of the present invention, the judging unit is according to the acquisition of the width of the current text block It is specific to execute: to obtain in the pdf document when whether current text block and next text block belong to the judgement result of same paragraph The document width of each page;Document according to the page where the width of the current text block and the current text block is wide Degree, determines whether the current text block reaches end of line;If the current text block reaches end of line, the current text is obtained Block belongs to the judgement of same paragraph with next text block and is not belonging to as a result, otherwise obtaining the current text block with next text block The judgement result of same paragraph.
According to one preferred embodiment of the present invention, the judging unit is according to the acquisition of the width of the current text block It is specific to execute: to determine the current text when whether current text block and next text block belong to the judgement result of same paragraph Whether the width of block is more than preset threshold, belongs to same paragraph with next text block if so, obtaining the current text block Determine to be not belonging to the judgement result of same paragraph with next text block as a result, otherwise obtaining the current text block.
According to one preferred embodiment of the present invention, the output unit is in the utilization judgement result in the pdf document Each line of text between paragraph relationship when being corrected, it is specific to execute: by the section between each line of text in the pdf document Paragraph relationship inconsistent with the judgement result in relationship is fallen to be adjusted to and consistent paragraph relationship in the judgement result.
According to one preferred embodiment of the present invention, the output unit is in the utilization judgement result in the pdf document Each line of text between paragraph relationship be corrected after, also execute: according to the width of the text block determine be located at table In each text block;The size of spacing upper in identified each text block is met into the number of the text block of preset condition as table The size of left spacing in identified each text block is met the text block number of preset condition as the row of table by the columns of lattice Number;Table reduction is carried out according to identified ranks number, and the content of text of each text block is inserted in restored table Corresponding position.
As can be seen from the above technical solutions, the present invention is tied by the parsing result and OCR identification for obtaining pdf document Fruit, the judgement for whether belonging to same paragraph in pdf document between text block is then obtained according to OCR recognition result as a result, into And the paragraph relationship between each line of text in the parsing result of pdf document is corrected according to acquired judgement result, it keeps away There is the problem of mistake in paragraph relationship between caused text of having exempted to carry out entering a new line according to line of text visual in pdf document, To accurately be restored to the text fragment in pdf document.
[Detailed description of the invention]
Fig. 1 is a kind of method flow diagram for conversion pdf document that one embodiment of the invention provides;
The schematic diagram of certain page of OCR recognition result in the pdf document that Fig. 2 provides for one embodiment of the invention;
Fig. 3 is a kind of structure drawing of device for conversion pdf document that one embodiment of the invention provides;
Fig. 4 is the block diagram for the computer system/server that one embodiment of the invention provides.
[specific embodiment]
To make the objectives, technical solutions, and advantages of the present invention clearer, right in the following with reference to the drawings and specific embodiments The present invention is described in detail.
The term used in embodiments of the present invention is only to be not intended to be limiting merely for for the purpose of describing particular embodiments The present invention.In the embodiment of the present invention and the "an" of singular used in the attached claims, " described " and "the" It is also intended to including most forms, unless the context clearly indicates other meaning.
It should be appreciated that term "and/or" used herein is only a kind of incidence relation for describing affiliated partner, indicate There may be three kinds of relationships, for example, A and/or B, can indicate: individualism A, exist simultaneously A and B, individualism B these three Situation.In addition, character "/" herein, typicallys represent the relationship that forward-backward correlation object is a kind of "or".
Depending on context, word as used in this " if " can be construed to " ... when " or " when ... When " or " in response to determination " or " in response to detection ".Similarly, depend on context, phrase " if it is determined that " or " if detection (condition or event of statement) " can be construed to " when determining " or " in response to determination " or " when the detection (condition of statement Or event) when " or " in response to detection (condition or event of statement) ".
Fig. 1 is a kind of method flow diagram for conversion pdf document that one embodiment of the invention provides, as shown in fig. 1, described Method includes:
In 101, pdf document to be converted is obtained, and obtains the parsing result of the pdf document, the parsing result In paragraph relationship between content of text and each line of text comprising each line of text in the pdf document.
In this step, pdf document to be converted is obtained first, then the pdf document is parsed, and includes to obtain The parsing result of paragraph relationship in the pdf document between the content of text and each line of text of each line of text.Wherein, this step The pdf document that user can be selected from terminal local can also select user as pdf document to be converted from internet The pdf document selected is as pdf document to be converted.
Wherein, this step obtains the parsing result of pdf document using existing pdf document analytic technique.For example, this step Suddenly poppler-utils software can be used pdf document is parsed, can also use pdf.js plug-in unit to pdf document into Row parsing etc..The present invention is to this without repeating.
In 102, optical character identification is carried out to the pdf document, to obtain the text of each text block in the pdf document This content and attribute information include the width of text block in the attribute information of the text block.
In this step, optical character identification (Optical is carried out again to pdf document acquired in step 101 Character Recognition, OCR), to obtain the content of text and attribute information of each text block in the pdf document. Wherein, include the width of text block in the attribute information of text block acquired in this step, text block can also be further included Height, the upper spacing of the left spacing of text block and text block.
It is understood that each text block acquired in this step is corresponding with each row text in pdf document, therefore each The content of text of text block is the content of text of each row text in pdf document, and the attribute information of each text block is pdf document In each row text attribute information.
Wherein, this step obtained using existing OCR identification technology in pdf document the content of text of each text block and Attribute information.The schematic diagram of certain page of OCR recognition result in the pdf document that Fig. 2 provides for one embodiment of the invention, in Fig. 2 Shown, each text block in this page is corresponding with each row text in this page respectively, includes in obtained recognition result The content of text and attribute information of each text block, such as the content of text of first text block is " specific embodiment ", it should In the attribute information of text block width be 169, be highly 34, left spacing is 138 and upper spacing is 24.
In 103, successively using each text block as current text block, and according to the width of the current text block, obtain Whether the current text block and next text block belong to the judgement result of same paragraph.
In this step, the width of each text block in the pdf document according to acquired in step 102 and each text block, Successively using each text block in pdf document as current text block, and according to the width of current text block, to obtain current text Whether block and next text block belong to the judgement result of same paragraph.
Specifically, whether this step is belonging to according to the width of current text block acquisition current text block with next text block It, can be in the following ways when the judgement result of same paragraph: the document width of each page in acquisition pdf document, such as using Existing Java function obtains document width;According to the document of the page where the width of current text block and current text block Width, determines whether current text block reaches end of line;If current text block reaches end of line, current text block and next text are obtained This block belongs to the judgement of same paragraph as a result, otherwise obtaining the judgement that current text block is not belonging to same paragraph with next text block As a result.
Wherein, the document width of this step page where according to the width of current text block and current text block, really It, can be in the following ways when determining current text block and whether reaching end of line: determining the width and current text block of current text block Whether the difference between the document width of the place page is less than or equal to preset threshold, if so, determining that current text block reaches capable Otherwise tail determines that current text block is not up to end of line.
In addition, whether this step belongs to together according to the width of current text block acquisition current text block and next text block , can also be in the following ways when the judgement result of one paragraph: whether the width for determining current text block be more than preset threshold, if It is then to obtain current text block to belong to the judgement of same paragraph as a result, otherwise obtaining current text block under with next text block One text block is not belonging to the judgement result of same paragraph.
In 104, the corresponding relationship of each line of text Yu each text block is determined according to content of text, and is tied using the judgement Fruit is corrected the paragraph relationship between each line of text in the pdf document, and the parsing result after output calibration is as institute State the transformation result of pdf document.
In this step, it is obtained in the content of text of each line of text according to acquired in step 101 and step 102 first The content of text of each text block taken determines the corresponding relationship between each line of text and each text block, and then using in step 103 Acquired judgement as a result, paragraph relationship between each line of text in pdf document acquired in aligning step 101, thus Transformation result of the parsing result as pdf document after output calibration.
That is, this step is according to judgement acquired in step 103 as a result, by PDF text acquired in step 101 It is determined as the section for mistake occur with paragraph relationship inconsistent in result is determined in paragraph relationship between each line of text in part Relationship is fallen, and the paragraph relationship for mistake occur is adjusted to and determines consistent paragraph relationship in result so that output turns Paragraph relationship in pdf document between each line of text can more accurately be reflected by changing result.
For example, if each line of text in pdf document acquired in step 101 is line of text 1, line of text 2 and text Current row 3, if the text block in pdf document acquired in step 102 is text block 1, text block 2 and text block 3, if text block 1 with the content of text having the same of line of text 1, text block 2 and the content of text having the same of line of text 2, text block 3 and text The content of text having the same of row 3, it is determined that text block 1 is corresponding with line of text 2 with the corresponding relationship of line of text 1, text block 2 The corresponding relationship of relationship, text block 3 and line of text 3, if the judgement result obtained belongs to same section for text block 1 and text block 2 Fall, text block 2 and text block 3 are not belonging to same paragraph, if the paragraph relationship between line of text acquired in step 101 is text Current row 1 and line of text 2 belong to same paragraph, line of text 2 and line of text 3 and belong to same paragraph, then by acquired line of text it Between paragraph relationship be corrected to line of text 2 and line of text 3 is not belonging to same paragraph, thus the parsing result conduct after output calibration The transformation result of pdf document.
In addition, the prior art to pdf document carry out parsing obtain parsing result when, can not be effectively in pdf document Table identified, lead to the prior art poor problem of effect when restoring to the pdf document containing table.
Therefore, this step is after being corrected the paragraph relationship between each line of text in pdf document, further include with Lower content: each text block being located in table, such as the width in the text block continuously acquired are determined according to the width of text block When less than preset threshold, these text blocks are determined as to each text block being located in table;On in identified each text block The size of spacing meets columns of the number as table of the text block of preset condition, wherein the size of upper spacing meets default item The text block of part is the text block being located in table with a line;The size of left spacing in identified each text block is met pre- If line number of the text block number of condition as table, wherein the text block that the size of left spacing meets preset condition is table In be located at same row text block;Table reduction is carried out according to identified ranks number, and the content of text of each text block is filled out Enter the corresponding position in restored table.
Wherein, the preset condition that the size of spacing is met in each text block can be between the upper spacing of each text block Difference is less than or equal to preset threshold;The preset condition that the size of each left spacing of text block is met can for each text block upper Difference away between is less than or equal to preset threshold.
For example, however, it is determined that the text block in table is respectively text block 1, text block 2, text block 3 and text This block 4, wherein the left spacing of text block 1 is 70, upper spacing is 477, and the left spacing of text block 2 is 497, upper spacing is 478, text The left spacing of this block 3 is 70, upper spacing is 525, and the left spacing of text block 4 is 497, upper spacing is 524, if preset threshold is 2, Can then determine text block 1 and text block 2 be located at same a line, text block 3 and text block 4 be located at a line and text block 1 and Text block 3 is located at same row, text block 2 and text block 4 and is located at same row, it is thus determined that the ranks number of the table is 2 × 2.
Fig. 3 is a kind of structure drawing of device for conversion pdf document that one embodiment of the invention provides, as shown in Figure 3, described Device includes: acquiring unit 31, recognition unit 32, judging unit 33 and output unit 34.
Acquiring unit 31 for obtaining pdf document to be converted, and obtains the parsing result of the pdf document, the solution Analyse the paragraph relationship between the content of text and each line of text comprising each line of text in the pdf document in result.
Acquiring unit 31 obtains pdf document to be converted first, then parses to the pdf document, includes to obtain The parsing result of paragraph relationship in the pdf document between the content of text and each line of text of each line of text.Wherein, it obtains single 31 pdf documents that can select user from terminal local of member, can also be by user from internet as pdf document to be converted The middle pdf document selected is as pdf document to be converted.
Wherein, acquiring unit 31 obtains the parsing result of pdf document using existing pdf document analytic technique.For example, Acquiring unit 31 can be used poppler-utils software and parse to pdf document, can also use pdf.js plug-in unit pair Pdf document parses etc..The present invention is to this without repeating.
Recognition unit 32, for carrying out optical character identification to the pdf document, to obtain each text in the pdf document The content of text and attribute information of this block include the width of text block in the attribute information of the text block.
Recognition unit 32 carries out optical character identification (Optical to pdf document acquired in acquiring unit 31 again Character Recognition, OCR), to obtain the content of text and attribute information of each text block in the pdf document. Wherein, include the width of text block in the attribute information of text block acquired in recognition unit 32, text can also be further included The upper spacing of the height of this block, the left spacing of text block and text block.
It is understood that each text block acquired in recognition unit 32 is corresponding with each row text in pdf document, because The content of text of this each text block is the content of text of each row text in pdf document, and the attribute information of each text block is PDF The attribute information of each row text in file.
Wherein, recognition unit 32 obtains the content of text of each text block in pdf document using existing OCR identification technology And attribute information
Judging unit 33 is used for successively using each text block as current text block, and according to the width of the current text block Degree, obtains the current text block and whether next text block belongs to the judgement result of same paragraph.
The width of each text block and each text block of the judging unit 33 in the pdf document according to acquired in recognition unit 32 Degree, it is current to obtain successively using each text block in pdf document as current text block, and according to the width of current text block Whether text block and next text block belong to the judgement result of same paragraph.
Specifically, whether judging unit 33 is obtaining current text block and next text block according to the width of current text block It, can be in the following ways when belonging to the judgement result of same paragraph: obtaining the document width of each page in pdf document;According to The document width of the page, determines whether current text block reaches end of line where the width and current text block of current text block; If current text block reaches end of line, obtains current text block and next text block belongs to the judgement of same paragraph as a result, otherwise It obtains current text block and next text block is not belonging to the judgement result of same paragraph.
Wherein, the document of the page where according to the width of current text block and current text block of judging unit 33 is wide Degree, can be in the following ways when determining whether current text block reaches end of line: determine the width of current text block with ought be above Whether the difference where this block between the document width of the page is less than or equal to preset threshold, if so, determining that current text block reaches To end of line, otherwise determine that current text block is not up to end of line.
In addition, whether judging unit 33 is belonging to according to the width of current text block acquisition current text block with next text block , can also be in the following ways when the judgement result of same paragraph: whether the width for determining current text block be more than default threshold Value belongs to the judgement of same paragraph as a result, otherwise obtaining current text block with next text block if so, obtaining current text block The judgement result of same paragraph is not belonging to next text block.
Output unit 34 for determining the corresponding relationship of each line of text Yu each text block according to content of text, and utilizes institute It states and determines that result is corrected the paragraph relationship between each line of text in the pdf document, the parsing knot after output calibration Transformation result of the fruit as the pdf document.
The content of text and recognition unit 32 of each line of text according to acquired in acquiring unit 31 first of output unit 34 In acquired each text block content of text, determine the corresponding relationship between each line of text and each text block, and then using sentencing Judgement acquired in order member 33 is as a result, correct the paragraph between each line of text in pdf document acquired in acquiring unit 31 Relationship, thus transformation result of the parsing result as pdf document after output calibration.
That is, the judgement according to acquired in judging unit 33 of output unit 34 is obtained in unit 31 as a result, will acquire It is determined as occurring with paragraph relationship inconsistent in result is determined in the paragraph relationship between each line of text in pdf document taken The paragraph relationship of mistake, and will occur mistake paragraph relationship be adjusted to determine result in consistent paragraph relationship so that The transformation result of output can more accurately reflect the paragraph relationship in pdf document between each line of text.
In addition, the prior art to pdf document carry out parsing obtain parsing result when, can not be effectively in pdf document Table identified, lead to the prior art poor problem of effect when restoring to the pdf document containing table.
Therefore, output unit 34 also wraps after being corrected to the paragraph relationship between each line of text in pdf document It includes the following contents: determining each text block being located in table according to the width of text block,;Between upper in identified each text block Away from size meet preset condition text block columns of the number as table, wherein the size of upper spacing meets preset condition Text block be in table be located at a line text block;The size of left spacing in identified each text block is met default Line number of the text block number of condition as table, wherein the text block that the size of left spacing meets preset condition is in table Positioned at the text block of same row;Table reduction is carried out according to identified ranks number, and the content of text of each text block is inserted The corresponding position in table restored.
Wherein, the preset condition that the size of spacing is met in each text block can be between the upper spacing of each text block Difference is less than or equal to preset threshold;The preset condition that the size of each left spacing of text block is met can for each text block upper Difference away between is less than or equal to preset threshold.
As shown in figure 4, computer system/server 012 is showed in the form of universal computing device.Computer system/clothes The component of business device 012 can include but is not limited to: one or more processor or processing unit 016, system storage 028, connect the bus 018 of different system components (including system storage 028 and processing unit 016).
Bus 018 indicates one of a few class bus structures or a variety of, including memory bus or Memory Controller, Peripheral bus, graphics acceleration port, processor or the local bus using any bus structures in a variety of bus structures.It lifts For example, these architectures include but is not limited to industry standard architecture (ISA) bus, microchannel architecture (MAC) Bus, enhanced isa bus, Video Electronics Standards Association (VESA) local bus and peripheral component interconnection (PCI) bus.
Computer system/server 012 typically comprises a variety of computer system readable media.These media, which can be, appoints The usable medium what can be accessed by computer system/server 012, including volatile and non-volatile media, movably With immovable medium.
System storage 028 may include the computer system readable media of form of volatile memory, such as deposit at random Access to memory (RAM) 030 and/or cache memory 032.Computer system/server 012 may further include other Removable/nonremovable, volatile/non-volatile computer system storage medium.Only as an example, storage system 034 can For reading and writing immovable, non-volatile magnetic media (Fig. 4 do not show, commonly referred to as " hard disk drive ").Although in Fig. 4 It is not shown, the disc driver for reading and writing to removable non-volatile magnetic disk (such as " floppy disk ") can be provided, and to can The CD drive of mobile anonvolatile optical disk (such as CD-ROM, DVD-ROM or other optical mediums) read-write.In these situations Under, each driver can be connected by one or more data media interfaces with bus 018.Memory 028 may include At least one program product, the program product have one group of (for example, at least one) program module, these program modules are configured To execute the function of various embodiments of the present invention.
Program/utility 040 with one group of (at least one) program module 042, can store in such as memory In 028, such program module 042 includes --- but being not limited to --- operating system, one or more application program, other It may include the realization of network environment in program module and program data, each of these examples or certain combination.Journey Sequence module 042 usually executes function and/or method in embodiment described in the invention.
Computer system/server 012 can also with one or more external equipments 014 (such as keyboard, sensing equipment, Display 024 etc.) communication, in the present invention, computer system/server 012 is communicated with outside radar equipment, can also be with One or more enable a user to the equipment interacted with the computer system/server 012 communication, and/or with make the meter Any equipment (such as network interface card, the modulation that calculation machine systems/servers 012 can be communicated with one or more of the other calculating equipment Demodulator etc.) communication.This communication can be carried out by input/output (I/O) interface 022.Also, computer system/clothes Being engaged in device 012 can also be by network adapter 020 and one or more network (such as local area network (LAN), wide area network (WAN) And/or public network, such as internet) communication.As shown, network adapter 020 by bus 018 and computer system/ Other modules of server 012 communicate.It should be understood that although not shown in the drawings, computer system/server 012 can be combined Using other hardware and/or software module, including but not limited to: microcode, device driver, redundant processing unit, external magnetic Dish driving array, RAID system, tape drive and data backup storage system etc..
Processing unit 016 by the program that is stored in system storage 028 of operation, thereby executing various function application with And data processing, such as realize method flow provided by the embodiment of the present invention.
With time, the development of technology, medium meaning is more and more extensive, and the route of transmission of computer program is no longer limited by Tangible medium, can also be directly from network downloading etc..It can be using any combination of one or more computer-readable media. Computer-readable medium can be computer-readable signal media or computer readable storage medium.Computer-readable storage medium Matter for example may be-but not limited to-system, device or the device of electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, or Any above combination of person.The more specific example (non exhaustive list) of computer readable storage medium includes: with one Or the electrical connections of multiple conducting wires, portable computer diskette, hard disk, random access memory (RAM), read-only memory (ROM), Erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light Memory device, magnetic memory device or above-mentioned any appropriate combination.In this document, computer readable storage medium can With to be any include or the tangible medium of storage program, the program can be commanded execution system, device or device use or Person is in connection.
Computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including --- but It is not limited to --- electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be Any computer-readable medium other than computer readable storage medium, which can send, propagate or Transmission is for by the use of instruction execution system, device or device or program in connection.
The program code for including on computer-readable medium can transmit with any suitable medium, including --- but it is unlimited In --- wireless, electric wire, optical cable, RF etc. or above-mentioned any appropriate combination.
The computer for executing operation of the present invention can be write with one or more programming languages or combinations thereof Program code, described program design language include object oriented program language-such as Java, Smalltalk, C++, It further include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with It fully executes, partly execute on the user computer on the user computer, being executed as an independent software package, portion Divide and partially executes or executed on a remote computer or server completely on the remote computer on the user computer.? Be related in the situation of remote computer, remote computer can pass through the network of any kind --- including local area network (LAN) or Wide area network (WAN) is connected to subscriber computer, or, it may be connected to outer computer (such as provided using Internet service Quotient is connected by internet).
Using technical solution provided by the present invention, by obtaining the parsing result and OCR recognition result of pdf document, Then the judgement for whether belonging to same paragraph in pdf document between text block is obtained according to OCR recognition result as a result, root in turn The paragraph relationship between each line of text in the parsing result of pdf document is corrected according to acquired judgement result, is avoided There is the problem of mistake according to line of text visual in the pdf document paragraph relationship between caused text that enter a new line, thus Accurately the text fragment in pdf document is restored.
In several embodiments provided by the present invention, it should be understood that disclosed system, device and method can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only a kind of logical function partition, there may be another division manner in actual implementation.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of hardware adds SFU software functional unit.
The above-mentioned integrated unit being realized in the form of SFU software functional unit can store and computer-readable deposit at one In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, including some instructions are used so that a computer It is each that equipment (can be personal computer, server or the network equipment etc.) or processor (processor) execute the present invention The part steps of embodiment the method.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (Read- Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic or disk etc. it is various It can store the medium of program code.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Within mind and principle, any modification, equivalent substitution, improvement and etc. done be should be included within the scope of the present invention.

Claims (14)

1. a kind of method for converting pdf document, which is characterized in that the described method includes:
Pdf document to be converted is obtained, and obtains the parsing result of the pdf document, includes the PDF in the parsing result Paragraph relationship in file between the content of text and each line of text of each line of text;
Optical character identification is carried out to the pdf document, to obtain the content of text and category of each text block in the pdf document Property information, the attribute information of the text block includes the width of text block;
Successively using each text block as current text block, and according to the width of the current text block, the current text is obtained Whether block and next text block belong to the judgement result of same paragraph;
The corresponding relationship of each line of text Yu each text block is determined according to content of text, and using the judgement result to the PDF Paragraph relationship between each line of text in file is corrected, and the parsing result after output calibration is as the pdf document Transformation result.
2. the method according to claim 1, wherein the attribute information of the text block also includes: the text The upper spacing of the height of block, the left spacing of the text block and the text block.
3. the method according to claim 1, wherein described according to the acquisition of the width of the current text block When whether current text block and next text block belong to the judgement result of same paragraph, comprising:
Obtain the document width of each page in the pdf document;
According to the document width of the page where the width of the current text block and the current text block, determine described current Whether text block reaches end of line;
If the current text block reaches end of line, the current text block is obtained with next text block and belongs to sentencing for same paragraph It is fixed to be not belonging to the judgement result of same paragraph with next text block as a result, otherwise obtaining the current text block.
4. the method according to claim 1, wherein described according to the acquisition of the width of the current text block When whether current text block and next text block belong to the judgement result of same paragraph, can also include:
Whether the width for determining the current text block is more than preset threshold, if so, obtain the current text block with it is next Text block belongs to the judgement of same paragraph and is not belonging to same paragraph with next text block as a result, otherwise obtaining the current text block Judgement result.
5. the method according to claim 1, wherein described utilize the judgement result in the pdf document Each line of text between paragraph relationship be corrected, comprising:
By paragraph relationship tune inconsistent with the judgement result in the paragraph relationship between each line of text in the pdf document It is whole to be and consistent paragraph relationship in the judgement result.
6. the method according to claim 1, wherein in the utilization judgement result in the pdf document After paragraph relationship between each line of text is corrected, further includes:
Each text block being located in table is determined according to the width of the text block;
The size of spacing upper in identified each text block is met into the number of the text block of preset condition as the columns of table, The size of left spacing in identified each text block is met into the text block number of preset condition as the line number of table;
Table reduction is carried out according to identified ranks number, and the content of text of each text block is inserted in restored table Corresponding position.
7. a kind of device for converting pdf document, which is characterized in that described device includes:
Acquiring unit for obtaining pdf document to be converted, and obtains the parsing result of the pdf document, the parsing result In paragraph relationship between content of text and each line of text comprising each line of text in the pdf document;
Recognition unit, for carrying out optical character identification to the pdf document, to obtain each text block in the pdf document Content of text and attribute information, the attribute information of the text block include the width of text block;
Judging unit is used for successively using each text block as current text block, and according to the width of the current text block, is obtained Whether the current text block and next text block belong to the judgement result of same paragraph;
Output unit for determining the corresponding relationship of each line of text Yu each text block according to content of text, and utilizes the judgement As a result the paragraph relationship between each line of text in the pdf document is corrected, the parsing result conduct after output calibration The transformation result of the pdf document.
8. device according to claim 7, which is characterized in that the attribute information of the text block also includes: the text The upper spacing of the height of block, the left spacing of the text block and the text block.
9. the method according to the description of claim 7 is characterized in that the judging unit is in the width according to the current text block It is specific to execute when whether the degree acquisition current text block and next text block belong to the judgement result of same paragraph:
Obtain the document width of each page in the pdf document;
According to the document width of the page where the width of the current text block and the current text block, determine described current Whether text block reaches end of line;
If the current text block reaches end of line, the current text block is obtained with next text block and belongs to sentencing for same paragraph It is fixed to be not belonging to the judgement result of same paragraph with next text block as a result, otherwise obtaining the current text block.
10. device according to claim 7, which is characterized in that the judging unit is according to the current text block Width obtains the current text block and when whether next text block belongs to the judgement result of same paragraph, specific to execute:
Whether the width for determining the current text block is more than preset threshold, if so, obtain the current text block with it is next Text block belongs to the judgement of same paragraph and is not belonging to same paragraph with next text block as a result, otherwise obtaining the current text block Judgement result.
11. device according to claim 7, which is characterized in that the output unit is in the utilization judgement result to institute It is specific to execute when stating the paragraph relationship between each line of text in pdf document and being corrected:
By paragraph relationship tune inconsistent with the judgement result in the paragraph relationship between each line of text in the pdf document It is whole to be and consistent paragraph relationship in the judgement result.
12. according to the method described in claim 8, it is characterized in that, the output unit is in the utilization judgement result to institute It states after the paragraph relationship between each line of text in pdf document is corrected, also executes:
Each text block being located in table is determined according to the width of the text block;
The size of spacing upper in identified each text block is met into the number of the text block of preset condition as the columns of table, The size of left spacing in identified each text block is met into the text block number of preset condition as the line number of table;
Table reduction is carried out according to identified ranks number, and the content of text of each text block is inserted in restored table Corresponding position.
13. a kind of computer equipment, including memory, processor and it is stored on the memory and can be on the processor The computer program of operation, which is characterized in that the processor is realized when executing described program as any in claim 1~6 Method described in.
14. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that described program is processed Such as method according to any one of claims 1 to 6 is realized when device executes.
CN201910515512.4A 2019-06-14 2019-06-14 Method, device, equipment and computer storage medium for converting PDF file Active CN110377885B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910515512.4A CN110377885B (en) 2019-06-14 2019-06-14 Method, device, equipment and computer storage medium for converting PDF file

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910515512.4A CN110377885B (en) 2019-06-14 2019-06-14 Method, device, equipment and computer storage medium for converting PDF file

Publications (2)

Publication Number Publication Date
CN110377885A true CN110377885A (en) 2019-10-25
CN110377885B CN110377885B (en) 2023-09-26

Family

ID=68250299

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910515512.4A Active CN110377885B (en) 2019-06-14 2019-06-14 Method, device, equipment and computer storage medium for converting PDF file

Country Status (1)

Country Link
CN (1) CN110377885B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112269872A (en) * 2020-10-19 2021-01-26 北京希瑞亚斯科技有限公司 Resume analysis method and device, electronic equipment and computer storage medium
CN112861821A (en) * 2021-04-06 2021-05-28 刘羽 Map data reduction method based on PDF file analysis
CN117710997A (en) * 2023-12-18 2024-03-15 合肥大智慧财汇数据科技有限公司 Method, equipment and storage medium for restoring wireless form in PDF (portable document format) file

Citations (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1467682A (en) * 2002-06-28 2004-01-14 富士通株式会社 Apparatus and method of analyzing layout of document, and computer product
CN1949210A (en) * 2006-11-03 2007-04-18 上海中标软件有限公司 Method of realizing Tibetan automatically rule composing of in computer files
US20090030671A1 (en) * 2007-07-27 2009-01-29 Electronics And Telecommunications Research Institute Machine translation method for PDF file
CN101667204A (en) * 2008-09-02 2010-03-10 财团法人工业技术研究院 Intelligent patent supervising and cautioning system and method
CN102103587A (en) * 2009-12-17 2011-06-22 北大方正集团有限公司 Method and device for converting form
US8156018B1 (en) * 2006-12-18 2012-04-10 Intuit Inc. Transformation of standard document format electronic documents for electronic filing
JP2012243121A (en) * 2011-05-20 2012-12-10 Sharp Corp Data creation device, data creation program, recording medium and data creation method
CN103959282A (en) * 2011-09-28 2014-07-30 谷歌公司 Selective feedback for text recognition systems
US20140281939A1 (en) * 2013-03-13 2014-09-18 Adobe Systems Inc. Method and apparatus for identifying logical blocks of text in a document
CN104063364A (en) * 2013-03-19 2014-09-24 福建福昕软件开发股份有限公司北京分公司 PDF document recognition method
CN104134057A (en) * 2009-01-28 2014-11-05 谷歌公司 Selective display of OCR'ed text and corresponding images from publications on a client device
US20150199314A1 (en) * 2010-10-26 2015-07-16 Google Inc. Editing Application For Synthesized eBooks
CN106326854A (en) * 2016-08-19 2017-01-11 掌阅科技股份有限公司 Open fixed-layout document paragraph identification method
US20170076169A1 (en) * 2011-10-17 2017-03-16 Sharp Laboratories of America (SLA), Inc. System and Method for Scanned Document Correction
CN106980607A (en) * 2017-03-31 2017-07-25 掌阅科技股份有限公司 Paragraph recognition methods, device and terminal device
CN107832676A (en) * 2017-10-16 2018-03-23 平安科技(深圳)有限公司 Form data line feed recognition methods, electronic equipment and computer-readable recording medium
CN108763176A (en) * 2018-04-10 2018-11-06 达而观信息科技(上海)有限公司 A kind of document processing method and device
CN108845993A (en) * 2018-06-06 2018-11-20 中国科学技术信息研究所 Interpretation method, device and the terminal device of text information
CN109492199A (en) * 2018-10-17 2019-03-19 四川译讯信息科技有限公司 A kind of pdf document conversion method judged in advance based on OCR
CN109635120A (en) * 2018-10-30 2019-04-16 百度在线网络技术(北京)有限公司 Construction method, device and the storage medium of knowledge mapping
CN109783810A (en) * 2018-12-26 2019-05-21 北京明略软件***有限公司 A kind of text handling method, device and computer readable storage medium

Patent Citations (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1467682A (en) * 2002-06-28 2004-01-14 富士通株式会社 Apparatus and method of analyzing layout of document, and computer product
CN1949210A (en) * 2006-11-03 2007-04-18 上海中标软件有限公司 Method of realizing Tibetan automatically rule composing of in computer files
US8156018B1 (en) * 2006-12-18 2012-04-10 Intuit Inc. Transformation of standard document format electronic documents for electronic filing
US20090030671A1 (en) * 2007-07-27 2009-01-29 Electronics And Telecommunications Research Institute Machine translation method for PDF file
CN101667204A (en) * 2008-09-02 2010-03-10 财团法人工业技术研究院 Intelligent patent supervising and cautioning system and method
CN104134057A (en) * 2009-01-28 2014-11-05 谷歌公司 Selective display of OCR'ed text and corresponding images from publications on a client device
CN102103587A (en) * 2009-12-17 2011-06-22 北大方正集团有限公司 Method and device for converting form
US20150199314A1 (en) * 2010-10-26 2015-07-16 Google Inc. Editing Application For Synthesized eBooks
JP2012243121A (en) * 2011-05-20 2012-12-10 Sharp Corp Data creation device, data creation program, recording medium and data creation method
CN103959282A (en) * 2011-09-28 2014-07-30 谷歌公司 Selective feedback for text recognition systems
US20170076169A1 (en) * 2011-10-17 2017-03-16 Sharp Laboratories of America (SLA), Inc. System and Method for Scanned Document Correction
US20140281939A1 (en) * 2013-03-13 2014-09-18 Adobe Systems Inc. Method and apparatus for identifying logical blocks of text in a document
CN104063364A (en) * 2013-03-19 2014-09-24 福建福昕软件开发股份有限公司北京分公司 PDF document recognition method
CN106326854A (en) * 2016-08-19 2017-01-11 掌阅科技股份有限公司 Open fixed-layout document paragraph identification method
CN106980607A (en) * 2017-03-31 2017-07-25 掌阅科技股份有限公司 Paragraph recognition methods, device and terminal device
CN107832676A (en) * 2017-10-16 2018-03-23 平安科技(深圳)有限公司 Form data line feed recognition methods, electronic equipment and computer-readable recording medium
WO2019075970A1 (en) * 2017-10-16 2019-04-25 平安科技(深圳)有限公司 Line wrap recognition method for table information, electronic device, and computer-readable storage medium
CN108763176A (en) * 2018-04-10 2018-11-06 达而观信息科技(上海)有限公司 A kind of document processing method and device
CN108845993A (en) * 2018-06-06 2018-11-20 中国科学技术信息研究所 Interpretation method, device and the terminal device of text information
CN109492199A (en) * 2018-10-17 2019-03-19 四川译讯信息科技有限公司 A kind of pdf document conversion method judged in advance based on OCR
CN109635120A (en) * 2018-10-30 2019-04-16 百度在线网络技术(北京)有限公司 Construction method, device and the storage medium of knowledge mapping
CN109783810A (en) * 2018-12-26 2019-05-21 北京明略软件***有限公司 A kind of text handling method, device and computer readable storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
AMRHEIN: "Supervised OCR Error Detection and Correction Using Statistical and Neural Machine Translation Methods", 《JOURNAL FOR LANGUAGE TECHNOLOGY AND COMPUTATIONAL LINGUISTICS》, vol. 33, no. 1, pages 49 - 76 *
卢玲: "数字内容跨终端出版技术研究", 《传媒》, no. 05, pages 59 - 60 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112269872A (en) * 2020-10-19 2021-01-26 北京希瑞亚斯科技有限公司 Resume analysis method and device, electronic equipment and computer storage medium
CN112269872B (en) * 2020-10-19 2023-12-19 北京希瑞亚斯科技有限公司 Resume analysis method and device, electronic equipment and computer storage medium
CN112861821A (en) * 2021-04-06 2021-05-28 刘羽 Map data reduction method based on PDF file analysis
CN112861821B (en) * 2021-04-06 2024-04-19 刘羽 Map data reduction method based on PDF file analysis
CN117710997A (en) * 2023-12-18 2024-03-15 合肥大智慧财汇数据科技有限公司 Method, equipment and storage medium for restoring wireless form in PDF (portable document format) file
CN117710997B (en) * 2023-12-18 2024-06-14 合肥大智慧财汇数据科技有限公司 Method, equipment and storage medium for restoring wireless form in PDF (portable document format) file

Also Published As

Publication number Publication date
CN110377885B (en) 2023-09-26

Similar Documents

Publication Publication Date Title
CN107992596B (en) Text clustering method, text clustering device, server and storage medium
CN107767870B (en) Punctuation mark adding method and device and computer equipment
CN114399769B (en) Training method of text recognition model, and text recognition method and device
CN110377885A (en) Convert method, apparatus, equipment and the computer storage medium of pdf document
CN109960541A (en) Start method, equipment and the computer storage medium of small routine
CN109599095A (en) A kind of mask method of voice data, device, equipment and computer storage medium
CN110363810A (en) Establish the method, apparatus, equipment and computer storage medium of image detection model
CN110633991A (en) Risk identification method and device and electronic equipment
CN109543560A (en) Dividing method, device, equipment and the computer storage medium of personage in a kind of video
CN109933269A (en) Method, equipment and the computer storage medium that small routine is recommended
CN109756568A (en) Processing method, equipment and the computer readable storage medium of file
CN109960554A (en) Show method, equipment and the computer storage medium of reading content
CN114757214B (en) Selection method and related device for sample corpora for optimizing translation model
CN109815481A (en) Method, apparatus, equipment and the computer storage medium of event extraction are carried out to text
CN109242320A (en) Order allocation method, device, server and storage medium
CN114283411B (en) Text recognition method, and training method and device of text recognition model
CN115761339A (en) Image processing method, apparatus, device, medium, and program product
CN109933254A (en) Show method, equipment and the computer storage medium of reading content
CN109165372A (en) A kind of webpage loading method, device, equipment and storage medium
EP3734459A1 (en) Method and system for prioritizing critical data object storage during backup operations
CN110659280A (en) Road blocking abnormity detection method and device, computer equipment and storage medium
CN113111200B (en) Method, device, electronic equipment and storage medium for auditing picture files
CN110704617B (en) News text classification method, device, electronic equipment and storage medium
CN107992457A (en) A kind of information conversion method, device, terminal device and storage medium
CN110516024B (en) Map search result display method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant