CN109871516A

CN109871516A - A kind of method of bilayer PDF Mass production WORD

Info

Publication number: CN109871516A
Application number: CN201711245886.6A
Authority: CN
Inventors: 陈伟; 曹勇; 殷绪成; 王旭
Original assignee: JIANGSU ABEYOND OUTSOURCING CO Ltd
Current assignee: JIANGSU ABEYOND OUTSOURCING CO Ltd
Priority date: 2017-12-01
Filing date: 2017-12-01
Publication date: 2019-06-11

Abstract

The invention discloses the methods of bilayer PDF Mass production WORD a kind of, are related to data entry techniques field.Picture is subjected to fragmentation, picture is cut using OCR technique, and record fragment coordinate: carrying out typing on copying platform, and input result is compared, school inspection and sampling observation；The coordinate information of record fragment is carried out data processing, is converted into the tables of data of storing data information, it is convenient to data further operating；Input result and the corresponding coordinate information of fragment are matched one by one, obtain complete data information；By rule and algorithm, the position of text is restored, generates word file；According to input result and coordinate information, position and the typesetting of text are restored, generates WORD, and the processing of mass big data may be implemented, not only efficiency is higher, and position is also more accurate, also solves the problems, such as to replicate stickup and typesetting inefficiency by hand.Such method for transformation is particularly practical, and easily operates and learn.

Description

A kind of method of bilayer PDF Mass production WORD

Technical field

The present invention relates to the methods of bilayer PDF Mass production WORD a kind of, specially data entry techniques field.

Background technique

In Internet information age, a large amount of traditional client application technology is applied to internet, such as

Customer relation management, office management system etc., most of design patterns serviced using software.

Currently, the browsing of electronic document, such as the file of POWERPOINT, WORD, TXT, PDF format, existing usual way It is that computer user installs document reading software, is browsed in such a way that software opens file.In addition to this, there are also one The document sharing website to open for free a bit, realizes the online reading of document, does not need to be downloaded document, is directly based on The reading of browser, it is very convenient, change previous operation and reading model.

However some document sharing websites largely carry out the reading of document by the way of PDF.However pdf document is not Conducive to the extraction of key message, operated so to be usually converted into WORD document.

Double-layer PDF file refers to that file content both includes text layers, also includes image layer, and its position is opposite one by one up and down It answers.The PDF converter of existing exploitation needs manual operation to complete conversion of the pdf document to WORD document, not can solve existing Large volume document carries out the low problem of manual transfer efficiency in technology.

Summary of the invention

The purpose of the present invention is to provide one kind, and the processing of mass big data may be implemented, and the higher bilayer PDF of efficiency is criticized The method that amount generates WORD, to solve the problems, such as that above-mentioned background technique exists.

To achieve the above object, the invention provides the following technical scheme: a kind of method packet of bilayer PDF Mass production WORD Containing following steps: 1, picture being carried out fragmentation, cut using OCR technique to picture, and record fragment coordinate；

Step 2 carries out typing on copying platform, and input result is compared, school inspection and sampling observation；

The coordinate information of record fragment is carried out data processing by step 3, is converted into the tables of data of storing data information, convenience pair Data further operating；

Step 4 one by one matches input result and the corresponding coordinate information of fragment, and complete data information is obtained；

Tables of data obtained in 3 through the above steps allows the tables of data of storage input result accurately to be matched with it, obtains both Having text again has the new tables of data of coordinate information；

Step 5 passes through rule and algorithm, restores the position of text, generates word file；

The new tables of data according to obtained in above-mentioned steps 4, to text be ranked up and algorithm on processing, make it according to original Text sorts one by one on original text；Using the algorithm and rule on coordinate, the data of each column in original copy are restored, finally by fragmentation pattern Title restores the data that every draft corresponds to, and rapid batch can be realized and generate word file.

Preferably, OCR technique in the step 1 method particularly includes: positioning character area first, and then identify text The line number and columns of word, determine the rectangular block where each text；Then under manual intervention, adjust rectangular block size and Position, obtains more accurately text rectangular block, is finally cut into fragmentation pattern one by one.

Preferably, in the step 3 tables of data generating process are as follows: record coordinate information is read by code TXT file pastes the information duplication of reading in Excel, by processes such as a series of point of column and replacements, obtains main The tables of data of information.

Preferably, the step of being ranked up in the step 5 to text are as follows: step 5-1, by fragment name and cross The sequence of ordinate obtains putting in order for all single texts.

Step 5-2, it is distinguish again by fragment name, a column data is converted into complete data line by line.

Step 5-3, it is finally controlled again using code, it finally can the multiple word files of Mass production, i.e. a Zhang great Yuan Scheme a corresponding word file.

Compared with prior art, the beneficial effects of the present invention are: 1, according to input result and coordinate information, restore text Position and typesetting generate WORD, and the processing of mass big data may be implemented, and not only efficiency is higher, and position is also more accurate, also It solves the problems, such as to replicate by hand and paste and typesetting inefficiency.

2, accurate reproduction text quickly and can realize that batch handles big data, and high-efficient, precision is high.

3, such method for transformation is particularly practical, and cutting is accurate, and reduction text point is also comparable accurate, and easily operates And study.

4, stable system performance, easy to maintain, applicability is especially high, and application is very extensive.

5, method generalizes, and moves towards the masses, it is easier to study and receiving.

Detailed description of the invention

Fig. 1 is the PDF text structure schematic diagram containing ancient books picture in the embodiment of the present invention；

Fig. 2 is the schematic diagram that PDF text conversion in the embodiment of the present invention containing ancient books picture is WORD document.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clearly and completely Description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall within the protection scope of the present invention.

Embodiment: referring to Fig.1 and shown in 2, being converted to WORD document for the PDF containing ancient books character picture, specific to walk Suddenly are as follows: 1: picture is subjected to fragmentation, picture is cut using OCR technique, and record fragment coordinate:

Since ancient books character pitch is relatively narrow, word content is uncommon, on the market popular OCR software to the resolution of ancient books generally compared with It is low.Therefore character area is positioned using with OCR, and then identifies the line number and columns of text, determine the rectangle where each text Block；Then under manual intervention, size and the position of rectangular block is adjusted, more accurately text rectangular block is obtained, finally cuts At fragmentation pattern one by one.

(2) typing is carried out on copying platform, and input result is compared, school inspection and sampling observation；

(3) coordinate information of record fragment is carried out data processing, is converted into the tables of data of storing data information, it is convenient to data Further operating:

The code for reading TXT file is first called, TXT file is got, then opens TXT file, full choosing duplication pastes Excel In, finally data are screened and handled, intercept useful data as tables of data；

(4) input result and the corresponding coordinate information of fragment are matched one by one, obtain complete data information:

Using tables of data obtained in step (3) as database, matched one by one by library lookup method and input result；

(5) by rule and algorithm, the position of text is restored, generates word file.

The coordinate information of fragment is obtained according to cutting, the text of exercises on-line personnel's typing is stored in corresponding position, because Ancient books text spacing relative narrower, rarely used word is also relatively more, and OCR cannot accomplish 100% identification, according on sequence and algorithm Processing, restores original draft text composition, quickly generates word file.

It although an embodiment of the present invention has been shown and described, for the ordinary skill in the art, can be with A variety of variations, modification, replacement can be carried out to these embodiments without departing from the principles and spirit of the present invention by understanding And modification, the scope of the present invention is defined by the appended.

Claims

1. a kind of method of bilayer PDF Mass production WORD, it is characterised in that: the method packet of the PDF Mass production WORD Containing following steps: picture is carried out fragmentation by step (1), is cut using OCR technique to picture, and record fragment coordinate:

Step (2) carries out typing on copying platform, and input result is compared, school inspection and sampling observation；

The coordinate information of record fragment is carried out data processing by step (3), is converted into the tables of data of storing data information, convenient To data further operating；

Step (4) one by one matches input result and the corresponding coordinate information of fragment, and complete data information is obtained；

Tables of data obtained in (3) through the above steps allows the tables of data of storage input result accurately to be matched with it, obtains Existing text has the new tables of data of coordinate information again；

Step (5) passes through rule and algorithm, restores the position of text, generates word file；

The new tables of data according to obtained in above-mentioned steps (4), to text be ranked up and algorithm on processing, make its according to Text sorts one by one on original copy；Using the algorithm and rule on coordinate, the data of each column in original copy are restored, finally by fragment Picture name restores the data that every draft corresponds to, and rapid batch can be realized and generate word file.

2. the method for bilayer PDF Mass production WORD according to claim 1 a kind of, it is characterised in that: the step (1) OCR technique in method particularly includes: positioning character area first, and then identify the line number and columns of text, determine each text Rectangular block where word；Then under manual intervention, size and the position of rectangular block is adjusted, more accurately text rectangle is obtained Block is finally cut into fragmentation pattern one by one.

3. the method for bilayer PDF Mass production WORD according to claim 1 a kind of, it is characterised in that: the step (3) generating process of tables of data in are as follows: the TXT file that record coordinate information is read by code replicates the information of reading It pastes in Excel, by processes such as a series of point of column and replacements, obtains the tables of data of main information.

4. the method for bilayer PDF Mass production WORD according to claim 1 a kind of, it is characterised in that: the step (5) the step of text is ranked up in are as follows: step (5-1), by the sequence to fragment name and transverse and longitudinal coordinate obtains all lists A text puts in order；

Step (5-2) is distinguish by fragment name again, and a column data is converted into complete data line by line；

Step (5-3) is finally controlled using code again, finally can the multiple word files of Mass production, i.e. a big original image A corresponding word file.