CN109871516A - A kind of method of bilayer PDF Mass production WORD - Google Patents

A kind of method of bilayer PDF Mass production WORD Download PDF

Info

Publication number
CN109871516A
CN109871516A CN201711245886.6A CN201711245886A CN109871516A CN 109871516 A CN109871516 A CN 109871516A CN 201711245886 A CN201711245886 A CN 201711245886A CN 109871516 A CN109871516 A CN 109871516A
Authority
CN
China
Prior art keywords
data
text
word
fragment
tables
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711245886.6A
Other languages
Chinese (zh)
Inventor
陈伟
曹勇
殷绪成
王旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
JIANGSU ABEYOND OUTSOURCING CO Ltd
Original Assignee
JIANGSU ABEYOND OUTSOURCING CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by JIANGSU ABEYOND OUTSOURCING CO Ltd filed Critical JIANGSU ABEYOND OUTSOURCING CO Ltd
Priority to CN201711245886.6A priority Critical patent/CN109871516A/en
Publication of CN109871516A publication Critical patent/CN109871516A/en
Pending legal-status Critical Current

Links

Landscapes

  • Character Input (AREA)

Abstract

The invention discloses the methods of bilayer PDF Mass production WORD a kind of, are related to data entry techniques field.Picture is subjected to fragmentation, picture is cut using OCR technique, and record fragment coordinate: carrying out typing on copying platform, and input result is compared, school inspection and sampling observation;The coordinate information of record fragment is carried out data processing, is converted into the tables of data of storing data information, it is convenient to data further operating;Input result and the corresponding coordinate information of fragment are matched one by one, obtain complete data information;By rule and algorithm, the position of text is restored, generates word file;According to input result and coordinate information, position and the typesetting of text are restored, generates WORD, and the processing of mass big data may be implemented, not only efficiency is higher, and position is also more accurate, also solves the problems, such as to replicate stickup and typesetting inefficiency by hand.Such method for transformation is particularly practical, and easily operates and learn.

Description

A kind of method of bilayer PDF Mass production WORD
Technical field
The present invention relates to the methods of bilayer PDF Mass production WORD a kind of, specially data entry techniques field.
Background technique
In Internet information age, a large amount of traditional client application technology is applied to internet, such as
Customer relation management, office management system etc., most of design patterns serviced using software.
Currently, the browsing of electronic document, such as the file of POWERPOINT, WORD, TXT, PDF format, existing usual way It is that computer user installs document reading software, is browsed in such a way that software opens file.In addition to this, there are also one The document sharing website to open for free a bit, realizes the online reading of document, does not need to be downloaded document, is directly based on The reading of browser, it is very convenient, change previous operation and reading model.
However some document sharing websites largely carry out the reading of document by the way of PDF.However pdf document is not Conducive to the extraction of key message, operated so to be usually converted into WORD document.
Double-layer PDF file refers to that file content both includes text layers, also includes image layer, and its position is opposite one by one up and down It answers.The PDF converter of existing exploitation needs manual operation to complete conversion of the pdf document to WORD document, not can solve existing Large volume document carries out the low problem of manual transfer efficiency in technology.
Summary of the invention
The purpose of the present invention is to provide one kind, and the processing of mass big data may be implemented, and the higher bilayer PDF of efficiency is criticized The method that amount generates WORD, to solve the problems, such as that above-mentioned background technique exists.
To achieve the above object, the invention provides the following technical scheme: a kind of method packet of bilayer PDF Mass production WORD Containing following steps: 1, picture being carried out fragmentation, cut using OCR technique to picture, and record fragment coordinate;
Step 2 carries out typing on copying platform, and input result is compared, school inspection and sampling observation;
The coordinate information of record fragment is carried out data processing by step 3, is converted into the tables of data of storing data information, convenience pair Data further operating;
Step 4 one by one matches input result and the corresponding coordinate information of fragment, and complete data information is obtained;
Tables of data obtained in 3 through the above steps allows the tables of data of storage input result accurately to be matched with it, obtains both Having text again has the new tables of data of coordinate information;
Step 5 passes through rule and algorithm, restores the position of text, generates word file;
The new tables of data according to obtained in above-mentioned steps 4, to text be ranked up and algorithm on processing, make it according to original Text sorts one by one on original text;Using the algorithm and rule on coordinate, the data of each column in original copy are restored, finally by fragmentation pattern Title restores the data that every draft corresponds to, and rapid batch can be realized and generate word file.
Preferably, OCR technique in the step 1 method particularly includes: positioning character area first, and then identify text The line number and columns of word, determine the rectangular block where each text;Then under manual intervention, adjust rectangular block size and Position, obtains more accurately text rectangular block, is finally cut into fragmentation pattern one by one.
Preferably, in the step 3 tables of data generating process are as follows: record coordinate information is read by code TXT file pastes the information duplication of reading in Excel, by processes such as a series of point of column and replacements, obtains main The tables of data of information.
Preferably, the step of being ranked up in the step 5 to text are as follows: step 5-1, by fragment name and cross The sequence of ordinate obtains putting in order for all single texts.
Step 5-2, it is distinguish again by fragment name, a column data is converted into complete data line by line.
Step 5-3, it is finally controlled again using code, it finally can the multiple word files of Mass production, i.e. a Zhang great Yuan Scheme a corresponding word file.
Compared with prior art, the beneficial effects of the present invention are: 1, according to input result and coordinate information, restore text Position and typesetting generate WORD, and the processing of mass big data may be implemented, and not only efficiency is higher, and position is also more accurate, also It solves the problems, such as to replicate by hand and paste and typesetting inefficiency.
2, accurate reproduction text quickly and can realize that batch handles big data, and high-efficient, precision is high.
3, such method for transformation is particularly practical, and cutting is accurate, and reduction text point is also comparable accurate, and easily operates And study.
4, stable system performance, easy to maintain, applicability is especially high, and application is very extensive.
5, method generalizes, and moves towards the masses, it is easier to study and receiving.
Detailed description of the invention
Fig. 1 is the PDF text structure schematic diagram containing ancient books picture in the embodiment of the present invention;
Fig. 2 is the schematic diagram that PDF text conversion in the embodiment of the present invention containing ancient books picture is WORD document.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clearly and completely Description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall within the protection scope of the present invention.
Embodiment: referring to Fig.1 and shown in 2, being converted to WORD document for the PDF containing ancient books character picture, specific to walk Suddenly are as follows: 1: picture is subjected to fragmentation, picture is cut using OCR technique, and record fragment coordinate:
Since ancient books character pitch is relatively narrow, word content is uncommon, on the market popular OCR software to the resolution of ancient books generally compared with It is low.Therefore character area is positioned using with OCR, and then identifies the line number and columns of text, determine the rectangle where each text Block;Then under manual intervention, size and the position of rectangular block is adjusted, more accurately text rectangular block is obtained, finally cuts At fragmentation pattern one by one.
(2) typing is carried out on copying platform, and input result is compared, school inspection and sampling observation;
(3) coordinate information of record fragment is carried out data processing, is converted into the tables of data of storing data information, it is convenient to data Further operating:
The code for reading TXT file is first called, TXT file is got, then opens TXT file, full choosing duplication pastes Excel In, finally data are screened and handled, intercept useful data as tables of data;
(4) input result and the corresponding coordinate information of fragment are matched one by one, obtain complete data information:
Using tables of data obtained in step (3) as database, matched one by one by library lookup method and input result;
(5) by rule and algorithm, the position of text is restored, generates word file.
The coordinate information of fragment is obtained according to cutting, the text of exercises on-line personnel's typing is stored in corresponding position, because Ancient books text spacing relative narrower, rarely used word is also relatively more, and OCR cannot accomplish 100% identification, according on sequence and algorithm Processing, restores original draft text composition, quickly generates word file.
It although an embodiment of the present invention has been shown and described, for the ordinary skill in the art, can be with A variety of variations, modification, replacement can be carried out to these embodiments without departing from the principles and spirit of the present invention by understanding And modification, the scope of the present invention is defined by the appended.

Claims (4)

1. a kind of method of bilayer PDF Mass production WORD, it is characterised in that: the method packet of the PDF Mass production WORD Containing following steps: picture is carried out fragmentation by step (1), is cut using OCR technique to picture, and record fragment coordinate:
Step (2) carries out typing on copying platform, and input result is compared, school inspection and sampling observation;
The coordinate information of record fragment is carried out data processing by step (3), is converted into the tables of data of storing data information, convenient To data further operating;
Step (4) one by one matches input result and the corresponding coordinate information of fragment, and complete data information is obtained;
Tables of data obtained in (3) through the above steps allows the tables of data of storage input result accurately to be matched with it, obtains Existing text has the new tables of data of coordinate information again;
Step (5) passes through rule and algorithm, restores the position of text, generates word file;
The new tables of data according to obtained in above-mentioned steps (4), to text be ranked up and algorithm on processing, make its according to Text sorts one by one on original copy;Using the algorithm and rule on coordinate, the data of each column in original copy are restored, finally by fragment Picture name restores the data that every draft corresponds to, and rapid batch can be realized and generate word file.
2. the method for bilayer PDF Mass production WORD according to claim 1 a kind of, it is characterised in that: the step (1) OCR technique in method particularly includes: positioning character area first, and then identify the line number and columns of text, determine each text Rectangular block where word;Then under manual intervention, size and the position of rectangular block is adjusted, more accurately text rectangle is obtained Block is finally cut into fragmentation pattern one by one.
3. the method for bilayer PDF Mass production WORD according to claim 1 a kind of, it is characterised in that: the step (3) generating process of tables of data in are as follows: the TXT file that record coordinate information is read by code replicates the information of reading It pastes in Excel, by processes such as a series of point of column and replacements, obtains the tables of data of main information.
4. the method for bilayer PDF Mass production WORD according to claim 1 a kind of, it is characterised in that: the step (5) the step of text is ranked up in are as follows: step (5-1), by the sequence to fragment name and transverse and longitudinal coordinate obtains all lists A text puts in order;
Step (5-2) is distinguish by fragment name again, and a column data is converted into complete data line by line;
Step (5-3) is finally controlled using code again, finally can the multiple word files of Mass production, i.e. a big original image A corresponding word file.
CN201711245886.6A 2017-12-01 2017-12-01 A kind of method of bilayer PDF Mass production WORD Pending CN109871516A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711245886.6A CN109871516A (en) 2017-12-01 2017-12-01 A kind of method of bilayer PDF Mass production WORD

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711245886.6A CN109871516A (en) 2017-12-01 2017-12-01 A kind of method of bilayer PDF Mass production WORD

Publications (1)

Publication Number Publication Date
CN109871516A true CN109871516A (en) 2019-06-11

Family

ID=66913338

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711245886.6A Pending CN109871516A (en) 2017-12-01 2017-12-01 A kind of method of bilayer PDF Mass production WORD

Country Status (1)

Country Link
CN (1) CN109871516A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110334710A (en) * 2019-07-10 2019-10-15 深圳市华云中盛科技有限公司 Legal documents recognition methods, device, computer equipment and storage medium
CN111046841A (en) * 2019-12-26 2020-04-21 中孚安全技术有限公司 Character extraction method, system, terminal and storage medium of PowerPoint file
CN112149163A (en) * 2020-09-22 2020-12-29 山东旗帜信息有限公司 Image file security processing method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102096845A (en) * 2009-12-10 2011-06-15 黑龙江省森林工程与环境研究所 Knowledge base full text search engine system for classified forest management
CN202795366U (en) * 2012-09-24 2013-03-13 上海理工大学 System capable of generating digital publication
CN106529521A (en) * 2016-10-31 2017-03-22 江苏文心古籍数字产业有限公司 Ancient book character digital recording method
CN106548175A (en) * 2016-10-13 2017-03-29 江苏奥博洋信息技术有限公司 A kind of new character image digitalized processing method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102096845A (en) * 2009-12-10 2011-06-15 黑龙江省森林工程与环境研究所 Knowledge base full text search engine system for classified forest management
CN202795366U (en) * 2012-09-24 2013-03-13 上海理工大学 System capable of generating digital publication
CN106548175A (en) * 2016-10-13 2017-03-29 江苏奥博洋信息技术有限公司 A kind of new character image digitalized processing method
CN106529521A (en) * 2016-10-31 2017-03-22 江苏文心古籍数字产业有限公司 Ancient book character digital recording method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110334710A (en) * 2019-07-10 2019-10-15 深圳市华云中盛科技有限公司 Legal documents recognition methods, device, computer equipment and storage medium
CN111046841A (en) * 2019-12-26 2020-04-21 中孚安全技术有限公司 Character extraction method, system, terminal and storage medium of PowerPoint file
CN112149163A (en) * 2020-09-22 2020-12-29 山东旗帜信息有限公司 Image file security processing method and device

Similar Documents

Publication Publication Date Title
US9129421B2 (en) System and method for displaying complex scripts with a cloud computing architecture
CN101183355B (en) Copy and paste processing method, apparatus
US9507867B2 (en) Discovery engine
CN107608949A (en) A kind of Text Information Extraction method and device based on semantic model
CN111680634B (en) Document file processing method, device, computer equipment and storage medium
US20100011016A1 (en) Dictionary compilations
CN110083805A (en) A kind of method and system that Word file is converted to EPUB file
US7941418B2 (en) Dynamic corpus generation
CN110134780B (en) Method, device, equipment and computer readable storage medium for generating document abstract
CN102043808A (en) Method and equipment for extracting bilingual terms using webpage structure
CN109871516A (en) A kind of method of bilayer PDF Mass production WORD
CN106874240A (en) Digital publishing method and system
CN106372065A (en) Method and system for developing multi-language website
CN107609032B (en) Matching method and electronic equipment
Pingali et al. WebKhoj: Indian language IR from multiple character encodings
US10643022B2 (en) PDF extraction with text-based key
KR100912288B1 (en) Search system using contents information in document file
EP2312473A1 (en) System, apparatus and method for processing content on a computing device
CN109670183A (en) A kind of calculation method, device, equipment and the storage medium of text importance
Patil et al. Design and development of a dictionary based stemmer for Marathi language
CN114610808A (en) Data storage method, data storage device, electronic equipment and medium
CN113065316A (en) Method for dynamically converting formal thumbnail file into html (hypertext markup language) and inputting question bank, selecting questions from question bank and composing draft and generating thumbnail file
Magapu Development and customization of in-house developed OCR and its evaluation
Winarti et al. Improving stemming algorithm using morphological rules
CN109522549B (en) Corpus construction method based on Web collection and text feature balanced distribution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190611

WD01 Invention patent application deemed withdrawn after publication