CN110083805A - A kind of method and system that Word file is converted to EPUB file - Google Patents
A kind of method and system that Word file is converted to EPUB file Download PDFInfo
- Publication number
- CN110083805A CN110083805A CN201810071710.1A CN201810071710A CN110083805A CN 110083805 A CN110083805 A CN 110083805A CN 201810071710 A CN201810071710 A CN 201810071710A CN 110083805 A CN110083805 A CN 110083805A
- Authority
- CN
- China
- Prior art keywords
- file
- word
- xml
- epub
- document
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/14—Tree-structured documents
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/151—Transformation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Document Processing Apparatus (AREA)
Abstract
The invention discloses a kind of method and systems that Word format file is converted to EPUB formatted file.For the Word file of .docx format, it is identified and is handled by the catalogue to Word source file, it can identify source Word document bibliographic structure, EPUB e-book is automatically generated, step includes: Word file parsing, XML file parsing, Word file is split, html file generates and EPUB file generated.The EPUB e-book auto-generating method provided by the invention that can identify source Word file catalogue, solves the problems such as prior art conversion effect is bad, the manual conversion process for adding head table is cumbersome, inefficiency, the integrality for having ensured document content improves the conversion effect of document and improves work efficiency.
Description
Technical field
The present invention relates to document processing technology more particularly to a kind of Word format file is converted into EPUB
The method and system of (Electronic Publication, electronic publishing) formatted file.
Background technique
In the epoch of digital publishing and " internet+", with the fast development of mobile communication and Web publishing, e-book becomes
It is more more and more universal with it is popular.The arrival of digital Age changes the reading habit of people, passes through electronic reader, smart phone
Etc. equipment carry out fragmentation read with mobile reading have become it is public receive with favorite reading method, and due to equipment, platform,
The difference of publishing media etc. emerges various electronics book formats on the market, as TXT, PDF, EPUB, Mobi,
Azw3, CEB/CEBX, CAJ, PDG etc..In the electronics book format of various prevalences, EPUB publishes forum as international numerical digit
(IDPF) official standard, because its support Various Complex typesetting, can adaptive device screen the advantages that, be listed as with PDF, Mobi
The big mainstream format of e-book three;And Word and PDF becomes the most frequently used in Publishing Industry as the most common office docuemts format
Two kinds of document manuscript formats.In the publication of e-book, distribution process, it is often necessary to realize between different electronics book formats
Conversion, and the demand mutually converted between documents in various formats is also frequently run onto during many software developments.
Microsoft Office Word is the current most common electronic document tools, and Word file include .doc with
.docx format, the former belongs to MS-Word binary file, the latter then follow Microsoft's exploitation based on XML and with ZIP lattice
The electronic document specification OOXML (Office Open XML) of formula compression.General Word file parsing method is, after decompression
Word file in extract corresponding information, be translated into corresponding html file to carry out the processing of next step.
EPUB format follows ZIP compress technique, and the EPUB file after decompression mainly includes three parts content: to illustrate
The mimetype file of the file format of EPUB;Storing OPF, NCX, CSS, HTML etc. includes EPUB e-book core content file
OEBPS file;And the META-INF file comprising several EPUB e-book property contents.General EPUB e-book
Generate mainly includes four steps: addition mimetype file;It is packaged all resource files;Create the core contents such as opf, ncx
File;Corresponding property file is finally created again, and compresses synthesis EPUB format.
At present there are many file format converter tools, form includes online service, multipad and API
Interface.For the conversion effect of documents in various formats, the integrality of the contents such as text, chart, label, bibliographic structure, title, word
The factors such as the processing of the reserving degree of the attributes such as body, font size and special document are all common consideration indexs.It is existing common
File-format conversion function is related to the formats such as Word, PDF, EPUB, Excel, and for Word file is converted to EPUB format
The technical solution of file is relatively fewer.Particularly, for the Word file comprising bibliographic structure, navigation tag is either had
File still have the file of the catalogue page without redirected link, the conversion effect of the prior art is bad, is easy to happen mesh
The situations such as directory structures loss, text confusion.
Summary of the invention
In order to overcome the above-mentioned deficiencies of the prior art, the present invention, which provides, a kind of is converted to EPUB format for Word format file
The method and system of file is identified and is located by the catalogue to Word source file for the Word file of .docx format
Reason, is capable of the bibliographic structure of extraction source Word file, automatically generates EPUB e-book.
The technical scheme is that
A method of Word file is converted into EPUB file, is included the following steps:
1) Word file parses: obtaining Word file to be converted (.docx formatted file) and is decompressed, is generated corresponding
Resource file and file, wherein including the files such as several XML documents, picture;
2) XML file parses: XML parsing is carried out according to the resource file, to obtain the document mesh of source Word file
The contents such as record, header syntax, text style, and the information such as text, paragraph, font, font size are extracted for subsequent html file
Generation;
3) Word file is split: according to the parsing result of source Word file, originally point situation handles different Word
File carries out catalog recognition to the source file comprising bibliographic structure, obtains the catalogue of source file;To the source document for not including catalogue
Part carries out header identification, to extract the catalogue of source file;And source Word file is split as unit of chapters and sections according to catalogue
For multiple subfiles;
4) html file generates: according to XML file parsing as a result, using structural informations such as text fragment, font sizes,
Subfile is converted into corresponding html file;
5) EPUB file generated: according to the html file, the picture obtained after being decompressed in conjunction with source Word file, metadata
Equal resource files and catalogue file obtained after parsing are packaged and generate EPUB formatted file.
The present invention also provides a kind of system that Word file is converted to EPUB file realized using the above method, packets
Include: 1) Word file parsing module obtains Word file to be converted (.docx formatted file) and is decompressed, and generates corresponding
Resource file and file, wherein including the files such as several XML documents, picture;2) XML file parsing module, according to described
Resource file carries out XML parsing, to obtain the contents such as the file catalogue of source Word file, header syntax, text style, and mentions
Take out the generation that the information such as text, paragraph, font, font size are used for subsequent html file;3) Word file splits module, according to source
The parsing result of Word file, originally point situation handles different Word files, i.e., to the source file comprising bibliographic structure
Carry out catalog recognition;Header identification is carried out to the source file for not including catalogue, to extract the catalogue of file;And according to catalogue
Source Word file is split as multiple subfiles as unit of chapters and sections;4) html file generation module, according to XML file parsing
As a result, subfile is converted to corresponding html file using structural informations such as text fragment, font sizes;5) EPUB file
Generation module, in conjunction with resource files and catalogues obtained after parsing such as picture, the metadata obtained after source Word file decompression
File is packaged and generates EPUB formatted file.Each module is described in detail below.
Word file parsing module.For the Word file of Microsoft Word 2007 and the above version, suffix name
For .docx, it then follows the OOXML electronic document specification based on zip+xml format.Word file suffix name is revised as .zip, is made
After being decompressed with decoder software to file, can be obtained [Content_Types] .xml file (record content be comprising it is all
The title and type of file) and _ tri- files of rels, docProps, word.Wherein docProps file essential record
The property content of Word document, comprising: the app.xml file of record number of pages, the statistical attributes such as number of words, when the creation of recording documents
Between, the core.xml file of core attributes and the thumbnail thumbnail.emf file of document such as author;Word document clip pack
The document.xml file of the body matter containing recording documents, the footnotes.xml file of recording documents footnote content, record
The endnotes.xml file of document endnote content, the styles.xml file of recording documents style information, record number sequence
Numbering.xml file and recordable picture resource the contents such as media file.
XML file parsing module.The XML file obtained after Word file is decompressed follows OOXML standard, wherein
Document.xml file includes the main contents of source Word file, and structure is mainly made of elements such as paragraph and tables.It is right
For Word, the XML tag element of OOXML document mainly includes paragraph, text, table, number, section, pattern, font, mark
Topic, footer, domain, link, catalogue etc..XML file itself follows tree construction, and general analyzing step includes dividing data block,
Each data block, identification data label, property content and certain last handling process are parsed using multi-threaded parallel.It utilizes
The Open-Source Tools such as OpenXMLSDK, parse the document.xml, app.xml, endnotes.xml,
Nested XML file structure in the files such as footnotes.xml, numbering.xml, styles.xml, to obtain wherein
Word file content and related style, in particular, obtaining the catalogue and header syntax of file.
Word file splits module.Using XML parsing as a result, the bibliographic structure of source Word file is extracted, according to phase
Source file is split as multiple Word subfiles by the chapters and sections structure answered.Wherein the extraction of Word file catalogue need to divide at three kinds of situations
Reason:
A) source file carry navigation directory structure, Word document by the inclusion of title level, special style TOC domain representation
Bibliographic structure extracts the bibliographic structure that respective labels content can then be converted directly into EPUB file;
B) source file does not include bibliographic structure, but there is the catalogue page comprising plain text content.The catalogue of this class file
Page usually includes specific typesetting feature, using these typesetting Feature Selections and determines catalogue page, further parses catalogue page, mentions
Title and the page number are refined, corresponding document content is finally matched to, generates bibliographic structure;
C) source file does not include bibliographic structure or the catalogue page with specific typesetting feature, for this class file, using SVM
Etc. classification methods, in conjunction with page empty, chapter Header font, headerfooter analysis as a result, extract document every title and phase
The paragraph content answered;Have the characteristics that style consistency using the title of same level, title is extracted using the method for cluster
Between hierarchical structure, to generate corresponding catalogue.
After obtaining the bibliographic structure of source file, the initial position of chapters and sections paragraph is positioned according to XML element, thus to source file
It carries out winning fractionation.Particularly, most of e-book catalogue only includes two-stage title to divide the chapters and sections structure of books, herein
Word file is converted to EPUB e-book formatted file when dividing subfile by catalogue, also can only consider two layers of catalogue mark
Topic.
Html file generation module.For the Word subfile after fractionation, according to XML parsing as a result, generating HTML's
Resource index file corresponds to the resource address such as the picture file occurred in Word subfile, in conjunction with Word content of text, finally will
Each subfile is converted to corresponding html format file.Html file after conversion is mainly used for synthesizing EPUB formatted file,
Its filename corresponds to the chained address of chapters and sections in the EPUB file directory after conversion.
EPUB file generating module.EPUB format follows ZIP compress technique, first during generating EPUB e-book
Mimetype file first is added in target storage position, to state EPUB format;According to the bibliographic structure, create EPUB's
Ncx file is added with the navigation link of the entitled mark of html file, to generate EPUB file directory;According to source file
Catalogue that metadata information and EPUB include, the file information, creation opf file simultaneously copy html file and its corresponding resource
File is stored into OPS file;According to the opf file, creates container.xml file and store to META-INF
In file;Finally above-mentioned mimetype file, OPS file, META-INF file are packaged, and delete intermediate file,
EPUB formatted file after generating final conversion.
Compared with prior art, the positive effect of the present invention are as follows:
The present invention provides a kind of method and system that Word format file is converted to EPUB formatted file, can be widely applied
In common file format shift scene, the Publishing Industry of Word, EPUB document manuscript is especially largely used.The present invention mentions
The EPUB e-book auto-generating method that can identify source Word file catalogue supplied, solves prior art conversion effect not
The problems such as conversion process is cumbersome, inefficiency of good, manual addition head table, has ensured the integrality of document content,
It improves the conversion effect of document and improves work efficiency.
Detailed description of the invention
Fig. 1 is the exemplary diagram of OOXML document.
Fig. 2 is the screenshot of Word file of the embodiment of the present invention with bibliographic structure.
Fig. 3 is the screenshot of Word file of the embodiment of the present invention with plain text catalogue page.
Fig. 4 is the screenshot for the Word file that the embodiment of the present invention does not include catalogue.
Fig. 5 is the screenshot of the EPUB file with catalogue after conversion of the embodiment of the present invention.
Fig. 6 is the flow diagram of the method for the present invention.
Specific embodiment
With reference to the accompanying drawing, the present invention, the model of but do not limit the invention in any way are further described by embodiment
It encloses.
The present invention provides a kind of method and system that Word format file is converted to EPUB formatted file, for .docx
The Word file of format is identified and is handled by the catalogue to Word source file, and the catalogue of extraction source Word file is capable of
Structure automatically generates EPUB e-book.The method of the present invention mainly includes Word file parsing, XML file parses, Word file is torn open
Point, html file generate and five steps of EPUB file generated, referring to Fig. 6.
Below by way of for one " Six Chapters of a Floating Life " .docx formatted file (hereinafter referred to as document one) is converted
Bright implementation method of the invention, specific example are as follows:
1) it obtains and decompresses Word file to be converted, obtain the resource files such as several XML files.
Document one is revised as ZIP format, after decompressing using decompression tool to it, obtained content includes:
[Content_Types] .xml file, _ rels file, customXml file, docProps file and word text
Part folder.Include several XML files in each file, wherein includes the core content of document one in word document folder, comprising: record
The document.xml file of body matter, numbering.xml file of record number sequence etc..
2) according to the XML file, parse and extract the contents such as text, paragraph, title.
XML file is parsed using third party's open source API, is obtained in every bookmark name, attribute and the text of XML
Hold.For nested paragraph content, recursion resolution obtains corresponding hierarchical relationship.For example,<w:p>corresponding paragraph,<w:t>are corresponding
Text,<w:hyperlink>corresponding link,<w:bookmarkStart>corresponding label initial position,<w:bookmarkEnd>
Corresponding label final position etc..
3) according to the content information, the bibliographic structure of source file is extracted.
For document one, its own contains the bibliographic structure with navigation feature, searches in document.xml file
The domain TOC, parse label where its, each title in catalogue can be obtained.For example, with w in document.xml file:
The link of hyperlink rubidium marking records link corresponding address such as _ Toc502003199 with w:anchor, searches for w in text:
Name is the _ label of Toc502003199, can obtain the corresponding redirected link address of title in catalogue.In circular treatment document
All domains TOC label, the bibliographic structure of source file can be extracted.
4) according to the bibliographic structure, source file is split as unit of chapters and sections.
The catalogue unloading that document one is extracted is for convenience of the data structure handled, using document processing tools, according to catalogue
In the corresponding paragraph address of each level-one title, source document is split as multiple subfiles.That is, " the volume one in document
The chapters and sections such as boudoir note pleasure (1) ", " the not busy feelings of volume two remember interest (1) ", " roll up two not busy feelings and remember interesting (2) " respectively save as individual subfile.
5) according to the subfile content, it is converted into corresponding html file.
According to the subfile content and relevant XML parsing result after fractionation, using document processing tools, by each height
File is converted to corresponding html file, and name is numbered to html file in the hierarchical structure for following document.For example, " volume
Html file after this chapters and sections Content Transformation of two not busy feelings note interests (1) " can be named as " chapter2_1.html ".
6) it according to the html format file and related resource file, is packaged and generates EPUB file.
In target storage position, creation mimetype file first is to illustrate EPUB format;It is corresponding according to subfile
Html file creates the catalogue file of EPUB, and wherein<navLabel>label in ncx file is with the entitled text of the title of file
This, the html file name of chapters and sections where the id of<navPoint>label is corresponding, then according to the core content creation opf etc. of document
File;All resource files are finally packaged, final EPUB formatted file is generated.
Using the above method, the present invention realizes the system that Word file is converted to EPUB file, comprising:
1) Word file parsing module obtains Word file to be converted (.docx formatted file) and is decompressed, and generates
Corresponding resource file and file, wherein including the files such as several XML documents, picture;
2) XML file parsing module carries out XML parsing according to the resource file, to obtain the text of source Word file
The contents such as shelves catalogue, header syntax, text style, and the information such as text, paragraph, font, font size are extracted for subsequent HTML
The generation of file;
Fig. 1 is the exemplary diagram of OOXML document.For Word, the XML tag element of OOXML document mainly includes section
It falls, text, table, number, section, pattern, font, title, footer, domain, link, catalogue etc..XML file itself follows tree knot
Structure, general analyzing step include dividing data block, parse each data block using multi-threaded parallel, identify data label, attribute
Content and certain last handling process.Using Open-Source Tools such as OpenXMLSDK, parse the document.xml,
Nested XML in the files such as app.xml, endnotes.xml, footnotes.xml, numbering.xml, styles.xml
File structure, so that wherein Word file content and related style are obtained, in particular, obtaining the catalogue and header syntax of file.
3) Word file splits module, and according to the parsing result of source Word file, originally a point situation is handled different
Word file carries out catalog recognition to the source file comprising bibliographic structure;Title knowledge is carried out to the source file for not including catalogue
Not, to extract the catalogue of file;And source Word file is split as by multiple subfiles as unit of chapters and sections according to catalogue;
Wherein the extraction of Word file catalogue need to divide three kinds of situation processing:
A) source file carries navigation directory structure, and performance is, includes in the document.xml file parsed
TOC (Table of Contents, catalogue) domain.Word document by the inclusion of title level, special style TOC domain representation mesh
Directory structures, extracts the bibliographic structure that respective labels content can then be converted directly into EPUB file, and specific example is shown in Fig. 2;
B) source file does not include bibliographic structure, but there is the catalogue page comprising plain text content.The catalogue of this class file
Page usually includes specific typesetting feature, as text include " catalogue " printed words, there are a large amount of dot symbols, there are a large amount of line-break and
Symbol, every row are retracted with number beginning etc..It using these typesetting Feature Selections and determines catalogue page, further parses catalogue
Page refines title and the page number, is finally matched to corresponding document content, generates bibliographic structure, and specific example is shown in Fig. 3;
C) source file does not include bibliographic structure or the catalogue page with specific typesetting feature, for this class file, using SVM
Etc. classification methods, in conjunction with page empty, chapter Header font, headerfooter analysis as a result, extract document every title and phase
The paragraph content answered;Have the characteristics that style consistency using the title of same level, title is extracted using the method for cluster
Between hierarchical structure, to generate corresponding catalogue, specific example is shown in Fig. 4.
After obtaining the bibliographic structure of source file, the initial position of chapters and sections paragraph is positioned according to XML element, thus to source file
It carries out winning fractionation.Particularly, most of e-book catalogue only includes two-stage title to divide the chapters and sections structure of books, herein
Word file is converted to EPUB e-book formatted file when dividing subfile by catalogue, also can only consider two layers of catalogue mark
Topic.Fig. 5 is the EPUB document instance after conversion.
4) html file generation module, according to XML file parsing as a result, utilizing the structures such as text fragment, font size
Subfile is converted to corresponding html file by information;
5) EPUB file generating module, in conjunction with the resource files such as obtained picture, metadata after source Word file decompression with
And catalogue file obtained after parsing, it is packaged and generates EPUB formatted file.
It should be noted that the purpose for publicizing and implementing example is to help to further understand the present invention, but the skill of this field
Art personnel, which are understood that, not to be departed from the present invention and spirit and scope of the appended claims, and various substitutions and modifications are all
It is possible.Therefore, the present invention should not be limited to embodiment disclosure of that, and the scope of protection of present invention is with claim
Subject to the range that book defines.
Claims (10)
1. a kind of method that Word file is converted to EPUB file, includes the following steps:
1) it parses Word source file to be converted: Word source file to be converted being parsed, corresponding resource file is generated
And file, including multiple XML files and picture file;The source Word file to be converted is .docx formatted file, it then follows
OOXML electronic document specification based on ZIP+XML format;It parses obtained XML document and follows OOXML electronic document specification;
2) it parses XML file: XML parsing being carried out to multiple XML files in obtained resource file, extraction obtains Word source document
The text of part, paragraph, font size, heading message;
3) split Word source file: using step 2) parsing XML's as a result, extract obtain the bibliographic structure of Word source document,
Word source file is split as multiple Word subfiles according to corresponding chapters and sections structure;
4) it generates html file: each subfile is converted into html file;
5) generate EPUB file: html file, the related resource index, catalogue file generated according to step 4) is packaged and generates
EPUB formatted file.
2. the method that Word file is converted to EPUB file as described in claim 1, characterized in that step 1) is parsed wait turn
The Word source file changed, specifically: Word source file suffix name being revised as .zip .zip file is carried out using decoder software
Decompression obtains [Content_Types] .xml file, docProps file and word document folder;Wherein [Content_
Types] the .xml file record All Files that include title and type;DocProps file include app.xml file,
Core.xml file and thumbnail.emf file;Word document clip pack file containing document.xml, footnotes.xml
File, endnotes.xml file, styles.xml file, numbering.xml file and media file.
3. the method that Word file is converted to EPUB file as claimed in claim 2, characterized in that step 2) parses XML
File specifically utilizes XML document analytical tool, nested XML document structure in multiple XML files in resolving resource file;
XML tag element includes paragraph, text, table, number, section, pattern, font, title, footer, domain, link, catalogue;XML text
Shelves analyzing step includes dividing data block, each data block being parsed using multi-threaded parallel, identifying data label, identification data attribute
Content and last handling process;Thus the document content and related style of Word source file are obtained.
4. the method that Word file is converted to EPUB file as described in claim 1, characterized in that step 3) splits Word
Source file includes following situations:
If a) Word source file includes bibliographic structure, catalog recognition is carried out to Word source file, obtains the mesh of Word source file
Record;It parses comprising TOC directory field in the document.xml file of obtained Word source file, by the inclusion of title level, spy
The TOC domain representation bibliographic structure of random sample formula extracts respective labels content, is converted directly into the bibliographic structure of EPUB file;
If b) Word source file does not include bibliographic structure, but there is the catalogue page comprising plain text content, catalogue page includes specific
Typesetting feature, using typesetting Feature Selection and determine catalogue page, further parse catalogue page, refine title and the page number, then
It is fitted on corresponding document content, thus generates bibliographic structure;
If c) Word source file does not include bibliographic structure or the catalogue page with typesetting feature, title is carried out to Word source file
Identification, using support vector machines classification method, according to the analysis of page empty, chapter Header font, headerfooter as a result, mentioning
Take every title of document and corresponding paragraph content;And the characteristics of utilizing style consistency between same level title, it uses
The method of cluster extracts the hierarchical structure between title, to generate corresponding catalogue.
5. the method that Word file is converted to EPUB file as described in claim 1, characterized in that step 4) generates HTML
File is specifically: the resource index file of HTML is generated according to XML parsing result for obtained Word subfile is split, it is right
Answer the picture file resource address occurred in Word subfile;In conjunction with Word content of text, each subfile is converted to accordingly
Html format file, the EPUB directory link address for synthesizing EPUB formatted file, and after corresponding conversion.
6. the method that Word file is converted to EPUB file as described in claim 1, characterized in that step 5) generates EPUB
File is specifically:
Mimetype file is added in target storage position first, for stating EPUB format;
According to bibliographic structure, the ncx file of EPUB is created, addition is with the navigation link of the entitled mark of html file, thus raw
At the file directory of EPUB;
Opf file, container.xml file are created, and copies html file and its corresponding resource file;
Finally above-mentioned file is packaged, ultimately generates EPUB formatted file.
7. a kind of system that Word file is converted to EPUB file, comprising: Word parsing module, XML parsing module, Word are torn open
Sub-module, HTML generation module, EPUB generation module;
1) Word file parsing module generates corresponding resource file and text for decompressing to Word file to be converted
Part folder includes multiple XML files;
2) XML file parsing module, for according to the resource file to Word file carry out XML parsing, extraction obtain text,
Paragraph, font size, heading message;
3) Word file splits module, for carrying out catalog recognition to the Word source file comprising bibliographic structure, to not comprising mesh
The Word source file of record carries out header identification, to extract the catalogue of Word source file, and according to catalogue by Word source file
Multiple subfiles are split as unit of chapters and sections;
4) html file generation module, for subfile to be converted to corresponding html file;
5) EPUB file generating module is packaged according to html file and related resource index, catalogue file and generates EPUB format text
Part.
8. the system that Word file is converted to EPUB file as claimed in claim 7, characterized in that Word file parsing module
Specifically: Word source file suffix name being revised as .zip .zip file is decompressed using decoder software, is obtained
[Content_Types] .xml file, docProps file and word document folder;Wherein, [Content_Types] .xml text
The title and type for the All Files that part record includes;DocProps file include app.xml file, core.xml file and
Thumbnail.emf file;Word document clip pack file containing document.xml, footnotes.xml file,
Endnotes.xml file, styles.xml file, numbering.xml file and media file.
9. the system that Word file is converted to EPUB file as claimed in claim 7, characterized in that XML file parsing module
XML document analytical tool is specifically utilized, nested XML document structure in multiple XML files in resolving resource file;XML mark
Signing element includes paragraph, text, table, number, section, pattern, font, title, footer, domain, link, catalogue;XML document parsing
Step include divide data block, each data block parsed using multi-threaded parallel, identification data label, identification data attribute content and
Last handling process;Thus the document content and related style of Word source file are obtained.
10. the system that Word file is converted to EPUB file as claimed in claim 7, characterized in that Word file splits mould
Block includes following situations:
If a) Word source file includes bibliographic structure, catalog recognition is carried out to Word source file, obtains the mesh of Word source file
Record;It parses in the obtained document.xml file of Word source file comprising TOC catalogue, by the inclusion of title level, specific
The TOC domain representation bibliographic structure of pattern extracts respective labels content, is converted directly into the bibliographic structure of EPUB file;
If b) Word source file does not include bibliographic structure, but there is the catalogue page comprising plain text content, catalogue page includes specific
Typesetting feature, using typesetting Feature Selection and determine catalogue page, further parse catalogue page, refine title and the page number, then
It is fitted on corresponding document content, thus generates bibliographic structure;
If c) Word source file does not include bibliographic structure or the catalogue page with typesetting feature, title is carried out to Word source file
Identification, using support vector machines classification method, using the analysis of page empty, chapter Header font, headerfooter as a result, mentioning
Take every title of document and corresponding paragraph content;And the characteristics of utilizing style consistency between same level title, it uses
The method of cluster extracts the hierarchical structure between title, to generate corresponding catalogue.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810071710.1A CN110083805B (en) | 2018-01-25 | 2018-01-25 | Method and system for converting Word file into EPUB file |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810071710.1A CN110083805B (en) | 2018-01-25 | 2018-01-25 | Method and system for converting Word file into EPUB file |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110083805A true CN110083805A (en) | 2019-08-02 |
CN110083805B CN110083805B (en) | 2020-11-27 |
Family
ID=67411893
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810071710.1A Active CN110083805B (en) | 2018-01-25 | 2018-01-25 | Method and system for converting Word file into EPUB file |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110083805B (en) |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110532233A (en) * | 2019-08-20 | 2019-12-03 | 武汉鼎森电子科技有限公司 | A kind of epub document generating method and system |
CN110705216A (en) * | 2019-09-19 | 2020-01-17 | 深圳前海环融联易信息科技服务有限公司 | Method and device for converting docx file into xml file based on java and computer equipment |
CN110781672A (en) * | 2019-10-30 | 2020-02-11 | 北京爱学习博乐教育科技有限公司 | Question bank production method and system based on machine intelligence |
CN111062187A (en) * | 2019-11-27 | 2020-04-24 | 北京计算机技术及应用研究所 | Structured parsing method and system for docx format document |
CN111144069A (en) * | 2019-12-30 | 2020-05-12 | 北大方正集团有限公司 | Table-based directory typesetting method and device and storage medium |
CN111581948A (en) * | 2020-04-03 | 2020-08-25 | 北京百度网讯科技有限公司 | Document analysis method, device, equipment and storage medium |
CN111881650A (en) * | 2020-07-20 | 2020-11-03 | 北京百度网讯科技有限公司 | PDF document generation method and device and electronic equipment |
CN112528080A (en) * | 2019-09-03 | 2021-03-19 | 北京国双科技有限公司 | Method and device for extracting text content of docx file |
CN112597422A (en) * | 2020-12-30 | 2021-04-02 | 深圳市世强元件网络有限公司 | PDF file segmentation method and PDF file loading method in webpage |
CN112686000A (en) * | 2020-12-24 | 2021-04-20 | 掌阅科技股份有限公司 | Format conversion method of electronic book document, electronic equipment and storage medium |
CN112699641A (en) * | 2021-03-25 | 2021-04-23 | 南京国睿信维软件有限公司 | Method for quickly converting batch copy of WORD content to DM based on S1000D standard |
CN113268585A (en) * | 2021-04-28 | 2021-08-17 | 企查查科技有限公司 | Report file generation method and device, computer equipment and storage medium |
CN113361256A (en) * | 2021-06-24 | 2021-09-07 | 上海真虹信息科技有限公司 | Rapid Word document parsing method based on Aspose technology |
CN113761840A (en) * | 2021-09-08 | 2021-12-07 | 中信建投证券股份有限公司 | Intelligent document processing method, system, computer device and medium |
CN113779931A (en) * | 2021-08-31 | 2021-12-10 | 民商数字科技(深圳)有限公司 | Knowledge base construction method based on Word and control method thereof |
CN116612491A (en) * | 2023-07-17 | 2023-08-18 | 中国电子科技集团公司第十研究所 | ARM kylin WORD file content extraction method |
CN116861847A (en) * | 2023-06-21 | 2023-10-10 | 三峡高科信息技术有限责任公司 | Online Office file previewing method and system |
CN117421357A (en) * | 2023-06-07 | 2024-01-19 | 广州市公安局***侦查支队 | Method, system, device and storage medium for importing funds stream data |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060236215A1 (en) * | 2005-04-14 | 2006-10-19 | Jenn-Sheng Wu | Method and system for automatically creating document |
CN104699714A (en) * | 2013-12-09 | 2015-06-10 | 北大方正集团有限公司 | Method and device for transferring files of book edition format into files of EPUB format |
CN106326194A (en) * | 2015-07-06 | 2017-01-11 | 北大方正集团有限公司 | Directory generation method and apparatus applied to file format conversion scene |
CN106886509A (en) * | 2017-03-06 | 2017-06-23 | 大连理工大学 | A kind of academic dissertation form automatic testing method |
-
2018
- 2018-01-25 CN CN201810071710.1A patent/CN110083805B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060236215A1 (en) * | 2005-04-14 | 2006-10-19 | Jenn-Sheng Wu | Method and system for automatically creating document |
CN104699714A (en) * | 2013-12-09 | 2015-06-10 | 北大方正集团有限公司 | Method and device for transferring files of book edition format into files of EPUB format |
CN106326194A (en) * | 2015-07-06 | 2017-01-11 | 北大方正集团有限公司 | Directory generation method and apparatus applied to file format conversion scene |
CN106886509A (en) * | 2017-03-06 | 2017-06-23 | 大连理工大学 | A kind of academic dissertation form automatic testing method |
Non-Patent Citations (1)
Title |
---|
房婧等: ""板式电子文档表格自动检测与性能评估"", 《北京大学学报(自然科学版)》 * |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110532233A (en) * | 2019-08-20 | 2019-12-03 | 武汉鼎森电子科技有限公司 | A kind of epub document generating method and system |
CN112528080A (en) * | 2019-09-03 | 2021-03-19 | 北京国双科技有限公司 | Method and device for extracting text content of docx file |
CN110705216A (en) * | 2019-09-19 | 2020-01-17 | 深圳前海环融联易信息科技服务有限公司 | Method and device for converting docx file into xml file based on java and computer equipment |
CN110705216B (en) * | 2019-09-19 | 2023-11-03 | 深圳前海环融联易信息科技服务有限公司 | Method and device for converting docx file into xml file based on java and computer equipment |
CN110781672A (en) * | 2019-10-30 | 2020-02-11 | 北京爱学习博乐教育科技有限公司 | Question bank production method and system based on machine intelligence |
CN110781672B (en) * | 2019-10-30 | 2024-01-30 | 北京爱学习博乐教育科技有限公司 | Question bank production method and system based on machine intelligence |
CN111062187A (en) * | 2019-11-27 | 2020-04-24 | 北京计算机技术及应用研究所 | Structured parsing method and system for docx format document |
CN111144069A (en) * | 2019-12-30 | 2020-05-12 | 北大方正集团有限公司 | Table-based directory typesetting method and device and storage medium |
CN111144069B (en) * | 2019-12-30 | 2021-12-03 | 北大方正集团有限公司 | Table-based directory typesetting method and device and storage medium |
CN111581948B (en) * | 2020-04-03 | 2024-02-09 | 北京百度网讯科技有限公司 | Document analysis method, device, equipment and storage medium |
CN111581948A (en) * | 2020-04-03 | 2020-08-25 | 北京百度网讯科技有限公司 | Document analysis method, device, equipment and storage medium |
CN111881650A (en) * | 2020-07-20 | 2020-11-03 | 北京百度网讯科技有限公司 | PDF document generation method and device and electronic equipment |
CN112686000A (en) * | 2020-12-24 | 2021-04-20 | 掌阅科技股份有限公司 | Format conversion method of electronic book document, electronic equipment and storage medium |
CN112597422A (en) * | 2020-12-30 | 2021-04-02 | 深圳市世强元件网络有限公司 | PDF file segmentation method and PDF file loading method in webpage |
CN112699641A (en) * | 2021-03-25 | 2021-04-23 | 南京国睿信维软件有限公司 | Method for quickly converting batch copy of WORD content to DM based on S1000D standard |
CN113268585A (en) * | 2021-04-28 | 2021-08-17 | 企查查科技有限公司 | Report file generation method and device, computer equipment and storage medium |
CN113361256A (en) * | 2021-06-24 | 2021-09-07 | 上海真虹信息科技有限公司 | Rapid Word document parsing method based on Aspose technology |
CN113779931A (en) * | 2021-08-31 | 2021-12-10 | 民商数字科技(深圳)有限公司 | Knowledge base construction method based on Word and control method thereof |
CN113761840A (en) * | 2021-09-08 | 2021-12-07 | 中信建投证券股份有限公司 | Intelligent document processing method, system, computer device and medium |
CN117421357A (en) * | 2023-06-07 | 2024-01-19 | 广州市公安局***侦查支队 | Method, system, device and storage medium for importing funds stream data |
CN116861847A (en) * | 2023-06-21 | 2023-10-10 | 三峡高科信息技术有限责任公司 | Online Office file previewing method and system |
CN116861847B (en) * | 2023-06-21 | 2024-02-13 | 三峡高科信息技术有限责任公司 | Online Office file previewing method and system |
CN116612491A (en) * | 2023-07-17 | 2023-08-18 | 中国电子科技集团公司第十研究所 | ARM kylin WORD file content extraction method |
Also Published As
Publication number | Publication date |
---|---|
CN110083805B (en) | 2020-11-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110083805B (en) | Method and system for converting Word file into EPUB file | |
CN111753499B (en) | Method for merging and displaying electronic form and OFD format file and generating directory | |
US8250469B2 (en) | Document layout extraction | |
US20150033116A1 (en) | Systems, Methods, and Media for Generating Structured Documents | |
CN105446946B (en) | Rearrangement method, system and the electronic reading terminal of format document | |
WO2009000141A1 (en) | Representation method, system and device of layout file logical structure information | |
CN101548273A (en) | Determining fields for presentable files and extensible markup language schemas for bibliographies and citations | |
JP2009524883A (en) | Presenting digital content to the network | |
JP2000148736A (en) | Methods for font acquisition, registration, display, and printing, method for handling document having variant fonts, and recording medium thereof | |
US20100080493A1 (en) | Associating optical character recognition text data with source images | |
WO2013146394A1 (en) | Information processing terminal and method, and information management apparatus and method | |
US9043343B2 (en) | Identifier assigning method, identifier parsing method, and multimedia reading | |
CN103309879A (en) | Method and device for managing marks in WORD document | |
CN112433995B (en) | File format conversion method, system, computer device and storage medium | |
US20120109638A1 (en) | Electronic device and method for extracting component names using the same | |
US9817913B2 (en) | Method and apparatus for collecting, merging and presenting content | |
CN110554996A (en) | method and system for quickly opening epub file | |
US20120192046A1 (en) | Generation of a source complex document to facilitate content access in complex document creation | |
CN107423271B (en) | Document generation method and device | |
US8566366B2 (en) | Format conversion apparatus and file search apparatus capable of searching for a file as based on an attribute provided prior to conversion | |
CN112818687B (en) | Method, device, electronic equipment and storage medium for constructing title recognition model | |
CN105320716A (en) | Automatic labeling method for digital publication | |
CN111401005B (en) | Text conversion method and device and readable storage medium | |
JP5707937B2 (en) | Electronic document conversion apparatus and electronic document conversion method | |
CN111143719A (en) | Online publication method, device and equipment of thesis and computer-readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |