CN110083805A - A kind of method and system that Word file is converted to EPUB file - Google Patents

A kind of method and system that Word file is converted to EPUB file Download PDF

Info

Publication number
CN110083805A
CN110083805A CN201810071710.1A CN201810071710A CN110083805A CN 110083805 A CN110083805 A CN 110083805A CN 201810071710 A CN201810071710 A CN 201810071710A CN 110083805 A CN110083805 A CN 110083805A
Authority
CN
China
Prior art keywords
file
word
xml
epub
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810071710.1A
Other languages
Chinese (zh)
Other versions
CN110083805B (en
Inventor
高良才
陈嘉云
汤帜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN201810071710.1A priority Critical patent/CN110083805B/en
Publication of CN110083805A publication Critical patent/CN110083805A/en
Application granted granted Critical
Publication of CN110083805B publication Critical patent/CN110083805B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a kind of method and systems that Word format file is converted to EPUB formatted file.For the Word file of .docx format, it is identified and is handled by the catalogue to Word source file, it can identify source Word document bibliographic structure, EPUB e-book is automatically generated, step includes: Word file parsing, XML file parsing, Word file is split, html file generates and EPUB file generated.The EPUB e-book auto-generating method provided by the invention that can identify source Word file catalogue, solves the problems such as prior art conversion effect is bad, the manual conversion process for adding head table is cumbersome, inefficiency, the integrality for having ensured document content improves the conversion effect of document and improves work efficiency.

Description

A kind of method and system that Word file is converted to EPUB file
Technical field
The present invention relates to document processing technology more particularly to a kind of Word format file is converted into EPUB The method and system of (Electronic Publication, electronic publishing) formatted file.
Background technique
In the epoch of digital publishing and " internet+", with the fast development of mobile communication and Web publishing, e-book becomes It is more more and more universal with it is popular.The arrival of digital Age changes the reading habit of people, passes through electronic reader, smart phone Etc. equipment carry out fragmentation read with mobile reading have become it is public receive with favorite reading method, and due to equipment, platform, The difference of publishing media etc. emerges various electronics book formats on the market, as TXT, PDF, EPUB, Mobi, Azw3, CEB/CEBX, CAJ, PDG etc..In the electronics book format of various prevalences, EPUB publishes forum as international numerical digit (IDPF) official standard, because its support Various Complex typesetting, can adaptive device screen the advantages that, be listed as with PDF, Mobi The big mainstream format of e-book three;And Word and PDF becomes the most frequently used in Publishing Industry as the most common office docuemts format Two kinds of document manuscript formats.In the publication of e-book, distribution process, it is often necessary to realize between different electronics book formats Conversion, and the demand mutually converted between documents in various formats is also frequently run onto during many software developments.
Microsoft Office Word is the current most common electronic document tools, and Word file include .doc with .docx format, the former belongs to MS-Word binary file, the latter then follow Microsoft's exploitation based on XML and with ZIP lattice The electronic document specification OOXML (Office Open XML) of formula compression.General Word file parsing method is, after decompression Word file in extract corresponding information, be translated into corresponding html file to carry out the processing of next step.
EPUB format follows ZIP compress technique, and the EPUB file after decompression mainly includes three parts content: to illustrate The mimetype file of the file format of EPUB;Storing OPF, NCX, CSS, HTML etc. includes EPUB e-book core content file OEBPS file;And the META-INF file comprising several EPUB e-book property contents.General EPUB e-book Generate mainly includes four steps: addition mimetype file;It is packaged all resource files;Create the core contents such as opf, ncx File;Corresponding property file is finally created again, and compresses synthesis EPUB format.
At present there are many file format converter tools, form includes online service, multipad and API Interface.For the conversion effect of documents in various formats, the integrality of the contents such as text, chart, label, bibliographic structure, title, word The factors such as the processing of the reserving degree of the attributes such as body, font size and special document are all common consideration indexs.It is existing common File-format conversion function is related to the formats such as Word, PDF, EPUB, Excel, and for Word file is converted to EPUB format The technical solution of file is relatively fewer.Particularly, for the Word file comprising bibliographic structure, navigation tag is either had File still have the file of the catalogue page without redirected link, the conversion effect of the prior art is bad, is easy to happen mesh The situations such as directory structures loss, text confusion.
Summary of the invention
In order to overcome the above-mentioned deficiencies of the prior art, the present invention, which provides, a kind of is converted to EPUB format for Word format file The method and system of file is identified and is located by the catalogue to Word source file for the Word file of .docx format Reason, is capable of the bibliographic structure of extraction source Word file, automatically generates EPUB e-book.
The technical scheme is that
A method of Word file is converted into EPUB file, is included the following steps:
1) Word file parses: obtaining Word file to be converted (.docx formatted file) and is decompressed, is generated corresponding Resource file and file, wherein including the files such as several XML documents, picture;
2) XML file parses: XML parsing is carried out according to the resource file, to obtain the document mesh of source Word file The contents such as record, header syntax, text style, and the information such as text, paragraph, font, font size are extracted for subsequent html file Generation;
3) Word file is split: according to the parsing result of source Word file, originally point situation handles different Word File carries out catalog recognition to the source file comprising bibliographic structure, obtains the catalogue of source file;To the source document for not including catalogue Part carries out header identification, to extract the catalogue of source file;And source Word file is split as unit of chapters and sections according to catalogue For multiple subfiles;
4) html file generates: according to XML file parsing as a result, using structural informations such as text fragment, font sizes, Subfile is converted into corresponding html file;
5) EPUB file generated: according to the html file, the picture obtained after being decompressed in conjunction with source Word file, metadata Equal resource files and catalogue file obtained after parsing are packaged and generate EPUB formatted file.
The present invention also provides a kind of system that Word file is converted to EPUB file realized using the above method, packets Include: 1) Word file parsing module obtains Word file to be converted (.docx formatted file) and is decompressed, and generates corresponding Resource file and file, wherein including the files such as several XML documents, picture;2) XML file parsing module, according to described Resource file carries out XML parsing, to obtain the contents such as the file catalogue of source Word file, header syntax, text style, and mentions Take out the generation that the information such as text, paragraph, font, font size are used for subsequent html file;3) Word file splits module, according to source The parsing result of Word file, originally point situation handles different Word files, i.e., to the source file comprising bibliographic structure Carry out catalog recognition;Header identification is carried out to the source file for not including catalogue, to extract the catalogue of file;And according to catalogue Source Word file is split as multiple subfiles as unit of chapters and sections;4) html file generation module, according to XML file parsing As a result, subfile is converted to corresponding html file using structural informations such as text fragment, font sizes;5) EPUB file Generation module, in conjunction with resource files and catalogues obtained after parsing such as picture, the metadata obtained after source Word file decompression File is packaged and generates EPUB formatted file.Each module is described in detail below.
Word file parsing module.For the Word file of Microsoft Word 2007 and the above version, suffix name For .docx, it then follows the OOXML electronic document specification based on zip+xml format.Word file suffix name is revised as .zip, is made After being decompressed with decoder software to file, can be obtained [Content_Types] .xml file (record content be comprising it is all The title and type of file) and _ tri- files of rels, docProps, word.Wherein docProps file essential record The property content of Word document, comprising: the app.xml file of record number of pages, the statistical attributes such as number of words, when the creation of recording documents Between, the core.xml file of core attributes and the thumbnail thumbnail.emf file of document such as author;Word document clip pack The document.xml file of the body matter containing recording documents, the footnotes.xml file of recording documents footnote content, record The endnotes.xml file of document endnote content, the styles.xml file of recording documents style information, record number sequence Numbering.xml file and recordable picture resource the contents such as media file.
XML file parsing module.The XML file obtained after Word file is decompressed follows OOXML standard, wherein Document.xml file includes the main contents of source Word file, and structure is mainly made of elements such as paragraph and tables.It is right For Word, the XML tag element of OOXML document mainly includes paragraph, text, table, number, section, pattern, font, mark Topic, footer, domain, link, catalogue etc..XML file itself follows tree construction, and general analyzing step includes dividing data block, Each data block, identification data label, property content and certain last handling process are parsed using multi-threaded parallel.It utilizes The Open-Source Tools such as OpenXMLSDK, parse the document.xml, app.xml, endnotes.xml, Nested XML file structure in the files such as footnotes.xml, numbering.xml, styles.xml, to obtain wherein Word file content and related style, in particular, obtaining the catalogue and header syntax of file.
Word file splits module.Using XML parsing as a result, the bibliographic structure of source Word file is extracted, according to phase Source file is split as multiple Word subfiles by the chapters and sections structure answered.Wherein the extraction of Word file catalogue need to divide at three kinds of situations Reason:
A) source file carry navigation directory structure, Word document by the inclusion of title level, special style TOC domain representation Bibliographic structure extracts the bibliographic structure that respective labels content can then be converted directly into EPUB file;
B) source file does not include bibliographic structure, but there is the catalogue page comprising plain text content.The catalogue of this class file Page usually includes specific typesetting feature, using these typesetting Feature Selections and determines catalogue page, further parses catalogue page, mentions Title and the page number are refined, corresponding document content is finally matched to, generates bibliographic structure;
C) source file does not include bibliographic structure or the catalogue page with specific typesetting feature, for this class file, using SVM Etc. classification methods, in conjunction with page empty, chapter Header font, headerfooter analysis as a result, extract document every title and phase The paragraph content answered;Have the characteristics that style consistency using the title of same level, title is extracted using the method for cluster Between hierarchical structure, to generate corresponding catalogue.
After obtaining the bibliographic structure of source file, the initial position of chapters and sections paragraph is positioned according to XML element, thus to source file It carries out winning fractionation.Particularly, most of e-book catalogue only includes two-stage title to divide the chapters and sections structure of books, herein Word file is converted to EPUB e-book formatted file when dividing subfile by catalogue, also can only consider two layers of catalogue mark Topic.
Html file generation module.For the Word subfile after fractionation, according to XML parsing as a result, generating HTML's Resource index file corresponds to the resource address such as the picture file occurred in Word subfile, in conjunction with Word content of text, finally will Each subfile is converted to corresponding html format file.Html file after conversion is mainly used for synthesizing EPUB formatted file, Its filename corresponds to the chained address of chapters and sections in the EPUB file directory after conversion.
EPUB file generating module.EPUB format follows ZIP compress technique, first during generating EPUB e-book Mimetype file first is added in target storage position, to state EPUB format;According to the bibliographic structure, create EPUB's Ncx file is added with the navigation link of the entitled mark of html file, to generate EPUB file directory;According to source file Catalogue that metadata information and EPUB include, the file information, creation opf file simultaneously copy html file and its corresponding resource File is stored into OPS file;According to the opf file, creates container.xml file and store to META-INF In file;Finally above-mentioned mimetype file, OPS file, META-INF file are packaged, and delete intermediate file, EPUB formatted file after generating final conversion.
Compared with prior art, the positive effect of the present invention are as follows:
The present invention provides a kind of method and system that Word format file is converted to EPUB formatted file, can be widely applied In common file format shift scene, the Publishing Industry of Word, EPUB document manuscript is especially largely used.The present invention mentions The EPUB e-book auto-generating method that can identify source Word file catalogue supplied, solves prior art conversion effect not The problems such as conversion process is cumbersome, inefficiency of good, manual addition head table, has ensured the integrality of document content, It improves the conversion effect of document and improves work efficiency.
Detailed description of the invention
Fig. 1 is the exemplary diagram of OOXML document.
Fig. 2 is the screenshot of Word file of the embodiment of the present invention with bibliographic structure.
Fig. 3 is the screenshot of Word file of the embodiment of the present invention with plain text catalogue page.
Fig. 4 is the screenshot for the Word file that the embodiment of the present invention does not include catalogue.
Fig. 5 is the screenshot of the EPUB file with catalogue after conversion of the embodiment of the present invention.
Fig. 6 is the flow diagram of the method for the present invention.
Specific embodiment
With reference to the accompanying drawing, the present invention, the model of but do not limit the invention in any way are further described by embodiment It encloses.
The present invention provides a kind of method and system that Word format file is converted to EPUB formatted file, for .docx The Word file of format is identified and is handled by the catalogue to Word source file, and the catalogue of extraction source Word file is capable of Structure automatically generates EPUB e-book.The method of the present invention mainly includes Word file parsing, XML file parses, Word file is torn open Point, html file generate and five steps of EPUB file generated, referring to Fig. 6.
Below by way of for one " Six Chapters of a Floating Life " .docx formatted file (hereinafter referred to as document one) is converted Bright implementation method of the invention, specific example are as follows:
1) it obtains and decompresses Word file to be converted, obtain the resource files such as several XML files.
Document one is revised as ZIP format, after decompressing using decompression tool to it, obtained content includes: [Content_Types] .xml file, _ rels file, customXml file, docProps file and word text Part folder.Include several XML files in each file, wherein includes the core content of document one in word document folder, comprising: record The document.xml file of body matter, numbering.xml file of record number sequence etc..
2) according to the XML file, parse and extract the contents such as text, paragraph, title.
XML file is parsed using third party's open source API, is obtained in every bookmark name, attribute and the text of XML Hold.For nested paragraph content, recursion resolution obtains corresponding hierarchical relationship.For example,<w:p>corresponding paragraph,<w:t>are corresponding Text,<w:hyperlink>corresponding link,<w:bookmarkStart>corresponding label initial position,<w:bookmarkEnd> Corresponding label final position etc..
3) according to the content information, the bibliographic structure of source file is extracted.
For document one, its own contains the bibliographic structure with navigation feature, searches in document.xml file The domain TOC, parse label where its, each title in catalogue can be obtained.For example, with w in document.xml file: The link of hyperlink rubidium marking records link corresponding address such as _ Toc502003199 with w:anchor, searches for w in text: Name is the _ label of Toc502003199, can obtain the corresponding redirected link address of title in catalogue.In circular treatment document All domains TOC label, the bibliographic structure of source file can be extracted.
4) according to the bibliographic structure, source file is split as unit of chapters and sections.
The catalogue unloading that document one is extracted is for convenience of the data structure handled, using document processing tools, according to catalogue In the corresponding paragraph address of each level-one title, source document is split as multiple subfiles.That is, " the volume one in document The chapters and sections such as boudoir note pleasure (1) ", " the not busy feelings of volume two remember interest (1) ", " roll up two not busy feelings and remember interesting (2) " respectively save as individual subfile.
5) according to the subfile content, it is converted into corresponding html file.
According to the subfile content and relevant XML parsing result after fractionation, using document processing tools, by each height File is converted to corresponding html file, and name is numbered to html file in the hierarchical structure for following document.For example, " volume Html file after this chapters and sections Content Transformation of two not busy feelings note interests (1) " can be named as " chapter2_1.html ".
6) it according to the html format file and related resource file, is packaged and generates EPUB file.
In target storage position, creation mimetype file first is to illustrate EPUB format;It is corresponding according to subfile Html file creates the catalogue file of EPUB, and wherein<navLabel>label in ncx file is with the entitled text of the title of file This, the html file name of chapters and sections where the id of<navPoint>label is corresponding, then according to the core content creation opf etc. of document File;All resource files are finally packaged, final EPUB formatted file is generated.
Using the above method, the present invention realizes the system that Word file is converted to EPUB file, comprising:
1) Word file parsing module obtains Word file to be converted (.docx formatted file) and is decompressed, and generates Corresponding resource file and file, wherein including the files such as several XML documents, picture;
2) XML file parsing module carries out XML parsing according to the resource file, to obtain the text of source Word file The contents such as shelves catalogue, header syntax, text style, and the information such as text, paragraph, font, font size are extracted for subsequent HTML The generation of file;
Fig. 1 is the exemplary diagram of OOXML document.For Word, the XML tag element of OOXML document mainly includes section It falls, text, table, number, section, pattern, font, title, footer, domain, link, catalogue etc..XML file itself follows tree knot Structure, general analyzing step include dividing data block, parse each data block using multi-threaded parallel, identify data label, attribute Content and certain last handling process.Using Open-Source Tools such as OpenXMLSDK, parse the document.xml, Nested XML in the files such as app.xml, endnotes.xml, footnotes.xml, numbering.xml, styles.xml File structure, so that wherein Word file content and related style are obtained, in particular, obtaining the catalogue and header syntax of file.
3) Word file splits module, and according to the parsing result of source Word file, originally a point situation is handled different Word file carries out catalog recognition to the source file comprising bibliographic structure;Title knowledge is carried out to the source file for not including catalogue Not, to extract the catalogue of file;And source Word file is split as by multiple subfiles as unit of chapters and sections according to catalogue;
Wherein the extraction of Word file catalogue need to divide three kinds of situation processing:
A) source file carries navigation directory structure, and performance is, includes in the document.xml file parsed TOC (Table of Contents, catalogue) domain.Word document by the inclusion of title level, special style TOC domain representation mesh Directory structures, extracts the bibliographic structure that respective labels content can then be converted directly into EPUB file, and specific example is shown in Fig. 2;
B) source file does not include bibliographic structure, but there is the catalogue page comprising plain text content.The catalogue of this class file Page usually includes specific typesetting feature, as text include " catalogue " printed words, there are a large amount of dot symbols, there are a large amount of line-break and Symbol, every row are retracted with number beginning etc..It using these typesetting Feature Selections and determines catalogue page, further parses catalogue Page refines title and the page number, is finally matched to corresponding document content, generates bibliographic structure, and specific example is shown in Fig. 3;
C) source file does not include bibliographic structure or the catalogue page with specific typesetting feature, for this class file, using SVM Etc. classification methods, in conjunction with page empty, chapter Header font, headerfooter analysis as a result, extract document every title and phase The paragraph content answered;Have the characteristics that style consistency using the title of same level, title is extracted using the method for cluster Between hierarchical structure, to generate corresponding catalogue, specific example is shown in Fig. 4.
After obtaining the bibliographic structure of source file, the initial position of chapters and sections paragraph is positioned according to XML element, thus to source file It carries out winning fractionation.Particularly, most of e-book catalogue only includes two-stage title to divide the chapters and sections structure of books, herein Word file is converted to EPUB e-book formatted file when dividing subfile by catalogue, also can only consider two layers of catalogue mark Topic.Fig. 5 is the EPUB document instance after conversion.
4) html file generation module, according to XML file parsing as a result, utilizing the structures such as text fragment, font size Subfile is converted to corresponding html file by information;
5) EPUB file generating module, in conjunction with the resource files such as obtained picture, metadata after source Word file decompression with And catalogue file obtained after parsing, it is packaged and generates EPUB formatted file.
It should be noted that the purpose for publicizing and implementing example is to help to further understand the present invention, but the skill of this field Art personnel, which are understood that, not to be departed from the present invention and spirit and scope of the appended claims, and various substitutions and modifications are all It is possible.Therefore, the present invention should not be limited to embodiment disclosure of that, and the scope of protection of present invention is with claim Subject to the range that book defines.

Claims (10)

1. a kind of method that Word file is converted to EPUB file, includes the following steps:
1) it parses Word source file to be converted: Word source file to be converted being parsed, corresponding resource file is generated And file, including multiple XML files and picture file;The source Word file to be converted is .docx formatted file, it then follows OOXML electronic document specification based on ZIP+XML format;It parses obtained XML document and follows OOXML electronic document specification;
2) it parses XML file: XML parsing being carried out to multiple XML files in obtained resource file, extraction obtains Word source document The text of part, paragraph, font size, heading message;
3) split Word source file: using step 2) parsing XML's as a result, extract obtain the bibliographic structure of Word source document, Word source file is split as multiple Word subfiles according to corresponding chapters and sections structure;
4) it generates html file: each subfile is converted into html file;
5) generate EPUB file: html file, the related resource index, catalogue file generated according to step 4) is packaged and generates EPUB formatted file.
2. the method that Word file is converted to EPUB file as described in claim 1, characterized in that step 1) is parsed wait turn The Word source file changed, specifically: Word source file suffix name being revised as .zip .zip file is carried out using decoder software Decompression obtains [Content_Types] .xml file, docProps file and word document folder;Wherein [Content_ Types] the .xml file record All Files that include title and type;DocProps file include app.xml file, Core.xml file and thumbnail.emf file;Word document clip pack file containing document.xml, footnotes.xml File, endnotes.xml file, styles.xml file, numbering.xml file and media file.
3. the method that Word file is converted to EPUB file as claimed in claim 2, characterized in that step 2) parses XML File specifically utilizes XML document analytical tool, nested XML document structure in multiple XML files in resolving resource file; XML tag element includes paragraph, text, table, number, section, pattern, font, title, footer, domain, link, catalogue;XML text Shelves analyzing step includes dividing data block, each data block being parsed using multi-threaded parallel, identifying data label, identification data attribute Content and last handling process;Thus the document content and related style of Word source file are obtained.
4. the method that Word file is converted to EPUB file as described in claim 1, characterized in that step 3) splits Word Source file includes following situations:
If a) Word source file includes bibliographic structure, catalog recognition is carried out to Word source file, obtains the mesh of Word source file Record;It parses comprising TOC directory field in the document.xml file of obtained Word source file, by the inclusion of title level, spy The TOC domain representation bibliographic structure of random sample formula extracts respective labels content, is converted directly into the bibliographic structure of EPUB file;
If b) Word source file does not include bibliographic structure, but there is the catalogue page comprising plain text content, catalogue page includes specific Typesetting feature, using typesetting Feature Selection and determine catalogue page, further parse catalogue page, refine title and the page number, then It is fitted on corresponding document content, thus generates bibliographic structure;
If c) Word source file does not include bibliographic structure or the catalogue page with typesetting feature, title is carried out to Word source file Identification, using support vector machines classification method, according to the analysis of page empty, chapter Header font, headerfooter as a result, mentioning Take every title of document and corresponding paragraph content;And the characteristics of utilizing style consistency between same level title, it uses The method of cluster extracts the hierarchical structure between title, to generate corresponding catalogue.
5. the method that Word file is converted to EPUB file as described in claim 1, characterized in that step 4) generates HTML File is specifically: the resource index file of HTML is generated according to XML parsing result for obtained Word subfile is split, it is right Answer the picture file resource address occurred in Word subfile;In conjunction with Word content of text, each subfile is converted to accordingly Html format file, the EPUB directory link address for synthesizing EPUB formatted file, and after corresponding conversion.
6. the method that Word file is converted to EPUB file as described in claim 1, characterized in that step 5) generates EPUB File is specifically:
Mimetype file is added in target storage position first, for stating EPUB format;
According to bibliographic structure, the ncx file of EPUB is created, addition is with the navigation link of the entitled mark of html file, thus raw At the file directory of EPUB;
Opf file, container.xml file are created, and copies html file and its corresponding resource file;
Finally above-mentioned file is packaged, ultimately generates EPUB formatted file.
7. a kind of system that Word file is converted to EPUB file, comprising: Word parsing module, XML parsing module, Word are torn open Sub-module, HTML generation module, EPUB generation module;
1) Word file parsing module generates corresponding resource file and text for decompressing to Word file to be converted Part folder includes multiple XML files;
2) XML file parsing module, for according to the resource file to Word file carry out XML parsing, extraction obtain text, Paragraph, font size, heading message;
3) Word file splits module, for carrying out catalog recognition to the Word source file comprising bibliographic structure, to not comprising mesh The Word source file of record carries out header identification, to extract the catalogue of Word source file, and according to catalogue by Word source file Multiple subfiles are split as unit of chapters and sections;
4) html file generation module, for subfile to be converted to corresponding html file;
5) EPUB file generating module is packaged according to html file and related resource index, catalogue file and generates EPUB format text Part.
8. the system that Word file is converted to EPUB file as claimed in claim 7, characterized in that Word file parsing module Specifically: Word source file suffix name being revised as .zip .zip file is decompressed using decoder software, is obtained [Content_Types] .xml file, docProps file and word document folder;Wherein, [Content_Types] .xml text The title and type for the All Files that part record includes;DocProps file include app.xml file, core.xml file and Thumbnail.emf file;Word document clip pack file containing document.xml, footnotes.xml file, Endnotes.xml file, styles.xml file, numbering.xml file and media file.
9. the system that Word file is converted to EPUB file as claimed in claim 7, characterized in that XML file parsing module XML document analytical tool is specifically utilized, nested XML document structure in multiple XML files in resolving resource file;XML mark Signing element includes paragraph, text, table, number, section, pattern, font, title, footer, domain, link, catalogue;XML document parsing Step include divide data block, each data block parsed using multi-threaded parallel, identification data label, identification data attribute content and Last handling process;Thus the document content and related style of Word source file are obtained.
10. the system that Word file is converted to EPUB file as claimed in claim 7, characterized in that Word file splits mould Block includes following situations:
If a) Word source file includes bibliographic structure, catalog recognition is carried out to Word source file, obtains the mesh of Word source file Record;It parses in the obtained document.xml file of Word source file comprising TOC catalogue, by the inclusion of title level, specific The TOC domain representation bibliographic structure of pattern extracts respective labels content, is converted directly into the bibliographic structure of EPUB file;
If b) Word source file does not include bibliographic structure, but there is the catalogue page comprising plain text content, catalogue page includes specific Typesetting feature, using typesetting Feature Selection and determine catalogue page, further parse catalogue page, refine title and the page number, then It is fitted on corresponding document content, thus generates bibliographic structure;
If c) Word source file does not include bibliographic structure or the catalogue page with typesetting feature, title is carried out to Word source file Identification, using support vector machines classification method, using the analysis of page empty, chapter Header font, headerfooter as a result, mentioning Take every title of document and corresponding paragraph content;And the characteristics of utilizing style consistency between same level title, it uses The method of cluster extracts the hierarchical structure between title, to generate corresponding catalogue.
CN201810071710.1A 2018-01-25 2018-01-25 Method and system for converting Word file into EPUB file Active CN110083805B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810071710.1A CN110083805B (en) 2018-01-25 2018-01-25 Method and system for converting Word file into EPUB file

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810071710.1A CN110083805B (en) 2018-01-25 2018-01-25 Method and system for converting Word file into EPUB file

Publications (2)

Publication Number Publication Date
CN110083805A true CN110083805A (en) 2019-08-02
CN110083805B CN110083805B (en) 2020-11-27

Family

ID=67411893

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810071710.1A Active CN110083805B (en) 2018-01-25 2018-01-25 Method and system for converting Word file into EPUB file

Country Status (1)

Country Link
CN (1) CN110083805B (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110532233A (en) * 2019-08-20 2019-12-03 武汉鼎森电子科技有限公司 A kind of epub document generating method and system
CN110705216A (en) * 2019-09-19 2020-01-17 深圳前海环融联易信息科技服务有限公司 Method and device for converting docx file into xml file based on java and computer equipment
CN110781672A (en) * 2019-10-30 2020-02-11 北京爱学习博乐教育科技有限公司 Question bank production method and system based on machine intelligence
CN111062187A (en) * 2019-11-27 2020-04-24 北京计算机技术及应用研究所 Structured parsing method and system for docx format document
CN111144069A (en) * 2019-12-30 2020-05-12 北大方正集团有限公司 Table-based directory typesetting method and device and storage medium
CN111581948A (en) * 2020-04-03 2020-08-25 北京百度网讯科技有限公司 Document analysis method, device, equipment and storage medium
CN111881650A (en) * 2020-07-20 2020-11-03 北京百度网讯科技有限公司 PDF document generation method and device and electronic equipment
CN112528080A (en) * 2019-09-03 2021-03-19 北京国双科技有限公司 Method and device for extracting text content of docx file
CN112597422A (en) * 2020-12-30 2021-04-02 深圳市世强元件网络有限公司 PDF file segmentation method and PDF file loading method in webpage
CN112686000A (en) * 2020-12-24 2021-04-20 掌阅科技股份有限公司 Format conversion method of electronic book document, electronic equipment and storage medium
CN112699641A (en) * 2021-03-25 2021-04-23 南京国睿信维软件有限公司 Method for quickly converting batch copy of WORD content to DM based on S1000D standard
CN113268585A (en) * 2021-04-28 2021-08-17 企查查科技有限公司 Report file generation method and device, computer equipment and storage medium
CN113361256A (en) * 2021-06-24 2021-09-07 上海真虹信息科技有限公司 Rapid Word document parsing method based on Aspose technology
CN113761840A (en) * 2021-09-08 2021-12-07 中信建投证券股份有限公司 Intelligent document processing method, system, computer device and medium
CN113779931A (en) * 2021-08-31 2021-12-10 民商数字科技(深圳)有限公司 Knowledge base construction method based on Word and control method thereof
CN116612491A (en) * 2023-07-17 2023-08-18 中国电子科技集团公司第十研究所 ARM kylin WORD file content extraction method
CN116861847A (en) * 2023-06-21 2023-10-10 三峡高科信息技术有限责任公司 Online Office file previewing method and system
CN117421357A (en) * 2023-06-07 2024-01-19 广州市公安局***侦查支队 Method, system, device and storage medium for importing funds stream data

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060236215A1 (en) * 2005-04-14 2006-10-19 Jenn-Sheng Wu Method and system for automatically creating document
CN104699714A (en) * 2013-12-09 2015-06-10 北大方正集团有限公司 Method and device for transferring files of book edition format into files of EPUB format
CN106326194A (en) * 2015-07-06 2017-01-11 北大方正集团有限公司 Directory generation method and apparatus applied to file format conversion scene
CN106886509A (en) * 2017-03-06 2017-06-23 大连理工大学 A kind of academic dissertation form automatic testing method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060236215A1 (en) * 2005-04-14 2006-10-19 Jenn-Sheng Wu Method and system for automatically creating document
CN104699714A (en) * 2013-12-09 2015-06-10 北大方正集团有限公司 Method and device for transferring files of book edition format into files of EPUB format
CN106326194A (en) * 2015-07-06 2017-01-11 北大方正集团有限公司 Directory generation method and apparatus applied to file format conversion scene
CN106886509A (en) * 2017-03-06 2017-06-23 大连理工大学 A kind of academic dissertation form automatic testing method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
房婧等: ""板式电子文档表格自动检测与性能评估"", 《北京大学学报(自然科学版)》 *

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110532233A (en) * 2019-08-20 2019-12-03 武汉鼎森电子科技有限公司 A kind of epub document generating method and system
CN112528080A (en) * 2019-09-03 2021-03-19 北京国双科技有限公司 Method and device for extracting text content of docx file
CN110705216A (en) * 2019-09-19 2020-01-17 深圳前海环融联易信息科技服务有限公司 Method and device for converting docx file into xml file based on java and computer equipment
CN110705216B (en) * 2019-09-19 2023-11-03 深圳前海环融联易信息科技服务有限公司 Method and device for converting docx file into xml file based on java and computer equipment
CN110781672A (en) * 2019-10-30 2020-02-11 北京爱学习博乐教育科技有限公司 Question bank production method and system based on machine intelligence
CN110781672B (en) * 2019-10-30 2024-01-30 北京爱学习博乐教育科技有限公司 Question bank production method and system based on machine intelligence
CN111062187A (en) * 2019-11-27 2020-04-24 北京计算机技术及应用研究所 Structured parsing method and system for docx format document
CN111144069A (en) * 2019-12-30 2020-05-12 北大方正集团有限公司 Table-based directory typesetting method and device and storage medium
CN111144069B (en) * 2019-12-30 2021-12-03 北大方正集团有限公司 Table-based directory typesetting method and device and storage medium
CN111581948B (en) * 2020-04-03 2024-02-09 北京百度网讯科技有限公司 Document analysis method, device, equipment and storage medium
CN111581948A (en) * 2020-04-03 2020-08-25 北京百度网讯科技有限公司 Document analysis method, device, equipment and storage medium
CN111881650A (en) * 2020-07-20 2020-11-03 北京百度网讯科技有限公司 PDF document generation method and device and electronic equipment
CN112686000A (en) * 2020-12-24 2021-04-20 掌阅科技股份有限公司 Format conversion method of electronic book document, electronic equipment and storage medium
CN112597422A (en) * 2020-12-30 2021-04-02 深圳市世强元件网络有限公司 PDF file segmentation method and PDF file loading method in webpage
CN112699641A (en) * 2021-03-25 2021-04-23 南京国睿信维软件有限公司 Method for quickly converting batch copy of WORD content to DM based on S1000D standard
CN113268585A (en) * 2021-04-28 2021-08-17 企查查科技有限公司 Report file generation method and device, computer equipment and storage medium
CN113361256A (en) * 2021-06-24 2021-09-07 上海真虹信息科技有限公司 Rapid Word document parsing method based on Aspose technology
CN113779931A (en) * 2021-08-31 2021-12-10 民商数字科技(深圳)有限公司 Knowledge base construction method based on Word and control method thereof
CN113761840A (en) * 2021-09-08 2021-12-07 中信建投证券股份有限公司 Intelligent document processing method, system, computer device and medium
CN117421357A (en) * 2023-06-07 2024-01-19 广州市公安局***侦查支队 Method, system, device and storage medium for importing funds stream data
CN116861847A (en) * 2023-06-21 2023-10-10 三峡高科信息技术有限责任公司 Online Office file previewing method and system
CN116861847B (en) * 2023-06-21 2024-02-13 三峡高科信息技术有限责任公司 Online Office file previewing method and system
CN116612491A (en) * 2023-07-17 2023-08-18 中国电子科技集团公司第十研究所 ARM kylin WORD file content extraction method

Also Published As

Publication number Publication date
CN110083805B (en) 2020-11-27

Similar Documents

Publication Publication Date Title
CN110083805B (en) Method and system for converting Word file into EPUB file
CN111753499B (en) Method for merging and displaying electronic form and OFD format file and generating directory
US8250469B2 (en) Document layout extraction
US20150033116A1 (en) Systems, Methods, and Media for Generating Structured Documents
CN105446946B (en) Rearrangement method, system and the electronic reading terminal of format document
WO2009000141A1 (en) Representation method, system and device of layout file logical structure information
CN101548273A (en) Determining fields for presentable files and extensible markup language schemas for bibliographies and citations
JP2009524883A (en) Presenting digital content to the network
JP2000148736A (en) Methods for font acquisition, registration, display, and printing, method for handling document having variant fonts, and recording medium thereof
US20100080493A1 (en) Associating optical character recognition text data with source images
WO2013146394A1 (en) Information processing terminal and method, and information management apparatus and method
US9043343B2 (en) Identifier assigning method, identifier parsing method, and multimedia reading
CN103309879A (en) Method and device for managing marks in WORD document
CN112433995B (en) File format conversion method, system, computer device and storage medium
US20120109638A1 (en) Electronic device and method for extracting component names using the same
US9817913B2 (en) Method and apparatus for collecting, merging and presenting content
CN110554996A (en) method and system for quickly opening epub file
US20120192046A1 (en) Generation of a source complex document to facilitate content access in complex document creation
CN107423271B (en) Document generation method and device
US8566366B2 (en) Format conversion apparatus and file search apparatus capable of searching for a file as based on an attribute provided prior to conversion
CN112818687B (en) Method, device, electronic equipment and storage medium for constructing title recognition model
CN105320716A (en) Automatic labeling method for digital publication
CN111401005B (en) Text conversion method and device and readable storage medium
JP5707937B2 (en) Electronic document conversion apparatus and electronic document conversion method
CN111143719A (en) Online publication method, device and equipment of thesis and computer-readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant