CN106649271A - Translation-based word document analysis method - Google Patents

Translation-based word document analysis method Download PDF

Info

Publication number
CN106649271A
CN106649271A CN201611180452.8A CN201611180452A CN106649271A CN 106649271 A CN106649271 A CN 106649271A CN 201611180452 A CN201611180452 A CN 201611180452A CN 106649271 A CN106649271 A CN 106649271A
Authority
CN
China
Prior art keywords
file
word document
xml format
translation
paragraph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611180452.8A
Other languages
Chinese (zh)
Inventor
席斌
李明
王兴强
张马成
彭成超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Excellent Translation Information Technology Ltd By Share Ltd
Original Assignee
Chengdu Excellent Translation Information Technology Ltd By Share Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Excellent Translation Information Technology Ltd By Share Ltd filed Critical Chengdu Excellent Translation Information Technology Ltd By Share Ltd
Priority to CN201611180452.8A priority Critical patent/CN106649271A/en
Publication of CN106649271A publication Critical patent/CN106649271A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/47Machine-assisted translation, e.g. using translation memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a translation-based word document analysis method. The method comprises the following steps of scanning and analyzing a file in an XML format by utilizing XPath; constructing a POI paragraph according to wp tag contents obtained by scanning; and analyzing an original text and a style in the paragraph, recording position information of wp in the file and performing numbering. According to the method, the file in the XML format is scanned and analyzed by utilizing the XPath, and the paragraph is identified by identifying the wp tag, so that the situation that missing translation cannot be identified in the paragraph of a nested form during form nesting is effectively avoided.

Description

A kind of word document analytic method based on translation
Technical field
The present invention relates to translation technology field, and in particular to a kind of word document analytic method based on translation.
Background technology
Computer-aided translation software development has defined the different technology such as translation, memory, storage former to today Reason.Computer-aided translation software can be parsed first when processing word document to word document.Existing word In processing procedure, there are the following problems for it for document analytic method, if there are the feelings of the embedding form of form as shown in Figure 1 in document Condition, then the form in form can not be resolved out, so as to cause to leak situation about turning over.
The content of the invention
The present invention provides a kind of word document analytic method based on translation to solve above-mentioned technical problem.
The present invention is achieved through the following technical solutions:
A kind of word document analytic method based on translation, comprises the following steps,
Using XPath scanning parsing XML format files;
According to the wp label substances construction POI paragraphs that scanning is obtained;
Source text and pattern in parsing paragraph, record wp positional informations hereof and number.
The method of this programme is scanned using XPath to XML format file, due to the structure of paragraph in XML format file Into being as the mark before section and after section, by the identification to wp labels so as to effectively recognizing paragraph with wp labels.When word it is literary When there is the situation of the embedding form of form in shelves, also effectively the paragraph of form in form can be identified, effectively be avoided Lou Situation about turning over.
Preferably, also including text transformation step before scanning parsing XML format file, the step is specially:If Original text is word document, then be directly converted into the file of XML format;If original text is the file of PDF, first it is converted For the file that word document is reconverted into XML format.
Further, the word document is 2003 later version files, if word document is the version before 2003, Also include version step of converting.Because the word versions before 2003 are converted into after XML format, its paragraph mark does not have wp to mark Sign, therefore permitted to carry out version conversion to it.
The present invention compared with prior art, has the following advantages and advantages:
The present invention is scanned parsing using XPath to XML format file, by recognizing that wp labels realize that paragraph must be known Not, when effectively avoiding the embedding form of form, the paragraph of form can not be identified the situation that outlet leakage is turned in form.
Description of the drawings
Fig. 1 is the structure chart of the embedding form of form.
Specific embodiment
To make the object, technical solutions and advantages of the present invention become more apparent, with reference to embodiment, to present invention work Further to describe in detail, exemplary embodiment and its explanation of the invention is only used for explaining the present invention, is not intended as to this The restriction of invention.
Embodiment 1
A kind of word document analytic method based on translation, comprises the following steps,
Using XPath scanning parsing XML format files, the wp labels in XML format file are obtained, including to header, page The scanning of pin and text;
According to the wp label substances construction POI paragraphs that scanning is obtained, POI is Apache POI, is to create and attended operation The Java API of the various compound document formats of OLE 2 for meeting Office Open XML standards and Microsoft;
Source text and pattern in parsing paragraph, record wp positional informations hereof and number.Source text and sample The parsing of formula will original text translate into html, mainly have of both use:1) recorded with html original text content and it Pattern, textual content refers to the word in file, and pattern refers to the pattern having on word such as:Overstriking, underscore, text color Deng, so that file content and pattern can be shown in translating web page.2) user is arranged after pattern on webpage to translation, Required setting and translation content and sample when translation is saved in file are obtained carrying out also to be parsed from html when translation is backfilled Formula, could so cause the translation content and pattern that preserve in file consistent with the translation pattern that user is arranged on webpage.Note Record wp positional informations hereof simultaneously number the segment number information for recording per section, are that the backfill of translation is prepared.
The paragraph of POI facilitates user that various operations are carried out on word paragraphs there is provided many operate interfaces.If user Voluntarily process and represent that the xml data of word paragraphs are easy to mistake occur and the execution efficiency of program is not high.
Embodiment 2
Above-described embodiment is applied to 2003 later word version files, if it is the word version or PDF before 2003 File, then the step of above-described embodiment before, also including version step of converting, specially:Word document is converted into into 2003 Later word version files;If original text is the file of PDF, 2003 later word versions are first converted into.
Next by 2003 later word version conversions for XML format file.
Above-described specific embodiment, has been carried out further to the purpose of the present invention, technical scheme and beneficial effect Describe in detail, should be understood that the specific embodiment that the foregoing is only the present invention, be not intended to limit the present invention Protection domain, all any modification, equivalent substitution and improvements within the spirit and principles in the present invention, done etc. all should include Within protection scope of the present invention.

Claims (3)

  1. It is 1. a kind of based on the word document analytic method translated, it is characterised in that to comprise the following steps,
    Using XPath scanning parsing XML format files;
    According to the wp label substances construction POI paragraphs that scanning is obtained;
    Source text and pattern in parsing paragraph, record wp positional informations hereof and number.
  2. 2. according to claim 1 a kind of based on the word document analytic method translated, it is characterised in that:In scanning parsing Also include text transformation step before XML format file, the step is specially:If original text is word document, directly by its turn It is changed to the file of XML format;If original text is the file of PDF, first it is converted into word document and is reconverted into XML format File.
  3. 3. according to claim 2 a kind of based on the word document analytic method translated, it is characterised in that:The word is literary Part is 2003 later version files, if word document is the version before 2003, also including version step of converting.
CN201611180452.8A 2016-12-19 2016-12-19 Translation-based word document analysis method Pending CN106649271A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611180452.8A CN106649271A (en) 2016-12-19 2016-12-19 Translation-based word document analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611180452.8A CN106649271A (en) 2016-12-19 2016-12-19 Translation-based word document analysis method

Publications (1)

Publication Number Publication Date
CN106649271A true CN106649271A (en) 2017-05-10

Family

ID=58835001

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611180452.8A Pending CN106649271A (en) 2016-12-19 2016-12-19 Translation-based word document analysis method

Country Status (1)

Country Link
CN (1) CN106649271A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107885735A (en) * 2017-11-21 2018-04-06 语联网(武汉)信息技术有限公司 A kind of unrelated document translation method and system of form
CN108052496A (en) * 2017-12-19 2018-05-18 国云科技股份有限公司 A kind of word picture and text formatting system and its implementation based on source file
CN110018984A (en) * 2017-10-31 2019-07-16 北京国双科技有限公司 A kind of conversion method and device of file format
CN111159981A (en) * 2019-12-31 2020-05-15 北京迈迪培尔信息技术有限公司 Method and device for analyzing and translating Excel document
CN111401000A (en) * 2020-04-03 2020-07-10 上海一者信息科技有限公司 Translation real-time preview method for online auxiliary translation

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104156207A (en) * 2014-07-31 2014-11-19 广州金山网络科技有限公司 File display method and device
CN104714944A (en) * 2015-04-14 2015-06-17 语联网(武汉)信息技术有限公司 Document translation method and document translation system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104156207A (en) * 2014-07-31 2014-11-19 广州金山网络科技有限公司 File display method and device
CN104714944A (en) * 2015-04-14 2015-06-17 语联网(武汉)信息技术有限公司 Document translation method and document translation system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
朱先远 等: "面向移动终端电子作业批改***的设计与实现", 《长江大学学报(自科版)》 *
杨倩晨 等: "基于XML的文档自动排版技术", 《2012 2ND INTERNATIONAL CONFERENCE ON APPLIED SOCIAL SCIENCE》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110018984A (en) * 2017-10-31 2019-07-16 北京国双科技有限公司 A kind of conversion method and device of file format
CN107885735A (en) * 2017-11-21 2018-04-06 语联网(武汉)信息技术有限公司 A kind of unrelated document translation method and system of form
CN107885735B (en) * 2017-11-21 2021-05-04 语联网(武汉)信息技术有限公司 Format-independent document translation method and system
CN108052496A (en) * 2017-12-19 2018-05-18 国云科技股份有限公司 A kind of word picture and text formatting system and its implementation based on source file
CN111159981A (en) * 2019-12-31 2020-05-15 北京迈迪培尔信息技术有限公司 Method and device for analyzing and translating Excel document
CN111159981B (en) * 2019-12-31 2023-08-08 北京迈迪培尔信息技术有限公司 Method and device for analyzing and translating Excel document
CN111401000A (en) * 2020-04-03 2020-07-10 上海一者信息科技有限公司 Translation real-time preview method for online auxiliary translation
CN111401000B (en) * 2020-04-03 2023-06-20 上海一者信息科技有限公司 Real-time translation previewing method for online auxiliary translation

Similar Documents

Publication Publication Date Title
CN106649271A (en) Translation-based word document analysis method
CN108491199B (en) Method and terminal for automatically generating interface
JP5883557B2 (en) How to add metadata to data
CN102722479B (en) A kind of method of implementation language translation and device
US8260064B2 (en) Image processing apparatus, image processing method, computer-readable medium and computer data signal
US20070208997A1 (en) Xsl transformation and translation
US8155945B2 (en) Image processing apparatus, image processing method, computer-readable medium and computer data signal
US9817887B2 (en) Universal text representation with import/export support for various document formats
CN104699714A (en) Method and device for transferring files of book edition format into files of EPUB format
CN111144070B (en) Document analysis translation method and device
CN110414010B (en) Processing method of internationalized resource file translation text and readable storage medium
CN104484323A (en) Translation processing method based on document segment
CN101866331A (en) Conversion method and device of XML (Extensible Markup Language) documents of different languages
CN105373562A (en) Acquisition method and device of PDF (Portable Document Format) documentation comment
CN107423271B (en) Document generation method and device
CN102209279A (en) Extensible markup language (XML)-based multi-language support method
KR20070062800A (en) Method for transforming of electronic document based on mapping rule and system thereof
CN106021197B (en) The translation system and interpretation method of DWG formatted file
CN106055529B (en) The resolution system and its analytic method of text data to be translated in DWG formatted file
CN111159981B (en) Method and device for analyzing and translating Excel document
CN102521359A (en) Interface data file comparison method and device
CN113296773B (en) Copyright labeling method and system for cascading style sheets
CN111125483A (en) Method and device for generating webpage data extraction template, computer device and computer readable storage medium
Darvishy et al. A flexible software architecture concept for the creation of accessible PDF documents
Bloechle et al. Dolores: An Interactive and Class-Free Approach for Document Logical Restructuring.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170510

RJ01 Rejection of invention patent application after publication