CN106649271A - Translation-based word document analysis method - Google Patents
Translation-based word document analysis method Download PDFInfo
- Publication number
- CN106649271A CN106649271A CN201611180452.8A CN201611180452A CN106649271A CN 106649271 A CN106649271 A CN 106649271A CN 201611180452 A CN201611180452 A CN 201611180452A CN 106649271 A CN106649271 A CN 106649271A
- Authority
- CN
- China
- Prior art keywords
- file
- word document
- xml format
- translation
- paragraph
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/103—Formatting, i.e. changing of presentation of documents
- G06F40/117—Tagging; Marking up; Designating a block; Setting of attributes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/42—Data-driven translation
- G06F40/47—Machine-assisted translation, e.g. using translation memory
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Document Processing Apparatus (AREA)
Abstract
The invention discloses a translation-based word document analysis method. The method comprises the following steps of scanning and analyzing a file in an XML format by utilizing XPath; constructing a POI paragraph according to wp tag contents obtained by scanning; and analyzing an original text and a style in the paragraph, recording position information of wp in the file and performing numbering. According to the method, the file in the XML format is scanned and analyzed by utilizing the XPath, and the paragraph is identified by identifying the wp tag, so that the situation that missing translation cannot be identified in the paragraph of a nested form during form nesting is effectively avoided.
Description
Technical field
The present invention relates to translation technology field, and in particular to a kind of word document analytic method based on translation.
Background technology
Computer-aided translation software development has defined the different technology such as translation, memory, storage former to today
Reason.Computer-aided translation software can be parsed first when processing word document to word document.Existing word
In processing procedure, there are the following problems for it for document analytic method, if there are the feelings of the embedding form of form as shown in Figure 1 in document
Condition, then the form in form can not be resolved out, so as to cause to leak situation about turning over.
The content of the invention
The present invention provides a kind of word document analytic method based on translation to solve above-mentioned technical problem.
The present invention is achieved through the following technical solutions:
A kind of word document analytic method based on translation, comprises the following steps,
Using XPath scanning parsing XML format files;
According to the wp label substances construction POI paragraphs that scanning is obtained;
Source text and pattern in parsing paragraph, record wp positional informations hereof and number.
The method of this programme is scanned using XPath to XML format file, due to the structure of paragraph in XML format file
Into being as the mark before section and after section, by the identification to wp labels so as to effectively recognizing paragraph with wp labels.When word it is literary
When there is the situation of the embedding form of form in shelves, also effectively the paragraph of form in form can be identified, effectively be avoided Lou
Situation about turning over.
Preferably, also including text transformation step before scanning parsing XML format file, the step is specially:If
Original text is word document, then be directly converted into the file of XML format;If original text is the file of PDF, first it is converted
For the file that word document is reconverted into XML format.
Further, the word document is 2003 later version files, if word document is the version before 2003,
Also include version step of converting.Because the word versions before 2003 are converted into after XML format, its paragraph mark does not have wp to mark
Sign, therefore permitted to carry out version conversion to it.
The present invention compared with prior art, has the following advantages and advantages:
The present invention is scanned parsing using XPath to XML format file, by recognizing that wp labels realize that paragraph must be known
Not, when effectively avoiding the embedding form of form, the paragraph of form can not be identified the situation that outlet leakage is turned in form.
Description of the drawings
Fig. 1 is the structure chart of the embedding form of form.
Specific embodiment
To make the object, technical solutions and advantages of the present invention become more apparent, with reference to embodiment, to present invention work
Further to describe in detail, exemplary embodiment and its explanation of the invention is only used for explaining the present invention, is not intended as to this
The restriction of invention.
Embodiment 1
A kind of word document analytic method based on translation, comprises the following steps,
Using XPath scanning parsing XML format files, the wp labels in XML format file are obtained, including to header, page
The scanning of pin and text;
According to the wp label substances construction POI paragraphs that scanning is obtained, POI is Apache POI, is to create and attended operation
The Java API of the various compound document formats of OLE 2 for meeting Office Open XML standards and Microsoft;
Source text and pattern in parsing paragraph, record wp positional informations hereof and number.Source text and sample
The parsing of formula will original text translate into html, mainly have of both use:1) recorded with html original text content and it
Pattern, textual content refers to the word in file, and pattern refers to the pattern having on word such as:Overstriking, underscore, text color
Deng, so that file content and pattern can be shown in translating web page.2) user is arranged after pattern on webpage to translation,
Required setting and translation content and sample when translation is saved in file are obtained carrying out also to be parsed from html when translation is backfilled
Formula, could so cause the translation content and pattern that preserve in file consistent with the translation pattern that user is arranged on webpage.Note
Record wp positional informations hereof simultaneously number the segment number information for recording per section, are that the backfill of translation is prepared.
The paragraph of POI facilitates user that various operations are carried out on word paragraphs there is provided many operate interfaces.If user
Voluntarily process and represent that the xml data of word paragraphs are easy to mistake occur and the execution efficiency of program is not high.
Embodiment 2
Above-described embodiment is applied to 2003 later word version files, if it is the word version or PDF before 2003
File, then the step of above-described embodiment before, also including version step of converting, specially:Word document is converted into into 2003
Later word version files;If original text is the file of PDF, 2003 later word versions are first converted into.
Next by 2003 later word version conversions for XML format file.
Above-described specific embodiment, has been carried out further to the purpose of the present invention, technical scheme and beneficial effect
Describe in detail, should be understood that the specific embodiment that the foregoing is only the present invention, be not intended to limit the present invention
Protection domain, all any modification, equivalent substitution and improvements within the spirit and principles in the present invention, done etc. all should include
Within protection scope of the present invention.
Claims (3)
- It is 1. a kind of based on the word document analytic method translated, it is characterised in that to comprise the following steps,Using XPath scanning parsing XML format files;According to the wp label substances construction POI paragraphs that scanning is obtained;Source text and pattern in parsing paragraph, record wp positional informations hereof and number.
- 2. according to claim 1 a kind of based on the word document analytic method translated, it is characterised in that:In scanning parsing Also include text transformation step before XML format file, the step is specially:If original text is word document, directly by its turn It is changed to the file of XML format;If original text is the file of PDF, first it is converted into word document and is reconverted into XML format File.
- 3. according to claim 2 a kind of based on the word document analytic method translated, it is characterised in that:The word is literary Part is 2003 later version files, if word document is the version before 2003, also including version step of converting.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611180452.8A CN106649271A (en) | 2016-12-19 | 2016-12-19 | Translation-based word document analysis method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611180452.8A CN106649271A (en) | 2016-12-19 | 2016-12-19 | Translation-based word document analysis method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106649271A true CN106649271A (en) | 2017-05-10 |
Family
ID=58835001
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611180452.8A Pending CN106649271A (en) | 2016-12-19 | 2016-12-19 | Translation-based word document analysis method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106649271A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107885735A (en) * | 2017-11-21 | 2018-04-06 | 语联网(武汉)信息技术有限公司 | A kind of unrelated document translation method and system of form |
CN108052496A (en) * | 2017-12-19 | 2018-05-18 | 国云科技股份有限公司 | A kind of word picture and text formatting system and its implementation based on source file |
CN110018984A (en) * | 2017-10-31 | 2019-07-16 | 北京国双科技有限公司 | A kind of conversion method and device of file format |
CN111159981A (en) * | 2019-12-31 | 2020-05-15 | 北京迈迪培尔信息技术有限公司 | Method and device for analyzing and translating Excel document |
CN111401000A (en) * | 2020-04-03 | 2020-07-10 | 上海一者信息科技有限公司 | Translation real-time preview method for online auxiliary translation |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104156207A (en) * | 2014-07-31 | 2014-11-19 | 广州金山网络科技有限公司 | File display method and device |
CN104714944A (en) * | 2015-04-14 | 2015-06-17 | 语联网(武汉)信息技术有限公司 | Document translation method and document translation system |
-
2016
- 2016-12-19 CN CN201611180452.8A patent/CN106649271A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104156207A (en) * | 2014-07-31 | 2014-11-19 | 广州金山网络科技有限公司 | File display method and device |
CN104714944A (en) * | 2015-04-14 | 2015-06-17 | 语联网(武汉)信息技术有限公司 | Document translation method and document translation system |
Non-Patent Citations (2)
Title |
---|
朱先远 等: "面向移动终端电子作业批改***的设计与实现", 《长江大学学报(自科版)》 * |
杨倩晨 等: "基于XML的文档自动排版技术", 《2012 2ND INTERNATIONAL CONFERENCE ON APPLIED SOCIAL SCIENCE》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110018984A (en) * | 2017-10-31 | 2019-07-16 | 北京国双科技有限公司 | A kind of conversion method and device of file format |
CN107885735A (en) * | 2017-11-21 | 2018-04-06 | 语联网(武汉)信息技术有限公司 | A kind of unrelated document translation method and system of form |
CN107885735B (en) * | 2017-11-21 | 2021-05-04 | 语联网(武汉)信息技术有限公司 | Format-independent document translation method and system |
CN108052496A (en) * | 2017-12-19 | 2018-05-18 | 国云科技股份有限公司 | A kind of word picture and text formatting system and its implementation based on source file |
CN111159981A (en) * | 2019-12-31 | 2020-05-15 | 北京迈迪培尔信息技术有限公司 | Method and device for analyzing and translating Excel document |
CN111159981B (en) * | 2019-12-31 | 2023-08-08 | 北京迈迪培尔信息技术有限公司 | Method and device for analyzing and translating Excel document |
CN111401000A (en) * | 2020-04-03 | 2020-07-10 | 上海一者信息科技有限公司 | Translation real-time preview method for online auxiliary translation |
CN111401000B (en) * | 2020-04-03 | 2023-06-20 | 上海一者信息科技有限公司 | Real-time translation previewing method for online auxiliary translation |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106649271A (en) | Translation-based word document analysis method | |
CN108491199B (en) | Method and terminal for automatically generating interface | |
JP5883557B2 (en) | How to add metadata to data | |
CN102722479B (en) | A kind of method of implementation language translation and device | |
US8260064B2 (en) | Image processing apparatus, image processing method, computer-readable medium and computer data signal | |
US20070208997A1 (en) | Xsl transformation and translation | |
US8155945B2 (en) | Image processing apparatus, image processing method, computer-readable medium and computer data signal | |
US9817887B2 (en) | Universal text representation with import/export support for various document formats | |
CN104699714A (en) | Method and device for transferring files of book edition format into files of EPUB format | |
CN111144070B (en) | Document analysis translation method and device | |
CN110414010B (en) | Processing method of internationalized resource file translation text and readable storage medium | |
CN104484323A (en) | Translation processing method based on document segment | |
CN101866331A (en) | Conversion method and device of XML (Extensible Markup Language) documents of different languages | |
CN105373562A (en) | Acquisition method and device of PDF (Portable Document Format) documentation comment | |
CN107423271B (en) | Document generation method and device | |
CN102209279A (en) | Extensible markup language (XML)-based multi-language support method | |
KR20070062800A (en) | Method for transforming of electronic document based on mapping rule and system thereof | |
CN106021197B (en) | The translation system and interpretation method of DWG formatted file | |
CN106055529B (en) | The resolution system and its analytic method of text data to be translated in DWG formatted file | |
CN111159981B (en) | Method and device for analyzing and translating Excel document | |
CN102521359A (en) | Interface data file comparison method and device | |
CN113296773B (en) | Copyright labeling method and system for cascading style sheets | |
CN111125483A (en) | Method and device for generating webpage data extraction template, computer device and computer readable storage medium | |
Darvishy et al. | A flexible software architecture concept for the creation of accessible PDF documents | |
Bloechle et al. | Dolores: An Interactive and Class-Free Approach for Document Logical Restructuring. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170510 |
|
RJ01 | Rejection of invention patent application after publication |