WO2007129288A2 - Electronic document reformatting - Google Patents

Electronic document reformatting Download PDF

Info

Publication number
WO2007129288A2
WO2007129288A2 PCT/IE2007/000030 IE2007000030W WO2007129288A2 WO 2007129288 A2 WO2007129288 A2 WO 2007129288A2 IE 2007000030 W IE2007000030 W IE 2007000030W WO 2007129288 A2 WO2007129288 A2 WO 2007129288A2
Authority
WO
WIPO (PCT)
Prior art keywords
data
document
electronic document
formatted
printing
Prior art date
Application number
PCT/IE2007/000030
Other languages
French (fr)
Other versions
WO2007129288A3 (en
Inventor
Seamus Mcgrenery
Brian Mcgrath
Kevin Clarke
Original Assignee
Big River Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Big River Limited filed Critical Big River Limited
Publication of WO2007129288A2 publication Critical patent/WO2007129288A2/en
Publication of WO2007129288A3 publication Critical patent/WO2007129288A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition

Definitions

  • the present invention relates to a system and method for reformatting electronic documents. More particularly, the present invention relates to an enhanced system and method for reprocessing electronic documents previously formatted for printing or for display on a specific device, into format-independent electronic documents with a coherent read order.
  • This network-enabled distribution is traditionally accomplished either via electronic mail messaging, wherein the electronic version of the print document may be sent and subsequently received by a recipient as a file attached to an electronic mail message (email), or via downloading the electronic document as a file made available on an Internet page (webpage) via its storage location reference in the network.
  • electronic mail messaging wherein the electronic version of the print document may be sent and subsequently received by a recipient as a file attached to an electronic mail message (email), or via downloading the electronic document as a file made available on an Internet page (webpage) via its storage location reference in the network.
  • PDF portable document format
  • doc Microsoft Word document
  • a document designer may choose to convey the meaning that a particular string of text is a heading by rendering it as an image, which the native application will display as if it were large text.
  • the document designer may choose to make text stand out by layering the same string of text twice with one slightly offset.
  • the intended meaning will be lost if the text is extracted for use on a different device or application in the first case.
  • the second case meaning may be confused by the text being repeated.
  • VDUs computer video display units
  • visual control such as a mouse or pointing device.
  • This combination provides difficulties for users who are trying to access the document on a non- standard VDU screen, and/or need to interact with controls using another method, such as keyboard. But this disadvantage can equally arise due to characteristics of the document such as the information contained in the document not being structured according to the intended reading order.
  • a further disadvantage of electronic documents is that they may be compiled from a variety of sources, or on a variety of computer operating systems, resulting in individual text objects being encoded in character sets other than that declared for the document or document section as a whole.
  • the resolution of the VDU of desktop computers has increased dramatically over time, from the Color Graphics Adapter (CGA) standard in 1981 of 160x200 16-colours pixels, through to the more recent Ultra Extended Graphics Array (UXGA) of 1600x1200 32-bit colour pixels.
  • CGA Color Graphics Adapter
  • UXGA Ultra Extended Graphics Array
  • the VDUs of most modern portable devices, such as mobile phones or PDAs are still only limited to a resolution of approximately 320x200 16-bit colour pixels.
  • an apparatus for converting electronic documents formatted for printing, or display on a specific device, the apparatus comprising processing means, memory means and data input means and the memory means storing instructions a data structure defining at least one electronic document formatted for printing and having a layout, at least one document layout template and at least one recipient terminal profile.
  • the instructions configures the processing means to compare the document layout defined by the data structure against the at least one document layout template, select metadata for an output electronic document formatted for display in response to the comparison, map alphanumerical data in the data structure to corresponding ASCII character data, optically recognise alphanumerical data in the data structure, compare the optically- recognised alphanumerical data against the mapped alphanumerical data, identify image data in the data structure, and in response to a selection of the at least one recipient terminal profile, optionally rescale the image data and output the electronic document, including the optionally rescaled image data and ASCII character data, wherein the output electronic document is formatted for display according to the selected metadata and profile.
  • the memory means preferably further stores a plurality of recipient terminal profiles and the instructions further configure the processing means to read a selection of a recipient terminal profile by a user and to format the output electronic document for display according to the selected recipient terminal profile.
  • the plurality of terminal profile may include a mobile telephone handset profile, a personal digital assistance profile and a personal computer profile.
  • the instructions preferably comprises a plurality of functional modules, including a document processing module, a text data processing module, an optical character recognition module and an image data scanning module.
  • the image data preferably comprises picture screen elements and attributes such as an image size in picture screen elements, an image resolution in dots per inch, and wherein every picture screen element is defined by respective red, green, blue and optionally alpha values.
  • Alphanumerical data may includes any of ASCII, ANSI, ISO, 16-bit UNICODE, UTF-8 text data.
  • the data structure preferably includes metadata and a layout, the layout including layout data and layer data.
  • the layer data defines which alphanumerical data and image data of the document formatted for printing belong to which layer of the document formatted for printing.
  • the layout data defines where alphanumerical data and image data of the document formatted for printing are physically disposed within the document area.
  • the output document is preferably a compiled HTML help (chm) file.
  • a method for converting an electronic document formatted for printing into an electronic document formatted for display, the method comprising the steps of comparing a document layout defined by a data structure defining at least one electronic document formatted for printing and having a layout against at least one document layout template, selecting metadata for an output electronic document formatted for display in response to the comparison, mapping alphanumerical data in the data structure to corresponding ASCII character data, optically recognising alphanumerical data in the data structure, comparing the optically-recognised alphanumerical data against the mapped alphanumerical data, identifying image data in the data structure, and in response to a receiving a selection of at least one recipient terminal profile, optionally rescaling the image data and outputting the electronic document including the optionally rescaled image data and ASCII character data, wherein the output electronic document is formatted for display according to the selected metadata and profile.
  • the method preferably comprises the further steps of storing a plurality of recipient terminal profiles, reading a selection of a recipient terminal profile by a user and formatting the output electronic document for display according to the selected recipient terminal profile.
  • the method also preferably comprises the further steps of scanning and buffering the electronic document formatted for printing and components thereof.
  • the method moreover preferably comprises the further step of identifying print convention data such document page numbers, document headers, document cross references and the like in the electronic document formatted for printing.
  • Print convention data such document page numbers, document headers, document cross references and the like in the electronic document formatted for printing.
  • Hyperlinks may advantageously be generated from the identified print convention data.
  • the method preferably comprises the further step of identifying network resource convention data such as uniform resource locators, email addresses and the like in the electronic document formatted for printing. Hyperlinks may advantageously be generated from the identified network resource convention data.
  • the data structure preferably comprises a layout including layout data and layer data, the layer data defining which alphanumerical data and image data of the document formatted for printing belong to which layer of the document formatted for printing and the layout data defining where alphanumerical data and image data of the document formatted for printing are physically disposed within the document area.
  • the step of outputting preferably comprises the step of generating a table of contents and a file index.
  • a method for converting an electronic document formatted for printing into an electronic document formatted for display, the method comprising the steps of mapping alphanumerical data in a data structure of the electronic document to corresponding ASCII character data; optically recognizing alphanumerical data in the data structure; comparing the optically-recognized alphanumerical data against the mapped alphanumerical data; identifying image data in the data structure; identifying sections of the electronic document content where the intended meaning is conveyed by the use of layered text or text and image data and processing this layered data according to rules to ensure that the intended meaning is preserved in the converted document; identifying additional text included for decorative effect and deleting it or replacing it with appropriate markup; rescaling the image data if the image data exceeds an image data parameter ; and outputting the electronic document including the optionally rescaled image data, processed layered data and compared ASCII character data.
  • a method for the detection of the reading order of an electronic document formatted for printing, the method comprising the steps of identifying horizontal spaces in the electronic document; identifying recognizing vertical spaces in the electronic document; for each combination of intersecting horizontal spaces and vertical spaces, declaring a document block, ordering the document blocks; and outputting the electronic document reading order according to the document block order.
  • a method for the detection of tabulated data in an electronic document formatted for printing comprising the steps of identifying blocks of alphanumerical data in the electronic document; mapping alphanumerical data in a data structure of the electronic document to corresponding ASC ⁇ character data; determining respective horizontal and vertical positional data of the identified blocks; comparing the respective horizontal positional data of at least two neighbouring identified blocks for detecting a candidate column; comparing the respective vertical positional data of at least two neighbouring identified blocks for detecting a candidate row; and upon detecting at least a predetermined number of predetermined rows and columns comparison, storing the corresponding ASCII character data of the identified blocks as a table.
  • a user interface for converting an electronic document formatted for printing into an electronic document formatted for display, the user interface comprising multiple user-configurable panes for guiding a user through a modular sequence of steps according to any of the document converting methods described above.
  • the multiple user-configurable panes preferably comprise at least one pane for representing the electronic document formatted for printing and another pane for representing a converted version of the electronic document formatted for printing.
  • the said another pane is configured to automatically represent the same point in the converted document as the electronic document formatted for printing shown in the at least one pane.
  • the user interface preferably also comprises means for ensuring that converted documents comply with accessibility standards.
  • Figure 1 shows a preferred embodiment of the present invention in an environment comprising a plurality of network-connected user terminals, including those of a sender and of a recipient;
  • Figure 2 provides an example of a sender user terminal shown in Figure 1, which includes memory means, processing means and networking means;
  • Figure 3 details the processing steps according to which the sender user terminal of Figures 1 and 2 operates, including steps of loading a data structure, of processing the data structure for video display and of outputting the processed data structure for video display;
  • FIG 4 illustrates the contents of the memory means shown in Figure 2 further to the loading step shown in Figure 3;
  • Figure 5 illustrates the components of the document shown in Figure 4, including layer data and layout data;
  • Figure 6 shows examples of document layers and layouts according to the data shown in
  • Figure 7 further details the step of processing a data structure for video display performed by the sender user terminal of Figures 1 to 6 ;
  • Figure 8 further details the step of outputting the processed data structure for video display performed by the sender user terminal of Figures 1 to 6 ;
  • Figure 9 details steps of detecting and processing document layout in an alternative embodiment of the present invention.
  • Figure 10 details steps of detecting and processing table data in an alternative embodiment of the present invention.
  • Figure 11 provides an illustration of a graphical user interface for the application shown in
  • a preferred embodiment of the present invention is shown in an environment in Figure 1.
  • a plurality of network-connected user terminals is shown, which first includes recipient user devices such as a mobile telephone handset 101.
  • the handset 101 is configured with wireless telecommunication emitting and receiving functionality, such as over a cellular telephone network configured according to either Global System for Mobile Communication ( 1 GSM') or General Packet Radio Service ('GPRS') network industry standards.
  • Handset 101 receives or emits voice and/or data encoded as a digital signal over wireless data transmission 102, wherein the signal is relayed respectively to or from the handset 101 by the geographically- closest communication link relay 103 of a plurality thereof.
  • 1 GSM' Global System for Mobile Communication
  • 'GPRS' General Packet Radio Service
  • the plurality of communication link relays allows the digital signal to be routed between handset 101 and its intended recipient or from its remote emitter, in the example sender user terminal 104 of content service provider 105, by means of a remote gateway 106.
  • Gateway 106 Gateway
  • the 106 is for instance a communication network switch coupling digital signal traffic between wireless telecommunication networks, such as the network within which wireless data transmission 102 takes place, and a wide area network (WAN) 107, an example of which is the Internet, or an Intranet or Extranet.
  • the gateway 106 further provides protocol conversion if required, for instance if handset 101 uses a Wireless Application Protocol (WAP) differing from the Internet Transmission Control Protocol/Internet Protocol (TCP/IP) in order to receive from, and optionally distribute data to, terminal 104, which is itself only connected to the WAN 107 via an Internet Service Provider (ISP) 108.
  • WAP Wireless Application Protocol
  • TCP/IP Internet Transmission Control Protocol/Internet Protocol
  • ISP Internet Service Provider
  • the user of handset 101 may also have the use of another mobile terminal 109, for instance a Personal Digital Assistant (PDA).
  • PDA Personal Digital Assistant
  • PDA 109 is also connected to the WAN
  • WAN 107 and configured to exchange data according to the TCP/IP protocol with the sender user terminal 104 via a different network 110, such as a Local Area Network (LAN) or a Wireless Local Area Network (WLAN) conforming to the IEEE 802.11b or IEEE 802.1 Ig standard, interfaced with WAN 107, although it will be readily understood by those skilled in the art that the present invention is not limited thereto and may indeed include any other such protocol or standard, depending upon the device, its networking means, operating system and processing capacity.
  • LAN Local Area Network
  • WLAN Wireless Local Area Network
  • Terminal 104 is a computer terminal configured with a data processing unit 201, data outputting means such as video display unit (VDU) 202, data inputting means such as a keyboard 203 and a pointing device (mouse) 204 and data inputting/outputting means such as network connection 205, magnetic data-carrying medium reader/writer 206A and optical data- carrying medium reader/writer 207A for, respectively, reading from or writing data to magnetic data-carrying medium (floppy disk or solid state memory card) 206B and reading from or writing data to optical data-carrying medium (CD-RAM, CD-RW, DVD-RAM, DVD-RW, DVD-R and the like) 207B .
  • data processing unit 201 data outputting means such as video display unit (VDU) 202, data inputting means such as a keyboard 203 and a pointing device (mouse) 204 and data inputting/outputting means such as network connection 205, magnetic data-carrying medium reader/writer 206A and optical data- carrying medium
  • a central processing unit (CPU) 208 such as an Intel Pentium 4 manufactured by the Intel Corporation, provides task co-ordination and data processing functionality. Instructions and data for the CPU 208 are stored in main memory 209 and a hard disk storage unit 210 facilitates non-volatile storage of data and several software applications.
  • WAN connection 205 is provided by way of a modem 211 as a wired dial-up connection or, alternatively, by way of a Network Interface Card (NIC) 212 as a wired high-bandwidth connection to the ISP 108 and the Internet 107.
  • NIC Network Interface Card
  • a universal serial bus (USB) input/output interface 213 facilitates connection to the keyboard and pointing device 203, 204. All of the above devices are connected to a data input/output bus 214, to which the magnetic data-carrying medium reader/writer 206A and optical data- carrying medium reader/writer 207A are also connected.
  • a video adapter 215 receives CPU instructions over the bus 214 for outputting processed data to VDU 202.
  • data processing unit 201 is of the type generally known as a compatible Personal Computer ('PC), but may equally be any device configured with processing means, output data display means, memory means, input means and wired or wireless network connectivity. The processing steps according to which the terminal 104 operates according to the present invention are further detailed in Figure 3.
  • the terminal 104 is switched on, whereby a first set of instructions known as the Operating System (OS) is loaded in memory 209 at step 302, to configure the terminal with basic interoperability and connectivity with ISP 108 and the Internet 107.
  • OS Operating System
  • a second set of instructions hereinafter referred to as a converting application is loaded in memory 209, either from storage 210 or from a remote terminal, for instance a remote server accessible via the Internet 107, by way of a data download.
  • the second set of instructions includes a plurality of functional modules further described herein below, at least one of which allows a user to select and load an electronic document configured for printing or display on a specific device at the next step 304.
  • the converting application receives user input to indicate the desired output file format of the electronic document configured for display on multiple devices, and optionally to indicate the type or characteristics of the recipient user terminal, in order to parameterize the processing of the electronic document configured for print by the modules at the next step 306 and the outputting of the processed electronic document now configured for display by the application at the next step 307.
  • the user of terminal 104 may then write the electronic document output at step 307 to a removable media 206B, 207B and/or distribute same to one or a plurality of remote recipient user terminals, such as mobile phone 101, by way of its network connection.
  • a question can be asked at step 308, as to whether another electronic document configured for print should be processed according to steps 304 to 307, and control returns to step 304 for the selection thereof if the question is answered in the affirmative.
  • the question is answered negatively, and at step 309 the user of terminal 104 may then decide to terminate the processing of the application first loaded at step 303 at the step 309 and eventually switch the terminal 104 off at step 310.
  • the contents of the memory means 209 after the steps 303, 304 of loading instructions and an electronic document are illustrated in Figure 4.
  • the OS loaded at step 302 is shown at 401, which in the example is Windows® XP® Professional®, manufactured and distributed by the Microsoft® Corporation of Redmond, Washington, United States, but it will be readily apparent to those skilled in the art that the present invention is not limited thereto and that any other OS may be used to configure terminal 104 with basic functionality, such as Mac OS/X® manufactured and distributed by Apple® Inc. of Cuppertino, California, United States, or Linux which is freely distributed.
  • the application loaded at step 303 is shown at 402 and comprises a plurality of modules including a document processing module 403, a text data processing module 404, an Optical Character Recognition (OCR) module 405 and an image data scanning module 406, the respective functionalities of which will be further described herein below.
  • each module apportions a part of memory 209 as a respective buffer, the buffers being shown as 407 in the figure.
  • a set of instructions for interpreting document layout is shown at 408, which is also loaded in memory 209 at step 303 along with the application 402, and which the document processing module 403 preferably uses at step 306 in conjunction with text data processing module 404 for processing a data structure shown at 409 after the loading thereof according to step 304, which is an electronic document formatted for print and, in the example, is formatted according to the Portable Document Format (*.pdf) file format of Adobe Systems Inc. of San Jose, California, United States and output by Adobe applications, notably the well-known Adobe® Acrobat® Reader application.
  • PDF Portable Document Format
  • the document 409 initially comprises metadata 501, which may be better understood as a description or definition of the electronic data contained in the data structure 409.
  • Metadata 501 therefore preferably includes data structure information, such as the document file type, the language of the document, the page format and the number of pages of the document, but can also include information about when the document was created, by whom and what changes have been made to the document since its creation, or even include descriptive HTML tags (also known to those skilled in the art as Meta tags) if the data structure 409 is an HTML document.
  • Document 409 also includes text data 502 encoded in one or more of a plurality of character representations including, but not limited to, ASCII, ANSI, ISO, 16-bit UNICODE or its 8-bit representation UTF-8 text data 503 developed as a universal character set for international use and allowing a wider variety of text characters than ASCII to be used, particularly for representing the written characters of Asian languages.
  • a plurality of character representations including, but not limited to, ASCII, ANSI, ISO, 16-bit UNICODE or its 8-bit representation UTF-8 text data 503 developed as a universal character set for international use and allowing a wider variety of text characters than ASCII to be used, particularly for representing the written characters of Asian languages.
  • document 409 may include textual content stored in an image file format to convey a particular meaning when displayed in the native application.
  • document 409 will include additional text data 502, which purpose is to create a visual effect for either implying a meaning or for artistic effect.
  • Document 409 next includes image n data 504, which in the context of the preferred embodiment is any graphical data in the document, which does not convey textual information, therefore such as photographs, artistic or design works, charts and the like.
  • image or graphical component 504 is individually defined within the document 409 in size by a pixel area, in resolution by a dot per inch (dpi) value and in content by individual pixels, each of which comprises respective Red, Green and Blue (RGB) values and, optionally, an Alpha (transparency) value.
  • dpi dot per inch
  • RGB Red, Green and Blue
  • Alpha Alpha
  • Image data 504 may further be modified by being overlaid with text or other graphic objects to convey a particular meaning when displayed in the native application.
  • Document 508 is formatted for print and therefore the various visual components 502 to 504 thereof are disposed relative to one another to achieve a particular result, be it in terms of a visual effect imparted upon the reader or a specific order in which those visual components should be read or observed when the document is printed out.
  • Data defining this disposition may take one of two forms, or possibly both, respectively represented as document layer data 505 and document layout data 506. Any or both of the layer data 505 and layout data 506 may be included as part of metadata 501.
  • Layer data 505 comprises data defining which visual components 502 to 504 belong to which layer of the document formatted for printing. Layers may be described as document areas overlaying one another, figuratively "stacked" on top of each other, the aggregation of which results in the final document. A typical example may include a watermark image 504 belonging in a bottom layer 601 and a letter comprising text belonging in a top layer 603, the aggregation 603 of which resulting in a letter having text 502 or 503 over the watermark 504 when printed. Layer data 505 therefore defines the bottom and top layers 601, 602 and with which watermark 504 and text 502 or 503 are associated, respectively. Layer data can be used for artistic effect or to convey meaning. A typical example of layer data being used to convey meaning is where a block of text is drawn to the reader's attention by being layered over a different background colour to that of the remaining text.
  • Layout data 506 comprises data defining where visual components 502 to 504 are physically disposed within the document area, e.g. where the watermark 504 is disposed relative to the total area of the document, for instance centered relative to the four boundaries of the document.
  • the disposition of visual components 502 to 504 according to layout data 506 can vary to a very large extent, which depends on the intended use and/or purpose of the document 409.
  • the single-sided example 603 of a watermarked letter 409 above can for instance be contrasted with another example 604 of a double-sided brochure, based upon an A4 document print format to be folded in thirds 605 to 607, wherein visual components 502 to 504 are laid out according to an intended reading order (605 recto, 605 verso, 606 verso, 607 verso, 607 recto, 606 recto) of the brochure, once printed and folded.
  • step 306 according to which the instructions 402 configure the data processing system 201 to process document 409 into an electronic document formatted for display is further detailed in Figure 7.
  • application 402 processes the metadata 501 to identify and temporarily store document data in its respective portion of buffer 407, which is useful to determine characteristics of text data 502, 503, for instance its language which is particularly useful in the case of written language using Unicode or UTF-8 text data 503, and also to determine whether the loaded document 409 incorporates layer data 505 and/or layout data 506.
  • the text processing module 404 of application 402 extracts text data 502 from the document 409, which it also temporarily stores in its respective portion of buffer 407.
  • All text is passed through text processing module 404 using rules, which detect if there is text encoded in character sets other than that declared for the document, or illegal in the output format. Where such characters are encountered they are processed using rules, and output in the correct character set and character. Where there is more than one possible correct output character a comparison is made with the text extracted using the OCR module 405.
  • the application detects if the document contains any instances where additional text may have been included for visual effect.
  • additional text is defined as repeated layered text, which can be discarded or replaced by markup.
  • the layers are examined to determine if one is used for visual effect, by having a significantly less prominent colour or a colour close to that of the background. It will be apparent that many other examples exist where a similar technique can be used to impart the intended meaning to the document.
  • Application 402 coordinates the respective data processing steps of each of processing modules 403 to 406, and at step 703, application 402 invokes the scanning module 406 to scan the document, in its composite form if it comprises layer data 505 or if an initial reading of the character strings stored at step 702 indicates that there may be more than one possible glyph for any of the characters stored, and then stores the result in its respective buffer portion 407, and subsequently invokes the OCR processing module 405 to optically recognize characters from the buffered scanned document and temporarily store an OCR version of the document 409 in its respective buffer portion 407.
  • the text processing module 404 performs a comparison of the text data 502, 503 extracted at step 702 with the buffered OCR version of the document output at the step 703 in order to detect and correct both extraction errors as well as alphanumerical data featured in document 409 under the form of image data 504 into ASCII text data 502.
  • alphanumerical data under the form of image data 504 include alphanumerical data in charts, graphs or in a stylized form, or defined by layer data 505 as or in a graphic layer 601 as opposed to a text layer 602.
  • the extracted and verified text data is stored in the buffer portion 407 of document processing module 403.
  • a number of sets of instructions for document layout detection 408 is provided, for instance for the more common types of document layouts that may be encountered, and therefore incorporating a letter, a magazine article with two or more columns, a broadsheet article with four or more columns, a double sided brochure and folded variations thereof as provided in the example 604. It will be readily apparent to those skilled in the art that the above templates are provided by way of example only and are not meant to be limitative.
  • application 402 matches the verified text data with a document layout template 408.
  • application 402 obtains from the performing of steps 701 to 705 that document 409 has an A4 format from step 701, that text data has been rotated by 90 degrees for the purposes of performing the OCR step 703, and that the disposition of text data 502 within the A4 area of document 409 defines three portions 605 to 607.
  • application 402 obtains from performing step 701 that layout data 506 defines three columns corresponding to the three portions 605 to 607.
  • step 707 application 402 processes the buffered scanned document output by the scanning module at step 703 for identifying respective images 504 «.
  • a question is therefore asked at step 708, as to whether an image n 504 is present in the document 409.
  • An initial question is asked as to whether any layer data is stored above the image. If layer data is stored above the image, the image is recaptured using a 'scan' of the source document display with OCR module 405, and processed as step 709. If the question of step 708 is answered positively, application 402 processes the identified image 504 n for reducing its respective storage requirement, e.g.
  • application 402 outputs the image n 504 to VDU 202 for the user to optionally increase or decrease its scale, or change its boundary relative to text or graphic objects, so that for example an adjacent text string can be included or excluded from the image.
  • a question is then asked as to whether there is a text string either overlaying the image, or adjacent to the image which might be used as a text alternative to assist blind users, or users accessing the document on a small screen without image display (or with image display turned off), for determining the meaning of the image.
  • Adjacent text may be selected for inclusion by the use of one or more of a plurality of features including: textual conventions such as the inclusion of a word such as 'caption' 'image' or 'photo'; layout conventions such as the placement of a short text string in isolation and adjacent to an image; and typographical conventions such as a change in font, font style, size or weight.
  • text alternatives will be displayed to the user by application 402 for acceptance, editing or rejection.
  • the identified text alternative is stored in association with image 504 for output.
  • step 710 A question is then asked at step 710, as to whether another image 504 n+1 is present in the document 409. If the question of step 710 is entered positively, control returns to step 709 for the processing of this next image as previously described, and the image processing cycle continues until all images 504 n have been processed. The question of step 710 is therefore eventually answered negatively, whereby application 402 may then invoke document processing module 403 for outputting the electronic document formatted for display at step 307. Referring to the question of step 708, if the question is answered negatively and there is no image data 504 n to process in document 409, application 402 likewise next invokes document processing module 403 for outputting the converted document at step 307.
  • step 710 if the question of step 710 is entered positively, a further question is asked as to whether this image has already been used in the document, and if the image has been used before, then this instance is replaced by a reference to a previous image and control returns to step 709.
  • application 402 and data processing modules 403 to 406 thereof process document 409 into ASCII or one of a plurality of character sets text data 502 and down- sampled image data 504, substantially reducing the storage space required to store the visual components of the original document.
  • the step 307 of outputting the electronic document formatted for display, converted from the electronic document of formatted for printing 409 according to the present invention, is further detailed in Figure 8.
  • the document processing module 403 firstly processes the buffered metadata 501 of step 701 to identify document structuring data, for instance layout data 506 comprising structural tags if document 409 is formatted according to HTML or XML structured languages, or headings if the format of document 409 is formatted according capable of storing such informationto as for example "doc" or "pdf ' file formats.
  • a first question is therefore asked at step 801, as to whether the buffered metadata comprises such information.
  • step 802 the document processing module 403 recursively processes the text data 502 output at step 705 to identify headings, sub-headings and other document subdivisions according to one or more of a plurality of: conventional document sub-dividing conventions, including for instance identifying strings of alphanumerical data 502 such as "chapter", "part” sequential numbering and the like, layout conventions such as the placement of a short text string, in isolation, typographical conventions such as a change in font, font style, size or weight.
  • a second question is asked at step 803 further to the identifying attempt of step 802, as to whether a document structure has been identified. If the question of step 803 is answered positively, then at step 804 the document processing module 403 temporarily stores data of the identified structure in buffer 407.
  • step 803 the document processing module further processes the text to ask if potential headings can be identified to a lower level of certainty.
  • the potential headings identified are presented to the user for acceptace, editing or rejection.
  • the document processing module 403 recursively processes the text data 502 output at step 705 to identify headings, sub-headings and other document subdivisions with selecting a first string of alphanumerical data 502 and processing the frequency and disposition of. said first string throughout the document.
  • module 403 assigns a respective probability of structural heading to the first string 502 based upon the processing of step 805 and temporarily stores said probability in respect of said first string.
  • step 807 a question is asked as to whether a string of alphanumerical data 502 remains to be processed in the document according to steps 805 and 806 and, if answered positively, control returns to step 805, whereby a next string of alphanumerical data 502 is selected for which a probability is obtained and stored, and so on and so forth until all probable strings 502 have been processed.
  • the question of step 807 is therefore eventually answered negatively, and the document processing module 803 derives a probable document structure from the stored probabilities, which it temporarily stores at step 804.
  • application 402 process the extracted text to search for print conventions, which indicate a reference to another part of the document.
  • Application 402 seeks to identify printed page numbers. In formats where such numbers are stored directly such as 'doc' or some versions of 'pdf the numbers are identified and stored. Where there are no stored numbers sequential numbers, they may be identified by searching for print conventions as 'page' 'p' with a number, or a number placed in isolation at a consistent location on the page or mirrored location on facing pages. These are stored as page numbers and mapped by offset to the leaf number or file section representing a page.
  • Application 402 next searches for a 'contents' page, identified by strings of data matching headings adjacent to numbers matching the page number of pages with the same heading. When a contents page is identified, all its entries are hyper-linked. Similarly, references within the text which contain clear identifiers such as 'see page n' are hyperlinked to the section with the appropriate page number.
  • application 402 will search for a reference such as 'see Appendix A' and it will then search for a heading matching text string 'Appendix A' and, having successfully located the heading, provide a hyperlink within the document from the text reference to the heading.
  • a reference such as 'see Appendix A'
  • it will then search for a heading matching text string 'Appendix A' and, having successfully located the heading, provide a hyperlink within the document from the text reference to the heading.
  • conventions such as those identifying footnotes or endnotes will be identified and hyper-linked.
  • step 801 If the question of step 801 is answered positively, or as and when document processing module 403 stores structure data at step 804 as a result of identifying conventional headings at step 802 or deriving structural headings according to steps 805 to 807, control proceeds to step 808, at which the document processing module 403 generates a table of contents (TOC) from the structural data, then a file index at step 809, from which TOC and index application 402 can subsequently compile an electronic document formatted for display at step 810.
  • TOC table of contents
  • the electronic document formatted for display is a CHM file, known to those skilled in the art as a 'Help' file, which can be processed for display by any Windows OS 401, compiled from the TOC of step 808 as a HHC file, or Html Help table of Contents, and the Index of step 809 as a HHK file, or Html Help index file.
  • a CHM file known to those skilled in the art as a 'Help' file, which can be processed for display by any Windows OS 401, compiled from the TOC of step 808 as a HHC file, or Html Help table of Contents, and the Index of step 809 as a HHK file, or Html Help index file.
  • application 402 searches the text for characters which suggest a reference to an external electronic resource such as a URL for a web page or an electronic mailing address. These are deduced by the presence of text such as www.name or name.com, or [email protected], for example. Where such text strings are detected, application 402 first checks the string for characters such as white space or line feed which should not be present. It then, where an appropriate network connect is available, tests the link to the resource and where valid provides an active hyperlink.
  • the user is presented with a prompt, which will allow them to test all links and send test e-mails to all addresses detected in the document.
  • the resource will be hyperlinked.
  • the terminal 104 is itself a server, connected to the network, such as the Internet or an Intranet, which a number of users at remote locations can use to convert documents according to the present invention.
  • the remote users can initiate document conversion by such means as transferring an electronic copy of the document described herein below to be converted to a specific location, either directly or by use of an automated system for transferring documents such as an ECM (Electronic Content Management) System or other Document Management System.
  • the remote user could initiate the conversion process by such means as dragging an icon representing the document to an icon representing application 402, or using a command within an application such as Microsoft Word and choosing the conversion process, the remote user being thus able to control the conversion process by means such as pre-set commands, commands loaded into an editable *.ini file, or input by a remote user using an XML interface.
  • the application is based on the server 104 and pre-set to convert all documents delivered to one or more specific locations on a computer network.
  • it can be pre-set to convert a document in response to user input, either directly or indirectly through an application such as an ECM system or by using a command from software on a remote computer.
  • the application 402 further configures terminal 104 to extract and analyse the contents of document 409, in order to deduce a coherent read order and output the contents in correctly ordered blocks.
  • the application 402 achieves this functionality by firstly creating (9A) an electronic representation 901 of a page image. The electronic representation is then analysed to discover (9B) white spaces 902, which cut across the page. These cuts 902 which run parallel to the X axis 903 are designated as Y cuts. The electronic representation is then analysed to discover (9C) white spaces 904, which cut along the page. These cuts 904 which run parallel to the Y axis 905 are designated as X cuts. White spaces which do not intersect are discarded from the analysis process. These cuts 902, 904 are then used to subdivide (9D) the page 901 into the largest possible blocks 906. X cuts are given precedence; over Y cuts to ensure that the read order of documents with multiple columns is respected.
  • the application 402 reiterates the discovering steps 9A to 9D through the block 906 to define and order the largest possible sub- block 907.
  • the application 402 reiterates the discovering steps 9A to 9D through the sub-block 907 to define and order the largest possible sub-sub-block 908.
  • the application continues to reiterate the discovering steps 9A to 9D until the smallest possible blocks 908 are discovered, which are then ordered (9E) top to bottom.
  • the application then proceeds to the next block 907, past an X cut with the highest Y value, and continues the sorting procedure 9A to 9E.
  • the application 402 Having found and ordered the component sub-blocks 907, 908 of a block 906 the application 402 then reiterates the discovering steps 9A to 9E to find the next neighbouring block 906B of that block 906A at the higher block level.
  • the component sub-blocks 907, 908 components are then ordered. The process is continued until all component blocks are placed in a reading order.
  • the text is then output at step 9F, in one example as flowing text marked up as HTML.
  • the application 402 further configures terminal 104 to extract and analyse the contents of document 409, in order to deduce any coherent table formatting and to output the table data content in correctly formed tables.
  • the application 402 achieves this functionality by firstly creating (10A) an electronic representation 1001 of a page image. The electronic representation is then analysed to discover (10B) blocks of text 1002 that are table candidates. Blocks 1002 are analysed based on their spacing from surrounding blocks.
  • line art such as cell boundaries may be treated as white space and discarded or, alternatively, line art which forms a boundary between rows is treated as a special row and line art which forms a boundary between columns is treated as a special column. It is not however a requirement of the invention that line art be present in order to detect and process a table.
  • a minimum of two blocks across and three blocks down is used as a rule to define if text blocks 1002 are table candidates.
  • Each of the text blocks 1002 is then analysed (10C) to determine its coordinates.
  • the vertical coordinates are defined as top, centre and bottom.
  • the horizontal coordinates are defined as left, centre and right.
  • Each block 1002 in the candidate table is then compared (10D) to its neighbours on the same broad vertical position in the source document.
  • these blocks 1002A, 1002B become a candidate row 1003 A for the candidate table.
  • the application 402 is designed to take account of atypical common table designs, such as a table featuring some rows and columns with different respective vertical and/or horizontal alignment.
  • the application compares (10F) blocks 1002 with their neighbours on substantially the same horizontal position in the source document.
  • these blocks 1002A, 1002C become a candidate column 1004A for the candidate table.
  • the next set of blocks 1002B, 1002D at an horizontal position to the right is then tested (10G) in the same manner to determine if there is another candidate column 1004B.
  • the analysis is repeated until all blocks of text 1002 have been analysed to determine if they belong to candidate columns 1004.
  • Any blocks 1002 which do not belong to either candidate rows 1003 or columns 1004 are discarded. If the candidate rows and columns 1004 respectively number less than a predefined number of columns and rows, for instance two rows and two columns, then the candidate table is discarded (10H). Alternatively, the candidate rows and columns 1004 are declared (101) as an actual table and the table data is stored in table cells for output as a structured table.
  • the application 402 may optionally analyse any grouping of blocks 1002 which share an inconsistent vertical or horizontal alignment to define column or row headers.
  • the rows and columns may also be further analysed to detect if a block could better be described as spanning two or more rows or columns. For example, when there is a row heading in every second row of a table, such row headings may be better described as spanning the intervening rows.
  • a textual analysis as previously described in relation to Figure 7 may also be carried out to further refine the processing of table data. For example, numerical data may be separated from textual data for output to a file type, which supports different cell data types, such as the Microsoft Excel "xls" file format of the Microsoft Corporation.
  • the application 402 includes a user interface 1101, which facilitates the insertion of accessibility features to documents by an unskilled user.
  • the interface preferably includes multiple panes sharing a common look and feel, giving the user control over the respective functions of modules 403, 404, 405 and 406.
  • the number and type of panes being presented to the user are user controllable, so that they may be limited to those relevant to the user in a sequence of tasks for converting a document conversion as previously described.
  • the interface 1101 is however configured for guiding the user through the modular, overall sequence of tasks, with appropriate prompting and advice provided at each step of the sequence.
  • the review of documents during conversion according to the present invention is facilitated by with a double pane view 1102, 1103 of the document, which displays the original document 409 side by side with a preview of the converted document 1104.
  • the pane 1103 displaying the original document may displays a plurality of file types, including Microsoft Word "doc” and Adobe Acrobat "pdf ' files, in a single display version.
  • This configuration advantageously eliminates the requirement to install multiple file on terminal 104, and thereby intrinsically limits the storage requirements, as well as the amount of training required by the user for interacting with the said plurality of file viewers.
  • the interface 1101 advantageously adds page break markers to the converted document 1104 previewed in pane 1103.
  • the interface can be set in such a way that when a user of terminal 104 chooses to view a new page in the original document 409, the pane 1103 automatically displaces the representation of converted document 1104 to the same point in the text.
  • This configuration is particularly advantageous when the removal of print-specific document features, such as large margins, fixed type sizes and multiple columns, results in a marked difference in page length between the original and output documents.
  • the interface 1101 further provides a means by which users, unskilled in the art of ensuring documents comply with accessibility standards, can ensure document accessibility.
  • the application 402 performs an automated accessibility test upon the document being converted, and configures interface 1101 to output messages about accessibility issues, which require user input for decision and/or correction.
  • the document 409 displayed in pane 1102 is replaced with a list of prompts on accessibility issues, which require user input for decision and/or correction, and which the user may consult side-by-side with the converted document preview.
  • the interface 1101 is further configured to provide a tool tip 1105 with guidance on how to solve the issue in respect of each prompt, and to open a pane 1106 with controls for providing relevant input.
  • the application 402 configures the interface 1001 to indicate to the user that this text alternative is required for both blind readers and those accessing the document on displays, which do no support images, and the indication further comprises a text input control box for adding a text alternative to the image.
  • the interface 1101 is moreover configured with text formatting controls by means of which users, unskilled in either the arts of HTML coding or CSS coding, can provide further input to the automatically generated HTML markup and linked cascading style sheets.
  • the user is presented with text formatting controls, which are familiar to users of standard word processing applications such as Microsoft Word or OpenOff ⁇ ce, and which are linked to both HTML and CSS editing functions for allowing the user to reformat a block of text by one or a combination of: changing the HTML selector used to mark it up, changing the CSS properties of all instances of that selector, or changing an instance of the selector to be a special case marked in both the HTML and CSS as a span, class or id.
  • the use may be presented with an id and class editing control linked to both HTML and CSS, which allows the addition or deletion of an id, class or an instance of a class, or its renaming to be more meaningful in the semantic context of the document.
  • the interface 1101 is moreover configured with table of contents editing controls.
  • the document 409 displayed in pane 1102 is replaced with a table of contents view, and which the user may consult side-by-side with the converted document preview for ease of reference.
  • the table of contents can be edited by the user to change the text of a link.
  • the user can also add or delete links from the table of contents as required. Having done so the user can choose to output the document as one or more of: a single HTML file with a linked table of contents, multiple HTML files comprising one for each page of the original document with a linked table of contents, multiple HTML files comprising one for each section of the original document with a linked table of contents.
  • interface 1101 An important feature of interface 1101 is the inclusion of a save control for application 402 to store work in progress as a single container file, with the maximum useable information contained therein.
  • the container file can then be reopened and editing continued at the user's convenience.
  • the container file can be any of a readily available type such as a 7-Zip, WinRAR or Winzip archive, or a proprietary archive, which uses a format such as XML.
  • the source images are also stored prior to the application of any transformation matrix for display in the source file format. All transformation matrices, whether extracted from the original document or created during editing, or a combination of both, are stored.
  • interface 1101 is configured to store unique user settings, including data learned by application 402 during the conversion of multiple documents, in a server client environment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Document Processing Apparatus (AREA)
  • Processing Or Creating Images (AREA)
  • Record Information Processing For Printing (AREA)

Abstract

An apparatus and methods are provided for converting electronic documents formatted for printing into corresponding electronic documents formatted for display. The conversion involves the comparison the document layout defined by the print format against at least one document layout template, the mapping of alphanumerical data in the electronic document to corresponding ASCII character data, the optical recognition of alphanumerical data in the electronic document, the comparison of the optically-recognized alphanumerical data against the mapped alphanumerical data, the identification of image data in the electronic document, the optional rescaling of the image data if the image data exceeds an image data parameter, and the output of a converted electronic document including the optionally rescaled image data and compared ASCII character data, the output electronic document being formatted for display according to the document layout template.

Description

Title
Electronic document reformatting
Field of the Invention
The present invention relates to a system and method for reformatting electronic documents. More particularly, the present invention relates to an enhanced system and method for reprocessing electronic documents previously formatted for printing or for display on a specific device, into format-independent electronic documents with a coherent read order.
Background of the invention
Many computerised systems and methods used therewith are known, with which users input any combination of alphabetical, numerical and image data for generating electronic versions of print documents. Typically, such documents have been prepared for distribution to recipients in printed form, whereby the finalised document is output by the computer on which it has been generated to a printing device. The ubiquitous development and adoption of the Internet has however simplified such distribution, as any electronic version of a print document, such as a report or a brochure which beforehand needed to be printed and mailed as a hardcopy to a distant recipient, may now simply be broadcast across the Internet to the respective terminals of one or many recipients, irrespective of their often disparate geographical locations, and be received quasi-instantaneously. Similarly, documents may be prepared for display on a particular type of device such as a computer with a video display unit (VDU) screen and a particular type of software.
This network-enabled distribution is traditionally accomplished either via electronic mail messaging, wherein the electronic version of the print document may be sent and subsequently received by a recipient as a file attached to an electronic mail message (email), or via downloading the electronic document as a file made available on an Internet page (webpage) via its storage location reference in the network. Particularly well-known types of such files include the portable document format ("pdf" ) file format developed by Adobe Systems Inc. of San Jose, California and the Microsoft Word document ("doc") file format developed by the Microsoft Corporation of Redmond, Washington.
Although such files and the respective computer applications used to produce them have immensely facilitated the production and distribution of document based information since their advent, especially for end-users unskilled in the particular arts of typesetting, page- setting and graphic design, they feature an important disadvantage, inherent to their initial purpose of presenting and reproducing graphic and textual information in print, or on a specific device such as a standard VDU, in that the intended meaning of the content may be lost or confused when they are converted for use or display on a range of devices. This disadvantage arises principally from the software applications used to create and display such documents (native applications) using a combination of layered data and layout data to convey meaning through visual presentation of a combination of text and graphical elements, which can be lost when the text and graphic data is extracted for use or display on another device. Typically, a document designer may choose to convey the meaning that a particular string of text is a heading by rendering it as an image, which the native application will display as if it were large text. Alternatively the document designer may choose to make text stand out by layering the same string of text twice with one slightly offset. In the above examples, the intended meaning will be lost if the text is extracted for use on a different device or application in the first case. In the second case meaning may be confused by the text being repeated.
Additionally the native applications used to display documents is typically optimised for interaction using a combination of a display device, such as computer video display units (VDUs), with user interaction by visual control such as a mouse or pointing device. This combination provides difficulties for users who are trying to access the document on a non- standard VDU screen, and/or need to interact with controls using another method, such as keyboard. But this disadvantage can equally arise due to characteristics of the document such as the information contained in the document not being structured according to the intended reading order.
A further disadvantage of electronic documents is that they may be compiled from a variety of sources, or on a variety of computer operating systems, resulting in individual text objects being encoded in character sets other than that declared for the document or document section as a whole.
These interrelated disadvantages are increasingly problematic, as there is an ongoing global trend to increase information data distribution from more traditional delivery platforms, such as desktop computers, to mobile platforms in the so-called pervasive computing era. This is evidenced by increasingly frequent announcements by device manufacturers of new personal digital assistants (PDAs), mobile telephone handsets and personal media centres (PMCs) with enhanced functionality. Mobile platforms present their own unique challenges, especially in the domain of graphical information processing and presentation. The devices are typically restricted to having low computational power, small memory storage space, limited- bandwidth network access to the Internet and ever-increasing miniaturisation requirements, particularly of their video display units (VDUs). By way of example, the resolution of the VDU of desktop computers has increased dramatically over time, from the Color Graphics Adapter (CGA) standard in 1981 of 160x200 16-colours pixels, through to the more recent Ultra Extended Graphics Array (UXGA) of 1600x1200 32-bit colour pixels. Conversely, the VDUs of most modern portable devices, such as mobile phones or PDAs, are still only limited to a resolution of approximately 320x200 16-bit colour pixels.
Having regard to current attempts at solving the above problems singularly or jointly, further problems such as the inconsistent mapping of printed characters to the data stored in an electronic document, or the implementation of document meaning conveyed through layered data or through a method which will only convey the intended meaning correctly in the native application, disadvantage users who rely on assistive technology to access electronic documents, such as for example blind users who use text-to-voice interpretation applications. These users may find that words can not be processed into spoken language because one or more letter is inconsistently mapped to ASCII, or lines from one paragraph may be spoken as part of another paragraph because of the way the document is laid out or saved, and they may have to listen to the whole text of a very large document to find the part that is of interest to them. Existing applications for creating and converting copies of electronic documents do not meet the needs of document publishers in relation to making documents accessible to users of assistive technology.
A number of attempts have been made for automatically deducing the read order of electronic documents, the contents of which are stored in an arbitrary fashion, such as in files of the portable document format ("pdf ') file format developed by Adobe Systems Inc. of San Jose, California, or electronic images of documents captured by scanning and OCR technology. US2004/0205568 Al (Bruel et al.) discloses a method and system for deconstructing and reconstructing a document image layout, and US 5,907,631 (Takashi) discloses a document image processing method and system for determining the reading order of regions of body text. These attempts typically either rely on creating a display representation, which mimics the display representation of the source document without creating a coherent read order, or are limited in their ability to process source documents having more complex layouts. The disadvantage of having an arbitrary read order to text extracted from a source document poses a particular burden for users of devices other than a desktop computer screen to access document content. For example users accessing content on mobile devices or through assistive technology may be unable to access a coherent version of the document text.
These disadvantages can also make it difficult or impossible to re-purpose the content of such documents for use in content management systems, such as a database or web content management system.
With particular regard to tabulated document data, it is known to detect and capture table data from documents, wherein the table formatting is missing. This might apply to such file types where there is no specific definition for a table element such as, for example, a table created in a text ("txt") file by spacing blocks of text. Likewise, this might apply to files of the portable document format ("pdf ) file format developed by Adobe Systems Inc. of San Jose, California or electronic images of documents captured by scanning and OCR technology.
Current solutions to the problem of detecting table structure from the visual presentation of the tabulated text are deficient in a number of respects.
Firstly, those methods which rely on bounding boxes will often fail, because some or all of the cell boundaries of a table are often not represented by visible lines.
Secondly, those methods which rely on line positioning of exacted text will be unable to properly form table cells, where the table data in a cell spans more than a single line. These methods also typically fail when a table is sparsely populated with data, for instance when there is there is important space between lines, or when empty cells are present.
Many alternative solutions for overcoming these difficulties rely in whole or in part on recreating an exact visual representation of the table image in a source document. While this may accurately represent the table visually, it suffers from the disadvantage that the table data is not readily re-usable in applications, which rely on table structured data. This disadvantage will also impact on visually impaired people who rely on assistive technology, which uses the table formatting to present the table data coherently. Further to the above, any unskilled user wishing to convert an electronic copy of a document from a format designed for print such as a Microsoft Word ("doc") file or Adobe Acrobat ("pdf" ) file, to a format suitable for delivery to multiple devices such as HTML, is disadvantaged because of the amount of knowledge and practice required.
Existing software solutions will typically try to maintain the print presentation features of the original document, for example the Microsoft Word feature 'save to HTML' will output HTML code, with the documents print attributes encoded in the markup. Those solutions, which save HTML from Adobe Acrobat pdf files, such as Magellan from BCL computers, use markup techniques such as positional DIVs to recreate the appearance of the print document. In either case, a considerable amount of specialist skills, including skills in the manual editing of HTML code, can be necessary to bring the document to a standard that will be correctly and consistently processed by multiple devices.
Secondly it may be necessary to address specific accessibility issues in the document. For example images and tables may require the manual inputting of additional information for users of assistive technology. Again a considerable degree of expertise, including specialist skill in the manual editing of HTML code, is needed to apply these changes.
Thirdly, it may be necessary to carry out a number of processing steps which, traditionally, has required the use of separate applications. It may be necessary to capture or manipulate an image from the original document, traditionally done using an image manipulation application such as Adobe Photoshop, the use of which requires skill and training. Tables, anchors or other features in the HTML may need to be inserted or edited using a HTML editing application such as Macromedia Dreamweaver, the use of which requires skill and training.
Existing solutions to these problems suffer from the disadvantages of requiring the use of a number of different conversion tools, a high level of skill and large amount of manual effort, and subsequent cost, to convert documents.
Object of the Invention It is an object of the present invention to provide an improved system for converting an electronic document formatted for printing or display on a computer into an electronic document formatted for display on a variety of devices.
It is another object of the present invention to provide an automatic electronic document conversion system, in which text character glyphs are accurately mapped to electronic representations of text characters, so that the intended character is displayed.
It is yet another object of the present invention to provide an automatic electronic document conversion system, in which image-formatted text data is identified and processed for outputting in corresponding text data.
It is a further object of the present invention to provide an improved method of converting an electronic document formatted for printing or display on a VDU into an electronic document formatted for display on multiple devices, by identifying a meaning conveyed by layering multiple objects for display as a single object in the native application, and rendering these as a single object.
It is a further object of the present invention to provide an improved method of converting an electronic document, whereby additional text introduced into the design to create a visual effect in the native application is identified and removed, or replaced by markup which conveys the same meaning without the use of additional text.
It is a further object of the present invention to provide an improved automatic electronic document conversion system, in which textual conventions for referring readers to another part of the document are augmented by hyperlinks, which the reader can follow in the electronic copy of the document.
It is a further object of the present invention to provide an automatic electronic document conversion system, in which textual conventions for referring a reader to an electronic resource such as URL or e-mail address are processed into hyperlinks, which the user can follow. It is a further object of this invention to improve the characteristics of electronic copies of print documents so that they can be accessed by users of devices other than standard computers, such as users of assistive technology and mobile devices.
It is a further object of this invention to extract and analyse the contents from an electronic copy of a print document, in order to deduce a coherent read order and output the contents in correctly ordered blocks.
It is a further object of this invention to extract and analyse the contents from an electronic copy of a print document, in order to deduce any coherent table formatting and output the table data content in correctly formed tables.
It is a further object of this invention to provide a user interface for the automatic electronic document conversion system, which facilitates the insertion of accessibility features to documents by an unskilled user.
Summary of the Invention
According to an aspect of the present invention, an apparatus is provided for converting electronic documents formatted for printing, or display on a specific device, the apparatus comprising processing means, memory means and data input means and the memory means storing instructions a data structure defining at least one electronic document formatted for printing and having a layout, at least one document layout template and at least one recipient terminal profile. The instructions configures the processing means to compare the document layout defined by the data structure against the at least one document layout template, select metadata for an output electronic document formatted for display in response to the comparison, map alphanumerical data in the data structure to corresponding ASCII character data, optically recognise alphanumerical data in the data structure, compare the optically- recognised alphanumerical data against the mapped alphanumerical data, identify image data in the data structure, and in response to a selection of the at least one recipient terminal profile, optionally rescale the image data and output the electronic document, including the optionally rescaled image data and ASCII character data, wherein the output electronic document is formatted for display according to the selected metadata and profile.
The memory means preferably further stores a plurality of recipient terminal profiles and the instructions further configure the processing means to read a selection of a recipient terminal profile by a user and to format the output electronic document for display according to the selected recipient terminal profile. The plurality of terminal profile may include a mobile telephone handset profile, a personal digital assistance profile and a personal computer profile.
The instructions preferably comprises a plurality of functional modules, including a document processing module, a text data processing module, an optical character recognition module and an image data scanning module.
The image data preferably comprises picture screen elements and attributes such as an image size in picture screen elements, an image resolution in dots per inch, and wherein every picture screen element is defined by respective red, green, blue and optionally alpha values.
Alphanumerical data may includes any of ASCII, ANSI, ISO, 16-bit UNICODE, UTF-8 text data.
The data structure preferably includes metadata and a layout, the layout including layout data and layer data. The layer data defines which alphanumerical data and image data of the document formatted for printing belong to which layer of the document formatted for printing. The layout data defines where alphanumerical data and image data of the document formatted for printing are physically disposed within the document area.
The output document is preferably a compiled HTML help (chm) file.
According to another aspect of the present invention, a method is provided for converting an electronic document formatted for printing into an electronic document formatted for display, the method comprising the steps of comparing a document layout defined by a data structure defining at least one electronic document formatted for printing and having a layout against at least one document layout template, selecting metadata for an output electronic document formatted for display in response to the comparison, mapping alphanumerical data in the data structure to corresponding ASCII character data, optically recognising alphanumerical data in the data structure, comparing the optically-recognised alphanumerical data against the mapped alphanumerical data, identifying image data in the data structure, and in response to a receiving a selection of at least one recipient terminal profile, optionally rescaling the image data and outputting the electronic document including the optionally rescaled image data and ASCII character data, wherein the output electronic document is formatted for display according to the selected metadata and profile.
The method preferably comprises the further steps of storing a plurality of recipient terminal profiles, reading a selection of a recipient terminal profile by a user and formatting the output electronic document for display according to the selected recipient terminal profile.
The method also preferably comprises the further steps of scanning and buffering the electronic document formatted for printing and components thereof.
The method moreover preferably comprises the further step of identifying print convention data such document page numbers, document headers, document cross references and the like in the electronic document formatted for printing. Hyperlinks may advantageously be generated from the identified print convention data.
The method preferably comprises the further step of identifying network resource convention data such as uniform resource locators, email addresses and the like in the electronic document formatted for printing. Hyperlinks may advantageously be generated from the identified network resource convention data.
The data structure preferably comprises a layout including layout data and layer data, the layer data defining which alphanumerical data and image data of the document formatted for printing belong to which layer of the document formatted for printing and the layout data defining where alphanumerical data and image data of the document formatted for printing are physically disposed within the document area.
The step of outputting preferably comprises the step of generating a table of contents and a file index.
According to another aspect of the present invention, a method is provided for converting an electronic document formatted for printing into an electronic document formatted for display, the method comprising the steps of mapping alphanumerical data in a data structure of the electronic document to corresponding ASCII character data; optically recognizing alphanumerical data in the data structure; comparing the optically-recognized alphanumerical data against the mapped alphanumerical data; identifying image data in the data structure; identifying sections of the electronic document content where the intended meaning is conveyed by the use of layered text or text and image data and processing this layered data according to rules to ensure that the intended meaning is preserved in the converted document; identifying additional text included for decorative effect and deleting it or replacing it with appropriate markup; rescaling the image data if the image data exceeds an image data parameter ; and outputting the electronic document including the optionally rescaled image data, processed layered data and compared ASCII character data.
According to yet another aspect of the present invention, a method is provided for the detection of the reading order of an electronic document formatted for printing, the method comprising the steps of identifying horizontal spaces in the electronic document; identifying recognizing vertical spaces in the electronic document; for each combination of intersecting horizontal spaces and vertical spaces, declaring a document block, ordering the document blocks; and outputting the electronic document reading order according to the document block order.
According to still another aspect of the present invention, a method is provided for the detection of tabulated data in an electronic document formatted for printing, the method comprising the steps of identifying blocks of alphanumerical data in the electronic document; mapping alphanumerical data in a data structure of the electronic document to corresponding ASCπ character data; determining respective horizontal and vertical positional data of the identified blocks; comparing the respective horizontal positional data of at least two neighbouring identified blocks for detecting a candidate column; comparing the respective vertical positional data of at least two neighbouring identified blocks for detecting a candidate row; and upon detecting at least a predetermined number of predetermined rows and columns comparison, storing the corresponding ASCII character data of the identified blocks as a table.
According to a further aspect of the present invention, a user interface is provided for converting an electronic document formatted for printing into an electronic document formatted for display, the user interface comprising multiple user-configurable panes for guiding a user through a modular sequence of steps according to any of the document converting methods described above.
The multiple user-configurable panes preferably comprise at least one pane for representing the electronic document formatted for printing and another pane for representing a converted version of the electronic document formatted for printing. The said another pane is configured to automatically represent the same point in the converted document as the electronic document formatted for printing shown in the at least one pane.
The user interface preferably also comprises means for ensuring that converted documents comply with accessibility standards.
Brief Description of the Drawings
The invention will be better understood upon consideration of the following detailed description and the accompanying drawings, in which:
Figure 1 shows a preferred embodiment of the present invention in an environment comprising a plurality of network-connected user terminals, including those of a sender and of a recipient;
Figure 2 provides an example of a sender user terminal shown in Figure 1, which includes memory means, processing means and networking means;
Figure 3 details the processing steps according to which the sender user terminal of Figures 1 and 2 operates, including steps of loading a data structure, of processing the data structure for video display and of outputting the processed data structure for video display;
Figure 4 illustrates the contents of the memory means shown in Figure 2 further to the loading step shown in Figure 3;
Figure 5 illustrates the components of the document shown in Figure 4, including layer data and layout data;
Figure 6 shows examples of document layers and layouts according to the data shown in
Figure 5;
Figure 7 further details the step of processing a data structure for video display performed by the sender user terminal of Figures 1 to 6 ;
Figure 8 further details the step of outputting the processed data structure for video display performed by the sender user terminal of Figures 1 to 6 ;
Figure 9 details steps of detecting and processing document layout in an alternative embodiment of the present invention;
Figure 10 details steps of detecting and processing table data in an alternative embodiment of the present invention; and
Figure 11 provides an illustration of a graphical user interface for the application shown in
Figures 3 to 10. Detailed Description of the Drawings
The words "comprises/comprising" and the words "having/including" when used herein with reference to the present invention are used to specify the presence of stated features, integers, steps or components but does not preclude the presence or addition of one or more other features, integers, steps, components or groups thereof.
A preferred embodiment of the present invention is shown in an environment in Figure 1. A plurality of network-connected user terminals is shown, which first includes recipient user devices such as a mobile telephone handset 101. The handset 101 is configured with wireless telecommunication emitting and receiving functionality, such as over a cellular telephone network configured according to either Global System for Mobile Communication (1GSM') or General Packet Radio Service ('GPRS') network industry standards. Handset 101 receives or emits voice and/or data encoded as a digital signal over wireless data transmission 102, wherein the signal is relayed respectively to or from the handset 101 by the geographically- closest communication link relay 103 of a plurality thereof.
The plurality of communication link relays allows the digital signal to be routed between handset 101 and its intended recipient or from its remote emitter, in the example sender user terminal 104 of content service provider 105, by means of a remote gateway 106. Gateway
106 is for instance a communication network switch coupling digital signal traffic between wireless telecommunication networks, such as the network within which wireless data transmission 102 takes place, and a wide area network (WAN) 107, an example of which is the Internet, or an Intranet or Extranet. The gateway 106 further provides protocol conversion if required, for instance if handset 101 uses a Wireless Application Protocol (WAP) differing from the Internet Transmission Control Protocol/Internet Protocol (TCP/IP) in order to receive from, and optionally distribute data to, terminal 104, which is itself only connected to the WAN 107 via an Internet Service Provider (ISP) 108.
The user of handset 101 may also have the use of another mobile terminal 109, for instance a Personal Digital Assistant (PDA). In the example, PDA 109 is also connected to the WAN
107 and configured to exchange data according to the TCP/IP protocol with the sender user terminal 104 via a different network 110, such as a Local Area Network (LAN) or a Wireless Local Area Network (WLAN) conforming to the IEEE 802.11b or IEEE 802.1 Ig standard, interfaced with WAN 107, although it will be readily understood by those skilled in the art that the present invention is not limited thereto and may indeed include any other such protocol or standard, depending upon the device, its networking means, operating system and processing capacity.
Thus, the potential exists for data exchange between any of handset 101, terminals 104 and 109, by way of wireless data transmission 102 interfaced by gateway 106 or wireless data transmission within network 110 and the Internet 107.
An example of the terminal 104 at the service provider 105 shown in Figure 1 is provided in Figure 2. Terminal 104 is a computer terminal configured with a data processing unit 201, data outputting means such as video display unit (VDU) 202, data inputting means such as a keyboard 203 and a pointing device (mouse) 204 and data inputting/outputting means such as network connection 205, magnetic data-carrying medium reader/writer 206A and optical data- carrying medium reader/writer 207A for, respectively, reading from or writing data to magnetic data-carrying medium (floppy disk or solid state memory card) 206B and reading from or writing data to optical data-carrying medium (CD-RAM, CD-RW, DVD-RAM, DVD-RW, DVD-R and the like) 207B .
Within data processing unit 201, a central processing unit (CPU) 208, such as an Intel Pentium 4 manufactured by the Intel Corporation, provides task co-ordination and data processing functionality. Instructions and data for the CPU 208 are stored in main memory 209 and a hard disk storage unit 210 facilitates non-volatile storage of data and several software applications. WAN connection 205 is provided by way of a modem 211 as a wired dial-up connection or, alternatively, by way of a Network Interface Card (NIC) 212 as a wired high-bandwidth connection to the ISP 108 and the Internet 107.
A universal serial bus (USB) input/output interface 213 facilitates connection to the keyboard and pointing device 203, 204. All of the above devices are connected to a data input/output bus 214, to which the magnetic data-carrying medium reader/writer 206A and optical data- carrying medium reader/writer 207A are also connected. A video adapter 215 receives CPU instructions over the bus 214 for outputting processed data to VDU 202. In the embodiment, data processing unit 201 is of the type generally known as a compatible Personal Computer ('PC), but may equally be any device configured with processing means, output data display means, memory means, input means and wired or wireless network connectivity. The processing steps according to which the terminal 104 operates according to the present invention are further detailed in Figure 3. At step 301 the terminal 104 is switched on, whereby a first set of instructions known as the Operating System (OS) is loaded in memory 209 at step 302, to configure the terminal with basic interoperability and connectivity with ISP 108 and the Internet 107. At step 303, a second set of instructions hereinafter referred to as a converting application is loaded in memory 209, either from storage 210 or from a remote terminal, for instance a remote server accessible via the Internet 107, by way of a data download.
In the preferred embodiment, the second set of instructions includes a plurality of functional modules further described herein below, at least one of which allows a user to select and load an electronic document configured for printing or display on a specific device at the next step 304. At step 305, the converting application receives user input to indicate the desired output file format of the electronic document configured for display on multiple devices, and optionally to indicate the type or characteristics of the recipient user terminal, in order to parameterize the processing of the electronic document configured for print by the modules at the next step 306 and the outputting of the processed electronic document now configured for display by the application at the next step 307. After step 307 the user of terminal 104 may then write the electronic document output at step 307 to a removable media 206B, 207B and/or distribute same to one or a plurality of remote recipient user terminals, such as mobile phone 101, by way of its network connection.
A question can be asked at step 308, as to whether another electronic document configured for print should be processed according to steps 304 to 307, and control returns to step 304 for the selection thereof if the question is answered in the affirmative. Alternatively, the question is answered negatively, and at step 309 the user of terminal 104 may then decide to terminate the processing of the application first loaded at step 303 at the step 309 and eventually switch the terminal 104 off at step 310.
The contents of the memory means 209 after the steps 303, 304 of loading instructions and an electronic document are illustrated in Figure 4. The OS loaded at step 302 is shown at 401, which in the example is Windows® XP® Professional®, manufactured and distributed by the Microsoft® Corporation of Redmond, Washington, United States, but it will be readily apparent to those skilled in the art that the present invention is not limited thereto and that any other OS may be used to configure terminal 104 with basic functionality, such as Mac OS/X® manufactured and distributed by Apple® Inc. of Cuppertino, California, United States, or Linux which is freely distributed.
The application loaded at step 303 is shown at 402 and comprises a plurality of modules including a document processing module 403, a text data processing module 404, an Optical Character Recognition (OCR) module 405 and an image data scanning module 406, the respective functionalities of which will be further described herein below. In operation, each module apportions a part of memory 209 as a respective buffer, the buffers being shown as 407 in the figure. A set of instructions for interpreting document layout is shown at 408, which is also loaded in memory 209 at step 303 along with the application 402, and which the document processing module 403 preferably uses at step 306 in conjunction with text data processing module 404 for processing a data structure shown at 409 after the loading thereof according to step 304, which is an electronic document formatted for print and, in the example, is formatted according to the Portable Document Format (*.pdf) file format of Adobe Systems Inc. of San Jose, California, United States and output by Adobe applications, notably the well-known Adobe® Acrobat® Reader application. It will be readily apparent to those skilled in the art that the present invention is not limited thereto, however, and a preferred embodiment of the present invention may process many other types of electronic documents formatted for print, such as electronic document formatted according to the Microsoft Word (*.doc) file format of Microsoft® Corporation or documents prepared for display, such as HTML (*.html or .htm) or XML (*.xml) documents.
Various types of data components of the data structure 409 are further described in Figure 5. The document 409 initially comprises metadata 501, which may be better understood as a description or definition of the electronic data contained in the data structure 409. Metadata 501 therefore preferably includes data structure information, such as the document file type, the language of the document, the page format and the number of pages of the document, but can also include information about when the document was created, by whom and what changes have been made to the document since its creation, or even include descriptive HTML tags (also known to those skilled in the art as Meta tags) if the data structure 409 is an HTML document.
Document 409 also includes text data 502 encoded in one or more of a plurality of character representations including, but not limited to, ASCII, ANSI, ISO, 16-bit UNICODE or its 8-bit representation UTF-8 text data 503 developed as a universal character set for international use and allowing a wider variety of text characters than ASCII to be used, particularly for representing the written characters of Asian languages.
In some instances, document 409 may include textual content stored in an image file format to convey a particular meaning when displayed in the native application. In some other instances, document 409 will include additional text data 502, which purpose is to create a visual effect for either implying a meaning or for artistic effect.
Document 409 next includes image n data 504, which in the context of the preferred embodiment is any graphical data in the document, which does not convey textual information, therefore such as photographs, artistic or design works, charts and the like. Each image or graphical component 504 is individually defined within the document 409 in size by a pixel area, in resolution by a dot per inch (dpi) value and in content by individual pixels, each of which comprises respective Red, Green and Blue (RGB) values and, optionally, an Alpha (transparency) value. Alternatively graphical content may be defined by reference to a vector object such as a line of a given width in pixels, and colour values in RGB between points A and B. Image data 504 may further be modified by being overlaid with text or other graphic objects to convey a particular meaning when displayed in the native application.
Document 508 is formatted for print and therefore the various visual components 502 to 504 thereof are disposed relative to one another to achieve a particular result, be it in terms of a visual effect imparted upon the reader or a specific order in which those visual components should be read or observed when the document is printed out. Data defining this disposition may take one of two forms, or possibly both, respectively represented as document layer data 505 and document layout data 506. Any or both of the layer data 505 and layout data 506 may be included as part of metadata 501.
Examples of layer and layout data are shown in Figure 6. Layer data 505 comprises data defining which visual components 502 to 504 belong to which layer of the document formatted for printing. Layers may be described as document areas overlaying one another, figuratively "stacked" on top of each other, the aggregation of which results in the final document. A typical example may include a watermark image 504 belonging in a bottom layer 601 and a letter comprising text belonging in a top layer 603, the aggregation 603 of which resulting in a letter having text 502 or 503 over the watermark 504 when printed. Layer data 505 therefore defines the bottom and top layers 601, 602 and with which watermark 504 and text 502 or 503 are associated, respectively. Layer data can be used for artistic effect or to convey meaning. A typical example of layer data being used to convey meaning is where a block of text is drawn to the reader's attention by being layered over a different background colour to that of the remaining text.
Layout data 506 comprises data defining where visual components 502 to 504 are physically disposed within the document area, e.g. where the watermark 504 is disposed relative to the total area of the document, for instance centered relative to the four boundaries of the document. The disposition of visual components 502 to 504 according to layout data 506 can vary to a very large extent, which depends on the intended use and/or purpose of the document 409. The single-sided example 603 of a watermarked letter 409 above can for instance be contrasted with another example 604 of a double-sided brochure, based upon an A4 document print format to be folded in thirds 605 to 607, wherein visual components 502 to 504 are laid out according to an intended reading order (605 recto, 605 verso, 606 verso, 607 verso, 607 recto, 606 recto) of the brochure, once printed and folded. It can be very easily appreciated that unless the example document 604 is physically printed and folded, a user would find reading document 604 presented on a VDU, such as VDU 202, particularly difficult, since layout data 506 specifies in this particular example that image and text data are rotated clockwise by 90 degrees relative to a VDU reading orientation shown as arrow 608 and image and text data are apportioned to their respective brochure recto or verso "face" 605 to 607, again irrespective of VDU reading order 608. In this instance the text reading order of the source document file will need to be changed in order to preserve the intended reading order.
The step 306 according to which the instructions 402 configure the data processing system 201 to process document 409 into an electronic document formatted for display is further detailed in Figure 7. At step 701, application 402 processes the metadata 501 to identify and temporarily store document data in its respective portion of buffer 407, which is useful to determine characteristics of text data 502, 503, for instance its language which is particularly useful in the case of written language using Unicode or UTF-8 text data 503, and also to determine whether the loaded document 409 incorporates layer data 505 and/or layout data 506. At step 702, the text processing module 404 of application 402 extracts text data 502 from the document 409, which it also temporarily stores in its respective portion of buffer 407. All text is passed through text processing module 404 using rules, which detect if there is text encoded in character sets other than that declared for the document, or illegal in the output format. Where such characters are encountered they are processed using rules, and output in the correct character set and character. Where there is more than one possible correct output character a comparison is made with the text extracted using the OCR module 405.
The application detects if the document contains any instances where additional text may have been included for visual effect. In one example, additional text is defined as repeated layered text, which can be discarded or replaced by markup. In another example where layered text is detected, the layers are examined to determine if one is used for visual effect, by having a significantly less prominent colour or a colour close to that of the background. It will be apparent that many other examples exist where a similar technique can be used to impart the intended meaning to the document.
Application 402 coordinates the respective data processing steps of each of processing modules 403 to 406, and at step 703, application 402 invokes the scanning module 406 to scan the document, in its composite form if it comprises layer data 505 or if an initial reading of the character strings stored at step 702 indicates that there may be more than one possible glyph for any of the characters stored, and then stores the result in its respective buffer portion 407, and subsequently invokes the OCR processing module 405 to optically recognize characters from the buffered scanned document and temporarily store an OCR version of the document 409 in its respective buffer portion 407.
Thereafter, at step 704, the text processing module 404 performs a comparison of the text data 502, 503 extracted at step 702 with the buffered OCR version of the document output at the step 703 in order to detect and correct both extraction errors as well as alphanumerical data featured in document 409 under the form of image data 504 into ASCII text data 502. Examples of such alphanumerical data under the form of image data 504 include alphanumerical data in charts, graphs or in a stylized form, or defined by layer data 505 as or in a graphic layer 601 as opposed to a text layer 602. At step 705, the extracted and verified text data is stored in the buffer portion 407 of document processing module 403.
In order to facilitate the document conversion, a number of sets of instructions for document layout detection 408 is provided, for instance for the more common types of document layouts that may be encountered, and therefore incorporating a letter, a magazine article with two or more columns, a broadsheet article with four or more columns, a double sided brochure and folded variations thereof as provided in the example 604. It will be readily apparent to those skilled in the art that the above templates are provided by way of example only and are not meant to be limitative. At the next step 706, application 402 matches the verified text data with a document layout template 408. In the example, application 402 obtains from the performing of steps 701 to 705 that document 409 has an A4 format from step 701, that text data has been rotated by 90 degrees for the purposes of performing the OCR step 703, and that the disposition of text data 502 within the A4 area of document 409 defines three portions 605 to 607. In an alternative embodiment of the present invention, application 402 obtains from performing step 701 that layout data 506 defines three columns corresponding to the three portions 605 to 607.
At step 707, application 402 processes the buffered scanned document output by the scanning module at step 703 for identifying respective images 504«. A question is therefore asked at step 708, as to whether an image n 504 is present in the document 409. An initial question is asked as to whether any layer data is stored above the image. If layer data is stored above the image, the image is recaptured using a 'scan' of the source document display with OCR module 405, and processed as step 709. If the question of step 708 is answered positively, application 402 processes the identified image 504 n for reducing its respective storage requirement, e.g. its respective size in bytes, at the next step 709, for instance with resampling the image to a lower resolution and subsequently performing a one-pass sharpening of the re-sampled image, and buffers the processed image 504 n. In one embodiment of application 402, application 402 outputs the image n 504 to VDU 202 for the user to optionally increase or decrease its scale, or change its boundary relative to text or graphic objects, so that for example an adjacent text string can be included or excluded from the image.
In an alternative embodiment, a question is then asked as to whether there is a text string either overlaying the image, or adjacent to the image which might be used as a text alternative to assist blind users, or users accessing the document on a small screen without image display (or with image display turned off), for determining the meaning of the image. Adjacent text may be selected for inclusion by the use of one or more of a plurality of features including: textual conventions such as the inclusion of a word such as 'caption' 'image' or 'photo'; layout conventions such as the placement of a short text string in isolation and adjacent to an image; and typographical conventions such as a change in font, font style, size or weight. In this embodiment, such text alternatives will be displayed to the user by application 402 for acceptance, editing or rejection. The identified text alternative is stored in association with image 504 for output.
A question is then asked at step 710, as to whether another image 504 n+1 is present in the document 409. If the question of step 710 is entered positively, control returns to step 709 for the processing of this next image as previously described, and the image processing cycle continues until all images 504 n have been processed. The question of step 710 is therefore eventually answered negatively, whereby application 402 may then invoke document processing module 403 for outputting the electronic document formatted for display at step 307. Referring to the question of step 708, if the question is answered negatively and there is no image data 504 n to process in document 409, application 402 likewise next invokes document processing module 403 for outputting the converted document at step 307. In an alternative embodiment, if the question of step 710 is entered positively, a further question is asked as to whether this image has already been used in the document, and if the image has been used before, then this instance is replaced by a reference to a previous image and control returns to step 709.
Therefore, regardless of the format and content of an electronic document formatted for printing 409, application 402 and data processing modules 403 to 406 thereof process document 409 into ASCII or one of a plurality of character sets text data 502 and down- sampled image data 504, substantially reducing the storage space required to store the visual components of the original document.
The step 307 of outputting the electronic document formatted for display, converted from the electronic document of formatted for printing 409 according to the present invention, is further detailed in Figure 8. The document processing module 403 firstly processes the buffered metadata 501 of step 701 to identify document structuring data, for instance layout data 506 comprising structural tags if document 409 is formatted according to HTML or XML structured languages, or headings if the format of document 409 is formatted according capable of storing such informationto as for example "doc" or "pdf ' file formats. A first question is therefore asked at step 801, as to whether the buffered metadata comprises such information. If the question of step 801 is answered negatively, then at step 802 the document processing module 403 recursively processes the text data 502 output at step 705 to identify headings, sub-headings and other document subdivisions according to one or more of a plurality of: conventional document sub-dividing conventions, including for instance identifying strings of alphanumerical data 502 such as "chapter", "part" sequential numbering and the like, layout conventions such as the placement of a short text string, in isolation, typographical conventions such as a change in font, font style, size or weight. A second question is asked at step 803 further to the identifying attempt of step 802, as to whether a document structure has been identified. If the question of step 803 is answered positively, then at step 804 the document processing module 403 temporarily stores data of the identified structure in buffer 407.
If the question of step 803 is answered negatively the document processing module further processes the text to ask if potential headings can be identified to a lower level of certainty. In an embodiment of the invention the potential headings identified are presented to the user for acceptace, editing or rejection. Then at step 805 the document processing module 403 recursively processes the text data 502 output at step 705 to identify headings, sub-headings and other document subdivisions with selecting a first string of alphanumerical data 502 and processing the frequency and disposition of. said first string throughout the document. At step 806, module 403 then assigns a respective probability of structural heading to the first string 502 based upon the processing of step 805 and temporarily stores said probability in respect of said first string. At step 807, a question is asked as to whether a string of alphanumerical data 502 remains to be processed in the document according to steps 805 and 806 and, if answered positively, control returns to step 805, whereby a next string of alphanumerical data 502 is selected for which a probability is obtained and stored, and so on and so forth until all probable strings 502 have been processed. The question of step 807 is therefore eventually answered negatively, and the document processing module 803 derives a probable document structure from the stored probabilities, which it temporarily stores at step 804.
In a further step, application 402 process the extracted text to search for print conventions, which indicate a reference to another part of the document. Application 402 seeks to identify printed page numbers. In formats where such numbers are stored directly such as 'doc' or some versions of 'pdf the numbers are identified and stored. Where there are no stored numbers sequential numbers, they may be identified by searching for print conventions as 'page' 'p' with a number, or a number placed in isolation at a consistent location on the page or mirrored location on facing pages. These are stored as page numbers and mapped by offset to the leaf number or file section representing a page. Application 402 next searches for a 'contents' page, identified by strings of data matching headings adjacent to numbers matching the page number of pages with the same heading. When a contents page is identified, all its entries are hyper-linked. Similarly, references within the text which contain clear identifiers such as 'see page n' are hyperlinked to the section with the appropriate page number.
In another example, application 402 will search for a reference such as 'see Appendix A' and it will then search for a heading matching text string 'Appendix A' and, having successfully located the heading, provide a hyperlink within the document from the text reference to the heading. Similarly, conventions such as those identifying footnotes or endnotes will be identified and hyper-linked.
If the question of step 801 is answered positively, or as and when document processing module 403 stores structure data at step 804 as a result of identifying conventional headings at step 802 or deriving structural headings according to steps 805 to 807, control proceeds to step 808, at which the document processing module 403 generates a table of contents (TOC) from the structural data, then a file index at step 809, from which TOC and index application 402 can subsequently compile an electronic document formatted for display at step 810. In a preferred embodiment of the present invention, the electronic document formatted for display is a CHM file, known to those skilled in the art as a 'Help' file, which can be processed for display by any Windows OS 401, compiled from the TOC of step 808 as a HHC file, or Html Help table of Contents, and the Index of step 809 as a HHK file, or Html Help index file.
In a further embodiment, application 402 searches the text for characters which suggest a reference to an external electronic resource such as a URL for a web page or an electronic mailing address. These are deduced by the presence of text such as www.name or name.com, or [email protected], for example. Where such text strings are detected, application 402 first checks the string for characters such as white space or line feed which should not be present. It then, where an appropriate network connect is available, tests the link to the resource and where valid provides an active hyperlink.
In one embodiment of application 402, the user is presented with a prompt, which will allow them to test all links and send test e-mails to all addresses detected in the document. Where the resource has been tested successfully, either by the successful loading of a web resource, or the receipt of a confirmation e-mail, or indeed by the e-mails having not been returned as undeliverable, the resource will be hyperlinked. In another embodiment of the invention, the terminal 104 is itself a server, connected to the network, such as the Internet or an Intranet, which a number of users at remote locations can use to convert documents according to the present invention. The remote users can initiate document conversion by such means as transferring an electronic copy of the document described herein below to be converted to a specific location, either directly or by use of an automated system for transferring documents such as an ECM (Electronic Content Management) System or other Document Management System. Equally however, the remote user could initiate the conversion process by such means as dragging an icon representing the document to an icon representing application 402, or using a command within an application such as Microsoft Word and choosing the conversion process, the remote user being thus able to control the conversion process by means such as pre-set commands, commands loaded into an editable *.ini file, or input by a remote user using an XML interface. In this embodiment, the application is based on the server 104 and pre-set to convert all documents delivered to one or more specific locations on a computer network. Alternatively it can be pre-set to convert a document in response to user input, either directly or indirectly through an application such as an ECM system or by using a command from software on a remote computer.
In an alternative embodiment of the present invention, the application 402 further configures terminal 104 to extract and analyse the contents of document 409, in order to deduce a coherent read order and output the contents in correctly ordered blocks.
The application 402 achieves this functionality by firstly creating (9A) an electronic representation 901 of a page image. The electronic representation is then analysed to discover (9B) white spaces 902, which cut across the page. These cuts 902 which run parallel to the X axis 903 are designated as Y cuts. The electronic representation is then analysed to discover (9C) white spaces 904, which cut along the page. These cuts 904 which run parallel to the Y axis 905 are designated as X cuts. White spaces which do not intersect are discarded from the analysis process. These cuts 902, 904 are then used to subdivide (9D) the page 901 into the largest possible blocks 906. X cuts are given precedence; over Y cuts to ensure that the read order of documents with multiple columns is respected.
As each largest possible block 906A, 906B, 906n is defined, the application 402 reiterates the discovering steps 9A to 9D through the block 906 to define and order the largest possible sub- block 907. As each largest possible sub-block 907 is defined, the application 402 reiterates the discovering steps 9A to 9D through the sub-block 907 to define and order the largest possible sub-sub-block 908. The application continues to reiterate the discovering steps 9A to 9D until the smallest possible blocks 908 are discovered, which are then ordered (9E) top to bottom. The application then proceeds to the next block 907, past an X cut with the highest Y value, and continues the sorting procedure 9A to 9E.
Having found and ordered the component sub-blocks 907, 908 of a block 906 the application 402 then reiterates the discovering steps 9A to 9E to find the next neighbouring block 906B of that block 906A at the higher block level. The component sub-blocks 907, 908 components are then ordered. The process is continued until all component blocks are placed in a reading order. The text is then output at step 9F, in one example as flowing text marked up as HTML.
It will be clear to anyone skilled in the art that the same process can be used to output other document formats and to analyse pages where the language reading order is other than left to right and top to bottom by changing the order in which X and Y cuts are made and/or the blocks are analysed and ordered.
In an alternative embodiment of the present invention, the application 402 further configures terminal 104 to extract and analyse the contents of document 409, in order to deduce any coherent table formatting and to output the table data content in correctly formed tables.
The application 402 achieves this functionality by firstly creating (10A) an electronic representation 1001 of a page image. The electronic representation is then analysed to discover (10B) blocks of text 1002 that are table candidates. Blocks 1002 are analysed based on their spacing from surrounding blocks. In this alternative embodiment, line art such as cell boundaries may be treated as white space and discarded or, alternatively, line art which forms a boundary between rows is treated as a special row and line art which forms a boundary between columns is treated as a special column. It is not however a requirement of the invention that line art be present in order to detect and process a table. In an advantageous form of this alternative embodiment, a minimum of two blocks across and three blocks down is used as a rule to define if text blocks 1002 are table candidates.
Each of the text blocks 1002 is then analysed (10C) to determine its coordinates. The vertical coordinates are defined as top, centre and bottom. The horizontal coordinates are defined as left, centre and right. Each block 1002 in the candidate table is then compared (10D) to its neighbours on the same broad vertical position in the source document. When two or more blocks 1002 A, 1002B share the same positional reference, that is their top, centre or bottom have substantially the same vertical position on the page, these blocks 1002A, 1002B become a candidate row 1003 A for the candidate table. The application 402 is designed to take account of atypical common table designs, such as a table featuring some rows and columns with different respective vertical and/or horizontal alignment. This is a technique sometimes used to confer some visual emphasis to column or row header cells, though it can have other functions. Therefore, allowance is made for one or more blocks of text having inconsistent alignment. The next set of blocks 1002C, 1002D at a lower vertical position is then tested (10E) in the same manner to determine if there is another candidate row 1003B. The analysis is repeated until all blocks of text 1002 have been analysed to determine if they belong to candidate rows 1003.
The application then compares (10F) blocks 1002 with their neighbours on substantially the same horizontal position in the source document. When two or more blocks 1002A, 1002C share the same positional reference, that is their left, centre or right have substantially the same horizontal position on the page, these blocks 1002A, 1002C become a candidate column 1004A for the candidate table. The next set of blocks 1002B, 1002D at an horizontal position to the right is then tested (10G) in the same manner to determine if there is another candidate column 1004B. The analysis is repeated until all blocks of text 1002 have been analysed to determine if they belong to candidate columns 1004.
Any blocks 1002 which do not belong to either candidate rows 1003 or columns 1004 are discarded. If the candidate rows and columns 1004 respectively number less than a predefined number of columns and rows, for instance two rows and two columns, then the candidate table is discarded (10H). Alternatively, the candidate rows and columns 1004 are declared (101) as an actual table and the table data is stored in table cells for output as a structured table.
The application 402 may optionally analyse any grouping of blocks 1002 which share an inconsistent vertical or horizontal alignment to define column or row headers. The rows and columns may also be further analysed to detect if a block could better be described as spanning two or more rows or columns. For example, when there is a row heading in every second row of a table, such row headings may be better described as spanning the intervening rows. A textual analysis as previously described in relation to Figure 7 may also be carried out to further refine the processing of table data. For example, numerical data may be separated from textual data for output to a file type, which supports different cell data types, such as the Microsoft Excel "xls" file format of the Microsoft Corporation.
In a particularly advantageous embodiment of the present invention, the application 402 includes a user interface 1101, which facilitates the insertion of accessibility features to documents by an unskilled user. The interface preferably includes multiple panes sharing a common look and feel, giving the user control over the respective functions of modules 403, 404, 405 and 406. The number and type of panes being presented to the user are user controllable, so that they may be limited to those relevant to the user in a sequence of tasks for converting a document conversion as previously described. Whilst user-configurable, the interface 1101 is however configured for guiding the user through the modular, overall sequence of tasks, with appropriate prompting and advice provided at each step of the sequence.
In the interface, those features of print display, which apply to print only, such as large page margins, are discarded. Those features of print display, which are page specific, such as uses of specific font type sizes and multiple columns, are replaced by an equivalent device display more suited for display on multiple devices. For example, specific fonts and type sizes are replaced by a CSS (cascading style sheet) reference to font families with relatively sized fonts, and multiple text columns are replaced by a single text column.
In the interface, the review of documents during conversion according to the present invention is facilitated by with a double pane view 1102, 1103 of the document, which displays the original document 409 side by side with a preview of the converted document 1104.
The pane 1103 displaying the original document may displays a plurality of file types, including Microsoft Word "doc" and Adobe Acrobat "pdf ' files, in a single display version. This configuration advantageously eliminates the requirement to install multiple file on terminal 104, and thereby intrinsically limits the storage requirements, as well as the amount of training required by the user for interacting with the said plurality of file viewers.
The interface 1101 advantageously adds page break markers to the converted document 1104 previewed in pane 1103. The interface can be set in such a way that when a user of terminal 104 chooses to view a new page in the original document 409, the pane 1103 automatically displaces the representation of converted document 1104 to the same point in the text. This configuration is particularly advantageous when the removal of print-specific document features, such as large margins, fixed type sizes and multiple columns, results in a marked difference in page length between the original and output documents.
The interface 1101 further provides a means by which users, unskilled in the art of ensuring documents comply with accessibility standards, can ensure document accessibility. The application 402 performs an automated accessibility test upon the document being converted, and configures interface 1101 to output messages about accessibility issues, which require user input for decision and/or correction. In a particularly advantageous embodiment of the interface 1101, the document 409 displayed in pane 1102 is replaced with a list of prompts on accessibility issues, which require user input for decision and/or correction, and which the user may consult side-by-side with the converted document preview. The interface 1101 is further configured to provide a tool tip 1105 with guidance on how to solve the issue in respect of each prompt, and to open a pane 1106 with controls for providing relevant input. For example, when an image requires a text alternative, the application 402 configures the interface 1001 to indicate to the user that this text alternative is required for both blind readers and those accessing the document on displays, which do no support images, and the indication further comprises a text input control box for adding a text alternative to the image.
The interface 1101 is moreover configured with text formatting controls by means of which users, unskilled in either the arts of HTML coding or CSS coding, can provide further input to the automatically generated HTML markup and linked cascading style sheets. In a particularly advantageous embodiment of the interface 1101, the user is presented with text formatting controls, which are familiar to users of standard word processing applications such as Microsoft Word or OpenOffϊce, and which are linked to both HTML and CSS editing functions for allowing the user to reformat a block of text by one or a combination of: changing the HTML selector used to mark it up, changing the CSS properties of all instances of that selector, or changing an instance of the selector to be a special case marked in both the HTML and CSS as a span, class or id. Likewise, the use may be presented with an id and class editing control linked to both HTML and CSS, which allows the addition or deletion of an id, class or an instance of a class, or its renaming to be more meaningful in the semantic context of the document.
With reference to the description of Figure 8 hereinabove, the interface 1101 is moreover configured with table of contents editing controls. In a particularly advantageous embodiment of the interface 1101, the document 409 displayed in pane 1102 is replaced with a table of contents view, and which the user may consult side-by-side with the converted document preview for ease of reference. The table of contents can be edited by the user to change the text of a link. The user can also add or delete links from the table of contents as required. Having done so the user can choose to output the document as one or more of: a single HTML file with a linked table of contents, multiple HTML files comprising one for each page of the original document with a linked table of contents, multiple HTML files comprising one for each section of the original document with a linked table of contents.
An important feature of interface 1101 is the inclusion of a save control for application 402 to store work in progress as a single container file, with the maximum useable information contained therein. The container file can then be reopened and editing continued at the user's convenience. In a particularly advantageous embodiment of this feature, the container file can be any of a readily available type such as a 7-Zip, WinRAR or Winzip archive, or a proprietary archive, which uses a format such as XML. Moreover, in addition to storing images as rendered in the original document, the source images are also stored prior to the application of any transformation matrix for display in the source file format. All transformation matrices, whether extracted from the original document or created during editing, or a combination of both, are stored. This allows the user to have access to the highest possible image quality at any time in the editing or output process. Other metadata related to the original and edited document, including Dublin Core metadata, user settings and other application status information is also stored. Advantageously, a further embodiment of interface 1101 is configured to store unique user settings, including data learned by application 402 during the conversion of multiple documents, in a server client environment.

Claims

Claims
1. An apparatus for converting an electronic document formatted for printing into an electronic document formatted for display, the apparatus comprising processing means, memory means and data input means, the memory means storing instructions, a data structure defining at least one electronic document formatted for printing and having a layout, and at least one document layout template, the instructions configuring the processing means to: compare the document layout defined by the data structure against the at least one document layout template; map alphanumerical data in the data structure to corresponding ASCII character data; optically recognize alphanumerical data in the data structure; compare the optically-recognized alphanumerical data against the mapped alphanumerical data; identify image data in the data structure; rescale the image data if the image data exceeds an image data parameter ; and output the electronic document including the optionally rescaled image data and compared ASCII character data, wherein the output electronic document is formatted for display according to the document layout template.
2. An apparatus according to claim 1, wherein the memory means further stores a plurality of recipient terminal profiles and the instructions further configure the processing means to read a selection of a recipient terminal profile by a user and to format the output electronic document for display according to the selected recipient terminal profile.
3. An apparatus according to claim 2, wherein the plurality of terminal profile includes a mobile telephone handset profile, a personal digital assistance profile and a personal computer profile.
4. An apparatus according to claim 1, wherein the instructions comprises a plurality of functional modules, including a document processing module, a text data processing module, an optical character recognition module and an image data scanning module.
5. An apparatus according to claim 1, wherein image data comprises picture screen elements and attributes such as an image size in picture screen elements, an image resolution in dots per inch, and wherein every picture screen element is defined by respective red, green, blue and optionally alpha values.
6. An apparatus according to claim 1, wherein alphanumerical data includes any of ASCII, ANSI, ISO, 16-bit UNICODE, UTF-8 text data.
7. An apparatus according to claim 1, wherein the data structure includes metadata and a layout, the layout including layout data and layer data.
8. An apparatus according to claim 7, wherein the layer data defines which alphanumerical data and image data of the document formatted for printing belong to which layer of the document formatted for printing.
9. An apparatus according to claim 7, wherein the layout data defines where alphanumerical data and image data of the document formatted for printing are physically disposed within the document area.
10. An apparatus according to claim 1, wherein the output document is a compiled HTML help (chm) file.
11. A method for converting an electronic document formatted for printing into an electronic document formatted for display, the method comprising the steps of : comparing a document layout defined by the data structure of the electronic document against at least one document layout template; mapping alphanumerical data in the data structure to corresponding ASCII character data; optically recognizing alphanumerical data in the data structure; comparing the optically-recognized alphanumerical data against the mapped alphanumerical data; identifying image data in the data structure and rescaling the image data if the image data exceeds an image data parameter ; and outputting the electronic document including the optionally rescaled image data and compared ASCII character data, wherein the output electronic document is formatted for display according to the document layout template.
12. A method according to claim 113 comprising the further steps of storing a plurality of recipient terminal profiles, reading a selection of a recipient terminal profile by a user and formatting the output electronic document for display according to the selected recipient terminal profile.
13. A method according to claim 11, comprising the further steps of scanning and buffering the electronic document formatted for printing and components thereof.
14. A method according to claim 11, comprising the further step of identifying print convention data such document page numbers, document headers, document cross references and the like in the electronic document formatted for printing.
15. A method according to claim 14, comprising the further step of generating hyperlinks from the identified print convention data.
16. A method according to claim 11, comprising the further step of identifying network resource convention data such as uniform resource locators, email addresses and the like in the electronic document formatted for printing.
17. A method according to claim 16, comprising the further step of generating hyperlinks from the identified network resource convention data.
18. A method according to claim 11, wherein the data structure comprises a layout including layout data and layer data, the layer data defining which alphanumerical data and image data of the document formatted for printing belong to which layer of the document formatted for printing and the layout data defining where alphanumerical data and image data of the document formatted for printing are physically disposed within the document area.
19. A method according to claim 11, wherein the step of outputting further comprises the step of generating a table of contents and a file index.
20. An apparatus according to claim 19, wherein the output document is a compiled HTML help (chm) file.
21. A method for converting an electronic document formatted for printing into an electronic document formatted for display, the method comprising the steps of : mapping alphanumerical data in a data structure of the electronic document to corresponding ASCII character data; optically recognizing alphanumerical data in the data structure; comparing the optically-recognized alphanumerical data against the mapped alphanumerical data; identifying image data in the data structure; identifying sections of the electronic document content where the intended meaning is conveyed by the use of layered text or text and image data and processing this layered data according to rules to ensure that the intended meaning is preserved in the converted document; identifying additional text included for decorative effect and deleting it or replacing it with appropriate markup; rescaling the image data if the image data exceeds an image data parameter ; and outputting the electronic document including the optionally rescaled image data, processed layered data and compared ASCII character data.
22. A method of detecting the reading order of an electronic document formatted for printing, the method comprising the steps of : identifying horizontal spaces in the electronic document; identifying recognizing vertical spaces in the electronic document; for each combination of intersecting horizontal spaces and vertical spaces, declaring a document block, ordering the document blocks; and outputting the electronic document reading order according to the document block order.
23. A method of detecting tabulated data in an electronic document formatted for printing, the method comprising the steps of : identifying blocks of alphanumerical data in the electronic document; mapping alphanumerical data in a data structure of the electronic document to corresponding ASCII character data; determining respective horizontal and vertical positional data of the identified blocks; comparing the respective horizontal positional data of at least two neighbouring identified blocks for detecting a candidate column; comparing the respective vertical positional data of at least two neighbouring identified blocks for detecting a candidate row; and upon detecting at least a predetermined number of predetermined rows and columns comparison, storing the corresponding ASCII character data of the identified blocks as a table.
24. A user interface for converting an electronic document formatted for printing into an electronic document formatted for display, the user interface comprising multiple user- configurable panes for guiding a user through a modular sequence of steps according to the document converting method of any of claims 11 to 23.
25. The user interface of claim 24, wherein the multiple user-configurable panes comprise at least one pane for representing the electronic document formatted for printing and another pane for representing a converted version of the electronic document formatted for printing.
26. The user interface of claim 25, wherein the said another pane is configured to automatically represent the same point in the converted document as the electronic document formatted for printing shown in the at least one pane.
27. The user interface of any of claims 24 to 26, further comprising means for ensuring that converted documents comply with accessibility standards.
28. A system substantially as herein described in relation to and in association with the accompanying drawings.
PCT/IE2007/000030 2006-05-05 2007-03-06 Electronic document reformatting WO2007129288A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IE20060361A IES20060361A2 (en) 2006-05-05 2006-05-05 Electronic document conversion
IES2006/0361 2006-05-05

Publications (2)

Publication Number Publication Date
WO2007129288A2 true WO2007129288A2 (en) 2007-11-15
WO2007129288A3 WO2007129288A3 (en) 2008-05-29

Family

ID=38573300

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IE2007/000030 WO2007129288A2 (en) 2006-05-05 2007-03-06 Electronic document reformatting

Country Status (2)

Country Link
IE (1) IES20060361A2 (en)
WO (1) WO2007129288A2 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101883248A (en) * 2009-05-04 2010-11-10 沈阳爱国者网络科技有限公司 Method for acquiring video file from network
HRP20130700B1 (en) * 2013-07-23 2016-03-11 Microblink D.O.O. System for adaptive detection and extraction of structures from machine-generated documents
CN109635729A (en) * 2018-12-12 2019-04-16 厦门商集网络科技有限责任公司 A kind of Table recognition method and terminal
WO2023028587A1 (en) * 2021-08-27 2023-03-02 Rock Cube Holdings LLC Systems and methods for structure-based automated hyperlinking

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1343095A2 (en) * 2002-03-01 2003-09-10 Xerox Corporation Method and system for document image layout deconstruction and redisplay
US20050193327A1 (en) * 2004-02-27 2005-09-01 Hui Chao Method for determining logical components of a document

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1343095A2 (en) * 2002-03-01 2003-09-10 Xerox Corporation Method and system for document image layout deconstruction and redisplay
US20050193327A1 (en) * 2004-02-27 2005-09-01 Hui Chao Method for determining logical components of a document

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHAO ET AL.: "PDF Document Study with Page Elements and Bounding Boxes" DOCUMENT LAYOUT INTERPRETATION AND ITS APPLICATIONS (DLIA2001), 9 September 2001 (2001-09-09), pages 1-3, XP002249458 Seattle, WA, US *
LOVEGROVE W S ET AL: "Document analysis of PDF files: methods, results and implications" ELECTRONIC PUBLISHING, WILEY, CHICHESTER, GB, vol. 82, no. 2-3, June 1995 (1995-06), pages 207-220, XP002357644 ISSN: 0894-3982 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101883248A (en) * 2009-05-04 2010-11-10 沈阳爱国者网络科技有限公司 Method for acquiring video file from network
HRP20130700B1 (en) * 2013-07-23 2016-03-11 Microblink D.O.O. System for adaptive detection and extraction of structures from machine-generated documents
CN109635729A (en) * 2018-12-12 2019-04-16 厦门商集网络科技有限责任公司 A kind of Table recognition method and terminal
WO2023028587A1 (en) * 2021-08-27 2023-03-02 Rock Cube Holdings LLC Systems and methods for structure-based automated hyperlinking
US11954422B2 (en) 2021-08-27 2024-04-09 Rock Cube Holdings LLC Systems and methods for structure-based automated hyperlinking

Also Published As

Publication number Publication date
WO2007129288A3 (en) 2008-05-29
IES20060361A2 (en) 2007-10-31

Similar Documents

Publication Publication Date Title
US6336124B1 (en) Conversion data representing a document to other formats for manipulation and display
US8732570B2 (en) Non-symbolic data system for the automated completion of forms
CN1784664B (en) Document data display device, output device, printing device and related method
JP5209124B2 (en) Send and receive electronic business cards
CN102117269B (en) Apparatus and method for digitizing documents
US20090110288A1 (en) Document processing apparatus and document processing method
US6850228B1 (en) Universal file format for digital rich ink data
JP2006350867A (en) Document processing device, method, program, and information storage medium
US20040202352A1 (en) Enhanced readability with flowed bitmaps
US20150363381A1 (en) Template management apparatus, non-transitory computer readable medium, and template management method
CN101008940B (en) Method and device for automatic processing font missing
CN105824788A (en) Method and system for converting PowerPoint file into word file
EP2972991A2 (en) Detection and reconstruction of right-to-left text direction, ligatures and diacritics in a fixed format document
JPH08147446A (en) Electronic filing device
WO2007129288A2 (en) Electronic document reformatting
JP2008129793A (en) Document processing system, apparatus and method, and recording medium with program recorded thereon
US20080046835A1 (en) Object-oriented processing of tab text
US20060095838A1 (en) Object-oriented processing of tab text
JP2008310816A (en) System and method for compact representation of multiple markup page data
JP2006260304A (en) Information management device, server, image forming device, information display processor, information management method, information display processing method, information management program, information display processing program, and computer-readable recording medium
JP7027757B2 (en) Information processing equipment and information processing programs
CN108345577A (en) Information processing equipment and method
JP2006309443A (en) Information processing system, information processor, information processing terminal, information processing method, program for executing the method in computer, and storage medium
US20210182477A1 (en) Information processing apparatus and non-transitory computer readable medium storing program
CN101233494A (en) Plug-in module execution method, browser execution method, mailer execution method, program, terminal device, and computer-readable recording medium containing page data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07713249

Country of ref document: EP

Kind code of ref document: A2

DPE1 Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101)
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 07713249

Country of ref document: EP

Kind code of ref document: A2