WO2013008103A1 - Method and apparatus for recognizing texts in digital pictures reproducing pages of an antique document - Google Patents

Method and apparatus for recognizing texts in digital pictures reproducing pages of an antique document Download PDF

Info

Publication number
WO2013008103A1
WO2013008103A1 PCT/IB2012/050288 IB2012050288W WO2013008103A1 WO 2013008103 A1 WO2013008103 A1 WO 2013008103A1 IB 2012050288 W IB2012050288 W IB 2012050288W WO 2013008103 A1 WO2013008103 A1 WO 2013008103A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
regions
region
images
recognizing
Prior art date
Application number
PCT/IB2012/050288
Other languages
French (fr)
Inventor
Nicola BARBUTI
Tommaso CALDAROLA
Original Assignee
Dabimus Srl
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dabimus Srl filed Critical Dabimus Srl
Publication of WO2013008103A1 publication Critical patent/WO2013008103A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/28Determining representative reference patterns, e.g. by averaging or distorting; Generating dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Definitions

  • OCR or IRC systems can be applied for solving the problem of text recognition in ancient documents (i.e.: EP0649113A2 - MULTIFONT OPTICAL CHARACTER RECOGNITION USING A BOX CONNECTIVITY APPROACH; EP 0369761 (A2) - UNIVERSAL CHARACTER SECTION FOR MULTIFONT) .
  • the used approach is to map the words on the scanned images from the document pages, by associating them to an electronic text which has to be input manually by an operator.
  • This method aims at reconstructing the document text rather than to recognize the text.
  • this approach limits strongly the possibility of electronic indexing a great number of historical texts, because it requires remarkable costs due to the almost manual work carried out by the operator.
  • the recognition system provided by DABIMUS starts from a new methodological approach oriented to the different features and noisiness peculiar to the ancient manuscripts.
  • the method in fact, is no more aimed at the regions /words in the text (regions containing a word) , but to typographic regions /characters (regions containing a character) , by associating to each of them the corresponding character extracted by the text which has to be input manually in electronic form.
  • the idea is therefore to individuate and recognize the typographic characters provided inside the ancient book or manuscript.
  • the recognition can be carried out on the whole digital volume with a very high level of precision in the reproduction.
  • Other manual settings can be used to correct the unavoidable noisiness left.
  • the ICR system thus provided by DABIMUS, allows great possibilities of accessibility to documents. In fact, it allows two usability levels, which can also be applied at the same time. The first one allows the user to carry out researches inside the document without the need o ' f content indexation; however, such method requires time since the segmentation would be concomitant to the research step, so it would be efficient for the user only on little documents.
  • object of the present invention is to realize a method and apparatus for recognizing the text provided in a set of digital images showing a page of an ancient document, which solve the above described problems of the related art.
  • the present invention solves such an aim by a method according to claim 1, an apparatus according to claim 8 and a computer programme, which loaded in the computer memory and executed by the same, allows to carry out the method object of the present invention.
  • FIG. 1 represents the flowchart of the method object of the invention
  • - figure 2 represents an apparatus according to an embodiment of the present invention
  • Digital image numerical/digital representation by means of a process called digitalization, of any physical object.
  • Text provided in the image portion of the image which represents the text and/or graphic content of the digitalized object (page, manuscript document, etc..) .
  • Font character element which is part of a group of typographic characters, characterized and sharing the same graphic style or intended to carry out a certain function (in ancient typography: mobile characters) .
  • OCR/ICR means: OCR: programmes for conversion of images containing text in digital text modifiable by a normal editor; ICR: programmes for intelligent recognition of characters for the conversion of a digital image containing text in modifiable text. Disturbance/noisiness: signals which are not desired and/or not part of the original physical object, both of natural and artificial origin, which overlap on the information transmitted and processed in a system.
  • Such an apparatus (1) comprises:
  • step (a2) comprises the transformation of each image (Ij . ) in grey scale and the horizontal alignment of said image (see fig. 3) by taking as reference: - the header line (3A) or the line (3B) at the bottom of the page, or
  • each region (R P i) containing a word each region (R C i) containing a character b3) associating automatically the corresponding text which was input in sub-step bl) to each region (R p i) containing words
  • the manually transcribed text is all or a portion of the one provided in each image (Ii) .
  • the transcribed portions have to contain wholly the entire set of characters used in the rest of the document .
  • step (b2) comprises the disturbance removing from said regions ( R p i ) .
  • step (b2) comprises the separation of possible united characters, by manually input a solid line of white pixels.
  • step (b5) said structured thesaurus allows to associate from zero to many images containing characters to a character of a specific font.
  • the previously enlisted steps (a) , (b) and (c) can be repeated until said first value of efficacy (E x ) is equal or greater than a prefixed value of efficacy (E) .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Character Discrimination (AREA)
  • Machine Translation (AREA)

Abstract

Method for recognizing the text provided in a set of digital images, each one showing a page of an ancient document, comprising the following macro- steps: a) individuating and connecting in sequence the regions (Rp) containing words in a sub-group (I) of said images; b) structuring a thesaurus of font characters used in said regions containing words; c) carrying out the character recognition on one or more images belonging to the set, by associating a first value of efficacy to the result of such recognition.

Description

METHOD AND APPARATUS FOR RECOGNIZING TEXTS IN DIGITAL PICTURES REPRODUCING PAGES OF AN ANTIQUE DOCUMENT
DESCRIPTION
The current Digital Libraries containing digital collections of handwritten manuscripts up to 1900 and ancient and valuable printed documents up to
1850, are still not completely enjoyable. For these specific digital contents, in fact, it has not yet been possible to provide digital and/or intelligent optical recognition systems for the texts contained in the virtual pages, which can guarantee an efficient indexation of the data banks contents both of those ones yet available on the Internet and of those ones in progress. In fact, no one of the most recent digital library projects, currently available on the web 2.0 (Europeana, World Digital
Library, The European Library, etc..) is provided with accessibility and enjoyability levels which allow the users to consult the texts contents of the digital objects reproduced, without integrally flicking therethrough. Apart from the classic researches of catalogue kind (author, title, bibliography etc..) , in these data banks it is not possible to develop an indexation to allow detailed studies based on the analysis of word occurrences, interference on different texts etc....
Such difficulty comes from the very nature of the manuscripts considered. The complexity and diversity of the handwriting, even of the more linear and regular paleographic ones; the different ink kinds used together with the ancientness of the supports, mostly damaged by time accidents and human negligence, have hindered all the efforts to overcome the simple digital reproduction of such manuscripts.
Not even the currently available on the market OCR or IRC systems can be applied for solving the problem of text recognition in ancient documents (i.e.: EP0649113A2 - MULTIFONT OPTICAL CHARACTER RECOGNITION USING A BOX CONNECTIVITY APPROACH; EP 0369761 (A2) - UNIVERSAL CHARACTER SECTION FOR MULTIFONT) .
If this situation appears almost obvious for the manuscripts, it would not be the same for the printed texts. But, even for this kind of manuscripts and in particular for texts produced by mobile characters printing (the whole typographic production from 1456 to early '800, four centuries of press), the situation is completely similar to the manuscript one. The problems are in fact always the same, even if less negative: the techniques for providing the printing matrix, the ink kinds used, the alignment of the punches inside the words and of the words in the line field, the diversity of the graphic symbols which represent some letters with respect to the ones commonly used (for example, for ages the "s" has been represented by a punch very similar to the "f"), different linguistic rules, image noisiness of various nature (printing refraction on the page surface, punch blurs and breaks, spots of ink and of different origin due to time and man) , are all factors which, so far, have hindered all the efforts for indexing the contents of the digitalized volumes by applying OCR and ICR systems. Even the experimental and very expensive project by Google Books has had to come up against the inadequacy of the OCR systems used, when applied on volumes dated prior to the second half of the 19th century.
According to the applicant, the reason why the applications provided so far for digital images optical and/or intelligent recognition do not work on ancient documents is the wrong methodological approach used in structuring such systems .
In fact, the used approach is to map the words on the scanned images from the document pages, by associating them to an electronic text which has to be input manually by an operator. This method aims at reconstructing the document text rather than to recognize the text. However, it is obvious that this approach limits strongly the possibility of electronic indexing a great number of historical texts, because it requires remarkable costs due to the almost manual work carried out by the operator.
If on some kinds of chronologically more recent pieces of work (from the late half of 19th on) this method can be useful, it does not represent the solution to the problem of making enjoyable to scholars and to the whole mankind the huge mole of most ancient, both "major" and "minor", pieces of work which are kept in the many historical libraries around the world, and which represent an heritage of pieces of information and history, almost completely unused but very valuable.
There are also products that, even using more developed methods, base however the recognition on dividing the words of the text regions, as for example the product "A2iA's Proprietary IWR,
Intelligent Word Recognition", by A2iA
(http://www.a2ia.com) . But these are systems able to work only if interfaced with specific semantic thesauruses, structured prior to the indexation step, otherwise they are not able to carry out any text recognition.
The recognition system provided by DABIMUS starts from a new methodological approach oriented to the different features and noisiness peculiar to the ancient manuscripts. The method, in fact, is no more aimed at the regions /words in the text (regions containing a word) , but to typographic regions /characters (regions containing a character) , by associating to each of them the corresponding character extracted by the text which has to be input manually in electronic form. The idea is therefore to individuate and recognize the typographic characters provided inside the ancient book or manuscript. In this way, it is only needed to transcribe manually on a portion of content, for which the typographer used the whole set of punches to print the volume, and which normally corresponds to a number of pages of quite limited text (up to maximum 20) . Obviously, the more the text is transcribed and connected to the extracted regions, the more the text reproduction is precise and correct .
After the typographic characters have been thesaurized, the recognition can be carried out on the whole digital volume with a very high level of precision in the reproduction. Other manual settings can be used to correct the unavoidable noisiness left.
The ICR system, thus provided by DABIMUS, allows great possibilities of accessibility to documents. In fact, it allows two usability levels, which can also be applied at the same time. The first one allows the user to carry out researches inside the document without the need o'f content indexation; however, such method requires time since the segmentation would be concomitant to the research step, so it would be efficient for the user only on little documents.
Instead, the other one provides the batch launch of the application on the whole document, prior to its input in the data bank, with the following indexation of the recognized text content, so that after the input of the document in the data bank, by means of research key words, the user is able to carry out any research with immediate reproduction. Therefore, object of the present invention is to realize a method and apparatus for recognizing the text provided in a set of digital images showing a page of an ancient document, which solve the above described problems of the related art. The present invention solves such an aim by a method according to claim 1, an apparatus according to claim 8 and a computer programme, which loaded in the computer memory and executed by the same, allows to carry out the method object of the present invention.
The members and the features of the present invention, together with the specific advantages thereof, are disclosed in the following description and represented graphically, as a way of not limiting example, in the appended drawings:
- figure 1 represents the flowchart of the method object of the invention;
- figure 2 represents an apparatus according to an embodiment of the present invention;
- figure 3 represents pages of ancient documents where it is possible to recognize the text according to the present invention.
Throughout the description of the present invention, there are used terms with the following meaning :
Ancient document: any record which represents an evidence from the past within conventionally fixed chronological limits (for example: in Italy: incunabulum: printed book from 1456 to 1500: ancient book: from 1501 to 1830; modern book: from 1831 to date) , regardless the support on which it is recorded.
Digital image: numerical/digital representation by means of a process called digitalization, of any physical object.
Text provided in the image: portion of the image which represents the text and/or graphic content of the digitalized object (page, manuscript document, etc..) .
Font character: element which is part of a group of typographic characters, characterized and sharing the same graphic style or intended to carry out a certain function (in ancient typography: mobile characters) .
OCR/ICR means: OCR: programmes for conversion of images containing text in digital text modifiable by a normal editor; ICR: programmes for intelligent recognition of characters for the conversion of a digital image containing text in modifiable text. Disturbance/noisiness: signals which are not desired and/or not part of the original physical object, both of natural and artificial origin, which overlap on the information transmitted and processed in a system.
Therefore, referring to the appended drawings, it is represented an apparatus (1) for recognizing the text (2) provided in a set of digital images (3), each one showing a page (4) of an ancient document (5) .
Such an apparatus (1) comprises:
- a planetary scanner (8) for acquiring said images (3)
- an electronic computer (6) provided with:
- means (7a) for starting a first step (a) in which it is to be individuated and connected in sequence the regions (RPi) , each one containing a word, in each image (Ij.) belonging to a sub-group (I) of said images (3)
- means (7b) for starting a second step (b) in which it is to be structured a thesaurus of characters of the fonts used in said regions (RPi)
- means (7c) for starting a third step (c) in which it is to be carried out the recognition of a character on one or more images belonging to the set (3) by using the thesaurus structured at the previous point.
It is important to observe that said means (7a) are able to start the following sub-steps:
al) selecting said sub-group of images (I) so that all or a portion of their content shows the complete set of font characters provided in the whole set of images (3) a2) processing each image (Ii) belonging to said sub-group (I) in order to obtain as many modified images (Mi)
a3) individuating in each modified image (ΜΑ) the regions (¾ί) which contain text approximately, but which, at this time, can contain also disturbances and not-text
a4) removing from said regions (RM) the regions which do not contain text, individuating in this way those which really contain text (Rti)
a5) individuating in each of said regions (Rti) each region (Rpi) containing a word
a6) connecting linearly each single region (RPi) containing a word, so that the sequence thereof is reconstructed in each region (Rri) containing a text line,
a7) connecting linearly each single region (Rri) containing a text line so that the sequence thereof is reconstructed in regions (Rti) containing text a8) connecting linearly the single regions (Rti) containing text so that the sequence thereof is reconstructed on the page.
It is important to observe that said step (a2) comprises the transformation of each image (Ij.) in grey scale and the horizontal alignment of said image (see fig. 3) by taking as reference: - the header line (3A) or the line (3B) at the bottom of the page, or
- if there are no header lines or lines at the bottom of the page, by taking as reference a manually indicated line (3C) .
It is important to observe that said means (7b) are able to start the following sub-steps:
bl) transcribing manually in electronic form all or a portion of the text shown in the sub-group (I) of said digital images (3)
b2) indi iduating in each region (RPi) containing a word, each region (RCi) containing a character b3) associating automatically the corresponding text which was input in sub-step bl) to each region (Rpi) containing words
b4) associating automatically the corresponding character to each region (RCi) containing a character
b5) storing said associations between regions representing characters and characters in a structured thesaurus.
Advantageously, for said sub-step (bl), the manually transcribed text is all or a portion of the one provided in each image (Ii) . In particular, the transcribed portions have to contain wholly the entire set of characters used in the rest of the document .
Advantageously said step (b2) comprises the disturbance removing from said regions ( Rpi ) .
Advantageously said step (b2) comprises the separation of possible united characters, by manually input a solid line of white pixels.
It is important that for step (b5) , said structured thesaurus allows to associate from zero to many images containing characters to a character of a specific font.
It is important to observe that said means (7c) are able to start the following sub-steps:
cl) individuating each region ( Rpi ) containing a word, in any image belonging to the set (3), according to the sub-step (a5)
c2) individuating in each region ( RPi ) containing a word, each region ( RCi ) containing a character c3) researching in the thesaurus structured in said steps (b4) and (t>5) the character corresponding to each region ( RCi )
c4) associating a first value of efficacy (Ex) to the result of said step (c) .
Advantageously, the previously enlisted steps (a) , (b) and (c) can be repeated until said first value of efficacy (Ex) is equal or greater than a prefixed value of efficacy (E) .

Claims

1. Method for recognizing the text (2) provided in a set of digital images (3), each one showing a page (4) of an ancient document (5), the method comprising the following macro-steps:
a) individuating and connecting in sequence the regions (Rp) containing words in a sub-group (I) of said images (3)
b) structuring a thesaurus of font characters used in said regions containing words
c) carrying out the character recognition on one or more images belonging to the set (3) , by associating a first value of efficacy (Ei) to the result of such recognition.
2. Method for recognizing the text provided in a set of digital images according to claim 1, said macro-step (a) comprising the following steps:
al) selecting a sub-group of images (I) of said images (3) so that all or a portion of their content shows the complete set of font characters provided in the whole set of images (3)
a2) processing each image (Ii) belonging to the sub-group (I) in order to obtain as many modified images (Mi) in grey scale and horizontally aligned by taking as reference:
- the header line (3A) or the line (3B) at the bottom of the page, or
- if there are no header lines or lines at the bottom of the page, by taking as reference a manually indicated line (3C)
a3) individuating in each image (Mi) the regions (Rbi) which contain text approximately, but which, at this time, can contain also disturbances and not-text
a4) removing from said regions (Rbi) the regions which do not contain text, individuating in this way those which really contain text (Rti)
a5) individuating in each of said regions (Rti ) each region (Rpi) containing a word
a6) connecting linearly each single region (RPi)' containing a word, so that the sequence thereof is reconstructed in each region (Rri) containing a text line
a7) connecting linearly each single region (Rri) containing a text line so that the sequence thereof is reconstructed in regions (Rti) containing text a8) connecting linearly the single regions (Rti) containing text so that the sequence thereof is reconstructed on the page.
3. Method for recognizing the text provided in a set of digital images according to claim 1, said macro-step (b) comprising the following steps:
bl) transcribing manually in electronic form all or a portion of the text shown in the sub-group (I) of said digital images (3)
b2) individuating in each region (Rpi) containing a word, each region (RCi) containing a character b3) associating automatically the corresponding text which was input in sub-step bl) to each region (Rpi) containing words
b4) associating automatically the corresponding character to each region (RCi) containing a character
b5) storing said associations between regions representing characters and characters in a structured thesaurus.
4. Method for recognizing the text provided in a set of digital images according to claim 1, said macro-step (c) comprising the following steps: cl) individuating each region ( RPi ) containing a word, in any image belonging to the set (3), according to the sub-step (a5)
c2) individuating in each region ( RPi ) containing a word, each region ( RCi ) containing a character c3) researching in the thesaurus structured in said steps (b4) and (b5) the character corresponding to each region ( RCi )
c4) associating a first value of efficacy ( E i ) to the result of said macro-step (c) .
5. Method for recognizing the text provided in a set of digital images according to any one of claims 1 to 4, further comprising the step of comparing said value of efficacy ( E i ) with a prefixed value of efficacy ( E ) .
6. Method for recognizing the text provided in a set of digital images according to claim 5, wherein said method provides the repetition of macro-steps (a) , (b) and (c) until said first value of efficacy ( Ei ) is equal or greater than said prefixed value of efficacy ( E ) .
7. Computer programme directly loadable in the memory of an electronic computer, comprising a code for carrying out the method according to any one of claims 1 to 6, when executed by said electronic computer .
8. Apparatus (1) for recognizing the text (2) provided in a set of digital images (3), each one showing a page (4) of the same ancient document (5), said apparatus comprising:
a planetary scanner (8) for acquiring said images (3)
- an electronic computer (6) provided with:
- means (7a) for individuating and connecting in sequence each regions (RPi) containing a word, in each image (It) belonging to a sub-group (I) according to claim (2)
means (7b) for structuring a thesaurus of characters of the fonts used in said regions (Rpi) according to claim (3)
- means (7c) for carrying out the recognition of a character on one or more images belonging to the set (3) according to claim 4,
characterized in that said means 7a, 7b, 7c comprise a computer programme according to any one of claims 1 to 6.
9. Apparatus for recognizing the text provided in a set of digital images according to claim 8, wherein the planetary scanner (8) is operationally connected to the computer (6) .
PCT/IB2012/050288 2011-07-10 2012-01-21 Method and apparatus for recognizing texts in digital pictures reproducing pages of an antique document WO2013008103A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
ITBA2011A000038 2011-07-10
IT000038A ITBA20110038A1 (en) 2011-07-12 2011-07-12 METHOD AND APPARATUS FOR RECOGNIZING TEXT IN DIGITAL IMAGES DEPICTING PAGES OF AN ANCIENT DOCUMENT

Publications (1)

Publication Number Publication Date
WO2013008103A1 true WO2013008103A1 (en) 2013-01-17

Family

ID=44511154

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2012/050288 WO2013008103A1 (en) 2011-07-10 2012-01-21 Method and apparatus for recognizing texts in digital pictures reproducing pages of an antique document

Country Status (2)

Country Link
IT (1) ITBA20110038A1 (en)
WO (1) WO2013008103A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0369761A2 (en) 1988-11-16 1990-05-23 Ncr International Inc. Character segmentation method
EP0649113A2 (en) 1993-10-19 1995-04-19 Canon Kabushiki Kaisha Multifont optical character recognition using a box connectivity approach

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0369761A2 (en) 1988-11-16 1990-05-23 Ncr International Inc. Character segmentation method
EP0649113A2 (en) 1993-10-19 1995-04-19 Canon Kabushiki Kaisha Multifont optical character recognition using a box connectivity approach

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
ALBERT GORDO ET AL: "State: A Multimodal Assisted Text-Transcription System for Ancient Documents", DOCUMENT ANALYSIS SYSTEMS, 2008. DAS '08. THE EIGHTH IAPR INTERNATIONAL WORKSHOP ON, IEEE, PISCATAWAY, NJ, USA, 16 September 2008 (2008-09-16), pages 135 - 142, XP031360482, ISBN: 978-0-7695-3337-7 *
LE BOURGEOIS F ET AL: "DEBORA: Digital AccEss to BOoks of the RenAissance", INTERNATIONAL JOURNAL OF DOCUMENT ANALYSIS AND RECOGNITION (IJDAR), SPRINGER, BERLIN, DE, vol. 9, no. 2-4, 2 February 2007 (2007-02-02), pages 193 - 221, XP019493682, ISSN: 1433-2825, DOI: 10.1007/S10032-006-0030-0 *
LEYDIER Y ET AL: "Towards an omnilingual word retrieval system for ancient manuscripts", PATTERN RECOGNITION, ELSEVIER, GB, vol. 42, no. 9, 1 September 2009 (2009-09-01), pages 2089 - 2105, XP026148596, ISSN: 0031-3203, [retrieved on 20090203], DOI: 10.1016/J.PATCOG.2009.01.026 *
LORIS EYNARD ET AL: "Particular Words Mining and Article Spotting in Old French Gazettes", MACHINE LEARNING AND DATA MINING IN PATTERN RECOGNITION 2009, 23 July 2009 (2009-07-23), Leibzig, Germany, pages 176 - 188, XP055021915 *
SACHIN RAWAT ET AL: "A Semi-automatic Adaptive OCR for Digital Libraries", 1 January 2006, DOCUMENT ANALYSIS SYSTEMS VII LECTURE NOTES IN COMPUTER SCIENCE;;LNCS, SPRINGER, BERLIN, DE, PAGE(S) 13 - 24, ISBN: 978-3-540-32140-8, XP019027972 *
TOSELLI A H ET AL: "Multimodal interactive transcription of text images", PATTERN RECOGNITION, ELSEVIER, GB, vol. 43, no. 5, 1 May 2010 (2010-05-01), pages 1814 - 1825, XP026892648, ISSN: 0031-3203, [retrieved on 20091127], DOI: 10.1016/J.PATCOG.2009.11.019 *
YANN LEYDIER ET AL: "Textual indexation of ancient documents", PROCEEDINGS OF THE 2005 ACM SYMPOSIUM ON DOCUMENT ENGINEERING, DOCENG '05, 2 November 2005 (2005-11-02), Bristol, UK, pages 111 - 117, XP055021913 *

Also Published As

Publication number Publication date
ITBA20110038A1 (en) 2013-01-13

Similar Documents

Publication Publication Date Title
US20090123071A1 (en) Document processing apparatus, document processing method, and computer program product
CN1841364A (en) Document translation method and document translation device
KR910012986A (en) Document correction device of document reading translation system
US11615635B2 (en) Heuristic method for analyzing content of an electronic document
JP2006276914A (en) Translation processing method, document processing device, and program
CN112801084A (en) Image processing method and device, electronic equipment and storage medium
JP3636490B2 (en) Image processing apparatus and image processing method
KR20060001392A (en) Document image storage method of content retrieval base to use ocr
WO2013008103A1 (en) Method and apparatus for recognizing texts in digital pictures reproducing pages of an antique document
Cojocaru et al. Optical Character Recognition Applied to Romanian Printed Texts of the 18th {20th Century
Schoen et al. Optical character recognition (ocr) and medieval manuscripts: Reconsidering transcriptions in the digital age
CN115713063A (en) Document conversion method, device, equipment and storage medium
US20220237397A1 (en) Identifying handwritten signatures in digital images using ocr residues
Kurz There’s More to It Already. Typography and Literature Studies: A Critique of Nina Nørgaard’s ‘The semiotics of typography in literary texts’(2009)
Ojumah et al. A database for handwritten yoruba characters
JP2006252164A (en) Chinese document processing device
US20140111438A1 (en) System, method and apparatus for the transcription of data using human optical character matching (hocm)
KR102313056B1 (en) A Sheet used to providing user-customized fonts, a device for providing user custom fonts, and method for providing the same
Prakash et al. Content extraction studies for multilingual unstructured web documents
Parhami Computers and challenges of writing in Persian: Explorations at the intersection of culture and technology
DE102012216165A1 (en) Method for providing print media content in digital format for mobile display device, involves optimizing print media contents for specific display device by converting portable document format data files
Kosaka et al. An effective and interactive training data collection method for early-modern Japanese printed character recognition
JP6574278B2 (en) How to create learning materials for Kuzushi characters
JP7041103B2 (en) Structured document creation device and its method
González Martínez et al. A new strategy for Arabic OCR based on script analysis and synthesis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12708931

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12708931

Country of ref document: EP

Kind code of ref document: A1