WO2013008103A1

WO2013008103A1 - Method and apparatus for recognizing texts in digital pictures reproducing pages of an antique document

Info

Publication number: WO2013008103A1
Application number: PCT/IB2012/050288
Authority: WO
Inventors: Nicola BARBUTI; Tommaso CALDAROLA
Original assignee: Dabimus Srl
Priority date: 2011-07-10
Filing date: 2012-01-21
Publication date: 2013-01-17
Also published as: ITBA20110038A1

Abstract

Method for recognizing the text provided in a set of digital images, each one showing a page of an ancient document, comprising the following macro- steps: a) individuating and connecting in sequence the regions (R_p) containing words in a sub-group (I) of said images; b) structuring a thesaurus of font characters used in said regions containing words; c) carrying out the character recognition on one or more images belonging to the set, by associating a first value of efficacy to the result of such recognition.

Description

METHOD AND APPARATUS FOR RECOGNIZING TEXTS IN DIGITAL PICTURES REPRODUCING PAGES OF AN ANTIQUE DOCUMENT

DESCRIPTION

The current Digital Libraries containing digital collections of handwritten manuscripts up to 1900 and ancient and valuable printed documents up to

1850, are still not completely enjoyable. For these specific digital contents, in fact, it has not yet been possible to provide digital and/or intelligent optical recognition systems for the texts contained in the virtual pages, which can guarantee an efficient indexation of the data banks contents both of those ones yet available on the Internet and of those ones in progress. In fact, no one of the most recent digital library projects, currently available on the web 2.0 (Europeana, World Digital

Library, The European Library, etc..) is provided with accessibility and enjoyability levels which allow the users to consult the texts contents of the digital objects reproduced, without integrally flicking therethrough. Apart from the classic researches of catalogue kind (author, title, bibliography etc..) , in these data banks it is not possible to develop an indexation to allow detailed studies based on the analysis of word occurrences, interference on different texts etc....

Such difficulty comes from the very nature of the manuscripts considered. The complexity and diversity of the handwriting, even of the more linear and regular paleographic ones; the different ink kinds used together with the ancientness of the supports, mostly damaged by time accidents and human negligence, have hindered all the efforts to overcome the simple digital reproduction of such manuscripts.

Not even the currently available on the market OCR or IRC systems can be applied for solving the problem of text recognition in ancient documents (i.e.: EP0649113A2 - MULTIFONT OPTICAL CHARACTER RECOGNITION USING A BOX CONNECTIVITY APPROACH; EP 0369761 (A2) - UNIVERSAL CHARACTER SECTION FOR MULTIFONT) .

If this situation appears almost obvious for the manuscripts, it would not be the same for the printed texts. But, even for this kind of manuscripts and in particular for texts produced by mobile characters printing (the whole typographic production from 1456 to early '800, four centuries of press), the situation is completely similar to the manuscript one. The problems are in fact always the same, even if less negative: the techniques for providing the printing matrix, the ink kinds used, the alignment of the punches inside the words and of the words in the line field, the diversity of the graphic symbols which represent some letters with respect to the ones commonly used (for example, for ages the "s" has been represented by a punch very similar to the "f"), different linguistic rules, image noisiness of various nature (printing refraction on the page surface, punch blurs and breaks, spots of ink and of different origin due to time and man) , are all factors which, so far, have hindered all the efforts for indexing the contents of the digitalized volumes by applying OCR and ICR systems. Even the experimental and very expensive project by Google Books has had to come up against the inadequacy of the OCR systems used, when applied on volumes dated prior to the second half of the 19th century.

According to the applicant, the reason why the applications provided so far for digital images optical and/or intelligent recognition do not work on ancient documents is the wrong methodological approach used in structuring such systems .

In fact, the used approach is to map the words on the scanned images from the document pages, by associating them to an electronic text which has to be input manually by an operator. This method aims at reconstructing the document text rather than to recognize the text. However, it is obvious that this approach limits strongly the possibility of electronic indexing a great number of historical texts, because it requires remarkable costs due to the almost manual work carried out by the operator.

If on some kinds of chronologically more recent pieces of work (from the late half of 19th on) this method can be useful, it does not represent the solution to the problem of making enjoyable to scholars and to the whole mankind the huge mole of most ancient, both "major" and "minor", pieces of work which are kept in the many historical libraries around the world, and which represent an heritage of pieces of information and history, almost completely unused but very valuable.

There are also products that, even using more developed methods, base however the recognition on dividing the words of the text regions, as for example the product "A2iA's Proprietary IWR,

Intelligent Word Recognition", by A2iA

(http://www.a2ia.com) . But these are systems able to work only if interfaced with specific semantic thesauruses, structured prior to the indexation step, otherwise they are not able to carry out any text recognition.

The recognition system provided by DABIMUS starts from a new methodological approach oriented to the different features and noisiness peculiar to the ancient manuscripts. The method, in fact, is no more aimed at the regions /words in the text (regions containing a word) , but to typographic regions /characters (regions containing a character) , by associating to each of them the corresponding character extracted by the text which has to be input manually in electronic form. The idea is therefore to individuate and recognize the typographic characters provided inside the ancient book or manuscript. In this way, it is only needed to transcribe manually on a portion of content, for which the typographer used the whole set of punches to print the volume, and which normally corresponds to a number of pages of quite limited text (up to maximum 20) . Obviously, the more the text is transcribed and connected to the extracted regions, the more the text reproduction is precise and correct .

After the typographic characters have been thesaurized, the recognition can be carried out on the whole digital volume with a very high level of precision in the reproduction. Other manual settings can be used to correct the unavoidable noisiness left.

The ICR system, thus provided by DABIMUS, allows great possibilities of accessibility to documents. In fact, it allows two usability levels, which can also be applied at the same time. The first one allows the user to carry out researches inside the document without the need o^'f content indexation; however, such method requires time since the segmentation would be concomitant to the research step, so it would be efficient for the user only on little documents.

Instead, the other one provides the batch launch of the application on the whole document, prior to its input in the data bank, with the following indexation of the recognized text content, so that after the input of the document in the data bank, by means of research key words, the user is able to carry out any research with immediate reproduction. Therefore, object of the present invention is to realize a method and apparatus for recognizing the text provided in a set of digital images showing a page of an ancient document, which solve the above described problems of the related art. The present invention solves such an aim by a method according to claim 1, an apparatus according to claim 8 and a computer programme, which loaded in the computer memory and executed by the same, allows to carry out the method object of the present invention.

The members and the features of the present invention, together with the specific advantages thereof, are disclosed in the following description and represented graphically, as a way of not limiting example, in the appended drawings:

- figure 1 represents the flowchart of the method object of the invention;

- figure 2 represents an apparatus according to an embodiment of the present invention;

- figure 3 represents pages of ancient documents where it is possible to recognize the text according to the present invention.

Throughout the description of the present invention, there are used terms with the following meaning :

Ancient document: any record which represents an evidence from the past within conventionally fixed chronological limits (for example: in Italy: incunabulum: printed book from 1456 to 1500: ancient book: from 1501 to 1830; modern book: from 1831 to date) , regardless the support on which it is recorded.

Digital image: numerical/digital representation by means of a process called digitalization, of any physical object.

Text provided in the image: portion of the image which represents the text and/or graphic content of the digitalized object (page, manuscript document, etc..) .

Font character: element which is part of a group of typographic characters, characterized and sharing the same graphic style or intended to carry out a certain function (in ancient typography: mobile characters) .

OCR/ICR means: OCR: programmes for conversion of images containing text in digital text modifiable by a normal editor; ICR: programmes for intelligent recognition of characters for the conversion of a digital image containing text in modifiable text. Disturbance/noisiness: signals which are not desired and/or not part of the original physical object, both of natural and artificial origin, which overlap on the information transmitted and processed in a system.

Therefore, referring to the appended drawings, it is represented an apparatus (1) for recognizing the text (2) provided in a set of digital images (3), each one showing a page (4) of an ancient document (5) .

Such an apparatus (1) comprises:

- a planetary scanner (8) for acquiring said images (3)

- an electronic computer (6) provided with:

- means (7a) for starting a first step (a) in which it is to be individuated and connected in sequence the regions (R_Pi) , each one containing a word, in each image (Ij_.) belonging to a sub-group (I) of said images (3)

- means (7b) for starting a second step (b) in which it is to be structured a thesaurus of characters of the fonts used in said regions (R_Pi)

- means (7c) for starting a third step (c) in which it is to be carried out the recognition of a character on one or more images belonging to the set (3) by using the thesaurus structured at the previous point.

It is important to observe that said means (7a) are able to start the following sub-steps:

al) selecting said sub-group of images (I) so that all or a portion of their content shows the complete set of font characters provided in the whole set of images (3) a2) processing each image (Ii) belonging to said sub-group (I) in order to obtain as many modified images (Mi)

a3) individuating in each modified image (Μ_Α) the regions (¾_ί) which contain text approximately, but which, at this time, can contain also disturbances and not-text

a4) removing from said regions (RM) the regions which do not contain text, individuating in this way those which really contain text (R_ti)

a5) individuating in each of said regions (R_ti) each region (R_pi) containing a word

a6) connecting linearly each single region (R_Pi) containing a word, so that the sequence thereof is reconstructed in each region (R_ri) containing a text line,

a7) connecting linearly each single region (R_ri) containing a text line so that the sequence thereof is reconstructed in regions (R_ti) containing text a8) connecting linearly the single regions (R_ti) containing text so that the sequence thereof is reconstructed on the page.

It is important to observe that said step (a2) comprises the transformation of each image (Ij_.) in grey scale and the horizontal alignment of said image (see fig. 3) by taking as reference: - the header line (3A) or the line (3B) at the bottom of the page, or

- if there are no header lines or lines at the bottom of the page, by taking as reference a manually indicated line (3C) .

It is important to observe that said means (7b) are able to start the following sub-steps:

bl) transcribing manually in electronic form all or a portion of the text shown in the sub-group (I) of said digital images (3)

b2) indi iduating in each region (R_Pi) containing a word, each region (R_Ci) containing a character b3) associating automatically the corresponding text which was input in sub-step bl) to each region (R_pi) containing words

b4) associating automatically the corresponding character to each region (R_Ci) containing a character

b5) storing said associations between regions representing characters and characters in a structured thesaurus.

Advantageously, for said sub-step (bl), the manually transcribed text is all or a portion of the one provided in each image (Ii) . In particular, the transcribed portions have to contain wholly the entire set of characters used in the rest of the document .

Advantageously said step (b2) comprises the disturbance removing from said regions ( R_pi ) .

Advantageously said step (b2) comprises the separation of possible united characters, by manually input a solid line of white pixels.

It is important that for step (b5) , said structured thesaurus allows to associate from zero to many images containing characters to a character of a specific font.

It is important to observe that said means (7c) are able to start the following sub-steps:

cl) individuating each region ( R_pi ) containing a word, in any image belonging to the set (3), according to the sub-step (a5)

c2) individuating in each region ( R_Pi ) containing a word, each region ( R_Ci ) containing a character c3) researching in the thesaurus structured in said steps (b4) and (t>5) the character corresponding to each region ( R_Ci )

c4) associating a first value of efficacy (E_x) to the result of said step (c) .

Advantageously, the previously enlisted steps (a) , (b) and (c) can be repeated until said first value of efficacy (E_x) is equal or greater than a prefixed value of efficacy (E) .

Claims

1. Method for recognizing the text (2) provided in a set of digital images (3), each one showing a page (4) of an ancient document (5), the method comprising the following macro-steps:

a) individuating and connecting in sequence the regions (R_p) containing words in a sub-group (I) of said images (3)

b) structuring a thesaurus of font characters used in said regions containing words

c) carrying out the character recognition on one or more images belonging to the set (3) , by associating a first value of efficacy (Ei) to the result of such recognition.

2. Method for recognizing the text provided in a set of digital images according to claim 1, said macro-step (a) comprising the following steps:

al) selecting a sub-group of images (I) of said images (3) so that all or a portion of their content shows the complete set of font characters provided in the whole set of images (3)

a2) processing each image (Ii) belonging to the sub-group (I) in order to obtain as many modified images (Mi) in grey scale and horizontally aligned by taking as reference:

- the header line (3A) or the line (3B) at the bottom of the page, or

- if there are no header lines or lines at the bottom of the page, by taking as reference a manually indicated line (3C)

a3) individuating in each image (Mi) the regions (R_bi) which contain text approximately, but which, at this time, can contain also disturbances and not-text

a4) removing from said regions (R_bi) the regions which do not contain text, individuating in this way those which really contain text (R_ti)

a5) individuating in each of said regions (R_ti ) each region (R_pi) containing a word

a6) connecting linearly each single region (R_Pi)^' containing a word, so that the sequence thereof is reconstructed in each region (R_ri) containing a text line

3. Method for recognizing the text provided in a set of digital images according to claim 1, said macro-step (b) comprising the following steps:

b2) individuating in each region (R_pi) containing a word, each region (R_Ci) containing a character b3) associating automatically the corresponding text which was input in sub-step bl) to each region (R_pi) containing words

4. Method for recognizing the text provided in a set of digital images according to claim 1, said macro-step (c) comprising the following steps: cl) individuating each region ( R_Pi ) containing a word, in any image belonging to the set (3), according to the sub-step (a5)

c2) individuating in each region ( R_Pi ) containing a word, each region ( R_Ci ) containing a character c3) researching in the thesaurus structured in said steps (b4) and (b5) the character corresponding to each region ( R_Ci )

c4) associating a first value of efficacy ( E i ) to the result of said macro-step (c) .

5. Method for recognizing the text provided in a set of digital images according to any one of claims 1 to 4, further comprising the step of comparing said value of efficacy ( E i ) with a prefixed value of efficacy ( E ) .

6. Method for recognizing the text provided in a set of digital images according to claim 5, wherein said method provides the repetition of macro-steps (a) , (b) and (c) until said first value of efficacy ( Ei ) is equal or greater than said prefixed value of efficacy ( E ) .

7. Computer programme directly loadable in the memory of an electronic computer, comprising a code for carrying out the method according to any one of claims 1 to 6, when executed by said electronic computer .

8. Apparatus (1) for recognizing the text (2) provided in a set of digital images (3), each one showing a page (4) of the same ancient document (5), said apparatus comprising:

a planetary scanner (8) for acquiring said images (3)

- an electronic computer (6) provided with:

- means (7a) for individuating and connecting in sequence each regions (R_Pi) containing a word, in each image (I_t) belonging to a sub-group (I) according to claim (2)

means (7b) for structuring a thesaurus of characters of the fonts used in said regions (R_pi) according to claim (3)

- means (7c) for carrying out the recognition of a character on one or more images belonging to the set (3) according to claim 4,

characterized in that said means 7a, 7b, 7c comprise a computer programme according to any one of claims 1 to 6.

9. Apparatus for recognizing the text provided in a set of digital images according to claim 8, wherein the planetary scanner (8) is operationally connected to the computer (6) .