WO2016033710A1

WO2016033710A1 - Scene text detection system and method

Info

Publication number: WO2016033710A1
Application number: PCT/CN2014/000830
Authority: WO
Inventors: Xiaoou Tang; Weilin Huang; Yu Qiao
Original assignee: Xiaoou Tang
Priority date: 2014-09-05
Filing date: 2014-09-05
Publication date: 2016-03-10
Also published as: CN106796647B; CN106796647A

Abstract

Disclosed is a scene text detection system. The system may comprise a Maximally Stable Extremal Region (MSER) detector, a trained Convolutional Neutral Network (CNN) classifier,a selector and a constructor. The Maximally Stable Extremal Region (MSER) detector may be configured to generate a set of text components from an image, wherein the generated text components are ordered into a MSER tree structure. The trained Convolutional Neutral Network (CNN) classifier may be configured to assign a component confident score to each text component in the set of text components. The selector may be configured to select text components with a high component confident score of the assigned component confident scores from the set of text components. The constructor may be configured to construct a final text with the selected text components. A scene text detection method is also disclosed.

Description

SCENE TEXT DETECTION SYSTEM AND METHOD

Technical Field

The present application generally relates to a field of image processing, more particularly, to a scene text detection system and a scene text detection method.

Background

With the rapid evolvement and popularization of high-performance mobile and wearable devices in recent years, scene text detection and localization have gained increasing attention for its large amount of potential applications. Text in an image usually contains important semantic information, so that detection and identification of the text are very important for a full understanding of the image.

The challenge for scene text detection comes from extreme diversity of text patterns, highly complicated background information and seriously real-world affects. For example, texts appeared in a natural image can be in a very small size or in a low contrast against the background color, and even regular texts can be distorted by strong lightings,occlusion or blurring. Furthermore, a large amount of noise and text-like outliers, such as windows, leaves and bricks, can be included in the image background, and often cause many false alarms in the detection processing.

There are mainly two groups of methods for scene text detection recently,sliding-window based and connected component based methods. The sliding-window based methods detect text information by sliding a sub-window in multiple scales through all locations of an image. Text and non-text information is then distinguished by a trained classier, which often uses manually designed low-level features extracted from the window,such as SIFT and Histogram of Oriented Gradients. The main challenge lies in the design of local features to handle the large variance of texts, and highly computational demand for scanning a large amount of windows, which may increase to N² for an image with N pixels.

The connected component based methods first separate text and non-text pixels by running a fast low-level filter and then groups the text pixels with similar properties (e. g. intensity, stroke width or color) to construct component candidates. Stroke width transform (SWT) and Maximally Stable Extremal Regions (MSER) are two representative low-level filters applied for scene text detection with great success achieved recently.

The MSER usually generates a large number of non-text components, leading to high ambiguity between text and non-text in MSER components. Robustly separating them has been a key issue for improving the performance of MSER based methods. Efforts have been devoted to handling this problem, but most of current methods for MSER pruning focus on developing low-level features, such as heuristic characteristics or geometric properties, to filter out non-text components. These low-level features are not robust or discriminative enough to distinguish true texts from text-like outliers, which often have similar heuristic or geometric properties with true texts.

Summary

According to an embodiment of the present application, disclosed is a scene text detection system. The system may comprise a Maximally Stable Extremal Region (MSER) detector, a Convolutional Neutral Network (CNN) classifier, a selector and a constructor. The Maximally Stable Extremal Region (MSER) detector may be configured to generate a set of text components from an image, wherein the generated text components are ordered into a MSER tree structure. The Convolutional Neutral Network (CNN) classifier may be configured to assign a component confident score to each text component in the set of text components. The selector may be configured to select text components with a high component confident score of the assigned component confident scores from the set of text components. The constructor may be configured to construct a final text with the selected text components.

According to an embodiment of the present application, disclosed is a scene text detection method, and the method may comprise: generating a set of text components from an image, wherein the generated text components are ordered into a tree structure； assigning a component confident score to each text component in the set of text components； selecting text components with a high component confident score of the assigned component confident scores from the set of text components； and constructing a final text with the selected text components.

Brief Description of the Drawing

Exemplary non-limiting embodiments of the present invention are described below with reference to the attached drawings. The drawings are illustrative and generally not to an exact scale. The same or similar elements on different figures are referenced with the same reference numbers.

Fig. 1 is a schematic diagram illustrating a scene text detection system consistent with an embodiment of the present application.

Fig. 2 is a schematic diagram illustrating scene text detection system when it is implemented in software, consistent with some disclosed embodiments.

Fig. 3 is a schematic diagram illustrating a convolutional neural network classifier consistent with some disclosed embodiments.

Fig. 4 is a schematic diagram illustrating a selector of the scene text detection system, consistent with some disclosed embodiments.

Fig. 5 is a schematic diagram illustrating a splitting device of the selector consistent with some disclosed embodiments.

Fig. 6 is a schematic flowchart illustrating a scene text detection method consistent with some disclosed embodiments.

Fig. 7 is a schematic flowchart illustrating a process of selecting text components consistent with some disclosed embodiments.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When appropriate, the same reference numbers are used throughout the drawings to refer to the same or like parts. Fig. 1 is a schematic diagram illustrating an exemplary scene text detection system 1000 consistent with some disclosed embodiments.

Referring to Fig. 1, where the system 1000 is implemented by the hardware, it may comprise a Maximally Stable Extremal Region (MSER) detector 100, a Convolutional Neutral Network (CNN) classifier 200, a selector 300 and a constructor 400.

It shall be appreciated that the system 1000 may be implemented using certain hardware, software, or a combination thereof. In addition, the embodiments of the present invention may be adapted to a computer program product embodied on one or more computer readable storage media (comprising but not limited to disk storage, CD-ROM, optical memory and the like) containing computer program codes. Fig. 2 is a schematic diagram illustrating a scene text detection system 1000 when it is implemented in software consistent with some disclosed embodiments.

In the case that the system 1000 is implemented with software, the system 1000may include a general purpose computer, a computer cluster, a mainstream computer, a computing device dedicated for providing online contents, or a computer network comprising a group of computers operating in a centralized or distributed fashion. As shown in Fig. 2, the system 1000 may include one or more processors (

processors

102, 104, 106 etc. ) , a memory 112, a storage device 116, and a bus to facilitate information exchange among various devices of system 1000. Processors 102-106 may include a central processing unit ( “CPU” ) , a graphic processing unit ( “GPU” ) , or other suitable information processing devices. Depending on the type of hardware being used, processors 102-106 can include one or more printed circuit boards, and/or one or more microprocessor chips. Processors 102-106 can execute sequences of computer program instructions to perform various methods that will be explained in greater detail below.

Memory 112 can include, among other things, a random access memory ( “RAM” ) and a read-only memory ( “ROM” ) . Computer program instructions can be stored,accessed, and read from memory 112 for execution by one or more of processors 102-106.For example, memory 112 may store one or more software applications. Further, memory 112may store an entire software application or only a part of a software application that is executable by one or more of processors 102-106. It is noted that although only one block is shown in Fig. 2, memory 112 may include multiple physical devices installed on a central computing device or on different computing devices.

In the embodiment shown in Fig. 1, the MSER detector may be configured to generate a set of text components from an image, and the generated text components are ordered into a MSER tree structure. MSER defines an extremal region as a connected component of an image whose pixels have intensity contrast against its boundary pixels. The intensity contrast is measured by increasing intensity values, and controls the region areas. A low contrast value would generate a large number of low-level regions, which are separated by small intensity difference between pixels. When the contrast value increases, a low-level region can be accumulated with current level pixels or merged with other lower level regions to construct a higher level region. Therefore, an extremal region tree can be constructed when it reaches the largest contrast. An extremal region is defined as a MSER if its variation is lower than both its parent and child. Therefore, the MSER can be considered as a special extremal region whose size remains unchanged over a range of thresholds.

In an embodiment, each individual character of a text in an image can be detected as a extremal region or a MSER by the MSER detector. Two promising properties made the MSER detector great success in scene text detection. First, the MSER detector is a fast detector and can be computed in linear time of the number of pixels in a image. Second, it is a powerful detector with high capability to handle low quality texts, such as low contrast,low resolution and blurring. With this capability, MSER is able to detect almost scene texts in a natural image.

According to an embodiment, a CNN classifier 200 may be configured to assign a component confident score to each text component in the set of text components. As shown in Fig. 3, the CNN classifier 200 may comprise at least one convolutional layer, at least one average pooling and a support vector machine (SVM) classifier. Each of the convolutional layer is followed by an average pooling, and has a plurality of filters. For example, as shown in Fig. 3, the CNN classifier comprises two convolutional layers, and the second layer is stacked upon the first layer. The numbers of filters for the two layers are 96 and 64,respectively.

In an embodiment, the CNN classifier is trained with a predetermined training set to assign the component confident score. When the CNN classifier is trained, filters of a first convolutional layer of the two convolutional layers are configured to learn from the set of patches extracted from the predetermined training set by using an unsupervised K-means to generate a response, and filters of a second convolutional layer two convolutional layers are configured to learn from the generated response by back-propagating a SVM classification error generated from the SVM classifier to obtain the component confidant scores of the text components. For example, during a training process shown in Fig. 3, the extracted patch is with fixed size of 32×32. Filters of a first convolutional layer are configured to learn from the set of patches by using an unsupervised K-means to generate a response. For example, as shown, the first layer is trained in an unsupervised way by using a variant of K-means to learn a set of filters

from a set of 8×8 patches, and k is the dimension of the patch for convolution and here is 64 for 8×8. n₁ is 96 for the number of the filters in the first layer. The responses (r) of the first layer is computed as

r＝max {0,|D^Tx-θ|} (1)

where

is an input vector for an 8×8 patch, and θ＝ 0.5. The resulted first layer response maps are with size of 25×25×96. Then average pooling with window size of 5×5 is applied to the response maps to get reduced maps with size of 5×5× 96.

In the embodiment, filters of a second convolutional layer are configured to learn from the generated response by back-propagating a SVM classification error generated from the SVM classifier to obtain the component confidant scores of the text components. The final output of the two layers is a 64 dimension feature vector, which is input to the SVM classifier to get the final confident score of the text component. Parameters in the second layer are fully connected and are trained by back-propagating the SVM classification error.

Fig. 4 illustrates the selector 300 of the scene text detection system 1000consistent with some disclosed embodiments. As shown, the selector 300 may comprise a defining device 310 and a splitting device 320. In an embodiment, the defining device 310may be configured to define erroneously-connected text components from the selected text components based on the assigned component confident score and the MSER tree structure. The splitting device 320 may be configured to split the erroneously-connected text components to text components with a high component confident score.

In an embodiment shown in Fig. 5, the splitting device may further comprise a resizing unit 321, a scanner 322 and an identifying unit 323. The resizing unit 321 may be configured to resize the defined erroneously-connected text components to a predetermined size. The scanner 322 may be configured to scan the resized text components to obtain an one dimensional array of component confident scores by a sliding window. The identifying unit 323 may be configured to identify peak location of the erroneously-connected text components based on the one dimensional array to split the erroneously-connected text components to text components with a high component confident score.

In the embodiment, an error-connected component has three remarkable characteristics. First, it often has a high aspect ratio with much longer in width of its boundary box than height. Second, differing from other non-text components, for example long horizontal lines or bars which are generally scored with negative confident values by the CNN classifier 200, the error-connected component actually includes some text information, but it is not very strong, because the CNN classifier is trained on single-character components.Third, the components in high-level of the MSER trees often include multiple text characters,for example, the components in the roots of the trees. Most of these components are already correctly separated by their children components, which often have higher confident scores than their parent components.

Therefore, conditions for defining erroneously-connected text components comprises: 1) an aspect ratio of width/height of the text component is larger than 2； 2) the text component has positive confident score； and 3) the text component is at the end node of the MSER tree structure or has larger confident score than that of all its children nodes in the MSER tree structure. An exemplary algorithm for searching and splitting error-connected components is given as follow.

According to an embodiment, the constructor 400 may further comprise a pairing unit and a merging unit (not shown) . The pairing unit may be configured to pair two text components, which have similar geometric and heuristic properties, of the selected text components. The merging unit may be configured to merge pairs having same component and similar orientation sequentially to construct the final text.

Fig. 6 is a schematic flowchart illustrating a scene text detection method 2000consistent with some disclosed embodiments. Hereafter, the method 2000 may be described in detail with respect to Fig. 6.

At step S210, a set of text components is generated from an image. In an embodiment, the set of text components is generated from an image by using a Maximally Stable Extremal Region (MSER) detector. The generated text components are ordered into a MSER tree structure.

At step S220, a component confident score is assigned to each text component in the set of text components. For example, the component confident score is assigned to each text component by a trained Convolutional Neutral Network (CNN) classifier. In an embodiment, the Convolutional Neutral Network classifier is trained with a predetermined training set to assign the component confident score.

According to an embodiment, the convolutional neural network classifier comprises at least one convolutional layer, at least one average pooling and a support vector machine (SVM) classifier, and wherein each of the convolutional layers is followed by an average pooling and has a plurality of filters. For example, the convolutional neural network classifier may comprise two convolutional layers. During the training process, a set of patches from the predetermined training set are extracted. Then, filters of a first convolutional layer of the two convolutional layers learn from the set of patches by using an unsupervised K-means to generate a response, and filters of a second convolutional layer of the two convolutional layers learn from the generated response by back-propagating a SVM classification error generated from the SVM classifier to obtain the component confidant scores of the text components.

At step S230, text components are selected with a high component confident score of the assigned component confident scores from the set of text components. Following is a possible way of selecting text components with a high component confident score. For example, as shown in Fig. 7, the erroneously-connected text components are defined from the selected text components based on the assigned component confident score and the MSER tree structure. The conditions for defining erroneously-connected text components comprises:1) an aspect ratio of width/height of the text component is larger than 2； 2) the text component has positive confident score； and 3) the text component is at the end node of the MSER tree structure or has larger confident score than that of all its children nodes in the MSER tree structure.

If the component belongs to the erroneously-connected text components, it is resized to a predetermined size. The resized text components are scanned, for example by a sliding window, to obtain an one dimensional array of component confident scores. For example, Non-Maximal Suppression (NMS) method is applied to the one dimensional array of component confident scores to estimate multiple character locations. Peak location of the erroneously-connected text components are identified based on the one dimensional array, so that the erroneously-connected text components are split to text components with a high component confident score based on the peak location.

At step S240, a final text is constructed with the selected text components. When a final text is constructed, two text components, which have similar geometric and heuristic properties, of the selected text components are paired, and pairs having same component and similar orientation sequentially to construct the final text are merged to construct the final text.

With the scene text detection system and method of the present application, high capacity of the deep learning model to tackle two main problems of current MSERs methods for text detection are effectively leveraged. In addition, the system of the present application can be with strong robustness and highly discriminative capability to distinguish texts from a large amount of non-text components by incorporating the MSER detector and the trained CNN classifier. A sliding window model was intergraded with CNN classifier to further improve the capability of MSER detector for correctly localizing challenge text components.Our method has achieved large improvements over current methods in standard benchmark datasets.

Although the preferred examples of the present invention have been described,those skilled in the art can make variations or modifications to these examples upon knowing the basic inventive concept. The appended claims are intended to be considered as comprising the preferred examples and all the variations or modifications fell into the scope of the present invention.

Obviously, those skilled in the art can make variations or modifications to the present invention without departing the spirit and scope of the present invention. As such, if these variations or modifications belong to the scope of the claims and equivalent technique,they may also fall into the scope of the present invention.

Claims

A scene text detection system, comprising:

a Maximally Stable Extremal Region (MSER) detector configured to generate a set of text components from an image, wherein the generated text components are ordered into a MSER tree structure；

a Convolutional Neutral Network (CNN) classifier configured to assign a component confident score to each text component in the set of text components；

a selector configured to select text components with a high component confident score of the assigned component confident scores from the set of text components； and

a constructor configured to construct a final text with the selected text components.
A scene text detection system according to claim 1, wherein the CNN classifier is trained with a predetermined training set to assign the component confident score.
A scene text detection system according to claim 1, wherein the CNN classifier comprises at least one convolutional layer, at least one average pooling and a support vector machine (SVM) classifier, and wherein each of the convolutional layer is followed by an average pooling and has a plurality of filters.
A scene text detection system according to claim 3, wherein the at least one convolutional layer comprises two convolutional layers.
A scene text detection system according to claim 4, wherein filters of a first convolutional layer of the two convolutional layers are configured to learn from the set of patches extracted from the predetermined training set by using an unsupervised K-means to generate a response, and filters of a second convolutional layer two convolutional layers are configured to learn from the generated response by back-propagating a SVM classification error generated from the SVM classifier to obtain the component confidant scores of the text components.
A scene text detection system according to claim 1, wherein the selector further comprises:

a defining device configured to define erroneously-connected text components from the selected text components based on the assigned component confident score and the MSER tree structure； and

a splitting device configured to split the erroneously-connected text components to text components with a high component confident score.
A scene text detection system according to claim 6, wherein the splitting device further comprises:

a resizing unit configured to resize the defined erroneously-connected text components to a predetermined size；

a scanner configured to scan the resized text components to obtain an one dimensional array of component confident scores by a sliding window； and

an identifying unit configured to identify peak location of the erroneously-connected text components based on the one dimensional array to split the erroneously-connected text components to text components with a high component confident score.
A scene text detection system according to claim 6, wherein conditions for defining erroneously-connected text components comprises:

an aspect ratio of width/height of the text component is larger than 2；

the text component has positive confident score； and

the text component is at the end node of the MSER tree structure or has larger confident score than that of all its children nodes in the MSER tree structure.
A scene text detection system according to claim 1, wherein the constructor further comprises:

a pairing unit configured to pair two text components, which have similar geometric and heuristic properties, of the selected text components； and

a merging unit configured to merge pairs having same component and similar orientation sequentially to construct the final text.
A scene text detection method, comprising:

generating a set of text components from an image, wherein the generated text components are ordered into a tree structure；

assigning a component confident score to each text component in the set of text components；

selecting text components with a high component confident score of the assigned component confident scores from the set of text components； and

constructing a final text with the selected text components.
A scene text detection method according to claim 10, wherein the generating a set of text components from an image further comprises:

generating the set of text components from the image by using a Maximally Stable Extremal Region (MSER) detector.
A scene text detection method according to claim 10, wherein the assigning a component confident score to each text component further comprises:

assigning a component confident score to each text component in the set of text components by a trained Convolutional Neutral Network (CNN) classifier.
A scene text detection method according to claim 12, further comprising:

training the Convolutional Neutral Network classifier with a predetermined training set to assign the component confident score.
A scene text detection method according to claim 12, wherein the convolutional neural network classifier comprises at least one convolutional layer, at least one average pooling and a support vector machine (SVM) classifier, and wherein each of the convolutional layer is followed by an average pooling and has a plurality of filters.
A scene text detection method according to claim 14, wherein the at least one convolutional layer comprises two convolutional layers, and the training the Convolutional Neutral Network classifier with a predetermined training set to assign the component confident score further comprises:

extracting a set of patches from the predetermined training set；

learning, by filters of a first convolutional layer of the two convolutional layers, from the set of patches by using an unsupervised K-means to generate a response； and

learning, by filters of a second convolutional layer of the two convolutional layers, from the generated response by back-propagating a SVM classification error generated from the SVM classifier to obtain the component confidant scores of the text components.
A scene text detection method according to claim 10, wherein the selecting text components with a high component confident score of the assigned component confident scores from the set of text components to construct a final text further comprises:

defining erroneously-connected text components from the selected text components based on the assigned component confident score and the MSER tree structure； and

splitting the erroneously-connected text components to text components with a high component confident score.
A scene text detection method according to claim 16, wherein the splitting the erroneously-connected text components to text components with a high component confident score further comprises:

resizing the defined erroneously-connected text components to a predetermined size； and

scanning the resized text components to obtain an one dimensional array of component confident scores by a sliding window； and

identifying peak location of the erroneously-connected text components based on the one dimensional array so that the erroneously-connected text components are split to text components with a high component confident score based on the peak location.
A scene text detection method according to claim 16, wherein conditions for defining erroneously-connected text components comprises:

an aspect ratio of width/height of the text component is larger than 2；

the text component has positive confident score； and

the text component is at the end node of the MSER tree structure or has larger confident score than that of all its children nodes in the MSER tree structure.
A scene text detection method according to claim 10, wherein the constructing a final text with the selected text components further comprises:

pairing two text components, which have similar geometric and heuristic properties, of the selected text components； and

merging pairs having same component and similar orientation sequentially to construct the final text.