WO2016033710A1 - Scene text detection system and method - Google Patents

Scene text detection system and method Download PDF

Info

Publication number
WO2016033710A1
WO2016033710A1 PCT/CN2014/000830 CN2014000830W WO2016033710A1 WO 2016033710 A1 WO2016033710 A1 WO 2016033710A1 CN 2014000830 W CN2014000830 W CN 2014000830W WO 2016033710 A1 WO2016033710 A1 WO 2016033710A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
component
components
text components
confident
Prior art date
Application number
PCT/CN2014/000830
Other languages
French (fr)
Inventor
Xiaoou Tang
Weilin Huang
Yu Qiao
Original Assignee
Xiaoou Tang
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiaoou Tang filed Critical Xiaoou Tang
Priority to PCT/CN2014/000830 priority Critical patent/WO2016033710A1/en
Priority to CN201480081759.5A priority patent/CN106796647B/en
Publication of WO2016033710A1 publication Critical patent/WO2016033710A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images

Definitions

  • the present application generally relates to a field of image processing, more particularly, to a scene text detection system and a scene text detection method.
  • Text in an image usually contains important semantic information, so that detection and identification of the text are very important for a full understanding of the image.
  • the sliding-window based methods detect text information by sliding a sub-window in multiple scales through all locations of an image. Text and non-text information is then distinguished by a trained classier, which often uses manually designed low-level features extracted from the window,such as SIFT and Histogram of Oriented Gradients.
  • the main challenge lies in the design of local features to handle the large variance of texts, and highly computational demand for scanning a large amount of windows, which may increase to N 2 for an image with N pixels.
  • the connected component based methods first separate text and non-text pixels by running a fast low-level filter and then groups the text pixels with similar properties (e. g. intensity, stroke width or color) to construct component candidates.
  • Stroke width transform (SWT) and Maximally Stable Extremal Regions (MSER) are two representative low-level filters applied for scene text detection with great success achieved recently.
  • the MSER usually generates a large number of non-text components, leading to high ambiguity between text and non-text in MSER components. Robustly separating them has been a key issue for improving the performance of MSER based methods. Efforts have been devoted to handling this problem, but most of current methods for MSER pruning focus on developing low-level features, such as heuristic characteristics or geometric properties, to filter out non-text components. These low-level features are not robust or discriminative enough to distinguish true texts from text-like outliers, which often have similar heuristic or geometric properties with true texts.
  • the system may comprise a Maximally Stable Extremal Region (MSER) detector, a Convolutional Neutral Network (CNN) classifier, a selector and a constructor.
  • MSER Maximally Stable Extremal Region
  • CNN Convolutional Neutral Network
  • the Maximally Stable Extremal Region (MSER) detector may be configured to generate a set of text components from an image, wherein the generated text components are ordered into a MSER tree structure.
  • the Convolutional Neutral Network (CNN) classifier may be configured to assign a component confident score to each text component in the set of text components.
  • the selector may be configured to select text components with a high component confident score of the assigned component confident scores from the set of text components.
  • the constructor may be configured to construct a final text with the selected text components.
  • a scene text detection method may comprise: generating a set of text components from an image, wherein the generated text components are ordered into a tree structure; assigning a component confident score to each text component in the set of text components; selecting text components with a high component confident score of the assigned component confident scores from the set of text components; and constructing a final text with the selected text components.
  • Fig. 1 is a schematic diagram illustrating a scene text detection system consistent with an embodiment of the present application.
  • Fig. 2 is a schematic diagram illustrating scene text detection system when it is implemented in software, consistent with some disclosed embodiments.
  • Fig. 3 is a schematic diagram illustrating a convolutional neural network classifier consistent with some disclosed embodiments.
  • Fig. 4 is a schematic diagram illustrating a selector of the scene text detection system, consistent with some disclosed embodiments.
  • Fig. 5 is a schematic diagram illustrating a splitting device of the selector consistent with some disclosed embodiments.
  • Fig. 6 is a schematic flowchart illustrating a scene text detection method consistent with some disclosed embodiments.
  • Fig. 7 is a schematic flowchart illustrating a process of selecting text components consistent with some disclosed embodiments.
  • FIG. 1 is a schematic diagram illustrating an exemplary scene text detection system 1000 consistent with some disclosed embodiments.
  • the system 1000 may comprise a Maximally Stable Extremal Region (MSER) detector 100, a Convolutional Neutral Network (CNN) classifier 200, a selector 300 and a constructor 400.
  • MSER Maximally Stable Extremal Region
  • CNN Convolutional Neutral Network
  • Fig. 2 is a schematic diagram illustrating a scene text detection system 1000 when it is implemented in software consistent with some disclosed embodiments.
  • the system 1000 may include a general purpose computer, a computer cluster, a mainstream computer, a computing device dedicated for providing online contents, or a computer network comprising a group of computers operating in a centralized or distributed fashion.
  • the system 1000 may include one or more processors (processors 102, 104, 106 etc. ) , a memory 112, a storage device 116, and a bus to facilitate information exchange among various devices of system 1000.
  • processors 102-106 may include a central processing unit ( “CPU” ) , a graphic processing unit ( “GPU” ) , or other suitable information processing devices.
  • processors 102-106 can include one or more printed circuit boards, and/or one or more microprocessor chips. Processors 102-106 can execute sequences of computer program instructions to perform various methods that will be explained in greater detail below.
  • Memory 112 can include, among other things, a random access memory ( “RAM” ) and a read-only memory ( “ROM” ) .
  • Computer program instructions can be stored,accessed, and read from memory 112 for execution by one or more of processors 102-106.
  • memory 112 may store one or more software applications.
  • memory 112 may store an entire software application or only a part of a software application that is executable by one or more of processors 102-106. It is noted that although only one block is shown in Fig. 2, memory 112 may include multiple physical devices installed on a central computing device or on different computing devices.
  • the MSER detector may be configured to generate a set of text components from an image, and the generated text components are ordered into a MSER tree structure.
  • MSER defines an extremal region as a connected component of an image whose pixels have intensity contrast against its boundary pixels. The intensity contrast is measured by increasing intensity values, and controls the region areas. A low contrast value would generate a large number of low-level regions, which are separated by small intensity difference between pixels. When the contrast value increases, a low-level region can be accumulated with current level pixels or merged with other lower level regions to construct a higher level region. Therefore, an extremal region tree can be constructed when it reaches the largest contrast.
  • An extremal region is defined as a MSER if its variation is lower than both its parent and child. Therefore, the MSER can be considered as a special extremal region whose size remains unchanged over a range of thresholds.
  • each individual character of a text in an image can be detected as a extremal region or a MSER by the MSER detector.
  • the MSER detector is a fast detector and can be computed in linear time of the number of pixels in a image.
  • the MSER detector is a powerful detector with high capability to handle low quality texts, such as low contrast,low resolution and blurring. With this capability, MSER is able to detect almost scene texts in a natural image.
  • a CNN classifier 200 may be configured to assign a component confident score to each text component in the set of text components.
  • the CNN classifier 200 may comprise at least one convolutional layer, at least one average pooling and a support vector machine (SVM) classifier.
  • SVM support vector machine
  • Each of the convolutional layer is followed by an average pooling, and has a plurality of filters.
  • the CNN classifier comprises two convolutional layers, and the second layer is stacked upon the first layer.
  • the numbers of filters for the two layers are 96 and 64,respectively.
  • the CNN classifier is trained with a predetermined training set to assign the component confident score.
  • filters of a first convolutional layer of the two convolutional layers are configured to learn from the set of patches extracted from the predetermined training set by using an unsupervised K-means to generate a response
  • filters of a second convolutional layer two convolutional layers are configured to learn from the generated response by back-propagating a SVM classification error generated from the SVM classifier to obtain the component confidant scores of the text components.
  • the extracted patch is with fixed size of 32 ⁇ 32.
  • Filters of a first convolutional layer are configured to learn from the set of patches by using an unsupervised K-means to generate a response.
  • the first layer is trained in an unsupervised way by using a variant of K-means to learn a set of filters from a set of 8 ⁇ 8 patches, and k is the dimension of the patch for convolution and here is 64 for 8 ⁇ 8.
  • n 1 is 96 for the number of the filters in the first layer.
  • the responses (r) of the first layer is computed as
  • filters of a second convolutional layer are configured to learn from the generated response by back-propagating a SVM classification error generated from the SVM classifier to obtain the component confidant scores of the text components.
  • the final output of the two layers is a 64 dimension feature vector, which is input to the SVM classifier to get the final confident score of the text component. Parameters in the second layer are fully connected and are trained by back-propagating the SVM classification error.
  • Fig. 4 illustrates the selector 300 of the scene text detection system 1000consistent with some disclosed embodiments.
  • the selector 300 may comprise a defining device 310 and a splitting device 320.
  • the defining device 310 may be configured to define erroneously-connected text components from the selected text components based on the assigned component confident score and the MSER tree structure.
  • the splitting device 320 may be configured to split the erroneously-connected text components to text components with a high component confident score.
  • the splitting device may further comprise a resizing unit 321, a scanner 322 and an identifying unit 323.
  • the resizing unit 321 may be configured to resize the defined erroneously-connected text components to a predetermined size.
  • the scanner 322 may be configured to scan the resized text components to obtain an one dimensional array of component confident scores by a sliding window.
  • the identifying unit 323 may be configured to identify peak location of the erroneously-connected text components based on the one dimensional array to split the erroneously-connected text components to text components with a high component confident score.
  • an error-connected component has three remarkable characteristics. First, it often has a high aspect ratio with much longer in width of its boundary box than height. Second, differing from other non-text components, for example long horizontal lines or bars which are generally scored with negative confident values by the CNN classifier 200, the error-connected component actually includes some text information, but it is not very strong, because the CNN classifier is trained on single-character components.Third, the components in high-level of the MSER trees often include multiple text characters,for example, the components in the roots of the trees. Most of these components are already correctly separated by their children components, which often have higher confident scores than their parent components.
  • conditions for defining erroneously-connected text components comprises: 1) an aspect ratio of width/height of the text component is larger than 2; 2) the text component has positive confident score; and 3) the text component is at the end node of the MSER tree structure or has larger confident score than that of all its children nodes in the MSER tree structure.
  • the constructor 400 may further comprise a pairing unit and a merging unit (not shown) .
  • the pairing unit may be configured to pair two text components, which have similar geometric and heuristic properties, of the selected text components.
  • the merging unit may be configured to merge pairs having same component and similar orientation sequentially to construct the final text.
  • Fig. 6 is a schematic flowchart illustrating a scene text detection method 2000consistent with some disclosed embodiments. Hereafter, the method 2000 may be described in detail with respect to Fig. 6.
  • a set of text components is generated from an image.
  • the set of text components is generated from an image by using a Maximally Stable Extremal Region (MSER) detector.
  • MSER Maximally Stable Extremal Region
  • the generated text components are ordered into a MSER tree structure.
  • a component confident score is assigned to each text component in the set of text components.
  • the component confident score is assigned to each text component by a trained Convolutional Neutral Network (CNN) classifier.
  • CNN Convolutional Neutral Network
  • the Convolutional Neutral Network classifier is trained with a predetermined training set to assign the component confident score.
  • the convolutional neural network classifier comprises at least one convolutional layer, at least one average pooling and a support vector machine (SVM) classifier, and wherein each of the convolutional layers is followed by an average pooling and has a plurality of filters.
  • the convolutional neural network classifier may comprise two convolutional layers.
  • filters of a first convolutional layer of the two convolutional layers learn from the set of patches by using an unsupervised K-means to generate a response
  • filters of a second convolutional layer of the two convolutional layers learn from the generated response by back-propagating a SVM classification error generated from the SVM classifier to obtain the component confidant scores of the text components.
  • text components are selected with a high component confident score of the assigned component confident scores from the set of text components.
  • the erroneously-connected text components are defined from the selected text components based on the assigned component confident score and the MSER tree structure.
  • the conditions for defining erroneously-connected text components comprises:1) an aspect ratio of width/height of the text component is larger than 2; 2) the text component has positive confident score; and 3) the text component is at the end node of the MSER tree structure or has larger confident score than that of all its children nodes in the MSER tree structure.
  • the component belongs to the erroneously-connected text components, it is resized to a predetermined size.
  • the resized text components are scanned, for example by a sliding window, to obtain an one dimensional array of component confident scores.
  • NMS Non-Maximal Suppression
  • Peak location of the erroneously-connected text components are identified based on the one dimensional array, so that the erroneously-connected text components are split to text components with a high component confident score based on the peak location.
  • a final text is constructed with the selected text components.
  • two text components, which have similar geometric and heuristic properties, of the selected text components are paired, and pairs having same component and similar orientation sequentially to construct the final text are merged to construct the final text.
  • the system of the present application can be with strong robustness and highly discriminative capability to distinguish texts from a large amount of non-text components by incorporating the MSER detector and the trained CNN classifier.
  • a sliding window model was intergraded with CNN classifier to further improve the capability of MSER detector for correctly localizing challenge text components.Our method has achieved large improvements over current methods in standard benchmark datasets.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

Disclosed is a scene text detection system. The system may comprise a Maximally Stable Extremal Region (MSER) detector, a trained Convolutional Neutral Network (CNN) classifier,a selector and a constructor. The Maximally Stable Extremal Region (MSER) detector may be configured to generate a set of text components from an image, wherein the generated text components are ordered into a MSER tree structure. The trained Convolutional Neutral Network (CNN) classifier may be configured to assign a component confident score to each text component in the set of text components. The selector may be configured to select text components with a high component confident score of the assigned component confident scores from the set of text components. The constructor may be configured to construct a final text with the selected text components. A scene text detection method is also disclosed.

Description

SCENE TEXT DETECTION SYSTEM AND METHOD Technical Field
The present application generally relates to a field of image processing, more particularly, to a scene text detection system and a scene text detection method.
Background
With the rapid evolvement and popularization of high-performance mobile and wearable devices in recent years, scene text detection and localization have gained increasing attention for its large amount of potential applications. Text in an image usually contains important semantic information, so that detection and identification of the text are very important for a full understanding of the image.
The challenge for scene text detection comes from extreme diversity of text patterns, highly complicated background information and seriously real-world affects. For example, texts appeared in a natural image can be in a very small size or in a low contrast against the background color, and even regular texts can be distorted by strong lightings,occlusion or blurring. Furthermore, a large amount of noise and text-like outliers, such as windows, leaves and bricks, can be included in the image background, and often cause many false alarms in the detection processing.
There are mainly two groups of methods for scene text detection recently,sliding-window based and connected component based methods. The sliding-window based methods detect text information by sliding a sub-window in multiple scales through all locations of an image. Text and non-text information is then distinguished by a trained classier, which often uses manually designed low-level features extracted from the window,such as SIFT and Histogram of Oriented Gradients. The main challenge lies in the design of local features to handle the large variance of texts, and highly computational demand for scanning a large amount of windows, which may increase to N2 for an image with N pixels.
The connected component based methods first separate text and non-text pixels by running a fast low-level filter and then groups the text pixels with similar properties (e. g. intensity, stroke width or color) to construct component candidates. Stroke width transform (SWT) and Maximally Stable Extremal Regions (MSER) are two representative low-level filters applied for scene text detection with great success achieved recently.
The MSER usually generates a large number of non-text components, leading to high ambiguity between text and non-text in MSER components. Robustly separating them has been a key issue for improving the performance of MSER based methods. Efforts have been devoted to handling this problem, but most of current methods for MSER pruning focus on developing low-level features, such as heuristic characteristics or geometric properties, to filter out non-text components. These low-level features are not robust or discriminative enough to distinguish true texts from text-like outliers, which often have similar heuristic or geometric properties with true texts.
Summary
According to an embodiment of the present application, disclosed is a scene text detection system. The system may comprise a Maximally Stable Extremal Region (MSER) detector, a Convolutional Neutral Network (CNN) classifier, a selector and a constructor. The Maximally Stable Extremal Region (MSER) detector may be configured to generate a set of text components from an image, wherein the generated text components are ordered into a MSER tree structure. The Convolutional Neutral Network (CNN) classifier may be configured to assign a component confident score to each text component in the set of text components. The selector may be configured to select text components with a high component confident score of the assigned component confident scores from the set of text components. The constructor may be configured to construct a final text with the selected text components.
According to an embodiment of the present application, disclosed is a scene text detection method, and the method may comprise: generating a set of text components from an image, wherein the generated text components are ordered into a tree structure; assigning a component confident score to each text component in the set of text components; selecting text components with a high component confident score of the assigned component confident  scores from the set of text components; and constructing a final text with the selected text components.
Brief Description of the Drawing
Exemplary non-limiting embodiments of the present invention are described below with reference to the attached drawings. The drawings are illustrative and generally not to an exact scale. The same or similar elements on different figures are referenced with the same reference numbers.
Fig. 1 is a schematic diagram illustrating a scene text detection system consistent with an embodiment of the present application.
Fig. 2 is a schematic diagram illustrating scene text detection system when it is implemented in software, consistent with some disclosed embodiments.
Fig. 3 is a schematic diagram illustrating a convolutional neural network classifier consistent with some disclosed embodiments.
Fig. 4 is a schematic diagram illustrating a selector of the scene text detection system, consistent with some disclosed embodiments.
Fig. 5 is a schematic diagram illustrating a splitting device of the selector consistent with some disclosed embodiments.
Fig. 6 is a schematic flowchart illustrating a scene text detection method consistent with some disclosed embodiments.
Fig. 7 is a schematic flowchart illustrating a process of selecting text components consistent with some disclosed embodiments.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When appropriate, the same reference numbers are used throughout the drawings to refer to the same or like parts. Fig. 1 is a schematic diagram illustrating an exemplary scene text detection system 1000 consistent with some disclosed embodiments.
Referring to Fig. 1, where the system 1000 is implemented by the hardware, it may comprise a Maximally Stable Extremal Region (MSER) detector 100, a Convolutional Neutral Network (CNN) classifier 200, a selector 300 and a constructor 400.
It shall be appreciated that the system 1000 may be implemented using certain hardware, software, or a combination thereof. In addition, the embodiments of the present invention may be adapted to a computer program product embodied on one or more computer readable storage media (comprising but not limited to disk storage, CD-ROM, optical memory and the like) containing computer program codes. Fig. 2 is a schematic diagram illustrating a scene text detection system 1000 when it is implemented in software consistent with some disclosed embodiments.
In the case that the system 1000 is implemented with software, the system 1000may include a general purpose computer, a computer cluster, a mainstream computer, a computing device dedicated for providing online contents, or a computer network comprising a group of computers operating in a centralized or distributed fashion. As shown in Fig. 2, the system 1000 may include one or more processors ( processors  102, 104, 106 etc. ) , a memory 112, a storage device 116, and a bus to facilitate information exchange among various devices of system 1000. Processors 102-106 may include a central processing unit ( “CPU” ) , a graphic processing unit ( “GPU” ) , or other suitable information processing devices. Depending on the type of hardware being used, processors 102-106 can include one or more printed circuit boards, and/or one or more microprocessor chips. Processors 102-106 can execute sequences of computer program instructions to perform various methods that will be explained in greater detail below.
Memory 112 can include, among other things, a random access memory ( “RAM” ) and a read-only memory ( “ROM” ) . Computer program instructions can be stored,accessed, and read from memory 112 for execution by one or more of processors 102-106.For example, memory 112 may store one or more software applications. Further, memory 112may store an entire software application or only a part of a software application that is executable by one or more of processors 102-106. It is noted that although only one block is shown in Fig. 2, memory 112 may include multiple physical devices installed on a central  computing device or on different computing devices.
In the embodiment shown in Fig. 1, the MSER detector may be configured to generate a set of text components from an image, and the generated text components are ordered into a MSER tree structure. MSER defines an extremal region as a connected component of an image whose pixels have intensity contrast against its boundary pixels. The intensity contrast is measured by increasing intensity values, and controls the region areas. A low contrast value would generate a large number of low-level regions, which are separated by small intensity difference between pixels. When the contrast value increases, a low-level region can be accumulated with current level pixels or merged with other lower level regions to construct a higher level region. Therefore, an extremal region tree can be constructed when it reaches the largest contrast. An extremal region is defined as a MSER if its variation is lower than both its parent and child. Therefore, the MSER can be considered as a special extremal region whose size remains unchanged over a range of thresholds.
In an embodiment, each individual character of a text in an image can be detected as a extremal region or a MSER by the MSER detector. Two promising properties made the MSER detector great success in scene text detection. First, the MSER detector is a fast detector and can be computed in linear time of the number of pixels in a image. Second, it is a powerful detector with high capability to handle low quality texts, such as low contrast,low resolution and blurring. With this capability, MSER is able to detect almost scene texts in a natural image.
According to an embodiment, a CNN classifier 200 may be configured to assign a component confident score to each text component in the set of text components. As shown in Fig. 3, the CNN classifier 200 may comprise at least one convolutional layer, at least one average pooling and a support vector machine (SVM) classifier. Each of the convolutional layer is followed by an average pooling, and has a plurality of filters. For example, as shown in Fig. 3, the CNN classifier comprises two convolutional layers, and the second layer is stacked upon the first layer. The numbers of filters for the two layers are 96 and 64,respectively.
In an embodiment, the CNN classifier is trained with a predetermined training  set to assign the component confident score. When the CNN classifier is trained, filters of a first convolutional layer of the two convolutional layers are configured to learn from the set of patches extracted from the predetermined training set by using an unsupervised K-means to generate a response, and filters of a second convolutional layer two convolutional layers are configured to learn from the generated response by back-propagating a SVM classification error generated from the SVM classifier to obtain the component confidant scores of the text components. For example, during a training process shown in Fig. 3, the extracted patch is with fixed size of 32×32. Filters of a first convolutional layer are configured to learn from the set of patches by using an unsupervised K-means to generate a response. For example, as shown, the first layer is trained in an unsupervised way by using a variant of K-means to learn a set of filters
Figure PCTCN2014000830-appb-000001
from a set of 8×8 patches, and k is the dimension of the patch for convolution and here is 64 for 8×8. n1 is 96 for the number of the filters in the first layer. The responses (r) of the first layer is computed as
r=max {0,|DTx-θ|}          (1)
where
Figure PCTCN2014000830-appb-000002
is an input vector for an 8×8 patch, and θ= 0.5. The resulted first layer response maps are with size of 25×25×96. Then average pooling with window size of 5×5 is applied to the response maps to get reduced maps with size of 5×5× 96.
In the embodiment, filters of a second convolutional layer are configured to learn from the generated response by back-propagating a SVM classification error generated from the SVM classifier to obtain the component confidant scores of the text components. The final output of the two layers is a 64 dimension feature vector, which is input to the SVM classifier to get the final confident score of the text component. Parameters in the second layer are fully connected and are trained by back-propagating the SVM classification error.
Fig. 4 illustrates the selector 300 of the scene text detection system 1000consistent with some disclosed embodiments. As shown, the selector 300 may comprise a defining device 310 and a splitting device 320. In an embodiment, the defining device 310may be configured to define erroneously-connected text components from the selected text components based on the assigned component confident score and the MSER tree structure. The splitting device 320 may be configured to split the erroneously-connected text components to text components with a high component confident score.
In an embodiment shown in Fig. 5, the splitting device may further comprise a resizing unit 321, a scanner 322 and an identifying unit 323. The resizing unit 321 may be configured to resize the defined erroneously-connected text components to a predetermined size. The scanner 322 may be configured to scan the resized text components to obtain an one dimensional array of component confident scores by a sliding window. The identifying unit 323 may be configured to identify peak location of the erroneously-connected text components based on the one dimensional array to split the erroneously-connected text components to text components with a high component confident score.
In the embodiment, an error-connected component has three remarkable characteristics. First, it often has a high aspect ratio with much longer in width of its boundary box than height. Second, differing from other non-text components, for example long horizontal lines or bars which are generally scored with negative confident values by the CNN classifier 200, the error-connected component actually includes some text information, but it is not very strong, because the CNN classifier is trained on single-character components.Third, the components in high-level of the MSER trees often include multiple text characters,for example, the components in the roots of the trees. Most of these components are already correctly separated by their children components, which often have higher confident scores than their parent components.
Therefore, conditions for defining erroneously-connected text components comprises: 1) an aspect ratio of width/height of the text component is larger than 2; 2) the text component has positive confident score; and 3) the text component is at the end node of the MSER tree structure or has larger confident score than that of all its children nodes in the MSER tree structure. An exemplary algorithm for searching and splitting error-connected components is given as follow.
Figure PCTCN2014000830-appb-000003
According to an embodiment, the constructor 400 may further comprise a pairing unit and a merging unit (not shown) . The pairing unit may be configured to pair two text components, which have similar geometric and heuristic properties, of the selected text components. The merging unit may be configured to merge pairs having same component and similar orientation sequentially to construct the final text.
Fig. 6 is a schematic flowchart illustrating a scene text detection method 2000consistent with some disclosed embodiments. Hereafter, the method 2000 may be described in  detail with respect to Fig. 6.
At step S210, a set of text components is generated from an image. In an embodiment, the set of text components is generated from an image by using a Maximally Stable Extremal Region (MSER) detector. The generated text components are ordered into a MSER tree structure.
At step S220, a component confident score is assigned to each text component in the set of text components. For example, the component confident score is assigned to each text component by a trained Convolutional Neutral Network (CNN) classifier. In an embodiment, the Convolutional Neutral Network classifier is trained with a predetermined training set to assign the component confident score.
According to an embodiment, the convolutional neural network classifier comprises at least one convolutional layer, at least one average pooling and a support vector machine (SVM) classifier, and wherein each of the convolutional layers is followed by an average pooling and has a plurality of filters. For example, the convolutional neural network classifier may comprise two convolutional layers. During the training process, a set of patches from the predetermined training set are extracted. Then, filters of a first convolutional layer of the two convolutional layers learn from the set of patches by using an unsupervised K-means to generate a response, and filters of a second convolutional layer of the two convolutional layers learn from the generated response by back-propagating a SVM classification error generated from the SVM classifier to obtain the component confidant scores of the text components.
At step S230, text components are selected with a high component confident score of the assigned component confident scores from the set of text components. Following is a possible way of selecting text components with a high component confident score. For example, as shown in Fig. 7, the erroneously-connected text components are defined from the selected text components based on the assigned component confident score and the MSER tree structure. The conditions for defining erroneously-connected text components comprises:1) an aspect ratio of width/height of the text component is larger than 2; 2) the text component has positive confident score; and 3) the text component is at the end node of the MSER tree  structure or has larger confident score than that of all its children nodes in the MSER tree structure.
If the component belongs to the erroneously-connected text components, it is resized to a predetermined size. The resized text components are scanned, for example by a sliding window, to obtain an one dimensional array of component confident scores. For example, Non-Maximal Suppression (NMS) method is applied to the one dimensional array of component confident scores to estimate multiple character locations. Peak location of the erroneously-connected text components are identified based on the one dimensional array, so that the erroneously-connected text components are split to text components with a high component confident score based on the peak location.
At step S240, a final text is constructed with the selected text components. When a final text is constructed, two text components, which have similar geometric and heuristic properties, of the selected text components are paired, and pairs having same component and similar orientation sequentially to construct the final text are merged to construct the final text.
With the scene text detection system and method of the present application, high capacity of the deep learning model to tackle two main problems of current MSERs methods for text detection are effectively leveraged. In addition, the system of the present application can be with strong robustness and highly discriminative capability to distinguish texts from a large amount of non-text components by incorporating the MSER detector and the trained CNN classifier. A sliding window model was intergraded with CNN classifier to further improve the capability of MSER detector for correctly localizing challenge text components.Our method has achieved large improvements over current methods in standard benchmark datasets.
Although the preferred examples of the present invention have been described,those skilled in the art can make variations or modifications to these examples upon knowing the basic inventive concept. The appended claims are intended to be considered as comprising the preferred examples and all the variations or modifications fell into the scope of the present invention.
Obviously, those skilled in the art can make variations or modifications to the present invention without departing the spirit and scope of the present invention. As such, if these variations or modifications belong to the scope of the claims and equivalent technique,they may also fall into the scope of the present invention.

Claims (19)

  1. A scene text detection system, comprising:
    a Maximally Stable Extremal Region (MSER) detector configured to generate a set of text components from an image, wherein the generated text components are ordered into a MSER tree structure;
    a Convolutional Neutral Network (CNN) classifier configured to assign a component confident score to each text component in the set of text components;
    a selector configured to select text components with a high component confident score of the assigned component confident scores from the set of text components; and
    a constructor configured to construct a final text with the selected text components.
  2. A scene text detection system according to claim 1, wherein the CNN classifier is trained with a predetermined training set to assign the component confident score.
  3. A scene text detection system according to claim 1, wherein the CNN classifier comprises at least one convolutional layer, at least one average pooling and a support vector machine (SVM) classifier, and wherein each of the convolutional layer is followed by an average pooling and has a plurality of filters.
  4. A scene text detection system according to claim 3, wherein the at least one convolutional layer comprises two convolutional layers.
  5. A scene text detection system according to claim 4, wherein filters of a first convolutional layer of the two convolutional layers are configured to learn from the set of patches extracted from the predetermined training set by using an unsupervised K-means to generate a response, and filters of a second convolutional layer two convolutional layers are configured to learn from the generated response by back-propagating a SVM classification  error generated from the SVM classifier to obtain the component confidant scores of the text components.
  6. A scene text detection system according to claim 1, wherein the selector further comprises:
    a defining device configured to define erroneously-connected text components from the selected text components based on the assigned component confident score and the MSER tree structure; and
    a splitting device configured to split the erroneously-connected text components to text components with a high component confident score.
  7. A scene text detection system according to claim 6, wherein the splitting device further comprises:
    a resizing unit configured to resize the defined erroneously-connected text components to a predetermined size;
    a scanner configured to scan the resized text components to obtain an one dimensional array of component confident scores by a sliding window; and
    an identifying unit configured to identify peak location of the erroneously-connected text components based on the one dimensional array to split the erroneously-connected text components to text components with a high component confident score.
  8. A scene text detection system according to claim 6, wherein conditions for defining erroneously-connected text components comprises:
    an aspect ratio of width/height of the text component is larger than 2;
    the text component has positive confident score; and
    the text component is at the end node of the MSER tree structure or has larger confident score than that of all its children nodes in the MSER tree structure.
  9. A scene text detection system according to claim 1, wherein the constructor further comprises:
    a pairing unit configured to pair two text components, which have similar geometric and heuristic properties, of the selected text components; and
    a merging unit configured to merge pairs having same component and similar orientation sequentially to construct the final text.
  10. A scene text detection method, comprising:
    generating a set of text components from an image, wherein the generated text components are ordered into a tree structure;
    assigning a component confident score to each text component in the set of text components;
    selecting text components with a high component confident score of the assigned component confident scores from the set of text components; and
    constructing a final text with the selected text components.
  11. A scene text detection method according to claim 10, wherein the generating a set of text components from an image further comprises:
    generating the set of text components from the image by using a Maximally Stable Extremal Region (MSER) detector.
  12. A scene text detection method according to claim 10, wherein the assigning a component confident score to each text component further comprises:
    assigning a component confident score to each text component in the set of text components by a trained Convolutional Neutral Network (CNN) classifier.
  13. A scene text detection method according to claim 12, further comprising:
    training the Convolutional Neutral Network classifier with a predetermined training set to assign the component confident score.
  14. A scene text detection method according to claim 12, wherein the convolutional neural network classifier comprises at least one convolutional layer, at least one average pooling and a support vector machine (SVM) classifier, and wherein each of the convolutional layer is followed by an average pooling and has a plurality of filters.
  15. A scene text detection method according to claim 14, wherein the at least one convolutional layer comprises two convolutional layers, and the training the Convolutional Neutral Network classifier with a predetermined training set to assign the component confident score further comprises:
    extracting a set of patches from the predetermined training set;
    learning, by filters of a first convolutional layer of the two convolutional layers, from the set of patches by using an unsupervised K-means to generate a response; and
    learning, by filters of a second convolutional layer of the two convolutional layers, from the generated response by back-propagating a SVM classification error generated from the SVM classifier to obtain the component confidant scores of the text components.
  16. A scene text detection method according to claim 10, wherein the selecting text components with a high component confident score of the assigned component confident scores from the set of text components to construct a final text further comprises:
    defining erroneously-connected text components from the selected text components based on the assigned component confident score and the MSER tree structure; and
    splitting the erroneously-connected text components to text components with a high component confident score.
  17. A scene text detection method according to claim 16, wherein the splitting the erroneously-connected text components to text components with a high component confident score further comprises:
    resizing the defined erroneously-connected text components to a predetermined size; and
    scanning the resized text components to obtain an one dimensional array of component confident scores by a sliding window; and
    identifying peak location of the erroneously-connected text components based on the one dimensional array so that the erroneously-connected text components are split to text components with a high component confident score based on the peak location.
  18. A scene text detection method according to claim 16, wherein conditions for defining erroneously-connected text components comprises:
    an aspect ratio of width/height of the text component is larger than 2;
    the text component has positive confident score; and
    the text component is at the end node of the MSER tree structure or has larger confident score than that of all its children nodes in the MSER tree structure.
  19. A scene text detection method according to claim 10, wherein the constructing a final text with the selected text components further comprises:
    pairing two text components, which have similar geometric and heuristic properties, of the selected text components; and
    merging pairs having same component and similar orientation sequentially to construct the final text.
PCT/CN2014/000830 2014-09-05 2014-09-05 Scene text detection system and method WO2016033710A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2014/000830 WO2016033710A1 (en) 2014-09-05 2014-09-05 Scene text detection system and method
CN201480081759.5A CN106796647B (en) 2014-09-05 2014-09-05 Scene text detecting system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2014/000830 WO2016033710A1 (en) 2014-09-05 2014-09-05 Scene text detection system and method

Publications (1)

Publication Number Publication Date
WO2016033710A1 true WO2016033710A1 (en) 2016-03-10

Family

ID=55438963

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/000830 WO2016033710A1 (en) 2014-09-05 2014-09-05 Scene text detection system and method

Country Status (2)

Country Link
CN (1) CN106796647B (en)
WO (1) WO2016033710A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10032110B2 (en) 2016-12-13 2018-07-24 Google Llc Performing average pooling in hardware
US10037490B2 (en) 2016-12-13 2018-07-31 Google Llc Performing average pooling in hardware
CN109086663A (en) * 2018-06-27 2018-12-25 大连理工大学 The natural scene Method for text detection of dimension self-adaption based on convolutional neural networks
CN110321893A (en) * 2019-06-27 2019-10-11 电子科技大学 A kind of scene text identification network focusing enhancing
CN110348280A (en) * 2019-03-21 2019-10-18 贵州工业职业技术学院 Water book character recognition method based on CNN artificial neural
CN110516554A (en) * 2019-07-31 2019-11-29 杭州电子科技大学 A kind of more scene multi-font Chinese text detection recognition methods
WO2020218111A1 (en) * 2019-04-24 2020-10-29 富士フイルム株式会社 Learning method and device, program, learned model, and text generation device

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704509B (en) * 2017-08-31 2021-11-02 北京联合大学 Reordering method combining stable region and deep learning
CN110135446B (en) * 2018-02-09 2021-01-22 北京世纪好未来教育科技有限公司 Text detection method and computer storage medium
FR3079056A1 (en) * 2018-03-19 2019-09-20 Stmicroelectronics (Rousset) Sas METHOD FOR CONTROLLING SCENES DETECTION BY AN APPARATUS, FOR EXAMPLE A WIRELESS COMMUNICATION APPARATUS, AND APPARATUS THEREFOR
CN109816022A (en) * 2019-01-29 2019-05-28 重庆市地理信息中心 A kind of image-recognizing method based on three decisions and CNN
CN112183523A (en) * 2020-12-02 2021-01-05 北京云测信息技术有限公司 Text detection method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014014640A1 (en) * 2012-07-19 2014-01-23 Qualcomm Incorporated Method of handling complex variants of words through prefix-tree based decoding for devanagiri ocr
US20140201126A1 (en) * 2012-09-15 2014-07-17 Lotfi A. Zadeh Methods and Systems for Applications for Z-numbers

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102201053B (en) * 2010-12-10 2013-07-24 上海合合信息科技发展有限公司 Method for cutting edge of text image
CN103136523B (en) * 2012-11-29 2016-06-29 浙江大学 Any direction text line detection method in a kind of natural image

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014014640A1 (en) * 2012-07-19 2014-01-23 Qualcomm Incorporated Method of handling complex variants of words through prefix-tree based decoding for devanagiri ocr
US20140201126A1 (en) * 2012-09-15 2014-07-17 Lotfi A. Zadeh Methods and Systems for Applications for Z-numbers

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LIAO, WEIMIN.: "End-to-End English Text Recognition in Natural Scene Image", CHINESE MASTER'S THESES FULL-TEXT DATABASE INFORMATION SCIENCE AND TECHNOLOGY, 15 August 2014 (2014-08-15), pages 37 - 64 *
OPTIZ, MICHAEL ET AL.: "End-to-End Text Recognition Using Local Ternary Patterns, MSER and Deep Convolutional Nets", DOCUMENT ANALYSIS SYSTEMS (DAS), 2014 11TH IAPR INTERNATIONAL WORKSHOP, 10 April 2014 (2014-04-10), pages 186, XP032606124, DOI: doi:10.1109/DAS.2014.29 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10032110B2 (en) 2016-12-13 2018-07-24 Google Llc Performing average pooling in hardware
US10037490B2 (en) 2016-12-13 2018-07-31 Google Llc Performing average pooling in hardware
US10679127B2 (en) 2016-12-13 2020-06-09 Google Llc Performing average pooling in hardware
US11232351B2 (en) 2016-12-13 2022-01-25 Google Llc Performing average pooling in hardware
CN109086663A (en) * 2018-06-27 2018-12-25 大连理工大学 The natural scene Method for text detection of dimension self-adaption based on convolutional neural networks
CN109086663B (en) * 2018-06-27 2021-11-05 大连理工大学 Natural scene text detection method based on scale self-adaption of convolutional neural network
CN110348280A (en) * 2019-03-21 2019-10-18 贵州工业职业技术学院 Water book character recognition method based on CNN artificial neural
WO2020218111A1 (en) * 2019-04-24 2020-10-29 富士フイルム株式会社 Learning method and device, program, learned model, and text generation device
JPWO2020218111A1 (en) * 2019-04-24 2020-10-29
JP7152600B2 (en) 2019-04-24 2022-10-12 富士フイルム株式会社 Learning method and device, program, trained model, and text generation device
CN110321893A (en) * 2019-06-27 2019-10-11 电子科技大学 A kind of scene text identification network focusing enhancing
CN110516554A (en) * 2019-07-31 2019-11-29 杭州电子科技大学 A kind of more scene multi-font Chinese text detection recognition methods

Also Published As

Publication number Publication date
CN106796647B (en) 2018-09-14
CN106796647A (en) 2017-05-31

Similar Documents

Publication Publication Date Title
WO2016033710A1 (en) Scene text detection system and method
US10395143B2 (en) Systems and methods for identifying a target object in an image
US10896349B2 (en) Text detection method and apparatus, and storage medium
US9367766B2 (en) Text line detection in images
Huang et al. Robust scene text detection with convolution neural network induced mser trees
Zamberletti et al. Text localization based on fast feature pyramids and multi-resolution maximally stable extremal regions
US8965127B2 (en) Method for segmenting text words in document images
US10430649B2 (en) Text region detection in digital images using image tag filtering
Yi et al. Feature representations for scene text character recognition: A comparative study
JP6188400B2 (en) Image processing apparatus, program, and image processing method
EP3203417B1 (en) Method for detecting texts included in an image and apparatus using the same
CN112036395A (en) Text classification identification method and device based on target detection
JP6897749B2 (en) Learning methods, learning systems, and learning programs
US9224207B2 (en) Segmentation co-clustering
Gomez et al. A fast hierarchical method for multi-script and arbitrary oriented scene text extraction
Wu et al. Scene text detection using adaptive color reduction, adjacent character model and hybrid verification strategy
Kalyoncu et al. GTCLC: leaf classification method using multiple descriptors
JP2015185033A (en) Character recognition device and identification function generation method
Kim et al. A rule-based method for table detection in website images
Ramirez et al. Automatic recognition of square notation symbols in western plainchant manuscripts
US9858293B2 (en) Image processing apparatus and image processing method
Yasmeen et al. Text detection and classification from low quality natural images
CN117612179A (en) Method and device for recognizing characters in image, electronic equipment and storage medium
CN111553361A (en) Pathological section label identification method
CN108475339B (en) Method and system for classifying objects in an image

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14901407

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14901407

Country of ref document: EP

Kind code of ref document: A1