CN112215235A

CN112215235A - Scene text detection method aiming at large character spacing and local shielding

Info

Publication number: CN112215235A
Application number: CN202011110021.0A
Authority: CN
Inventors: 高攀; 刘磊; 黄军文; 汤红
Original assignee: Shenzhen Huafu Information Technology Co ltd
Current assignee: Shenzhen Huafu Information Technology Co ltd
Priority date: 2020-10-16
Filing date: 2020-10-16
Publication date: 2021-01-12
Anticipated expiration: 2040-10-16
Also published as: CN112215235B

Abstract

The invention belongs to the technical field of optical character recognition, and particularly relates to a scene text detection method aiming at large character spacing and local shielding, which comprises the following steps: s1, extracting features of the input picture through a full convolution neural network, and fusing the features of different layers; s2, outputting a text segmentation graph by the fused features through a text semantic segmentation network, and outputting a text instance embedding feature graph by a text instance feature embedding module; and S3, embedding the text segmentation graph and the text example into the feature graph to obtain a text detection result through a text example recombination algorithm. The text instance embedding module embeds each pixel into a feature space, average pixel features in a text region are regarded as features of the text region, then a text instance recombination algorithm recombines text candidate regions with similar features, and text instances which are segmented into a plurality of regions due to large character spacing or local occlusion can be detected as a complete object again.

Description

Scene text detection method aiming at large character spacing and local shielding

Technical Field

The invention belongs to the technical field of optical character recognition, and particularly relates to a scene text detection method aiming at large character spacing and local shielding.

Background

Because the characters naturally contain rich and accurate semantic information, the computer can read and understand the characters on the pictures, and the method has academic and practical application values. Scene text detection is to detect text in a natural scene picture. The difficulty of the task mainly comes from three aspects, the first is the diversity of the text, and the text in a natural scene has various fonts, colors, sizes and artistic styles; secondly, the background of the natural scene picture is very complex, and objects with similar structures to texts exist in real life, such as windows, tiles, fences, grasslands and the like; and thirdly, the influence of the imaging environment of the picture, and uneven illumination, blurring and the like exist in part of the picture.

One of the existing methods is based on text box regression, which uses general object detection frameworks such as SSD, Faster R-CNN, etc., but due to the limitation of anchor box (anchor) mechanism, such methods cannot process arbitrarily shaped texts (such as curved texts). Meanwhile, due to the limited receptive field, the regression of the frame of the long text is inaccurate.

The second of the existing methods is based on semantic segmentation, and the pixels on the picture are divided into a foreground (text region) and a background based on the segmentation method. This approach can handle arbitrarily shaped text without considering the shape and size of the text object, but because the boundaries of the text are difficult to define, adjacent text lines are not easily distinguished. In addition, most methods use connected component analysis to determine text instances, and when text characters have large space or are partially occluded, one text object corresponds to multiple detection boxes.

Disclosure of Invention

In order to overcome the problem that when a text in an image is identified based on a semantic segmentation method in the background art, when the text character spacing is large or the text is partially shielded, one text object corresponds to a plurality of detection boxes, the invention provides the following technical scheme:

a method for detecting scene texts with large character spacing and local occlusion comprises the following steps:

s1, extracting features of the input picture through a full convolution neural network, and fusing the features of different layers;

s2, outputting a text segmentation graph by the fused features through a text semantic segmentation network, and outputting a text instance embedding feature graph by a text instance feature embedding module;

and S3, embedding the text segmentation graph and the text example into the feature graph to obtain a text detection result through a text example recombination algorithm.

Further, in S1, a full convolution network with a feature pyramid structure is adopted, and the input picture is subjected to feature pyramid network to extract features of different levels, and then subjected to a point adding operation and a channel cascade operation to fuse the features of different levels together.

Further, in S2, the text instance feature embedding module embeds each pixel into the feature space, and the average pixel feature in the text region is regarded as the feature of the text region

Further, the network structure constructed by the text instance feature embedding module enables the fused features to pass through two Conv-BN-Relu layers, then uses a 1 x1 convolution to reduce the number of channels and the calculation amount, and then passes through a Relu activation layer and then is up-sampled to the original input size.

Further, the text instance feature embedding module performs training by reducing the feature distance of different pixels in the same text instance and increasing the feature distance between different text instances.

Further, the text instance reorganization algorithm is a clustering algorithm based on measurement.

The method for detecting the scene text with the large character spacing and the local shielding has the advantages that: the method is optimized aiming at the problem of false detection of texts with large character spacing and local sheltered texts, and provides a text example feature embedding module and a text example recombination algorithm, wherein the text example embedding module embeds each pixel into a feature space, and average pixel features in a text region are regarded as features of the text region. Subsequently, the text instance reorganization algorithm reorganizes the text candidate regions with similar characteristics. By doing so, a text instance segmented into regions due to a large character pitch or partial occlusion can be re-detected as a complete object. The two modules do not depend on specific model details, can be combined with any mainstream text detection algorithm based on segmentation in a very portable mode, and improves the precision of the method.

Drawings

FIG. 1 is a schematic diagram illustrating steps of a text detection method according to an embodiment of the present invention;

FIG. 2 is a diagram of a full convolution network with a feature pyramid structure according to an embodiment of the present invention;

FIG. 3 is a diagram of a network structure of a text instance feature embedding module in an embodiment of the present invention;

Detailed Description

The present invention is further illustrated by the following examples, which are only a part of the examples of the present invention, and these examples are only for explaining the present invention and do not limit the scope of the present invention.

As shown in fig. 1 in the specification, a method for detecting a scene text with a large character space and a local occlusion includes the following steps:

s1, extracting features of the input picture through a full convolution neural network, and fusing the features of different layers

Any mainstream text detection method based on segmentation is selected, and in this embodiment, a full convolution network (FPN + FCN) with a classical feature pyramid structure is taken as an example. The overall network structure is as shown in fig. 2 of the specification, the input picture firstly extracts features of different layers through a feature pyramid network, and then the features of different layers are fused together through channel cascade operation.

S2, outputting text segmentation graph by the fused feature through a text semantic segmentation network, and outputting text instance embedding feature graph by a text instance feature embedding module

The main text example feature embedding module outputs text example embedding feature diagram introduction. The text example feature is embedded into a module network structure, as shown in the attached figure 3 of the specification, the fused features firstly pass through two Conv-BN-Relu layers, then use a 1 x1 convolution to reduce the number of channels and reduce the calculated amount, and then pass through a Relu activation layer and then are up-sampled to the original input size. Specifically, the text instance feature embedding module outputs a feature vector F for each pixel_pX1, x2, x3, x4 (four dimensions are taken as an example). The feature of a text region is represented by the average feature vector of the pixels of the region, and its mathematical formula can be defined as

Since the feature vector of each pixel is missing in the label, and the purpose of the module is to learn the similarity between text instances, the idea of clustering is adopted in this document, and the learning of the text instance embedding module is supervised by reducing the feature distance of different pixels in the same text instance and increasing the feature distance between different text instances in the training process. Specifically, (1) decrease the intra-instance distance: the feature distance between pixels in the same text region should be as small as possible. The distance between the pixel and the text instance is used as loss, so that the pixel characteristics in the same text area are more similar. (2) Increase the distance between instances: the distance between feature vectors of different text regions should be as large as possible, as opposed to the intra-instance distance.

S3, embedding the text segmentation graph and the text example into the feature graph to obtain a text detection result through a text example recombination algorithm

The text instance recombination algorithm is a clustering algorithm based on measurement (distance), and the main idea is to judge whether the distance between feature vectors of two text candidate sets is smaller than a threshold value, if the distance between the feature vectors is small enough, the two text candidate sets are considered to be possibly the same text instance, and the two candidate texts are combined and recombined into one text instance. Besides the feature distance, some logic conditions need to be satisfied, such as the relative positions of the two candidate texts.

The invention provides a text instance feature embedding module and a text instance recombination algorithm aiming at a large character interval and a partially shielded text, can effectively detect the text with the large character interval and the partially shielded text, effectively improves the integral accuracy of the model, can realize plug and play, does not depend on a specific method, and can be very conveniently combined with a mainstream text detection method based on segmentation.

Although the present invention has been described with reference to a preferred embodiment, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for detecting a scene text with a large character space and local occlusion is characterized by comprising the following steps:

2. The method for detecting the scene text with the large character spacing and the local occlusion according to claim 1, wherein in S1, a full convolution network with a feature pyramid structure is adopted, and the inputted picture is firstly subjected to a feature pyramid network to extract features of different levels, and then subjected to a dot adding operation and a channel cascading operation to fuse the features of different levels together.

3. The method for detecting the text of the scene with the large character spacing and the local occlusion according to claim 1 or 2, wherein in S2, the text instance feature embedding module embeds each pixel into the feature space, and the average pixel feature in the text region is regarded as the feature of the text region.

4. The method for detecting the scene text with the large character spacing and the local occlusion according to claim 3, wherein a network structure constructed by the text instance feature embedding module leads the fused features to pass through two Conv-BN-Relu layers, then uses a 1 x1 convolution to reduce the number of channels, reduces the amount of calculation, passes through a Relu activation layer, and then is up-sampled to the original input size.

5. The method of claim 4, wherein the text instance feature embedding module performs training by reducing feature distances of different pixels in the same text instance and increasing feature distances between different text instances.

6. The method of claim 5, wherein the text instance reorganization algorithm is a metric-based clustering algorithm.