Scene text detection method aiming at large character spacing and local shielding
Technical Field
The invention belongs to the technical field of optical character recognition, and particularly relates to a scene text detection method with large character spacing and partial shielding.
Background
Because the characters naturally contain rich and accurate semantic information, the computer can read and understand the characters on the pictures and has academic and practical application values. Scene text detection is to detect text in a natural scene picture. The difficulty of the task mainly comes from three aspects, namely the diversity of the text, and the text in the natural scene has various fonts, colors, sizes and artistic styles; secondly, the background of the natural scene picture is quite complex, and some objects with structures similar to texts exist in real life, such as windows, bricks and tiles, fences, grasslands and the like; thirdly, the influence of the imaging environment of the picture, uneven illumination, blurring and the like exist in part of the picture.
One of the existing methods is based on text box regression, and the methods use a general target detection framework such as SSD, faster R-CNN, etc., but cannot process text with any shape (such as curved text) due to the limitation of an anchor mechanism. Meanwhile, due to the limited receptive field, the frame regression of the long text is inaccurate.
The second existing method is based on semantic segmentation, and the segmentation-based method divides pixels on a picture into a foreground (text region) and a background. This approach is capable of handling arbitrarily shaped text without regard to the shape and size of the text object, but because the boundaries of the text are difficult to define, adjacent lines of text are not easily distinguished. In addition, most methods use connected region analysis to determine text instances, and when text characters are widely spaced or partially occluded, a text object corresponds to multiple detection boxes.
Disclosure of Invention
In order to solve the problem that one text object corresponds to a plurality of detection frames when texts in images are identified based on a semantic segmentation method in the background art and text characters are large in distance or are partially blocked, the invention provides the following technical scheme:
a scene text detection method aiming at a scene with large character spacing and local shielding comprises the following steps:
s1, extracting features from an input picture through a full convolution neural network, and fusing features of different layers;
s2, outputting a text segmentation map through a text semantic segmentation network by the fused features, and outputting a text instance embedded feature map through a text instance feature embedding module;
s3, embedding the text segmentation map and the text instance into the feature map to obtain a text detection result through a text instance recombination algorithm.
Further, in S1, a full convolution network with a feature pyramid structure is adopted, features of different levels are extracted from an input picture through the feature pyramid network, and then the features of different levels are fused together through point adding operation and channel cascading operation.
Further, in S2, the text instance feature embedding module embeds each pixel into a feature space, and the average pixel feature in the text region is considered as a feature of the text region
Further, the network structure constructed by the text instance feature embedding module enables the fused features to pass through two Conv-BN-Relu layers, then reduces the channel number by using a 1X 1 convolution, reduces the calculated amount, and then upsamples to the original input size through a relu activation layer.
Further, the text instance feature embedding module trains by decreasing feature distances for different pixels in the same text instance and increasing feature distances between different text instances.
Further, the text instance reorganization algorithm is a metric-based clustering algorithm.
The scene text detection method has the beneficial effects that the scene text detection method has large character spacing and partial shielding: the method aims at optimizing the false detection problem of the text with large character spacing and local occlusion, and provides a text instance feature embedding module and a text instance recombination algorithm, wherein the text instance embedding module embeds each pixel into a feature space and regards average pixel features in a text region as features of the text region. The text instance reassembly algorithm then reassembles the text candidate regions with similar features. By doing so, text instances that are segmented into multiple regions due to large character spacing or partial occlusion can be re-detected as a complete object. The two proposed modules are not dependent on specific model details, can be very portable and can be combined with any mainstream segmentation-based text detection algorithm, and the accuracy of the method is improved.
Drawings
FIG. 1 is a schematic diagram illustrating steps of a text detection method according to an embodiment of the present invention;
FIG. 2 is a diagram of a full convolution network of feature pyramids in an embodiment of the present invention;
FIG. 3 is a network architecture diagram of a text instance feature embedding module in an embodiment of the present invention;
Detailed Description
The invention is further illustrated below with reference to examples, which are only examples of part of the invention, which are intended to illustrate the invention and do not limit the scope of the invention in any way.
As shown in figure 1 of the specification, the method for detecting the scene text with large character spacing and partial occlusion comprises the following steps:
S1, extracting features from an input picture through a full convolution neural network, and fusing features of different layers
Any mainstream segmentation-based text detection method is selected, and in this embodiment, a full convolution network (fpn+fcn) with a classical feature pyramid structure is taken as an example. The whole network structure is shown in the figure 2 of the specification, the input picture firstly extracts the features of different layers through the feature pyramid network, and then the features of different layers are fused together through the channel cascading operation.
S2, outputting a text segmentation map through a text semantic segmentation network by the fused features, and outputting a text instance embedded feature map through a text instance feature embedding module
The text instance feature embedding module is mainly used for outputting text instance embedding feature diagram introduction. The text example features are embedded into a module network structure, as shown in figure 3 of the specification, the fused features pass through two Conv-BN-Relu layers, then a 1X 1 convolution is used for reducing the number of channels, the calculated amount is reduced, and then the fused features are up-sampled to the original input size through a relu activation layer. Specifically, the text instance feature embedding module outputs a feature vector F p = { x1, x2, x3, x4} (four dimensions are examples) for each pixel. The features of a text region are represented by the average feature vector of the pixels of the region, the mathematical formula of which can be defined as
Since the feature vector of each pixel is missing in the label and the purpose of the module is to learn the similarity between text instances, the idea of clustering is employed herein to supervise text instance embedding module learning by decreasing the feature distance of different pixels in the same text instance and increasing the feature distance between different text instances during training. Specifically, (1) decrease the intra-instance distance: the feature distance between pixels in the same text region should be as small as possible. The distance between a pixel and a text instance is used herein as a penalty to make the pixel characteristics more similar within the same text region. (2) increasing the inter-instance distance: the distance between feature vectors of different text regions should be as large as possible, as opposed to the intra-instance distance.
S3, obtaining a text detection result through a text instance recombination algorithm by embedding the text segmentation map and the text instance embedding feature map
The text instance reorganization algorithm is a clustering algorithm based on a metric (distance), and the main idea is to determine whether the feature vector distance of two text candidate sets is smaller than a threshold value, and if the feature distance is small enough, consider that the two text candidate sets may be the same text instance, and reorganize the two candidate texts into one. In addition to feature distance, some logic conditions, such as the relative position of two candidate texts, etc., need to be satisfied.
The invention provides a text instance feature embedding module and a text instance recombination algorithm aiming at texts with large character spacing and partially blocked, which can effectively detect the texts with large character spacing and partially blocked, effectively improve the overall accuracy of a model, realize plug and play, do not depend on a specific method, and can be very portable and combined with a mainstream segmentation-based text detection method.
The present invention is not limited to the above-mentioned embodiments, but is intended to be limited to the following embodiments, and any modifications, equivalent changes and variations in the above-mentioned embodiments can be made by those skilled in the art without departing from the scope of the present invention.