CN113887282A

CN113887282A - Detection system and method for any-shape adjacent text in scene image

Info

Publication number: CN113887282A
Application number: CN202111004566.8A
Authority: CN
Inventors: 王伟平; 过友辉; 周宇; 秦绪功
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2021-08-30
Filing date: 2021-08-30
Publication date: 2022-01-04

Abstract

The invention discloses a detection system and a detection method for an adjacent text in an arbitrary shape in a scene image, belongs to the field of image text detection, enables a network to pay more attention to text characteristics by generating a regional suggestion, and simultaneously provides a one-to-many training strategy to match a plurality of target texts for each candidate box, so as to relieve the problem of confusion of regression targets in selection during adjacent texts and finally improve the capability of detecting the adjacent text in the arbitrary shape in the scene image based on a regression two-stage model.

Description

Detection system and method for any-shape adjacent text in scene image

Technical Field

The invention belongs to the field of image text detection, and particularly relates to a detection system and method for any shape of adjacent text in a scene image.

Background

Text detection and recognition of a scene image are a research hotspot in recent years, and due to the complexity of texts in the scene image, the texts are difficult to be recognized directly, so that the texts in the image need to be detected first, namely, the position of the texts in the scene image needs to be located. With the development of deep learning, the field of text detection is rapidly improved. Inspired by general target detection, the current mainstream method modifies the frame of general target detection and then performs targeted modification by combining the characteristics of the text. The goals of text detection have become more diverse, from first horizontal text to multi-directional text, to arbitrarily shaped text that has received attention in recent years.

Although some methods propose solutions for arbitrarily shaped texts, where the segmentation-based method can adapt to arbitrarily shaped texts but is severely affected by the segmentation quality, the regression-based method mostly uses the feature of rectangular anchor boxes, which contains much background noise. The existing method cannot well solve the situation of adjacent texts, wherein the segmentation-based method adopts a contracted text region to achieve the purpose of separating adjacent texts, but the method introduces other properties to be predicted and inflexible post-expansion processing, and the regression-based method selects corresponding target texts for each candidate box by using maximum cross-over comparison, but the maximum cross-over comparison is changed for each candidate box, so that the target confusion problem can be generated during testing. Arbitrarily shaped adjacent text is common in scene images, but few methods focus on both arbitrarily shaped and adjacent text.

Disclosure of Invention

The invention aims to improve the capability of detecting any-shape adjacent texts in a scene image based on a regression two-stage model, and provides a detection system and method for any-shape adjacent texts in the scene image.

In order to achieve the purpose, the invention adopts the following technical scheme:

a detection system for arbitrarily-shaped adjacent texts in scene images comprises:

the characteristic extraction module consists of a residual error network with 50 layers and a characteristic pyramid network, wherein the residual error network is a convolutional neural network and is used for extracting visual characteristics with different scales from bottom to top; the characteristic pyramid network is formed by a transverse connection and a top-down connection and is used for fusing visual characteristics with different scales to obtain a richer visual characteristic, namely a fusion characteristic;

the region suggestion generation module is used for presetting a plurality of different anchor frames at each position of the fusion features, and generating a series of region suggestions through classification and regression;

and the detection head module is used for processing each region suggestion independently, extracting the features corresponding to the region suggestions from the fusion features according to the coordinates in the region suggestions to obtain the region suggestion features, and then classifying and regressing on the basis of the region suggestion features to obtain the text detection result of the scene image.

Preferably, when the detection head module is trained, the region suggestion generation module calculates an intersection ratio according to the generated region suggestion and a rectangular frame corresponding to a text in the input image, determines a positive sample and a negative sample according to the intersection ratio, and selects the region suggestion with the positive sample and the negative sample in a certain proportion to train the detection head module.

Preferably, if the cross-over ratio is greater than 0.7, the sample is positive, and if the cross-over ratio is less than 3.0, the sample is negative; the positive to negative sample ratio was 3: 1.

Preferably, the detection head module includes a regional suggestion feature attention module PFAM, where the PFAM is configured to use a perceptron module to generate an attention weight corresponding to the current regional suggestion feature, and the attention weight is multiplied by a corresponding position of the regional suggestion to obtain an optimized feature of the adaptive attention text feature, so as to remove the background noise.

Preferably, the detection head module comprises two convolutional layers, two fully-connected layers, and two PFAMs, and the two convolutional layers, the one fully-connected layer, the one PFAM, the other fully-connected layer, and the other PFAM are arranged in the order of data flow.

Preferably, the detection head module trains according to a one-to-many training strategy OMTS, so that the region suggestion is optimized in learning under the condition that a plurality of labeled text instances exist; the OMTS is a detection branch added in a detection head module, two detection results are suggested for each region, and then the two detection results are supervised by using two matched text instances to carry out model training.

Preferably, in the training process, two text instances are matched according to the union ratio for each region suggestion, and if one region suggestion can only be matched to one text instance according to the union ratio, the other text instance is matched to the background.

A detection method for any-shape adjacent text in a scene image is realized based on the system and comprises the following steps:

extracting visual features of different scales from the scene image by using a feature extraction module, and fusing to obtain fusion features;

presetting a plurality of different anchor frames at each fused position by using a region suggestion generation module, and generating a series of region suggestions through classification and regression;

extracting the characteristics corresponding to the region suggestions from the fusion characteristics by using the detection head module according to the coordinates in the region suggestions to obtain region suggestion characteristics, and classifying and regressing on the basis of the region suggestion characteristics;

and processing the scene image of the training data by using the steps, optimizing and training the whole system according to a one-to-many training strategy, detecting the scene image by using the trained system, and acquiring a text detection result in the scene image.

Preferably, the detection head module is used for generating corresponding attention weight according to each region suggestion feature, and the region suggestion feature and the position corresponding to the attention weight are multiplied to obtain an optimization feature of the self-adaptive attention text feature, so as to remove background noise.

Compared with the existing method, the method provided by the invention is simple and effective, a region suggestion feature attention module and a one-to-many training strategy are added on the classic universal target detection method fast RCNN, and the addition of the two modules has no influence on the speed basically. Wherein the text features are adaptively focused using a focus mechanism for each region proposal feature, suppressing background noise. The one-to-many training strategy suggests regression of multiple instances for each region, making it learn a more appropriate goal in the presence of multiple text instances. Meanwhile, experimental results on a plurality of universal text detection data sets show that the method can obtain better detection results compared with the prior method.

Drawings

Fig. 1 is a network structure diagram of an embodiment of a detection system for arbitrarily-shaped neighboring texts in a scene image.

Fig. 2 is a graph of the visualization of the detection of the present invention on different data.

Detailed Description

In order to make the aforementioned and other features and advantages of the invention more comprehensible, embodiments accompanied with figures are described in detail below.

The embodiment provides a detection system for an arbitrary-shaped adjacent text in a scene image, as shown in fig. 1, the system is a regression-based two-stage detection model, and the whole model is composed of three parts: the system comprises a feature extraction module, a region suggestion generation module and a detection head module.

The feature extraction module comprises a residual error network and a feature pyramid network, wherein the residual error network is a convolutional neural network and can extract visual features with different scales from bottom to top, and then the feature pyramid network formed by transverse connection and top-to-bottom connection is used for fusing the features with different scales, so that texts with larger scale changes can be better processed, a richer visual feature, namely a fusion feature, is obtained, and is used for the next region suggestion generation module and the detection head module.

The region suggestion generation module is used for generating a series of region suggestions by using a region suggestion generation network on the basis of the fusion features obtained by the feature extraction module, firstly, a plurality of anchor frames with different sizes and scales are preset at each position of the fusion features, and then the anchor frames are classified and regressed to obtain a primary detection result, namely the region suggestion; and calculating the intersection ratio of the area suggestions and the rectangular boxes corresponding to the texts in the input images during training to determine positive and negative samples, if the intersection ratio is greater than 0.7, determining the positive samples, if the intersection ratio is less than 3.0, determining the negative samples, selecting a certain number of area suggestions with the positive and negative sample ratio of 3:1 for training subsequent detection head modules, and during testing, using the area suggestions as the input of the subsequent detection head modules to predict the final text detection result.

And the detection head module is used for processing each region suggestion independently, the region suggestions are a primary detection result, the features corresponding to the region suggestions are extracted from the fusion features according to the coordinates in the detection result to obtain region suggestion features, and then classification and regression are carried out on the basis of the region suggestion features to obtain a final text detection result.

In the detection head module, the invention firstly proposes a region suggestion feature attention module (PFAM) for mining more effective features for each region suggestion so as to better adapt to arbitrarily-shaped adjacent text instances. Specifically, the module uses a perceptron module to generate attention weights w corresponding to the proposed features of the current region_aThe weight and the area proposal feature are consistent in size, and a new optimization feature can be obtained by multiplying the corresponding positions of the weight and the area proposal feature, wherein the optimization feature can adaptively focus on the text feature and remove background noise. As shown in fig. 1, each PFAM module can be spliced directly after each fully connected layer (fc). Second, a one-to-many training strategy (OMTS) was devised to make regional suggestions learn more appropriate targets in the presence of multiple text instances to eliminate confusion. Specifically, in the training process, two text examples are matched according to a cross-over ratio (IoU) for each region suggestion in consideration of the distribution condition of actual text examples, if one region suggestion can only be matched to one text example according to IoU, the other text example is matched to the background, meanwhile, a detection branch is added to a detection head module, two detection results are given to each region suggestion, and then the two detection results are supervised by using the two matched text examples to train the model. After using the one-to-many training strategy, when there are adjacent text instances in testing, the region suggestion can better select the target text to be regressed and classifiedExamples are given.

The invention also provides a detection method for the adjacent text with any shape in the scene image, which is realized by the system, and the whole process comprises the following steps:

s1: the input picture is extracted by a feature extraction module to fuse visual features of different scales, namely fusion features.

S2: the fusion features are processed by a region suggestion generation module to generate a large number of region suggestions.

S3: and extracting the visual features corresponding to the region suggestions and then processing the visual features into region suggestion features with fixed size and dimensionality.

S4: and performing convolution, full connection layer and region suggestion feature attention module on each region suggestion feature to obtain an optimization feature of the self-adaptive attention text.

S5: and classifying and regressing by using the optimized features, optimally training the whole system model by using a one-to-many training strategy, detecting scene images after training, and acquiring texts.

According to the invention, extensive experiments are carried out on four mainstream scene Text detection data of CTW1500, Total-Text, ICDAR2015 and MSRA-TD500 to evaluate the effect of the method. For fairness, the experiment uses the generated data to pre-train the model when compared to other methods. The CTW1500 has 1000 training images and 500 testing images, which contain more curve length texts; Total-Text has 1255 training images and 300 test images, which contain horizontal, multi-directional and curved Text; ICDAR2015 has 1000 training images and 500 test images, is a multi-directional text data set, most of which are of lower quality; MSRA-TD500 has 300 training images and 200 test images, and 400 images of HUST-TR400 are added as training images, most of which are text with a large aspect ratio, in accordance with the previous method.

Table 1 shows the effect comparison among the modules of the present invention, and the result proves that the performance improvement can be brought by the area recommendation feature attention module and the one-to-many training strategy provided by the present invention, and at the same time, the two modules have complementarity, which together can bring more obvious improvement.

Table 1 comparative experiments on the respective modules

Meanwhile, in order to further verify the effectiveness of the one-to-many training strategy on the adjacent text with any shape, the experiment rotates the standard CTW1500 and ICDAR2015 test sets by different angles, and the experimental result is shown in table 2, so that the performance improvement is very obvious when the one-to-many training strategy is used.

Table 2 effects of OMTS on different rotation angle test sets of CTW1500 and ICDAR2015

Tables 3 and 4 show the comparison of the effect of the present invention on the test data set with other mainstream methods, the present invention achieves the best performance on multiple data sets, demonstrating the effectiveness of the present invention, while the speed (FPS) of the present invention is faster than most methods.

TABLE 3 test results for CTW1500 and totaltext data sets

Note: table represents pre-training the model using the real dataset.

TABLE 4 detection results of ICDAR2015 and MSRA-TD500 datasets

Note: representation Pre-training model Using real dataset

Fig. 2 shows the visualized results of the text detection on different data sets, and it can be seen intuitively that the invention has better detection results on various data sets.

Although the present invention has been described with reference to the above embodiments, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A detection system for arbitrarily-shaped adjacent texts in scene images is characterized by comprising:

the characteristic extraction module consists of a residual error network with 50 layers and a characteristic pyramid network, wherein the residual error network is a convolutional neural network and is used for extracting visual characteristics with different scales from bottom to top; the characteristic pyramid network is formed by a transverse connection and a top-down connection and is used for fusing visual characteristics of different scales to obtain fused characteristics;

2. The system of claim 1, wherein when the detection head module is trained, the region suggestion generation module calculates an intersection ratio according to the generated region suggestion and a rectangular frame corresponding to a text in the input image, determines positive and negative samples according to the intersection ratio, and selects the region suggestion with the positive and negative samples in a certain proportion to train the detection head module.

3. The system of claim 2, wherein a cross-over ratio is greater than 0.7 for positive samples and less than 3.0 for negative samples.

4. The system of claim 2, wherein the positive to negative sample ratio is 3: 1.

5. The system of claim 1, wherein the detection head module comprises a regional suggestion feature attention module PFAM, the PFAM configured to generate an attention weight corresponding to the current regional suggestion feature using a perceptron module, and the attention weight multiplied by the corresponding position of the regional suggestion to obtain an optimized feature of the adaptive attention text feature for removing the background noise.

6. The system of claim 5, wherein the detection head module comprises two convolutional layers, two fully-connected layers, two PFAMs, in order of data flow, two convolutional layers, one fully-connected layer, one PFAM, another fully-connected layer, and another PFAM.

7. The system of claim 1, wherein the detection head module is trained according to a one-to-many training strategy, OMTS, to optimize the learning of the region suggestions in the presence of multiple labeled text instances; the OMTS is a detection branch added in a detection head module, two detection results are suggested for each region, and then the two detection results are supervised by using two matched text instances to carry out model training.

8. The system of claim 7, wherein during training, two text instances are matched for each region suggestion based on a cross-over ratio, and if one region suggestion can only be matched to one text instance based on a cross-over ratio, the other is matched to a background.

9. A detection method for any shape adjacent text in a scene image is realized based on the system of any one of claims 1-8, and is characterized by comprising the following steps:

10. The method as claimed in claim 9, wherein the detecting head module is used to generate a corresponding attention weight according to each region suggestion feature, and the region suggestion feature and the position corresponding to the attention weight are multiplied to obtain an optimized feature of the adaptive attention text feature, so as to remove the background noise.