CN112926582B

CN112926582B - Text detection method based on adaptive feature selection and scale loss function

Info

Publication number: CN112926582B
Application number: CN202110341740.1A
Authority: CN
Inventors: 吴秦; 骆文莉; 柴志雷; 肖志勇; 陈璟; 刘登峰
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2021-03-30
Filing date: 2021-03-30
Publication date: 2021-12-07
Anticipated expiration: 2041-03-30
Also published as: CN112926582A

Abstract

The invention discloses a text detection method based on self-adaptive feature selection and a scale loss function, which comprises the following steps: acquiring text features in the image by using a backbone network, and extracting basic features through a feature pyramid network; extracting more representative characteristic information from the basic characteristics by utilizing self-adaptive characteristic selection; and utilizing a progressive expansion algorithm to segment and expand the representative characteristic information and obtaining a final detection result. The method applies the deformable convolution to the text with certain geometric deformation, so that the network can adapt to any text shape and can detect the texts with different sizes; and more abundant and accurate characteristics can be extracted, the problem of large text scale change in the image is solved, and false detection is effectively reduced.

Description

Text detection method based on adaptive feature selection and scale loss function

Technical Field

The invention belongs to the technical field of computer vision and text detection, and particularly relates to a text detection method based on adaptive feature selection and a scale loss function.

Background

Characters are important ways of expressing information, and widely exist in natural scene images, such as road signs, car logos, commodity names, and the like, texts can convey richer information compared with other contents (such as trees and pedestrians) in the natural scene, accurately identifying texts in the images is helpful for analysis and understanding of the scene, and text detection is more important as an important premise of text identification.

The important application of the text detection technology in the aspects of intelligent traffic systems, blind guiding of visually impaired people, image/video retrieval and the like makes the text detection method become a research hotspot of computer vision. At present, most of research on text detection is based on deep learning, and the methods are mainly divided into two types: the first is a regression-based method, and the other is a segmentation-based method. Regression-based methods are generally improvements to general object detection methods to successfully locate text in rectangular or quadrilateral bounding boxes. However, such methods have poor detection effect on curved text. The text detection method based on segmentation can detect the text with any shape, but the accuracy rate needs to be further improved.

Currently, most text detection algorithms still have three limitations. First, the conventional convolution has only a fixed receptive field, cannot adapt to various shapes of texts, and has a poor detection effect on texts with arbitrary shapes. Secondly, there are many objects similar to the text in the natural scene, such as railings, discs, etc., which may cause difficulty in fitting the model, and many non-text objects may be easily detected as texts, resulting in false detection. Thirdly, due to the fact that the detection is also very difficult due to the text scale diversity caused by the difference between the shooting angle and the size of the text instance, the undersized or oversized text is difficult to detect at the same time.

Therefore, in order to address the three limitations of the above-mentioned requirements, a new text detection model needs to be provided, so that the network can adapt to any text shape and can detect texts with different sizes. Meanwhile, more abundant and accurate characteristics can be extracted, and false detection is effectively reduced.

Disclosure of Invention

This section is for the purpose of summarizing some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. In this section, as well as in the abstract and the title of the invention of this application, simplifications or omissions may be made to avoid obscuring the purpose of the section, the abstract and the title, and such simplifications or omissions are not intended to limit the scope of the invention.

The present invention has been made in view of the above-mentioned problems occurring in the conventional text detection process.

Therefore, the technical problem solved by the invention is as follows: the traditional convolution only has a fixed receptive field, cannot adapt to various shapes of texts, and has poor detection effect on texts with any shapes; when a natural scene is detected by the traditional convolution, difficulty is caused in fitting of a model, and many non-text objects are easily detected as texts, so that false detection is caused; too small or too large text cannot be detected at the same time.

In order to solve the technical problems, the invention provides the following technical scheme: acquiring text features in the image by using a backbone network, and extracting basic features through a feature pyramid network; extracting more representative characteristic information from the basic characteristics by utilizing self-adaptive characteristic selection; and utilizing a progressive expansion algorithm to segment and expand the representative characteristic information and obtaining a final detection result.

As a preferred embodiment of the text detection method based on adaptive feature selection and scale loss function according to the present invention, wherein: the method for acquiring the text features in the image by using the backbone network comprises the steps that the backbone network selects a deformable convolution network, and in the last three stages of the network, 3-by-3 deformable convolution is used for replacing general convolution so as to adapt to the geometric shape of an object.

As a preferred embodiment of the text detection method based on adaptive feature selection and scale loss function according to the present invention, wherein: the extracting of the basic features through the feature pyramid network comprises the step of extracting the acquired text features C₂、C₃、C₄、C₅Obtaining feature maps P with different scales through a feature pyramid network₂、P₃、P₄、P₅And the number of channels per scale feature is 256, and P is₃、P₄、P₅Respectively obtaining sum P by 2, 4 and 8 times of upsampling₂And finally, performing a connecting operation on the 4 feature maps with the same size to obtain a basic feature X of subsequent processing, wherein the size of the X is 1/4 of the input original image, and the number of channels of the basic feature is 1024.

As a preferred embodiment of the text detection method based on adaptive feature selection and scale loss function according to the present invention, wherein: the extracting of the more representative feature information comprises the steps of enhancing features related to texts and inhibiting features unrelated to the texts by using a self-adaptive feature selection module, and setting the basic features acquired from the feature pyramid network as follows:

X＝[X₁,X₂,…,X_C],X∈R^C×H×W

wherein: c is the channel number of the feature map, H, W is the height and width of the feature map, respectively, the average value of the features of all pixels in the feature map of each channel of the channels of the basic features is calculated by using the global average pooling operation, and the value representing the feature map of the corresponding channel is output:

z＝[z₁,z₂,…,z_C]^T

two fully-connected layers are used to capture the weights between different channels, and the output formula after two full-connections is:

v＝σ(W₂δ(W₁z))

wherein: σ, δ are ReLU and Sigmoid operations, respectively, and z ∈ R^C，

To reduce the parameters, the channels of the first fully-connected layer are reduced to

r is set to 16, and adaptive features are obtained according to channel weights applied to the basic features X

The calculation formula of (a) is as follows:

the features X and

and adding to obtain a feature map F.

As a preferred embodiment of the text detection method based on adaptive feature selection and scale loss function according to the present invention, wherein: the obtaining of the final detection result by using the progressive addition algorithm includes, in order to distinguish closely adjacent texts, projecting the representative feature information as an input feature map to generate a plurality of segmentation results, and gradually adding the segmentation result with the smallest scale to a complete shape of the segmentation result with the largest scale by using the progressive addition algorithm to obtain the final detection result.

As a preferred embodiment of the text detection method based on adaptive feature selection and scale loss function according to the present invention, wherein: the method also comprises the step of training the backbone network, the self-adaptive feature selection and the progressive expansion algorithm by using the loss function, and in the training process of the network, different labels are required to be generated to guide the optimization of the loss function.

As a preferred embodiment of the text detection method based on adaptive feature selection and scale loss function according to the present invention, wherein: the generating different labels includes that the progressive scale expansion generates a plurality of segmentation results, and labels corresponding to the segmentation results are needed, and a calculation formula of the labels is as follows:

wherein: p is a radical of_nAs original label polygon of text region, d_iThe length required for the original polygon to shrink inward, S (p)_n) Is a polygon p_nArea of (d), C (p)_n) Is a polygon p_nOf (2)Long, and r_iIs expressed as:

wherein: m is the minimum shrinkage and the value thereof is in the range of (0, 1)]N is the total number of divisions, for each reduced polygon p_iA 1 is to p_iThe pixel in the inner part is set to 1, the rest parts are set to 0, and the same operation is carried out on a plurality of polygons in one graph to obtain the label G_i。

As a preferred embodiment of the text detection method based on adaptive feature selection and scale loss function according to the present invention, wherein: the loss function comprises the following steps of utilizing a loss function aiming at the scale perception of a text example to assign different weights to texts with different sizes, and solving the loss problem of a small text example, wherein the calculation formula of the loss function is as follows:

L＝(1-α)L_S+αL_H

wherein: l is_SFor loss of reduced segmentation map, L_HFor the loss of information about the text scale, α is the balance L_SAnd L_HThe weight of (c).

As a preferred embodiment of the text detection method based on adaptive feature selection and scale loss function according to the present invention, wherein: the text example further comprises the following steps of designing weights of different texts by using the approximate heights of the text examples, setting different formulas for calculation when calculating the heights of the text examples in different data sets, and calculating the average height B of all the text examples in the same image according to the heights of the text examples as follows:

wherein: q is the number of text instances in the same image, H_iIs the approximate height of the ith text instance.

As a preferred embodiment of the text detection method based on adaptive feature selection and scale loss function according to the present invention, wherein:

different text instances have different effects on the loss function, so that different text instances T need to be given_eAssigning different weights mu_eThe weight mu_eThe calculation formula of (2) is as follows:

for non-text pixels, the weight is set to be 0, and after the weight of each pixel is determined, the loss L of text proportion information is reflected_HCan be expressed as:

L_H＝1-Dice(S_n·M·μ,G_n·M·μ)

wherein: m is a training mask selected by "on-line inexperienced mining" (OHEM), μ is a weight matrix for each image, S_nIs the largest scale segmentation result, G_nIs the corresponding label.

The invention has the beneficial effects that: the method applies the deformable convolution to the text with certain geometric deformation, so that the network can adapt to any text shape and can detect the texts with different sizes; and more abundant and accurate characteristics can be extracted, the problem of large text scale change in the image is solved, and false detection is effectively reduced.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise. Wherein:

fig. 1 is a schematic basic flow chart of a text detection method based on adaptive feature selection and a scale loss function according to a first embodiment of the present invention;

fig. 2 is a schematic diagram of an overall network structure of a text detection method based on adaptive feature selection and a scale loss function according to a first embodiment of the present invention;

FIG. 3 is a diagram illustrating a general convolution and a deformable convolution of a text detection method based on adaptive feature selection and a scale loss function according to a first embodiment of the present invention;

FIG. 4 is a schematic diagram of training label generation of a text detection method based on adaptive feature selection and a scale loss function according to a first embodiment of the present invention;

FIG. 5 is a schematic diagram of a text detection method based on adaptive feature selection and a scale loss function according to a first embodiment of the present invention, in which the text scales in ICDAR2017-MLT data sets are greatly different;

FIG. 6 is a schematic diagram of the calculation of the height of the loss function on the ICDAR2015 and ICDAR2017-MLT data sets according to the text detection method based on adaptive feature selection and scale loss function provided by the first embodiment of the present invention;

fig. 7 is a schematic diagram illustrating calculation of a loss function height on a CTW1500 data set by a text detection method based on adaptive feature selection and a scale loss function according to a first embodiment of the present invention;

fig. 8 is a schematic diagram illustrating that different weights are assigned to text instances of different sizes on a scale loss function in a text detection method based on adaptive feature selection and the scale loss function according to a first embodiment of the present invention;

fig. 9 is a schematic diagram of verification results of a text detection method based on adaptive feature selection and scale loss function according to a second embodiment of the present invention, in which different parameters α are selected in the loss function portion on the ICDAR2015 and CTW1500 data sets;

fig. 10 is a schematic diagram of verification results of selecting two different ways to perform a scale loss function height design on a CTW1500 data set according to a text detection method based on adaptive feature selection and a scale loss function provided in the second embodiment of the present invention;

fig. 11 is a schematic diagram illustrating the effect of detecting on an ICDAR2015 data set of the text detection method based on adaptive feature selection and scale loss function according to the second embodiment of the present invention;

fig. 12 is a schematic diagram illustrating the effect of the text detection method based on adaptive feature selection and scale loss function on the CTW1500 data set according to the second embodiment of the present invention;

FIG. 13 is a schematic diagram of the effect of the text detection method based on adaptive feature selection and scale-loss function on ICDAR2017-MLT data set according to the second embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, specific embodiments accompanied with figures are described in detail below, and it is apparent that the described embodiments are a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present invention, shall fall within the protection scope of the present invention.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.

Furthermore, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.

The present invention will be described in detail with reference to the drawings, wherein the cross-sectional views illustrating the structure of the device are not enlarged partially in general scale for convenience of illustration, and the drawings are only exemplary and should not be construed as limiting the scope of the present invention. In addition, the three-dimensional dimensions of length, width and depth should be included in the actual fabrication.

Meanwhile, in the description of the present invention, it should be noted that the terms "upper, lower, inner and outer" and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of describing the present invention and simplifying the description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation and operate, and thus, cannot be construed as limiting the present invention. Furthermore, the terms first, second, or third are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

The terms "mounted, connected and connected" in the present invention are to be understood broadly, unless otherwise explicitly specified or limited, for example: can be fixedly connected, detachably connected or integrally connected; they may be mechanically, electrically, or directly connected, or indirectly connected through intervening media, or may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

Example 1

Referring to fig. 1 to 8, for an embodiment of the present invention, a text detection method based on adaptive feature selection and a scale loss function is provided, including:

s1: and acquiring text features in the image by using the backbone network, and extracting basic features through the feature pyramid network. In which it is to be noted that,

referring to fig. 2, in order to obtain text features adapted to geometric changes, the backbone network selects a deformable convolution network ResNet50, and feeds the features from conv2_ x to conv5_ x of ResNet50 to a Feature Pyramid Network (FPN) to obtain basic features, and in the last three stages of the network (conv3_ x to conv5_ x), 3 × 3 deformable convolution is used to replace general convolution to adapt to the geometric shape of an object, and referring to fig. 3, the general convolution and the deformable convolution are shown, wherein fig. 3(a) shows that the general convolution uses a filter with a fixed size, and the general convolution operates on a predefined rectangular sampling grid and cannot adapt to the feature of large changes of text geometric structures; fig. 3(b) is a deformable convolution, each grid point is adjusted based on a learnable offset, and can adapt to the geometry of the object.

Output result C of ResNet50₂、C₃、C₄、C₅Obtaining feature maps P with different scales through a feature pyramid network₂、P₃、P₄、P₅And the number of channels per scale feature is 256, and P is₃、P₄、P₅Respectively obtaining sum P by 2, 4 and 8 times of upsampling₂And finally, performing a connection operation on the 4 feature maps with the same size to obtain a basic feature X of subsequent processing, wherein the size of the X is 1/4 of the input original image, and the number of channels of the basic feature is 1024.

S2: and extracting more representative characteristic information from the basic characteristics by utilizing self-adaptive characteristic selection. In which it is to be noted that,

in order to extract more representative features, an adaptive feature selection module is utilized to enhance features related to texts and inhibit features unrelated to texts, and the basic features acquired from the feature pyramid network are set as follows:

X＝[X₁,X₂,…,X_C],X∈R^C×H×W

wherein: c is the channel number of the feature map, H, W is the height and width of the feature map, the average value of the features of all pixels in the feature map of each channel of the basic feature is calculated by using the global average pooling operation, and the output of the C channel of the basic feature is set as z_c，z_cThe calculation formula of (a) is as follows:

wherein: x is the number of_c(i, j) is a feature map X corresponding to the c-th channel_cThe characteristic value at the (i, j) position,the values representing the corresponding channel profile are therefore:

z＝[z₁,z₂,…,z_C]^T

v＝σ(W₂δ(W₁z))

wherein: σ, δ are ReLU and Sigmoid operations, respectively, and z ∈ R^C，

r is set to 16, and the adaptive features are obtained according to the channel weights applied on the basic features X

The calculation formula of (a) is as follows:

the features X and

and adding to obtain a feature map F.

S3: and segmenting and expanding the representative characteristic information by utilizing a progressive expansion algorithm, and obtaining a final detection result. In which it is to be noted that,

generally, the text detection method based on segmentation is difficult to distinguish from the close text, and the invention uses a progressive expansion algorithm to have a generationThe characteristic feature information is projected as an input feature map to n branches to generate a plurality of division results (S)₁,S₂,…,S_n-1,S_n) In which S is₁Is the result of the smallest scale segmentation, S_nIs the segmentation result with the largest scale, and utilizes a progressive expansion algorithm to carry out the segmentation on the segmentation result S with the smallest scale₁Step-by-step expansion into maximally scaled segmentation results S_nTo obtain the final test result.

Further, the loss function is used for training the backbone network, the adaptive feature selection and the progressive expansion algorithm, and in the training process of the network, different labels are required to be generated to guide the optimization of the loss function.

Generating different labels includes, with reference to FIG. 4, progressive scale expansion yielding a plurality of segmentation results (S)₁,S₂,…,S_n-1) It is necessary to have labels (G) corresponding to a plurality of division results₁,G₂,…,G_n-1) The calculation formula of the label is as follows:

wherein: p is a radical of_nAs original label polygon of text region, d_iThe length required for the original polygon to shrink inward, S (p)_n) Is a polygon p_nArea of (d), C (p)_n) Is a polygon p_nA circumference of r_iIs expressed as:

Loss functions include that loss functions are important for training a model, for text detection a good loss function should not only capture undetected text regions, but also consider text dimension information, as before, the size of text instances may vary widely, e.g., text "50" in fig. 5 is much larger in size than all other words in the figure, and text instance "SAVE UP TO" is very small, and in calculating the loss function, if the same weight is applied TO all positive pixels, it is clear that a larger proportion of text regions will be calculated more than a smaller proportion of text regions, which is not fair for small text instances and may result in the loss of small text instances, and TO address this problem, the present invention proposes a novel scale-aware loss function for text instances, which will assign different weights TO different sizes of text, the loss function calculation formula is as follows:

L＝(1-α)L_S+αL_H

wherein: l is_SFor loss of reduced segmentation map, L_HFor the loss of information about the text scale, α is the balance L_SAnd L_HWeight of (1), loss function L of the reduced segmentation map_SIs represented as follows:

wherein: dice is a loss function used to measure the gap between the segmented result and the label (S)₁,S₂,…,S_n-1) For mapping of a segmentation map of a reduced text region, S_nAs a result of the segmentation of the complete text region, P is at S_nIgnoring the pixel mask of the non-text region, the P value at pixel (x, y) is calculated by the following equation:

wherein S_n,x,yIs pixel (x, y) in the segmentation map S_nA value of (1).

Still further, text font size is the most representative feature of a text instance, but it is difficult to obtain the exact size of all characters in a text instance, so the invention uses the approximate height of a text instance to design the weight of different texts, because different data sets have different label formats, different formulas must be set to estimate the height of a text instance in different data sets, taking ICDAR2015 and ICDAR2017-MLT data sets as examples, and the label format of each text region in a data set is as shown in FIG. 6, which is quadrilateral 0 (x) with four vertices₀,y₀)，1(x₁,y₁)，2(x₂,y₂)，3(x₃,y₃) The way in which the height is calculated is also shown in fig. 6, with the ordinate y of the vertex 3₃And the ordinate y of the vertex 0₀Are subtracted to obtain h₀Vertex 2 ordinate y₂And the ordinate y of the vertex 1₁Are subtracted to obtain h₁And the two values are then averaged, H can be taken_eExpressed as:

referring to fig. 7, a text region in the CTW1500 dataset is marked as a polygon having 14 points, 0 (x)₀,y₀)，1(x₁,y₁)，……，12(x₁₂,y₁₂)，13(x₁₃,y₁₃) Approximate height is represented as H_iThe ordinate y of the vertex 12₁₂And the ordinate y of the vertex 1₁Are subtracted to obtain h₁The ordinate y of the apex 11₁₁And the ordinate y of the vertex 2₂Are subtracted to obtain h₂The ordinate y of the apex 10₁₀And the ordinate y of the vertex 3₃Are subtracted to obtain h₃Vertex 9, ordinate y₉And the ordinate y of the vertex 4₄Are subtracted to obtain h₄Vertex 8 ordinate y₈And the ordinate y of the vertex 5₅Are subtracted to obtain h₅And the five values are averaged again, then H can be averaged_eExpressed as:

still other methods may estimate the height of the text, e.g., the Euclidean distance, which is expressed as follows:

however, the height of the euclidean distance calculation is poor, and after the approximate height of each text instance is obtained, the average height of all text instances in the same image is represented as B:

wherein: q is the number of text instances in the same image, referring to FIG. 8, the invention is the e-th text instance T_eEach pixel in (1) is assigned a weight u_e，T₁、T₂、T₃、T₄Representing different text instances in an image, u₁、u₂、u₃、u₄Representing the respective weights of the four text instances, T₁All pixels within a pixel are assigned the same value u₁Wherein u is_eThe calculation formula of (a) is as follows:

it can be found that the weight of the text instance is inversely proportional to the height of the text instance, the smaller the size of the text instance, the greater the weight, the weight is set to 0 for the non-text pixels, and then the weight matrix of each image is calculated and represented as mu, and after the weight of each pixel is determined, the loss L reflecting the text proportion information is reflected_HCan be expressed as:

L_H＝1-Dice(S_n·M·μ,G_n·M·μ)

wherein: m is a training mask selected by "on-line hard case mining" (OHEM), by giving different text instances T_eDifferent weight mu_eEach text instance is made to contribute the same to the penalty function.

Example 2

Referring to fig. 9 to 13, which are another embodiment of the present invention, to verify and explain the technical effects adopted in the method, the embodiment adopts a conventional technical scheme and the method of the present invention to perform a comparison test, and compares the test results by means of scientific demonstration to verify the real effects of the method.

The data sets used in this experiment included ICDAR2015, SCUT-CTW1500, ICDAR2017-MLT data sets, and when experiments were performed, ablation experiments were first performed on data sets ICDAR2015 and SCUT-CTW1500 to verify the effectiveness of the various modules of the present invention; finally, the modules were combined, validated on ICDAR2015, SCUT-CTW1500, ICDAR2017-MLT data sets, and compared to the most advanced methods.

ICDAR 2015: the method is a common English text detection data set, comprises 1000 training pictures and 500 testing pictures, and divides the original 1000 training pictures into 800 training pictures and 200 verification pictures for subsequent comparison experiments; SCUT-CTW 1500: the test method is a challenging curved text detection data set with 1000 training pictures and 500 test pictures, and similarly, the 1000 training pictures of the data set are divided into 800 training pictures and 200 training pictures for testing and verification respectively in a comparison experiment; ICDAR 2017-MLT: the method is a large multi-language text detection data set, the data set comprises 7200 training pictures, 1800 verification pictures and 9000 test pictures, and in the embodiment, 7200 training pictures and 1800 verification pictures are combined for training.

In order to ensure the fairness of the experiment, the embodiment sets the experiment: in the invention, ResNet50 is used as a main network, pretrained on ImageNet, and in the training stage, a random gradient descent method is used, and the attenuation rate is 5 multiplied by 10^-4Momentum is 0.9, and two training methods are used: direct training in threeTraining 600 epochs on each data set, with an initial learning rate of 0.001, and a batch size of 16, and then dividing the learning rate by 10 at 200 and 400 epochs; for ICDAR2015 and SCUT-CTW1500 datasets, fine tuning is performed on the ICDAR2017-MLT dataset, and the learning rate adjustment strategy is adjusted to

Initial learning rate lr₀To enhance the training data, 0.0001 and p 0.9, the following data enhancement method was used: from [0.5,1.0,2.0,3.0 ]]Selecting a scaling factor to randomly scale the image; randomly selecting an image, and then horizontally turning; the images are randomly rotated at an angle between-10 ° and 10 °; the previously transformed image was randomly cropped to a size of 640 x 640.

To demonstrate the effectiveness of the deformable convolution, adaptive feature selection and scale perception loss proposed by the present invention, ablation experiments were performed on ICDAR2015 and CTW1500 datasets, and compared to the baseline method PSENet, in order to determine the equilibrium L_SAnd L_HThe ICDAR2015 data set and the CTW1500 data set are subjected to experiments, 800 pictures are selected as training images, 200 pictures are selected as verification images, the weight value alpha is changed from 0 and is increased to 1 by a step of 0.1, and during the experiments, since the value of F1-measure is very small when alpha is 0 and alpha is 1, only the value of F1-measure of alpha is drawn from 0.1 to 0.9, so that alpha can be clearly displayed.

Experimental results of ICDAR2015 dataset and CTW1500 referring to fig. 9, it can be seen that the parameter variation of the loss function has a great influence on the results of ICDAR2015 dataset and CTW1500 dataset, and the best results are obtained when α of both datasets is 0.2, therefore, α is 0.2, i.e. L is 0.8L_S+0.2L_HAs a loss function formula.

In order to assign different weights to texts with different proportions, the height of the texts in the CTW1500 data set is estimated by using the two methods of the present invention and the euclidean distance, and the performance of the two height estimation methods is shown in fig. 10, wherein a solid line is the result obtained by the present invention, and a dashed line is the result obtained by the euclidean distance.

In order to verify the beneficial effects of the invention, verification is performed from the aspects of deformable convolution, adaptive feature selection and scale-related loss functions, a general convolution PSENet is selected for comparison with the invention, and the results are shown in the following table 1:

table 1: the experimental results were compared on the CTW1500 and ICDAR2015 data sets.

It can be seen that for the deformable convolution whose results are shown in the second row of table 1, in table 1, "P", "R" and "F" stand for "precision", "real" and "F1-measure", respectively, and the first row of table 1 shows the results of PSENet, it can be seen that the value of F1-measure is improved by 0.36% on the ICDAR2015 data set and 2.57% on the CTW1500 data set, which is a more improved CTW1500 data set than the ICDAR2015 data set because the CTW1500 data set is a curved text data set in which the deformation of the text is large and thus the deformable convolution is more efficient on such data set.

For the adaptive feature selection module, the third row of Table 1 shows the results for the PSENet-based adaptive feature selection module, with the invention improving the F1-measure value by 0.86% in the ICDAR2015 data set and 0.82% for the F1-measure of CTW1500 compared to PSENet.

For the scale-dependent loss function, from the above description, the present embodiment selects L ═ 0.8L_S+0.2L_HAs a loss function, to validate the bookWhether the loss function proposed by the invention is effective or not is determined, the original loss function in the PSENet method is replaced by the loss function provided by the invention, and the result is shown in the fourth row of table 1, and from the experimental result of table 1, it can be seen that the loss function related to the scale improves the ICDAR2015 data set by 2.23%, and the improvement on the CTW1500 data set by 2.79%, which is obviously improved.

On the other hand, this embodiment was also tested in combination with the deformable convolution and proposed adaptive feature selection module, and the results are shown in the fifth row of table 1, which is an improvement of 1.11% over the ICDAR2015 data set and 2.78% over the CTW1500 data set, compared to PSENet.

Finally, the experimental results of the present invention show that in the last row of Table 1, the ICDAR2015 data set improved F1-measure by 3.57% compared to baseline, while the CTW1500 data set improved F1-measure by 3.77%.

Furthermore, the present embodiment also compares the present invention with the existing latest method from three aspects: directional text, curved text, and multilingual text.

For directional text detection: to verify the ability of the present invention to detect oriented text, the present embodiment experimented with an ICDAR2015 dataset, setting the long side of the image to 2240 according to PSENet and adjusting the short side according to the scale of the original image, and using a single scale test, "Pre" indicates whether the model was trained from the head or fine tuned on ICDAR2017-MLT, with the experimental results as shown in table 2 below:

table 2: results of experiments on ICDAR 2015.

Method	Pre	P	R	F	FPS
						CTPN	N	74.2	51.6	60.9	7.1
EAST	N	83.6	73.5	78.2	13.2
						PSENet	N	81.49	79.68	80.57	1.6
PixelLink	N	82.9	81.7	82.3	7.3
						ours	N	87.36	81.17	84.15	1.3
SegLink	Y	73.1	76.8	75.0	-
						SSTD	Y	80.2	73.9	76.9	7.7
WordSup	Y	79.3	77.0	78.2	-
						Corner	Y	94.1	70.7	80.7	3.6
TextBoxes++	Y	87.2	76.7	81.7	11.6
						RRD	Y	85.6	79	82.2	6.5
MCN	Y		72	80	76							-
						TextSnake	Y	84.9	80.4	82.6	1.1
PSENet	Y	88.26	82.19	85.12	1.6
						CRAFT	Y	89.8	84.3	86.9	-
SAE	Y	88.3	85.0	86.6	-
						ours	Y	89.87	86.33	88.06	1.3

It can be seen that, under the condition that ICDAR2017-MLT is not pre-trained, the method achieves F1-measure 84.15%, is improved by 3.58% compared with PSENet, is improved by 1.85% compared with PixelLink, is increased by 2.94% compared with PSENet after pre-training on ICDAR2017-MLT, and experimental results of the method exceed other methods no matter whether de novo training or fine tuning is performed on ICDAR2017-MLT, and for a test set, the average detection speed on 1080 GPU ti equipment is 1.3 FPS.

For curved text detection: to demonstrate the shape robustness of the proposed method, this example performed experiments on the CTW1500 dataset, setting the long side of the input image to 1280 and resizing the short side according to the aspect ratio in the test phase according to the test setup in PSENet, Pre "list indicating whether the result was obtained by ab initio training or fine tuning on ICDAR2017-MLT, with the experimental results as shown in table 3 below:

table 3: experimental results on CTW 1500.

Method	Pre	P	R	F	FPS
						CTPN	N	60.4	53.8	56.9	7.14
EAST	N	78.7	49.1	60.4	21.2
						SegLink	N	42.3	40.4	40.8	10.7
TLOC	N	77.4	69.8	73.4	13.3
						PSENet	N	80.6	75.6	78.0	3.9
ours	N	84.01	78.59	81.21	3.1
						TextSnake	Y	67.9	85.3	75.6	1.1
PSENet	Y	84.8	79.7	82.2	3.9
						SAE	Y	82.7	77.8	80.1	3
ours	Y	84.94	82.37	83.63	3.1

It can be seen that the method achieves the best performance on F1-measure compared with other methods, and under the condition that ICDAR2017-MLT is not pre-trained, the method achieves 81.21% of F1-measure, 3.21% of the method is improved compared with PSENet, the CTW1500 is 83.63% of the method is improved compared with PSENet, 1.43% of the method is improved compared with PSENet, and the detection effect on the CTW1500 is good no matter whether the method is performed by de novo training or fine tuning.

For multi-language text detection: the ICDAR2017-MLT dataset is a multilingual dataset in which the resolution of the images is very high, the text is relatively small, and it is doubtless that this dataset is a huge challenge for training and testing, because the pictures in the IC17 dataset contain too much small text, and the image size difference in this dataset is relatively large, during testing the images with long sides less than or equal to 2048 pixels are enlarged by a factor of 2, the long sides of the images longer than 2048 pixels are set to 4480, and the short sides are scaled, this embodiment trains this dataset without pre-training in any other text detection dataset, and the comparison of the inventive method with other methods is shown in table 4 below:

table 4: ICDAR2017-MLT experiment result.

It can be seen that the method of the invention achieves F1-measure 74.44%, which is improved by 3.54% compared with the traditional method, and the result shows that the method of the invention is very good for detecting multilingual texts.

Referring to fig. 11, 12, and 13, it can be seen from these test pictures that the present invention has a very good effect of detecting curved texts and texts with large scale changes.

It should be noted that the above-mentioned embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, which should be covered by the claims of the present invention.

Claims

1. A text detection method based on adaptive feature selection and a scale loss function is characterized by comprising the following steps:

acquiring text features in the image by using a backbone network, and extracting basic features through a feature pyramid network;

extracting more representative characteristic information from the basic characteristics by utilizing self-adaptive characteristic selection;

segmenting and expanding the representative characteristic information by utilizing a progressive expansion algorithm, and obtaining a final detection result;

the approximate height of the text example is used for designing the weight of different texts, when the height of the text example in different data sets is calculated, different formulas are required to be set for calculation, and according to the height of the text example, the average height B of all the text examples in the same image is calculated as follows:

wherein: q is the number of text instances in the same image, H_eIs the approximate height of the e-th text instance;

text example T_eWeight of (u)_eInversely proportional to the average height B of the text instance;

L_H＝1-Dice(S_n·M·μ,G_n·M·μ)

2. The method for text detection based on adaptive feature selection and scale loss function according to claim 1, wherein: the acquiring of the text feature in the image by using the backbone network includes,

the backbone network is a deformable convolution network, and in the last three stages of the network, 3 × 3 deformable convolution is used for replacing general convolution so as to adapt to the geometric shape of an object.

3. The text detection method based on adaptive feature selection and scale loss function according to claim 1 or 2, characterized in that: the extracting of the basic features through the feature pyramid network comprises,

the obtained text feature C is used₂、C₃、C₄、C₅Obtaining feature maps P with different scales through a feature pyramid network₂、P₃、P₄、P₅And the number of channels per scale feature is 256, and P is₃、P₄、P₅Respectively obtaining sum P by 2, 4 and 8 times of upsampling₂Feature maps with the same size are obtained by connecting the 4 feature maps with the same sizeThe basic feature X is processed, and the size of X is 1/4 of the input original image, and the number of channels of the basic feature is 1024.

4. The method of text detection based on adaptive feature selection and scale loss function of claim 3, wherein: the extracting of more representative feature information using adaptive feature selection includes,

utilizing a self-adaptive feature selection module to enhance the features related to the text, inhibit the features unrelated to the text, and set the basic features acquired from the feature pyramid network as:

X＝[X₁,X₂,…,X_C],X∈R^C×H×W

z＝[z₁,z₂,…,z_C]^T

v＝σ(W₂δ(W₁z))

wherein: σ, δ are ReLU and Sigmoid operations, respectively, and z ∈ R^C，

The calculation formula of (a) is as follows:

the features X and

and adding to obtain a feature map F.

5. The method for text detection based on adaptive feature selection and scale loss function according to any of claims 1, 2 or 4, wherein: the obtaining of the final detection result by using the progressive dilation algorithm includes,

in order to distinguish texts, the representative feature information is used as an input feature map to be projected to generate a plurality of segmentation results, and the segmentation result with the smallest scale is gradually expanded into a complete shape of the segmentation result with the largest scale by using a progressive expansion algorithm to obtain a final detection result.

6. The method for text detection based on adaptive feature selection and scale loss function according to claim 1, wherein: also comprises

And training the backbone network, the self-adaptive feature selection and the progressive expansion algorithm by using the loss function, and generating different labels to guide the optimization of the loss function in the training process of the network.

7. The method of text detection based on adaptive feature selection and scale loss function of claim 6, wherein: the generating of the different label includes generating a different label,

the progressive dilation algorithm generates a plurality of segmentation results, and a label corresponding to the segmentation results is required, wherein the label is calculated by the formula:

wherein: m is the minimum shrinkage and the value thereof is in the range of (0, 1)]N is the total number of divisions, for each reduced polygon p_iA 1 is to p_iThe pixel in the inner part is set as 1, the rest parts are set as 0, and a plurality of polygons in one figure are all subjected to the same operation to obtain a label G_i。

8. The text detection method based on adaptive feature selection and scale loss function according to claim 6 or 7, characterized in that:

the loss function aiming at the scale perception of the text example is utilized to assign different weights to texts with different sizes, so that the problem of the loss of the small text example is solved, and the calculation formula of the loss function is as follows:

L＝(1-α)L_S+αL_H