CN116453133A

CN116453133A - Banner curve and key point-based banner text detection method and system

Info

Publication number: CN116453133A
Application number: CN202310714974.5A
Authority: CN
Inventors: 谢红刚; 姜迪; 侯凯元; 马万杰
Original assignee: Hubei University of Technology
Current assignee: Hubei University of Technology
Priority date: 2023-06-16
Filing date: 2023-06-16
Publication date: 2023-07-18
Anticipated expiration: 2043-06-16
Also published as: CN116453133B

Abstract

The invention discloses a method and a system for detecting a banner text based on Bezier curves and key points, wherein an initial text box of a text area is firstly generated according to image labels, then the number of long-edge coordinates of the initial text box is reduced by utilizing a fixed threshold value, the Bezier curves are generated based on the reduced long-edge coordinates, the two Bezier curves are connected end to form a new text box, the labels of the text box are converted into key point coordinates and the width of the key points from the boundary coordinates of the text box, a banner text detection network model is then constructed and trained, and finally the text in the banner image is detected by utilizing the trained banner text detection network model. The method solves the problem that the prior art cannot accurately frame the banner text, can improve the detection speed and use less data when completing text detection.

Description

Banner curve and key point-based banner text detection method and system

Technical Field

The invention belongs to the technical field of natural scene text positioning, and particularly relates to a banner text detection method and system based on Bezier curves and key points.

Background

Along with the continuous development of computer vision, the target detection and the semantic segmentation are also continuously and iteratively updated. And dividing a text region and a non-text region in the banner by using technologies such as target detection, semantic segmentation and the like, so as to realize detection of the banner text and further identify the content of the text region.

By researching the text area of the banner image, the problems of large length-width ratio of the banner text, text distortion and the like are found. Although the target detection method has made a certain progress in processing such texts, there are still problems such as false detection, missed detection, and more non-text regions in the detected text. Meanwhile, the text detection method adopting semantic segmentation has some problems, such as complex post-processing, low detection speed, high requirements on hardware equipment and the like. Therefore, in order to solve these problems, a new text detection method is urgently required.

Disclosure of Invention

Aiming at the problem that the prior art cannot accurately frame a banner text, the invention provides a banner text detection method and system based on Bezier curves and key points.

In order to achieve the above object, the present invention provides a method for detecting a banner text based on a bezier curve and a key point, comprising the steps of:

Step 1, generating an initial text box of a text region according to labels of a public text data set, simplifying the number of long-side coordinates of the text box through a fixed threshold value, generating Bezier curves based on simplified long-side coordinate points, connecting two Bezier curves end to form a new text box, and converting the labels of the text box from boundary coordinate points of the text box to key point coordinates and the width of key points;

step 1.1, selecting images with special texts such as long texts, distorted texts and the like in a public text image data set as a data set, and generating an initial text box of a text area according to labels of the public text data set;

step 1.2, judging the bending degree of the long edge of the text box by adopting a fixed threshold value method;

step 1.3, selectively simplifying coordinate points of two long sides of the text box according to the bending degree of the two long sides of the text box;

step 1.4, taking coordinate points on two simplified long sides as control points of a Bezier curve, generating two corresponding Bezier curves, and connecting the two Bezier curves end to obtain a real boundary frame of the text;

step 1.5, converting labels of the public data set from text box boundary coordinate points into key point coordinates and the width of the key points;

Step 2, constructing a banner text detection network model;

step 3, training the banner text detection network model constructed in the step 2 by utilizing the key point data set obtained in the step 1;

and 4, detecting the text in the banner image by using the trained banner text detection network model.

In step 1.1, the labels of the common text data set are a plurality of groups of coordinates arranged clockwise, each group of coordinates is coordinates of boundary points of a text frame for framing the text, each group of coordinates is connected clockwise to form a closed polygon, an initial text frame of the text is obtained, and the number of boundary points of an image of the data set is set as followsSequentially select ∈>And (2) as an upper boundary point, after->And taking the connecting line of the upper boundary point and the connecting line of the lower boundary point as two long sides of the initial text box.

In step 1.2, the distance between the connecting line of the coordinates of the head and tail of the long edge of the text box in the data set and the distance between the other coordinate points on the long edge and the connecting line are compared through a fixed threshold value, and the bending degree of the two long edges of the text box is judged, namely:

（1）

in the method, in the process of the invention,indicating the degree of curvature of the long side of the text box, +.>Representing the ratio of the farthest distance from the coordinate point on the long side of the text box in the image data to the head-tail coordinate connecting line of the long side to the head-tail coordinate connecting line, and when the ratio is more than or equal to 0 and less than +. >When the ratio is equal to or greater than +.>And is less than->When the ratio is equal to or greater than +.>When the long side is judged to be completely bent, +.>、/>Is a set threshold.

In step 1.3, the distance from the coordinate point on the long side to the line connecting the head and tail coordinate points is set asThe head and tail coordinate points are respectively +.>、/>The specific simplifying process is as follows: when the long side is judged to be a straight line, only the head-tail coordinate points of the long side are reserved; when the long side is judged to be partially bent, a coordinate point farthest from the head-tail coordinate connecting line and a head-tail coordinate point are reserved; when it is determined that the long side is completely bent, a threshold value +.>0.1 time of the length of the head-tail coordinate connecting line, when +.>Is greater than->When the coordinate points are reserved, corresponding coordinate points are reserved, and other coordinate points are abandoned; is provided with->The maximum coordinate point is +.>Use->Dividing the curve into->，/>Two parts, repeating the above operation until no coordinate point to the connecting line distance is greater than +.>Until that point.

In step 1.4, the coordinate points on the reduced long side are used as control points of a bezier curve, the bezier curve is represented by a parameter curve based on a bernstein polynomial, and the specific definition is as follows:

（2）

（3）

in the method, in the process of the invention,coordinate set representing a point on the Bessel curve,/- >Represents Bezier curve order, +.>Indicate->Coordinates of the individual control points->Indicate->Bernstan polynomial of the individual control points, +.>Representing binomial coefficients,/->Time is expressed when coordinates of all points on the Bessel curve are corresponding, due to +.>Or 1->The value of (2) is 0, thus whenAt the moment, the first coordinate point on the long side is selected as the position coordinate of the Bezier curve at the moment 0, when +.>And when the position coordinate of the Bezier curve at the moment 1 is selected as the last coordinate point on the long side.

Two Bezier curves are generated through the formula (2), and a closed polygon formed by connecting the two Bezier curves end to end is used as a real text box of the text example.

In step 1.5, the boundary points on the two long sides are converted into a group of key points to represent the text box, and before the key points are converted, the number of the boundary points on the upper and lower long sides of the text box is ensured to be consistent by adopting an upward compatible mode, and the specific steps are as follows: when the upper edge and the lower edge are respectively straight lines and partially bent, the middle point of the straight line edge is extracted as one boundary point, so that the boundary points of the upper edge and the lower edge are three; when the upper edge and the lower edge are respectively straight lines and completely bent, dividing the straight lines equally according to the number of coordinate points of the completely bent edges, and extracting equally divided coordinate points so that the number of boundary points of the upper edge and the lower edge is consistent; when the upper edge and the lower edge are respectively in partial bending and full bending, dividing the two curves of the partial bending edge equally according to the coordinate point quantity of the full bending edge minus the coordinate point quantity of the partial bending edge, and extracting the equally divided coordinate points, so that the boundary point quantity of the upper edge and the lower edge is consistent. After the number of the upper boundary points and the lower boundary points are unified through the operation, the boundary points are converted, coordinates of the upper edge and the lower edge are in one-to-one correspondence from the beginning to the end, the middle point coordinates of the corresponding coordinate points are taken as key point coordinates, one half of the distance of the corresponding coordinate points is taken as the width of the key points, and the labels in the public image text data set are converted into a group of key point coordinates and corresponding widths from the coordinate points of the boundary frames.

And, the banner text detection network model in the step 2 comprises a feature extraction module, a feature fusion module, a regression module and a text box generation module. And the feature extraction module is used for extracting feature information of different layers to obtain feature images containing semantic information from a lower layer to a higher layer. And the feature fusion module is used for combining the feature images of different layers to obtain a fused feature image which is used for detecting the banner text subsequently. And the regression module is used for regressing the shape of the text instance, and the coordinates of the key points and the width of the key points of the text instance. And the text box generation module is used for generating a banner image text box based on the key point coordinates and the width information in the current image.

The feature extraction module backbone network adopts a ResNet-50 model, and four feature images are sequentially obtained through channel increasing and downsampling processing after images are input into the ResNet-50 model、/>、/>、/>The channel number of four characteristic images with different scales obtained in a backbone network is uniformly processed to obtain +.>、/>、/>、/>Then from the lowest scale feature map +.>Starting up-sampling process, and performing feature map with the same scale as the input end of FPN structure ∈ ->Performing addition operation to obtain fused lower-scale characteristic image +. >For->Up-sampling and +.>Adding to obtain a fused low-scale characteristic image +.>Likewise for->Up-sampling and +.>Adding to obtain a fused characteristic imageFinally, the fused characteristic image is +.>、/>、/>、/>As an output of the FPN.

The feature fusion module is used for combining the fused feature images with different scales to obtain a combined fused feature imageThe specific calculation process is as follows:

（4）

in the method, in the process of the invention,indicating a channel connection-> And->Up-sampling by 2 times, 4 times and 8 times, respectively,>、/>、、/>and the characteristic images are fused.

Will fuse the feature imagesUp-sampling processing is performed so that +.>The same size as the original image.

The regression module comprises two parts, namely shape regression and key point regression, wherein the shape regression fuses the feature graphs through a convolution layer of an activation functionConverting into text shape feature map by setting threshold value +.>Binarizing the feature map, above threshold +.>Is a text region below a threshold +.>The area of the (2) is a background area, and a text shape binary image with the text separated from the background is obtained. Comparing the text outline shape in the binary image with the text box shape generated by the image key point label, and intersecting the two Parallel ratio ofIOUAnd matching the text outline shape in the binary image with the text box shape generated by the image key point label. The input of the keypoint regression is the fusion profile +.>The output is the key point coordinates and width, including two branches, one of which is output +.>Zhang Guanjian dot heat map, < >>Selecting the ++of the highest score in the keypoint heat map for the most key points in the detected image text example>The highlighted coordinate points are the key point coordinates in the key point heat map, and are the key point coordinates corresponding to the key points of the text instance of the image, and the key points are the key points>For the number of text examples of the detected image, the number of key points of the text examples is insufficient, the number of the highlight coordinates is correspondingly reduced, and the output of the other branch detectionThe number of the key points of the text example is insufficient, and the residual width information is 0.

The text box generation module takes the coordinates of the key points and the width information output by the regression module as text instance information, and generates a text box by using the information. The width of the key point is the distance from the key point to the corresponding long-side coordinate point, the connecting line of two adjacent key points is used as the normal line of the connecting line of the key point and the long-side coordinate point, the key point extends upwards and downwards to be perpendicular to the normal line by the corresponding width distance of the key point, and the end point coordinate is the long-side coordinate point. And processing each coordinate point according to the operation to obtain two groups of long-side coordinate points with the same number as the key points, generating two Bezier curves by taking the long-side coordinate points as control points of the Bezier curves, and connecting the two Bezier curves end to obtain a completely closed curve frame, wherein the curve frame is a text frame of the text example. And finally, outputting the image of the framed text to realize text detection of the banner image.

And in the step 3, the key point data set obtained in the step 1 is divided into a training set and a testing set, the training set is input into a banner text detection network model for iterative training, parameters of the banner text detection network model are updated, the loss function is minimized, the accuracy of the testing model of the testing set is recorded, and the optimal model is stored. The training process is divided into shape detection training and key point detection training, and corresponding loss functionsThe calculation method is as follows:

（5）

in the method, in the process of the invention,for the shape loss function +.>For the key point loss function +.>A weight factor that is a loss function.

Shape loss functionThe calculation mode of (2) is as follows:

（6）

in the method, in the process of the invention,representing regressed text outline shape and relationshipsCross ratio of text boxes generated by key point labels, +.>Andrespectively representing center point coordinates of a text box generated by the regressed text outline shape and the key point labels, wherein the center point coordinates of the regressed text outline shape are clockwise center point coordinates of key points of the text outline shape, the center point coordinates of the two most middle key point connecting lines are selected when the number of the key points is double, the center point coordinates of the text box generated by the key point labels are clockwise center point coordinates of the key points of the generated text box, the center point coordinates of the two most middle key point connecting lines are selected when the number of the key points is double >Euclidean distance representing two center points, < ->Diagonal length of minimum closure area representing text box capable of containing both regressed text outline shape and key point label generation, < ->As a regulatory factor for balancing the weights between overlap area and aspect ratio similarity, ++>Is an index for measuring the similarity of the length to width ratio.

Key point loss functionThe method comprises two parts of key point coordinates and width, and a specific calculation formula is as follows:

（7）

（8）

（9）

in the method, in the process of the invention,for the key point coordinate loss function, < >>As a key point width loss function, +.>As a weight factor, ++>Is the number of text instances in the image, +.>Number of channels representing regression keypoint heatmap, +.>And->Representing the height and width of the regression keypoint heat map, respectively,/->Key point +.in key point heat map which is regression module regression>Score of->Coordinate point score representing real key point heat map obtained by Gaussian function calculation of image with key point label, < ->Andis a super-parameter controlling the contribution of each key point by +.>To reduce the penalty for points around the keypoint coordinates,indicating the absolute value of the number returned in brackets.

In order to accelerate model convergence, non-text region coordinate points are not considered in the regression of the key point coordinates, so that the number of negative samples is reduced. After training the banner text detection network model by using training set data, putting the test set into the model, comparing the accuracy and the detection speed of text detection, and extracting an optimal detection model.

The invention also provides a banner text detection system based on the Bezier curve and the key points, which is used for realizing the banner text detection method based on the Bezier curve and the key points.

Further, the system comprises a processor and a memory, wherein the memory is used for storing program instructions, and the processor is used for calling the stored instructions in the memory to execute a banner text detection method based on the Bezier curve and the key points.

Compared with the prior art, the invention has the following advantages:

1) A group of key points are used for replacing rectangular frames for regression, the anchor frames with fixed shapes are replaced by the mode of generating text frames through the key points to represent text examples, meanwhile, long sides of the text frames are replaced by Bezier curves, and the shape variability of the Bezier curves is utilized to adapt to different text shapes, so that the problem that the shapes of the text examples cannot be accurately represented due to the fact that the anchor frames are fixed is solved.

2) In order to reduce the calculation pressure of the text box coordinate regression, in the key point label making stage, adopting a self-adaptive mode to reduce the number of long-side coordinate points so as to reduce the calculation pressure of the text box coordinate regression; the key point coordinates and the width are used for replacing the text box coordinates to carry out regression, so that the calculation cost of the regression label is obviously reduced, and the text detection is finished, and meanwhile, the detection speed is improved and less data is used.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention.

Fig. 2 is a block diagram of a banner text detection network in accordance with an embodiment of the present invention.

Detailed Description

The invention provides a method and a system for detecting banner texts based on Bezier curves and key points, and the technical scheme of the invention is further described below with reference to drawings and embodiments.

Example 1

As shown in fig. 1, the invention provides a method for detecting a banner text based on a bezier curve and key points, which comprises the following steps:

step 1, generating an initial text box of a text region according to labels of a public text data set, simplifying the number of long-side coordinates of the text box through a fixed threshold value, generating a Bezier curve based on simplified long-side coordinate points, connecting two Bezier curves end to form a new text box, and converting the labels of the text box from boundary coordinate points of the text box to key point coordinates and the width of key points.

And 1.1, selecting images with special texts such as long texts, distorted texts and the like in the public text image data set as a data set, and generating an initial text box of a text area according to the labels of the public text data set.

The labels of the public text data set are a plurality of groups of coordinates which are arranged clockwise, each group of coordinates is a text box boundary point coordinate of the framed text, and each group of coordinates is connected clockwise to form a closed polygon, so that the text box of the text, namely the initial text box, is obtained. In this embodiment, the number of boundary points of the ctw-1500 dataset is 14, the first 7 are sequentially selected as upper boundary points, the last 7 are sequentially selected as lower boundary points, and the connecting line of the upper boundary points (1-7) and the connecting line of the lower boundary points (8-14) are sequentially selected as the long sides of the initial text box.

And 1.2, judging the bending degree of the long edge of the text box by adopting a fixed threshold value method.

And comparing the connecting line distance of the head and tail coordinates of the long sides of the text box in the data set with the distance from other coordinate points on the long sides to the connecting line through a fixed threshold value, and judging the bending degree of the two long sides of the text box, namely:

（1）

in the method, in the process of the invention,indicating the degree of curvature of the long side of the text box, +.>And (3) representing the ratio of the furthest distance from the coordinate point on the long side of the text box in the image data set to the connecting line of the head and tail coordinates of the long side to the connecting line distance of the head and tail coordinates of the long side, judging that the long side is a straight line when the ratio is more than or equal to 0 and less than 0.1, judging that the long side is partially bent when the ratio is more than or equal to 0.1 and less than 0.7, and judging that the long side is completely bent when the ratio is more than or equal to 0.7.

And 1.3, selectively simplifying coordinate points of two long sides of the text box according to the bending degree of the two long sides of the text box.

Let the distance from the coordinate point on the long edge to the line connecting the head and tail coordinate points beThe head and tail coordinate points are respectively +.>、/>The specific simplifying process is as follows: when the long side is judged to be a straight line, only the head-tail coordinate points of the long side are reserved; when the long side is judged to be partially bent, a coordinate point farthest from the head-tail coordinate connecting line and a head-tail coordinate point are reserved; when the long edge is judged to be completely bent, a threshold value is set by the heuristic algorithm of Douglas-Peucker +. >0.1 time of the length of the head-tail coordinate connecting line, when +.>Is greater than->When the coordinate points are reserved, corresponding coordinate points are reserved, and other coordinate points are abandoned; is provided with->The maximum coordinate point is +.>Use->Dividing the curve into->，/>Two parts, repeating the above operation until no coordinate point to the connecting line distance is greater than +.>Until that point.

Through the operation, the coordinate points on the long sides are simplified, the number of labels is reduced, the calculated amount of a banner image text detection model regression module is reduced, and the detection speed is improved.

And 1.4, taking coordinate points on two simplified long sides as control points of the Bezier curves, generating two corresponding Bezier curves, and connecting the two Bezier curves end to obtain a real boundary frame of the text.

And taking the coordinate points on the reduced long side as control points of a Bezier curve, wherein the Bezier curve is expressed by using a parameter curve based on a Bernstein polynomial, and the specific definition is as follows:

（2）

（3）

in the method, in the process of the invention,coordinate set representing a point on the Bessel curve,/->Represents Bezier curve order, +.>Indicate->Coordinates of the individual control points->Indicate->Bernstan polynomial of the individual control points, +.>Representing binomial coefficients,/->Time is expressed when coordinates of all points on the Bessel curve are corresponding, due to +. >Or 1->The value of (2) is 0, thus whenAt the moment, the first coordinate point on the long side is selected as the position coordinate of the Bezier curve at the moment 0, when +.>And when the position coordinate of the Bezier curve at the moment 1 is selected as the last coordinate point on the long side.

And step 1.5, converting the labels of the public data set from text box boundary coordinate points to key point coordinates and the width of the key points.

In order to further reduce the number of data set labels and improve the detection efficiency, boundary points on two long sides are converted into a group of key points to represent text boxes. Before the text box is converted into key points, the quantity of boundary points of the upper long side and the lower long side of the text box is ensured to be consistent, and the quantity of the boundary points is ensured to be consistent by adopting an upward compatible mode due to the fact that the two long sides have different bending degrees, and the concrete steps are as follows:

when the upper edge and the lower edge are respectively straight lines and partially bent, the middle point of the straight line edge is extracted as one boundary point, so that the boundary points of the upper edge and the lower edge are three; when the upper edge and the lower edge are respectively straight lines and completely bent, dividing the straight lines equally according to the number of coordinate points of the completely bent edges, and extracting equally divided coordinate points so that the number of boundary points of the upper edge and the lower edge is consistent; when the upper edge and the lower edge are respectively in partial bending and full bending, dividing the two curves of the partial bending edge equally according to the coordinate point quantity of the full bending edge minus the coordinate point quantity of the partial bending edge, and extracting the equally divided coordinate points, so that the boundary point quantity of the upper edge and the lower edge is consistent.

After the number of the upper boundary points and the lower boundary points is unified through the operation, the boundary points are converted. The coordinates of the upper side and the lower side are in one-to-one correspondence from the head to the tail, the middle point coordinates of the corresponding coordinate points are taken as key point coordinates, one half of the distance of the corresponding coordinate points is taken as the width of the key points, and the labels in the public image text data set are converted into a group of key point coordinates and the corresponding width from the coordinate points of the boundary boxes, so that the label manufacturing based on the key points is realized.

And 2, constructing a banner text detection network model.

The banner text detection network model comprises a feature extraction module, a feature fusion module, a regression module and a text box generation module. And the feature extraction module is used for extracting feature information of different layers to obtain feature images containing semantic information from a lower layer to a higher layer. And the feature fusion module is used for superposing and combining the feature images of different layers to obtain a fused feature image which is used for detecting the banner text subsequently. And the regression module is used for regressing the shape of the text instance, and the coordinates of the key points and the width of the key points of the text instance. And the text box generation module is used for generating a banner image text box based on the key point coordinates and the width information vector information in the current image.

Firstly, extracting four characteristic images with different scales by using a banner text detection network model through a Resnet50 as a backbone network, and sequentially combining the characteristic images with different scales by using an FPN (characteristic pyramid network) to obtain four fused characteristic images with different scales; up-sampling the fused characteristic images with different scales by corresponding times to obtain four characteristic images with the same scales, then superposing the four characteristic images to obtain a fused characteristic image, and up-sampling the fused characteristic image by four times to obtain a fused characteristic image with the same size as the original image; and carrying out regression operation on the fusion characteristic images to obtain two parts of regression data, comparing the shape of the contour of the regressed text with the shape of the real text box, judging the similarity degree of the contour of the regressed text and the shape of the real text box, sending the data regressed by the key points into a text box generating module, obtaining two groups of long-side control point coordinates by utilizing the coordinates and width information of the key points, converting the obtained control points into two Bezier curves, and connecting the Bezier curves to obtain a final text box of the text example.

The feature extraction module backbone network adopts a ResNet-50 model, after an image is input into the ResNet-50 model, firstly downsampling is carried out on the image so that the length and the width of the image are respectively reduced to 1/4 of the original image, the channel number is increased from 3 to 64, then a convolution kernel of 1 multiplied by 1 is adopted so that the channel number of the image is increased from 64 to 256 under the condition that the length and the width of the image are unchanged, and a first feature map is obtained Then the channel increase and downsampling are carried out on the feature map, so that the channel number is increased by two times when the length and the width of the image are reduced to 1/2, and a second feature map ∈is obtained>Repeating this operation to obtain four characteristic images +.>、/>、/>、/>. ResNet-50 with the full connection layer removed is combined with the FPN structure, and four characteristic images with different scales obtained in a backbone network are used as input of the FPN structure. Before fusion of characteristic images with different scales, the number of channels of the characteristic images needs to be processed uniformly, so that a convolution kernel of 1 multiplied by 1 is added at the input end of the FPN structure, so that the number of channels of the characteristic images is reduced to 256, and the ∈1 is obtained>、/>、/>、/>. From the feature map of the lowest scale->Initially, a double up-sampling is performed by nearest neighbor interpolation and feature map of the same scale as the input end of the FPN structure +.>Performing addition operation to obtain fused lower-scale characteristic image +.>And again adopt nearest neighbor interpolation method pair +.>Double up-sampling and +.>Adding to obtain a fused low-scale characteristic image +.>Likewise for->Double up-sampling and +.>Adding to obtain a fused characteristic image +.>Finally, the fused characteristic image is +.>、/>、/>、/>As an output of the FPN.

The feature fusion module is used for combining the fused feature images with different scales to obtain a combined fused feature imageSpecific calculation ofThe process is as follows:

（4）

Using a 3 x 3 convolutional layer (with BN and ReLU layers to accelerate model convergence, reduce model parameters) willThe number of channels is reduced to 256, then the feature image is +.>4-fold upsampling is performed so that +.>The same size as the original image.

The regression module comprises two parts, namely shape regression and key point regression. Shape regression feature maps will be fused by a 3 x 3 convolution layer of Sigmoid activation functionConverting into text shape feature diagram by setting threshold +.>Binarizing the feature map for 0.5, taking the region higher than the threshold value of 0.5 as a text region, and taking the region lower than the threshold value of 0.5 as a background region, so as to obtain a text shape binary map with text separated from the background. Comparing the text outline shape in the binary image with the text box shape generated by the image key point label, and comparing the two shapes by the cross-correlationIOUAnd matching the text outline shape in the binary image with the text box shape generated by the image key point label. Because an image has a plurality of texts, the situation that key points are matched with other texts possibly exists, and the shape regression function is to ensure that the key points are in the corresponding text outline shape, so that false detection is avoided. The input of the keypoint regression is the fusion profile +. >Will fuse the feature map->Respectively input into two different detection branches, wherein one branch detects the key point coordinates, and the detection branches output +.>Zhang Guanjian dot heat map, < >>Selecting the ++of the highest score in the keypoint heat map for the most key points in the detected image text example>The highlighted coordinate points are the key point coordinates in the key point heat map, and are the key point coordinates corresponding to the key points of the text instance of the image, and the key points are the key points>For the number of text examples of the detected image, the number of key points of the text examples is insufficient, and the number of the highlight coordinates is correspondingly reduced. The other branch detects the critical point width and outputs +.A.about.3×3 convolution layer and a 1×1 convolution layer>Width information->In order to detect the number of text examples of the image, the width information corresponds to the key points one by one, the number of the key points of the text examples is insufficient, and the residual width information is 0.

The text box generation module takes the coordinates of the key points and the width information output by the regression module as text instance information, and generates a text box by using the information. Specifically, the width of a key point is the distance from the key point to a corresponding long-side coordinate point, the connecting line of two adjacent key points is used as the normal line of the connecting line of the key point and the long-side coordinate point, the key point extends upwards and downwards to be perpendicular to the normal line by the distance of the corresponding width of the key point, and the end point coordinate is the long-side coordinate point. And processing each coordinate point according to the method to obtain two groups of long-side coordinate points with the same number as the key points. And (3) taking the long-side coordinate point as a control point of the Bezier curve, obtaining two Bezier curves according to a formula (2), connecting the two Bezier curves end to obtain a completely closed curve frame, and taking the curve frame as a text frame of the text example. And finally, outputting the image of the framed text to realize text detection of the banner image.

And 3, training the banner text detection network model constructed in the step 2 by using the key point data set obtained in the step 1.

Dividing the key point data set obtained in the step 1 into a training set and a testing set, inputting the training set into a banner text detection network model for iterative training, updating parameters of the banner text detection network model, minimizing a loss function, recording the accuracy of the testing set testing model, and storing the optimal model. The training process is divided into shapesShape detection training and key point detection training, corresponding loss functionThe calculation method is as follows:

（5）

in the method, in the process of the invention,for the shape loss function +.>For the key point loss function +.>For the weight factor of the loss function, the present embodiment is provided with +.>。

Considering that the banner text has arbitrary shape and large aspect ratio, the CIOU loss function is used for definingThe specific formula is as follows:

（6）

in the method, in the process of the invention,representing the intersection ratio of the regressed text outline shape and the text box generated by the key point label, ++>Andrespectively representing the center point coordinates of the text box generated by the regressed text outline shape and the key point label, and regressing the text outline shapeThe central point coordinates of the text box generated by the key point tag are the central point coordinates of the central point of the text contour shape, the central point coordinates of the connecting line of the two most middle key points are selected when the number of the key points is double, the central point coordinates of the text box generated by the key point tag are the central point coordinates of the central point of the text box generated by the key point tag, the central point coordinates of the connecting line of the two most middle key points are selected when the number of the key points is double >Euclidean distance representing two center points, < ->Diagonal length of minimum closure area representing text box capable of containing both regressed text outline shape and key point label generation, < ->As a regulatory factor for balancing the weights between overlap area and aspect ratio similarity, ++>Is an index for measuring the similarity of the length to width ratio.

The keypoint comprises two parts, a keypoint coordinate and a width, so the keypoint loss functionThe calculation formula is as follows:

（7）

in the method, in the process of the invention,for the key point coordinate loss function, < >>As a key point width loss function, +.>As the weight factor, the present embodiment is set to 0.2.

Considering that the number of negative samples of the key point coordinates is far greater than the number of positive samples in the training process, in order to solve the problem of unbalance of the positive and negative samples, a variant of the focal change loss function is adopted asThe method comprises the following steps:

（8）

in the method, in the process of the invention,key point +.in key point heat map which is regression module regression>Score of->Coordinate point score representing real key point heat map obtained by Gaussian function calculation of image with key point label, < ->Is the number of text instances in the image, +.>Number of channels representing regression keypoint heatmap, +.>And->Representing the height and width of the regression keypoint heat map, respectively,/- >And->Is to control the contribution of each key pointSuper parameter, the present embodiment sets ∈ ->，/>By means ofTo reduce the penalty for points around the keypoint coordinates.

Since the width of each key point generation is random, L is adopted ₁ Loss function as：

（9）

In the method, in the process of the invention,is the number of text instances in the image, +.>Indicating the absolute value of the number returned in brackets, < >>Coordinate point score representing real key point heat map obtained by Gaussian function calculation of image with key point label, < ->Key point +.in key point heat map which is regression module regression>Is a score of (2).

In order to accelerate model convergence, non-text region coordinate points are not considered in the regression of the key point coordinates, so that the number of negative samples is reduced.

After training the banner text detection network model by using training set data, putting the test set into the model, comparing the accuracy and the detection speed of text detection, and extracting an optimal detection model.

Inputting an image with a banner text into the banner text detection network model trained in the step 3, and obtaining a banner text image with a text box through feature extraction, feature fusion, regression and text box generation. The specific process comprises the following steps: inputting a banner text image into a banner text detection network model, obtaining four characteristic images with different scales through a feature extraction network of resnet50+FPN, then carrying out up-sampling on the four images with different multiples to enable the image scales to be identical, then carrying out feature fusion on the four characteristic images to obtain a fused characteristic image, carrying out up-sampling on the fused characteristic image four times to enable the fused characteristic image to be identical with an original image in size, then carrying out activation mapping on the fused characteristic image to obtain a key point heat map, obtaining key point coordinates and widths through the key point heat map, finally calculating two groups of long-side coordinate points according to key point coordinates and width information, generating two Bezier curves by taking the long-side coordinate points as control points of the Bezier curves, taking a closed curve frame obtained by connecting the two Bezier curves end to end as a text frame, and outputting the banner text image with text frame marks.

Example two

Based on the same inventive concept, the invention also provides a banner text detection system based on the Bezier curve and the key points, which comprises a processor and a memory, wherein the memory is used for storing program instructions, and the processor is used for calling the program instructions in the memory to execute the banner text detection method based on the Bezier curve and the key points.

In particular, the method according to the technical solution of the present invention may be implemented by those skilled in the art using computer software technology to implement an automatic operation flow, and a system apparatus for implementing the method, such as a computer readable storage medium storing a corresponding computer program according to the technical solution of the present invention, and a computer device including the operation of the corresponding computer program, should also fall within the protection scope of the present invention.

The specific embodiments described herein are offered by way of example only to illustrate the spirit of the invention. Those skilled in the art may make various modifications or additions to the described embodiments or substitutions thereof without departing from the spirit of the invention or exceeding the scope of the invention as defined in the accompanying claims.

Claims

1. A banner text detection method based on Bezier curves and key points is characterized by comprising the following steps:

step 2, constructing a banner text detection network model;

the banner text detection network model comprises a feature extraction module, a feature fusion module, a regression module and a text box generation module; the feature extraction module is used for extracting feature information of different layers to obtain feature images containing semantic information from a lower layer to a higher layer; the feature fusion module is used for combining the feature images of different layers to obtain a fused feature image which is used for detecting the banner text subsequently; the regression module is used for regressing the shape of the text instance, and the coordinates of key points and the width of the key points of the text instance; the text box generation module is used for generating a banner image text box based on the key point coordinates and the width information in the current image;

2. The method for detecting the banner text based on the Bezier curve and the key points as claimed in claim 1, wherein the method comprises the following steps: the step 1 comprises the following steps:

step 1.1, selecting images with long texts and distorted texts in a public text image dataset as a dataset, and generating an initial text box of a text region according to labels of the public text dataset;

the labels of the public text data set are a plurality of groups of coordinates which are arranged clockwise, each group of coordinates is a text box boundary point coordinate of the framed text, each group of coordinates is connected clockwise to form a closed polygon, an initial text box of the text is obtained, and the number of data set image boundary points is set asSequentially select ∈>And (2) as an upper boundary point, after->The two are used as lower boundary points, and the connecting line of the upper boundary points and the connecting line of the lower boundary points are used as two long sides of the initial text box;

3. The method for detecting the banner text based on the Bezier curve and the key points as claimed in claim 2, wherein the method comprises the following steps: in step 1.2, the connection distance between the head and tail coordinates of the long sides of the text box in the data set and the distance between other coordinate points on the long sides and the connection are compared through a fixed threshold value, and the bending degree of the two long sides of the text box is judged, namely:

（1）

in the method, in the process of the invention,indicating the degree of curvature of the long side of the text box, +.>Representing the ratio of the farthest distance from the coordinate point on the long side of the text box in the image data to the head-to-tail coordinate connecting line of the long side to the head-to-tail coordinate connecting line distance, when the ratio is more than or equal to 0 and less thanWhen the ratio is equal to or greater than +.>And is less than->When the ratio is equal to or greater than +.>When the long side is judged to be completely bent, +. >、/>Is a set threshold.

4. The method for detecting the banner text based on the Bezier curve and the key points as claimed in claim 2, wherein the method comprises the following steps: in the step 1.3, the distance from the coordinate point on the long side to the connecting line of the head coordinate point and the tail coordinate point is set asThe head and tail coordinate points are respectively +.>、/>The specific simplifying process is as follows: when the long side is judged to be a straight line, only the head-tail coordinate points of the long side are reserved; when the long side is judged to be partially bent, a coordinate point farthest from the head-tail coordinate connecting line and a head-tail coordinate point are reserved; when it is determined that the long side is completely bent, a threshold value +.>0.1 time of the length of the head-tail coordinate connecting line, when +.>Is greater than->When the coordinate points are reserved, corresponding coordinate points are reserved, and other coordinate points are abandoned; is provided with->The maximum coordinate point is +.>Use->Dividing the curve into->，/>Two parts, repeating the above operation until no coordinate point to the connecting line distance is greater than +.>Until that point.

5. The method for detecting the banner text based on the Bezier curve and the key points as claimed in claim 2, wherein the method comprises the following steps: in step 1.4, coordinate points on the long side after simplification are used as control points of a Bezier curve, the Bezier curve is expressed by using a parameter curve based on a Bernstein polynomial, and the specific definition is as shown in the following formula:

（2）

（3）

In the method, in the process of the invention,coordinate set representing a point on the Bessel curve,/->Represents Bezier curve order, +.>Indicate->Coordinates of the individual control points->Indicate->Bernstant polynomials for individual control pointsA (E) a)>Representing binomial coefficients,/->Time is expressed when coordinates of all points on the Bessel curve are corresponding, due to +.>Or 1->The value of (2) is 0, thus when +.>At the moment, the first coordinate point on the long side is selected as the position coordinate of the Bezier curve at the moment 0, when +.>When the method is used, the last coordinate point on the long side is selected as the position coordinate of the Bezier curve at the moment 1;

6. The method for detecting the banner text based on the Bezier curve and the key points as claimed in claim 2, wherein the method comprises the following steps: in step 1.5, converting boundary points on two long sides into a group of key points to represent a text box, and adopting an upward compatible mode to ensure that the number of boundary points on the upper and lower long sides of the text box is consistent before converting the boundary points into the key points, wherein the specific steps are as follows: when the upper edge and the lower edge are respectively straight lines and partially bent, the middle point of the straight line edge is extracted as one boundary point, so that the boundary points of the upper edge and the lower edge are three; when the upper edge and the lower edge are respectively straight lines and completely bent, dividing the straight lines equally according to the number of coordinate points of the completely bent edges, and extracting equally divided coordinate points so that the number of boundary points of the upper edge and the lower edge is consistent; when the upper edge and the lower edge are respectively in partial bending and full bending, dividing the two curves of the partial bending edge equally according to the coordinate point quantity of the full bending edge minus the coordinate point quantity of the partial bending edge, and extracting the equally divided coordinate points so that the boundary point quantity of the upper edge and the lower edge is consistent; after the number of the upper boundary points and the lower boundary points are unified through the operation, the boundary points are converted, coordinates of the upper edge and the lower edge are in one-to-one correspondence from the beginning to the end, the middle point coordinates of the corresponding coordinate points are taken as key point coordinates, one half of the distance of the corresponding coordinate points is taken as the width of the key points, and the labels in the public image text data set are converted into a group of key point coordinates and corresponding widths from the coordinate points of the boundary frames.

7. The method for detecting the banner text based on the Bezier curve and the key points as claimed in claim 1, wherein the method comprises the following steps: in the step 2, a ResNet-50 model is adopted in a backbone network of the feature extraction module, and after an image is input into the ResNet-50 model, four feature images are sequentially obtained through channel increasing and downsampling processing、/>、/>、/>The channel number of four characteristic images with different scales obtained in a backbone network is uniformly processed to obtain +.>、/>、/> 、/>Then from the lowest scale feature map +.>Starting up-sampling process, and performing feature map with the same scale as the input end of FPN structure ∈ ->Performing addition operation to obtain fused lower-scale characteristic image +.>For->Up-sampling and +.>Adding to obtain a fused low-scale characteristic image +.>Likewise for->Up-sampling and +.>Adding to obtain a fused characteristic imageFinally, the fused characteristic image is +.>、/>、/>、/>As an output of the FPN;

（4）

in the method, in the process of the invention,indicating a channel connection-> And->Up-sampling by 2 times, 4 times and 8 times, respectively,>、/>、/>、the characteristic images are fused;

8. The method for detecting the banner text based on the Bezier curve and the key points as claimed in claim 7, wherein: the regression module in the step 2 comprises two parts of shape regression and key point regression, wherein the shape regression fuses the feature graphs through a convolution layer of an activation functionConverting into text shape feature map by setting threshold value +.>Binarizing the feature map, above threshold +.>Is a text region below a threshold +.>The area of the (2) is a background area, and a text shape binary image with a text separated from the background is obtained; comparing the text outline shape in the binary image with the text box shape generated by the image key point label, and comparing the two shapes by the cross-correlationIOUMatching the text outline shape in the binary image with the text box shape generated by the image key point label; the input of the keypoint regression is the fusion profile +.>The output is the key point coordinates and width, including two branches, one of which is output +.>Zhang Guanjian dot heat map, < >>Selecting the ++of the highest score in the keypoint heat map for the most key points in the detected image text example >The highlighted coordinate points are the key point coordinates in the key point heat map, and are the key point coordinates corresponding to the key points of the text instance of the image, and the key points are the key points>For the number of text examples of the detected image, the number of key points of the text examples is insufficient, the number of the highlight coordinates is correspondingly reduced, and the output of the other branch detectionThe number of the key points of the text example is insufficient, and the residual width information is 0; the text box generation module takes the coordinates and the width information of the key points output by the regression module as text instance information, and generates a text box by using the information; the width of the key point is the distance from the key point to the corresponding long-side coordinate point, the connecting line of two adjacent key points is used as the normal line of the connecting line of the key point and the long-side coordinate point, the key point extends upwards and downwards to be perpendicular to the normal line by the corresponding width distance of the key point, and the end point coordinate is the long-side coordinate point; processing each coordinate point according to the operation to obtain two groups of long-side coordinate points with the same number as the key points, generating two Bezier curves by taking the long-side coordinate points as control points of the Bezier curves, and connecting the two Bezier curves end to obtain a completely closed curve frame, wherein the curve frame is a text frame of the text example; and finally, outputting the image of the framed text to realize text detection of the banner image.

9. The method for detecting banner text based on Bezier curve and key points as claimed in claim 1, wherein the method is specificThe method is characterized in that: step 3, dividing the key point data set obtained in the step 1 into a training set and a testing set, inputting the training set into a banner text detection network model for iterative training, updating parameters of the banner text detection network model, minimizing a loss function, recording the accuracy of the testing model of the testing set, and storing an optimal model; the training process is divided into shape detection training and key point detection training, and corresponding loss functionsThe calculation method is as follows:

（5）

in the method, in the process of the invention,for the shape loss function +.>For the key point loss function +.>A weight factor that is a loss function;

shape loss functionThe calculation mode of (2) is as follows:

（6）

in the method, in the process of the invention,representing the intersection ratio of the regressed text outline shape and the text box generated by the key point label, ++>And->Respectively representing center point coordinates of a text box generated by the regressed text outline shape and the key point labels, wherein the center point coordinates of the regressed text outline shape are clockwise center point coordinates of key points of the text outline shape, the center point coordinates of the two most middle key point connecting lines are selected when the number of the key points is double, the center point coordinates of the text box generated by the key point labels are clockwise center point coordinates of the key points of the generated text box, the center point coordinates of the two most middle key point connecting lines are selected when the number of the key points is double >Euclidean distance representing two center points, < ->Diagonal length of minimum closure area representing text box capable of containing both regressed text outline shape and key point label generation, < ->As a regulatory factor for balancing the weights between overlap area and aspect ratio similarity, ++>Is an index for measuring the similarity of the length-width ratio;

（7）

（8）

（9）

in the method, in the process of the invention,for the key point coordinate loss function, < >>As a key point width loss function, +.>As a weight factor, ++>Is the number of text instances in the image, +.>Number of channels representing regression keypoint heatmap, +.>And->Representing the height and width of the regression keypoint heat map, respectively,/->Key point +.in key point heat map which is regression module regression>Score of->Representing a real keypoint heat map obtained by performing Gaussian function calculation on an image with a keypoint labelCoordinate point score->And->Is a super-parameter controlling the contribution of each key point by +.>To reduce the penalty for points around the keypoint coordinates,indicating the absolute value of the number returned in brackets;

when the coordinates of the key points are regressed, the coordinates of the non-text area are not considered, so that the number of negative samples is reduced; after training the banner text detection network model by using training set data, putting the test set into the model, comparing the accuracy and the detection speed of text detection, and extracting an optimal detection model.

10. A bessel curve and keypoint based banner text detection system comprising a processor and a memory, the memory for storing program instructions, the processor for invoking the stored instructions in the memory to perform a bessel curve and keypoint based banner text detection method in accordance with any of claims 1-9.