CN114898372A

CN114898372A - Vietnamese scene character detection method based on edge attention guidance

Info

Publication number: CN114898372A
Application number: CN202210628050.9A
Authority: CN
Inventors: 文益民; 王利兵
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2022-06-06
Filing date: 2022-06-06
Publication date: 2022-08-12

Abstract

The invention relates to the field of character detection, in particular to a Vietnamese scene character detection method based on edge attention guidance, which comprises the following steps: extracting characteristic information of a target by using ResNet, and generating rich receptive fields by using a receptive field residual block RFRB in ResNet; fusing the feature information by using a multi-path fusion feature pyramid network MF-FPN to obtain feature information of different levels of a target; inputting the characteristic information into an RPN to obtain a certain number of candidate frames; after the candidate frame and the feature information are subjected to RoI Align, inputting the category information, the boundary frame information and the mask information of a classification branch and a mask branch prediction target, using a Re-Score mechanism to restrain a non-character target in the classification branch and using an Edge Attention Mechanism (EAM) to highlight the edge of the target in the classification branch and the mask branch.

Description

Vietnamese scene character detection method based on edge attention guidance

Technical Field

The invention relates to the field of character detection, in particular to a Vietnamese scene character detection method based on edge attention guidance.

Background

Natural scene text detection is a technique for automatically detecting text objects in natural scene images, which is widely used for automatic driving, signboard recognition, scene understanding, and the like. Also, natural scene text detection has attracted the attention and research of countless researchers. However, most of the existing methods are based on the research of non-tonal languages such as english, and the scene character detection of some tonal languages such as vietnamese is rarely researched.

Vietnamese is a tonal language that uses accents or diacritics to represent vowels and tones, where three symbols are used to add vowels and five symbols represent tones of Vietnamese, and the five tone symbols determine the meaning of each word. The unique structure of the Vietnamese characters enables the detection of the Vietnamese characters in a natural scene to have the following difficulties compared with the existing detection technology which is mainly used for English:

1. more abundant and robust feature information is needed to detect the Vietnam scene character target as much as possible and extract the features of the diacritical marks;

2. due to the existence of the diacritics and the interference of background information, some character targets are easier to be mistakenly detected as character targets, namely some false positive targets appear;

3. the diacritic of the Vietnam characters in a natural scene is smaller than that of Latin letters in shape, and is easy to ignore during detection, so that Vietnam character targets cannot be completely expressed (the diacritic is incompletely detected, and the character targets are incompletely detected), and meanwhile, two diacritic symbols appear on the upper parts of some characters;

4. in a natural scene, the text target scale of the Vietnam scene is changed greatly.

Disclosure of Invention

The invention aims to provide a Vietnam scene character detection method based on edge attention guidance, which aims to more accurately detect Vietnam scene character targets with different scales, particularly diacritical information and effectively eliminate non-character targets.

In order to achieve the above object, the present invention provides a method for detecting text in a vietnamese scene based on edge attention guidance, which includes: extracting characteristic information of a target by using ResNet, and generating rich receptive fields by using a receptive field residual block RFRB in ResNet so as to adapt to Vietnam scene character targets with different scales;

and fusing the feature information by using a multi-path fusion feature pyramid network MF-FPN to obtain feature information of different levels of the target, such as: target spatial position information, diacritic detail information, etc.;

inputting the characteristic information into an RPN to obtain a certain number of candidate frames;

and after the candidate frame and the characteristic information are subjected to RoI alignment, inputting the class information, the boundary frame information and the mask information of the classification branch and the mask branch prediction target, using a Re-Score mechanism to suppress non-character targets, and simultaneously using an edge attention mechanism to highlight the edge of the target by using EAM.

In the method for detecting characters in a vietnam scene, the specific way of generating the rich receptive field by using the receptive field residual block RFRB is as follows: firstly, adjusting the channel number of the features by adopting 1 multiplied by 1 convolution; then performing concat fusion on the output characteristics of 3 multiplied by 3 expansion convolutions with expansion rates of 1, 2 and 3 respectively; then, the number of channels is adjusted by using convolution of 1 multiplied by 1 to carry out fusion among information, thereby generating rich receptive fields.

In the method for detecting characters in a vietnam scene, the multi-path fusion feature pyramid network MF-FPN is a network for feature fusion to generate feature maps containing different information and at different levels, and the specific way for extracting the feature information of different levels of a target by the multi-path fusion feature pyramid network MF-FPN is as follows: inputting the output obtained by convolving the features of the current level with the number of input channels of 256 by 1 multiplied by 1 obtained by ResNet, performing 2 multiplied by 2 average pooling on the features of the previous level obtained by ResNet, performing top-down upsampling on the features output by ResNet, fusing the three, and inputting the convolution of the current level with the number of input channels of 256 by 3 multiplied by 3 to further obtain the feature information of different levels.

In the method for detecting the characters in the Vietnam scene, the candidate box and the characteristic information are input into a classification branch and a mask branch after being processed by RoI Align, and the category information, the boundary box information and the mask information of a target are predicted, wherein the specific mode is as follows:

inputting the candidate frame and the characteristic information into RoI Align, and mapping the characteristic graph of the target to a fixed size;

inputting the feature map with a fixed size into a classification branch, obtaining accurate category information through a Re-Score mechanism, meanwhile, utilizing an edge contour probability map of an edge branch prediction target, multiplying the map and the intermediate feature information of the category and a boundary box prediction branch to form an edge attention EAM, and guiding a model to predict accurate boundary box information;

inputting the feature map with fixed size into a mask branch to obtain a mask map of a target, simultaneously predicting an edge contour probability map of the target by using the edge branch, multiplying the edge contour probability map and the middle feature information of the mask prediction branch to form an edge attention EAM, and guiding a model to predict accurate mask information.

Further, the specific steps of obtaining accurate category information by the Re-Score mechanism are as follows:

inputting the characteristic information of the candidate frame into a convolution network to obtain the visual category confidence of the target;

inputting the characteristic information of the candidate frame into a sequence, and scoring branches to obtain a sequence confidence coefficient of a target;

and multiplying the two by 0.5 respectively, adding to obtain the final class confidence, and selecting the class with the highest confidence as the class information of the target.

Furthermore, the sequence scoring branch consists of a 1 × 1 convolution layer with the channel number of 1, a Bi-LSTM layer and two fully-connected layers, and the sequence modeling is carried out by utilizing the Bi-LSTM in the width dimension of the characteristic information.

In the method for detecting the Vietnam scene characters, the specific mode of the Re-Score machine for inhibiting the non-character targets is as follows: adjusting the channel number of the candidate frame feature information with a fixed size by using 1 multiplied by 1 convolution; extracting sequence characteristics by using Bi-LSTM; predicting a sequence score of a target by using two fully-connected layers; predicting a visual classification score of the target using a convolutional network; multiplying the two scores by 0.5 respectively and then adding to obtain a final class confidence coefficient; and with 0.7 as a threshold value, rejecting the target with the confidence coefficient lower than the threshold value.

In the method for detecting characters in a vietnam scene, the specific way of highlighting the edge of the target by using the edge attention mechanism EAM is as follows: an edge contour probability graph of an edge branch prediction target formed by several convolution layers is utilized, the pixel value of the edge is more than or equal to 0.5, and the non-edge pixel is less than 0.5; considering the edge probability map as an attention weight multiplied by the feature information forms an edge attention mechanism EAM.

According to the method for detecting the characters in the Vietnam scene based on the edge attention guidance, different receptive fields are generated by using the receptive field residual block RFRB, so that the method is effectively suitable for Vietnam scene character targets with different scales; designing a multi-path fusion characteristic pyramid network MF-FPN network, fusing rich low-level characteristics (relatively low levels), and highlighting the position of a Vietnam scene character target and the detail information of a diacritical sign; the Re-Score mechanism effectively eliminates non-textual targets; the EAM enhancement model is more sensitive to the edges of Vietnam scene character targets, so that Vietnam scene character targets including diacritics are completely detected, and some non-character targets are eliminated.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart of a text detection method for a Vietnam scene according to an embodiment.

Fig. 2 is a schematic diagram of an algorithm structure of the text detection method for the vietnam scene according to the embodiment.

Fig. 3 is a diagram of a configuration of a receptor field residual block RFRB according to an embodiment.

Fig. 4 is a diagram of a multi-path fused feature pyramid network MF-FPN structure according to an embodiment.

FIG. 5 is a block diagram of a modified stage1 provided by the embodiment.

FIG. 6 is a diagram of the structure of the Re-Score mechanism provided in the examples.

FIG. 7 is a diagram of a classification branch structure according to an embodiment.

Fig. 8 is a mask branch structure diagram according to an embodiment.

FIG. 9 is an exemplary graph of text detection experimental data for a Vietnam scenario according to an embodiment;

wherein, (a) is a Vietnam scene character image sample with a bounding box label; (b) is a binary text segmentation map corresponding to the image sample (a); (c) is a text edge contour map corresponding to the image sample (a).

Fig. 10 is a text picture example of a vietnam scene required by an embodiment at the time of experiment.

FIG. 11 is a comparative example of feature maps generated by the multi-way fused feature pyramid network MF-FPN and other methods provided by the embodiments.

FIG. 12 is a graph of F-measure comparison at different IoU thresholds after EAM combined with edge attention provided by an embodiment.

FIG. 13 is a comparison graph of an example of detection results of a Vietnam scene text detection method and a others method based on edge attention guidance according to an embodiment;

wherein, (a) original images of vietnam scene characters; (b) the method is characterized in that a baseline algorithm is adopted to detect an original image; (c) the method is characterized in that an improved Mask R-CNN is adopted to detect an original image; (d) the invention is an effect diagram for detecting the original image by adopting the detection method.

Detailed Description

Reference will now be made in detail to the embodiments of the present invention, examples of which are illustrated in the accompanying drawings, and the embodiments described by way of reference to the drawings are exemplary and intended to be illustrative of the invention and should not be construed as limiting the invention.

Examples

Referring to fig. 1 to 13, a method for detecting text in a vietnamese scene based on edge attention guidance includes the following steps, specifically referring to fig. 1 to 2:

s101, inputting a Vietnam scene character picture, and extracting feature information C of a target by using ResNet ₁ ～C ₅ ；

Wherein the ResNet is a feature extraction network commonly used in the art and comprises 5 stages ₂ ～stage ₅ Consists of a certain number of residual blocks, ResNet outputs a group of characteristics at each stage, and 5 characteristics C are obtained in total ₁ ～C ₅ 。

In order to enable the method to effectively adapt to Vietnam scene character targets with different scales, all residual blocks in ResNet are replaced by the receptive field residual block RFRB provided by the embodiment.

The specific structure of the RFRB is shown in fig. 3, first, a branch inputs features into 1 × 1 convolution, then inputs expansion convolutions with expansion rates of 1, 2, and 3, respectively, then performs fusion operation concat (splicing) on the outputs of the three in channel dimensions, and then adjusts the number of channels by using 1 × 1 convolution and performs inter-information fusion to obtain output information; the other path is constant mapping short, namely, the input information is directly output or output after the number of channels is adjusted through 1 × 1 convolution; and finally, adding the outputs of the two branches, and obtaining the final output through a ReLU activation function.

S102, fusing 5 pieces of feature information output by ResNet by using a multi-path fusion feature pyramid network MF-FPN to obtain feature information P of different levels ₁ ～P ₆ ；

MF-FPN is improved by FPN, which is a common feature fusion network in the field.

The specific structure of MF-FPN is shown in FIG. 4. the dashed arrow in FIG. 4 is the improvement of the present invention, i.e., the feature map P of each level _k When formed, not only will fuse C _k And the feature of up-sampling the upsample, and the feature C of a lower level _k-1 . For example to obtain P ₃ The specific steps of the characteristic diagram are as follows: c is to be ₂ Average pooling of Avgpool, C for characterization 2X 2 ₃ Features are convolved by 1 x 1, information of top-down paths is performedThe outputs of the three 2 times upsampling upsample are added add and then convolved by 3 multiplied by 3 to obtain P ₃ And (5) characterizing. And P is ₆ The characteristic diagram is represented by P ₅ The feature map is obtained by maximum pooling Maxpool.

In view of obtaining P ₂ Feature fusion C ₁ And by stage in ResNet ₁ Extracting to obtain C ₁ The information is limited, and the invention is used for the stage in ResNet ₁ Making improvements to C extracted ₁ The characteristic information is richer.

FIG. 5 shows a stage after modification ₁ The structure diagram comprises two branches, wherein one branch firstly fills Zero padding to input information, then inputs depth separable convolution with 7 multiplied by 7, step length of 2 and channel number of 32, and then obtains output information by using Max pooling; the other branch is convolution of firstly inputting 1 multiplied by 1, step length of 1 and channel number of 32, and then deep separable convolution of inputting two layers of 3 multiplied by 3, step length of 2 and channel number of 32 to obtain output information; then, the information output by the two branches is fused concat; then convolution with 1 multiplied by 1, step length of 1 and channel number of 64 is input to obtain final characteristic information C ₁ 。

S103, the characteristic information P ₂ ～P ₆ Inputting RPN to obtain a certain number of candidate frames Region poppesals;

RPN is a regional generation network proposed by the well-known algorithm in the field, Faster R-CNN, which generates a certain number of candidate boxes representing which positions in the image the model considers to be likely text targets.

S104, the candidate frame regions and the characteristic information P ₂ ～P ₅ Inputting a classification branch and a mask branch after the RoI Align to predict the category information S, the bounding box information bbox and the mask information mask of each candidate box (target);

the RoI Align is proposed in a Mask R-CNN (block-forward network) algorithm in the field, and is used for cutting out feature information corresponding to a candidate frame according to the position information and the size of the candidate frame Region.

FIG. 7 is a diagram showing the structure of a classification branch, where candidate boxes Region explosals are passed through RoI Align is then mapped to a 7 × 7 feature map, and the edge branch (edge branch) and class and bounding box prediction branches (class and bounding box prediction branches) are input, respectively. The edge branch consists of a layer of 3 multiplied by 3 convolutional layer with 256 channels and a layer of 1 multiplied by 1 convolutional layer with 1 channel, and then an edge contour probability map edge _ map is obtained after being activated by a Sigmoid function _cls And obtaining an edge contour binary image after binarization. In the class and bounding box prediction branch, 4 layers of convolutional layers (3 × 3 convolution with 256 channels in 2 layers, 7 × 7 convolution with 1024 channels in 1 layer, and 1 × 1 convolution with 1024 channels in 1 layer) and 1 layer of fully-connected layer prediction are used to obtain a bounding box bbox and a visual classification confidence coefficient S _v And the predicted edge contour probability map edge _ map _cls Feature f output by convolution with the first layer _cls And multiplying corresponding position elements to form an EAM (attention edge model), wherein the specific calculation is shown as formula (1). Meanwhile, a Re-Score mechanism is fused into a category and a bounding box prediction branch to obtain a final prediction category confidence coefficient S;

the concrete steps of fusing the Re-Score mechanism into the category and the bounding box prediction branch to obtain the final confidence coefficient S of the prediction category are as follows:

a schematic of the Re-Score mechanism is shown in FIG. 6;

obtaining visual classification score of ith candidate box by using the category and the boundary box to predict branches

Obtaining semantic sequence scores of candidate boxes by using sequence scoring branches

The sequence scoring branch firstly adjusts the number of channels by convolution, then takes the w (width) dimension of the characteristic as the time step length, carries out sequence modeling by using Bi-LSTM, and then obtains the sequence score of the target after passing through two layers of full connection layers and a Softmax function

Final predicted i-th candidate box class confidence S _i Calculated by the formula (2);

the structure diagram of the mask branch is shown in fig. 8, and the candidate box regions are mapped into a feature map of 14 × 14 size after having undergone RoI alignment, and are respectively input into an edge branch (edge branch) and a mask prediction branch (mask prediction branch). The edge branch is composed of 4 layers of convolution layers (3 layers of convolution with the number of 256 channels being 3 multiplied by 3 and 1 layer of convolution with the number of 1 channel being 1) and then activated by a Sigmoid function to obtain an edge profile probability map edge _ map _msk And obtaining an edge contour binary image after binarization. The Mask prediction branch has the same network structure as the Mask branch in the Mask R-CNN, and consists of a 3 × 3 convolutional layer with 256 channels in 4 layers, a 2 × 2 transposed convolutional layer with 256 channels in 1 layer and a 1 × 1 convolutional layer in 1 layer, wherein the activation function is Sigmoid, and the Mask segmentation graph Mask of the target is finally obtained. Meanwhile, the predicted edge profile probability map edge _ map is obtained _msk Feature f output by convolution with the first layer _msk Multiplying corresponding position elements to form an edge attention EAM, and specifically calculating the formula (3);

the method of the present invention will be further described with reference to examples.

The invention provides a Vietnam scene character detection method based on edge attention guidance, which comprises the following test environments and experimental results:

1) and (3) testing environment:

the system environment is as follows: ubuntu 16.04;

hardware environment: 256GB memory, GPU: tesla V100X 4, CPU 1.70GHz Intel (R) Xeon (R) E5-2609, hard disk: 8 TB;

2) experimental data:

to verify the effectiveness of the present invention, a natural scene multilingual text detection dataset (MLT 2017) was used, using only 7200 pictures of which only latin text types were included, and 200 pictures taken from a real vietnam scene.

The data labels used are as shown in fig. 9, labeled as bounding box position coordinates of a vietnam scene text object (fig. 9(a)), a binary mask map of a vietnam scene text object (fig. 9(b)), and an edge contour map of a vietnam scene text object (fig. 9 (c)).

3) Implementation details:

the method is trained and tested using the data set. In training the model, the experiment was first pre-trained using the MLT 2017 dataset (80 epochs) and then fine-tuned to the entire model using the Vietnam text dataset (20 epochs). Wherein, the batch-size is set to 8, the optimizer selects SGD, the initial learning rate is 0.001, and the momentum is set to 0.9.

The performed experiment is a five-fold cross validation experiment, the text data set of the Vietnam scene adopts a training set of 160 sheets (in the training process, five times of data augmentation are performed on each picture), a testing set of 40 sheets is distributed, Precision, Recall and F-measure are adopted as evaluation indexes, IoU is calculated in a mode that the intersection and comparison of a mask segmentation matrix obtained by calculating a mask branch and a mask matrix (binary mask segmentation graph) of a real target is calculated instead of a traditional square frame mode, and a IoU threshold value is set to be 0.7.

In the experiment, ablation research is carried out on a receptive field residual block RFRB, a multipath fusion characteristic pyramid network MF-FPN, a Re-Score mechanism and an edge attention mechanism EAM, and the method is compared with other existing methods. The basic algorithm (baseline) is Mask R-CNN, and the invention is improved on the basis of baseline.

4) The experimental results are as follows:

a) results of the field-of-reception residual block RFRB experiment

As shown in Table 1, after the RFRB module is added, all the evaluation indexes are improved by nearly 2 percent. The performance of the Vietnamese scene character target is improved just because the RFRB can be fused with different receptive fields, and the Vietnamese scene character target with different scales can be more flexibly adapted.

In order to highlight the adaptability of the RFRB to the Vietnam scene text targets with different scales, the experiment evaluates the baseline and the combined RFRB according to the definitions of small, medium and large targets in COCO, and compares the baseline and the combined RFRB to respectively detect the number of the real Vietnam scene text targets (true positive targets) with different scales. Wherein the small target (S) area is less than 32 ² Medium target (M) area greater than 32 ² And is less than 96 ² Large target (L) area greater than 96 ² . The experimental results are shown in table 2, and it can be seen from the table that the number of true positive targets of different scales detected by adding the RFRB module is greater than that detected by the baseline algorithm, wherein the number of detected small targets is increased by 3.6% compared with that detected by the baseline algorithm, the middle target is increased by 1.8%, and the large target is increased by 5.5%, which further illustrates the adaptability of the RFRB to text targets of a vietnam scene of different scales.

b) Experimental result of multi-path fusion characteristic pyramid network MF-FPN

As shown in Table 1, compared with the baseline, after replacing FPN in the baseline with MF-FPN (+ modified stage1), the Recall index is improved by 2.8%, Precision is improved by 3.8%, and F-measure is improved by 3.2%. If the combination is combined with RFRB, all indexes are improved slightly.

In the experiment, the image shown in fig. 10 is input into an algorithm, and then feature maps output by the FPN and the MF-FPN are respectively visualized, as shown in fig. 11, the feature map obtained by the MF-FPN is more obvious than that of the FPN in terms of target position information and contains more detailed information, which is more helpful for detecting text targets in the vietnam scene, for example, feature information of a diacritic included in a P2 feature map extracted by the MF-FPN (shown by an arrow in fig. 11) can be seen.

To further verify the necessity of improving stage1 and the effectiveness of the MF-FPN, the following ablation experiments were performed. Experiments replaced the stage1 of ResNet in baseline with a modified stage1 network and the FPN with MF-FPN, respectively. The results are shown in Table 3, from which it can be seen that the F-measure is slightly elevated by 0.4% by replacing only the stage1 of ResNet in baseline with the modified stage 1. After the FPN is replaced by the MF-FPN, Precision, Recall and F-measure are all improved by about 2 percent, and further the effectiveness of the MF-FPN is shown. As shown in the last row of Table 3, the performance is further improved by combining the modified stage1 with MF-FPN (MF-FPN + modified stage 1).

c) Results of Re-Score mechanism experiment

The Re-Score mechanism increases the sequence scoring of the candidate frame, so that the category scoring of the candidate target is more accurate, and the non-textual target is effectively eliminated, and as can be seen from Table 1, compared with the baseline algorithm, the Precision value is improved by nearly 8% after the Re-Score is combined, and the Precision value is improved by 5.7% after the RFRB and the MF-FPN are combined simultaneously. This is sufficient to illustrate the ability of the Re-Score machine to suppress false positive targets.

d) EAM experiment result of edge attention mechanism

As shown in Table 1, under the condition that threshold IoU is 0.7, the evaluation indexes of EAM combined with Baseline in each fold of cross validation experiment are greatly improved, and F-measure is improved by 4.3% after the combination with the module.

To further explore the ability of EAM to detect diacritics, higher IoU thresholds were used for experiments, testing EAM at 0.7, 0.8, and 0.9, respectively. As shown in fig. 12, under a higher IoU threshold, when compared with baseline after EAM, the representation is still good enough to prove that it can effectively detect diacritics and further accurately segment text objects in the vietnam scene.

e) Comparison with other person's methods

As can be seen from Table 4, the invention shows good performance, compared with other methods, Precision is improved by 10.6%, Recall is improved by 0.6%, and F-measure is improved by 5.8%. FIG. 13 shows the results of several methods, from which it can be seen that the detection result of the present invention has finer boundaries and more accurate detection results, and can effectively eliminate some false positive targets.

TABLE 1 ablation study (MF-FPN results in the Table are combined with the results of the modified stage1)

Table 2 number of real vietnam scene text objects (true positive objects) detected in different scales

TABLE 3 results of ablation study experiments on improved stage1 and MF-FPN

TABLE 4 comparison of the different methods

Reference to the baseline algorithm in table 4: he K, Gkioxari G, Doll a R P, et al Mask R-CNN C/International Conference on Computer Vision (ICCV). Piscataway, NJ: IEEE,2017: 2961-;

algorithm reference of others: vietnam scene text detection [ J ] based on improved Mask R-CNN, computer applications 2021,41(12): 7.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A Vietnam scene character detection method based on edge attention guidance is characterized in that,

the method comprises the following steps: extracting characteristic information of a target by using ResNet, and generating rich receptive fields by using a receptive field residual block RFRB in ResNet;

fusing the feature information by using a multi-path fusion feature pyramid network MF-FPN to obtain feature information of different levels of a target;

inputting the candidate box and the characteristic information into RoI Align, inputting the classification branch and the mask branch after the RoI Align, predicting the class information, the boundary box information and the mask information of the target, restraining the non-character target by using a Re-Score mechanism, and meanwhile, utilizing an edge attention mechanism EAM to highlight the edge of the target.

2. The Vietnam scene text detection method of claim 1,

the specific method for generating the rich receptive field by using the receptive field residual block RFRB comprises the following steps: firstly, adjusting the channel number of the features by adopting 1 multiplied by 1 convolution; then performing concat fusion on the output characteristics of 3 multiplied by 3 expansion convolutions with expansion rates of 1, 2 and 3 respectively; then, the number of channels is adjusted by using convolution of 1 × 1, and information is fused, so that a rich receptive field is generated.

3. The Vietnam scene text detection method of claim 1,

the specific way for extracting the feature information of different levels of the target by the multi-path fusion feature pyramid network MF-FPN is as follows: inputting the features of the current layer obtained by ResNet into the output obtained after 1 × 1 convolution, performing 2 × 2 average pooling on the features of the previous layer obtained by ResNet to obtain the output, performing top-down sampling on the features output by ResNet, fusing the three, and inputting the convolution of 3 × 3 to obtain the feature information of different layers.

4. The Vietnam scene text detection method of claim 1,

after the candidate box and the feature information are subjected to RoI Align, inputting a classification branch and a mask branch, and predicting the category information, the boundary box information and the mask information of a target, wherein the specific mode is as follows:

inputting a feature map with a fixed size into a classification branch, obtaining accurate category information through a Re-Score mechanism, meanwhile, utilizing an edge contour probability map of an edge branch prediction target, multiplying the map and the intermediate feature information of the category and a boundary box prediction branch to form an edge attention EAM, and guiding a model to predict accurate boundary box information;

5. The Vietnam scene text detection method of claim 4,

the specific steps of obtaining accurate category information by the Re-Score mechanism are as follows:

6. The Vietnam scene text detection method of claim 5,

the sequence scoring branch consists of a layer of 1 multiplied by 1 convolution layer, a layer of Bi-LSTM and two layers of full connection layers, and sequence modeling is carried out by utilizing the Bi-LSTM according to the width dimension of characteristic information.

7. The Vietnam scene text detection method of claim 1,

the specific way in which the Re-Score machine suppresses non-textual objects is: adjusting the channel number of the candidate frame feature information with a fixed size by using 1 multiplied by 1 convolution; extracting sequence characteristics by using Bi-LSTM; predicting a sequence score of a target by using two fully-connected layers; predicting a visual classification score of the target using a convolutional network; multiplying the two scores by 0.5 respectively and then adding to obtain a final category confidence coefficient; and with 0.7 as a threshold value, rejecting the target with the confidence coefficient lower than the threshold value.

8. The Vietnam scene text detection method of claim 1,

the specific way of using the edge attention mechanism EAM to protrude the edge of the target is as follows: inputting the feature with fixed size into a full convolution network consisting of several convolution layers, and obtaining the edge information of the target after being activated by a Sigmoid function, wherein in the edge information, the pixel value belonging to the edge is more than or equal to 0.5, and the non-edge pixel is less than 0.5.