CN116311290A

CN116311290A - Handwriting and printing text detection method and device based on deep learning

Info

Publication number: CN116311290A
Application number: CN202310149288.8A
Authority: CN
Inventors: 刘凯航; 黄宇恒; 张华俊; 徐天适; 金晓峰
Original assignee: GRG Banking Equipment Co Ltd
Current assignee: GRG Banking Equipment Co Ltd
Priority date: 2023-02-20
Filing date: 2023-02-20
Publication date: 2023-06-23

Abstract

The application discloses a handwriting and printing text detection method and device based on deep learning, and belongs to the technical field of image processing. The handwriting and printing text detection method based on deep learning comprises the following steps: extracting multiple features of the initial text image to obtain a fusion feature map; acquiring a printed text region probability map, a handwritten text region probability map and a text region self-adaptive threshold probability map based on the fusion feature map; and acquiring a target detection area based on at least two of the printed text area probability map, the handwritten text area probability map and the text area self-adaptive threshold probability map, wherein the target detection area comprises a target text and a text box corresponding to the target text. According to the handwriting and printing text detection method based on deep learning, different types of features can be fused in practical application, and the printing text, the handwriting text and the background area can be effectively distinguished, so that the accuracy of final text recognition and the text detection effect are improved.

Description

Handwriting and printing text detection method and device based on deep learning

Technical Field

The application belongs to the technical field of image processing, and particularly relates to a handwriting and printing text detection method and device based on deep learning.

Background

With the development of artificial intelligence, deep learning is increasingly widely applied to the technical fields of image recognition, optical character recognition and the like, such as license plate recognition, invoice recognition, scanned document information extraction and the like, and text detection is taken as an important branch in optical character recognition, so that the final recognition effect of the optical character recognition is directly affected. The characters have the characteristics of multiple directions, irregular shapes, extreme length-width ratios, multiple fonts, colors, multiple backgrounds and the like, the conventional text detection technology has poor detection effect on long texts and compact texts, handwriting and printing text information cannot be distinguished, the recognition range is limited, and the accuracy of the subsequent character recognition is further affected.

Disclosure of Invention

The present application aims to solve at least one of the technical problems existing in the prior art. Therefore, the handwriting and printing text detection method and device based on deep learning can be used for fusing different types of features in practical application, so that text examples can be completely described, printing texts, handwriting texts and background areas can be effectively distinguished, and the accuracy of final text recognition and the text detection effect are improved.

In a first aspect, the present application provides a method for detecting handwriting and printed text based on deep learning, the method comprising:

extracting multiple features of the initial text image to obtain a fusion feature map;

acquiring a printed text region probability map, a handwritten text region probability map and a text region self-adaptive threshold probability map based on the fusion feature map;

and acquiring a target detection area based on at least two of the printed text area probability map, the handwritten text area probability map and the text area self-adaptive threshold probability map, wherein the target detection area comprises a target text and a text box corresponding to the target text.

According to the handwriting and printing text detection method based on deep learning, the initial text image is subjected to multi-feature extraction to obtain the fusion feature image, then the printing text region probability image, the handwriting text region probability image and the text region self-adaptive threshold probability image are obtained based on the fusion feature image, and then the target detection region is obtained based on at least two of the printing text region probability image, the handwriting text region probability image and the text region self-adaptive threshold probability image, so that different types of features can be fused in practical application, a text instance can be completely described, printing text, handwriting text and a background region can be effectively distinguished, and finally text recognition precision and text detection effect are improved.

According to one embodiment of the application, a method for detecting handwriting and printing text based on deep learning, the method for acquiring a target detection region based on at least two of the printing text region probability map, the handwriting text region probability map and the text region adaptive threshold probability map includes:

acquiring a printed text region binary image based on the printed text region probability image and the text region self-adaptive threshold probability image;

acquiring a handwritten text region binary image based on the handwritten text region probability image and the text region self-adaptive threshold probability image;

and respectively carrying out at least one of connected domain calculation, circumscribed rectangle calculation and expansion processing on the printed text region binary image and the handwritten text region binary image to obtain the target detection region.

According to one embodiment of the method for detecting handwriting and printing text based on deep learning, the method for extracting multiple features of an initial text image to obtain a fusion feature map comprises the following steps:

extracting multiple features from the initial text image to obtain multiple types of features;

and generating the fusion feature map based on the multiple types of features and attention weights corresponding to the features.

inputting the initial text image into a feature extraction model to obtain a plurality of types of features corresponding to the initial text image;

performing convolution operation on the multiple types of features to obtain an intermediate feature sequence;

inputting the intermediate feature sequences to a spatial attention mechanism module, and obtaining attention weight sequences corresponding to the intermediate feature sequences output by the spatial attention mechanism module, wherein the attention weight sequences comprise a plurality of attention weights, and the attention weights are in one-to-one correspondence with the plurality of types of features;

and carrying out weighted calculation on the plurality of types of features based on the plurality of attention weights to acquire the fusion feature map.

According to one embodiment of the application, based on the fusion feature map, the handwriting and printing text detection method based on deep learning obtains a printing text region probability map, a handwriting text region probability map and a text region self-adaptive threshold probability map, and the method comprises the following steps:

inputting the fusion feature map to a first channel of a text detection model, and acquiring the printed text region probability map output by the first channel;

The first channel is obtained by training a sample tag by taking a sample feature map as a sample and a sample printing text region probability map corresponding to the sample feature map;

inputting the fusion feature map to a second channel of a text detection model, and obtaining the handwritten text region probability map output by the second channel;

the second channel is obtained by training a sample feature map serving as a sample and a sample handwritten text area probability map corresponding to the sample feature map serving as a sample label;

inputting the fusion feature map to a third channel of a text detection model, and acquiring the text region self-adaptive threshold probability map output by the third channel;

the third channel is obtained by training a sample tag by taking a sample feature map as a sample and taking a sample text area self-adaptive threshold probability map corresponding to the sample feature map as a sample tag.

The handwriting and printing text detection method based on deep learning of one embodiment of the application,

the sample printed text region probability map is determined by:

extracting the characteristics of the printed text area from the sample characteristic diagram to obtain a first printed text area;

Performing reduction processing on the first printed text region to obtain a probability map of the sample printed text region;

the sample handwritten text area probability map is determined by:

extracting handwriting text region features of the sample feature map to obtain a first handwriting text region;

and carrying out reduction processing on the first handwritten text area to obtain the probability map of the sample handwritten text area.

In a second aspect, the present application provides a handwriting and printed text detection device based on deep learning, the device comprising:

the first processing module is used for extracting multiple features of the initial text image and acquiring a fusion feature image;

the second processing module is used for acquiring a printed text region probability map, a handwritten text region probability map and a text region self-adaptive threshold probability map based on the fusion feature map;

and the third processing module is used for acquiring a target detection area based on at least two of the printed text area probability map, the handwritten text area probability map and the text area self-adaptive threshold probability map, wherein the target detection area comprises a target text and a text box corresponding to the target text.

According to the handwriting and printing text detection device based on deep learning, through multi-feature extraction of the initial text image, the fusion feature image is obtained, then the printing text region probability image, the handwriting text region probability image and the text region self-adaptive threshold probability image are obtained based on the fusion feature image, and then the target detection region is obtained based on at least two of the printing text region probability image, the handwriting text region probability image and the text region self-adaptive threshold probability image, so that different types of features can be fused in practical application, a text instance can be completely described, printing text, handwriting text and a background region can be effectively distinguished, and finally text recognition precision and text detection effect are improved.

In a third aspect, the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the deep learning based handwriting and printed text detection method according to the first aspect when the processor executes the computer program.

In a fourth aspect, the present application provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the deep learning based handwriting and printed text detection method as described in the first aspect above.

In a fifth aspect, the present application provides a computer program product comprising a computer program which, when executed by a processor, implements the deep learning based handwriting and printed text detection method as described in the first aspect above.

The above technical solutions in the embodiments of the present application have at least one of the following technical effects:

the method comprises the steps of extracting multiple features from an initial text image to obtain a fusion feature image, then obtaining a printed text region probability image, a handwritten text region probability image and a text region self-adaptive threshold probability image based on the fusion feature image, and then obtaining a target detection region based on at least two of the printed text region probability image, the handwritten text region probability image and the text region self-adaptive threshold probability image.

Further, by inputting the initial text image into the feature extraction model, a plurality of types of features corresponding to the initial text image are obtained, so that feature extraction can be enhanced; then, carrying out convolution operation on the multiple types of features to obtain an intermediate feature sequence, inputting the intermediate feature sequence into a spatial attention mechanism module to obtain attention weight sequences corresponding to the intermediate feature sequences, and then carrying out weighted calculation on the multiple types of features to obtain a fusion feature map, so that features of different scales can be fused dynamically, the characterization of the multi-scale change features is enhanced, and a better feature fusion effect is obtained.

Furthermore, the fused feature images are input into the first channel, the second channel and the third channel of the text detection model to respectively obtain the printed text region probability image output by the first channel, the handwritten text region probability image output by the second channel and the text region self-adaptive threshold probability image output by the third channel, and when the text detection model is constructed, the output layer is set to be three prediction probability images, so that classification and detection of handwritten text lines and printed text lines are realized.

Still further, the target detection area is obtained by acquiring the printing/handwriting text area binary image based on the printing/handwriting text area probability image and the text area self-adaptive threshold probability image, and then performing at least one of connected domain calculation, circumscribed rectangle calculation and expansion processing on the printing text area binary image and the handwriting text area binary image respectively, so that the corresponding text line position can be accurately positioned, the printing and the handwriting text box can be accurately distinguished, and the text detection effect is improved.

Additional aspects and advantages of the application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, wherein:

fig. 1 is a schematic flow chart of a method for detecting handwriting and printed text based on deep learning according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a method for detecting handwriting and printed text based on deep learning according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a second embodiment of a method for detecting handwriting and printed text based on deep learning;

FIG. 4 is a schematic diagram of a third embodiment of a deep learning based handwriting and printed text detection method;

FIG. 5 is a schematic diagram of a deep learning based handwriting and printed text detection method according to an embodiment of the present application;

FIG. 6 is a second flow chart of a method for detecting handwriting and printed text based on deep learning according to an embodiment of the present application;

FIG. 7 is a third flow chart of a method for detecting handwriting and printed text based on deep learning according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a deep learning based handwriting and printed text detection method according to an embodiment of the present application;

Fig. 9 is a schematic structural diagram of a handwriting and printed text detection device based on deep learning according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Technical solutions in the embodiments of the present application will be clearly described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application are within the scope of the protection of the present application.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type and not limited to the number of objects, e.g., the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.

The handwriting and printed text detection method based on deep learning according to the embodiment of the present application is described below with reference to fig. 1 to 8.

It should be noted that, the execution body of the handwriting and printing text detection method based on deep learning may be a server, or may be a handwriting and printing text detection device based on deep learning, or may also be a terminal of a user, including but not limited to a mobile terminal and a non-mobile terminal.

For example, mobile terminals include, but are not limited to, cell phones, PDA smart terminals, tablet computers, vehicle-mounted smart terminals, and the like; non-mobile terminals include, but are not limited to, PC-side and the like.

As shown in fig. 1, the method for detecting handwriting and printed text based on deep learning includes: step 110, step 120 and step 130.

And 110, performing multi-feature extraction on the initial text image to obtain a fusion feature map.

In this step, the initial text image may be an initial image input by a user.

The content of the initial text image may include at least one of content corresponding to the print text and content corresponding to the handwritten text.

In the actual execution process, the initial text image can be downloaded from the server side; or the initial text image may be retrieved from a local or cloud server; or may receive the initial text image sent by the opposite terminal.

For example, in some embodiments, the printed text image may be obtained by a crawler, the final handwritten text image may be obtained by synthesizing the printed text image by stitching, overlaying, or the like, and performing image degradation or the like.

The fusion feature map is an image obtained after multi-feature extraction is carried out on the initial text image, and comprises image information corresponding to a plurality of features.

In the application process, a plurality of features can be extracted from the initial text image through a network model, and parts are selected from the features to be fused; or may be based on mean fusion to obtain a fused feature map; alternatively, the fused feature map may be obtained based on a weighted fusion, which is not limited herein.

The implementation of step 110 is described below by taking the example of obtaining a fusion feature map based on weighted fusion:

in some embodiments, step 110 may include:

extracting multiple features of the initial text image to obtain multiple types of features;

and generating a fusion feature map based on the multiple types of features and the attention weights corresponding to the features.

In this embodiment, the plurality of types of features may include features such as texture, edges, corner points, and semantic information of the original text image.

The multiple types of features may include 4 sets of multi-channel feature maps of different sizes.

Note that the weights are used to characterize how important the features are to the fused feature map.

According to the handwriting and printing text detection method based on deep learning, the initial text image is subjected to multi-feature extraction to obtain the multiple types of features, and then the fusion feature map is generated based on the multiple types of features and attention weights corresponding to the features, so that different types of features can be fused in practical application, text examples can be completely described, and the accuracy of final text recognition is improved.

In some embodiments, step 110 may further comprise:

inputting the intermediate feature sequences into a spatial attention mechanism module, and obtaining attention weight sequences corresponding to the intermediate feature sequences output by the spatial attention mechanism module, wherein the attention weight sequences comprise a plurality of attention weights, and the attention weights are in one-to-one correspondence with a plurality of types of features;

And based on the plurality of attention weights, carrying out weighted calculation on the plurality of types of features to obtain a fusion feature map.

In this embodiment, the feature extraction module may include a residual network ResNet50, or may include a residual network ResNet50 and a feature pyramid network FPN (Feature Pyramid Networks), with outputs of the residual network connected to inputs of the feature pyramid network, and FIG. 3 illustrates a network architecture diagram of the residual network ResNet50 and the feature pyramid network FPN.

The residual network ResNet50 can comprise a plurality of residual blocks as shown in FIG. 2, wherein the residual blocks can comprise one or more of the modules of convolution operation, batch normalization processing, pooling operation, and skip connection.

The residual network res net50 may extract features such as texture, edges, corner points, and semantic information for the original text image.

The feature pyramid network FPN may further extract a plurality of types of features for the image output by the residual network ResNet50 to increase the types of the extracted features, thereby further enhancing the effect of feature extraction.

The feature pyramid network FPN may include one or more of the modules of convolution operation, batch normalization processing, and interpolation up-sampling.

The intermediate feature sequence includes a plurality of intermediate features.

The spatial attention mechanism module is used for calculating an attention weight sequence corresponding to the intermediate feature sequence.

The attention weight sequence includes a plurality of attention weights, and the plurality of attention weights are in one-to-one correspondence with a plurality of types of features.

It will be appreciated that the attention weights corresponding to the different types of features may be different, and that during actual execution, the user may adjust parameters of the spatial attention mechanism module to correspondingly adjust the attention weights based on the extraction scene or actual demand.

And carrying out weighted calculation on the multiple types of features, namely multiplying the multiple types of features by attention weights corresponding to the features respectively to obtain a fusion feature map.

In the actual implementation process, the initial text image may be made to be X, where X may include N feature maps, that is:

wherein X is the initial text image, N is the number of feature images, C is the number of channels of the image, H is the width of the image, W is the height of the image, and R represents the operation of upsampling feature images of different sizes to the same size.

N may be user-defined, e.g., N may be 4, 5, or 6, etc., in this application N may be set to 4.

As shown in FIG. 4, the scaled N feature maps may be concatenated and then passed through a 3×3 convolution layer to obtain an intermediate feature sequence S ε R ^C×H×W Specifically, the calculation process of the intermediate feature sequence can be expressed by the following formula:

S＝Conv(concat([X ₀ ,X ₁ ,...,X _N-1 ])

s is an intermediate feature sequence, concat is a connection operator, conv is a 3×3 convolution operator, X is an initial text image, and N is the number of feature images;

inputting the intermediate feature sequences into a spatial attention mechanism module to obtain attention weight sequences A epsilon R corresponding to the intermediate feature sequences output by the spatial attention mechanism module ^N×H×W In particular, canTo express the calculation process of the attention weight sequence as follows:

A＝Spatial_Attention(S)

wherein, a is an Attention weight sequence, S is an intermediate feature sequence, and spatial_attention is a Spatial Attention mechanism in the adaptive Spatial Attention scale fusion module ASF (Adaptive Scale Fusion module) shown in fig. 4;

specifically, a specific implementation flow of the spatial attention mechanism is shown in fig. 5:

spatial Average Pooling operation is carried out on the intermediate feature sequence with the C×H×W input dimension, the dimension is reduced into data with the 1×H×W dimension, then the data is input into Conv+ReLU layer calculation with the convolution kernel size of 3×3, and then the data with the same dimension of 1×H×W is obtained through Conv+sigmoid layer calculation with the size of 1×1;

and adding the same-dimensional data obtained by the operation with the intermediate feature sequences with the original dimensions of C multiplied by H multiplied by W, performing expansion dim operation in the first dimension, and calculating through a Conv+sigmoid layer of 3 multiplied by 3 to obtain attention weight sequences corresponding to the intermediate feature sequences, wherein the dimension of the attention weight sequences is N multiplied by 1 multiplied by H multiplied by W.

Dividing the attention weight sequence into N parts according to the channel dimension, and weighting and calculating a plurality of types of characteristics corresponding to the attention weights to obtain a fusion characteristic diagram F E R ^N×C×H×W Specifically, the calculation process of the fusion feature map may be expressed by the following formula:

F＝concat([E ₀ X ₀ ,E ₁ X ₁ ,...,E _N-1 X _N-1 ])

wherein F is a fusion feature map, concat is a connection operator, X is an initial text image, and E is an attention weight sequence.

The inventors have discovered during development that features of different scales have different perception and acceptance domains, and that different types of features focus on describing textual instances of different scales. For example, shallow or large size features may perceive details of small text instances, but cannot capture a global view of large text instances, while deep or small size features are the opposite. Semantic segmentation based on feature pyramids or U-Net structures exists in the prior art.

In the method, the fusion feature map is acquired based on the adaptive spatial attention scale fusion module ASF, so that features of different scales can be fused dynamically.

According to the handwriting and printing text detection method based on deep learning, which is provided by the embodiment of the application, the initial text image is input into the feature extraction model to obtain a plurality of types of features corresponding to the initial text image, so that feature extraction can be enhanced; then, carrying out convolution operation on the multiple types of features to obtain an intermediate feature sequence, inputting the intermediate feature sequence into a spatial attention mechanism module to obtain attention weight sequences corresponding to the intermediate feature sequences, and then carrying out weighted calculation on the multiple types of features to obtain a fusion feature map, so that features of different scales can be fused dynamically, the characterization of the multi-scale change features is enhanced, and a better feature fusion effect is obtained.

Step 120, based on the fusion feature map, obtaining a printed text region probability map, a handwritten text region probability map and a text region self-adaptive threshold probability map.

In this step, the print text region probability map includes text information of the print text.

The handwritten text area probability map comprises text information of the handwritten text.

The self-adaptive threshold value is used for binarizing the printed text region probability map and the handwritten text region probability map, and the local binarization threshold value corresponding to the printed text region probability map and the handwritten text region probability map can be self-adaptively adjusted according to the characteristic information such as brightness, contrast, texture and the like of the local image region.

In the actual execution process, the printed text region probability map, the handwritten text region probability map and the text region self-adaptive threshold probability map can be respectively acquired through a neural network model.

Referring now to fig. 6, in some embodiments, step 120 may include:

inputting the fusion feature map into a first channel of a text detection model, and acquiring a printed text region probability map output by the first channel;

the first channel is obtained by training a sample feature map serving as a sample and a sample printing text region probability map corresponding to the sample feature map serving as a sample label;

Inputting the fusion feature map into a second channel of the text detection model, and obtaining a handwritten text region probability map output by the second channel;

inputting the fusion feature map into a third channel of the text detection model, and acquiring a text region self-adaptive threshold probability map output by the third channel;

the third channel is obtained by training a sample feature map serving as a sample and a sample text area self-adaptive threshold probability map corresponding to the sample feature map serving as a sample label.

In this embodiment, the text detection model is used to classify lines of handwritten text and lines of printed text.

The text detection model may include a first channel, a second channel, and a third channel, which may be disposed in parallel and may be processed in parallel in actual execution.

The first channel is used for receiving the fusion feature map, extracting the printing text features of the fusion feature map, and outputting a probability map of the printing text region;

the second channel is used for receiving the fusion feature map, extracting the handwriting text features of the fusion feature map, and outputting a handwriting text region probability map;

The third channel is used for receiving the fusion characteristic map so as to output a text region self-adaptive threshold probability map.

The sample feature map is a fusion feature map obtained after multi-feature extraction of the sample initial text image.

The sample printed text region probability map is a printed text region probability map corresponding to the sample feature map.

The sample handwritten text area probability map is a handwritten text area probability map corresponding to the sample feature map.

The sample text region adaptive threshold probability map is a text region adaptive threshold probability map corresponding to the sample feature map.

In the application process, the first channel, the second channel and the third channel in the text detection model are trained in advance, and the output layer of the text detection model may include a convolution layer, and specifically, the convolution layer may include a 3×3 conv+bn+relu layer, a 2×2 conv+bn+relu layer, and a 2×2 Conv transmit+bn+sigmoid layer.

Wherein the output of the Conv+BN+ReLU layer is connected with the input of the Conv+BN+ReLU layer, and the output of the Conv+BN+ReLU layer is connected with the input of the ConvTranspose+BN+Sigmoid layer.

And carrying out the convolution calculation operation on the fusion feature map to respectively obtain a printed text region probability map output by the first channel, a handwritten text region probability map output by the second channel and a text region self-adaptive threshold probability map output by the third channel.

The training process of the first channel, the second channel and the third channel in the text detection model is specifically described below.

In the training process, the first channel and the second channel may be trained using a binary cross entropy Loss function (BCE Loss) as a target Loss function, respectively, to output a sample printed text region probability map and a sample handwritten text region probability map, where BCE Loss may be represented by the following formula:

L(pt,target)＝-w×(target×ln(pt)+(1-target)×ln(1-pt))

where L is the final loss value output by the binary cross entropy loss function, pt is the output result after model reasoning, target is the label of the dataset, and w is the weight value, for example, w may be 1.

The third channel may be trained using an L1 distance function (L1 Loss) to output a sample text region adaptive threshold probability map, where L1 Loss may be represented by the following equation:

wherein Loss is the final Loss value output by the L1 distance function, x is the data sample (i.e. sample fusion feature map), y is the label of the data sample (i.e. sample text region self-adaptive threshold probability map), n is the total number of data samples, i is the index of the current data sample, f (x) _i ) Is the predicted value of the i-th data sample.

In the training process, the data quantity of the printed text region probability map and the handwritten text region probability map in the target data set can be set to be 1:1, the data set is divided into a training set and a testing set based on the target proportion, the target proportion can be 8:2 or 9:1, and the like, and the method is not limited.

According to the handwriting and printed text detection method based on deep learning, the fused feature images are input into the first channel, the second channel and the third channel of the text detection model, the printed text region probability image output by the first channel, the handwritten text region probability image output by the second channel and the text region self-adaptive threshold probability image output by the third channel are respectively obtained, and when the text detection model is constructed, the output layer is set to be three prediction probability images, so that classification and detection of handwriting text lines and printed text lines are realized.

In some embodiments, the sample printed text region probability map is determined by:

extracting the characteristics of the printed text region from the sample characteristic diagram to obtain a first printed text region;

performing reduction processing on the first printed text region to obtain a sample printed text region probability map;

the sample handwritten text area probability map is determined by:

and carrying out reduction processing on the first handwritten text area to obtain a sample handwritten text area probability map.

In this embodiment, the first printed text region is obtained after the printed text region feature extraction of the sample feature map, and includes printed text information therein.

The first handwritten text area is obtained after the handwritten text area feature extraction of the sample feature map, and the first handwritten text area contains handwritten text information.

In the actual implementation process, a binary image of the sample feature image size may be drawn, and then the first printed/handwritten text area is reduced, where the first printed/handwritten text area may be reduced to a user-defined value, for example, may be reduced to 0.4, 0.5 or 0.6, and in the present application, the first printed/handwritten text area may be reduced to 0.4, and then the area corresponding to the binary image is drawn, so as to obtain a probability image of the sample printed/handwritten text area.

According to the handwriting and printed text detection method based on deep learning, the first printing/handwritten text area is obtained by carrying out printing/handwritten text area feature extraction on the sample feature map, and then the first printing/handwritten text area is reduced to obtain the sample printing/handwritten text area probability map, so that text adhesion can be prevented in the actual execution process, the compact text detection effect is improved, and the detection range is widened.

The training process of the text detection model is specifically described below.

As shown in fig. 6, an initial text image is first input to a residual network res net50, and a plurality of types of features corresponding to the initial text image are output;

inputting the multiple types of features into a feature pyramid network FPN for processing, and extracting the features corresponding to the initial text image again;

then up-sampling all the feature images to 1/4 of the original image size, inputting the reduced feature images to an adaptive space attention scale fusion module ASF, and obtaining an ASF module output fusion feature image;

and performing convolution operation and deconvolution operation on the fused feature map for the first time to obtain a feature map with the feature map channel number of 3, wherein the first channel outputs a printed text region probability map, the second channel outputs a handwritten text region probability map, and the third channel outputs a text region self-adaptive threshold probability map.

In the application process, the loss of the training result can be calculated, the training parameters are transmitted in the opposite direction, and the training model is corrected based on the training parameters, so that a more accurate training model is obtained, and the accuracy of final text detection is improved.

And 130, acquiring a target detection area based on at least two of the printed text area probability map, the handwritten text area probability map and the text area self-adaptive threshold probability map, wherein the target detection area comprises a target text and a text box corresponding to the target text.

In this step, the target detection area is an area that is ultimately used to detect the printed text and the handwritten text.

The target text may include one or more of printed text and handwritten text.

The text box corresponding to the target text is an circumscribed rectangle box of the target text.

The inventor finds that in the research and development process, a large number of candidate frames are arranged in the related technology, the candidate frames are used as sliding windows to traverse the feature images of the images after convolution operation, and then the feature images are mapped to the text frame areas of the original images according to the positioning areas on the feature images;

in the related art, a feature map is obtained based on calculation of a connected domain and convolution operation, the feature map is subjected to binarization processing to obtain a binary map, and final text box region information is output through post-processing.

In the application, firstly, multi-feature extraction is carried out on an initial text image to obtain a fusion feature image, and different types of features can be fused in practical application, so that a text example is completely described, and the accuracy of final text recognition is improved;

then based on the fusion feature map, acquiring a printed text region probability map, a handwritten text region probability map and a text region self-adaptive threshold probability map, so that a target detection region can be acquired conveniently later, and classification and detection of handwritten text lines and printed text lines are realized;

finally, based on at least two of the printed text region probability map, the handwritten text region probability map and the text region self-adaptive threshold probability map, a target detection region is obtained, self-adaptive threshold map training can be realized, the subjectivity problem is avoided, and the final detection effect is improved; in addition, text boxes corresponding to the printing text area and the handwriting text area can be automatically generated, the image does not need to be traversed, the calculation complexity is remarkably reduced, and the calculation resources are saved; the subjectivity of manually setting the candidate frames is avoided, so that the final text detection precision is improved, and the universality is higher.

Referring now to fig. 7, in some embodiments, step 130 may include:

and respectively carrying out at least one of connected domain calculation, circumscribed rectangle calculation and expansion processing on the printed text region binary image and the handwritten text region binary image to obtain a target detection region.

In this embodiment, the printed text region binary image is a binary image obtained after binarizing the printed text region probability image and the text region adaptive threshold probability image.

The binary image of the handwritten text area is obtained after binarization processing is carried out on the probability image of the handwritten text area and the self-adaptive threshold probability image of the text area.

The connected domain is a set of adjacent pixels with the same pixel value.

The target detection area includes a target text and a text box corresponding to the target text, as shown in fig. 8, A1 and A2 are printed text information, A3 is handwritten text information, and a rectangular box at the periphery of the text information is the text box corresponding to the target text.

In the application process, the pixel value of the printing/handwriting text area in the printing/handwriting text area binary image is 1, the pixel value of the background area in the printing/handwriting text area binary image is 0, and then the connected pixel values are set to the same mark based on the connected domain calculation.

And carrying out circumscribed rectangle calculation on the printing/handwriting text area binary diagram based on Matlab, python, C++ and other programming languages so as to obtain a text box corresponding to the printing/handwriting text area in the printing/handwriting text area binary diagram.

The expansion process is used to expand the reduced print/handwritten text area to return to the normal text box size.

In some embodiments, obtaining the printed text region binary map based on the printed text region probability map and the text region adaptive threshold probability map may further include: performing differential binarization operation on the printed text region probability map and the text region self-adaptive threshold probability map to obtain a printed text inward-shrinking prediction map;

and performing differential binarization operation on the printing text shrink probability map to obtain a printing text region binary map.

In some embodiments, obtaining the handwritten text area binary map based on the handwritten text area probability map and the text area adaptive threshold probability map may further include:

Performing differential binarization operation on the handwritten text region probability map and the text region self-adaptive threshold probability map to obtain a handwritten text inward-shrinking prediction map;

and performing differential binarization operation on the handwritten text inward-shrinking prediction graph to obtain a handwritten text area binary graph.

In the method, based on differential binarization operation, the threshold value of each position of the printed text region probability map and the handwritten text region probability map can be adaptively predicted, threshold value adaptation of the printed text region probability map and the handwritten text region probability map can be realized, and the subjectivity problem is avoided.

In some embodiments, performing a differentiable binarization operation on the printed text region probability map and the text region adaptive threshold probability map to obtain a printed text run-in prediction map may include:

and inputting the printed text region probability map and the text region self-adaptive threshold probability map to a first layer of the target neural network model, and obtaining a printed text inward-shrinking prediction map output by the first layer.

In this embodiment, the first layer of the target neural network model is trained in advance.

The first layer of the target neural network can be obtained by training a sample printing text region probability map and a sample text region self-adaptive threshold probability map serving as samples and a sample printing text shrink prediction map serving as a sample label.

In some embodiments, performing a differentiable binarization operation on the handwritten text region probability map and the text region adaptive threshold probability map, the obtaining the handwritten text run-in prediction map may include:

and inputting the handwritten text region probability map and the text region self-adaptive threshold probability map to a second layer of the target neural network model, and obtaining a handwritten text inward-shrinking prediction map output by the second layer.

In this embodiment, the second layer of the target neural network model is trained in advance.

The second layer of the target neural network can be obtained by training a sample handwritten text region probability map and a sample text region self-adaptive threshold probability map serving as samples and a sample handwritten text inward-shrinking prediction map serving as a sample label.

The training process of the first layer and the second layer of the target neural network is specifically described below.

In the training process, the first layer and the second layer of the target neural network may be trained using a cross ratio Loss function (Dice Loss) to output a sample printed text run-in prediction graph and a sample handwritten text run-in prediction graph, respectively, where Dice Loss may be represented by the following formula:

wherein, diceLoss is the final loss value output by the cross ratio loss function, X is the pixel label (i.e. sample label) of the real segmented image, and Y is the pixel class (i.e. the output value corresponding to the first layer or the second layer of the target neural network) of the model predictive segmented image.

In the actual application process, the printed text region probability map and the text region self-adaptive threshold probability map can be input into a first layer of the target neural network model, and the printed text systole probability map output by the first layer can be obtained;

the handwritten text region probability map and the text region self-adaptive threshold probability map can be input to a second layer of the target neural network model, and a handwritten text systole probability map output by the second layer can be obtained;

and then, binarizing the printed text inward-shrinking probability graph and the handwritten text inward-shrinking probability graph respectively to obtain a printed text area binary graph and a handwritten text area binary graph.

According to the handwriting and printed text detection method based on deep learning, the printed/handwritten text region binary image is obtained based on the printed/handwritten text region probability image and the text region self-adaptive threshold probability image, then at least one of connected domain calculation, circumscribed rectangle calculation and expansion processing is carried out on the printed text region binary image and the handwritten text region binary image respectively, a target detection region is obtained, the position of a corresponding text line can be accurately positioned, the printed text box and the handwritten text box can be accurately distinguished, and the text detection effect is improved.

In some embodiments, as shown in fig. 7, the fusion feature map may be input into a first channel, a second channel, and a third channel of the text detection model, respectively, to obtain a printed text region probability map output by the first channel, a handwritten text region probability map output by the second channel, and a text region adaptive threshold probability map output by the third channel;

then inputting the printed text region probability map and the text region self-adaptive threshold probability map to a first layer of the target neural network model to obtain a printed text inward-contraction probability map output by the first layer;

inputting the handwritten text region probability map and the text region self-adaptive threshold probability map to a second layer of the target neural network model to obtain a handwritten text inward-contraction probability map output by the second layer;

differential binarization operation is carried out on the printed text inward-contraction probability map and the handwritten text inward-contraction probability map respectively, so that a printed text area binary map and a handwritten text area binary map are obtained;

and finally, at least one of connected domain calculation, circumscribed rectangle calculation and expansion processing is carried out on the printed text region binary image and the handwritten text region binary image, so as to obtain a target detection region.

The deep learning-based handwriting and printing text detection device provided by the application is described below, and the deep learning-based handwriting and printing text detection device described below and the deep learning-based handwriting and printing text detection method described above can be referred to correspondingly.

According to the handwriting and printing text detection method based on the deep learning, the execution subject can be the handwriting and printing text detection device based on the deep learning. In the embodiment of the application, taking a handwriting and printing text detection method based on deep learning as an example, the handwriting and printing text detection device based on deep learning is described.

The embodiment of the application also provides a handwriting and printing text detection device based on deep learning.

As shown in fig. 9, the handwriting and printed text detection device based on deep learning includes: a first processing module 910, a second processing module 920, and a third processing module 930.

A first processing module 910, configured to perform multi-feature extraction on the initial text image, and obtain a fusion feature map;

the second processing module 920 is configured to obtain a printed text region probability map, a handwritten text region probability map, and a text region adaptive threshold probability map based on the fused feature map;

a third processing module 930, configured to obtain a target detection area based on at least two of the printed text region probability map, the handwritten text region probability map, and the text region adaptive threshold probability map, where the target detection area includes a target text and a text box corresponding to the target text.

In some embodiments, the third processing module 930 may also be configured to:

In some embodiments, the first processing module 910 may also be configured to:

In some embodiments, the second processing module 920 may also be configured to:

In some embodiments, the deep learning based handwriting and printed text detection apparatus may further include a fourth processing module for:

the sample printed text region probability map is determined by:

The sample handwritten text area probability map is determined by:

The handwriting and printing text detection device based on deep learning in the embodiment of the application can be electronic equipment, and can also be a component in the electronic equipment, such as an integrated circuit or a chip. The electronic device may be a terminal, or may be other devices than a terminal. By way of example, the electronic device may be a mobile phone, tablet computer, notebook computer, palm computer, vehicle-mounted electronic device, mobile internet appliance (Mobile Internet Device, MID), augmented reality (augmented reality, AR)/Virtual Reality (VR) device, robot, wearable device, ultra-mobile personal computer, UMPC, netbook or personal digital assistant (personal digital assistant, PDA), etc., but may also be a server, network attached storage (Network Attached Storage, NAS), personal computer (personal computer, PC), television (TV), teller machine or self-service machine, etc., and the embodiments of the present application are not limited in particular.

The handwriting and printed text detection device based on deep learning in the embodiment of the application may be a device with an operating system. The operating system may be an Android operating system, an IOS operating system, or other possible operating systems, which is not specifically limited in the embodiments of the present application.

The handwriting and printing text detection device based on deep learning provided in the embodiment of the present application can implement each process implemented by the method embodiments of fig. 1 to 8, and in order to avoid repetition, a description is omitted here.

In some embodiments, as shown in fig. 10, the embodiment of the present application further provides an electronic device 1000, including a processor 1001, a memory 1002, and a computer program stored in the memory 1002 and capable of running on the processor 1001, where the program when executed by the processor 1001 implements the processes of the foregoing embodiment of the handwriting and printed text detection method based on deep learning, and the same technical effects can be achieved, so that repetition is avoided, and no further description is given here.

The electronic device in the embodiment of the application includes the mobile electronic device and the non-mobile electronic device described above.

In another aspect, the present application further provides a computer program product, where the computer program product includes a computer program stored on a non-transitory computer readable storage medium, where the computer program includes program instructions, when the program instructions are executed by a computer, the computer is capable of executing each process of the above embodiment of the handwriting and printed text detection method based on deep learning, and the same technical effects can be achieved, so that repetition is avoided, and no further description is given here.

In still another aspect, the present application further provides a non-transitory computer readable storage medium, on which a computer program is stored, where the computer program is implemented when executed by a processor to perform the processes of the foregoing embodiment of the handwriting and printing text detection method based on deep learning, and the same technical effects can be achieved, so that repetition is avoided, and details are not repeated herein.

In still another aspect, an embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and the processor is configured to run a program or an instruction, implement each process of the foregoing embodiment of the handwriting and printing text detection method based on deep learning, and achieve the same technical effect, so that repetition is avoided, and no further description is given here.

It should be understood that the chips referred to in the embodiments of the present application may also be referred to as system-on-chip chips, chip systems, or system-on-chip chips, etc.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A handwriting and printing text detection method based on deep learning is characterized by comprising the following steps:

2. The method for deep learning based handwriting and printed text detection according to claim 1, wherein the obtaining a target detection area based on at least two of the printed text region probability map, the handwritten text region probability map, and the text region adaptive threshold probability map comprises:

3. The method for detecting handwriting and printed text based on deep learning according to claim 1, wherein the performing multi-feature extraction on the initial text image to obtain a fused feature map comprises:

4. A method for detecting handwriting and printed text based on deep learning according to any of claims 1-3 and wherein said performing multi-feature extraction on an initial text image to obtain a fused feature map comprises:

5. The method for detecting handwriting and printed text based on deep learning according to any one of claims 1-3, wherein the obtaining a printed text region probability map, a handwritten text region probability map, and a text region adaptive threshold probability map based on the fused feature map comprises:

6. The method for deep learning based handwriting and printed text detection as claimed in claim 5, wherein,

the sample printed text region probability map is determined by:

the sample handwritten text area probability map is determined by:

7. Handwriting and printing text detection device based on deep learning, characterized by comprising:

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the deep learning based handwriting and printed text detection method of any of claims 1-6 when the program is executed by the processor.

9. A non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the deep learning based handwriting and printed text detection method of any of claims 1-6.

10. A computer program product comprising a computer program which, when executed by a processor, implements a deep learning based handwriting and printed text detection method according to any of claims 1-6.