CN111666941A

CN111666941A - Text detection method and device and electronic equipment

Info

Publication number: CN111666941A
Application number: CN202010537495.7A
Authority: CN
Inventors: 张水发
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2020-06-12
Filing date: 2020-06-12
Publication date: 2020-09-15
Anticipated expiration: 2040-06-12
Also published as: CN111666941B

Abstract

The disclosure relates to a text detection method, a text detection device and electronic equipment, wherein the method comprises the following steps: acquiring an image to be detected; determining a candidate region related to a text line from an image to be detected; determining anchor point characteristics of the alternative area; the anchor point characteristics comprise two types of characteristic information of the inclination angle and the size characteristics of the alternative region; determining whether the alternative area is a text line or not by utilizing the anchor point characteristics of the alternative area and the corresponding relation between the preset anchor point data and the text line identification result; the text line identification result is used for representing whether the area is the result of the text line; the anchor point data are determined based on a plurality of preset sample anchor point characteristics; and when the candidate area is a text line, determining the content of the candidate area as the detected text. Compared with the prior art, the scheme provided by the disclosure can improve the accuracy of the text region obtained by detection in the OCR detection process, and further improve the accuracy of characters in the obtained image.

Description

Text detection method and device and electronic equipment

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a text detection method and apparatus, and an electronic device.

Background

The current OCR (Optical Character Recognition) detection is applied more and more widely in various fields, so-called OCR refers to: after the text data is scanned, the scanned image is analyzed and processed, so that the characters and the layout information in the image are obtained.

When performing OCR detection, due to the angle of the text data during scanning, the physical parameter limitation of the scanning apparatus, and the like, the text area where each text line is located in the obtained image may be inclined.

Therefore, in the OCR detection process, when various target object detection methods currently existing are used, the inclined text region cannot be detected, so that the accuracy of the detected text region is low, and the accuracy of characters in the acquired image is greatly reduced.

Disclosure of Invention

The disclosure provides a text detection method, a text detection device, an electronic device and a storage medium, which at least solve the problem that in the OCR detection process in the related art, the accuracy of a detected text region is low and the accuracy of characters in an acquired image is reduced due to the fact that an inclined text region cannot be detected. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided a text detection method, including:

acquiring an image to be detected;

determining a candidate region related to a text line from the image to be detected;

determining anchor point characteristics of the alternative region; the anchor point features comprise two types of feature information of the inclination angle and the size features of the candidate region;

determining whether the alternative area is a text line or not by using the anchor point characteristics of the alternative area and the corresponding relation between preset anchor point data and a text line identification result; the text line identification result is used for representing whether the area is the result of the text line; the anchor point data are determined based on a plurality of preset sample anchor point characteristics;

and when the candidate area is a text line, determining the content of the candidate area as the detected target text.

Optionally, in a specific implementation manner, the correspondence between the anchor data and the text line recognition result includes: the corresponding relation between the anchor point category and the text line recognition result;

the step of determining whether the candidate region is a text line by using the anchor point characteristics of the candidate region and the corresponding relationship between the preset anchor point data and the text line recognition result includes:

determining the target anchor point category to which the anchor point characteristics of the alternative area belong;

and determining whether the alternative area is a text line or not by utilizing the target anchor point category and the corresponding relation between the preset anchor point category and the text line identification result.

Optionally, in a specific implementation manner, the determining manner of the correspondence between the anchor point category and the text line recognition result includes:

acquiring a feature map of the first sample image and each anchor point category obtained based on the anchor point features of each text line in the second sample image;

performing category regression and detection frame regression on the text regions based on the feature map and the anchor point categories to obtain a plurality of initial text regions;

intercepting a text area to be regressed corresponding to each initial text area in the feature map;

and performing category regression, detection frame regression and angle regression on the intercepted text areas to be regressed to obtain the corresponding relation between the anchor point category and the text line recognition result.

Optionally, in a specific implementation manner, the determining manner of each anchor point category includes:

determining the inclination angle and the size characteristic of each text line in the second sample image to obtain an angle data set and a size data set;

clustering the angle data group and the size data group respectively to obtain a first number of angle clustering results and a second number of size clustering results;

respectively selecting a clustering center of a clustering result from the first number of angle clustering results and the second data number of size clustering results to form an anchor point category to obtain an anchor point category; wherein, the two types of feature information included by different anchor point categories are not completely the same.

Optionally, in a specific implementation manner, before the step of performing category regression, detection box regression, and angle regression on the plurality of intercepted text regions to be regressed, the method further includes:

extracting the angle characteristics of the inclined text regions in the plurality of intercepted text regions to be regressed, and performing rotation conversion on the inclined text regions on the basis of the angle characteristics;

the step of performing category regression, detection frame regression and angle regression on the intercepted text regions to be regressed comprises the following steps of:

and performing category regression, detection frame regression and angle regression on non-tilted text regions in the intercepted plurality of text regions to be regressed and the tilted text regions after rotation conversion.

Optionally, in a specific implementation manner, before the step of intercepting the to-be-regressed text regions corresponding to the initial text regions in the feature map, the method further includes:

executing non-maximum value suppression NMS operation on each initial text region to obtain a suggested detection region of each initial text region;

the step of intercepting the text regions to be regressed corresponding to the initial text regions in the feature map comprises the following steps:

and intercepting the text areas to be regressed corresponding to the obtained suggested text areas in the feature map.

According to a second aspect of the embodiments of the present disclosure, there is provided a text detection apparatus including:

the image acquisition module is configured to acquire an image to be detected;

a region determination module configured to determine a candidate region about a text line from the image to be detected;

a feature determination module configured to determine anchor features of the candidate regions; the anchor point features comprise two types of feature information of the inclination angle and the size features of the candidate region;

the text line determination module is configured to determine whether the alternative area is a text line or not by using the anchor point characteristics of the alternative area and the corresponding relation between preset anchor point data and a text line identification result; the text line identification result is used for representing whether the area is the result of the text line; the anchor point data are determined based on a plurality of preset sample anchor point characteristics;

and the text determination module is configured to determine the content of the candidate area as the detected target text when the candidate area is a text line.

the text line determining module is configured to determine a target anchor point category to which the anchor point feature of the alternative region belongs; and determining whether the alternative area is a text line or not by utilizing the target anchor point category and the corresponding relation between the preset anchor point category and the text line identification result.

Optionally, in a specific implementation manner, the apparatus further includes: a relationship determination module for determining the correspondence between the anchor point category and the text line recognition result; the relationship determination module includes:

an information acquisition sub-module configured to: acquiring a feature map of the first sample image and each anchor point category obtained based on the anchor point features of each text line in the second sample image;

the region acquisition sub-module is configured to perform category regression and detection frame regression on the text regions based on the feature map and the anchor point categories to obtain a plurality of initial text regions;

the region intercepting submodule is configured to intercept a to-be-regressed text region corresponding to each initial text region in the feature map;

and the relation determination submodule is configured to perform category regression, detection frame regression and angle regression on the intercepted text regions to be regressed to obtain the corresponding relation between the anchor point category and the text line recognition result.

Optionally, in a specific implementation manner, the apparatus further includes: an anchor determination module for determining each anchor category;

the anchor point determining module is configured to determine the inclination angle and the size characteristic of each text line in the second sample image, so as to obtain an angle data set and a size data set; clustering the angle data group and the size data group respectively to obtain a first number of angle clustering results and a second number of content size clustering results; respectively selecting a clustering center of a clustering result from the first number of angle clustering results and the second data number of size clustering results to form an anchor point category to obtain an anchor point category; wherein, the two types of feature information included by different anchor point categories are not completely the same.

Optionally, in a specific implementation manner, the relationship determining module further includes:

a region rotation sub-module configured to extract an angle feature of a tilted text region in the plurality of intercepted regions to be regressed before performing category regression, detection frame regression, and angle regression on the plurality of intercepted regions to be regressed text, and perform rotation conversion on the tilted text region based on the angle feature;

the relation determination submodule is specifically configured to perform category regression, detection frame regression and angle regression on non-tilt text regions and the tilt text regions after the rotation conversion in the plurality of intercepted text regions to be regressed.

the area suppression sub-module is configured to execute non-maximum suppression NMS operation on each initial text area before intercepting the text area to be regressed corresponding to each initial text area in the feature map, so as to obtain a suggested detection area of each initial text area;

the region intercepting submodule is specifically configured to intercept a to-be-regressed text region corresponding to each obtained suggested text region in the feature map.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory configured to store the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the steps of any of the text detection methods as provided in the first aspect.

According to a fourth aspect of embodiments of the present disclosure, there is provided a storage medium, wherein instructions, when executed by a processor of an electronic device, enable the electronic device to perform the steps of any of the text detection methods as provided in the first aspect.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product which, when run on a computer, causes the computer to perform the steps of any of the text detection methods as provided in the first aspect.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

the method comprises the steps of carrying out OCR detection on an image to be detected comprising characters to obtain a text region in the image to be detected, and further, determining an alternative region about a text line from the image to be detected after the image to be detected is obtained when the character content in the image to be detected is identified, so that two types of characteristic information of an inclination angle and a size characteristic of the alternative region are determined and used as an anchor point characteristic of a selected region. Thus, whether the alternative area is a text line can be determined by using the anchor point characteristic and the preset corresponding relation between the anchor point data and the text line identification result. When the candidate region is a text line, the candidate region is a text region, so that the content of the selected region can be determined as the detected target text.

The selected anchor point characteristics comprise the inclination angle of the alternative region, so that the inclination angle of the alternative region can be utilized when the anchor point information of the alternative region is utilized to determine whether the alternative region is a text line. Based on this, in the technical scheme provided by the present disclosure, the oblique text region included in the image to be detected can be detected, so that the accuracy of the detected text region can be improved, and further, when the accuracy of the detected text region is improved, the accuracy of the characters in the identified text region is also improved, that is, the accuracy of the characters in the acquired image to be detected is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a flow diagram illustrating a text detection method in accordance with an exemplary embodiment.

Fig. 2 is a flowchart illustrating a specific implementation manner of step S14 in fig. 1 according to an exemplary embodiment.

Fig. 3 is a flow chart illustrating a manner of determining correspondence of anchor categories to text line recognition results according to an example embodiment.

FIG. 4 is a flowchart illustrating one particular implementation of steps S32-S34 of FIG. 3, according to an example embodiment.

FIG. 5 is a flow diagram illustrating a text detection method according to another exemplary embodiment.

FIG. 6 is a flow diagram illustrating a manner in which various anchor categories may be derived based on anchor features of various lines of text in a second sample image in accordance with an exemplary embodiment.

FIG. 7 is a flow chart illustrating a manner in which various anchor categories may be derived based on anchor features of various lines of text in a second sample image in accordance with another exemplary embodiment.

Fig. 8(a) is a flowchart illustrating a determination method of correspondence between anchor point categories and text line recognition results according to an exemplary embodiment based on the embodiment illustrated in fig. 3.

Fig. 8(b) is a flowchart illustrating a determination method of correspondence between anchor point categories and text line recognition results according to another exemplary embodiment based on the embodiment illustrated in fig. 3.

Fig. 8(c) is a flowchart illustrating a determination method of correspondence between anchor categories and text line recognition results according to another exemplary embodiment based on the embodiments illustrated in fig. 8(a) and 8 (b).

Fig. 9 is a flowchart illustrating a specific implementation of step S33A in fig. 8(b) and 8(c) according to an exemplary embodiment.

FIG. 10 is a block diagram illustrating a text detection apparatus according to an example embodiment.

FIG. 11 is a block diagram illustrating an electronic device for detecting text in accordance with an exemplary embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Fig. 1 is a flowchart illustrating a text detection method according to an exemplary embodiment, where the method may be applied to any electronic device that needs text detection, such as a laptop, a desktop, a tablet, a mobile phone, and the like, and the disclosure is not limited thereto, and will be referred to as an electronic device hereinafter.

The detection method provided by the disclosure can be integrated into a functional module of the electronic device, so that the electronic device can have a function of realizing the detection method provided by the disclosure; the detection method provided by the present disclosure may also be installed in the electronic device as application software, so that the electronic device may implement the detection method provided by the present disclosure through the installed client of the application software, which is reasonable.

As shown in fig. 1, the present disclosure provides a text detection method, including the following steps.

In step S11, an image to be detected is acquired;

the electronic device may acquire the image to be detected in various ways, and the disclosure is not limited thereto. For example, the electronic device may obtain the stored image to be detected from the local storage area, obtain the image to be detected from other electronic devices, and download the image to be detected from the internet.

In addition, it is reasonable that the image to be detected including the text may be any type of image, for example, a photograph, a scanned picture obtained by scanning text data, or the like.

In step S12, from the image to be detected, a candidate region for a text line is determined;

after the image to be detected is obtained, the electronic device can determine at least one alternative area related to the text line in the image to be detected according to the image content of the image to be detected.

Each candidate area may be a text line, or may be an area formed by other contents besides text, for example, a line area formed by transversely arranging a plurality of decorative patterns with smaller sizes, and the like. Also, each of the candidate regions may have a certain inclination angle.

It should be noted that the electronic device may determine the candidate region related to the text line from the image to be detected in a variety of ways, for example, the electronic device may determine the candidate region related to the text line from the image to be detected through various algorithms such as image feature extraction and image detection. In this regard, the present disclosure is not particularly limited.

In step S13, anchor point features of the candidate region are determined;

the anchor point features comprise two types of feature information of the inclination angle and the size features of the candidate region.

After determining each candidate region of the text line in the image to be detected, the electronic device may perform region feature extraction on each candidate region to obtain an inclination angle and a size feature of the candidate region, so as to obtain an anchor point feature of the candidate region.

Optionally, the size characteristic of the candidate region may include only the aspect ratio of the candidate region; the size information of the content included in the alternative area may be included, for example, the size of the included text, the size of the included pattern, and the like; the aspect ratio of the candidate area and the size information of the content included in the candidate area may also be included at the same time. This is all reasonable.

Of course, the size characteristic of the candidate region may also include other information related to the shape and size of the candidate region, for example, the height and width of the candidate region, and the disclosure is not limited thereto.

In step S14, determining whether the candidate region is a text line by using the anchor point feature of the candidate region and the correspondence between the preset anchor point data and the text line recognition result;

the text line identification result is used for representing whether the area is the result of the text line; the anchor point data are determined based on a plurality of preset sample anchor point characteristics;

in the technical scheme provided by the embodiment of the disclosure, the corresponding relationship between anchor point data and the text line recognition result can be constructed in advance, so that the corresponding relationship can represent what kind of anchor point data is in the text line.

Optionally, in the above correspondence, the text line recognition result includes two cases, that is, the correspondence may not only indicate what kind of anchor data the region has is a text line, but also indicate what kind of anchor data the region has is not a text line. Furthermore, after the anchor point characteristics of the alternative area in the image to be detected are obtained, the anchor point data including the anchor point characteristics can be searched from the corresponding relation, so that the text line identification result corresponding to the searched anchor point data is the text line identification result of the alternative area, and further, whether the alternative area is a text line or not can be determined according to the obtained text line identification result.

In this way, when the text line recognition result is yes, the candidate region can be determined as a text line; accordingly, when the text line recognition result is negative, it may be determined that the candidate region is not a text line.

Optionally, in the above correspondence, the text line recognition result only includes yes, that is, the correspondence is used to represent what kind of anchor point data the region has is a text line. Furthermore, after anchor point characteristics of the alternative area in the image to be detected are obtained, the anchor point data comprising the anchor point characteristics can be searched from the corresponding relation, so that when the anchor point data comprising the anchor point characteristics are searched, the alternative area can be determined to be a text line; accordingly, when anchor point data including the anchor point feature is not found, it may be determined that the candidate region is not a text line.

Of course, the correspondence relationship may also represent what kind of anchor point data area is a text line in other manners, and further, according to different representation manners of the correspondence relationship, a manner of determining whether the candidate area is a text line by using the anchor point characteristics of the candidate area and the correspondence relationship may also be adjusted accordingly, which is not specifically limited in this disclosure.

It should be noted that the correspondence relationship may be determined locally in the electronic device and stored locally, so that when the step S14 is executed, the electronic device may directly read the correspondence relationship from the local; of course, the corresponding relationship may also be determined on other electronic devices and sent to the electronic device, so that the electronic device is stored locally, and thus, when the step S14 is executed, the electronic device may directly read the corresponding relationship from the local; in addition, the correspondence relationship may be determined on another electronic device and stored in the other electronic device, so that when the step S14 is executed, the electronic device needs to first read the correspondence relationship from the other electronic device, where the correspondence relationship and the other electronic device storing the correspondence relationship may be the same or different electronic devices. This is all reasonable.

The corresponding relationship between the preset anchor point data and the text line recognition result may be constructed in various ways, and the disclosure is not limited thereto. For clarity of action, the candidate will exemplify the determination of the correspondence relationship.

In step S15, when the candidate region is a text line, the content of the candidate region is determined as the detected target text.

When the candidate region is determined to be a text line, it can be stated that the local region is a text region, and thus, the content in the candidate region is a text, and thus, the purpose of the present disclosure is to obtain the text region in the image to be detected, that is, when the candidate region is determined to be a text line, the candidate region is the text region in the image to be detected, and thus, the content of the candidate region can be determined to be the detected target text.

As can be seen from the above, in the technical solution provided in the embodiment of the present disclosure, OCR detection is performed on an image to be detected including characters to obtain a text region in the image to be detected, and then, when character content in the image to be detected is identified, after the image to be detected is obtained, an alternative region related to a text line is determined from the image to be detected, so that two types of feature information, namely an inclination angle and a size feature of the alternative region, are determined and are used as an anchor point feature of a selected region. Thus, whether the alternative area is a text line can be determined by using the anchor point characteristic and the preset corresponding relation between the anchor point data and the text line identification result. When the candidate region is a text line, the candidate region is a text region, so that the content of the selected region can be determined as the detected target text.

It can be understood that, when determining the correspondence between the anchor point data and the text line recognition result, the anchor point data is determined based on a plurality of preset sample anchor point features, and the sample anchor point features may be classified into a plurality of categories according to specific values of included inclination angles and size features.

Based on this, optionally, in a specific implementation manner, the correspondence between the anchor data and the text line recognition result may include: and the corresponding relation between the anchor point category and the text line recognition result.

The multiple sample anchor point characteristics comprise two types of characteristic information of the inclination angle and the size characteristics of the alternative region; further, each anchor point category determined may include: the inclination angle included in each anchor point category is determined based on the clustering result of each inclination angle in the sample anchor point characteristics, and the size characteristic included in each anchor point analogy is determined based on the clustering result of each size characteristic in the sample anchor point characteristics. That is, each anchor point category may include two types of feature information, that is, the inclination angle and the size feature, and each type of feature information included in each anchor point category is obtained based on a clustering result of the type of feature information in the plurality of sample anchor point features.

It should be noted that, optionally, when the above dimensional features include: when the aspect ratio of the candidate area and the size information of the content included in the candidate area are included, the anchor point type may include three types of feature information, i.e., the tilt angle, the aspect ratio, and the size information of the content. That is, each anchor point category determined may include a tilt angle, an aspect ratio, and a memory size information.

Accordingly, in this specific implementation manner, as shown in fig. 2, the step S14, determining whether the candidate region is a text line by using the anchor point feature of the candidate region and the corresponding relationship between the preset anchor point data and the text line recognition result, may include the following steps:

in step S21, determining a target anchor point category to which the anchor point feature of the candidate region belongs;

in step S22, it is determined whether the candidate region is a text line or not by using the target anchor point category and the correspondence between the preset anchor point category and the text line recognition result.

In this specific implementation manner, since the correspondence between the anchor point type and the text line recognition result may represent what anchor point type the anchor point data belongs to, the text line, after obtaining the anchor point feature of the candidate region, the target anchor point type to which the anchor point feature of the candidate region belongs may be determined first, and thus, whether the candidate region is the text line is determined by using the target anchor point type and the correspondence between the anchor point type and the text line recognition result.

Optionally, in the correspondence between the anchor point category and the text line recognition result, the text line recognition result includes two cases, that is, the correspondence may not only indicate that the region having the anchor point data belongs to which anchor point category is the text line, but also indicate that the region having the anchor point data belongs to which anchor point category is not the text line. Furthermore, after the target anchor point category to which the anchor point feature of the alternative area in the image to be detected belongs is obtained, the target anchor point category can be searched from the corresponding relation between the anchor point category and the text line recognition result, so that the text line recognition result corresponding to the searched target anchor point category is the text line recognition result of the alternative area, and further, whether the alternative area is a text line or not can be determined according to the obtained text line recognition result.

Optionally, in the above correspondence, the text line recognition result only includes that, that is, the correspondence between the anchor point category and the text line recognition result is used to characterize what kind of anchor point category the anchor point data belongs to, and the region is the text line. Furthermore, after the target anchor point category to which the anchor point feature of the alternative area in the image to be detected belongs is obtained, the target anchor point category can be searched from the corresponding relation, so that when the target anchor point category is searched, the alternative area can be determined to be a text line; correspondingly, when the target anchor point category is not found, it may be determined that the candidate region is not a text line.

Of course, the correspondence between the anchor point type and the text line recognition result may also represent, in other manners, what kind of anchor point type the region to which the anchor point data belongs is the text line, and further, according to different manners of representing the correspondence between the anchor point type and the text line recognition result, the manner of determining whether the candidate region is the text line may also be adjusted accordingly by using the target anchor point type to which the anchor point feature of the candidate region belongs and the correspondence between the anchor point type and the text line recognition result, and this disclosure is not specifically limited.

Next, a determination method of the correspondence between the anchor point category and the text line recognition result provided by the present disclosure is exemplified. It should be noted that, in the following description, an execution subject of the determination method of the correspondence between the anchor point category and the text line recognition result is an electronic device that executes the text detection method provided by the present disclosure, where when determining the correspondence between the anchor point category and the text line recognition result by using other electronic devices, the method used may be the same as the method in the following description, and is not described herein again.

Optionally, in a specific implementation manner, as shown in fig. 3, the determining manner of the correspondence between the anchor point category and the text line recognition result may include the following steps:

in step S31, a feature map of the first sample image and anchor point categories obtained based on anchor point features of text lines in the second sample image are obtained;

in step S32, based on the feature map and each anchor point category, performing category regression and detection frame regression with respect to the text regions to obtain a plurality of initial text regions;

in step S33, intercepting a to-be-regressed text region corresponding to each initial text region in the feature map;

in step S34, category regression, detection frame regression, and angle regression are performed on the plurality of intercepted text regions to be regressed, so as to obtain a correspondence between the anchor point category and the text line recognition result.

The electronic equipment firstly obtains a feature map of the first sample image and each anchor point category obtained based on the anchor point features of each text line in the second sample image.

The electronic device may acquire the feature map and each anchor point category in multiple ways, which is not limited in this disclosure. For clarity, the manner in which the above feature map and the anchor point categories are obtained by the electronic device will be illustrated later.

The first image sample and the second image sample may be the same image sample or different image samples. This is all reasonable.

After obtaining the feature map and the anchor point categories, category regression and detection frame regression with respect to the text area may be performed on each image area existing in the feature map based on the feature map and the anchor point categories to obtain a plurality of initial text areas. Wherein the resulting initial text regions may have tilt angle and size characteristics.

Further, after obtaining the plurality of initial text regions, for each initial text region, the initial text region may be used to intercept a to-be-regressed text region corresponding to the initial text region in the feature map.

In this way, the method can perform category regression, detection frame regression and angle regression on each intercepted text region to be regressed, thereby obtaining the corresponding relation between the anchor point category and the text line recognition result.

Optionally, in a specific implementation manner, the steps S32-S34 in the specific implementation manner shown in fig. 3 may be implemented by training a preset model. Specifically, as shown in fig. 4, the steps S32-S34 may include the following steps:

in step S41, inputting the feature map of the first sample image and the anchor point of the text region into a region selection layer of a preset model for training, so as to obtain an initial text region output by the region candidate layer;

in step S42, intercepting a to-be-regressed text region corresponding to the initial text region in the feature map of the first sample image;

in step S43, the region to be subjected to text regression is input to the pooling layer of the preset model for training, and when a preset completion condition is satisfied, the correspondence between the anchor point category and the text line recognition result is obtained.

Specifically, before model training, the electronic device first acquires a feature map and a text region anchor (anchor) of a first sample image, and constructs a preset model.

Wherein the anchor is determined by utilizing each anchor type obtained based on the anchor characteristics of each text line in the second sample image. I.e. after obtaining the above mentioned anchor point categories, the anchor can be further determined. Moreover, since each anchor point category includes two types of feature information, namely, the tilt angle and the size feature, the anchor can reflect the tilt angle and the size feature of the text region, that is, the anchor can represent a text region.

In addition, the predetermined model may be various models such as fast-rcnn, rfcn, rcnn, SSD, etc., and the disclosure is not particularly limited thereto.

The preset model may include a region selection layer and a pooling layer. For example, when the default model is fast-rcnn, the regionalized layer may be rpn (Region pro-polar Network), and the Pooling layer may be roi-posing (Region of Interest-posing) layer.

Therefore, after the feature map and the anchor are obtained and the preset model is constructed, the electronic equipment can input the obtained feature map and the anchor into a region candidate layer of the preset model for training, and then when the training of the feature map and the anchor by the region candidate layer meets the preset region candidate layer training completion condition, the region candidate layer can output the trained initial text region.

The specific way for training the feature map and the anchor by the region candidate layer is as follows: and performing category regression and detection frame regression to obtain each area of which the trained category is the text area, namely the initial text area.

In addition, the condition for completing the training of the candidate layer in the region may be that the number of iterations of the training reaches a preset value, or that a loss value obtained by the training is smaller than a preset loss value.

Further, after the initial text region output by the region candidate layer is obtained, because the initial text region has the features of the inclination angle and the size, the initial text region can be used to intercept the text region to be regressed corresponding to the initial text region in the feature map.

And then, inputting the intercepted text areas to be regressed into a pooling layer of the preset model for training, namely reducing or amplifying the intercepted text areas to be regressed to the size matched with the model parameters of the pooling layer of the preset model, and performing category regression, detection frame regression and angle regression on the reduced or amplified text areas to be regressed by using the pooling layer. And then, when the preset completion condition is met, obtaining the trained target model.

The corresponding relation between the anchor point type and the text line recognition result is established in the trained target model, so that the corresponding relation between the anchor point type and the text line recognition result can be obtained after the trained target model is obtained.

In addition, the preset completing condition may be that the number of iterations of the training reaches a preset value, or that the loss value obtained for the training is smaller than a preset loss value, which is reasonable.

Further, in order to ensure the accuracy of the target model obtained by training and to ensure the accuracy of the correspondence between the obtained anchor point type and the text line recognition result, the feature map and the anchor may satisfy a certain number, and the specific numerical value of the number may be defined according to the requirements of practical applications.

Optionally, in a specific implementation manner, on the basis of the specific implementation manner shown in fig. 4, as shown in fig. 5, a text detection manner provided by the present disclosure may include the following steps:

in step S51, an image to be detected is acquired;

in step S52, inputting the image to be detected into the trained target model to obtain a target text region output by the target model;

in step S53, the content in the resulting target text region is determined as the detected target text.

The trained target model is the model obtained by training based on the feature diagram and the anchor in the specific implementation manner shown in fig. 4.

Thus, when the image to be detected is input into the target model, the target model can determine the alternative regions about the text line in the image to be detected, and learn the anchor point characteristics of each alternative region, thereby determining the anchor point category to which the anchor point characteristics belong. Furthermore, according to the correspondence relationship between the anchor point type and the text line recognition result established in the target model, whether the candidate region of the image to be detected, which is related to the text line, is the text line can be determined based on the anchor point type and the correspondence relationship, and therefore the candidate region determined to be the text line is output as the target text region as the output result, so that the electronic device can obtain the target text region. Further, the electronic device may determine the content in the obtained target text region as the detected target text.

The output mode of the target text area may be: labeling is carried out in the image to be detected, and the method can also comprise the following steps: it is reasonable to characterize by the specific numerical values of the tilt angle and the size features of the output target text box.

Optionally, in a specific implementation manner, the obtaining manner of the feature map of the first sample image may be:

and acquiring a first sample image, and inputting the acquired first sample image into the image feature extraction model. In this way, the image feature extraction model can learn the image features of the first sample image, and thus, a feature map corresponding to the first sample image is generated. Thus, for each first sample image, the feature map corresponding to the first sample image can be obtained through the image feature extraction model.

Before the first sample image is input to the image feature extraction model, the first sample image may be reduced or enlarged to a size matched with the model parameter according to the model parameter of the image feature extraction model, and then the reduced or enlarged first sample image is input to the image feature extraction model.

In addition, the image feature extraction model may be any model capable of extracting image features and generating a feature map, and the disclosure is not limited in this respect. For example, network models such as VGG16, inclusion v1, inclusion v2, respet, inclusion-respet, and the like.

Optionally, in a specific implementation manner, as shown in fig. 6, the manner of obtaining each anchor point category based on the anchor point features of each text line in the second sample image may include the following steps:

in step S61, determining the tilt angle and the size feature of each text line in the second sample image, and obtaining an angle data set and a size data set;

in step S62, the angle data groups and the size data groups are clustered respectively to obtain a first number of angle clustering results and a second number of size clustering results;

in step S63, selecting a cluster center of a cluster result from the first number of angle cluster results and the second number of size cluster results, respectively, to form an anchor point category, thereby obtaining an anchor point category;

wherein, the two types of feature information included by different anchor point categories are not completely the same.

Specifically, the electronic device first obtains each second sample image, where each second sample image may include at least one text line, and of course, the length of the text line may not be greater than the length or the width of the second sample image. And then extracting each text line included in the second sample image to obtain a text area corresponding to each text line, and calculating the inclination angle and the size characteristic of each obtained text box.

It should be noted that, the obtained text region corresponding to each text line may be an actual text region included in each second sample image, and the calculated tilt angle and size feature of each text region may also be actual data of the text region in the second sample image.

Further, after the tilt angle and the size feature of each text region in each second sample image are calculated, an angle data set including all the calculated tilt angles and a size data set including all the calculated size features are obtained.

Then, the electronic device can cluster the obtained angle data group and size data group according to the preset number of categories divided according to the preset angle range and the number of categories corresponding to the size features, and obtain a first number of angle clustering results and a second number of size clustering results.

Furthermore, the cluster center of one cluster result can be selected from each obtained angle cluster result and each obtained size cluster result, and the selected angle cluster center and the selected size cluster center can form an anchor point category, so that a plurality of anchor point categories can be obtained after multiple selections. Wherein, the two types of feature information included by different anchor point categories are not completely the same.

It is understood that, on the basis of obtaining the first number of angle clustering results and the second number of size clustering results, at most a product of the first number and the second number of anchor point categories can be obtained.

Wherein the first number and the second number may be the same or different. And, the first number is obtained by dividing a preset angle range, for example, the preset angle range may be 0 ° -90 °, and further, the preset angle range is divided into three categories, i.e., the first number is 3, i.e., the number of angle categories is 3, on the basis of 30 °, to 0 ° -30 °, 30 ° -60 °, and 60 ° -90 °.

Further, in the above step S63, the clustering method adopted may include, but is not limited to, knn, k-means, gaussian model, and the like.

Optionally, in a specific implementation, when the dimensional features include: as shown in fig. 7, the above-mentioned manner of obtaining each anchor point category based on the anchor point feature of each text line in the second sample image may include the following steps:

in step S71, determining the tilt angle, the aspect ratio, and the size information of the content of each text line in the second sample image, and obtaining an angle data set, an aspect ratio data set, and a content size data set;

in step S72, clustering the angle data group, the aspect ratio data group, and the content size data group to obtain a first numerical angle clustering result, a second numerical aspect ratio clustering result, and a third numerical content size clustering result;

in step S73, selecting a clustering center of a clustering result from the angle clustering result, the first number of aspect ratio clustering results, and the second number of character size clustering results, respectively, to form an anchor point category, and obtaining an anchor point category;

wherein, the three types of feature information included by different anchor point categories are not completely the same.

The specific contents of the steps S71-S73 are similar to the specific contents of the steps S61-S63, and are not repeated herein.

Optionally, in a specific implementation manner, on the basis of the embodiment shown in fig. 3, as shown in fig. 8(a), the determining manner regarding the correspondence between the anchor point category and the text line recognition result may further include the following steps:

in step S34A, extracting the angle feature of the tilted text region in the plurality of intercepted text regions to be regressed, and performing rotation conversion on the tilted text region based on the angle feature;

accordingly, in this specific implementation manner, the step S34 may include the following steps:

in step S34B: and performing category regression, detection frame regression and angle regression on non-tilted text regions and the tilted text regions after the rotation conversion in the intercepted plurality of text regions to be regressed.

For the tilted text region after the rotation conversion, when the tilted text region is rotated, the angle of the tilted text region will change, so that there may be sub-pixel level correspondence in the pixels of the tilted text region after the rotation conversion, that is, the pixel points of the tilted text region become sub-pixel level points after the rotation conversion. Therefore, in order to eliminate the influence of the sub-pixel-level point, it is necessary to perform a bilinear difference value at the sub-pixel level. Wherein, the sub-pixel level means: there may be smaller points between two pixel points in the image.

Optionally, in a specific implementation manner, on the basis of the embodiment shown in fig. 3, as shown in fig. 8(b), the determining manner regarding the correspondence between the anchor point category and the text line recognition result may further include the following steps:

in step S33A, a non-maximum value suppression NMS operation is performed on each of the initial text regions, resulting in a suggested detection region for each of the initial text regions;

accordingly, in this specific implementation manner, the step S33 may include the following steps:

in step S33B, a region to be regression text corresponding to each obtained suggested text region is clipped from the feature map.

Where NMS is an abbreviation for non maximum suppression and Chinese means non-maximum suppression.

Specifically, in step S32, when performing the category regression with respect to the text area, the essence is: and generating a window with a fixed size, further, regarding the same text object, intercepting a plurality of areas related to the text object in the feature map of the first sample image by sliding the window, and adding a classification score to each intercepted area.

Wherein the classification is divided into: the higher the classification score is, the higher the probability that the determined initial text region is the real text region is.

Obviously, in order to improve the accuracy of the correspondence obtained between the anchor category and the text line recognition result, based on the fact that the initial text regions obtained in step S32 include a plurality of regions for the same text object, the text region having the largest maximum value may be selected by the NMS from among the plurality of initial text regions for the same text object as the suggested text region for the text object.

Accordingly, in this specific implementation manner, the step S33 may include intercepting a to-be-regressed text region corresponding to the suggested text region in the feature map.

The execution manner of step S33B is similar to the execution manner of step S33, and is not repeated here.

Optionally, in a specific implementation manner, based on the specific implementation manners shown in fig. 8(a) and fig. 8(b) on the basis of the specific implementation manner shown in fig. 3, as shown in fig. 8(c), the determining manner regarding the correspondence between the anchor point category and the text line recognition result may further include the following steps:

Furthermore, in this embodiment, the method for determining the correspondence between the anchor type and the text line recognition result may further include:

The related steps shown in fig. 8(c) are the same as the related steps shown in fig. 8(a) and fig. 8(b), and are not repeated herein.

Optionally, in a specific implementation manner, as shown in fig. 9, the executing, by the electronic device, the step S33A to execute the non-maximum suppression NMS operation on each initial text region to obtain the suggested detection region of each initial text region may include the following steps:

in step S91, the text region with the highest classification score among the initial text regions is taken as the target text region;

wherein, the classification is as follows: determining the probability that each initial text region is a real text region;

in step S92, deleting a text region in which the ratio of the overlap area to the union area in the candidate text region is greater than a preset ratio, to obtain a current initial text region;

wherein, the candidate text regions are: the text areas in the initial text area except the target text area have the overlapping area: the area of the overlapping region of the candidate text region and the target text region, the union area is: the area of the union region of the candidate text region and the target text region;

in step S93, the text region with the highest classification score among the remaining text regions is set as the next target text region, and the process returns to step S92;

wherein the remaining text regions are: a text region which is not used as a target text region in the current initial text region;

in step S94, the retained respective target text regions are taken as respective suggestion detection regions.

In this specific implementation, the text region with the highest classification score in the initial text regions is used as the target text region, and the text regions other than the target text region in the initial text regions are used as candidate text regions.

Calculating the area of the overlapping region of the candidate text region and the target text region and the area of the union region of the candidate text region and the target text region aiming at each candidate text region; further, the ratio of the overlap area to the union area is calculated.

And then, judging the size relation between the ratio of the calculated overlapping area to the union set area and a preset ratio for each candidate text area, wherein when the ratio of the overlapping area to the union set area is larger than the preset ratio, deleting the candidate text area in the initial text area.

Thus, after traversing each candidate text region, the current initial text region can be obtained. Wherein the current initial text region may have at least one less text region than the initial text region obtained in step S32.

Furthermore, a text region that is not the target text region in the current initial text region may be used as the remaining text region, so that a text region with the highest classification score in the remaining text regions may be determined as the next target text region, and the electronic device may repeat the step S92 to obtain the updated initial text region.

Then, the text region that is not the target text region in the re-updated initial text region is set as the remaining text region, and the text region with the highest classification score in the updated remaining text region is determined as the next target text region, and the process returns to the step S92.

By analogy, when the updated initial text region does not have a text region which is not the target text region, the electronic device may use each target text region reserved in the current initial text region as each suggested text region.

FIG. 10 is a block diagram illustrating a text detection apparatus according to an example embodiment. Referring to fig. 10, the apparatus includes an image acquisition module 1010, a region determination module 1020, a feature determination module 1030, a text line determination module 1040, and a text determination module 1050.

The image acquisition module 1010 is configured to acquire an image to be detected;

the region determining module 1020 is configured to determine a candidate region for a text line from the image to be detected;

the feature determination module 1030 is configured to determine anchor features for the alternative regions; the anchor point features comprise two types of feature information of the inclination angle and the size features of the candidate region;

the text line determination module 1040 is configured to determine whether the candidate region is a text line by using the anchor point feature of the candidate region and the corresponding relationship between the preset anchor point data and the text line recognition result; the text line identification result is used for representing whether the area is the result of the text line; the anchor point data are determined based on a plurality of preset sample anchor point characteristics;

the text determination module 1050 is configured to determine the content of the candidate region as the detected target text when the candidate region is a text line.

Optionally, in a specific implementation manner, the correspondence between anchor point data and a text line recognition result includes: the corresponding relation between the anchor point category and the text line recognition result;

the text line determination module 1040 is configured to determine a target anchor point category to which the anchor point feature of the candidate region belongs; and determining whether the alternative area is a text line or not by utilizing the target anchor point category and the corresponding relation between the preset anchor point category and the text line identification result.

the anchor point determining module is configured to determine the inclination angle and the size characteristic of each text line in the second sample image, so as to obtain an angle data set and a size data set; clustering the angle data group and the size data group respectively to obtain a first number of angle clustering results and a second number of size clustering results; respectively selecting a clustering center of a clustering result from the first number of angle clustering results and the second data number of size clustering results to form an anchor point category to obtain an anchor point category; wherein, the two types of feature information included by different anchor point categories are not completely the same.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

FIG. 11 is a block diagram illustrating an electronic device 1100 for detecting text in accordance with an exemplary embodiment. For example, the electronic device 1100 can be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 11, electronic device 1100 may include one or more of the following components: processing component 1102, memory 1104, power component 1106, multimedia component 1108, audio component 1110, input/output (I/O) interface(s) 1112, sensor component 1114, and communications component 1116.

The processing component 1102 generally controls the overall operation of the electronic device 1100, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 1102 may include one or more processors 1120 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 1102 may include one or more modules that facilitate interaction between the processing component 1102 and other components. For example, the processing component 1102 may include a multimedia module to facilitate interaction between the multimedia component 1108 and the processing component 1102.

The memory 1104 is configured to store various types of data to support operations at the electronic device 1100. Examples of such data include instructions for any application or method operating on the electronic device 1100, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 1104 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 1106 provides power to the various components of the electronic device 1100. The power components 1106 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 1100.

The multimedia component 1108 includes a screen that provides an output interface between the electronic device 1100 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 1108 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 1100 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 1110 is configured to output and/or input audio signals. For example, the audio component 1110 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 1100 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 1104 or transmitted via the communication component 1116. In some embodiments, the audio assembly 1110 further includes a speaker for outputting audio signals.

The I/O interface 1112 provides an interface between the processing component 1102 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 1114 includes one or more sensors for providing various aspects of state assessment for the electronic device 1100. For example, the sensor assembly 1114 may detect the open/closed status of the device 1100, the relative positioning of components, such as a display and keypad of the electronic device 1100, the sensor assembly 1114 may also detect a change in the position of the electronic device 1100 or a component of the electronic device 1100, the presence or absence of user contact with the electronic device 1100, orientation or acceleration/deceleration of the electronic device 1100, and a change in the temperature of the electronic device 1100. The sensor assembly 1114 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 1114 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 1114 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 1116 is configured to facilitate wired or wireless communication between the electronic device 1100 and other devices. The electronic device 1100 may access a wireless network based on a communication standard, such as WiFi, a carrier network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 1116 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 1016 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 1100 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a storage medium comprising instructions, such as the memory 1104 comprising instructions, executable by the processor 1120 of the electronic device 1100 to perform the method described above is also provided. Alternatively, the storage medium may be a non-transitory computer readable storage medium, which may be, for example, a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In yet another embodiment provided by the present disclosure, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the text detection methods of the above embodiments.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A text detection method, comprising:

acquiring an image to be detected;

2. The method of claim 1, wherein the correspondence of anchor data to text line recognition results comprises: the corresponding relation between the anchor point category and the text line recognition result;

3. The method of claim 2, wherein the determining of the correspondence between the anchor point category and the text line recognition result comprises:

4. The method of claim 3, wherein the determining of each anchor point category comprises:

5. The method according to claim 3 or 4, wherein before the step of performing category regression, detection box regression and angle regression on the plurality of intercepted text regions to be regressed, the method further comprises:

6. The method according to claim 3 or 4, wherein before the step of intercepting the text region to be regressed corresponding to each initial text region in the feature map, the method further comprises:

7. A text detection apparatus, comprising:

the image acquisition module is configured to acquire an image to be detected;

8. The apparatus of claim 7, wherein the correspondence of anchor data to text line recognition results comprises: the corresponding relation between the anchor point category and the text line recognition result;

9. An electronic device, comprising:

a processor;

a memory configured to store the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the text detection method of any of claims 1 to 6.

10. A storage medium in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform the text detection method of any one of claims 1 to 6.