CN109460763B

CN109460763B - Text region extraction method based on multilevel text component positioning and growth

Info

Publication number: CN109460763B
Application number: CN201811267160.7A
Authority: CN
Inventors: 苏丰; 丁文俊; 汪洋; 王雨阳; 王岚
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2018-10-29
Filing date: 2018-10-29
Publication date: 2022-06-21
Anticipated expiration: 2038-10-29
Also published as: CN109460763A

Abstract

The invention discloses a text region extraction method in a natural scene image based on multi-level text component positioning and growth, which comprises the steps of firstly inputting a gray level or color RGB image; running an MSER algorithm on an input image, further running an SWT algorithm in the MSER by taking an MSER boundary as an area edge, and acquiring a stroke width value of pixels in an extremum area; calculating a stroke width histogram in an extremum region, selecting a pixel set corresponding to three stroke widths containing the maximum number of pixels in the histogram, and taking pixels in the pixel set passing edge gradient difference angle characteristic verification as seed pixels; and based on the seed pixels, iteratively performing a growth process of two layers in and between the characters, further filtering the regions obtained after growth based on various text region characteristics, and outputting the finally obtained text regions as results. The text region extraction method provided by the invention can give consideration to the precision and the recall rate of the extraction result, does not depend on a specific machine learning model, and is simple and easy to reproduce.

Description

Text region extraction method based on multi-level text component positioning and growth

Technical Field

The invention belongs to the field of image target detection, and relates to a text region extraction method based on multi-level text component positioning and growth in a natural scene image.

Background

Characters in the natural scene image contain rich semantic information, have important significance for understanding the image and the scene, and have significant utilization value in image understanding, retrieval, classification, labeling and other applications. However, since characters in the natural scene image often have great differences in attributes such as size, direction, color, language, style, and the like, and are easily affected by factors such as illumination, occlusion, background, and the like in the natural scene, accurately detecting characters in the natural scene image is a challenging task.

In general, natural scene image text detection can be divided into two subtasks. The first step extracts possible character candidate regions, and the second step merges character candidate regions belonging to the same text line. The success of the first step is crucial to the effective extraction of the characters in the natural scene image, and if the possible character candidate regions cannot be accurately and completely extracted, the processing of subsequent merging to generate text lines is difficult to obtain a good result.

For the step of extracting possible character candidate regions, two conventional algorithms based on connected region analysis, which are commonly used at present, are a Maximum Stable Extremum Region (MSER) algorithm and a Stroke Width Transformation (SWT) algorithm, respectively. The MSER algorithm is based on a watershed method, focuses on reflecting the relative stability inside an extremum region, is not specially drawn for the character characteristics, and the extraction result of the MSER algorithm depends on the specific threshold setting of parameters such as the gray value change rate of pixels inside the region, so that the accuracy and the recall rate of the extraction result are difficult to be considered. Although the Stroke Width Transformation (SWT) algorithm grasps and fully utilizes the parallel characteristics of character stroke edges, the reliability of the SWT algorithm depends on the quality of image edge pixels to a great extent, and when character stroke edge pixels are matched, the SWT algorithm depends on the difference threshold value of two pixel gradient directions to a great extent, and different threshold value settings influence the matching result and further influence the final character candidate region extraction result. In practical use, the two classical text region extraction methods usually adopt a fixed and single parameter threshold, the extraction result is very sensitive to the selected specific parameter threshold, the characters in the natural scene image have great difference in appearance and quality, the use of a high threshold in the algorithm will improve the accuracy of the extraction result, but will also cause many text regions to be omitted in the extraction result; on the other hand, using a low threshold will increase the recall of the extracted results, but the accuracy of the extracted text regions will also decrease. Therefore, the situation that texts in natural scene images are complex and changeable is difficult to deal with by adopting a single and non-adaptive processing strategy.

The Chinese invention patent CN10756380.A provides a vehicle license plate detection and identification method combining an MSER algorithm and an SWT algorithm. In the license plate detection part in the patent, firstly, graying and contrast enhancement operations are carried out on an input image, then Canny edge detection and MSER area detection are carried out on the processed image, and then intersection is taken between the expanded Canny edge and an original MSER area to obtain a candidate license plate area. And further, an SWT algorithm based on morphological processing is operated on the candidate license plate area to obtain the stroke width of characters in the candidate license plate area, and finally the candidate area is screened and aggregated according to the stroke width to obtain the final license plate position. The method has good detection effect on the characters in the license plate with obvious character and background contrast and high character edge quality, but the excessive graying, edge detection, morphological operation and other processing contained in the method cause the method to have certain limitation on the type of the applicable image text, and text objects with various forms and complex backgrounds in natural scene images are difficult to effectively process.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a self-adaptive text region extraction method in a natural scene image based on multi-level text component positioning and growth, which adopts a strategy of easy, difficult and divide-and-conquer to treat text objects with different detection conditions differently, namely on the granularity level of a plurality of objects, firstly extracts a more standard seed text component in the image by using a relatively strict detection condition, and then effectively extracts the text component with poor quality by relaxing the detection condition based on the obtained seed text component and text characteristics thereof so as to obtain a text region extraction result with better precision and recall ratio.

The invention specifically adopts the following technical scheme:

a method for extracting text regions in natural scene images based on multi-level text component positioning and growth is characterized by comprising the following steps:

the method comprises the following steps: inputting a gray or color natural scene image containing a text object;

step two: extracting text seed pixels in the input image by combining MSER, SWT and a region pixel gray value smoothing algorithm; the method specifically comprises the following steps:

step (21): running an MSER algorithm on the input image, and smoothing the obtained gray value of the pixels in each extremum region;

step (22): running an SWT algorithm in the extremum region to obtain a stroke width value of each pixel in the extremum region;

step (23): calculating a histogram of the stroke width of each pixel in each extreme value area, and selecting three histogram peak values with the largest number of pixels as candidate main stroke widths of the extreme value area;

step (24): calculating the edge gradient difference angle characteristic of the pixel set corresponding to each candidate main stroke width of each extreme value region, and taking the pixel set corresponding to the main stroke width smaller than a given edge gradient difference angle characteristic threshold value as a text seed pixel in the extreme value region;

step three: based on the extracted text seed pixels, the character in-growth process is carried out in an iteration mode, and the specific method comprises the following steps: calculating difference values of the gray value, the color value and the stroke width value of the pixel adjacent to the text seed pixel and the corresponding value of the text seed pixel, taking the adjacent pixel with the difference value smaller than a specific threshold value as a new text seed pixel obtained by growth, and iterating the growth process until the adjacent pixel reaches the edge of the region or cannot be further grown to obtain a text pixel connected region;

step four: and iterating the character-to-character growth process based on the text pixel connected region obtained in the step three, wherein the method specifically comprises the following steps: selecting two text pixel connected regions with the center distance smaller than a specific threshold, searching connected pixels with enough quantity and gray value, color value and stroke width value and the difference value of the corresponding parameter mean value of the two text pixel connected regions smaller than the specific threshold on a plurality of connecting lines of corresponding quartering points of the sides of the minimum enclosing rectangle, and taking the connected pixels as a text seed pixel set of a new text pixel connected region obtained by growth;

step five: repeating the iterative growth process of the third step and the fourth step based on the new text seed pixel obtained in the fourth step until a new text pixel connected region cannot be obtained;

step six: and filtering the finally obtained text pixel connected region based on the threshold values of the various features to remove non-text regions possibly contained in the text pixel connected region, and taking the filtered text region as a final extraction result.

The invention has the beneficial effects that:

the method for extracting the text region in the natural scene image based on multi-level text component positioning and growth has the following advantages:

1) the self-adaptive extraction strategy can give consideration to the precision and recall rate of the extraction result;

2) the iterative hierarchical growth strategy uses different parameter thresholds in the growth process of each level, thereby ensuring the pertinence, controllability and rationality of the growth steps.

3) The extraction method is independent of a specific machine learning model, and is simple and easy to reproduce;

drawings

FIG. 1: a method flow diagram of the present invention;

FIG. 2: the detection schematic diagram of the invention;

FIG. 3: calculating a schematic diagram of edge gradient difference angle features;

FIG. 4 is a schematic view of: and the text pixel connected region grows a schematic diagram.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without any inventive step, are within the scope of the present invention.

As shown in fig. 1 and fig. 2, the method for extracting a text region in a natural scene image based on multi-level text component positioning and growth of the present invention includes the following steps:

inputting a gray level or RGB color natural scene image I containing a text object;

step two, extracting text seed pixels in the input image I, comprising the following steps:

in the step (21), a Maximum Stable Extremum Region (MSER) algorithm is run on the input image I, and the obtained gray values of the pixels in each extremum region are smoothed (see the description (1)), so that the text edge quality can be improved, the calculation of the subsequent SWT algorithm can be accelerated, and the stroke width value obtained by the SWT algorithm in the next step is more stable. Meanwhile, taking the smoothed gray value corresponding to the maximum number of pixels as a main gray value of the extreme value area;

step (22) running an SWT algorithm in the area by taking the boundary pixel of the extremum area as an edge to obtain the stroke width value of each pixel in the extremum area;

step (23) calculating a histogram of the stroke width of each pixel in each extreme value area, and selecting three histogram peak values with the largest pixel number as candidate main stroke widths of the area;

and (24) calculating an edge gradient difference angle feature (see description (2)) of a pixel set corresponding to each candidate main stroke width in each extreme value region, taking the pixel set corresponding to the main stroke width smaller than a specific edge gradient difference angle feature threshold value as a text seed pixel in the extreme value region, and taking the mean value of the candidate main stroke widths meeting the conditions as the main stroke widths of the extreme value region and the text seed pixel set thereof.

And step three, performing character in-growth starting from text seed pixels, calculating the difference value between the gray value, the color value and the stroke width value of the pixel adjacent to the seed pixels and the corresponding value of the text seed pixels, taking the adjacent pixels with the difference value smaller than a specific threshold value as new text seed pixels obtained by growth (see the specification (3)) and iterating the growth process until the edge of the extremum area is reached or the adjacent pixels cannot be further grown, and finally taking the grown and expanded text seed pixel set as a text pixel communication area of the components in the corresponding characters. On the basis of the stroke width value parameter of the pixel, the method further utilizes the attributes of all aspects of the text reflected by the seed pixel obtained in the step two, and improves the robustness of the growing method for dealing with various interference, noise and degradation conditions.

And step four, starting from the text pixel connected region obtained in the step three, iterating to carry out the inter-character growth process. Specifically, two text pixel connected regions with the center distance smaller than a specific threshold are selected, on a plurality of connecting lines between corresponding quartering points of the sides in the vertical direction of the minimum enclosing rectangle, the difference between a gray value, a color value and a stroke width value and the mean value of the main gray value, the color mean value and the main stroke width of the two text pixel connected regions is found to be smaller than the specific threshold, and the connected pixels with enough number are used as a text seed pixel set of a new text pixel connected region grown between characters (see description (4)). Compared with the method that the inter-character growth is only carried out on the connecting line of the central points of the two text pixel communication areas, the method searches all possible inter-character text pixel areas on the more dense connecting lines of the plurality of quartering points, and can better cope with the difference of different sizes of characters and the growth results in the characters, thereby further improving the recall rate of the extraction of the text areas;

step five, repeatedly and iteratively executing the step three and the step four until a new text pixel connected region cannot be obtained;

and step six, filtering the text pixel connected regions based on the threshold values of the characteristics of the various text regions, and outputting the filtered regions as the text region extraction results. (see description (5) for details).

Description of (1): the step (21) of the invention for smoothing the pixel gray value in the extreme value area comprises the following steps:

setting the lowest gray value of each pixel in the extreme value area as min and the highest gray value as max, dividing the gray value range of the extreme value area into five parts, wherein the gray value range of the ith part is

For the pixel P whose gray value falls within the ith range, the gray value is set to

I.e. the intermediate gray value of the ith range, and the gray value with the most number of pixels in the updated extremum region is taken as the main gray value of the extremum region.

Description (2): the method for calculating the edge gradient difference angle characteristic of the pixel set corresponding to each main stroke width in the step (24) of the invention comprises the following steps:

as shown in FIG. 3, for each edge pixel p detected by the Canny edge detector in the Stroke Width Transform (SWT) algorithm_iEmitting a ray along the gradient direction of the gray value, if the ray passes through another edge pixel p_jThen the ray and p thereon_iAnd p_jEach pixel p in between_kAre all assigned an angle value d_angle ^(k)The value is from p_iAnd p_jStarting the direction radian difference of the two rays along the respective gradient directions; due to the fact that for one pixel p in a certain extremum region_iThere may be more than one ray passing through the pixel location, assuming that there are m rays, of which the s-th ray

A value of d_sAnd the ray contains N_sA pixel in the connected region, then pixel p_iFinal d_angle ⁽ⁱ⁾The value calculation method comprises the following steps:

further, for a pixel set comprising N pixels, D is_angleThe characteristic value being d of all pixels therein_angleAverage of the values:

in the present invention, D_angleThe threshold of the eigenvalue is set to 0.93 if D of a certain set of pixels_angleIf the feature value is greater than the threshold, the non-text region is considered to be discarded.

Description (3): the character in-growth process of the text seed pixel in the third step of the invention comprises the following specific steps:

starting from any pixel in a text seed pixel set of a certain extreme value region, searching for a pixel p which simultaneously meets the following four conditions in an image_i：

1) Pixel p_nCommunicating with any pixel in the text seed pixel set;

2) pixel p_iThe stroke width value of (a) differs from the main stroke width of the text seed pixel set by no more than 40% of the latter;

3) pixel p_iThe difference between the gray value of the text seed pixel set and the mean gray value of the text seed pixel set is not more than 40 percent of the latter;

4) for an RGB color natural scene input image, pixel p_iThe values of the three RGB color channels of (a) differ by no more than 60% from the average value of the text seed pixel set over the three RGB channels.

And adding the found pixels meeting the conditions into the text seed pixel set as new text seed pixels.

Description (4): the specific steps of the inter-character growth process between the text pixel connected regions in the fourth step of the invention are as follows:

1) as shown in fig. 4, for each text pixel connected region, representing the region by a minimum bounding rectangle, calculating the maximum value d of the length and width of the bounding rectangle, and representing the center of the text pixel connected region by the center of the minimum bounding rectangle, and marking the quartering points of the sides of the minimum bounding rectangle in the vertical direction;

2) searching other text pixel connected regions in a circular image range which takes the center of the text pixel connected region as the center of a circle and takes 3d as the radius;

3) if one or more text pixel connected regions are found, searching pixels p meeting the following conditions on a plurality of connecting lines connecting corresponding quartering points of the text pixel connected region and another text pixel connected region from near to far in sequence_i：

a) Pixel p_nThe stroke width value of (2) differs by no more than 40% from the average value of the main stroke widths of the two regions;

b) pixel p_iDoes not differ by more than 40% from the mean of the grey values of the two regions;

c) for an RGB color natural scene input image, pixel p_iThe values of the three RGB color channels of (a) do not differ by more than 60% from the RGB three-channel average of the two regions.

4) And removing isolated pixels from the pixel sets meeting the conditions, and regarding the residual pixel sets as the grown text seed pixel sets of the new text pixel connected region.

Description (5): in the sixth step of the invention, the text pixel connected region is filtered based on the following threshold values of the text region characteristics so as to improve the precision of the text region extraction result, wherein the characteristics for filtering and the threshold values thereof are as follows:

1) aspect ratio: namely the length-width ratio of a text pixel connected region, and the effective value range is [0.2, 2.0 ];

2) symmetry: respectively and equally dividing the minimum bounding rectangle of the text pixel connected region into two sub-regions according to the horizontal direction and the vertical direction, wherein the number of text pixels in any sub-region in the horizontal direction or the vertical direction is required to be not more than 80% of the total number of text pixels in the whole region;

3) stability: and if three of four adjacent pixels of a certain text pixel do not belong to the text pixel, judging the pixel as an unstable noise pixel. The number of noise pixels in any one sub-region should not exceed 50% of the total number of text pixels in the whole region;

4) pixel ratio of main stroke width: namely the proportion of the number of pixels corresponding to the width of the main stroke in the text pixel connected region to the total number of pixels in the region, and the effective value range is [0.5, 0.9 ].

The method for extracting the text region in the natural scene image based on multi-level text component positioning and growth has the following innovations:

1) adaptive text region extraction strategy: the text in the natural scene image has a complex and changeable form, and the overall extraction strategy of the invention is to position a relatively regular and high-quality seed text region (namely a text seed pixel set) by using relatively strict and sensitive detection conditions; and then starting from the seed text region, growing to obtain the text region with poor quality by adopting a looser and more robust constraint condition. Because adaptive matching conditions are adopted at different stages in the text region extraction process, the precision and the recall rate of the extraction result are considered;

2) determination of seed text region: different characteristics used for positioning text regions in a Maximum Stable Extremum Region (MSER) algorithm and a Stroke Width Transformation (SWT) algorithm are combined, a seed text region with high quality is extracted from an image under a relatively strict condition to serve as a basis of a later growth process, and region pixel gray value smoothing processing is introduced, so that the stability and the efficiency of the extraction algorithm are effectively improved;

3) multilayer iterative growth strategy: the growth process in the characters is focused on the growth of text regions which belong to the same connected component with the text seed pixels, the growth process between the characters is focused on the growth of the text regions on a plurality of possible paths between the connected components of the text pixels, the two growth processes are operated in an iterative mode by taking the text characteristic similarity as a growth condition until no new text region grows out, and the reliability of the method for dealing with various noise, interference and degradation conditions is improved.

Claims

1. A method for extracting text regions in natural scene images based on multi-level text component positioning and growth is characterized by comprising the following steps:

2. The method of claim 1, wherein the step (21) of smoothing the gray values of the pixels in the extreme region comprises the steps of: setting the lowest gray value as min and the highest gray value as max for each pixel in the extreme value area, dividing the gray value range of the extreme value area into five parts, wherein the gray value range of the ith part is

I.e. the intermediate gray value of the ith range, and the gray value with the largest number of pixels in the processed extremum region is taken as the main gray value of the extremum region.

3. The method for extracting text regions from natural scene images based on multi-level text component positioning and growing as claimed in claim 1, wherein the text seed pixel character in-growth process in step three comprises the following steps:

from any pixel in text seed pixel set of some extremum regionIn the image, a pixel p satisfying four conditions described below is searched_i：

1) Pixel p_iCommunicating with any pixel in the text seed pixel set;

4) for an RGB color natural scene input image, pixel p_iThe difference between the values of the three RGB color channels and the average value of the text seed pixel set on the RGB three channels is not more than 60 percent of the latter;

4. The method for extracting text regions from natural scene images based on multi-level text component positioning and growing as claimed in claim 1, wherein the step four of performing inter-character growing process between text pixel connected regions comprises the following specific steps:

1) for each text pixel connected region, representing the region by a minimum bounding rectangle, calculating the maximum value d of the length and the width of the bounding rectangle, representing the center of the text pixel connected region by the center of the minimum bounding rectangle, and marking the quartering points of the sides of the minimum bounding rectangle in the vertical direction;

3) if one or more text pixel connected regions are found, pixels p meeting the following conditions are found on a plurality of connecting lines of corresponding quartering points connecting the text pixel connected region and another text pixel connected region from near to far in sequence_i：

a) Pixel p_iThe stroke width value of (1) and the main stroke width of the two regionsThe difference between the average values does not exceed 40% of the latter;

c) for an RGB color natural scene input image, pixel p_iThe difference between the values of the three RGB color channels and the average value of the RGB three channels of the two areas is not more than 60 percent of the latter;

5. The method as claimed in claim 1, wherein in step six, the text region is filtered based on multiple features of the text region, including aspect ratio, symmetry, stability, and pixel ratio of main stroke width.

6. The method of claim 5, wherein the method for extracting text regions in natural scene images based on multi-level text component positioning and growing comprises the following steps:

1) aspect ratio: namely the ratio of the length to the width of a text pixel connected region, and the effective value range is [0.2, 2.0 ];

3) stability: if three of four adjacent pixels of a certain text pixel do not belong to the character pixel, judging the pixel as an unstable noise pixel; the number of noise pixels in any one sub-region should not exceed 50% of the total number of text pixels in the whole region;