CN106709494B

CN106709494B - Scene character recognition method based on coupling space learning

Info

Publication number: CN106709494B
Application number: CN201710014236.4A
Authority: CN
Inventors: 张重; 王红; 刘爽
Original assignee: Tianjin Normal University
Current assignee: Zhongfang Information Technology Tianjin Co ltd
Priority date: 2017-01-10
Filing date: 2017-01-10
Publication date: 2019-12-24
Anticipated expiration: 2037-01-10
Also published as: CN106709494A

Abstract

The embodiment of the invention discloses a scene character recognition method based on coupling space learning, which comprises the following steps: preprocessing an input scene character image to obtain a training scene character image; performing recognition feature extraction on the character image of the training scene to obtain a space dictionary; carrying out spatial coding on the recognition features of the corresponding images by using a spatial dictionary to obtain corresponding spatial coding vectors; performing maximum extraction on the space coding vector to obtain a characteristic vector; training by utilizing a linear support vector machine based on the feature vectors to obtain a scene character recognition classification model; and acquiring a feature vector of the text and image of the test scene, and inputting the text and image recognition classification model of the test scene to obtain a text and image recognition result of the test scene. The invention can effectively combine the spatial context information into the feature vector by creating the spatial dictionary and utilizing the spatial dictionary to carry out spatial coding, thereby achieving the purpose of effectively mining the spatial information and further improving the accuracy of scene character recognition.

Description

Scene character recognition method based on coupling space learning

Technical Field

The invention belongs to the technical field of pattern recognition, and particularly relates to a scene character recognition method based on coupling space learning.

Background

Scene character recognition plays an important role in the field of pattern recognition, and can be directly applied to the fields of image retrieval, intelligent transportation, man-machine interaction and the like. In practical applications, scene text recognition is a very challenging research direction because the scene text is affected by external factors such as uneven illumination, distortion, complex background, and the like.

Scene character recognition has been extensively studied in recent decades, and some early methods utilize optical character recognition technology for scene character recognition. However, the optical character recognition technology has great limitations, such as a scene text image binarization operation. In recent years, a large number of methods for scene character recognition have been proposed and made great progress. Among them, the most representative work is a scene character recognition method based on object recognition. The method based on the target recognition skips the binarization process of the scene character image and considers each scene character as a special target, and achieves certain success in the field of pattern recognition. Such as: newell et al use multi-scale HOG (Histogram of Gradients) for characterization. Zhang et al extracts and characterizes sparse code Histograms (HSCs). Shi et al consider both local feature information and global structure information. While these approaches have met with some success, they have largely ignored spatial context information. To solve this problem, Gao et al propose a stroke library in the feature representation stage to consider spatial context information, since different characters may contain the same feature information at different positions, which may cause reconstruction errors. The method proposed by Shi et al is an extension of the Gao et al method, which uses discriminative multi-scale stroke libraries to represent features. Tian et al propose to add spatial context information considering symbiotic relationships between HOG features. In addition, Gao et al also propose to consider spatial context information based on a location-embedded dictionary. Although the above method has been successful, the spatial context information, i.e., the dictionary learning stage or the encoding stage, is considered in a single aspect, and thus effective spatial context information cannot be sufficiently retained.

Disclosure of Invention

The invention aims to solve the technical problem that spatial context information has a large influence on a scene character recognition result, and therefore, the invention provides a scene character recognition method based on coupling space learning.

In order to achieve the purpose, the scene character recognition method based on the coupling space learning comprises the following steps:

step S1, respectively carrying out preprocessing operation on the N input scene character images to obtain N training scene character images;

step S2, respectively extracting the recognition features of the N training scene character images to obtain N space dictionaries;

step S3, carrying out space coding on the recognition characteristics of each training scene character image by using the space dictionary of the image to obtain a corresponding space coding vector;

step S4, performing maximum extraction on the space coding vector of each training scene character image to obtain a characteristic vector corresponding to the training scene character image;

step S5, training by using a linear support vector machine based on the feature vector of the training scene character image to obtain a scene character recognition classification model;

and S6, acquiring the feature vectors of the text and image of the test scene according to the steps S1-S4, and inputting the feature vectors into the scene character recognition classification model to obtain a scene character recognition result.

Optionally, the step S1 includes the following steps:

step S11, converting the input scene text image into a grayscale scene text image;

step S12, the size of the grayscale scene character image is normalized to H × W, and the normalized grayscale scene character image is used as the training scene character image, where H and W represent the height and width of the grayscale scene character image, respectively.

Optionally, the step S2 includes the following steps:

step S21, P of character image in each training scene_iExtracting identification features at positions (i-1, 2, …, m), wherein m is the number of the identification feature extraction positions of each training scene character image;

step S22, for N training scene character images, for the slave P_iClustering all the recognition features extracted from the positions to obtain a sub-dictionary C_i(i ═ 1,2, …, m), and combines the sub-dictionaries C_iIs denoted as P_i；

In step S23, m sub-dictionaries carrying position information are concatenated to obtain a spatial dictionary.

Optionally, the identifying feature is a HOG feature.

Optionally, in the step S22, the identifying features are clustered by using a k-means clustering algorithm.

Optionally, the spatial dictionary is represented as:

D＝{C,P}＝{(C₁,P₁),(C₂,P₂),...,(C_m,P_m)}，

wherein D represents a space dictionary, and C ═ C₁,C₂,…,C_m) For a set of m sub-dictionaries, P ═ P₁,P₂,…,P_m) A set of position information representing the set of sub-dictionaries C.

Optionally, in step S3, the recognition features of the text image of the training scene are spatially encoded by an objective function shown in the following formula:

wherein | · | purple sweet²Is represented by₂A norm,. indicates a dot product operation of corresponding elements in two matrices, f_jRepresenting a recognition feature, a_jDenotes f_jCorresponding spatial coding vector, a ═ a₁,a₂,…,a_j,…]Represents the set of all spatial encoding vectors, | | f_j-Ca_j||²Representing an error generated by reconstructing the recognition feature by using a spatial dictionary; | d |_jF⊙a_j||²The local regular term represents the distance constraint relation between the recognition features in the feature space and the code words in the sub-dictionary; | d |_jE⊙a_j||²The space regular term represents the position relation between the constraint characteristics in the Euclidean space and the code words in the sub-dictionary; alpha and beta are the regularization parameters,representing a spatially encoded vector a_jThe sum of all elements in (a) is equal to 1; d_jFRepresenting the distance between the recognition feature in the feature space and the code word in the sub-dictionary, d_jERepresenting a recognition feature f in Euclidean space_jThe distance between the corresponding position and the position corresponding to the codeword in the sub-dictionary.

Optionally, the distance d between the recognition feature in the feature space and the codeword in the sub-dictionary_jFExpressed as:

wherein σ_FIs a device for adjusting d_jFParameter of weight descent speed, dist (f)_jAnd C) is defined as:

dist(f_j,C)＝[dist(f_j,C₁),dist(f_j,C₂),...,dist(f_j,C_m)]^T

wherein, dist (f)_j,C_i) (i-1, 2, …, m) represents the feature f_jAnd sub-dictionary C_iThe euclidean distance between all codewords in (a).

Optionally, identifying the feature f in Euclidean space_jDistance d between the corresponding position and the position corresponding to the code word in the sub-dictionary_jEExpressed as:

wherein σ_EIs a device for adjusting d_jEParameter of the weight descent speed, dist (l)_jAnd, P) is defined as:

dist(l_j,P)＝[dist(l_j,P₁),…,dist(l_j,P₁),dist(l_j,P₂),…,dist(l_j,P₂),…,dist(l_j,P_m),…,dist(l_j,P_m)]^Twherein, dist (l)_j,P_i) (i-1, 2, …, m) represents the identification feature f_jPosition l of_jAnd sub-dictionary C_iPosition P_iThe euclidean distance between them.

Optionally, in step S4, the spatial coding vector of each text image of the training scene is maximally extracted by using the following formula:

a＝max{a₁,a₂,...,a_j,...,a_m}，

wherein, a represents the characteristic vector of the character image of the training scene, a_j(j ═ 1, 2.. times, m) denotes a spatial coding vector.

The invention has the beneficial effects that: the invention can effectively combine the spatial context information into the feature vector by creating the spatial dictionary and performing spatial coding by using the created spatial dictionary, thereby achieving the purpose of effectively mining the spatial information and improving the accuracy of scene character recognition.

It should be noted that the invention obtains the subsidies of national science fund projects No.61401309 and No.61501327, Tianjin City applied foundation and leading edge technology research plan youth fund project No.15JCQNJC01700, Tianjin Master university doctor fund project No.5RL134 and No. 52XB1405.

Drawings

Fig. 1 is a flowchart of a scene character recognition method based on coupled space learning according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings in conjunction with the following detailed description. It should be understood that the description is intended to be exemplary only, and is not intended to limit the scope of the present invention. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present invention.

Fig. 1 is a flowchart of a scene character recognition method based on coupled space learning according to an embodiment of the present invention, and some implementation flows of the present invention are described below with reference to fig. 1 as an example. The invention relates to a scene character recognition method based on coupling space learning, which comprises the following specific steps:

wherein the pre-processing operation comprises the steps of:

further, the step S2 includes the following steps:

step S21, P of character image in each training scene_iExtracting an identification feature at the position (i-1, 2, …, m), wherein m is the number of feature extraction positions of each training scene character image, so that each training scene character image can obtain m identification features;

the identification feature may be an HOG feature or another identification feature.

Step S22, for N training scene character images, for the slave P_iClustering all the recognition features extracted from the positions to obtain a sub-dictionary C_i(i ═ 1,2, …, m), and combines the sub-dictionaries C_iIs denoted as P_iThus, m sub-dictionaries are obtained for m feature extraction positions;

wherein, clustering operation can be carried out by using a clustering algorithm such as k-means and the like.

Wherein the spatial dictionary may be expressed as:

D＝{C,P}＝{(C₁,P₁),(C₂,P₂),...,(C_m,P_m)}，

wherein D represents a space dictionary, and C ═ C₁,C₂,…,C_m) For a set of m sub-dictionaries, the corresponding P ═ P₁,P₂,…,P_m) A set of position information representing the set of sub-dictionaries C.

Step S3, using the space dictionary of each training scene character image to perform space coding on m recognition characteristics of the image to obtain corresponding m space coding vectors;

in step S3, the m recognition features of each training scene character image are spatially encoded by the spatial dictionary through the following objective function:

wherein | · | purple sweet²Is represented by₂A norm,. indicates a dot product operation of corresponding elements in two matrices, f_jRepresenting a recognition feature, a_jDenotes f_jCorresponding spatial encoding vector, corresponding a ═ a₁,a₂,…,a_j,…]Represents the set of all spatial encoding vectors, | | f_j-Ca_j||²Representing an error generated by reconstructing the recognition feature by using a spatial dictionary; | d |_jF⊙a_j||²The local regular term represents the distance constraint relation between the recognition features in the feature space and the code words in the sub-dictionary; | d |_jE⊙a_j||²The space regular term represents the position relation between the constraint characteristics in the Euclidean space and the code words in the sub-dictionary; alpha and beta are the regularization parameters,representing a spatially encoded vector a_jThe sum of all elements in (a) is equal to 1; d_jFAnd representing the distance between the recognition features in the feature space and the code words in the sub-dictionary, wherein the specific expression form is as follows:

wherein σ_FIs a device for adjusting d_jFParameter of weight descent speed, dist (f)_jAnd C) is defined as follows:

dist(f_j,C)＝[dist(f_j,C₁),dist(f_j,C₂),...,dist(f_j,C_m)]^T

d_jERepresenting a recognition feature f in Euclidean space_jCorresponding position l_jAnd P, the specific expression form is as follows:

wherein σ_EIs a device for adjusting d_jEThe weight drop speed parameter. dist (l)_jP) is defined as follows: dist (l)_j,P)＝[dist(l_j,P₁),…,dist(l_j,P₁),dist(l_j,P₂),…,dist(l_j,P₂),…,dist(l_j,P_m),…,dist(l_j,P_m)]^TWherein, dist (l)_j,P_i) (i-1, 2, …, m) represents the identification feature f_jPosition l of_jAnd sub-dictionary C_iPosition P_iThe euclidean distance between them.

The target function selects a group of code words to reconstruct the identification characteristics by using a local regular term in a characteristic space, and simultaneously restricts the position relation between the identification characteristics and the code words in the sub-dictionary by using a space regular term in an Euclidean space.

An analytical solution can be obtained by taking the derivative of the objective function as follows:

wherein A is_j＝(C^T-1f_j ^T)(C^T-1f_j ^T)^TRepresenting a covariance matrix using a formulaCan solveAnd carrying out normalization operation.

Through the analysis solution, the situation that a complex optimization process directly solves the space coding vector corresponding to the identification feature can be avoided.

in step S4, the spatial coding vector of each text image of the training scene is maximally extracted by using the following formula:

a＝max{a₁,a₂,...,a_j,...,a_m}，

wherein, a_jAnd (j ═ 1, 2.. times, m) represents a spatial coding vector, and a represents a feature vector of a character image of a training scene.

And taking the maximum value of each dimension of the m spatial coding vectors of the training scene character image through the formula to obtain the feature vector a of the training scene character image.

The validity of the method of the present invention can be seen by using a scene text image database published on the internet as a test object, for example, on an ICDAR2003 database, when H × W is 64 × 32 and position m is 128, the accuracy of scene text recognition is 83.2%.

It is to be understood that the above-described embodiments of the present invention are merely illustrative of or explaining the principles of the invention and are not to be construed as limiting the invention. Therefore, any modification, equivalent replacement, improvement and the like made without departing from the spirit and scope of the present invention should be included in the protection scope of the present invention. Further, it is intended that the appended claims cover all such variations and modifications as fall within the scope and boundaries of the appended claims or the equivalents of such scope and boundaries.

Claims

1. A scene character recognition method based on coupled space learning is characterized by comprising the following steps:

step S6, acquiring the feature vector of the text and image of the test scene according to the steps S1-S4, and inputting the feature vector into the scene character recognition classification model to obtain a scene character recognition result;

in step S3, the recognition features of the text images of the training scene are spatially encoded by an objective function shown in the following formula:

wherein | · | purple sweet²Is represented by₂A norm,. indicates a dot product operation of corresponding elements in two matrices, f_jRepresenting a recognition feature, a_jDenotes f_jCorresponding spatial coding vector, a ═ a₁,a₂,…,a_j,…]Represents the set of all spatial encoding vectors, | | f_j-Ca_j||²Representing an error generated by reconstructing the recognition feature by using a spatial dictionary; | d |_jF⊙a_j||²The local regular term represents the distance constraint relation between the recognition features in the feature space and the code words in the sub-dictionary; | d |_jE⊙a_j||²The space regular term represents the position relation between the constraint characteristics in the Euclidean space and the code words in the sub-dictionary; alpha and beta are regularization parameters, 1^Ta_j＝1,Representing a spatially encoded vector a_jThe sum of all elements in (a) is equal to 1; d_jFRepresenting the distance between the recognition feature in the feature space and the code word in the sub-dictionary, d_jERepresenting a recognition feature f in Euclidean space_jThe distance between the corresponding position and the position corresponding to the codeword in the sub-dictionary.

2. The method according to claim 1, wherein the step S1 comprises the steps of:

3. The method according to claim 1, wherein the step S2 comprises the steps of:

4. The method of claim 3, wherein the identified feature is a HOG feature.

5. The method according to claim 3, wherein in step S22, the identification features are clustered using a k-means clustering algorithm.

6. The method of claim 3, wherein the spatial dictionary is represented as:

D＝{C,P}＝{(C₁,P₁),(C₂,P₂),...,(C_m,P_m)}，

7. Method according to claim 1, characterized in that the distance d between the recognition features in the feature space and the code words in the sub-dictionary_jFExpressed as:

dist(f_j,C)＝[dist(f_j,C₁),dist(f_j,C₂),...,dist(f_j,C_m)]^T

8. The method of claim 1, wherein the feature f is identified in Euclidean space_jDistance d between the corresponding position and the position corresponding to the code word in the sub-dictionary_jEExpressed as:

9. The method according to claim 1, wherein in step S4, the spatial coding vector of each text image of the training scene is maximally extracted by using the following formula:

a＝max{a₁,a₂,...,a_j,...,a_m}，