CN101311946B - Character identification method - Google Patents

Character identification method Download PDF

Info

Publication number
CN101311946B
CN101311946B CN2007101073048A CN200710107304A CN101311946B CN 101311946 B CN101311946 B CN 101311946B CN 2007101073048 A CN2007101073048 A CN 2007101073048A CN 200710107304 A CN200710107304 A CN 200710107304A CN 101311946 B CN101311946 B CN 101311946B
Authority
CN
China
Prior art keywords
character
classification
those
line
little
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2007101073048A
Other languages
Chinese (zh)
Other versions
CN101311946A (en
Inventor
蔡文瀚
范圣恩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Compal Electronics Inc
Original Assignee
Compal Electronics Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Compal Electronics Inc filed Critical Compal Electronics Inc
Priority to CN2007101073048A priority Critical patent/CN101311946B/en
Publication of CN101311946A publication Critical patent/CN101311946A/en
Application granted granted Critical
Publication of CN101311946B publication Critical patent/CN101311946B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Character Discrimination (AREA)

Abstract

The invention provides a method for identifying characters. In the method, prior to identifying the characters, the characters to be identified are classified according to the relative position between the characters and the track. However, when the characters are identified actually, the characters corresponding to the classification of the characters in the character database are used for comparison. Therefore, the scope and number of the compared characters are reduced, and the accuracy rate and the speed of character identification can be improved.

Description

Character identification method
Technical field
The invention relates to a kind of character identification method, and particularly relevant for a kind of track that utilizes with the hive off method of identification of literal.
Background technology
In the epoch that information is exploded, a large amount of books or newspapers and magazines need be read by regular meeting during common people, and during as if the article paragraph of seeing worth preservation or emphasis, normally the mode of employing photomechanical printing or montage is filed or directly used a mark.And for the literal worker, after reading one piece of article, if need to use the data of the inside, just must be input in the computer by the mode of keying in (Key-in) again, the action that can edit or file this data is not only required great effort and consuming time.
In order to address this problem, there is the dealer to research and develop the optical identification technology at present, the user only needs by general scanner equipment, the file scan that institute's desire is preserved becomes the figure shelves, utilize the software of text-recognition again, the literal in these figure shelves is partly captured out, and convert corresponding digital character to, so as to the electronics shelves that provide the user to rake in file, edit or handle file easily.The field that present optical identification technology is used is very extensive, such as library's document information filing, enterprises file management, even the identification of license, bill, all can utilize this optical identification technology to reach easily, identification data accurately more can be saved great mass of data comparison and the manpower and the time of checking.
The optical character identification abbreviates OCR (Optical Character Recognition) usually as, and its main application is to carry out the literal recognized action at existing written document.At first file needs to pass through earlier flatbed or hand held scanner, and the file of desiring identification is scanned into the figure shelves in advance.Because may be clean when scanning document because of file itself, character is fuzzy, or the problem of scanner resolution makes the image of input may exist some noises, these noises all can have influence on the accuracy rate of follow-up text-recognition.Therefore, the optical character recognition software can be earlier carried out processing such as slant correction, noise remove, sharp keenization of image edge at the file map shelves of scanning.Then, the optical character recognition software can be carried out the action that picture and text separate to the figure shelves after handling, literal all in the file, figure and form are separated, and at the unconnected situation of part character stroke, cutting correctly or merging literal.Then, the optical character recognition software will be carried out the action of file identification, utilize character image is brought and the comparison of written historical materials storehouse, and the function of proofreading and correct by Chinese simultaneously, carry out the affirmation of dictionary, the relevant words of front and back literary composition after, export identification result accurately at last.And the literal that identification is come out can directly be saved as the archives of forms such as Word, PDF, pure words, so not only can alleviate the burden of data input, also can increase the speed and the correctness of data input simultaneously.
Yet, in the process of above-mentioned recognition character, must one by one each character in the file and all characters in the written historical materials storehouse be compared, this measure will expend a large amount of calculation resources, and increases the processing time of text-recognition.In addition, the comparison mode of above-mentioned this indifference formula very likely causes the erroneous judgement of literal because of the interference of noise, and can't improve the discrimination power and the identification speed of literal effectively between each literal or the relative position between literal and the track.
Summary of the invention
The invention provides a kind of character identification method, by the relative position between printing word and the track printing word is classified, and the character of only getting corresponding character class in the written historical materials storehouse therewith printing word compare, therefore can improve the accuracy and the speed of text-recognition.
The present invention proposes a kind of character identification method, comprises the following steps: a. scanning delegation printing word, and wherein this journey printing word comprises a plurality of first characters; B. utilize these first characters, produce many tracks; C. according to the relative position of each first character in these tracks, judge the character class that each first character is affiliated; And d. will belong to a plurality of second character comparisons of this character class in each first character and the data bank, find out pairing second character of each first character, and pick out first character, wherein this data bank comprises multiple character class of record and the second affiliated character of various character class.
In one embodiment of this invention, above-mentioned track comprises top line (top line), last layer line (upper line), baseline (base line) and bottom line (bottom line), and is for being lower floor district (downer zone) between mesozone (central zone), baseline and the bottom line between top line and the last layer line between top regions (upper zone), last layer line and the baseline.
In one embodiment of this invention, above-mentioned step c comprises: c1. judges whether each first character belongs to little character; C2. if belong to little character, then carry out the classification of small character unit; C3. if do not belong to little character, then carry out the first classification of non-small character.
In one embodiment of this invention, above-mentioned step c1. comprises: c1-1. calculates the character height of each first character respectively; C1-2. relatively, the character height is classified as little character less than first character of this preset height value with the character height of each first character and a preset height value.
In one embodiment of this invention, wherein more comprise after step c1-2.: c1-3. captures the center reference point at remaining each first character center respectively; C1-4. utilize least square method, ask for a center line of these center reference point institute convergences; C1-5. judge whether the lower edge of remaining each first character is positioned at the top of this center line, and remaining first character that lower edge is positioned at the top of center line classifies as little character; And c1-6. judges whether the upper limb of remaining each first character is positioned at the below of center line, and remaining first character that upper limb is positioned at the below of center line classifies as little character.
In one embodiment of this invention, more comprise number and kind before the above-mentioned step c1. according to the track that is produced, with these tracks classify as a plurality of states one of them, and these states comprise first state, second state, the third state and four condition.Wherein, on behalf of these tracks, first state comprise above-mentioned top line, is gone up layer line, baseline and bottom line.On behalf of these tracks, second state comprise above-mentioned baseline, bottom line, and the track that is merged by top line and last layer line.On behalf of these tracks, the third state comprise above-mentioned top line, is gone up layer line, and by the track of baseline with the bottom line merging.On behalf of these tracks, four condition then comprise by top line and is gone up the track that layer line merges, and by the track of baseline with the bottom line merging.
In one embodiment of this invention, above-mentioned steps d. comprising: d1. calculates first eigenwert of each first character; And d2. with belonging to second eigenwert comparison of each second character of its character class in its first eigenwert and the aforesaid data storehouse, finds out the identification character of the second the most close character of eigenwert as this first character at each first character.In addition, more comprise and use the pairing forecast model of each first character, come identification first character.
The present invention is because of adopting the structure that printing word is hived off and compares, utilize printing word to be positioned at relative position information on the track, printing word is classified, simultaneously also when carrying out text-recognition, only getting the character of corresponding its character class in the written historical materials storehouse compares, therefore the scope and the number of literal comparison can be reduced, and the accuracy and the speed of text-recognition can be improved.
For above-mentioned feature and advantage of the present invention can be become apparent, preferred embodiment cited below particularly, and cooperate appended graphicly, be described in detail below.
Description of drawings
Fig. 1 is the track synoptic diagram that illustrates according to preferred embodiment of the present invention.
Fig. 2 is the process flow diagram of the character identification method that illustrates according to preferred embodiment of the present invention.
Fig. 3 is the character classification chart that illustrates according to preferred embodiment of the present invention.
Fig. 4 is the process flow diagram of the eigenvalue calculation method that illustrates according to preferred embodiment of the present invention.
Fig. 5 then is an example of the eigenvalue calculation method that illustrates according to preferred embodiment of the present invention.
Fig. 6 is the small character unit determination methods process flow diagram that illustrates according to preferred embodiment of the present invention.
Fig. 7 A and Fig. 7 B are the process flow diagrams of the small character unit sorting technique that illustrates according to preferred embodiment of the present invention.
Fig. 8 is the process flow diagram of the non-small character unit sorting technique that illustrates according to preferred embodiment of the present invention.
S201~S204: each step of the character identification method of preferred embodiment of the present invention
S401~S406: each step of the eigenvalue calculation method of preferred embodiment of the present invention
S601~S607: each step of the small character unit determination methods of preferred embodiment of the present invention
S701~S720: each step of the small character unit sorting technique of preferred embodiment of the present invention
S801~S811: each step of the first sorting technique of non-small character of preferred embodiment of the present invention
Embodiment
Usually when writing the literal of the English family of languages, characteristic according to its font, the size and location that the capital is being followed certain track adjustment or arranged writing words, these " reference lines " of hiding with regard to similar originally when these literal are write in study, four straight lines being printed on the writing notebook, as long as the user is according to these straight line writing words, just can write out neatly, article clearly.Similarly, the literal of these English family of languageies also can followed these reference lines when printing, and these reference lines are the so-called track of this paper just.
Fig. 1 is the track synoptic diagram that illustrates according to preferred embodiment of the present invention.Please refer to Fig. 1, present embodiment is the font according to each printing character in the printing word " typeface analysis ", define four tracks, these tracks can be divided into top line (top line), go up layer line (upper line), baseline (base line) and bottom line (bottom line) according to its position, and the zone between these tracks then can be divided into top regions (upper zone), mesozone (central zone) and lower floor district (lowerzone).It should be noted that among Fig. 1 that it is that (connect-component CC) constitutes, and these blocks that link together then are called CC group by a plurality of interconnected blocks that each printing character all can be considered.
From the above, characteristic according to English family of languages literal, position during its printing all can be dropped between these four tracks, and the character of different qualities (for example lowercase, capitalization, subscript character, subscript character) also can occupy the zones of different between these tracks.The present invention promptly utilizes these characteristics, all characters in the written historical materials storehouse are put into different categories, and actual when carrying out text-recognition, find out earlier desire the character class of recognition character, the character of getting this corresponding in written historical materials storehouse character class again compares, and can access comparison result more accurately.In order to make content of the present invention more clear, below the example that can implement according to this really as the present invention especially exemplified by embodiment.
Fig. 2 is the process flow diagram of the character identification method that illustrates according to preferred embodiment of the present invention.Please refer to Fig. 2, present embodiment is in order to identification delegation printing word, by the character of a certain character class in each printing character in this journey printing word and the data bank is compared, prints character and pick out each.
At first, scanning this journey printing word wherein comprises a plurality of first characters (step S201) in this journey printing word.The printing word of indication can be any delegation literal that captures in a file herein, and the present invention does not limit its scope, below only describes with regard to the identification of single capable printing word.
Then can utilize these first characters, produce many tracks (step S202), the kind of the track that is produced comprises top line, goes up layer line, baseline and bottom line as shown in Figure 1, and the zone between these tracks then can be divided into top regions, mesozone and lower floor district.Yet the difference along with printing character kind also might produce two to four tracks that do not wait.
Then, according to the relative position of each first character in track, judge the character class (step S203) that each first character is affiliated.For instance, Fig. 3 is the character classification chart that illustrates according to preferred embodiment of the present invention.Please refer to Fig. 3, present embodiment is that the English character in the data bank is divided into eight classes according to its regional location shared in track.Wherein, first classification (FULL) has accounted for top regions, mesozone and lower floor district.Second classification (HIGH) has accounted for top regions and mesozone.The 3rd classification (DEEP) has accounted for the lower floor district.The 4th classification (SHORT) has accounted for the mesozone.The 5th classification (SUPER) is little character, and is positioned near the center line.The 6th classification (SUBSCRIPT) has accounted between center line and the last layer line.The 7th classification (CENTER) has accounted between center line and the baseline.The 8th classification (UNKNOWN) can't be according to the track sorter for residue.Certainly, in other embodiments of the invention, also English character can be categorized into other variety classes, and not limit its scope.
Behind the character class that has determined each first character, can compare by second character that these first characters are identical with character class in the data bank respectively, find out pairing second character of each first character, and pick out these first characters (step S204).For instance, if desire identification printing character " N ", then it can be classified as second class among Fig. 3.And actual when carrying out text-recognition, also be only to get the word collection that second class is comprised, compare with printing character " N " one by one.In view of the above, can significantly reduce the number of characters of required comparison, also because in same character class, the relative position of each character in track is all same or similar simultaneously, and the situation that does not therefore also more have the identification mistake takes place, and can improve discrimination power.
What deserves to be mentioned is, the mode of above-mentioned identification first character for example is to calculate first eigenwert of first character earlier, second eigenwert with second character in the same character class in this first eigenwert and the data bank compares again, and finds out the identification character of the second the most close character of eigenwert as first character.
For instance, Fig. 4 is the process flow diagram of the eigenvalue calculation method that illustrates according to preferred embodiment of the present invention, and Fig. 5 then is an example of the eigenvalue calculation method that illustrates according to preferred embodiment of the present invention.Please refer to Fig. 4, present embodiment is that a character (CC group in other words conj.or perhaps) is divided into several equal portions, and calculates block (CC) number that each is partly included, and tries to achieve the eigenvalue matrix that can represent this character.
At first, find out a central vertical line, and calculate the coordinate (step S401) of the reference point (A point as shown in Figure 5 and B point) that is positioned at 1/3 and 2/3 position on this central vertical line by this character center reference point.Then then get a reference radius (step S402), this reference radius R for example is half of this character catercorner length, promptly R = L 2 + W 2 / 2 , Wherein L As Zi Yuan Long degree, W As character Wide degree.
This moment, R was that radius is drawn circle, and 360 degree are divided into 18 directions as if being the center of circle with the A point, and then the circumference of these 18 directions and this circle will be the character cutting 36 zones.Therefore, next step then is to calculate included block (CC) number in these 36 zones respectively, and is recorded in (step S403) in the A matrix.In like manner, as if being the center of circle with the B point, R is that radius is drawn circle, can be the character cutting 36 zones also.Therefore, can calculate block (CC) number included in these 36 zones more respectively, and be recorded in (step S404) in the B matrix.
And then these two matrixes are given normalization (step S405) respectively, this promptly respectively with each element in these two matrixes divided by all elements in this matrix and, eliminate whereby font because of character vary in size the influence that may cause.Then these two matrixes are merged into an one dimension matrix at last, and as the eigenvalue matrix (step S406) of this character.
With the character " N " that Fig. 5 was illustrated is example, and the coordinate that its A point and B are ordered is respectively (5,4) and (5,8), and its reference radius R=7.5.Therefore, if be the center of circle with (5,4), 7.5 justify for the radius picture, and with 18 direction cutting zones with the calculation block number, then can obtain matrix A:
A[2][18]=436432732541000132
000404000000000000
Wherein, first of matrix A classify the number of blocks of justifying interior 18 zones as; Secondary series then is the number of blocks in outer 18 zones of circle.In like manner, if be the center of circle with (5,8), 7.5 justify for the radius picture, and with 18 direction cutting zones with the calculation block number, then can obtain matrix B:
B[2][18]=436100332436511362
000000000000202000
Wherein, first of matrix B classify the number of blocks of justifying interior 18 zones as; Secondary series then is the number of blocks in outer 18 zones of circle.After then these two arrays can being carried out normalization at last, remerging is an one dimension matrix, and obtains required eigenvalue matrix.
Except the method for above-mentioned eigenwert comparison, the present invention also comprises the pairing forecast model of each first character of use, comes identification first character.Yet the mode of above-mentioned these character comparisons is known art technology person when visual actual needs only for illustrating, and adopts the character comparison mode of other kind.
By above-mentioned content as can be known, of the present invention focusing on judged the character class that this character is affiliated according to the relative position of a character in track.Wherein, the present invention is divided into little character and non-little character two parts with all characters in the data bank, then formulates a determination methods of overlapping character class individually at the character of these two kinds, and division is as follows now:
Must judge at first whether a character belongs to little character, please refer to Fig. 6, and present embodiment is after having scanned delegation's printing word, can find out printing character wherein less or that the position is more inclined to one side, and it is classified as little character.
At first, each printing character in the scan-image can be labeled a housing earlier, this housing comprises four edges such as upper and lower, left and right that print character, and just can calculate the character height (step S601) of each printing character according to the housing of these marks this moment.
The character height of these printing characters is then then brought with a preset height value and is compared, and whether judges each character height that prints character less than this preset height value (step S602).This preset height value for example is half of all character height flat averages of printing characters, does not limit its scope at this.
Wherein, if the character height of printing character then can print this character and classify as little character (step S607) less than the preset height value.After rejecting the less printing character of character height, next step then captures a center reference point (step S603) of remaining each printing character respectively, and utilize least square method (least square), ask for a center line (step S604) of these center reference point institute convergences, and make each center reference point that prints character be minimum apart from the summation of this center line.
After center line defines out, promptly can be used to judge whether also exist little character in the remaining printing character.For example judge earlier wherein whether the lower edge of printing character is positioned at the top (step S605) of center line, if very then this printing character is classified as little character (step S607); Otherwise just continue to judge whether the upper limb of printing character is positioned at the below (step S606) of center line, if very then this printing character is classified as little character (step S607).In simple terms, the purpose of present embodiment is exactly that will find out can be by the printing character of center line, and these printing characters may be noise or punctuation mark, and its external form is alphabetical little than reality often, also can not be imprinted on the center line usually.
Little character in the printing word can be determined by said method, the classification step of little character can be carried out with that.Fig. 7 A and Fig. 7 B are the process flow diagrams of the small character unit sorting technique that illustrates according to preferred embodiment of the present invention.Please earlier with reference to Fig. 7 A, present embodiment is according to the number of the track of above-mentioned generation and kind, sorts out the state (step S701) of track, and according to the state of this track, classifies at a little character.Wherein, above-mentioned these states for example comprise first state, second state, the third state and four condition.Wherein, on behalf of these tracks, first state comprise above-mentioned top line, is gone up layer line, baseline and bottom line.On behalf of these tracks, second state comprise above-mentioned baseline, bottom line, and the track that is merged by top line and last layer line.On behalf of these tracks, the third state comprise above-mentioned top line, is gone up layer line, and by the track of baseline with the bottom line merging.On behalf of these tracks, four condition then comprise by top line and is gone up the track that layer line merges, and by the track of baseline with the bottom line merging.
At first, find out a center line between top line and the last layer line, and calculate first distance, second distance, three distance, four distance and five distance (step S702) of the center reference point of little character respectively to top line, last layer line, baseline, center line.In one embodiment, the y intercept of above-mentioned center line for example is y intercept average of top line and last layer line, and its slope equals the slope of layer line, yet the present invention does not limit its scope, the user can need according to it, takes arbitrary straight line between top line and the last layer line as center line.
Then, then can calculate the ratio (step S703) of the height and the width of this little character, this ratio is then then brought with one first critical value and is compared, to judge that whether it is greater than this first critical value (step S704).Wherein this first critical value is an integer 4 for example, and does not limit its scope.
In step S704, if the ratio of this little character can continue then to judge greater than first critical value whether the classification of track belongs to first state or second state (step S705).If not, then little character can be classified as the 8th classification (step S706); If, then can calculate the lower edge of this little character and top line and go up the distance of layer line, and judge its lower edge whether with the close together (step S707) of last layer line.If then little character is classified as the 4th classification (step S708); Otherwise then it is classified as second classification (step S709).
On the other hand, please continue B, in step S704, if the ratio of little character is not more than first critical value, then can continue to judge whether the 5th distance, or the 5th distance be less than second distance and the 3rd distance (step S710) less than second critical value with reference to Fig. 7.If meet one of them person of above-mentioned condition, continue then to judge whether that the 3rd distance is less than the 5th distance and second distance (step S711).If then little character is classified as the 6th classification (step S712); Otherwise continue then to judge whether that second distance is less than the 5th distance and the 3rd distance (step S713).If then little character is classified as the 5th classification (step S714); Otherwise just classify as the 7th classification (step S715).
Moreover, if in step S710, conform to, can judge whether that then the center reference point of little character drops on (step S716) on the center line without any a condition.If not, then little character is classified as the 6th classification (step S717); If, then calculate the distance of lower edge and the last layer line and the baseline of little character respectively, and judge its lower edge whether with the close together (step S718) of last layer line.If then little character is classified as the 5th classification (step S719); Otherwise classify as the 4th classification (step S720).
In the sorting technique of above little character, classify and first to the 8th the classification for example be the classification that Fig. 3 listed, so do not limit its scope.And the classification of Fig. 3 also is leveraged in the sorting technique of non-little character, below promptly introduces the detailed step of the sorting technique of non-little character:
Fig. 8 is the process flow diagram of the non-small character unit sorting technique that illustrates according to preferred embodiment of the present invention.Please refer to Fig. 8, as the way of the same embodiment, present embodiment also according to the number and the kind of track, is judged the state (step S801) of track at the beginning, and according to the state of this track, classifies at a non-character that belongs to little character.Wherein, these states for example comprise first state, second state, the third state and four condition, and its definition is all identical with preceding embodiment, so do not repeat them here.
If when the classification of track belongs to first state, can judge then whether the upper limb of character is positioned at the top of top line (step S802).If then continue to judge whether the lower edge of this character is positioned at the top of bottom line (step S803).If then this character can be classified as first classification (step S804); Otherwise then this character is classified as second classification (step S805).
Wherein, in step S802,, can continue then to judge whether the lower edge of this character is positioned at the top of bottom line (step S806) if judge that the upper limb of character is not when being positioned at the top of top line.If then this character is classified as the 3rd classification (step S807); Otherwise then this character is classified as the 4th classification (step S808).
On the other hand, if when the classification of track belongs to second state, can judge then whether the upper limb of this character is positioned at the top of top line (step S809), if then this character is classified as second classification (step S810); Otherwise then it is classified as the 4th classification (step S811).
At last, if when the classification of track belongs to the third state or four condition, then be that first character is classified as the 8th classification (step S812), meaning promptly classifies as can't be according to the classification of track classification.
In the sorting technique of above non-little character, judge that first to the 8th classification of gained is to adopt as the classification that Fig. 3 listed equally, so do not limit its scope.And promptly can be at all types of characters in conjunction with the sorting technique of above-mentioned little character and non-little character, (OpticalCharacter Recognition, OCR) data is done identification, and can effectively improve discrimination power to screen suitable optical character identification.
In sum, character identification method of the present invention has following advantage at least:
1. before carrying out text-recognition, want the literal of identification at all earlier,, produce optical character identification data such as eigenwert individually according to doing classification in the position of track.
2. use relative position information between known character and the track to assess character and belong to which classification, and carry out identification, therefore can improve identification speed with the corresponding OCR data of this classification.
Though the present invention discloses as above with preferred embodiment; right its is not in order to limit the present invention; have in the technical field under any and know the knowledgeable usually; without departing from the spirit and scope of the present invention; when can doing a little change and retouching, so protection scope of the present invention is as the criterion when looking appended the claim person of defining.

Claims (14)

1. a character identification method is characterized in that it comprises the following steps:
A. scan delegation's printing word, wherein this row printing word comprises a plurality of first characters;
B. utilize those first characters, produce many tracks;
C. the relative position of those first characters in those tracks according to each judged the character class that each those first character is affiliated; And
D. each is belonged to a plurality of second character comparisons of this character class in those first characters and the data bank, find out each pairing this second character of those first characters, and pick out those first characters, wherein this data bank comprises the multiple character class of record and those affiliated second characters of each those character class.
2. character identification method according to claim 1, it is characterized in that those tracks comprise layer line on the top line,, a baseline and a bottom line, and this top line and should to go up between the layer line be a top regions, be to be a lower floor district between a mesozone, this baseline and this bottom line between layer line and this baseline on this.
3. character identification method according to claim 2 is characterized in that this step c comprises:
C1. judge whether each those first character belongs to a little character;
C2. if belong to this little character, then carry out the classification of small character unit; And
C3. if do not belong to this little character, then carry out the first classification of a non-small character.
4. character identification method according to claim 3 is characterized in that this step c1. comprises:
C1-1. calculate a character height of each those first character respectively; And
C1-2. this a character height of those first characters and a preset height value relatively and classify as those little characters with this character height less than those first characters of this preset height value with each.
5. character identification method according to claim 4 is characterized in that more comprising after step c1-2.:
C1-3. capture a center reference point at remaining each those first character center respectively;
C1-4. utilize a least square method, ask for a center line of those center reference point institute convergences;
Whether a lower edge of c1-5. judging described remaining each those first character is positioned at the top of this center line, and described remaining those first characters that this lower edge are positioned at the top of this center line classify as those little characters; And
Whether a upper limb of c1-6. judging described remaining each those first character is positioned at the below of this center line, and described remaining those first characters that this upper limb are positioned at the below of this center line classify as those little characters.
6. character identification method according to claim 3 is characterized in that more comprising before this step c1.:
According to the number and the kind of those tracks of being produced, with those tracks classify as a plurality of states one of them.
7. character identification method according to claim 6 is characterized in that those states comprise:
One first state, this first state represent that those tracks comprise this above-mentioned top line, layer line, this baseline and this bottom line on this;
On behalf of those tracks, one second state, this second state comprise this above-mentioned baseline, this bottom line, and goes up the track that layer line merges by this top line and this;
One third state, this third state represent that those tracks comprise this above-mentioned top line, layer line on this, and the track that is merged by this baseline and this bottom line; And
On behalf of those tracks, one four condition, this four condition comprise by this top line and layer line merges on this this track, and by this baseline this track with the merging of this bottom line.
8. character identification method according to claim 7 is characterized in that the classification step of each this little character comprises:
C2-1. find out this top line and should go up a center line between the layer line, and calculate a center reference point of this little character to first distance of this top line, this center reference point to the second distance that should go up layer line, five distance of this center reference point to the 3rd distance of this baseline and this center reference point to this center line;
C2-2. calculate a height of this little character and a ratio of a width;
C2-3. judge that whether this ratio is greater than one first critical value;
Whether belong to this first state or this second state if c2-4., then judge the classification of those tracks;
If c2-4-1., then calculate the lower edge of this little character and this top line and the distance of layer line on this, and this lower edge of judging this little character with whether should go up distance of layer line less than this lower edge of this little character and the distance of this top line;
If c2-4-1-1., then should classify as one the 4th classification by little character;
C2-4-1-2. if not, then should classify as one second classification by little character;
C2-4-2. if not, then should classify as one the 8th classification by little character;
C2-5. if not, then judge whether the 5th distance less than one second critical value, or the 5th distance is less than this second distance and the 3rd distance;
If c2-5-1., judge whether that then the 3rd distance is less than the 5th distance and this second distance;
If c2-5-1-1., then should classify as one the 6th classification by little character;
C2-5-1-2. if not, judge whether that then this second distance is less than the 5th distance and the 3rd distance;
If c2-5-1-2-1., then should classify as one the 5th classification by little character;
C2-5-1-2-2. if not, then should classify as one the 7th classification by little character;
C2-5-2. if not, judge whether that then this center reference point drops on this center line;
If c2-5-2-1., the lower edge that then calculates this little character with should go up the distance of layer line and this baseline, and this lower edge of judging this little character with whether should go up the distance of layer line less than this lower edge of this little character and the distance of this baseline;
If c2-5-2-1-1., then should classify as the 5th classification by little character;
C2-5-2-1-2. if not, then should classify as the 4th classification by little character; And
C2-5-2-2. if not, then should classify as the 6th classification by little character.
9. character identification method according to claim 8 is characterized in that the step of the first classification of this non-small character comprises:
C3-1. when if the classification of those tracks belongs to this first state, judge whether the upper limb of this first character is positioned at the top of this top line;
Whether be positioned at the top of this bottom line if c3-1-1., then judge the lower edge of this first character;
If c3-1-1-1., then this first character is classified as one first classification;
C3-1-1-2. if not, then this first character is classified as this second classification;
C3-1-2. if not, judge then whether the lower edge of this first character is positioned at the top of this bottom line;
If c3-1-2-1., then this first character is classified as one the 3rd classification; And
C3-1-2-2. if not, then this first character is classified as the 4th classification.
10. character identification method according to claim 9 is characterized in that the step of the first classification of this non-small character more comprises:
C3-2. when if the classification of those tracks belongs to this second state, judge whether the upper limb of this first character is positioned at the top of this top line;
If c3-2-1., then this first character is classified as this second classification; And
C3-2-2. if not, then this first character is classified as the 4th classification.
11. character identification method according to claim 10 is characterized in that the step of the first classification of this non-small character more comprises:
C3-3. when if the classification of those tracks belongs to this third state or this four condition, then this first character is classified as the 8th classification.
12. character identification method according to claim 11 is characterized in that those character classification comprise:
This first classification has accounted for this top regions, this mesozone and this lower floor district;
This second classification has accounted for this top regions and this mesozone;
The 3rd classification has accounted for this lower floor district;
The 4th classification has accounted for this mesozone;
The 5th is categorized as this little character, and is positioned near this center line;
The 6th classification has accounted for this center line and should go up between the layer line;
The 7th classification has accounted between this center line and this baseline; And
The 8th is categorized as residue can't be according to those tracks sorter.
13. character identification method according to claim 1 is characterized in that this steps d. comprising:
D1. calculate one first eigenwert of each those first character; And
D2. those first characters at each compare one second eigenwert that belongs to each those second character of this character class in this first eigenwert and this data bank, find out the identification character of this most close second character of eigenwert as this first character.
14. character identification method according to claim 1 is characterized in that this steps d. comprising:
Use a forecast model of each those first character correspondence, with this first character of identification.
CN2007101073048A 2007-05-25 2007-05-25 Character identification method Expired - Fee Related CN101311946B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2007101073048A CN101311946B (en) 2007-05-25 2007-05-25 Character identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2007101073048A CN101311946B (en) 2007-05-25 2007-05-25 Character identification method

Publications (2)

Publication Number Publication Date
CN101311946A CN101311946A (en) 2008-11-26
CN101311946B true CN101311946B (en) 2010-10-27

Family

ID=40100588

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2007101073048A Expired - Fee Related CN101311946B (en) 2007-05-25 2007-05-25 Character identification method

Country Status (1)

Country Link
CN (1) CN101311946B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5073955A (en) * 1989-06-16 1991-12-17 Siemens Aktiengesellschaft Method for recognizing previously localized characters present in digital gray tone images, particularly for recognizing characters struck into metal surfaces
CN1162158A (en) * 1996-04-09 1997-10-15 财团法人工业技术研究院 Method for automatically correcting truncating error of document and device thereof
CN1338671A (en) * 2001-09-26 2002-03-06 倚天资讯股份有限公司 Input device integrating handwrinting recognition and input with virtual keyboard input

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5073955A (en) * 1989-06-16 1991-12-17 Siemens Aktiengesellschaft Method for recognizing previously localized characters present in digital gray tone images, particularly for recognizing characters struck into metal surfaces
CN1162158A (en) * 1996-04-09 1997-10-15 财团法人工业技术研究院 Method for automatically correcting truncating error of document and device thereof
CN1338671A (en) * 2001-09-26 2002-03-06 倚天资讯股份有限公司 Input device integrating handwrinting recognition and input with virtual keyboard input

Also Published As

Publication number Publication date
CN101311946A (en) 2008-11-26

Similar Documents

Publication Publication Date Title
US7801392B2 (en) Image search system, image search method, and storage medium
CN101366020B (en) Table detection in ink notes
CN101356541B (en) Method and apparatus for processing account ticket
US6009196A (en) Method for classifying non-running text in an image
US5889886A (en) Method and apparatus for detecting running text in an image
US8249343B2 (en) Representing documents with runlength histograms
Aradhye A generic method for determining up/down orientation of text in roman and non-roman scripts
JP2575539B2 (en) How to locate and identify money fields on documents
US6816630B1 (en) System and method for creating and processing data forms
US8462394B2 (en) Document type classification for scanned bitmaps
EP0113410A2 (en) Image processors
CN100559387C (en) Image processing apparatus and method, image processing system
US6320983B1 (en) Method and apparatus for character recognition, and computer-readable recording medium with a program making a computer execute the method recorded therein
JP4098845B2 (en) How to compare symbols extracted from binary images of text
US20090257653A1 (en) Image processor and computer readable medium
EP1118959B1 (en) Method and apparatus for determining form sheet type
CN113269101A (en) Bill identification method, device and equipment
CN101311946B (en) Character identification method
JP2004171316A (en) Ocr device, document retrieval system and document retrieval program
US20080137955A1 (en) Method for recognizing characters
US20210240973A1 (en) Extracting data from tables detected in electronic documents
JP3957471B2 (en) Separating string unit
CN101303731B (en) Method for generating printing line
JP3162552B2 (en) Mail address recognition device and address recognition method
Hong et al. Information Extraction and Analysis on Certificates and Medical Receipts

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20101027

Termination date: 20140525