GB2245740A

GB2245740A - Character encoding

Info

Publication number: GB2245740A
Application number: GB9014722A
Authority: GB
Inventors: Philip Timothy Wright; Edmund Sparks
Original assignee: Roke Manor Research Ltd
Current assignee: Roke Manor Research Ltd
Priority date: 1990-07-03
Filing date: 1990-07-03
Publication date: 1992-01-08
Anticipated expiration: 2010-07-03
Also published as: GB9014722D0; GB2245740B

Abstract

A hand-written character to be encoded is defined by digital data, typically from a graphics tablet, comprising successive x, y co-ordinate positions. The digital data is encoded (e.g. using x, y encoding) into a series of arbitrary line vectors LV1, LV2, etc and the arbitrary line vectors are subsequently encoded (e.g. using Freeman encoding) into a series of pre-defined line vectors FV1, FV2 etc to afford a character encoded signal. Corresponding ones of the arbitrary line vectors LV1, LV2, etc are compared with the pre-defined line vectors FV1, FV2 etc in order to derive one or more alternative character encoded signals which improve the chances of the character being recognised.

Description

CHARACTER ENCODING USING ELASTIC MATCHING This invention relates to character encoding systems and especially to such systems for encoding in real-time, hand-written characters as produced dynamically on, for example, a digitising graphics tablet.

It is known to make use of a digitising graphics tablet, such as a Numonics 2200 (trade mark) data tablet, to provide a digitised output corresponding to one or more characters which are hand-written on the tablet using a suitable pen. Typically the pen tip position is sampled approximately 60 times per second, each sample corresponding to an x,y co-ordinate pair, and a digitised serial output is afforded, typically at speeds up to 19200 bands, in either ASCII or packed binary format. The digitised output from the graphics tablet, corresponding to each hand-written character, is then compared with digitised character information stored in a data base in order to effect recognition of the hand-written character.

In order to effect such character recognition it is usual to encode the digitised output from the graphics tablet to reduce the amount of information which has to be stored in respect of each character.

In our co-pending patent application

(F20772) entitled CHARACTER ENCODING SYSTEMS there is described a number of encoding systems which may be referred to as x,y encoding, Freeman encoding and "threshold" encoding.

With such encoding systems each hand-written character is encoded to afford a digitised output which is compared with digitised character information stored in a data base in order to effect recognition of the hand-written character.

In practice it is found that a persons hand-writing is very seldom constant and it can happen that when even the same character is written a number of times at different times, even though these may be discerned by the human eye as being the same character, when those characters are digitised and are encoded, the character encoded signal may well be different for each of the characters. Thus, a number of different character encoded signals may be obtained for what is ostensibly the same character. In existing systems the data base is provided with character information for each expected character, and it does not take account of the fact that the same character can result in a number of different character encoded signals, only one of which can match up with the character information in the data base.Therefore, the character encoded signals, ostensibly for the same character, which do not match up with the digitised character information in the data base are not recognised.

It is an object of the present invention to provide a character encoding system in which each character which is encoded may be provided with a plurality of alternative character encoded signals which are more likely to match up with the digitised character information in the data base with the result that the character is more likely to be recognised.

According to the present invention there is provided a character encoding system in which a character to be encoded is defined by digital data comprising successive x,y co-ordinate positions corresponding to the shape of said character, and in which the digital data is initially encoded in the form of a series of arbitrary line vectors which are subsequently encoded into a series of pre-defined line vectors to define a character encoded signal, means being provided for comparing corresponding one of the arbitrary line vectors with the pre-defined line vectors to afford one or more alternative character encoded signals.

In carrying out the invention the digital data may be initially encoded in the form of a predetermined number of arbitrary line vectors which are subsequently encoded into the same predetermined number of pre-defined line vectors.

It may be arranged that the digital data is initially encoded in the form of a series of arbitrary line vectors using, for example, x,y encoding or threshold encoding, and the arbitrary line vectors may be subsequently encoded using, for example, Freeman encoding.

An exemplary embodiment of the invention will now be described reference being made to the accompanying drawings, in which Figure 1 depicts a typical hand-written character "a" with various samples x,y co-ordinates; Figure 2 depicts the x,y encoding of the character "a" of Figure 1; Figure 3 depicts the x,y encoded character corresponding to the character "a" of Figure 1; Figure 4 depicts typical Freeman encoding vectors; and Figure 5 depicts the Freeman encoded character corresponding to the x,y encoded character of Figure 3.

One method of encoding the digitised output, which is conveniently referred to as x,y encoding, is described in an article entitled "A Microcomputer System to Recognise Hand-printed Numerals using a Syntactic-Statistic Approach" by Tang, Tzeng and Hsu, published in 7th International Conference on Pattern Recognition, Volume II, pp 106-4, 1984. The basis of x,y encoding is that as a character is being formed, the sampled x and y co-ordinate pairs in the digitised output from the graphics tablet are monitored and each time there is a transition from positive to negative or negative to positive in either the x or y direction, the x,y co-ordinates of such transitions are noted and line vectors are effectively fitted between the x,y co-ordinates where the transitions occur and are used to derive an encoded digitised output corresponding to the character being recognised.

Another method of encoding the digitised output, which is conveniently referred to as "Freeman encoding", is described in an article "On the Encoding of Arbitrary Geometric Configurations" by H. Freeman in IRE Transactions on Electronic Computers; pp 260-268, June 1961. The basis of Freeman encoding is that the shape of the character being recognised as it is formed is quantised into a number, e.g.

5, of pre-determined vector directions, and the particular vectors detected are used to derive the encoded digitised output corresponding to the character being recognised.

Freeman vector encoding is advantageous as regards x,y encoding in that the greater the number of vectors used, the more closely the character can be tracked.

However, the number of vectors which are used to define each character determines the storage and processing requirements of the system. For example, if each character is defined using eight vectors, there are approximately 6600000 possible vector permutations. If the number of vectors for each character is reduced to five, the possiblevector permutations are reduced to approximately 22000.

It is therefore usual to use some form of vector reduction technique in order to reduce the number of vectors used to define each character to a maximum of, say, five.

Various vector reduction techniques are known for doing this.

Yet another method of encoding the digitised output, which is conveniently referred to as "threshold encoding" is described in the aforementioned patent application

(F20772). The basis of threshold encoding is that as the character is being formed, the sampled x and y co-ordinate pairs in the digitised output are monitored and software processing is used to effectively construct a succession of line vectors between one of the sampling points and successive sampling points, and each time a line vector is constructed, the perpendicular distance between it any any intermediate sampling point is determined until a resultant line vector is obtained which straddles a maximum number of sampling points with perpendicular distance not exceeding a pre-determined maximum threshold distance. The process is repeated for all of the sampling points of the character in order to derive a set of "resultant" line vectors which correspond to the shape of the character being encoded.

Again, vector reduction techniques, such as by repeating the threshold encoding procedure with different perpendicular threshold distances, may be used in order to reduce the numbers of resultant" line vectors to a required number, eg five.

Turning now to the accompanying drawings, in Figure 1 there is depicted the character "a" as it may be hand-written on a graphics tablet as hereinbefore described using an associated pen. Superimposed on the character "a" are typical sampling points referenced 0 to 19, at which sampling points the x,y co-ordinates are determined. Thus at sampling point 0 the x,y co-ordinates will be xo, yo, at sampling point 1 the x,y co-ordinates will be xl, yl, etc.

This data is output from the graphics tablet in digitised serial form and is fed to a suitable processor (not shown) for character encoding and recognition purposes.

In Figures 2 and 3 of the drawings there is depicted how the character "a" of Figure 1 is coded using the x,y encoding technique referred to above. In Figure 2 it will be observed that at sampling point 1 a transition Tyl occurs from positive to negative in the y-axis. Similarly at sampling point 6 a transition Txl occurs from negative to positive in the x-axis. In the same way a transition Ty2 occurs at sampling point 9, a transition Ty3 occurs at sampling point 14 and transitions Ty3 and Tx2 occur at sampling point 14. The sampling points at which transitions occur are then joined by a series of straight line vectors LV1, LV2, LV3, LV4 and LV5 to obtain the encoded character depicted in Figure 3.

One problem with the x,y encoded signal corresponding to the encoded character of Figure 3 is that the line vectors LV1, LV2 etc have arbitrary directions and consequently require a large character information storage capability and relatively complex processing in order to effect character recognition. This can be overcome by further encoding the line vectors LV1, LV2 ect of Figure 3 using, for example, Freeman encoding.

In Figure 4 of the drawings is set out a set of eight vectors which form the basis of eight-vector Freeman encoding. The vectors are referenced V1 to V8 and are disposed at 45 to one another. Each of the vectors V1 to V8 is bounded by a quadrant Q1, Q2 etc. which extends 22.5 on either side of the respective vector.

In Figure 5 of the drawings there is depicted the x,y encoded character "a" of Figure 3 after it has been subsequently encoded using the Freeman vectors shown in Figure 4 of the drawings. In Figure 3, the line vector LV1 falls within the quadrant associated with vector V1 of Figure 4 and is replaced by a Freeman vector FV4 in Figure 5. Similarly, line vector LV2 in Figure 3 falls within the quadrant associated with vector V5 in Figure 4 and is replaced by a Freeman vector FV5 in Figure 5. In this way, the line vectors LV1, LV2, LV3, LV4 and LV5 of the x,y encoded character of Figure 3 are encoded into Freeman vectors FV4, FV5, FV7, FV1 and FV6 of Figure 5.

The Freeman vector string of Figure 5, which may be conveniently referred to as 4:5:7:1:6 is then compared to corresponding character information in a data base in order to effect recognition of the character "a" of Figure 1.

However, as has already been mentioned, it is possible that when a person writes the character "a" of Figure 1 a number of times it could happen that although these may all be perceived by the human eye to be the character "a", when they are encoded using for example, the encoding methods described herein, the different hand-written characters "a" may result in very different character encoded signals. It is usual for the data base to include only one set of character information for each character, and the result may be that a number of the hand-written characters "a" will not be recognised because they are too dissimilar from the "standard" character "a" that the data base character information is based on.

In order to alleviate this problem, it is proposed that corresponding ones of the line vectors LV1, LV2, etc in Figure 3 and the Freeman vectors FV4, FV5 etc. of Figure 5 be compared in order to generate one or more "alternative" character strings which may be used, for example, if the "4:5:7:1:6" character string of Figure 5 is not recognised in the character recognition procedure.

The Freeman encoded character of Figure 5 was derived from the vectors V1, V2, etc of Figure 4 in dependence upon the particular quadrants Q1, Q2 etc. of Figure 4 that the line vectors LV1, LV2 etc of Figure 3 extended in.

However, in accordance with the present invention, in order to take account of the variable nature of hand-written characters, it is proposed that where a line vector LV1, LV2, etc of Figure 3 differs from the corresponding vector V1, V2, etc of Figure 4 by more than a predetermined threshold amount eg by more than 11.25 (each quadrant Q1, Q2, etc being 45 ) then it is considered that there is poor correspondence between the two and that the line vector in question could be considered to be in the adjacent quadrant in which case it would be substituted by the adjacent vector V1, V2 etc.

For example, if line vector LV2 of Figure 3 differed from vector V5 in Figure 4 by more than 11.250 (towards Freeman vector V6) then there would be poor correspondence between line vector LV2 of Figure 3 and vector V5 of Figure 4, and that the correct vector for line vector LV2 of Figure 3 may be vector V6 in Figure 4 and not vector V5 which was the first choice. Thus, the alternative Freeman vector string for the encoded character of Figure 5 would be: 4:6:7:1:6 The alternative Freeman vector string and the preferred Freeman vector string can be represented as 4: (5):7:l:6 which corresponds to one degree of alternative match.

It should be appreciated that the alternative match may apply to any of the Freeman vectors FV4, FV5, FV7, FV1 and FV6 of Figure 6 and may apply to more than one of the Freeman vectors. Thus, the number of alternative character encoding signals increases as follows 4:5:7:1:6 - all good fits 4:(6):7:1:6 - one degree of alternative matching (3/4):(5/6):7:1:6 - two degrees of alternative matching 6 (3):(5):(7):1:6 - - three degrees of alternative matching (3/4):(6/5):(7/6):(1/2: :6 - four degrees of alternative matching 6 6 2 (5) (7) (2) (6) - five degrees of alternative matching Thus, up to 32 alternative Freeman vector strings could be produced from the originally encoded character of Figure 5, and these can be compared in turn with the character information stored in the data base in order to effect recognition of the character in question, thereby increasing substantially the chances of a correct match being found.

Although in the embodiment which has been described x,y encoding has been used to encode the digitised output from the graphics tablet it should be appreciated that other forms of encoding e.g. threshold encoding, could be used, and also although Freeman encoding has been used to encode the x,y encoded character of Figure 3, other forms of encoding may be used.

Claims

1. A character encoding system in which a character to be encoded is defined by digital data comprising successive x,y co-ordinate positions corresponding to the shape of said character, and in which the digital data is initially encoded in the form of a series of arbitrary line vectors which are subsequently encoded into a series of pre-defined line vectors to define a character encoded signal, means being provided for comparing corresponding ones of the arbitrary line vectors with the pre-defined line vectors to afford one of more alternative character encoded signals.

2. A system as claimed in claim 1, in which the digital data is initially encoded in the form of a predetermined number of arbitrary line vectors which are subsequently encoded into the same predetermined number of pre-defined line vectors.

3. A system as claimed in claim 1 or claim 2, in which the digital data is initially encoded using threshold encoding.

5. A system as claimed in any preceding claim, in which the arbitrary line vectors are subsequently encoded using Freeman encoding.

6. A character encoding system substantially as hereinbefore described with reference to the accompanying drawings.