GB2430296A

GB2430296A - Modifying threshold employing manually checked data

Info

Publication number: GB2430296A
Application number: GB0518856A
Authority: GB
Inventors: Xavier Lagardere; Eric Erickson; Sherif Yacoub
Original assignee: Hewlett Packard Development Co LP
Current assignee: Hewlett Packard Development Co LP
Priority date: 2005-09-16
Filing date: 2005-09-16
Publication date: 2007-03-21
Also published as: GB0518856D0

Abstract

A method of using a computer processor 700 to assess data items 752 to determine which data items 752 need checking in a subsequent, checking operation before their accuracy is relied upon and data items which do not need checking in a subsequent, checking operation before their accuracy is relied upon are also determined. The method comprises comparing confidence scores ascribed to the data items 754 with a threshold confidence score 755 to determine which data items need further subsequent checking in a subsequent checking operation. The value of the threshold confidence score is automatically controlled by the processor 700. Data items which are accepted are added to a database 756. If the ascribed confidence score 754 is less than the threshold the data item is checked either manually by an user, or automatically. The checked data item may either be accepted 760, recognised, or rejected. This acceptance or rejection is used to modify the threshold score 762. The data items are determined by analysing hand written forms, i.e handwriting recognition. The method may be used via voice recognition system.

Description

CHECKING DATA

Field of the Invention

This invention relates to checking data and especially but not exclusively to apparatus, methods, and software for processing data by checking its accuracy before it is entered onto a database, and to a method of using a computer processor to check data. In particular, but not exclusively, the invention relates to checking data manuscript entered onto forms.

Background of the Invention

The invention arose from a consideration of a user writing data into data areas of forms using a digital pen of the Anoto TM kind, and the userwritten data being allocated to fields associated with the data entry areas for subsequent processing by storing the data on a database. That is an area of the present invention, but it will be appreciated that the invention has wider applicability: for example to other digital pen systems where the position of a pen is known to a computer, and to nonpen systems such as scanning data capture systems which acquire data by scanning a form or voice recognition systems which acquire data by analysing a voice profile.

It also applies to non-form related areas.

Prior Art

Data is entered onto forms by a user using a pen (digital or otherwise). (An Anoto-type pen need not be used, lot example a form could be completed using a normal pen and then scanned into a computer). The data is entered (or should be entered) into specific predetermined data entry areas on the form. A digital version of the manually entered data is known to a computer (for example either via a digital pen or via scanning the form), and application software, possibly specific to the field, processes the data (e.g. adds it to a database, or evaluates a parameter derived from the data entered in the field). The above scenario overlooks a factor: it is not always clear how exactly the data should be interpreted before it is processed. For example, if the data is entered in handwritten format onto a form, it may not be apparent to a machine converting the data into a digital format, exactly what was intended to be written if a user has poor handwriting, for example. The user may make a spelling mistake and the machine may have a list of correctly spelt alternatives to process the data by comparison - in this scenario, there will also be difficulties in data interpretation.

Forms processing workflows (e.g. in a commercial forms-processing operation or business) are composed of the following steps: data acquisition, data interpretation, data correction, data storage. Usually, for each field in a form, the output of the data interpretation stage is composed of a hypothesis and a system-assessed likelihood measure for this hypothesis, called the confidence score. The data correction stage is skipped (missed out/not performed) for hypotheses that achieve a confidence score higher than a given confidence threshold. On the other hand, if the confidence score is below threshold, the hypothesis is "rejected" and the machine - converted digital data requires correction (e.g. manual correction).

The confidence score should be set accurately because it affects the cost (e.g. in manpower, time of performing the process, or financial cost) of running the system, and has a direct implication on the return on investment for a forms processing system. The ultimate goal of a forms processing system is to automate as much as possible of the workflow and avoid human intervention all along the processing chain, from data acquisition to data commit (commit to the database). On the other hand, automatically committing all interpreted data without allowing some of these data to be manually corrected would lead to invalid data being stored into the system databases. Invalid data storage has a cost which is directly related to business criticality of the data being stored. This cost can be very high, for instance in the case of badly recognised amounts in an automatic cheque processing application.

The "triage" to discriminate (using a computer) between properly recognised data and badly recognised data relies on the system-assessed measure of the quality of the interpreted output. This likelihood measure is the confidence score. This system-assessed score is commonly available as an output data in commercial forms processing packages. It is a standard output value of commercial handwriting recognition engines (systems that convert handwritten text into ASCII strings).

Accurately setting this critical value is a costly process for the following reasons: * The confidence threshold is often set independently for each field in the form. Indeed in many cases there is no simple rule that allows inferring the confidence threshold for a field from the one previously set for another field. Although thresholds set for previous similar fields can be used where similarity is a function of the properties of the field; e.g. content type, numerals or alphabetic, etc. * The average value of the confidence score for a particular field may vary all along the lifetime of the system. For instance it is possible that the confidence score increases as the form fillers (humans) get more used to a new system being put in place (new device, new form etc.). These possible variations make it necessary for the person who is "system administrator" to periodically review and adjust the confidence score settings for each field of the whole system.

This can take them a lot of time. System administrators can forget to update/review the confidence threshold, or simply not update them because they do not like doing it and it is a non-urgent job (usually).

If the confidence threshold is set too high in an effort to ensure that 100% of the data is correct before being committed to the database, it will generate a disproportionately high number of manual corrections interpreted by, for example, a handwriting recognition engine.

This could lead users to begin to disregard the indication that a value needs to be corrected, thereby creating the opposite effect of increasing the amount of erroneous data committed to the database.

These difficulties often result in a poorly set confidence threshold value and increase the total costs of running the system, either by feeding in too much data into the manual correction sub process or committing too much erroneous data into the databases.

Some commercial forms processing packages that include a correction console like Neurascript's INDICIUS provide ways for the System Administrator to log and replay a sample of input data in order to test the recognition with different levels of thresholds, and manually set the best threshold for this sample of data, which is supposed to be representative of the actual data that will he processed later on in the deployment environment.

Summary of the Invention

According to a first aspect of the invention there is provided a method of using a computer processor to assess data items to determine which data items need checking in a subsequent, further, checking operation before their accuracy is to be relied upon, and which data items do not need checking in a subsequent, further, checking operation before their accuracy is to be relied upon, the method comprising using the processor to compare confidence scores ascribed to the data items with a threshold confidence score to determine which data items need further subsequent checking in a subsequent checking operation, and in which the value of the threshold confidence score is automatically controlled by the processor.

According to a second aspect of the invention there is provided a data processing system, for processing data having an ascribed confidence score, comprising a processor arranged to compare the ascribed confidence score to a threshold confidence score and accept or reject the data on the basis of said comparison, wherein the processor is arranged to automatically control the threshold score.

According to a third aspect of the invention there is provided a method of making a database comprising using a processor to select data for entry onto the database by comparing an ascribed confidence score associated with the data with a threshold confidence beyond which data is accepted fbr entry onto the database and using the processor to control automatically the threshold confidence score.

According to a fourth aspect of the invention there is provided a computer program product encoded with software code which when run on a processor is arranged to enable the processor to process data by accepting or rejecting the data on the basis of comparison of an ascribed confidence score associated with the data with a threshold confidence score, and which is arranged to cause the processor to automatically control the threshold confidence value in response to feedback on the levels of: (I) subsequent correction of rejected data, and/or (ii) the levels of subsequent acceptance of uncorrected rejected data.

Advantageously, when the invention according to these four aspects is used, since the processor automatically controls the threshold value, there is no need for a human operator to manually update the threshold value.

System monitoring and administration needs and therefore costs (e.g. financial and manpower time) are reduced.

It should be appreciated that when an aspect of an invention is claimed as a particular category (e.g. as a method, system, data carrier etc.) then protection is sought for that aspect but expressed as a different category of the claim.

Brief Description of the Drawings

Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings, in which; Figure 1 schematically illustrates a sheet of prior art Anoto digital paper; Figure 2 schematically illustrates an existing Anoto-type digital pen; Figure 3 schematically illustrates a prior art form having data entry areas; Figure 4 schematically illustrates the prior art form of Figure 3 and a data capture area for each of the data entry areas; Figure 5 schematically illustrates the prior art form of Figure 3 that has been completed by a user.

Figures 6a to 6d schematically illustrate various prior art data entry areas; Figure 7a schematically illustrates a data processing system according to a first embodiment of the present invention; Figure 7b is a flow diagram illustrating a method of operation of the system shown in Figure 7a; Figure 8 is a graph showing a distribution of recognised hypotheses as a function of confidence score values; Figure 9 is a graph showing the correlation between different hypotheses of the system of Figure 7a and the confidence threshold; Figure 10 shows schematically a manual checking system which is part of the embodiment of the invention of Figure 7a; Figure 11 shows schematically a display of the system of Figure 10; Figure 12a is a graph showing the distribution of confidence score values which the system of Figure 7a is able to measure; IS Figure 12b is a graph showing the information of the graph of Figure 12a and, in addition, inferred distribution values; Figure 13 is a flowchart showing operation of the embodiment of the invention of Figure 7a during a transition period; Figures 14a to 14d are graphs representing the way in which the embodiment of the invention of Figure 7a constructs distributions showing the relationship between hypotheses and the confidence score; Figure 15 is a flowchart providing further detail of the steps 1302 and 1304 of Figure 13; Figure 16 is a flowchart showing operation of the system of Figure 7a when it enters a stable mode; Figure 17 is a schematic logical diagram of the system of Figure 7a; Figure 18a is a schematic representation of a data sorting system according to a second embodiment of the present invention; Figure 18b is a flow diagram showing operation of the system of Figure 18a.

Figure 19a is a schematic representation of a database making system according to a third embodiment of the present invention; Figure 19b is a flow chart showing operation of the system of Figure 19a; and Figure 20 is a schematic representation of a data carrier according to a fourth embodiment of the present invention.

Detailed Description

It is convenient to discuss the invention in more detail by referring to the prior art Anoto digital pen and paper system, but it will be appreciated that the invention is not restricted to use with any proprietary system.

The prior art Anoto system is described on their website otofunctionality. coni. However, since the content of websites can change with time it is to be made clear that the prior art admitted is that which was published on their website 110 later than the day before the priority dale of this patent application. It is also appropriate to include in this application itself a brief review of the Anoto system.

Figure 1 shows schematically an A4 sheet 10 of Anoto digital paper. This comprises a part of a very large non-repeating pattern 12 of dots 14. The overall pattern is large enough to cover 60,000,000 square kilomelres. The pattern 12 is made from the dots which are printed using infra-red absorbing black ink. The dots 14 are spaced by a nominal spacing of 300tm, but are offset from their nominal position a little way (about 50tm), for example north, south, east or west, from the nominal position.

In WO 01/126032, a 4x4 array of dots is described, and also a 6x6 array of' dots, to define a cell. Each cell has its dots at a unique combination of positions in the pattern space so as to locate the cell in the pattern space.

The dot pattern of an area of the dot pattern space codes for the position of that area in the overall dot pattern space. The contents of WO 01/126032 are hereby incorporated by reference, with special reference on the dot pattern and the pen.

The sheet 12 has a pale grey appearance due to the dots 14.

Figure 2 schematically shows a digital pen 20 adapted to write human readable ink in non-machine-readable IR transparent ink and to read a position dot pattern in infra-red. The pen 20 has a housing 22, a pen processor 24 with access to a pen memory 26, a removable and replaceable ink nib and cartridge unit 28, a pressure sensor 29 adapted to be able to identify when the nib is pressed against a document, an infra-red LED emitter 30 adapted to emit infra-red light, an infra-red sensitive camera 32 (e.g. a CCD or CMOS sensor), a wireless telecommunications transceiver 34, and a removable and replaceable battery 36. The pen 20 also has a visible wavelength warning light 38 (e.g. a red light) positioned so that a user of the pen can see it when they are using the pen, and a vibration unit 40 adapted to vibrate and to cause a user to be able to feel vibrations through the pen. The pen 20 includes a pen clock 24' adapted to associate a time value with position data acquired by the pen.

Such a pen exists today and is available from Anoto as the Logitech lOIM pen. - 10-

The pen, when in use writing on a page/marking a page, sees a 6x6 array of dots 14 and its pen processor 24 establishes its position in the dot pattern from that image. In use the LED 30 emits infra-red light which is reflected by the page 10 and detected by the camera 32. The dots 14 absorb the infra-red and so are detectable against the generally reflective background.

Of course, the ink of the dots might be especially reflective in order to distinguish them (and the paper less reflective), or they may fluoresce at a different wavelength from the radiation that excites them, the fluorescent wavelength being detected. The dots 14 are detectable against the

background page.

The pen processor 24 processes data acquired by the camera 32 and the transceiver 34 communicates processed information from the pen processor 24 to a remote complementary transceiver (e.g. to a receiver linked to a PC). Typically that information will include information related to where in the dot pattern the pen is, or has been, and its pattern of movement, and the time at which the tip of the pen was at any particular position: position values are time-stamped.

There are times when the pen processor 24 cannot determine its position in pattern space (the overall virtual space defined by the very large dot pattern). For example, if the pen is moved too fast over the pattern the processor cannot process the images fast enough. Also the pen may not be able to sec where it is in the dot pattern. This can happen if the page 10 is marked or defaced by colorants, or the pattern covered up with something, or the field of view oithe pattern is obscured. The user putting their finger in the way is a common reason why the processor fails to recognise the position of the pen. In order to alert the user to the fact that the pen is not able to determine its position properly the pen processor 24 is adapted to illuminate the light 38 and cause the vibrator 40 to vibrate. The user gets visual and tactile feedback that the camera is not seeing the dot pattern properly/that the pen is unable to determine its position properly.

Figures 3 to 6, discussed hereinbelow, relate to an invention which is the subject of a different application which was unpublished at the filing date of the current application. The discussion provides background information which is relevant to the understanding of the current invention. It should be appreciated that embodiments of the current invention can include features described with reference to Figures 3 to 6.

Referring to Figure 3 a form 200 is shown having several data entry areas 210. Referring to Figure 4, each data entry area 210 has associated with it a data capture area 212 that surrounds the data entry area 210. Referring to Figure 5, a data item in the form of written data 214 that falls within the boundary of a data capture area 212 is associated with the digital data entry

field associated with that data capture area 212.

Various types of data entry areas are shown in Figures 6a to 6d. Figure 6(a) illustrates a series of one character boxes 150, such a series of boxes 150 is useful for entering a data item such as name, address, postcode/zip code, date of birth, postcode etc., where the number of characters is either known (e.g. date of birth) or falls within a narrow range (e.g. the lines of an address. Figure 6(b) illustrates a "free form" box 152 designed to allow a user to write several words in the box 152, such a box 152 may be useful for a user to express comments in his/her own words. Figure 6(c) illustrates a "comb" style data entry area 154 which is generally used for similar types of data as that for which the arrangement of Figure 6(a) is used. Figure 6(d) illustrates a "baseline" type data entry area 156, this type of data entry area often being used when the written data is to take the form of a signature.

Referring to Figure 7A, a data processing system 700 according to a first embodiment of this invention comprises a processor 702, a memory 704, a display 706 and a user input 708. The processor 702 is able to read from and write to the memory 704. The memory stores a database 704a of data 12- items which have been processed through the system 700 and accepted onto the database once they are deemed to be accurate. The memory also stores forms processing software 704b which can be run on the processor 702 to ascribe a confidence score to data items received by the processor 702 from the digital pen transceiver 34 after it writes on the form 200 as previously described. The display 706 is used to display information stored in the memory 704 and the user input 708 is used to manually input information to the processor 702.

Referring to Figure 7b, operation of the system 700 will be described. At step 752, the processor 702 receives information relating to the content of a data entry area 210 from the pen 20 via the transceiver 34. When this information is received the processor 702 runs the forms processing software 704b to convert it into digital format and in order to ascribe a confidence score to the data item at step 754. In another embodiment the pen processor 24 runs software stored on the pen memory which is arranged to ascribe a confidence score to data items captured by the pen 20 before sending the data itenis to the processor 702.

At step 755, the ascribed confidence score is then compared to a current confidence threshold score by the processor 702. The current confidence threshold score depends upon many factors as described in more detail below, if the confidence score is higher than the confidence threshold value, the data item is automatically accepted for addition to the database 704a at step 756 with no further review. if the confidence score is less than the threshold value, the data item is sent to a manual checking system 1000 at step 758. The manual checking system 1000 (described in detail below) returns data items suitable for addition to the database 704a to the processor 702. Therefore at step 756, this data item is also added to the database 704a. After the data item has been added to the database, at step 762, the threshold is updated before further data items are processed through the system 700. The updating of the threshold depends upon - 13- various factors as outlined below inc'uding factors relating to the way in which previous data items have been processed. Therefore the system 700 is able to learn from previous results the best way in which to adjust the threshold score to achieve a desired, predefined result as more and more data items are processed through the system 700. In this embodiment, the processor 702 automatically computes an optimal value for the confidence threshold corresponding to each field in the form 200 to be processed. This automatic computing is performed continuously on-the- fly all during the execution of the system. I0

In other embodiments, the automatic computing may be performed at intervals, from time-to-time, for example at predetermined time intervals (e.g. hourly, every two hours, every three hours, daily (e.g. overnight), weekly), periodically at different time intervals (e.g. once on a first day, twice on a second day, once on a third day etc.), whenever the system is initiated, when the system is on, or at any other time to be specified by the user. The automatic computing may be performed after a predetermined amount of data items have been processed or any other suitable measure (e.g. cost) may be used to determine how frequently the computing is to he performed.

A data item having a confideiice score above the threshold value is automatically sent to the database 704a since the system 700 is sufficiently certain that the data item captured in the data entry area 210 of the form 200 has been correctly interpreted. A data item having a con lidence score below the threshold value is automatically sent to a manual checking system 1000 since the system 700 is not sufficiently certain that the captured data has been correctly interpreted. The manual checking system 1000 is described in more detail below. In other embodiments the checking system may be an automated checking system. For example, it may comprise a data processing system which is more accurate than the initial - 14- processing system but must be limited in its use for some reason, e.g. it is more expensive to run.

When comparing the ascribed confidence score to the threshold score for a given field and a given input, the processor 702 provides an interpreted output - consisting of a hypothesis and its confidence score measure this can fall in one of the four following categories: * Correct Acceptance (CA): the hypothesis matches the input; the confidence score is above threshold.

* False Acceptance (FA): the hypothesis does not match the input (wrong interpretation); the confidence score is above threshold.

* Correct Rejection (CR): the hypothesis does not match the input (wrong interpretation); the confidence score is below threshold.

* False Rejection (FR): the hypothesis matches the input; the confidence score is below threshold.

The False Acceptance rate defines the overall forms processing workflow accuracy: it represents the amount of erroneous data which is improperly committed into back-end systems, i.e. the database 704a in this embodiment.

The False Rejection rate represents the amount of noise that will go into the correction process and may distract the operators from correcting real misinterpreted data.

Figure 8 shows a possible example of the distribution of recognised hypotheses as a function of the confidence score values: - 15- * Wrongly recognised hypotheses (curve 802) are either Correct Rejections (confidence below threshold) or False Acceptances (confidence above threshold).

* Correctly recognised hypotheses (curve 804) are either False Rejections (confidence below threshold) or Correct Acceptance (confidence above threshold).

Figure 9 represents the correlation between both the False Acceptance (FA) and the False Rejection (FR) rate, and the confidence threshold: * The FR rate increases when the threshold increases (the higher the threshold, the more data items are manually corrected).

* The FA rate decreases when the threshold increases (the higher the threshold, the less errors are committed into the database). is

It is not possible to automatically measure FA since all data items having a confidence score above the threshold are automatically committed to the database 704a without further review and there is no opportunity to determine whether or not the data was correctly or incorrectly committed.

It is possible to measure FR and CR since a rejected data item goes through the manual checking system 1000 which is arranged to monitor whether the rejected data item was correct or false i.e. whether or not it was necessary to reject the data item - whether it was correctly or incorrectly rejected.

Referring to Figure 10, the manual checking system 1000 comprises the processor 702, the memory 704, the display 706 and the input 708 of the system 700. When the processor 702 receives a data item which has an ascribed confidence score which is below the threshold score, the data item is displayed on the display 706, which is in the form of a display screen of a PC in this embodiment. Figure 11 shows schematically an example of a - 16- typical display. On a left side 1004 of the screen 706 there is displayed an accurate representation 1004 of the data item obtained from the data entry area 210. In this example, the data item is the word "cow" in handwritten format. On the right side 1006 of the screen 706, there is displayed a list of possible words which the processor 702 considers to be the closest matches to the handwritten word. Possible words are accessed by the processor 702 from a dictionary or other list of words/data stored on the memory 704. A confidence score is associated with each alternative word and this is also displayed as a percentage.

The manual checking system 1000 also includes an operator 1002 who compares the data item on the left hand side 1004 of the screen 706 with the interpreted data on the right hand side 1006. The operator 1002 chooses one of the alternative words and indicates this choice via the user input 708, in the form of a keyboard. In other embodiments, there may be only one word displayed as the interpreted data (the most likely word) and if the operator disagrees with this choice then he/she is required to input a correct word via the input 708.

Alternatively there may be no words present which are offered as a choice to the operator - in this case the operator is required to interpret the captured data without any options and input his/her choice of word through the input 708.

After the operator's choice is inputted into the keyboard 708, the data item is processed and the corrected data item is sent for storage onto the database 704a.

The processor 702 also runs a program to monitor the CR rate, the FR rate, the total amount of rejected data items and the total amount of accepted data items as explained further below. - 17- In another embodiment the manual checking system 1000 hardware may be

separate from the data processing system 700 hardware i.e. different processors, memories, inputs and displays may be provided.

The automatic confidence threshold optimization system is designed to optimize the amount of False Rejection or False Acceptance. It is based on: * monitoring the correction process and feeding back data extracted from this monitoring in order to compute the confidence threshold, * defining a level of criticality fbr each of the form elements, and using this level of criticality in order to control the adjustment made on the confidence threshold values.

A history of confidence values for each field is stored in the database of the automatic confidence threshold optimization system as a moving window "snapshot" across all pen users (to account for general improvement as a function of most recent learning and experience). The history of confidence values includes the confidence score values associated with the corrected hypotheses (False Rejections and Correct Rejections), and the confidence score values associated with the hypotheses that were committed into the back-end system, i.e. the database 704a, without going through the correction step (False Acceptance and Correct Acceptance).

It is assumed that the operator 1002 will not introduce errors in the review process. This means that if the threshold is set to 100% output accuracy, the operator 1002 will correct everything and 100% output accuracy is achieved.

By monitoring the actions being taken by the corrector - the corrector either being the form filler himself or an operator working on a dedicated correction console i.e. the input 708 - during the correction stage, the 18- system is able to differentiate the cases in which the data are actually corrected (Correct Rejection), from the cases in which they are committed to the database without correction (the recognised output did not actually need to be corrected and therefore a False Rejection). This information is stored in the database 704a along with the confidence score values.

Figure 12a shows the distribution of confidence score values that the system is able to monitor and that are being stored in the system. A first curve 1204 is the distribution of confidence scores corresponding to False Rejections, a second curve 1202 is the distribution of confidence scores corresponding to Correct Rejections, and a third curve 1206 is the distribution of confidence scores corresponding to the sum of False and Correct Acceptances.

Based on this information, the system infers the complete distribution of the Wrongly Recognised 1202 and Correctly Recognised 1204 hypotheses, as shown in Figure 12b. Known distributions are shown in solid lines and inferred distributions in broken lines for these curves. The system can dynamically modify the rejection threshold value in order to achieve target FA and/or FR rates defined by the user: for instance if the actual FA rate is higher than the target FA rate, the threshold value is increased automatically by the processor 702 so that less errors are committed into the databases. Target FA, FR rates may be, for example, input by a user via the user input 708 when the system is started (or then can be system-set by a Systems Administrator). They can be changed manually at any time upon entry of a password via the input 708.

When started, the system 700 enters a transition period which will be described with reference to Figure 13. The confidence values for the different hypotheses categories are captured at step 1302. The processor 702 then checks whether the inferred distributions are stable (according to a preset degree of precision). If the inferred distribution curves 1202, 1204 - 19- are not stable to a required degree of precision which is preset then more confidence scores need to be captured and step 1302 is repeated. In this embodiment, stability is measured by comparing the inferred distribution curves 1202, 1204 to expected theoretical model distribution curves, such as normal distribution curves. If the processor determines that the inferred curves are similar enough to the theoretical models, the curves 1202, 1204 are deemed stable. The required degree of precision when assessing this similarity is preset as a confidence interval associated with the confidence level. The confidence interval is a measure of how close actual samples need to he to the model to accept that the model is approximated by the sample. For example, for a confidence level of 95%, if the confidence interval is set at 5 a minimum sample size of 370 is required to provide a stable result approximating a sample population size of 10,000. A minimum sample size of 384 approximates a population size of 1,000,000,000 in another example.

in another embodiment, the inferred distribution curves may be deemed stable once a predetermined number of samples have been corrected. In general, as more and more samples are corrected, the inferred distribution curves become more stable.

If the inferred distribution curves 1202, 1204 are sufficiently stable, then the transition period ends and the length of the transition period is noted by the processor 702 at step 1306. The length of this transition period gives the depth of the moving history window for subsequently capturing values into the system database 704a. After this transition period is completed, the system can enter a stable mode.

The system 700 constructs the diagram shown in Figures l4a to l4d during the transition period, i.e. during steps 1302 and 1304 of Figure 13. These steps are expanded and described in more detail with reference to Figure 15. Figures 14a to 14d show the iterations in forming the diagram.

- 20 - The threshold is set to some initial value, say ti (the selection of the initial value does not impact the results) at step 1 502. As samples are corrected in the system, the number of samples to be checked (suspects), number of correct samples (required no correction, CR), and the number of corrected samples (require operator intervention, FR) are obtained at step 1504. In some embodiments the number of accepted samples is also obtained at this stage. At step 1506, the processor 702 checks whether enough samples have been corrected to provide statistical relevance. This is achieved in a similar manner to the assessment of stability of inferred distribution curves as described above. Collected data is compared to expected values within a required degree of precision to determine whether or not statistical relevance is present. If sufficient samples have not been corrected, then step 1 504 is repeated with the same threshold value as previously to collect more samples. If sufficient samples have been corrected (large enough for statistical relevance), the transition mode reaches step 1508 (which is step 1304 of Figure 13) in which the processor 702 checks whether or not the inferred distributions 1202, 1204 are stable. If they 1202, 1204 are not, the threshold value is changed to, say, t2, at step 1510 and then step 1504 is repeated to calculate the corresponding CR2 and FR2 values. 1'he process continues (transition period) to sufficiently cover threshold values until at step 1508/1304 the processor 702 determines that the inferred distributions are sufficiently stable. The final result is the plot of the first 1204' and second 1202' curves as illustrated in Figure 14d.

Advantageously the transition mode provides an automated set up for the system 700. There is no need for human operator involvement and thus there is a saving in cost, time and overall efficiency.

Referring to Figure 16, when the system 700 enters the stable mode, initially the processor 702 sets the threshold confidence score at a maximum value at step 1602. This leads to all of the results being manually corrected. This is done so that no errors are initially introduced -21 - into the database 704a when the threshold is likely to vary most as it "settles down".

As data is introduced to the database 704a, the FR, CR rates are monitored at each stage and the threshold score is decreased iteratively at step 1604.

When the system runs in stable mode, the confidence threshold for a given

field can be adjusted as a function of:

* the history of confidence distributions recorded in the system database 704a, as described above.

* a level of criticality as defined for this field.

After the initial period, the threshold score may be adjusted to increase, decrease or stay the same.

Figure 17 shows a simplified logical diagram of the automatic confidence threshold optimization system.

The level of criticality is defined by some type of system administrator and can represent the business relevance - and therefore the cost of invalid data - of the element, sub-element or set of elements taking into account.

The criticality of a field can be expressed in terms of target FA and/or FR rate. The confidence threshold value for different fields can be set to different levels.

* High security data (e.g. such as amounts in a cheque processing application) are assigned a lower target FA rate and a higher FR rate.

* Less critical data are assigned a lower FR rate and a higher FA rate.

- 22 - Once the target FA and FR rates are assigned, the automatic confidence thresholds adjustments that are being made during the execution of the system 700 tend to match these desired rates.

Criticality can also depend on who is the user e.g. a person filling in the form 200. Assuming a manager has a level of approval he can authorise, and different managers have different levels, a big expense for one manager may need closer inspection than a big expense by a higher level manager.

Criticality can also be a function of the data pre-filled or interpreted by the same field, or another field. For example, a $IM transfer is more critical than a $1 transfer, so the threshold can depend on the data item entered and the recognition results, since a low confidence result for $1 is less of a problem than a low confidence result that is 1,000,000.

Other parameters may also be taken into account in order to adjust the threshold. For example, the corrections may be stored and used depending on who completed the form (if known from system via device user or device ID) and the threshold modified not just by field, but the user. User profiles could be computed, and the threshold adjustment may take into account the tendency of a particular user to get his/her writing corrected, or not.

The confidence values distributions may be weighted by a function of the position of the form element within the page. For instance correetors may have a tendency to pay less attention to fields located at the bottom of the pages and the detection that such a field is not being corrected may be

given less importance than for header fields.

Another parameter which may he taken into consideration is the cost of the correction. For example, a customer may be willing to assign a particular budget and would accept the output as it comes. The setup of the threshold is such that the correction cost is within the limit of the pre-assigned 23 - budget. Using the distribution curves shown in Figure 12b and since the cost associated with correcting one sample is known, the correction process designer can translate the budget into a number of samples that can be corrected given that budget. This number is then looked up against the curve in Figure 12b to obtain the threshold to be used in the stable process.

For example, if a business decides that it wants one person to check for errors as their job they can set the number of' data items sent for checking to be about what one person can reasonably check in their work time. Or they may want the ratio of amended to unarnended items to be at a level where it retains the checker's interest.

The system 700 provides the advantages of: * automatic system tuning which reduces need for system monitoring and administration.

* higher accuracy of the overall data entry process because the workflow parameters are being kept consistent with the quality of the processed data all the time. l.arge amounts of unnecessary corrections of correctly recognised results are avoided, and therefore avoid the risk for correctors to skip isolated correct error detections.

* Easier system setup thanks to automatic confidence threshold setting, and parameters that are related with the business application.

* Optimized return on investment for the forms processing workflows using this system.

* Automatic adaptation to external changes affecting the system without any input - for example, users becoming familiar with layouts of forms 200 with time and their error rates consequently increasing due to poor handwriting for example.

Referring to Figures 18a and 18b, in a second embodiment of the invention, a data sorting system 1798 comprising a processor 1800 is - 24 - arranged to sort data items having an ascribed confidence score for checking. The processor 1800 automatically controls a threshold confidence score at step 1802. The threshold score can depend upon many user defined factors as previously described in relation to the first embodiment's system 700. These include: * proportion of rejected data items which was correctly rejected; * proportion of rejected data items which was incorrectly rejected; * identity of an imputer of the data items; * data item content; * position on a page, when the data items has been taken from the page; * amount of rejected data items; * amount of accepted data items.

IS

The processor 1 800 receives a data item 1 799 having an ascribed confidence score from any suitable source e.g. the data item is read from one or more of a digital pen and paper system, a PDA and stylus system, a tablet PC and stylus system, a touch sensitive screen or a voice recognition system.

At step 1804, the processor 1800 compares the ascribed confidence score to the threshold score in order to determine whether the data item needs to be sent for checking or not. If the ascribed confidence score is greater than the threshold score, the data item does not need to be checked. If the ascribed score is less than the threshold score, the data item needs to he checked.

Referring to Figures l9a and 19b, a database making system 1900 according to a third embodiment of the invention comprises the data sorting system 1798 of the second embodiment. The system 1900 is used to make - 25 - a database 1902. At a first step 1904, the data sorting system 1798 selects a data item lbr addition to the database 1902 by comparing confidence values ascribed to the data item with a threshold confidence value which is automatically controlled by the processor 1800 as previously described. At step 1906, accepted data item is added to the database 1902.

Referring to Figure 20, a data carrier 2000, in the form of a CD, according to a fourth embodiment of the invention carries software which when run on a processor instructs the processor to process a data item after comparison of the threshold score with a confidence score which has been ascribed to the data item. The software on the CD 2000 is also arranged to automatically control the threshold score.

Various modifications can be made to the present invention without departing from its scope. For example, it will be appreciated that the form, of any aspect of the invention, may be a physical sheet, such as a sheet of paper or plastic. The physical sheet may permanently carry the form (e.g. a sheet of paper), or it may not: it may temporarily carry, or display, the form. An example of this is the display/graphical user interface of a tablet computer, or Personal Digital Organiser: when the screen of such a device displays a form with data input areas this constitutes a form. When a user writes on the screen with a stylus, they are creating written data on the form.

It is to be understood that the terms "top", "bottom", "left" and right" take their normal meaning with respect to a medium (such as a piece of paper, or stylus-sensitive screen of a Tablet Graphical User Interface (GUI) ) when written on by a user.

In addition to pen-type/stylus-type inputs, other hand manipulated input devices could include a mouse-type device. It may even be possible to - 26 - consider a user's finger as a "pen" if they write with their finger as a position (e.g. touch) sensitive input screen.

"Writing on a form" is intended to cover all of the above, and more.

A stamp, such as a rubber stamp may be used to apply data to a form, or a finger print may be applied to a form. Both processes apply data to a form (i.e. Fingerprint, possibly for fingerprint analysis, may comprise userapplied markings to a form, which could comprise a checking screen).

Therefore, the mark made by the stamp or a finger are also to be considered to fall within the scope of the term "written data" and the process of applying a stamp or Fingerprint are to be considered to fall within the term "writing on a form".

"A form" may have a series of spaces for answers to questions/areas to input data. On the other hand, "a form" could be a single data entry area - for example a signing in box to identify the user of a computer system (possibly to provide access to functionality, or a system, once the identity of the user has been established). Filling in, e.g. by hand, the user ID area on a GUI is filling in a form.

Also, any known type of data carrier may be used - the skilled person understands that data carriers are not limited to CDs - any suitable storage media may be used including, but not limited to, floppy disc, hard disc, DVD, memory which is hard wired, memory stick, memory card or any other storage medium. 2J)

Claims

1. A method of using a computer processor to assess data items to determine which data items need checking in a subsequent, further, checking operation before their accuracy is to be relied upon, and which data items do not need checking in a subsequent, further, checking operation before their accuracy is to be relied upon, the method comprising using the processor to compare confidence scores ascribed to the data items with a threshold confidence score to determine which data items need further subsequent checking in a subsequent checking operation, and in which the value of the threshold confidence score is automatically controlled by the processor.

2. A method as claimed in Claim 1 in which the processor controls the value of the threshold score in response to receiving feedback information on the level of data which has been sent for further, subsequent checking in the subsequent checking operation that is accepted unchanged in the subsequent checking operation and/or that is changed in the subsequent checking operation.

3. A method according to Claim I or Claim 2 wherein the processor, at least from time to time, dynamically controls the value of the threshold score as a historic body of assessed data items is built up by the processor.

4. A method according to any preceding claim wherein the processor causes data items which are assessed as having a confidence score above the threshold to be stored in computer memory.

5. A method according to any preceding claim wherein the processor causes data items which are assessed as having a confidence score below the threshold to be displayed to a human user for human checking, and humaninstigated correction if correction is needed. 2'

6. A method as claimed in any preceding claim, in which the processor automatically controls the threshold confidence score dependent upon one or more of the following factors: (I) the proportion of rejected data items which was correctly rejected in the assessment of the data items; (11) the proportion of rejected data items which was incorrectly rejected in the assessment of the data items; (Ill) the identity of an inputer of the data items; (IV) the data content of the data items; (V) the position of the data item on a page, when the data item has been taken from a page; (VI) the number of rejected data items; (VII) the number of accepted data items.

IS

7. A method as claimed in Claim 6, wherein the processor controls the threshold confidence score by comparing the ascribed confidence score of a data item to an existing threshold confidence score, accepting or rejecting the data item on the basis of said comparison and updating the threshold score to control a desired property or value of one or more of (I) the proportion of rejected data items which was correctly rejected in the assessment of the data items; ([I) the proportion of rejected data items which was incorrectly rejected in the assessment of the data items; (III) the identity of an inputer of the data items; (IV) the data content of the data items; (V) the position of the data item on a page, when the data item has been taken from a page; (VI) the number of rejected data items; (V II) the number of accepted data items.

8. A method as claimed in any preceding claim, wherein a step of controlling the threshold confidence score comprises assessing a predetermined sample of data and then updating the threshold score.

9. A method as claimed in Claim 8, wherein the processor controls the threshold score on the basis of more recent data as opposed to older, potentially out of date, data.

10. A method as claimed in Claim 8 or Claim 9, wherein the processor determines the number of data items in the sample of data by calculating the number of data items required to be sampled so as to achieve statistically relevant results.

II. A method as claimed in any preceding claim, in which the processor IS automatically sets the confidence threshold score when the system is initialised.

12. A method as claimed in any preceding claim, further comprising converting an image of data into digital format and ascribing a confidence score based upon the conversion.

13. A method according to any preceding claim in which either (i) the processor ascribes confidence scores to the data items before determining which data items need subsequent checking; or (ii) the data items already have an associated ascribed confidence scores when they are processed by the processor to determine which data items need subsequent checking.

14. A data processing system, for processing data having an ascribed confidence score, comprising a processor arranged to compare the ascribed confidence score to a threshold confidence score and accept or reject the data on the basis of said comparison, wherein the processor is arranged to automatically control the threshold score. 3 C3

15. A method of making a database comprising using a processor to select data for entry onto the database by comparing an ascribed confidence score associated with the data with a threshold confidence beyond which data is accepted for entry onto the database and using the processor to control automatically the threshold confidence score.

16. A computer program product encoded with software code which when run on a processor is arranged to enable the processor to process data by accepting or rejecting the data on the basis of comparison of an ascribed confidence score associated with the data with a threshold confidence score, and which is arranged to cause the processor to automatically control the threshold confidence value in response to feedback on the levels of: (i) subsequent correction of rejected data, and/or (ii) the levels of subsequent IS acceptance of uncorrected rejected data.