GB2452067A

GB2452067A - Method for prediction and diagnosis of medical conditions

Info

Publication number: GB2452067A
Application number: GB0716460A
Authority: GB
Inventors: Emmanuel C Ifeachor; Viktoriya Stalbovskaya
Original assignee: Plymouth University
Current assignee: Plymouth University
Priority date: 2007-08-23
Filing date: 2007-08-23
Publication date: 2009-02-25
Also published as: WO2009024796A1; GB0716460D0

Abstract

A method of characterising a medical condition (e.g. tumour, ovarian cancer) of a subject comprises the steps of a) providing a set of symptom data for the subject, the set of symptom data comprising at least one result of an analysis xi, conducted on the subject; b) calculating a diagnostic coefficient DCi for each result of an analysis xi; c) calculating an importance factor Ji for each result of an analysis xi; d) specifying an acceptable level of error in the characterisation; e) determining the thresholds for the diagnostic coefficients DCi using the error level specified in step (d) and defining a threshold range there between: f) compare the value of the diagnostic coefficient DCi of the analysis x, with the highest importance factor Ji , DCimax with the thresholds determined in step (e); g) successively sum the diagnostic coefficient DCimax with the diagnostic coefficient DCi having the next highest importance factor Ji until the value of the sum lies outside the threshold range defined in step (e); and h) identify the threshold exceeded in step (g). The method preferably comprises the step of converting continuous data in the symptom data set into discrete data such as binary data.

Description

METHOD FOR PREDICTION AND DIAGNOSIS OF MEDICAL CONDITIONS AND

APPARATUS FOR PERFORMING THE SAME

The present invention relates to a method for use in assisting a medical practitioner in making a diagnosis of the condition of a subject or patient, in particular in determining the nature of the condition. The present invention further relates to an apparatus for performing the method.

The assessment and diagnosis of the condition of a patient or subject may be divided into aspects. The first aspect is the correct determination of the particular condition that is ailing the subject. The second aspect is the nature or severity of that condition, in particular to determine whether the condition is malignant or benign.

The proper determination of the second aspect is particularly useful in deciding upon 1 5 the most appropriate and efficient form of treatment. This is particularly important when a health service or provider is treating a large number of patients with limited resources, as is generally the case.

Cancer is the second-leading cause of death in the UK after heart diseases.

*. 20 Each year, around 130,000 people die from cancer in the UK alone and about * S..

225,000 new cases are diagnosed. These figures are currently increasing by about 1 4% per annum. In 2003, the NHS invested �639m mainly on chemotherapy alone.

However, the earlier diagnosis of cancer is made the more optimistic the prognosis can be and the less aggressive the therapy (for example conservative operation *:: 25 without adjuvant chemo and radiotherapy). For example, if ovarian cancer is detected at FIGO stage I, the 5-year disease-free survival rate is over 80%. The use of the optimum treatment strategy can increase the effectiveness of treatment and minimise side effects experienced by the patient, many of which can be severe.

The trend in many clinical areas of the treatment of cancer is towards personalisation of diagnosis and treatment because of the heterogeneity of the disease and differences in individual patients. Conventional prognostic criteria for various types of cancer include histological staging, lymph node status, TNM system, proliferation index, Nottingham prognostic index and risk of malignancy index for ovarian tumours. However, the predictive power of conventional diagnostic and prognostic markers is limited and therefore not adequate for the individualisation of prediction of care and the response to treatment. Tumours with similar histopathological appearance, for example, can follow significantly different clinical courses and patients with similar diagnosis show markedly different responses to treatment.

New and emerging high throughput technologies such as genomics and proteomics have the potential to provide an insight into individual differences in patients and an opportunity to improve diagnosis and care on an individual basis.

Recent studies in genomics/proteomics of cancer have identified potentially useful cancer-specific signatures" and biomarkers (Van't Veer et al., Gene expression profiling predicts clinical outcome of breast cancer, Nature, 415 (6871) pages 530-6, 2002; Petricoin, E.F., Ardekani, A.M., Hitt et al., Use of proteomic patterns in serum to identify ovarian cancer', Lancet, 2002; 359 (9306), pages 572-7; Golub TR, Slonim OK, Tamayo P et al., Molecular classification of cancer: class discovery and class prediction by gene expression monitoring', Science, 286 (5439) pages 531-7, 1999; Crijns A.P et al., Molecular prognostic markers in ovarian cancer: toward patient-tailored therapy', lnt J G Cancer, Suppl 1, pages 152-65, 2006). * *

For example, the discovery of the marker HER-2 made it possible to identify subgroups of breast cancer patients who will benefit from adjuvant chemotherapy with Trastuzumab (Romond EH et al., Trastuzumab plus adjuvant chemotherapy for operable HER2-positive breast cancer', N Eng.J Med., 353 (16), pages 1673-84, ::; 25 2005).

Potentially up-regulated genes associated with breast cancer have recently been identified and are being tested as potential biomarkers of breast cancer (Piccart MJ, Loi S, Van'tVeer L, et al, Multi-center external validation study of the Amsterdam 70-gene prognostic signature in node negative untreated breast cancer: are the results still outperforming the clinical-pathological criteria?', Breast Cancer Res. Treat., 88 (suppl 1), Abstr 38, 2004). However, there are enormous research challenges to be addressed to determine whether such methods can satisfy the high expectations (for example the ability to tailor therapy on the basis of biological findings) as well as overcome the relevant biotechnological challenges. Further, in clinical practice, decision-making is still largely based on clinical data alone and a great deal of work remains to understand the information derived from genomic/proteomics data and how to integrate this information with clinical data when appropriate. In some cases, clinical data alone are adequate for diagnosis because of the clinico-pathological signs of the malignancy. However, this is by no means always the case.

There are two key problems to be addressed, which are important prerequisites for successful patient-tailored diagnosis and treatment for cancer. First, there is a need for the development of a methodology for quantifying a patient's health/disease status. Further, it would be most advantageous to have a model for handling the diversity and complexity of the diagnostic problem. In particular, the method should preferably be able to handle a wide range of data obtained from the results of tests and examinations performed on a patient, as well as being able to accommodate data sets that may be incomplete.

Providing the first prerequisite will enable the individualised approach to the care of patients, in particular oncological patients, because, in contrast to traditional *** 20 cancer staging system, quantitative assessment of a patient's health status can . handle individual peculiarities and allows fine grading of a patient's condition. * ** * * * * **

The problems associated with the second prerequisite can be illustrated with a simple example. In clinical practice, some cases of cancer have clear clinico-pathological signs of malignancy so that only ultrasound examination might be needed to make a diagnostic decision and a referral for operation; others might require a thorough examination, including measurement of genetic markers and invasive diagnostic procedures. Thus, to cater for these differing requirements requires the development of a flexible model which can utilise a variable number of modalities.

It would be most desirable if a novel method for quantitative assessment of a patient's health status could be provided which combines multimodal data from macro-, micro-and nano-levels. Preferably, the method should integrate patient information from different modalities (clinical, imaging, laboratory, genomics, etc.) to produce a composite index, with an appropriate confidence measure assigned to each modality.

It would be particularly advantageous if such a method could be provided for the assessment of cancers and tumours, for example ovarian tumours. Ovarian tumours are common among women. In Europe and North America the age-adjusted standardised incidence rate of ovarian cancer is over 10 per 100,000 women. Preoperative prediction of malignancy of ovarian tumours is very important, because it can prevent unnecessary surgery for benign functional cysts or in the case of benign neoplastic lesions only minimal surgical intervention would be required. On the other hand, patients with malignant forms of tumour require not only surgical operation but also an appropriate pre-, pen-and postoperative management. A great deal of effort has been put in by gynaecological oncologists in order to develop preoperative predictive markers of ovarian malignancy. However, prospective testing of these markers have shown either low performance or unbalanced results (i.e. high specificity and low sensitivity). To address the limitations of previous studies the International Ovarian Tumour Analysis (IOTA) Group has established multicentre prospective clinical trials with more than six centres working to the same protocol and collecting data from a total of 1000 patients who have a persistent adnexal mass. For clinical acceptance, a predictive model for discrimination of ovarian tumours should *** preferably satisfy the following requirements: (i) have reasonably high sensitivity and specificity levels, typically 90% and 75%, respectively; (ii) be interpretable; and (iii) use as few diagnostic techniques/parameters as possible.

In relation to (iii), the range of laboratory and instrumental diagnostic techniques for ovarian cancer is wide and includes transvaginal and transabdominal ultrasonography, serum tumour markers, laparoscopy, computer tomography and magnetic resonance imaging. A key problem is in the choice of necessary procedures taking into account their diagnostic value, cost and invasiveness.

Accordingly, there is a need for a method of assessing tumours and cancerous conditions of a patient in particular, and for assessing other clinical conditions in general, that meets the aforementioned criteria and needs.

In a first aspect, the present invention provides a method of characterising a medical condition of a subject, the method comprising the steps of: a) provide a set of symptom data for the subject, the set of symptom data comprising at least one result of an analysis x, conducted on the subject; b) calculate a diagnostic coefficient DC for each result of an analysis x; c) calculate an importance factor J for each result of an analysis x; d) specify an acceptable level of error in the characterisation; e) determine the thresholds for the diagnostic coefficients DC using the error levels specified in step (d) and define a threshold range therebetween; f) compare the value of the diagnostic coefficient DC of the analysis x with the highest importance factor J, DC,max with the thresholds determined in step (e); g) successively sum the diagnostic coefficient DC,max with the diagnostic coefficient DC, having the next highest importance factor J, until the value of the sum lies outside the threshold range defined in step (e); and h) identify the threshold exceeded in step (g).

The method of the first aspect of the present invention allows the modelling of the preoperative diagnosis of conditions of a patient, in particular cancerous conditions and tumours, especially ovarian tumours. The method is based on the Sequential Nonuniform Procedure (SNuP), which meets the requirements above.

SNuP is based on a form of Bayes classification, but with additional restrictions. In particular, consecutive multiplication of likelihood ratios of input variables is interrupted when one of the diagnostic thresholds is reached. Values of thresholds are specified according to an acceptable level of the diagnostic errors. The SNuP operates sequentially on the variables (features) as the cases (observations) are accumulated. This is significant, as it allows the method to provide a personalised differential diagnosis. This is achieved by varying the number of attributes used, ranking the variables according to their discriminative relevance and the specified confidence level.

In the first step of the method as set of data obtained from an analysis and/or examination of the subject is provided. The set of data contains the result of at least one test, analysis, investigation or examination carried out on or in respect of the subject. In many cases, the set of data will contain two or more such results.

Examples of symptom analyses x that may be obtained to generate the set of symptom data for a female subject suspected of suffering from ovarian cancer are set out in Table I below. The set of symptom data contains at least one result from an analysis of a symptom x1. I0

Table I

ANALYSISx TYPE OF RESULT Age (Age) cont.

Menopause state (Meno) binary :. Amount of blood flow (Col score) nominal Level of serum CA 125 (CA125) cant. *bSs

Pulsatility index (P1) cont.

Resistance index (RI) cont.

Peak systolic velocity (PSV) cont.

Time-averaged mean velocity (TAMX) cont.

Ascites (Asc) binary Unilocular cyst (Un) binary Unilocular solid (UnSol) binary Multilocular cyst (Mul) binary Multilocular solid (MulSol) binary Solid tumour (Sol) binary Bilateral mass (Bilat) binary Smooth waIl (Smooth) binary Irregular wall (lrreg) binary Papillations (Pap) binary Septa > 3 mm (Sept) binary Acoustic shadows (Shadows) binary Anechoic cystic content (Lucent) binary Low level echogenicity (Low-level) binary Mixed echogenicity (Mixed) binary Ground glass cyst (G.Glass) binary Hemorrhagic cyst (Haem) binary Output var Pathology result (Path) binary Indices Ultrasound score (Morph) nominal Jacobs index (Jacobs) nominal Risk of malignancy index (RMI) cont.

Transformed Rather strong blood flow (Col3) binary vars Very strong blood flow (Co14) binary CAl 25 > 30 U/mI (C CA125) binary In Table I, the form of data for each of the analyses is indicated and may be continuous, binary (that is assigned 1 or 0) or nominal (that is having a discrete value, such as 0, 1, 2 etc. depending upon the outcome of the analysis).

Similar sets of result data may be compiled for other cancerous and non- * cancerous conditions.

In one embodiment, the method of the present invention employs purely discrete data, in particular binary data. In this respect, discrete data is to be considered as being data that have a discrete value, such as may result from a test or investigation that produces an indication that is merely low', medium' or high'.

Similarly, binary data is to be considered as the data resulting from a test, examination or analysis of the subject that can give one of two results. For example, included in the above list is the menopausal state of the female subject, who may either be menopausal or not menopausal. However, many analyses or tests conducted on a subject do not yield a simple binary result, but rather provide a continuous result that may take any value within the result range. The type of result data is specified for each data set in Table I. In the present method, the reduction of the continuous result data to a discrete data set, in particular a binary data set, may be achieved in a number of ways First, the analysis or test may be redefined to produce a binary result. For example, the level of serum CA 125, while a continuous result rather than a binary, may be redefined to specify a minimum level of the serum CA 125 (for example 30 U/mi), allowing the result to be presented as a binary result, that is either above the specified minimum level, or at or below the specified minimum level. However, this manual setting of a specified value, such as a minimum or threshold value, is not preferred and can lead to inefficiencies in the system.

Alternatively, and most preferably, the continuous result data are converted into discrete result data, for example binary data. This conversion may be achieved using mathematical manipulations known in the art. In one embodiment, the conversion of the continuous result data to discrete data is achieved using fuzzy logic Suitable fuzzy logic techniques are known in the art. One preferred method for converting the continuous result data into discrete data is the use of clustering techniques, in particular univariate and multivariate clustering. A preferred clustering technique is disclosed in J. MacQueen, Some methods for classification and analysis of multivariate observations', Proceedings of 5k" Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, 1967, pages 281 to 297. In particular, :. continuous result data may be transformed to ordinal data by partitioning the initial input space. Thereafter these variables are analysed by SNuP in the regular way.

Automatic partition of the input space for continuous variables may be performed by applying k-means clustering, as described by J. MacQueen aforementioned, with three number of clusters. Squared Euclidean distance may be used as a distance measure, initial centroid positions of clusters being selected randomly. In case of a *:*::* cluster losing all of its member observations those clusters are removed. Assignment of continuous variables from the test set to clusters may be made on a basis of minimal squared Euclidean distance to one of the centroids that are identified on the training stage.

Accordingly, in a further aspect, the present invention provides a method of characterising a medical condition of a subject, the method comprising the steps of: a) provide a set of symptom data for the subject, the set of symptom data comprising at least one result of an analysis x conducted on the subject; al) convert continuous result data present in the set of symptom data into discrete result data; b) calculate a diagnostic coefficient DC, for each result of an analysis x; C) calculate an importance factor J for each result of an analysis x; d) specify an acceptable level of error in the characterisation; e) determine the thresholds for the diagnostic coefficients DC1 using the error levels specified in step (d) and define a threshold range therebetween; f) compare the value of the diagnostic coefficient DC of the analysis x with the highest importance factor J, DCmax, with the thresholds determined in step (e), g) if DCmax does not exceed one of the thresholds determined in step (e), successively sum the diagnostic coefficient DCmax with the diagnostic coefficient DC having the next highest importance factor J until the value of the sum lies outside the threshold range defined in step (e); and h) identify the threshold exceeded in step (g).

Once the set of symptom data has been compiled and provided, each result of an analysis x is used to calculate a diagnostic coefficient DC1. The diagnostic coefficient DC is derived from the probability of the subject having a particular condition given a specific result from the analysis x. * *

The diagnostic coefficient DC may be derived from the result of an analysis x, as follows.

The key issue in diagnosis is to determine whether a given subject belongs to one of two groups, that is the groups of having a particular condition, or not, given the symptoms expressed by the subject and the laboratory data obtained from one or more tests. In the case of a tumour, the key issue is whether the subject belongs to one of the following two groups: benign or malignant tumour. The task can be viewed as a two-class classification (Ak, where k 1; 2) problem, given a vector of input variables, x. I0

Denoting P(Ak) as the prior probability of class k, k = I. . n, with n being the number of classes (groups); P(xij Ak) is the conditional probability of x given Ak, that is the probability of presence of symptom x, in the group Ak; P(x) is the prior probability of symptom x,. The posterior probability of the subject to belong to group Ak having symptom x, can be defined using Bayes' theorem as follows: P -P(Ak)P(x7.I4k) (1) Sequential non-uniform procedure produces a model of classification into two groups. The ratio of conditional probabilities of the groups is equal to the ratio of symptom's occurrences in the two groups, as follows: P(Ailx) -P(xlA1) P(A2fx2) -P(xA2) (2) * *** P(AiIx) * where P(A2kr3) is a likelihood ratio of probability of a group given symptom P(x.41 x and P(x., .4) is a likelihood ratio of the probability of the symptom x given groups Ak.

Accumulation of the diagnostic information given the presence of independent features/symptoms x1, x2 x is performed as follows: P(AiIxi,x2...x) 1P(xIAi) P(A2frj,x2, ...,x,) - P(xA) (3) In order to remove the multiplication operation in the right hand part of formula (3), the relationship may be transformed into a summation by taking a logarithm and introducing a scaling factor of 10. The diagnostic coefficient DC of symptom x is a score value which is defined as follows: P(x IA1) DC1 = W]og10 P(x1jA2) (4) When the probability of the symptom x is higher in group A1 than in group A2 the value of DC is greater than 0. When the probability of the symptom x is higher in the group A2 the value of DC1 is less than 0.

In the method of the present invention, the next step is to determine an importance factor Ji for each result of the analysis or symptom x. The importance factor is required in order to rank the analytical and test results and symptoms, as described hereinafter. The importance factor J may be determined as follows: S...

**.* The feature selection process and ranking of input variables/symptoms is based on the calculation of symmetrised Kuliback-Leibler divergence between two distributions, P and Q, the so-called J-divergence, as described in H. Jeifreys An invariant from of the prior probability in estimation problems', J. R. Statist. Soc., Vol. : 25 A, pages 453 to 469, 1946, as follows: J(P.Q) = D(PIIQ)-l-D(QJIP) 2 (5) where D is the Kuilback-Liebler divergence so that m D(PJIQ)== P31og-3=1;and I) (Q F) = Q3 log, j=1 and m is the number of distinct values of the variable (for example, low', normal', high').

The J-divergerice for distinct values P1 and D is defined as follows: = Pj1og+Q1og.L -Pj1og2-Q31og1 = * * S...

*..* Substituting for P1 and Q with the conditional probabilities P(x, I jA1) and P(x1 I jA2) and, to be consistent with the definition of DC1, scaling formula (5) the J-divergence of the distinct value of the variable becomes: -P(xIAi) -P(rA2)101. P(xIAi) (.ij) -2 og10 P(xIA2) (6) The importance factor J for the symptom or result x is the sum of all the distinct values J(x) of the variable, as follows: Once the importance factors J have been obtained for each symptom or result x, the symptom or results are ranked according to their importance factors, with the symptom or result having the highest importance factor being ranked highest. The subsequent processing of the diagnostic coefficients DC is applied to each symptom or result x according to the ranking of its importance factor J, as will be described hereinafter.

The method of the present invention requires the determination of threshold values for the diagnostic coefficients DCI, that is the threshold values for the ratio set out in formula (3). This in turn requires that the possible errors in the values of the symptom or result data x is taken into account. In the case of a subject under investigation for a given condition, two types of error may be identified: first, the n... 20 subject may be diagnosed as having the given condition, when this is incorrect and the condition is not present; and, second, the subject may be diagnosed as not * * having the given condition, when the condition is in fact present. These errors may be termed as a and 3. Thus, in the case of a cancer or tumour, in terms of the malignant-benign' classification, a specifies the probability of false assignment of a * *. 25 patient with a malignant tumour into a benign tumour group, and 13 specifies the probability of false assignment of a patient with a benign tumour into a malignant * tumour group.

In terms of classification into groups A1 and A2, a is the rate of misclassification into group A1, and 13 is the rate of misclassification into group A2.

The threshold for a diagnostic hypothesis is the minimum acceptable rate of correct diagnoses over incorrect ones. Denoting A+ as a correct diagnosis and A-as an incorrect diagnosis, the probabilities of correct and incorrect diagnoses in the groups are P(A1+), P(,41-), P(A2+), and P(A2-). Accordingly, the decision rule for group 1 is: P(Ailxt,x2....) > P(Afl P(A21x1,x2....) -and for group 2 is: P(Ai X1 2,...) < P(A) P(A21x1:x23...) -P(At) P(Aj) P(A;) P'4) P(A+) where 1 and 2 are the eveIs of acceptable classification errors.

Using types I and II errors, for group 1: P(A1+)= 1-a; and P(A1-) = *...

20 The ratio of correct to incorrect diagnoses in group 1 is as follows: * I. * S S * S. ______ -1. -a ::; P(A) -Similarly for group 2: P(A2-) = a; and P(A2+) = 1 -3.

The ratio of correct to incorrect diagnoses in group 2 is as follows: P(Afl -cx P(A) -1 -/3 The threshold for a diagnostic hypothesis is the minimum acceptable rate of correct diagnoses over incorrect ones. Thresholds for the sum of the diagnostic coefficients are defined as follows: -1-a = 101og10 q (8) DCth(A2) = 10 log10 (9) S. * S * S.. S... * S 5.5

* *. The threshold values are assigned to a particular condition or diagnosis.

Thus, for example, A1 may assigned as the condition of a tumour being benign and A2 being assigned as the condition of the tumour being malignant. This in turn means that the threshold value DCth(Al) is the minimum value of the diagnostic *. coefficients required in order to provide a diagnosis that the tumour is benign. In * contrast, the threshold value DCIh(A2) is the minimum value of the diagnostic coefficients required in order to diagnose the condition as being a malignant tumour.

The threshold values for different levels of a and 3 are given in Table II below.

Table II

*8.ja DGth(A1) DCth(A2) 0.20 0.20 4 0.250 6 -6 0.15 0.15 5.7 0.176 7.6 -7.5 0.10 0.10 9 0.111 9.5 -9.5 0.05 0.05 19 0.053 12.8 -12.8 0.01 0.01 99 0.010 20 -20 0.001 0.00 I 999 0.001 30 -30 0.20 0.15 5.3 0.235 7.2 -6.3 0.10 0.15 6 0.118 7.8 -9.3 0.05 0.15 6.3 0.059 8 -12.3 0. 01 0.15 6.6 0.012 8.2 -19.2 0.001 0.15 6.7 0.001 8.3 -30 It can be seen from the figures in Table II, for example, that an accuracy of 95% (that is both a and 3 are 0.05), the threshold values for the diagnostic coefficients is 12.8 and -12.8. Increasing the accuracy of the analysis and symptom data to 99% provides threshold values for the diagnostic coefficients of 30 and -30.

:. Considering the task of assignment of the symptom or analysis data x to one of the groups A, or A2 the inference rules for the SNuP are as follows: * .* * * * * ** * ** * S * * ** S. * S S* S. 1f P(Aji.r2. ..) > I-a * P(AlTI.2,...) -then the decision is x E Group A1".

ç P(A11r1.x2...) < a * P(A2lrj.9....) -then the decision is "x Group A2".

v-i _. P(Ailri.r..,,...) 1-a * tf P(A2lTI,r,...) < theii additional information is required to assign x to one of the groups.

f. . P(AiIri.r.,,...) 1-a * l-d P(A2Ir.i.r,...) and no more features are available then the decision is "membership of x is undefined".

Thus, in the next step of the method of the present invention, the value of the diagnostic coefficient DCmax with the highest importance factor J, is compared with the threshold values determined for the diagnostic coefficients. If the value of the diagnostic coefficient DCmax with the highest importance factor J exceeds one of the threshold values, the thus exceeded threshold value is identified and the method terminated. If the value of the diagnostic coefficient DCmax with the highest importance factor J, does not exceed one of the threshold values, the value of DC,max is summed with the value of the diagnostic coefficient DC having the next highest importance factor J. If the value of this sum exceeds one of the threshold values, the thus exceeded threshold value is identified and the method terminated. If S...

neither threshold value is exceeded, the successive summation of the diagnostic * .* coefficients DC, in order of decreasing importance factor J is continued until the value of the sum exceeds one of the threshold values. At this point, the summation is ceased and the exceeded threshold value identified. * .. * . . * **

The accumulation of the diagnostic information using the diagnostic coefficients DC, is performed as a sum, as follows: DC('i,) = DC(x1) + DC(,:2) + ... � D C(z) (10) The SNuP using diagnostic coefficients DC is performed until the following inequality is no longer true: DCth(A) < DC(2:) < DGth(.4,) (11) As noted above, each threshold value of the diagnostic coefficient is the minimum acceptable rate of correct diagnoses over incorrect ones. Thus, identification of the exceeded threshold in turn allows the correct diagnosis to be made For example, if A1 is taken to represent the subject having a benign tumour and A2 is taken to represent the subject having a malignant tumour. Should the value of the diagnostic coefficient DCmax with the highest importance factor J or the successive summation of the diagnostic coefficients exceed the threshold value for the group A1, given as DCth(Al) in Table II, then the method indicates that the subject has a benign tumour In contrast, should DCmax or the successive summation of the diagnostic coefficients exceed the threshold value for the group A2, given as DCIh(A2) in Table II, the subject is to be diagnosed with a tumour that is malignant.

In a further aspect, the present invention provides a system for characterising a medical condition of a subject, the system comprising: a) means for providing a set of symptom data for the subject, the set of symptom data comprising at least one result of an analysis x conducted on the subject; * ** b) means for calculating a diagnostic coefficient DC for each result of an analysis x; * ** C) means for calculating an importance factor J for each result of an analysis x; *..: d) means allowing a user to specify an acceptable level of error in the characterisation; e) means to determine the thresholds for the diagnostic coefficients DC1 using the error levels specified in feature (d) and define a threshold range therebetween; f) means to compare the value of the diagnostic coefficient DC, of the analysis x with the highest importance factor J, DCmax with the thresholds determined in feature (e); g) means for successively summing the diagnostic coefficient DCmax with the diagnostic coefficient DC having the next highest importance factor J until the value of the sum lies outside the threshold range defined in feature (e).

The system of this aspect of the present invention is capable of processing binary analysis data. However, as noted hereinbefore, many tests, examinations and analyses conducted on or in respect of a subject provide continuous data as results.

Accordingly, it is particularly preferred that the system comprises means for converting continuous analysis data into discrete data, for example binary data In a still further aspect, the present invention provides a system for characterising a medical condition of a subject, the system comprising: a) means for providing a set of symptom data for the subject, the set of symptom data comprising at least one result of an analysis x conducted on the subject; al) means for converting continuous result data present in the set of symptom data into discrete result data; b) means for calculating a diagnostic coefficient DC for each result of an :. analysis x,; S..

5**'*' c) means for calculating an importance factor J for each result of an ".5 analysis x1; *:* 25 d) means allowing a user to specify an acceptable level of error in the characterisation; e) means to determine the thresholds for the diagnostic coefficients DC, 5.. using the error levels specified in feature (d) and define a threshold range therebetween; f) means to compare the value of the diagnostic coefficient DC, of the analysis x with the highest importance factor J,, DC,max with the thresholds determined in feature (e); g) means for successively summing the diagnostic coefficient DCmax with the diagnostic coefficient DC1 having the next highest importance factor J until the value of the sum lies outside the threshold range defined in feature (e).

As discussed hereinbefore, the methods of the present invention may be used to provide an indication at an early stage in the diagnostic assessment of a subject the extent to which some or all of the available tests, analyses and examinations in relation to the subject are required in order to allow the medical practitioner to arrive at a clear diagnosis. In particular, the methods may be applied in order to identify those tests, analyses and examinations that are not required in order to reach a clear diagnosis, thus reducing the amount of time the subject is subjected to, possibly invasive, procedures, the overall time taken to carry out the prediagnostic assessments, and the overall cost of the diagnostic procedure.

Accordingly, in a further aspect, the present invention provides a method of identifying the investigative procedures required to reach a diagnosis of a medical condition of a subject, the method comprising the steps of: a) provide a first set of symptom data for the subject, the set of symptom data comprising at least one result of an analysis x conducted on the subject; b) calculate a diagnostic coefficient DC for each result of an analysis x; * c) calculate an importance factor J for each result of an analysis x; d) specify an acceptable level of error in the characterisation; e) determine the thresholds for the diagnostic coefficients DC using the * ** error levels specified in step (d) and define a threshold range therebetween; f) compare the value of the diagnostic coefficient DC of the analysis x with the highest importance factor J, DCmax with the thresholds determined in step (e); * ** * g) successively sum the diagnostic coefficient DCmax with the * S S * a.

diagnostic coefficient DC having the next highest importance factor J until the value of the sum lies outside the threshold range defined in step (e); h) identify the threshold exceeded in step (g); i) if no threshold is exceeded as a result of the successive summation of all the diagnostic coefficients DC in step (g), provide an indication that further symptom data for the subject are required.

The method applies the general process discussed hereinbefore to a first set of symptom data obtained from a first group of tests or analyses. This first set may contain only some of the results of a complete investigation into the condition of the subject. However, the method is applied to this partial data set. If the successive summation of the diagnostic coefficients results in one of the threshold values being exceeded, then a diagnosis of the condition can be made, without the need for conducting further investigations or tests. It is only when the successive summation of all the diagnostic coefficients in the first data set does not provide a sum that exceeds one of the thresholds that further investigations are required. The method may be applied after each analysis, test or investigation into the subject and the procedures continued only until a threshold is exceeded and a diagnosis is possible.

In this way, the method indicates whether a diagnosis of the condition is possible from a selection of tests, analyses or investigations selected according to specified I 5 criteria, such as time taken, discomfort or risk to the subject, and/or cost.

As noted hereinbefore, the method may be applied to sets of data that contain only binary values. However, it is most advantageous if the method is also applied to data sets containing continuous values. In this case, the continuous data values are converted into discrete data, as hereinbefore described. I.

In a further aspect, the present invention provides a method of identifying the *IS* investigative procedures required to reach a diagnosis of a medical condition of a subject, the method comprising the steps of: * .* a) provide a first set of symptom data for the subject, the set of symptom data comprising at least one result of an analysis x conducted on the subject; al) convert continuous result data present in the set of symptom data into * ** *. * discrete result data; * * a a ** b) calculate a diagnostic coefficient DC for each result of an analysis x; c) calculate an importance factor J for each result of an analysis x; d) specify an acceptable level of error in the characterisation; e) determine the thresholds for the diagnostic coefficients DC using the error levels specified in step (d) and define a threshold range therebetween; f) compare the value of the diagnostic coefficient DC of the analysis x with the highest importance factor J,, DCmax with the thresholds determined in step (e); g) successively sum the diagnostic coefficient DCmax with the diagnostic coefficient DC, having the next highest importance factor J until the value of the sum lies outside the threshold range defined in step (e); h) identify the threshold exceeded in step (g); i) if no threshold is exceeded as a result of the successive summation of all the diagnostic coefficients DC, in step (g), provide an indication that further symptom data for the subject are required.

A system for carrying out the method of the foregoing aspects of the invention is also provided. Accordingly, there is provided a system for identifying the investigative procedures required to reach a diagnosis of a medical condition of a subject, the system comprising: a) means to provide a first set of symptom data for the subject, the set of symptom data comprising at least one result of an analysis x, conducted on the subject, b) means to calculate a diagnostic coefficient DC1 for each result of an analysis x; * c) means for calculating an importance factor J for each result of an analysis x; * ***

S

d) means for specifying an acceptable level of error in the * ** characterisation; * S * * S. e) means to determine the thresholds for the diagnostic coefficients DC, using the error levels specified in feature (d) and define a threshold range *:*: therebetween; *: * f) means to compare the value of the diagnostic coefficient DC of the analysis x with the highest importance factor J1, DCmax with the thresholds determined in feature (e); g) means to successively sum the diagnostic coefficient DCmax with the diagnostic coefficient DC having the next highest importance factor J until the value of the sum lies outside the threshold range defined in feature (e); and h) means that, if no threshold is exceeded as a result of the successive summation of all the diagnostic coefficients DC in step (g), provide an indication that further symptom data for the subject are required.

The system most preferably further comprises means for converting continuous data in the data set into discrete data. Accordingly, there is also provided a system for identifying the investigative procedures required to reach a diagnosis of a medical condition of a subject, the system comprising: a) means to provide a first set of symptom data for the subject, the set of symptom data comprising at least one result of an analysis x conducted on the subject; al) means for converting continuous result data present in the set of symptom data into discrete result data; b) means to calculate a diagnostic coefficient DC for each result of an analysis x,; c) means for ca'culating an importance factor J for each result of an analysis x; d) means for specifying an acceptable level of error in the characterisation; e) means to determine the thresholds for the diagnostic coefficients DC using the error levels specified in feature (d) and define a threshold range therebetween; * .* * S f) means to compare the value of the diagnostic coefficient DC of the analysis x with the highest importance factor J,, DCmax with the thresholds * *S determined in feature (e); g) means to successively sum the diagnostic coefficient DCmax with the diagnostic coefficient DC having the next highest importance factor J until the :5, : value of the sum lies outside the threshold range defined in feature (e); and h) means that, if no threshold is exceeded as a result of the successive summation of all the diagnostic coefficients DC in step (g), provide an indication that further symptom data for the subject are required.

The method and apparatus of the present invention will now be further illustrated by the following examples, having reference to the accompanying figures, in which: Figure 1 is a graphical representation of the method of the present invention applied to four different cases of ovarian cancer.

Example 1

Figure 1 illustrates, graphically, the method of the present invention applied to a set of data obtained from the investigation of a subject suspected of suffering from ovarian cancer. The result of the successive sum of DC, in every step of procedure is indicated by arrow. A1 is the outcome that the tumour is malignant, while A2 is the outcome that the tumour is benign. The thresholds for A, and A2 are denoted by bold solid lines. The hatched areas in Figure 1 are determinate zones for A, and A2 (malignant and benign groups). Case 1 and case 2 demonstrate SNuP with definitive variables (when one variables is enough to reach a threshold). Thus, in both cases 1 and 2, the value of the diagnostic coefficient DCmax having the highest importance factor exceeds one of the threshold values. In case 1, the tumour may be diagnosed as being malign. In case 2, the subject is suffering from a benign tumour. Case 3 shows the straight-forward classification of a benign tumour in three steps, involving * the successive summation of the three diagnostic coefficients having the first, second and third highest importance factors. Case 4 is a difficult case of ovarian cancer. It * .* can be seen the final determination of the method applied to case 4 is that the tumour is malignant. However, it will be noted that the successive summation provides an indication in the third and fourth summations that the tumour may be benign, as indicated by the value of the summation tending towards the threshold value DCth(A2). In a conventional diagnostic procedure, a medical practitioner may have interpreted such a trend in the data to indicate a benign tumour. However, proceeding further with the method of successive summation until a threshold value is exceeded demonstrates that such a diagnosis would be incorrect and that the subject is suffering from a malignant tumour.

In any case, should the result of the summation procedure after all the diagnostic coefficients had been successively summed still lie between the threshold values, this would indicate that further investigative data are required in order to provide a full diagnosis of the condition of the subject.

Example 2

A study was conducted of data obtained from the investigation of 525 patients admitted to the Department of Obstetrics and Gynecology at the University Hospitals Katholieke Universiteit Leuven. All the patients underwent a transvaginal ultrasonography with B-mode and colour Doppler imaging. The level of serum oncomarker CAl 25 was measured for 432 patients. A summary of the collected data is as set out in Table I above. A detailed description of the data acquisition process is set out in D. Timmerman, et al., A comparison of methods for preoperative discrimination between malignant and benign adnexal masses: the development of a new logistic regression model', Am J. Abstet. Gynecol., Vol. 181, No. 1, pages 57to 65, July 1999.

As part of the ultrasound examination the amount of blood flow was assessed within the septa, cyst walls, solid tumor areas, or ovarian stroma. Depending on whether the amount of the blood flow was rather strong or very strong two new binary S... variables were added -Col3' and Co14'. The variable CAl 25 was transformed to binary values, 001 1, depending on a threshold value of 30 U/mI. A value of 1 was * ** assigned if CA125>30, a value of 0 otherwise.

*:* The Risk of Malignancy Index (RMI) was used as a benchmark during the * performance evaluation. RMI values were calculated according to the formula RMI=JacobsxMenoxCAl25. Details of the RMI are set out in I. Jacobs, et al., A risk of malignancy index incorporating CA 125, ultrasound and menopausal status for the accurate preoperative diagnosis of ovarian cancer', Br. J. Obstet. Gynaecol., Vol. 97, No 10, pages 922 to 929, October 1990.

The ultrasound score (Morph) was calculated as the sum of scores for the presence of multilocular cyst, evidence of solid areas, evidence of metastases, presence of ascites and bilateral lesions. Jacobss index was assigned a value of 0 if Morph=0, a value 1 if Morph=1 and a value 3 if Morph>1. The menopause state (Meno) was equal to 1 if premenopausal and equal to 3 if postmenopausal.

The calculated diagnostic coefficients and J-divergences for all nominal input variables are presented in Table Ill below. The last column shows the rank of the symptom. The sequential non-uniform procedure for preoperative differential diagnosis between benign and malignant forms of adnexal tumour is recommended to start from the most informative variables, i.e. variables with the highest J rank (e.g. smooth internal wall, strong blood flow, presence of unilocular cyst, level of serum CAl 25 above 30 U/mI, presence of ascites, etc).

Table Ill

Malignant Benign tumour tumour /alue No Variable N % n N % n DC(x,) J(x0) J(x,) J rank

S

*.S. -______________ _______ ______________ _____________ ___________ _______ _______ ____________________________ 1 Menopause 1 141 65.292 38431.312(3.2 0.54 1.05 9 (Meno) ) 34.849 68.726 -0.51 * 3.0 2 Normal 1 141 34 48 384 15.459 3.4 0.32 0.42 14 * ** blood flow (Co13) ) 66 93 84.6321-1.1 0.10 3 Strong 1 14144 62 3843.4 13 11.1 2.25 2.74 2 blood flow (Col4) ) 56 79 96.6 371 -2.4 0.49 4 Ascites 1 141 60.3 85 384 13.351 6.6 1.55 2.35 5 (Asc) 39.7 56 86.733 -3.4 0.80 Unilocular 1 141 4.3 6 38446.1 lfl -10.3 2.15 2.67 3 cyst (Un) ) 95.7 135 53.9 20 2.5 0.52 6 Unilocular 1 141 16.323 3846.3 24 4.1 0.21 0.24 15 solid (UnSol) ) 83.7 118 93.736C -0.5 0.03 7 Multilocular 1 141 5.7 8 38428.6 lIC -7.0 0.80 0.94 10 cyst (Mul) ) 94.3 133 71.427 1.2 0.14 8 Multilocular 1 141 36.2 51 38410.741 5.3 0.68 0.87 11 solid (MulSol) ) 63.8 90 89.334 -1.5 0.19 9 Solid 1 141 37.6 53 3848.3 32 6.6 0.97 1.22 8 tumour (Sol) ) 62.4 88 91.7352-1.7 0.25 Bilateral 1 141 39 55 38413.351 4.7 0.60 0.79 12 mass (Bilat) ) 61 66 86.733 -1.5 0.19 11 Smooth walll 141 5.7 8 38456.821 -10. 0 2.56 3.43 1 (Smooth) ) 94.3 133 43.2 16 3.4 0.87 12 Irregular 1 138 73.2 101 37333.8 12 3.4 0.67 1.44 7 wall s... (Irreg) ) 26.8 37 66.224 -3.9 0.77 5.5 -_____________ _______ ______________ _____________ ___________ _______ _______ ____________________________ 13 Papillations 1 141 53.9 76 38412.247 6.5 1.36 1.94 6 * .* (Pap) ) 46.1 65 87.8337-2.8 0.58 14 Septa$>$31 141 31.244 38413 50 3.8 0.35 0.44 13 mm * ** (Sept) ) 68.8 97 87 33 1.0 0.09 : 15 Acoustic 1 141 5.7 8 384 12.247 -3.3 0.11).12 19 shadows (Shadows) ) 94.3 133 87.8337 0.3 0.01 16 Anechoic 1 141 28.4 40 38443.5 16 -1.9 0.14).22 17 cystic content (Lucent) ) 71.6 101 56.521; 1. 0 0.08 17 Low level 1 141 20.6 29 38411.745 2.5 0.11).13 18 echogenicity (Low-level) ) 79.4 112 88.333c -0.5 0.02 18 Mixed 1 141 13.5 19 38420.378 -1.8 0.06).07 21 echogenicity (Mixed) ) 86.5 122 79.7 30 0.4 0.01 19 Ground 1 141 8.5 12 384 19.876 -3.7 0.21).24 16 glass cyst (G.Glass) ) 91.5 129 80.2 30 0.6 0.03 1 141 0.7 1 3843.6 14 -7.1 0.10).10 20 Hemorrhagic cyst (Haem) ) 99.3 140 96.437( 0.1 0.00 21 CA125$>$ 1 137 80.311029529.286 4.4 1.12.55 4 19.7 27 70.820c -5.6 1.43 (C CAl 25) N -total number of cases in the group, n -number of cases in the group with presence of feature *.* I *S'

S S...

In this study only binary variables were considered. A large value of DC means a high discriminative ability of the variable and the importance factor J gives an indication of the reliability of this variable. Features with positive DC values * *.

correspond to malignancy, and those with negative values to the benign group.

*** 10 Accumulation of the diagnostic information was carried out by summation of the diagnostic coefficients and comparing the sum with a specified threshold.

Two cases of ovarian cancer will now be used to demonstrate the application of the method of the present invention.

Case I A woman with a benign adnexal mass, age 31, pre-menopausal, strong blood flow, CAl 25 is not raised (9 U/mI), no ascites, unilocular ovarian cyst, smooth internal wall, mixed echogenicity (patient 3 in the database). Acceptable levels of errors are: ci = 0.05 and 13 = 0.05, that is assuming 95% confidence for both decisions (benign and malignant). The thresholds values for the summation of the diagnostic coefficient may be taken from Table II and are DCth(Al) = 12.8 for malignant and DCth(A2) = -12.8 for benign.

The successive summation of the diagnostic coefficients DC follows the following steps: Step 1: Variable Smooth'=l, DC (Smooth = 1) = -10 Sum(DC) -10 Thresholds are not reached. Conclusion: continue procedure.

Step 2: Variable Col4'=l, DC(Co14 1) = 11.1, Sum(DC) -10+11.1= 1.1 Thresholds are not reached. Conclusion: continue procedure.

Step 3 Variable Un'=l, DC(Un = 1) -10.3 Sum(DC) 1.1 + (-10.3) -9.2 Thresholds are not reached. Conclusion: continue procedure. * * S * S.

Step 4: Variable C CA125'=O, DC(C -CA 125 = 0) = -5.6 Sum(DC) = -9.2 + (-5.6) -14.8 Sum(DC) < DC(h(A2) * *5 Threshold has been exceeded. Conclusion: Stop SNuP. Decision: benign * form of tumour.

For case 1, four variables are enough to make a decision as to a diagnosis with a confidence level of 95%.

Case 2 This case contains data missing for some variables, indicating that certain tests or analyses were not carried out on the subject.

This is a difficult case of ovarian cancer. It is for a woman aged 72, post-menopause, ascites, multilocular cyst, strong blood flow, smooth internal wall, no information on the level of CAl 25 (patient number 216 in the database). Thresholds for flI(DC) are9.8formalignantand-12.5forbenign (a = 0.05, b 0.10).

The successive summation of the diagnostic coefficients DC, proceeded as follows: Step 1: Variable Smooth'l, DC(Smoolh = 1) -10 Sum(DC) -10 Thresholds are not reached. Conclusion: continue procedure.

Step 2: Variable Co14'l, DC(Co14 = 1) 11.1 Sum(DC) -10+ ILl = 1.1 Thresholds are not reached. Conclusion: continue procedure.

Step 3: Variable Un'=O, DC(Un = 0) = 2.5 Sum(DC) = 1.1 + 2.5 3.6

S

Thresholds are not reached. Conclusion: continue procedure.

S

Step 4: Variable C_CA125' value unknown, Sum(DC) = 3.6 Thresholds are

not reached. Conclusion: continue procedure.

* 25 Step 5: Variable Ascites'=l, DC(Asc:ies = 1) 6.6, Sum(DC) = 3.6 + 6.6 10.2 Sum(DC)> DC,h(AI) Threshold has been exceeded. Conclusion: Stop SNuP. Decision: malignant form of tumour.

Example 3

The performance of the method of the present invention was assessed using ROC analysis and a 3-fold cross validation. A ratio of 1:2 between malignant and benign groups sample sizes was taken from the initial data set.

The SNuP procedure of the present invention was applied to the ovarian tumour data set The task was to distinguish malignant and benign forms of this kind of neoplasm. The differential diagnosis of these conditions apart from clinical examination involves ultrasound methods, tumour markers, CT and MRI. It is important to find a trade-off between the cost and the number of the diagnostic procedures and the risk of missing a case when urgent surgical operation might be required.

The SNuP showed a high performance on a real data set during cross validation. The method is close to clinical thinking and can be used not only for research but also for educational purposes to demonstrate the inference process.

Example 4

A second study using the method of the present invention was performed on 1066 cases of adnexal masses collected during international multicentre clinical trial S.... across 14 research centres. The full database include 1066 cases of ovarian tumours, 266 malignant and 800 benign. Histological diagnosis were used as a gold standard There were three data modalities: * ** *** (i) clinical variables included family history of ovarian and breast cancer, age, menopausal status, previous hormonal surgery, and surgical history; (ii) sonographical examination was performed in all cases with gray scale and colour Doppler imaging with total over 40 morphological and blood flow velocity characteristics, (iii) serum tumour marker CAl 25 was measured for 809 patients.

When intratumoral blood flow velocity waveforms were not detected, the peak systolic velocity (PSV), time averaged maximum velocity (TAMXV), the pulsatility index (P1), and the resistance index (RI) were substituted by 2.0 cm/sec, lcm/sec, 3.0, and 1.0, respectively.

At the first stage of analysis some preprocessing procedures were made in order to incorporate continuous variables into a model. Continuous variables were transformed into discrete values by automatic partitioning input space into intervals.

A univariate and multivariate k-means clustering procedure in MATLAB was used.

For volumetric characteristics, such as diameter of the lesion (LesDl-3) multivariate clustering was used.

Table IV below demonstrates this approach. First, it is necessary to specify variables and a desired number of clusters. Three clusters were used by default, considering that values in these clusters might be described as low', medium' and high'. As a result, a new variable with three/two values was obtained. Assignment of values for the variable was made by the following rule: value = argrnin(j -e..

E( -c1,)2 where i.is the Euclidean distance from an observation x to one of m-dimensional centroids c of clusters. I Is

In the case of a malignant-benign classification using the diameter of the lesion, the number of clusters max(s) = , the number of modalities maxO) = m = 3 and coordinates of cluster centroids C, are triplets LesD1;LesD2;1es1)3), {5 1.0,40.7,40.0), {106.8;85.3; 82.2), {201.2; 148.0; 142.3)

Table IV

Values Variable Cluster I Cluster 2 Cluster 3 LesDi 510 106.8 201.2 LesD2 40.7 85.3 148.0 LesD3 40.0 82.2 142.3 LesD 1 2 3 lg(CA125) 1.11 1.83 3.01 lg(CM2S) 1 2 3 After all input variables were converted to a discrete scale the successive summation procedure of the present invention was applied to the data. For every distinct value of a variable (e.g. strong blood flow, col score = 4) the following parameters were calculated: the conditional probability of this event in malignant and benign groups P(x I A1,2); the diagnostic coefficient for the distinct value of the variable DC(xq) using formula (4); * the J-divergence of the symptom's level J(x1) using formula (6). **

Then values of J-divergence were summarised across all values to produce an importance factorJ1 of the symptom x, applying formula (7). All the variables were sorted by their importance factor J, in descending order. The most informative variables (that is those with an importance factor J of greater than 1.0) are * S. ** * summarised in Table V, where the total J is presented in the third column, all distinct

S SS

* values of variables are given in the forth column, followed by corresponding values of the diagnostic coefficients DC. A large value of DC means a high discriminative ability of the variable and the importance factor J1 gives an indication of the reliability of this value. Features with positive DC values correspond to malignancy, and those with negative values to the benign group. Accumulation of the diagnostic information was carried out by summation of the diagnostic coefficients and comparing the sum with a specified threshold.

Table V

Rink Variabk J Le1 of symptom I)iuaosbC cxfiicien1s I Locularity 537 I 2 345 6} 1-15.9; 2.4; -6.1;3.8; 7.4:4.7} 2 CoiScore 3.91 41 2 3 4} -11.2: -3.6:2.0; 9.6} 3 WailRegularity 2.76 {0 I) -6.4; 4.2} 4 SoIidD 2. 68 412) [_2.6;9.9} Asciks 2.64 40 I] -2.1: 11.9} 6 Pt, RI, I'SV, TAMXV 2.19 {I 2 3} 11 .6; 5.2; -3.3) 7 RatioPapLes 2.05 41 2 3} 16.5; -2.8;7 5} NrLocuks 1.99 (0 1 2 3 4 5 6] [6.6; -4.0; -0.3; -2.9: -1.5; 1.8; 7.7} 9 IIuid 1.79 {I 2] 8.9; -1.9) PapNr 1.78 (0 I 2 3 4} -1.9; -0.7:5.3; 2.7; 11.1) II PipFkw 1.68 0 I) _1.9;8.1} 12 MaxSolid 1.63 123) (-1.4;9.1; 11.5) 13 age 1.35 I 2 3] [-4.4: 4.8; 0.9) 14 OvD 1.3 123) [-3.2: 6.3; 1.8) PapSmIh I 21 0 I] (-1.7: 6.4} 16 MaxLes 1.19 123) 1-3.0; 6.1: 2.1} 17 Ig(CA125) 1.11 123) (0.7;4.8:13.6} l LesD 1.05 I 2 3) [-2.8: 2.7; 5.3) In order to classify cases the thresholds DCth (A12) were to be specified. B was set and fixed at 0.05, while a was varied from 0.90 to 0.001. Lower and upper * 10 thresholds for sum of the diagnostic coefficients were calculated using formulae (8) *S* and (9).

Performance of the method during 3-fold cross-validation is presented in Table VI. The last column of the table shows the median number of cases where the diagnostic decision was undefined. As can be seen this number increases with I *.

decreasing the level of acceptable error a. As a result a value of a = 0.10 was selected as an optimal threshold, as it produces a relatively high performance (Se=86.9%, Sp=84.3%, Acc=84.9%) and low number of undefined cases (10 out of 355).

The interpretation of diagnostic coefficients of untransformed and univariately transformed variables is straightforward when there is clear assignment of the diagnostic coefficient DC to the level of the symptom. For instance, low blood flow (ColScore=1) is highly associated with a benign tumour, DC = -11.2, and strong blood flow (ColScore4) on the other hand is a marker of malignancy, DC = 9.6.

Examples of a univariately transformed variable include log of serum CAl 25, which was split into three clusters, which can be described as low', medium' and high' levels, with an increasing degree of association with malignancy. New variables created in two-or multidimensional space might have increasing values in one dimension and decreasing values in another, which may slightly complicate interpretation and require additional clinical input. Examples of these kinds of variables include the diameter of solid component (SolidD), velocity indices (P1, RI, PSV, TAMXV), diameter of ovaries (OvD) and diameter of lesion (LesD).

Example 5

The method of the present invention was compared with the performance of an expert medical assessment.

Assuming S is a gold standard value (0 -benign tumour, I -malignant tumour), M -model result (0 -benign tumour, 0.5 -undefined, 1 -malignant tumour), E -expert opinion (0 -benign tumour, 1 -malignant tumour), there are six possible : situations in comparing the model to an expert, as follows: 1) Method is correct. Expert is correct.

S=MUS=E.

*:*:: 2) Method is correct. Expert is incorrect.

S=MUS!=E

S S **

3) Method is incorrect. Expert is correct.

S!= M U M!= 0:5 U S = E 4) Method is incorrect. Expert is incorrect.

S!=MUS!=E 5) Method's result is undefined. Expert is correct.

M = 0:5 U S = E 6) Mthod's result is undefined. Expert is incorrect.

M = 0:5 U S!= E In the above situations, conditions 1 and 4 represent a situation when the method and the expert agree. The rest of the conditions are more interesting.

Conditions 2 and 6 are difficult cases for diagnosis. Condition 2 is true when the expert misses something or reaches a conclusion based on a wrong assumption.

This may also be due to new knowledge discovered by the method of the present invention. When the expert outperforms the method, condition 3 is true. Condition 5 is possible when there is enough information for the expert to arrive at a correct conclusion but not enough for the method to provide a determinative answer.

Table VII contains a summary of the results of the method of the present invention compared with the performance of an expert in making the same diagnosis.

Table VII

Nwnberofca.'es Condii ions ____________________ I1enlgu MaJinanL *** _________________________________________ _________________ ___________________ I. M correcL E coirecL 71 73.2.%) 35 ç79.5%) 2. M correcL E incoriec.L 10 (10.3%) 1 (2.3%) * 3. M i.ncorracL E ccr1cL 6 (6.2%) 6 (136%) 4. M incorrect. E incoriect 5 (51%) 0 (0.0%) 5. M undJijed. E correct 3 (3.1%) 2 (4.6%) 6. M undfiued. E incorrect. 2(21%) 0 (0.0%) *e * * * S * S. Modelling conditions included a 3-fold cross validation, 141 cases in the test set, a and 3 were equal to 0.10, and p(A,) was taken into account. Table VII shows that for malignant tumour the agreement was reached in 79.5% cases (79.5% and 0 0%) and for the benign form the level of agreement was almost the same -78.3%

A

(73.2% and 5.1%). The method of the present invention was better than the expert in (10.3%) cases of benign and 1 (2.3%) case of malignant tumours. The expert outperformed the method in 9 (9.3%) cases of benign neoplasm and 8 (18.2%) cases of ovarian cancer.

The results of the method-expert comparison show that the method of the present invention compares well with the diagnoses made by a medical expert.

Example 6

Measurement of the tumour marker CA125 in the serum is common in the diagnosis of ovarian cancer as well as during and after treatment. It has been shown that an abnormally raised level of CAl 25 is associated with malignancy. However, many women with a benign tumour, or even healthy women, might have raised levels of CA125. This in turn results in a high rate of misdiagnosis. On the other hand 10 to percent of ovarian cancer patients have normal levels of CAl 25. Analysis of CA125 is expensive and the laboratory results take a considerable time to be produced. Therefore it is important to evaluate the role of CAl 25 in the preoperative differential diagnosis of adnexal masses and establish conditions in which the * diagnosis will definitely benefit from CAl 25, or to identify conditions when CAl 25 can be omitted from diagnostic procedures. *u.

*:*::* The method of the present invention was used to analyse a set of test data for a range of subjects and to compare the results obtained when the CAl 25 data were included in the data set with the results obtained when the CAl 25 data were omitted.

The results are summarised in Table VIII. * .* S. * * . *

Table VIII

With CA 125 Without CA125 Acc. % 90.1 1.8 90.3 E 1.8 Se. % 87.9 1.9 86.4 2.0 Sp, % 91.1 1.7 92.1 1.6 PPV. . 79.4 2.4 81.2 2.3 PNV, % 93.5 1.5 92.7 1.5 Undeiiid (M), rate 4 -10 6 -8 Uridehned (13), rate 10 -17 13 -18 tJndeimed (ToLal), rnte 15 -21 21 -24 Median iii.imber of variables 4 4 For the analysis summarised in Table VIII, both a and 13 were set at 0.05.

As can be seen from Table VIII, the CAl 25 result does not significantly improve the classification performance, but brings more certainty to the decision making process by reducing the total number of undefined cases although the median number of variables stays the same.

The CA 125 test has an importance factor J that ranks it fourth in importance.

As demonstrated in Example I, the method of the present invention may not require four data points in order to reach a diagnosis. In an experiment, CAl 25 was used in ** patients out of 141. In 66 (82.5%) cases the absence of CA125 did not change the * *. 15 outcome of the method of the present invention, and the rates per groups were 30 (78.9%) benign cases, 36 (85.7%) malignant cases. The exclusion of CA125 produced the worse results in 9 (11.3%) cases and better results3 in 5 (6.3%) cases. * S. S * S

*5* Accordingly, the method of the present invention may be used to provide a * 20 diagnosis in a significant number of cases on the basis of a data set that does not contain CA125 data. If the method is applied and the result is indeterrninative, that is neither threshold is exceeded, this is an indication that further data are required, which may include CAl 25 data.

Claims

1. A method of characterising a medical condition of a subject, the method comprising the steps of: a) provide a set of symptom data for the subject, the set of symptom data comprising at least one result of an analysis x conducted on the subject; b) calculate a diagnostic coefficient DC for each result of an analysis x; c) calculate an importance factor J for each result of an analysis x; d) specify an acceptable level of error in the characterisation; e) determine the thresholds for the diagnostic coefficients DC using the error level specified in step (d) and define a threshold range therebetween; f) compare the value of the diagnostic coefficient DC of the analysis x with the highest importance factor J, DCmax with the thresholds determined in step (e); g) successively sum the diagnostic coefficient DCmax with the diagnostic coefficient DC having the next highest importance factor J until the value of the sum lies outside the threshold range defined in step (e); and h) identify the threshold exceeded in step (g).

S

S *S.

2. The method according to claim 1, wherein the medical condition is a tumour. 5.5

3. The method according to claim 1, wherein the medical condition is ovarian cancer.

4 The method according to any preceding claim, wherein the set of symptom data comprises a plurality of results.

5 The method according to any preceding claim, wherein the data in the set of symptom data are all discrete data.

6. The method according to claim 5, wherein the symptom data are all binary data.

7. The method according to any of claims 1 to 4, wherein the set of symptom data comprises continuous data.

8. The method according to claim 7, further comprising the step of: al) converting the continuous data in the set of symptom data into discrete data.

9. The method according to claim 8, wherein the conversion of the data is carried out using fuzzy logic.

The method according to claim 8, wherein the conversion of the data is carried out using clustering techniques.

11. The method according to claim 10, wherein the clustering techniques comprise univariate and multivariate clustering.

12. The method according to any preceding claim, wherein the specification of an acceptable level of error comprises specifying values for a and 13.

13. A method of characterising a medical condition of a subject, the method comprising the steps of: a) provide a set of symptom data for the subject, the set of symptom data comprising at least one result of an analysis x conducted on the subject; al) convert continuous result data present in the set of symptom data into discrete result data; b) calculate a diagnostic coefficient DC for each result of an analysis x1; C) calculate an importance factor J for each result of an analysis x; d) specify an acceptable level of error in the characterisation; e) determine the thresholds for the diagnostic coefficients DC1 using the error levels specified in step (d) and define a threshold range therebetween; f) compare the value of the diagnostic coefficient DC1 of the analysis x with the highest importance factor J, DC1max, with the thresholds determined in step (e); g) if DC,max does not exceed one of the thresholds determined in step (e), successively sum the diagnostic coefficient DC,max with the diagnostic coefficient DC having the next highest importance factor J until the value of the sum lies outside the threshold range defined in step (e); and h) identify the threshold exceeded in step (g).

14. A system for characterising a medical condition of a subject, the system comprising: a) means for providing a set of symptom data for the subject, the set of 0 symptom data comprising at least one result of an analysis x conducted on the subject, b) means for calculating a diagnostic coefficient DC for each result of an analysis x; c) means for calculating an importance factor J for each result of an analysis x; d) means allowing a user to specify an acceptable level of error in the characterisation; e) means to determine the thresholds for the diagnostic coefficients DC using the error levels specified in feature (d) and define a threshold range therebetween; f) means to compare the value of the diagnostic coefficient DC of the analysis x with the highest importance factor J, DCmax with the thresholds determined in feature (e); :*:* g) means for successively summing the diagnostic coefficient DC,max with the diagnostic coefficient DC, having the next highest importance factor J, until the value of the sum lies outside the threshold rénge defined in feature (e). * S. p S S * ..

15. The system according to claim 14, further comprising means for converting continuous data in the set of symptom data into discrete data.

16. The system according to claim 15, wherein the means for converting the continuous data employs fuzzy logic and/or clustering techniques.

17 A system for characterising a medical condition of a subject, the system comprising: a) means for providing a set of symptom data for the subject, the set of symptom data comprising at least one result of an analysis x conducted on the subject; al) means for converting continuous result data present in the set of symptom data into discrete result data; b) means for calculating a diagnostic coefficient DC for each result of an analysis x; c) means for calculating an importance factor J for each result of an analysis x; d) means allowing a user to specify an acceptable level of error in the characterisation; e) means to determine the thresholds for the diagnostic coefficients DC using the error levels specified in feature (d) and define a threshold range therebetween; f) means to compare the value of the diagnostic coefficient DC of the analysis x with the highest importance factor J1, DCmax with the thresholds determined in feature (e); g) means for successively summing the diagnostic coefficient DCmax with the diagnostic coefficient DC having the next highest importance factor J1 until the value of the sum lies outside the threshold range defined in feature (e).

18. A system for characterising a medical condition of a subject, the system comprising.

a) means for providing a set of symptom data for the subject, the set of * S. symptom data comprising at least one result of an analysis x conducted on the subject; al) means for converting continuous result data present in the set of symptom data into discrete result data; b) means for calculating a diagnostic coefficient DC for each result of an analysis x; c) means for calculating an importance factor J for each result of an analysis x1; d) means allowing a user to specify an acceptable level of error in the characterisation; e) means to determine the thresholds for the diagnostic coefficients DC using the error levels specified in feature (d) and define a threshold range therebetween; f) means to compare the value of the diagnostic coefficient DC, of the analysis x, with the highest importance factor J,, DCmax with the thresholds determined in feature (e); g) means for successively summing the diagnostic coefficient DC,max with the diagnostic coefficient DC, having the next highest importance factor J, until the value of the sum lies outside the threshold range defined in feature (e).

19. A method of identifying the investigative procedures required to reach a diagnosis of a medical condition of a subject, the method comprising the steps of: a) provide a first set of symptom data for the subject, the set of symptom data comprising at least one result of an analysis x, conducted on the subject; b) calculate a diagnostic coefficient DC, for each result of an analysis x,; c) calculate an importance factor J for each result of an analysis x,; d) specify an acceptable level of error in the characterisation; e) determine the thresholds for the diagnostic coefficients DC, using the error levels specified in step (d) and define a threshold range therebetween; f) compare the value of the diagnostic coefficient DC, of the analysis x with the highest importance factor J,, DC,max with the thresholds determined in S S. step (e); g) successively sum the diagnostic coefficient DCmax with the diagnostic coefficient DC, having the next highest importance factor J until the * ** value of the sum lies outside the threshold range defined in step (e); h) identify the threshold exceeded in step (g); i) if no threshold is exceeded as a result of the successive summation of all the diagnostic coefficients DC in step (g), provide further symptom data for the subject.

20. The method according to claim 19, further comprising the step of: al) converting continuous data contained in the first set of symptom data into discrete data.

21. A method of identifying the investigative procedures required to reach a diagnosis of a medical condition of a subject, the method comprising the steps of: a) provide a first set of symptom data for the subject, the set of symptom data comprising at least one result of an analysis x conducted on the subject; al) convert continuous result data present in the set of symptom data into discrete result data; b) calculate a diagnostic coefficient DC for each result of an analysis x; C) calculate an importance factor J for each result of an analysis x1; d) specify an acceptable level of error in the characterisation; e) determine the thresholds for the diagnostic coefficients DC using the error levels specified in step (d) and define a threshold range therebetween; f) compare the value of the diagnostic coefficient DC of the analysis x1 with the highest importance factor J,, DCmax with the thresholds determined in step (e); g) successively sum the diagnostic coefficient DCmax with the diagnostic coefficient DC having the next highest importance factor J until the value of the sum lies outside the threshold range defined in step (e); h) identify the threshold exceeded in step (g); i) if no threshold is exceeded as a result of the successive summation of S...

all the diagnostic coefficients DC in step (g), provide an indication that further symptom data for the subject are required.

22. A system for identifying the investigative procedures required to reach a diagnosis of a medical condition of a subject, the system comprising: a) means to provide a first set of symptom data for the subject, the set of symptom data comprising at least one result of an analysis x conducted on the subject; b) means to calculate a diagnostic coefficient DC for each result of an analysis x; c) means for calculating an importance factor J1 for each result of an analysis x,; d) means for specifying an acceptable level of error in the characterisation, e) means to determine the thresholds for the diagnostic coefficients DC using the error levels specified in feature (d) and define a threshold range therebetween; f) means to compare the value of the diagnostic coefficient DC of the analysis x with the highest importance factor J, DCmax with the thresholds determined in feature (e); g) means to successively sum the diagnostic coefficient DCmax with the diagnostic coefficient DC, having the next highest importance factor J until the value of the sum lies outside the threshold range defined in feature (e); and h) means that, if no threshold is exceeded as a result of the successive summation of all the diagnostic coefficients DC in step (g), provide an indication that further symptom data for the subject is required.

23. A system for identifying the investigative procedures required to reach a diagnosis of a medical condition of a subject, the system comprising: a) means to provide a first set of symptom data for the subject, the set of symptom data comprising at least one result of an analysis x conducted on the subject; al) means for converting continuous result data present in the set of symptom data into discrete result data; S...

b) means to calculate a diagnostic coefficient DC, for each result of an * ** * * analysis x; ** c) means for calculating an importance factor J for each result of an analysis x; *.:.. d) means for specifying an acceptable level of error in the characterisation; * .* e) means to determine the thresholds for the diagnostic coefficients DC, using the error levels specified in feature (d) and define a threshold range therebetween; f) means to compare the value of the diagnostic coefficient DC of the analysis x with the highest importance factor J, DCmax with the thresholds determined in feature (e); g) means to successively sum the diagnostic coefficient DCmax with the diagnostic coefficient DC having the next highest importance factor J until the value of the sum lies outside the threshold range defined in feature (e); and h) means that, if no threshold is exceeded as a result of the successive summation of all the diagnostic coefficients DC in step (g), provide an indication that further symptom data for the subject are required.

24. A method for characterising a condition of a subject substantially as hereinbefore described having reference to the accompanying figure.

A system for characterising a condition of a subject substantially as hereinbefore described.

26. A method of identifying the investigative procedures required to reach a diagnosis of a medical condition of a subject substantially as hereinbefore described.

27. A system for identifying the investigative procedures required to reach a diagnosis of a medical condition of a subject substantially as hereinbefore described.

I

S S ** S *. * S S 55 * 5. * . * S. 5. * * *5