WO2022006628A1

WO2022006628A1 - Computer-implemented method and system for identifying measurable features for use in a predictive model

Info

Publication number: WO2022006628A1
Application number: PCT/AU2021/050723
Authority: WO
Inventors: George Connaught MAYNE; Damian James HUSSEY
Original assignee: Southern Adelaide Local Health Network Inc.
Priority date: 2020-07-08
Filing date: 2021-07-07
Publication date: 2022-01-13

Abstract

A method and system for identifying a subset of physically measurable features from a number of candidate physically measurable features potentially associated with a physical characteristic is disclosed. The method comprises receiving measured data, the measured data comprising respective datasets of measurements of the number of candidate physically measured features and the associated physical characteristic and in an outer loop iteratively partitioning the measured data into training data and validation data to form multiple outer loop training data sets and associated outer loop validation data sets. For each of the outer loop training data sets in an inner loop the method comprises iteratively generating a respective set of randomly sampled inner loop training data subsets and associated inner loop test data subsets; and generating predictive models based on the respective set of inner loop training data subsets and inner loop test data subsets each having an optimised set of physically measurable features and an associated optimised predictive capacity. A collection of optimised predictive models corresponding to each of the multiple outer loop training data sets and their respective sets of inner loop training and test data subsets is then formed and the subset of physically measurable features is identified by determining the subset of stable physically measurable features from the collection of optimised predictive models generated in the inner loop.

Description

COMPUTER-IMPLEMENTED METHOD AND SYSTEM FOR IDENTIFYING MEASURABLE FEATURES FOR USE IN A PREDICTIVE MODEL

TECHNICAL FIELD

[0001] The present disclosure relates to computer-implemented methods and systems for determining a subset of physically measurable features from a number of candidate physically measurable features for use in a predictive model for detecting a physical characteristic.

BACKGROUND

[0002] Often it is a goal of computer-implemented predictive models to provide a reliable detection or prediction of the occurrence of a physical characteristic based on a number of physically measureable features which upon measurement would be fed into the predictive model to provide the result indicating the physical characteristic. In many cases, there are a large number of candidate physically measurable features that could be related to an associated physical characteristic that is being sought to be determined and the problem is to identify what subset of physically measurable features is relevant to this determination.

[0003] In one non-limiting example, the ability to identify biological features (ie, physically measurable features that could be potentially associated with a biological characteristic of a subject) that are indicative of the biological characteristic (eg, a disease condition) remains an important goal in the field of health care.

[0004] For example, it has long been recognised that the identification of biomarkers with clinical utility will greatly benefit human health. While there has been substantial progress over the years with the identification of many biomarkers associated with certain medical conditions, the promise of many other biomarkers has failed to be achieved.

[0005] While there are various reasons why some biomarkers fail to achieve clinical utility, it has become increasingly apparent that the selection and reproducibility of biomarkers has been widely affected by the methods used for data analysis to identify the biomarkers. For example, in some cases reanalysis of large studies of microarray-based cancer prognosis has concluded that the originally reported assessments were overly optimistic, and that only some of the data sets yielded classifiers or predictive models that performed better than chance.

[0006] Furthermore, it has been reported that half of the reported prognostic gene signatures from microarray studies in cancer that were examined were not reproducible due to critical flaws in the data analysis methods. The primary issues were found to be with model overfitting and the incorrect application of statistical techniques.

[0007] Typically, a key approach to improving biomarker based predictive models is to validate the trained predictive model using a separate set of samples from the samples used for training where each sample corresponding to the set of measured features (eg, biomarkers) and the associated physical characteristic that is to be determined (eg, presence of disease) . However, this approach alone does not maximise the information that can be derived from valuable samples, and for often necessarily small discovery studies it is prone to error resulting from biological variation.

[0008] Cross validation is a more powerful method used generally for the training of predictive models and in particular in the health sciences to assist with the selection of biological features, but its implementation is not straightforward, and it is often used to compute an error estimate for a predictive model or classifier that has itself been tuned using cross validation with the same data. Importantly, the use of cross validation in and of itself does not assist with determining what subset of the entire range of potential physically measured features should be adopted for the predictive model.

[0009] Accordingly, there is a need for improved methods to identify those physically measurable features that ought to be used in predictive models to be indicative of the associated physical characteristic that is to be determined.

SUMMARY

[0010] In a first aspect, the present disclosure provides a computer-implemented method for identifying a subset of physically measurable features from a number of candidate physically measurable features potentially associated with a physical characteristic, the subset of physically measurable features for use in a predictive model for detecting the physical characteristic based on measurements of the identified subset of physically measurable features, the method comprising: receiving measured data by one or more processors of a computing system, the measured data comprising respective datasets of measurements of the number of candidate physically measured features and the associated physical characteristic; in an outer loop iteratively partitioning by the one or more processors the measured data into training data and validation data to form multiple outer loop training data sets and associated outer loop validation data sets: for each of the outer loop training data sets in an inner loop: iteratively generating by the one or more processors a respective set of randomly sampled inner loop training data subsets and associated inner loop test data subsets; and generating predictive models by the one or more processors based on the respective set of inner loop training data subsets and inner loop test data subsets each having an optimised set of physically measurable features and an associated optimised predictive capacity, forming by the one or more processors a collection of optimised predictive models corresponding to each of the multiple outer loop training data sets and their respective sets of inner loop training and test data subsets; and identifying by the one or more processors the subset of physically measurable features by determining the subset of stable physically measurable features from the collection of optimised predictive models generated in the inner loop.

[0011] In another form, determining the subset of stable physically measurable features comprises: ranking by the one or more processors the number of candidate physically measurable features in prevalence order based on their prevalence in the collection of optimised predictive models to form a ranked list of physically measureable features; forming by the one or more processors successive subsets of physically measurable features, wherein a first subset comprises one or more of a most prevalent physically measurable features from the ranked list and successive subsets are formed by iteratively stepping through the ranked list and adding one or more of the next most prevalent physically measureable features to form each new successive subset; for each successive subset of physically measurable features, generating by the one or more processors a predictive model and an associated predictive capacity within each of the outer loop training sets to together form a group of predictive models for each of the successive subsets of physically measurable features; determining by the one or more processors the subset of stable physically measurable features by determining an optimum subset of physically measurable features that optimises a group predictive capacity measure determined for each group of predictive models that were generated and tested within each outer loop training set.

[0012] In another form, the method further comprises estimating by the one or more processors the predictive capacity of the stable subset of physically measurable features by generating a predictive model using all of the data in each outer loop training set and using each prediction model to predict the corresponding associated outer loop validation data set.

[0013] In another form, the predictive model is a regularised predictive model.

[0014] In another form, the regularised predictive model is a LASSO regression model. [0015] In another form, the number of candidate physically measured features comprises biomarkers and the physical characteristic is a biological characteristic.

[0016] In another form, the biomarkers comprise miRNA related features.

[0017] In another form, the miRNA features comprise the value of one or more pairs of concentrations of miRNAs.

[0018] In another form, the biological characteristic is a disease, condition or state in a subject.

[0019] In another form, the method further comprises: configuring a data processor accessible by a user with the predictive model for detecting the physical characteristic; entering physically measured values for the subset of physical measurable features into the data processor; and determining, using the predictive model, whether the physical characteristic is detected or not.

[0020] In a second aspect, the present disclosure provides an electronic data record comprising the subset of physically measurable features identified by a method in accordance with the first aspect.

[0021] In a third aspect, the present system provides a feature identification system for identifying a subset of physically measurable features from a number of candidate physically measurable features potentially associated with a physical characteristic, the subset of physically measurable features for use in a predictive model for detecting the physical characteristic based on measurements of the identified subset of physically measurable features, comprising: one or more processors; memory in electronic communication with the one or more processors; and instructions stored in the memory and operable, when executed by the processor, to cause the system to: receive measured data comprising respective datasets of measurements of the number of candidate physically measured features and the associated physical characteristic; in an outer loop iteratively partition the measured data into training data and validation data to form multiple outer loop training data sets and associated outer loop validation data sets: for each of the outer loop training data sets in an inner loop: iteratively generate a respective set of randomly sampled inner loop training data subsets and associated inner loop test data subsets; and generate predictive models based on the respective set of inner loop training data subsets and inner loop test data subsets each having an optimised set of physically measurable features and an associated optimised predictive capacity, form a collection of optimised predictive models corresponding to each of the multiple outer loop training data sets and their respective sets of inner loop training and test data subsets; and identify the subset of physically measurable features by determining the subset of stable physically measurable features from the collection of optimised predictive models generated in the inner loop.

[0022] In a fourth aspect, the present disclosure provides a feature identification system for identifying a subset of physically measurable features from a number of candidate physically measurable features potentially associated with a physical characteristic, the subset of physically measurable features for use in a predictive model for detecting the physical characteristic based on measurements of the identified subset of physically measurable features, the feature identification system comprising a computer system comprising one or more processors having a computer-readable medium encoded with programming instructions executable by the one or more processors to perform the method in accordance with the first aspect.

BRIEF DESCRIPTION OF DRAWINGS

[0023] Embodiments of the present disclosure will be discussed with reference to the accompanying drawings wherein:

[0024] Figure 1 is a flowchart of a computer-implemented method for identifying a subset of physically measureable features from a number of candidate physically measurable features for use in a predictive model in accordance with an illustrative embodiment;

[0025] Figure 2 is a flowchart of a computer-implemented method for determining the stability of physically measureable features for the method illustrated in Figure 1 in accordance with an illustrative embodiment;

[0026] Figure 3 is a data flow overview diagram of a computer-implemented method for identifying a subset of physically measurable features from a number of candidate physically measurable features in accordance with another illustrative embodiment;

[0027] Figure 4 is a system overview diagram of a feature identification system for identifying a subset of physically measureable features from a number of candidate physically measurable features in accordance with an illustrative embodiment; [0028] Figure 5 shows the details of selected House Keeping Genes;

[0029] Figure 6 shows ROC curves with 95% confidence intervals for sensitivity and specificity at each threshold level with (A) Standard nested 2-stage cross validation method (optimized lambda LASSO regression), (B) Nested 2-stage cross validation with additive penalization (one-standard-error rule) and (C) Method in accordance with present disclosure (11 miR-ratio logistic regression model);

[0030] Figure 7 shows boxplots of the 11 miRNA ratios in the logistic regression model;

[0031] Figure 8 shows the details of the miRNAs included in the 11 -miR-ratio logistic regression model;

[0032] Figure 9 shows (A) cross validated sensitivity vs. specificity estimates from ROC curve analysis using the 11 miR-ratio identified in accordance with an illustrative embodiment and (B) cross validated sensitivity and specificity lower bound estimates at increasing threshold levels using the identified 11 miR-ratio model;

[0033] Figure 10 shows details of all differentially expressed house keeping gene normalized miRNAs (non-cancer vs cancer);

[0034] Figure 11 shows details of non-differentially expressed miRNAs present in the 11 miRNA-ratios logistic regression model. miRNAs normalized with the geometric mean of 15 house keeping genes;

[0035] Figure 12 shows Boxplots of the differentially expressed miRNAs in the 11-miRNA-ratio; and

[0036] Figure 13 shows Boxplots of the non-differentially expressed miRNAs in the 11-miRNA-ratio logistic regression model.

DESCRIPTION OF EMBODIMENTS

[0037] Referring now to Figure 1, there is shown a flowchart of a method 100 for identifying a subset of physically measurable features from a number of candidate physically measurable features for use in a predictive model according to an illustrative embodiment.

[0038] By way of overview, at step 110 measured data comprising respective datasets of the physically measured values corresponding to the candidate physically measurable features potentially associated with the physical characteristic and the associated physical characteristic are received as an input to method 100. [0039] At step 120, the method 100 involves in an outer loop iteratively partitioning the measured data into training data and validation data to form multiple outer loop training data sets and associated outer loop validation data sets.

[0040] At step 130, for each of the outer loop training data sets the method then involves in an inner loop iteratively generating a respective set of randomly sampled inner loop training data subsets and associated inner loop test data subsets.

[0041] At step 140, the respective set of randomly sampled inner loop training data subsets and associated inner loop test data subsets are then used to generate predictive models that each have an optimised set of physically measurable features and an associated optimised predictive capacity. This optimised set will be those physically measureable features that the predictive model identified as being of relevance for determining of the associated physical characteristic for the particular training data subset that was used to generate the predictive model. A collection of optimised predictive models may then be formed that corresponds to each of the multiple outer loop training sets but which is ultimately derived from their respective sets of inner loop training and test data subsets.

[0042] At step 150, the subset of physically measurable features is identified by determining the subset of stable physically measurable features from the collection of predictive models generated in the inner loop. In this manner, the final set of physically measurable features is identified or selected on the basis of those features which are consistently or frequently selected over the collection of optimised predictive models that have been generated in the inner loop.

[0043] Referring now to Figure 2, there is shown a flowchart of a method 200 for determining the subset of physically measureable features according to an illustrative embodiment corresponding in one example to step 150 of method 100 illustrated in Figure 1.

[0044] At step 210, the candidate physically measurable features are ranked in prevalence order based on their prevalence in the collection of predictive models to in turn form a ranked list of physically measurable features arising from the collection of predictive models generated in the inner loop. As an example, if a first physically measurable feature appeared in 73% of the predictive models in the collection of predictive models and another second physically measurable feature only appeared in 45% of the predictive models then the first physically measurable feature would be ranked ahead of the second physically measurable feature.

[0045] At step 220, successive subsets of physically measurable features are formed starting initially from a first subset that comprises one or more of the most prevalent physically measurable features from the ranked list and then successive subsets are formed by iteratively stepping through the ranked list and adding one or more of the next most prevalent physically measureable features to form each new successive subset. In one example, the first subset comprises the most prevalent physically measurable feature and the second successive subset then comprises the most prevalent physically measurable feature and the second most prevalent physically measurable feature with the third successive subset comprising the three most prevalent physically measurable features and so on. In another example, the stepwise process may involve adding two or more prevalent physically measurable features to generate each successive subset.

[0046] Having generated multiple successive subsets of physically measurable features, at step 230 for each of these successive subsets of physically measurable features a group of predictive models and associated predictive capacities is generated where each of the group of predictive models is based on an inner loop training set within each respective outer loop training set. In this manner, as an example, if there were 10 outer loop training sets then each group of predictive models would comprise 10 subgroups of predictive models each corresponding to an outer loop training set but restricted to the physically measurable features of the corresponding subset.

[0047] At step 240, the optimum subset of physically measurable features is determined by optimising on a group predictive capacity measure determined for each group of predictive models that were generated and tested within each outer loop training set. In one example, the group predictive capacity measure is the average of the predictive capacities for the group. In another example, the group predictive capacity measure is the maximum predictive capacity for the group. This optimum subset is then identified as the subset of stable physically measurable features.

[0048] In another example, method 200 further involves then forming an estimate of the predictive capacity of the stable subset of physically measurable features by generating a predictive model using all the data in each outer loop training set and then using each prediction model to predict the corresponding associated outer loop validation data set.

[0049] Referring now to Figure 3, there is shown a dataflow overview diagram 300 of a computer- implemented method for identifying a subset of physically measurable features from a number of candidate physically measurable features according to another illustrative embodiment.

[0050] Measured data are received 310 in in the form of respective datasets comprising measurements of the number of candidate physically measured features and the associated physical characteristic. In an outer loop, this measured data is then iteratively partitioned 320 into training data and validation data to form multiple outer loop training sets 321 and validation data sets 322. [0051] Turning now to the inner loop, for each of the outer loop training sets 321, there are generated respective randomly sampled 330 inner loop training data subsets 331 and associated inner loop test data subsets 332. Based on these inner loop training and testing data subsets 331, 332, predictive models are generated. In one example, the predictive models are based on a regularised model involving the elimination of physically measurable features in the optimisation process.

[0052] In one illustrative embodiment, and as shown in Figure 3, the regularised model is a least absolute shrinkage and selection operator (LASSO) regression model 340 involving shrinkage parameter L. In this model, as L increases more and more coefficients of the regression model corresponding to physically measurable features are set to zero. In this embodiment, parameter l is optimised which will correspond to an optimised set of measurable physical features for that particular trained model.

[0053] In other embodiments, other embedded feature selection predictive models may be used including, but not limited to, those based on Ridge regression, sparse regression, support vector machines, K-nearest neighbours or decision trees (eg, random forests and variants).

[0054] In other example embodiments, other feature selection approaches may be adopted including, but not limited to, filter methods such as information gain, chi-square test, fisher score, correlation coefficient or variance threshold; or wrapper methods such as recursive feature elimination, sequential feature selection or the use genetic algorithms.

[0055] In this manner, a collection of optimised predictive models having an associated optimised set of physically measurable features will be generated where each of the optimised predictive models corresponds to an outer loop training set 321. The next step is to determine the subset of stable physically measurable features from the collection of optimised predictive models that have been generated in the inner loop.

[0056] For the illustrative embodiment shown in Figure 3, the physically measurable features that optimised each LASSO regression model are selected 340 and then collated and ordered 350 on overall frequency of selection in the various LASSO regression models to in effect generate a ranked list of physically measurable features in accordance with their prevalence or frequency of occurrence in the collection of optimised predictive models..

[0057] The next step involves an iterative stepwise regression process 360 of the frequently selected features involving adding the next most frequently selected features at each step to then select the optimum subset of physically measurable features that has in this example the least average predictive capacity error across the inner loop. [0058] In this example, this process involves looping through the following steps comprising:

(i) assemble a subset of the most frequently selected features by stepwise addition from the list of frequently selected features;

(ii) use the current subset of features to build predictive models and test their predictive capacity within each outer loop training set;

(iii) determine the average of the predictive capacity for each outer loop training set;

(iv) determine the average of the predictive capacity across the outer loop training sets; and

(v) record the prediction error for the current subset of frequently selected features.

[0059] This is followed by then determining the subset of the most frequently selected features that minimised the average prediction error.

[0060] This optimised subset of physically measurable features may then be further characterised or validated by using them to generate 370 a predictive model in each of the outer loop training sets 321 to predict each of the corresponding validation data sets 322.

[0061] Referring now to Figure 4, there is shown a system overview diagram of a feature identification system 400 for identifying a set of physically measurable features 460 (ie, Feature 1 to Feature M) from a number of candidate physically measurable features potentially associated with a physical characteristic according to an illustrative embodiment. Feature identification system 400 comprises at least a computer system or data processor 410 that may comprise one or more processors and memory 420 in electronic communication with data processor 410 and for storing one or more computer readable instructions for execution by the one or more processors to implement the methods in accordance with the present disclosure.

[0062] As shown in Figure 4, feature identification system 400 receives measured data 450 in for the form of respective data sets (eg, Data Set 1 to Data Set N) each data set comprising measurements of the candidate physically measurable features the associated physical characteristic. Feature identification system 450 is configured to operate, in various embodiments or configurations, the computer- implemented methods variously described with respect to Figures 1, 2 and 3 to identify the subset of physically measurable features to be used in a predictive model for detecting the physical characteristic.

[0063] Data processor 410 of feature identification system 400, or indeed any suitable data processor or computer system, may be configured with the predictive model for detecting the physical characteristic and a user may then enter physically measured values corresponding to the subset of physically measured features determined in accordance with the present disclosure and a determination may be made whether the physical characteristic is detected or not by the predictive model. [0064] It will be appreciated that the various illustrative logical blocks, modules and algorithm steps described in connection with the embodiments described in the present disclosure may be implemented as electronic hardware, computer software or instructions, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether the disclosed functionality is implemented as hardware or software will depend upon the requirements of the application and design constraints imposed on the overall system. It would also be appreciated that the systems and methods described here may be implemented using multiple components and modules that may be separate or co-located and that the components of the system may be interconnected by any form or medium of digital data communication.

[0065] In one embodiment, methods and systems in accordance with the present disclosure may adopted for identifying a set of features for use of a predictive model for determining, detecting and/or assessing a physical characteristic in the form of a biological characteristic.

[0066] In certain embodiments, the biological characteristic is the presence, absence or likelihood of a disease, condition or state in a subject. In certain embodiments, the biological characteristic is the presence, absence or likelihood of developing a cancer in a subject.

[0067] Other types of biologicals characteristics are contemplated, such as prediction of response to a therapy (eg, response to chemotherapy), extent of response to a treatment, likelihood of developing a specific biological characteristic, drug metabolism, extent of a physical or a mental characteristic existing in the subject, detection of residual cancer after treatment, for example after chemo radiotherapy followed by surgery, and detection of cancer micro-metastases that are otherwise undetectable using current imaging modalities.

[0068] In certain embodiments, the set of physically measurable features comprise features associated with a human or animal subject. Examples of such features include clinical features or characteristics of a subject, data or features determined from scanning or imaging of a subject, pharmacological characteristics of a subject, characteristics of one or more biomarkers, or any combination of the aforementioned.

[0069] In certain embodiments, the set of features comprises imaging data, scanning data, clinical features from a subject, pharmacological data, pharmacokinetic data, or data associated with one or more biomarkers. Methods for determining such features are known in the art.

[0070] In certain embodiments, the set of features comprise one or more features associated with one or more biomarkers. Examples of biomarkers include one or more of miR As, proteins, mRNAs, metabolites, lipids, carbohydrates, metals, receptors, ligands and genetic markers. Other types of biomarkers and other characteristics are contemplated.

[0071] In certain embodiments, the set of features comprises the level, presence, absence, and/or other characteristics of the biomarkers (eg location, association with other molecules, or signalling properties).

[0072] In certain embodiments, the candidate features comprise features associated with a subject. Examples of such features include clinical features or characteristics of a subject, data or features determined from scanning or imaging of a subject, pharmacological characteristics of a subject, biomarker characteristics, or any combination of the aforementioned.

[0073] In certain embodiments, the candidate features comprise imaging data, scanning data, clinical features from a subject, pharmacological data, pharmacokinetic data, or data associated with one or more biomarkers, or any combination of the aforementioned. Methods for determining such candidate features are known in the biological arts.

[0074] Examples of candidate features comprise the level, presence, absence or other characteristic of one or more of miRNAs, proteins, mRNAs, metabolites, lipids, carbohydrates, metals, receptors, ligands and genetic markers.

[0075] Methods for detecting and determining the characteristics of biomarkers are known in the art.

[0076] In certain embodiments, the set of features comprise characteristics of DNA. In certain embodiments, the set of features comprise the presence or absence of genetic markers, such as the presence or absence of mutations, polymorphisms, insertions, deletions, inversions of rearrangements of DNA, or a combination thereof.

[0077] In certain embodiments, the set of features comprise characteristics of RNA. In certain embodiments, the set of features comprise the presence or absence of mRNAs, miRNAs, rRNAs, tRNAs, piRNAs, other small RNAs, or a combination thereof.

[0078] In certain embodiments, the set of features comprise a characteristic of lipids, carbohydrates, metals, receptors, ligands, or a combination thereof.

[0079] In certain embodiments, the present disclosure provides a set of selected features identified by the method as described herein. [0080] In certain embodiments the present disclosure provides the use of the selected features as described herein for characterising a biological characteristic, such as for diagnostic and/or prognostic purposes.

[0081] In certain embodiments, the set of features identified are used for identifying a subject suffering from, or susceptible to, a disease, condition or state in a subject.

[0082] In certain embodiments, the methods as described herein comprises obtaining a biological sample and determining a set of candidate features from the biological sample. Methods for determining biological characteristics from a sample are known. Examples of biological samples include biological fluids such as blood, plasma, and urine, tissue samples, and tissue biopsies.

[0083] In certain embodiments, the method comprises processing a sample. Methods for processing samples, for example to determine the presence or concentration of biomarkers in a subject, are known in the art.

[0084] In certain embodiments, the method comprises obtaining a sample from a subject and processing the sample to allow detection of biomarkers. In certain embodiments, the processing of the samples permits the detection of a biomarkers in a fraction or part of the sample.

[0085] Methods and products for processing a sample to permit isolation or enrichment of a fraction or part of the sample are known in the art.

[0086] In certain embodiments, the method comprises obtaining a sample from a subject and isolating/enriching a fraction or part from the sample. In certain embodiments, the method comprises obtaining the sample from a subject, and processing the sample to isolate/enrich a fraction or part of the subject.

[0087] Standard techniques may be used for cell culture, recombinant DNA technology, oligonucleotide synthesis, enzyme assays, antibody production, peptide synthesis, tissue culture and transfection. Enzymatic reactions and purification techniques may be performed according to manufacturer's specifications or as commonly accomplished in the art or as described herein. The foregoing techniques and procedures may be generally performed according to conventional methods well known in the art and as described in various general and more specific references that are cited and discussed throughout the present specification.

[0088] The following references provide directions to assist with performing one or more methods described herein, and are hereby incorporated by reference: Molecular Cloning: A Laboratory Manual, 3rd ed., Vols 1, 2 and 3 J.F. Sambrook and D.W. Russell, ed., Cold Spring Harbor Laboratory Press,

2001; Laboratory Manual on Biotechnology (2008); Practical Immunology - A Laboratory Manual, Balakrishnan, Senthilkumar & Karthik, Kaliaperumal & Duraisamy, Senbagam. (2015)

10.13140/RG.2.1.4075.4728; Cell Biology (Third Edition) A Laboratory Handbook, 2006 Edited by Julio E. Celis; ISBN: 978-0-12-164730-8, Academic Press; and “Antibodies: A Laboratory Manual” Edward A. Greenfield; Cold Spring Harbor Laboratory Press, 2014; Lin, M. Z., Martin, J. L. and Baxter, R. C. (2015).

[0089] The present disclosure is further described by the following examples. It is to be understood that the following description is for the purpose of describing particular embodiments only and is not intended to be limiting with respect to the above description.

[0090] In one example embodiment, the biological characteristic is the detection of oropharyngeal squamous cell carcinoma and the candidate physically measured features comprise measured extracellular vesicle microRNAs with the resulting subset of physically measurable features corresponding to a selection of miRNAs that could be adopted as a biomarker signature.

[0091] The following example is based, at least in part, upon the recognition that a standard cross validation approach was unable to produce a biomarker signature with good cross validated predictive capacity, and that a stabilised biomarker selection approach selecting a subset of the most frequently selected miRNAs in accordance with the present disclosure allows for a set of biomarkers with improved specificity and sensitivity.

[0092] EXAMPLE 1 - Cross validated serum small extracellular vesicle microRNAs for the detection of oropharyngeal squamous cell carcinoma

[0093] SUMMARY OF EXAMPLE

[0094] BACKGROUND OF EXAMPLE

[0095] Oropharyngeal squamous cell carcinoma (OPSCC) is often diagnosed at an advanced stage because the disease often causes minimal symptoms other than metastasis to neck lymph nodes. Better tools are required to assist with the early detection of OPSCC. MicroRNAs (miRNAs, miRs) are potential biomarkers for early head and neck squamous cell cancer diagnosis, prognosis, recurrence, and presence of metastatic disease. However, there is no widespread agreement on a panel of miRNAs with clinically meaningful utility for head and neck squamous cell cancers. Whilst this could be due to variations in the collection, storage, pre-processing, and isolation of RNA, several reports have indicated that the selection and reproducibility of biomarkers has been widely affected by the methods used for data analysis. The primary analysis issues appear to be model overfitting and the incorrect application of statistical techniques. The purpose of this study was to develop a robust statistical approach to identify a miRNA signature that can distinguish controls and patients with inflammatory disease from patients with human papilloma virus positive (HPV+) OPSCC.

[0096] Methods:

[0097] Small extracellular vesicles were harvested from the serum of 20 control patients, 20 patients with gastroesophageal reflux disease (GORD), and 40 patients with locally advanced HPV+ OPSCC. MicroRNAs were purified, and expression profiled on OpenArrayTM.

[0098] Results:

[0099] The Applicants found that standard cross validation approach was unable to produce a biomarker signature with good cross validated predictive capacity.

[00100] BACKGROUND

[00101] Head and neck cancer is the 6th most common cancer worldwide, with oropharyngeal squamous cell carcinoma (OPSCC) significantly increasing in incidence. Historically the majority of patients presenting with OPSCC have been older with a history of smoking and alcohol consumption. The increasing incidence of OPSCC in the last 20 years, despite a decrease in tobacco and alcohol consumption, amongst younger males has been attributed to human papilloma virus (HPV). Immunohistochemical staining of p 16 is used as a surrogate marker for HPV, and is currently the only biomarker used clinically for OPSCC staging. OPSCC is often diagnosed at an advanced stage because the disease often causes minimal symptoms other than metastasis to enlarging lymph nodes in the neck. Better tools would assist with facilitating non-invasive detection of OPSCC for primary care doctors and cancer specialists.

[00102] Biomarkers are biological molecules found in blood, fluid or tissues that can signal either a normal or an abnormal process such as cancer.

[00103] MicroRNAs (miRNAs, miRs) are potential biomarkers for early head and neck squamous cell carcinoma diagnosis, prognosis, recurrence, and presence of metastatic disease. miRNAs are single- stranded noncoding RNA molecules that play a significant role in cancer development. Recent studies have found that miRNAs are dysregulated in head and neck cancer tissue biopsy samples and have potential as diagnostic and prognostic biomarkers. Tissue-based biomarkers, however, require invasive collection and are only available via biopsy or at time of surgery, and thus repeated sampling during the course of the disease, treatment and surveillance is generally not practical. A liquid biopsy, usually blood, can be obtained more easily, and is less invasive than a tissue biopsy. Liquid biopsies can be collected throughout the course of a patient’s disease, and can potentially be used to determine cancer diagnosis, prognosis and recurrence. This allows for real-time changes to treatment plans.

[00104] Circulating miRNAs obtained from blood have been described for head and neck cancer of several anatomical subsites including oral cavity, nasopharynx, larynx, salivary glands and cutaneous malignancies. However, despite widespread efforts to develop clinically significant miRNA biomarker panels, there is a lack of agreement on which specific miRNAs constitute a clinically significant biomarker panel. Recent studies indicate that this may be due in part to differences in detection methodology, as well as biological variability, variations in the collection, storage, pre-processing, and isolation of RNA, as well as poor reporting of detailed methodology, and variation in the methods used for relative quantification and normalisation.

[00105] Several reports have also indicated that the selection and reproducibility of biomarkers has been widely affected by the methods used for data analysis. Studies suggest that previously reported assessments are overly optimistic, and that a few of the earlier data sets yielded classifiers that performed better than chance.

[00106] Furthermore, it has become apparent from microarray studies in cancer that half of the reported prognostic gene signatures that were examined were not reproducible due to critical flaws in the data analysis methods. The primary issues appear to be with model overfitting and the incorrect application of statistical techniques.

[00107] A key approach to improving medical biomarker studies is to validate findings in a separate set of samples. However, this approach alone does not maximise the information that can be derived from valuable samples, and for often necessarily small discovery studies it is prone to error resulting from biological variation. Cross validation is a more powerful method, but its implementation is not straightforward, and it is often used to compute an error estimate for a classifier that has itself been tuned using cross validation with the same data. This method of cross validation has been reported to give biased estimates of classification error.

[00108] METHODS

[00109] Late diagnosis of OPSCC is a significant clinical problem. Primary care doctors and cancer specialists need improved methods for early diagnosis of OPSCC. miRNAs in tumor derived small extracellular vesicles, circulating in blood serum, have excellent potential for this purpose. Our aim was to develop a panel of serum small extracellular vesicle derived miRNAs which show robust cross validation as a diagnostic biomarker for OPSCC.

[00110] Patients:

[00111] Three patient cohorts were included in this study; a ‘control’ patient cohort and a cohort of patients with gastroesophageal reflux disease (GORD) and ulcerative esophagitis were included in the non-cancer group, and the cancer group were a cohort of patients with OPSCC. Blood specimens and related clinical data were accessed with appropriate ethical and governance approvals from the SA ENT Tissuebank (stored by Flinders Medical Centre, Adelaide, South Australia), PROBE-NET (Flinders Medical Centre, Adelaide, South Australia) and Victorian Cancer Biobank from consenting participants. Specimens from cancer patients (n=40) diagnosed with p 16 positive advanced stage OPSCC (stage III or IV AJCC 7th Edition - Edge SB, Byrd DR, Carducci MA, Compton CC, Fritz A, Greene F: AJCC cancer staging manual, 7th Edition.: Springer New York; 2010) but no concurrent or previous cancer diagnosis were selected. The diagnosis and AJCC stage were confirmed at a Head and Neck multi-disciplinary team meeting at each respective institution. Specimens from patients without head and neck cancer were selected from a cohort of patients who underwent upper gastrointestinal endoscopy for reasons unrelated to the investigation of any cancer. These patients were recruited via a previously described recruitment process. Patients who had no pathology identified at upper gastrointestinal endoscopy were classified as either ‘controls’ (n=20), and a second cohort was determined to have GORD based on the presence of ulcerative esophagitis (any grade) at endoscopy (n=20).

[00112] HPV DNA polymerase chain reaction (PCR):

[00113] Diagnostic tissue blocks were accessed to determine the presence of HPV DNA utilising the method of Antonsson et al. (2015), with minor modification [Antonsson A et al. Cancer Epidemiol 2015, 39:174-181] The presence of tumor cells in an adjacent section of the tissue block was confirmed by a histopathologist. Tissue sections (3 x 10 pm formalin fixed paraffin embedded) were used to extract DNA using the QIA DNA FFPE Tissue kit (Qiagen, Cat No 56404) with slight modification. Paraffin sections were washed 3x with xylene prior to proteinase K digestion (up to 3.5hr; after which undigested material was removed via centrifugation). The DNA was eluted in 50pl ATE buffer from the kit.

[00114] Primers for HPV detection and b-globin were obtained from GeneWorks (Thebarton,

South Australia). DNA samples were analysed by PCR for the presence of HPV with the general mucosal HPV primers GP5+ (5 ’TTTGTTACTGTGGTAGATACTAC3 ’) /GP6+

(5 GAAAAATAAACTGTAAATCATATTC3 ) essentially as described in Antonsson A et al. Cancer Epidemiol 2015, 39:174-181 and de RodaHusman et al. J Gen Virol 1995, 76 ( Pt 4): 1057-1062. PCR reaction mix consisted of GeneAmp lOx buffer II (2.5pl), 25mM MgC12 (3.5pl), lOmM dNTP Mix (0.5m1), 5mM GPT5+ primer (4m1), 5mM GPT6+ primer (4m1), 5U/pl AmpliTaq Gold ® DNA Polymerase (0.125m1), 2.5m1 of eluted DNA and water to make total volume 25m1. PCR thermocycler conditions were 95°C 10 min, 50 cycles of 94°C 90 sec, 55°C 90 sec, 72°C 2 min, followed by 72°C 4 min and 20°C 10 min.

[00115] Ultrapure water was used as a negative control . He La cells (HPV 18 positive cervical cancer cell line) were used as positive control b-globin PCR with the primers PC03 (5 ’ CTTCTGACACAACTGTGTTCACTAGC3 ’) and PC04

(5TCACCACCAACTTCATCCACGTTCACC3’) was carried out on all samples to ensure they contained enough cells to detect human DNA with the following PCR thermocycler conditions: 95 °C 10 min, 50 cycles of 94°C 90 sec, 60°C 90 sec, 72°C 2 min, followed by 72°C 4 min and 20°C 10 min. PCR products were visualised by agarose gel electrophoresis and photographed.

[00116] Blood collection:

[00117] All pre-cancer treatment blood specimens were collected either at time of clinic consultation or at time of endoscopy/surgical procedure (before the administration of any medications. Blood was collected into 8ml Z Serum Separator Clot Activator tubes Vacuette® (cat# 455078). All blood samples were left at room temperature for a period of 16 hr-24 hr before processing with a standardised protocol established in our laboratory [26]

[00118] Extracellular vesicle isolation and miRNA extraction:

[00119] For small extracellular vesicle isolation, 1 ml aliquots of serum were retrieved, quick thawed, and centrifuged at 16,000 g at 4°C for 30 min to exclude larger microparticles. 250 mΐ supernatant from each sample was then processed with an ExoQuickTM kit (System Biosciences, CA, United States; EXOQ20A-1) according to the manufacturer’s protocol. Samples were incubated with ExoQuickTM at 4°C for 16 hr. The pellet isolated from each sample was resuspended with 50 mΐ phosphate buffered saline (PBS). We have previously confirmed that pellets obtained from serum using ExoQuickTM contain particles consistent in size with exosomes (30 - 150 nm), using a Nanosight LM10 Nanoparticle Analysis System and Nanoparticle Tracking Analysis Software (Nanosight Ltd.). We refer to these as small extracellular vesicles, as recommended in the Minimal Information for Studies of Extracellular Vesicles 2018 Guidelines [Thery el al. J Extracell Vesicles 2018, 7: 1535750.]. Extraction of miRNA from small extracellular vesicles was performed using the commercial miRNeasy Serum/Plasma kit (QIAGEN, #217184) according to the manufacturer’s protocol. Five microlitres (0.1 picomole) of each of the synthetic RNA molecules ath-miR-159a and cel-miR-54 (Shanghai Genepharma Co. Ltd.) were added to the 500 mΐ QIAzol vesicle lysate before further processing. Twenty four microlitres of RNase-free ultrapure water was used for the final RNA elution step. [00120] TaqMan OpenArray® miRNA Profiling:

[00121] High throughput QuantStudio™ 12K Flex OpenArray® PCR custom made plates were used for miRNA profiling. These arrays were comprised of a panel of 112 miRNA probes (Table 1) that were selected based upon their abundance in samples from a previous study on serum small extracellular vesicle associated miRNAs.

TABLE 1

Details of 112 miRNAs included on custom OpenArray™

[00122] For each sample, 3.35 mΐ of RNA was reverse transcribed using a matching Custom

OpenArray® miRNA RT pool (Life Technologies cat # A25630) and the TaqMan® microRNA Reverse Transcription Kit (Life Technologies cat # 4366596). cDNA Pre-amplifications were carried out with a matching Custom OpenArray® PreAmp pool (Life Technologies cat # 4485255) and TaqMan PreAmp Master Mix (Life Technologies cat # 4488593) on 7.5 mΐ complementary DNA (cDNA)/sample for each pool. The pre-amplified products (4 mΐ per sample) were diluted at the recommended 1:40 dilution with 156 mΐ of RNase-free ultra pure water before mixing with TaqMan OpenArray Real-Time PCR Master Mix (Life Technologies cat # 4462164) and loading onto a 384-well TaqMan OpenArray loading plate. PCR runs were performed using a QuantStudio™ 12K Flex Real-Time PCR System.

[00123] OpenArray® real-time PCR assay data analysis: [00124] Analyses were performed using R (version 3.4.3), and Microsoft Excel for Mac (version

16).

[00125] The cycle threshold (Ct) value for each PCR assay was determined using the qpcR package vl.4 in R (https://cran.r-projeet.org/web/packages/qpcR/mdex.btral). Only miRNAs with detectable Cts in at least 50% of samples in one group were considered for the expression analysis. The relative expression of each miRNA was calculated as 2^(40_Ct). Relative expression values for each miRNA were used to derive per patient values for every possible permutation of miRNA ratios.

[00126] Selection of miRNA biomarkers:

[00127] The use of gene expression ratios has been shown to provide good sensitivity and specificity in RNA biomarker studies. We therefore calculated the ratio of the relative expression level of each miRNA with every other miRNA. In this example, miRNA ratios with high variation in both of the comparison groups were removed (coefficient of variation > 300%), and the miRNA ratios were then pre- filtered (Mann-Whitney U-test at p<0.05) to remove non-informative ratios [Bourgon et al. Proc Natl Acad Sci USA 2010, 107:9546-9551.]. The remaining ratios were investigated for their capacity to discriminate patients with OPSCC from control patients and patients with GORD and ulcerative oesophagitis. We have previously demonstrated ulceration of the squamous oesophageal mucosa in GORD is associated with an alteration of miRNA expression compared to normal controls.

[00128] Referring once again to Figure 1, in this example the measured data was in the form of respective datasets of the candidate physically measured miRNA ratios and the OPSCC disease condition for the patients in the study (see step 110).

[00129] This measured data was iteratively partitioned in an outer loop of a nested cross validation procedure to form multiple outer loop training data sets and associated outer loop validation data sets (see step 120).

[00130] For each of the outer loop training data sets in an inner loop of the nested cross validation procedure, 50 repeated rounds of 10-fold cross validation were carried out to generate predictive models for each repeat. In this embodiment, the predictive model is a regularised model and in particular a least absolute shrinkage and selection operator (FASSO) regression model. In this case, the regularised model is achieved by making the sum of the regression coefficients of all variables less than a fixed value which is governed by the shrinkage or regularisation parameter l. The effect of this is that some regression coefficients are set to zero. [00131] This may be contrasted to a regularised model based on Ridge regression where the coefficients are scaled by a constant factor. In some applications, LASSO regression can perform better than Ridge regression when only a small fraction of the considered variables is associated with the dependent variable (i.e. disease state), or if we are primarily interested in discovering a small set of biomarkers.

[00132] In this example, the inner loop operates to determine the level of the regularization parameter l that maximizes the cross validated predictive capacity of the LASSO regression. In this embodiment, a cyclical coordinate descent computed along a regularization path, ie, a range of L values, was adopted to find a value that produces a model with the least prediction error. In this embodiment, each of the 50 repeats of the 10-fold cross validation in the inner loop consisted of a random split of the samples into 10 folds, so this approach produces 50 l estimates from each of the outer loop training sets (see steps 130 and 140).

[00133] In accordance with the present disclosure, the next step was to determine the subset of stable miRNA ratios from the collection of optimised predictive models (see step 150).

[00134] In this example, the optimal cut-off value for the percent frequency or prevalence of feature selection across repeated k-fold cross validations, and across the training sets was identified in effect identifying the subset of stable features against the random fold assignments within each training set, and the sample variance across the training sets. In this example, this involved testing a range of percent cut-offs by an incremental step-down procedure involving ranking according to frequency of selection and then successive stepwise selection at percentile cut-offs to determine the optimum model with the least prediction error. At each step, the subset of stable miRNA-ratios that were selected at or above the cut-off frequency were included in a multivariate logistic regression model which was used to make predictions in the inner loop (see steps outlined in Figure 2).

[00135] The final set of miR-ratios, as determined by the cut-off frequency that produced the lowest prediction error in the inner loop, was used to build a regression model in each outer loop training set, and each model was then used to predict the held-out test sample that was excluded from the model building process. TABLE 2

List of all LASSO regression miR-ratios selected from the inner cross validation loop.

[00136] Sensitivity and Specificity estimates:

[00137] The outer loop predictions were assessed using Receiver Operating Characteristic (ROC) curve analysis, with 2,000 bootstrap samples to estimate 95% confidence intervals for the sensitivity and specificity at each threshold level [ Jiang et al. (2013) Stat Methods Med Res 22:505-518]

[00138] Selection of House Keeping Genes:

[00139] For normalisation of the miRNAs we selected 15 miRNAs as House Keeping Genes using the following criteria: (i) they were expressed in all samples and at high levels (median Ct < 30);

(ii) they were not statistically different in tissue comparisons (Mann Whitney U test, p > 0.1); (iii) they were not highly variable (coefficient of variation < 2 x standard deviation) and did not contain outliers (samples with levels not within 5-fold of the mean); and (iv) they were correlated at r > 0.7 with the geometric mean of the house keeping genes. The values for these selection criteria for each of the 15 House Keeping Gene miRNAs, plus mature nucleic acid sequences and Accession numbers, are shown in Figure 2.

[00140] Determination of Differential Expression:

[00141] The relative levels of the miRNAs were determined using the formula 2 ^{l4( t)}. and were normalized using the geometric mean of the relative levels of the 15 House Keeping Genes.

[00142] In this example, the normalised miRNAs were pre-filtered using the following criteria: 1) at least 50% of samples amplified in one of the comparison groups, 2) the coefficient of variation was less than 200%, and 3) differential expression was greater than 1.3 fold. Mann Whitney U tests were then used to determine which miRNAs were differentially expressed, and the False Discovery Rate was estimated using the method of Storey (2002) Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64:479-498. [00143] RESULTS

[00144] Of the 80 RNA samples profiled on OpenArrayTM, one sample failed to amplify, and data import failed for one other sample. Therefore, the miRNA data available for biomarker discovery was derived from 19 controls, 20 patients with gastroesophageal reflux disease induced ulcerative oesophagitis, and 39 patients with pl6 positive OPSCC (27 with confirmed HPV, 12 with tissue unavailable for HPV PCR).

[00145] The clinicopathologic characteristics of the patients included in this analysis are shown in

Table 3.

TABLE 3

Clinicopathologic characteristics of the patients included in this analysis.

[00146] Applying the above method for identifying a subset of miRNA ratios, This approach produced a regression model in which 11-miR-ratios were identified (see Table 4; Boxplots of the 11 miRNA ratios in the logistic regression model are shown in Figure 7; Details of the miRNAs included in the 11-miR-ratio logistic regression mode are shown in Figure 8).

TABLE 4

[00147] Each row in the table lists the two miRs present in each miR-ratio. The bold highlighted miRNAs were differentially expressed when normalized with selected house keeping genes. [00148] Referring now to Figure 6, there is shown a series ROC curves comparing the approach of the present method as described above (see Figure 6C) as compared to LASSO regression in a standard nested 2-stage cross validation (see Figure 6A) where it can be seen that this approach produced a multi miR-ratio model with poor predictive capacity for the held-out samples. Figure 6B shows the effect of the nested 2-stage cross validation with additive penalization in accordance with the “one-standard-error rule”, but as can be seen this did not improve the capacity of the resultant LASSO regression model to predict the held-out samples.

[00149] We investigated the potential clinical utility of this model by examining the trade-off between the sensitivity and specificity at different threshold levels from a ROC curve analysis with bootstrapped confidence intervals (see Figures 9A & 9B). When giving equal weight to sensitivity and specificity to determine the model threshold with the maximum predictive capacity (Y oudan index) the 11-miR-ratio regression model detected OPSCCs with a sensitivity of 90% (95% Cl: 79-97%) at a specificity of 79% (95% CL 67-92%). With a focus on minimising false positives, the 11-miR-ratio model achieved a specificity of 97% (95%CI: 92-100%), and a sensitivity of 54% (95% Cl: 38-69%).

[00150] In order to determine how likely, it was to obtain the observed classification performance of the 11-miR-ratio model by chance, we randomly permuted the sample labels 2,000 times in order to estimate the empirical cumulative distribution of the cross validated classification error under the null hypothesis [Golland P, Fischl B: Permutation tests for classification: towards statistical significance in image-based studies. Inf Process Med Imaging 2003, 18:330-341] The maximum cross validated accuracy achieved from the permutations was 63%. At the threshold corresponding to the Youdan index the non-permuted cross validated accuracy was 83%. This suggests that the estimated cross validated prediction accuracy of the 11-miR-ratio model was not due to chance alone.

[00151] We also investigated whether any of the miR-ratios in the model contained individual miRNAs that were significantly differentially expressed when normalised with house keeping gene miRNAs. For this differential expression analysis we estimated a false discovery rate of 18%. All 11 miR-ratios contained at least one differentially expressed house-keeping gene normalised miRNA (Figures 10 to 13).

[00152] The findings from the above study suggest that the serum small extracellular vesicle derived 11-miRNA-ratio signature may be useful for detecting HPV+ OPSCCs. Biomarker discovery studies have historically utilised a single split of patient samples into a discovery cohort and a validation cohort, but it is now known that this is not the most effective use of valuable samples. This is because the development of a predictive model with this approach uses only part (e.g. 50%) of the dataset, so there is the possibility that information about the data will be missed, which can result in bias. Furthermore, a single split of the data may not be able to generate an equitable distribution of all biological or clinical features. These issues can result in overfitting and poor performance in either the validation cohort or in subsequent independent cohorts.

[00153] Cross validation can reduce these effects by training models on many subsets that contain a large proportion of the data, to reduce bias, and then by testing model performance against held out data. However, with cross validation the model that is selected by LASSO regression can differ in each training set.

[00154] Many cancers are associated with a background of chronic inflammation. Patients with

GORD and ulcerative esophagitis (a benign inflammatory disease) were included, in order to select against biomarkers associated with non-cancer specific inflammation. This group of patients is associated with inflamed squamous oesophageal epithelium as is the squamous epithelium in HPV associated OPSCC. We have previously demonstrated that chronic inflammatory conditions are associated with miRNA changes compared to healthy controls. miRNAs are potent regulators of immune cell functions involved in inflammatory disease and cancer. This is a major strength of this study to include an inflammatory non-cancer group as well as a control group. Other strengths include incorporating patients with HPV associated OPSCC from three different major head and neck cancer centres, exclusion of patients with concurrent cancers, and the use of serum, rather than plasma, for miRNA profiling.

[00155] Currently, there is no detection test available for primary care physicians to use for patients at risk of HPV associated OPSCC. Usually these patients have non-specific symptoms of a sore throat, or a lump in the throat or neck. These symptoms are not specific for cancer and may be mistakenly diagnosed as infectious or inflammatory. Consequently, some patients are not diagnosed as having HPV associated OPSCC until the cancer is at a more advanced stage. Therefore, a high specificity blood-based biomarker could provide a non-invasive test that could triage patients with HPV associated OPSCC in the primary care setting to receive prompt specialist care.

[00156] The majority of studies examining the role of miRNAs in head and neck cancer have examined their potential role in pathogenesis or prognosis using tissue specimens. Examining the tumor specimen for novel miRNAs is potentially useful for prognosis and treatment, but it does not address the issue of improved detection of head and neck cancer. Few studies have investigated the potential role of circulating miRNAs in the detection of head and neck cancer and none to date have been published for HPV associated OPSCC.

[00157] Another potential area of benefit for a blood-based biomarker is as an adjunct test for the surveillance post treatment period and detection of cancer recurrences. Although HPV associated oropharyngeal cancers have a relatively good prognosis, 20-25% of patients develop recurrent disease within 5 years of treatment. Following treatment with curative intent for HPV associated OPSCC, patients are followed up in a clinical surveillance program for signs of recurrence, and to manage post-treatment complications. The primary aim of surveillance is to detect recurrences at an early stage and therefore increase the likelihood of cure with salvage therapy. However, early detection of residual HPV associated OPSCC following treatment can be clinically difficult. Positron emission tomography with 2-deoxy-2- [fluorine-18]fluoro- D-glucose integrated with computed tomography (PET-CT), when available, is the preferred imaging modality for assessment of treatment response, and is utilised in surveillance to aid in the detection of OPSCC recurrences at local, regional and distant sites. However, PET-CT has limited spatial resolution, and tumors or lymph nodes smaller than approximately 1cm cannot be accurately detected. This limits the sensitivity for detecting small recurrences with PET-CT. In addition, the interpretation of PET-CT following treatment is challenging because treatment-related inflammation and oedema are common causes of false positive tracer uptake, which is indistinguishable from residual OPSCC, and can result in false positives. PET-CT is therefore not able to be used earlier than 12 weeks post therapy. At a high specificity model threshold the 11-miR-ratio biomarker panel discovered in this current study was able to differentiate HPV associated OPSCCs from control patients and patients with GORD (a benign inflammatory disease) with a cross validated specificity of 97%, at a sensitivity of 54%. The 11-miR-ratio biomarker therefore has the potential to non-invasively detect false positives that result from the use of PET-CT in post-therapy surveillance.

[00158] The 11-miR-ratio biomarker panel also has the potential to detect recurrences earlier than is currently possible. Currently there are no effective methods for detecting residual cancers within the first six to twelve weeks following treatment. In the most recent study investigating the use of PET/CTs for surveillance of HPV associated OPSCCs (i.e. when there was no clinical suspicion of disease recurrence), the positive predictive value was only 13.4%. However, evidence suggests that circulating biomarkers have the potential for detecting early recurrences. Although plasma HPV DNA has the potential to become a highly specific biomarker for HPV associated OPSCCs it is not applicable for HPV negative OPSCCs or other mucosal head and neck cancers. If a biomarker is able to detect subclinical recurrent disease earlier then it could potentially be salvaged with surgery, radiotherapy or systemic therapies.

[00159] As would be appreciated, systems and methods in accordance with the present disclosure may be used to generate a relevant subset of physically measurable features from a large number of candidate features for use in a trained predictive model in order to determine or detect a physical characteristic. This has important benefits as, for example, in the application of biomarker panels, the number of biomarkers may be reduced to those most relevant for the diagnostic task as a result reducing the complexity of having to take a large number of samples and making the diagnosis. Reducing the set of physically measurable features also has the important intrinsic benefit that it is likely to improve the computational performance of the predictive model and it may provide some insight into what physical processes are responsible for the physical characteristic being detected.

[00160] In one non -limiting example directed to the detection or identification of a mineral of interest in an ore sample, the candidate physically measurable features would be mass/charge ratios intensities as measured by a mass spectrometer in the ore sample and the selection method in accordance with the present method would process measured data in the form of measured mass/charge ratios intensities for a number of ore samples and whether the ore sample contained the mineral of interest and determine which mass/charge ratios should be measured and used in a predictive model to determine whether the mineral of interest is present in a newly measured ore sample.

[00161] In another example embodiment, directed to the identification of the presence of a particular animal in a region of interest based on measured acoustic data (eg, for use in tracking marine species migration), the candidate physically measured features may be spectral intensities over a frequency range from a measured acoustic signal for that region and the selection method in accordance with the present method would process acoustic signals for a region and whether the region contained the animal of interest and determine which frequency values of the frequency range are significant and should be used in a predictive model to determine whether an acoustic signal for a region indicates the presence of that animal in the region.

[00162] Although the present disclosure has been described with reference to particular embodiments, it will be appreciated that the disclosure may be embodied in many other forms. It will also be appreciated that the disclosure described herein is susceptible to variations and modifications other than those specifically described. It is to be understood that the disclosure includes all such variations and modifications. The disclosure also includes all of the steps, features, compositions and compounds referred to, or indicated in this specification, individually or collectively, and any and all combinations of any two or more of the steps or features.

[00163] Also, it is to be noted that, as used herein, the singular forms “a”, “an” and “the” include plural aspects unless the context already dictates otherwise.

[00164] Reference to any prior art in this specification is not, and should not be taken as, an acknowledgment or any form of suggestion that this prior art forms part of the common general knowledge in any country.

[00165] The subject headings used herein are included only for the ease of reference of the reader and should not be used to limit the subject matter found throughout the disclosure or the claims. The subject headings should not be used in construing the scope of the claims or the claim limitations. [00166] The description provided herein is in relation to several embodiments which may share common characteristics and features. It is to be understood that one or more features of one embodiment may be combinable with one or more features of the other embodiments. In addition, a single feature or combination of features of the embodiments may constitute additional embodiments.

[00167] The methods described herein can be performed in one or more suitable orders unless indicated otherwise herein or clearly contradicted by context. The use of examples, or exemplary language (e.g., "such as") provided herein, is intended merely to better illuminate the example embodiments and does not pose a limitation on the scope of the claimed invention unless otherwise claimed. No language in the specification should be construed as indicating any non-claimed element as essential.

[00168] It will be understood that the terms “comprise” and “include” and any of their derivatives

(e.g. comprises, comprising, includes, including) as used in this specification, and the claims that follow, is to be taken to be inclusive of features to which the term refers, and is not meant to exclude the presence of any additional features unless otherwise stated or implied.

[00169] In some cases, a single embodiment may, for succinctness and/or to assist in understanding the scope of the disclosure, combine multiple features. It is to be understood that in such a case, these multiple features may be provided separately (in separate embodiments), or in any other suitable combination. Alternatively, where separate features are described in separate embodiments, these separate features may be combined into a single embodiment unless otherwise stated or implied. This also applies to the claims which can be recombined in any combination. That is a claim may be amended to include a feature defined in any other claim. Further a phrase referring to “at least one of’ a list of items refers to any combination of those items, including single members. As an example, “at least one of: a, b, or c” is intended to cover: a, b, c, a-b, a-c, b-c, and a-b-c.

[00170] It will be appreciated by those skilled in the art that the disclosure is not restricted in its use to the particular application or applications described. Neither is the present disclosure restricted in its preferred embodiment with regard to the particular elements and/or features described or depicted herein. It will be appreciated that the disclosure is not limited to the embodiment or embodiments disclosed, but is capable of numerous rearrangements, modifications and substitutions without departing from the scope as set forth and defined by the following claims.

Claims

1. A computer-implemented method for identifying a subset of physically measurable features from a number of candidate physically measurable features potentially associated with a physical characteristic, the subset of physically measurable features for use in a predictive model for detecting the physical characteristic based on measurements of the identified subset of physically measurable features, the method comprising: receiving measured data by one or more processors of a computing system, the measured data comprising respective datasets of measurements of the number of candidate physically measured features and the associated physical characteristic; in an outer loop iteratively partitioning by the one or more processors the measured data into training data and validation data to form multiple outer loop training data sets and associated outer loop validation data sets: for each of the outer loop training data sets in an inner loop: iteratively generating by the one or more processors a respective set of randomly sampled inner loop training data subsets and associated inner loop test data subsets; and generating predictive models by the one or more processors based on the respective set of inner loop training data subsets and inner loop test data subsets each having an optimised set of physically measurable features and an associated optimised predictive capacity, forming by the one or more processors a collection of optimised predictive models corresponding to each of the multiple outer loop training data sets and their respective sets of inner loop training and test data subsets; and identifying by the one or more processors the subset of physically measurable features by determining the subset of stable physically measurable features from the collection of optimised predictive models generated in the inner loop.

2. The computer-implemented method of claim 1, wherein determining the subset of stable physically measurable features comprises: ranking by the one or more processors the number of candidate physically measurable features in prevalence order based on their prevalence in the collection of optimised predictive models to form a ranked list of physically measureable features; forming by the one or more processors successive subsets of physically measurable features, wherein a first subset comprises one or more of a most prevalent physically measurable features from the ranked list and successive subsets are formed by iteratively stepping through the ranked list and adding one or more of the next most prevalent physically measureable features to form each new successive subset; for each successive subset of physically measurable features, generating by the one or more processors a predictive model and an associated predictive capacity within each of the outer loop training sets to together form a group of predictive models for each of the successive subsets of physically measurable features; determining by the one or more processors the subset of stable physically measurable features by determining an optimum subset of physically measurable features that optimises a group predictive capacity measure determined for each group of predictive models that were generated and tested within each outer loop training set.

3. The computer-implemented method of claim 2, further comprising estimating by the one or more processors the predictive capacity of the stable subset of physically measurable features by generating a predictive model using all of the data in each outer loop training set and using each prediction model to predict the corresponding associated outer loop validation data set.

4. The computer-implemented method of any one of the preceding claims wherein the predictive model is a regularised predictive model.

5. The computer-implemented method of claim 4, wherein the regularised predictive model is a LASSO regression model.

6. The computer-implemented method according to any one of the preceding claims, wherein the number of candidate physically measured features comprises biomarkers and the physical characteristic is a biological characteristic.

7. The computer-implemented method according to claim 6, wherein the biomarkers comprise miRNA related features.

8. The computer-implemented method according to claim 3, wherein the miRNA features comprise the value of one or more pairs of concentrations of miRNAs.

9. The computer-implemented method according any one of claims 6 to 8, wherein the biological characteristic is a disease, condition or state in a subject.

10. The computer implemented method of any one of the preceding claims, further comprising: configuring a data processor accessible by a user with the predictive model for detecting the physical characteristic; entering physically measured values for the subset of physical measurable features into the data processor; and determining, using the predictive model, whether the physical characteristic is detected or not.

11. An electronic data record comprising the subset of physically measurable features identified by the method according to any one of claims 1 to 9.

12. A feature identification system for identifying a subset of physically measurable features from a number of candidate physically measurable features potentially associated with a physical characteristic, the subset of physically measurable features for use in a predictive model for detecting the physical characteristic based on measurements of the identified subset of physically measurable features, comprising: one or more processors; memory in electronic communication with the one or more processors; and instructions stored in the memory and operable, when executed by the processor, to cause the system to: receive measured data comprising respective datasets of measurements of the number of candidate physically measured features and the associated physical characteristic; in an outer loop iteratively partition the measured data into training data and validation data to form multiple outer loop training data sets and associated outer loop validation data sets: for each of the outer loop training data sets in an inner loop: iteratively generate a respective set of randomly sampled inner loop training data subsets and associated inner loop test data subsets; and generate predictive models based on the respective set of inner loop training data subsets and inner loop test data subsets each having an optimised set of physically measurable features and an associated optimised predictive capacity, form a collection of optimised predictive models corresponding to each of the multiple outer loop training data sets and their respective sets of inner loop training and test data subsets; and identify the subset of physically measurable features by determining the subset of stable physically measurable features from the collection of optimised predictive models generated in the inner loop.

13. A feature identification system for identifying a subset of physically measurable features from a number of candidate physically measurable features potentially associated with a physical characteristic, the subset of physically measurable features for use in a predictive model for detecting the physical characteristic based on measurements of the identified subset of physically measurable features, the feature identification system comprising a computer system comprising one or more processors having a computer-readable medium encoded with programming instructions executable by the one or more processors to perform the method according any one of claims 1 to 9.