WO2005081158A2

WO2005081158A2 - Use of feature point pharmacophores (fepops)

Info

Publication number: WO2005081158A2
Application number: PCT/EP2005/001848
Authority: WO
Inventors: John William Davies; Meir Glick; Jeremy Lee Jenkins
Original assignee: Novartis Ag; Novartis Pharma Gmbh
Priority date: 2004-02-23
Filing date: 2005-02-22
Publication date: 2005-09-01
Also published as: WO2005081158A3

Abstract

The invention provides a three-dimensional similarity method for scaffold hopping from known drugs or natural ligands to new chemotypes.

Description

USE OF FEATURE POINT PHARMACOPHORES (FEPOPS)

FIELD OF THE INVENTION

[0001] This invention relates generally to a data processing system or calculating computer designed for or utilized in a measurement system directed to a chemical compound or process in a living system, and more specifically to a three-dimensional similarity method for scaffold hopping frofri known drugs or natural ligands to new chemotypes.

BACKGROUND OF THE INVENTION

[0002] A primary goal of three-dimensional (3D) similarity searching is to find compounds with similar bioactivity to a reference ligand but with different chemotypes, i.e., "scaffold hopping". However, an adequate description of chemical structures in 3D conformational space is a problem of high-dimensionality.

[0003] The search for compounds with similar bioactivity to a reference ligand but with different molecular frameworks has been variously termed "scaffold hopping" (Schneider G et al., Angewandte Chemie, International Ed. in English 38: 2894-2896 (1999)), "leapfrogging" (Stanton DT et al, Journal of Chemical Information & Computer Science, 39: 21-27 (1999)), and "lead-hopping" (Andrews KM & Cramer RD, Journal of Medicinal Chemistry 43; 1723-1740 (2000)). In silico approaches that seek to systematize this practice have been introduced recently. The ability to move to new scaffolds can be of interest in situations where the natural ligands or substrates of protein targets are known but synthetic inhibitors are not and structural information about the target protein is not available. [0004] Thus, there is a need in the art for a similarity search method that could use endogenous ligand structures to discover drug-like mimetics in large databases in an automated manner. The stratagem could be applied when lead compounds in-hand have intractable chemistry, "flat" structure-activity relationships, or poor pharmacological properties (e.g., molecular weight, solubility, toxicity, membrane permeability, etc). For the above reasons, it is desirable to establish a structurally diverse profile of leads early to avoid elimination at a future stage in the drug development pipeline. SUMMARY OF THE INVENTION

[0005] The invention provides a three-dimensional (3D) similarity method for scaffold hopping from known drugs or natural ligands to new chemotypes. The method simplifies whole-molecule chemical similarity searching using clustering techniques to create fuzzy molecular representations. In one embodiment, the invention is an automated method. [0006] The overall method for generating "feature-point pharmacophore" (FEPOPS) representations from a given molecule may include the following steps. (1) Compounds (e.g., from a database) are pre-processed to generate 3D structures, assign protonation states, enumerate tautomers, and calculate partial charges and atomic logP values. (2) Multiple conformers are generated by systematic rotation of flexible bonds. (3) Ligand atoms are partitioned into a pre-determined number (e.g., four) of k-means clusters. (4) Atom-type pharmacophoric features are assigned into their c-means cluster centroids (from step 3) to create the "feature points". In several embodiments, the features may include partial charge, atomic logP, hydrogen bond donors and acceptors, formal charge, atomic number, atomic refractivity, or other calculable atomic descriptors such as those computed by MOE (Chemical Computing Group, Montreal, Quebec) or Pipeline Pilot software (SciTegic, Inc). Distances between feature points may be recorded after sorting on the basis of quadrupole directionality. (5) Representative FEPOPS conformers are selected among all calculated conformations by Λ-medoids clustering and stored in a lookup table. (6) The similarity of the query molecule FEPOPS to the FEPOPS of each database compound is calculated using a similarity metric for correlating variables, preferably using Pearson correlation. The rank of the highest-scoring conformer of each compound is saved. Steps 1-5 for generating FEPOPS can apply to both a query molecule and the target database to be searched. Thus, the number of £-means cluster centers (centroids) is pre-determined, with centroids being the average position of the atoms belonging to that cluster and each cenfroid being a "feature point". [0007] In one embodiment of the method of the invention, atoms of a ligand molecule are partitioned into four £-means clusters based on their three-dimensional (3D) coordinates and pharmacophoric features of the cluster members are encoded into the cluster centroids. The result is a four "feature-point pharmacophore" (FEPOPS) containing information about regional atomic neighborhoods distributed throughout the molecule. FEPOPS are generated for multiple conformers obtained from a systematic but rapid conformational search to obtain a small number of FEPOPS representatives, which are then selected by &-medoids clustering to obtain a subset of conformations that cover FEPOPS space.

[0008] The representations produced by the method of the invention are useful for flexible 3D similarity searching without a priori knowledge of bioactive conformation. A "regiosimilarity" searching with FEPOPS significantly enriches for actives taken from MDL Drug Data Report (MDDR) activity classes (including, for example, COX-2, 5-HT3, and HIV-RT), while scaffold or ring-system hopping to new chemical frameworks. Further, inhibitors of target proteins (for example, D2 and RAR) are found by FEPOPS given only their associated endogenous ligands (dopamine and retinoic acid). The FEPOPS method recovers more novel scaffold classes than standard topological methods (for example, DAYLIGHT, MACCS, Pipeline Pilot fingerprints) and commercial 3D methods (for example, Pharmacophore Distance Triplets, when given a single query molecule). [0009] The method of the invention, when used in conjunction with Tanimoto similarity or Pearson correlation analyses and using one or more known active compounds with a query dataset, is useful in providing a ranked list of compounds most similar to the known active compounds. This method is testable, since the method can identify known actives from a database, thus providing an internal control of the method.

[0010] The method of the invention, when performing automated principle component analysis (PCA) or recursive partitioning analyses on a hit list of active ligands, is also useful in providing a candidate pharmacophore. Thus, the invention is useful for identifying pharmacophores, even phamacophores having unexpected chemical scaffolds, based on the structure of a known active compound for a particular biological target. [0011 ] In one embodiment, the invention is a method of identifying a structurally diverse set of biologically related "hits". The early ability to establish a diverse set of hits in the drug discovery process usefully maintains a range of compounds for lead optimization as structural classes are eliminated during later stage development.

[0012] The invention also relates to a method for identifying at least one compound as being likely to have a desired pharmacophore structure, comprising the steps of:| (1) selecting a list of compounds, wherein the compounds have a desired bioactive property, are suspected of having the bioactive property, or are to be analyzed for having the bioactive property; (2) performing recursive component analyses on the list of compounds, wherein the analyses (a) select representative feature point pharmacophore (FEPOPS) conformers by &~medoids clustering for the listed compounds; and (b) enrich for the selected confoπners; and (3) identifying the compounds corresponding to the enriched conformers, wherein the compound is likely to have the desired pharmacophore structure.

[0013] . The method of the invention is thus useful to report pharmacophore predictions. Given a set of compounds with specific biological activity, the atomic features responsible for activity can be extrapolated from the FEPOPS descriptors. The invention thus relates to a method for predicting a pharmacophore responsible for a specific biological activity, comprising the steps of: (1) selecting a list of compounds with a known specific biological activity; (2) generating FEPOPS conformers of said compounds selected at step (1), (3) perfoirning unsupervised learning methods on the FEPOPS conformers, and identifying atomic features correlating with said biological activity, thereby predicting pharmacophore responsible for a specific biological activity.

Various statistical methods can be used for that purpose. Among them, one can cite PCA or recursive partitioning. Other possible approaches are self-organized maps (SOM), hierarchical or agglomerative clustering, or a modification of Naϊve Bayes classification. [0014] The method of the invention is also useful to identify biological targets for orphan compounds from annotated databases of chemical structures (target fishing). In one embodiment, an orphan compound is used to search a compound collection for similar molecule with known biological activities. Common targets among the generated FEPOPS hits is an indication that said orphan compound may interact with that target or a similar one. Appropriate annotated databases of compounds with known biological activities include for example, MDL Drug Database Report (MDDR), Avalon (Novartis corporate collection), World Drug Index (WDI), the Harvard Chembanlc (HTTP://chembank.med.harvard.edu/). and WOMBAT (http://www.sunsetmolecular.com)

[0015] The method of the invention can further be exploited to understand 3D structure- activity relationship. Indeed, when retaining in the FEPOPS output the 3D coordinates of conformers as well as the coordinates of the calculated feature points, the method of the I invention may also provide means to visually superimpose any FEPOPS hits obtained from a search with the probe molecule in a PC-based molecular modeling environment. Appropriate PC-based molecular modeling environment includes, for example, ViewerLite (Accelrys). Alignment of a probe molecule with a database molecule may be carried out by superimposition of feature points appropriately weighted according to the feature point descriptors shown to be important for biological activity. Aligned models of disparate yet bioactive scaffolds can then provide chemists with structure activity relationship and interpretable hypothesis for regional bioisosteric replacements.

BRIEF DESCRIPTION OF THE DRAWINGS [0016] FIG. 1 is a schematic showing the creation of FEPOPS representations. The steps include (1) compound pre-processing, (2) conformer generation, (3) k-means clustering of atom coordinates, (4) assignment of features to feature points and sorting by charges, (5) k- medoids clustering of FEPOPS conformers, and storage of representative FEPOPS conformers in a lookup table. FIG. 1 provides a comprehensive schematic of how FEPOPS are calculated.

[0017] FIG. 2 is a set of charts. Cumulative recall of the 58 active scaffold classes (top) and 18 reduced ring systems (bottom) from the COX-2 inhibitor set found in the highest-ranked 2% of compounds from each method. The points denote the rank of the highest-scoring member of a given scaffold class. At right, the percent of the re-ranked database that would need to be tested to find 70%o of the active scaffolds/ring systems.

[0018] FIG. 3 is a set of schematics showing representatives from COX-2 RRS classes found by the similarity methods in the top 1% using the probe SC-558. Only the highest- ranked compound from each RRS is shown. The RRS designation is provided along with the percentile rank in parenthesis. RRS classes found uniquely by a method in the top 1% have an asterisk. [0019] FIG. 4 is a chart showing the cumulative recall of the 44 RRS classes in the top

2% for the 5-HT3A dataset (left). The percent of the re-ranked database that would need to be tested to find 70% of the active scaffolds/ring systems (right).

[0020] FIG. 5 is a set of schematics showing representatives from 5-HT3 A RRS classes found by the similarity methods in the top 0.5% using the probe shown (MDDR extreg

194584). The RRS designation and percentile rank are provided. RRS classes found uniquely by a method have an asterisk.

[0021] FIG. 6 is a chart showing the cumulative recall of the 36 RRS classes in the top

2% for the HIV-RT dataset (left). The percent of the re-ranked database that would need to be tested to find 70% of the active scaffolds/ring systems (right).

[0022] FIG. 7 is a set of schematics showing representatives from HIV-RT RRS classes found by the similarity methods in the top 1% using the probe shown (MDDR extreg

236942). The RRS designation and percentile rank are provided. RRS classes found uniquely by a method have an asterisk.

[0023] FIG. 8 is a chart showing the cumulative recall of the 48 RRS classes in the top

2% for the D2 dataset (left). The percent of the re-ranked database that would need to be tested to find 70% of the active scaffolds/ring systems (right).

[0024] FIG. 9 is a set of schematics showing representatives from D2 RRS classes found by the similarity methods in the top 1% using dopamine as a probe. The RRS designation and percentile rank are provided. RRS classes found uniquely by a method have an asterisk.

[0025] FIG. 10 is a chart showing the cumulative recall of the 15 RRS classes in the top

2% for the retinoids dataset. The percent of the re-ranked database that would need to be tested to find 70% of the active scaffolds/ring systems (right).

[0026] FIG. 11 is a set of schematics showing representatives from retinoid RRS classes found by the similarity methods in the top 0.5% using all-trans retinoic acid as a probe. The

RRS designation and percentile rank are provided. RRS classes found uniquely by a method have an asterisk.

[0027] FIG. 12 is a set of bar graphs of actives, active reduced scaffolds, and active reduced ring systems (RRS) from HTS hit lists by FEPOPS and Pipeline Pilot Functional

Class Fingerprints. The recall is shown only for the Mghest-ranking 1% of compounds. For cases where multiple probes are used, the similarity of a compound is measured to each of the probe molecules and the highest value is used for ranking.

[0028] FIG. 13 is TABLE 3, showing the recovery of validated high throughput screening (HTS) actives, clusters, scaffolds, and ring systems by similarity searching. [0029] FIG 14 is a set of charts showing profile of the "Lipinski properties" of the COX-2 scaffold hops (from FIG. 3) returned by FEPOPS in the top 1% of hits. In comparison to the values for the probe molecule used for similarity searching, the properties are diverse, suggesting that in addition to finding novel active scaffolds, FEPOPS is capable of highly ranking actives that have diverse physical properties (logP, Molecular Weight, number of hydrogen bond donor atoms, number of hydrogen bond acceptor atoms). [0030] FIG. 15 is a plot of the number of actives recalled by each similarity method versus the average similarity of the actives recalled to the probe molecule. The figure gives a snapshot view of the key parameters by which similarity methods are often judged: quantity and quality of hits recalled (X and Y axis, respectively). On average, for the 5 MDDR test cases, FEPOPS returns the most actives as well as the most diverse actives. The average similarity of the probes to the top l%ι of hits returned by FEPOPS is similar to the average similarity of the probes to all actives in the dataset, indicating that FEPOPS samples the full available diversity in the set.

DETAILED DESCRIPTION OF THE INVENTION [0031] Introduction. The pharmacophore concept is well known in the art of drug discovery and drug-lead optimization. See, U.S. Pat. No. 6,343,257, incorporated herein by reference. A "pharmacophore" is defined as a distinct three dimensional (3D) arrangement of chemical groups essential for biological activity. Since a pharmaceutically active molecule must interact with one or more molecular structures within the body of the subject in order to be effective, and the desired functional properties of the molecule are derived from these interactions, each active compound must contain a distinct arrangement of chemical groups which enable this interaction to occur. The ability to design, or identify from large databases, pharmaceutically useful molecules according to the pharmacophore concept would be highly effective both in the process of drug discovery and in the process of drug lead optimization. The pharmacophore can be constructed either directly or indirectly. In the direct method, pharmacophore descriptor centers are inferred from studying the X-ray or NMR structure of a receptor-ligand complex, or by a shape-complementarity function analysis of the receptor binding site. In the indirect method, the structure of the receptor is unknown and therefore the pharmacophore descriptor centers are inferred by overlaying the 3-D conformations of active compounds and finding the common, overlapping functional groups. The virtually screened databases may be commercially or otherwise publicly available or corporate databases of compounds.

[0032] A well-defined criterion for scaffolds of chemical structures is useful for evaluating the pharmacophoric diversity of a compound set. Several classifications of chemical structures have been reported. Paris CG, Chemical Structure Handling by Computer. In Annual Review of Information Science and Technology (Williams, M. E. Ed.; ASIS&T: Silver Spring, MD, 1997); Gasteiger J & Engel T., Chemoinformatics (Wiley- VCH GmbH & Co. KGaA: Weinheim, 2003). A common assumption rooted in graph theory is that compound stractures may be reduced in a hierarchical manner as a connection of ring systems and linkers that form frameworks or scaffolds, which may contain sidechains or functional groups. Bemis GW & Murcko MA, Journal of Medicinal Chemistry 39: 2887-2893 (1999). A topological scaffold may thus be extracted by simple pruning of side chains (Xu J, Journal of Medicinal Chemistry 45: 5311-5320 (2002)), and optionally discarding information concerning heteroatoms and bond orders to uncover the graph framework. Ring systems are a further deconstruction with utility in database searching (Lipkus AH, Journal of Chemical Information & Computer Sciences 39: 582-586 (1999); Nilakantan R, Journal of Chemical Information & Computer Sciences 30: 65-68 (1990)) and for estimating occupied chemical and drug space (Feher M & Schmidt JM, Journal of Chemical Information & Computer Sciences 43. 218-227 (2003); Lee ML & Schneider G, Journal of Combinatorial Chemistry 3: 284-289 (2001); Lewell XQ et al, Journal of Medicinal Chemistry 46: 3257-3274 (2003); Lipkus AH, Journal of Chemical Information & Computer Sciences 41: 430-438 (2001)). While many such objective classifications for scaffolds now exist, the scaffold-hopping ability of 3D-similarity or pharmacophore-based search methods is oftentimes left up to the subjective chemical intuition of the method authors.

[0033] Currently, screening data is frequently mined using two-dimensional (2D) methods to eliminate false positives, rescue false negatives and cluster the compounds in a "hit" list. Additional information could be provided by three-dimensional (3D) methods to identify actives with diverse chemical scaffolds and to extract pharmacophores in an automated way. Methods using 3D information can advantageously be orthogonal to 2D and augment the analysis, incorporating multiple ligand conformations (i.e. flexibility) to gain insight into the 3D event of a drug binding to a target, and "scaffold hopping" to identify leads whose structural templates are distinct from known actives. Previous to this invention, however, 3D methods have been computationally intensive, providing too many possible conformers for ease of analysis. Also conformational alignment of 3D structures has previously been difficult, especially for large datasets.

[0034] All similarity methods are based on the assumption that globally similar molecules may have similar activity (Maggiora GM & Johnson MA, Concepts and applications of molecular similarity (John Wiley & Sons, New York, 1990) pp 99-117), although the degree of similarity required for similar activity is a matter of dispute (Martin YC et al., Journal of Medicinal Chemistry 45: 4350-4358 (2002)). In theory, the increase in structural information from 2D to 3D should provide a more accurate basis for finding new compounds with similar bioactivity. However, 3D-similarity approaches are faced with a number of challenges not faced by topological (2D) methods, like the generation of flexible molecular conformations and alignment, along with relatively longer computing times. Numerous 3D-similarity methods have been reported (Willett P et al, Journal of Chemical Information & Computer Sciences 38: 983-996(1998)), some asserting primacy over 2D methods (Andrews KM & Cramer RD, Journal of Medicinal Chemistry 43: 1723-1740 (2000); Makara GM, Journal of Medicinal Chemistry 44: 3563-3571 (2001)). However, direct comparisons made in the influential study by Brown and Martin (Journal of Chemical Information & Computer Sciences 36: 572-584 (1996)) and in more recent reports (Matter H & Potter T, Journal of Chemical Information & Computer Sciences 39: 1211-1225 (1999); Sheridan RP & Kearsley SK, Drug Discovery Today 7: 903-911 (2002)) maintain that current 3D methods offer no advantage over topological searches in recalling actives or in sampling structural diversity.

[0035] Published methods for 3D searching commonly take the form of queries based on pharmacophores of two, three or four features (hydrophobic group, hydrogen bond donor or acceptor, etc.) separated from other features by binned distance ranges. For example, a large number of "potential pharmacophores" can be generated automatically for multiple ligand conformers and stored in fingerprints or keys; the overlap between pharmacophore sets for molecules being compared is then calculated. Abrahamian E et al, Journal of Chemical Information & Computer Sciences 43: 458-468 (2003); Brown RD & Martin YC, Journal of Chemical Information & Computer Sciences 36: 572-584 (1996); Mason JS et al, Journal of Medicinal Chemistry 42: 3251-3264 (1999). Alternatively, pharmacophore alignments can be carried out at run time. Lemmen C et al, Perspectives in Drug Discovery & Design 20: 43-62 (2000). Other work has incorporated whole-molecule information into pharmacophores on the basis of geometric distances between defined features and all ligand atoms. Makara GM, Journal of Medicinal Chemistry 44: 3563-3571 (2001). Similarity has been measured between 3D topomeric fragments (Andrews KM & Cramer RD, Journal of Medicinal Chemistry 43: 1723-1740 (2000); Cramer RD et al, Journal of Molecular Graphics and Modelling 20: 447- 462 (2002)) and between maximal common substructures. Raymond JW & Willett P, Journal of Chemical Information & Computer Sciences 43: 908-916 (2003); Rhodes N et al, Journal of Chemical Information & Computer Sciences 43: 443-448 (2003). In contrast to the use of pharmacophore points, similarity based on physicochemical property descriptors (Kearsley SK et al., Journal of Chemical Information & Computer Sciences 36: 118-127 (1996)) and molecular fields (Ginn CMR et al, Journal of Chemical Information & Computer Sciences 37: 23-37 (1997); Mesfres J et al., Journal of Computational Chemistry 18: 934-954 (1997); Thorner DA et al, Journal of Chemical Information & Computer Sciences 36: 900-908 (1996)) has been explored. Finally, pharmacophoric features can be used in the context of 2D similarity searching (Schuffenhauer A et al, Journal of Chemical Information & Computer Sciences 43: 391-405 (2003)) or for classification (Chen X et al, Journal of Chemical Information & Computer Sciences 39: 887-896 (1999); Pirard B & Pickett SD, Journal of Chemical Information & Computer Sciences 40: 1431-1440 (2000)) and diversity analysis (Mount J et al, Journal of Medicinal Chemistry 42: 60-66 (1999)). [0036] The problems described above can be solved by a rapid automated method that, when given a single known drug or natural ligand, can retrieve actives from massive databases which are more diverse than those recovered by commonly-used 2D methods. In practice, the method of the invention may be used as an orthogonal approach to rescue false negatives missed due to low 2D fingerprint similarity, a concept promoted in data fusion sfrategies. Ginn CMR et al, Perspectives in Drug Discovery & Design 20: 1-16 (2000). In addition to its use as an in silico screening tool, the method could be used to provide pharmacophore-type information about the highly-ranked molecules to facilitate the transition from hit discovery to lead optimization. The FEPOPS (feature point pharmacophores) method of the invention evaluates the regional correlation of additive physicochemical and pharmacophoric properties, benefiting from the advantages gained from both field-based similarity and pharmacophore-based queries.

[0037] The FEPOPS method of the invention incorporates clustering techniques from the field of data miriing, which enables scaling-down of stractural information by &-means clustering of atomic coordinates into feature points, followed by the selection of representative conformers by &-medoids clustering. The method is fuzzy in three important ways: (i) the decomposition of atoms into feature point representations; (ii) use of physicochemical descriptors, which are less specific than topological descriptors (Kearsley SK et al, Journal of Chemical Information & Computer Sciences 36: 118-127 (1996)); and (Hi) purging of the conformational space covered by a ligand into a small number of representative data points. The fuzzy nature of the FEPOPS method enables it to retrieve strikingly diverse compounds with similar biological activity to reference queries in rigorous test examples (see below) with compounds from the MDDR (MDL Drug Database Report) and from real-life high throughput screening (HTS) datasets. Further, a specific, objective criterion is established for measuring scaffold hopping as well as "ring-system hopping" using molecular equivalence indices. Xu YJ & Johnson M, Journal of Chemical Information & Computer Sciences 41: 181-185 (2001) and 42: 912-926 (2002).

[0038] Overview of FEPOPS calculation. The overall strategy for generating FEPOPS representations from a given molecule may include the following steps: (1) Compounds from a database are pre-processed to_. generate 3D structures, assign protonation states, enumerate tautomers, and calculate partial charges and atomic logP values. (2) Multiple conformers are generated by systematic rotation of flexible bonds. (3) Ligand atoms are partitioned into four &-means clusters. (4) Atom-type pharmacophoric features — e.g, partial charge, logP, hydrogen bond donors and acceptors, formal charge, atomic number, atomic radius, atomic refractivity — are assigned into their Aπmeans cluster centroids (from step 3.) to create the "feature points". Distances between feature points are recorded, preferably after sorting on the basis of quadrupole directionality. (5) Representative FEPOPS conformers are selected among all calculated conformations by /c-medoids clustering and stored in a lookup table. [0039] Compound preprocessing. FEPOPS can be calculated using an automated data- pipelining protocol. The core FEPOPS programs can be implemented in a series of custom scripts and freely-available applications. The scripts are launched from an automated data- pipelining protocol in Pipeline Pilot (SciTegic, San Diego, Calif.) by using batched SOAP (Simple Object Access Protocol) technology. The entire protocol processes FEPOPS representations from input 2D structures at ~ 1.0 compound second, i.e., ~600K compounds per week. FEPOPS are calculated once and stored in a lookup table.

[0040] Initially, a compound library file of type sdf, mol2, or SMILES strings is read into the protocol. Custom filters are applied to remove duplicate compounds, compounds with <4 atoms, >9 rings, or >40 rotatable bonds, and salts or counter-ions associated with compound structures (FIG. 1). Three-dimensional coordinates are generated in Pipeline Pilot, followed by addition of hydrogens and a brief minimization using the Clean force field. The protonation states of ionizable groups are set at pH 7.4 based on either look-up tables of pKa values or partial-least squares models. For each compound, all tautomers are enumerated. Finally, Gasteiger-Marsili partial charges (Gasteiger J & Marsili M, Tetrahedron 36: 3219- 3288 (1980)) are computed prior to conversion to mol2 format.

[0041] Atomic logP calculation. logP correlates with hydrophobic binding of receptors, and can be calculated atom-wise for a molecule. A modified version of XlogP, an atom- additive program that predicts octanol/water partition coefficients (Wang R et al, Journal of Chemical Information & Computer Sciences 37: 615-621 (1997)), can be used to calculates atomic logP values for each compound.

[0042] Conformer generation. Conformational searching and atom clustering in FEPOPS is implemented in a C program. See, Glick M et al, Journal of Medicinal Chemistry 45: 4639-4646 (2002); Glick M et al, Journal of the American Chemical Society 124: 2337-2344 (2002). Flexibility is simulated by systematic rotation of bonds at fixed angle increments, followed by eviction of conformations with van der Waals clashes. For compounds with a "drug-like" number of rotatable bonds (<6), torsional increments between 10° and 120° cover feature space similarly. Glick M et al, Journal of Medicinal Chemistry 45: 4639-4646 (2002). In other words, using smaller angle intervals, which increases calculation time linearly, does not lead to a significant increase in conformational information.

[0043] For compounds with >5 rotatable bonds, increments of 90° gave the optimal trade between speed and accuracy; thus 90° intervals are used for the conformational search. Similar angle intervals have been used by other 3D methods. Mason JS & Cheney DL, Pac Symp Biocomput. 4: 456-467 (1999); Pickett SD et al, Journal of Chemical Information & Computer Sciences 38: 144-150 (1998). Indeed, the fuzzy representation of FEPOPS was particularly suited to cover conformational space by sampling at larger intervals, since atoms are ultimately partitioned into 3D space based on their coordinates, the atoms of one conformer must be reasonably distant from the atoms of another conformer to yield a unique &-means clustering result. The conformers for compounds with five rotatable bonds or less were generated in an exhaustive manner in 90° intervals (maximal number of 4⁵ = 1024 conformers). For more than five rotatable bonds, 1,025 conformers were sampled at random. [0044] The objective of this conformational sampling was not to find low energy conformations but to provide a reasonable coverage of the conformational space. Low energy solutions are not necessarily representative of protein-bound ligand geometries. Chen X et al, Journal of Chemical Information & Computer Sciences 39: 887-896 (1999). The similarity method of the invention determines biologically relevant conformations by identifying the conformer with the highest correlation to the "probe" or reference molecules in feature point space.

[0045] k-means clustering of atoms. The / -means algorithm is an iterative descent clustering method. For a discussion of iterative descent clustering, see Hastie T et al, The elements of statistical learning (Springer: New York, 2001). In FEPOPS, the atomic 3D coordinates of each ligand are partitioned into clusters, preferably 4 clusters, and the geometric centers of the clusters (centroids) are retained to represent the compound. This approach has previously been used for identifying ligand binding sites on proteins (Glick M et al, Journal of the American Chemical Society 124: 2337-2344 (2002)) and for ligand docking (Glick M et al, Journal of Medicinal Chemistry 45: 4639-4646 (2002)). See also, PCT patent application WO 03/062468, which is incorporated herein by reference. In one embodiment, four clusters are used, because this number allows a reasonable description of molecules of drug-like size while retaining information about cbirality, which is lost in triplet-type representations.

[0046] In one embodiment, the algorithm that generates the initial feature points at various scales using £-means clustering of atoms is that provided in Glick M et al, J. Am. Chem. Soc. 124: 2337-2344 (2002) or in PCT patent application WO 03/062468. However, the use and purpose of these published algorithms differ from the use and purpose of the FEPOPS method of the present invention.

[0047] In the method provided in WO 03/062468, a compound of interest is modeled at various scales by a k-means-clustering algorithm. Hartigan J, Clustering Algorithms (John Wiley & Sons, Inc., New York, 1975) pp. 84-112 and Rollet R et al, Int. J. Remote Sens. 19: 3003-3009 (1998). Using a series of k-means generated clusters, a series of ligand models is obtained, each containing one more feature point than the previous model. In one embodiment, a total of four feature points is calculated.

[0048] These feature points are preferably well distributed, to ensure that each model yields the best possible description of the ligand for the number of points generated. In the implementation of this procedure, n atoms j, ₂, ■■■, x_n fall into k clusters, k<n. Let ,- be the mean position of the atomic coordinates in cluster i. If the clusters are well separated, a minimum-distance classifier is used to separate them. Thus, atom x is in cluster i if the distance between x and _;- is the minimum of all the k distances. Since there is no definite way to initialize the mean values, it is common to use random primary initial values for the means rnj, i₂, ..., m_/t. At the next stage, until there are no changes in any mean, the estimated means is used to classify the atoms into clusters. Thus, for all clusters, replace ,- is replaced with the mean position of all of the atoms for cluster i.

[0049] The initial cluster is at the mean position of the ligand. All of the atoms in the ligand belong to this initial cluster. To initialize the means for the second cluster, for the atom that is furthest away from the initial cluster center is searched for. This atom then becomes the temporary center of the new cluster. Ail atoms that are closer to this cluster center than their currently assigned cluster center change identities and are marked as belonging to the new cluster. The position of the cluster centers are then iterated upon to self-consistency so that each cluster center is positioned at the average position of the atoms that belong to the cluster. This process may be repeated as many times as there are atoms in the ligand, with each iteration generating the next model. In one embodiment, a total of four feature points is calculated.

[0050] As used in the present invention, the algorithm initially guesses the centroid positions. Then, for each atom, the closest centroid is identified, followed by replacement of the centroids with the coordinatewise average of all atoms closest to it. The algorithm minimizes the sum of squared Euclidean distances from centroids to atom cluster members until convergence is achieved. A centroid in chemical 3D space is the mean of the x, y, and z coordinates of its atom cluster members. In FEPOPS, each of the centroids is assigned any or all of the following pharmacophoric features: (i) the distance to a neighboring centroid (see below) and (ii) atom-type pharmacophoric features. Atom-type pharmacophoric features are preferably selected from the group consisting of the sum of the partial charges of its cluster- member atoms, the sum of atomic logP values of its cluster-member atoms, the sum of atomic number of its cluster-member atoms, the sum of atomic radii of its cluster-member atoms, the sum of atomic refractivities of its cluster-member atoms and binary flags that indicate occurrence of hydrogen bond donors, hydrogen bond acceptors, formal positive charges, and formal negative charges. Any other appropriate pharmacophoric features may be used in the method of the invention. For example, one or more atom-based descriptors as described and calculated in MOE (Chemical Computing Group, Montreal) or Pipeline Pilot software (SciTegic, Inc.) can be used in the method of the invention Additionally, the distances between juxtaposed feature points 1 and 3 and between feature points 2 and 4 are recorded to determine the chirality of the quartet..

[0051] The distribution of electrons in molecules is one of the principle factors in determining their biological, chemical, and physical properties. Gasteiger J & Engel T., Chemoinformatics (Wiley- VCH GmbH & Co. KGaA, Weinheim, 2003). Prior to recording inter-point distances, the feature points are preferably sorted on the basis of partial charges to enforce a quadrupole directionality. For example, with 4 centroids, the most negatively- charged centroid becomes feature point 1, whereas the most positively-charged centroid is assigned to feature point 4. Thus, the descriptor distance may be recorded as follows: the descriptor "Distance 1" ("dl") encoded in feature point 1 will contain the distance between itself, the most negative centroid, and feature point 4, the most positive centroid (FIG. 1). "Distance 2" ("d2") encoded in feature point 2 will contain the distance between feature points 1 and 2, and so on. All FEPOPS configurations are thus "pre-aligned" by charge distribution, rather than using alignment of shape or geometry during the similarity calculations. Thus, the probe molecule descriptors can easily be compared at search time to the descriptors of target compounds (e.g., dl of the probe is compared to dl of the target compound, not d2).

[0052] In another embodiment, the method can use other "field-based" conformational alignment methods that have been reported previously. Mestres J et al, Journal of Computational Chemistry 18: 934-954 (1997); Apaya RP et al, Journal of Computer-Aided Molecular Design 9: 33-43 (1995); Lemmen C & Lengauer T, Journal of Computer-Aided Molecular Design 14: 215-232 (2000).

[0053] Selection of representative conformers. One common strategy for scaling the vast array of potential molecular configurations down to a manageable size is to cluster and subsequently store a smaller number of explicit conformers. Feher M & Schmidt JM, Journal of Chemical Information & Computer Sciences 43: 810-818 (2003). The creation of fuzzy FEPOPS representations in particular results in a number of conformations that are alike or similar enough to be redundant in terms of describing a given molecule. Glick M et al, Journal of Medicinal Chemistry 45: 4639-4646 (2002). & medoids is a non-hierarchical crisp clustering method like k means. In contrast to k means, which minimizes the sum of squared Euclidean distances from objects to cluster centers, & medoids minimizes the sum of unsquared dissimilarities of objects to their closest representative object (the medoid). Kaufman L & Rousseew PJ, Finding groups in data - An introduction to cluster analysis (John Wiley & Sons, Inc., New York, 1990). The medoid is an actual object or data point that is representative of the structure of the dataset. In the method of the invention, medoids are single representative FEPOPS of conformers of the same compound. Selection of medoids is less influenced by outliers that can skew the selection of cenfrotypes in a method like k means; the medoid conformers are actual conformers, not averages. Here, the actual molecular conformers were not clustered, but rather the FEPOPS representations of all calculated conformers for a molecule were clustered (these are the "FEPOPS conformers"). k- medoid clustering can be carried out in the R language and environment (the GNU version of S-PLUS) using the programs PAM (Partitioning Around Medoids) and CLARA (Clustering of LARge Applications), which has been described in detail by Kaufman L & Rousseew PJ, Finding groups in data- An introduction to cluster analysis (John Wiley & Sons, Inc., New York, 1990). The data format is an n x p matrix, consisting of n molecular conformers snap features (or descriptors) for each compound. The features are first scaled by subtracting the mean and dividing by standard deviation, and k representative conformers are selected using the function PAM for compounds with <50 conformers or CLARA for compounds with >50 conformers.

[0054] In CLARA, the partitioning is performed on a subset of the data in multiple iterations to speed up calculation. We sampled the lesser of 100 conformers or 50%> of the total number of conformers. The number of A: clusters designated was seven, based on a thorough analysis of speed and coverage of conformational space. For compounds taken from X-ray structures of protein-ligand complexes, a k value of seven typically yields a set of medoids (conformations) containing at least one conformer that is like the bioactive conformation. Additionally, higher k values resulted in redundant conformers for molecules of drug-like sizes. If a molecule contained fewer than seven conformations, clustering was bypassed and all non-duplicates were saved. Since calculations at multiple k values would be highly impractical for converting large databases, a statistical metric (Feher M & Schmidt JM, Journal of Chemical Information & Computer Sciences 43: 810-818 (2003)) was not used to evaluate the goodness of cluster separations during FEPOPS conversions. [0055] FEPOPS similarity calculation. The similarity between FEPOPS representations was determined by Pearson correlation. The similarity between FEPOPS representations may however be determined according to any similarity metrics for correlating variables. Other calculation methods which can be used with the FEPOPS method include without limitation, Euclidean distance, Manhattan distance, Hamming distance, the cosine coefficient, and the Tanimoto coefficient. Pearson was suitable for this purpose due to the nature of the continuous, non-binary FEPOPS descriptors stored in a matrix format. The feature descriptors are first scaled by mean centering (offsetting the values so that their sum is zero), and then divided by a factor so that the variance of the scaled data is equal to one. For all FEPOPS conformers of compounds in the test sets, the Pearson correlations to the FEPOPS conformers of the probe were calculated. In one embodiment, the similarity is calculated by matching each feature point of a probe conformer to the corresponding feature point of the compound conformer sorted on the basis of quadrupole (four charges) directionality, without further matching with other feature points. In another embodiment, said similarity is calculated by matching each feature point of a probe conformer to each feature point of a compound conformer.

[0056] The compound conformer with the highest correlation to any probe conformer was then retained for the similarity score (i.e., the "single nearest-neighbor" method). The target database was thus re-ranked in its entirety by correlation to the probe. No attempt was made to computationally optimize the search procedure due to the short calculation times, j [0057] Pearson correlation is particularly useful for quantifying relative distances between large sets of compounds to a probe. The Pearson coefficient (ranging from -1 to 1) reflects the degree of linear relationship or strength of association between variables; in the present example, the strength of association between test compound features and the reference molecule features with respect to the entire dataset is measured in order to produce a ranked list. However, Pearson's correlation serves as a useful benchmark for testing FEPOPS, which incidentally preserves the descriptors in a useful form for visual analysis. [0058] 2D Similarity Searches. To assess the scaffold-hopping ability of FEPOPS, its performance was compared with three 2D-descriptor methods used routinely for similarity searches: the 166 publicly-available MDL Keys, also known as MACCS (MDL Information Systems, Inc., San Leandro, Calif.); DAYLIGHT fingerprints (Daylight Chemical Information Systems, Inc., Mission Viejo, Calif); and Pipeline Pilot Functional Class Fingerprints with a neighborhood size of four (FCFP__4) (see, www.scitegic.com). The Pipeline Pilot fingerprints are based on an extension of the Morgan Algorithm. Morgan HL, Journal of Chemical Documents 5: 107-113 (1965). The Tanimoto coefficient was used to calculate similarity for all 2D methods. The searches as implemented required approximately 2 min calculation time for a 29,197 compound database (see, Chemical datasets below). [0059] Pharmacophore Distance Triplets. In addition to the 2D search methods, we also tested the commercially- available 3D method, Pharmacophore Distance Triplets (PDT) from Sybyl 6.9 (Tripos, Inc., St. Louis, Missouri). The method called "pharmacophore distance triplets" or "PDT" operates in three-dimensional space (like FEPOPS) rather than two- dimensional space. A PDT contains three pharmacophoric macro atoms and the three distances separating them, where the distances are binned. Each possible triplet is represented by a single bit in a binary fingerprint; bits are set to "1" in a PDT fingerprint when a triplet is found in a compound during conformational searching. The automated approach used by PDT allows for a fair comparison to be made with FEPOPS, since neither method uses a priori pharmacophore models. "Hand-built" pharmacophore queries could bias the search results if not suitably constructed. We incorporated three pharmacophoric features in the BinBounds.def configuration file: hydrogen bond atoms, hydrogen bond donors, and hydrophobes. Distances between features were binned from 3-15 A at 1.5 A intervals (9 bins). The default 100 conformations were used for both the probe and database molecules. The above parameters result in 3,754 bit strings for each compound, as the fingerprint is a composite of all generated conformations. A PDT fingerprint was created for each test case probe in advance of searching. The target database was imported into a UNITY database and PDT fingerprints for each database compound were generated at search time. Tanimoto similarity was calculated using the UNIX dbmktriplets utility (evaltype = similar; cutoff ^'= 0) and the hit lists were sorted by similarity scores in Pipeline Pilot. The searches required approximately 8 hours of single-processor calculation time for the 29,197 compound database. [0060] Chemical datasets. Test sets derived from the MDDR (MDL Drug Database Report) were selected as a background set for similarity searching due to the classification of database records according to biological activities. FEPOPS typically converts >95% of compounds successfully (the majority of failures occur during the 3D conversion step). Of -128,000 MDDR records, 121,948 structures were successfully converted to FEPOPS representations, and stored in 1.32 million explicit FEPOPS conformations with 1.75 tautomers/compound on average. From the entire MDDR set, a diverse subset of 30,000 compounds (hereafter, "MDDRds") was selected using the Pipeline Pilot Functional Class Fingerprints in the "Diverse Molecules" filter component. Of the 30,000 compounds, 29,178 successfully converted to FEPOPS. The MDDRds was considered representative of the whole database and therefore used as a background set because similarity searches using either MDDR or MDDRds yield similar retrieval rates of percent actives returned for each search method.

[0061] MDDR activity classes were selected to cover the four major ligand-target classifications proposed by Schuffenhauer A et al, Journal of Chemical Information & Computer Sciences 42: 947-955 (2002), i.e., enzymes, G-protein coupled-receptors (GPCR), nuclear receptors (NR), and ligand-gated ion channels (LGIC). The ligand targets in the dataset consisted of (i) COX-2 (enzyme), MDDR Activity Index = 78331, 642 records converted; (ii) 5-HT3A (ligand-gated ion channel), MDDR Activity Index = 06233, 788 records converted; (iii) HIV-RT (enzyme), MDDR Activity Index = 71522, 597 records converted; (iv) D2 dopamine receptor (GPCR), 151 agonists (MDDR Activity Index = 11125) and 431 antagonists (MDDR Activity Index = 07701) converted; (v) Retinoic acid receptor (nuclear receptor), MDDR Activity Index = 59505, 331 records converted. Numerous compounds associated with the DA receptor superfamily are present in the MDDR. To avoid false positives due to the various Dl, D3, and D4 agonists and antagonists that could resemble dopamine, MDDR records containing dopamine in the biological activity field or as a substructure were eliminated from the MDDRds dataset. In each test case, the actives were "spiked" into the background MDDRds dataset to assess the ability of the similarity methods to enrich for actives above random. As noted previously by others, the assumption that compounds are inactive for a particular biological activity based on activity indices listed in the MDDR is not necessarily valid, since all compounds have not been tested for all activities (Sheridan RP et al, Journal of Chemical Information & Computer Sciences 41: 1395-1406 (2001)). However, the assumption is reasonable enough for comparison of similarity methods. [0062] We have further selected four datasets from high-throughput screening (HTS) campaigns. The HTS targets were growth hormone secretagogue receptor (GHS-R; GPCR), gamma secretase, matrix metalloprotease-13 (MMP-13; enzyme), and a T₃₁₅I mutant form of Abl kinase (mAbl, enzyme). The background dataset consists of 60,000 randomly selected inactive compounds from our HTS plated compound library, of which 57,017 structures were converted successfully to FEPOPS. We opted to use multiple probes (5, 10, or 20 compounds) for three of these test examples rather than a single ligand to reflect situations where more than one antagonist is known a priori. The probes were selected at random from the total set of validated actives using the Diverse Molecules (structural diversity) filter component in Pipeline Pilot. In the mABL analysis, the Novartis anti-cancer drug STI-571 (Gleevec®, imatinib) was used by itself as the reference compound.

[0063] Scaffold hopping criteria. In principle, 2D-similarity searches are less capable than 3D methods of finding compounds with similar activity to a probe that are yet chemically dissimilar. Indeed, the very goal of traditional similarity searching is to find hits that are structurally similar. An advantage of 3D methods is the ability to leap from the chemical space describing the probe to distant and diverse chemical spaces. Unbiased criteria for measuring dissimilarity between molecules should be established which are independent of the fingerprints or descriptors used for similarity searching. One useful strategy is to first define the chemical scaffolds of the actives in the test set, and subsequently to assess the enrichment of unique frameworks among the top hits selected by the similarity method. Here, we classified scaffold classes using molecular equivalence indices (Meqi) developed by Xu and Johnson, which are based on their extension of the Morgan naming algorithm to labeled psuedo graphs. A molecular equivalence class is defined as an exhaustive subset of molecules that each contain a "recognizable structural feature". Xu YJ & Johnson M, Journal of Chemical Information & Computer Sciences 41: 181-185 (2001) and 42: 912-926 (2002). TABLE 1 illustrates how these structural features are extracted stepwise by reduction of compound topology to create the classes.

TABLE 1. Reduction of Compound Topology: "Molecular Equivalence Indices" as Criteria for Scaffold Hopping Number of Compounds at Topology Level Topology Example 1 Example 2 Cox-2 5-HT3 HIV-RT D2 RAR

Graph Scaffold 138 232 152 245 52

Reduced Scaffold 58 115 75 106 27

Graph OO^ ^: Q Q Ring System ^ < ,- 45 129 82 108 29

Reduced ζ~ Ring System Q Q CD 18 44 36 48 15 O

[0064] Molecular equivalences are particularly useful because the classes are easy to visualize and their memberships do not shift if more compounds are added to the dataset, as in other clustering methods. The total number of compounds used from 5 MDDR activity classes are shown in TABLE 1 as well as the number of structural classes that occur among the datasets at the reduced topology levels. For example, among the 642 COX-2 inhibitors, 58 "reduced scaffolds" and 18 "reduced ring systems" are present (shown in bold). A reduced scaffold is essentially an ordered ring system containing information about ring connectivity and ring edges, but not ring size. In contrast, a reduced ring system (RRS) is an unordered ring system containing information about the number of rings and internal edges, but not the order of connectivity. RRS is the most broadly defined class and produces the smallest number of structural clusters. Reduced scaffolds and RRS representations were chosen in the present study to evaluate scaffold hopping because ring systems are key in determining shape and orienting the functional groups that guide receptor binding, and are important in the pharmacological profiles of drugs. Lipkus AH, Journal of Chemical Information & Computer Sciences 41: 430-438 (2001). Importantly, the Meqi scaffolds and ring systems serve as an impartial and stringent basis on which to evaluate chemotypes, unrelated to the fingerprinting methods herein.

[0065] Active enrichment. FEPOPS was compared to three topological search methods, FCFP_4, MACCS, and DAYLIGHT, as well as a 3D method, PDT. Traditionally, the success of similarity methods is judged by the "quantity", or the recall, of known actives from a database. To an experimental screener, the important question concerning active recall is how many actives can be retrieved above a designated cutoff containing a practical number of compounds for testing? In the current study, however, we also stress the medicinal chemist's perspective by assessing the "diversity" or "quality" of each hit list based on the recall of active scaffold classes (i.e., scaffold hops). The critical question in terms of scaffold hopping is how well do the methods sample among all possible active chemotypes above a designated cutoff? Sheridan RP & Kearsley SK, Drug Discovery Today 7: 903-911 (2002). In other words, does the method find actives dissimilar enough from the probe to be considered novel scaffolds in the eyes of a medicinal chemist? TABLE 2 provides a comprehensive summary of the percentages of actives, scaffolds, and ring systems retrieved by all three methods using single ligand probes for the five targets. In practice, the number of compounds that can be tested based on a similarity search is limited. On this basis, we selected a stringent and practicable cutoff of the top 1.0% most similar compounds in order to assess enrichments, corresponding to approximately 30, 148, and 296 compounds that would need to be tested (for the MDDRds test set). It is clear that a majority of the actives do not need to be recalled in order to return a sizeable percentage of the active scaffolds and ring systems. The following results demonsfrate that FEPOPS is ideal for capturing the largest number of scaffolds within a reasonable number of "cherry picks". TABLE 2. Recovery of MDDR Actives. Active Scaffolds. & Active Ring Systems bv Single-Ligand Similarity Searching %Actives Found %Active Scaffolds %Active Ring-Systems in Top Wo" '* Found inTopN%'' Found in Toτ> N%° Target Ligand Probe Method JV = 1 N = 0.5 TV = 0.1 N = l JV = 0.5 ΛT = 0.1 JV = 1 N = 0.5 TV = 0.1 C0X_2 SC-SS8 bioactive FEPOPS 4.4 2.4 0.3 21 12 3 - 44 33 11 C0X_2 SC-558 flexible FEPOPS 5.8 3.1 1.6 22 16 7 39 33 22 COX_2 SC-558 FCFP_4 7.2 6.1 1.9 7 7 3 17 17 11 COX_2 SC-558 MACCS 1.6 1.4 0.3 7 7 2 22 22 6 COX 2 SC-558 DAYLIGHT 6.1 5.4 1.9 9 5 2 11 11 6 5-HT3 extreg 194584 FEPOPS 5.3 3.6 0.8 17 13 27 18 9 5-HT3 extreg 194584 . FCFP_4 8.2 7.1 2.9 14 10 20 16 5 5-HT3 extreg 1 4584 MACCS 6.1 3.0 1.3 18 12 23 14 11 5-HT3 extreg 194584 DAYLIGHT 6.0 4.4 2.5 13 10 16 11 9 HIV-RT extreg 236942 FEPOPS 3.5 2.5 1.2 9 7 4 17 14 6 HIV-RT extreg 236942 FCFP_4 1.3 1.2 1.0 7 3 1 14 6 3 HIV-RT extreg 236942 MACCS 1.5 1.2 1.0 3 3 1 6 6 .3 fflV-RT extreg 236942 DAYLIGHT 1.2 1.0 1.0 3 1 1 6 3 3 D2 dopamine FEPOPS 1.8 1.3 0.5 8 10 8 4 D2 dopamine FCFP_4 1.0 0.0 0.0 6 4 0 0 D2 dopamine MACCS 0.1 0.1 0.0 3 2 2 0 D2 dopamine DAYLIGHT 0.7 0.7 0.0 3 2 2 0 RAR retinoic acid FEPOPS 24.0 15.0 3.8 37 33 15 60 60 33 RAR retinoic acid FCFP_4 11.0 10.0 4.9 33 30 26 40 40 20 RAR retinoic acid MACCS 13.3 9.7 4.5 30 22 15 47 33 27 RAR retinoic acid DAYLIGHT 10.4 7.8 3.6 26 22 15 13 13 13

" See TABLE 1 for total number of actives, reduced scaffolds, and reduced ring systems for each target. * The %Actives possible in the given percentile cutoffs is <100% in all cases.

[0066] In practice, the number of compounds that can be ordered and tested based on a similarity search is limited. Therefore a stringent cutoff of the top 1.0% most similar compounds was used to assess enrichments, corresponding to approximately 30, 148, and 296 compounds that would need to be tested. The following results demonstrate that FEPOPS is useful for capturing the largest number of scaffolds within a reasonable number of "cherry picks".

[0067] COX-2 inhibitors. COX-2 is an oxidoreductase (E.C. 1.1499.1) targeted by numerous therapeutics for pain and inflammation. The COX-2 inhibitor SC-558 was selected as the probe since an X-ray co-crystal complex structure was available. Kurumbail RG et al, Nature 384: 644-648 (1996). The 3D coordinates (PDB code: 1CX2) were used to obtain a single FEPOP for the bioactive form of the compound by disabling flexibility during the calculation, he results were compared with the results for FEPOPS calculations allowing incremental flexibility in the 4 rotatable bonds (TABLE 2).

[0068] In the top 1% most similar compounds to SC-558, 4.4% of the 642 COX-2 inhibitors were retrieved using the bioactive form versus 5.8% when flexibility is incorporated. The Pipeline Pilot fingerprints (FCFP_4) retrieved the highest percentage of actives (7.2%) given the SC-558 probe. In contrast, FEPOPS identified the largest number of active scaffolds and reduced ring systems (RRS). In the top 1%, 44% of all active ring systems were sampled by FEPOPS (rigid) versus 17% for FCFP_4, 22% for MACCS, 11% for DAYLIGHT, and 22% for PDT, indicating that the most diverse hits are retrieved by FEPOPS is considerably higher than those found by 2D similarity. The results were similar for scaffold retrieval. The recovery of COX-2 inhibitor scaffolds and ring systems is shown in FIG. 2, in cumulative recall curves.

[0069] Cumulative recall curves are useful for visual comparison of similarity methods because they are based on ranks rather than scores. Sheridan RP & Kearsley SK, Drug Discovery Today 7: 903-911 (2002). For each ring system recovered in the top 1%, the highest-ranking member along with its actual rank is shown in FIG.3. Several of the active ring systems retrieved by FEPOPS were strikingly divergent from the probe. In certain examples, substituent differences preserved the overall logP and charge character of the probe, but possess an entirely different "bulkiness", such as the compound from RRS 2 in FIG.3. Notably, ring systems 4, 6, 2, and 8 found by FEPOPS were not found by the 2D- based searches. The compounds shown from these ring systems (FIG. 3, bioactive probe conformation) have Tanimoto similarities to the probe of 0.13, 0.06, 0.17, and 0.11, respectively using FCFP_4 descriptors, (i.e., they are dissimilar in practical terms). The 2D methods typically find compounds that retain the tricyclic core of SC-558 despite their membership in different RRS classes (see, FIG. 3).

[0070] An important result from the COX-2 test case was that use of the bioactive conformation of the probe molecule rather than a representative set of flexible conformers does not improve the results, and in fact does worse than the flexible approach at scaffold recovery (FIG. 2). We have similarly observed that the bioactive confirmation of lisinopril (Natesh R et al, Nature 421: 551-554 (2003)) does not retrieve ACE inhibitors as well as a flexible lisinopril similarity search. There is further unexpected evidence to support that rigid pharmacophore models are limiting. One of skill in the art might anticipate one or two SC- 558 conformers from the flexible similarity search are predominantly most similar to all other COX-2 inhibitors (i.e., the probe conformations that best represent a common pharmacophore), but this is not the case. Each of the probe conformations have subsets of COX-2 inhibitors to which they are the best correlated. Intriguingly, this finding holds true for all of the test cases in this example. These results suggest that a priori knowledge of bioactive conformations does not necessarily improve FEPOPS performance and may be somewhat limiting to scaffold hopping efforts.

[0071] FIG. 2 shows cumulative recall curves for the recovery of COX-2 inhibitor scaffolds (top panels) and ring systems (bottom panels). Cumulative recall curves are useful for visual comparison of similarity methods because they are based on ranks rather than scores.28 The recovery of scaffolds and ring systems in the top 2% of the database is displayed (FIG. 2, at left). We also assessed what percent of the database would need to be screened in order to find 70% of the active scaffolds and ring systems (FIG. 2, at right). For example, FEPOPS (flexible search) found 70% of the active ring systems in the first 7% of the database, versus 20% of the database for PDT.

[0072] For each ring system recovered in the top 1%, the highest-ranking member along with its actual rank is shown in FIG. 3. Several of the active ring systems retrieved by FEPOPS are strikingly divergent from the probe. In certain cases, substituent differences preserve the overall logP and charge character of the probe, but possess an entirely different "bulkiness", such as the compound from RRS 2 in FIG. 3. Notably, ring systems 2, 4, 6, and 8 found by FEPOPS were not found by the 2D-based searches. The compounds shown from these ring systems (FIG. 3, bioactive probe conformation) have Tanimoto similarities to the probe of 0.13, 0.06, 0.17, and 0.11, respectively using FCFP_4 descriptors, (i.e., they are dissimilar in practical terms). The scaffold hops made by FEPOPS could not have been made by following the chemical approach of isosteric replacement. In contrast, the 2D methods typically find compounds that retain the tricyclic core of SC-558 despite their membership in different RRS classes (FIG. 3). [0073] 5-HT3A. The 5-hydroxytryptamine receptor (3 A) is a ligand-gated ion channel involved in neurotransmitter reuptake and a significant pharmacological target of antidepressant medications. The probe for similarity searching was selected at random from the 788 MDDR actives (see structure in FIG. 5). Enrichment for the total number of actives recovered was again highest for FCFP_4, followed by MACCS and DAYLIGHT (TABLE 2). Although FEPOPS showed the lowest active recovery, its identification of active ring systems was the best at the 0.5% and 1.0% cutoffs (FIG. 4, left). Overall, 27% of the active ring systems are recovered in the top 1% by FEPOPS, and PDT; however, PDT, MACCS, and FEPOPS all perform competitively at recalling 70%> of the active ring systems (FIG. 4, right). For each RRS recovered in the top 0.5%, the highest-ranking member along with its actual rank is shown in FIG. 5. (Note that RRS numbers do not correspond to one another between the different target test examples.) The representative structures shown for the 2D methods generally contain substructures that resemble the core bicyclic ring system from the probe. Half of the ring systems retrieved by the 3D methods were not found by the 2D methods (RRS 12, 15, 8, and 10), whereas FCFP_4 and MACCS each found one unique ring system (RRS 35 and RRS 40). This suggests that overall the 3D methods sample the diversity of 5- HT3 A antagonist chemotypes more effectively in the top percentiles. [0074] The 2D methods are capable of scaffold hopping to an extent and different 2D similarity methods return different rankings for actives. This same observation made by others has led to the paradigm that using multiple similarity methods benefits the discovery process. Stanton DT et al, Journal of Chemical Information & Computer Science, 39: 21-27 (1999); Ginn CMR et al, Perspectives in Drug Discovery & Design 20: 1-16 (2000). However, in the present example, the ring systems recovered by the 2D methods as a collective were not so divergent. If hit diversity is more desirable than hit quantity, FEPOPS appears to be a better strategy than employing multiple 2D methods.

[0075] HIV-RT. HIV reverse transcriptase is a nucleotidyl transferase (E.C. 2.7.7.49) and a significant target of anti-retroviral compounds. The probe for similarity searching was selected at random from the 597 MDDR actives. In this example, FEPOPS retrieved the highest percentage of actives, scaffolds, and ring systems at any threshold above the top 1% most similar compounds (TABLE 2 and FIG. 6); 21 actives from seven scaffolds and six ring systems were identified in the top 1% when given the single probe. In contrast to its performance in the early portion of the RRS recall curve, DAYLIGHT recalled 70% of the active ring systems in the smallest percentile of the MDDRds (FIG. 6). In the top 100 hits from each method, (top 0.34%), only FEPOPS hops to a ring system that differs from the probe (FIG. 7). DAYLIGHT and MACCS each have one RRS unique to their hits in the top 1%, while FEPOPS and FCFP_4 have two.

[0076] D2 agonists and antagonists. Often, endogenous ligands such as peptides, hormones, or co-factors associated with proteins in vivo are known at the beginning of the lead identification process. Ideally, an effective similarity search method could scaffold hop directly from a given endogenous ligand. to candidate lead molecules. Consequently, we posed an especially challenging problem to the similarity methods described above by supplying dopamine as a probe to search for 548 MDDR D2 receptor agonists and antagonists. [0077] D2 receptors belong to the superfamily of seven fransmembrane GPCR dopamine (DA) receptors, which have been implicated in neuropsychiatry, cardiovascular, and renal diseases. Emilian G et al, Pharmacology & Therapeutics 84: 133-156 (1999). For example, D2 agonist activity is involved in the action of antiparkinsonian drugs, whereas D2 antagonist activity is associated with the antipsychotic medications used to treat schizophrenia. [0078] Numerous compounds associated with the DA receptor superfamily are present in the MDDR. To avoid false negatives due to the various Dl, D3, and D4 agonists and antagonists that could resemble dopamine, MDDR records containing dopamine in the biological activity field were eliminated from the background dataset. Further, since the objective is to hop to new chemotypes, MDDR records containing dopamine as a substructure, which could easily be retrieved by a simple substructure query, were filtered out. It is important to stress that a number of compounds in the background MDDR dataset antagonize binding of dopamine-like compounds, such as serotonin and histamine, adding a level of difficulty to this particular example.

[0079] A small number of the total D2 actives were found in the top 1% by the similarity search methods. Only the 3D methods, PDT and FEPOPS, fared consistently better than random at recovering D2 actives given the structure of DA (TABLE 2). [0080] FEPOPS finds 5 different ring systems (10%) in the top 292 compounds (1 %), whereas PDT finds 7 ring systems. Further, FEPOPS recalls 70% of the RRS classes in the top 20%) of the database (FIG. 8). Despite the low return on total actives, the enrichment for novel RRS classes by the 3D methods (FIG. 9) may afford a suitable starting point for lead optimization in similar real-life cases.

[0081] Retinoids. Retinoic acid receptors (RAR) are nuclear receptors and transcription factors critical in the differentiation of various cell types. Both RAR agonists and antagonists have been found to have antitumor activities in several cancers. Schapira M et al, BMC Structural Biology 1: 1 (2001). Members of the RAR family are activated by a number of natτιrally-occurring retinoids, one of which is all-trans retinoic acid (at-RA), a carboxylated form of vitamin A. As a launch point to test for scaffold hopping, we selected at-RA as a probe to assess for recovery of 308 retinoid compounds from the MDDR. [0082] The similarity methods were more successful at recalling retinoids than at active recall in the other test cases. FEPOPS showed 24-fold enrichment of actives in the top 1% and 60-fold enrichment for active ring systems (TABLE 2). Of the fifteen active ring systems, FEPOPS found 9 in the top 0.5%, versus six found by FCFP_4, 5 found by MACCS, and 2 found by DAYLIGHT (FIGS. 10 and 11). FEPOPS exclusively identified RRS classes nine and fifteen as well as all seven others recalled by FCFP_4, MACCS, DAYLIGHT and PDT collectively.

[0083] HTS datasets. The MDDR test examples indicated that the Pipeline Pilot functional class fingerprints and MACCS Keys were more generally useful than DAYLIGHT as benchmarks in comparing 2D similarity to the FEPOPS approach. For the HTS dataset test cases, we have simply compared FEPOPS to FCFP_4 by assessing the recall of actives, scaffolds, and reduced ring systems in the top 1%> of their ranked lists. The results are summarized in FIG. 12. In the GHS-R, gamma secretase, and MMP-13 test cases, FEPOPS recalls more actives than FCFP_4 when given 5 probe molecules, as well as more scaffolds, and novel ring systems. However, FCFP_4 recovers more inhibitors of GHS-R and gamma secretase than FEPOPS when given 20 reference actives. This may reflect the fundamental nature of the descriptors used by the different methods. For a 3D method such as FEPOPS, an increase in the number of probe structures may not greatly increase the amount of new information, since the molecular representation is largely based on pharmacophores. In contrast, an increase in new probe structures inherently presents more topological information to guide searches with 2D fingerprints. [0084] The results for mABL are significant from a biological viewpoint because they involve searching for compounds with similarity to Gleevec that are in fact inhibitors of a mutant version of ABL kinase found in vivo in some chronic myelogenous leukemia patients with resistance to Gleevec; thus, the probe molecule itself has dramatically reduced affinity for the target. The challenge is to leap from the ineffective inhibitor to active chemotypes without utilizing structural knowledge about the variant protein target. We found that FEPOPS hopped from Gleevec® to 24%> of the mABL inhibitors in the top 1% most similar compounds, while sampling from 54%> of the active structural classes, 37%> of active scaffolds, and 29% of the active ring systems. (FIG. 11). This suggests that drug-induced mutations in target proteins could be countered from a pharmacological standpoint by scaffold hopping from impotent drugs to new potent leads.

[0085] Discussion. The method of the invention conducts flexible 3D similarity searches using atom cluster centroids, or feature points, which contain physicochemical (partial charges and atomic logP) and pharmacophore (hydrogen bond donors and acceptors) properties. "Feature point pharmacophores" are created by electrostatic-based sorting of the four feature points prior to assigning their inter-point distances. As shown herein, the FEPOPS representation is applicable for finding leads with novel scaffolds ranked in the top 0.1 - 1%) of the database in cases where one or only a few "binders" are known. The method is robust in terms of scaffold hopping, and can identify novel scaffolds that cannot be identified by isosteric replacements. The fuzzy representation of FEPOPS highly rank ranks actives from a variety of scaffold classes with minimal reduction in total active recall. [0086] As shown herein, the FEPOPS method of the invention is consistently a strong performer. For example, FEPOPS demonstrated the best ring-system hopping above the top 1% threshold in 4 of the 5 MDDR test cases, while recalling 70%> of the active ring systems first in 3 of the 5 test cases. The FEPOPS method performs particularly well compared to topological methods when an endogenous ligand is used as the similarity query. Efforts to leap from natural ligands to lead compounds have been made previously using peptides. Now, natural ligands or substrates may also be used to simply prioritize compounds in a library for testing against a target. Alternatively, FEPOPS may serve as an orthogonal approach to complement 2D methods prior to screening or to recover false negatives missed by HTS, post- screening. [0087] The observations made by others, that 2D methods are capable of scaffold hopping to some extent and that different 2D similarity methods return different rankings for actives, has led to the paradigm that using multiple similarity methods benefits the discovery process. Nonetheless, in several of the cases presented herein, the ring systems recovered by the 2D methods as a collective are not so divergent. If scaffold recall is more desirable than active recall, then a more advantageous strategy may be to use a reliable 3D method such as FEPOPS in addition to 2D methods, rather than employing consensus 2D methods alone. Additionally, the number of ligands known prior to screening should play a role in the decision of whether to use 2D- or 3D-similarity search algorithms, as a larger number of reference compounds (>5) more readily benefits 2D methods in the current study (TABLE 3). [0088] In addition to recalling novel active scaffolds, the hits ranked in the top 1%> by FEPOPS are diverse in terms of their physical properties, such as the often-cited Lipinski properties: logP, molecular weight, hydrogen bond donors, and hydrogen bond acceptors. FIG. 14 shows the Lipinski properties of the COX-2 scaffold hops returned by FEPOPS in the top 1% of the database. Notably, many of the hits have logP values lower than the probe molecule, suggesting that FEPOPS may be a useful method for finding new lead compounds in cases where the current lead has undesirable pharmacological properties (such as poor aqueous solubility). Further, FEPOPS samples more of the diversity contained inherently within the active datasets. For example, FIG. 15 illustrates that, the actives recalled by FEPOPS in the top l%o have an average similarity to the probe molecules that is nearly the same as the average similarity of all actives to the probe molecules. In other words, the highest ranking actives from FEPOPS are a reasonable sampling of the entire set of actives. [0089] One of the clear advantages of the FEPOPS method of the invention is the fully automated, unbiased generation of pharmacophore-type information. With prior art methods, the construction of a pharmacophore model often required flexible alignment of known active compounds and hand-picking of important features. Typically, pharmacophores were composed of a triplet or quartet of single features (hydrophobic group, hydrogen bond donor or acceptor, etc.). In contrast, each feature point in the FEPOPS method of the invention encodes multiple features, since each atom with membership in a k-means cluster contributes some information. Thus, the indigenous atomic environment of the cluster members is encoded into the centroids. When performing a pharmacophore-based search, consideration of the atom neighbors of features may prevent the selection of compounds in which the features reside in the context of undesirable neighbors.

[0090] The scaffold hopping of PDT is somewhat similar to the FEPOPS method of the invention and to the topological methods, especially in D2 test case among the top 1% of ranked compounds (FIG. 10). Many of the D2 binders retrieved in the top 1% share a common pharmacophore: a donor-acceptor-hydrophobe triplet, where the acceptor to hydrophobe distance is <3 A, as in dopamine itself. These results exemplify how useful explicit pharmacophore similarity methods can be if critical pharmacophores are known in advance of searching.

[0091] Although FEPOPS models contain only a small number of descriptors per compound, the information encodes a level of complexity that may not be readily apparent to one of ordinary skill in the art. For example, a hypothetical value of >1 for the feature "LI" not only indicates the presence of a hydrophobic region in the molecule, but also reveals that the most negative portion of the molecule (corresponding to feature point 1) is hydrophobic. Conversely, a negative value for "L4" indicates that the most positively charged portion of the molecule is composed of mostly hydrophilic atoms.

[0092] A striking result from the COX-2 test case is that use of the bioactive conformation of the probe molecule rather than a representative set of flexible conformers does not improve the results, and in fact does worse than the flexible approach at scaffold recovery (FIG. 2). We have similarly observed that the bioactive confirmation of lisinopril does not retrieve acetylcholinesterase (ACE) inhibitors as well as a flexible lisinopril similarity search. There is further unexpected evidence to support that rigid pharmacophore models based on a single bioactive conformation are limiting. One of skill in the art might anticipate one or two SC-558 conformers from the flexible similarity search are predominantly most similar to all other COX-2 inhibitors (i.e., the probe conformations that best represent a common pharmacophore), but this was not the case. Each of the seven probe conformations had subsets of COX-2 inhibitors to which they are the best correlated. In other words, each explicit probe conformation contributed to the overall enrichment of actives. Intriguingly, this finding held true for all of the test cases shown herein. These results suggest that allowing probe flexibility may encourage scaffold hopping more than incorporating a priori knowledge of bioactive conformations does not necessarily improve FEPOPS performance and may be somewhat limiting to scaffold hopping efforts. [0093] In summary, the FEPOPS method of the invention advantageously creates a pharmacophore from all atoms in the ligand, rather than specific feature quartets common to pharmacophore searching. Clustered representation is fuzzy (via &-means clustering), which promotes selection of chemotypes with regional similarities but unique frameworks. Training sets of known actives are not required for conducting similarity searches, nor is knowledge about bioactive conformations. Compound pre-processing includes pKa and tautomeric information often missing from other similarity methods. The method precludes time- intensive molecular alignments or pharmacophore matching by a simple, yet effective presorting of feature points by overall charge before calculating distances between points, which allows for direct statistical correlations of descriptors. FEPOPS finds a small number of diverse, representative conformers that cover FEPOPS space efficiently by selecting conformations with "minimal dissimilarity" to neighboring conformations (&-medoids clustering). Finally, the calculation time for similarity searches is nominal once FEPOPS are pre-computed (~8 min for 100K compound database) and comparable to 2D methods once FEPOPS are pre-computed (~600K compounds/week). The representations can optionally be mapped back to parent coordinates; thus, for any test compound, the conformation with the highest correlation to a known active ligand is potentially a bioactive conformation.

[0094] All references cited herein are incorporated herein by reference in their entirety and for all purposes to the same extent as if each individual publication or patent or patent application was specifically and individually indicated to be incorporated by reference in its entirety for all purposes. In addition, all GenBank accession numbers, Unigene Cluster numbers and protein accession numbers cited herein are incorporated herein by reference in their entirety and for all purposes to the same extent as if each such number was specifically and individually indicated to be incorporated by reference in its entirety for all purposes. [0095] The present invention is not to be limited in terms of the particular embodiments described in this application, which are intended as single illustrations of individual aspects of the invention. Many modifications and variations of this invention can be made without departing from its spirit and scope, as will be apparent to those skilled in the art. Functionally equivalent methods and apparatus within the scope of the invention, in addition to those enumerated herein, will be apparent to those skilled in the art from the foregoing description and accompanying drawings. Such modifications and variations are intended to fall within the scope of the appended claims. The present invention is to be limited only by the terms of the appended claims, along with the full scope of equivalents to which such claims are entitled.

Claims

CLAIMS We claim:

1. A method for generating a three-dimensional (3D) representations of a molecule, comprising the steps of: (1 ) selecting a molecule having a known chemical structure; (2) computationally generating multiple conformers of the chemical structure by systematic rotation of flexible bonds; (3) partitioning atoms of the molecule into one or more &-means clusters, each cluster having a centroid; (4) assigning atom-type pharmacophoric features to the λ-means cluster cenfroids from step (3) to create feature points; (5) recording distances between feature points; and (6) selecting representative feature point pharmacophore (FEPOPS) conformers among all calculated conformations by -medoids clustering.

2. The method of claim 1, wherein four feature points are created and wherein, prior to step (5) of recording distances between feature points, the feature points are sorted on the basis of a quadrupole directionality.

3. The method of claim 1 , further comprising the steps of : (7) identifying a bioactive configuration of the molecule.

4. The method of claim 1 , wherein the selection of a molecule further comprises a step selected from the group consisting of: (a) assigning protonation states to the atoms of the molecule; (b) enumerating tautomers of the molecule; (c) calculating partial charges of the atoms of the molecule; (d) calculating atomic logP values for the molecule; and (e) combinations thereof.

5. The method of claim 1, wherein the atom-type pharmacophoric features are selected from the group consisting of partial charge, logP, hydrogen bond donors, hydrogen bond acceptors, formal charge, atomic number, atomic radius, atomic refractivity and combinations thereof.

6. The method of claim 1 , further comprising the step of storing representative FEPOPS conformers in a lookup table.

7. A method for identifying a molecule having a similar three-dimensional (3D) stmcture to a query molecule, said method comprising the steps (l)-(6) of Claim 1 to generate FEPOPS conformers of the query molecule and FEPOPS conformers of any candidate molecule to be analysed for having a similar 3D stmcture, further comprising the steps of : (8) calculating the similarity of the FEPOPS conformers of the query molecule to the FEPOPS conformers of other candidate molecules, wherein the similarity is calculated by matching the feature points according to feature correlation wherein a defined level of similarity is indicative that said candidate molecule has a similar 3D stmcture to said query molecule.

8. The method of Claim 7, wherein said similarity is calculated using Pearson correlation.

9. The method of Claim 7, wherein said similarity is calculated by matching each feature point of the query molecule to the corresponding feature point of the candidate molecule sorted on the basis of quadrupole directionality, without further matching with the other feature points. |

10. The method of Claim 7, wherein the similarity is calculated by matching each feature point of the query molecule to each feature point of the candidate molecule.

11. The method of claim 7, wherein similarity is calculated between the FEPOPS conformers of the query molecule and the FEPOPS conformers of each database compound.

12. The method of Claim 11 , wherein the similarity is calculated between the FEPOPS conformers of a query dataset of molecules known to have a specific biological activity and the FEPOPS conformers of each compound of a database.

13. A computer-based method for predicting the structure of a ligand capable of binding to a query molecule, comprising the steps of (1) searching for molecules having similar 3D stmcture to said query molecule, according to the method of Claim 7; and, (2) predicting the stmcture of a ligand capable of binding to said query molecule by using information related to the structure of ligands of the molecules identified at step (1).

14. A system for generating a three-dimensional (3D) representations of a molecule, comprising: (a) an interface for receiving atomic information on the molecule; (b) a module for generating a feature-point pharmacophore (FEPOPS) representations of a molecule according to the method of Claim 1, and (c) an interface for outputting information on the FEPOPS representation.

15. The system according to Claim 14, wherein said interface for outputting information also comprises an algorithm to transform the atomic coordinates of the molecule and to overlay the feature points so that any FEPOPS conformer can be visually superimposed with a probe molecule in a molecular modeling environment.

16. A computer program which is capable, when executed by a computer processor, of causing the computer processor to perform a method according to claim 1 or Claim 13.

17. A computer-readable storage medium having recorded thereon a computer program according to claim 16.

18. A method for predicting a pharmacophore structure responsible for a specific biological activity, comprising the steps of: (1) selecting a list of compounds with a known specific biological activity; (2) generating FEPOPS conformers of said compounds selected at step (1), according to the method of Claim 1 ; and, (3) performing unsupervised learning methods on the FEPOPS conformers, and identifying atomic features correlating with said biological activity, thereby predicting pharmacophore structure responsible for a specific biological activity.