WO2024116201A1 - Classification de lignées et de sous-lignées d'agents pathogènes - Google Patents

Classification de lignées et de sous-lignées d'agents pathogènes Download PDF

Info

Publication number
WO2024116201A1
WO2024116201A1 PCT/IN2023/051107 IN2023051107W WO2024116201A1 WO 2024116201 A1 WO2024116201 A1 WO 2024116201A1 IN 2023051107 W IN2023051107 W IN 2023051107W WO 2024116201 A1 WO2024116201 A1 WO 2024116201A1
Authority
WO
WIPO (PCT)
Prior art keywords
strain
lineage
genetic
data
pathogen
Prior art date
Application number
PCT/IN2023/051107
Other languages
English (en)
Inventor
Aniruddh Sharma
Avlokita Tiwari
Original Assignee
Aarogyaai Innovations Pvt. Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aarogyaai Innovations Pvt. Ltd. filed Critical Aarogyaai Innovations Pvt. Ltd.
Publication of WO2024116201A1 publication Critical patent/WO2024116201A1/fr

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B10/00ICT specially adapted for evolutionary bioinformatics, e.g. phylogenetic tree construction or analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Definitions

  • a disease specifically, an infectious disease may be caused by a pathogen, such as bacterium, virus, fungi, or any other micro-organism.
  • a pathogen such as bacterium, virus, fungi, or any other micro-organism.
  • the strain of a pathogen evolves and certain genetic variations may get introduced into the genetic code of the pathogen, resulting in the emergence of lineages or sub-lineages.
  • a comprehensive study about each lineage and sub-lineage of a pathogen helps in better monitoring and prediction of whether any new lineage or a sublineage is emerging within a population and accordingly assess whether such an emerging lineage is more or less infectious than the previously existing ones, or to obtain further information as to their characteristics.
  • FIG. 1 illustrates a system for training a lineage classification model for classifying the lineages or sub-lineages of a strain of a pathogen, as per an example of the present subject matter
  • FIG. 2 illustrates a classification system for classifying the lineages or sublineages of a strain of a pathogen, as per an example of the present subject matter
  • FIG. 3 illustrates a graphical illustration depicting the emergence of lineages or sub-lineages of a pathogen, as per an example of the present subject matter.
  • FIG. 4 illustrates a method for providing data to a training system and thereby classifying the lineages and sub-lineages according to an example of the present subject matter.
  • a successful jump into a novel host species is a three-step process. Firstly, the pathogen must come in contact with the novel host. Secondly, it must successfully infect the novel host, which may involve binding to host cell receptors, entering cells and taking over the cell machinery to replicate, and/or escaping host defenses. Finally, there must be sufficient onwards transmission of the pathogen for its persistence and spread through the novel host species. The steps of infection and transmission may represent such daunting challenges for the pathogen that, in most cases, it fails to establish in the novel host species. A pathogen may, therefore, rely on the host's cellular machinery for survival and replication, and to persist, it needs to avoid the host's immune defenses.
  • any given strain of a pathogen under consideration may undergo variations or mutations during the course of time. Such mutations are evident as variations in the genetic code of the pathogen under consideration.
  • one or more variations may define a certain lineage for a given pathogen.
  • Lineage is a group of genetically related sequences defined by statistical support of their placement in a phylogeny and genetic differences from a common ancestor.
  • a sub-lineage refers to a lineage as it relates to being a direct descendent of a parent lineage.
  • tuberculosis bacteria have seven different lineages each of which are characterized by defined variations in the genetic structure.
  • identification of the genetic lineages of microbes causing outbreaks of a disease may be useful in determining the relatedness and transmission patterns of cases as well as understanding the particular causes for a given outbreak caused by a given strain of an organism. For example, some variations allow the pathogen to spread more readily.
  • transmission is the passing of a pathogen causing communicable disease from an infected host individual or group to a particular individual or group, regardless of whether the other individual was previously infected.
  • the number of variations in the genetic structure may increase resulting in different strains or lineages.
  • the evolving strain may be a new lineage or may be a sub-lineage within an existing lineage.
  • lineages or sub-lineages may be identified once a large number of occurrences (e.g., by way of infections) are observed within a certain community. Later, existence of lineages or sub-lineages may be confirmed or ascertained based on testing of certain samples. As may be observed, such measures or approaches do not allow proactive monitoring to ascertain whether any new lineages or sub-lineages are emerging with new infections. Furthermore, pathogens may generally acquire mutations over time giving rise to new variants. When an infection due to a pathogen is in its earlier stages, there are fewer opportunities for the mutations to occur in the genome, which therefore results in fewer chances for the occurrence of variants and/or lineages or sublineages.
  • the present subject matter relates to classifying a given strain or a target strain into any one of the plurality of lineages and/or sub-lineages of the base strain (also known as a wild strain of a given pathogen) to which the target strain may be correspond.
  • the target strain may be obtained from any one of the environmental samples, such as a sample derived from an individual who may be known to be infected with the given pathogen.
  • the classification of the given strain into any one of the lineages or sub-lineages is based on the genetic variations in the genetic structure of the target strain.
  • the approaches as described herein may classify the target strain into either a new lineage or a new sub-lineage of an existing lineage based on its genetic variations.
  • a training system may be utilized.
  • the training system may be any processor implemented system which classifies one or more data points based on a given criteria or characteristics.
  • the training system may classify the target strain based on the mutations that may be present in the target strain.
  • the training system may classify the target strain into one of the plurality of clusters, with each cluster representing a lineage.
  • the target strain may be classified into a sub-lineage depending on the extent of variation in the target strain vis-a-vis the genetic structure corresponding to the related lineage.
  • the training system may be utilized to classify the target strain into any one of the lineages or sub-lineages.
  • the reference genetic data may correspond to genetic variations of known lineages and sub-lineages of a plurality of strains of a given pathogen under consideration, wherein which the genetic variations are determined with respect to the base strain of the pathogen the base strain (also known as a wild strain of a given pathogen).
  • the training system may be subjected to the reference data for training.
  • the training system once subjected to the reference genetic data may, considering the genetic variations of the different reference strains, classify instances of the reference genetic data into one or more clusters.
  • the clusters may correspond to either a lineage or a sub-lineage of the pathogen under consideration.
  • the training system may then be utilized for assessing whether a given target strain is to be classified into one of the clusters (corresponding to the lineages or sub-lineages), or whether the genetic variation of the target strain is such that it warrants that the target strain is to be classified into either a new lineage or a new sub-lineage within an existing lineage.
  • the training system may implement classifiers that are trained based on the reference genetic data, as discussed above.
  • the classifiers set to classify the data samples may be based on the genetic variations or mutations within a nucleotide sequence of a strain. Some examples of such genetic variations may include single nucleotide polymorphisms (or SNPs), deletions, insertions or mutations within the nucleotide sequence of the strain under consideration. However, it is crucial to note that the classification may be done upon a variety of parameters such as the binding energy between the molecules of the sample, and more. The examples used herein are only for the purpose of explanation of the present disclosure and should not be considered as a limitation.
  • the present approaches employ machine learning by way of which trained classifiers, in one example, may be used for ascertaining lineages and sub-lineages of a target strain of a pathogen.
  • the approaches as discussed provide a rapid and comprehensive mechanism for ascertaining the differential lineage or a sub-lineage formed for a target strain of a pathogen.
  • the determination is quick and accurate which in turn enables classifying different variants/lineages or sub-lineages of a pathogen so that an appropriate drug may be given selected for a patient undergoing treatment pursuant to the target strain of the pathogen, in a timely manner.
  • the pathogen include, but are not limited to, viral, bacterial or fungal pathogens.
  • the method includes performing repetitive classification and analysis, for example, by using an Artificial Intelligence (Al) learning model.
  • Al Artificial Intelligence
  • Other approaches for performing classification may also be utilized, without deviating the scope of the presented subject matter.
  • the data samples pertaining to the patient sample may further be classified depending on the lineage or sub-lineage that the data set represents.
  • a large number of data sets may be used for training one or more learning models, which may then be used to classify into clusters depending upon the categories/types of the lineages and sub-lineages of a pathogen.
  • Clustering is a method of grouping the objects into clusters such that objects with most similarities remain into a group and have less or no similarities with the objects of another group. Cluster analysis finds the commonalities between the data objects and categorizes them as per the presence and absence of those commonalities. With clustering, it is possible to detect outliers in the given data. It may be noted that other approaches may also be adopted without limiting the scope of the current subject matter.
  • FIG. 1 illustrates a system for training a lineage classification model for classifying the lineages or sub-lineages of a strain of a pathogen, as per an example of the present subject matter.
  • the lineage classification model is trained for classifying a given strain or a target strain into any one of the plurality of lineages and/or sub-lineages of the base strain (also known as a wild strain of a given pathogen) to which the target strain may correspond.
  • the target strain may be obtained from any one of the environmental samples, such as a sample derived from an individual who may be known to be infected with the given pathogen.
  • the classification of the given strain into any one of the lineages or sub-lineages is based on the genetic variations in the genetic structure of the target strain.
  • the approaches as described herein may classify the target strain into either a new lineage or a new sublineage of an existing lineage based on its genetic variations.
  • the training system 102 (referred to as the system 102) may be in communication with a predefined repository 104 through a network 106.
  • the network 106 may be a private network or a public network and may be implemented as a wired network, a wireless network, or a combination of a wired and wireless network.
  • the network 106 may also include a collection of individual networks, interconnected with each other and functioning as a single large network, such as the Internet.
  • the system 102 may further include processor(s) 108 which may execute one or more computer executable instructions for training the lineage classification model 1 10.
  • the processor(s) 108 may be implemented as a single computing entity or may be implemented as a combination of multiple computing entities or processing units.
  • the processor(s) 108 may be coupled to a memory (not shown in FIG. 1 ).
  • the processor(s) 108 is configured to fetch and execute computer-readable instructions stored in the memory(s) for ascertaining lineages and sub-lineages of a target strain of a pathogen.
  • the system 102 may include a sequencing engine 112 and a training engine 120.
  • the sequencing engine 112 and training engine 120 (collectively referred to as engine(s) 112, 120) may be implemented as a combination of hardware and programming, for example, programmable instructions to implement a variety of functionalities. In examples described herein, such combinations of hardware and programming may be implemented in several different ways.
  • the programming for the engine(s) 112, 120 may be executable instructions, by the processor(s) 108. Such instructions may be stored on a non-transitory machine- readable storage medium which may be coupled either directly with the system 102 or indirectly (for example, through networked means).
  • the engine(s) 112, 120 may itself include a processing resource (not shown in FIG. 1 ), for example, either a single processor or a combination of multiple processors, to execute such instructions.
  • a non-transitory machine-readable storage medium may store instructions, that when executed by the processing resource, implement engine(s) 112, 120.
  • the engine(s) 112, 120 may be implemented as electronic circuitry.
  • the system 102 may further include training genetic data 114, reference genetic data 116, and clustered lineage information 118.
  • the system 102 may obtain training genetic data 114 from the repository 104.
  • the repository 104 may be any implemented as any data storage repository which stores information pertaining the training genetic data 114.
  • the training genetic data 114 may include nucleotide sequence data corresponding to a training base strain of the pathogen.
  • the training base strain may refer to a strain of the pathogen according to which a lineage or a sublineage of a target base strain may be determined by the lineage classification model 110.
  • the reference genetic data 116 may include nucleotide sequence data pertaining to a reference strain of the pathogen.
  • the reference strain may refer to a naturally-occurring strain of the pathogen that is devoid of any genetic variation or mutation.
  • the training engine 120 may initiate the process of classifying the genetic variations in different strains based on the reference generic data 116, retrieved from the repository 104. Each of the data points thus classified within a cluster may pertain to similar genetic variations (representative of mutations) that may occur in the different strains. Such similar variations may be grouped together.
  • the training engine 120 may begin operation with an input data point ‘k’ which represents the number of clusters to identify. Thereafter, the operation moves on to placing ‘K’ centroids in random locations. Now, by using the Euclidean distancing between the data points and centroids, the operation may comprise assigning each data point to the cluster which is close to it by determining the best value of centroids using an iterative process.
  • the classification may be initiated for a plurality of strains of pathogens, for which the training engine 120 chooses random ‘K’ points or centroid from the clusters which may pertain to the reference genetic data 116.
  • K random ‘K’ points or centroid from the clusters which may pertain to the reference genetic data 116.
  • any number of points not pertaining to the cluster will be assigned to its closed K-point or the centroid by the application of mathematical tools and techniques.
  • the training engine 120 will further classify the data points from the genetic data 116 into different branches and sub-branches pertaining to lineages and sub-lineages of a strain for a given pathogen. It may be noted that the above-described methodology is only indicative and one of the many other examples that may be implemented for the purposes of classifying the reference genetic data 1 16.
  • the manner of classifying the data samples into lineages or sub-lineages depends on a threshold parameter defined by the operator in the system 102. For example, if the genetic variation is greater or less than a certain pre-defined threshold, then data sample may be identified into clusters. For example, upon forming the clusters, if an attribute pertaining to the genetic variation in a data sample depicts a value greater than a pre-defined threshold, it may be classified as a lineage. Examples of such attributes include, but are not limited to, variation in protein coding genes which may be essential for survival of the pathogen, variation in specific and highly conserved amino acids, for example, tryptophan, number of such mutations accumulated over time indicating an event of divergence, or combinations thereof.
  • any one or more attributes provides for a multi-dimensional assessment taking in time, genetic variation, and criticality assigned to genetic variations under consideration. It may be noted that the above examples are only indicative and should not be construed as a limitation. Reliance on other attributes is also possible without deviating from the scope of the present subject matter.
  • the genetic variation in a data sample depicts a value less than a pre-defined threshold, it may be classified as a sub-lineage.
  • the pre-defined threshold may refer to the extent of relatedness of genetic variation depicted by the data points as compared to the reference genetic data 116 used for training the lineage classification model 110 and stored as the clustered lineage information 118 based on the fact that a lineage is a group of closely related viruses with a common ancestor, and a sub-lineage is used to define a lineage as it relates to being a direct descendent of a parent lineage.
  • a reiterative process classifies the data sample into lineages or sub-lineages based upon the genetic variation in the data samples collected from a population.
  • the training engine 120 may assess the threshold value of the data sample and accordingly classify the data sample into lineages or sub-lineages. It is to be noted that each cluster has data samples pertaining to some commonalities, which may be based on similar genetic variations in a pathogen.
  • information pertaining to the genetic variation once classified may be stored as clustered lineage information 118 comprising the clustered information within the system 102.
  • the clustered lineage information 118 may parameters or attributes that may depict the different clustering criteria or classifiers based on which the reference genetic data 1 16 of different strain may have been clustered with each cluster representing a lineage or a sub-lineage.
  • the clusters may correspond to a lineage or a sub-lineage for the given pathogen.
  • the target strain may be processed to determine its genetic information (which in the present example, is referred to as the target genetic information).
  • the target genetic information thus obtained may be analyzed and processed by the training engine 120.
  • the mutations present within the target strain may be ascertained.
  • the genetic variations in the target strain may be determined by comparing the genetic information of the target strain with the corresponding genetic information of a base strain of the pathogen under consideration. Other approaches for determining the genetic variations in the target strain may be adopted without limiting the scope of the current subject matter.
  • the sequencing engine 112 may compare the training genetic data 114 and the reference genetic data 116 to determine occurrence of a genetic variation in the training base strain of the pathogen. To this end, the sequencing engine 112may parse the nucleotide sequence of the training base strain to determine start and stop sequences. In certain instances, when the start and/or stop sequences relatively shift, the sequencing engine 112 may check and fix any alignment issues that may have occurred between the references strain and the reference strain. Proceeding further, the sequencing engine 112 may further compare the nucleotide sequence of the training base strain and the nucleotide sequence of the reference strain to determine a genetic variation in the training base strain.
  • a genetic variation in the nucleotide sequence of the training base strain may manifest as a different combination of amino acids, for a given sequence location when compared with the same corresponding location in the nucleotide sequence of the reference strain.
  • the genetic variations may be utilized as the training genetic data 114 for training the lineage classification model 1 10, as per the examples of the present subject matter.
  • the pathogen may be M. Tuberculosis.
  • an example of the genetic variations for a plurality of training base strains of the M. Tuberculosis pathogen is provided in Table A below:
  • Table A depicts a gene, along with a start and a stop sequence of a nucleotide sequence of a training base strain A. Furthermore, as per the first row of the Table A, the training base strain A may be resistant to a drug, for example, Ciprofloxacin due to a genetic variation present at position A74S, wherein mutations at the position A74S is depicted in the last column. In a similar manner, other genetic variations for other references strains or same training base strains (identified by the gene column in Table A) may be recorded to provide the clustered lineage information 118. It may be noted that Table A is just one of the many possible examples. Other examples would also be within the scope of the present subject matter. Although the present example has been explained with respect to the M. tuberculosis, the present approaches may be implemented without limitation for any other pathogen, such as a virus, other bacteria or fungi.
  • the training engine 120 may process the genetic variations of the training base strain based on the clustered lineage information 1 18 to determine which of the clusters would the training base strain may be classified. In an example, the training engine 120 may also determine that the reference strain would not be classifiable under any one of the clusters as indicated in the clustered lineage information 118. To this end, the training engine 120 may classify the reference strain as a sub-lineage of an existing lineage or into a new lineage altogether.
  • the training engine 120 may adopt the centroid approach as described above to ascertain whether the target strain is classifiable under one of the clusters as indicated in the clustered lineage information 118, or is to be classified into a new sub-lineage or a new lineage.
  • the sequencing engine 112 may perform additional assessments, for example checking quality of the training genetic data 114, determining whether the training genetic data 114 corresponds to the entire nucleotide sequence of the training base strain, or whether the training base strain that is under consideration, is in fact that of the pathogen under consideration. Once the additional assessments are completed, the sequencing engine 1 12 may proceed and compare the training genetic data 114 and the reference genetic data 116 to determine and locate the genetic variations.
  • the sequencing engine 112 may determine the locations within the nucleotide sequence of the references strain in which combination of the amino acids differs from the combination of amino acids at a corresponding location of the nucleotide sequence of the reference strain. Once a deviation is detected, the same is flagged as a genetic variation and the corresponding information is recorded.
  • the genetic variation may be a mutation that may be present within the training base strain.
  • the trained lineage classification model 110 may thereafter be used for determining a lineage and/or a sub-lineage of a target strain of the pathogen.
  • FIG. 2 depicts a classification system 202 (hereinafter referred to as the classification system 202).
  • the classification system 202 may be in communication with a clinical environment 204 over a communication network 206.
  • the communication network 206 is similar to the network 106.
  • the clinical environment 204 may be an environment which may be testing one or more test samples acquired from one or more patients.
  • the clinical environment 204 may include any facility or institution implementing mechanisms for retrieving and storing target genetic data 212 of a target strain in computer-accessible storage devices.
  • Such storage devices may either be within the premises of the clinical environment 204 or may be remotely accessible by one or more computing devices within the clinical environment 204.
  • the classification system 202 may also be implemented within the clinical environment 204 without deviating from the scope of the present subject matter.
  • the classification system 202 may further include processor(s) 208, a lineage classification engine 210, a lineage classification model 110 (based on the lineage classification model 110 of FIG. 1 ), genetic data of target strain 212, reference strain data 214, genetic variation data 216, and lineage formation data 218.
  • the lineage classification engine 210 is configured for classifying strains of pathogens into one or more categories representing either a lineage or a sub-lineage.
  • genetic data of the target strain (referred to as target genetic data 212) is obtained.
  • the target genetic data 212 may be based on patient samples that may be collected by the clinical environment 204.
  • the target strain may be processed to determine a nucleotide sequence of the target strain under consideration.
  • the nucleotide sequence information may then be determined and transmitted to the classification system 202.
  • the nucleotide sequence information may then be stored in the classification system 202 as the target genetic data 212.
  • the target genetic data 212 may then be processed by the lineage classification engine 210 based on the lineage classification model 110 and the clustered lineage information 118.
  • the lineage classification engine 210 may initially determine a genetic variation present within the nucleotide sequence of the target strain. To this end, the lineage classification engine 210 may compare the nucleotide sequence of the target strain (available as target genetic data 212) of the pathogen with a nucleotide sequence of a reference strain of the pathogen (available as reference strain data 214).
  • the lineage classification engine 210 may parse the nucleotide sequence of the target strain and the nucleotide sequence of the reference strain to determine their corresponding start and stop sequences. Thereafter, the lineage classification engine 210 may compare the target genetic data 212 and the reference strain data 214 to identify a genetic variation such as, a genetic marker or a genetic mutation, that may be present in the target strain. The determined genetic variation may be stored as genetic variation data 216. It may be noted that similar to the clustered lineage information 1 18 which indicated the genetic variations between the base strain and the reference strain, the genetic variation data 216 indicates the genetic variations between the target strain and the base strain.
  • the lineage classification engine 210 may further process the genetic variation data 216.
  • the lineage classification model 110 may be previously trained based on the mutations that may be present in the target strain.
  • the training system may classify the target strain into one of the plurality of clusters, with each cluster representing a lineage.
  • the target strain may be classified into a sub-lineage depending on the extent of variation in the target strain vis-a-vis the genetic structure corresponding to the related lineage.
  • FIG. 3 represents illustrates a graphical illustration by means of a dendrogram depicting the emergence of lineages or sub-lineages of a pathogen, which in one instance, may represent the clustered lineage information 118.
  • the horizontal line 304 represents each node in a cluster branch containing a group of similar data points.
  • Clusters at one level join with clusters in the next level up, using a degree of similarity. The process carries on until all nodes are in the branch, which gives a visual snapshot of the data contained in the whole set. As depicted in FIG. 3, each iteration merges the clusters with the most similar genetic mutation and classifies them in different lineages and sublineages. In this manner, the strain of a pathogen may be categorized into lineages or sub-linages using unsupervised learning initiated on the reference genetic data 116 as shown in FIG. 1.
  • the point 302 represents the emergence of a new lineage or a sublineage pertaining to a pathogen, according to the example of the present subject matter. It may be noted that the graphical depiction 300 is only depicting one example and should not be construed to be a limitation on the claimed subject matter.
  • FIG. 4 illustrates a method 400 to be implemented for determining linages or sub-lineages of a pathogen, as per an example of the present subject matter.
  • the method 400 may be implemented in a variety of computing devices, for the ease of explanation, the present description of the example method 400 is provided in reference to the above-described system 102.
  • the order in which the various method blocks of method 400 are described, is not intended to be construed as a limitation, and any number of the described method blocks may be combined in any order to implement the method 400, or an alternative method.
  • method 400 pertains to initially training a lineage classification model, such as the model 110, and then subsequently classifying a target strain into lineages or sub-lineages of a pathogen. However, such steps may be performed separately at different instances without limiting the scope of the present subject matter in any manner.
  • the above-mentioned methods may be implemented in a suitable hardware, computer-readable instructions, or combination thereof.
  • the steps of such methods may be performed by either a system under the instruction of machine executable instructions stored on a non-transitory computer readable medium or by dedicated hardware circuits, microcontrollers, or logic circuits.
  • some examples are also intended to cover non-transitory computer readable medium, for example, digital data storage media, which are computer readable and encode computer-executable instructions, where said instructions perform some or all the steps of the above-mentioned methods.
  • a reference genetic data of a training base strain of a pathogen may be obtained.
  • the system 102 may obtain the training genetic data 1 14 from the repository 104 (as shown in FIG. 1 ).
  • the training genetic data 114 includes a nucleotide sequence data corresponding to the training base strain of the pathogen.
  • genetic data pertaining to a reference strain is obtained.
  • the system 102 may obtain reference genetic data 1 16 wherein the reference genetic data 116 includes nucleotide sequence data pertaining to a reference strain of the pathogen.
  • the reference strain may be a strain of the pathogen as it occurs in nature and may be devoid of any genetic variations or mutations.
  • the reference genetic data and the reference genetic data are compared.
  • a genetic variation may be determined based on the comparing.
  • the sequencing engine 112 may compare the training genetic data 114 and the reference genetic data 116 to determine occurrence of the genetic variation in the training base strain.
  • the genetic variation may manifest in the training genetic data 114 as a different combination of amino acids when compared with the reference genetic data 116, at a given location.
  • the genetic variation may be stored as clustered lineage information 1 18.
  • the genetic variation is associated with one or more lineages or sub-lineages of the pathogen.
  • the sequencing engine 112 may associate the clustered lineage information 118 with the genetic variations to classify the genetic variations into lineages or sub-lineages of the pathogen.
  • the manner of classifying the data samples into lineages or sub-lineages depends on a threshold parameter defined by the operator in the system 102. For example, if the genetic variation is greater or less than a certain pre-defined threshold, then data sample may be identified into clusters. For example, upon forming the clusters, if an attribute pertaining to the genetic variation in a data sample depicts a value greater than a pre-defined threshold, it may be classified as a lineage.
  • Such attributes include, but are not limited to, variation in protein coding genes which may be essential for survival of the pathogen, variation in specific and highly conserved amino acids, for example, tryptophan, number of such mutations accumulated over time indicating an event of divergence, or combinations thereof. Further, if the genetic variation in a data sample depicts a value less than a pre-defined threshold, it may be classified as a sub-lineage.
  • the pre-defined threshold may refer to the extent of relatedness of genetic variation depicted by the data points as compared to the reference genetic data 116 used for training the lineage classification model 110 and stored as the clustered lineage information 118 based on the fact that a lineage is a group of closely related viruses with a common ancestor, and a sub-lineage is used to define a lineage as it relates to being a direct descendent of a parent lineage. In this manner, a reiterative process classifies the data sample into lineages or sub-lineages based upon the genetic variation in the data samples collected from a population.
  • the training engine 120 may assess the threshold value of the data sample and accordingly classify the data sample into lineages or sub-lineages. It is to be noted that each cluster has data samples pertaining to some commonalities, which may be based on similar genetic variations in a pathogen.
  • a lineage classification model is trained for determining a lineage and/or a sub-lineage of a target strain of the pathogen.
  • the training engine For example, the training engine
  • Y1 120 may train the lineage classification model 110 based on the clustered lineage information 1 18.
  • the lineage classification model 1 10 may be based on a number of classification or regression-based learning techniques, such as random forest techniques.
  • the lineage classification model 110 may be in the form of a classifier. Once the lineage classification model 110 is trained, it may be utilized for classifying the data points from the genetic data 116 into different branches and subbranches pertaining to lineages and sub-lineages of a strain for a given pathogen.
  • the target genetic data of the target strain may be processed to determine a genetic variation in the target strain.
  • a lineage classification engine 210 may determine a genetic variation within the nucleotide sequence of the target strain.
  • the lineage classification engine 210 may compare the nucleotide sequence of the target strain (available as target genetic data 212) with a nucleotide sequence of a reference strain of the pathogen (available as reference strain data 214). Based on the comparing, the lineage classification engine 210 may identify the genetic variation that may be present in the target strain.
  • the genetic variation once determined may be stored as genetic variation data 216.
  • reference genetic data may be classified.
  • the training engine 120 of the system 102 may commence classifying the data points, i.e., the genetic variations of different strains, as indicated in the reference genetic data 116.
  • Each of the data points thus classified within a cluster may pertain to similar genetic variations (representative of mutations) that may occur in the different strains.
  • Such similar variations may be grouped together.
  • the training engine 120 may begin assessment with an input data point k which represents the number of clusters to identify. Thereafter, the operation moves on to placing K centroids in random locations to create a cluster in which any one or more instances of the reference genetic data 116 may be clustered into.
  • each cluster has data samples pertaining to some commonalities, which may be based on similar genetic variations in a pathogen.
  • information pertaining to the genetic variation once classified may be stored as clustered lineage information 118 within the system 102.
  • the clustered lineage information 118 may depict the different clusters within which the reference genetic data 116 of different strain may have been clustered.
  • the clusters may correspond to a lineage or a sub-lineage for the given pathogen.
  • the information pertaining to lineages or sub-lineages formation may be stored for further consideration.
  • the information pertaining to the lineages or sub-lineages formation of the pathogen may be stored as lineage formation data 218.
  • the lineage formation data 218 may be then communicated to a medical practitioner for prescribing either a single drug or a combination of drugs a treatment against the pathogen in the patient from which the test sample (or the target strain) was collected.
  • the lineage classification engine 210 may further generate a report which provides an indication of drugs to which the target strain under consideration may be resistant to or susceptible to for aid in treatment against the pathogen.
  • the report may be stored within a database or may be shared electronically with the patient or any other medical practitioner for further consideration.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Genetics & Genomics (AREA)
  • Artificial Intelligence (AREA)
  • Analytical Chemistry (AREA)
  • Physiology (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Animal Behavior & Ethology (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

L'objet de la présente invention concerne la classification d'une souche donnée ou d'une souche cible dans l'une quelconque de la pluralité de lignées et/ou de sous-lignées de la souche de base (également connue sous le nom de souche sauvage d'un agent pathogène donné) à laquelle la souche cible peut correspondre. La souche cible peut être obtenue à partir de l'un quelconque des échantillons environnementaux, tels qu'un échantillon dérivé d'un individu qui peut être connu comme étant infecté par l'agent pathogène donné. Dans un exemple, la classification de la souche donnée dans l'une quelconque des lignées ou sous-lignées est basée sur les variations génétiques dans la structure génétique de la souche cible.
PCT/IN2023/051107 2022-11-29 2023-11-28 Classification de lignées et de sous-lignées d'agents pathogènes WO2024116201A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IN202241068794 2022-11-29
IN202241068794 2022-11-29

Publications (1)

Publication Number Publication Date
WO2024116201A1 true WO2024116201A1 (fr) 2024-06-06

Family

ID=91323227

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IN2023/051107 WO2024116201A1 (fr) 2022-11-29 2023-11-28 Classification de lignées et de sous-lignées d'agents pathogènes

Country Status (1)

Country Link
WO (1) WO2024116201A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021152603A1 (fr) * 2020-02-02 2021-08-05 Technion Research & Development Foundation Limited Système et procédé de classification d'échocardiogrammes de déformation
WO2022137251A1 (fr) * 2020-12-21 2022-06-30 Aarogyaai Innovations Pvt. Ltd. Pharmacorésistance de souches cibles d'un agent pathogène
WO2022260466A1 (fr) * 2021-06-11 2022-12-15 한국생명공학연구원 Procédé et système pour sélectionner des variations et des marqueurs de classification spécifiques à un individu et à une lignée à l'aide d'une intelligence artificielle
IN202241011598A (fr) * 2022-03-03 2023-09-08

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021152603A1 (fr) * 2020-02-02 2021-08-05 Technion Research & Development Foundation Limited Système et procédé de classification d'échocardiogrammes de déformation
WO2022137251A1 (fr) * 2020-12-21 2022-06-30 Aarogyaai Innovations Pvt. Ltd. Pharmacorésistance de souches cibles d'un agent pathogène
WO2022260466A1 (fr) * 2021-06-11 2022-12-15 한국생명공학연구원 Procédé et système pour sélectionner des variations et des marqueurs de classification spécifiques à un individu et à une lignée à l'aide d'une intelligence artificielle
IN202241011598A (fr) * 2022-03-03 2023-09-08

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
O’TOOLE ÁINE, SCHER EMILY, UNDERWOOD ANTHONY, JACKSON BEN, HILL VERITY, MCCRONE JOHN T, COLQUHOUN RACHEL, RUIS CHRIS, ABU-DAHAB KH: "Assignment of epidemiological lineages in an emerging pandemic using the pangolin tool", VIRUS EVOLUTION, OXFORD UNIVERSITY PRESS, vol. 7, no. 2, 16 December 2021 (2021-12-16), XP093179349, ISSN: 2057-1577, DOI: 10.1093/ve/veab064 *

Similar Documents

Publication Publication Date Title
CN111009286A (zh) 对宿主样本进行微生物分析的方法和装置
CN113934895A (zh) 一种辅助建立患者主索引的方法
Gorbalenya et al. Phylogeny of viruses
WO2011145955A1 (fr) Procédé et système de corrélation de séquences
CN111081315A (zh) 一种同源假基因变异检测的方法
CN111599413B (zh) 一种测序数据的分类单元组分计算方法
CN111785328A (zh) 基于门控循环单元神经网络的冠状病毒序列识别方法
CN112199670A (zh) 一种基于深度学习改进iforest对行为异常检测的日志监控方法
WO2022137251A1 (fr) Pharmacorésistance de souches cibles d'un agent pathogène
CN105631464B (zh) 对染色体序列和质粒序列进行分类的方法及装置
CN110610741B (zh) 一种人类病原体的识别方法、装置及电子设备
CN113221960A (zh) 一种高质量漏洞数据收集模型的构建方法及收集方法
Wahid et al. Pneumonia Detection in Chest X‐Ray Images Using Enhanced Restricted Boltzmann Machine
WO2024116201A1 (fr) Classification de lignées et de sous-lignées d'agents pathogènes
JP6356015B2 (ja) 遺伝子発現情報解析装置、遺伝子発現情報解析方法、及びプログラム
CN115729825A (zh) 一种工业协议的模糊测试用例生成方法、装置和电子设备
CN113593698B (zh) 一种基于图注意网络的中医证型识别方法
CN114496196A (zh) 医疗实验室临床生化检验自动审核***
CN114566221A (zh) 遗传病ngs数据自动化分析解读***
Gauthier et al. Hybrids and phylogenetics revisited: a statistical test of hybridization using quartets
CN116469473B (zh) T细胞亚型鉴定的模型训练方法、装置、设备及存储介质
da Silva et al. Silhouette-based feature selection for classification of medical images
CN115910216B (zh) 一种基于机器学习识别基因组序列分类错误的方法和***
CN116797826B (zh) 数据识别方法、模型训练方法、设备及介质
CN117789823B (zh) 病原体基因组协同演化突变簇的识别方法、装置、存储介质及设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23897070

Country of ref document: EP

Kind code of ref document: A1