WO2023113013A1 - Method for selecting gene for use in estimation of possibility of onset of disease, and method for estimating possibility of onset of disease - Google Patents

Method for selecting gene for use in estimation of possibility of onset of disease, and method for estimating possibility of onset of disease Download PDF

Info

Publication number
WO2023113013A1
WO2023113013A1 PCT/JP2022/046394 JP2022046394W WO2023113013A1 WO 2023113013 A1 WO2023113013 A1 WO 2023113013A1 JP 2022046394 W JP2022046394 W JP 2022046394W WO 2023113013 A1 WO2023113013 A1 WO 2023113013A1
Authority
WO
WIPO (PCT)
Prior art keywords
disease
genes
disease type
stem cells
subject
Prior art date
Application number
PCT/JP2022/046394
Other languages
French (fr)
Japanese (ja)
Inventor
航 藤渕
Original Assignee
国立大学法人京都大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 国立大学法人京都大学 filed Critical 国立大学法人京都大学
Publication of WO2023113013A1 publication Critical patent/WO2023113013A1/en

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6844Nucleic acid amplification reactions
    • C12Q1/6851Quantitative amplification
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/50ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for simulation or modelling of medical disorders

Definitions

  • the present disclosure relates to a method for selecting genes suitable for predicting a specific disease type, a selection program, and a computer-readable recording medium recording the selection program. Additionally, the present disclosure relates to methods of determining the likelihood of developing a disease belonging to a particular disease type.
  • Non-Patent Document 1 reports that deep learning is used to predict the effects of mutations on tissue-specific expression and disease risk from DNA sequence information.
  • the present inventors analyzed the known effects of 20 substances externally exposed to embryonic stem (ES) cells in Non-Patent Document 2, neurotoxicity, genetic carcinogenicity, and non-genetic carcinogenicity, using a gene network. However, they have also reported highly accurate prediction results for chemical substances whose effects are unknown by using a support vector machine (SVM).
  • SVM support vector machine
  • the disease risk is predicted by perturbing the gene expression of ES cells, which are pluripotent stem cells that can become human fetuses, from the outside.
  • the purpose of the present disclosure is to provide a method for selecting genes suitable for high-performance prediction of specific disease types. It is also an object of the present disclosure to provide a sophisticated method for determining the likelihood of developing a disease belonging to a particular disease type.
  • the present inventors have found that disease types can be predicted by statistical methods and machine learning based on gene expression information of stem cells produced from individuals. got In particular, the prediction of disease types with genetic disease factors is performed by performing machine learning using support vector machines after determining the ranking using the t-test based on the gene expression information of iPS cells generated from individuals. I found out what I can do. As shown in the examples, the prediction rate is the result of using disease-derived iPS cell data that develops disease in 5 types, with an accuracy of 95.7% in the brain, AUC 1.00, and an accuracy of 82.6% in skeletal muscle, AUC 1.00. The prediction rate was shown. In the brain, skeletal muscle, skin, and metabolic system, the number of characteristic genes used for the highest prediction rate was 17, 1, 58, and 51, and these genes are considered to be important for disease type prediction.
  • the present disclosure has been completed through further studies based on these findings. It provides a recording medium, a method for determining the likelihood of developing a disease belonging to a specific disease type, and the like.
  • Section 1 A method of selecting genes suitable for predicting a particular disease type, comprising the steps of: (1) Applying gene expression levels in stem cells derived from a subject who has not developed a disease belonging to the disease type and stem cells derived from a subject who has developed a disease belonging to the disease type to statistical methods and machine learning and selecting genes suitable for predicting the disease type.
  • Section 2. The step (1) is (1a) A characteristic gene is obtained by using a statistical method on gene expression levels in stem cells derived from a subject who has not developed a disease belonging to the disease type and stem cells derived from a subject who has developed a disease belonging to the disease type. and (1b) selecting a gene suitable for predicting the disease type using machine learning for one or more characteristic genes from the top of the ranking. Method. Item 3.
  • Method. Item 5. (0) Item 1, further comprising the step of measuring gene expression levels in a subject-derived stem cell that has not developed a disease belonging to the disease type and a subject-derived stem cell that has developed a disease belonging to the disease type. 5.
  • Item 6. A program for selecting genes suitable for predicting a particular disease type that causes the computer to perform the following steps: (1) Applying gene expression levels in stem cells derived from a subject who has not developed a disease belonging to the disease type and stem cells derived from a subject who has developed a disease belonging to the disease type to statistical methods and machine learning and selecting genes suitable for predicting the likelihood of developing said disease type.
  • Item 7. Item 7. A computer-readable recording medium recording the selection program according to item 6.
  • a method of determining the likelihood of developing a disease belonging to a particular disease type comprising the steps of: (A) A step of determining the possibility of developing a disease belonging to the disease type based on the expression level of one or more genes described in any one of FIGS. 1 to 4 in subject-derived stem cells. Item 9.
  • One or more genes described in Figure 1 are selected from the group consisting of MYO19, SKA1, TRIM11, WDR47, LENG8, NAB2, KHDRBS3, SYF2, NSUN5P1, EME2, BRD7, SELENBP1, METTL3, OSER1, and FBXO41 one or more genes, one or more of the genes described in FIG.
  • 2 is RP2, 1 or 2 or more genes described in FIG. , MLST8, ZC3H18, PKN1, LSM10, THAP4, AURKAIP1, CD320, WDR4, N4BP3, RPL7P9, TRAF2, ISOC2, SPOUT1, ATP6V0B, ACOT7, RNASEH1-AS1, NUP62, CCDC71, LMNB2, SLC39A3, COG3, SGTA, POLR3E, NCAPH2 , ZSWIM4, MPV17L2, AGPAT1, BRF1, CCDC14, TEDC2, LONP1, C4orf3, UPF1, AL031708, and one or more genes selected from the group consisting of PSMA7, 1 or 2 or more genes described in FIG.
  • FIG. 1 are one or more genes selected from the group consisting of MYO19, SKA1, TRIM11, WDR47, LENG8, NAB2, KHDRBS3, SYF2, NSUN5P1, and EME2, one or more of the genes described in FIG. 2 is RP2, 1 or 2 or more genes described in FIG. , MLST8, ZC3H18, PKN1, LSM10, THAP4, AURKAIP1, CD320, WDR4, N4BP3, RPL7P9, TRAF2, ISOC2, SPOUT1, ATP6V0B, and one or more genes selected from the group consisting of ACOT7, 1 or 2 or more genes described in FIG.
  • step (A) the gene expression level of stem cells derived from a subject who has not developed a disease belonging to the disease type and the gene expression level of stem cells derived from a subject who has developed a disease belonging to the disease type 11.
  • Item 12. 12 The method of any one of paragraphs 8-11, wherein the disease type is a disease in the brain, skeletal muscle, skin, or metabolic system.
  • Item 13 (A0) The method according to any one of items 8 to 12, further comprising the step of measuring the expression level of one or more genes described in any one of FIGS. 1 to 4 in the subject-derived stem cells. the method of. Item 14. The method according to any one of Items 1 to 5 and 8 to 13, the program according to Item 6, or the recording medium according to Item 7, wherein the stem cells are pluripotent stem cells. Item 15. The method according to any one of Items 1 to 5 and 8 to 13, the program according to Item 6, or the recording medium according to Item 7, wherein the stem cells are induced pluripotent stem (iPS) cells.
  • iPS induced pluripotent stem
  • the method of the present disclosure it is possible to select genes suitable for high-performance prediction of specific disease types. Furthermore, according to the method of the present disclosure, it is possible to make a high-performance determination of the possibility of developing a disease belonging to a specific disease type based on the expression information of a specific gene.
  • stem cells can be used as they are undifferentiated and there is no need to differentiate them into organs, it is possible to predict the possibility of developing a specific disease type at low cost and in a short period of time.
  • FIG. 10 is a diagram showing genes (Ensemble gene IDs, gene names) included in 17 genes, the number of genes in the highest prediction rate of the brain, together with the number of times of use in leave-one-out cross-validation.
  • FIG. 10 is a diagram showing genes (Ensemble gene ID, gene name) included in one gene in the highest prediction rate of skeletal muscle in Examples, together with the number of times of use in leave-one-out cross-validation.
  • FIG. 10 is a diagram showing genes (Ensemble gene IDs, gene names) included in 58 genes in the highest skin prediction rate in Examples, together with the number of times they were used in cross-validation without one.
  • FIG. 10 is a diagram showing genes (Ensemble gene IDs, gene names) included in 17 genes, the number of genes in the highest prediction rate of the brain, together with the number of times of use in leave-one-out cross-validation.
  • FIG. 10 is a diagram showing genes (Ensemble gene ID, gene name) included in one
  • FIG. 10 is a diagram showing genes (Ensemble gene IDs, gene names) included in 58 genes in the highest skin prediction rate in Examples, together with the number of times they were used in cross-validation without one.
  • FIG. 10 is a diagram showing genes (Ensemble gene IDs, gene names) included in 58 genes in the highest skin prediction rate in Examples, together with the number of times they were used in cross-validation without one.
  • FIG. 4 is a diagram showing genes (Ensemble gene IDs, gene names) included in 51 genes in the highest prediction rate of the metabolic system in Examples, along with the number of times of use in cross-validation without one.
  • FIG. 4 is a diagram showing genes (Ensemble gene IDs, gene names) included in 51 genes in the highest prediction rate of the metabolic system in Examples, along with the number of times of use in cross-validation without one.
  • FIG. 4 is a diagram showing genes (Ensemble gene IDs, gene names) included in 51 genes in the highest prediction rate of the metabolic system in Examples, along with the number of times of use in cross-validation without one. It is a flowchart which shows the processing procedure of a gene selection method.
  • 1 is a block diagram of a selection device that executes a gene selection method
  • FIG. 4 is a graph showing AUC in diseases of the brain, skeletal muscle, skin, immune system, and metabolic system in Examples.
  • the method for selecting a gene suitable for predicting a specific disease type of the present disclosure (hereinafter sometimes referred to as the “selection method of the present disclosure”) is characterized by including the following steps.
  • Disease type in the present disclosure means a general term for diseases that develop in each tissue (eg, immune system disease, skin disease, brain disease, etc.).
  • tissue refers to the tissue of animals including humans, and the tissue here means organs and organs.
  • the tissue is not particularly limited, and examples include musculoskeletal system, human skeleton, joints, ligaments, muscular system, tendon, digestive system, mouth, teeth, tongue, salivary gland, parotid gland, submandibular gland, sublingual gland.
  • the present disclosure is a method of identifying a gene that can be used to predict a specific disease type, where the likelihood of developing a disease in each tissue is predicted regardless of the type of disease, i.e., brain, skin, etc. make predictions about the likelihood that any disease will develop in the tissues of Therefore, the type of disease is not particularly limited as long as the disease can develop in each tissue, and any disease can be used.
  • the gene to be selected in the method of the present disclosure may be one or a plurality of genes of two or more, and in the case of a plurality of genes of two or more, each is used alone for prediction , or may be used in combination.
  • stem cells instead of iPS cells.
  • Stem cells here are not particularly limited as long as the effects of the present disclosure can be obtained, and include, for example, pluripotent stem cells, tissue stem cells (somatic stem cells), and the like.
  • tissue stem cells sematic stem cells
  • iPS cells are mainly described below, the description can be similarly applied to other stem cells.
  • Pluripotent stem cells are stem cells that have the ability (pluripotency) to differentiate into any of the three germ layers (endoderm, mesoderm, and ectoderm) and are capable of self-renewal.
  • pluripotent stem cells include embryonic stem (ES) cells, induced pluripotent stem (iPS) cells, embryonic stem (ntES) cells derived from cloned embryos obtained by nuclear transfer, and spermatogonial stem cells (GS cells). , embryonic germ cells (EG cells), cultured fibroblasts, and pluripotent cells derived from bone marrow stem cells (Muse cells).
  • tissue stem cells means stem cells that have the ability to differentiate into various cell types (pluripotency), although the tissues to differentiate are limited.
  • tissue stem cells include mesenchymal stem cells, neural stem cells, hematopoietic stem cells, liver stem cells, pancreatic stem cells, germ stem cells, epithelial stem cells, gastrointestinal epithelial stem cells, dental pulp stem cells, retinal stem cells, epidermal stem cells, hair follicle stem cells, and the like. be done.
  • step (1) subject-derived iPS cells that have not developed a disease belonging to the disease type (hereinafter sometimes referred to as “disease-free iPS cells”) and a disease that belongs to the disease type have developed. Applying statistical methods and machine learning to the gene expression levels in iPS cells derived from a subject with a disease (hereinafter sometimes referred to as “disease-type iPS cells”), suitable for predicting the disease type Select genes that have
  • iPS cells can be produced by known methods, for example, by introducing reprogramming factors into arbitrary somatic cells.
  • the initialization factors include, for example, Oct3/4, Sox2, Sox1, Sox3, Sox15, Sox17, Klf4, Klf2, c-Myc, N-Myc, L-Myc, Nanog, Lin28, Fbx15, ERas, ECAT15 -2, Tcl1, beta-catenin, Lin28b, Sall1, Sall4, Esrrb, Nr5a2, Tbx3, Glis1, and other genes or gene products, and these reprogramming factors can be used alone or in combination of two or more Available.
  • WO2009 /101084 WO2009/101407, WO2009/102983, WO2009/114949, WO2009/117439, WO2009/126250, WO2009/126251, WO2009/126655, WO2009/157593, WO2010/0090 15, WO2010/033906, WO2010/033920, WO2010/042800 , WO2010/050626, WO2010/056831, WO2010/068955, WO2010/098419, WO2010/102267, WO2010/111409, WO2010/111422, WO2010/115050, WO2010/124290, WO20 10/147395, WO2010/147612, Huangfu D et al.
  • the somatic cells are not particularly limited, and include fetal (pup) somatic cells, neonatal (pup) somatic cells, and mature healthy and diseased somatic cells. Also included are cells and cell lines.
  • tissue stem cells such as neural stem cells, hematopoietic stem cells, mesenchymal stem cells, dental pulp stem cells, (2) tissue progenitor cells, (3) blood cells (peripheral blood cells, umbilical cord blood cells, etc.), lymphocytes, epithelial cells, endothelial cells, muscle cells, fibroblasts (skin cells, etc.), hair cells, hepatocytes, gastric mucosa cells, enterocytes, splenocytes, pancreatic cells (pancreatic exocrine cells, etc.), Differentiated cells such as brain cells, lung cells, renal cells, adipocytes, and the like are included.
  • the subject is the target organism of the selection method of the present disclosure, and since the organism derived from the above tissue is the subject, it particularly includes mammals including humans.
  • a subject who has not developed a disease belonging to the disease type and a subject who has developed a disease belonging to the disease type are the same organisms in order to select genes capable of high-performance prediction. is desirable.
  • the method for classifying the disease in which the subject is developing into the disease type is not particularly limited, and various known disease databases, literature, etc. information (for example, MalaCards (https://www.malacards.org/)).
  • the correspondence between diseases and disease types is not limited to 1:1, and one disease is classified into two or more disease types if it develops in multiple tissues depending on the type of disease.
  • gene expression refers to the process of converting genetic information encoded by a gene into RNA (e.g., mRNA, rRNA, tRNA, snRNA, ncRNA) through transcription of the gene, or , refers to the process of converting mRNA into protein through “translation”.
  • measuring the gene expression level includes measuring the RNA expression level.
  • RNA isolated from each iPS cell can be used. Isolation of total RNA from each iPS cell is performed using methods known in the art or by using a commercially available kit (e.g., RNeasy Mini Kit (Qiagen)) according to the manufacturer's instructions. be able to.
  • a commercially available kit e.g., RNeasy Mini Kit (Qiagen)
  • cDNA synthesized from RNA can also be used.
  • RNA expression levels in iPS cells For the measurement of gene expression levels in iPS cells, methods known in the art for measuring gene expression levels can be used. Such methods include, for example, microarray method, real-time PCR method, northern blotting method, EST method, SAGE method (gene expression linkage analysis) method, NGS (next generation sequencer), and sequencing method using nanopore sequencer. be done.
  • the gene expression level may be obtained by measuring the amount of total RNA or by measuring the amount of a part of RNA.
  • the data obtained about the gene expression level may be subjected to gene ID conversion, missing value processing, normalization, logarithmic conversion, etc. as preprocessing used for subsequent analysis.
  • gene expression level data in diseased and disease-free iPS cells is used to analyze by statistical methods and machine learning to select genes suitable for disease type prediction.
  • Statistical methods and machine learning are not particularly limited as long as they can select genes suitable for prediction of any disease type, and various known methods can be used.
  • step (1) include the following steps.
  • the order of characteristic genes is determined by using a statistical method for gene expression levels in diseased and disease-free iPS cells.
  • the characteristic gene means a gene with a significant statistic for testing the degree of difference (for example, based on mean value and variance) in the expression level of each gene in diseased-type and disease-free iPS cells. do.
  • the degree of expression for example, the magnitude of the difference in expression level
  • the degree of expression is obtained by comparing the gene expression level in diseased-type iPS cells and the gene expression level in disease-free iPS cells. ) (eg, ranked in descending order of magnitude of expression level).
  • the gene expression level of the diseased type iPS cells is greater than the gene expression level of the disease-free iPS cells, or if the gene expression level of the disease-free iPS cells is greater than that of the diseased type may be greater than the gene expression level of the iPS cells.
  • Statistical methods are not particularly limited as long as they are capable of ranking the difference between the gene expression levels of diseased-type iPS cells and disease-free iPS cells.
  • a t-test can be used, and a two-sample t-test can be preferably used.
  • Wilcoxon, chi-square test, etc. can also be used.
  • forward selection, backward selection, and exhaustive search methods for finding combinations of feature quantities can also be used.
  • an embedded method that selects features during learning can also be used.
  • step (1b) machine learning is used for one or more of the characteristic genes in the order determined in step (1a) to select genes suitable for disease type prediction.
  • the feature genes used for machine learning are the feature gene with the highest ranking (hereinafter, the N-th feature gene from the top is sometimes referred to as the N-th feature gene), and the ranking is 1.
  • the number of characteristic genes used for machine learning is, for example, 1 to 300, 1 to 200, 1 to 100, etc.
  • the specific number of characteristic genes is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14. 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47 , 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97 , 98, 99, 100, and so on.
  • Machine learning is not particularly limited as long as it can be used to determine the possibility of developing a disease belonging to a disease type based on the expression level of one or more characteristic genes, such as support vector machines (SVM). , random forest, boosting, bagging, neural network, deep learning, etc. Among them, support vector machines can be preferably used. When using support vector machines, any kernel function such as linear, polynomial, radial basis functions, maximum entropy, etc. can be used.
  • SVM support vector machines
  • any kernel function such as linear, polynomial, radial basis functions, maximum entropy, etc. can be used.
  • the expression level of characteristic genes of diseased iPS cells and the expression level of characteristic genes of disease-free iPS cells are input, and the probability of developing a disease belonging to the disease type is used as an output for learning.
  • the number of characteristic genes that give the highest prediction rate can be determined, and the genes included therein can be selected as genes suitable for predicting disease type.
  • machine learning is performed using various numbers of feature genes, test data is used to determine the number of feature genes (N) that yield the highest prediction rate, and the ranking is determined. Characteristic genes at positions 1 to N can be regarded as genes suitable for prediction of disease type.
  • one or more evaluation indicators of machine learning can be used, such as accuracy, AUC (Area Under the Curve), etc. indicators can be used. Misclassification rate, balanced accuracy, F1 value, Matthews correlation coefficient, etc. can also be used.
  • Methods for determining the number of characteristic genes that provide the highest prediction rate include, for example, cross-validation, and cross-validation includes holdout validation, k-fold cross-validation, leave-one-out cross-validation, etc. Any kind of cross-validation can be used.
  • Cross-validation may be applied not only to step (1b), but also to both steps (1a) and (1b).
  • the types of characteristic genes ranked in step (1a) also change for each cross-validation. is used, and the number of feature genes that yield the highest prediction rate is N, then the leave-one-out cross-validation is performed M times, and N feature genes are determined for each.
  • M ⁇ N genes will be selected as genes suitable for predicting the likelihood that a type will develop. There is a possibility that the same gene exists in M ⁇ N genes, and the same gene exists 1 to M times. Therefore, genes that are present in greater numbers are considered to be more suitable for predicting the likelihood of developing a disease type.
  • FIG. 5 is a flow chart showing the processing procedure of the gene selection method according to one embodiment of the present invention
  • FIG. 6 is a functional block diagram of the selection device 1 that executes the selection method.
  • the selection device 1 can be configured with a general-purpose computer, and includes processors such as CPU and GPU, main storage devices such as DRAM and SRAM, and auxiliary storage devices such as HDD and SSD as hardware configurations. Further, the selection device 1 may be composed of a plurality of computers.
  • the selection device 1 includes, as functional blocks, an acquisition unit 2, a cell line selection unit 3, a ranking determination unit 4, a gene selection unit 5, a learning unit 6, a prediction rate measurement unit 7, and a gene selection unit 8. , provided. These units can be realized in software by the processor of the selection device 1 reading the selection program according to the present embodiment into the main storage device and executing it.
  • the selection program may be downloaded to the selection device 1 via a communication network such as the Internet, or may be downloaded to the selection device 1 via a computer-readable non-temporary recording medium such as a CD-R recording the selection program. may be installed.
  • step S1 the acquisition unit 2 obtains the gene expression level data obtained as a result of measuring the gene expression levels of the diseased-type iPS cell line and the disease-free iPS cell line. get. In this embodiment, it is assumed that there are 23 iPS cell lines. Gene expression levels can be measured by the method described above. Measured values and logarithmically normalized data are used as gene expression level data. In addition, as gene expression levels, total RNA and RNA of some genes suitable for use in analysis are used.
  • the ranking determination unit 4 uses a statistical method to determine the ranking of characteristic genes from the learning cell lines. That is, the ranking determination unit 4 uses a statistical method on the gene expression levels of the diseased-type iPS cell line and the disease-free iPS cell line, for example, the probability using the two-sample t-test between the two groups to rank feature genes.
  • the disease type the disease type assigned based on the disease information of the iPS cell line is used using database information such as MalaCards.
  • step S6 N is set to 1, and in step S7, the gene selection unit 5 selects one characteristic gene from the top of the order determined in step S5.
  • step S8 the learning unit 6 performs machine learning on the relationship between the expression level of the characteristic gene selected in step S7 and the presence or absence of disease occurrence in the learning cell line, and creates a learned model.
  • the learning unit 6 performs machine learning using a support vector machine.
  • step S9 the prediction rate measurement unit 7 predicts test cell lines. That is, by inputting the expression level data of the test cell line selected in step S3 into the learned model created in step S8, the presence or absence of disease occurrence in the test cell line is predicted.
  • N is less than 100 (NO in step S10), N is incremented by 1 in step S11, and steps S7 to S9 are repeated. That is, steps S3 to S9 are repeated 100 times, which is the number of feature genes.
  • N 100 (YES in step S10), if X is less than 23 (NO in step S12), X is incremented by 1 in step S13, and steps S3 to S11 are repeated.
  • step S15 the gene selection unit 8 selects genes included in the number of characteristic genes determined in step S14 as genes suitable for disease type prediction.
  • the number of characteristic genes is determined to be 17 as described above, 17 genes are included in every 23 times of leave-one-out cross-validation, so 17 ⁇ 23 genes are selected.
  • processing for removing duplication of genes is performed as appropriate. Since 1 to 23 of the same genes are present, it can be determined that the more genes are present, the more important the gene is for predicting the disease type.
  • genes suitable for prediction are selected for each type.
  • Method for determining the possibility of developing a disease belonging to a specific disease type A is characterized by including the following steps.
  • (A) A step of determining the possibility of developing a disease belonging to the disease type based on the expression level of one or more genes described in any one of FIGS. 1 to 4 in iPS cells derived from the subject. .
  • the terms such as disease type and subject in the determination method of the present disclosure are the same as those described in the selection method of the present disclosure.
  • iPS cells are mainly described below, the description can be similarly applied to other stem cells.
  • step (A) the possibility of developing a disease belonging to the disease type is determined based on the expression level of one or more genes described in any one of FIGS. make a judgment.
  • the expression level of one or more genes described in any one of FIGS. 1 to 4 means the expression level of one or more genes described in FIG. It means the expression level of the above genes, the expression level of one or two or more genes described in FIG. 3, or the expression level of one or two or more genes described in FIG.
  • the genes described in Figures 1 to 4 are used to determine the possibility of developing diseases in the brain ( Figure 1), skeletal muscle ( Figure 2), skin ( Figure 3), and metabolic system ( Figure 4), respectively. can do.
  • HGNC Human Genome Nomenclature Committee
  • Genes described in Figure 1 MYO19, SKA1, TRIM11, WDR47, LENG8, NAB2, KHDRBS3, SYF2, NSUN5P1, EME2, BRD7, SELENBP1, METTL3, OSER1, FBXO41, HEATR5B, SGSM2, SETP14, SRSF2, AGAP1, CTR9, BAHD1, MRPS33, PCMTD2, MTCO1P12, EIF1AXP1, AL391058, AGPAT1, CCNL2, HNRNPA1P12, DMD, L3MBTL3, MT-CO3, MT-CO1, MTCO3P12, ADH5, SLC25A29, TMEM120B, ZDHHC23, BTBD9, NPIPA1, GLDC, KDF1, CLCN5, NUDCD2, SNIP1, ZC3H12A, MAG OH,
  • the one or more genes described in Figure 1 are selected from the group consisting of MYO19, SKA1, TRIM11, WDR47, LENG8, NAB2, KHDRBS3, SYF2, NSUN5P1, EME2, BRD7, SELENBP1, METTL3, OSER1, and FBXO41
  • the one or more genes described in Figure 2 are preferably RP2.
  • One or more genes described in FIG. from PATL1, MOAP1, KDR, PRUNE2, ITPRIPL1, TBK1, UBE2Q1, PTRH2, ABCC4, CPEB4, DDAH2, TCEAL5, PIGO, SLC2A4RG, TMEM14C, CASC15, ATRAID, PSME4, GET1, ANKRD54, FKBP5, FAM89B, CLTB, and DGKK
  • one or more genes selected from the group consisting of FASTKD5, DDAH2, UBE2L3, SIAH2, ICE1, ZFPL1, SFR1, ACSL1, TKFC, CREB3L4, INTS7, SLTM, SLC44A2, ZC3H7A, TCERG1, MTRF1L One or more genes selected from the group consisting of C3orf18, TTC38, TUBE1, PATL1, MOAP1, KDR, PRUNE2, ITPRIPL1, TBK1, UBE2Q1, PTRH2, ABCC4, CPEB4, DDAH2, TCE
  • genes listed above were selected as genes suitable for predicting the likelihood of developing a specific disease type using the selection method of the present disclosure.
  • the number of genes used to determine the likelihood of developing a specific disease type is, for example, 1-300, 1-200, 1-100, etc. Specific numbers of genes are 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14. , 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64 , 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100 and so on.
  • Subject-derived iPS cells can be produced based on the above-described known method using cells contained in a biological sample collected from the subject.
  • the biological sample is a cell, tissue, body fluid, or the like collected from a subject, and is not particularly limited as long as it contains cells capable of producing iPS cells.
  • Conditions for culturing iPS cells such as medium, culture temperature, culture time, and culture vessel, are not particularly limited, and known conditions can be used as appropriate.
  • Gene expression levels can be measured using methods known in the art for measuring gene expression levels. Such methods include, for example, microarray method, real-time PCR method, northern blotting method, EST method, SAGE method (gene expression linkage analysis) method, NGS (next generation sequencer), and sequencing method using nanopore sequencer. be done.
  • the gene expression level may be obtained by measuring the amount of total RNA or by measuring the amount of a part of RNA. Furthermore, the data obtained about the gene expression level may be subjected to gene ID conversion, missing value processing, normalization, logarithmic conversion, etc. as preprocessing used for subsequent analysis.
  • the possibility of developing a disease belonging to a specific disease type in a subject is determined using the expression level of the above gene as an index.
  • the expression level of the gene is higher than a preset cutoff value (when the gene expression level is increased in iPS cells of a diseased type), or when it is lower than a preset cutoff value (When the gene expression level is decreased in disease-prone iPS cells), it is determined that the subject has a possibility of developing a disease belonging to a specific disease type.
  • the above cut-off value can be appropriately set by a person skilled in the art, and can be set, for example, from the viewpoint of sensitivity, specificity, positive predictive value, negative predictive value, and the like.
  • the cutoff value can be the average value, percentile value, or maximum value of the expression levels of the above genes in disease-free iPS cells.
  • cut-off values that are standardized from previously obtained data and values that are set by statistical analysis based on analysis of ROC (Receiver Operating Characteristic) curves, etc. can be used.
  • the subject may be determined that there is a possibility that the subject will develop a disease belonging to a specific disease type when one of the above genes is higher or lower than the cutoff value, or two types It may be determined that there is a possibility that the subject will develop a disease belonging to a specific disease type when the above genes are higher or lower than the cutoff value.
  • the gene expression level of iPS cells derived from a subject who has not developed a disease belonging to the disease type and the gene expression level of iPS cells derived from a subject who has developed a disease belonging to the disease type can also be determined using a machine-learned model using as learning data.
  • the machine learning here is not particularly limited as long as it can be used to determine the possibility of developing a disease belonging to a specific disease type based on the expression level of one or more genes, such as support vectors machine (SVM), random forest, boosting, bagging, neural network, deep learning, etc., among which support vector machine can be preferably used.
  • SVM support vectors machine
  • any kernel function such as linear, polynomial, radial basis functions, maximum entropy, etc.
  • the expression levels of characteristic genes of diseased iPS cells and the expression levels of characteristic genes of disease-free iPS cells are input, and the probability of developing a disease belonging to a specific disease type is learned as output. can be done.
  • By inputting a predetermined gene expression level into the model obtained by such learning it is possible to determine the possibility that a subject will develop a disease belonging to a specific disease type.
  • genes suitable for high-performance prediction of specific disease types can be selected. Furthermore, according to the determination method of the present disclosure, high-performance determination can be made about the possibility of developing a disease belonging to a specific disease type based on the expression information of a specific gene.
  • stem cells can be used as undifferentiated and do not need to be differentiated into organs, it is possible to predict the possibility of developing a disease belonging to a specific disease type at low cost and in a short period of time. .
  • the genome sequence is static data and has a problem of low accuracy in predicting diseases with complex gene interactions. becomes possible.
  • Table 1 shows the disease category, disease name, strain number, age, sex, tissue origin, etc. of the iPS cell lines and ES cell lines used in the Examples.
  • Table 2 shows the classification of each iPS cell line into 5 types of disease onset. MalaCards classifies diseases into subtypes, but RIKEN's disease names are often classified in higher order, so direct correspondence may not be expected. At that time, the first type among the subtypes considered to correspond to the disease names of RIKEN in the MalaCards classification was tentatively used. Using these 23 strains in total, a comprehensive gene expression profile was obtained by the RNA-seq method according to the following procedure.
  • iPS cell lines were prepared using prime ES cell medium (ReproCell) or DMEM/F-12 medium (Thermo Fisher Scientific) on SNL cells (mouse fibroblast STO cell line). Both media were supplemented with 5 ng/ml human basic fibroblast growth factor (FUJIFILM Irvine Scientific). Thereafter, the iPS cell lines were cultured in a feeder-free cell medium StemFit AK02N ( Ajinomoto Co., Inc.) and maintained under feeder brie conditions for at least two passages.
  • ReproCell prime ES cell medium
  • DMEM/F-12 medium Thermo Fisher Scientific
  • ES cells were also maintained in StemFit AK02N supplemented with 10 ⁇ M Y-27632 and 0.25 ⁇ g/cm 2 iMatrix-511 for at least two passages. All cell lines were washed with PBS, incubated in 0.5 ⁇ TrypLE Select (Thermo Fisher Scientific) for 3 minutes at 37°C, detached, harvested, counted and then 10 ⁇ M Y-27632 and 0.25 ⁇ g/cm 2 iMartix-511 (Inc. Nippi) was added and seeded again in StemFit AK02N (Ajinomoto Co., Inc.). The medium was changed every 1-2 days.
  • RNA-seq protocol Total RNA from all cell lines was isolated using the RNeasy Mini Kit (Qiagen), followed by DNase treatment. RNA seq libraries were prepared from 350 ng of total RNA each in duplicate using the TruSeq Stranded mRNA Library Prep Kit (Illumina). ⁇ 4.2 million of 70-bp single-end reads per sample were sequenced on an illumina Hiseq 2500 sequencer (IIlumina). Nucleotide sequences were obtained from binary RNA-seq data using bcl2fastq v2.20.0.422.
  • the adapter sequence was removed using trim_galore 0.4.4-dev, and the cDNA and ncRNA sequences of the Ensembl genome GRCh38r100 were mapped with M_score ⁇ 1 using bowtie2 2.2.5, and the counts were finally summarized by gene name. .
  • RNA-seq data From the gene expression data of 23 strains logarithmically normalized, 5 types of disease onset tissue were predicted with 4 or more strains. Characteristic genes used for prediction were ranked according to probability using a two-sample t-test between two groups with or without diseased tissue. Using leave-one-out-cross-validation (LOOCV), we measured the highest predictive rate in the range from 1 to 100 top feature genes. Since the leave-one-out cross-validation yielded training data 23 times, prediction was performed by determining the order of characteristic genes from the training data each time using a two-sample t-test.
  • LOCV leave-one-out-cross-validation
  • a support vector machine (SVM) was used for learning, and four types of kernels were used: linear, polynomial, radial basis function, and maximum entropy. Since 23 predictions were made in leave-one-out cross-validation, these were aggregated to calculate accuracy and AUC (area under the receiver operating characteristic curve) to obtain the highest accuracy. When there were multiple peaks, the highest AUC was recorded. To evaluate this result, we generated uniform random numbers in a 12,499 ⁇ 23 matrix of gene expression data for 23 strains and compared it with the highest prediction rate obtained by similar cross-validation without one.
  • SVM support vector machine
  • Table 3 shows the number of positive and negative data, the highest accuracy and its probability of one-sample t-test results, the AUC and its probability of one-sample t-test results, and the number of feature genes.
  • AUC is shown in FIG.
  • the average value of the highest accuracy obtained from 10 uniform random numbers, the sample standard deviation, and the degree of freedom of 9 were used as the population.
  • the T.DIST.RT function of Excel was used for the test.
  • the maximum accuracy of 95.7% and AUC1.0 for brain and AUC1.0 for skeletal muscle were significant with p ⁇ 0.05.
  • AUC in the skin and metabolic system also showed a high value of 0.92.
  • FIGS. 1 to 4 The gene names shown in FIGS. 1 to 4 are those given by the Human Genome Nomenclature Committee (HGNC) of HUGO (Human Genome Organization). This indicates that it is possible to predict disease onset with a small number of gene combinations.
  • HGNC Human Genome Nomenclature Committee

Abstract

Disclosed is a method for selecting a gene suitable for the prediction of a specific disease type, the method including the following step: (1) a step for applying the expression amounts of genes in each of a stem cell derived from a subject in whom a disease belonging to the specific disease type is not developed and a stem cell derived from a subject in whom the disease belonging to the specific disease type is developed to a statistical method and a machine learning to select a gene suitable for the prediction of the disease type. Also disclosed is a method for determining the possibility of the onset of a disease belonging to a specific disease type, the method comprising the following step: (A) a step for determining the possibility of the onset of the disease belonging to the disease type on the basis of expression amounts of one or more genes shown in any one of Figs. 1 to 4 in a stem cell derived from a subject.

Description

疾患発症の可能性の推定を行う遺伝子の選別方法及び疾患発症の可能性を推定する方法Method for selecting gene for estimating possibility of developing disease and method for estimating possibility of developing disease
 本開示は、特定の疾患タイプの予測に適した遺伝子の選定方法、選定プログラム、及び当該選定プログラムを記録したコンピュータ読み取り可能な記録媒体に関する。さらに、本開示は、特定の疾患タイプに属する疾患が発症する可能性を判断する方法に関する。 The present disclosure relates to a method for selecting genes suitable for predicting a specific disease type, a selection program, and a computer-readable recording medium recording the selection program. Additionally, the present disclosure relates to methods of determining the likelihood of developing a disease belonging to a particular disease type.
 現在では、ゲノム配列を用いた疾患リスク予測が既に商業化されており、このような分析には例えばSNP (一塩基多型)が使用されている。しかしながら、ゲノム配列は静的なデータであり、複雑な遺伝子相互作用のある疾患を予測する精度が低いことなどが問題となる。非特許文献1では、DNAの配列情報から変異の組織特異的な発現への影響、及び疾患リスクを深層学習により予測していることが報告されている。 Currently, disease risk prediction using genome sequences has already been commercialized, and SNPs (single nucleotide polymorphisms), for example, are used for such analysis. However, genome sequences are static data, and there are problems such as low accuracy in predicting diseases with complex gene interactions. Non-Patent Document 1 reports that deep learning is used to predict the effects of mutations on tissue-specific expression and disease risk from DNA sequence information.
 本発明者らは、非特許文献2において外部から胚性幹(ES)細胞に曝露した20物質の既知の影響である神経毒性、遺伝的発癌性、非遺伝的発癌性を、遺伝子ネットワークを採用しサポートベクターマシーン(SVM)の使用により影響が未知の化学物質についても高精度で予測した結果を報告している。当該文献では、外部からヒト胎児になることができる多能性幹細胞であるES細胞の遺伝子発現を撹乱させてその疾患リスクを予測している。 The present inventors analyzed the known effects of 20 substances externally exposed to embryonic stem (ES) cells in Non-Patent Document 2, neurotoxicity, genetic carcinogenicity, and non-genetic carcinogenicity, using a gene network. However, they have also reported highly accurate prediction results for chemical substances whose effects are unknown by using a support vector machine (SVM). In this document, the disease risk is predicted by perturbing the gene expression of ES cells, which are pluripotent stem cells that can become human fetuses, from the outside.
 本開示は、特定の疾患タイプの高性能な予測に適した遺伝子の選定方法を提供することを目的とする。また、本開示は、高性能な、特定の疾患タイプに属する疾患が発症する可能性を判断する方法を提供することを目的とする。 The purpose of the present disclosure is to provide a method for selecting genes suitable for high-performance prediction of specific disease types. It is also an object of the present disclosure to provide a sophisticated method for determining the likelihood of developing a disease belonging to a particular disease type.
 本発明者らは、上記目的を達成すべく鋭意研究を重ねた結果、個人から作製した幹細胞の遺伝子発現情報に基づいて統計学的手法及び機械学習を行うことにより疾患タイプの予測ができるという知見を得た。特に、個人から作製したiPS細胞の遺伝子発現情報に基づいてt検定を用いて順位を決定した上で、サポートベクターマシーンを用いて機械学習を行うことにより遺伝的な疾患要因を持つ疾患タイプの予測ができることを見出した。実施例で示すように、予測率は5タイプで疾患が発症する疾患由来iPS細胞データを用いた結果で、脳において正確度95.7%、AUC 1.00、骨格筋において正確度82.6%、AUC 1.00の高予測率を示した。脳、骨格筋、皮膚、代謝系において最高予測率に使用された特徴遺伝子数が17、1、58、51個であり、これらの遺伝子は疾患タイプの予測に重要なものであると考えられる。 As a result of intensive research to achieve the above object, the present inventors have found that disease types can be predicted by statistical methods and machine learning based on gene expression information of stem cells produced from individuals. got In particular, the prediction of disease types with genetic disease factors is performed by performing machine learning using support vector machines after determining the ranking using the t-test based on the gene expression information of iPS cells generated from individuals. I found out what I can do. As shown in the examples, the prediction rate is the result of using disease-derived iPS cell data that develops disease in 5 types, with an accuracy of 95.7% in the brain, AUC 1.00, and an accuracy of 82.6% in skeletal muscle, AUC 1.00. The prediction rate was shown. In the brain, skeletal muscle, skin, and metabolic system, the number of characteristic genes used for the highest prediction rate was 17, 1, 58, and 51, and these genes are considered to be important for disease type prediction.
 本開示は、これら知見に基づき、更に検討を重ねて完成されたものであり、次の特定の疾患タイプの予測に適した遺伝子の選定方法、選定プログラム、当該選定プログラムを記録したコンピュータ読み取り可能な記録媒体、特定の疾患タイプに属する疾患が発症する可能性を判断する方法などを提供するものである。 The present disclosure has been completed through further studies based on these findings. It provides a recording medium, a method for determining the likelihood of developing a disease belonging to a specific disease type, and the like.
項1.以下の工程を含む、特定の疾患タイプの予測に適した遺伝子の選定方法:
 (1)前記疾患タイプに属する疾患が発症していない被検体由来の幹細胞及び前記疾患タイプに属する疾患が発症している被検体由来の幹細胞における遺伝子発現量を統計学的手法及び機械学習に適用し、前記疾患タイプの予測に適した遺伝子を選定する工程。
項2.前記工程(1)が、
 (1a)前記疾患タイプに属する疾患が発症していない被検体由来の幹細胞及び前記疾患タイプに属する疾患が発症している被検体由来の幹細胞における遺伝子発現量に統計学的手法を用いて特徴遺伝子の順位を決定する工程、及び
 (1b)順位の上から1又は2以上の特徴遺伝子に機械学習を用いて、前記疾患タイプの予測に適した遺伝子を選定する工程
を含む、項1に記載の方法。
項3.前記工程(1a)において、前記疾患タイプに属する疾患が発症している被検体由来の幹細胞の遺伝子発現量と、前記疾患タイプに属する疾患が発症していない被検体由来の幹細胞の遺伝子発現量とを比較した発現量の程度に関してランク付けされる、項2に記載の方法。
項4.前記工程(1b)において、機械学習により最も高い予測率が得られる特徴遺伝子の数を決定し、それに含まれる遺伝子を前記疾患タイプの予測に適した遺伝子として選定する、項2又は3に記載の方法。
項5.(0)前記疾患タイプに属する疾患が発症していない被検体由来の幹細胞及び前記疾患タイプに属する疾患が発症している被検体由来の幹細胞における遺伝子発現量を測定する工程
を更に含む、項1~4のいずれか一項に記載の方法。
項6.コンピュータに以下の工程を実行させる、特定の疾患タイプの予測に適した遺伝子の選定プログラム:
 (1)前記疾患タイプに属する疾患が発症していない被検体由来の幹細胞及び前記疾患タイプに属する疾患が発症している被検体由来の幹細胞における遺伝子発現量を統計学的手法及び機械学習に適用し、前記疾患タイプが発症する可能性の予測に適した遺伝子を選定する工程。
項7.項6に記載の選定プログラムを記録したコンピュータ読み取り可能な記録媒体。
項8.以下の工程を含む、特定の疾患タイプに属する疾患が発症する可能性を判断する方法:
 (A)被検体由来の幹細胞における図1~4のいずれかの図に記載の1又は2以上の遺伝子の発現量に基づいて、前記疾患タイプに属する疾患が発症する可能性を判定する工程。
項9.前記工程(A)において、
図1に記載の1又は2以上の遺伝子が、MYO19、SKA1、TRIM11、WDR47、LENG8、NAB2、KHDRBS3、SYF2、NSUN5P1、EME2、BRD7、SELENBP1、METTL3、OSER1、及びFBXO41からなる群から選択される1又は2以上の遺伝子であり、
図2に記載の1又は2以上の遺伝子が、RP2であり、
図3に記載の1又は2以上の遺伝子が、BAG6、KIAA2026、GATAD2A、PPP4C、NTMT1、MAZ、ABL1、YTHDC1、GSK3B、SNX13、PDZD4、ARHGAP23、TMEM250、AC016739、ZNRF1、PUF60、SAMD4B、PPP1R14B、SF3B5、MLST8、ZC3H18、PKN1、LSM10、THAP4、AURKAIP1、CD320、WDR4、N4BP3、RPL7P9、TRAF2、ISOC2、SPOUT1、ATP6V0B、ACOT7、RNASEH1-AS1、NUP62、CCDC71、LMNB2、SLC39A3、COG3、SGTA、POLR3E、NCAPH2、ZSWIM4、MPV17L2、AGPAT1、BRF1、CCDC14、TEDC2、LONP1、C4orf3、UPF1、AL031708、及びPSMA7からなる群から選択される1又は2以上の遺伝子であり、
図4に記載の1又は2以上の遺伝子が、FASTKD5、DDAH2、UBE2L3、SIAH2、ICE1、ZFPL1、SFR1、ACSL1、TKFC、CREB3L4、INTS7、SLTM、SLC44A2、ZC3H7A、TCERG1、MTRF1L、C3orf18、TTC38、TUBE1、PATL1、MOAP1、KDR、PRUNE2、ITPRIPL1、TBK1、UBE2Q1、PTRH2、ABCC4、CPEB4、DDAH2、TCEAL5、PIGO、SLC2A4RG、TMEM14C、CASC15、ATRAID、PSME4、GET1、ANKRD54、FKBP5、FAM89B、CLTB、及びDGKKからなる群から選択される1又は2以上の遺伝子である、項8に記載の方法。
項10.前記工程(A)において、
図1に記載の1又は2以上の遺伝子が、MYO19、SKA1、TRIM11、WDR47、LENG8、NAB2、KHDRBS3、SYF2、NSUN5P1、及びEME2からなる群から選択される1又は2以上の遺伝子であり、
図2に記載の1又は2以上の遺伝子が、RP2であり、
図3に記載の1又は2以上の遺伝子が、BAG6、KIAA2026、GATAD2A、PPP4C、NTMT1、MAZ、ABL1、YTHDC1、GSK3B、SNX13、PDZD4、ARHGAP23、TMEM250、AC016739、ZNRF1、PUF60、SAMD4B、PPP1R14B、SF3B5、MLST8、ZC3H18、PKN1、LSM10、THAP4、AURKAIP1、CD320、WDR4、N4BP3、RPL7P9、TRAF2、ISOC2、SPOUT1、ATP6V0B、及びACOT7からなる群から選択される1又は2以上の遺伝子であり、
図4に記載の1又は2以上の遺伝子が、FASTKD5、DDAH2、UBE2L3、SIAH2、ICE1、ZFPL1、SFR1、ACSL1、TKFC、CREB3L4、INTS7、SLTM、SLC44A2、ZC3H7A、TCERG1、MTRF1L、C3orf18、TTC38、TUBE1、PATL1、MOAP1、KDR、PRUNE2、ITPRIPL1、TBK1、UBE2Q1、PTRH2、ABCC4、CPEB4、DDAH2、TCEAL5、PIGO、及びSLC2A4RGからなる群から選択される1又は2以上の遺伝子である、項8に記載の方法。
項11.前記工程(A)において、前記疾患タイプに属する疾患が発症していない被検体由来の幹細胞の遺伝子発現量と、前記疾患タイプに属する疾患が発症している被検体由来の幹細胞の遺伝子発現量と、を学習データとして用いて機械学習したモデルを用いて判定を行う、項8~10のいずれか一項に記載の方法。
項12.前記疾患タイプが、脳、骨格筋、皮膚、又は代謝系における疾患である、項8~11のいずれか一項に記載の方法。
項13.(A0)前記被検体由来の幹細胞における図1~4のいずれかの図に記載の1又は2以上の遺伝子の発現量を測定する工程
を更に含む、項8~12のいずれか一項に記載の方法。
項14.前記幹細胞が、多能性幹細胞である、項1~5及び8~13のいずれか一項に記載の方法、項6に記載のプログラム、又は項7に記載の記録媒体。
項15.前記幹細胞が、人工多能性幹(iPS)細胞である、項1~5及び8~13のいずれか一項に記載の方法、項6に記載のプログラム、又は項7に記載の記録媒体。
Section 1. A method of selecting genes suitable for predicting a particular disease type, comprising the steps of:
(1) Applying gene expression levels in stem cells derived from a subject who has not developed a disease belonging to the disease type and stem cells derived from a subject who has developed a disease belonging to the disease type to statistical methods and machine learning and selecting genes suitable for predicting the disease type.
Section 2. The step (1) is
(1a) A characteristic gene is obtained by using a statistical method on gene expression levels in stem cells derived from a subject who has not developed a disease belonging to the disease type and stem cells derived from a subject who has developed a disease belonging to the disease type. and (1b) selecting a gene suitable for predicting the disease type using machine learning for one or more characteristic genes from the top of the ranking. Method.
Item 3. In the step (1a), the gene expression level of stem cells derived from a subject who has developed a disease belonging to the disease type and the gene expression level of stem cells derived from a subject who has not developed a disease belonging to the disease type 3. The method of paragraph 2, wherein the ranking is in terms of degree of expression compared to .
Section 4. Item 4. The item 2 or 3, wherein in the step (1b), the number of characteristic genes that provide the highest prediction rate is determined by machine learning, and the genes included therein are selected as genes suitable for predicting the disease type. Method.
Item 5. (0) Item 1, further comprising the step of measuring gene expression levels in a subject-derived stem cell that has not developed a disease belonging to the disease type and a subject-derived stem cell that has developed a disease belonging to the disease type. 5. The method according to any one of -4.
Item 6. A program for selecting genes suitable for predicting a particular disease type that causes the computer to perform the following steps:
(1) Applying gene expression levels in stem cells derived from a subject who has not developed a disease belonging to the disease type and stem cells derived from a subject who has developed a disease belonging to the disease type to statistical methods and machine learning and selecting genes suitable for predicting the likelihood of developing said disease type.
Item 7. Item 7. A computer-readable recording medium recording the selection program according to item 6.
Item 8. A method of determining the likelihood of developing a disease belonging to a particular disease type, comprising the steps of:
(A) A step of determining the possibility of developing a disease belonging to the disease type based on the expression level of one or more genes described in any one of FIGS. 1 to 4 in subject-derived stem cells.
Item 9. In the step (A),
One or more genes described in Figure 1 are selected from the group consisting of MYO19, SKA1, TRIM11, WDR47, LENG8, NAB2, KHDRBS3, SYF2, NSUN5P1, EME2, BRD7, SELENBP1, METTL3, OSER1, and FBXO41 one or more genes,
one or more of the genes described in FIG. 2 is RP2,
1 or 2 or more genes described in FIG. , MLST8, ZC3H18, PKN1, LSM10, THAP4, AURKAIP1, CD320, WDR4, N4BP3, RPL7P9, TRAF2, ISOC2, SPOUT1, ATP6V0B, ACOT7, RNASEH1-AS1, NUP62, CCDC71, LMNB2, SLC39A3, COG3, SGTA, POLR3E, NCAPH2 , ZSWIM4, MPV17L2, AGPAT1, BRF1, CCDC14, TEDC2, LONP1, C4orf3, UPF1, AL031708, and one or more genes selected from the group consisting of PSMA7,
1 or 2 or more genes described in FIG. , from PATL1, MOAP1, KDR, PRUNE2, ITPRIPL1, TBK1, UBE2Q1, PTRH2, ABCC4, CPEB4, DDAH2, TCEAL5, PIGO, SLC2A4RG, TMEM14C, CASC15, ATRAID, PSME4, GET1, ANKRD54, FKBP5, FAM89B, CLTB, and DGKK Item 9. The method according to Item 8, wherein the gene is one or more genes selected from the group consisting of
Item 10. In the step (A),
The one or more genes described in FIG. 1 are one or more genes selected from the group consisting of MYO19, SKA1, TRIM11, WDR47, LENG8, NAB2, KHDRBS3, SYF2, NSUN5P1, and EME2,
one or more of the genes described in FIG. 2 is RP2,
1 or 2 or more genes described in FIG. , MLST8, ZC3H18, PKN1, LSM10, THAP4, AURKAIP1, CD320, WDR4, N4BP3, RPL7P9, TRAF2, ISOC2, SPOUT1, ATP6V0B, and one or more genes selected from the group consisting of ACOT7,
1 or 2 or more genes described in FIG. , PATL1, MOAP1, KDR, PRUNE2, ITPRIPL1, TBK1, UBE2Q1, PTRH2, ABCC4, CPEB4, DDAH2, TCEAL5, PIGO, and SLC2A4RG. Method.
Item 11. In the step (A), the gene expression level of stem cells derived from a subject who has not developed a disease belonging to the disease type and the gene expression level of stem cells derived from a subject who has developed a disease belonging to the disease type 11. The method according to any one of items 8 to 10, wherein the determination is performed using a machine-learned model using , as learning data.
Item 12. 12. The method of any one of paragraphs 8-11, wherein the disease type is a disease in the brain, skeletal muscle, skin, or metabolic system.
Item 13. (A0) The method according to any one of items 8 to 12, further comprising the step of measuring the expression level of one or more genes described in any one of FIGS. 1 to 4 in the subject-derived stem cells. the method of.
Item 14. The method according to any one of Items 1 to 5 and 8 to 13, the program according to Item 6, or the recording medium according to Item 7, wherein the stem cells are pluripotent stem cells.
Item 15. The method according to any one of Items 1 to 5 and 8 to 13, the program according to Item 6, or the recording medium according to Item 7, wherein the stem cells are induced pluripotent stem (iPS) cells.
 本開示の方法によれば、特定の疾患タイプの高性能な予測に適した遺伝子の選定が可能となる。さらに、本開示の方法によれば、特定の遺伝子の発現情報に基づいて、特定の疾患タイプに属する疾患が発症する可能性について、高性能な判断が可能となる。 According to the method of the present disclosure, it is possible to select genes suitable for high-performance prediction of specific disease types. Furthermore, according to the method of the present disclosure, it is possible to make a high-performance determination of the possibility of developing a disease belonging to a specific disease type based on the expression information of a specific gene.
 本開示では、幹細胞を未分化のまま用いることができ臓器などに分化させる必要が無いため、低コスト及び短時間で特定の疾患タイプが発症する可能性について予測することが可能である。 In the present disclosure, since stem cells can be used as they are undifferentiated and there is no need to differentiate them into organs, it is possible to predict the possibility of developing a specific disease type at low cost and in a short period of time.
実施例での脳の最高予測率における遺伝子の数17個に含まれる遺伝子(Ensemble gene ID、遺伝子名)を一個抜き交差検証での使用回数と共に示す図である。FIG. 10 is a diagram showing genes (Ensemble gene IDs, gene names) included in 17 genes, the number of genes in the highest prediction rate of the brain, together with the number of times of use in leave-one-out cross-validation. 実施例での骨格筋の最高予測率における遺伝子の数1個に含まれる遺伝子(Ensemble gene ID、遺伝子名)を一個抜き交差検証での使用回数と共に示す図である。FIG. 10 is a diagram showing genes (Ensemble gene ID, gene name) included in one gene in the highest prediction rate of skeletal muscle in Examples, together with the number of times of use in leave-one-out cross-validation. 実施例での皮膚の最高予測率における遺伝子の数58個に含まれる遺伝子(Ensemble gene ID、遺伝子名)を一個抜き交差検証での使用回数と共に示す図である。FIG. 10 is a diagram showing genes (Ensemble gene IDs, gene names) included in 58 genes in the highest skin prediction rate in Examples, together with the number of times they were used in cross-validation without one. 実施例での皮膚の最高予測率における遺伝子の数58個に含まれる遺伝子(Ensemble gene ID、遺伝子名)を一個抜き交差検証での使用回数と共に示す図である。FIG. 10 is a diagram showing genes (Ensemble gene IDs, gene names) included in 58 genes in the highest skin prediction rate in Examples, together with the number of times they were used in cross-validation without one. 実施例での皮膚の最高予測率における遺伝子の数58個に含まれる遺伝子(Ensemble gene ID、遺伝子名)を一個抜き交差検証での使用回数と共に示す図である。FIG. 10 is a diagram showing genes (Ensemble gene IDs, gene names) included in 58 genes in the highest skin prediction rate in Examples, together with the number of times they were used in cross-validation without one. 実施例での代謝系の最高予測率における遺伝子の数51個に含まれる遺伝子(Ensemble gene ID、遺伝子名)を一個抜き交差検証での使用回数と共に示す図である。FIG. 4 is a diagram showing genes (Ensemble gene IDs, gene names) included in 51 genes in the highest prediction rate of the metabolic system in Examples, along with the number of times of use in cross-validation without one. 実施例での代謝系の最高予測率における遺伝子の数51個に含まれる遺伝子(Ensemble gene ID、遺伝子名)を一個抜き交差検証での使用回数と共に示す図である。FIG. 4 is a diagram showing genes (Ensemble gene IDs, gene names) included in 51 genes in the highest prediction rate of the metabolic system in Examples, along with the number of times of use in cross-validation without one. 実施例での代謝系の最高予測率における遺伝子の数51個に含まれる遺伝子(Ensemble gene ID、遺伝子名)を一個抜き交差検証での使用回数と共に示す図である。FIG. 4 is a diagram showing genes (Ensemble gene IDs, gene names) included in 51 genes in the highest prediction rate of the metabolic system in Examples, along with the number of times of use in cross-validation without one. 遺伝子選定方法の処理手順を示すフローチャートである。It is a flowchart which shows the processing procedure of a gene selection method. 遺伝子選定方法を実行する選定装置のブロック図である。1 is a block diagram of a selection device that executes a gene selection method; FIG. 実施例における脳、骨格筋、皮膚、免疫系、代謝系の疾患でのAUCを示すグラフである。4 is a graph showing AUC in diseases of the brain, skeletal muscle, skin, immune system, and metabolic system in Examples.
 以下、本発明の実施の形態について詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail.
 なお、本明細書において「含む(comprise)」とは、「本質的にからなる(essentially consist of)」という意味と、「のみからなる(consist of)」という意味をも包含する。 In this specification, the term "comprise" includes the meaning of "essentially consist of" and the meaning of "consist of".
 遺伝子の選定方法
 本開示の特定の疾患タイプの予測に適した遺伝子の選定方法(以下、「本開示の選定方法」と記載することもある)は、以下の工程を含むことを特徴とする。
Method for Selecting Gene The method for selecting a gene suitable for predicting a specific disease type of the present disclosure (hereinafter sometimes referred to as the “selection method of the present disclosure”) is characterized by including the following steps.
 (1)前記疾患タイプに属する疾患が発症していない被検体由来のiPS細胞及び前記疾患タイプに属する疾患が発症している被検体由来のiPS細胞における遺伝子発現量を統計学的手法及び機械学習に適用し、前記疾患タイプの予測に適した遺伝子を選定する工程。 (1) Statistical methods and machine learning for gene expression levels in iPS cells derived from subjects who have not developed a disease belonging to the disease type and iPS cells derived from subjects who have developed a disease belonging to the disease type to select genes suitable for predicting said disease type.
 本開示における「疾患タイプ」とは、各組織において発症する疾患の総称(例えば、免疫系疾患、皮膚疾患、脳疾患などを)を意味する。 "Disease type" in the present disclosure means a general term for diseases that develop in each tissue (eg, immune system disease, skin disease, brain disease, etc.).
 上記組織とは、ヒトを含む動物の組織のことであり、ここでの組織は臓器及び器官を意味する。組織としては特に限定されず、例えば、筋骨格系、人間の骨格、関節、靭帯、筋肉系、腱、消化器系、口、歯、舌、唾液腺、耳下腺、顎下腺、舌下腺、咽頭、食道、胃、小腸、十二指腸、空腸、回腸、大腸、盲腸、上行結腸、横行結腸、下行結腸、S状結腸、直腸、肝臓、胆のう、腸間膜、膵臓、肛門管、呼吸器系、鼻腔、咽頭、喉頭、気管、気管支、細気管支と小さな気道、肺、呼吸筋、泌尿器系、腎臓、尿管、膀胱、尿道、生殖器官、女性生殖器、女性内部生殖器官、卵巣、卵管、子宮、子宮頚部、膣、女性外部生殖器官、陰門、陰核、胎盤、***、男性内部生殖器官、精巣、精巣上体、輸精管、精嚢、前立腺、尿道球腺、男性外部生殖器官、陰茎、陰嚢、内分泌系、脳下垂体、松果体、甲状腺、副甲状腺、副腎、膵臓、循環系、心臓、動脈、静脈、毛細血管、リンパ系、リンパ管、リンパ節、骨髄、胸腺、脾臓、腸管関連リンパ組織、扁桃腺、間質、神経系、脳、大脳、大脳半球、間脳、脳幹、中脳、脳橋、延髄、小脳、脊髄、脳室系、脈絡叢、末梢神経系、神経、脳神経、脊髄神経、神経節、腸管神経系、感覚器官、目、角膜、虹彩、毛様体、レンズ、網膜、耳、外耳、耳朶、鼓膜、中耳、耳小骨、内耳、蝸牛、耳の骨迷路の中心空洞、三半規管、嗅上皮、舌、味蕾、外皮系、乳腺、皮膚、皮下組織、骨格筋、免疫系、代謝系などが挙げられる。上記動物としては、特にヒトを含む哺乳動物が挙げられ、ヒトを含む哺乳動物としては、例えば、ヒト、サル、ラット、マウス、ウサギ、ヤギ、ヒツジ、ウシ、ウマ、ブタ、イヌ、ネコなどが挙げられる。 The above tissue refers to the tissue of animals including humans, and the tissue here means organs and organs. The tissue is not particularly limited, and examples include musculoskeletal system, human skeleton, joints, ligaments, muscular system, tendon, digestive system, mouth, teeth, tongue, salivary gland, parotid gland, submandibular gland, sublingual gland. , pharynx, esophagus, stomach, small intestine, duodenum, jejunum, ileum, large intestine, cecum, ascending colon, transverse colon, descending colon, sigmoid colon, rectum, liver, gallbladder, mesentery, pancreas, anal canal, respiratory system , nasal cavity, pharynx, larynx, trachea, bronchi, bronchioles and small airways, lungs, respiratory muscles, urinary system, kidneys, ureters, bladder, urethra, reproductive organs, female reproductive organs, female internal reproductive organs, ovaries, fallopian tubes, Uterus, cervix, vagina, female external reproductive organs, vulva, clitoris, placenta, male reproductive organs, male internal reproductive organs, testis, epididymis, vas deferens, seminal vesicles, prostate, bulbourethral gland, male external reproductive organs, penis , scrotum, endocrine system, pituitary gland, pineal gland, thyroid gland, parathyroid gland, adrenal gland, pancreas, circulatory system, heart, artery, vein, capillaries, lymphatic system, lymphatic vessel, lymph node, bone marrow, thymus, spleen , gut-related lymphoid tissue, tonsils, stroma, nervous system, brain, cerebrum, cerebral hemisphere, diencephalon, brainstem, midbrain, pons, medulla oblongata, cerebellum, spinal cord, ventricular system, choroid plexus, peripheral nervous system, Nerve, cranial nerve, spinal nerve, ganglion, enteric nervous system, sensory organs, eye, cornea, iris, ciliary body, lens, retina, ear, outer ear, earlobe, eardrum, middle ear, ossicles, inner ear, cochlea, ear central cavity of the bony labyrinth, semicircular canals, olfactory epithelium, tongue, taste buds, integumentary system, mammary gland, skin, subcutaneous tissue, skeletal muscle, immune system, metabolic system, and the like. The above-mentioned animals particularly include mammals including humans, and examples of mammals including humans include humans, monkeys, rats, mice, rabbits, goats, sheep, cows, horses, pigs, dogs, cats, and the like. mentioned.
 本開示は、特定の疾患タイプの予測に用いることができる遺伝子を特定する方法であり、ここでは疾患の種類は問わず各組織に疾患が発症する可能性の予測を行い、すなわち脳、皮膚などの組織において何れかの疾患が発症する可能性について予測を行う。そのため、疾患の種類は、各組織において発症し得る疾患である限り特に限定されず、いずれの疾患であってもよい。 The present disclosure is a method of identifying a gene that can be used to predict a specific disease type, where the likelihood of developing a disease in each tissue is predicted regardless of the type of disease, i.e., brain, skin, etc. make predictions about the likelihood that any disease will develop in the tissues of Therefore, the type of disease is not particularly limited as long as the disease can develop in each tissue, and any disease can be used.
 本開示の方法において選定する遺伝子は、1個であってもよく又は2個以上の複数の遺伝子であってもよく、2個以上の複数の遺伝子の場合は、それぞれ単独で予測に使用するものであってもよいし、組み合わせて使用するものであってもよい。 The gene to be selected in the method of the present disclosure may be one or a plurality of genes of two or more, and in the case of a plurality of genes of two or more, each is used alone for prediction , or may be used in combination.
 また、本開示の別の実施態様では、iPS細胞に代えて他の種類の幹細胞を使用することも可能である。ここでの幹細胞としては、本開示の効果が得られる限り特に限定されず、例えば、多能性幹細胞、組織幹細胞(体性幹細胞)など挙げられる。なお、以下ではiPS細胞に関して主に説明を行っているが、当該説明は他の幹細胞についても同様に適用可能である。 Also, in another embodiment of the present disclosure, it is possible to use other types of stem cells instead of iPS cells. Stem cells here are not particularly limited as long as the effects of the present disclosure can be obtained, and include, for example, pluripotent stem cells, tissue stem cells (somatic stem cells), and the like. Although iPS cells are mainly described below, the description can be similarly applied to other stem cells.
 多能性幹細胞とは、三胚葉(内胚葉、中胚葉、及び外胚葉)のいずれにも分化できる能力(多能性:pluripotency)を有し且つ自己複製可能な幹細胞である。多能性幹細胞としては、例えば、胚性幹(ES)細胞、人工多能性幹(iPS)細胞、核移植により得られるクローン胚由来の胚性幹(ntES)細胞、***幹細胞(GS細胞)、胚性生殖細胞(EG細胞)、培養線維芽細胞及び骨髄幹細胞由来の多能性細胞(Muse細胞)等が挙げられる。 Pluripotent stem cells are stem cells that have the ability (pluripotency) to differentiate into any of the three germ layers (endoderm, mesoderm, and ectoderm) and are capable of self-renewal. Examples of pluripotent stem cells include embryonic stem (ES) cells, induced pluripotent stem (iPS) cells, embryonic stem (ntES) cells derived from cloned embryos obtained by nuclear transfer, and spermatogonial stem cells (GS cells). , embryonic germ cells (EG cells), cultured fibroblasts, and pluripotent cells derived from bone marrow stem cells (Muse cells).
 組織幹細胞とは、分化する組織が限定されているが、様々な細胞種へ分化可能な能力(分化多能性)を有する幹細胞を意味する。組織幹細胞としては、例えば、間葉系幹細胞、神経幹細胞、造血幹細胞、肝幹細胞、膵幹細胞、生殖幹細胞、上皮幹細胞、消化管上皮幹細胞、歯髄幹細胞、網膜幹細胞、表皮幹細胞、毛嚢幹細胞等が挙げられる。 "Tissue stem cells" means stem cells that have the ability to differentiate into various cell types (pluripotency), although the tissues to differentiate are limited. Examples of tissue stem cells include mesenchymal stem cells, neural stem cells, hematopoietic stem cells, liver stem cells, pancreatic stem cells, germ stem cells, epithelial stem cells, gastrointestinal epithelial stem cells, dental pulp stem cells, retinal stem cells, epidermal stem cells, hair follicle stem cells, and the like. be done.
 ・工程(1)
 工程(1)では、前記疾患タイプに属する疾患が発症していない被検体由来のiPS細胞(以下、「無疾患タイプのiPS細胞」と記載することもある)及び前記疾患タイプに属する疾患が発症している被検体由来のiPS細胞(以下、「有疾患タイプのiPS細胞」と記載することもある)における遺伝子発現量を統計学的手法及び機械学習に適用し、前記疾患タイプの予測に適した遺伝子の選定を行う。
・Process (1)
In step (1), subject-derived iPS cells that have not developed a disease belonging to the disease type (hereinafter sometimes referred to as “disease-free iPS cells”) and a disease that belongs to the disease type have developed. Applying statistical methods and machine learning to the gene expression levels in iPS cells derived from a subject with a disease (hereinafter sometimes referred to as "disease-type iPS cells"), suitable for predicting the disease type Select genes that have
 iPS細胞は、公知の方法、例えば、任意の体細胞へ初期化因子を導入することによって製造され得る。ここで、初期化因子としては、例えば、Oct3/4、Sox2、Sox1、Sox3、Sox15、Sox17、Klf4、Klf2、c-Myc、N-Myc、L-Myc、Nanog、Lin28、Fbx15、ERas、ECAT15-2、Tcl1、beta-catenin、Lin28b、Sall1、Sall4、Esrrb、Nr5a2、Tbx3、Glis1等の遺伝子又は遺伝子産物を挙げることができ、これらの初期化因子は、単独で又は2種以上を組み合わせて使用できる。ここで、初期化因子の組み合わせとしては、例えば、WO2007/069666、WO2008/118820、WO2009/007852、WO2009/032194、WO2009/058413、WO2009/057831、WO2009/075119、WO2009/079007、WO2009/091659、WO2009/101084、WO2009/101407、WO2009/102983、WO2009/114949、WO2009/117439、WO2009/126250、WO2009/126251、WO2009/126655、WO2009/157593、WO2010/009015、WO2010/033906、WO2010/033920、WO2010/042800、WO2010/050626、WO2010/056831、WO2010/068955、WO2010/098419、WO2010/102267、WO2010/111409、WO2010/111422、WO2010/115050、WO2010/124290、WO2010/147395、WO2010/147612、Huangfu D et al., Nat. Biotechnol., 26:795-797(2008)、Shi Y et al., Cell Stem Cell, 2:525-528(2008)、Eminli S et al., Stem Cells. 26:2467-2474(2008)、Huangfu D et al., Nat. Biotechnol. 26:1269-1275(2008)、Shi Y et al., Cell Stem Cell, 3:568-574(2008)、Zhao Y et al., Cell Stem Cell, 3:475-479(2008)、Marson A, Cell Stem Cell, 3:132-135(2008)、Feng B et al., Nat. Cell Biol. 11:197-203(2009)、Judson RL et al., Nat. Biotechnol., 27:459-461(2009)、Lyssiotis CA et al., Proc Natl Acad Sci U S A. 106:8912-8917(2009)、Kim JB et al., Nature. 461:649-643(2009)、Ichida JK et al., Cell Stem Cell. 5:491-503(2009)、Heng JC et al., Cell Stem Cell. 6:167-174(2010)、Han J et al., Nature. 463:1096-1100(2010)、Mali P et al., Stem Cells. 28:713-720(2010)、Maekawa M et al., Nature. 474:225-229(2011)等に記載のものが使用できる。 iPS cells can be produced by known methods, for example, by introducing reprogramming factors into arbitrary somatic cells. Here, the initialization factors include, for example, Oct3/4, Sox2, Sox1, Sox3, Sox15, Sox17, Klf4, Klf2, c-Myc, N-Myc, L-Myc, Nanog, Lin28, Fbx15, ERas, ECAT15 -2, Tcl1, beta-catenin, Lin28b, Sall1, Sall4, Esrrb, Nr5a2, Tbx3, Glis1, and other genes or gene products, and these reprogramming factors can be used alone or in combination of two or more Available. Here, as a combination of initialization factors, for example, /091659, WO2009 /101084, WO2009/101407, WO2009/102983, WO2009/114949, WO2009/117439, WO2009/126250, WO2009/126251, WO2009/126655, WO2009/157593, WO2010/0090 15, WO2010/033906, WO2010/033920, WO2010/042800 , WO2010/050626, WO2010/056831, WO2010/068955, WO2010/098419, WO2010/102267, WO2010/111409, WO2010/111422, WO2010/115050, WO2010/124290, WO20 10/147395, WO2010/147612, Huangfu D et al. , Nat. Biotechnol., 26:795-797(2008), Shi Y et al., Cell Stem Cell, 2:525-528(2008), Eminli S et al., Stem Cells. 26:2467-2474(2008) ), Huangfu D et al., Nat. Biotechnol. 26:1269-1275 (2008), Shi Y et al., Cell Stem Cell, 3:568-574 (2008), Zhao Y et al., Cell Stem Cell, 3:475-479 (2008), Marson A, Cell Stem Cell, 3:132-135 (2008), Feng B et al., Nat. Cell Biol. 11:197-203 (2009), Judson RL et al. , Nat. Biotechnol., 27:459-461(2009), Lyssiotis CA et al., Proc Natl Acad Sci U S A. 106:8912-8917(2009), Kim JB et al., Nature. 461:649- 643(2009), Ichida JK et al., Cell Stem Cell. 5:491-503(2009), Heng JC et al., Cell Stem Cell. 6:167-174(2010), Han J et al., Nature 463:1096-1100 (2010), Mali P et al., Stem Cells. 28:713-720 (2010), Maekawa M et al., Nature. 474:225-229 (2011), etc. Available.
 上記体細胞としては、特に制限されず、胎児(仔)の体細胞、新生児(仔)の体細胞、並びに成熟した健全な及び疾患性の体細胞が含まれ、また、初代培養細胞、継代細胞、及び株化細胞も含まれる。体細胞としては、例えば、(1)神経幹細胞、造血幹細胞、間葉系幹細胞、歯髄幹細胞等の組織幹細胞(体性幹細胞)、(2)組織前駆細胞、(3)血液細胞(末梢血細胞、臍帯血細胞等)、リンパ球、上皮細胞、内皮細胞、筋肉細胞、線維芽細胞(皮膚細胞等)、毛細胞、肝細胞、胃粘膜細胞、腸細胞、脾細胞、膵細胞(膵外分泌細胞等)、脳細胞、肺細胞、腎細胞、脂肪細胞等の分化した細胞等が挙げられる。 The somatic cells are not particularly limited, and include fetal (pup) somatic cells, neonatal (pup) somatic cells, and mature healthy and diseased somatic cells. Also included are cells and cell lines. Examples of somatic cells include (1) tissue stem cells (somatic stem cells) such as neural stem cells, hematopoietic stem cells, mesenchymal stem cells, dental pulp stem cells, (2) tissue progenitor cells, (3) blood cells (peripheral blood cells, umbilical cord blood cells, etc.), lymphocytes, epithelial cells, endothelial cells, muscle cells, fibroblasts (skin cells, etc.), hair cells, hepatocytes, gastric mucosa cells, enterocytes, splenocytes, pancreatic cells (pancreatic exocrine cells, etc.), Differentiated cells such as brain cells, lung cells, renal cells, adipocytes, and the like are included.
 被検体は、本開示の選定方法の対象生物であり、上記の組織の由来生物が被検体となるので、特にヒトを含む哺乳動物が挙げられる。疾患タイプに属する疾患が発症していない被検体と疾患タイプに属する疾患が発症している被検体とは、高性能な予測を行うことができる遺伝子の選定を行うためには、同じ生物であることが望ましい。 The subject is the target organism of the selection method of the present disclosure, and since the organism derived from the above tissue is the subject, it particularly includes mammals including humans. A subject who has not developed a disease belonging to the disease type and a subject who has developed a disease belonging to the disease type are the same organisms in order to select genes capable of high-performance prediction. is desirable.
 疾患タイプに属する疾患が発症している被検体由来のiPS細胞について、被検体が発症している疾患を疾患タイプに分類する方法としては、特に限定されず、各種の公知の疾患データベース、文献などの情報(例えば、MalaCards (https://www.malacards.org/))を利用して行うことができる。ここで、疾患と疾患タイプとの対応は1:1だけではなく、疾患の種類によっては複数の組織に発症する場合、1つの疾患が2以上の疾患タイプに分類されることになる。 For iPS cells derived from a subject developing a disease belonging to the disease type, the method for classifying the disease in which the subject is developing into the disease type is not particularly limited, and various known disease databases, literature, etc. information (for example, MalaCards (https://www.malacards.org/)). Here, the correspondence between diseases and disease types is not limited to 1:1, and one disease is classified into two or more disease types if it develops in multiple tissues depending on the type of disease.
 本開示において、「遺伝子発現」とは、遺伝子でコードされた遺伝情報を遺伝子の転写を通してRNA (例えば、mRNA、rRNA、tRNA、snRNA、ncRNA)に変換するプロセス、又はタンパク質をコードする遺伝子に関しては、mRNAの「翻訳」を通してタンパク質に変換するプロセスを意味する。また、本開示では、遺伝子の発現量を測定することは、RNAの発現量を測定することを含む。 In the present disclosure, "gene expression" refers to the process of converting genetic information encoded by a gene into RNA (e.g., mRNA, rRNA, tRNA, snRNA, ncRNA) through transcription of the gene, or , refers to the process of converting mRNA into protein through “translation”. In addition, in the present disclosure, measuring the gene expression level includes measuring the RNA expression level.
 遺伝子の発現量の測定をRNAの発現量の測定により行う場合、各iPS細胞から単離されたRNAを使用することができる。各iPS細胞からの全RNAの単離は、当該技術分野で公知の方法を使用すること、又は市販のキット(例えば、RNeasy Mini Kit (Qiagen))を製造者の説明書に従って使用することにより行うことができる。遺伝子の発現量の測定をRNAの発現量の測定により行う場合には、RNAから合成されたcDNAを使用することもできる。 When measuring the expression level of genes by measuring the expression level of RNA, RNA isolated from each iPS cell can be used. Isolation of total RNA from each iPS cell is performed using methods known in the art or by using a commercially available kit (e.g., RNeasy Mini Kit (Qiagen)) according to the manufacturer's instructions. be able to. When measuring the expression level of genes by measuring the expression level of RNA, cDNA synthesized from RNA can also be used.
 iPS細胞における遺伝子発現量の測定は、当該技術分野において遺伝子発現量を測定するための公知の方法を用いることができる。そのような方法としては、例えば、マイクロアレイ法、リアルタイムPCR法、ノーザンブロッティング法、EST法、SAGE法(遺伝子発現連鎖解析)法、NGS (次世代シークエンサー)及びナノポアシークエンサーを用いた配列決定法など挙げられる。遺伝子発現量は、全RNAの量を測定したものであっても、一部のRNAの量を測定したもののいずれであってもよい。さらに、遺伝子発現量について得られたデータは、その後の解析に用いる前処理として、遺伝子のID変換、欠損値の処理、正規化、対数変換などが行われてもよい。 For the measurement of gene expression levels in iPS cells, methods known in the art for measuring gene expression levels can be used. Such methods include, for example, microarray method, real-time PCR method, northern blotting method, EST method, SAGE method (gene expression linkage analysis) method, NGS (next generation sequencer), and sequencing method using nanopore sequencer. be done. The gene expression level may be obtained by measuring the amount of total RNA or by measuring the amount of a part of RNA. Furthermore, the data obtained about the gene expression level may be subjected to gene ID conversion, missing value processing, normalization, logarithmic conversion, etc. as preprocessing used for subsequent analysis.
 本開示では、有疾患タイプ及び無疾患タイプのiPS細胞における遺伝子発現量のデータを用いて統計学的手法及び機械学習による解析を行い、疾患タイプの予測に適した遺伝子の選定を行う。統計学的手法及び機械学習としては、いずれかの疾患タイプの予測に適した遺伝子の選定を行うことできるものであれば特に制限されず、各種公知の方法を用いることができる。 In this disclosure, gene expression level data in diseased and disease-free iPS cells is used to analyze by statistical methods and machine learning to select genes suitable for disease type prediction. Statistical methods and machine learning are not particularly limited as long as they can select genes suitable for prediction of any disease type, and various known methods can be used.
 工程(1)の具体例としては、以下の工程を含むものが挙げられる。 Specific examples of step (1) include the following steps.
 (1a)前記疾患タイプに属する疾患が発症していない被検体由来のiPS細胞及び前記疾患タイプに属する疾患が発症している被検体由来のiPS細胞における遺伝子発現量に統計学的手法を用いて特徴遺伝子の順位を決定する工程、及び
 (1b)順位の上から1又は2以上の特徴遺伝子に機械学習を用いて、前記疾患タイプの予測に適した遺伝子を選定する工程。
(1a) Using a statistical method on gene expression levels in iPS cells derived from a subject who has not developed a disease belonging to the disease type and iPS cells derived from a subject who has developed a disease belonging to the disease type (1b) selecting a gene suitable for predicting the disease type by applying machine learning to one or more characteristic genes from the top of the ranking;
 工程(1a)では、有疾患タイプ及び無疾患タイプのiPS細胞における遺伝子発現量に対して統計学的手法を用いることで特徴遺伝子の順位の決定を行う。ここでの特徴遺伝子とは、有疾患タイプ及び無疾患タイプのiPS細胞における遺伝子のそれぞれの発現量の(例えば、平均値及び分散に基づく)差の程度を検定する統計量が有意な遺伝子を意味する。特徴遺伝子の順位の具体例としては、有疾患タイプのiPS細胞における遺伝子発現量と、無疾患タイプのiPS細胞の遺伝子発現量とを比較した発現量の程度(例えば、発現量の差の大きさ)に関してランク付けされたものである(例えば、当該発現量の程度の大きさについて降順に順位が付けられたもの)。ここでの発現量の差に関しては、有疾患タイプのiPS細胞の遺伝子発現量が無疾患タイプのiPS細胞の遺伝子発現量より大きい場合、又は無疾患タイプのiPS細胞の遺伝子発現量が有疾患タイプのiPS細胞の遺伝子発現量より大きい場合のいずれであってもよい。 In step (1a), the order of characteristic genes is determined by using a statistical method for gene expression levels in diseased and disease-free iPS cells. Here, the characteristic gene means a gene with a significant statistic for testing the degree of difference (for example, based on mean value and variance) in the expression level of each gene in diseased-type and disease-free iPS cells. do. As a specific example of the order of characteristic genes, the degree of expression (for example, the magnitude of the difference in expression level) is obtained by comparing the gene expression level in diseased-type iPS cells and the gene expression level in disease-free iPS cells. ) (eg, ranked in descending order of magnitude of expression level). Regarding the difference in expression level here, if the gene expression level of the diseased type iPS cells is greater than the gene expression level of the disease-free iPS cells, or if the gene expression level of the disease-free iPS cells is greater than that of the diseased type may be greater than the gene expression level of the iPS cells.
 統計学的手法としては、有疾患タイプのiPS細胞の遺伝子発現量と無疾患タイプのiPS細胞の遺伝子発現量の差の大きさの順位付けを行うことが可能であるものである限り特に限定されず、例えばt検定などが挙げられ、2標本t検定を好適に使用することができる。他にも、Wilcoxon、カイ二乗検定なども使用することができる。さらに、これらのフィルター法に加えて、特徴量の組み合わせを見つけ出すフォワード選択、バックワード選択、全探索法も併せて用いることができる。また、特徴量選択を学習時に行うエンベッディド法も使用することができる。 Statistical methods are not particularly limited as long as they are capable of ranking the difference between the gene expression levels of diseased-type iPS cells and disease-free iPS cells. However, for example, a t-test can be used, and a two-sample t-test can be preferably used. Wilcoxon, chi-square test, etc. can also be used. Furthermore, in addition to these filtering methods, forward selection, backward selection, and exhaustive search methods for finding combinations of feature quantities can also be used. In addition, an embedded method that selects features during learning can also be used.
 工程(1b)では、工程(1a)で決定した順位の上から1又は2以上の特徴遺伝子に機械学習を用いて、疾患タイプの予測に適した遺伝子の選定を行う。ここで機械学習に使用する特徴遺伝子は、順位が一番上の特徴遺伝子(以下、順位が一番上からN番目の特徴遺伝子をN位の特徴遺伝子と記載することもある)、順位が1~2位の特徴遺伝子、順位が1~3位の特徴遺伝子、順位が1~4位の特徴遺伝子というように、1位の特徴遺伝子単独で又は1番目からN番目までの特徴遺伝子の組合せを使用する。 In step (1b), machine learning is used for one or more of the characteristic genes in the order determined in step (1a) to select genes suitable for disease type prediction. Here, the feature genes used for machine learning are the feature gene with the highest ranking (hereinafter, the N-th feature gene from the top is sometimes referred to as the N-th feature gene), and the ranking is 1. The 1st-ranked characteristic gene alone or the combination of the 1st to Nth-ranked characteristic genes, such as the 2nd ranked characteristic gene, the 1st to 3rd ranked characteristic genes, and the 1st to 4th ranked characteristic genes. use.
 機械学習に使用する特徴遺伝子の数は、例えば、1~300個、1~200個、1~100個などが挙げられる。具体的な特徴遺伝子の数としては、1個、2個、3個、4個、5個、6個、7個、8個、9個、10個、11個、12個、13個、14個、15個、16個、17個、18個、19個、20個、21個、22個、23個、24個、25個、26個、27個、28個、29個、30個、31個、32個、33個、34個、35個、36個、37個、38個、39個、40個、41個、42個、43個、44個、45個、46個、47個、48個、49個、50個、51個、52個、53個、54個、55個、56個、57個、58個、59個、60個、61個、62個、63個、64個、65個、66個、67個、68個、69個、70個、71個、72個、73個、74個、75個、76個、77個、78個、79個、80個、81個、82個、83個、84個、85個、86個、87個、88個、89個、90個、91個、92個、93個、94個、95個、96個、97個、98個、99個、100個などが挙げられる。  The number of characteristic genes used for machine learning is, for example, 1 to 300, 1 to 200, 1 to 100, etc. The specific number of characteristic genes is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14. 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47 , 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97 , 98, 99, 100, and so on.
 機械学習としては、1又は2以上の特徴遺伝子の発現量に基づく疾患タイプに属する疾患が発症する可能性の判定に使用可能であるものである限り特に限定されず、例えばサポートベクターマシーン(SVM)、ランダムフォレスト、ブースティング、バギング、ニューラルネットワーク、ディープラーニングなどが挙げられ、中でもサポートベクターマシーンが好適に使用することができる。サポートベクターマシーンを使用する場合、線形、多項式、放射基底関数、最大エントロピーなどの任意のカーネル関数を使用することができる。 Machine learning is not particularly limited as long as it can be used to determine the possibility of developing a disease belonging to a disease type based on the expression level of one or more characteristic genes, such as support vector machines (SVM). , random forest, boosting, bagging, neural network, deep learning, etc. Among them, support vector machines can be preferably used. When using support vector machines, any kernel function such as linear, polynomial, radial basis functions, maximum entropy, etc. can be used.
 機械学習では、有疾患タイプのiPS細胞の特徴遺伝子の発現量と無疾患タイプのiPS細胞の特徴遺伝子の発現量とを入力とし、疾患タイプに属する疾患の発症の可能性を出力として学習を行わせることができる。機械学習の結果、最も高い予測率が得られる特徴遺伝子の数を決定し、それに含まれる遺伝子を疾患タイプの予測に適した遺伝子として選定することができる。ここでは、例えば、いろいろな数の特徴遺伝子を使用して機械学習を行い、テストデータを使用してその中で最も高い予測率が得られる特徴遺伝子の数(N個)を決定して、順位が1~N位の特徴遺伝子を疾患タイプの予測に適した遺伝子とすることができる。ここでの予測率としては、機械学習の(モデルの汎化性能評価における)1又は2以上の評価指標を使用することができ、例えば、正確度(Accuracy)、AUC (Area Under the Curve)などの指標を使用することができる。他にも、誤分類率、balanced accuracy、F1値、マシューズ相関係数なども使用できる。 In machine learning, the expression level of characteristic genes of diseased iPS cells and the expression level of characteristic genes of disease-free iPS cells are input, and the probability of developing a disease belonging to the disease type is used as an output for learning. can let As a result of machine learning, the number of characteristic genes that give the highest prediction rate can be determined, and the genes included therein can be selected as genes suitable for predicting disease type. Here, for example, machine learning is performed using various numbers of feature genes, test data is used to determine the number of feature genes (N) that yield the highest prediction rate, and the ranking is determined. Characteristic genes at positions 1 to N can be regarded as genes suitable for prediction of disease type. As the prediction rate here, one or more evaluation indicators of machine learning (in evaluating the generalization performance of the model) can be used, such as accuracy, AUC (Area Under the Curve), etc. indicators can be used. Misclassification rate, balanced accuracy, F1 value, Matthews correlation coefficient, etc. can also be used.
 最も高い予測率が得られる特徴遺伝子の数を決定するための手法としては、例えば、交差検証が挙げられ、交差検証としてはホールドアウト検証、k-分割交差検証、一個抜き交差検証などがあり、いずれの種類の交差検証も使用することができる。交差検証は工程(1b)だけでなく、工程(1a)及び工程(1b)の両方に適用してもよい。工程(1a)及び工程(1b)の両方に交差検証を適用する場合は、工程(1a)においてランク付けされる特徴遺伝子の種類も交差検証毎に変わってくるので、例えば、M種のiPS細胞を使用し、最も高い予測率が得られる特徴遺伝子の数がN個となった場合、一個抜き交差検証がM回行われ、それぞれにN個の特徴遺伝子が決定されることになるので、疾患タイプが発症する可能の予測に適した遺伝子としてM×N個の遺伝子が選択されることになる。M×N個の遺伝子の中には同じ遺伝子が存在する可能性があり、同じ遺伝子は1~M個存在することになる。そのため、存在する数が多い遺伝子ほど、疾患タイプが発症する可能性の予測により適した遺伝子と考えられる。 Methods for determining the number of characteristic genes that provide the highest prediction rate include, for example, cross-validation, and cross-validation includes holdout validation, k-fold cross-validation, leave-one-out cross-validation, etc. Any kind of cross-validation can be used. Cross-validation may be applied not only to step (1b), but also to both steps (1a) and (1b). When cross-validation is applied to both step (1a) and step (1b), the types of characteristic genes ranked in step (1a) also change for each cross-validation. is used, and the number of feature genes that yield the highest prediction rate is N, then the leave-one-out cross-validation is performed M times, and N feature genes are determined for each. M×N genes will be selected as genes suitable for predicting the likelihood that a type will develop. There is a possibility that the same gene exists in M×N genes, and the same gene exists 1 to M times. Therefore, genes that are present in greater numbers are considered to be more suitable for predicting the likelihood of developing a disease type.
 ・選定手順の一例
 続いて、遺伝子の選定方法の処理手順の一例について説明する。図5は、本発明の一実施形態に係る遺伝子の選定方法の処理手順を示すフローチャートであり、図6は、該選定方法を実行する選定装置1の機能ブロック図である。
- An example of selection procedure Next, an example of the processing procedure of the gene selection method will be described. FIG. 5 is a flow chart showing the processing procedure of the gene selection method according to one embodiment of the present invention, and FIG. 6 is a functional block diagram of the selection device 1 that executes the selection method.
 選定装置1は、汎用のコンピュータで構成することができ、ハードウェア構成として、CPU、GPUなどのプロセッサ、DRAM、SRAMなどの主記憶装置、及びHDD、SSDなどの補助記憶装置を備えている。また、選定装置1は、複数のコンピュータで構成してもよい。 The selection device 1 can be configured with a general-purpose computer, and includes processors such as CPU and GPU, main storage devices such as DRAM and SRAM, and auxiliary storage devices such as HDD and SSD as hardware configurations. Further, the selection device 1 may be composed of a plurality of computers.
 選定装置1は、機能ブロックとして、取得部2と、細胞株選択部3と、順位決定部4と、遺伝子選択部5と、学習部6と、予測率測定部7と、遺伝子選定部8と、を備える。これらの各部は、選定装置1のプロセッサが、本実施形態に係る選定プログラムを主記憶装置に読み出して実行することにより、ソフトウェア的に実現することができる。選定プログラムは、インターネット等の通信ネットワークを介して選定装置1にダウンロードしてもよいし、選定プログラムを記録したCD-R等のコンピュータ読み取り可能な非一時的な記録媒体を介して選定装置1にインストールしてもよい。 The selection device 1 includes, as functional blocks, an acquisition unit 2, a cell line selection unit 3, a ranking determination unit 4, a gene selection unit 5, a learning unit 6, a prediction rate measurement unit 7, and a gene selection unit 8. , provided. These units can be realized in software by the processor of the selection device 1 reading the selection program according to the present embodiment into the main storage device and executing it. The selection program may be downloaded to the selection device 1 via a communication network such as the Internet, or may be downloaded to the selection device 1 via a computer-readable non-temporary recording medium such as a CD-R recording the selection program. may be installed.
 図5に示すフローチャートでは、ステップS1において、取得部2が、有疾患タイプのiPS細胞株及び無疾患タイプのiPS細胞株の遺伝子発現量の測定を行った結果得られた遺伝子発現量のデータを取得する。本実施形態では、iPS細胞株は23株であるとする。遺伝子発現量の測定は前述するような方法により行うことができる。遺伝子発現量のデータとしては、測定された値、対数正規化などの処理がされたデータが使用される。また、遺伝子発現量としては、全RNA、及び解析に使用することが適切な一部の遺伝子のRNAのものが使用される。 In the flowchart shown in FIG. 5, in step S1, the acquisition unit 2 obtains the gene expression level data obtained as a result of measuring the gene expression levels of the diseased-type iPS cell line and the disease-free iPS cell line. get. In this embodiment, it is assumed that there are 23 iPS cell lines. Gene expression levels can be measured by the method described above. Measured values and logarithmically normalized data are used as gene expression level data. In addition, as gene expression levels, total RNA and RNA of some genes suitable for use in analysis are used.
 ステップS2において、X=1に設定し、ステップS3において、細胞株選択部3が、23のiPS細胞株からX=1番目の細胞株を交差検証のテスト細胞株として選択する。さらに、ステップS4において、細胞株選択部3は、テスト細胞株以外の22株の細胞株を交差検証の学習細胞株にする。 In step S2, X is set to 1, and in step S3, the cell line selection unit 3 selects the X=1st cell line from the 23 iPS cell lines as a cross-validation test cell line. Furthermore, in step S4, the cell line selection unit 3 selects 22 cell lines other than the test cell line as learning cell lines for cross-validation.
 ステップS5において、順位決定部4が、統計学的手法を用いて学習細胞株から特徴遺伝子の順位を決定する。すなわち、順位決定部4は、有疾患タイプのiPS細胞株及び無疾患タイプのiPS細胞株の遺伝子発現量に統計学的手法を用いて、例えば2群間での2標本t検定を用いた確率で特徴遺伝子のランク付けを行う。疾患タイプについてはMalaCardsなどのデータベースの情報を用いて、iPS細胞株の疾患情報に基づいて割り当てた疾患タイプを用いる。 In step S5, the ranking determination unit 4 uses a statistical method to determine the ranking of characteristic genes from the learning cell lines. That is, the ranking determination unit 4 uses a statistical method on the gene expression levels of the diseased-type iPS cell line and the disease-free iPS cell line, for example, the probability using the two-sample t-test between the two groups to rank feature genes. For the disease type, the disease type assigned based on the disease information of the iPS cell line is used using database information such as MalaCards.
 ステップS6において、N=1に設定し、ステップS7において、遺伝子選択部5が、ステップS5で決定された順位の上から1の特徴遺伝子を選択する。 In step S6, N is set to 1, and in step S7, the gene selection unit 5 selects one characteristic gene from the top of the order determined in step S5.
 ステップS8において、学習部6が、ステップS7で選択された特徴遺伝子の発現量と学習細胞株の疾患発生有無との関連性を機械学習して、学習済モデルを作成する。本実施形態では、学習部6は、サポートベクターマシーンを用いて機械学習する。 In step S8, the learning unit 6 performs machine learning on the relationship between the expression level of the characteristic gene selected in step S7 and the presence or absence of disease occurrence in the learning cell line, and creates a learned model. In this embodiment, the learning unit 6 performs machine learning using a support vector machine.
 ステップS9において、予測率測定部7が、テスト細胞株の予測を行う。すなわち、ステップS3で選択されたテスト細胞株の発現量のデータを、ステップS8で作成された学習済モデルに入力することにより、テスト細胞株の疾患発生有無を予測する。 In step S9, the prediction rate measurement unit 7 predicts test cell lines. That is, by inputting the expression level data of the test cell line selected in step S3 into the learned model created in step S8, the presence or absence of disease occurrence in the test cell line is predicted.
 Nが100未満の場合は(ステップS10においてNO)、ステップS11において、Nを1だけインクリメントし、ステップS7~ステップS9を繰り返す。すなわち、ステップS3~ステップS9は、特徴遺伝子の個数である100回繰り返される。N=100になると(ステップS10においてYES)、Xが23未満の場合は(ステップS12においてNO)、ステップS13において、Xを1だけインクリメントし、ステップS3~ステップS11を繰り返す。 If N is less than 100 (NO in step S10), N is incremented by 1 in step S11, and steps S7 to S9 are repeated. That is, steps S3 to S9 are repeated 100 times, which is the number of feature genes. When N=100 (YES in step S10), if X is less than 23 (NO in step S12), X is incremented by 1 in step S13, and steps S3 to S11 are repeated.
 その後、X=23になった場合は(ステップS12においてYES)、遺伝子選定部8が、23株のテスト細胞株の最高予測精度を与えたN数を測定する。すなわち、1~Nの特徴遺伝子を用いて機械学習した学習済モデルは、ステップS9において、計23株のテスト細胞株の疾患発生有無を予測することになるが、その23回の予測を総計した予測率(正確度、AUCなど)が最も高くなるN数を求める。例えば、N=17の場合に予測率が最大になったとすると、遺伝子選定部8は、最高予測精度を与えた特徴遺伝子の数を17個と決定する。 After that, when X=23 (YES in step S12), the gene selection unit 8 measures the number N of the 23 test cell lines that gave the highest prediction accuracy. That is, the trained model machine-learned using 1 to N characteristic genes predicts the presence or absence of disease occurrence in a total of 23 test cell lines in step S9, and the 23 predictions were totaled. Find the N number with the highest prediction rate (accuracy, AUC, etc.). For example, if the prediction rate is maximized when N=17, the gene selection unit 8 determines 17 as the number of characteristic genes that gave the highest prediction accuracy.
 続いて、ステップS15において、遺伝子選定部8が、ステップS14で決定した特徴遺伝子の数に含まれる遺伝子を疾患タイプの予測に適した遺伝子として選択する。上記のように特徴遺伝子の数を17個と決定した場合、23回の一個抜き交差検証毎に17個の遺伝子が含まれるので、17×23個の遺伝子が選択されることになる。この場合、17×23個の遺伝子の中には同じ遺伝子も含まれているため、適宜、遺伝子の重複を除く処理を行う。同じ遺伝子は1~23個存在することになるので、多く存在するほど、疾患タイプの予測について重要度が高い遺伝子と判断することができる。 Subsequently, in step S15, the gene selection unit 8 selects genes included in the number of characteristic genes determined in step S14 as genes suitable for disease type prediction. When the number of characteristic genes is determined to be 17 as described above, 17 genes are included in every 23 times of leave-one-out cross-validation, so 17×23 genes are selected. In this case, since the same genes are included in the 17×23 genes, processing for removing duplication of genes is performed as appropriate. Since 1 to 23 of the same genes are present, it can be determined that the more genes are present, the more important the gene is for predicting the disease type.
 脳、骨格筋などの疾患タイプ毎にステップS1~ステップS15を実施することで、タイプ毎に予測に適した遺伝子の選定を行う。 By performing steps S1 to S15 for each disease type such as brain and skeletal muscle, genes suitable for prediction are selected for each type.
 特定の疾患タイプに属する疾患が発症する可能性を判断する方法
 本開示の特定の疾患タイプに属する疾患が発症する可能性を判断する方法(以下、「本開示の判断方法」と記載することもある)は、以下の工程を含むことを特徴とする。
Method for determining the possibility of developing a disease belonging to a specific disease type A) is characterized by including the following steps.
 (A)被検体由来のiPS細胞における図1~4のいずれかの図に記載の1又は2以上の遺伝子の発現量に基づいて、前記疾患タイプに属する疾患が発症する可能性を判定する工程。 (A) A step of determining the possibility of developing a disease belonging to the disease type based on the expression level of one or more genes described in any one of FIGS. 1 to 4 in iPS cells derived from the subject. .
 本開示の判断方法における、疾患タイプ、被検体などの用語については、特に断らない限り、本開示の選定方法において説明した内容と同様である。また、本開示の別の実施態様では、本開示の選定方法と同様に、iPS細胞に代えて他の種類の幹細胞を使用することも可能である。なお、以下ではiPS細胞に関して主に説明を行っているが、当該説明は他の幹細胞についても同様に適用可能である。 Unless otherwise specified, the terms such as disease type and subject in the determination method of the present disclosure are the same as those described in the selection method of the present disclosure. In another embodiment of the present disclosure, it is also possible to use other types of stem cells instead of iPS cells, as in the selection method of the present disclosure. Although iPS cells are mainly described below, the description can be similarly applied to other stem cells.
 ・工程(A)
 工程(A)では、被検体由来のiPS細胞における図1~4のいずれかの図に記載の1又は2以上の遺伝子の発現量に基づいて、前記疾患タイプに属する疾患が発症する可能性の判定を行う。ここで、図1~4のいずれかの図に記載の1又は2以上の遺伝子の発現量とは、図1に記載の1又は2以上の遺伝子の発現量、図2に記載の1又は2以上の遺伝子の発現量、図3に記載の1又は2以上の遺伝子の発現量、又は図4に記載の1又は2以上の遺伝子の発現量、を意味する。図1~4に記載の遺伝子は、それぞれ脳(図1)、骨格筋(図2)、皮膚(図3)、代謝系(図4)に疾患が発症する可能性の判定を行うために使用することができる。なお、図1~4に記載の遺伝子名は、HUGO (Human Genome Organisation)のヒトゲノム命名法委員会(HGNC)による遺伝子名である。
図1に記載の遺伝子:
MYO19、SKA1、TRIM11、WDR47、LENG8、NAB2、KHDRBS3、SYF2、NSUN5P1、EME2、BRD7、SELENBP1、METTL3、OSER1、FBXO41、HEATR5B、SGSM2、SETP14、SRSF2、AGAP1、CTR9、BAHD1、MRPS33、PCMTD2、MTCO1P12、EIF1AXP1、AL391058、AGPAT1、CCNL2、HNRNPA1P12、DMD、L3MBTL3、MT-CO3、MT-CO1、MTCO3P12、ADH5、SLC25A29、TMEM120B、ZDHHC23、BTBD9、NPIPA1、GLDC、KDF1、CLCN5、NUDCD2、SNIP1、ZC3H12A、MAGOH、UTRN、B4GALNT3、PSMA1、MKLN1、SEM1、UQCR11、BCL11A、TGIF2、RAD50、RNF8、UBE4A、INTS2、RHEB、MEST、ZNF14、DMAC2、SYDE1、ZNF106、SUSD6、METTL2A、PREP、VDAC3、KDM7A
図2に記載の遺伝子:
RP2、POLR2D、FADD、NDUFB11、SLC39A11、PDHA1、MND1、COQ5、YKT6
図3に記載の遺伝子:
BAG6、KIAA2026、GATAD2A、PPP4C、NTMT1、MAZ、ABL1、YTHDC1、GSK3B、SNX13、PDZD4、ARHGAP23、TMEM250、AC016739、ZNRF1、PUF60、SAMD4B、PPP1R14B、SF3B5、MLST8、ZC3H18、PKN1、LSM10、THAP4、AURKAIP1、CD320、WDR4、N4BP3、RPL7P9、TRAF2、ISOC2、SPOUT1、ATP6V0B、ACOT7、RNASEH1-AS1、NUP62、CCDC71、LMNB2、SLC39A3、COG3、SGTA、POLR3E、NCAPH2、ZSWIM4、MPV17L2、AGPAT1、BRF1、CCDC14、TEDC2、LONP1、C4orf3、UPF1、AL031708、PSMA7、RPS27AP11、ZNF592、SLC22A23、ERP44、OXLD1、ARMCX5、YTHDC1、CTU2、PUSL1、BOLA2B、DTYMK、SSBP2、USP8、Z83844、RNF31、AC079250、RPL39、TSSC4、SSNA1、SURF2、NIBAN2、MGAT1、DHX40、DNAJA3、USP48、CDC14A、AP002784、PCDHGA11、HNRNPA1P54、HSPA1B、NOP9、HGS、COPB2、DDX28、ZNRD2、RNF26、ZNF16、ATP5MPL、PTGES2、SLC7A1、GCOM1、MED6、KDM3A、OSBP、WSB1、LIMK1、RPL18A、SLC7A5、PLTP、SIRT6、C1GALT1C1、AL392086、BCAR1、FNBP4、KCNQ2、ARHGAP23、NTAN1、RAB11FIP3、RPS9、FAM189A1、LINC01578、NBPF10、AC010614、AC107871、PLEKHO2、WDR46、PAM16、HSPA1B、WDR46、INPP5F、GET3、GPX1P1、ZGPAT、ZNF782、PLSCR3、LIN9、PYCR1、COL18A1、SMG1P3、COX5A、CTXN1、JUP、HNRNPA3、C2CD3、C19orf48、TBC1D16、RBPMS2、TKT、NEPRO、FAM102B、POLR3K、SHC1、DEDD2、G6PD、PTPDC1、THY1、DPH7、SH3GL1、DCAF5、TACO1、COQ10A、SWAP70、RAP1GAP2、RFX1、TRAF7、ANKHD1、KIF1A、ELOF1、KIF1C、RNF6、ELK1、TRAP1、DDX39A、P4HA1、ZNF211、KDM5B、ADPRS、SDC1、FAM162A、WWC1、SERINC1、DVL1、DDX49、SLC1A5、AKT2、CD276、TRIP4、ELOB、COTL1、PLA2G15、NME3、PGK1、RANGAP1、BAG6、NUBP2、PSME1、EXOC1、MUL1、STRN4、CHERP、KAT6A、DOP1A、ITGB5、KEAP1、SAR1A、TRAF4、TTC7A、IARS2、CLEC16A、GABARAPL2
図4に記載の遺伝子:
FASTKD5、DDAH2、UBE2L3、SIAH2、ICE1、ZFPL1、SFR1、ACSL1、TKFC、CREB3L4、INTS7、SLTM、SLC44A2、ZC3H7A、TCERG1、MTRF1L、C3orf18、TTC38、TUBE1、PATL1、MOAP1、KDR、PRUNE2、ITPRIPL1、TBK1、UBE2Q1、PTRH2、ABCC4、CPEB4、DDAH2、TCEAL5、PIGO、SLC2A4RG、TMEM14C、CASC15、ATRAID、PSME4、GET1、ANKRD54、FKBP5、FAM89B、CLTB、DGKK、AKNA、CYB561D1、ZNF202、MDK、LINC02188、CCPG1、AL031729、CCNL1、NCAPD3、ZSCAN29、SNX10、ARMCX4、DDAH2、HDDC3、STYXL1、ZNF195、ZNF35、ARHGAP26、RELCH、TIMM10B、GTF3C4、GTF2IP1、CYB5D1、SHROOM4、TIAM1、IRF2BPL、TBCB、LYRM1、GAB2、DUS4L-BCAP29、AC019257、MLLT6、ZBTB12、GTF2IP4、NELFE、HSPA5P1、LYRM4、ITGA1、ENTPD7、JRKL、ZBTB38、SLC25A33、LINGO1、HIC2、RPS14、WASF2、ANKRD50、NAF1、SHQ1、ZNF614、CHCHD5、PRPF4B、CORO1C、DHX15、TSPAN13、LIMA1、ARID4A、DEF6、SCMH1、AP001267、KIAA0355、LINC00869、ZNF528-AS1、GTF2I、HNRNPA1P50、FPGT、SNHG3、MRPS17、CKMT1B、ATF6B、EIF3FP3、EEF1A1P14、RPL3P2、DANCR、CCHCR1、MORF4L1P1、CD27-AS1、VPS52、ZNF814、ZNF525、CCDC167、RPL10A、MRPL40、WASHC1、LRRN1、XXYLT1、HEG1、DAB1、EVC2、TADA3、PFKFB3、SIN3A、ATF6B、CIAO2A、SHLD2P1、BORCS5、ANKS6、SENP2、ATF3、CLPB、TIAL1、SACS、VIPAS39、INIP、LACTB2、SLC2A12、MALL、MRPL24、TNFRSF11A、CPNE2、HCN4、ANXA7、CNMD、USP44、THOC2、GOT2、CHD6、TFCP2L1、FAF2、CHKA、CNTNAP1、RIC1、GLA、MTAP、ERGIC2、SNRPA、OSBPL3、COL11A1、AKR7A2、PTCD2、GRAMD1B、ZFP64、MKS1、MYCBP2
・Process (A)
In step (A), the possibility of developing a disease belonging to the disease type is determined based on the expression level of one or more genes described in any one of FIGS. make a judgment. Here, the expression level of one or more genes described in any one of FIGS. 1 to 4 means the expression level of one or more genes described in FIG. It means the expression level of the above genes, the expression level of one or two or more genes described in FIG. 3, or the expression level of one or two or more genes described in FIG. The genes described in Figures 1 to 4 are used to determine the possibility of developing diseases in the brain (Figure 1), skeletal muscle (Figure 2), skin (Figure 3), and metabolic system (Figure 4), respectively. can do. The gene names shown in FIGS. 1 to 4 are those given by the Human Genome Nomenclature Committee (HGNC) of HUGO (Human Genome Organization).
Genes described in Figure 1:
MYO19, SKA1, TRIM11, WDR47, LENG8, NAB2, KHDRBS3, SYF2, NSUN5P1, EME2, BRD7, SELENBP1, METTL3, OSER1, FBXO41, HEATR5B, SGSM2, SETP14, SRSF2, AGAP1, CTR9, BAHD1, MRPS33, PCMTD2, MTCO1P12, EIF1AXP1, AL391058, AGPAT1, CCNL2, HNRNPA1P12, DMD, L3MBTL3, MT-CO3, MT-CO1, MTCO3P12, ADH5, SLC25A29, TMEM120B, ZDHHC23, BTBD9, NPIPA1, GLDC, KDF1, CLCN5, NUDCD2, SNIP1, ZC3H12A, MAG OH, UTRN, B4GALNT3, PSMA1, MKLN1, SEM1, UQCR11, BCL11A, TGIF2, RAD50, RNF8, UBE4A, INTS2, RHEB, MEST, ZNF14, DMAC2, SYDE1, ZNF106, SUSD6, METTL2A, PREP, VDAC3, KDM7A
Genes described in Figure 2:
RP2, POLR2D, FADD, NDUFB11, SLC39A11, PDHA1, MND1, COQ5, YKT6
Genes described in Figure 3:
BAG6, KIAA2026, GATAD2A, PPP4C, NTMT1, MAZ, ABL1, YTHDC1, GSK3B, SNX13, PDZD4, ARHGAP23, TMEM250, AC016739, ZNRF1, PUF60, SAMD4B, PPP1R14B, SF3B5, MLST8, ZC3H18, PKN1, LSM10, TH AP4, AURKAIP1, CD320, WDR4, N4BP3, RPL7P9, TRAF2, ISOC2, SPOUT1, ATP6V0B, ACOT7, RNASEH1-AS1, NUP62, CCDC71, LMNB2, SLC39A3, COG3, SGTA, POLR3E, NCAPH2, ZSWIM4, MPV17L2, AGPAT1, BRF1, CCDC14, TEDC2, LONP1, C4orf3, UPF1, AL031708, PSMA7, RPS27AP11, ZNF592, SLC22A23, ERP44, OXLD1, ARMCX5, YTHDC1, CTU2, PUSL1, BOLA2B, DTYMK, SSBP2, USP8, Z83844, RNF31, AC079250, RPL39, TSSC4, SSNA1, SURF2, NIBAN2, MGAT1, DHX40, DNAJA3, USP48, CDC14A, AP002784, PCDHGA11, HNRNPA1P54, HSPA1B, NOP9, HGS, COPB2, DDX28, ZNRD2, RNF26, ZNF16, ATP5MPL, PTGES2, SLC7A1, GCOM1, MED6, KDM3A, OSBP, WS B1, LIMK1, RPL18A, SLC7A5, PLTP, SIRT6, C1GALT1C1, AL392086, BCAR1, FNBP4, KCNQ2, ARHGAP23, NTAN1, RAB11FIP3, RPS9, FAM189A1, LINC01578, NBPF10, AC010614, AC107871, PLEKHO2, WDR46, PAM16, HSPA1B, WDR46, INPP5F, GET3, GPX1P1, ZGPAT, ZNF782, PLSCR3, LIN9, PYCR1, COL18A1, SMG1P3, COX5A, CTXN1, JUP, HNRNPA3, C2CD3, C19orf48, TBC1D16, RBPMS2, TKT, NEPRO, FAM102B, POLR3K, SHC1, DEDD2, G6PD, PTPDC 1, THY1, DPH7, SH3GL1, DCAF5, TACO1, COQ10A, SWAP70, RAP1GAP2, RFX1, TRAF7, ANKHD1, KIF1A, ELOF1, KIF1C, RNF6, ELK1, TRAP1, DDX39A, P4HA1, ZNF211, KDM5B, ADPRS, SDC1, FAM162A, WWC1, SERINC1, DVL1, DDX49, SLC1A5, AKT2, CD276, TRIP4, ELOB, COTL1, PLA2G15, NME3, PGK1, RANGAP1, BAG6, NUBP2, PSME1, EXOC1, MUL1, STRN4, CHERP, KAT6A, DOP1A, ITGB5, KEAP1, SAR1A, TRAF4, TTC7A, IARS2, CLEC16A, GABARAPL2
Genes described in Figure 4:
FASTKD5, DDAH2, UBE2L3, SIAH2, ICE1, ZFPL1, SFR1, ACSL1, TKFC, CREB3L4, INTS7, SLTM, SLC44A2, ZC3H7A, TCERG1, MTRF1L, C3orf18, TTC38, TUBE1, PATL1, MOAP1, KDR, PRUNE2, ITPRIPL1, TBK1, UBE2Q1, PTRH2, ABCC4, CPEB4, DDAH2, TCEAL5, PIGO, SLC2A4RG, TMEM14C, CASC15, ATRAID, PSME4, GET1, ANKRD54, FKBP5, FAM89B, CLTB, DGKK, AKNA, CYB561D1, ZNF202, MDK, LINC02188, CCPG1, AL031729, CCNL1, NCAPD3, ZSCAN29, SNX10, ARMCX4, DDAH2, HDDC3, STYXL1, ZNF195, ZNF35, ARHGAP26, RELCH, TIMM10B, GTF3C4, GTF2IP1, CYB5D1, SHROOM4, TIAM1, IRF2BPL, TBCB, LYRM1, GAB2, DUS4L-BCAP29, AC01 9257, MLLT6, ZBTB12, GTF2IP4, Nelfe, HSPA5P1, LYRM4, ITGA1, ITGA1, JRKL, ZBTB38, SLC25A33, Lingo1, Lingo1, RPS14, WASF2, ANKRD50, SHQ1, ZNF1, SHQ1, ZNF1 614, CHCHD5, PRPF4B, CORO1C, DHX15, TSPAN13, LIMA1, ARID4A, DEF6, SCMH1, AP001267, KIAA0355, LINC00869, ZNF528-AS1, GTF2I, HNRNPA1P50, FPGT, SNHG3, MRPS17, CKMT1B, ATF6B, EIF3FP3, EEF1A1P14, RPL3P2, DANCR, CCHCR1, MORF4L1P1, CD27- AS1, VPS52, ZNF814, ZNF525, CCDC167, RPL10A, MRPL40, WASHC1, LRRN1, XXYLT1, HEG1, DAB1, EVC2, TADA3, PFKFB3, SIN3A, ATF6B, CIAO2A, SHLD2P1, BORCS5, ANKS6, SENP2, ATF3, CLPB, TIAL1, SACS, VIPAS39, INIP, LACTB2, SLC2A12, MALL, MRPL24, TNFRSF11A, CPNE2, HCN4, ANXA7, CNMD, USP44, THOC2, GOT2, CHD6, TFCP2L1, FAF2, CHKA, CNTNAP1, RIC1, GLA, MTAP, ERGIC2, SNRPA, OSBPL3, COL11A1, AKR7A2, PTCD2, GRAMD1B, ZFP64, MKS1, MYCBP2
 図1に記載の1又は2以上の遺伝子は、MYO19、SKA1、TRIM11、WDR47、LENG8、NAB2、KHDRBS3、SYF2、NSUN5P1、EME2、BRD7、SELENBP1、METTL3、OSER1、及びFBXO41からなる群から選択される1又は2以上の遺伝子であることが好ましく、MYO19、SKA1、TRIM11、WDR47、LENG8、NAB2、KHDRBS3、SYF2、NSUN5P1、及びEME2からなる群から選択される1又は2以上の遺伝子であることがより好ましい。
 図2に記載の1又は2以上の遺伝子は、RP2であることが好ましい。
 図3に記載の1又は2以上の遺伝子は、BAG6、KIAA2026、GATAD2A、PPP4C、NTMT1、MAZ、ABL1、YTHDC1、GSK3B、SNX13、PDZD4、ARHGAP23、TMEM250、AC016739、ZNRF1、PUF60、SAMD4B、PPP1R14B、SF3B5、MLST8、ZC3H18、PKN1、LSM10、THAP4、AURKAIP1、CD320、WDR4、N4BP3、RPL7P9、TRAF2、ISOC2、SPOUT1、ATP6V0B、ACOT7、RNASEH1-AS1、NUP62、CCDC71、LMNB2、SLC39A3、COG3、SGTA、POLR3E、NCAPH2、ZSWIM4、MPV17L2、AGPAT1、BRF1、CCDC14、TEDC2、LONP1、C4orf3、UPF1、AL031708、及びPSMA7からなる群から選択される1又は2以上の遺伝子であることが好ましく、BAG6、KIAA2026、GATAD2A、PPP4C、NTMT1、MAZ、ABL1、YTHDC1、GSK3B、SNX13、PDZD4、ARHGAP23、TMEM250、AC016739、ZNRF1、PUF60、SAMD4B、PPP1R14B、SF3B5、MLST8、ZC3H18、PKN1、LSM10、THAP4、AURKAIP1、CD320、WDR4、N4BP3、RPL7P9、TRAF2、ISOC2、SPOUT1、ATP6V0B、及びACOT7からなる群から選択される1又は2以上の遺伝子であることがより好ましい。
 図4に記載の1又は2以上の遺伝子は、FASTKD5、DDAH2、UBE2L3、SIAH2、ICE1、ZFPL1、SFR1、ACSL1、TKFC、CREB3L4、INTS7、SLTM、SLC44A2、ZC3H7A、TCERG1、MTRF1L、C3orf18、TTC38、TUBE1、PATL1、MOAP1、KDR、PRUNE2、ITPRIPL1、TBK1、UBE2Q1、PTRH2、ABCC4、CPEB4、DDAH2、TCEAL5、PIGO、SLC2A4RG、TMEM14C、CASC15、ATRAID、PSME4、GET1、ANKRD54、FKBP5、FAM89B、CLTB、及びDGKKからなる群から選択される1又は2以上の遺伝子であることが好ましく、FASTKD5、DDAH2、UBE2L3、SIAH2、ICE1、ZFPL1、SFR1、ACSL1、TKFC、CREB3L4、INTS7、SLTM、SLC44A2、ZC3H7A、TCERG1、MTRF1L、C3orf18、TTC38、TUBE1、PATL1、MOAP1、KDR、PRUNE2、ITPRIPL1、TBK1、UBE2Q1、PTRH2、ABCC4、CPEB4、DDAH2、TCEAL5、PIGO、及びSLC2A4RGからなる群から選択される1又は2以上の遺伝子であることがより好ましい。
The one or more genes described in Figure 1 are selected from the group consisting of MYO19, SKA1, TRIM11, WDR47, LENG8, NAB2, KHDRBS3, SYF2, NSUN5P1, EME2, BRD7, SELENBP1, METTL3, OSER1, and FBXO41 Preferably one or two or more genes, more preferably one or two or more genes selected from the group consisting of MYO19, SKA1, TRIM11, WDR47, LENG8, NAB2, KHDRBS3, SYF2, NSUN5P1, and EME2 preferable.
The one or more genes described in Figure 2 are preferably RP2.
One or more genes described in FIG. 3 are BAG6, KIAA2026, GATAD2A, PPP4C, NTMT1, MAZ, ABL1, YTHDC1, GSK3B, SNX13, PDZD4, ARHGAP23, TMEM250, AC016739, ZNRF1, PUF60, SAMD4B, PPP1R14B, SF3B5 , MLST8, ZC3H18, PKN1, LSM10, THAP4, AURKAIP1, CD320, WDR4, N4BP3, RPL7P9, TRAF2, ISOC2, SPOUT1, ATP6V0B, ACOT7, RNASEH1-AS1, NUP62, CCDC71, LMNB2, SLC39A3, COG3, SGTA, POLR3E, NCAPH2 , ZSWIM4, MPV17L2, AGPAT1, BRF1, CCDC14, TEDC2, LONP1, C4orf3, UPF1, AL031708, and preferably one or more genes selected from the group consisting of PSMA7, BAG6, KIAA2026, GATAD2A, PPP4C, NTMT1, MAZ, ABL1, YTHDC1, GSK3B, SNX13, PDZD4, ARHGAP23, TMEM250, AC016739, ZNRF1, PUF60, SAMD4B, PPP1R14B, SF3B5, MLST8, ZC3H18, PKN1, LSM10, THAP4, AURKAIP1, CD320, WDR4, N4BP3 , RPL7P9, More preferably, one or more genes selected from the group consisting of TRAF2, ISOC2, SPOUT1, ATP6V0B, and ACOT7.
One or more genes described in FIG. , from PATL1, MOAP1, KDR, PRUNE2, ITPRIPL1, TBK1, UBE2Q1, PTRH2, ABCC4, CPEB4, DDAH2, TCEAL5, PIGO, SLC2A4RG, TMEM14C, CASC15, ATRAID, PSME4, GET1, ANKRD54, FKBP5, FAM89B, CLTB, and DGKK Preferably, one or more genes selected from the group consisting of FASTKD5, DDAH2, UBE2L3, SIAH2, ICE1, ZFPL1, SFR1, ACSL1, TKFC, CREB3L4, INTS7, SLTM, SLC44A2, ZC3H7A, TCERG1, MTRF1L, One or more genes selected from the group consisting of C3orf18, TTC38, TUBE1, PATL1, MOAP1, KDR, PRUNE2, ITPRIPL1, TBK1, UBE2Q1, PTRH2, ABCC4, CPEB4, DDAH2, TCEAL5, PIGO, and SLC2A4RG is more preferred.
 上で挙げられた遺伝子は、本開示の選定方法を用いて、特定の疾患タイプが発症する可能性の予測に適した遺伝子として選定されたものである。 The genes listed above were selected as genes suitable for predicting the likelihood of developing a specific disease type using the selection method of the present disclosure.
 特定の疾患タイプが発症する可能性の判定に使用する遺伝子の数は、例えば、1~300個、1~200個、1~100個などが挙げられる。具体的な遺伝子の数としては、1個、2個、3個、4個、5個、6個、7個、8個、9個、10個、11個、12個、13個、14個、15個、16個、17個、18個、19個、20個、21個、22個、23個、24個、25個、26個、27個、28個、29個、30個、31個、32個、33個、34個、35個、36個、37個、38個、39個、40個、41個、42個、43個、44個、45個、46個、47個、48個、49個、50個、51個、52個、53個、54個、55個、56個、57個、58個、59個、60個、61個、62個、63個、64個、65個、66個、67個、68個、69個、70個、71個、72個、73個、74個、75個、76個、77個、78個、79個、80個、81個、82個、83個、84個、85個、86個、87個、88個、89個、90個、91個、92個、93個、94個、95個、96個、97個、98個、99個、100個などが挙げられる。 The number of genes used to determine the likelihood of developing a specific disease type is, for example, 1-300, 1-200, 1-100, etc. Specific numbers of genes are 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14. , 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64 , 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100 and so on.
 なお、本開示の判断方法において、上記の遺伝子に加えて上記以外の遺伝子の発現量を利用することを排除するものではなく、上記以外の遺伝子の発現量を特定の疾患タイプに属する疾患が発症する可能性の判定に利用することもできる。 In addition, in the determination method of the present disclosure, in addition to the above genes, the use of the expression levels of genes other than the above is not excluded. It can also be used to determine the possibility of
 被検体由来のiPS細胞は、被検体から採取された生体試料に含まれる細胞を使用して前述する公知の方法に基づいて作製することができる。ここで生体試料とは、被検体から採取された細胞、組織、体液などであり、iPS細胞を作製することが可能な細胞が含まれている限り特に限定されない。なお、iPS細胞を培養するための条件、例えば、培地、培養温度、培養時間、培養容器などは特に制限されず、公知の条件を適宜使用することができる。 Subject-derived iPS cells can be produced based on the above-described known method using cells contained in a biological sample collected from the subject. Here, the biological sample is a cell, tissue, body fluid, or the like collected from a subject, and is not particularly limited as long as it contains cells capable of producing iPS cells. Conditions for culturing iPS cells, such as medium, culture temperature, culture time, and culture vessel, are not particularly limited, and known conditions can be used as appropriate.
 遺伝子発現量の測定は、当該技術分野において遺伝子発現量を測定するための公知の方法を用いることができる。そのような方法としては、例えば、マイクロアレイ法、リアルタイムPCR法、ノーザンブロッティング法、EST法、SAGE法(遺伝子発現連鎖解析)法、NGS (次世代シークエンサー)及びナノポアシークエンサーを用いた配列決定法など挙げられる。遺伝子発現量は、全RNAの量を測定したものであっても、一部のRNAの量を測定したもののいずれであってもよい。さらに、遺伝子発現量について得られたデータは、その後の解析に用いる前処理として、遺伝子のID変換、欠損値の処理、正規化、対数変換などが行われてもよい。  Gene expression levels can be measured using methods known in the art for measuring gene expression levels. Such methods include, for example, microarray method, real-time PCR method, northern blotting method, EST method, SAGE method (gene expression linkage analysis) method, NGS (next generation sequencer), and sequencing method using nanopore sequencer. be done. The gene expression level may be obtained by measuring the amount of total RNA or by measuring the amount of a part of RNA. Furthermore, the data obtained about the gene expression level may be subjected to gene ID conversion, missing value processing, normalization, logarithmic conversion, etc. as preprocessing used for subsequent analysis.
 本開示の判断方法では、上記遺伝子の発現量を指標として、被検体に特定の疾患タイプに属する疾患が発症する可能性を判定する。この場合、例えば、上記遺伝子の発現量が、予め設定されたカットオフ値より高い場合(有疾患タイプのiPS細胞において遺伝子発現量が上昇する場合)、又は予め設定されたカットオフ値より低い場合(有疾患タイプのiPS細胞において遺伝子発現量が低下する場合)、被検体に特定の疾患タイプに属する疾患が発症する可能性を有すると判定する。ここで、上記のカットオフ値は、当業者が適宜設定することができるものであり、例えば、感度、特異度、陽性的中率、陰性的中率などの観点から設定することができる。また、カットオフ値は、無疾患タイプのiPS細胞の上記遺伝子の発現量の平均値、パーセンタイル値、又は最大値とすることができる。その他、カットオフ値は過去に得られたデータから標準化されたもの、及びROC (受信者操作特性)曲線の解析などに基づいた統計解析により設定されたものを使用することができる。 In the determination method of the present disclosure, the possibility of developing a disease belonging to a specific disease type in a subject is determined using the expression level of the above gene as an index. In this case, for example, when the expression level of the gene is higher than a preset cutoff value (when the gene expression level is increased in iPS cells of a diseased type), or when it is lower than a preset cutoff value (When the gene expression level is decreased in disease-prone iPS cells), it is determined that the subject has a possibility of developing a disease belonging to a specific disease type. Here, the above cut-off value can be appropriately set by a person skilled in the art, and can be set, for example, from the viewpoint of sensitivity, specificity, positive predictive value, negative predictive value, and the like. In addition, the cutoff value can be the average value, percentile value, or maximum value of the expression levels of the above genes in disease-free iPS cells. In addition, cut-off values that are standardized from previously obtained data and values that are set by statistical analysis based on analysis of ROC (Receiver Operating Characteristic) curves, etc. can be used.
 なお、上記遺伝子の内の1種の遺伝子においてカットオフ値より高い又は低い場合に被検体に特定の疾患タイプに属する疾患が発症する可能性を有するとの判定を行ってもよく、又は2種以上の遺伝子においてカットオフ値より高い又は低い場合が見られた場合に被検体に特定の疾患タイプに属する疾患が発症する可能性を有するとの判定を行ってもよい。 In addition, it may be determined that there is a possibility that the subject will develop a disease belonging to a specific disease type when one of the above genes is higher or lower than the cutoff value, or two types It may be determined that there is a possibility that the subject will develop a disease belonging to a specific disease type when the above genes are higher or lower than the cutoff value.
 工程(A)において、疾患タイプに属する疾患が発症していない被検体由来のiPS細胞の遺伝子発現量と、疾患タイプに属する疾患が発症している被検体由来のiPS細胞の遺伝子発現量と、を学習データとして用いて機械学習したモデルを用いて判定を行うこともできる。ここでの機械学習としては、1又は2以上の遺伝子の発現量に基づく特定の疾患タイプに属する疾患の発症の可能性の判定に使用可能であるものである限り特に限定されず、例えばサポートベクターマシーン(SVM)、ランダムフォレスト、ブースティング、バギング、ニューラルネットワーク、ディープラーニングなどが挙げられ、中でもサポートベクターマシーンが好適に使用することができる。サポートベクターマシーンを使用する場合、線形、多項式、放射基底関数、最大エントロピーなどの任意のカーネル関数を使用することができる。機械学習では、有疾患タイプのiPS細胞の特徴遺伝子の発現量と無疾患タイプのiPS細胞の特徴遺伝子の発現量とを入力とし、特定の疾患タイプに属する疾患の発症の可能性を出力として学習を行わせることができる。このように学習を行って得られたモデルに、所定の遺伝子発現量を入力し、被検体に特定の疾患タイプに属する疾患が発症する可能性を判定することができる。 In step (A), the gene expression level of iPS cells derived from a subject who has not developed a disease belonging to the disease type and the gene expression level of iPS cells derived from a subject who has developed a disease belonging to the disease type, can also be determined using a machine-learned model using as learning data. The machine learning here is not particularly limited as long as it can be used to determine the possibility of developing a disease belonging to a specific disease type based on the expression level of one or more genes, such as support vectors machine (SVM), random forest, boosting, bagging, neural network, deep learning, etc., among which support vector machine can be preferably used. When using support vector machines, any kernel function such as linear, polynomial, radial basis functions, maximum entropy, etc. can be used. In machine learning, the expression levels of characteristic genes of diseased iPS cells and the expression levels of characteristic genes of disease-free iPS cells are input, and the probability of developing a disease belonging to a specific disease type is learned as output. can be done. By inputting a predetermined gene expression level into the model obtained by such learning, it is possible to determine the possibility that a subject will develop a disease belonging to a specific disease type.
 本開示の選定方法によれば、特定の疾患タイプの高性能な予測に適した遺伝子の選定を行うことができる。さらに、本開示の判断方法によれば、特定の遺伝子の発現情報に基づいて、特定の疾患タイプに属する疾患が発症する可能性について、高性能な判断を行うことができる。 According to the selection method of the present disclosure, genes suitable for high-performance prediction of specific disease types can be selected. Furthermore, according to the determination method of the present disclosure, high-performance determination can be made about the possibility of developing a disease belonging to a specific disease type based on the expression information of a specific gene.
 本開示では、幹細胞を未分化のまま用いることができ臓器などに分化させる必要が無いため、低コスト及び短時間で特定の疾患タイプに属する疾患が発症する可能性について予測することが可能である。 In the present disclosure, since stem cells can be used as undifferentiated and do not need to be differentiated into organs, it is possible to predict the possibility of developing a disease belonging to a specific disease type at low cost and in a short period of time. .
 また、ゲノム配列は静的なデータであり複雑な遺伝子相互作用のある疾患を予測する精度が低いことが問題であるが、本開示では遺伝子の発現情報を用いるため、より高性能な疾患リスク予測が可能となる。 In addition, the genome sequence is static data and has a problem of low accuracy in predicting diseases with complex gene interactions. becomes possible.
 iPS細胞ストックを作製する際に、本開示を利用していずれかの疾患タイプに属する疾患が発症する可能性について予想することで、客観的に健康な細胞を得ることができる。 When producing an iPS cell stock, objectively healthy cells can be obtained by predicting the possibility of developing a disease belonging to one of the disease types using the present disclosure.
 本開示の判断方法を利用し、発症していない個人の幹細胞を用いて疾患発症の組織の有無を予想することができることから、予防及び先制医療を行うことが可能となる。 By using the determination method of the present disclosure, it is possible to predict the presence or absence of disease-developed tissues using stem cells of individuals who have not developed disease, making it possible to perform preventive and preemptive medicine.
 以下、本発明を更に詳しく説明するため実施例を挙げる。しかし、本発明はこれら実施例等になんら限定されるものではない。 Examples are given below to describe the present invention in more detail. However, the present invention is by no means limited to these examples.
 方法
 <材料>
 実施例で用いたiPS細胞株、ES細胞株の疾患カテゴリー、疾患名、株番号、年齢、性別、由来組織などは表1の通りである。また、それぞれのiPS細胞株における疾病発症5タイプへの分類は表2に示した。MalaCardsでは疾患をサブタイプにまで分類しているが、RIKENの疾患名では上位の分類となっていることが多いため直接的な対応が見込めないことがある。その際にはMalaCardsの分類でRIKENの疾患名にも対応すると考えられるサブタイプのうち最初のタイプのものを暫定的に使用した。これら合計23株を用いて以下の手順でRNA-seq法による網羅的遺伝子発現プロファイルを取得した。
Figure JPOXMLDOC01-appb-T000001
Figure JPOXMLDOC01-appb-T000002
Method <Material>
Table 1 shows the disease category, disease name, strain number, age, sex, tissue origin, etc. of the iPS cell lines and ES cell lines used in the Examples. Table 2 shows the classification of each iPS cell line into 5 types of disease onset. MalaCards classifies diseases into subtypes, but RIKEN's disease names are often classified in higher order, so direct correspondence may not be expected. At that time, the first type among the subtypes considered to correspond to the disease names of RIKEN in the MalaCards classification was tentatively used. Using these 23 strains in total, a comprehensive gene expression profile was obtained by the RNA-seq method according to the following procedure.
Figure JPOXMLDOC01-appb-T000001
Figure JPOXMLDOC01-appb-T000002
 <細胞培養>
 iPS細胞株は、20% KSR (Thermo Fisher Scientific)、0.1 mM NEAA (ナカライテスク株式会社)及び0.1 mM 2-メルカプトエタノールを添加した、primate ES cell medium (ReproCell)又はDMEM/F-12 medium (Thermo Fisher Scientific)中で、SNL細胞(マウス繊維芽細胞STO細胞株)上で培養した。両方の培地には、5 ng/ml human basic fibroblast growth factor (FUJIFILM Irvine Scientific)を添加した。その後、iPS細胞株は、10μM Y-27632 (CultureSure(商標)、富士フイルム和光純薬株式会社)及び0.25μg/cm2iMatrix-511 (株式会社ニッピ)が添加されたフィーダーフリー細胞培地StemFit AK02N (味の素株式会社)に継代し、少なくとも2回の継代においてフィーダーブリーの条件で維持した。ES細胞はまた、10μM Y-27632及び0.25μg/cm2 iMatrix-511を添加したStemFit AK02N中で少なくとも2回の継代において維持した。全ての細胞株は、PBSで洗浄、0.5×TrypLE Select (Thermo Fisher Scientific)中で37℃3分間インキュベート、剥離、回収し、計数後に10μM Y-27632及び0.25μg/cm2iMartix-511 (株式会社ニッピ)を添加したStemFit AK02N (味の素株式会社)中に再度播種した。1~2日毎に培地交換を行った。
<Cell culture>
iPS cell lines were prepared using prime ES cell medium (ReproCell) or DMEM/F-12 medium (Thermo Fisher Scientific) on SNL cells (mouse fibroblast STO cell line). Both media were supplemented with 5 ng/ml human basic fibroblast growth factor (FUJIFILM Irvine Scientific). Thereafter, the iPS cell lines were cultured in a feeder-free cell medium StemFit AK02N ( Ajinomoto Co., Inc.) and maintained under feeder brie conditions for at least two passages. ES cells were also maintained in StemFit AK02N supplemented with 10 μM Y-27632 and 0.25 μg/cm 2 iMatrix-511 for at least two passages. All cell lines were washed with PBS, incubated in 0.5×TrypLE Select (Thermo Fisher Scientific) for 3 minutes at 37°C, detached, harvested, counted and then 10 μM Y-27632 and 0.25 μg/cm 2 iMartix-511 (Inc. Nippi) was added and seeded again in StemFit AK02N (Ajinomoto Co., Inc.). The medium was changed every 1-2 days.
 <RNA-seqプロトコール>
 全ての細胞株の全RNAはRNeasy Mini Kit (Qiagen)を用いて単離し、次にDNase処理を行った。RNA seqライブラリーは、TruSeq Stranded mRNA Library Prep Kit (Illumina)を用いて2つの複製についてそれぞれ350 ngの全RNAから調製した。illumina Hiseq 2500シークエンサー(IIlumina)でサンプル当たり70-bp single-end readsの~420万をシークエンスした。RNA-seqのバイナリデータからbcl2fastq v2.20.0.422を用いて塩基配列を取得した。これをtrim_galore 0.4.4-devを用いてアダプター配列を除去し、bowtie2 2.2.5を用いてEnsemblゲノムGRCh38r100のcDNA及びncRNA配列にM_score≧1でマッピングし最終的に遺伝子名でカウント数をまとめた。同一サンプルをシーケンスからマッピングデータ作成まで2回行い、これらを統合したデータを作成した。この統合データを統計ソフトウェアR 3.5.0の上でedgeR 3.22.1パッケージを用いてmin.count=30、min.total.count=0でカウント数の十分な12,499個の発現遺伝子のみにして、limma 3.36.1パッケージのvoom関数で対数正規化した遺伝子を以降の解析に用いた。
<RNA-seq protocol>
Total RNA from all cell lines was isolated using the RNeasy Mini Kit (Qiagen), followed by DNase treatment. RNA seq libraries were prepared from 350 ng of total RNA each in duplicate using the TruSeq Stranded mRNA Library Prep Kit (Illumina). ~4.2 million of 70-bp single-end reads per sample were sequenced on an illumina Hiseq 2500 sequencer (IIlumina). Nucleotide sequences were obtained from binary RNA-seq data using bcl2fastq v2.20.0.422. The adapter sequence was removed using trim_galore 0.4.4-dev, and the cDNA and ncRNA sequences of the Ensembl genome GRCh38r100 were mapped with M_score ≥ 1 using bowtie2 2.2.5, and the counts were finally summarized by gene name. . The same sample was performed twice from sequencing to mapping data creation, and the data was created by integrating these. This integrated data was reduced to only 12,499 expressed genes with sufficient counts at min.count=30 and min.total.count=0 using the edgeR 3.22.1 package on the statistical software R 3.5.0 and limma Genes log-normalized with the voom function in the 3.36.1 package were used for the subsequent analysis.
 <RNA-seqデータから予測システムの構築>
 対数正規化した23株の遺伝子発現データから疾病発症組織が4株以上である5タイプの予測を行った。予測に使用する特徴遺伝子は発症組織の有無の2グループ間での2標本t検定を用いた確率で順位を決定した。一個抜き交差検証(LOOCV; leave-one-out-cross-validation)を用いて、上位の特徴遺伝子を1個から100個までの範囲で最高予測率を測定した。一個抜き交差検証では23回の学習データが生じるため、その度に学習データから2標本t検定で特徴遺伝子の順位を決定して予測を行った。学習にはサポートベクターマシン(SVM)を用い、そのカーネルとして線形、多項式、放射基底関数、最大エントロピーの4種類を使用した。一個抜き交差検証では23回の予測が行われるためこれらを総計して正確度とAUC (受信者操作特性曲線下面積)とを計算し、正確度の最高値を得た。複数の最高値がある場合にはよりAUCが最大のものを記録した。この結果を評価するため、23株での遺伝子発現データ12,499×23の行列に一様乱数を発生させ、同様に一個抜き交差検証して得られた最高予測率と比較した。
<Construction of prediction system from RNA-seq data>
From the gene expression data of 23 strains logarithmically normalized, 5 types of disease onset tissue were predicted with 4 or more strains. Characteristic genes used for prediction were ranked according to probability using a two-sample t-test between two groups with or without diseased tissue. Using leave-one-out-cross-validation (LOOCV), we measured the highest predictive rate in the range from 1 to 100 top feature genes. Since the leave-one-out cross-validation yielded training data 23 times, prediction was performed by determining the order of characteristic genes from the training data each time using a two-sample t-test. A support vector machine (SVM) was used for learning, and four types of kernels were used: linear, polynomial, radial basis function, and maximum entropy. Since 23 predictions were made in leave-one-out cross-validation, these were aggregated to calculate accuracy and AUC (area under the receiver operating characteristic curve) to obtain the highest accuracy. When there were multiple peaks, the highest AUC was recorded. To evaluate this result, we generated uniform random numbers in a 12,499×23 matrix of gene expression data for 23 strains and compared it with the highest prediction rate obtained by similar cross-validation without one.
 その結果を表3及び図7に示す。表3には、陽性及び陰性データ数、最高正確度とその1標本t検定の結果の確率、AUCとその1標本t検定の結果の確率、特徴遺伝子数を示した。図7にはAUCを示した。1標本t検定では母集団として10回の一様乱数による最高正確度の平均値と標本標準偏差と自由度9を用い、23細胞株による予測率が大きい場合に対する右側片側検定を行った。検定ではエクセルのT.DIST.RT関数を用いた。その結果、脳の最高正確度95.7%とAUC1.0及び骨格筋のAUC1.0がp<0.05で有意であった。また、今回は有意とならなかったが皮膚・代謝系においてもAUCが0.92と高い値を示した。
Figure JPOXMLDOC01-appb-T000003
The results are shown in Table 3 and FIG. Table 3 shows the number of positive and negative data, the highest accuracy and its probability of one-sample t-test results, the AUC and its probability of one-sample t-test results, and the number of feature genes. AUC is shown in FIG. In the one-sample t-test, the average value of the highest accuracy obtained from 10 uniform random numbers, the sample standard deviation, and the degree of freedom of 9 were used as the population. The T.DIST.RT function of Excel was used for the test. As a result, the maximum accuracy of 95.7% and AUC1.0 for brain and AUC1.0 for skeletal muscle were significant with p<0.05. In addition, although it was not significant this time, AUC in the skin and metabolic system also showed a high value of 0.92.
Figure JPOXMLDOC01-appb-T000003
 <脳、骨格筋における特徴遺伝子について>
 脳、骨格筋、皮膚、代謝系の予測において最高予測率における遺伝子の数はそれぞれ、17、1、58、51個であった。これらの遺伝子は、一個抜き交差検証で毎回異なる順位となるため、それぞれの検証毎に使用したEnsembl遺伝子番号を図1~4にその使用回数と共に示した。なお、図1~4に記載の遺伝子名は、HUGO (Human Genome Organisation)のヒトゲノム命名法委員会(HGNC)による遺伝子名である。これにより、疾病発症を少ない数の遺伝子の組み合わせで予測することが可能ということが示された。
<Characteristic genes in brain and skeletal muscle>
The number of genes with the highest prediction rate was 17, 1, 58, and 51 in predicting brain, skeletal muscle, skin, and metabolic system, respectively. Since these genes are ranked differently each time in leave-one-out cross-validation, Ensembl gene numbers used for each validation are shown in FIGS. The gene names shown in FIGS. 1 to 4 are those given by the Human Genome Nomenclature Committee (HGNC) of HUGO (Human Genome Organization). This indicates that it is possible to predict disease onset with a small number of gene combinations.
1 選定装置
2 細胞株選択部
3 取得部
4 順位決定部
5 遺伝子選択部
6 学習部
7 予測率測定部
8 遺伝子選定部
1 selection device 2 cell line selection unit 3 acquisition unit 4 ranking determination unit 5 gene selection unit 6 learning unit 7 prediction rate measurement unit 8 gene selection unit

Claims (14)

  1.  以下の工程を含む、特定の疾患タイプの予測に適した遺伝子の選定方法:
     (1)前記疾患タイプに属する疾患が発症していない被検体由来の幹細胞及び前記疾患タイプに属する疾患が発症している被検体由来の幹細胞における遺伝子発現量を統計学的手法及び機械学習に適用し、前記疾患タイプの予測に適した遺伝子を選定する工程。
    A method of selecting genes suitable for predicting a particular disease type, comprising the steps of:
    (1) Applying gene expression levels in stem cells derived from a subject who has not developed a disease belonging to the disease type and stem cells derived from a subject who has developed a disease belonging to the disease type to statistical methods and machine learning and selecting genes suitable for predicting the disease type.
  2.  前記工程(1)が、
     (1a)前記疾患タイプに属する疾患が発症していない被検体由来の幹細胞及び前記疾患タイプに属する疾患が発症している被検体由来の幹細胞における遺伝子発現量に統計学的手法を用いて特徴遺伝子の順位を決定する工程、及び
     (1b)順位の上から1又は2以上の特徴遺伝子に機械学習を用いて、前記疾患タイプの予測に適した遺伝子を選定する工程
    を含む、請求項1に記載の方法。
    The step (1) is
    (1a) A characteristic gene is obtained by using a statistical method on gene expression levels in stem cells derived from a subject who has not developed a disease belonging to the disease type and stem cells derived from a subject who has developed a disease belonging to the disease type. and (1b) selecting a gene suitable for predicting the disease type by using machine learning on one or more characteristic genes from the top of the ranking. the method of.
  3.  前記工程(1a)において、前記疾患タイプに属する疾患が発症している被検体由来の幹細胞の遺伝子発現量と、前記疾患タイプに属する疾患が発症していない被検体由来の幹細胞の遺伝子発現量とを比較した発現量の程度に関してランク付けされる、請求項2に記載の方法。 In the step (1a), the gene expression level of stem cells derived from a subject who has developed a disease belonging to the disease type and the gene expression level of stem cells derived from a subject who has not developed a disease belonging to the disease type 3. The method of claim 2, wherein the ranking is in terms of the degree of expression compared to .
  4.  前記工程(1b)において、機械学習により最も高い予測率が得られる特徴遺伝子の数を決定し、それに含まれる遺伝子を前記疾患タイプの予測に適した遺伝子として選定する、請求項2又は3に記載の方法。 4. The method according to claim 2 or 3, wherein in step (1b), the number of characteristic genes that provide the highest prediction rate is determined by machine learning, and the genes included therein are selected as genes suitable for predicting the disease type. the method of.
  5.  コンピュータに以下の工程を実行させる、特定の疾患タイプの予測に適した遺伝子の選定プログラム:
     (1)前記疾患タイプに属する疾患が発症していない被検体由来の幹細胞及び前記疾患タイプに属する疾患が発症している被検体由来の幹細胞における遺伝子発現量を統計学的手法及び機械学習に適用し、前記疾患タイプの予測に適した遺伝子を選定する工程。
    A program for selecting genes suitable for predicting a particular disease type that causes the computer to perform the following steps:
    (1) Applying gene expression levels in stem cells derived from a subject who has not developed a disease belonging to the disease type and stem cells derived from a subject who has developed a disease belonging to the disease type to statistical methods and machine learning and selecting genes suitable for predicting the disease type.
  6.  請求項5に記載の選定プログラムを記録したコンピュータ読み取り可能な記録媒体。 A computer-readable recording medium recording the selection program according to claim 5.
  7.  以下の工程を含む、特定の疾患タイプに属する疾患が発症する可能性を判断する方法:
     (A)被検体由来の幹細胞における図1~4のいずれかの図に記載の1又は2以上の遺伝子の発現量に基づいて、前記疾患タイプに属する疾患が発症する可能性を判定する工程。
    A method of determining the likelihood of developing a disease belonging to a particular disease type, comprising the steps of:
    (A) A step of determining the possibility of developing a disease belonging to the disease type based on the expression level of one or more genes described in any one of FIGS. 1 to 4 in subject-derived stem cells.
  8.  前記工程(A)において、
    図1に記載の1又は2以上の遺伝子が、MYO19、SKA1、TRIM11、WDR47、LENG8、NAB2、KHDRBS3、SYF2、NSUN5P1、EME2、BRD7、SELENBP1、METTL3、OSER1、及びFBXO41からなる群から選択される1又は2以上の遺伝子であり、
    図2に記載の1又は2以上の遺伝子が、RP2であり、
    図3に記載の1又は2以上の遺伝子が、BAG6、KIAA2026、GATAD2A、PPP4C、NTMT1、MAZ、ABL1、YTHDC1、GSK3B、SNX13、PDZD4、ARHGAP23、TMEM250、AC016739、ZNRF1、PUF60、SAMD4B、PPP1R14B、SF3B5、MLST8、ZC3H18、PKN1、LSM10、THAP4、AURKAIP1、CD320、WDR4、N4BP3、RPL7P9、TRAF2、ISOC2、SPOUT1、ATP6V0B、ACOT7、RNASEH1-AS1、NUP62、CCDC71、LMNB2、SLC39A3、COG3、SGTA、POLR3E、NCAPH2、ZSWIM4、MPV17L2、AGPAT1、BRF1、CCDC14、TEDC2、LONP1、C4orf3、UPF1、AL031708、及びPSMA7からなる群から選択される1又は2以上の遺伝子であり、
    図4に記載の1又は2以上の遺伝子が、FASTKD5、DDAH2、UBE2L3、SIAH2、ICE1、ZFPL1、SFR1、ACSL1、TKFC、CREB3L4、INTS7、SLTM、SLC44A2、ZC3H7A、TCERG1、MTRF1L、C3orf18、TTC38、TUBE1、PATL1、MOAP1、KDR、PRUNE2、ITPRIPL1、TBK1、UBE2Q1、PTRH2、ABCC4、CPEB4、DDAH2、TCEAL5、PIGO、SLC2A4RG、TMEM14C、CASC15、ATRAID、PSME4、GET1、ANKRD54、FKBP5、FAM89B、CLTB、及びDGKKからなる群から選択される1又は2以上の遺伝子である、請求項7に記載の方法。
    In the step (A),
    One or more genes described in Figure 1 are selected from the group consisting of MYO19, SKA1, TRIM11, WDR47, LENG8, NAB2, KHDRBS3, SYF2, NSUN5P1, EME2, BRD7, SELENBP1, METTL3, OSER1, and FBXO41 one or more genes,
    one or more of the genes described in FIG. 2 is RP2,
    1 or 2 or more genes described in FIG. , MLST8, ZC3H18, PKN1, LSM10, THAP4, AURKAIP1, CD320, WDR4, N4BP3, RPL7P9, TRAF2, ISOC2, SPOUT1, ATP6V0B, ACOT7, RNASEH1-AS1, NUP62, CCDC71, LMNB2, SLC39A3, COG3, SGTA, POLR3E, NCAPH2 , ZSWIM4, MPV17L2, AGPAT1, BRF1, CCDC14, TEDC2, LONP1, C4orf3, UPF1, AL031708, and one or more genes selected from the group consisting of PSMA7,
    1 or 2 or more genes described in FIG. , from PATL1, MOAP1, KDR, PRUNE2, ITPRIPL1, TBK1, UBE2Q1, PTRH2, ABCC4, CPEB4, DDAH2, TCEAL5, PIGO, SLC2A4RG, TMEM14C, CASC15, ATRAID, PSME4, GET1, ANKRD54, FKBP5, FAM89B, CLTB, and DGKK 8. The method according to claim 7, which is one or more genes selected from the group consisting of:
  9.  前記工程(A)において、
    図1に記載の1又は2以上の遺伝子が、MYO19、SKA1、TRIM11、WDR47、LENG8、NAB2、KHDRBS3、SYF2、NSUN5P1、及びEME2からなる群から選択される1又は2以上の遺伝子であり、
    図2に記載の1又は2以上の遺伝子が、RP2であり、
    図3に記載の1又は2以上の遺伝子が、BAG6、KIAA2026、GATAD2A、PPP4C、NTMT1、MAZ、ABL1、YTHDC1、GSK3B、SNX13、PDZD4、ARHGAP23、TMEM250、AC016739、ZNRF1、PUF60、SAMD4B、PPP1R14B、SF3B5、MLST8、ZC3H18、PKN1、LSM10、THAP4、AURKAIP1、CD320、WDR4、N4BP3、RPL7P9、TRAF2、ISOC2、SPOUT1、ATP6V0B、及びACOT7からなる群から選択される1又は2以上の遺伝子であり、
    図4に記載の1又は2以上の遺伝子が、FASTKD5、DDAH2、UBE2L3、SIAH2、ICE1、ZFPL1、SFR1、ACSL1、TKFC、CREB3L4、INTS7、SLTM、SLC44A2、ZC3H7A、TCERG1、MTRF1L、C3orf18、TTC38、TUBE1、PATL1、MOAP1、KDR、PRUNE2、ITPRIPL1、TBK1、UBE2Q1、PTRH2、ABCC4、CPEB4、DDAH2、TCEAL5、PIGO、及びSLC2A4RGからなる群から選択される1又は2以上の遺伝子である、請求項7に記載の方法。
    In the step (A),
    The one or more genes described in FIG. 1 are one or more genes selected from the group consisting of MYO19, SKA1, TRIM11, WDR47, LENG8, NAB2, KHDRBS3, SYF2, NSUN5P1, and EME2,
    one or more of the genes described in FIG. 2 is RP2,
    1 or 2 or more genes described in FIG. , MLST8, ZC3H18, PKN1, LSM10, THAP4, AURKAIP1, CD320, WDR4, N4BP3, RPL7P9, TRAF2, ISOC2, SPOUT1, ATP6V0B, and one or more genes selected from the group consisting of ACOT7,
    1 or 2 or more genes described in FIG. , PATL1, MOAP1, KDR, PRUNE2, ITPRIPL1, TBK1, UBE2Q1, PTRH2, ABCC4, CPEB4, DDAH2, TCEAL5, PIGO, and one or more genes selected from the group consisting of SLC2A4RG, according to claim 7 the method of.
  10.  前記工程(A)において、前記疾患タイプに属する疾患が発症していない被検体由来の幹細胞の遺伝子発現量と、前記疾患タイプに属する疾患が発症している被検体由来の幹細胞の遺伝子発現量と、を学習データとして用いて機械学習したモデルを用いて判定を行う、請求項7~9のいずれか一項に記載の方法。 In the step (A), the gene expression level of stem cells derived from a subject who has not developed a disease belonging to the disease type and the gene expression level of stem cells derived from a subject who has developed a disease belonging to the disease type The method according to any one of claims 7 to 9, wherein the determination is performed using a machine-learned model using , as learning data.
  11.  前記疾患タイプが、脳、骨格筋、皮膚、又は代謝系における疾患である、請求項7~9のいずれか一項に記載の方法。  The method according to any one of claims 7 to 9, wherein the disease type is a disease in the brain, skeletal muscle, skin, or metabolic system.
  12.  (A0)前記被検体由来の幹細胞における図1~4のいずれかの図に記載の1又は2以上の遺伝子の発現量を測定する工程
    を更に含む、請求項7~9のいずれか一項に記載の方法。
    (A0) The method according to any one of claims 7 to 9, further comprising the step of measuring the expression level of one or more genes shown in any one of FIGS. 1 to 4 in the subject-derived stem cells. described method.
  13.  前記幹細胞が、多能性幹細胞である、請求項1~3及び7~9のいずれか一項に記載の方法。 The method according to any one of claims 1 to 3 and 7 to 9, wherein the stem cells are pluripotent stem cells.
  14.  前記幹細胞が、人工多能性幹(iPS)細胞である、請求項1~3及び7~9のいずれか一項に記載の方法。  The method according to any one of claims 1 to 3 and 7 to 9, wherein the stem cells are induced pluripotent stem (iPS) cells.
PCT/JP2022/046394 2021-12-17 2022-12-16 Method for selecting gene for use in estimation of possibility of onset of disease, and method for estimating possibility of onset of disease WO2023113013A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2021-205495 2021-12-17
JP2021205495 2021-12-17

Publications (1)

Publication Number Publication Date
WO2023113013A1 true WO2023113013A1 (en) 2023-06-22

Family

ID=86774448

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/046394 WO2023113013A1 (en) 2021-12-17 2022-12-16 Method for selecting gene for use in estimation of possibility of onset of disease, and method for estimating possibility of onset of disease

Country Status (1)

Country Link
WO (1) WO2023113013A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005323573A (en) * 2004-05-17 2005-11-24 Sumitomo Pharmaceut Co Ltd Method for analyzing gene expression data and, method for screening disease marker gene and its utilization
JP2017501137A (en) * 2013-12-02 2017-01-12 オンコメッド ファーマシューティカルズ インコーポレイテッド Identification of predictive biomarkers associated with WNT pathway inhibitors
JP2021145635A (en) * 2020-03-23 2021-09-27 住友化学株式会社 Determination method of factor giving impact to endocrine system of chemical material

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005323573A (en) * 2004-05-17 2005-11-24 Sumitomo Pharmaceut Co Ltd Method for analyzing gene expression data and, method for screening disease marker gene and its utilization
JP2017501137A (en) * 2013-12-02 2017-01-12 オンコメッド ファーマシューティカルズ インコーポレイテッド Identification of predictive biomarkers associated with WNT pathway inhibitors
JP2021145635A (en) * 2020-03-23 2021-09-27 住友化学株式会社 Determination method of factor giving impact to endocrine system of chemical material

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
BONDER MARC JAN; SMAIL CRAIG; GLOUDEMANS MICHAEL J.; FRéSARD LAURE; JAKUBOSKY DAVID; D’ANTONIO MATTEO; LI XIN; FERRARO : "Identification of rare and common regulatory variants in pluripotent cells using population-scale transcriptomics", NATURE GENETICS, NATURE PUBLISHING GROUP US, NEW YORK, vol. 53, no. 3, 1 January 1900 (1900-01-01), New York, pages 313 - 321, XP037414604, ISSN: 1061-4036, DOI: 10.1038/s41588-021-00800-7 *
TAO WENSI, AYALA-HAEDO JUAN A., FIELD MATTHEW G., PELAEZ DANIEL, WESTER SARA T.: "RNA-Sequencing Gene Expression Profiling of Orbital Adipose-Derived Stem Cell Population Implicate HOX Genes and WNT Signaling Dysregulation in the Pathogenesis of Thyroid-Associated Orbitopathy", INVESTIGATIVE OPTHALMOLOGY & VISUAL SCIENCE, ASSOCIATION FOR RESEARCH IN VISION AND OPHTHALMOLOGY, US, vol. 58, no. 14, 6 December 2017 (2017-12-06), US , pages 6146, XP093072117, ISSN: 1552-5783, DOI: 10.1167/iovs.17-22237 *
VLASOV IVAN N., ALIEVA ANELYA KH., NOVOSADOVA EKATERINA V., ARSENYEVA ELENA L., ROSINSKAYA ANNA V., PARTEVIAN SUZANNA A., GRIVENNI: "Transcriptome Analysis of Induced Pluripotent Stem Cells and Neuronal Progenitor Cells, Derived from Discordant Monozygotic Twins with Parkinson’s Disease", CELLS, vol. 10, no. 12, 9 December 2021 (2021-12-09), pages 3478, XP093072118, DOI: 10.3390/cells10123478 *
YAO FANG, ZHANG CHI, DU WEI, LIU CHAO, XU YING: "Identification of Gene-Expression Signatures and Protein Markers for Breast Cancer Grading and Staging", PLOS ONE, vol. 10, no. 9, 16 September 2015 (2015-09-16), pages e0138213, XP093072123, DOI: 10.1371/journal.pone.0138213 *

Similar Documents

Publication Publication Date Title
Dräger et al. A CRISPRi/a platform in human iPSC-derived microglia uncovers regulators of disease states
Eppsteiner et al. Prediction of cochlear implant performance by genetic mutation: the spiral ganglion hypothesis
Reizel et al. Colon stem cell and crypt dynamics exposed by cell lineage reconstruction
Chen et al. Comparative transcript profiling of gene expression of fresh and frozen–thawed bull sperm
Li et al. Genetic variants associated with Alzheimer’s disease confer different cerebral cortex cell-type population structure
Lek et al. Emerging preclinical animal models for FSHD
Houldsworth et al. Expression profiling of lineage differentiation in pluripotential human embryonal carcinoma cells
Zamboni et al. Disruption of ArhGAP15 results in hyperactive Rac1, affects the architecture and function of hippocampal inhibitory neurons and causes cognitive deficits
Degrelle et al. A small set of extra-embryonic genes defines a new landmark for bovine embryo staging
US20220254448A1 (en) Methods of identifying dopaminergic neurons and progenitor cells
Borup et al. Competence classification of cumulus and granulosa cell transcriptome in embryos matched by morphology and female age
Ushakov et al. Genome-wide identification and expression profiling of long non-coding RNAs in auditory and vestibular systems
Carbonell et al. Haploinsufficiency in the ANKS1B gene encoding AIDA-1 leads to a neurodevelopmental syndrome
Frausto et al. Transcriptome analysis of the human corneal endothelium
Clark et al. Comprehensive analysis of retinal development at single cell resolution identifies NFI factors as essential for mitotic exit and specification of late-born cells
Pozzi et al. Transcriptional network of p63 in human keratinocytes
Lee et al. Whole genome DNA methylation sequencing of the chicken retina, cornea and brain
Biase et al. Fine-tuned adaptation of embryo–endometrium pairs at implantation revealed by transcriptome analyses in Bos taurus
Martinez-Fernandez et al. Natural cardiogenesis-based template predicts cardiogenic potential of induced pluripotent stem cell lines
Sanchez-Priego et al. Mapping cis-regulatory elements in human neurons links psychiatric disease heritability and activity-regulated transcriptional programs
Dolan et al. A resource for generating and manipulating human microglial states in vitro
Yao et al. Clinical and molecular characterization of three novel ARHGEF9 mutations in patients with developmental delay and epilepsy
WO2023113013A1 (en) Method for selecting gene for use in estimation of possibility of onset of disease, and method for estimating possibility of onset of disease
Zhang et al. Molecular characterization of a novel ring 6 chromosome using next generation sequencing
Agarwal et al. Bulk RNA sequencing analysis of developing human induced pluripotent cell-derived retinal organoids

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22907535

Country of ref document: EP

Kind code of ref document: A1