JP2023052011A

JP2023052011A - Deep learning-based techniques for pre-training deep convolutional neural networks

Info

Publication number: JP2023052011A
Application number: JP2022204685A
Authority: JP
Inventors: ホン・ガオ; Hong Gao; カイ－ハウ・ファー; Kai-How FARH; サムスクルーティ・レディ・パディゲパティ; Reddy Padigepati Samskruthi
Original assignee: Illumina Inc
Current assignee: Illumina Inc
Priority date: 2018-10-15
Filing date: 2022-12-21
Publication date: 2023-04-11
Also published as: IL271091A; WO2020081122A1; CN111328419B; IL271091B; AU2021269351A1; JP2021501923A; AU2019272062A1; JP7200294B2; CN111328419A; KR20200044731A; CN113705585A; SG11201911777QA; IL282689A; NZ759665A; SG10202108013QA; KR102165734B1; JP2021152907A; JP6888123B2; AU2019272062B2; AU2021269351B2

Abstract

PROBLEM TO BE SOLVED: To provide systems and methods to reduce overfitting of neural network-implemented models that process sequences of amino acids and accompanying position frequency matrices.

SOLUTION: The system generates supplemental training example sequence pairs, labelled benign, that include an arrangement advancing from a start position to an end position through a target amino acid position. A supplemental sequence pair supplements a pathogenic or benign missense training example sequence pair. It has identical amino acids in a reference and an alternate sequence of amino acids. The system includes logic to input, with each supplemental sequence pair, a supplemental training position frequency matrix (PFM) that is identical to the PFM of the benign or pathogenic missense at the matching start and end positions. The system includes logic to attenuate the training influence of the training PFMs while training a neural network-implemented model by including supplemental training example PFMs in the training example data.

SELECTED DRAWING: Figure 1

Description

優先出願
本出願は、2019年5月8日に出願した米国一部継続特許出願第16/407,149号、名称「DEEP LEARNING-BASED TECHNIQUES FOR PRE-TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS」(代理人整理番号第ILLM 1010-1/IP-1734-US)への優先権を主張し、これはすべて2018年10月15日に出願した次の3つのPCT出願および3つの米国非仮出願、すなわち、(1)2018年10月15日に出願したPCT特許出願第PCT/US2018/055840号、名称「DEEP LEARNING-BASED TECHNIQUES FOR TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS」(代理人整理番号第ILLM 1000-8/IP-1611-PCT)、(2)2018年10月15日に出願したPCT特許出願第PCT/US2018/055878号、名称「DEEP CONVOLUTIONAL NEURAL NETWORKS FOR VARIANT CLASSIFICATION」(代理人整理番号第ILLM 1000-9/IP-1612-PCT)、(3)2018年10月15日に出願したPCT特許出願第PCT/US2018/055881号、名称「SEMI-SUPERVISED LEARNING FOR TRAINING AN ENSEMBLE OF DEEP CONVOLUTIONAL NEURAL NETWORKS」(代理人整理番号第ILLM 1000-10/IP-1613-PCT)、(4)2018年10月15日に出願した米国非仮特許出願第16/160,903号、名称「DEEP LEARNING-BASED TECHNIQUES FOR TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS」(代理人整理番号第ILLM 1000-5/IP-1611-US)、(5)2018年10月15日に出願した米国非仮特許出願第16/160,986号、名称「DEEP CONVOLUTIONAL NEURAL NETWORKS FOR VARIANT CLASSIFICATION」(代理人整理番号第ILLM 1000-6/IP-1612-US)、および(6)2018年10月15日に出願した米国非仮特許出願第16/160,968号、名称「SEMI-SUPERVISED LEARNING FOR TRAINING AN ENSEMBLE OF DEEP CONVOLUTIONAL NEURAL NETWORKS」(代理人整理番号第ILLM 1000-7/IP-1613-US)の一部継続であり、その優先権を主張する。3つのPCT出願および3つの米国非仮出願はすべて、以下に列挙する次の4つの米国仮出願への優先権および/または利益を主張する。 PRIORITY APPLICATION This application is filed May 8, 2019, with U.S. continuation-in-part patent application Ser. 1010-1/IP-1734-US), which are all filed on October 15, 2018, in the following three PCT applications and three U.S. nonprovisional applications: (1)2018 PCT Patent Application No. PCT/US2018/055840, filed Oct. 15, 2018, entitled "DEEP LEARNING-BASED TECHNIQUES FOR TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS" (Attorney Docket No. ILLM 1000-8/IP-1611-PCT) (2) PCT Patent Application No. PCT/US2018/055878, filed October 15, 2018, entitled "DEEP CONVOLUTIONAL NEURAL NETWORKS FOR VARIANT CLASSIFICATION" (Attorney Docket No. ILLM 1000-9/IP-1612-PCT ), (3) PCT patent application no. 10/IP-1613-PCT), (4) U.S. Nonprovisional Patent Application No. 16/160,903, filed October 15, 2018, entitled "DEEP LEARNING-BASED TECHNIQUES FOR TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS" No. ILLM 1000-5/IP-1611-US), (5) U.S. Nonprovisional Patent Application No. 16/160,986, filed October 15, 2018, entitled "DEEP CONVOLUTIONAL NEURAL NETWORKS FOR VARIANT CLASSIFICATION" (Attorney (Docket No. ILLM 1000-6/IP-1612-US), and (6) U.S. Nonprovisional Patent Application No. 16/160,968, filed October 15, 2018, entitled "SEMI-SUPERVISED LEARNING FOR TRAINING AN ENSEMBLE OF DEEP CONVOLUTIONAL NEURAL NETWORKS” (Attorney Docket No. ILLM 1000-7/IP-1613-US) and claims priority therefrom. All three PCT applications and three US nonprovisional applications claim priority and/or benefit from the following four US provisional applications listed below.

2017年10月16日に出願した米国仮特許出願第62/573,144号、名称「TRAINING A DEEP PATHOGENICITY CLASSIFIER USING LARGE-SCALE BENIGN TRAINING DATA」(代理人整理番号第ILLM 1000-1/IP-1611-PRV)。 U.S. Provisional Patent Application No. 62/573,144, filed Oct. 16, 2017, entitled "TRAINING A DEEP PATHOGENICITY CLASSIFIER USING LARGE-SCALE BENIGN TRAINING DATA" (Attorney Docket No. ILLM 1000-1/IP-1611-PRV ).

2017年10月16日に出願した米国仮特許出願第62/573,149号、名称「PATHOGENICITY CLASSIFIER BASED ON DEEP CONVOLUTIONAL NEURAL NETWORKS (CNNS)」(代理人整理番号第ILLM 1000-2/IP-1612-PRV)。 U.S. Provisional Patent Application No. 62/573,149, filed Oct. 16, 2017, entitled "PATHOGENICITY CLASSIFIER BASED ON DEEP CONVOLUTIONAL NEURAL NETWORKS (CNNS)" (Attorney Docket No. ILLM 1000-2/IP-1612-PRV) .

2017年10月16日に出願した米国仮特許出願第62/573,153号、名称「DEEP SEMI-SUPERVISED LEARNING THAT GENERATES LARGE-SCALE PATHOGENIC TRAINING DATA」(代理人整理番号第ILLM 1000-3/IP-1613-PRV)。 U.S. Provisional Patent Application No. 62/573,153, filed Oct. 16, 2017, entitled "DEEP SEMI-SUPERVISED LEARNING THAT GENERATES LARGE-SCALE PATHOGENIC TRAINING DATA" (Attorney Docket No. ILLM 1000-3/IP-1613- PRV).

2017年11月7日に出願した米国仮特許出願第62/582,898号、名称「PATHOGENICITY CLASSIFICATION OF GENOMIC DATA USING DEEP CONVOLUTIONAL NEURAL NETWORKS (CNNs)」(代理人整理番号第ILLM 1000-4/IP-1618-PRV)。 U.S. Provisional Patent Application No. 62/582,898, filed November 7, 2017, entitled "PATHOGENICITY CLASSIFICATION OF GENOMIC DATA USING DEEP CONVOLUTIONAL NEURAL NETWORKS (CNNs)" (Attorney Docket No. ILLM 1000-4/IP-1618- PRV).

引用
以下の文献は、あたかも全体が本明細書に記載されているかのように、すべての目的に関して参照により引用される。 Citations The following documents are incorporated by reference for all purposes as if fully set forth herein.

Hong Gao、Kai-How Farh、Laksshman Sundaram、およびJeremy Francis McRaeによる、2017年10月16日に出願した米国仮特許出願第62/573,144号、名称「TRAINING A DEEP PATHOGENICITY CLASSIFIER USING LARGE-SCALE BENIGN TRAINING DATA」(代理人整理番号第ILLM 1000-1/IP-1611-PRV)。 Hong Gao, Kai-How Farh, Laksshman Sundaram, and Jeremy Francis McRae, U.S. Provisional Patent Application No. 62/573,144, entitled "TRAINING A DEEP PATHOGENICITY CLASSIFIER USING LARGE-SCALE BENIGN TRAINING DATA," filed October 16, 2017. (Attorney Docket No. ILLM 1000-1/IP-1611-PRV).

Laksshman Sundaram、Kai-How Farh、Hong Gao、Samskruthi Reddy Padigepati、およびJeremy Francis McRaeによる、2017年10月16日に出願した米国仮特許出願第62/573,149号、名称「PATHOGENICITY CLASSIFIER BASED ON DEEP CONVOLUTIONAL NEURAL NETWORKS (CNNS)」(代理人整理番号第ILLM 1000-2/IP-1612-PRV)。 Laksshman Sundaram, Kai-How Farh, Hong Gao, Samskruthi Reddy Padigepati, and Jeremy Francis McRae, U.S. Provisional Patent Application No. 62/573,149, entitled "PATHOGENICITY CLASSIFIER BASED ON DEEP CONVOLUTIONAL NEURAL NETWORKS," filed Oct. 16, 2017. (CNNS)” (Attorney Docket No. ILLM 1000-2/IP-1612-PRV).

Hong Gao、Kai-How Farh、Laksshman Sundaram、およびJeremy Francis McRaeによる、2017年10月16日に出願した米国仮特許出願第62/573,153号、名称「DEEP SEMI-SUPERVISED LEARNING THAT GENERATES LARGE-SCALE PATHOGENIC TRAINING DATA」(代理人整理番号第ILLM 1000-3/IP-1613-PRV)。 Hong Gao, Kai-How Farh, Laksshman Sundaram, and Jeremy Francis McRae, U.S. Provisional Patent Application No. 62/573,153, entitled "DEEP SEMI-SUPERVISED LEARNING THAT GENERATES LARGE-SCALE PATHOGENIC TRAINING," filed October 16, 2017. DATA” (Attorney Docket No. ILLM 1000-3/IP-1613-PRV).

Hong Gao、Kai-How Farh、Laksshman Sundaramによる、2017年11月7日に出願した米国仮特許出願第62/582,898号、名称「PATHOGENICITY CLASSIFICATION OF GENOMIC DATA USING DEEP CONVOLUTIONAL NEURAL NETWORKS (CNNs)」(代理人整理番号第ILLM 1000-4/IP-1618-PRV)。 Hong Gao, Kai-How Farh, Laksshman Sundaram, U.S. Provisional Patent Application No. 62/582,898, filed November 7, 2017, entitled "PATHOGENICITY CLASSIFICATION OF GENOMIC DATA USING DEEP CONVOLUTIONAL NEURAL NETWORKS (CNNs)" (Attorney Docket No. ILLM 1000-4/IP-1618-PRV).

2018年10月15日に出願された、Hong Gao、Kai-How Farh、Laksshman Sundaram、およびJeremy Francis McRaeによる、「DEEP LEARNING-BASED TECHNIQUES FOR TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS」という表題の国際特許出願第PCT/US18/55840号(代理人整理番号ILLM 1000-8/ IP-1611-PCT)。 Hong Gao, Kai-How Farh, Laksshman Sundaram, and Jeremy Francis McRae, International Patent Application No. PCT/ US18/55840 (Attorney Docket No. ILLM 1000-8/ IP-1611-PCT).

Laksshman Sundaram、Kai-How Farh、Hong Gao、Samskruthi Reddy Padigepati、およびJeremy Francis McRaeによる、2018年10月15日に出願したPCT特許出願第PCT/US2018/55878号、名称「DEEP CONVOLUTIONAL NEURAL NETWORKS FOR VARIANT CLASSIFICATION」(代理人整理番号第ILLM 1000-9/IP-1612-PCT)。 Laksshman Sundaram, Kai-How Farh, Hong Gao, Samskruthi Reddy Padigepati, and Jeremy Francis McRae, PCT patent application no. (Attorney Docket No. ILLM 1000-9/IP-1612-PCT).

2018年10月15日に出願された、Laksshman Sundaram、Kai-How Farh、Hong Gao、およびJeremy Francis McRaeによる、「SEMI-SUPERVISED LEARNING FOR TRAINING AN ENSEMBLE OF DEEP CONVOLUTIONAL NEURAL NETWORKS」という表題の国際特許出願第PCT/US2018/55881号(代理人整理番号第ILLM 1000-10/IP-1613-PCT)。 Laksshman Sundaram, Kai-How Farh, Hong Gao, and Jeremy Francis McRae, International Patent Application entitled "SEMI-SUPERVISED LEARNING FOR TRAINING AN ENSEMBLE OF DEEP CONVOLUTIONAL NEURAL NETWORKS", filed October 15, 2018 PCT/US2018/55881 (Attorney Docket No. ILLM 1000-10/IP-1613-PCT).

2018年10月15日に出願された、Hong Gao、Kai-How Farh、Laksshman Sundaram、およびJeremy Francis McRaeによる、「DEEP LEARNING-BASED TECHNIQUES FOR TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS」という表題の米国非仮特許出願第16/160,903号(代理人整理番号ILLM 1000-5/IP-1611-US)。 Hong Gao, Kai-How Farh, Laksshman Sundaram, and Jeremy Francis McRae, U.S. Nonprovisional Patent Application No. entitled "DEEP LEARNING-BASED TECHNIQUES FOR TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS," filed October 15, 2018 16/160,903 (Attorney Docket No. ILLM 1000-5/IP-1611-US).

2018年10月15日に出願された、Laksshman Sundaram、Kai-How Farh、Hong Gao、およびJeremy Francis McRaeによる、「DEEP CONVOLUTIONAL NEURAL NETWORKS FOR VARIANT CLASSIFICATION」という表題の米国非仮特許出願第16/160,986号(代理人整理番号ILLM 1000-6/IP-1612-US)。 Laksshman Sundaram, Kai-How Farh, Hong Gao, and Jeremy Francis McRae, U.S. Nonprovisional Patent Application No. 16/160,986, entitled "DEEP CONVOLUTIONAL NEURAL NETWORKS FOR VARIANT CLASSIFICATION," filed October 15, 2018 (Attorney Docket No. ILLM 1000-6/IP-1612-US).

2018年10月15日に出願された、Laksshman Sundaram、Kai-How Farh、Hong Gao、およびJeremy Francis McRaeによる、「SEMI-SUPERVISED LEARNING FOR TRAINING AN ENSEMBLE OF DEEP CONVOLUTIONAL NEURAL NETWORKS」という表題の米国非仮特許出願第16/160,968号(代理人整理番号ILLM 1000-7/IP-1613-US)。 U.S. nonprovisional patent entitled "SEMI-SUPERVISED LEARNING FOR TRAINING AN ENSEMBLE OF DEEP CONVOLUTIONAL NEURAL NETWORKS" by Laksshman Sundaram, Kai-How Farh, Hong Gao, and Jeremy Francis McRae, filed Oct. 15, 2018 Application No. 16/160,968 (Attorney Docket No. ILLM 1000-7/IP-1613-US).

文書1 - A.V.D.Oord、S.Dieleman、H.Zen、K.Simonyan、O.Vinyals、A.Graves、N.Kalchbrenner、A.Senior、およびK.Kavukcuoglu、「WAVENET: A GENERATIVE MODEL FOR RAW AUDIO」、arXiv:1609.03499、2016 Document 1 - A.V.D.Oord, S.Dieleman, H.Zen, K.Simonyan, O.Vinyals, A.Graves, N.Kalchbrenner, A.Senior, and K.Kavukcuoglu, "WAVENET: A GENERATIVE MODEL FOR RAW AUDIO", arXiv: 1609.03499, 2016

文書2 - S.O.Arik、M.Chrzanowski、A.Coates、G.Diamos、A.Gibiansky、Y.Kang、X.Li、J.Miller、A.Ng、J.Raiman、S.Sengupta、およびM.Shoeybi、「DEEP VOICE: REAL-TIME NEURAL TEXT-TO-SPEECH」、arXiv:1702.07825、2017 Document 2 - S.O.Arik, M.Chrzanowski, A.Coates, G.Diamos, A.Gibiansky, Y.Kang, X.Li, J.Miller, A.Ng, J.Raiman, S.Sengupta, and M.Shoeybi , "DEEP VOICE: REAL-TIME NEURAL TEXT-TO-SPEECH", arXiv:1702.07825, 2017

文書3 - F.YuおよびV.Koltun、「MULTI-SCALE CONTEXT AGGREGATION BY DILATED CONVOLUTIONS」、arXiv:1511.07122、2016 Document 3 - F.Yu and V.Koltun, "MULTI-SCALE CONTEXT AGGREGATION BY DILATED CONVOLUTIONS", arXiv:1511.07122, 2016

文書4 - K.He、X.Zhang、S.Ren、およびJ.Sun、「DEEP RESIDUAL LEARNING FOR IMAGE RECOGNITION」、arXiv:1512.03385、2015 Document 4 - K.He, X.Zhang, S.Ren, and J.Sun, "DEEP RESIDUAL LEARNING FOR IMAGE RECOGNITION", arXiv:1512.03385, 2015

文書5 - R.K.Srivastava、K.Greff、およびJ.Schmidhuber、「HIGHWAY NETWORKS」、arXiv:1505.00387、2015 Document 5 - R.K.Srivastava, K.Greff, and J.Schmidhuber, "HIGHWAY NETWORKS", arXiv:1505.00387, 2015

文書6 - G.Huang、Z.Liu、L.van der Maaten、およびK.Q.Weinberger、「DENSELY CONNECTED CONVOLUTIONAL NETWORKS」、arXiv:1608.06993、2017 Document 6 - G.Huang, Z.Liu, L.van der Maaten, and K.Q.Weinberger, "DENSELY CONNECTED CONVOLUTIONAL NETWORKS", arXiv:1608.06993, 2017

文書7 - C.Szegedy、W.Liu、Y.Jia、P.Sermanet、S.Reed、D.Anguelov、D.Erhan、V.Vanhoucke、およびA.Rabinovich、「GOING DEEPER WITH CONVOLUTIONS」、arXiv:1409.4842、2014 Document 7 - C.Szegedy, W.Liu, Y.Jia, P.Sermanet, S.Reed, D.Anguelov, D.Erhan, V.Vanhoucke, and A.Rabinovich, "GOING DEEPER WITH CONVOLUTIONS", arXiv:1409.4842 , 2014

文書8 - S.Ioffe、およびC.Szegedy、「BATCH NORMALIZATION: ACCELERATING DEEP NETWORK TRAINING BY REDUCING INTERNAL COVARIATE SHIFT」、arXiv:1502.03167、2015 Document 8 - S.Ioffe, and C.Szegedy, "BATCH NORMALIZATION: ACCELERATING DEEP NETWORK TRAINING BY REDUCING INTERNAL COVARIATE SHIFT", arXiv:1502.03167, 2015

文書9 - J.M.Wolterink、T.Leiner、M.A.Viergever、およびI.Isgum、「DILATED CONVOLUTIONAL NEURAL NETWORKS FOR CARDIOVASCULAR MR SEGMENTATION IN CONGENITAL HEART DISEASE」、arXiv:1704.03669、2017 Document 9 - J.M.Wolterink, T.Leiner, M.A.Viergever, and I.Isgum, "DILATED CONVOLUTIONAL NEURAL NETWORKS FOR CARDIOVASCULAR MR SEGMENTATION IN CONGENITAL HEART DISEASE", arXiv:1704.03669, 2017

文書10 - L.C.Piqueras、「AUTOREGRESSIVE MODEL BASED ON A DEEP CONVOLUTIONAL NEURAL NETWORK FOR AUDIO GENERATION」、Tampere University of Technology、2016 Document 10 - L.C. Piqueras, "AUTOREGRESSIVE MODEL BASED ON A DEEP CONVOLUTIONAL NEURAL NETWORK FOR AUDIO GENERATION", Tampere University of Technology, 2016

文書11 - J.Wu、「Introduction to Convolutional Neural Networks」、Nanjing University、2017 Document 11 – J.Wu, “Introduction to Convolutional Neural Networks”, Nanjing University, 2017

文書12 - I.J.Goodfellow、D.Warde-Farley、M.Mirza、A.Courville、およびY.Bengio、「CONVOLUTIONAL NETWORKS」、Deep Learning、MIT Press、2016 Document 12 - I.J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, "CONVOLUTIONAL NETWORKS", Deep Learning, MIT Press, 2016

文書13 - J.Gu、Z.Wang、J.Kuen、L.Ma、A.Shahroudy、B.Shuai、T.Liu、X.Wang、およびG.Wang、「RECENT ADVANCES IN CONVOLUTIONAL NEURAL NETWORKS」、arXiv:1512.07108、2017 Document 13 – J.Gu, Z.Wang, J.Kuen, L.Ma, A.Shahroudy, B.Shuai, T.Liu, X.Wang, and G.Wang, “RECENT ADVANCES IN CONVOLUTIONAL NEURAL NETWORKS”, arXiv : 1512.07108, 2017

文書1は、入力シーケンスを受け入れて入力シーケンス中のエントリをスコアリングする出力シーケンスを生成するために、同じ畳み込みウィンドウサイズを有する畳み込みフィルタ、バッチ正規化層、正規化線形ユニット(ReLUと省略される)層、次元変換層、指数関数的に増大する膨張畳み込み率(atrous convolution rate)を伴う膨張畳み込み層、スキップ接続、およびソフトマックス分類層を伴う、残差ブロックのグループを使用する深層畳み込みニューラルネットワークアーキテクチャを説明する。開示される技術は、文書1において説明されるニューラルネットワークコンポーネントおよびパラメータを使用する。一実装形態では、開示される技術は、文書1において説明されるニューラルネットワークコンポーネントのパラメータを修正する。たとえば、文書1とは異なり、開示される技術における膨張畳み込み率は、より低い残差ブロックグループからより高い残差ブロックグループへと非指数関数的に高まる。別の例では、文書1とは異なり、開示される技術における畳み込みウィンドウサイズは、残差ブロックのグループ間で変動する。 Document 1 uses a convolution filter with the same convolution window size, a batch normalization layer, a normalization linear unit (abbreviated ReLU ) layer, a dimensional transformation layer, an inflated convolutional layer with an exponentially increasing atrous convolution rate, a skip connection, and a group of residual blocks with a softmax classification layer. Describe the architecture. The disclosed technique uses the neural network components and parameters described in Document 1. In one implementation, the disclosed technique modifies the parameters of the neural network components described in Document 1. For example, unlike Document 1, the dilated convolution rate in the disclosed technique increases non-exponentially from lower residual block groups to higher residual block groups. In another example, unlike document 1, the convolution window size in the disclosed technique varies between groups of residual blocks.

文書2は、文書1において説明される深層畳み込みニューラルネットワークアーキテクチャの詳細を説明する。 Document 2 describes the details of the deep convolutional neural network architecture described in Document 1.

文書3は、開示される技術によって使用される膨張畳み込みを説明する。本明細書では、膨張畳み込みは「拡張畳み込み(dilated convolution)」とも呼ばれる。膨張/拡張畳み込みは、少数の訓練可能なパラメータで大きな受容野を可能にする。膨張/拡張畳み込みは、膨張畳み込み率または拡張係数とも呼ばれるあるステップを用いて入力値をスキップすることによって、カーネルがその長さより長いエリアにわたって適用されるような畳み込みである。膨張/拡張畳み込みは、畳み込み演算が実行されるときに、より長い間隔の隣り合う入力エントリ(たとえば、ヌクレオチド、アミノ酸)が考慮されるように、畳み込みフィルタ/カーネルの要素間に離隔を加える。これにより、入力における長距離のコンテクスト依存性の組み込みが可能になる。膨張畳み込みは、隣接するヌクレオチドが処理されるにつれて、部分的な畳み込み計算結果を再使用のために保存する。 Document 3 describes the dilated convolution used by the disclosed technique. Dilated convolutions are also referred to herein as "dilated convolutions". Dilation/extension convolution allows a large receptive field with a small number of trainable parameters. A dilation/dilation convolution is a convolution in which a kernel is applied over an area longer than its length by skipping input values with some step, also called dilation convolution rate or dilation factor. The dilation/expansion convolution adds spacing between the elements of the convolution filter/kernel such that longer-spaced adjacent input entries (eg, nucleotides, amino acids) are considered when the convolution operation is performed. This allows the incorporation of long-range context dependencies in the input. Dilated convolution saves the partial convolution calculation results for reuse as adjacent nucleotides are processed.

文書4は、開示される技術によって使用される残差ブロックおよび残差接続を説明する。 Document 4 describes residual blocks and residual connections used by the disclosed technique.

文書5は、開示される技術によって使用されるスキップ接続を説明する。本明細書では、スキップ接続は「ハイウェイネットワーク」とも呼ばれる。 Document 5 describes skip connections used by the disclosed technology. Skip connections are also referred to herein as "highway networks".

文書6は、開示される技術によって使用される密接続(densely connected)畳み込みネットワークアーキテクチャを説明する。 Document 6 describes the densely connected convolutional network architecture used by the disclosed technology.

文書7は、開示される技術によって使用される次元変換畳み込み層およびモジュールベースの処理パイプラインを説明する。次元変換畳み込みの一例は1×1の畳み込みである。 Document 7 describes the dimensional transformation convolution layer and module-based processing pipeline used by the disclosed technique. An example of a dimensional transformation convolution is the 1×1 convolution.

文書8は、開示される技術によって使用されるバッチ正規化層を説明する。 Document 8 describes the batch normalization layer used by the disclosed technique.

文書9も、開示される技術によって使用される膨張/拡張畳み込みを説明する。 Document 9 also describes the dilation/dilation convolution used by the disclosed technique.

文書10は、畳み込みニューラルネットワーク、深層畳み込みニューラルネットワーク、および膨張/拡張畳み込みを伴う深層畳み込みニューラルネットワークを含む、開示される技術によって使用され得る深層ニューラルネットワークの様々なアーキテクチャを説明する。 Document 10 describes various architectures of deep neural networks that can be used by the disclosed techniques, including convolutional neural networks, deep convolutional neural networks, and deep convolutional neural networks with dilation/extension convolution.

文書11は、サブサンプリング層(たとえば、プーリング)および全結合層を伴う畳み込みニューラルネットワークを訓練するためのアルゴリズムを含む、開示される技術によって使用され得る畳み込みニューラルネットワークの詳細を説明する。 Document 11 describes details of convolutional neural networks that can be used by the disclosed techniques, including algorithms for training convolutional neural networks with subsampling layers (eg, pooling) and fully connected layers.

文書12は、開示される技術によって使用され得る様々な畳み込み演算の詳細を説明する。 Document 12 describes details of various convolution operations that may be used by the disclosed technique.

文書13は、開示される技術によって使用され得る畳み込みニューラルネットワークの様々なアーキテクチャを説明する。 Document 13 describes various architectures of convolutional neural networks that can be used by the disclosed technology.

開示される技術の分野
開示される技術は、人工知能タイプコンピュータならびにデジタルデータ処理システムならびに知性のエミュレーションのための対応するデータ処理方法および製品(すなわち、知識ベースシステム、推論システム、知識取得システム)に関し、不確実性を伴う推論のためのシステム(たとえば、ファジー論理システム)、適応システム、機械学習システム、および人工ニューラルネットワークを含む。具体的には、開示される技術は、深層畳み込みニューラルネットワークを訓練するために深層学習ベースの技法を使用することに関する。特に、開示されている技術は、過剰適合を回避するために深層畳み込みニューラルネットワークを事前訓練することに関する。 FIELD OF DISCLOSED TECHNOLOGY The disclosed technology relates to artificial intelligence type computers and digital data processing systems and corresponding data processing methods and products for emulation of intelligence (i.e. knowledge base systems, reasoning systems, knowledge acquisition systems). , systems for reasoning with uncertainty (eg, fuzzy logic systems), adaptive systems, machine learning systems, and artificial neural networks. Specifically, the disclosed techniques relate to using deep learning-based techniques to train deep convolutional neural networks. In particular, the disclosed techniques relate to pre-training deep convolutional neural networks to avoid overfitting.

このセクションにおいて論じられる主題は、このセクションにおける言及の結果として、単なる従来技術であると見なされるべきではない。同様に、このセクションにおいて言及される問題、または背景として提供される主題と関連付けられる問題は、従来技術においてこれまで認識されていたと見なされるべきではない。このセクションの主題は異なる手法を表すにすぎず、それらの異なる手法自体も、特許請求される技術の実装形態に対応し得る。 The subject matter discussed in this section should not be considered merely prior art as a result of any mention in this section. Likewise, the problems mentioned in this section or associated with the subject matter provided in the background should not be considered to have been previously recognized in the prior art. The subject matter of this section merely represents different approaches, which themselves may correspond to implementations of the claimed technology.

機械学習
機械学習では、出力変数を予測するために入力変数が使用される。入力変数はしばしば特徴量と呼ばれ、X=(X₁,X₂,...,X_k)と表記され、i∈1,...,kである各X_iが特徴量である。出力変数はしばしば応答または依存変数と呼ばれ、変数Y_iにより表記される。Yと対応するXとの関係は、次の一般的な形式で書くことができる。
Y=f(x)+∈ Machine Learning In machine learning, input variables are used to predict output variables. The input variables are often called features, denoted X=(X ₁ ,X ₂ ,...,X _k ), where each X _i for i∈1,...,k is a feature. The output variables are often called response or dependent variables and are denoted by the variables _Yi . The relationship between Y and the corresponding X can be written in the general form
Y=f(x)+∈

上式において、fは特徴量(X₁,X₂,...,X_k)の関数であり、∈はランダムな誤差の項である。この誤差の項は、Xとは無関係であり、平均値が0である。 where f is a function of features (X ₁ ,X ₂ ,...,X _k ) and ∈ is a random error term. This error term is independent of X and has a mean of zero.

実際には、特徴量Xは、Yがなくても、またはXとYとの厳密な関係を知らなくても入手可能である。誤差の項は平均値が0であるので、目標はfを推定することである。 In practice, feature X is available without Y or without knowing the exact relationship between X and Y. The error term has a mean of 0, so the goal is to estimate f.

上式において、 In the above formula,

は∈の推定値であり、これはしばしばブラックボックスと見なされ、 is an estimate of ∈, which is often regarded as a black box, and

の入力と出力の関係のみが知られていることを意味するが、なぜこれで機能するのかという疑問は答えられていないままである。 This means that only the relationship between the inputs and outputs of is known, but the question of why this works remains unanswered.

関数 function

は学習を使用して発見される。教師あり学習および教師なし学習は、このタスクのための機械学習において使用される2つの方式である。教師あり学習では、ラベリングされたデータが訓練のために使用される。入力および対応する出力(=ラベル)を示すことによって、関数 is discovered using learning. Supervised learning and unsupervised learning are two schemes used in machine learning for this task. In supervised learning, labeled data are used for training. function by indicating the inputs and corresponding outputs (=labels)

は、出力を近似するように最適化される。教師なし学習では、目標はラベリングされていないデータから隠された構造を見つけることである。このアルゴリズムは、入力データについての正確さの尺度を持たず、これにより教師あり学習と区別される。 is optimized to approximate the output. In unsupervised learning, the goal is to find hidden structures in unlabeled data. This algorithm has no accuracy measure on the input data, which distinguishes it from supervised learning.

ニューラルネットワーク
ニューラルネットワークは、互いとの間でメッセージを交換する相互接続された人工ニューロン(たとえば、a₁、a₂、a₃)のシステムである。示されるニューラルネットワークは3つの入力を有し、2つのニューロンが隠れ層にあり、2つのニューロンが出力層にある。隠れ層は活性化関数f(・)を有し、出力層は活性化関数g(・)を有する。これらの接続は、適切に訓練されたネットワークが認識すべき画像を与えられると正しく応答するように、訓練プロセスの間に調整された数値的な重み(たとえば、w₁₁、w₂₁、w₁₂、w₃₁、w₂₂、w₃₂、v₁₁、v₂₂)を有する。入力層は生の入力を処理し、隠れ層は入力層と隠れ層との間の接続の重みに基づいて入力層から出力を処理する。出力層は、隠れ層から出力を取り込み、隠れ層と出力層との間の接続の重みに基づいてそれを処理する。ネットワークは、特徴検出ニューロンの複数の層を含む。各層は、前の層からの入力の異なる組合せに対応する多数のニューロンを有する。これらの層は、第1の層が入力画像データにおける基本的なパターンのセットを検出し、第2の層がパターンのパターンを検出し、第3の層がそれらのパターンのパターンを検出するように、構築される。 Neural Networks Neural networks are systems of interconnected artificial neurons (eg, a ₁ , a ₂ , a ₃ ) that exchange messages with each other. The neural network shown has 3 inputs, 2 neurons in the hidden layer and 2 neurons in the output layer. The hidden layer has activation function f(·) and the output layer has activation function g(·). These connections are numerically weighted (e.g., w ₁₁ , w ₂₁ , w ₁₂ , _w31 , _w22 , _w32 , _v11 , _v22 ). The input layer processes the raw input and the hidden layer processes the output from the input layer based on the weights of the connections between the input layer and the hidden layer. The output layer takes the output from the hidden layer and processes it based on the weights of the connections between the hidden layer and the output layer. The network contains multiple layers of feature detection neurons. Each layer has a number of neurons corresponding to different combinations of inputs from previous layers. These layers are arranged such that the first layer detects a set of basic patterns in the input image data, the second layer detects patterns of patterns, and the third layer detects patterns of those patterns. to be constructed.

ニューラルネットワークモデルは、使用前に訓練サンプルを使用して訓練され、プロダクションサンプルに対する出力を予測するために使用される。訓練されたモデルの予測の品質は、訓練中に入力として与えられない訓練サンプルのテストセットを使用することによって評価される。モデルがテストサンプルに対する出力を正しく予測した場合、これは高い信頼度で推論に使用できる。しかしながら、モデルがテストサンプルに対する出力を正しく予測しない場合、我々は、モデルが訓練データ上で過剰適合されており、まだ見ていないテストデータ上で一般化されていないと言うことができる。 A neural network model is trained using the training samples before use and is used to predict the output for the production samples. The quality of the trained model's predictions is evaluated by using a test set of training samples that are not given as input during training. If the model correctly predicts the output on the test sample, it can be used for inference with a high degree of confidence. However, if the model does not correctly predict the output on the test samples, we can say that the model is overfitted on the training data and does not generalize on the unseen test data.

遺伝学における深層学習の応用の概観は、以下の出版物において見出され得る。
・ T.Ching他、Opportunities And Obstacles For Deep Learning In Biology And Medicine、www.biorxiv.org:142760、2017
・ Angermueller C、Parnamaa T、Parts L、Stegle O、Deep Learning For Computational Biology. Mol Syst Biol. 2016;12:878
・ Park Y、Kellis M、2015 Deep Learning For Regulatory Genomics. Nat. Biotechnol. 33、825-826、(doi:10.1038/nbt.3313)
・ Min S、Lee B、およびYoon S、Deep Learning In Bioinformatics. Brief. Bioinform. bbw068 (2016)
・ Leung MK、Delong A、Alipanahi B他、Machine Learning In Genomic Medicine: A Review of Computational Problems and Data Sets、2016
・ Libbrecht MW、Noble WS、Machine Learning Applications In Genetics and Genomics. Nature Reviews Genetics 2015;16(6):321-32 An overview of the applications of deep learning in genetics can be found in the following publications.
・ T. Ching et al., Opportunities And Obstacles For Deep Learning In Biology And Medicine, www.biorxiv.org:142760, 2017
・ Angermueller C, Parnamaa T, Parts L, Stegle O, Deep Learning For Computational Biology. Mol Syst Biol. 2016;12:878
・ Park Y, Kellis M, 2015 Deep Learning For Regulatory Genomics. Nat. Biotechnol. 33, 825-826, (doi:10.1038/nbt.3313)
・ Min S, Lee B, and Yoon S, Deep Learning In Bioinformatics. Brief. Bioinform. bbw068 (2016)
・ Leung MK, Delong A, Alipanahi B, et al., Machine Learning In Genomic Medicine: A Review of Computational Problems and Data Sets, 2016
・ Libbrecht MW, Noble WS, Machine Learning Applications In Genetics and Genomics. Nature Reviews Genetics 2015;16(6):321-32

米国仮特許出願第62/573,144号U.S. Provisional Patent Application No. 62/573,144 米国仮特許出願第62/573,149号U.S. Provisional Patent Application No. 62/573,149 米国仮特許出願第62/573,153号U.S. Provisional Patent Application No. 62/573,153 米国仮特許出願第62/582,898号U.S. Provisional Patent Application No. 62/582,898 国際特許出願第PCT/US18/55840号International Patent Application No. PCT/US18/55840 PCT特許出願第PCT/US2018/55878号PCT Patent Application No. PCT/US2018/55878 国際特許出願第PCT/US2018/55881号(代理人整理番号第ILLM 1000-10/IP-1613-PCT)International Patent Application No. PCT/US2018/55881 (Attorney Docket No. ILLM 1000-10/IP-1613-PCT) 米国特許出願第16/160,903号(代理人整理番号第ILLM 1000-5/IP-1611-US)U.S. Patent Application No. 16/160,903 (Attorney Docket No. ILLM 1000-5/IP-1611-US) 米国特許出願第16/160,986号(代理人整理番号第ILLM 1000-6/IP-1612-US)U.S. Patent Application No. 16/160,986 (Attorney Docket No. ILLM 1000-6/IP-1612-US) 米国特許出願第16/160,968号(代理人整理番号第ILLM 1000-7/IP-1613-US)U.S. Patent Application No. 16/160,968 (Attorney Docket No. ILLM 1000-7/IP-1613-US) 国際特許出願公開第WO07010252号International Patent Application Publication No. WO07010252 国際特許出願第PCTGB2007/003798号International Patent Application No. PCTGB2007/003798 米国特許出願公開第2009/0088327号U.S. Patent Application Publication No. 2009/0088327 米国特許出願公開第2016/0085910号U.S. Patent Application Publication No. 2016/0085910 米国特許出願公開第2013/0296175号U.S. Patent Application Publication No. 2013/0296175 国際特許出願公開第WO 04/018497号International Patent Application Publication No. WO 04/018497 米国特許第7057026号U.S. Patent No. 7057026 国際特許出願公開第WO 91/06678号International Patent Application Publication No. WO 91/06678 国際特許出願公開第WO 07/123744号International Patent Application Publication No. WO 07/123744 米国特許第7329492号U.S. Patent No. 7329492 米国特許第7211414号U.S. Patent No. 7211414 米国特許第7315019号U.S. Patent No. 7315019 米国特許第7405281号U.S. Patent No. 7405281 米国特許出願公開第2008/0108082号U.S. Patent Application Publication No. 2008/0108082 米国特許第5641658号U.S. Patent No. 5,641,658 米国特許出願公開第2002/0055100号U.S. Patent Application Publication No. 2002/0055100 米国特許第7115400号U.S. Patent No. 7115400 米国特許出願公開第2004/0096853号U.S. Patent Application Publication No. 2004/0096853 米国特許出願公開第2004/0002090号U.S. Patent Application Publication No. 2004/0002090 米国特許出願公開第2007/0128624号U.S. Patent Application Publication No. 2007/0128624 米国特許出願公開第2008/0009420号U.S. Patent Application Publication No. 2008/0009420 米国特許出願公開第2007/0099208A1号U.S. Patent Application Publication No. 2007/0099208A1 米国特許出願公開第2007/0166705A1号U.S. Patent Application Publication No. 2007/0166705A1 米国特許出願公開第2008/0280773A1号U.S. Patent Application Publication No. 2008/0280773A1 米国特許出願第13/018255号U.S. Patent Application No. 13/018255

A.V.D.Oord、S.Dieleman、H.Zen、K.Simonyan、O.Vinyals、A.Graves、N.Kalchbrenner、A.Senior、およびK.Kavukcuoglu、「WAVENET: A GENERATIVE MODEL FOR RAW AUDIO」、arXiv:1609.03499、2016A.V.D.Oord, S.Dieleman, H.Zen, K.Simonyan, O.Vinyals, A.Graves, N.Kalchbrenner, A.Senior, and K.Kavukcuoglu, "WAVENET: A GENERATIVE MODEL FOR RAW AUDIO", arXiv:1609.03499 , 2016 S.O.Arik、M.Chrzanowski、A.Coates、G.Diamos、A.Gibiansky、Y.Kang、X.Li、J.Miller、A.Ng、J.Raiman、S.Sengupta、およびM.Shoeybi、「DEEP VOICE: REAL-TIME NEURAL TEXT-TO-SPEECH」、arXiv:1702.07825、2017S.O.Arik, M.Chrzanowski, A.Coates, G.Diamos, A.Gibiansky, Y.Kang, X.Li, J.Miller, A.Ng, J.Raiman, S.Sengupta, and M.Shoeybi, “DEEP VOICE: REAL-TIME NEURAL TEXT-TO-SPEECH”, arXiv:1702.07825, 2017 F.YuおよびV.Koltun、「MULTI-SCALE CONTEXT AGGREGATION BY DILATED CONVOLUTIONS」、arXiv:1511.07122、2016F.Yu and V.Koltun, "MULTI-SCALE CONTEXT AGGREGATION BY DILATED CONVOLUTIONS", arXiv:1511.07122, 2016 K.He、X.Zhang、S.Ren、およびJ.Sun、「DEEP RESIDUAL LEARNING FOR IMAGE RECOGNITION」、arXiv:1512.03385、2015K.He, X.Zhang, S.Ren, and J.Sun, “DEEP RESIDUAL LEARNING FOR IMAGE RECOGNITION,” arXiv:1512.03385, 2015 R.K.Srivastava、K.Greff、およびJ.Schmidhuber、「HIGHWAY NETWORKS」、arXiv:1505.00387、2015R.K.Srivastava, K.Greff, and J.Schmidhuber, “HIGHWAY NETWORKS”, arXiv:1505.00387, 2015 G.Huang、Z.Liu、L.van der Maaten、およびK.Q.Weinberger、「DENSELY CONNECTED CONVOLUTIONAL NETWORKS」、arXiv:1608.06993、2017G.Huang, Z.Liu, L.van der Maaten, and K.Q.Weinberger, “DENSELY CONNECTED CONVOLUTIONAL NETWORKS”, arXiv:1608.06993, 2017 C.Szegedy、W.Liu、Y.Jia、P.Sermanet、S.Reed、D.Anguelov、D.Erhan、V.Vanhoucke、およびA.Rabinovich、「GOING DEEPER WITH CONVOLUTIONS」、arXiv:1409.4842、2014C.Szegedy, W.Liu, Y.Jia, P.Sermanet, S.Reed, D.Anguelov, D.Erhan, V.Vanhoucke, and A.Rabinovich, "GOING DEEPER WITH CONVOLUTIONS", arXiv:1409.4842, 2014 S.Ioffe、およびC.Szegedy、「BATCH NORMALIZATION: ACCELERATING DEEP NETWORK TRAINING BY REDUCING INTERNAL COVARIATE SHIFT」、arXiv:1502.03167、2015S.Ioffe, and C.Szegedy, “BATCH NORMALIZATION: ACCELERATING DEEP NETWORK TRAINING BY REDUCING INTERNAL COVARIATE SHIFT,” arXiv:1502.03167, 2015 J.M.Wolterink、T.Leiner、M.A.Viergever、およびI.Isgum、「DILATED CONVOLUTIONAL NEURAL NETWORKS FOR CARDIOVASCULAR MR SEGMENTATION IN CONGENITAL HEART DISEASE」、arXiv:1704.03669、2017J.M.Wolterink, T.Leiner, M.A.Viergever, and I.Isgum, “DILATED CONVOLUTIONAL NEURAL NETWORKS FOR CARDIOVASCULAR MR SEGMENTATION IN CONGENITAL HEART DISEASE”, arXiv:1704.03669, 2017 L.C.Piqueras、「AUTOREGRESSIVE MODEL BASED ON A DEEP CONVOLUTIONAL NEURAL NETWORK FOR AUDIO GENERATION」、Tampere University of Technology、2016L.C.Piqueras, “AUTOREGRESSIVE MODEL BASED ON A DEEP CONVOLUTIONAL NEURAL NETWORK FOR AUDIO GENERATION,” Tampere University of Technology, 2016 J.Wu、「Introduction to Convolutional Neural Networks」、Nanjing University、2017J. Wu, “Introduction to Convolutional Neural Networks,” Nanjing University, 2017 I.J.Goodfellow、D.Warde-Farley、M.Mirza、A.Courville、およびY.Bengio、「CONVOLUTIONAL NETWORKS」、Deep Learning、MIT Press、2016I.J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, "CONVOLUTIONAL NETWORKS", Deep Learning, MIT Press, 2016 J.Gu、Z.Wang、J.Kuen、L.Ma、A.Shahroudy、B.Shuai、T.Liu、X.Wang、およびG.Wang、「RECENT ADVANCES IN CONVOLUTIONAL NEURAL NETWORKS」、arXiv:1512.07108、2017J.Gu, Z.Wang, J.Kuen, L.Ma, A.Shahroudy, B.Shuai, T.Liu, X.Wang, and G.Wang, "RECENT ADVANCES IN CONVOLUTIONAL NEURAL NETWORKS", arXiv:1512.07108, 2017 T.Ching他、Opportunities And Obstacles For Deep Learning In Biology And Medicine、www.biorxiv.org:142760、2017T. Ching et al., Opportunities And Obstacles For Deep Learning In Biology And Medicine, www.biorxiv.org:142760, 2017 Angermueller C、Parnamaa T、Parts L、Stegle O、Deep Learning For Computational Biology. Mol Syst Biol. 2016;12:878Angermueller C, Parnamaa T, Parts L, Stegle O, Deep Learning For Computational Biology. Mol Syst Biol. 2016;12:878 Park Y、Kellis M、2015 Deep Learning For Regulatory Genomics. Nat. Biotechnol. 33、825-826、(doi:10.1038/nbt.3313)Park Y, Kellis M, 2015 Deep Learning For Regulatory Genomics. Nat. Biotechnol. 33, 825-826, (doi:10.1038/nbt.3313) Min S、Lee B、およびYoon S、Deep Learning In Bioinformatics. Brief. Bioinform. bbw068 (2016)Min S, Lee B, and Yoon S, Deep Learning In Bioinformatics. Brief. Bioinform. bbw068 (2016) Leung MK、Delong A、Alipanahi B他、Machine Learning In Genomic Medicine: A Review of Computational Problems and Data Sets、2016Leung MK, Delong A, Alipanahi B et al., Machine Learning In Genomic Medicine: A Review of Computational Problems and Data Sets, 2016 Libbrecht MW、Noble WS、Machine Learning Applications In Genetics and Genomics. Nature Reviews Genetics 2015;16(6):321-32Libbrecht MW, Noble WS, Machine Learning Applications In Genetics and Genomics. Nature Reviews Genetics 2015;16(6):321-32 K.He、X.Zhang、S.Ren、およびJ.Sun、「DEEP RESIDUAL LEARNING FOR IMAGE RECOGNITION」、arXiv:1512.03385、2015K.He, X.Zhang, S.Ren, and J.Sun, “DEEP RESIDUAL LEARNING FOR IMAGE RECOGNITION,” arXiv:1512.03385, 2015 Bentley他、Nature 456:53-59(2008)Bentley et al., Nature 456:53-59 (2008) Lizardi他、Nat.Genet.19:225-232(1998)Lizardi et al., Nat. Genet. 19:225-232 (1998) Dunn, TamsenおよびBerry, GwennおよびEmig-Agius, DorotheaおよびJiang, YuおよびIyer, AnitaおよびUdar, NitinおよびStromberg, Michael、2017、Pisces: An Accurate and Versatile Single Sample Somatic and Germline Variant Caller、595-595、10.1145/3107411.3108203Dunn, Tamsen and Berry, Gwenn and Emig-Agius, Dorothea and Jiang, Yu and Iyer, Anita and Udar, Nitin and Stromberg, Michael, 2017, Pisces: An Accurate and Versatile Single Sample Somatic and Germline Variant Caller, 595-595, 10.1145/3107411.3108203

図面において、同様の参照文字は一般に様々な図全体で同様の部分を指す。また、図面は必ずしも縮尺通りではなく、代わりに、開示される技術の原理を示す際に一般に強調が行われる。以下の説明では、開示される技術の様々な実装形態が、以下の図面を参照して説明される。 In the drawings, like reference characters generally refer to like parts throughout the various views. Also, the drawings are not necessarily to scale, emphasis instead generally being placed upon illustrating the principles of the disclosed technology. In the following description, various implementations of the disclosed technology are described with reference to the following drawings.

補足訓練例がバリアント病原性予測モデルの訓練中に過剰適合を低減するために使用されるシステムのアーキテクチャレベルの概略図である。FIG. 2 is an architectural-level schematic of a system in which supplementary training examples are used to reduce overfitting during training of a variant pathogenicity prediction model. 本明細書において「PrimateAI」と称される、病原性予測のための深層残差ネットワークの例示的なアーキテクチャを示す図である。FIG. 3 shows an exemplary architecture of a deep residual network for pathogenicity prediction, referred to herein as “PrimateAI”. 病原性分類のための深層学習ネットワークアーキテクチャである、PrimateAIを示す概略図である。FIG. 2 is a schematic diagram showing PrimateAI, a deep learning network architecture for pathogenicity classification. 畳み込みニューラルネットワークの機能の一実装形態を示す図である。FIG. 2 illustrates one implementation of the functionality of a convolutional neural network; 開示された技術の一実装形態による畳み込みニューラルネットワークの訓練のブロック図である。1 is a block diagram of training a convolutional neural network according to one implementation of the disclosed technology; FIG. 例示的なミスセンスバリアントおよび対応する補足良性訓練例を提示する図である。FIG. 3 presents exemplary missense variants and corresponding supplemental benign training examples. 補足データセットを使用して病原性予測モデルの開示されている事前訓練を示す図である。FIG. 3 shows the disclosed pre-training of the virulence prediction model using supplemental datasets. 事前訓練エポックの後の事前訓練された病原性予測モデルの訓練を示す図である。FIG. 11 shows training of a pre-trained pathogenicity prediction model after pre-training epochs. ラベリングされていないバリアントを評価するための訓練された病原性予測モデルの適用を示す図である。FIG. 2 shows application of a trained pathogenicity prediction model to assess unlabeled variants. 病原性ミスセンスバリアントおよび対応する補足良性訓練例とともに例示的なアミノ酸配列に対する位置特定的頻度行列開始点を提示する図である。FIG. 3 presents the location-specific frequency matrix starting points for exemplary amino acid sequences along with pathogenic missense variants and corresponding supplemental benign training examples. 良性ミスセンスバリアントおよび対応する補足良性訓練例とともに例示的なアミノ酸配列に対する位置特定的頻度行列開始点を提示する図である。FIG. 3 presents the position-specific frequency matrix starting points for exemplary amino acid sequences along with benign missense variants and corresponding supplemental benign training examples. 霊長類、哺乳類、および脊椎動物のアミノ酸配列に対する位置特定的頻度行列の構成を示す図である。FIG. 2 shows the construction of position-specific frequency matrices for primate, mammalian, and vertebrate amino acid sequences. ヒト基準アミノ酸配列およびヒト代替アミノ酸配列の例示的なワンホット符号化を提示する図である。FIG. 1 presents exemplary one-hot encodings of human reference amino acid sequences and human alternate amino acid sequences. バリアント病原性予測モデルへの入力の例を提示する図である。FIG. 2 presents examples of inputs to a variant pathogenicity prediction model. 開示される技術を実装するために使用され得るコンピュータシステムの簡略化されたブロック図である。1 is a simplified block diagram of a computer system that can be used to implement the disclosed techniques; FIG.

以下の議論は、あらゆる当業者が開示される技術を作成して使用することを可能にするために提示され、特定の適用例およびその要件の文脈で与えられる。開示される実装形態への様々な修正が当業者に容易に明らかとなり、本明細書で定義される一般的な原理は、開示される技術の趣旨および範囲から逸脱することなく他の実装形態および適用例に適用され得る。したがって、開示される技術は、示される実装形態に限定されることは意図されず、本明細書で開示される原理および特徴と矛盾しない最も広い範囲を認められるべきである。 The following discussion is presented to enable any person skilled in the art to make and use the disclosed techniques, and is given in the context of particular applications and their requirements. Various modifications to the disclosed implementations will be readily apparent to those skilled in the art, and the generic principles defined herein may be adapted to other implementations and modifications without departing from the spirit and scope of the disclosed technology. It can be applied to any application. Accordingly, the disclosed technology is not intended to be limited to the implementations shown, but is to be accorded the broadest scope consistent with the principles and features disclosed herein.

［導入］
本出願のセクションは、開示されている改善の背景を提供するために参照により引用された出願から抜粋した繰り返しである。従来の出願では、以下で説明されているように、ヒト以外の霊長類のミスセンスバリアントデータを使用して訓練される深層学習システムを開示した。背景を提供する前に、我々は、開示されている改善を紹介する。 [introduction]
The sections of this application are repeat excerpts from the applications cited by reference to provide background for the improvements disclosed. A previous application disclosed a deep learning system trained using non-human primate missense variant data, as described below. Before providing background, we introduce the disclosed improvements.

発明者らは、経験的に、訓練のいくつかのパターンは、ときには、深層学習システムが位置特定的頻度行列入力を過度に強調することを引き起こすことを観察している。位置特定的頻度行列への過剰適合は、システムが、R->Wなどの典型的には悪影響を有するアミノ酸ミスセンスからR->kなどの典型的には良性であるアミノ酸ミスセンスを区別する能力を減退させる可能性がある。訓練セットを特に選択されている訓練例で補足することで、過剰適合を低減させるか、または弱め、訓練結果を改善することができる。 The inventors have empirically observed that some patterns of training sometimes cause the deep learning system to overemphasize the localization frequency matrix input. Overfitting to the position-specific frequency matrix enhances the ability of the system to distinguish typically benign amino acid missenses such as R->k from typically deleterious amino acid missenses such as R->W. may diminish. Supplementing the training set with specially selected training examples can reduce or weaken overfitting and improve training results.

良性とラベリングされた補足訓練例は、ミスセンス訓練例と同じ位置特定的頻度行列(「PFM」)を含み、これはラベリングされていない(および病原性があると推測される)か、病原性とラベリングされるか、または良性とラベリングされ得る。これらの補足良性訓練例の直観的影響は、位置特定的頻度行列以外のものに基づき逆伝播訓練で良性と病原性とを強制的に区別することである。 Supplementary training examples labeled as benign contained the same location-specific frequency matrix (“PFM”) as the missense training examples, which were either unlabeled (and presumed to be pathogenic) or as pathogenic. labeled or may be labeled as benign. The intuitive effect of these complementary benign training examples is to force backpropagation training to distinguish between benign and pathogenic based on something other than the localization frequency matrix.

補足良性訓練例は、訓練セット内の病原性またはラベリングされていない例と対比するように構成される。補足良性訓練例は、また、良性ミスセンス例を補強することも可能である。対比するために、病原性ミスセンスは、精選された病原性ミスセンスであり得るか、または訓練セット内の組合せ的に生成された例であってよい。選択された良性バリアントは同義バリアントであってよく、これは2つの異なるコドン、すなわち、同じアミノ酸に対してコードする2つの異なるトリヌクレオチド配列から、同じアミノ酸を表現する。同義良性バリアントが使用されるときに、これはランダムには構成されず、その代わりに、シーケンシングされた集団内で観察された同義バリアントから選択される。同義バリアントは、ヒトバリアントである可能性が高いが、他の霊長類、哺乳類、または脊椎動物に比べてヒトの方が、利用可能な配列データが多いからである。補足良性訓練例は、基準アミノ酸配列および代替アミノ酸配列の両方において同じアミノ酸配列を有する。代替的に、選択された良性バリアントは、単純に、対比する訓練例と同じ位置にあり得る。これは、同義良性バリアントの使用と同じくらい、過剰適合を弱める効果を潜在的に有し得る。 Supplemental benign training examples are configured to be contrasted with pathogenic or unlabeled examples in the training set. Supplemental benign training examples can also reinforce benign missense examples. To contrast, the pathogenic missense can be a curated pathogenic missense or can be combinatorially generated examples within a training set. The benign variant selected may be a synonymous variant, which expresses the same amino acid from two different codons, ie two different trinucleotide sequences encoding for the same amino acid. When synonymous benign variants are used, they are not randomly constructed, but instead selected from synonymous variants observed within the sequenced population. Synonymous variants are more likely to be human variants, since more sequence data are available for humans than for other primates, mammals, or vertebrates. Supplemental benign training examples have the same amino acid sequence in both the reference amino acid sequence and the alternate amino acid sequence. Alternatively, the selected benign variant can simply be co-located with the control training example. This could potentially have the same effect of dampening overfitting as the use of synonymous benign variants.

補足良性訓練例の使用は、初期訓練エポックの後に中断され得るか、または訓練全体を通して続行され得るが、これらの例が性質を正確に反映しているとおりである。 The use of supplemental benign training examples can be discontinued after the initial training epoch or continued throughout training, just as these examples accurately reflect the nature.

［畳み込みニューラルネットワーク］
畳み込みニューラルネットワークは特別なタイプのニューラルネットワークである。密結合層と畳み込み層との間の基本的な違いは、密層が入力特徴空間におけるグローバルパターンを学習するのに対して、畳み込み層がローカルパターンを学習するということである。画像の場合、入力の小さい2Dウィンドウにおいてパターンが見出される。この重要な特徴は、(1)畳み込みニューラルネットワークの学習するパターンが移動不変である、および(2)畳み込みニューラルネットワークがパターンの空間的階層を学習できるという、2つの興味深い特性を畳み込みニューラルネットワークに与える。 [Convolutional Neural Network]
A convolutional neural network is a special type of neural network. The fundamental difference between tightly coupled layers and convolutional layers is that dense layers learn global patterns in the input feature space, whereas convolutional layers learn local patterns. For images, patterns are found in small 2D windows of the input. This important feature gives convolutional neural networks two interesting properties: (1) the patterns they learn are motion-invariant, and (2) they can learn spatial hierarchies of patterns. .

第1の特性に関して、写真の右下の角のあるパターンを学習した後、畳み込み層はそれをどこでも、たとえば左上の角において認識することができる。密結合ネットワークは、パターンが新しい位置において現れた場合、改めてパターンを学習しなければならない。これにより、畳み込みニューラルネットワークはデータ効率が高くなり、それは、一般化能力を有する表現を学習するのにより少数の訓練サンプルしか必要としないからである。 Regarding the first property, after learning the pattern with the lower right corner of the photo, the convolutional layer can recognize it anywhere, for example in the upper left corner. Tightly coupled networks must relearn patterns when patterns appear at new locations. This makes convolutional neural networks more data efficient, as they require fewer training samples to learn representations with generalization ability.

第2の特性に関して、第1の畳み込み層は端などの小さいローカルパターンを学習することができ、第2の畳み込み層は第1の層の特徴から作られるより大きいパターンを学習し、以下同様である。これにより、畳み込みニューラルネットワークは、ますます複雑になり抽象的になる視覚的な概念を効率的に学習することが可能になる。 Regarding the second property, the first convolutional layer can learn small local patterns such as edges, the second convolutional layer learns larger patterns made from the features of the first layer, and so on. be. This allows convolutional neural networks to efficiently learn increasingly complex and abstract visual concepts.

畳み込みニューラルネットワークは、多くの異なる層において配置される人工ニューロンの層を、それらの層を互いに依存関係にする活性化関数を用いて相互接続することによって、高度に非線形なマッピングを学習する。畳み込みニューラルネットワークは、1つまたは複数のサブサンプリング層および非線形層とともに散在する、1つまたは複数の畳み込み層を含み、サブサンプリング層および非線形層の後には、通常は1つまたは複数の全結合層がある。畳み込みニューラルネットワークの各要素は、以前の層における特徴のセットから入力を受け取る。畳み込みニューラルネットワークは同時に学習し、それは同じ特徴マップの中のニューロンが同一の重みを有するからである。これらの局所の共有される重みがネットワークの複雑さを下げるので、多次元入力データがネットワークに入るとき、畳み込みニューラルネットワークは、特徴の抽出および回帰または分類のプロセスにおいて、データ再構築の複雑さを避ける。 Convolutional neural networks learn highly nonlinear mappings by interconnecting layers of artificial neurons arranged in many different layers with activation functions that make the layers dependent on each other. A convolutional neural network contains one or more convolutional layers interspersed with one or more subsampling and nonlinear layers, followed by typically one or more fully connected layers. There is Each element of the convolutional neural network receives input from a set of features in previous layers. Convolutional neural networks learn concurrently because neurons in the same feature map have identical weights. Since these local shared weights reduce the complexity of the network, convolutional neural networks reduce the complexity of data reconstruction in the process of feature extraction and regression or classification when multidimensional input data enters the network. avoid.

畳み込みは、2つの空間軸(高さおよび幅)ならびに深さ軸(チャネル軸とも呼ばれる)を伴う、特徴マップと呼ばれる3Dテンソルにわたって行われる。RGB画像では、深さ軸の次元は3であり、それは画像が3つの色チャネル、すなわち赤、緑、および青を有するからである。白黒の写真では、深さは1(グレーのレベル)である。畳み込み演算は、入力特徴マップからパッチを抽出し、これらのパッチのすべてに同じ変換を適用し、出力特徴マップを生成する。この出力特徴マップはそれでも3Dテンソルであり、幅および高さを有する。その深さは任意であってよく、それは出力深さが層のパラメータであり、その深さ軸における異なるチャネルはRGB入力におけるような特定の色をもはや表さず、むしろフィルタを表すからである。フィルタは入力データの特定の態様を符号化し、高いレベルで、単一のフィルタが、たとえば「入力における顔の存在」という概念を符号化することができる。 The convolution is performed over a 3D tensor called a feature map, with two spatial axes (height and width) and a depth axis (also called channel axis). In an RGB image, the dimension of the depth axis is 3, because the image has 3 color channels: red, green, and blue. For black and white photos, the depth is 1 (level of gray). A convolution operation extracts patches from an input feature map and applies the same transformation to all of these patches to produce an output feature map. This output feature map is still a 3D tensor, with width and height. The depth can be arbitrary, because the output depth is a parameter of the layer, and different channels in the depth axis no longer represent specific colors as in the RGB input, but rather filters. . Filters encode certain aspects of the input data, and at a high level, a single filter can encode, for example, the notion "presence of faces in input".

たとえば、第1の畳み込み層は、サイズ(28,28,1)の特徴マップを取り込み、サイズ(26,26,32)の特徴マップを出力する。すなわち、第1の畳み込み層は、その入力にわたる32個のフィルタを計算する。これらの32個の出力チャネルの各々が26×26の値の格子を含み、この格子は入力にわたるフィルタの応答マップであり、入力の中の異なる位置におけるそのフィルタパターンの応答を示す。これが、特徴マップという用語が意味することである。すなわち、深さ軸におけるそれぞれの次元が特徴(またはフィルタ)であり、2Dテンソル出力[:,:,n]が入力にわたるこのフィルタの応答の2D空間マップである。 For example, the first convolutional layer takes in feature maps of size (28,28,1) and outputs feature maps of size (26,26,32). That is, the first convolutional layer computes 32 filters over its inputs. Each of these 32 output channels contains a grid of 26×26 values, which is the filter's response map across the input, showing the response of that filter pattern at different locations within the input. This is what is meant by the term feature map. That is, each dimension in the depth axis is a feature (or filter) and the 2D tensor output [:,:,n] is the 2D spatial map of this filter's response over the input.

畳み込みは、(1)通常は1×1、3×3、または5×5である入力から抽出されたパッチのサイズ、および(2)出力特徴マップの深さという、2つの重要なパラメータによって定義され、フィルタの数は畳み込みによって計算される。しばしば、これらは32という深さで開始し、64という深さまで続き、128または256という深さで終わる。 A convolution is defined by two key parameters: (1) the size of patches extracted from the input, which are typically 1×1, 3×3, or 5×5, and (2) the depth of the output feature map. and the number of filters is calculated by convolution. Often these start at depth 32, continue to depth 64, and end at depth 128 or 256.

畳み込みは、3D入力特徴マップにわたってサイズ3×3または5×5のこれらのウィンドウをスライドし、それぞれの位置において止まり、周囲の特徴の3Dパッチ(形状(window_height、window_width、input_depth))を抽出することによって機能する。各々のそのような3Dパッチは次いで、形状の1Dベクトル(output_depth)への(畳み込みカーネルと呼ばれる、同じ学習された重み行列を伴うテンソル積を介して)変換される。これらのベクトルのすべてが次いで、形状の3D出力マップ(高さ、幅、output_depth)へと空間的に再び組み立てられる。出力特徴マップの中のそれぞれの空間的位置が入力特徴マップの中の同じ位置に対応する(たとえば、出力の右下の角は入力の右下の角についての情報を含む)。たとえば、3×3のウィンドウでは、ベクトル出力[i,j,:]は3Dパッチ入力[i-1:i+1,j-1:J+1,:]から来る。完全なプロセスは図4において詳述される(400とラベリングされている)。 The convolution slides these windows of size 3x3 or 5x5 across the 3D input feature map, stops at each position, and extracts a 3D patch of surrounding features (shape(window_height, window_width, input_depth)). function by Each such 3D patch is then transformed (via a tensor product with the same learned weight matrix, called a convolution kernel) into a 1D vector of shapes (output_depth). All of these vectors are then spatially reassembled into a 3D output map (height, width, output_depth) of the shape. Each spatial location in the output feature map corresponds to the same location in the input feature map (eg, the lower right corner of the output contains information about the lower right corner of the input). For example, in a 3×3 window, vector outputs [i,j,:] come from 3D patch inputs [i-1:i+1,j-1:J+1,:]. The complete process is detailed in Figure 4 (labeled 400).

畳み込みニューラルネットワークは、訓練の間に多数の勾配更新反復を介して学習される入力値と畳み込みフィルタ(重みの行列)との間で畳み込み演算を実行する、畳み込み層を備える。(m,n)をフィルタサイズとし、Wは重みの行列とすると、畳み込み層は、ドット積w・x+bを計算することによって、入力Xを用いてWの畳み込みを実行し、xはXのインスタンスであり、bはバイアスである。畳み込みフィルタが入力にわたってスライドするステップサイズはストライドと呼ばれ、フィルタ面積(m×n)は受容野と呼ばれる。同じ畳み込みフィルタが入力の異なる場所にわたって適用され、このことは学習される重みの数を減らす。このことは、すなわち、重要なパターンが入力において存在する場合、位置不変学習も可能にし、畳み込みフィルタは、重要なパターンがシーケンスの中でどこにあるかにかかわらず、重要なパターンを学習する。 A convolutional neural network comprises a convolutional layer that performs a convolution operation between input values and a convolutional filter (matrix of weights) that are learned through a number of gradient update iterations during training. Let (m,n) be the filter size and W be the matrix of weights, the convolutional layer performs a convolution of W with the input X by computing the dot product w x + b, where x is X and b is the bias. The step size that the convolution filter slides across the input is called the stride, and the filter area (m×n) is called the receptive field. The same convolution filter is applied across different locations of the input, which reduces the number of learned weights. This also allows position-invariant learning, i.e., if the pattern of interest is present in the input, the convolution filter learns the pattern of interest regardless of where it is in the sequence.

［畳み込みニューラルネットワークの訓練］
さらなる背景として、図5は、開示される技術の一実装形態による畳み込みニューラルネットワークを訓練することのブロック図500を示す。畳み込みニューラルネットワークは、入力データが特定の出力推定につながるように、調整または訓練される。畳み込みニューラルネットワークは、出力推定とグラウンドトゥルースの比較に基づいて、出力推定がグラウンドトゥルースに漸近的に一致または接近するまで、逆伝播を使用して調整される。 [Training a convolutional neural network]
By way of further background, FIG. 5 shows a block diagram 500 of training a convolutional neural network according to one implementation of the disclosed technology. A convolutional neural network is tuned or trained such that input data leads to a particular output estimate. A convolutional neural network is tuned using backpropagation based on a comparison of the output estimate and the ground truth until the output estimate asymptotically matches or approaches the ground truth.

畳み込みニューラルネットワークは、グラウンドトゥルースと実際の出力との間の差に基づいてニューロン間の重みを調整することよって訓練される。これは次のように数学的に表される。 A convolutional neural network is trained by adjusting weights between neurons based on the difference between the ground truth and the actual output. This is expressed mathematically as follows.

ただし、δ=(グラウンドトゥルース)-(実際の出力) where δ = (ground truth) - (actual output)

一実装形態では、訓練規則は次のように定義される。
w_nm←w_nm+α(t_m-φ_m)α_n In one implementation, the training rules are defined as follows.
w _nm ←w _nm +α(t _m -φ _m )α _n

上式において、矢印は値の更新を示し、t_mはニューロンmの目標値であり、φ_mはニューロンmの計算された現在の出力であり、α_nは入力nであり、αは学習率である。 where the arrows indicate value updates, t _m is the target value of neuron m, φ _m is the computed current output of neuron m, α _n is the input n, and α is the learning rate is.

訓練における中間ステップは、畳み込み層を使用して入力データから特徴ベクトルを生成することを含む。出力において開始して、各層における重みに関する勾配が計算される。これは、バックワードパス、または後ろに行くと呼ばれる。ネットワークにおける重みは、負の勾配および以前の重みの組合せを使用して更新される。 An intermediate step in training involves using convolutional layers to generate feature vectors from the input data. Starting at the output, gradients are computed for the weights at each layer. This is called the backward pass, or going backwards. Weights in the network are updated using a combination of negative gradients and previous weights.

一実装形態では、畳み込みニューラルネットワークは、勾配降下法によって誤差の逆伝播を実行する確率的勾配更新アルゴリズム(ADAMなど)を使用する。シグモイド関数ベースの逆伝播アルゴリズムの一例は以下のように記述される。 In one implementation, the convolutional neural network uses a stochastic gradient update algorithm (such as ADAM) that performs error backpropagation via gradient descent. An example of a sigmoid function-based backpropagation algorithm is described as follows.

上のシグモイド関数において、hはニューロンによって計算される加重和である。シグモイド関数は以下の導関数を有する。 In the sigmoid function above, h is the weighted sum computed by the neuron. The sigmoid function has the derivative

このアルゴリズムは、ネットワークの中のすべてのニューロンの活性化を計算し、フォワードパスに対する出力を生み出すことを含む。隠れ層の中のニューロンmの活性化は次のように記述される。 This algorithm involves computing the activations of all neurons in the network and producing an output for the forward pass. The activation of neuron m in the hidden layer is described as follows.

これは、次のように記述される活性化を得るためにすべての隠れ層に対して行われる。 This is done for all hidden layers to get the activations described as follows.

そして、誤差および訂正重みが層ごとに計算される。出力における誤差は次のように計算される。
δ_ok=(t_k-φ_k)φ_k(1-φ_k) Error and correction weights are then calculated for each layer. The error in output is calculated as follows.
δ _ok =(t _k -φ _k )φ _k (1-φ _k )

隠れ層における誤差は次のように計算される。 The error in the hidden layer is computed as follows.

出力層の重みは次のように更新される。
v_mk←v_mk+αδ_okφ_m The output layer weights are updated as follows.
v _mk ← v _mk + αδ _ok φ _m

隠れ層の重みは学習率αを使用して次のように更新される。
v_nm←w_nm+αδ_hma_n The hidden layer weights are updated using the learning rate α as follows.
v _nm ←w _nm +αδ _hm a _n

一実装形態では、畳み込みニューラルネットワークは、すべての層にわたって誤差を計算するために勾配降下最適化を使用する。そのような最適化において、入力特徴ベクトルxおよび予測される出力 In one implementation, the convolutional neural network uses gradient descent optimization to compute the error across all layers. In such optimization, the input feature vector x and the predicted output

に対して、目標がyであるときに , when the goal is y

を予測することのコストのためのlとして損失関数が定義され、すなわち A loss function is defined as l for the cost of predicting , i.e.

である。予測される出力 is. expected output

は、関数fを使用して入力特徴ベクトルxから変換される。関数fは、畳み込みニューラルネットワークの重みによってパラメータ化され、すなわち is transformed from the input feature vector x using the function f. The function f is parameterized by the weights of the convolutional neural network, i.e.

である。損失関数は is. The loss function is

、またはQ(z,w)=l(f_w(x),y)と記述され、ここでzは入力データと出力データのペア(x,y)である。勾配降下最適化は、以下に従って重みを更新することによって実行される。 , or Q(z,w)=l(f _w (x),y), where z is the input and output data pair (x,y). Gradient descent optimization is performed by updating the weights according to:

w_t+1=w_t+v_t+1 w _t+1 =w _t +v _t+1

上式において、αは学習率である。また、損失はn個のデータペアのセットにわたる平均として計算される。この計算は、線形収束の際に学習率αが十分小さくなると終了する。他の実装形態では、計算効率をもたらすために、ネステロフの加速勾配法および適応勾配法に供給される選択されたデータペアだけを使用して、勾配が計算される。 where α is the learning rate. Also, the loss is computed as an average over a set of n data pairs. This computation terminates when the learning rate α becomes sufficiently small during linear convergence. In other implementations, gradients are computed using only selected data pairs fed into Nesterov's accelerated and adaptive gradient methods to provide computational efficiency.

一実装形態では、畳み込みニューラルネットワークは、コスト関数を計算するために確率的勾配降下法(SGD)を使用する。SGDは、損失関数における重みに関する勾配を、以下で記述されるように、1つのランダム化されたデータペアz_tだけから計算することによって近似する。
v_t+1=μv-α∇wQ(z_t,w_t)
w_t+1=w_t+v_t+1 In one implementation, the convolutional neural network uses stochastic gradient descent (SGD) to compute the cost function. SGD approximates the gradients for the weights in the loss function by computing them from only one randomized data pair _zt , as described below.
v _t+1 =μv-α∇wQ(z _t ,w _t )
w _t+1 =w _t +v _t+1

上式において、αは学習率であり、μはモメンタムであり、tは更新前の現在の重み状態である。SGDの収束速度は、学習率αが十分に速く低減するときと、十分に遅く低減するときの両方において、約O(1/t)である。他の実装形態では、畳み込みニューラルネットワークは、ユークリッド損失およびソフトマックス損失などの異なる損失関数を使用する。さらなる実装形態では、Adam確率的最適化器が畳み込みニューラルネットワークによって使用される。 where α is the learning rate, μ is the momentum, and t is the current weight state before updating. The convergence speed of SGD is about O(1/t) both when the learning rate α decreases fast enough and when it decreases slow enough. In other implementations, convolutional neural networks use different loss functions such as Euclidean loss and softmax loss. In a further implementation, the Adam stochastic optimizer is used with convolutional neural networks.

畳み込み層、サブサンプリング層、および非線形層の追加の開示および説明は、畳み込みの例および逆伝播による訓練の説明とともに参照により引用された出願に記載されている。また参照により引用された資料の対象となるのは、基本的なCNN技術におけるアーキテクチャ上のバリエーションである。 Additional disclosure and description of convolutional layers, subsampling layers, and nonlinear layers, along with convolutional examples and descriptions of training with backpropagation, are found in the applications incorporated by reference. Also subject to the material cited by reference are architectural variations on the basic CNN technique.

前に説明されている反復平衡サンプリング上のバリエーションの1つは、20サイクルの代わりに1または2サイクルでエリート訓練セット全体を選択することである。1もしくは2訓練サイクルだけ、または3から5訓練サイクルがエリート訓練セットを組み立てるのに十分であり得る、知られている良性訓練例と確実に分類され予測された病原性バリアントとの間の、半教師あり訓練によって学習された十分な区別があり得る。1サイクルもしくは2サイクルだけ、または3から5サイクルの範囲を記述するための開示されている方法およびデバイスの修正は、本明細書に開示されており、前に開示されている反復を1もしくは2または3から5サイクルに変換することによって容易に達成され得る。 One variation on the iterative equilibrium sampling previously described is to select the entire elite training set in 1 or 2 cycles instead of 20 cycles. Only 1 or 2 training cycles, or 3 to 5 training cycles may be sufficient to assemble an elite training set. There may be sufficient distinctions learned by supervised training. Modifications of the disclosed methods and devices to describe only 1 or 2 cycles, or a range of 3 to 5 cycles, are disclosed herein and may be modified to 1 or 2 iterations previously disclosed. Or can be easily achieved by converting from 3 to 5 cycles.

［ゲノミクスにおける深層学習］
遺伝的変異は、多くの疾患の説明を助け得る。ヒトはそれぞれが固有の遺伝コードを持ち、個人のグループ内には多くの遺伝的バリアントがある。有害な遺伝的バリアントの大半は、自然選択によってゲノムから枯渇している。どの遺伝的変異が病原性または有害である可能性が高いかを特定することが重要である。このことは、研究者が、病原性である可能性が高い遺伝的バリアントに注目し、多くの疾患の診断および治療を加速させることを助けるであろう。 [Deep learning in genomics]
Genetic variations can help explain many diseases. Each human has a unique genetic code and there are many genetic variants within groups of individuals. Most deleterious genetic variants have been depleted from the genome by natural selection. It is important to identify which genetic variants are likely to be pathogenic or deleterious. This will help researchers focus on genetic variants that are likely pathogenic and accelerate the diagnosis and treatment of many diseases.

バリアントの性質および機能的な影響(たとえば、病原性)をモデル化することは重要であるが、ゲノミクスの分野においては難しい仕事である。機能的ゲノムシーケンシング技術の急速な進化にもかかわらず、バリアントの機能的な結果の解釈には、細胞タイプに固有の転写制御システムの複雑さが原因で、大きな困難が立ちはだかっている。 Modeling the properties and functional consequences (eg, pathogenicity) of variants is an important but difficult task in the field of genomics. Despite the rapid evolution of functional genome sequencing technologies, the interpretation of the functional results of variants poses great challenges due to the complexity of the cell-type-specific transcriptional regulatory systems.

過去数十年にわたる生化学技術の進化は、これまでよりもはるかに低いコストでゲノムデータを高速に生成する、次世代シーケンシング(NGS)プラットフォームをもたらした。そのような圧倒的に大量のシーケンシングされたDNAは、アノテーションが困難なままである。教師あり機械学習アルゴリズムは通常、大量のラベリングされたデータが利用可能であるときには性能を発揮する。バイオインフォマティクスおよび多くの他のデータリッチな訓練法では、インスタンスをラベリングするプロセスが高価である。しかしながら、ラベリングされていないインスタンスは、安価であり容易に利用可能である。ラベリングされたデータの量が比較的少なく、ラベリングされていないデータの量がかなり多いシナリオでは、半教師あり学習が、手動のラベリングに対する費用対効果の高い代替手法となる。 Advances in biochemical technology over the past decades have led to next-generation sequencing (NGS) platforms that rapidly generate genomic data at a much lower cost than ever before. Such overwhelmingly large amounts of sequenced DNA remain difficult to annotate. Supervised machine learning algorithms usually perform well when large amounts of labeled data are available. In bioinformatics and many other data-rich training methods, the process of labeling instances is expensive. However, unlabeled instances are cheap and readily available. In scenarios where the amount of labeled data is relatively small and the amount of unlabeled data is fairly large, semi-supervised learning is a cost-effective alternative to manual labeling.

バリアントの病原性を正確に予測する深層学習ベースの病原性分類器を構築するために、半教師ありアルゴリズムを使用する機会が生じる。人間の診断バイアスがない病原性バリアントのデータベースを得ることができる。 An opportunity arises to use semi-supervised algorithms to build deep learning-based pathogenicity classifiers that accurately predict variant pathogenicity. A database of pathogenic variants free of human diagnostic bias can be obtained.

病原性分類器に関して、深層ニューラルネットワークは、高水準の特徴を連続的にモデル化するために複数の非線形の複雑な変換層を使用する、あるタイプの人工ニューラルネットワークである。深層ニューラルネットワークは、観測される出力と予測される出力との差を搬送する逆伝播を介してフィードバックを提供し、パラメータを調整する。深層ニューラルネットワークは、大きな訓練データセット、並列および分散コンピューティングの能力、および洗練された訓練アルゴリズムが利用可能になることとともに進化してきた。深層ニューラルネットワークは、コンピュータビジョン、音声認識、および自然言語処理などの、多数の領域において大きな進化を促進してきた。 With respect to pathogenicity classifiers, deep neural networks are a type of artificial neural network that uses multiple nonlinear, complex transformation layers to continuously model high-level features. Deep neural networks provide feedback via backpropagation carrying differences between observed and predicted outputs to adjust parameters. Deep neural networks have evolved with the availability of large training datasets, parallel and distributed computing power, and sophisticated training algorithms. Deep neural networks have facilitated great advances in many areas, such as computer vision, speech recognition, and natural language processing.

畳み込みニューラルネットワーク(CNN)および再帰型ニューラルネットワーク(RNN)は、深層ニューラルネットワークの構成要素である。畳み込みニューラルネットワークは、畳み込み層、非線形層、およびプーリング層を備えるアーキテクチャにより、画像認識において特に成功してきた。再帰型ニューラルネットワークは、パーセプトロン、長短期メモリユニット、およびゲート付き回帰型ユニットのようなビルディングブロックの間で、巡回接続を用いて入力データの連続的情報を利用するように設計される。加えて、深層空間時間ニューラルネットワーク、多次元再帰型ニューラルネットワーク、および畳み込みオートエンコーダなどの、多くの他の新興の深層ニューラルネットワークが、限られた文脈に対して提案されている。 Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) are the building blocks of deep neural networks. Convolutional neural networks have been particularly successful in image recognition due to their architecture comprising convolutional, nonlinear and pooling layers. Recurrent neural networks are designed to exploit continuous information in input data using cyclic connections between building blocks such as perceptrons, long-short-term memory units, and gated recurrent units. In addition, many other emerging deep neural networks, such as deep spatio-temporal neural networks, multi-dimensional recurrent neural networks, and convolutional autoencoders, have been proposed for limited contexts.

深層ニューラルネットワークを訓練する目的は、各層における重みパラメータの最適化であり、このことは、最も適した階層的表現をデータから学習できるように、より単純な特徴を複雑な特徴へと徐々に合成する。最適化プロセスの単一のサイクルは次のように編成される。まず、ある訓練データセットのもとで、フォワードパスが各層の中の出力を順番に計算し、ネットワークを通じて関数信号を前に伝播させる。最後の出力層において、目的損失関数が、推論された出力と所与のラベルとの間の誤差を測定する。訓練誤差を最小にするために、バックワードパスは、連鎖律を逆伝播誤差信号に使用し、ニューラルネットワーク全体のすべての重みに関する勾配を計算する。最後に、重みパラメータは、確率的勾配降下に基づく最適化アルゴリズムを使用して更新される。一方、バッチ勾配降下は、各々の完全なデータセットに対するパラメータ更新を実行し、確率的勾配降下は、データ例の各々の小さいセットに対する更新を実行することによって確率的近似を提供する。いくつかの最適化アルゴリズムは、確率的勾配低下に由来する。たとえば、AdagradおよびAdam訓練アルゴリズムは、確率的勾配降下を実行しながら、それぞれ、各パラメータのための更新頻度および勾配のモーメントに基づいて学習率を適応的に修正する。 The goal of training a deep neural network is the optimization of the weight parameters in each layer, which gradually synthesizes simpler features into complex ones so that the most suitable hierarchical representation can be learned from the data. do. A single cycle of the optimization process is organized as follows. First, given a training data set, the forward pass computes the outputs in each layer in turn, propagating the function signal forward through the network. At the final output layer, an objective loss function measures the error between the inferred output and a given label. To minimize the training error, the backward pass uses the chain rule on the backpropagation error signal to compute the gradient for all weights across the neural network. Finally, the weight parameters are updated using an optimization algorithm based on stochastic gradient descent. Batch gradient descent, on the other hand, performs parameter updates on each complete data set, while stochastic gradient descent provides stochastic approximations by performing updates on each small set of data examples. Some optimization algorithms derive from stochastic gradient descent. For example, the Adagrad and Adam training algorithms adaptively modify the learning rate based on the update frequency and gradient moment for each parameter, respectively, while performing stochastic gradient descent.

深層ニューラルネットワークの訓練における別のコア要素は正則化であり、これは、過剰適応を避けることで良好な一般化性能を達成することを意図した戦略を指す。たとえば、重み減衰は、重みパラメータがより小さい絶対値へと収束するように、目的損失関数にペナルティ項を追加する。ドロップアウトは、訓練の間にニューラルネットワークから隠れユニットをランダムに除去し、可能性のあるサブネットワークのアンサンブルであると見なされ得る。ドロップアウトの能力を高めるために、新しい活性化関数であるmaxoutと、rnnDropと呼ばれる再帰型ニューラルネットワークのためのドロップアウトの変形が提案されている。さらに、バッチ正規化は、ミニバッチ内の各活性化のためのスカラー特徴量の正規化と、各平均および分散をパラメータとして学習することとを通じた、新しい正則化方法を提供する。 Another core element in training deep neural networks is regularization, which refers to strategies intended to achieve good generalization performance by avoiding over-adaptation. For example, weight decay adds a penalty term to the objective loss function such that the weight parameter converges to a smaller absolute value. Dropout randomly removes hidden units from the neural network during training and can be viewed as an ensemble of possible sub-networks. To increase the power of dropout, a new activation function maxout and a variant of dropout for recurrent neural networks called rnnDrop have been proposed. Furthermore, batch normalization provides a new regularization method through normalizing scalar features for each activation in a mini-batch and learning each mean and variance as parameters.

シーケンシングされたデータが多次元かつ高次元であるとすると、深層ニューラルネットワークは、その広い適用可能性および高い予測能力により、バイオインフォマティクスの研究に対して高い将来性がある。畳み込みニューラルネットワークは、モチーフの発見、病原性バリアントの特定、および遺伝子発現の推論などの、ゲノミクスにおける配列に基づく問題を解決するために適合されてきた。畳み込みニューラルネットワークは、DNAを研究するのに特に有用である重み共有戦略を使用し、それは、この戦略が、重大な生物学的機能を有することが推定されるDNAにおける短い反復的なローカルパターンである配列モチーフを捉えることができるからである。畳み込みニューラルネットワークの特徴は、畳み込みフィルタの使用である。精巧に設計され人間により作られた特徴に基づく従来の分類手法とは異なり、畳み込みフィルタは、生の入力データを知識の有用な表現へとマッピングする処理と類似した、特徴の適応学習を実行する。この意味で、畳み込みフィルタは一連のモチーフスキャナとして機能し、それは、そのようなフィルタのセットが、入力の中の関連するパターンを認識し、訓練手順の間にそれらを更新することが可能であるからである。再帰型ニューラルネットワークは、タンパク質またはDNA配列などの、可変の長さの連続的データにおける長距離の依存関係を捉えることができる。 Given that the sequenced data are multi-dimensional and high-dimensional, deep neural networks have great potential for bioinformatics research due to their wide applicability and high predictive ability. Convolutional neural networks have been adapted to solve sequence-based problems in genomics, such as motif discovery, pathogenic variant identification, and gene expression inference. Convolutional neural networks use a weight-sharing strategy that is particularly useful for studying DNA, because this strategy detects short, repetitive local patterns in DNA that are putative to have important biological functions. This is because it can capture a certain sequence motif. A feature of convolutional neural networks is the use of convolution filters. Unlike traditional classification techniques based on elaborately designed, human-made features, convolution filters perform adaptive learning of features, similar to the process of mapping raw input data into useful representations of knowledge. . In this sense, the convolutional filter acts as a series of motif scanners, which enables such a filter set to recognize relevant patterns in the input and update them during the training procedure. It is from. Recurrent neural networks can capture long-range dependencies in variable-length continuous data, such as protein or DNA sequences.

したがって、バリアントの病原性を予測するための強力な計算モデルには、基礎科学研究と橋渡し研究の両方に対して莫大な利益があり得る。 Therefore, powerful computational models for predicting variant pathogenicity could be of enormous benefit to both basic and translational research.

一般的な多型は、多世代の自然選択によりその健康性が試されてきた自然の実験結果を表している。ヒトのミスセンス置換と同義置換についてアレル頻度分布を比較すると、ヒト以外の霊長類の種における高いアレル頻度でのミスセンスバリアントの存在は、そのバリアントがヒトの集団においても自然選択を受けていることを高い信頼度で予測することを発見した。対照的に、より遠縁の種における一般的なバリアントは、進化的な距離が長くなるにつれて、負の選択を受ける。 Common polymorphisms represent the result of natural experiments whose health has been tested by natural selection over many generations. Comparing the allelic frequency distributions for missense and synonymous substitutions in humans, the presence of missense variants at high allelic frequencies in non-human primate species indicates that the variants have also undergone natural selection in the human population. found to predict with high confidence. In contrast, common variants in more distantly related species undergo negative selection as evolutionary distance increases.

配列だけを使用して臨床的なde novoミスセンス変異を正確に分類する、半教師あり深層学習ネットワークを訓練するために、ヒト以外の6種の霊長類の種からの一般的な変異を利用する。500を超える既知の種により、霊長類の系統は、有意性が知られていない大半のヒトバリアントの影響を系統的にモデル化するのに、十分な一般的な変異を含んでいる。 Harness common mutations from six non-human primate species to train a semi-supervised deep learning network that accurately classifies clinical de novo missense mutations using sequence alone . With over 500 known species, the primate lineage contains enough common variation to systematically model the effects of most human variants of unknown significance.

ヒト基準ゲノムには、7000万個のタンパク質を変化させる可能性のあるミスセンス置換が隠れており、それらの大半は、ヒトの健康への影響が特性把握されていない稀な変異である。これらの有意性が知られていないバリアントは、臨床上の応用においてゲノム解釈の課題となっており、集団全体にわたるスクリーニングおよび個別化医療のためのシーケンシングの長期的な採用の障害である。 The human reference genome harbors 70 million potentially altering protein missense substitutions, most of which are rare mutations with uncharacterized effects on human health. These variants of unknown significance have challenged genome interpretation in clinical applications and are an obstacle to the long-term adoption of sequencing for population-wide screening and personalized medicine.

多様なヒトの集団にわたる一般的な変異の目録を作ることが、臨床的に良性の変異を特定するのに有効な戦略であるが、現代のヒトから入手可能な一般的な変異は、我々の種の遠い過去におけるボトルネック事象により限られている。ヒトとチンパンジーは99%の配列相同性を共有しており、これは、チンパンジーバリアントに対して働く自然選択が、ヒトにおいて同一状態であるバリアントの影響をモデル化することの可能性を示唆している。ヒトの集団における自然な多型に対する平均合祖時間は、種の分岐時間の一部であるので、自然に発生するチンパンジー変異は大部分が、平衡選択により維持されるハプロタイプの稀な事例を除き、ヒト変異と重複しない変異空間に及ぶ。 Although inventorying common mutations across diverse human populations is an effective strategy for identifying clinically benign mutations, the common mutations available from modern humans are limited by bottleneck events in the distant past of the species. Humans and chimpanzees share 99% sequence homology, suggesting that natural selection acting on chimpanzee variants may model the effects of identical variants in humans. there is Since the average coalescence time for natural polymorphisms in human populations is a fraction of the species divergence time, the majority of naturally occurring chimpanzee mutations are haplotypes, except for rare cases of haplotypes maintained by equilibrium selection. , spans the mutational space that does not overlap with human mutations.

60706人のヒトからの集約されたエクソンデータが最近利用可能になったことで、ミスセンス変異と同義変異に対するアレル頻度スペクトラムを比較することによって、この仮説を検定することが可能になった。ExACにおけるシングルトンバリアントは、トリヌクレオチドコンテクストを使用して変異率を調整した後のde novo変異により予測される、予想される2.2:1のミスセンス:同義比とよく一致するが、より高いアレル頻度では、観察されるミスセンスバリアントの数は、自然選択による有害なバリアントの除去により減少する。アレル頻度スペクトラムにわたるミスセンス:同義比のパターンは、集団における頻度が0.1%未満であるミスセンスバリアントの大部分が軽度に有害である、すなわち、集団からの即刻の除去を保証するほど病原性が高くなく、高いアレル頻度で存在することが許容されるほど中立的でもないということを示しており、これはより限られた集団データに対する以前の観察と一致している。これらの発見は、0.1%～1%より高いアレル頻度を伴うバリアントを、平衡選択および創始者効果により引き起こされるよく記録されている少数の例外を除いて、浸透性の遺伝性疾患に対しては良性である可能性が高いものとして除去するという、診療室において広く行われている経験的な実践を支持するものである。 The recent availability of aggregated exon data from 60,706 humans made it possible to test this hypothesis by comparing allelic frequency spectra for missense and synonymous mutations. Singleton variants in ExAC are in good agreement with the expected 2.2:1 missense:synonymous ratio predicted by de novo mutations after adjusting for mutation rates using trinucleotide context, but at higher allele frequencies , the number of observed missense variants is reduced by elimination of deleterious variants by natural selection. The pattern of missense:synonymous ratios across the allele frequency spectrum indicates that the majority of missense variants with a frequency of <0.1% in the population are mildly deleterious, i.e., not highly pathogenic enough to warrant immediate elimination from the population. , indicating that the presence at high allele frequencies is not acceptably neutral, which is consistent with previous observations on more limited population data. These findings suggest that variants with allele frequencies higher than 0.1% to 1% are not for penetrant hereditary diseases, with a few well-documented exceptions caused by balanced selection and founder effects. It supports the empirical practice widely practiced in clinics to remove as likely benign.

この分析を、一般的なチンパンジーバリアント(チンパンジー集団のシーケンシングにおいて1回よりも多く観察される)と同一状態であるヒトバリアントのサブセットについて繰り返すと、ミスセンス:同義比は、アレル頻度スペクトラムにわたって概ね一定であることを発見した。チンパンジーの集団におけるこれらのバリアントの高いアレル頻度は、これらのバリアントがチンパンジーの自然選択のふるいにすでにかけられてきたことを示し、ヒトの集団における健康へのそれらのバリアントの中立的な影響は、ミスセンスバリアントに対する選択圧力が2つの種において高度に合致していることの注目すべき証拠を与えている。チンパンジーにおいて観察されるより低いミスセンス:同義比は、軽度に有害なバリアントの効率的な除去を可能にする先祖のチンパンジーの集団におけるより大きい実効集団サイズと一貫している。 When this analysis is repeated for a subset of human variants that are identical to a common chimpanzee variant (observed more than once in sequencing chimpanzee populations), the missense:synonymous ratio remains largely constant across the allele frequency spectrum. I discovered that The high allelic frequencies of these variants in chimpanzee populations indicate that these variants have already been sifted through natural selection in chimpanzees, and their neutral impact on health in human populations suggests that It provides compelling evidence that selective pressures for missense variants are highly concordant in the two species. The lower missense:synonymous ratios observed in chimpanzees are consistent with the larger effective population size in the ancestral chimpanzee population, which allows efficient elimination of mildly deleterious variants.

対照的に、稀なチンパンジーバリアント(チンパンジー集団のシーケンシングにおいて1回しか観察されない)は、より高いアレル頻度において、ミスセンス:同義比のあまり大きくない低下を示す。ヒト変異データからの同一サイズのコホートをシミュレートすると、このサイズのコホートにおいて一度観察されるバリアントの64%しか、集団全体において0.1%より高いアレル頻度を有せず、それと比べて、コホートにおいて複数回見られるバリアントについては99.8%が集団全体において0.1%より高いアレル頻度を有することが推定され、これは、稀なチンパンジーバリアントのすべてが選択のふるいにかけられたとは限らないことを示している。全体として、確認されたチンパンジーミスセンスバリアントの16%が、集団全体において0.1%未満のアレル頻度を有し、より高いアレル頻度では負の選択を受けることが推定される。 In contrast, the rare chimpanzee variant (observed only once in sequencing chimpanzee populations) shows a modest reduction in the missense:synonymous ratio at higher allele frequencies. When simulating the same size cohort from human mutation data, only 64% of the variants observed once in this size cohort had an allelic frequency higher than 0.1% in the entire population, compared to multiple mutations in the cohort. For the recurrent variants, 99.8% were estimated to have an allele frequency higher than 0.1% in the population as a whole, indicating that not all rare chimpanzee variants were screened for selection. Overall, 16% of the confirmed chimpanzee missense variants are estimated to have an allele frequency of less than 0.1% in the population as a whole, with higher allele frequencies undergoing negative selection.

次に、他のヒト以外の霊長類の種(ボノボ、ゴリラ、オランウータン、アカゲザル、およびマーモセット)において観察される変異と同一状態であるヒトバリアントを特徴付ける。チンパンジーと同様に、少数の稀なバリアント(約5～15%)の包含によるものであると推測される高いアレル頻度におけるミスセンス変異のわずかな枯渇を除き、ミスセンス:同義比がアレル頻度スペクトラムにわたって概ね等しいことを認めた。これらの結果は、ミスセンスバリアントに対する選択圧が、ヒトの祖先の系統から約3500万年前に分岐したと推定される新世界ザルまでは少なくとも、霊長類の系統内で概ね合致していることを示唆する。 Next, we characterize human variants that are identical to mutations observed in other non-human primate species (bonobos, gorillas, orangutans, rhesus monkeys, and marmosets). Similar to chimpanzees, the missense:synonymous ratio is broadly consistent across the allele frequency spectrum, except for a slight depletion of missense mutations at high allele frequencies, presumably due to inclusion of a small number of rare variants (~5–15%). acknowledged to be equal. These results indicate that selective pressures for missense variants are broadly consistent within the primate lineage, at least up to the New World monkeys, which are estimated to have diverged from the human ancestral lineage about 35 million years ago. Suggest.

他の霊長類におけるバリアントと同一状態であるヒトミスセンスバリアントは、ClinVarにおける良性の結果に対して強くエンリッチメントされる。未知のまたは矛盾するアノテーションを伴うバリアントを除いた後で、霊長類オーソログを伴うヒトバリアントは、ClinVarにおいて良性または良性の可能性が高いものとしてアノテートされる確率が約95%であり、それと比較して、ミスセンス変異全般では45%であることが観察される。ヒト以外の霊長類から病原性であるものとして分類されるClinVarバリアントの小さな割合は、健康なヒトの同様のサイズのコホートからの稀なバリアントを確認することにより観察されるであろう病原性のClinVarバリアントの割合と同程度である。大きなアレル頻度データベースの出現の前に分類を受けた、病原性であるまたは病原性である可能性が高いものとしてアノテートされたこれらのバリアントのかなりの割合が、今日では異なるように評価される可能性がある。 Human missense variants that are identical to variants in other primates are strongly enriched for benign outcomes in ClinVar. After removing variants with unknown or conflicting annotations, human variants with primate orthologues have about a 95% chance of being annotated as benign or likely benign in ClinVar, compared with and 45% for missense mutations in general. The small proportion of ClinVar variants classified as pathogenic from non-human primates would be observed by identifying rare variants from similarly sized cohorts of healthy humans. similar to the proportion of ClinVar variants. A significant proportion of those variants annotated as pathogenic or probable pathogenic that were classified before the advent of large allele frequency databases could be assessed differently today. have a nature.

ヒトの遺伝学の分野は、ヒト変異の臨床上の影響を推論するためにモデル生物に長い間依存してきたが、大半の遺伝的に扱いやすい動物モデルまでの進化的距離が長いことで、これらの発見がヒトに対してどの程度一般化可能であるかについての懸念が生まれている。ヒトおよびより遠縁の種におけるミスセンスバリアントに対する自然選択の合致を調査するために、4種の追加の哺乳類の種(ネズミ、ブタ、ヤギ、ウシ)と2種のより遠縁の脊椎動物(ニワトリ、ゼブラフィッシュ)からの概ね一般的な変異を含めるように、霊長類の系統を超えて分析を拡張した。以前の霊長類の分析とは対照的に、進化的距離が遠い場合には特に、稀なアレル頻度と比較して一般的なアレル頻度ではミスセンス変異が顕著に枯渇していることが観察され、これは、より遠縁の種における一般的なミスセンス変異のかなりの割合が、ヒトの集団においては負の選択を受けるであろうことを示している。それでも、より遠縁の脊椎動物におけるミスセンスバリアントの観察は、良性の結果の確率を高め、それは、自然選択により枯渇した一般的なミスセンスバリアントの割合は、基準であるヒトミスセンスバリアントに対して約50%よりはるかに低い枯渇率であるからである。これらの結果と一致して、ネズミ、イヌ、ブタ、およびウシにおいて観察されたヒトミスセンスバリアントは、ClinVarにおいて良性または良性の可能性が高いものとしてアノテートされる確率が約85%であり、それと比較して、霊長類の変異に対しては95%、ClinVarデータベース全体に対しては45%であることを発見した。 The field of human genetics has long relied on model organisms to infer the clinical impact of human mutations, but the long evolutionary distance to most genetically tractable animal models makes these Concerns have been raised about how generalizable the findings are to humans. Four additional mammalian species (murine, pig, goat, bovine) and two more distantly related vertebrates (chicken, zebra) were used to investigate the natural selection match for missense variants in humans and more distantly related species. We extended the analysis beyond primate lineages to include mostly common mutations from fish). In contrast to previous primate analyses, we observed a marked depletion of missense mutations at common allele frequencies compared to rare allele frequencies, especially at greater evolutionary distances. This indicates that a significant proportion of common missense mutations in more distantly related species will undergo negative selection in the human population. Nevertheless, observation of missense variants in more distantly related vertebrates increases the probability of a benign outcome, indicating that the proportion of common missense variants depleted by natural selection is approximately 50% relative to the reference human missense variant. This is because the depletion rate is much lower than Consistent with these results, human missense variants observed in rats, dogs, pigs, and bovines have ~85% chance of being annotated as benign or likely benign in ClinVar, compared with We found 95% for primate mutations and 45% for the entire ClinVar database.

様々な進化的距離にある近縁の種のペアの存在も、ヒトの集団における固定されたミスセンス置換の機能的な結果を評価するための機会を与える。哺乳類の系図上で近縁の種のペア(枝長<0.1)内で、固定されたミスセンス変異が、稀なアレル頻度と比較して一般的なアレル頻度で枯渇することが観察され、これは、複数の種にわたる固定された置換のかなりの割合が、霊長類の系統内であってもヒトにおいては非中立的であることを示している。ミスセンスの枯渇の程度の比較は、複数の種にわたる固定された置換が、同一種内の多型よりはるかに中立的ではないことを示している。興味深いことに、近縁の哺乳類間での複数の種にわたる変異は、同一種内の一般的な多型と比較して、ClinVarにおいてはさほどより病原性ではなく(良性または良性の可能性が高いものとしてアノテートされる確率が83%)、これらの変化がタンパク質の機能を無効にするのではなく、むしろ、種固有の適応的な利益を授けるタンパク質機能の調整を招いていることを示唆する。 The existence of closely related species pairs at varying evolutionary distances also provides an opportunity to assess the functional consequences of fixed missense substitutions in human populations. Within pairs of mammalian phylogenetically related species (branch length < 0.1), fixed missense mutations were observed to deplete at common allele frequencies compared to rare allele frequencies, which A significant proportion of fixed substitutions across species are non-neutral in humans even within the primate lineage. A comparison of the degree of missense depletion indicates that fixed substitutions across multiple species are much less neutral than polymorphisms within the same species. Interestingly, cross-species mutations among closely related mammals are less pathogenic (benign or likely benign) in ClinVar compared to common polymorphisms within the same species. 83%), suggesting that these changes do not abolish protein function, but rather lead to modulation of protein function that confers species-specific adaptive benefits.

有意性が知られてない多数の潜在的なバリアントがあること、および臨床上の応用には正確なバリアント分類が決定的に重要であることにより、機械学習を用いた問題の解決が多く試みられてきたが、これらの努力は、一般的なヒトバリアントの量が不十分であること、および精選されたデータベースにおけるアノテーションの品質が疑わしいことにより大きく制約されてきた。6種のヒト以外の霊長類からの変異は、一般的なヒト変異と重複せず大部分が良性の結果をもたらす300000個を超える固有のミスセンスバリアントに寄与し、機械学習手法に使用できる訓練データセットのサイズを大きく拡大した。 Due to the large number of potential variants of unknown significance and the critical importance of accurate variant classification for clinical application, many attempts have been made to solve the problem using machine learning. However, these efforts have been severely constrained by the inadequate abundance of common human variants and the questionable quality of annotations in curated databases. Mutations from six non-human primates contribute over 300,000 unique missense variants with mostly benign outcomes that do not overlap with common human mutations, providing training data that can be used for machine learning methods. Greatly increased the size of the set.

人間により加工された多数の特徴およびメタ分類器を利用するこれまでのモデルと異なり、対象のバリアントの側にあるアミノ酸配列および他の種におけるオーソロガスな配列アラインメントのみを入力として取り込む、単純な深層学習残差ネットワークを適用する。タンパク質構造についての情報をネットワークに提供するために、配列だけから二次構造および溶媒接触性を学習するように2つの別々のネットワークを訓練し、これらをサブネットワークとしてより大きな深層学習ネットワークに組み込み、タンパク質構造に対する影響を予測する。配列を開始点として使用することで、不完全に確認されている可能性がある、または矛盾して適用されている可能性がある、タンパク質構造および機能ドメインのアノテーションにおける存在し得るバイアスが回避される。 Simple deep learning that takes as input only amino acid sequences flanking the variant of interest and orthologous sequence alignments in other species, unlike previous models that utilize a large number of human-crafted features and metaclassifiers Apply a residual network. training two separate networks to learn secondary structure and solvent accessibility from sequences alone and incorporating these as sub-networks into a larger deep learning network, in order to provide the networks with information about protein structure; Predict effects on protein structure. Using the sequence as a starting point avoids possible biases in the annotation of protein structural and functional domains that may be incompletely validated or inconsistently applied. be.

良性である可能性が高い霊長類バリアントと、変異率およびシーケンシングカバレッジについて一致するランダムな未知のバリアントとを分離するように、ネットワークのアンサンブルを最初に訓練することによって、訓練セットが良性のラベルを持つバリアントしか含まないという問題を克服するために、半教師あり学習を使用する。このネットワークのアンサンブルは、未知のバリアントの完全なセットをスコアリングするために、および、より病原性であるという予測される結果を持つ未知のバリアントに向かってバイアスをかけることによって分類器の次の反復をシードするように未知のバリアントの選択に影響を与えるために使用され、モデルが準最適な結果へと尚早に収束するのを防ぐために各反復において緩やかなステップをとる。 By initially training an ensemble of networks to separate primate variants that are likely to be benign from random unknown variants that are consistent for mutation rate and sequencing coverage, the training set is labeled as benign. To overcome the problem of only containing variants with , we use semi-supervised learning. This ensemble of networks is used to score the full set of unknown variants and to the next step of the classifier by biasing towards the unknown variants with the predicted outcome of being more pathogenic. It is used to influence the choice of unknown variants to seed the iterations, taking slow steps in each iteration to prevent the model from prematurely converging to suboptimal results.

一般的な霊長類の変異はまた、メタ分類器の増殖により客観的に評価することが難しくなっている既存の方法を評価するための、以前に使用された訓練データとは完全に無関係であるクリーンな評価データセットを提供する。10000個の提供された霊長類の一般的なバリアントを使用して、4つの他の人気のある分類アルゴリズム(Sift、Polyphen2、CADD、M-CAP)とともに、我々のモデルの性能を評価した。すべてのヒトミスセンスバリアントの概ね50%は、一般的なアレル頻度では自然選択によって除去されるので、変異率によって、10000個の提供された霊長類の一般的なバリアントと一致したランダムに選ばれたミスセンスバリアントのセットに対して、各分類器について50パーセンタイルのスコアを計算し、その閾値を使用して、提出された霊長類の一般的なバリアントを評価した。我々の深層学習モデルの正確さは、ヒトの一般的なバリアントだけで訓練された深層学習ネットワークを使用しても、またはヒトの一般的なバリアントと霊長類のバリアントの両方を使用しても、この独立の評価データセットについて、他の分類器よりはるかに良好であった。 Common primate mutations are also completely independent of previously used training data to evaluate existing methods that are difficult to assess objectively due to proliferation of metaclassifiers. Provide a clean evaluation dataset. We evaluated the performance of our model, along with four other popular classification algorithms (Sift, Polyphen2, CADD, M-CAP), using 10000 provided primate common variants. Approximately 50% of all human missense variants are eliminated by natural selection at common allele frequencies, so mutation rates were randomly selected consistent with 10,000 contributed primate common variants. For the set of missense variants, a 50th percentile score was calculated for each classifier and that threshold was used to assess common variants in submitted primates. The accuracy of our deep-learning model is measured using deep-learning networks trained only on the human common variant, or using both the human common variant and the primate variant. It far outperformed other classifiers on this independent evaluation dataset.

最近のトリオシーケンシング研究は、神経発達障害を持つ患者と患者の健康な兄弟における数千個のde novo変異の目録を作っており、症例群vs対照群におけるde novoミスセンス変異を分離する際の様々な分類アルゴリズムの強さの評価を可能にしている。4つの分類アルゴリズムの各々について、症例群vs対照群における各de novoミスセンスバリアントをスコアリングし、2つの分布の間の差のウィルコクソンの順位和検定からのp値を報告し、この臨床シナリオでは、霊長類バリアントについて訓練された深層学習方法(p約10^-33)が他の分類器(p約10^-13から10^-19)はるかに良好な性能であったことを示した。このコホートについて以前に報告された予想を超える、de novoミスセンスバリアントの約1.3-foldエンリッチメントから、およびミスセンスバリアントの約20%が機能喪失の影響を生むという以前の推定から、完璧な分類器はp約10^-40というp値で2つのクラスを分離することが予想される。 A recent trio-sequencing study has cataloged thousands of de novo mutations in patients with neurodevelopmental disorders and their healthy siblings, demonstrating the importance of segregating de novo missense mutations in cases vs. controls. It allows evaluation of the strength of various classification algorithms. For each of the four classification algorithms, we scored each de novo missense variant in the case vs. control groups and reported the p-value from the Wilcoxon rank sum test of the difference between the two distributions, indicating that in this clinical scenario, We show that deep learning methods trained on primate variants (p~10 ^-33 ) performed much better than other classifiers (p~10 ^-13 to 10 ^-19 ). From the ~1.3-fold enrichment of de novo missense variants, which exceeds the expectations previously reported for this cohort, and from previous estimates that ~20% of missense variants produce loss-of-function effects, the perfect classifier is A p-value of p˜10 ⁻⁴⁰ is expected to separate the two classes.

深度学習分類器の正確さは訓練データセットのサイズと符合し、6種の霊長類の各々からの変異データは独立に、分類器の正確さを上げることに寄与する。ヒト以外の霊長類の種が多数かつ多様にあることは、タンパク質を変化させるバリアントに対する選択圧力が霊長類の系統内で概ね合致していることを示す証拠とともに、臨床上のゲノム解釈を現在制約している、有意性が知られていない数百万個のヒトバリアントを分類するための効果的な戦略として、系統的な霊長類集団のシーケンシングを示唆する。504種の知られているヒト以外の霊長類の種のうち、約60%が狩猟および生息地喪失により絶滅に瀕しており、これらの固有の代わりのいない種と我々自身の両方に利益をもたらすであろう、緊急を要する世界的な保全の努力に対する動機となっている。 The accuracy of the deep learning classifier matches the size of the training dataset, and mutation data from each of the six primate species independently contributes to increasing the accuracy of the classifier. The large number and diversity of non-human primate species, together with evidence that selective pressures for protein-altering variants are largely consistent within primate lineages, currently constrain clinical genome interpretation. , suggesting systematic primate population sequencing as an effective strategy for classifying millions of human variants of unknown significance. Of the 504 known non-human primate species, about 60% are endangered due to hunting and habitat loss, benefiting both these endemic irreplaceable species and ourselves. motivating the urgent global conservation efforts that will result.

ゲノムデータ全体はエクソンデータほど集約された形では利用可能ではないが、深いイントロン領域における自然選択の影響を検出するための能力を制限することで、エクソン領域から遠く離れた隠れたスプライシング変異の観察されるカウントと予想されるカウントを計算することも可能になった。全体として、エクソンイントロン境界から50ntを超える距離にある隠れたスプライシング変異において、60%の欠失を認めた。信号の減衰は、エクソンと比較してゲノムデータ全体ではサンプルサイズがより小さいことと、深いイントロンバリアントの影響を予測することがより難しいこととの組合せによるものである可能性が高い。 Although the entire genome data is not available as aggregated as the exonic data, it limits the ability to detect the effects of natural selection in deep intronic regions, thus limiting the observation of cryptic splicing variants far from exonic regions. It is now also possible to calculate expected and expected counts. Overall, we observed 60% deletions in cryptic splicing mutations at distances greater than 50 nt from exon-intron boundaries. The signal attenuation is likely due to a combination of the smaller sample size across genomic data compared to exons and the greater difficulty in predicting the effects of deep intronic variants.

［用語］
限定はされないが、特許、特許出願、論説、書籍、論文、およびウェブページを含む、本出願において引用されるすべての文献および同様の資料は、そのような文献および同様の資料のフォーマットとは無関係に、全体が参照によって明確に引用される。限定はされないが、定義される用語、用語の使用法、説明される技法などを含めて、引用される文献および同様の資料のうちの1つまたは複数が、本出願とは異なる場合、または本出願と矛盾する場合、本出願が優先する。 [the term]
All literature and similar material cited in this application, including, but not limited to, patents, patent applications, articles, books, articles, and web pages, are independent of the format of such literature and similar material. , expressly incorporated by reference in its entirety. Where one or more of the cited publications and similar material, including, but not limited to, terms defined, usage of terms, techniques illustrated, etc. In case of conflict with the application, this application will control.

本明細書では、以下の用語は示される意味を有する。 As used herein, the following terms have the meanings indicated.

塩基は、ヌクレオチド塩基またはヌクレオチド、すなわちA(アデニン)、C(シトシン)、T(チミン)、またはG(グアニン)を指す。 Base refers to nucleotide bases or nucleotides, ie A (adenine), C (cytosine), T (thymine), or G (guanine).

本出願は、「タンパク質」および「翻訳配列」という用語を交換可能に使用する。 This application uses the terms "protein" and "translated sequence" interchangeably.

本出願は、「コドン」および「塩基トリプレット」という用語を交換可能に使用する。 This application uses the terms "codon" and "base triplet" interchangeably.

本出願は、「アミノ酸」および「翻訳単位」という用語を交換可能に使用する。 This application uses the terms "amino acid" and "translation unit" interchangeably.

本出願は、「バリアント病原性分類器」、「バリアント分類のための畳み込みニューラルネットワークベースの分類器」、および「バリアント分類のための深層畳み込みニューラルネットワークベースの分類器」という語句を交換可能に使用する。 This application uses the phrases "variant pathogenicity classifier", "convolutional neural network-based classifier for variant classification", and "deep convolutional neural network-based classifier for variant classification" interchangeably. do.

「染色体」という用語は、生きている細胞の遺伝情報を持っている遺伝子の担体を指し、これはDNAおよびタンパク質の構成要素(特にヒストン)を備えるクロマチン鎖に由来する。従来の国際的に認識されている個々のヒトゲノム染色体ナンバリングシステムが本明細書で利用される。 The term "chromosome" refers to the carriers of genes carrying the genetic information of a living cell, which are derived from chromatin strands comprising DNA and protein components (particularly histones). The conventional, internationally recognized individual human genome chromosome numbering system is utilized herein.

「サイト」という用語は、基準ゲノム上の一意な場所(たとえば、染色体ID、染色体の場所および向き)を指す。いくつかの実装形態では、サイトは、残基、配列タグ、または配列上のセグメントの場所であり得る。「座」という用語は、基準染色体上での核酸配列または多型の具体的な位置を指すために使用され得る。 The term "site" refers to a unique location (eg, chromosome ID, chromosome location and orientation) on the reference genome. In some implementations, a site can be the location of a residue, a sequence tag, or a segment on a sequence. The term "locus" may be used to refer to a specific location of a nucleic acid sequence or polymorphism on a reference chromosome.

本明細書の「サンプル」という用語は、典型的には、シーケンシングおよび/もしくはフェージングされるべき少なくとも1つの核酸配列を含有する核酸もしくは核酸の混合物を含有する、体液、細胞、組織、器官、または生物体に由来する、サンプルを指す。そのようなサンプルは、限定はされないが、唾液/口腔液、羊水、血液、血液の断片、細針生検サンプル(たとえば、直視下生検、細針生検など)、尿、腹膜液、胸膜液、組織外植、器官培養、および任意の他の組織もしくは細胞の標本、またはそれらの一部もしくはそれらの派生物、またはそれらから分離されたものを含む。サンプルはしばしば、ヒト対象(たとえば、患者)から取られるが、サンプルは、限定はされないが、イヌ、ネコ、ウマ、ヤギ、ヒツジ、ウシ、ブタなどを含む、染色体を有する任意の生物体から取ることができる。サンプルは、生物学的な供給源から得られるものとして直接使用されることがあり、または、サンプルの特性を修正するための前処理の後に使用されることがある。たとえば、そのような前処理は、血液から血漿を調製すること、粘液を希釈することなどを含み得る。前処理の方法はまた、限定はされないが、濾過、沈殿、希釈、蒸留、混合、遠心分離、凍結、凍結乾燥、濃縮、増幅、核酸断片化、干渉する要素の不活性化、試薬の追加、溶解などを伴い得る。 The term "sample" herein refers to body fluids, cells, tissues, organs, typically containing nucleic acids or mixtures of nucleic acids containing at least one nucleic acid sequence to be sequenced and/or phased. or refers to a sample derived from an organism. Such samples include, but are not limited to, saliva/oral fluid, amniotic fluid, blood, blood fragments, fine needle biopsy samples (e.g., open biopsy, fine needle biopsy, etc.), urine, peritoneal fluid, pleural fluid, Including tissue explants, organ cultures, and any other tissue or cell preparations, or portions thereof or derivatives thereof, or isolated therefrom. Samples are often taken from human subjects (e.g., patients), but samples are taken from any organism having chromosomes, including, but not limited to, dogs, cats, horses, goats, sheep, cows, pigs, etc. be able to. Samples may be used directly as obtained from a biological source, or may be used after pretreatment to modify sample properties. For example, such pretreatments may include preparing plasma from blood, diluting mucus, and the like. Methods of pretreatment may also include, but are not limited to, filtration, precipitation, dilution, distillation, mixing, centrifugation, freezing, lyophilization, concentration, amplification, nucleic acid fragmentation, inactivation of interfering elements, addition of reagents, Dissolution and the like may be involved.

「配列」という用語は、互いに結合されたヌクレオチドの鎖を含み、または表す。ヌクレオチドはDNAまたはRNAに基づき得る。1つの配列は複数の部分配列を含み得ることを理解されたい。たとえば、(たとえばPCRアンプリコン)の単一配列は350個のヌクレオチドを有し得る。サンプルリードは、これらの350個のヌクレオチド内の複数の部分配列を含み得る。たとえば、サンプルリードは、たとえば20～50個のヌクレオチドを有する、第1および第2のフランキング部分配列を含み得る。第1および第2のフランキング部分配列は、対応する部分配列(たとえば、40～100個のヌクレオチド)を有する反復的なセグメントの両側に位置し得る。フランキング部分配列の各々は、プライマー部分配列(たとえば、10～30個のヌクレオチド)を含み得る(またはその一部を含み得る)。読むのを簡単にするために、「部分配列」という用語は「配列」と呼ばれるが、2つの配列は必ずしも共通の鎖上で互いに別々であるとは限らないことを理解されたい。本明細書で説明される様々な配列を区別するために、配列は異なるラベル(たとえば、標的配列、プライマー配列、フランキング配列、基準配列など)を与えられ得る。「アレル」などの他の用語は、同様の物を区別するために異なるラベルを与えられ得る。 The term "sequence" includes or represents a chain of nucleotides linked together. Nucleotides can be based on DNA or RNA. It should be understood that one sequence can contain multiple subsequences. For example, a single sequence (eg, a PCR amplicon) can have 350 nucleotides. A sample read may contain multiple subsequences within these 350 nucleotides. For example, a sample read can include first and second flanking subsequences having, eg, 20-50 nucleotides. The first and second flanking subsequences can flank a repetitive segment with corresponding subsequences (eg, 40-100 nucleotides). Each of the flanking subsequences can include (or can include a portion of) a primer subsequence (eg, 10-30 nucleotides). For ease of reading, the term "subsequence" will be referred to as "sequence", with the understanding that the two sequences are not necessarily distinct from each other on a common strand. To distinguish the various sequences described herein, the sequences may be given different labels (eg, target sequence, primer sequence, flanking sequence, reference sequence, etc.). Other terms, such as "allele," may be given different labels to distinguish like objects.

「ペアエンドシーケンシング(paired-end sequencing)」という用語は、標的フラグメントの両端をシーケンシングするシーケンシング方法を指す。ペアエンドシーケンシングは、ゲノム再配置および反復セグメント、ならびに遺伝子融合および新規転写物の検出を容易にし得る。ペアエンドシーケンシングの方法論は、各々が本明細書において参照によって引用される、国際特許出願公開第WO07010252号、国際特許出願第PCTGB2007/003798号、および米国特許出願公開第2009/0088327号において説明されている。一例では、一連の操作は次のように実行され得る。(a)核酸のクラスタを生成する。(b)核酸を直線化する。(c)第1のシーケンシングプライマーをハイブリダイゼーションし、上で記載されたような延長、走査、およびデブロッキングの繰り返されるサイクルを実行する。(d)相補的なコピーを合成することによってフローセル表面上の標的核酸を「逆にする」。(e)再合成された鎖を直線化する。(f)第2のシーケンシングプライマーをハイブリダイゼーションし、上で記載されたような延長、走査、およびデブロッキングの繰り返されるサイクルを実行する。この逆転操作は、ブリッジ増幅の単一サイクルについて上に記載されたように試薬を導入するために実行され得る。 The term "paired-end sequencing" refers to sequencing methods in which both ends of a target fragment are sequenced. Paired-end sequencing can facilitate detection of genomic rearrangements and repetitive segments, as well as gene fusions and novel transcripts. Paired-end sequencing methodologies are described in International Patent Application Publication No. WO07010252, International Patent Application No. PCTGB2007/003798, and US Patent Application Publication No. 2009/0088327, each of which is incorporated herein by reference. there is In one example, a sequence of operations may be performed as follows. (a) generating clusters of nucleic acids; (b) linearize the nucleic acid; (c) Hybridize the first sequencing primer and perform repeated cycles of extension, scanning, and deblocking as described above. (d) "inverting" the target nucleic acid on the flow cell surface by synthesizing a complementary copy; (e) Linearize the resynthesized strand. (f) Hybridize a second sequencing primer and perform repeated cycles of extension, scanning, and deblocking as described above. This reversal operation can be performed to introduce reagents as described above for a single cycle of bridge amplification.

「基準ゲノム」または「基準配列」という用語は、対象からの特定された配列の基準にするために使用され得る任意の生物体の任意の特定の既知のゲノム配列を、それが部分的なものであるか完全なものであるかにかかわらず指す。たとえば、ヒト対象ならびに多くの他の生物体のために使用される基準ゲノムは、ncbi.nlm.nih.govの米国国立生物工学情報センターにおいて見つかる。「ゲノム」は、核酸配列で表現される、生物体またはウイルスの完全な遺伝情報を指す。ゲノムは、遺伝子とDNAのノンコーディング配列の両方を含む。基準配列は、それとアラインメントされるリードより大きいことがある。たとえば、それは少なくとも約100倍大きいことがあり、または少なくとも約1000倍大きいことがあり、または少なくとも約10000倍大きいことがあり、または少なくとも約105倍大きいことがあり、または少なくとも約106倍大きいことがあり、または少なくとも約107倍大きいことがある。一例では、基準ゲノム配列は、完全な長さのヒトゲノムの基準ゲノム配列である。別の例では、基準ゲノム配列は、13番染色体などの特定のヒト染色体に限定される。いくつかの実装形態では、基準染色体は、ヒトゲノムバージョンhg19からの染色体配列である。そのような配列は染色体基準配列と呼ばれ得るが、基準ゲノムという用語がそのような配列を包含することが意図される。基準配列の他の例には、他の種のゲノム、ならびに任意の種の染色体、部分染色体領域(鎖など)などがある。様々な実装形態において、基準ゲノムは、複数の個体に由来するコンセンサス配列または他の組合せである。しかしながら、いくつかの適用例では、基準配列は特定の個体から取られることがある。 The terms "reference genome" or "reference sequence" refer to any specific known genomic sequence of any organism that can be used to reference an identified sequence from a subject, whether partial or partial. refers to whether it is or is complete. For example, reference genomes used for human subjects as well as many other organisms can be found at the National Center for Biotechnology Information at ncbi.nlm.nih.gov. "Genome" refers to the complete genetic information of an organism or virus expressed in nucleic acid sequences. The genome includes both genes and non-coding sequences of DNA. A reference sequence may be larger than the reads it aligns with. For example, it may be at least about 100 times larger, or at least about 1000 times larger, or at least about 10000 times larger, or at least about 105 times larger, or at least about 106 times larger. Yes, or at least about 107 times larger. In one example, the reference genomic sequence is the reference genomic sequence of the full-length human genome. In another example, the reference genomic sequence is restricted to a particular human chromosome, such as chromosome 13. In some implementations, the reference chromosome is a chromosomal sequence from human genome version hg19. Such sequences may be referred to as chromosomal reference sequences, although the term reference genome is intended to encompass such sequences. Other examples of reference sequences include genomes of other species, as well as chromosomes, partial chromosomal regions (strands, etc.) of any species. In various implementations, the reference genome is a consensus sequence or other combination derived from multiple individuals. However, in some applications the reference sequence may be taken from a particular individual.

「リード」という用語は、ヌクレオチドサンプルまたは基準のフラグメントを記述する配列データの集合体を指す。「リード」という用語は、サンプルリードおよび/または基準リードを指し得る。通常、必須ではないが、リードは、サンプルまたは基準における連続的な塩基対の短い配列を表す。リードは、サンプルまたは基準フラグメントの塩基対配列によって文字で(ATCGで)表され得る。リードは、メモリデバイスに記憶され、リードが基準配列と一致するかどうか、または他の基準を満たすかどうかを決定するために適宜処理され得る。リードは、シーケンシング装置から直接、またはサンプルに関する記憶された配列情報から間接的に得られ得る。いくつかの場合、リードは、たとえば染色体またはゲノム領域または遺伝子にアラインメントされ具体的に割り当てられ得る、より大きい配列または領域を特定するために使用され得る、十分な長さの(たとえば、少なくとも約25bp)DNA配列である。 The term "read" refers to a collection of sequence data describing a nucleotide sample or reference fragment. The term "read" can refer to a sample read and/or a reference read. Reads usually, but not necessarily, represent short sequences of contiguous base pairs in a sample or reference. A read can be represented in letters (ATCG) by the base pair sequence of a sample or reference fragment. The reads may be stored in a memory device and processed accordingly to determine whether the reads match a reference sequence or meet other criteria. Reads can be obtained directly from a sequencing device or indirectly from stored sequence information about the sample. In some cases, reads are of sufficient length (e.g., at least about 25 bp ) is a DNA sequence.

次世代シーケンシング方法には、たとえば、合成技術によるシーケンシング(Illumina)、パイロシーケンシング(454)、イオン半導体技術(Ion Torrentシーケンシング)、単一分子リアルタイムシーケンシング(Pacific Biosciences)、およびライゲーションによるシーケンシング(SOLiDシーケンシング)がある。シーケンシング方法に応じて、各リードの長さは、約30bpから10000bp以上にまで変動し得る。たとえば、SOLiDシーケンサを使用するIlluminaシーケンシング方法は、約50bpの核酸リードを生成する。別の例では、Ion Torrentシーケンシングは最高で400bpの核酸リードを生成し、454パイロシーケンシングは約700bpの核酸リードを生成し得る。さらに別の例では、単一分子リアルタイムシーケンシング方法は、10000bpから15000bpのリードを生成し得る。したがって、いくつかの実装形態では、核酸配列リードは、30～100bp、50～200bp、または50～400bpの長さを有する。 Next-generation sequencing methods include, for example, sequencing by synthetic technology (Illumina), pyrosequencing (454), ion-semiconductor technology (Ion Torrent sequencing), single-molecule real-time sequencing (Pacific Biosciences), and ligation-by-ligation. There is sequencing (SOLiD sequencing). Depending on the sequencing method, the length of each read can vary from approximately 30 bp to over 10000 bp. For example, Illumina sequencing methods using the SOLiD sequencer generate nucleic acid reads of approximately 50bp. In another example, Ion Torrent sequencing can generate nucleic acid reads of up to 400bp and 454 pyrosequencing can generate nucleic acid reads of approximately 700bp. In yet another example, single-molecule real-time sequencing methods can generate reads of 10000bp to 15000bp. Thus, in some implementations, nucleic acid sequence reads have lengths of 30-100 bp, 50-200 bp, or 50-400 bp.

「サンプルリード」、「サンプル配列」、または「サンプルフラグメント」という用語は、サンプルからの対象のゲノム配列の配列データを指す。たとえば、サンプルリードは、フォワードプライマー配列およびリバースプライマー配列を有するPCRアンプリコンからの配列データを備える。配列データは、任意の配列選択方法から得られ得る。サンプルリードは、たとえば、sequencing-by-synthesis(SBS)反応、sequencing-by-ligation反応、または、そのために反復要素の長さおよび/または正体を決定することが望まれる任意の他の適切なシーケンシング方法からのものであり得る。サンプルリードは、複数のサンプルリードに由来するコンセンサス(たとえば、平均または加重)配列であり得る。いくつかの実装形態では、基準配列を提供することは、PCRアンプリコンのプライマー配列に基づいて対象座を特定することを備える。 The terms "sample read," "sample sequence," or "sample fragment" refer to sequence data of a genomic sequence of interest from a sample. For example, sample reads comprise sequence data from PCR amplicons having forward and reverse primer sequences. Sequence data can be obtained from any sequence selection method. A sample read can be, for example, a sequencing-by-synthesis (SBS) reaction, a sequencing-by-ligation reaction, or any other suitable sequence for which it is desired to determine the length and/or identity of repetitive elements. from the sing method. A sample read can be a consensus (eg, averaged or weighted) sequence derived from multiple sample reads. In some implementations, providing the reference sequence comprises identifying the locus of interest based on the primer sequences of the PCR amplicon.

「生フラグメント」という用語は、サンプルリードまたはサンプルフラグメント内で指定場所または二次的な対象場所と少なくとも部分的に重複する、対象のゲノム配列の部分に対する配列データを指す。生フラグメントの非限定的な例には、duplex stitchedフラグメント、simplex stitchedフラグメント、duplex un-stitchedフラグメント、およびsimplex un-stitchedフラグメントがある。「生」という用語は、生フラグメントがサンプルリードの中の潜在的なバリアントに対応しそれが本物であることを証明または確認する、支持バリアントを呈するかどうかにかかわらず、サンプルリードの中の配列データに対する何らかの関連を有する配列データを含むことを示すために使用される。「生フラグメント」という用語は、フラグメントが、サンプルリードの中のバリアントコールを妥当性確認する支持バリアントを必ず含むことを示さない。たとえば、サンプルリードが第1のバリアントを呈することが、バリアントコールアプリケーションによって決定されるとき、バリアントコールアプリケーションは、1つまたは複数の生フラグメントが、サンプルリードの中にそのバリアントがあるとすれば存在することが予想され得る対応するタイプの「支持」バリアントを欠いていることを決定し得る。 The term "raw fragment" refers to sequence data for a portion of a genomic sequence of interest that at least partially overlaps a designated location or secondary location of interest within a sample read or sample fragment. Non-limiting examples of raw fragments include duplex stitched fragments, simplex stitched fragments, duplex un-stitched fragments, and simplex un-stitched fragments. The term "raw" refers to the sequence in the sample read, whether or not it exhibits a supporting variant, that the raw fragment corresponds to the potential variant in the sample read and proves or confirms its authenticity. Used to indicate the inclusion of sequence data that has some relationship to the data. The term "raw fragment" does not indicate that the fragment necessarily contains the supporting variant that validates the variant call in the sample read. For example, when the variant calling application determines that a sample read exhibits a first variant, the variant calling application determines that one or more raw fragments, if any, of that variant are present in the sample read. It can be determined that it lacks the corresponding type of "supporting" variant that might be expected to do.

「マッピング」、「アラインメントされる」、「アラインメント」、または「アラインメントしている」という用語は、リードまたはタグを基準配列と比較し、それにより、基準配列がリード配列を含むかどうかを決定するプロセスを指す。基準配列がリードを含む場合、リードは、基準配列にマッピングされることがあり、またはいくつかの実装形態では、基準配列の中の特定の位置にマッピングされることがある。いくつかの場合、アラインメントは単に、リードが特定の基準配列のメンバーであるかどうか(すなわち、リードが基準配列の中に存在するかしないか)を伝える。たとえば、ヒト13番染色体の基準配列に対するリードのアラインメントは、リードが13番染色体の基準配列の中に存在するかどうかを伝える。この情報を提供するツールは、セットメンバーシップテスターと呼ばれ得る。いくつかの場合、アラインメントは追加で、リードまたはタグがマッピングする基準配列の中の位置を示す。たとえば、基準配列がヒトゲノム配列全体である場合、アラインメントは、リードが13番染色体上に存在することを示すことがあり、さらに、リードが13番染色体の特定の鎖および/またはサイトにあることを示すことがある。 The terms "mapping," "aligned," "alignment," or "aligning" compare a read or tag to a reference sequence, thereby determining whether the reference sequence contains the read sequence. refers to the process. If the reference sequence contains reads, the reads may be mapped to the reference sequence or, in some implementations, to specific positions within the reference sequence. In some cases, the alignment simply tells whether the read is a member of a particular reference sequence (ie, whether the read is present in the reference sequence or not). For example, alignment of a read to a reference sequence for human chromosome 13 will tell whether the read is present in the reference sequence for chromosome 13. A tool that provides this information can be called a set membership tester. In some cases, the alignment additionally indicates the position within the reference sequence to which the read or tag maps. For example, if the reference sequence is the entire human genome sequence, the alignment may indicate that the reads are on chromosome 13 and further indicate that the reads are on specific strands and/or sites on chromosome 13. I have something to show you.

「インデル」という用語は、生物体のDNAにおける塩基の挿入および/または欠失を指す。マイクロインデルは、1～50個のヌクレオチドの正味の変化をもたらすインデルを表す。ゲノムのコーディング領域において、インデルの長さが3の倍数ではない限り、インデルはフレームシフト変異を生み出す。インデルは点変異と対比され得る。インデルは配列からヌクレオチドを挿入または削除するが、点変異はDNAの全体の数を変えることなくヌクレオチドのうちの1つを置き換えるある形式の置換である。インデルは、タンデム塩基変異(TBM)とも対比することができ、TBMは隣接するヌクレオチドにおける置換として定義され得る(主に2つの隣接するヌクレオチドにおける置換、しかし3つの隣接するヌクレオチドにおける置換が観察されている)。 The term "indel" refers to base insertions and/or deletions in the DNA of an organism. Microindels represent indels that result in a net change of 1-50 nucleotides. Indels generate frameshift mutations unless the indel length is a multiple of three in the coding region of the genome. Indels can be contrasted with point mutations. An indel inserts or deletes a nucleotide from a sequence, while a point mutation is a form of substitution that replaces one of the nucleotides without changing the overall number of DNA. Indels can also be contrasted with tandem base mutations (TBMs), which can be defined as substitutions in adjacent nucleotides (mainly substitutions in two adjacent nucleotides, but substitutions in three adjacent nucleotides have been observed). are).

「バリアント」という用語は、核酸基準と異なる核酸配列を指す。典型的な核酸配列バリアントには、限定はされないが、一塩基多型(SNP)、短い欠失および挿入の多型(インデル)、コピー数変異(CNV)、マイクロサテライトマーカー、またはショートタンデムリピートおよび構造変異がある。体細胞バリアントコーリング(somatic variant calling)は、DNAサンプルにおいて低頻度に存在するバリアントを特定するための試みである。体細胞バリアントコーリングは、癌治療の文脈において関心の対象である。癌はDNAの変異の蓄積により引き起こされる。腫瘍からのDNAサンプルは一般に異質であり、いくつかの正常細胞、癌進行の早期段階にあるいくつかの細胞(少数の変異を伴う)、およびいくつかの後期段階の細胞(多数の変異を伴う)を含む。この異質さにより、(たとえば、FFPEサンプルから)腫瘍をシーケンシングするとき、体細胞突然変異がしばしば低頻度で現れる。たとえば、ある所与の塩基を含むリードの10%だけにおいて、SNVが見られることがある。バリアント分類器によって体細胞性または生殖細胞性であると分類されるべきバリアントは、「検定対象バリアント(variant under test)」とも本明細書では呼ばれる。 The term "variant" refers to a nucleic acid sequence that differs from a reference nucleic acid. Exemplary nucleic acid sequence variants include, but are not limited to, single nucleotide polymorphisms (SNPs), short deletion and insertion polymorphisms (indels), copy number variations (CNVs), microsatellite markers, or short tandem repeats and There is structural variation. Somatic variant calling is an attempt to identify low frequency variants in a DNA sample. Somatic variant calling is of interest in the context of cancer therapy. Cancer is caused by the accumulation of mutations in DNA. DNA samples from tumors are generally heterogeneous, with some normal cells, some cells in early stages of cancer progression (with few mutations), and some late-stage cells (with many mutations). )including. Due to this heterogeneity, somatic mutations often appear at low frequency when sequencing tumors (eg, from FFPE samples). For example, SNVs may be seen in only 10% of reads containing a given base. Variants that are to be classified as somatic or germline by the variant classifier are also referred to herein as "variants under test."

「ノイズ」という用語は、シーケンシングプロセスおよび/またはバリアントコールアプリケーションにおける1つまたは複数のエラーに起因する誤ったバリアントコールを指す。 The term "noise" refers to erroneous variant calls due to one or more errors in the sequencing process and/or variant calling application.

「バリアント頻度」という用語は、割合または百分率で表される、ある集団の中の特定の座におけるアレル(遺伝子のバリアント)の相対的な頻度を表す。たとえば、この割合または百分率は、そのアレルを持つ集団の中のすべての染色体の割合であり得る。例として、サンプルバリアント頻度は、ある個人からの対象のゲノム配列について取得されたリードおよび/またはサンプルの数に対応する「集団」にわたる、対象のゲノム配列に沿った特定の座/場所におけるアレル/バリアントの相対的な頻度を表す。別の例として、基準バリアント頻度は、1つまたは複数の基準ゲノム配列に沿った特定の座/場所におけるアレル/バリアントの相対的な頻度を表し、リードおよび/またはサンプルの数に対応する「集団」は、正常な個人の集団からの1つまたは複数の基準ゲノム配列について取得される。 The term "variant frequency" refers to the relative frequency of alleles (variants of a gene) at a particular locus within a population expressed as a percentage or percentage. For example, the proportion or percentage can be the proportion of all chromosomes in a population with that allele. As an example, sample variant frequency is the number of alleles/ alleles/ Represents the relative frequency of variants. As another example, a reference variant frequency represents the relative frequency of an allele/variant at a particular locus/location along one or more reference genomic sequences and corresponds to the number of reads and/or samples in the "population ' is obtained for one or more reference genomic sequences from a population of normal individuals.

「バリアントアレル頻度(VAF)」という用語は、標的場所における、バリアントと一致することが観察されたシーケンシングされたリードをカバレッジ全体で割った百分率を指す。VAFはバリアントを持つシーケンシングされたリードの比率の尺度である。 The term "variant allele frequency (VAF)" refers to the percentage of sequenced reads observed to match the variant at the target location divided by the total coverage. VAF is a measure of the proportion of sequenced reads with variants.

「場所」、「指定場所」、および「座」という用語は、ヌクレオチドの配列内の1つまたは複数のヌクレオチドの位置または座標を指す。「場所」、「指定場所」、および「座」という用語は、ヌクレオチドの配列の中の1つまたは複数の塩基対の位置または座標も指す。 The terms "location," "designated location," and "locus" refer to the position or coordinates of one or more nucleotides within a sequence of nucleotides. The terms "location," "designated location," and "locus" also refer to the position or coordinates of one or more base pairs within a sequence of nucleotides.

「ハプロタイプ」という用語は、一緒に受け継がれる染色体上の隣接するサイトにおけるアレルの組合せを指す。ハプロタイプは、所与の座のセット間で組み換え事象が発生した場合にはその数に依存して、1つの座、いくつかの座、または染色体全体であり得る。 The term "haplotype" refers to the combination of alleles at adjacent sites on the chromosome that are inherited together. A haplotype can be one locus, several loci, or an entire chromosome, depending on the number, if any, of recombination events that occur between a given set of loci.

本明細書の「閾値」という用語は、サンプル、核酸、またはその一部(たとえば、リード)を特徴付けるためにカットオフとして使用される、数値または数字ではない値を指す。閾値は経験的な分析に基づいて変動し得る。閾値は、そのような値の示唆をもたらす源がある特定の方式で分類されるべきであるかどうかを決定するために、測定された値または計算された値と比較され得る。閾値は経験的または分析的に特定され得る。閾値の選択は、ユーザが分類を行うために有することを望む信頼性のレベルに依存する。閾値は特定の目的で(たとえば、感度と選択度のバランスをとるように)選ばれ得る。本明細書では、「閾値」という用語は、分析のコースが変更され得る点、および/または活動が惹起され得る点を示す。閾値は所定の数である必要はない。代わりに、閾値は、たとえば、複数の要因に基づく関数であり得る。閾値は状況に適応するものであり得る。その上、閾値は、上限、下限、または制限値間の範囲を示し得る。 The term "threshold" herein refers to a numeric or non-numeric value used as a cutoff to characterize a sample, nucleic acid, or portion thereof (eg, read). The threshold may vary based on empirical analysis. A threshold value can be compared to a measured or calculated value to determine whether a source leading to such value suggestions should be classified in a certain manner. The threshold can be specified empirically or analytically. The choice of threshold depends on the level of confidence the user wishes to have in making the classification. The threshold may be chosen for a particular purpose (eg, to balance sensitivity and selectivity). As used herein, the term "threshold" indicates a point at which the course of analysis may be altered and/or activity evoked. The threshold need not be a predetermined number. Alternatively, the threshold may be a function based on multiple factors, for example. The threshold may be adaptive to the circumstances. Additionally, a threshold may indicate an upper limit, a lower limit, or a range between limits.

いくつかの実装形態では、シーケンシングデータに基づく尺度またはスコアが閾値と比較され得る。本明細書では、「尺度」または「スコア」という用語は、シーケンシングデータから決定された値もしくは結果を含むことがあり、または、シーケンシングデータから決定された値もしくは結果に基づく関数を含むことがある。閾値と同様に、尺度またはスコアは状況に適応するものであり得る。たとえば、尺度またはスコアは正規化された値であり得る。スコアまたは尺度の例として、1つまたは複数の実装形態は、データを分析するときにカウントスコアを使用し得る。カウントスコアはサンプルリードの数に基づき得る。サンプルリードは1つまたは複数のフィルタリング段階を経ていることがあるので、サンプルリードは少なくとも1つの一般的な特性または品質を有する。たとえば、カウントスコアを決定するために使用されるサンプルリードの各々は、基準配列とアラインメントされていることがあり、または潜在的なアレルとして割り当てられることがある。一般的な特性を有するサンプルリードの数はリードカウントを決定するためにカウントされ得る。カウントスコアはリードカウントに基づき得る。いくつかの実装形態では、カウントスコアはリードカウントに等しい値であり得る。他の実装形態では、カウントスコアはリードカウントおよび他の情報に基づき得る。たとえば、カウントスコアは、遺伝子座の特定のアレルに対するリードカウントおよび遺伝子座に対するリードの総数に基づき得る。いくつかの実装形態では、カウントスコアは、遺伝子座に対するリードカウントおよび以前に得られたデータに基づき得る。いくつかの実装形態では、カウントスコアは複数の所定の値の間の正規化されたスコアであり得る。カウントスコアはまた、サンプルの他の座からのリードカウントの関数、または対象サンプルと同時に実行された他のサンプルからのリードカウントの関数であり得る。たとえば、カウントスコアは、特定のアレルのリードカウントおよびサンプルの中の他の座のリードカウントおよび/または他のサンプルからのリードカウントの関数であり得る。一例として、他の座からのリードカウントおよび/または他のサンプルからのリードカウントが、特定のアレルに対するカウントスコアを正規化するために使用され得る。 In some implementations, a metric or score based on sequencing data can be compared to a threshold. As used herein, the term "measure" or "score" may include a value or result determined from sequencing data or may include a function based on a value or result determined from sequencing data. There is As with thresholds, scales or scores may be contextually adaptive. For example, a scale or score can be a normalized value. As an example of a score or measure, one or more implementations may use count scores when analyzing data. A count score can be based on the number of sample reads. A sample lead may have undergone one or more stages of filtering so that the sample lead has at least one general characteristic or quality. For example, each of the sample reads used to determine the count score may be aligned with a reference sequence or assigned as a potential allele. The number of sample reads with common properties can be counted to determine the read count. A count score may be based on read counts. In some implementations, the count score can be a value equal to the read count. In other implementations, the count score may be based on read counts and other information. For example, a count score can be based on the read count for a particular allele of a locus and the total number of reads for the locus. In some implementations, the count score can be based on read counts and previously obtained data for the locus. In some implementations, the count score may be a normalized score between multiple predetermined values. The count score can also be a function of read counts from other loci of the sample, or read counts from other samples run concurrently with the sample of interest. For example, a count score can be a function of read counts for a particular allele and read counts for other loci in the sample and/or read counts from other samples. As an example, read counts from other loci and/or read counts from other samples can be used to normalize the count score for a particular allele.

「カバレッジ」または「フラグメントカバレッジ」という用語は、配列の同じフラグメントに対するサンプルリードの数のカウントまたは他の尺度を指す。リードカウントは対応するフラグメントをカバーするリードの数のカウントを表し得る。あるいは、カバレッジは、履歴の知識、サンプルの知識、座の知識などに基づく指定された係数を、リードカウントと乗じることによって決定され得る。 The terms "coverage" or "fragment coverage" refer to a count or other measure of the number of sample reads for the same fragment of a sequence. A read count may represent a count of the number of reads covering the corresponding fragment. Alternatively, coverage can be determined by multiplying read counts by a specified factor based on historical knowledge, sample knowledge, locus knowledge, or the like.

「リード深さ」(慣習的に「×」が後に続く数)という用語は、標的場所における重複するアラインメントを伴うシーケンシングされたリードの数を指す。これはしばしば、平均として、または間隔(エクソン、遺伝子、またはパネルなど)のセットにわたってカットオフを超える百分率として表される。たとえば、パネル平均カバレッジが1.105×であり、カバーされる標的塩基の98%が>100×であるということを、臨床報告が述べることがある。 The term "read depth" (a number conventionally followed by an "x") refers to the number of sequenced reads with overlapping alignments at the target location. This is often expressed as an average or as a percentage above the cutoff over a set of intervals (such as exons, genes, or panels). For example, a clinical report may state that the panel average coverage is 1.105× and 98% of the target bases covered are >100×.

「塩基コール品質スコア」または「Qスコア」という用語は、単一のシーケンシングされた塩基が正しい確率に反比例する、0～20の範囲のPHREDスケーリングされた確率を指す。たとえば、Qが20であるT塩基コールは、0.01という信頼性P値を伴い正しい可能性が高いと見なされる。Q<20であるあらゆる塩基コールは低品質であると見なされるべきであり、バリアントを支持するシーケンシングされたリードのかなりの部分が低品質であるようなあらゆる特定されたバリアントは、偽陽性の可能性があると見なされるべきである。 The term "base call quality score" or "Q-score" refers to a PHRED-scaled probability ranging from 0 to 20 that is inversely proportional to the probability that a single sequenced base is correct. For example, a T base call with a Q of 20 is considered likely correct with a confidence P-value of 0.01. Any base call with a Q<20 should be considered low quality, and any identified variant such that a significant portion of the sequenced reads supporting the variant are of low quality should be considered false positives. should be considered possible.

「バリアントリード」または「バリアントリード数」という用語は、バリアントの存在を支持するシーケンシングされたリードの数を指す。 The term "variant read" or "variant read number" refers to the number of sequenced reads that support the existence of a variant.

［シーケンシングプロセス］
本明細書に記載される実装形態は、配列の変異を特定するために核酸配列を分析することに適用可能であり得る。実装形態は、遺伝子の場所/座の潜在的なバリアント/アレルを分析し、遺伝子座の遺伝子型を決定するために、言い換えると、座に対する遺伝子型コールを提供するために使用され得る。例として、核酸配列は、米国特許出願公開第2016/0085910号および米国特許出願公開第2013/0296175号において説明される方法およびシステムに従って分析されることがあり、これらの出願公開の完全な主題の全体が、本明細書において参照によって明確に引用される。 [Sequencing process]
Implementations described herein may be applicable to analyzing nucleic acid sequences to identify sequence variations. Implementations can be used to analyze potential variants/alleles of a gene location/locus and determine the genotype of the locus, in other words to provide a genotype call for the locus. By way of example, nucleic acid sequences may be analyzed according to the methods and systems described in US Patent Application Publication No. 2016/0085910 and US Patent Application Publication No. 2013/0296175, the full subject matter of these application publications. The entirety is expressly incorporated herein by reference.

一実装形態では、シーケンシングプロセスは、DNAなどの核酸を含む、または含むことが疑われるサンプルを受け取ることを含む。サンプルは、動物(たとえばヒト)、植物、バクテリア、または菌類などの、既知のまたは未知の源からのものであり得る。サンプルは源から直接採取され得る。たとえば、血液または唾液が個体から直接採取され得る。代わりに、サンプルは源から直接採取されないことがある。次いで、1つまたは複数のプロセッサは、シーケンシングのためのサンプルを調製するようにシステムに指示する。この調製は、外来の物質を除去することおよび/または何らかの物質(たとえば、DNA)を隔離することを含み得る。生体サンプルは、特定のアッセイのための特徴を含むように調製され得る。たとえば、生体サンプルは、sequencing-by-synthesis(SBS)のために調製され得る。いくつかの実装形態では、調製することは、ゲノムのいくつかの領域の増幅を含み得る。たとえば、調製することは、STRおよび/またはSNRを含むことが知られている所定の遺伝子座を増幅することを含み得る。遺伝子座は、所定のプライマー配列を使用して増幅され得る。 In one implementation, the sequencing process includes receiving a sample containing or suspected of containing nucleic acids, such as DNA. Samples can be from known or unknown sources, such as animals (eg, humans), plants, bacteria, or fungi. Samples can be taken directly from the source. For example, blood or saliva can be taken directly from the individual. Alternatively, the sample may not be taken directly from the source. One or more processors then direct the system to prepare the samples for sequencing. The preparation may include removing foreign material and/or isolating any material (eg, DNA). A biological sample can be prepared to contain features for a particular assay. For example, a biological sample can be prepared for sequencing-by-synthesis (SBS). In some implementations, preparing can include amplification of several regions of the genome. For example, preparing can include amplifying a given locus known to contain a STR and/or SNR. A locus can be amplified using a given primer sequence.

次に、1つまたは複数のプロセッサは、サンプルをシーケンシングするようにシステムに指示する。シーケンシングは、様々な既知のシーケンシングプロトコルを通じて実行され得る。特定の実装形態では、シーケンシングはSBSを含む。SBSでは、複数の蛍光ラベリングされたヌクレオチドが、光学基板の表面(たとえば、フローセルの中のチャネルを少なくとも部分的に画定する表面)上に存在する増幅されたDNAの複数のクラスタ(場合によっては数百万個のクラスタ)をシーケンシングするために使用される。フローセルはシーケンシングのための核酸サンプルを含むことがあり、ここでフローセルは適切なフローセルホルダ内に配置される。 One or more processors then direct the system to sequence the samples. Sequencing can be performed through various known sequencing protocols. In certain implementations, sequencing includes SBS. In SBS, a plurality of fluorescently labeled nucleotides are present on a surface of an optical substrate (e.g., a surface that at least partially defines channels in a flow cell) of amplified DNA in multiple clusters (sometimes several clusters). million clusters). A flow cell may contain a nucleic acid sample for sequencing, where the flow cell is placed in a suitable flow cell holder.

核酸は、未知の標的配列に隣接する既知のプライマー配列を備えるように調製され得る。最初のSBSシーケンシングサイクルを開始するために、1つまたは複数の異なるようにラベリングされたヌクレオチド、およびDNAポリメラーゼなどが、流体サブシステムによってフローセルの中へと/フローセルを通って流され得る。単一のタイプのヌクレオチドが一度に追加されるか、または、シーケンシング手順において使用されるヌクレオチドが反転可能な末端の性質を持つように特別に設計されるかのいずれかであってよく、これにより、シーケンシング反応の各サイクルが、いくつかのタイプのラベリングされたヌクレオチド(たとえば、A、C、T、G)の存在下で同時に発生することが可能になる。ヌクレオチドは、蛍光色素などの検出可能なラベル部分を含み得る。4つのヌクレオチドが一緒に混合される場合、ポリメラーゼは組み込むべき正しい塩基を選択することが可能であり、各配列は一塩基だけ延長される。組み込まれないヌクレオチドは、洗浄液をフローセルに流すことによって洗い落とされ得る。1つまたは複数のレーザーが、核酸を励起して蛍光を誘導し得る。核酸から放出される蛍光は組み込まれた塩基の蛍光色素に基づき、異なる蛍光色素は異なる波長の放出光を放出し得る。デブロッキング試薬が、延長され検出されたDNA鎖から反転可能な末端グループを除去するためにフローセルに追加され得る。次いでデブロッキング試薬が、洗浄液をフローセルに流すことによって洗い落とされ得る。そうすると、フローセルは、上に記載されたようなラベリングされたヌクレオチドの導入で開始するシーケンシングのさらなるサイクルの準備ができる。流体および検出の操作は、シーケンシングの実行を完了するために何回か繰り返され得る。例示的なシーケンシング方法は、たとえば、Bentley他、Nature 456:53-59(2008)、国際特許出願公開第WO 04/018497号、米国特許第7057026号、国際特許出願公開第WO 91/06678号、国際特許出願公開第WO 07/123744号、米国特許第7329492号、米国特許第7211414号、米国特許第7315019号、米国特許第7405281号、および米国特許出願公開第2008/0108082号において説明されており、これらの各々が参照によって本明細書において引用される。 Nucleic acids can be prepared with known primer sequences flanking an unknown target sequence. To initiate the first SBS sequencing cycle, one or more differentially labeled nucleotides, DNA polymerase, etc., can be flowed into/through the flow cell by the fluidic subsystem. Either a single type of nucleotide is added at a time, or the nucleotides used in the sequencing procedure may be specifically designed to have reversible terminal properties, which allows each cycle of the sequencing reaction to occur simultaneously in the presence of several types of labeled nucleotides (eg, A, C, T, G). A nucleotide may contain a detectable labeling moiety such as a fluorescent dye. When the four nucleotides are mixed together, the polymerase is able to select the correct base to incorporate and each sequence is extended by one base. Unincorporated nucleotides can be washed off by running a wash solution through the flow cell. One or more lasers can excite the nucleic acids to induce fluorescence. The fluorescence emitted from nucleic acids is based on the fluorochromes of the incorporated bases, and different fluorochromes can emit different wavelengths of emitted light. A deblocking reagent can be added to the flow cell to remove invertible end groups from the extended and detected DNA strands. The deblocking reagent can then be washed off by running a wash solution through the flow cell. The flow cell is then ready for further cycles of sequencing starting with the introduction of labeled nucleotides as described above. The fluidics and detection operations can be repeated several times to complete a sequencing run. Exemplary sequencing methods include, for example, Bentley et al., Nature 456:53-59 (2008), International Patent Application Publication No. WO 04/018497, US Patent No. 7057026, International Patent Application Publication No. WO 91/06678. , International Patent Application Publication No. WO 07/123744, U.S. Patent No. 7329492, U.S. Patent No. 7211414, U.S. Patent No. 7315019, U.S. Patent No. 7405281, and U.S. Patent Application Publication No. 2008/0108082. , each of which is incorporated herein by reference.

いくつかの実装形態では、核酸は表面に付着され、シーケンシングの前または間に増幅され得る。たとえば、増幅は、表面上に核酸クラスタを形成するためにブリッジ増幅を使用して行われ得る。有用なブリッジ増幅方法は、たとえば、米国特許第5641658号、米国特許出願公開第2002/0055100号、米国特許第7115400号、米国特許出願公開第2004/0096853号、米国特許出願公開第2004/0002090号、米国特許出願公開第2007/0128624号、および米国特許出願公開第2008/0009420号において説明されており、これらの各々の全体が参照によって本明細書において引用される。表面上で核酸を増幅するための別の有用な方法は、たとえば、Lizardi他、Nat. Genet. 19:225-232(1998)、および米国特許出願公開第2007/0099208A1号において説明されるようなローリングサークル増幅(RCA)であり、これらの各々が参照によって本明細書において引用される。 In some implementations, nucleic acids can be attached to a surface and amplified before or during sequencing. For example, amplification can be performed using bridge amplification to form nucleic acid clusters on the surface. Useful bridge amplification methods are described, for example, in US Pat. No. 5,641,658, US Patent Application Publication No. 2002/0055100, US Pat. , US Patent Application Publication No. 2007/0128624, and US Patent Application Publication No. 2008/0009420, each of which is incorporated herein by reference in its entirety. Another useful method for amplifying nucleic acids on a surface is, for example, as described in Lizardi et al., Nat. Rolling Circle Amplification (RCA), each of which is incorporated herein by reference.

1つの例示的なSBSプロトコルは、たとえば、国際特許出願公開第WO 04/018497号、米国特許出願公開第2007/0166705A1号、および米国特許第7057026号において説明されるような、除去可能な3'ブロックを有する修正されたヌクレオチドを利用し、これらの各々が参照によって本明細書において引用される。たとえば、SBS試薬の反復されるサイクルが、たとえばブリッジ増幅プロトコルの結果として、標的核酸が付着されたフローセルに導入され得る。核酸クラスタは、直線化溶液を使用して単鎖の形態へと変換され得る。直線化溶液は、たとえば、各クラスタの1本の鎖を開裂することが可能な制限エンドヌクレアーゼを含み得る。とりわけ化学開裂(たとえば、過ヨード酸塩を用いたジオール結合の開裂)、熱またはアルカリへの曝露によるエンドヌクレアーゼ(たとえば、米国マサチューセッツ州イプスウィッチのNEBにより供給されるような「USER」、部品番号M5505S)を用いた開裂による無塩基サイトの開裂、そうされなければデオキシリボヌクレオチドからなる増幅産物へと組み込まれるリボヌクレオチドの開裂、光化学開裂またはペプチドリンカーの開裂を含む、開裂の他の方法が、制限酵素またはニッキング酵素に対する代替として使用され得る。直線化操作の後で、シーケンシングプライマーは、シーケンシングされるべき標的核酸へのシーケンシングプライマーのハイブリダイゼーションのための条件下で、フローセルに導入され得る。 One exemplary SBS protocol is a removable 3′ protocol, for example, as described in International Patent Application Publication No. WO 04/018497, US Patent Application Publication No. 2007/0166705A1, and US Patent No. 7057026. Modified nucleotides with blocks are utilized, each of which is incorporated herein by reference. For example, repeated cycles of SBS reagents can be introduced to a flow cell with attached target nucleic acids, eg, as a result of a bridge amplification protocol. Nucleic acid clusters can be converted to single-stranded form using a linearization solution. A linearization solution may, for example, contain a restriction endonuclease capable of cleaving one strand of each cluster. Chemical cleavage (e.g., cleavage of diol bonds using periodate), endonucleases (e.g., "USER", part number as supplied by NEB, Ipswich, Massachusetts, USA, inter alia), exposure to heat or alkali. M5505S), cleavage of ribonucleotides that would otherwise be incorporated into an amplification product consisting of deoxyribonucleotides, photochemical cleavage or cleavage of peptide linkers may be used for restriction. It can be used as an alternative to enzymes or nicking enzymes. After the linearization operation, sequencing primers can be introduced into the flow cell under conditions for hybridization of the sequencing primers to target nucleic acids to be sequenced.

次いで、フローセルが、単一のヌクレオチドの追加によって各標的核酸にハイブリダイゼーションされるプライマーを延長するための条件下で、除去可能な3'ブロックおよび蛍光ラベルを伴う修正されたヌクレオチドを有するSBS延長試薬と接触させられ得る。単一のヌクレオチドだけが各プライマーに追加され、それは、修正されたヌクレオチドが、シーケンシングされているテンプレートの領域と相補的な成長中のポリヌクレオチド鎖へと組み込まれると、さらなる配列延長を指示するために利用可能な自由な3'-OH基がないので、ポリメラーゼがさらなるヌクレオチドを追加できないからである。SBS延長試薬は、除去され、放射線による励起のもとでサンプルを保護する構成要素を含む走査試薬により置き換えられ得る。走査試薬の例示的な構成要素は、米国特許出願公開第2008/0280773A1号および米国特許出願第13/018255号において説明され、これらの各々が参照によって本明細書に引用される。次いで、延長された核酸が、走査試薬の存在下で蛍光により検出され得る。蛍光が検出されると、3'ブロックが、使用されるブロッキンググループに適切なデブロック試薬を使用して除去され得る。それぞれのブロッキンググループに対して有用な例示的なデブロック試薬は、国際特許出願公開第WO004018497号、米国特許出願公開第2007/0166705A1号、および米国特許第7057026号において説明されており、これらの各々が参照によって本明細書において引用される。デブロック試薬は、3'OH基を有する延長されたプライマーにハイブリダイゼーションされる標的核酸を残して洗浄されてよく、このプライマーはこれで、さらなるヌクレオチドの追加が可能になる。したがって、延長試薬、走査試薬、およびデブロック試薬を追加するサイクルは、操作のうちの1つまたは複数の間の任意選択の洗浄とともに、所望の配列が得られるまで繰り返され得る。上記のサイクルは、修正されたヌクレオチドの各々に異なるラベルが付けられているとき、特定の塩基に対応することが知られている、サイクルごとに単一の延長試薬導入操作を使用して行われ得る。異なるラベルが、各組み込み操作の間に追加されるヌクレオチドの区別を容易にする。代わりに、各サイクルは、延長試薬導入の別個の操作と、それに続く走査試薬導入と検出の別個の操作とを含むことがあり、この場合、ヌクレオチドのうちの2つ以上が同じラベルを有することが可能であり、それらを導入の既知の順序に基づいて区別することができる。 The flow cell then uses an SBS extension reagent with removable 3′ blocks and modified nucleotides with fluorescent labels under conditions to extend the primers hybridized to each target nucleic acid by the addition of a single nucleotide. can be brought into contact with Only a single nucleotide is added to each primer, which directs further sequence extension as the modified nucleotide is incorporated into the growing polynucleotide strand complementary to the region of the template being sequenced. This is because there are no free 3'-OH groups available for the polymerase to add additional nucleotides. The SBS extension reagent can be removed and replaced by a scanning reagent that contains components that protect the sample under excitation by radiation. Exemplary components of scanning reagents are described in US Patent Application Publication No. 2008/0280773A1 and US Patent Application No. 13/018255, each of which is incorporated herein by reference. The extended nucleic acid can then be detected by fluorescence in the presence of scanning reagents. Once fluorescence is detected, the 3' block can be removed using deblocking reagents appropriate to the blocking group used. Exemplary deblocking reagents useful for each blocking group are described in International Patent Application Publication No. WO004018497, US Patent Application Publication No. 2007/0166705A1, and US Patent No. 7057026, each of which is incorporated herein by reference. The deblocking reagent may be washed away leaving the target nucleic acid hybridized to an extended primer with a 3'OH group, which is now ready for the addition of additional nucleotides. Thus, cycles of adding extension, scanning, and deblocking reagents, with optional washing between one or more of the operations, can be repeated until the desired sequence is obtained. The above cycles are performed using a single extension reagent introduction operation per cycle, known to correspond to a particular base, when each of the modified nucleotides is labeled differently. obtain. Different labels facilitate distinguishing nucleotides added during each incorporation operation. Alternatively, each cycle may comprise a separate operation of extension reagent introduction, followed by separate operations of scanning reagent introduction and detection, where two or more of the nucleotides have the same label. are possible and they can be distinguished based on the known order of introduction.

シーケンシング操作は特定のSBSプロトコルに関して上で論じられたが、シーケンシングのための他のプロトコルおよび様々な他の分子分析法のいずれもが、必要に応じて行われ得ることが理解されるであろう。 Although sequencing procedures have been discussed above with respect to specific SBS protocols, it will be appreciated that other protocols for sequencing and any of a variety of other molecular analysis methods can be performed as desired. be.

次いで、システムの1つまたは複数のプロセッサは、後続の分析のためのシーケンシングデータを受け取る。シーケンシングデータは、.BAMファイルなどの様々な方式でフォーマットされ得る。シーケンシングデータは、たとえばいくつかのサンプルリードを含み得る。シーケンシングデータは、ヌクレオチドの対応するサンプル配列を有する複数のサンプルリードを含み得る。1つだけのサンプルリードが論じられるが、シーケンシングデータは、たとえば、数百個、数千個、数十万個、または数百万個のサンプルリードを含み得ることを理解されたい。異なるサンプルリードは異なる数のヌクレオチドを有し得る。たとえば、サンプルリードは、10個のヌクレオチドから約500個以上のヌクレオチドにまでわたり得る。サンプルリードは源のゲノム全体にわたり得る。一例として、サンプルリードは、疑わしいSTRまたは疑わしいSNPを有する遺伝子座などの、所定の遺伝子座の方を向いている。 One or more processors of the system then receive the sequencing data for subsequent analysis. Sequencing data can be formatted in a variety of ways, such as .BAM files. Sequencing data can include, for example, a number of sample reads. Sequencing data can include multiple sample reads with corresponding sample sequences of nucleotides. Although only one sample read is discussed, it should be understood that sequencing data can contain, for example, hundreds, thousands, hundreds of thousands, or millions of sample reads. Different sample reads can have different numbers of nucleotides. For example, sample reads can range from 10 nucleotides to about 500 or more nucleotides. Sample reads can span the entire genome of the source. As an example, sample reads are directed to predetermined loci, such as loci with suspected STRs or suspected SNPs.

各サンプルリードは、サンプル配列、サンプルフラグメント、または標的配列と呼ばれ得る、ヌクレオチドの配列を含み得る。サンプル配列は、たとえば、プライマー配列、フランキング配列、および標的配列を含み得る。サンプル配列内のヌクレオチドの数は、30個、40個、50個、60個、70個、80個、90個、100個以上を含み得る。いくつかの実装形態では、サンプルリード(またはサンプル配列)のうちの1つまたは複数は、少なくとも150個のヌクレオチド、200個のヌクレオチド、300個のヌクレオチド、400個のヌクレオチド、500個のヌクレオチド、またはそれより多くを含む。いくつかの実装形態では、サンプルリードは、1000個を超えるヌクレオチド、2000個を超えるヌクレオチド、またはそれより多くを含み得る。サンプルリード(またはサンプル配列)は、一端または両端にプライマー配列を含み得る。 Each sample read may contain a sequence of nucleotides, which may be referred to as a sample sequence, sample fragment, or target sequence. Sample sequences can include, for example, primer sequences, flanking sequences, and target sequences. The number of nucleotides in the sample sequence can include 30, 40, 50, 60, 70, 80, 90, 100 or more. In some implementations, one or more of the sample reads (or sample sequences) are at least 150 nucleotides, 200 nucleotides, 300 nucleotides, 400 nucleotides, 500 nucleotides, or Including more. In some implementations, a sample read may contain more than 1000 nucleotides, more than 2000 nucleotides, or more. A sample read (or sample sequence) may contain primer sequences at one or both ends.

次に、1つまたは複数のプロセッサは、シーケンシングデータを分析して、潜在的なバリアントコールおよびサンプルバリアントコールのサンプルバリアント頻度を取得する。この操作は、バリアントコールアプリケーションまたはバリアントコーラとも呼ばれ得る。したがって、バリアントコーラはバリアントを特定または検出し、バリアント分類器は検出されたバリアントを体細胞性または生殖細胞性であるものとして分類する。代替的なバリアントコーラが本明細書の実装形態に従って利用されることがあり、ここで、異なるバリアントコーラは、実行されているシーケンシング操作のタイプ、対象のサンプルの特徴などに基づき使用され得る。バリアントコールアプリケーションの1つの非限定的な例は、https://github.com/Illumina/Piscesにおいてホストされ、論説Dunn, TamsenおよびBerry, GwennおよびEmig-Agius, DorotheaおよびJiang, YuおよびIyer, AnitaおよびUdar, NitinおよびStromberg, Michael、(2017)、Pisces: An Accurate and Versatile Single Sample Somatic and Germline Variant Caller、595-595、10.1145/3107411.3108203において説明される、Illumina Inc.(カリフォルニア州サンディエゴ)によるPisces(商標)アプリケーションであり、上記の論説の完全な主題の全体が、参照によって本明細書において引用される。 One or more processors then analyze the sequencing data to obtain sample variant frequencies of potential variant calls and sample variant calls. This operation may also be referred to as a variant call application or variant call. Thus, Variant Cola identifies or detects a variant, and Variant Classifier classifies the detected variant as being somatic or germline. Alternative variant callers may be utilized in accordance with implementations herein, where different variant callers may be used based on the type of sequencing operation being performed, characteristics of the sample of interest, etc. One non-limiting example of a variant calling application is hosted at https://github.com/Illumina/Pisces, in the articles Dunn, Tamsen and Berry, Gwenn and Emig-Agius, Dorothea and Jiang, Yu and Iyer, Anita and Pisces by Illumina Inc. (San Diego, Calif.), described in Udar, Nitin and Stromberg, Michael, (2017), Pisces: An Accurate and Versatile Single Sample Somatic and Germline Variant Caller, 595-595, 10.1145/3107411.3108203. Trademark) application, and the entire subject matter of the above article is incorporated herein by reference.

［良性訓練セットの生成］
数百万個のヒトゲノムおよびエクソンがシーケンシングされているが、それらの臨床上の応用は、疾患を引き起こす変異を良性の遺伝的変異から区別することの難しさにより限られたままである。ここで我々は、他の霊長類の種における一般的なミスセンスバリアントが、ヒトにおいて大部分が臨床的に良性であることを実証し、病原性の変異が除去のプロセスによって系統的に特定されることを可能にする。6種のヒト以外の霊長類の種の集団シーケンシングからの数十万個の一般的なバリアントを使用して、88%の正確さで稀な疾患の患者における病原性の変異を特定し、ゲノムワイド有意性(genome-wide significance)で知的障害における14個の新たな遺伝子候補の発見を可能にする、深層ニューラルネットワークを訓練した。追加の霊長類の種からの一般的な変異の目録を作ることで、数百万個の有意性が不確かなバリアントに対する解釈が改善し、ヒトゲノムシーケンシングの臨床上の利用がさらに進む。 [Generation of benign training set]
Although millions of human genomes and exons have been sequenced, their clinical application remains limited by the difficulty in distinguishing disease-causing mutations from benign genetic mutations. Here we demonstrate that common missense variants in other primate species are mostly clinically benign in humans, with pathogenic variants systematically identified by a process of elimination. make it possible. Identifying pathogenic variants in patients with rare diseases with 88% accuracy using hundreds of thousands of common variants from population sequencing of six non-human primate species, We trained a deep neural network that enabled the discovery of 14 new gene candidates in intellectual disability with genome-wide significance. An inventory of common mutations from additional primate species will improve the interpretation of the millions of variants of uncertain significance and further the clinical use of human genome sequencing.

診断シーケンシングの臨床上の使用可能性は、ヒトの集団における稀な遺伝子バリアントを解釈しそれらの疾患リスクに対する影響を推測することが難しいことにより、限られている。臨床的に有意な遺伝子バリアントは、それらの健康に対する有害な影響により、集団において極めて稀である傾向があり、大半については、ヒトの健康に対する影響が決定されていない。臨床的な有意性が不確かであるこれらのバリアントが多数あること、およびそれらが稀であることは、個人化された医療および集団全体の健康スクリーニングに対するシーケンシングの採用に対する手強い障壁となっている。 The clinical applicability of diagnostic sequencing is limited by the difficulty of interpreting rare genetic variants in human populations and inferring their impact on disease risk. Clinically significant genetic variants tend to be extremely rare in the population due to their deleterious health effects, and for the most part human health effects have not been determined. The large number of these variants, of uncertain clinical significance, and their rarity present formidable barriers to the adoption of sequencing for personalized medicine and population-wide health screening.

大半の浸透性のメンデル性の疾患は集団において非常に有病率が低いので、集団における高頻度でのバリアントの観察は、良性の結果を支持する強い証拠である。多様なヒトの集団にわたって一般的な変異を評価することは、良性のバリアントの目録を作るための有効な戦略であるが、現生人類における一般的な変異の総数は、祖先の多様性の大部分が失われた我々の種の最近の歴史におけるボトルネック事象により、限られている。現生人類の集団の研究は、過去15000～65000年以内の10000人未満の個人という有効個体数(N_e)からの顕著な膨張を示しており、一般的な多型のプールが小さいことは、このサイズの集団における変異の容量が限られていることに由来する。基準ゲノムの中の7000万個の潜在的なタンパク質を変化させるミスセンス置換のうち、全体で0.1%を超える集団アレル頻度を持つものは、概ね1000個のうちの1個しか存在しない。 Since most penetrant Mendelian diseases have very low prevalence in the population, the observation of the variant at high frequency in the population is strong evidence in favor of a benign outcome. Although assessing common mutations across diverse human populations is a valid strategy for inventorying benign variants, the total number of common mutations in modern humans does not account for the large amount of ancestral diversity. Limited by bottleneck events in the recent history of our species that have lost parts. Studies of modern human populations show a marked expansion from the effective population size (N _e ) of less than 10 000 individuals within the past 15,000 to 65,000 years, suggesting that the small pool of common polymorphisms , derived from the limited capacity for mutation in a population of this size. Of the 70 million potential protein-altering missense substitutions in the reference genome, roughly 1 in 1000 has a population allele frequency greater than 0.1% overall.

現生人類の集団以外では、チンパンジーが次に近い現存する種を構成し、99.4%のアミノ酸配列相同性を共有する。ヒトとチンパンジーにおけるタンパク質コーディング配列の近い相同性は、チンパンジーのタンパク質コーディングバリアントに対して作用する純化選択が、同一状態であるヒトの変異の健康に対する結果もモデル化し得ることを示唆する。 Outside the modern human population, chimpanzees constitute the next closest extant species, sharing 99.4% amino acid sequence homology. The close homology of protein-coding sequences in humans and chimpanzees suggests that purifying selection acting on chimpanzee protein-coding variants may also model the health consequences of identical mutations in humans.

中立的な多型がヒトの祖先の系統(約4N_e世代)において持続する平均時間は、種の分岐時間(約600万年前)の一部であるので、自然に発生するチンパンジーの変異は、平衡選択により維持されるハプロタイプの稀な事例を除き、偶然を除いて大部分が重複しない変異空間に及ぶ。同一状態である多型が2つの種において同様に健康に影響する場合、チンパンジーの集団における高いアレル頻度でのバリアントの存在は、ヒトにおける良性の結果を示すはずであり、その良性の結果が純化選択によって確立されている既知のバリアントの目録を拡大する。実質的な追加の詳細は、参照により引用された出願に記載されている。 Since the average time that a neutral polymorphism persists in the human ancestral lineage (~4N _e generations) is a fraction of the species divergence time (~6 million years ago), naturally occurring chimpanzee mutations are , spans a largely non-overlapping variational space except by chance, with the exception of rare cases of haplotypes maintained by equilibrium selection. If polymorphisms with the same status affect health similarly in the two species, the presence of variants at high allelic frequencies in chimpanzee populations should indicate a benign outcome in humans, and the benign outcome should be refined. Expanding the inventory of known variants established by selection. Substantial additional details are provided in the applications incorporated by reference.

［深層学習ネットワークのアーキテクチャ］
参照により引用された出願により開示される一実装形態において、病原性予測ネットワークは、対象のバリアントを中心とする長さ51のアミノ酸配列と、二次構造および溶媒接触性ネットワーク(図2および図3)の出力とを、中心の場所において置換されるミスセンスバリアントとともに入力として取り込む。11種の霊長類のための1つの場所頻度行列と、霊長類を除く50種の哺乳類のための1つの場所頻度行列と、霊長類と哺乳類を除く38種の脊椎動物のための1つの場所頻度行列とを含む、3つの長さ51の場所頻度行列が、99種の脊椎動物の複数の配列アラインメントから生成される。 [Deep learning network architecture]
In one implementation disclosed by the applications incorporated by reference, the pathogenicity prediction network consists of a 51 amino acid sequence in length centered on the variant of interest and a secondary structure and solvent accessibility network (Figs. 2 and 3). ) with the missense variant to be replaced at the central location as input. One place-frequency matrix for 11 primates, one place-frequency matrix for 50 non-primate mammals, and one place-frequency matrix for 38 non-primate and non-mammal vertebrates Three length 51 place frequency matrices are generated from multiple sequence alignments of 99 vertebrate species.

二次構造深層学習ネットワークは、各アミノ酸の場所における3状態の二次構造、すなわちαヘリックス(H)、βシート(B)、およびコイル(C)を予測する。溶媒接触性ネットワークは、各アミノ酸の場所における3状態の溶媒接触性、すなわち、埋もれている(buried)(B)、中間(intermediate)(I)、および露出している(exposed)(E)を予測する。両方のネットワークが、入力としてフランキングアミノ酸配列のみを取り込むことができ、Protein DataBankにおける既知の冗長ではない結晶構造からのラベルを使用して訓練することができる。事前訓練された3状態二次構造ネットワークおよび3状態溶媒接触性ネットワークへの入力のために、やはり長さが51であり深さが20である、すべての99種の脊椎動物に対する複数の配列アラインメントから生成された単一の長さ場所頻度行列を使用することができる。Protein DataBankからの既知の結晶構造についてネットワークを事前訓練した後で、二次構造および溶媒モデルに対する最終的な2つの層を除去することができ、ネットワークの出力は病原性モデルの入力に直接接続できる。3状態2次構造予測モデルについて達成される例示的な検定の正確さは79.86%であった。結晶構造を有していた約4000個のヒトタンパク質に対するDSSPとアノテートされた構造ラベルを使用するときと、予測される構造ラベルのみを使用するときとでニューラルネットワークの予測を比較すると、大きな差はなかった。 Secondary structure deep learning networks predict three-state secondary structures at each amino acid location: α-helix (H), β-sheet (B), and coil (C). The solvent accessibility network represents three states of solvent accessibility at each amino acid location: buried (B), intermediate (I), and exposed (E). Predict. Both networks can take only flanking amino acid sequences as input and can be trained using labels from known non-redundant crystal structures in the Protein DataBank. Multiple sequence alignments for all 99 vertebrate species, also of length 51 and depth 20, for inputs to the pretrained 3-state secondary structure network and 3-state solvent accessible network A single length location-frequency matrix generated from After pre-training the network on known crystal structures from the Protein DataBank, the final two layers for secondary structure and solvent models can be removed and the output of the network can be directly connected to the input of the pathogenicity model. . The exemplary test accuracy achieved for the three-state secondary structure prediction model was 79.86%. Comparing the neural network predictions when using DSSPs and annotated structure labels for about 4000 human proteins that had crystal structures and when using only the predicted structure labels, there was no significant difference. I didn't.

病原性予測のための我々の深層学習ネットワーク(PrimateAI)と、二次構造および溶媒接触性を予測するための深層学習ネットワークの両方が、残基ブロックのアーキテクチャを採用した。PrimateAIの詳細なアーキテクチャは、図3において説明されている。 Both our deep learning network for pathogenicity prediction (PrimateAI) and the deep learning network for predicting secondary structure and solvent accessibility adopted the architecture of residue blocks. The detailed architecture of PrimateAI is described in Figure 3.

図2は、本明細書で「PrimateAI」と呼ばれる、病原性予測のための深層残差ネットワークの例示的なアーキテクチャ200を示す。図2において、1Dは1次元畳み込み層を指す。予測される病原性は、0(良性)から1(病原性)までの目盛り上にある。ネットワークは、ヒトアミノ酸(AA)基準およびバリアントを中心とする代替配列(51個のAA)、99種の脊椎動物の種から計算された位置特定的重み行列(PWM)保存プロファイル、ならびに二次構造および溶媒接触性予測深層学習ネットワークの出力を入力として取り込み、この深層学習ネットワークは、3状態のタンパク質二次構造(ヘリックス-H、βシート-B、およびコイル-C)と、3状態の溶媒接触性(埋もれている-B、中間-I、および露出している-E)とを予測する。 FIG. 2 shows an exemplary architecture 200 of a deep residual network for pathogenicity prediction, referred to herein as “PrimateAI”. In FIG. 2, 1D refers to a one-dimensional convolutional layer. Predicted pathogenicity is on a scale from 0 (benign) to 1 (pathogenic). The network consists of alternate sequences (51 AAs) centered around human amino acid (AA) references and variants, position-specific weight matrix (PWM) conservation profiles calculated from 99 vertebrate species, and secondary structure. and the output of a solvent accessibility prediction deep learning network as input, this deep learning network predicts three states of protein secondary structure (helix-H, β-sheet-B, and coil-C) and three states of solvent accessibility Predict gender (buried-B, intermediate-I, and exposed-E).

図3は、病原性分類のための深層学習ネットワークアーキテクチャであるPrimateAIの概略図300を示す。モデルへの入力は、基準配列と置換されるバリアントを伴う配列との両方に対するフランキング配列の51個のアミノ酸(AA)と、霊長類、哺乳類、および脊椎動物のアラインメントからの3つの長さ51AAの位置特定的重み行列により表される保存率と、事前訓練された二次構造ネットワークおよび溶媒接触性ネットワークの出力(やはり長さは51AAである)とを含む。 FIG. 3 shows a schematic diagram 300 of PrimateAI, a deep learning network architecture for pathogenicity classification. Inputs to the model are the 51 amino acids (AA) of the flanking sequences for both the reference sequence and the sequence with the variant to be replaced, plus three lengths of 51 AA from primate, mammalian, and vertebrate alignments. and the output of the pretrained secondary structure network and solvent accessibility network (also of length 51AA).

［事前訓練による改善］
本開示では、過剰適合を低減するか、または弱め、訓練結果を改善するために病原性予測モデルを事前訓練することを紹介する。システムは、一実装形態によるシステムのアーキテクチャレベルの概略図100を示す図1を参照しつつ説明される。図1は、アーキテクチャ図であるので、説明のわかりやすさを高めるために詳細の一部は意図的に省かれている。図1の説明は、次のように編成されている。最初に、図の要素が説明され、続いてその相互接続が説明される。次いで、システム内の要素の使用についてより詳しく説明されている。 [Improvement through pre-training]
This disclosure introduces pre-training of pathogenicity prediction models to reduce or weaken overfitting and improve training results. The system is described with reference to FIG. 1, which shows an architectural-level schematic diagram 100 of the system according to one implementation. Since Figure 1 is an architectural diagram, some details have been intentionally left out to improve the clarity of the explanation. The description of Figure 1 is organized as follows. First, the elements of the figure are described, followed by a description of their interconnections. The use of the elements within the system is then described in more detail.

この段落では、図1に例示されているシステムのラベリングされた部分に名前を付けている。システムは、4つの訓練データセット、すなわち、病原性ミスセンス訓練例121、補足良性訓練例131、良性ミスセンス訓練例161、および補足良性訓練例181を備える。システムは、トレーナー114、テスター116、位置特定的頻度行列(PFM)計算器184、入力エンコーダ186、バリアント病原性予測モデル157、およびネットワーク155をさらに備える。補足良性訓練例131は、病原性ミスセンス訓練例121に対応し、したがって、破線によるボックス内に一緒に置かれる。同様に、補足良性訓練例181は、良性ミスセンス訓練例161に対応し、したがって、両方のデータセットが同じボックス内に示される。 This paragraph names the labeled parts of the system illustrated in FIG. The system comprises four training data sets: pathogenic missense training examples 121, supplemental benign training examples 131, benign missense training examples 161, and supplemental benign training examples 181. The system further comprises trainer 114 , tester 116 , location-specific frequency matrix (PFM) calculator 184 , input encoder 186 , variant pathogenicity prediction model 157 and network 155 . Supplemental benign training examples 131 correspond to pathogenic missense training examples 121 and are therefore placed together in the dashed box. Similarly, supplemental benign training examples 181 correspond to benign missense training examples 161, so both data sets are shown in the same box.

システムは、対象のバリアントの側にあるアミノ酸配列と他の種におけるオーソロガスな配列アラインメントを入力として取る例示的なバリアント病原性予測モデル157としてPrimateAIにより記述される。病原性予測に対するPrimateAIモデルの詳細なアーキテクチャは、図3を参照して上で提示されている。アミノ酸配列の入力は、対象のバリアントを含む。「バリアント」という用語は、アミノ酸基準配列と異なるアミノ酸配列を指す。染色体のタンパク質コード領域内で特定の位置にあるトリヌクレオチド塩基配列(コドンとも称される)は、アミノ酸を表現する。61個のトリヌクレオチド配列組合せによって形成され得るアミノ酸は20種類ある。複数のコドンまたはトリヌクレオチド配列組合せは結果として同じアミノ酸を形成することができる。たとえば、コドン「AAA」および「AAG」は、リシンというアミノ酸(記号「K」でも示される)を表している。 The system is described by PrimateAI as an exemplary variant pathogenicity prediction model 157 that takes as input amino acid sequences flanking the variant of interest and orthologous sequence alignments in other species. The detailed architecture of the PrimateAI model for pathogenicity prediction is presented above with reference to Figure 3. The amino acid sequence input contains the variants of interest. The term "variant" refers to an amino acid sequence that differs from a reference amino acid sequence. Trinucleotide base sequences (also called codons) at specific locations within a protein-coding region of a chromosome represent amino acids. There are 20 amino acids that can be formed by 61 trinucleotide sequence combinations. Multiple codon or trinucleotide sequence combinations can result in the formation of the same amino acid. For example, the codons "AAA" and "AAG" represent the amino acid lysine (also indicated by the symbol "K").

アミノ酸配列バリアントは、単一ヌクレオチド多型(SNP)によって引き起こされ得る。SNPは、遺伝子内の特定の座に生じる単一ヌクレオチド内の変異であり、集団内で何らかの感知できる程度まで観察される(たとえば、>1%)。開示されている技術は、エクソンと呼ばれる遺伝子のタンパク質コード領域内に出現するSNPに集中している。SNPには、同義SNPとミスセンスSNPの2種類がある。同義SNPは、アミノ酸に対する第1のコドンを同じアミノ酸に対する第2のコドンに変えるタンパク質コードSNPの一種である。その一方でミスセンスSNPは、第1のアミノ酸に対する第1のコドンから第2のアミノ酸に対する第2のコドンへの変化を含む。 Amino acid sequence variants can be caused by single nucleotide polymorphisms (SNPs). A SNP is a single-nucleotide mutation that occurs at a particular locus within a gene and is observed to some appreciable degree within a population (eg, >1%). The disclosed technology focuses on SNPs that occur within protein-coding regions of genes called exons. There are two types of SNPs: synonymous SNPs and missense SNPs. A synonymous SNP is a type of protein-coding SNP that changes the first codon for an amino acid to a second codon for the same amino acid. A missense SNP, on the other hand, involves a change from the first codon for the first amino acid to the second codon for the second amino acid.

図6は、ミスセンスバリアントおよび対応する構成された同義バリアントに対する「タンパク質配列ペア」の一例600を提示している。「タンパク質配列ペア」または単純に「配列ペア」という語句は、基準タンパク質配列および代替タンパク質配列を指す。基準タンパク質配列は、基準コドンまたはトリヌクレオチド塩基によって表現される基準アミノ酸を含む。代替タンパク質配列は結果として、代替タンパク質配列が、基準タンパク質配列の基準アミノ酸を表現する基準コドン内に出現するバリアントにより生じるように代替コドンまたはトリヌクレオチド塩基によって表現される代替アミノ酸を含む。 FIG. 6 presents an example "protein sequence pair" 600 for a missense variant and the corresponding constructed synonymous variant. The phrase "protein sequence pair" or simply "sequence pair" refers to a reference protein sequence and an alternate protein sequence. A reference protein sequence contains reference amino acids represented by reference codons or trinucleotide bases. Alternate protein sequences consequently include alternate amino acids represented by alternate codons or trinucleotide bases such that alternate protein sequences occur with variants occurring within the standard codons expressing the standard amino acids of the standard protein sequence.

図6において、我々は、ミスセンスバリアントに対応する補足良性同義カウンターパート訓練例(上では補足良性訓練例と称されている)の構成を提示している。ミスセンスバリアントは、病原性ミスセンス訓練例または良性ミスセンス訓練例であってよい。染色体1において位置5、6、および7(すなわち、5:7)にコドン「TTT」を有する基準アミノ酸配列を持つミスセンスバリアントに対するタンパク質配列ペアを考える。次に、SNPが同じ染色体において位置6に出現し、その結果代替配列が同じ位置、すなわち5:7にコドン「TCT」をもたらすと考える。基準配列内のコドン「TTT」は結果としてフェニルアラニン(F)というアミノ酸をもたらすが、代替アミノ酸配列内のコドン「TCT」は結果としてセリン(S)というアミノ酸をもたらす。図を簡単にするため、図6は、標的位置にある配列ペア内のアミノ酸および対応するコドンのみを示している。配列ペア内のフランキングアミノ酸およびそれぞれのコドンは図示されていない。訓練データセットにおいて、ミスセンスバリアントは病原性とラベリングされている(「1」とラベリングされている)。訓練中のモデルの過剰適合を低減するために、開示されている技術は、対応するミスセンスバリアントにカウンターパート補足良性訓練例を構成する。構成された補足良性訓練例に対する配列ペアの中の基準配列は、図6の左部分に示されているミスセンスバリアント内の基準配列と同じである。図6の右部分は、ミスセンスバリアントに対する基準配列の場合のように染色体1内の位置5:7における同じ基準配列コドン「TTT」との同義カウンターパートである補足良性訓練例を示している。同義カウンターパートに対して構成された代替配列は、位置番号7のところにSNPを有し、その結果コドン「TTC」がもたらされる。このコドンは結果として、同じ染色体内の同じ位置における基準配列にあるのと同じアミノ酸であるフェニルアラニン(F)を代替配列内にもたらす。同じ位置の同じ染色体内の2つの異なるコドンは、同じアミノ酸を表現し、したがって、同義カウンターパートは良性としてラベリングされる(または「0」とラベリングされる)。基準配列および代替配列内の同じ位置にある2つの異なるコドンは、標的位置で同じアミノ酸を表現する。良性カウンターパートはランダムには構成されず、その代わりに、シーケンシングされた集団内で観察された同義バリアントから選択される。開示されている技術は、補足良性訓練例を構成して病原性ミスセンス訓練例と対比し、訓練中のバリアント病原性予測モデルの過剰適合を低減する。 In FIG. 6, we present the construction of complementary benign synonymous counterpart training examples (referred to above as complementary benign training examples) that correspond to missense variants. A missense variant may be a pathogenic missense training instance or a benign missense training instance. Consider a protein sequence pair for a missense variant with a reference amino acid sequence having codons "TTT" at positions 5, 6, and 7 (ie, 5:7) on chromosome 1. Now consider that the SNP occurs at position 6 on the same chromosome, so that the alternate sequence results in the codon "TCT" at the same position, ie 5:7. Codon "TTT" in the reference sequence results in the amino acid phenylalanine (F), while codon "TCT" in the alternate amino acid sequence results in the amino acid serine (S). For simplicity of illustration, FIG. 6 shows only the amino acids and corresponding codons within the sequence pair at the target position. Flanking amino acids and respective codons within the sequence pair are not shown. Missense variants are labeled as pathogenic (labeled '1') in the training dataset. To reduce overfitting of the model during training, the disclosed technique constructs counterpart supplemental benign training examples to the corresponding missense variants. The reference sequences in the sequence pairs for the constructed complementary benign training examples are the same as the reference sequences in the missense variants shown in the left part of FIG. The right part of FIG. 6 shows supplementary benign training examples that are synonymous counterparts to the same reference sequence codon 'TTT' at position 5:7 in chromosome 1 as in the reference sequence for the missense variant. An alternative sequence constructed for the synonymous counterpart has a SNP at position number 7, resulting in the codon "TTC". This codon results in the same amino acid phenylalanine (F) in the alternate sequence as in the reference sequence at the same position in the same chromosome. Two different codons in the same chromosome at the same location express the same amino acid, and thus synonymous counterparts are labeled as benign (or labeled as '0'). Two different codons at the same position in the reference and alternate sequences represent the same amino acid at the target position. Benign counterparts are not randomly constructed, but instead selected from synonymous variants observed within the sequenced population. The disclosed technique constructs supplemental benign training examples to contrast with pathogenic missense training examples to reduce overfitting of variant pathogenicity prediction models during training.

補足良性訓練例は、同義である必要はない。開示されている技術は、同一のトリヌクレオチドコドンによって構成された、基準配列内にあるのと同じアミノ酸を代替配列内に有する補足良性訓練例も構成することができる。関連付けられている位置特定的頻度行列(PFM)は、アミノ酸が同義または同一コドンによって表現されるかどうかに関係なく、同一のアミノ酸配列に対して同じである。したがって、そのような補足訓練例は、訓練中のバリアント病原性予測モデルの過剰適合を低減する効果を有し、これは図6に提示されている同義カウンターパート訓練例における効果と同じである。 Supplemental benign training examples need not be synonymous. The disclosed technology can also construct supplemental benign training examples having the same amino acids in the alternate sequence as in the reference sequence, constituted by the same trinucleotide codons. The associated position-specific frequency matrix (PFM) is the same for identical amino acid sequences, regardless of whether the amino acids are represented by synonymous or identical codons. Therefore, such supplemental training examples have the effect of reducing overfitting of the variant pathogenicity prediction model during training, which is the same effect as in the synonymous counterpart training examples presented in FIG.

我々は、次に、図1に提示されているシステムの他の要素について説明する。トレーナー114は、図1に提示されている4つの訓練データセットを使用して、バリアント病原性予測モデルを訓練する。一実装形態において、バリアント病原性予測モデルは、畳み込みニューラルネットワーク(CNN)として実装される。CNNの訓練は、図5を参照しつつ上で説明されている。訓練中に、CNNは、入力データが特定の出力推定値になるように調整または訓練される。訓練は、出力推定値がグラウンドトゥルースに徐々に一致するかまたは近づくまで、出力推定値とグラウンドトゥルースの比較に基づき逆伝播を使用してCNNを調整することを含む。訓練に続き、テスター116は、テストデータセットを使用して、バリアント病原性予測モデルのベンチマークをとる。入力エンコーダ186は、基準および代替アミノ酸配列などのカテゴリ入力データを、バリアント病原性予測モデルへの入力として提供され得る形態に変換する。これは、図13の例示的基準および代替配列を使用してさらに説明される。 We next describe other elements of the system presented in FIG. Trainer 114 uses the four training data sets presented in FIG. 1 to train the variant pathogenicity prediction model. In one implementation, the variant pathogenicity prediction model is implemented as a convolutional neural network (CNN). CNN training is described above with reference to FIG. During training, the CNN is tuned or trained to the input data to specific output estimates. Training involves adjusting the CNN using backpropagation based on a comparison of the output estimate and the ground truth until the output estimate gradually matches or approaches the ground truth. Following training, tester 116 benchmarks the variant pathogenicity prediction model using the test dataset. Input encoder 186 transforms categorical input data, such as reference and alternative amino acid sequences, into a form that can be provided as input to a variant pathogenicity prediction model. This is further explained using the exemplary reference and alternate arrangement of FIG.

PFM計算器184は、位置特定的スコアリング行列(PSSM)または位置特定的重み行列(PWM)とも称される位置特定的頻度行列(PFM)を計算する。PFMは、図10および図11に示されているように各アミノ酸位置(横軸上)のすべてのアミノ酸(縦軸上)の頻度を指示する。開示されている技術は、3つのPFM、すなわち、霊長類、哺乳類、および脊椎動物について各々1つずつ計算する。3つのPFMの各々に対するアミノ酸配列の長さは、上流および下流の側に少なくとも25個のアミノ酸がある標的アミノ酸とともに51であるものとしてよい。PFMは、アミノ酸に対して20行、アミノ酸配列内のアミノ酸の位置に対して51列を有する。PFM計算器は、11種の霊長類に対するアミノ酸配列を有する第1のPFM、48種の哺乳類に対するアミノ酸配列を有する第2のPFM、および40種の脊椎動物に対するアミノ酸配列を有する第3のPFMを計算する。PFM内の細胞は、配列内の特定の位置のアミノ酸の出現のカウントである。3つのPFMに対するアミノ酸配列はアラインメントされる。これは、基準アミノ酸配列または代替アミノ酸配列内の各アミノ酸位置に対する霊長類、哺乳類、および脊椎動物のPFMの位置毎の計算の結果がアミノ酸位置が基準アミノ酸配列または代替アミノ酸配列内に出現するのと同じ順序で位置毎にまたは順序位置に基づき記憶されることを意味する。 The PFM calculator 184 calculates a location-specific frequency matrix (PFM), also referred to as a location-specific scoring matrix (PSSM) or location-specific weighting matrix (PWM). PFM indicates the frequency of all amino acids (on the vertical axis) at each amino acid position (on the horizontal axis) as shown in FIGS. The disclosed technique computes three PFMs, one each for primates, mammals, and vertebrates. The amino acid sequence length for each of the three PFMs may be 51 with target amino acids flanked by at least 25 amino acids upstream and downstream. The PFM has 20 rows for amino acids and 51 columns for the position of the amino acid within the amino acid sequence. The PFM calculator generated a first PFM with amino acid sequences for 11 primates, a second PFM with amino acid sequences for 48 mammals, and a third PFM with amino acid sequences for 40 vertebrates. calculate. A cell in PFM is a count of occurrences of the amino acid at a particular position in the sequence. The amino acid sequences for the three PFMs are aligned. This is because the results of position-by-position calculations of the primate, mammalian, and vertebrate PFMs for each amino acid position in the reference or alternate amino acid sequence are the amino acid positions appearing in the reference or alternate amino acid sequence. It means stored by position in the same order or based on ordinal position.

開示されている技術は、初期訓練エポック、たとえば、2もしくは3もしくは5もしくは8もしくは10エポックまたは3から5、3から8、もしくは2から10エポックにおいて補足良性訓練例131および181を使用する。図7、図8、および図9は、事前訓練エポック中、訓練エポック中、および推論中の病原性予測モデルを例示している。図7は、約400,000の良性補足訓練例131が深層学習モデルから予測された約400,000の病原性バリアント121と組み合わされている事前訓練エポック1から5の説明図700を提示している。約100,000、200,000、または300,000などのより少ない良性補足訓練例が、病原性バリアントと組み合わせることができる。一実装形態において、病原性バリアントデータセットは、上で説明されているように約6800万個の合成バリアントからのランダムサンプルを使用して20サイクルで生成される。別の実装形態において、病原性バリアントデータセットは、約6800万個の合成バリアントから1サイクルで生成されてもよい。病原性バリアント121および補足良性訓練例131は、最初の5エポックでネットワークのアンサンブルへの入力として与えられる。同様に、約400,000の補足良性訓練例181は、事前訓練エポック中にアンサンブル訓練に対して約400,000の良性バリアント161と組み合わされる。約100,000、200,000、または300,000などのより少ない良性訓練例が、良性バリアントと組み合わせることができる。 The disclosed technique uses supplemental benign training examples 131 and 181 in initial training epochs, eg, 2 or 3 or 5 or 8 or 10 epochs or 3 to 5, 3 to 8, or 2 to 10 epochs. Figures 7, 8, and 9 illustrate the pathogenicity prediction model during pretraining epochs, training epochs, and inference. FIG. 7 presents an illustration 700 of pretraining epochs 1 through 5 in which approximately 400,000 benign supplemental training examples 131 are combined with approximately 400,000 pathogenic variants 121 predicted from the deep learning model. Fewer benign supplemental training cases, such as about 100,000, 200,000, or 300,000, can be combined with a pathogenic variant. In one implementation, the pathogenic variant dataset is generated in 20 cycles using random samples from approximately 68 million synthetic variants as described above. In another implementation, a pathogenic variant dataset may be generated in one cycle from approximately 68 million synthetic variants. Pathogenic variant 121 and complementary benign training example 131 are given as inputs to the ensemble of networks in the first 5 epochs. Similarly, about 400,000 supplemental benign training examples 181 are combined with about 400,000 benign variants 161 for ensemble training during the pre-training epoch. Fewer benign training examples, such as about 100,000, 200,000, or 300,000, can be combined with benign variants.

補足良性データセット131および181は、図8の例800に示されているような訓練エポック6からnの残りに対する入力としては与えられない。ネットワークのアンサンブルの訓練は、病原性バリアントデータセットおよび良性バリアントデータセットで複数のエポックにわたって継続する。訓練は、所定の数の訓練エポックの後に、または終了条件に達したときに終了する。訓練されたネットワークは、図9の例900に示されているように合成バリアント810を評価するために推論時に使用される。訓練されたネットワークは、バリアントを病原性または良性として予測する。 Supplemental benign data sets 131 and 181 are not provided as inputs for the remainder of training epochs 6 through n as shown in example 800 of FIG. Training of the ensemble of networks continues over multiple epochs on the pathogenic and benign variant datasets. Training ends after a predetermined number of training epochs or when a termination condition is reached. The trained network is used during inference to evaluate synthetic variants 810 as shown in example 900 of FIG. A trained network predicts variants as pathogenic or benign.

次に、我々は、図10に例示されている、病原性ミスセンスバリアント訓練例1002(番号1000によって参照されている)のカウンターパートとして構成される例示的な補足良性訓練例1012に対するPFMを説明する。PFMは、訓練例に対して生成されるか、または参照される。訓練例に対するPFMは、基準配列の位置にのみ依存し、したがって、訓練例1002および1012は両方とも同じPFMを有する。たとえば、図10では、2つの訓練例が示されている。第1の訓練例1002は、病原性/ラベリングされていないバリアントである。第2の訓練例1012は、訓練例1002に対応するカウンターパート補足良性訓練例である。訓練例1002は、基準配列1002Rおよび代替配列1002Aを有する。第1のPFMは、基準配列1002Rの位置にのみ基づき訓練例1002についてアクセスされるか、または生成される。訓練例1012は、基準配列10012Rおよび代替配列1012Aを有する。例1002に対する第1のPFMは、例1012に再利用できる。PFMは、種の間の配列の保存の指示として、霊長類、哺乳類、および脊椎動物の99種など、複数の種からのアミノ酸配列を使用して計算される。ヒトは、PFMの計算において表される種に入っても入らなくてもよい。このPFMにおける細胞は、配列内の、種の間のアミノ酸の出現のカウントを含む。PFM1022は、PFMに対する開始点であり、これは訓練例における単一の配列のワンホット符号化を例示している。PFMが完全であるときに、99種の例について、種の間で完全に保存されている位置は、「1」の代わりに「99」の値を有する。部分的保存の結果として、この例では、足して99になる値を有する1つの列内の2つまたはそれ以上の行が得られる。基準および代替配列は、両方とも、同じPFMを有するが、それは、PFMが配列の中心位置にあるアミノ酸ではなく、全体的な配列位置に依存するからである。 Next, we describe PFM for an exemplary supplemental benign training example 1012 configured as a counterpart to the pathogenic missense variant training example 1002 (referenced by number 1000) illustrated in FIG. . PFMs are generated or referenced on training examples. The PFM for the training examples depends only on the position in the reference array, so both training examples 1002 and 1012 have the same PFM. For example, in Figure 10 two training examples are shown. The first training example 1002 is the pathogenic/unlabeled variant. A second training example 1012 is a counterpart complementary benign training example corresponding to training example 1002 . Training example 1002 has reference sequence 1002R and alternate sequence 1002A. A first PFM is accessed or generated for training examples 1002 based solely on the location of reference array 1002R. Training example 1012 has reference sequence 10012R and alternate sequence 1012A. The first PFM for example 1002 can be reused for example 1012. PFM is calculated using amino acid sequences from multiple species, including 99 species of primates, mammals, and vertebrates, as an indication of sequence conservation among species. Humans may or may not be included in the species represented in the PFM calculation. Cells in this PFM contain counts of amino acid occurrences between species within a sequence. PFM1022 is a starting point for PFM, which illustrates one-hot encoding of a single array on training examples. For 99 examples, positions that are fully conserved across species have a value of '99' instead of '1' when the PFM is perfect. Partial saving results in two or more rows in one column with values that add up to 99 in this example. Both the reference and alternate sequences have the same PFM because the PFM depends on the overall sequence position rather than the amino acid at the central position of the sequence.

次に、我々は、図10の例示的な基準配列内の位置を使用してPFM1012の決定を説明する。図10に示されているような病原性/ラベリングされていない訓練例1002および補足良性訓練例1012の両方に対する例示的な基準および代替アミノ酸配列は、51個のアミノ酸を有する。基準アミノ酸配列1002Rは、配列内の位置26(標的位置とも称される)に「R」によって表されるアルギニンというアミノ酸を有する。ヌクレオチドレベルでは、6個のトリヌクレオチド塩基またはコドン(CGT、CGC、CGA、CGG、AGA、およびAAG)のうちの1つはアミノ酸「R」を表現する。我々は、図を簡単にするためにこの例ではそれらのコドンを示さず、むしろPFMの計算に集中している。基準配列にアラインメントされ、位置26にアミノ酸「R」を有する99種のうちの1つからのアミノ酸配列(図示せず)を考察する。この結果、行「R」と列「26」との交差点のところで細胞内のPFM1022内の「1」の値が得られる。類似の値は、PFMのすべての列について決定される。2つのPFM(すなわち、病原性ミスセンスバリアント1002に対する基準配列1002RのPFMおよび補足良性訓練例1012に対する基準配列1012RのPFM)は同じであるが、例示を目的として、1つのPFM1022のみ示されている。これら2つのPFMは、関連するアミノ酸に対する病原性の対抗する例を表す。一方は病原性または「1」とラベリングされるが、他方は良性に対して「0」とラベリングされる。したがって、開示されている技術は、訓練中にこれらの例をモデルに提供することによって過剰適合を低減する。 We next describe the determination of PFM1012 using the positions within the exemplary reference array of FIG. Exemplary reference and alternate amino acid sequences for both pathogenic/unlabeled training examples 1002 and supplemental benign training examples 1012 as shown in FIG. 10 have 51 amino acids. Reference amino acid sequence 1002R has the amino acid arginine represented by "R" at position 26 (also referred to as the target position) in the sequence. At the nucleotide level, one of the six trinucleotide bases or codons (CGT, CGC, CGA, CGG, AGA, and AAG) represents the amino acid "R." We have not shown those codons in this example for simplicity of illustration, but rather concentrate on the calculation of the PFM. Consider an amino acid sequence (not shown) from one of the 99 species aligned to the reference sequence and having an amino acid "R" at position 26. This results in a value of '1' in PFM1022 in the cell at the intersection of row 'R' and column '26'. Similar values are determined for all columns of the PFM. Although the two PFMs (ie, the PFM of reference sequence 1002R for pathogenic missense variant 1002 and the PFM of reference sequence 1012R for supplemental benign training example 1012) are identical, only one PFM 1022 is shown for illustrative purposes. These two PFMs represent opposing examples of pathogenicity for related amino acids. One is labeled as pathogenic or "1", while the other is labeled as "0" for benign. Therefore, the disclosed technique reduces overfitting by providing these examples to the model during training.

我々は、訓練データセット内の良性ミスセンスバリアント161に対応する補足良性訓練例181の第2のセットを構成している。図11は、2つのPFMが例示的な良性ミスセンスバリアント1102および対応する補足良性訓練例1122について計算される例1100を提示している。例を見るとわかるように、基準配列1102Rおよび1112Rは、良性ミスセンスバリアント1102および補足良性訓練例1112の両方に対して同じである。それらのそれぞれの代替配列1102Aおよび1112Aも、図11に示されている。2つのPFMは、図10に提示されている例について上で説明されているように2つの基準配列に対して生成されるか、または参照される。PFMは両方とも同じであり、図11には例示を目的として1つのPFM1122だけが示されている。これらのPFMは両方とも、良性(「0」)とラベリングされたアミノ酸配列を表す。 We are constructing a second set of supplemental benign training examples 181 corresponding to the benign missense variants 161 in the training dataset. FIG. 11 presents an example 1100 in which two PFMs are calculated for an exemplary benign missense variant 1102 and a corresponding supplemental benign training example 1122 . As can be seen from the example, the reference sequences 1102R and 1112R are the same for both the benign missense variant 1102 and the supplemental benign training example 1112. Their respective alternative sequences 1102A and 1112A are also shown in FIG. Two PFMs are generated or referenced to two reference arrays as described above for the example presented in FIG. Both PFMs are the same and only one PFM 1122 is shown in FIG. 11 for illustrative purposes. Both of these PFMs represent amino acid sequences labeled as benign (“0”).

開示されている技術は、3つのPFM、すなわち、11種の霊長類の配列、48種の哺乳類の配列、および40種の脊椎動物の配列について各々1つずつ計算する。図12は、3つのPFM1218、1228、および1238を示しており、各々20行および51列を有する。一実装形態において、霊長類の配列はヒトの基準配列を含まない。別の実装形態において、霊長類の配列はヒトの基準配列を含む。3つのPFMにおける細胞の値は、所与の位置(列ラベル)のところでPFMに対するすべての配列内に存在しているアミノ酸(行ラベル)の出現をカウントすることによって計算される。たとえば、3つの霊長類の配列が位置26にアミノ酸「K」を有する場合、行ラベル「K」および列ラベル「26」を持つ細胞の値は「3」という値を有する。 The disclosed technique computes three PFMs, one each for 11 primate sequences, 48 mammalian sequences, and 40 vertebrate sequences. FIG. 12 shows three PFMs 1218, 1228 and 1238, each having 20 rows and 51 columns. In one implementation, the primate sequence does not include a human reference sequence. In another implementation, the primate sequence comprises a human reference sequence. Cellular values in the three PFMs are calculated by counting the occurrence of amino acids (row labels) that are present in all sequences for the PFMs at a given position (column labels). For example, if three primate sequences have the amino acid 'K' at position 26, the cell with row label 'K' and column label '26' has a value of '3'.

ワンホット符号化は、カテゴリ変数が深層学習モデルへの入力を提供され得る形態に変換されるプロセスである。カテゴリ変数は、データセット内のエントリに対する英数字値を表す。たとえば、基準および代替アミノ酸配列は各々51個のアミノ酸の文字を配列に配置構成したものである。配列内の位置「1」にあるアミノ酸の文字「T」は、配列内の第1の位置にあるアミノ酸であるトレオニンを表す。アミノ酸配列は、ワンホット符号化された表現において行ラベル「T」および列ラベル「1」を持つ細胞内に「1」の値を割り当てることによって符号化される。アミノ酸配列に対するワンホット符号化された表現は、特定の位置(列ラベル)に出現するアミノ酸(行ラベル)を表す細胞を除く細胞内で0を有する。図13は、補足良性訓練例に対する基準および代替配列がワンホット符号化されたものとして表される例1300を示している。基準および代替アミノ酸配列は、バリアント病原性予測モデルへのワンホット符号化された形態を入力として与えられる。図14は、バリアント病原性予測モデルへの入力を示す説明図1400を含む。入力は、ワンホット符号化された形態のヒト基準アミノ酸配列および代替アミノ酸配列、ならびに霊長類に対するPFM1218、哺乳類に対するPFM1228、および脊椎動物に対するPFM1238を含む。上で説明されているように、霊長類に対するPFMは、ヒト以外の霊長類またはヒトおよびヒト以外の霊長類のみを含むことができる。 One-hot encoding is the process by which categorical variables are transformed into a form that can be provided as input to a deep learning model. Categorical variables represent alphanumeric values for entries in the dataset. For example, the reference and alternate amino acid sequences are each arranged 51 amino acid letters into the sequence. The letter "T" for the amino acid at position "1" in the sequence represents the amino acid at position 1 in the sequence, threonine. Amino acid sequences are encoded by assigning a value of "1" in cells with row label "T" and column label "1" in the one-hot encoded representation. One-hot encoded representations for amino acid sequences have 0's in cells except for cells representing amino acids (row labels) that occur at specific positions (column labels). FIG. 13 shows an example 1300 in which the reference and alternate sequences for supplemental benign training examples are represented as one-hot encoded. The reference and alternate amino acid sequences are given as input in one-hot encoded form to the variant pathogenicity prediction model. Figure 14 includes an illustration 1400 showing the inputs to the variant pathogenicity prediction model. Inputs include human reference and alternate amino acid sequences in one-hot encoded form, and PFM1218 for primates, PFM1228 for mammals, and PFM1238 for vertebrates. As explained above, PFM for primates can include only non-human primates or humans and non-human primates.

訓練セットを補足するこのアプローチのバリエーションは、参照によって引用される出願において説明されているアーキテクチャおよび、他のデータタイプと組み合わせて、特にアミノ酸またはヌクレオチドの配列と組み合わせて、PFMを使用する他の任意のアーキテクチャの両方に適用される。 Variations of this approach to supplement the training set include architectures described in the applications cited by reference and any other architecture that uses PFM in combination with other data types, particularly in combination with sequences of amino acids or nucleotides. architecture.

［結果］
ニューラルネットワークベースのモデル(たとえば、上に提示されているPrimateAI)の性能は、上に提示されている事前訓練エポックを使用することによって改善される。次の表には、例示的なテスト結果が提示されている。表の中の結果は、6つの見出しを付けてまとめられている。我々は、結果を提示する前に見出しについて簡単に説明する。「複製」列は、20回の複製試行に対する結果を提示している。各試行は、異なる乱数シードを使用する8個のモデルのアンサンブルであってよい。「精度」は、良性と分類されている10,000個の保留された霊長類良性バリアントの割合である。「Pvalue_DDD」は、影響を受けていない兄弟姉妹から発達障害を患っている影響を受けている子供のde novo変異がどれだけうまく分離されるかを評価するためのウィルコクソン順位検定の結果を提示している。「pvalue_605genes」は、この場合に我々が605個の疾病関係遺伝子内のde novo変異を使用したことを除くpvalue_DDDと類似の検定の結果を提示している。「Corr_RK_RW」は、RからKへのアミノ酸の変化とRからWへのアミノ酸の変化との間のprimateAIスコアの相関を提示している。Corr_RK_RWの小さい方の値は、よりよい性能を示す。「Pvalue_Corr」は、前の列内の相関のp値、すなわち、Corr_RK_RWを提示している。 [result]
The performance of neural network-based models (eg, PrimateAI presented above) is improved by using the pre-training epochs presented above. The following table presents exemplary test results. The results in the table are summarized under six headings. We briefly describe the headings before presenting the results. The "Replication" column presents the results for 20 replication trials. Each trial may be an ensemble of 8 models using different random number seeds. "Precision" is the percentage of 10,000 reserved primate benign variants classified as benign. 'Pvalue_DDD' presents the results of a Wilcoxon rank test to assess how well de novo mutations in affected children with developmental disorders segregate from unaffected siblings. ing. "pvalue_605genes" presents the results of a test similar to pvalue_DDD, except in this case we used de novo mutations within 605 disease-related genes. "Corr_RK_RW" presents the primateAI score correlation between R to K and R to W amino acid changes. Smaller values of Corr_RK_RW indicate better performance. "Pvalue_Corr" presents the p-value of the correlation in the previous column, Corr_RK_RW.

これらの結果は、カットオフとして未知のバリアントの中央値スコアを使用する良性バリアントの予測の中央値精度が20回の複製試行で91.44%であることを示している。ウィルコクソン順位和検定の対数p値は、対照のde novoミスセンスバリアントからDDD患者のde novoミスセンスバリアントを分離することについて29.39である。同様に、順位和検定の対数p値は、605個の疾病遺伝子のみの中でde novoミスセンスバリアントを比較して16.18である。この測定基準は前に報告された結果より改善されている。R->KとR->Wとの間の相関は著しく低減され、ウィルコクソン順位和検定のp値=3.11e-70によって測定される。 These results show that the median accuracy of predicting benign variants using the median score of unknown variants as a cutoff is 91.44% over 20 replication trials. The log p-value of the Wilcoxon rank sum test is 29.39 for separating de novo missense variants in DDD patients from de novo missense variants in controls. Similarly, the log p-value of the rank sum test is 16.18 comparing de novo missense variants among only 605 disease genes. This metric is an improvement over previously reported results. The correlation between R->K and R->W is significantly reduced, as measured by the Wilcoxon rank sum test p-value=3.11e-70.

［具体的な実装形態］
我々は、アミノ酸の配列および随伴する位置特定的頻度行列(PFM)を処理するニューラルネットワーク実装モデルを事前訓練するためのシステム、方法、および製造物品を説明する。実装形態の1つまたは複数の特徴は基本の実装形態と合成され得る。相互に排他的ではない実装形態は、合成可能であると教示される。実装形態の1つまたは複数の特徴は他の実装形態と合成され得る。本開示は定期的にこれらの選択肢をユーザに思い起こさせる。これらの選択肢を繰り返し述べる記載がいくつかの実装形態において省略されていることは、先行するセクションにおいて教示された合成を限定するものと解釈されるべきではなく、これらの記載は以後の実装形態の各々へと前方に参照によって組み込まれる。 [Specific implementation form]
We describe systems, methods, and articles of manufacture for pre-training neural network-implemented models that process sequences of amino acids and their associated position-specific frequency matrices (PFMs). One or more features of an implementation may be combined with the base implementation. Implementations that are not mutually exclusive are taught to be synthesizable. One or more features of implementations may be combined with other implementations. The present disclosure periodically reminds users of these options. The omission of reiterating these options in some implementations should not be construed as limiting the syntheses taught in the preceding sections, and these descriptions may be used in subsequent implementations. incorporated by reference into each onwards.

開示されている技術のシステム実装形態は、メモリに結合されている1つまたは複数のプロセッサを含む。メモリは、アミノ酸の配列および随伴する位置特定的頻度行列(PFM)を処理するニューラルネットワーク実装モデルの過剰適合を低減するためのコンピュータ命令をロードされる。システムは、開始位置から標的アミノ酸位置を通り終了位置へ進む配置を含む良性とラベリングされた補足訓練例配列ペアを生成するためのロジックを備える。補足配列ペアは、ミスセンス訓練例配列ペアの開始位置および終了位置と一致する。これは、アミノ酸の基準および代替配列内に同一のアミノ酸を有する。システムは、各補足配列ペアとともに、一致する開始位置および終了位置におけるミスセンス訓練例配列ペアのPFMと同一の補足訓練PFMを入力するためのロジックを備える。システムは、良性訓練例配列ペアおよび補足訓練例PFM、ならびに一致する開始位置および終了位置におけるミスセンス訓練例配列ペアおよびミスセンス訓練例配列ペアのPFMを使用してニューラルネットワーク実装モデルを訓練するためのロジックを備える。訓練PFMの訓練の影響は、訓練中に弱められる。 A system implementation of the disclosed technology includes one or more processors coupled to memory. The memory is loaded with computer instructions for reducing overfitting of neural network implementation models that process sequences of amino acids and associated position-specific frequency matrices (PFMs). The system includes logic for generating complementary training example sequence pairs labeled as benign that contain placements that progress from the start position through the target amino acid position to the end position. The complementary sequence pairs match the start and end positions of the missense training example sequence pairs. It has identical amino acids within the reference and alternate sequences of amino acids. The system includes logic for inputting with each complementary sequence pair a complementary training PFM that is identical to the PFM of the missense training example sequence pair at matching start and end positions. The system uses benign training example sequence pairs and supplemental training example PFMs, and missense training example sequence pairs and PFMs of missense training example sequence pairs at congruent start and end positions to train a neural network implementation model. Prepare. The training effect of training PFM is attenuated during training.

このシステム実装形態および開示される他のシステムは任意選択で、以下の特徴のうちの1つまたは複数を含む。システムはまた、開示される方法に関連して説明される特徴を含み得る。簡潔にするために、システム特徴の代替的な組合せは個別に列挙されない。システム、方法、および製造物品に適用可能な特徴は、基本の特徴の各statutory classセットに対して繰り返されない。読者は、このセクションにおいて特定される特徴が他のstatutory classにおける基本の特徴とどのように容易に合成され得るかを理解するであろう。 This system implementation and other disclosed systems optionally include one or more of the following features. The system can also include features described in connection with the disclosed methods. For the sake of brevity, alternative combinations of system features are not listed individually. Features applicable to systems, methods, and articles of manufacture are not repeated for each statutory class set of basic features. The reader will see how the features specified in this section can be easily combined with basic features in other statutory classes.

システムは、各補足配列ペアが、良性ミスセンス訓練例配列ペアの開始位置および終了位置と一致するように補足配列ペアを構成するためのロジックを備えることができる。 The system can comprise logic for arranging complementary sequence pairs such that each complementary sequence pair matches the start and end positions of a pair of benign missense training example sequences.

システムは、各補足配列ペアが、病原性ミスセンス訓練例配列ペアの開始位置および終了位置と一致するように補足配列ペアを構成するためのロジックを備えることができる。 The system can comprise logic for arranging complementary sequence pairs such that each complementary sequence pair matches the start and end positions of a pair of pathogenic missense training example sequences.

システムは、所定の数の訓練エポックの後に補足訓練例配列ペアおよび補足訓練PFMを使用するのを中止するようにニューラルネットワーク実装モデルの訓練を修正するためのロジックを備える。 The system comprises logic for modifying the training of the neural network implementation model to stop using the supplemental training example array pairs and the supplemental training PFM after a predetermined number of training epochs.

システムは、3訓練エポックの後に補足訓練例配列ペアおよび補足訓練PFMを使用するのを中止するようにニューラルネットワーク実装モデルの訓練を修正するためのロジックを備える。 The system comprises logic for modifying the training of the neural network implementation model to stop using the supplemental training example sequence pairs and the supplemental training PFM after 3 training epochs.

システムは、5訓練エポックの後に補足訓練例配列ペアおよび補足訓練PFMを使用するのを中止するようにニューラルネットワーク実装モデルの訓練を修正するためのロジックを備える。 The system comprises logic for modifying the training of the neural network implementation model to stop using the supplemental training example array pairs and the supplemental training PFM after 5 training epochs.

補足訓練例配列ペアと病原性訓練例配列ペアとの比は、1:1から1:8の間であるものとしてよい。システムは、たとえば1:1から1:12、1:1から1:16、および1:1から1:24の間の範囲に対して異なる値を使用することができる。 The ratio of complementary training example sequence pairs to pathogenic training example sequence pairs may be between 1:1 and 1:8. The system may use different values for ranges between, for example, 1:1 to 1:12, 1:1 to 1:16, and 1:1 to 1:24.

補足訓練例配列ペアと良性訓練例配列ペアとの比は、1:2から1:8の間であるものとしてよい。システムは、たとえば1:1から1:12、1:1から1:16、および1:1から1:24の間の範囲に対して異なる値を使用することができる。 The ratio of complementary training example sequence pairs to benign training example sequence pairs may be between 1:2 and 1:8. The system may use different values for ranges between, for example, 1:1 to 1:12, 1:1 to 1:16, and 1:1 to 1:24.

システムは、補足PFM、ヒト以外の霊長類および霊長類以外の哺乳類に対するデータからのアミノ酸位置を作成するためのロジックを備える。 The system includes logic for generating amino acid positions from data for complementary PFM, non-human primates and non-primate mammals.

他の実装形態は、上で説明されているシステムの機能を実行するためにプロセッサによって実行可能な命令を記憶する非一時的コンピュータ可読記憶媒体を含み得る。さらに別の実装形態は、上で説明されているシステムの機能を実行する方法を含み得る。 Other implementations may include a non-transitory computer-readable storage medium storing instructions executable by a processor to perform the functions of the system described above. Yet another implementation may include a method of performing the functions of the system described above.

開示されている技術の方法実装形態は、開始位置から標的アミノ酸位置を通り終了位置へ進む配置を含む良性とラベリングされた補足訓練例配列ペアを生成することを含む。各補足配列ペアは、ミスセンス訓練例配列ペアの開始位置および終了位置と一致する。これは、アミノ酸の基準および代替配列内に同一のアミノ酸を有する。方法は、各補足配列ペアとともに、一致する開始位置および終了位置におけるミスセンス訓練例配列ペアのPFMと同一の補足訓練PFMを入力することを含む。方法は、良性訓練例配列ペアおよび補足訓練例PFM、ならびにミスセンス訓練例配列ペア、ならびに一致する開始位置および終了位置におけるミスセンスのPFMを使用してニューラルネットワーク実装モデルを訓練することを含む。訓練PFMの訓練の影響は、訓練中に弱められる。 A method implementation of the disclosed technology involves generating complementary training example sequence pairs labeled as benign that include placements that progress from the start position through the target amino acid position to the end position. Each complementary sequence pair matches the start and end positions of the missense training example sequence pair. It has identical amino acids within the reference and alternate sequences of amino acids. The method involves inputting, with each complementary sequence pair, a complementary training PFM that is identical to the PFM of the missense training example sequence pair at matching start and end positions. The method includes training a neural network implementation model using benign training example sequence pairs and supplemental training example PFMs and missense training example sequence pairs and PFMs of missenses at matching start and end positions. The training effect of training PFM is attenuated during training.

この方法実装形態および開示されている他の方法は、任意選択で、次の特徴のうちの1つまたは複数を含む。方法は、開示されているシステムに関連して説明されている特徴も含むことができる。このセクションにおいて特定される特徴が他のstatutory classの中の基本の特徴とどのように容易に組み合わされ得るかを、読者は理解するであろう。 This method implementation and other disclosed methods optionally include one or more of the following features. The method can also include features described in connection with the disclosed system. The reader will see how the features specified in this section can be easily combined with basic features in other statutory classes.

他の実装形態は、アミノ酸の配列および随伴する位置特定的頻度行列(PFM)を処理するニューラルネットワーク実装モデルの過剰適合を低減するために1つまたは複数のプロセッサによって実行可能であるコンピュータプログラム命令をまとめて記憶する1つまたは複数の非一時的コンピュータ可読記憶媒体のセットを含み得る。コンピュータプログラム命令は1つまたは複数のプロセッサ上で実行されたときに、開始位置から標的アミノ酸位置を通り終了位置へ進む配置を含む良性とラベリングされた補足訓練例配列ペアを生成することを含む方法を実行する。各補足配列ペアは、ミスセンス訓練例配列ペアの開始位置および終了位置と一致する。これは、アミノ酸の基準および代替配列内に同一のアミノ酸を有する。方法は、各補足配列ペアとともに、一致する開始位置および終了位置におけるミスセンス訓練例配列ペアのPFMと同一の補足訓練PFMを入力することを含む。方法は、良性訓練例配列ペアおよび補足訓練例PFM、ならびにミスセンス訓練例配列ペア、ならびに一致する開始位置および終了位置におけるミスセンス訓練のPFMを使用してニューラルネットワーク実装モデルを訓練することを含む。訓練PFMの訓練の影響は、訓練中に弱められる。 Other implementations provide computer program instructions executable by one or more processors to reduce overfitting of neural network implementation models that process sequences of amino acids and associated position-specific frequency matrices (PFMs). It may include a set of one or more non-transitory computer-readable storage media stored together. The computer program instructions, when executed on one or more processors, generate complementary training example sequence pairs labeled as benign comprising placements progressing from a starting position through a target amino acid position to an ending position. to run. Each complementary sequence pair matches the start and end positions of the missense training example sequence pair. It has identical amino acids within the reference and alternate sequences of amino acids. The method involves inputting, with each complementary sequence pair, a complementary training PFM that is identical to the PFM of the missense training example sequence pair at matching start and end positions. The method includes training a neural network implementation model using benign training example sequence pairs and supplemental training example PFMs and missense training example sequence pairs and PFMs of missense training at matching start and end positions. The training effect of training PFM is attenuated during training.

開示されている技術のコンピュータ可読媒体(CRM)実装形態は、1つまたは複数のプロセッサ上で実行されたときに上で説明されている方法を実行するコンピュータプログラム命令が焼かれた1つまたは複数の非一時的コンピュータ可読記憶媒体を含む。このCRM実装形態は、次の特徴のうちの1つまたは複数を含む。CRM実装形態は、上で開示されているシステムおよび方法に関連して説明されている特徴も含むことができる。 A computer readable medium (CRM) implementation of the disclosed technology comprises one or more burned computer program instructions that, when executed on one or more processors, perform the methods described above. of non-transitory computer-readable storage media. This CRM implementation includes one or more of the following features. CRM implementations can also include the features described in connection with the systems and methods disclosed above.

先行する説明は、開示される技術の作成および使用を可能にするために提示される。開示される実装形態に対する様々な修正が明らかであり、本明細書で定義される一般原理は、開示される技術の趣旨および範囲から逸脱することなく、他の実装形態および適用例に適用され得る。したがって、開示される技術は、示される実装形態に限定されることは意図されず、本明細書で開示される原理および特徴と一致する最も広い範囲を認められるべきである。開示される技術の範囲は添付の特許請求の範囲によって定義される。 The preceding description is presented to enable you to make and use the disclosed technology. Various modifications to the disclosed implementations will be apparent, and the general principles defined herein may be applied to other implementations and applications without departing from the spirit and scope of the disclosed technology. . Accordingly, the disclosed technology is not intended to be limited to the implementations shown, but is to be accorded the broadest scope consistent with the principles and features disclosed herein. The scope of the disclosed technology is defined by the appended claims.

［コンピュータシステム］
図15は、開示される技術を実装するために使用され得るコンピュータシステムの簡略化されたブロック図1500である。コンピュータシステムは通常、バスサブシステムを介していくつかの周辺デバイスと通信する少なくとも1つのプロセッサを含む。これらの周辺デバイスは、たとえば、メモリデバイスおよびファイルストレージサブシステム、ユーザインターフェース入力デバイス、ユーザインターフェース出力デバイス、ならびにネットワークインターフェースサブシステムを含む、ストレージサブシステムを含み得る。入力デバイスおよび出力デバイスはコンピュータシステムとのユーザの対話を可能にする。ネットワークインターフェースサブシステムは、他のコンピュータシステムにおける対応するインターフェースデバイスへのインターフェースを含む、外部ネットワークへのインターフェースを提供する。 [Computer system]
FIG. 15 is a simplified block diagram 1500 of a computer system that can be used to implement the disclosed techniques. A computer system typically includes at least one processor that communicates with several peripheral devices through a bus subsystem. These peripheral devices may include, for example, storage subsystems including memory devices and file storage subsystems, user interface input devices, user interface output devices, and network interface subsystems. Input and output devices allow user interaction with the computer system. The network interface subsystem provides an interface to external networks, including interfaces to corresponding interface devices in other computer systems.

一実装形態において、バリアント病原性分類器157、PFM計算器184、および入力エンコーダ186などのニューラルネットワークは、ストレージサブシステムおよびユーザインターフェース入力デバイスに通信可能に結合される。 In one implementation, neural networks such as variant pathogenicity classifier 157, PFM calculator 184, and input encoder 186 are communicatively coupled to the storage subsystem and user interface input device.

ユーザインターフェース入力デバイスは、キーボードと、マウス、トラックボール、タッチパッド、またはグラフィクスタブレットなどのポインティングデバイスと、ディスプレイに組み込まれたタッチスクリーンと、音声認識システムおよびマイクロフォンなどのオーディオ入力デバイスと、他のタイプの入力デバイスとを含み得る。一般に、「入力デバイス」という用語の使用は、コンピュータシステムへ情報を入力するためのすべての可能なタイプのデバイスおよび方式を含むことが意図される。 User interface input devices include keyboards, pointing devices such as mice, trackballs, touch pads, or graphics tablets, touch screens embedded in displays, audio input devices such as voice recognition systems and microphones, and other types. input devices. In general, use of the term "input device" is intended to include all possible types of devices and methods for entering information into a computer system.

ユーザインターフェース出力デバイスは、ディスプレイサブシステム、プリンタ、faxマシン、またはオーディオ出力デバイスなどの非視覚的ディスプレイを含み得る。ディスプレイサブシステムは、陰極線管(CRT)、液晶ディスプレイ(LCD)などのフラットパネルデバイス、プロジェクションデバイス、または可視の画像を創造するための何らかの他の機構を含み得る。ディスプレイサブシステムはまた、オーディオ出力デバイスなどの非視覚ディスプレイを提供することができる。一般に、「出力デバイス」という用語の使用は、コンピュータシステムから情報をユーザまたは別の機械もしくはコンピュータシステムに出力するためのすべての可能なタイプのデバイスおよび方式を含むことが意図される。 User interface output devices may include non-visual displays such as display subsystems, printers, fax machines, or audio output devices. A display subsystem may include a cathode ray tube (CRT), a flat panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a viewable image. The display subsystem can also provide non-visual displays such as audio output devices. In general, use of the term "output device" is intended to include all possible types of devices and manners for outputting information from a computer system to a user or another machine or computer system.

ストレージサブシステムは、本明細書で説明されるモジュールおよび方法の一部またはすべての機能を提供する、プログラミングおよびデータ構築物を記憶する。これらのソフトウェアモジュールは一般に、プロセッサだけによって、または他のプロセッサと組み合わせて実行される。 The storage subsystem stores programming and data constructs that provide the functionality of some or all of the modules and methods described herein. These software modules are typically executed by the processor alone or in combination with other processors.

ストレージサブシステムにおいて使用されるメモリは、プログラム実行の間の命令およびデータの記憶のためのメインランダムアクセスメモリ(RAM)と、固定された命令が記憶される読取り専用メモリ(ROM)とを含む、いくつかのメモリを含み得る。ファイルストレージサブシステムは、プログラムおよびデータファイルのための永続的なストレージを提供することができ、ハードディスクドライブ、関連する取り外し可能なメディアを伴うフロッピー（登録商標）ディスクドライブ、CD-ROMドライブ、光学ドライブ、または取り外し可能なメディアカートリッジを含み得る。いくつかの実装形態の機能を実装するモジュールは、ストレージサブシステムの中の、または他のプロセッサによってアクセス可能な他の機械の中の、ファイルストレージサブシステムによって記憶され得る。 The memory used in the storage subsystem includes main random access memory (RAM) for storage of instructions and data during program execution, and read-only memory (ROM) in which fixed instructions are stored. May contain some memory. File storage subsystems can provide persistent storage for program and data files and include hard disk drives, floppy disk drives with associated removable media, CD-ROM drives, optical drives , or may include a removable media cartridge. Modules implementing the functionality of some implementations may be stored by a file storage subsystem within the storage subsystem or within other machines accessible by other processors.

バスサブシステムは、コンピュータシステムの様々な構成要素およびサブシステムに意図されるように互いに通信させるための機構を提供する。バスサブシステムは単一のバスとして概略的に示されているが、バスサブシステムの代替的な実装形態は複数のバスを使用することができる。 A bus subsystem provides a mechanism for allowing the various components and subsystems of a computer system to communicate with each other as intended. Although the bus subsystem is shown schematically as a single bus, alternate implementations of the bus subsystem may use multiple buses.

コンピュータシステム自体が、パーソナルコンピュータ、ポータブルコンピュータ、ワークステーション、コンピュータ端末、ネットワークコンピュータ、テレビジョン、メインフレーム、サーバファーム、緩やかにネットワーク化されたコンピュータの広く分布するセット、または、任意の他のデータ処理システムもしくはユーザデバイスを含む、様々なタイプであってよい。コンピュータおよびネットワークの変わり続ける性質により、図15に示されるコンピュータシステムの記述は、開示される技術を例示することを目的とする特定の例としてのみ意図されている。図15に示されるコンピュータシステムより多数または少数の構成要素を有する、コンピュータシステムの多くの他の構成が可能である。 The computer system itself may be a personal computer, portable computer, workstation, computer terminal, network computer, television, mainframe, server farm, widely distributed set of loosely networked computers, or any other data processing It may be of various types, including systems or user devices. Due to the ever-changing nature of computers and networks, the description of the computer system shown in FIG. 15 is intended only as a specific example for purposes of illustrating the disclosed technology. Many other configurations of computer systems are possible, having more or fewer components than the computer system shown in FIG.

深層学習プロセッサは、GPUまたはFPGAであってよく、Google Cloud Platform、Xilinx、およびCirrascaleなどの深層学習クラウドプラットフォームによってホストされてよい。深層学習プロセッサの例には、GoogleのTensor Processing Unit(TPU)、GX4 Rackmount Series、GX8 Rackmount Seriesのようなラックマウントソリューション、NVIDIA DGX-1、MicrosoftのStratix V FPGA、GraphcoreのIntelligent Processor Unit(IPU)、Snapdragonプロセッサを用いたQualcommのZerothプラットフォーム、NVIDIAのVolta、NVIDIAのDRIVE PX、NVIDIAのJETSON TX1/TX2 MODULE、IntelのNirvana、Movidius VPU、Fujitsu DPI、ARMのDynamicIQ、IBM TrueNorthなどがある。 Deep learning processors may be GPUs or FPGAs and may be hosted by deep learning cloud platforms such as Google Cloud Platform, Xilinx, and Cirrascale. Examples of deep learning processors include Google's Tensor Processing Unit (TPU), rackmount solutions like the GX4 Rackmount Series, GX8 Rackmount Series, NVIDIA DGX-1, Microsoft's Stratix V FPGA, and Graphcore's Intelligent Processor Unit (IPU). , Qualcomm's Zeroth platform with Snapdragon processors, NVIDIA's Volta, NVIDIA's DRIVE PX, NVIDIA's JETSON TX1/TX2 MODULE, Intel's Nirvana, Movidius VPU, Fujitsu DPI, ARM's DynamicIQ, IBM TrueNorth.

114 トレーナー
116 テスター
121 病原性ミスセンス訓練例
131 補足良性訓練例
155 ネットワーク
157 バリアント病原性予測モデル
161 良性ミスセンス訓練例
181 補足良性訓練例
184 位置特定的頻度行列(PFM)計算器
186 入力エンコーダ
600 例
700 説明図
800 例
1002 病原性ミスセンスバリアント訓練例
1002A 代替配列
1002R 基準配列
1012 補足良性訓練例
1012A 代替配列
10012R 基準配列
1022 PFM
1100 例
1102 良性ミスセンスバリアント
1102Rおよび1112R 基準配列
1112 補足良性訓練例
1102Aおよび1112A 代替配列
1122 対応する補足良性訓練例
1218、1228、および1238 PFM
1300 例
1400 説明図 114 trainer
116 Tester
121 Examples of Pathogenic Missense Training
131 Supplementary Benign Training Examples
155 network
157 variant pathogenicity prediction model
161 Examples of benign missense training
181 Supplementary Benign Training Examples
184 Location-Specific Frequency Matrix (PFM) Calculator
186 Input Encoder
600 examples
700 Illustration
800 examples
1002 Pathogenic missense variant training example
1002A Alternative arrangement
1002R reference sequence
1012 Supplementary Benign Training Examples
1012A Alternative arrangement
10012R reference sequence
1022 PFM
1100 examples
1102 Benign missense variant
1102R and 1112R reference sequences
1112 Supplementary Benign Training Examples
1102A and 1112A alternate arrangement
1122 Corresponding Supplementary Benign Training Examples
1218, 1228, and 1238 PFMs
1300 examples
1400 Illustration

Claims

アミノ酸の配列および随伴する位置特定的頻度行列(PFM)を処理するニューラルネットワーク実装モデルの過剰適合を低減するための方法であって、
開始位置から標的アミノ酸位置を通り終了位置へ進む配置を含む良性とラベリングされた補足訓練例配列ペアを生成するステップであって、各補足訓練例配列ペアは、
ミスセンス訓練例配列ペアの前記開始位置および前記終了位置と一致し、
アミノ酸の基準および代替配列内に同一のアミノ酸を有する、ステップと、
各補足訓練例配列ペアとともに、前記一致する開始位置および終了位置における前記ミスセンス訓練例配列ペアの前記PFMと同一の補足訓練PFMを入力するステップと、
前記良性補足訓練例配列ペア、前記補足訓練PFM、前記ミスセンス訓練例配列ペア、ならびに前記一致する開始位置および終了位置における前記ミスセンス訓練例配列ペアの前記PFMを使用して前記ニューラルネットワーク実装モデルを訓練するステップであって、
これにより前記補足訓練PFMの訓練の影響が前記訓練中に弱められる、ステップと
を含む方法。 A method for reducing overfitting of a neural network implementation model processing a sequence of amino acids and an associated position specific frequency matrix (PFM), comprising:
generating a pair of supplemental training example sequences labeled as benign comprising a sequence proceeding from the start position through the target amino acid position to the end position, each supplementary training example sequence pair comprising:
match the start position and the end position of a missense training example sequence pair;
having identical amino acids in the reference and alternate sequences of amino acids;
inputting, with each supplemental training example sequence pair, a supplemental training PFM that is identical to the PFM of the missense training example sequence pair at the matching start and end positions;
training the neural network implementation model using the benign supplemental training example sequence pair, the supplemental training PFM, the missense training example sequence pair, and the PFM of the missense training example sequence pair at the matching start and end positions; a step of
whereby the training impact of said supplementary training PFM is attenuated during said training.

前記補足訓練例配列ペアは、病原性ミスセンス訓練例配列ペアの前記開始位置および前記終了位置と一致する請求項1に記載の方法。 2. The method of claim 1, wherein said complementary training example sequence pair coincides with said start position and said end position of a pathogenic missense training example sequence pair.

前記補足訓練例配列ペアは、良性ミスセンス訓練例配列ペアの前記開始位置および前記終了位置と一致する請求項1に記載の方法。 2. The method of claim 1, wherein said complementary training example sequence pair coincides with said start position and said end position of a benign missense training example sequence pair.

所定の数の訓練エポックの後に前記補足訓練例配列ペアおよび前記補足訓練PFMを使用するのを中止するように前記ニューラルネットワーク実装モデルの前記訓練を修正するステップをさらに含む請求項1に記載の方法。 2. The method of claim 1, further comprising modifying the training of the neural network implementation model to stop using the supplemental training example array pairs and the supplemental training PFM after a predetermined number of training epochs. .

5訓練エポックの後に前記補足訓練例配列ペアおよび前記補足訓練PFMを使用するのを中止するように前記ニューラルネットワーク実装モデルの前記訓練を修正するステップをさらに含む請求項1に記載の方法。 2. The method of claim 1, further comprising modifying the training of the neural network implementation model to stop using the supplemental training example array pairs and the supplemental training PFM after 5 training epochs.

前記補足訓練例配列ペアと前記病原性ミスセンス訓練例配列ペアとの比が、1:1から1:8の間であることをさらに含む請求項2に記載の方法。 3. The method of claim 2, further comprising a ratio of said complementary training example sequence pairs to said pathogenic missense training example sequence pairs is between 1:1 and 1:8.

前記補足訓練例配列ペアと前記良性ミスセンス訓練例配列ペアとの比が、1:1から1:8の間であることをさらに含む請求項3に記載の方法。 4. The method of claim 3, further comprising a ratio of said complementary training example sequence pairs to said benign missense training example sequence pairs is between 1:1 and 1:8.

前記補足訓練PFMを作成する際に、ヒト以外の霊長類および霊長類以外の哺乳類に対するデータからのアミノ酸位置を使用するステップをさらに含む請求項1に記載の方法。 2. The method of claim 1, further comprising using amino acid positions from data for non-human primates and non-primate mammals in generating the supplemental training PFM.

メモリに結合されている1つまたは複数のプロセッサを備えるシステムであって、前記メモリはアミノ酸の配列および随伴する位置特定的頻度行列(PFM)を処理するニューラルネットワーク実装モデルの過剰適合を低減するためのコンピュータ命令をロードされ、前記命令は前記プロセッサ上で実行されたときに、
開始位置から標的アミノ酸位置を通り終了位置へ進む配置を含む良性とラベリングされた補足訓練例配列ペアを生成するステップであって、各補足訓練例配列ペアは、
ミスセンス訓練例配列ペアの前記開始位置および前記終了位置と一致し、
アミノ酸の基準および代替配列内に同一のアミノ酸を有する、ステップと、
各補足訓練例配列ペアとともに、前記一致する開始位置および終了位置における前記ミスセンス訓練例配列ペアの前記PFMと同一の補足訓練PFMを入力するステップと、
前記良性補足訓練例配列ペア、前記補足訓練PFM、前記ミスセンス訓練例配列ペア、ならびに前記一致する開始位置および終了位置における前記ミスセンス訓練例配列ペアの前記PFMを使用して前記ニューラルネットワーク実装モデルを訓練するステップであって、
これにより前記補足訓練PFMの訓練の影響が前記訓練中に弱められるか、または無効にされる、ステップと
を含む活動を実行するシステム。 A system comprising one or more processors coupled to a memory for reducing overfitting of neural network implementation models that process sequences of amino acids and associated position specific frequency matrices (PFMs). loaded with computer instructions of, said instructions when executed on said processor,
generating a pair of supplemental training example sequences labeled as benign comprising a sequence proceeding from the start position through the target amino acid position to the end position, each supplementary training example sequence pair comprising:
match the start position and the end position of a missense training example sequence pair;
having identical amino acids in the reference and alternate sequences of amino acids;
inputting, with each supplemental training example sequence pair, a supplemental training PFM that is identical to the PFM of the missense training example sequence pair at the matching start and end positions;
training the neural network implementation model using the benign supplemental training example sequence pair, the supplemental training PFM, the missense training example sequence pair, and the PFM of the missense training example sequence pair at the matching start and end positions; a step of
whereby the training effect of said supplementary training PFM is attenuated or negated during said training.

前記補足訓練例配列ペアは、病原性ミスセンス訓練例配列ペアの前記開始位置および前記終了位置と一致する請求項9に記載のシステム。 10. The system of claim 9, wherein the complementary training example sequence pair coincides with the start position and the end position of a pathogenic missense training example sequence pair.

前記補足訓練例配列ペアは、良性ミスセンス訓練例配列ペアの前記開始位置および前記終了位置と一致する請求項9に記載のシステム。 10. The system of claim 9, wherein the complementary training example sequence pair coincides with the start position and the end position of a benign missense training example sequence pair.

所定の数の訓練エポックの後に前記補足訓練例配列ペアおよび前記補足訓練PFMを使用するのを中止するように前記ニューラルネットワーク実装モデルの前記訓練を修正するステップを含む活動をさらに実装する請求項9に記載のシステム。 10. Implementing an activity further comprising modifying said training of said neural network implementation model to stop using said supplementary training example array pairs and said supplementary training PFM after a predetermined number of training epochs. The system described in .

5訓練エポックの後に前記補足訓練例配列ペアおよび前記補足訓練PFMを使用するのを中止するように前記ニューラルネットワーク実装モデルの前記訓練を修正するステップを含む活動をさらに実装する請求項9に記載のシステム。 10. The activity of claim 9, further implementing an activity comprising modifying the training of the neural network implementation model to stop using the supplemental training example array pairs and the supplemental training PFM after 5 training epochs. system.

前記補足訓練例配列ペアと前記病原性ミスセンス訓練例配列ペアとの比が、1:1から1:8の間であることを含む活動をさらに実装する請求項10に記載のシステム。 11. The system of claim 10, further implementing an activity comprising: a ratio of said complementary training example sequence pairs to said pathogenic missense training example sequence pairs is between 1:1 and 1:8.

前記補足訓練例配列ペアと前記良性ミスセンス訓練例配列ペアとの比が、1:1から1:8の間であることを含む活動をさらに実装する請求項11に記載のシステム。 12. The system of claim 11, further implementing an activity comprising: the ratio of the complementary training example sequence pairs to the benign missense training example sequence pairs is between 1:1 and 1:8.

前記補足訓練PFMを作成する際に、ヒト以外の霊長類および霊長類以外の哺乳類に対するデータからのアミノ酸位置を使用するステップを含む活動をさらに実装する請求項9に記載のシステム。 10. The system of claim 9, further implementing an activity comprising using amino acid positions from data for non-human primates and non-primate mammals in creating the supplemental training PFM.

アミノ酸の配列および随伴する位置特定的頻度行列(PFM)を処理するニューラルネットワーク実装モデルの過剰適合を低減するためのコンピュータプログラム命令が焼かれた非一時的コンピュータ可読記憶媒体であって、前記命令はプロセッサ上で実行されたときに、
開始位置から標的アミノ酸位置を通り終了位置へ進む配置を含む良性とラベリングされた補足訓練例配列ペアを生成するステップであって、各補足訓練例配列ペアは、
ミスセンス訓練例配列ペアの前記開始位置および前記終了位置と一致し、
アミノ酸の基準および代替配列内に同一のアミノ酸を有する、ステップと、
各補足訓練例配列ペアとともに、前記一致する開始位置および終了位置における前記ミスセンス訓練例配列ペアの前記PFMと同一の補足訓練PFMを入力するステップと、
前記良性補足訓練例配列ペア、前記補足訓練PFM、ミスセンス訓練例配列ペア、ならびに前記一致する開始位置および終了位置における前記ミスセンス訓練例配列ペアの前記PFMを使用して前記ニューラルネットワーク実装モデルを訓練するステップであって、
これにより前記補足訓練PFMの訓練の影響が前記訓練中に弱められる、ステップと
を含む方法を実行する非一時的コンピュータ可読記憶媒体。 A non-transitory computer-readable storage medium having burned therein computer program instructions for reducing overfitting of a neural network implementation model that processes a sequence of amino acids and an associated position-specific frequency matrix (PFM), said instructions comprising: when executed on the processor
generating a pair of supplemental training example sequences labeled as benign comprising a sequence proceeding from the start position through the target amino acid position to the end position, each supplementary training example sequence pair comprising:
match the start position and the end position of a missense training example sequence pair;
having identical amino acids in the reference and alternate sequences of amino acids;
inputting, with each supplemental training example sequence pair, a supplemental training PFM that is identical to the PFM of the missense training example sequence pair at the matching start and end positions;
training the neural network implementation model using the benign supplemental training example sequence pair, the supplemental training PFM, the missense training example sequence pair, and the PFM of the missense training example sequence pair at the matching start and end positions. is a step
whereby a training effect of said supplementary training PFM is attenuated during said training.

前記補足訓練例配列ペアは、病原性ミスセンス訓練例配列ペアの前記開始位置および前記終了位置と一致する請求項17に記載の非一時的コンピュータ可読記憶媒体。 18. The non-transitory computer readable storage medium of claim 17, wherein said supplemental training example sequence pair coincides with said start position and said end position of a pathogenic missense training example sequence pair.

前記補足訓練例配列ペアは、良性ミスセンス訓練例配列ペアの前記開始位置および前記終了位置と一致する請求項17に記載の非一時的コンピュータ可読記憶媒体。 18. The non-transitory computer readable storage medium of claim 17, wherein the complementary training example sequence pair coincides with the start position and the end position of a benign missense training example sequence pair.

所定の数の訓練エポックの後に前記補足訓練例配列ペアおよび前記補足訓練PFMを使用するのを中止するように前記ニューラルネットワーク実装モデルの前記訓練を修正するステップをさらに含む前記方法を実装する請求項17に記載の非一時的コンピュータ可読記憶媒体。 3. Implementing said method further comprising modifying said training of said neural network implementation model to stop using said supplementary training example array pairs and said supplementary training PFM after a predetermined number of training epochs. 18. The non-transitory computer-readable storage medium according to 17.

5訓練エポックの後に前記補足訓練例配列ペアおよび前記補足訓練PFMを使用するのを中止するように前記ニューラルネットワーク実装モデルの前記訓練を修正するステップをさらに含む前記方法を実装する請求項17に記載の非一時的コンピュータ可読記憶媒体。 18. Implementing the method of claim 17, further comprising modifying the training of the neural network implementation model to stop using the supplemental training example array pairs and the supplemental training PFM after 5 training epochs. non-transitory computer-readable storage medium.

前記補足訓練例配列ペアと前記病原性ミスセンス訓練例配列ペアとの比が、1:1から1:8の間であることをさらに含む方法を実装する請求項18に記載の非一時的コンピュータ可読記憶媒体。 19. The non-transient computer readable of Claim 18, implementing the method further comprising: a ratio of said complementary training example sequence pairs to said pathogenic missense training example sequence pairs is between 1:1 and 1:8. storage medium.

前記補足訓練例配列ペアと前記良性ミスセンス訓練例配列ペアとの比が、1:1から1:8の間であることをさらに含む方法を実装する請求項19に記載の非一時的コンピュータ可読記憶媒体。 20. The non-transitory computer readable memory of claim 19, implementing the method further comprising: the ratio of the complementary training example sequence pairs to the benign missense training example sequence pairs is between 1:1 and 1:8. medium.

前記補足訓練PFMを作成する際に、ヒト以外の霊長類および霊長類以外の哺乳類に対するデータからのアミノ酸位置を使用するステップを含む前記方法を実装する請求項17に記載の非一時的コンピュータ可読記憶媒体。 18. The non-transitory computer readable memory of claim 17, implementing the method comprising using amino acid positions from data for non-human primates and non-primate mammals in generating the supplemental training PFM. medium.