CN113870949B

CN113870949B - Deep learning-based nanopore sequencing data base identification method

Info

Publication number: CN113870949B
Application number: CN202111172443.5A
Authority: CN
Inventors: 汪国华; 高文韬; 邹权
Original assignee: Northeast Forestry University; Yangtze River Delta Research Institute of UESTC Huzhou
Current assignee: Northeast Forestry University; Yangtze River Delta Research Institute of UESTC Huzhou
Priority date: 2021-10-08
Filing date: 2021-10-08
Publication date: 2022-05-17
Anticipated expiration: 2041-10-08
Also published as: CN113870949A

Abstract

A deep learning-based nanopore sequencing data base identification method relates to the field of bioinformatics, and aims at the problem of low accuracy of nanopore sequencing in the prior art, and comprises the following steps: downloading 50 groups of nanopore original data including pneumococcus, enterobacter and proteus as a training set; II, secondly: carrying out base recognition on 50 groups of original data to obtain a base sequence; thirdly, the method comprises the following steps: acquiring an Illumina sequencing sequence with the accuracy rate of more than 99%, taking the Illumina sequencing sequence with the accuracy rate of more than 99% as a reference genome, taking the reference genome as a ground route, and correcting a base sequence by using a Tombo algorithm; fourthly, the method comprises the following steps: converting the corrected base sequence into corresponding electrical signal data by using a Re-squiggle method, and then marking the electrical signal data; fifthly: and training a neural network by using the marked electric signal data and the original data, and performing base recognition by using the trained neural network. The method realizes the high-accuracy recognition of the base sequence of the nanopore sequencing data.

Description

Deep learning-based nanopore sequencing data base identification method

Technical Field

The invention relates to the field of bioinformatics, in particular to a deep learning-based nanopore sequencing data base identification method.

Background

The nanopore third generation sequencer available from Oxford corporation has the advantages of portability, low cost, long sequencing reads, etc., compared to the second generation sequencer and the third generation sequencer available from PacBio corporation. However, the accuracy of nanopore sequencing is much lower than the second generation sequencing technology and the HIFI sequencing technology of PacBio. The accuracy of the base recognition tool provided by the official part is only about 90 percent, and the method is not open source. The Nanopore of the Nanopore sequencer is essentially a nanoscale protein pore with voltage detection devices on both sides. In operation, primers are used to pull single-stranded DNA/RNA through the nanopore, causing different current changes when different types of nucleotides pass through the nanopore. The sequencer records all changes in current by translating the electrical signal into the corresponding base sequence. nanopore is single molecule sequencing, noiseAcoustic signals and random errors have a large impact on the accuracy of base recognition. The unloading data of the nanopore sequencer are divided into fasta and fast 5. Among them, fasta is a gene sequence obtained by treatment using an official base recognition tool (Guppy), and the accuracy is about 90%. The fast5 file contains the original electrical signal text acquired by the sequencer. Taking the official tool Guppy R9.4 as an example, 5 bases pass through the nanopore at a time, so there are 45-102 possible gene sequences. Further complications arise due to the presence of base modifications. The currently known base modifications are 5mC, and if 5mC is used as the base signal of class 5 except A, C, G, T, 5 bases in a single pass through the nanopore will be 5⁵3125 possible sequences. And the nucleotide and the nanopore are of nanoscale molecular structures, and the official base recognition tool cannot well predict a real base sequence through an electric signal. This is a major factor affecting the accuracy of nanopore sequencing. Therefore, using a relevant method of deep learning to construct a model, it is very necessary to make a reliable prediction of the nanopore sequencing raw data.

Disclosure of Invention

The purpose of the invention is: aiming at the problem of low accuracy of the nanopore sequencing in the prior art, a deep learning-based nanopore sequencing data base identification method is provided.

The technical scheme adopted by the invention to solve the technical problems is as follows:

the deep learning-based method for identifying the base of the nanopore sequencing data comprises the following steps:

the method comprises the following steps: downloading 50 groups of nanopore original data including pneumococcus, enterobacter and proteus as a training set;

step two: carrying out base recognition on 50 groups of original data to obtain a base sequence;

step three: acquiring an Illumina sequencing sequence with the accuracy rate of more than 99%, taking the Illumina sequencing sequence with the accuracy rate of more than 99% as a reference genome, taking the reference genome as a ground route, and correcting a base sequence by using a Tombo algorithm;

step four: converting the corrected base sequence into corresponding electrical signal data by using a Re-squiggle method, and then marking the electrical signal data;

step five: and training a neural network by using the marked electric signal data and the original data, and performing base recognition by using the trained neural network.

Further, the neural network comprises a first convolution layer, a second convolution layer, a BERT module, a full connection layer and a CTC decoding module;

the first convolutional layer is used for down-sampling the marked electrical signal data,

the second convolutional layer is used for carrying out feature extraction on the electrical signal data after down sampling,

a BN layer is arranged behind the first convolution layer and the second convolution layer and used for preventing the mean value and the variance from being saturated,

the BERT module is used for training according to the extracted characteristics and outputting a base sequence corresponding to the electric signal data,

the full junction layer processes the base sequences corresponding to the electrical signal data by using a softmax function to obtain the probability of each base sequence corresponding to the original electrical signal,

the CTC decoding module processes the probability of each base sequence corresponding to the original electric signal to obtain a final base sequence,

the convolution kernel in the first convolution layer has a size of 1 × 3, a step size of 1 × 2, an output channel of 128,

the convolution kernel in the second convolution layer has a size of 1 × 3, a step size of 1 × 2, an output channel of 128,

the BERT module comprises 12 layers of transformers, 768-dimensional Embedding hidden layers and 12-head attention mechanism layers.

Further, the marked electrical signal data is characterized by:

wherein c represents sequencing data, x_cRepresents the corresponding feature of the sequencing data, ω is the weight of the convolution kernel, where the parameter k is set to 3, i and j are the initial position of the sequence, T is the length of the sequence, and x represents the accumulation.

Further, the BN layer is represented as:

where α, β and ∈ are modelled parameters, x_bnIs the sequence characteristic of the convolutional layer output, E is the function to calculate the expectation, and Var is the variance function.

Further, the softmax function is expressed as:

wherein z is_iThe output value of the ith node is expressed, C is the number of classification categories, e is the base number of a natural logarithm function and is a mathematical constant, and Zc is the output value of the C node.

Further, the CTC decoding module specifically performs the following steps:

aiming at a predicted sequence output by a BERT layer, firstly, a candidate base sequence is generated by iteration by using a beacon search algorithm, the beam width is 3, then, the candidate bases are scored, blank characters and redundant characters in the base sequence are removed, the base sequence with the highest score is selected as a final prediction result,

the probability of blank characters existing in the base sequence is:

x is the output sequence of the BERT layer, pi represents the path corresponding to the intermediate result, beta^-1(l) Represents all paths satisfying the condition in the searching process of the algorithm, I is the output result, P (I | x) represents the probability of blank characters in the sequence,

expressing the CTC loss function by using the base sequence space character probability, which is equal to the minimized logarithm field-ln (P pi | x)), and expressing the CTC loss function as follows:

where ln () represents the natural logarithm.

Further, the base recognition of the 50 sets of raw data in the second step is performed by a base recognition tool Guppy.

The invention has the beneficial effects that:

(1) the invention uses a deep neural network model with better performance, introduces the idea of solving the problem of natural language processing into the base recognition of the nanopore sequencing data, and has better performance compared with an official base recognition tool.

(2) The invention provides a good basis for genomics research, and the high-accuracy base identification is beneficial to the analysis of downstream genome data.

(3) The model of the invention has better generalization performance and is suitable for the base recognition of the nanopore sequencing data of various species including microorganisms, plants, animals and the like.

The method comprises the steps of utilizing a convolutional layer to carry out down-sampling and feature extraction on nanopore electric signal data, utilizing a BERT module to predict a base sequence corresponding to an electric signal, and utilizing a CTC algorithm to remove redundant data. Realizing the high-accuracy recognition of the base sequence of the nanopore sequencing data.

Drawings

FIG. 1 is a flow chart of a method for base recognition of nanopore sequencing data based on a deep neural network model according to an embodiment of the present application;

FIG. 2 is a diagram illustrating the effect of the deep neural network model of the present application;

FIG. 3 is a graphical representation of the comparison of the accuracy of the present application with official base recognition tools and Guppy-KP on the test set;

FIG. 4 is a schematic diagram of comparison of the sequence identity indicators of the present application with official base recognition tools on a test set 1;

FIG. 5 is a schematic diagram of comparison of the sequence identity indicators of the present application with official base recognition tools on the test set 2;

FIG. 6 is a schematic diagram of comparison of the sequence identity indicators of the present application with official base recognition tools on the test set 3;

FIG. 7 is a schematic diagram of comparison of the sequence identity indicators of the present application with official base recognition tools on the test set 4;

FIG. 8 is a schematic diagram of comparison of the sequence identity indicators of the present application with official base recognition tools on the test set 5;

FIG. 9 is a schematic illustration of comparison of the sequence identity indicators of the present application with official base recognition tools on a test set 6;

FIG. 10 is a schematic diagram 7 comparing the sequence identity indicators on the test set of the present application with the official base recognition tool;

FIG. 11 is a schematic diagram 8 comparing the sequence identity indicators on the test set of the present application with official base recognition tools;

FIG. 12 is a schematic diagram of a comparison of the sequence identity indicators of the present application with official base recognition tools on the test set 9;

FIG. 13 is a graph showing the comparison of error rates on a test set of 9 species for the present application and the official base recognition tool.

Detailed Description

It should be noted that, in the present invention, the embodiments disclosed in the present application may be combined with each other without conflict.

The first embodiment is as follows: the method for recognizing a base based on deep learning nanopore sequencing data according to this embodiment will be described in detail with reference to FIG. 1.

As shown in FIG. 1, the method comprises the following steps S1-S8:

s1, downloading 50 groups of raw data of nanopore including pneumococcus (Klebsiella pneumoniae), Enterobacter (Enterobacteriaceae), Proteobacteria (Proteobacteria) and sequencing data of 9 fungi to form a data set.

Wherein, 50 groups of obtained nanopore original sequencing data are used as a training set of the model, and the gene sequences of the other 9 species are used as a test set.

S2, base recognition was performed on 50 sets of raw data using the nanopore official base recognition tool Guppy.

The official base recognition tool Guppy was used to convert unknown nanopores into base sequences for finding their corresponding next generation sequenced reference genomes.

And S3, correcting the base sequence after Guppy processing by using the Tombo algorithm and annotating the corrected sequence by using the dynamic time warping algorithm by using the Illumina sequencing sequence as a reference genome.

S4, converting the real DNA sequence into a real electric signal by adopting a Re-squiggle method, and generating the marking data in a (base sequence, electric signal) format.

S5, constructing a neural network model based on the convolutional neural network and the BERT network, wherein the model comprises two convolutional layers, a BERT module, a full connection layer and a CTC decoding module. And performing feature extraction on the input sequence by using a convolution module. The method adopts two convolutional layers to carry out preprocessing and feature extraction on input sequence data, and comprises the following steps:

s51, the size of the convolution kernel in the first convolution layer is 1 x 3, the step size is 1 x 2, the output channel is 128, and the convolution kernel is used for down-sampling data and reducing the calculation complexity.

S52, the size of convolution kernel in the second convolution layer is 1 x 3, the step size is 1 x 2, the output channel is 128, and the method is used for feature extraction. The input signal vector x is calculated as follows:

and S53, a Batch Normalization (BN) layer is arranged behind each convolution module and used for preventing the mean value and the variance from being saturated and improving the generalization performance of the model. The calculation formula is as follows:

and S6, inputting the extracted features into a BERT module, and outputting the probability of each base sequence corresponding to the nanopore original electric signal after full-connection layer processing. And inputting the features extracted after the down sampling into a BERT module for training. The BERT module contains 12 layers of Transformer, 768 dimensions of Embedding hidden layers and 12-headed attention mechanism layers. This is followed by a full ligation layer and the probability of a base at each position is calculated using the softmax function. i and j respectively represent the sequence number of the sequence, and xi + j represents the sequence characteristics extracted after each character in the sequence is convolved and is used as the input of the subsequent BERT layer. The calculation formula is as follows:

and S7, removing the repeated base sequence and the blank sequence by using a CTC decoding module, and finally outputting the high-accuracy nanopore base sequence. The high-order feature distribution distance between the nanopore original electrical signal and the base sequence was calculated using the CTC loss function. The CTC decoder iteratively generates candidate base sequences by using a beamsearch algorithm, wherein the beam width is 3, and then scores the candidate bases. Blank characters in the sequences are removed in the process, and the base sequences with the highest scores are selected as final prediction results. The base sequence with high accuracy is obtained from the raw signal data of the nanopore. e is the base of the natural logarithmic function and is a mathematical constant. Zc represents the output value of the c-th node. The probability calculation process of blank characters existing in the base sequence and the CTC loss function formula are as follows:

L(S)＝-lnΠ_(x,z)∈Sp(z|x)＝-∑_(x,z)∈Slnp(z|x)

wherein x represents an input sequence, π represents a path of a base searched by a beamsearch, and z represents an output sequence.

S8, converting the original electric signal of the nanopore sequencing into a base sequence with higher accuracy than that of an official tool by adopting a trained prediction model.

The recognition effect of the present invention is further described below with a set of specific experimental examples.

First, to evaluate the performance of the base recognition tools, we performed comparative analysis on 4 base recognition tools including our model on the same dataset. Wherein, the ourethod represents the deep neural network model of the invention, Guppy and Albacore are official base recognition tools of Oxford formula, and Guppy-KP is a model retrained on an official basis.

Table one shows the error rates on the test set for 4 tools including our method.

Wherein deletion, insertion and mismatch respectively represent deletion error, insertion error and matching error of sequencing data. The base recognition accuracy was defined as follows:

m represents the number of bases matched, S represents the number of bases with matching errors, I represents the number of bases with insertion errors, and D represents the number of bases with deletion errors. On the Klebsiella Pneumoniae NUH29 dataset, the error rate of the method of the present invention was 11.06%, lower than that of other base recognition tools. On the Klebsiella Pneumoniae KSB2 dataset, the error rates of the method of the invention, Albore, Guppy were 11.26%, 15.80%, 15.73%, respectively, which were lower than the official base recognition tool.

Secondly, we also used genome assembly consistency as an index to evaluate model performance. FIG. 4 shows the consensus sequence identity of the 4 base recognition tools comprising the present invention. We used 6 indicators of polymer insertion errors, other insertion errors, polymer deletion errors, other deletion errors, substitution errors, and Dcm errors to evaluate model performance.

Performance evaluation on a test set shows that the base recognition error rate and the genome assembly consistency index of the invention are superior to those of base recognition tools provided by the official authorities.

It should be noted that the detailed description is only for explaining and explaining the technical solution of the present invention, and the scope of protection of the claims is not limited thereby. It is intended that all such modifications and variations that fall within the spirit and scope of the invention be limited only by the claims and the description.

Claims

1. The deep learning-based method for identifying the base of the nanopore sequencing data comprises the following steps:

step five: training a neural network by using the marked electric signal data and the original data, and performing base identification by using the trained neural network;

the neural network comprises a first convolution layer, a second convolution layer, a BERT module, a full connection layer and a CTC decoding module;

2. The method for deep learning based nanopore sequencing data base identity recognition according to claim 1 wherein the labeled electrical signal data characteristics are represented as:

3. The deep learning-based nanopore sequencing data base identity method of claim 2, wherein the BN layer is represented as:

4. The deep learning-based nanopore sequencing data base identity method of claim 3, wherein the softmax function is expressed as:

5. The deep learning-based nanopore sequencing data base identity method of claim 4, wherein the CTC decoding module specifically performs the following steps:

the probability of blank characters existing in the base sequence is:

expressing the CTC loss function using the base sequence space character probability, equal to the minimized log domain-ln (P (π | x)), as:

where ln () represents the natural logarithm.

6. The method for base recognition based on deep learning nanopore sequencing data of claim 1, wherein the base recognition of the 50 sets of raw data in the second step is performed by a base recognition tool Guppy.