CN112735604B - Novel coronavirus classification method based on deep learning algorithm - Google Patents
Novel coronavirus classification method based on deep learning algorithm Download PDFInfo
- Publication number
- CN112735604B CN112735604B CN202110045563.2A CN202110045563A CN112735604B CN 112735604 B CN112735604 B CN 112735604B CN 202110045563 A CN202110045563 A CN 202110045563A CN 112735604 B CN112735604 B CN 112735604B
- Authority
- CN
- China
- Prior art keywords
- sequence
- virus
- novel coronavirus
- data
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 241000711573 Coronaviridae Species 0.000 title claims abstract description 50
- 238000000034 method Methods 0.000 title claims abstract description 36
- 238000013135 deep learning Methods 0.000 title claims abstract description 15
- 241000700605 Viruses Species 0.000 claims abstract description 78
- 238000012549 training Methods 0.000 claims abstract description 35
- 238000013145 classification model Methods 0.000 claims abstract description 19
- 238000000605 extraction Methods 0.000 claims abstract description 11
- 238000007781 pre-processing Methods 0.000 claims abstract description 5
- 238000012360 testing method Methods 0.000 claims abstract description 4
- 239000013598 vector Substances 0.000 claims description 50
- 230000003612 virological effect Effects 0.000 claims description 30
- 230000006870 function Effects 0.000 claims description 26
- 238000012795 verification Methods 0.000 claims description 12
- 238000005457 optimization Methods 0.000 claims description 8
- 208000025721 COVID-19 Diseases 0.000 claims description 7
- 230000006872 improvement Effects 0.000 claims description 4
- 238000000354 decomposition reaction Methods 0.000 claims description 3
- 230000009467 reduction Effects 0.000 abstract description 2
- 238000013136 deep learning model Methods 0.000 description 11
- 238000011176 pooling Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 5
- 238000011156 evaluation Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 230000004913 activation Effects 0.000 description 3
- 238000002790 cross-validation Methods 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 241001678561 Sarbecovirus Species 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 210000002569 neuron Anatomy 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 241000008904 Betacoronavirus Species 0.000 description 1
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 210000004027 cell Anatomy 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 241001493065 dsRNA viruses Species 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000036039 immunity Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000001902 propagating effect Effects 0.000 description 1
- 108090000623 proteins and genes Proteins 0.000 description 1
- 238000002864 sequence alignment Methods 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000011285 therapeutic regimen Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/80—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for detecting, monitoring or modelling epidemics or pandemics, e.g. flu
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/14—Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
- G06F17/141—Discrete Fourier transforms
- G06F17/142—Fast Fourier transforms, e.g. using a Cooley-Tukey type algorithm
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Mathematical Physics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Computational Mathematics (AREA)
- Pure & Applied Mathematics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Mathematical Optimization (AREA)
- Mathematical Analysis (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Epidemiology (AREA)
- Primary Health Care (AREA)
- Discrete Mathematics (AREA)
- Pathology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Algebra (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention provides a novel coronavirus classification method based on a deep learning algorithm, which is used for solving the technical problem of lower classification precision in the prior art, and comprises the following implementation steps: the method comprises the steps of obtaining the existing available virus sequence and a novel coronavirus data set, preprocessing the virus sequence data set, performing feature extraction on the preprocessed high-dimensional redundant virus sequence features by using three cascaded automatic encoders to achieve data dimension reduction, obtaining virus sequence nonlinear abstract features, obtaining training set data and test set data, obtaining an optimal novel coronavirus sequence classification model, and predicting a label of the novel coronavirus data by using the optimal novel coronavirus sequence classification model.
Description
Technical Field
The invention relates to the technical field of novel coronavirus classification, in particular to a novel coronavirus classification method based on a deep learning algorithm.
Background
The novel coronavirus is an RNA virus with envelope and linear single-strand positive strand genome, and the crowd is generally susceptible because the crowd lacks immunity to the novel virus strain. Because of the long latency of the novel coronaviruses, there is an urgent need to elucidate and analyze viral genomic sequences in order to better understand the novel viruses and to timely formulate therapeutic regimens. Whereas existing methods have found sequence similarity by performing similarity comparisons on sequence data. However, this sequence alignment method requires the help of gene annotation, with the database as a reference, and analysis of data using alignment software is almost impossible in the face of the need to analyze thousands of cell epigenomic sequences simultaneously. The traditional machine learning method is difficult to extract nonlinear abstract features of a virus sequence, only low-level features can be extracted, the low-level features mainly describe local information of the virus sequence and can not well describe all features of a virus genome sequence, and under the background that big data of the virus genome sequence needs to be analyzed, the calculation efficiency and the prediction accuracy are lacked.
Disclosure of Invention
The invention provides a deep learning-based method for classifying novel coronaviruses, which can effectively mine potential value when analyzing and processing large data scenes of virus sequences, solves the difficulty of a comparative genomics method in classifying the novel coronaviruses, can learn nonlinear characteristics of the virus sequences layer by layer, extracts more comprehensive and representative genome data characteristics, overcomes the defect that the traditional machine learning method cannot extract high-level characteristics (abstract characteristics), effectively improves classification performance of a classifier, and realizes deep mining of an internal nonlinear association mechanism of the virus genome sequences, and is characterized by comprising the following steps:
step 1, acquiring a novel coronavirus data set, and downloading all available virus sequences and a COVID-19 sequence from a virus host database and a GISAID platform;
step 2, preprocessing a virus sequence data set to obtain a feature vector;
step 3, using three cascade automatic encoders to perform feature extraction to map the feature vectors into low-dimensional high-level features as the input of a model;
step 4, training an optimal novel coronavirus classification model according to the input of the model;
step 5, predicting the label of the novel coronavirus data by using the optimal novel coronavirus classification model.
Further, the step 2 preprocesses the virus sequence data set to obtain a feature vector, and the implementation steps are as follows:
step 2.1, performing character sequence primary coding on the virus sequence to obtain a digital sequence;
step 2.2, performing fast Fourier transform on the digital sequence to obtain the amplitude of the digital sequence;
further, step 2.1 performs character sequence preliminary coding on the virus sequence to obtain a digital sequence, and the implementation method is as follows:
the virus sequence downloaded from the database is a character sequence, represented by four base symbols of A, T, G and C, and is converted into a digital sequence recognized by the cascade automatic encoder, and the virus sequence set is assumed to be represented as D= { P 1 ,P 2 ,P 3 ,P 4 ,…,P n }, wherein P i E { A, T, G, C }, 1.ltoreq.i.ltoreq.n, for each character sequence P i The code character T/c=1, a/g= -1, and the initially coded numerical sequence is noted as: g= (D 1 ,D 2 ,D 3 …D n ) Wherein D is i Is the sequence P i And the discrete value of (2) represents that i is more than or equal to 1 and n is more than or equal to n.
Further, step 2.2 performs fast fourier transform on the digital sequence to obtain an amplitude of the digital sequence; the implementation steps are as follows:
step 2.2.1 digital signal D for each viral sequence i (n) two groups by parity of n:DFT is expressed as +.>
Step 2.2.2 1 will beThe viral subsequence of the dot is then split into two +.>Sub-sequence of points: x is x 1 (2s)=x 3 (s),x 1 (2s+1)=x 4 (s);x 2 (2s)=x 5 (s),x 2 (2s+1)=x 6 (s),/>DFT is expressed as
Step 2.2.3 recursively repeating the step 2.2.2M times to obtain 2-point DFT operation, D, of the virus sequence after M times of decomposition i The fast Fourier transform mode of (n) is expressed as |F i (k) I, marked as H i (k) Wherein k is more than or equal to 0 and less than or equal to n-1, H i (k) Namely, the amplitude value;
further, the step 2.3 constructs a feature vector according to the amplitude value by using a mahalanobis distance, which is implemented as follows:
where Hi and Hj are the magnitudes of the ith and jth digital signals, respectively, of the viral sequence. For the virus sequence label, one-hot coding is adopted, and after coding, any virus type corresponds to a label value, and L= [ L ] 1 ,L 2 ,L 3 …L k ],L i If any one of the virus sequences belongs to the ith virus, only the ith position in the corresponding tag L is 1, the rest positions are 0, and all the tag data L are two-dimensional arrays.
Further, the step 3 uses three cascade automatic encoders to perform feature extraction to map the feature vector into low-dimensional high-level features, and uses the low-dimensional high-level features as the input of a model, and the implementation steps are as follows:
step 3.1, in order to learn the characteristics of a more robust virus sequence, randomly destroying a small part of data of the characteristic vector to obtain a sample, and avoiding the influence of introducing some irrelevant information when the character sequence is encoded in the previous stage, so that the designed automatic encoder can grasp the essential characteristics of the virus sequence more;
step 3.2, using L1+L2 norms as penalty items for model improvement, avoiding the excessive fitting of an algorithm to input data, and constructing a loss function;
step 3.3, training a first-stage automatic encoder to obtain a first-layer low-dimensional feature vector;
step 3.4, taking the output of the first-stage automatic encoder as the input of the next-stage encoder, continuously completing the training of the second-stage automatic encoder, and repeating the step until the training of the third-stage automatic encoder is completed, so as to obtain the high-level characteristic expression of virus sequence abstraction;
further, the step 3.2 uses the l1+l2 norm as a penalty term for model improvement, avoids the algorithm from overfitting the input data, and constructs a loss function, which is implemented as follows:
x (i) as the original feature vector of the input, x 1 (i) And after reconstruction, the feature vector w is a weight, and lambda and rho are used for adjusting the weight of the penalty term.
Further, the step 3.3 trains a first-stage automatic encoder to obtain a first-layer low-dimensional feature vector, which is implemented as follows:
let n samples of the viral digital sequence be represented as x= { X 1 ,x 2 ,x 3 ,x 4 ……,x n }={x i I 1 is less than or equal to i is less than or equal to n, each sample x i ∈R M Is an M-dimensional eigenvector, randomly selecting a small fraction of data for each sample to be set to 1, -1, and the vector X is further expressed as Input to a first stage encoder, the encoding function f θ For->The first layer encoding of each virus sequence sample gives a feature vector denoted +.>Reconstructing an input vector denoted as Z by a decoder 1 =g θ (1) (Y 1 )=s(W′Y 1 +b '), where { W, b } is the coding parameter, { W ', b ' } is the decoding parameter, minimizing the loss function argmin using a gradient descent algorithm θ,θ′ E gets the coding and decoding parameters, "> Wherein l is learning rate, W ', b' adopts phaseAnd calculating by the same method.
Further, the step 3.4 is implemented by the abstract high-level characteristic expression of the virus sequence output by the third-level automatic encoder:
the final characteristics of the obtained viral sequences can be expressed as: x 1 (i) representing the feature vector, f, of the output of the third-stage automatic encoder θ Is a coding function.
Further, the step 4 trains an optimal novel coronavirus sequence classification model according to the input of the model, and comprises the following steps:
step 4.1, dividing training set data into K parts, wherein one part is used as a verification set, and the rest K-1 parts are used as training sets;
step 4.2, obtaining optimal super parameters by using Bayes optimization;
step 4.3, training by using a deep learning network according to the optimal super-parameters to obtain an optimized novel coronavirus classification model;
and 4.4, selecting one of the data sets which are not divided into the verification sets as the verification set, taking the rest data as the training set, repeating the step 4.1, and if all the data are divided into the verification sets, calculating the precision of the K-time optimized novel coronavirus classification model on the test set, wherein the model with the highest precision is used as the optimal novel coronavirus classification model.
The invention provides a novel coronavirus classification method based on a deep learning algorithm, which can effectively improve classification accuracy and solve the problem of low classification accuracy of the novel coronavirus.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it will be obvious that the drawings in the following description are some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort to a person skilled in the art.
FIG. 1 is a block diagram of the feature extraction of a viral sequence according to the present invention;
FIG. 2 is a flowchart of an algorithm of an automatic encoder for viral sequences according to the present invention;
FIG. 3 is a flow chart of the optimization of the epoch parameters of the deep learning model of the present invention;
FIG. 4 is a flowchart of the optimization of the remaining parameters of the deep learning model of the present invention;
FIG. 5 is an overall flow chart of the present invention;
FIG. 6 is a diagram of a classification structure of a currently available viral sequence in a database of the present invention;
FIG. 7 is a diagram showing a structure of a deep learning model prediction virus sequence classification according to the present invention;
FIG. 8 is a specific structural design of the deep learning model network of the present invention;
FIG. 9 is a block diagram of hierarchical modeling classification prediction for a COVID-19 sequence in accordance with the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Fig. 5 is an overall flowchart of the present invention, and the present invention provides a novel coronavirus classification method based on a deep learning algorithm, which is characterized by comprising the following steps:
(1) Acquiring existing available viral sequences and a new coronavirus dataset:
(1a) All available virus sequences and covd-19 sequences downloaded from the virus host database and the GISAID platform;
(2) Preprocessing a viral sequence dataset to obtain feature vectors
(2a) Carrying out character sequence coding on the virus sequence to obtain a digital sequence;
(2b) Performing fast Fourier transform on the digital sequence to obtain the amplitude of the digital sequence;
(2c) Constructing a feature vector by utilizing a mahalanobis distance according to the amplitude value;
(3) Feature extraction using three concatenated automatic encoders to map the feature vectors into low-dimensional, high-level features as input to a model
(4) Training an optimal novel coronavirus sequence classification model according to the input of the model
(4a) Dividing training set data into K parts, wherein one part is used as a verification set, and the other K-1 parts are used as training sets;
(4b) Obtaining optimal super parameters by using Bayesian optimization;
(4c) Training by using a deep learning network according to the optimal super parameters to obtain an optimized novel coronavirus classification model;
(4d) Selecting one of the data sets which are not divided into verification sets as the verification set, taking the rest data as the training set, repeating the step (4 a), and if all the data are divided into the verification sets, calculating the precision of the K-time optimized novel coronavirus classification model on the test set, wherein the model with the highest precision is used as the optimal novel coronavirus classification model;
(5) Predicting a signature of the new coronavirus data using the optimal new coronavirus classification model;
and (2 a) performing character sequence primary coding on the virus sequence to obtain a digital sequence, wherein the realization method comprises the following steps:
the virus sequence downloaded from the database is a character sequence, represented by four base symbols of A, T, G and C, and is converted into a digital sequence recognized by a cascade automatic encoder, and the set of virus sequences is assumed to be represented as D= { P 1 ,P 2 ,P 3 ,P 4 ,…,P n }, wherein P i ∈{A,T,G,C},1≤i≤n, for each character sequence P i The code character T/c=1, a/g= -1, and the initially coded numerical sequence is noted as: g= (D 1 ,D 2 ,D 3 …D n ) Wherein D is i Is the sequence P i I is more than or equal to 1 and n is more than or equal to n;
step (2 b) performing fast fourier transform on the digital sequence to obtain the amplitude of the digital sequence; the implementation principle is as follows:
for each D i Digital signals whose magnitudes are solved for using Fast Fourier Transforms (FFTs). D (D) i Is expressed as |F by the fast Fourier transform mode of (A) i (k) I, marked as H i (k) Wherein 0.ltoreq.k.ltoreq.n-1, which is implemented as a point number n=2 M Viral sequence digital signal D i Where N is the length of the viral digital sequence. Performing a DFT operation of decomposing M times into 2 points by performing a base MFFT change extracted in time to form D i (n) to F i (k) M-level butterfly operation process of (c). Specifically, each virus sequence is represented by digital signal D i (n) two groups by parity of n: DFT is expressed as Then 1 +.>The viral subsequence of the dot is then split into two +.>Sub-sequence of points: x is x 1 (2s)=x 3 (s),x 1 (2s+1)=x 4 (s);x 2 (2s)=x 5 (s),x 2 (2s+1)=x 6 (s),/>DFT is expressed as By analogy, we can decompose M times to finally become 2-point DFT operation of the viral sequence. Thus, through M-level operation, a virus sequence digital signal D can be obtained i (n) amplitude Spectrum representation F i (k) The method comprises the steps of carrying out a first treatment on the surface of the Further, the method comprises the following specific implementation steps:
step 1 digital signal D of each viral sequence i (n) two groups by parity of n:DFT is expressed as +.>
Step 2 will be 1The viral subsequence of the dot is then split into two +.>Sub-sequence of points: x is x 1 (2s)=x 3 (s),x 1 (2s+1)=x 4 (s);x 2 (2s)=x 5 (s),x 2 (2s+1)=x 6 (s),/>DFT is expressed as
Step 3 recursively repeating the steps 2.2.2 for M-2 times to obtain 2-point DFT operation of the virus sequence after M times of decomposition, D i The fast Fourier transform mode of (n) is expressed as |F i (k) I, marked as H i (k) Wherein k is more than or equal to 0 and less than or equal to n-1, H i (k) Namely, the amplitude value;
step (2 c) constructs a feature vector using a mahalanobis distance from the amplitude, which is implemented as:
wherein Hi and Hj are the magnitudes of the ith and jth digital signals, respectively, of the viral sequence;
and (3) performing feature extraction by using three cascading automatic encoders to map the feature vectors into low-dimensional high-level features, wherein the three cascading automatic encoders are used as input of a model, and the implementation steps are as follows:
(3a) In order to learn the characteristics of a more robust virus sequence, randomly destroying a small part of data of the characteristic vector, and avoiding the influence of introducing some irrelevant information when the character sequence is encoded in the previous stage, so that the designed automatic encoder can grasp the essential characteristics of the virus sequence more;
(3b) The L1+L2 norms are used as penalty items for model improvement, so that the excessive fitting of an algorithm to input data is avoided, and a loss function is constructed;
(3c) Training a first-stage automatic encoder to obtain a first-layer low-dimensional feature vector;
(3d) Taking the output of the first-stage automatic encoder as the input of the next-stage encoder, continuously completing the training of the second-stage automatic encoder, and repeating the step until the training of the third-stage automatic encoder is completed, so as to obtain the high-level feature expression of virus sequence abstraction;
(3e) The construction of the loss function of step (3 b) is implemented as:
x (i) as the original feature vector of the input, x 1 (i) After reconstruction, the feature vector w is weight, and lambda and rho are used for adjusting the weight of penalty term
Step (3 c) of training the first-stage automatic encoder to obtain a first-layer low-dimensional feature vector, which is implemented as follows:
let n samples of the viral digital sequence be represented as x= { X 1 ,x 2 ,x 3 ,x 4 ……,x n }={x i I 1 is less than or equal to i is less than or equal to n, each sample x i ∈R M Is an M-dimensional eigenvector, randomly selecting a small fraction of data for each sample to be set to 1, -1, and the vector X is further expressed as Input to a first stage encoder, the encoding function f θ For->The first layer encoding of each virus sequence sample gives a feature vector denoted +.>Reconstructing an input vector denoted as Z by a decoder 1 =g θ (1) (Y 1 )=s(W′Y 1 +b '), where { W, b } is the coding parameter, { W ', b ' } is the decoding parameter, minimizing the loss function argmin using a gradient descent algorithm θ,θ′ E gets the coding and decoding parameters, "> Wherein l is learning rate, W ', b' is adoptedCalculating by the same method;
the virus sequence output by the third-stage automatic encoder in the step (3 d) is abstract high-level characteristic expression, which is realized by the following steps:
the final characteristics of the obtained viral sequences can be expressed as: x 1 (i) representing the feature vector, f, of the output of the third-stage automatic encoder θ Is a coding function;
specifically, step (3) may be summarized as:
specifically, let n samples of the viral digital sequence be assumed, denoted as x= { X 1 ,x 2 ,x 3 ,x 4 ……,x n }={x i I 1 is less than or equal to i is less than or equal to n, each sample x i ∈R M Is an M-dimensional eigenvector, randomly selecting a small fraction of data for each sample to be set to 1, -1, and the vector X is further expressed as Input to a first stage encoder, the encoding function f θ For->Each virus sequence sample is subjected to first layer coding to obtain a characteristic vector denoted as Y 1 ={y i I1.ltoreq.i.ltoreq.n } (3-1), the input vector is reconstructed by the decoder to be denoted Z 1 ={z i I 1.ltoreq.i.ltoreq.n } (3-2), where { W, b } is the coding parameter, { W ', b' } is the decoding parameter, and a gradient descent algorithm is used to minimize the loss function argmin θ,θ′ E obtaining coding and decoding parameters, and calculating formulas such as (3-3), (3-4), wherein l is learning rate, W 'and b' are calculated by adopting the same method, and only the first stage is reserved after the first layer of network training is completedThe encoder part takes the abstracted low-dimensional feature vector output by the encoder at the moment as the input of the next-stage encoder, trains the second-layer network according to the same method until the training of the third-stage automatic encoder is finished, and the output of the encoder at the moment is the abstract feature expression of the virus DNA sequence->The abstract high-level feature expression is finally completed through three-level gradual abstraction of the bottom layer features of the combined virus sequence, the virus sequence feature extraction block diagram is shown in fig. 1, and the algorithm flow chart of the automatic virus sequence encoder is shown in fig. 2.
Z 1 =g θ (1) (Y 1 )=s(W′Y 1 +b′) (3-2)
Step (4) can be summarized as:
firstly, the virus digital sequence after feature extraction is subjected to dimension reduction through a convolution layer, when the convolution kernel works, the virus digital sequence is regularly scanned to input features, matrix element multiplication summation is carried out on the input features, deviation amounts are overlapped, and the extracted features keep the inherent topology of the input. The convolution method formula is (4-1), wherein f (·) is an activation function, is convolution operation,characteristic of row i and column j of layer i, especially x 0 Representing the input viral number sequence,/->Is a convolution kernel, b j Is a bias term.
After feature extraction by the convolution layer, the output features are passed to the pooling layer for feature selection and information filtering. The maximum pooling method is adopted, and the formula of the pooling method is (4-2), whereinFor the feature values of the ith row and j column of the layer-i max-pooling layer feature map, u (a, a) is a window function, and N is the size of the window.
The basic residual error module containing two convolution layers is used for replacing the stacked convolution layers, the problem of training difficulty caused by network depth is solved, and the pooling layer is connected behind the basic residual error module for filtering characteristic information and preventing overfitting to a certain extent. Three residual modules of the step are defined as y l+1 =f(y l +F(y l ,w l )),y l And y l+1 Representing the input and output of the layer I residual module, F (·) is the residual function, w l Is the parameter of the residual block, f (·) represents the activation function. Wherein F (y) l ,w l ) Representing the residual map to be learned. A dropout layer is connected after the second maximum pooling layer, and the activation value of a certain neuron stops working with a certain probability p to avoid overfitting when propagating forward, wherein the process calculates as (4-3), (4-4), W is weight, b is bias, k (3) For layer 3 output, k (4) For layer 4 output, C (3) It is shown that a 0,1 vector is randomly generated using the Bernoulli function, where each random variable obeys a Bernoulli probability distribution of parameter P. Wherein the discard ratio (P) best parameter determination is given by bayesian call parameters.
k i (4) =f(a i (4) ) (4-4)
After global averaging pooling, the extracted features are nonlinearly combined using a full-join layer to obtain an output, the process being represented as (4-5), where ∈ represents matrix multiplication, W N*C Is weight, b is bias, G c And (3) the feature vector of the virus sequence after global average pooling is obtained, and N is the number of all classes of the virus sequence of each class in the hierarchical modeling.
y n =W N*C ⊙G c +b (4-5)
The number of neurons in the full-connection layer when the virus sequences are modeled in a grading manner is 13 and 12,4,4, namely the total number of the corresponding virus sequence categories is obtained. The patent uses cross entropy function as loss function training model, specific process is (4-6), wherein y ik True tags of the kth virus sequence class are assembled for virus sequence tags, p ik The classification prediction probability of the model on the virus sequence category is represented, N is the total number of virus sequence samples, and K is the total number of categories. With the loss function of the model, the model updates parameters by using an optimization method of gradient descent, each layer of network firstly calculates the output of each layer through feedforward, then counter propagates errors, and the model runs along the error gradient direction until the optimal solution is found.
The deep learning model design is shown in fig. 7, and the network specific structure design is shown in fig. 8.
When the model is optimized, the model val-loss is not reduced after 30 rounds of evaluation for the setting period (epoch) determination, the value of the epoch is output as the optimal parameter setting of the model, and the epoch optimization scheme is shown in figure 3. The parameter optimization schemes other than epoch are shown in FIG. 4. And (3) using three-fold cross validation to evaluate the prediction performance of the model under different parameter choices, using the average value of the AUC values of the three-fold cross validation as an evaluation index to define an objective function, and ensuring the reliability of the model parameter choices. The tuning range of the network parameters can be shown in the following table:
the prediction capability of the deep learning model under the condition of selecting different parameters is compared by a K-fold cross validation mode, sample data are disturbed to eliminate deviation possibly caused by samples before model training, training data are divided into K parts in order to ensure that all data sets can be trained, the deep learning model is trained, different K-1 parts are selected for training each time, 1 part is tested, K times are repeated, and the average value of K times of model evaluation indexes is compared to select the optimal model. The evaluation was performed using accuracy (acc), precision (Precision), the closer this value is to 1, the better the effect.
In classifying the covd-19 sequences, since each virus sequence of the training data is a set of taxonomic names, the classification structure of the existing available virus sequences in the database is shown in fig. 6 from high to low, and the sequence is searched from high to low according to the classification calculation of the classification level. The first range is 11 virus families and a ribovirus domain, and the optimized deep learning model is used for classifying and predicting the COVID-19 sequence, so that the COVID-19 sequence can be determined to be the ribovirus domain. The second range was 12 families under the riboviridae domain, which were identified as Coronaviridae (Coronaviridae) by using an optimized deep learning model to classify and predict the covd-19 sequence. The third sub-range is four genera under the coronaviridae family, and classification prediction is performed on the covd-19 sequence by using an optimized deep learning model, so that the covd-19 sequence can be determined to be coronavirus (betacorenavirus). The fourth sub-range is four subgenera under the coronavirus genus, and classification prediction of the covd-19 sequence using an optimized deep learning model can be determined to be the B-series (Sarbecovirus). Thus, it was finally determined that the classification tag of the COVID-19 sequence was the B-series (Sarbecovirus) of the genus coronavirus (Betacoronavirus), and the classification prediction structure of the classification of the COVID-19 sequence was modeled in a hierarchical manner as shown in FIG. 9.
Let the optimal model chosen be denoted as f (x), and for the covd-19 sequence x, this model is used to output its class label, covd-19_label=f (x).
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.
Claims (6)
1. The novel coronavirus classification method based on the deep learning algorithm is characterized by comprising the following steps of:
step 1, acquiring a novel coronavirus data set, wherein the data set is all available virus sequences and COVID-19 sequences downloaded from a virus host database and a GISAID platform;
step 2, preprocessing a virus sequence data set to obtain a feature vector;
step 3, using three cascade automatic encoders to perform feature extraction to map the feature vectors into low-dimensional high-level features as the input of a model;
the step 3 uses three cascade automatic encoders to perform feature extraction to map the feature vectors into low-dimensional high-level features, and uses the low-dimensional high-level features as the input of a model, and the implementation steps are as follows:
step 3.1, randomly destroying one tenth of data of the feature vector to obtain a sample for obtaining virus sequence features;
step 3.2, using L1+L2 norms as penalty items for model improvement, avoiding the excessive fitting of an algorithm to input data, and constructing a loss function;
the step 3.2 constructs a loss function, which is implemented as:
x (i) as an input of the original feature vector,after reconstruction, the feature vector w is a weight, and lambda and rho are used for adjusting the weight of the penalty term;
step 3.3, training a first-stage automatic encoder to obtain a first-layer low-dimensional feature vector;
step 3.3 trains the first-stage automatic encoder to obtain a first-layer low-dimensional feature vector, which is realized by:
let n samples of the viral digital sequence be represented as x= { X 1 ,x 2 ,x 3 ,x 4 ……,x n }={x i I 1 is less than or equal to i is less than or equal to n, each sample x i ∈R M Is an M-dimensional feature vector, randomly selecting one tenth of data for each sample to be set as 1, -1, and the vector X is further expressed as Input to the first stage encoder, encoding function +.>For->The first layer encoding of each virus sequence sample gives a feature vector denoted +.>Reconstructing an input vector denoted as Z by a decoder 1 =g θ (1) (Y 1 )=s(W′Y 1 +b '), where { W, b } is the coding parameter, { W ', b ' } is the decoding parameter, minimizing the loss function argmin using a gradient descent algorithm θ,θ′ E gets the coding and decoding parameters, "> Wherein l is learning rate, and W 'and b' are calculated by adopting the same method;
step 3.4, taking the output of the first-stage automatic encoder as the input of the next-stage encoder, continuously completing the training of the second-stage automatic encoder, and repeating the step until the training of the third-stage automatic encoder is completed, so as to obtain the high-level characteristic expression of virus sequence abstraction;
the abstract high-level characteristic expression of the virus sequence output by the third-level automatic encoder in the step 3.4 is realized as follows:
the final characteristics of the obtained viral sequences are expressed as: representing the feature vector, f, of the output of the third-stage automatic encoder θ Is a coding function;
step 4, training an optimal novel coronavirus sequence classification model according to the input of the model;
step 5, predicting the label of the novel coronavirus data by using the optimal novel coronavirus sequence classification model.
2. The method for classifying coronaviruses based on the deep learning algorithm according to claim 1, wherein the step 2 of preprocessing the viral sequence data set to obtain feature vectors comprises the following implementation steps:
step 2.1, carrying out character sequence coding on the virus sequence to obtain a digital sequence;
step 2.2, performing fast Fourier transform on the digital sequence to obtain the amplitude of the digital sequence;
and 2.3, constructing a characteristic vector by utilizing the Markov distance according to the amplitude value.
3. The method for classifying coronaviruses based on deep learning algorithm according to claim 1, wherein said step 4 is to train an optimal novel coronavirus sequence classification model according to the input of said model, and comprises the steps of:
step 4.1, dividing training set data into K parts, wherein one part is used as a verification set, and the rest K-1 parts are used as training sets;
step 4.2, obtaining optimal super parameters by using Bayes optimization;
step 4.3, training by using a deep learning network according to the optimal super-parameters to obtain an optimized novel coronavirus classification model;
and 4.4, judging whether all data are divided into verification sets, if so, calculating the precision of the K-time optimized novel coronavirus classification model on the test set, taking the model with the highest precision as the optimal novel coronavirus classification model, and if not, re-selecting one of the training set data which is not divided into the verification sets as the verification set, taking the rest K-1 as the training set, and repeating the steps 4.2 and 4.3.
4. The novel coronavirus classification method based on the deep learning algorithm as claimed in claim 2, wherein the step 2.1 is to perform initial encoding of the virus sequence to obtain a digital sequence, and the implementation method is as follows:
the virus sequence downloaded from the database is a character sequence, represented by four base symbols of A, T, G and C, and is converted into a digital sequence recognized by the cascade automatic encoder, and the virus sequence set is assumed to be represented as D= { P 1 ,P 2 ,P 3 ,P 4 ,…,P n }, wherein P i E { A, T, G, C }, 1.ltoreq.i.ltoreq.n, for each character sequence P i The coding character is T/C=1, A/G= -1, and the numerical sequence after preliminary coding is recorded as follows: g= (D 1 ,D 2 ,D 3 …D n ) Wherein D is i Is the sequence P i And the discrete value of (2) represents that i is more than or equal to 1 and n is more than or equal to n.
5. The method for classifying coronaviruses based on deep learning algorithm according to claim 2, wherein step 2.2 comprises the steps of:
step 2.2.1 digital signal D for each viral sequence i (n) two groups by parity of n: d (D) i (2r)=x 1 (r),D i (2r+1)=x 2 (r),DFT is expressed as +.> N is the sequence length;
step 2.2.2 1 will beThe viral subsequence of the dot is then split into two +.>Sub-sequence of points: x is x 1 (2s)=x 3 (s),x 1 (2s+1)=x 4 (s);x 2 (2s)=x 5 (s),x 2 (2s+1)=x 6 (s),/>DFT is expressed as
Step 2.2.3 recursively repeating the step 2.2.2M times to obtain 2-point DFT operation, D, of the virus sequence after M times of decomposition i The fast Fourier transform mode of (n) is expressed as |F i (k) I, marked as H i (k) Wherein k is more than or equal to 0 and less than or equal to n-1, H i (k) I.e. the amplitude.
6. The method according to claim 2, wherein the step 2.3 is implemented by constructing feature vectors using mahalanobis distance according to the magnitude, and the method comprises the steps of:
wherein Hi and Hj are the magnitudes of the ith and jth digital signals of the viral sequence, one-hot encoding is adopted for the viral sequence tag, and any virus type after encoding corresponds to a tag value, L= [ L ] 1 ,L 2 ,L 3 …L k ],L i E {0,1}, i = {1,2 … k }, specifically, if any one of the viral sequences belongs to the ith virus, only the ith position in the corresponding tag L is 1, the rest positions are 0, and all the tag data L are a two-dimensional array.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110045563.2A CN112735604B (en) | 2021-01-13 | 2021-01-13 | Novel coronavirus classification method based on deep learning algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110045563.2A CN112735604B (en) | 2021-01-13 | 2021-01-13 | Novel coronavirus classification method based on deep learning algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112735604A CN112735604A (en) | 2021-04-30 |
CN112735604B true CN112735604B (en) | 2024-03-26 |
Family
ID=75592903
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110045563.2A Active CN112735604B (en) | 2021-01-13 | 2021-01-13 | Novel coronavirus classification method based on deep learning algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112735604B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113252640B (en) * | 2021-06-03 | 2021-12-14 | 季华实验室 | Rapid virus screening and detecting method |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107665248A (en) * | 2017-09-22 | 2018-02-06 | 齐鲁工业大学 | File classification method and device based on deep learning mixed model |
CN108171232A (en) * | 2017-11-15 | 2018-06-15 | 中山大学 | The sorting technique of bacillary and viral children Streptococcus based on deep learning algorithm |
CN111785328A (en) * | 2020-06-12 | 2020-10-16 | 中国人民解放军军事科学院军事医学研究院 | Coronavirus sequence identification method based on gated cyclic unit neural network |
CN111951975A (en) * | 2020-08-19 | 2020-11-17 | 哈尔滨工业大学 | Sepsis early warning method based on deep learning model GPT-2 |
AU2020102631A4 (en) * | 2020-10-07 | 2020-11-26 | A, Anbuchezian Dr | The Severity Level and Early Prediction of Covid-19 Using CEDCNN Classifier |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140058987A1 (en) * | 2012-08-27 | 2014-02-27 | Almon David Ing | MoRPE: a machine learning method for probabilistic classification based on monotonic regression of a polynomial expansion |
-
2021
- 2021-01-13 CN CN202110045563.2A patent/CN112735604B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107665248A (en) * | 2017-09-22 | 2018-02-06 | 齐鲁工业大学 | File classification method and device based on deep learning mixed model |
CN108171232A (en) * | 2017-11-15 | 2018-06-15 | 中山大学 | The sorting technique of bacillary and viral children Streptococcus based on deep learning algorithm |
CN111785328A (en) * | 2020-06-12 | 2020-10-16 | 中国人民解放军军事科学院军事医学研究院 | Coronavirus sequence identification method based on gated cyclic unit neural network |
CN111951975A (en) * | 2020-08-19 | 2020-11-17 | 哈尔滨工业大学 | Sepsis early warning method based on deep learning model GPT-2 |
AU2020102631A4 (en) * | 2020-10-07 | 2020-11-26 | A, Anbuchezian Dr | The Severity Level and Early Prediction of Covid-19 Using CEDCNN Classifier |
Non-Patent Citations (3)
Title |
---|
Machine learning using intrinsic genomic signatures for rapid classfication of novel pathogens:COVID-19 case study;Gurjit S. Randhawa, et al;《PLOS ONE》;第1-24页 * |
王宇韬等.《Python大数据分析与机器学习商业案例实战》.机械工业出版社,2020,(第1版),第138页. * |
黄祥林等.《图像检索原理与实践》.中国传媒大学出版社,2014,(第1版),第89页. * |
Also Published As
Publication number | Publication date |
---|---|
CN112735604A (en) | 2021-04-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110298037B (en) | Convolutional neural network matching text recognition method based on enhanced attention mechanism | |
Chen et al. | Shallowing deep networks: Layer-wise pruning based on feature representations | |
CN114169330B (en) | Chinese named entity recognition method integrating time sequence convolution and transform encoder | |
CN109271522B (en) | Comment emotion classification method and system based on deep hybrid model transfer learning | |
US20230052865A1 (en) | Molecular graph representation learning method based on contrastive learning | |
CN112926303B (en) | Malicious URL detection method based on BERT-BiGRU | |
CN109657947A (en) | A kind of method for detecting abnormality towards enterprises ' industry classification | |
CN111785329A (en) | Single-cell RNA sequencing clustering method based on confrontation automatic encoder | |
CN113535953B (en) | Meta learning-based few-sample classification method | |
CN113571125A (en) | Drug target interaction prediction method based on multilayer network and graph coding | |
CN109933682B (en) | Image hash retrieval method and system based on combination of semantics and content information | |
Ma et al. | MIDIA: exploring denoising autoencoders for missing data imputation | |
CN111582506A (en) | Multi-label learning method based on global and local label relation | |
CN111259264B (en) | Time sequence scoring prediction method based on generation countermeasure network | |
CN112735604B (en) | Novel coronavirus classification method based on deep learning algorithm | |
Bhadoria et al. | Bunch graph based dimensionality reduction using auto-encoder for character recognition | |
CN114138971A (en) | Genetic algorithm-based maximum multi-label classification method | |
CN112085245A (en) | Protein residue contact prediction method based on deep residual error neural network | |
CN115424663B (en) | RNA modification site prediction method based on attention bidirectional expression model | |
CN113297385B (en) | Multi-label text classification system and method based on improved GraphRNN | |
CN115795037B (en) | Multi-label text classification method based on label perception | |
CN116628690A (en) | SQL injection attack detection method and system | |
Kai et al. | Molecular design method based on novel molecular representation and variational auto-encoder | |
CN116561004A (en) | Code complexity analysis method based on transducer model | |
El Mhouti et al. | A Machine Learning-Based Approach for Meteorological Big Data Analysis to Improve Weather Forecast |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |