CN112735604B - Novel coronavirus classification method based on deep learning algorithm - Google Patents

Novel coronavirus classification method based on deep learning algorithm Download PDF

Info

Publication number
CN112735604B
CN112735604B CN202110045563.2A CN202110045563A CN112735604B CN 112735604 B CN112735604 B CN 112735604B CN 202110045563 A CN202110045563 A CN 202110045563A CN 112735604 B CN112735604 B CN 112735604B
Authority
CN
China
Prior art keywords
sequence
virus
novel coronavirus
data
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110045563.2A
Other languages
Chinese (zh)
Other versions
CN112735604A (en
Inventor
马宝山
张树正
张新宇
高宗江
柴冰洁
侯晓宇
熊桐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian Maritime University
Original Assignee
Dalian Maritime University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian Maritime University filed Critical Dalian Maritime University
Priority to CN202110045563.2A priority Critical patent/CN112735604B/en
Publication of CN112735604A publication Critical patent/CN112735604A/en
Application granted granted Critical
Publication of CN112735604B publication Critical patent/CN112735604B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/80ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for detecting, monitoring or modelling epidemics or pandemics, e.g. flu
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/14Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
    • G06F17/141Discrete Fourier transforms
    • G06F17/142Fast Fourier transforms, e.g. using a Cooley-Tukey type algorithm
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Discrete Mathematics (AREA)
  • Pathology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Algebra (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a novel coronavirus classification method based on a deep learning algorithm, which is used for solving the technical problem of lower classification precision in the prior art, and comprises the following implementation steps: the method comprises the steps of obtaining the existing available virus sequence and a novel coronavirus data set, preprocessing the virus sequence data set, performing feature extraction on the preprocessed high-dimensional redundant virus sequence features by using three cascaded automatic encoders to achieve data dimension reduction, obtaining virus sequence nonlinear abstract features, obtaining training set data and test set data, obtaining an optimal novel coronavirus sequence classification model, and predicting a label of the novel coronavirus data by using the optimal novel coronavirus sequence classification model.

Description

Novel coronavirus classification method based on deep learning algorithm
Technical Field
The invention relates to the technical field of novel coronavirus classification, in particular to a novel coronavirus classification method based on a deep learning algorithm.
Background
The novel coronavirus is an RNA virus with envelope and linear single-strand positive strand genome, and the crowd is generally susceptible because the crowd lacks immunity to the novel virus strain. Because of the long latency of the novel coronaviruses, there is an urgent need to elucidate and analyze viral genomic sequences in order to better understand the novel viruses and to timely formulate therapeutic regimens. Whereas existing methods have found sequence similarity by performing similarity comparisons on sequence data. However, this sequence alignment method requires the help of gene annotation, with the database as a reference, and analysis of data using alignment software is almost impossible in the face of the need to analyze thousands of cell epigenomic sequences simultaneously. The traditional machine learning method is difficult to extract nonlinear abstract features of a virus sequence, only low-level features can be extracted, the low-level features mainly describe local information of the virus sequence and can not well describe all features of a virus genome sequence, and under the background that big data of the virus genome sequence needs to be analyzed, the calculation efficiency and the prediction accuracy are lacked.
Disclosure of Invention
The invention provides a deep learning-based method for classifying novel coronaviruses, which can effectively mine potential value when analyzing and processing large data scenes of virus sequences, solves the difficulty of a comparative genomics method in classifying the novel coronaviruses, can learn nonlinear characteristics of the virus sequences layer by layer, extracts more comprehensive and representative genome data characteristics, overcomes the defect that the traditional machine learning method cannot extract high-level characteristics (abstract characteristics), effectively improves classification performance of a classifier, and realizes deep mining of an internal nonlinear association mechanism of the virus genome sequences, and is characterized by comprising the following steps:
step 1, acquiring a novel coronavirus data set, and downloading all available virus sequences and a COVID-19 sequence from a virus host database and a GISAID platform;
step 2, preprocessing a virus sequence data set to obtain a feature vector;
step 3, using three cascade automatic encoders to perform feature extraction to map the feature vectors into low-dimensional high-level features as the input of a model;
step 4, training an optimal novel coronavirus classification model according to the input of the model;
step 5, predicting the label of the novel coronavirus data by using the optimal novel coronavirus classification model.
Further, the step 2 preprocesses the virus sequence data set to obtain a feature vector, and the implementation steps are as follows:
step 2.1, performing character sequence primary coding on the virus sequence to obtain a digital sequence;
step 2.2, performing fast Fourier transform on the digital sequence to obtain the amplitude of the digital sequence;
further, step 2.1 performs character sequence preliminary coding on the virus sequence to obtain a digital sequence, and the implementation method is as follows:
the virus sequence downloaded from the database is a character sequence, represented by four base symbols of A, T, G and C, and is converted into a digital sequence recognized by the cascade automatic encoder, and the virus sequence set is assumed to be represented as D= { P 1 ,P 2 ,P 3 ,P 4 ,…,P n }, wherein P i E { A, T, G, C }, 1.ltoreq.i.ltoreq.n, for each character sequence P i The code character T/c=1, a/g= -1, and the initially coded numerical sequence is noted as: g= (D 1 ,D 2 ,D 3 …D n ) Wherein D is i Is the sequence P i And the discrete value of (2) represents that i is more than or equal to 1 and n is more than or equal to n.
Further, step 2.2 performs fast fourier transform on the digital sequence to obtain an amplitude of the digital sequence; the implementation steps are as follows:
step 2.2.1 digital signal D for each viral sequence i (n) two groups by parity of n:DFT is expressed as +.>
Step 2.2.2 1 will beThe viral subsequence of the dot is then split into two +.>Sub-sequence of points: x is x 1 (2s)=x 3 (s),x 1 (2s+1)=x 4 (s);x 2 (2s)=x 5 (s),x 2 (2s+1)=x 6 (s),/>DFT is expressed as
Step 2.2.3 recursively repeating the step 2.2.2M times to obtain 2-point DFT operation, D, of the virus sequence after M times of decomposition i The fast Fourier transform mode of (n) is expressed as |F i (k) I, marked as H i (k) Wherein k is more than or equal to 0 and less than or equal to n-1, H i (k) Namely, the amplitude value;
further, the step 2.3 constructs a feature vector according to the amplitude value by using a mahalanobis distance, which is implemented as follows:
where Hi and Hj are the magnitudes of the ith and jth digital signals, respectively, of the viral sequence. For the virus sequence label, one-hot coding is adopted, and after coding, any virus type corresponds to a label value, and L= [ L ] 1 ,L 2 ,L 3 …L k ],L i If any one of the virus sequences belongs to the ith virus, only the ith position in the corresponding tag L is 1, the rest positions are 0, and all the tag data L are two-dimensional arrays.
Further, the step 3 uses three cascade automatic encoders to perform feature extraction to map the feature vector into low-dimensional high-level features, and uses the low-dimensional high-level features as the input of a model, and the implementation steps are as follows:
step 3.1, in order to learn the characteristics of a more robust virus sequence, randomly destroying a small part of data of the characteristic vector to obtain a sample, and avoiding the influence of introducing some irrelevant information when the character sequence is encoded in the previous stage, so that the designed automatic encoder can grasp the essential characteristics of the virus sequence more;
step 3.2, using L1+L2 norms as penalty items for model improvement, avoiding the excessive fitting of an algorithm to input data, and constructing a loss function;
step 3.3, training a first-stage automatic encoder to obtain a first-layer low-dimensional feature vector;
step 3.4, taking the output of the first-stage automatic encoder as the input of the next-stage encoder, continuously completing the training of the second-stage automatic encoder, and repeating the step until the training of the third-stage automatic encoder is completed, so as to obtain the high-level characteristic expression of virus sequence abstraction;
further, the step 3.2 uses the l1+l2 norm as a penalty term for model improvement, avoids the algorithm from overfitting the input data, and constructs a loss function, which is implemented as follows:
x (i) as the original feature vector of the input, x 1 (i) And after reconstruction, the feature vector w is a weight, and lambda and rho are used for adjusting the weight of the penalty term.
Further, the step 3.3 trains a first-stage automatic encoder to obtain a first-layer low-dimensional feature vector, which is implemented as follows:
let n samples of the viral digital sequence be represented as x= { X 1 ,x 2 ,x 3 ,x 4 ……,x n }={x i I 1 is less than or equal to i is less than or equal to n, each sample x i ∈R M Is an M-dimensional eigenvector, randomly selecting a small fraction of data for each sample to be set to 1, -1, and the vector X is further expressed as Input to a first stage encoder, the encoding function f θ For->The first layer encoding of each virus sequence sample gives a feature vector denoted +.>Reconstructing an input vector denoted as Z by a decoder 1 =g θ (1) (Y 1 )=s(W′Y 1 +b '), where { W, b } is the coding parameter, { W ', b ' } is the decoding parameter, minimizing the loss function argmin using a gradient descent algorithm θ,θ′ E gets the coding and decoding parameters, "> Wherein l is learning rate, W ', b' adopts phaseAnd calculating by the same method.
Further, the step 3.4 is implemented by the abstract high-level characteristic expression of the virus sequence output by the third-level automatic encoder:
the final characteristics of the obtained viral sequences can be expressed as: x 1 (i) representing the feature vector, f, of the output of the third-stage automatic encoder θ Is a coding function.
Further, the step 4 trains an optimal novel coronavirus sequence classification model according to the input of the model, and comprises the following steps:
step 4.1, dividing training set data into K parts, wherein one part is used as a verification set, and the rest K-1 parts are used as training sets;
step 4.2, obtaining optimal super parameters by using Bayes optimization;
step 4.3, training by using a deep learning network according to the optimal super-parameters to obtain an optimized novel coronavirus classification model;
and 4.4, selecting one of the data sets which are not divided into the verification sets as the verification set, taking the rest data as the training set, repeating the step 4.1, and if all the data are divided into the verification sets, calculating the precision of the K-time optimized novel coronavirus classification model on the test set, wherein the model with the highest precision is used as the optimal novel coronavirus classification model.
The invention provides a novel coronavirus classification method based on a deep learning algorithm, which can effectively improve classification accuracy and solve the problem of low classification accuracy of the novel coronavirus.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it will be obvious that the drawings in the following description are some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort to a person skilled in the art.
FIG. 1 is a block diagram of the feature extraction of a viral sequence according to the present invention;
FIG. 2 is a flowchart of an algorithm of an automatic encoder for viral sequences according to the present invention;
FIG. 3 is a flow chart of the optimization of the epoch parameters of the deep learning model of the present invention;
FIG. 4 is a flowchart of the optimization of the remaining parameters of the deep learning model of the present invention;
FIG. 5 is an overall flow chart of the present invention;
FIG. 6 is a diagram of a classification structure of a currently available viral sequence in a database of the present invention;
FIG. 7 is a diagram showing a structure of a deep learning model prediction virus sequence classification according to the present invention;
FIG. 8 is a specific structural design of the deep learning model network of the present invention;
FIG. 9 is a block diagram of hierarchical modeling classification prediction for a COVID-19 sequence in accordance with the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Fig. 5 is an overall flowchart of the present invention, and the present invention provides a novel coronavirus classification method based on a deep learning algorithm, which is characterized by comprising the following steps:
(1) Acquiring existing available viral sequences and a new coronavirus dataset:
(1a) All available virus sequences and covd-19 sequences downloaded from the virus host database and the GISAID platform;
(2) Preprocessing a viral sequence dataset to obtain feature vectors
(2a) Carrying out character sequence coding on the virus sequence to obtain a digital sequence;
(2b) Performing fast Fourier transform on the digital sequence to obtain the amplitude of the digital sequence;
(2c) Constructing a feature vector by utilizing a mahalanobis distance according to the amplitude value;
(3) Feature extraction using three concatenated automatic encoders to map the feature vectors into low-dimensional, high-level features as input to a model
(4) Training an optimal novel coronavirus sequence classification model according to the input of the model
(4a) Dividing training set data into K parts, wherein one part is used as a verification set, and the other K-1 parts are used as training sets;
(4b) Obtaining optimal super parameters by using Bayesian optimization;
(4c) Training by using a deep learning network according to the optimal super parameters to obtain an optimized novel coronavirus classification model;
(4d) Selecting one of the data sets which are not divided into verification sets as the verification set, taking the rest data as the training set, repeating the step (4 a), and if all the data are divided into the verification sets, calculating the precision of the K-time optimized novel coronavirus classification model on the test set, wherein the model with the highest precision is used as the optimal novel coronavirus classification model;
(5) Predicting a signature of the new coronavirus data using the optimal new coronavirus classification model;
and (2 a) performing character sequence primary coding on the virus sequence to obtain a digital sequence, wherein the realization method comprises the following steps:
the virus sequence downloaded from the database is a character sequence, represented by four base symbols of A, T, G and C, and is converted into a digital sequence recognized by a cascade automatic encoder, and the set of virus sequences is assumed to be represented as D= { P 1 ,P 2 ,P 3 ,P 4 ,…,P n }, wherein P i ∈{A,T,G,C},1≤i≤n, for each character sequence P i The code character T/c=1, a/g= -1, and the initially coded numerical sequence is noted as: g= (D 1 ,D 2 ,D 3 …D n ) Wherein D is i Is the sequence P i I is more than or equal to 1 and n is more than or equal to n;
step (2 b) performing fast fourier transform on the digital sequence to obtain the amplitude of the digital sequence; the implementation principle is as follows:
for each D i Digital signals whose magnitudes are solved for using Fast Fourier Transforms (FFTs). D (D) i Is expressed as |F by the fast Fourier transform mode of (A) i (k) I, marked as H i (k) Wherein 0.ltoreq.k.ltoreq.n-1, which is implemented as a point number n=2 M Viral sequence digital signal D i Where N is the length of the viral digital sequence. Performing a DFT operation of decomposing M times into 2 points by performing a base MFFT change extracted in time to form D i (n) to F i (k) M-level butterfly operation process of (c). Specifically, each virus sequence is represented by digital signal D i (n) two groups by parity of n: DFT is expressed as Then 1 +.>The viral subsequence of the dot is then split into two +.>Sub-sequence of points: x is x 1 (2s)=x 3 (s),x 1 (2s+1)=x 4 (s);x 2 (2s)=x 5 (s),x 2 (2s+1)=x 6 (s),/>DFT is expressed as By analogy, we can decompose M times to finally become 2-point DFT operation of the viral sequence. Thus, through M-level operation, a virus sequence digital signal D can be obtained i (n) amplitude Spectrum representation F i (k) The method comprises the steps of carrying out a first treatment on the surface of the Further, the method comprises the following specific implementation steps:
step 1 digital signal D of each viral sequence i (n) two groups by parity of n:DFT is expressed as +.>
Step 2 will be 1The viral subsequence of the dot is then split into two +.>Sub-sequence of points: x is x 1 (2s)=x 3 (s),x 1 (2s+1)=x 4 (s);x 2 (2s)=x 5 (s),x 2 (2s+1)=x 6 (s),/>DFT is expressed as
Step 3 recursively repeating the steps 2.2.2 for M-2 times to obtain 2-point DFT operation of the virus sequence after M times of decomposition, D i The fast Fourier transform mode of (n) is expressed as |F i (k) I, marked as H i (k) Wherein k is more than or equal to 0 and less than or equal to n-1, H i (k) Namely, the amplitude value;
step (2 c) constructs a feature vector using a mahalanobis distance from the amplitude, which is implemented as:
wherein Hi and Hj are the magnitudes of the ith and jth digital signals, respectively, of the viral sequence;
and (3) performing feature extraction by using three cascading automatic encoders to map the feature vectors into low-dimensional high-level features, wherein the three cascading automatic encoders are used as input of a model, and the implementation steps are as follows:
(3a) In order to learn the characteristics of a more robust virus sequence, randomly destroying a small part of data of the characteristic vector, and avoiding the influence of introducing some irrelevant information when the character sequence is encoded in the previous stage, so that the designed automatic encoder can grasp the essential characteristics of the virus sequence more;
(3b) The L1+L2 norms are used as penalty items for model improvement, so that the excessive fitting of an algorithm to input data is avoided, and a loss function is constructed;
(3c) Training a first-stage automatic encoder to obtain a first-layer low-dimensional feature vector;
(3d) Taking the output of the first-stage automatic encoder as the input of the next-stage encoder, continuously completing the training of the second-stage automatic encoder, and repeating the step until the training of the third-stage automatic encoder is completed, so as to obtain the high-level feature expression of virus sequence abstraction;
(3e) The construction of the loss function of step (3 b) is implemented as:
x (i) as the original feature vector of the input, x 1 (i) After reconstruction, the feature vector w is weight, and lambda and rho are used for adjusting the weight of penalty term
Step (3 c) of training the first-stage automatic encoder to obtain a first-layer low-dimensional feature vector, which is implemented as follows:
let n samples of the viral digital sequence be represented as x= { X 1 ,x 2 ,x 3 ,x 4 ……,x n }={x i I 1 is less than or equal to i is less than or equal to n, each sample x i ∈R M Is an M-dimensional eigenvector, randomly selecting a small fraction of data for each sample to be set to 1, -1, and the vector X is further expressed as Input to a first stage encoder, the encoding function f θ For->The first layer encoding of each virus sequence sample gives a feature vector denoted +.>Reconstructing an input vector denoted as Z by a decoder 1 =g θ (1) (Y 1 )=s(W′Y 1 +b '), where { W, b } is the coding parameter, { W ', b ' } is the decoding parameter, minimizing the loss function argmin using a gradient descent algorithm θ,θ′ E gets the coding and decoding parameters, "> Wherein l is learning rate, W ', b' is adoptedCalculating by the same method;
the virus sequence output by the third-stage automatic encoder in the step (3 d) is abstract high-level characteristic expression, which is realized by the following steps:
the final characteristics of the obtained viral sequences can be expressed as: x 1 (i) representing the feature vector, f, of the output of the third-stage automatic encoder θ Is a coding function;
specifically, step (3) may be summarized as:
specifically, let n samples of the viral digital sequence be assumed, denoted as x= { X 1 ,x 2 ,x 3 ,x 4 ……,x n }={x i I 1 is less than or equal to i is less than or equal to n, each sample x i ∈R M Is an M-dimensional eigenvector, randomly selecting a small fraction of data for each sample to be set to 1, -1, and the vector X is further expressed as Input to a first stage encoder, the encoding function f θ For->Each virus sequence sample is subjected to first layer coding to obtain a characteristic vector denoted as Y 1 ={y i I1.ltoreq.i.ltoreq.n } (3-1), the input vector is reconstructed by the decoder to be denoted Z 1 ={z i I 1.ltoreq.i.ltoreq.n } (3-2), where { W, b } is the coding parameter, { W ', b' } is the decoding parameter, and a gradient descent algorithm is used to minimize the loss function argmin θ,θ′ E obtaining coding and decoding parameters, and calculating formulas such as (3-3), (3-4), wherein l is learning rate, W 'and b' are calculated by adopting the same method, and only the first stage is reserved after the first layer of network training is completedThe encoder part takes the abstracted low-dimensional feature vector output by the encoder at the moment as the input of the next-stage encoder, trains the second-layer network according to the same method until the training of the third-stage automatic encoder is finished, and the output of the encoder at the moment is the abstract feature expression of the virus DNA sequence->The abstract high-level feature expression is finally completed through three-level gradual abstraction of the bottom layer features of the combined virus sequence, the virus sequence feature extraction block diagram is shown in fig. 1, and the algorithm flow chart of the automatic virus sequence encoder is shown in fig. 2.
Z 1 =g θ (1) (Y 1 )=s(W′Y 1 +b′) (3-2)
Step (4) can be summarized as:
firstly, the virus digital sequence after feature extraction is subjected to dimension reduction through a convolution layer, when the convolution kernel works, the virus digital sequence is regularly scanned to input features, matrix element multiplication summation is carried out on the input features, deviation amounts are overlapped, and the extracted features keep the inherent topology of the input. The convolution method formula is (4-1), wherein f (·) is an activation function, is convolution operation,characteristic of row i and column j of layer i, especially x 0 Representing the input viral number sequence,/->Is a convolution kernel, b j Is a bias term.
After feature extraction by the convolution layer, the output features are passed to the pooling layer for feature selection and information filtering. The maximum pooling method is adopted, and the formula of the pooling method is (4-2), whereinFor the feature values of the ith row and j column of the layer-i max-pooling layer feature map, u (a, a) is a window function, and N is the size of the window.
The basic residual error module containing two convolution layers is used for replacing the stacked convolution layers, the problem of training difficulty caused by network depth is solved, and the pooling layer is connected behind the basic residual error module for filtering characteristic information and preventing overfitting to a certain extent. Three residual modules of the step are defined as y l+1 =f(y l +F(y l ,w l )),y l And y l+1 Representing the input and output of the layer I residual module, F (·) is the residual function, w l Is the parameter of the residual block, f (·) represents the activation function. Wherein F (y) l ,w l ) Representing the residual map to be learned. A dropout layer is connected after the second maximum pooling layer, and the activation value of a certain neuron stops working with a certain probability p to avoid overfitting when propagating forward, wherein the process calculates as (4-3), (4-4), W is weight, b is bias, k (3) For layer 3 output, k (4) For layer 4 output, C (3) It is shown that a 0,1 vector is randomly generated using the Bernoulli function, where each random variable obeys a Bernoulli probability distribution of parameter P. Wherein the discard ratio (P) best parameter determination is given by bayesian call parameters.
k i (4) =f(a i (4) ) (4-4)
After global averaging pooling, the extracted features are nonlinearly combined using a full-join layer to obtain an output, the process being represented as (4-5), where ∈ represents matrix multiplication, W N*C Is weight, b is bias, G c And (3) the feature vector of the virus sequence after global average pooling is obtained, and N is the number of all classes of the virus sequence of each class in the hierarchical modeling.
y n =W N*C ⊙G c +b (4-5)
The number of neurons in the full-connection layer when the virus sequences are modeled in a grading manner is 13 and 12,4,4, namely the total number of the corresponding virus sequence categories is obtained. The patent uses cross entropy function as loss function training model, specific process is (4-6), wherein y ik True tags of the kth virus sequence class are assembled for virus sequence tags, p ik The classification prediction probability of the model on the virus sequence category is represented, N is the total number of virus sequence samples, and K is the total number of categories. With the loss function of the model, the model updates parameters by using an optimization method of gradient descent, each layer of network firstly calculates the output of each layer through feedforward, then counter propagates errors, and the model runs along the error gradient direction until the optimal solution is found.
The deep learning model design is shown in fig. 7, and the network specific structure design is shown in fig. 8.
When the model is optimized, the model val-loss is not reduced after 30 rounds of evaluation for the setting period (epoch) determination, the value of the epoch is output as the optimal parameter setting of the model, and the epoch optimization scheme is shown in figure 3. The parameter optimization schemes other than epoch are shown in FIG. 4. And (3) using three-fold cross validation to evaluate the prediction performance of the model under different parameter choices, using the average value of the AUC values of the three-fold cross validation as an evaluation index to define an objective function, and ensuring the reliability of the model parameter choices. The tuning range of the network parameters can be shown in the following table:
the prediction capability of the deep learning model under the condition of selecting different parameters is compared by a K-fold cross validation mode, sample data are disturbed to eliminate deviation possibly caused by samples before model training, training data are divided into K parts in order to ensure that all data sets can be trained, the deep learning model is trained, different K-1 parts are selected for training each time, 1 part is tested, K times are repeated, and the average value of K times of model evaluation indexes is compared to select the optimal model. The evaluation was performed using accuracy (acc), precision (Precision), the closer this value is to 1, the better the effect.
In classifying the covd-19 sequences, since each virus sequence of the training data is a set of taxonomic names, the classification structure of the existing available virus sequences in the database is shown in fig. 6 from high to low, and the sequence is searched from high to low according to the classification calculation of the classification level. The first range is 11 virus families and a ribovirus domain, and the optimized deep learning model is used for classifying and predicting the COVID-19 sequence, so that the COVID-19 sequence can be determined to be the ribovirus domain. The second range was 12 families under the riboviridae domain, which were identified as Coronaviridae (Coronaviridae) by using an optimized deep learning model to classify and predict the covd-19 sequence. The third sub-range is four genera under the coronaviridae family, and classification prediction is performed on the covd-19 sequence by using an optimized deep learning model, so that the covd-19 sequence can be determined to be coronavirus (betacorenavirus). The fourth sub-range is four subgenera under the coronavirus genus, and classification prediction of the covd-19 sequence using an optimized deep learning model can be determined to be the B-series (Sarbecovirus). Thus, it was finally determined that the classification tag of the COVID-19 sequence was the B-series (Sarbecovirus) of the genus coronavirus (Betacoronavirus), and the classification prediction structure of the classification of the COVID-19 sequence was modeled in a hierarchical manner as shown in FIG. 9.
Let the optimal model chosen be denoted as f (x), and for the covd-19 sequence x, this model is used to output its class label, covd-19_label=f (x).
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims (6)

1. The novel coronavirus classification method based on the deep learning algorithm is characterized by comprising the following steps of:
step 1, acquiring a novel coronavirus data set, wherein the data set is all available virus sequences and COVID-19 sequences downloaded from a virus host database and a GISAID platform;
step 2, preprocessing a virus sequence data set to obtain a feature vector;
step 3, using three cascade automatic encoders to perform feature extraction to map the feature vectors into low-dimensional high-level features as the input of a model;
the step 3 uses three cascade automatic encoders to perform feature extraction to map the feature vectors into low-dimensional high-level features, and uses the low-dimensional high-level features as the input of a model, and the implementation steps are as follows:
step 3.1, randomly destroying one tenth of data of the feature vector to obtain a sample for obtaining virus sequence features;
step 3.2, using L1+L2 norms as penalty items for model improvement, avoiding the excessive fitting of an algorithm to input data, and constructing a loss function;
the step 3.2 constructs a loss function, which is implemented as:
x (i) as an input of the original feature vector,after reconstruction, the feature vector w is a weight, and lambda and rho are used for adjusting the weight of the penalty term;
step 3.3, training a first-stage automatic encoder to obtain a first-layer low-dimensional feature vector;
step 3.3 trains the first-stage automatic encoder to obtain a first-layer low-dimensional feature vector, which is realized by:
let n samples of the viral digital sequence be represented as x= { X 1 ,x 2 ,x 3 ,x 4 ……,x n }={x i I 1 is less than or equal to i is less than or equal to n, each sample x i ∈R M Is an M-dimensional feature vector, randomly selecting one tenth of data for each sample to be set as 1, -1, and the vector X is further expressed as Input to the first stage encoder, encoding function +.>For->The first layer encoding of each virus sequence sample gives a feature vector denoted +.>Reconstructing an input vector denoted as Z by a decoder 1 =g θ (1) (Y 1 )=s(W′Y 1 +b '), where { W, b } is the coding parameter, { W ', b ' } is the decoding parameter, minimizing the loss function argmin using a gradient descent algorithm θ,θ′ E gets the coding and decoding parameters, "> Wherein l is learning rate, and W 'and b' are calculated by adopting the same method;
step 3.4, taking the output of the first-stage automatic encoder as the input of the next-stage encoder, continuously completing the training of the second-stage automatic encoder, and repeating the step until the training of the third-stage automatic encoder is completed, so as to obtain the high-level characteristic expression of virus sequence abstraction;
the abstract high-level characteristic expression of the virus sequence output by the third-level automatic encoder in the step 3.4 is realized as follows:
the final characteristics of the obtained viral sequences are expressed as: representing the feature vector, f, of the output of the third-stage automatic encoder θ Is a coding function;
step 4, training an optimal novel coronavirus sequence classification model according to the input of the model;
step 5, predicting the label of the novel coronavirus data by using the optimal novel coronavirus sequence classification model.
2. The method for classifying coronaviruses based on the deep learning algorithm according to claim 1, wherein the step 2 of preprocessing the viral sequence data set to obtain feature vectors comprises the following implementation steps:
step 2.1, carrying out character sequence coding on the virus sequence to obtain a digital sequence;
step 2.2, performing fast Fourier transform on the digital sequence to obtain the amplitude of the digital sequence;
and 2.3, constructing a characteristic vector by utilizing the Markov distance according to the amplitude value.
3. The method for classifying coronaviruses based on deep learning algorithm according to claim 1, wherein said step 4 is to train an optimal novel coronavirus sequence classification model according to the input of said model, and comprises the steps of:
step 4.1, dividing training set data into K parts, wherein one part is used as a verification set, and the rest K-1 parts are used as training sets;
step 4.2, obtaining optimal super parameters by using Bayes optimization;
step 4.3, training by using a deep learning network according to the optimal super-parameters to obtain an optimized novel coronavirus classification model;
and 4.4, judging whether all data are divided into verification sets, if so, calculating the precision of the K-time optimized novel coronavirus classification model on the test set, taking the model with the highest precision as the optimal novel coronavirus classification model, and if not, re-selecting one of the training set data which is not divided into the verification sets as the verification set, taking the rest K-1 as the training set, and repeating the steps 4.2 and 4.3.
4. The novel coronavirus classification method based on the deep learning algorithm as claimed in claim 2, wherein the step 2.1 is to perform initial encoding of the virus sequence to obtain a digital sequence, and the implementation method is as follows:
the virus sequence downloaded from the database is a character sequence, represented by four base symbols of A, T, G and C, and is converted into a digital sequence recognized by the cascade automatic encoder, and the virus sequence set is assumed to be represented as D= { P 1 ,P 2 ,P 3 ,P 4 ,…,P n }, wherein P i E { A, T, G, C }, 1.ltoreq.i.ltoreq.n, for each character sequence P i The coding character is T/C=1, A/G= -1, and the numerical sequence after preliminary coding is recorded as follows: g= (D 1 ,D 2 ,D 3 …D n ) Wherein D is i Is the sequence P i And the discrete value of (2) represents that i is more than or equal to 1 and n is more than or equal to n.
5. The method for classifying coronaviruses based on deep learning algorithm according to claim 2, wherein step 2.2 comprises the steps of:
step 2.2.1 digital signal D for each viral sequence i (n) two groups by parity of n: d (D) i (2r)=x 1 (r),D i (2r+1)=x 2 (r),DFT is expressed as +.> N is the sequence length;
step 2.2.2 1 will beThe viral subsequence of the dot is then split into two +.>Sub-sequence of points: x is x 1 (2s)=x 3 (s),x 1 (2s+1)=x 4 (s);x 2 (2s)=x 5 (s),x 2 (2s+1)=x 6 (s),/>DFT is expressed as
Step 2.2.3 recursively repeating the step 2.2.2M times to obtain 2-point DFT operation, D, of the virus sequence after M times of decomposition i The fast Fourier transform mode of (n) is expressed as |F i (k) I, marked as H i (k) Wherein k is more than or equal to 0 and less than or equal to n-1, H i (k) I.e. the amplitude.
6. The method according to claim 2, wherein the step 2.3 is implemented by constructing feature vectors using mahalanobis distance according to the magnitude, and the method comprises the steps of:
wherein Hi and Hj are the magnitudes of the ith and jth digital signals of the viral sequence, one-hot encoding is adopted for the viral sequence tag, and any virus type after encoding corresponds to a tag value, L= [ L ] 1 ,L 2 ,L 3 …L k ],L i E {0,1}, i = {1,2 … k }, specifically, if any one of the viral sequences belongs to the ith virus, only the ith position in the corresponding tag L is 1, the rest positions are 0, and all the tag data L are a two-dimensional array.
CN202110045563.2A 2021-01-13 2021-01-13 Novel coronavirus classification method based on deep learning algorithm Active CN112735604B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110045563.2A CN112735604B (en) 2021-01-13 2021-01-13 Novel coronavirus classification method based on deep learning algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110045563.2A CN112735604B (en) 2021-01-13 2021-01-13 Novel coronavirus classification method based on deep learning algorithm

Publications (2)

Publication Number Publication Date
CN112735604A CN112735604A (en) 2021-04-30
CN112735604B true CN112735604B (en) 2024-03-26

Family

ID=75592903

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110045563.2A Active CN112735604B (en) 2021-01-13 2021-01-13 Novel coronavirus classification method based on deep learning algorithm

Country Status (1)

Country Link
CN (1) CN112735604B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113252640B (en) * 2021-06-03 2021-12-14 季华实验室 Rapid virus screening and detecting method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107665248A (en) * 2017-09-22 2018-02-06 齐鲁工业大学 File classification method and device based on deep learning mixed model
CN108171232A (en) * 2017-11-15 2018-06-15 中山大学 The sorting technique of bacillary and viral children Streptococcus based on deep learning algorithm
CN111785328A (en) * 2020-06-12 2020-10-16 中国人民解放军军事科学院军事医学研究院 Coronavirus sequence identification method based on gated cyclic unit neural network
CN111951975A (en) * 2020-08-19 2020-11-17 哈尔滨工业大学 Sepsis early warning method based on deep learning model GPT-2
AU2020102631A4 (en) * 2020-10-07 2020-11-26 A, Anbuchezian Dr The Severity Level and Early Prediction of Covid-19 Using CEDCNN Classifier

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140058987A1 (en) * 2012-08-27 2014-02-27 Almon David Ing MoRPE: a machine learning method for probabilistic classification based on monotonic regression of a polynomial expansion

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107665248A (en) * 2017-09-22 2018-02-06 齐鲁工业大学 File classification method and device based on deep learning mixed model
CN108171232A (en) * 2017-11-15 2018-06-15 中山大学 The sorting technique of bacillary and viral children Streptococcus based on deep learning algorithm
CN111785328A (en) * 2020-06-12 2020-10-16 中国人民解放军军事科学院军事医学研究院 Coronavirus sequence identification method based on gated cyclic unit neural network
CN111951975A (en) * 2020-08-19 2020-11-17 哈尔滨工业大学 Sepsis early warning method based on deep learning model GPT-2
AU2020102631A4 (en) * 2020-10-07 2020-11-26 A, Anbuchezian Dr The Severity Level and Early Prediction of Covid-19 Using CEDCNN Classifier

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Machine learning using intrinsic genomic signatures for rapid classfication of novel pathogens:COVID-19 case study;Gurjit S. Randhawa, et al;《PLOS ONE》;第1-24页 *
王宇韬等.《Python大数据分析与机器学习商业案例实战》.机械工业出版社,2020,(第1版),第138页. *
黄祥林等.《图像检索原理与实践》.中国传媒大学出版社,2014,(第1版),第89页. *

Also Published As

Publication number Publication date
CN112735604A (en) 2021-04-30

Similar Documents

Publication Publication Date Title
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
Chen et al. Shallowing deep networks: Layer-wise pruning based on feature representations
CN114169330B (en) Chinese named entity recognition method integrating time sequence convolution and transform encoder
CN109271522B (en) Comment emotion classification method and system based on deep hybrid model transfer learning
US20230052865A1 (en) Molecular graph representation learning method based on contrastive learning
CN112926303B (en) Malicious URL detection method based on BERT-BiGRU
CN109657947A (en) A kind of method for detecting abnormality towards enterprises ' industry classification
CN111785329A (en) Single-cell RNA sequencing clustering method based on confrontation automatic encoder
CN113535953B (en) Meta learning-based few-sample classification method
CN113571125A (en) Drug target interaction prediction method based on multilayer network and graph coding
CN109933682B (en) Image hash retrieval method and system based on combination of semantics and content information
Ma et al. MIDIA: exploring denoising autoencoders for missing data imputation
CN111582506A (en) Multi-label learning method based on global and local label relation
CN111259264B (en) Time sequence scoring prediction method based on generation countermeasure network
CN112735604B (en) Novel coronavirus classification method based on deep learning algorithm
Bhadoria et al. Bunch graph based dimensionality reduction using auto-encoder for character recognition
CN114138971A (en) Genetic algorithm-based maximum multi-label classification method
CN112085245A (en) Protein residue contact prediction method based on deep residual error neural network
CN115424663B (en) RNA modification site prediction method based on attention bidirectional expression model
CN113297385B (en) Multi-label text classification system and method based on improved GraphRNN
CN115795037B (en) Multi-label text classification method based on label perception
CN116628690A (en) SQL injection attack detection method and system
Kai et al. Molecular design method based on novel molecular representation and variational auto-encoder
CN116561004A (en) Code complexity analysis method based on transducer model
El Mhouti et al. A Machine Learning-Based Approach for Meteorological Big Data Analysis to Improve Weather Forecast

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant