CN112735604B

CN112735604B - Novel coronavirus classification method based on deep learning algorithm

Info

Publication number: CN112735604B
Application number: CN202110045563.2A
Authority: CN
Inventors: 马宝山; 张树正; 张新宇; 高宗江; 柴冰洁; 侯晓宇; 熊桐
Original assignee: Dalian Maritime University
Current assignee: Dalian Maritime University
Priority date: 2021-01-13
Filing date: 2021-01-13
Publication date: 2024-03-26
Anticipated expiration: 2041-01-13
Also published as: CN112735604A

Abstract

The invention provides a novel coronavirus classification method based on a deep learning algorithm, which is used for solving the technical problem of lower classification precision in the prior art, and comprises the following implementation steps: the method comprises the steps of obtaining the existing available virus sequence and a novel coronavirus data set, preprocessing the virus sequence data set, performing feature extraction on the preprocessed high-dimensional redundant virus sequence features by using three cascaded automatic encoders to achieve data dimension reduction, obtaining virus sequence nonlinear abstract features, obtaining training set data and test set data, obtaining an optimal novel coronavirus sequence classification model, and predicting a label of the novel coronavirus data by using the optimal novel coronavirus sequence classification model.

Description

Novel coronavirus classification method based on deep learning algorithm

Technical Field

The invention relates to the technical field of novel coronavirus classification, in particular to a novel coronavirus classification method based on a deep learning algorithm.

Background

The novel coronavirus is an RNA virus with envelope and linear single-strand positive strand genome, and the crowd is generally susceptible because the crowd lacks immunity to the novel virus strain. Because of the long latency of the novel coronaviruses, there is an urgent need to elucidate and analyze viral genomic sequences in order to better understand the novel viruses and to timely formulate therapeutic regimens. Whereas existing methods have found sequence similarity by performing similarity comparisons on sequence data. However, this sequence alignment method requires the help of gene annotation, with the database as a reference, and analysis of data using alignment software is almost impossible in the face of the need to analyze thousands of cell epigenomic sequences simultaneously. The traditional machine learning method is difficult to extract nonlinear abstract features of a virus sequence, only low-level features can be extracted, the low-level features mainly describe local information of the virus sequence and can not well describe all features of a virus genome sequence, and under the background that big data of the virus genome sequence needs to be analyzed, the calculation efficiency and the prediction accuracy are lacked.

Disclosure of Invention

The invention provides a deep learning-based method for classifying novel coronaviruses, which can effectively mine potential value when analyzing and processing large data scenes of virus sequences, solves the difficulty of a comparative genomics method in classifying the novel coronaviruses, can learn nonlinear characteristics of the virus sequences layer by layer, extracts more comprehensive and representative genome data characteristics, overcomes the defect that the traditional machine learning method cannot extract high-level characteristics (abstract characteristics), effectively improves classification performance of a classifier, and realizes deep mining of an internal nonlinear association mechanism of the virus genome sequences, and is characterized by comprising the following steps:

step 1, acquiring a novel coronavirus data set, and downloading all available virus sequences and a COVID-19 sequence from a virus host database and a GISAID platform;

step 2, preprocessing a virus sequence data set to obtain a feature vector;

step 3, using three cascade automatic encoders to perform feature extraction to map the feature vectors into low-dimensional high-level features as the input of a model;

step 4, training an optimal novel coronavirus classification model according to the input of the model;

step 5, predicting the label of the novel coronavirus data by using the optimal novel coronavirus classification model.

Further, the step 2 preprocesses the virus sequence data set to obtain a feature vector, and the implementation steps are as follows:

step 2.1, performing character sequence primary coding on the virus sequence to obtain a digital sequence;

step 2.2, performing fast Fourier transform on the digital sequence to obtain the amplitude of the digital sequence;

further, step 2.1 performs character sequence preliminary coding on the virus sequence to obtain a digital sequence, and the implementation method is as follows:

the virus sequence downloaded from the database is a character sequence, represented by four base symbols of A, T, G and C, and is converted into a digital sequence recognized by the cascade automatic encoder, and the virus sequence set is assumed to be represented as D= { P ₁ ，P ₂ ，P ₃ ，P ₄ ，…，P _n }, wherein P _i E { A, T, G, C }, 1.ltoreq.i.ltoreq.n, for each character sequence P _i The code character T/c=1, a/g= -1, and the initially coded numerical sequence is noted as: g= (D ₁ ,D ₂ ,D ₃ …D _n ) Wherein D is _i Is the sequence P _i And the discrete value of (2) represents that i is more than or equal to 1 and n is more than or equal to n.

Further, step 2.2 performs fast fourier transform on the digital sequence to obtain an amplitude of the digital sequence; the implementation steps are as follows:

step 2.2.1 digital signal D for each viral sequence _i (n) two groups by parity of n:DFT is expressed as +.>

Step 2.2.2 1 will beThe viral subsequence of the dot is then split into two +.>Sub-sequence of points: x is x ₁ (2s)＝x ₃ (s)，x ₁ (2s+1)＝x ₄ (s)；x ₂ (2s)＝x ₅ (s)，x ₂ (2s+1)＝x ₆ (s)，/>DFT is expressed as

Step 2.2.3 recursively repeating the step 2.2.2M times to obtain 2-point DFT operation, D, of the virus sequence after M times of decomposition _i The fast Fourier transform mode of (n) is expressed as |F _i (k) I, marked as H _i (k) Wherein k is more than or equal to 0 and less than or equal to n-1, H _i (k) Namely, the amplitude value;

further, the step 2.3 constructs a feature vector according to the amplitude value by using a mahalanobis distance, which is implemented as follows:

where Hi and Hj are the magnitudes of the ith and jth digital signals, respectively, of the viral sequence. For the virus sequence label, one-hot coding is adopted, and after coding, any virus type corresponds to a label value, and L= [ L ] ₁ ，L ₂ ，L ₃ …L _k ]，L _i If any one of the virus sequences belongs to the ith virus, only the ith position in the corresponding tag L is 1, the rest positions are 0, and all the tag data L are two-dimensional arrays.

Further, the step 3 uses three cascade automatic encoders to perform feature extraction to map the feature vector into low-dimensional high-level features, and uses the low-dimensional high-level features as the input of a model, and the implementation steps are as follows:

step 3.1, in order to learn the characteristics of a more robust virus sequence, randomly destroying a small part of data of the characteristic vector to obtain a sample, and avoiding the influence of introducing some irrelevant information when the character sequence is encoded in the previous stage, so that the designed automatic encoder can grasp the essential characteristics of the virus sequence more;

step 3.2, using L1+L2 norms as penalty items for model improvement, avoiding the excessive fitting of an algorithm to input data, and constructing a loss function;

step 3.3, training a first-stage automatic encoder to obtain a first-layer low-dimensional feature vector;

step 3.4, taking the output of the first-stage automatic encoder as the input of the next-stage encoder, continuously completing the training of the second-stage automatic encoder, and repeating the step until the training of the third-stage automatic encoder is completed, so as to obtain the high-level characteristic expression of virus sequence abstraction;

further, the step 3.2 uses the l1+l2 norm as a penalty term for model improvement, avoids the algorithm from overfitting the input data, and constructs a loss function, which is implemented as follows:

x ⁽ⁱ⁾ as the original feature vector of the input, x ₁ ⁽ⁱ⁾ And after reconstruction, the feature vector w is a weight, and lambda and rho are used for adjusting the weight of the penalty term.

Further, the step 3.3 trains a first-stage automatic encoder to obtain a first-layer low-dimensional feature vector, which is implemented as follows:

let n samples of the viral digital sequence be represented as x= { X ₁ ,x ₂ ,x ₃ ,x ₄ ……,x _n }＝{x _i I 1 is less than or equal to i is less than or equal to n, each sample x _i ∈R ^M Is an M-dimensional eigenvector, randomly selecting a small fraction of data for each sample to be set to 1, -1, and the vector X is further expressed as Input to a first stage encoder, the encoding function f _θ For->The first layer encoding of each virus sequence sample gives a feature vector denoted +.>Reconstructing an input vector denoted as Z by a decoder ₁ ＝g _θ ⁽¹⁾ (Y ₁ )＝s(W′Y ₁ +b '), where { W, b } is the coding parameter, { W ', b ' } is the decoding parameter, minimizing the loss function argmin using a gradient descent algorithm _θ,θ′ E gets the coding and decoding parameters, "> Wherein l is learning rate, W ', b' adopts phaseAnd calculating by the same method.

Further, the step 3.4 is implemented by the abstract high-level characteristic expression of the virus sequence output by the third-level automatic encoder:

the final characteristics of the obtained viral sequences can be expressed as: x ₁ ⁽ⁱ⁾ representing the feature vector, f, of the output of the third-stage automatic encoder _θ Is a coding function.

Further, the step 4 trains an optimal novel coronavirus sequence classification model according to the input of the model, and comprises the following steps:

step 4.1, dividing training set data into K parts, wherein one part is used as a verification set, and the rest K-1 parts are used as training sets;

step 4.2, obtaining optimal super parameters by using Bayes optimization;

step 4.3, training by using a deep learning network according to the optimal super-parameters to obtain an optimized novel coronavirus classification model;

and 4.4, selecting one of the data sets which are not divided into the verification sets as the verification set, taking the rest data as the training set, repeating the step 4.1, and if all the data are divided into the verification sets, calculating the precision of the K-time optimized novel coronavirus classification model on the test set, wherein the model with the highest precision is used as the optimal novel coronavirus classification model.

The invention provides a novel coronavirus classification method based on a deep learning algorithm, which can effectively improve classification accuracy and solve the problem of low classification accuracy of the novel coronavirus.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it will be obvious that the drawings in the following description are some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort to a person skilled in the art.

FIG. 1 is a block diagram of the feature extraction of a viral sequence according to the present invention;

FIG. 2 is a flowchart of an algorithm of an automatic encoder for viral sequences according to the present invention;

FIG. 3 is a flow chart of the optimization of the epoch parameters of the deep learning model of the present invention;

FIG. 4 is a flowchart of the optimization of the remaining parameters of the deep learning model of the present invention;

FIG. 5 is an overall flow chart of the present invention;

FIG. 6 is a diagram of a classification structure of a currently available viral sequence in a database of the present invention;

FIG. 7 is a diagram showing a structure of a deep learning model prediction virus sequence classification according to the present invention;

FIG. 8 is a specific structural design of the deep learning model network of the present invention;

FIG. 9 is a block diagram of hierarchical modeling classification prediction for a COVID-19 sequence in accordance with the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Fig. 5 is an overall flowchart of the present invention, and the present invention provides a novel coronavirus classification method based on a deep learning algorithm, which is characterized by comprising the following steps:

(1) Acquiring existing available viral sequences and a new coronavirus dataset:

(1a) All available virus sequences and covd-19 sequences downloaded from the virus host database and the GISAID platform;

(2) Preprocessing a viral sequence dataset to obtain feature vectors

(2a) Carrying out character sequence coding on the virus sequence to obtain a digital sequence;

(2b) Performing fast Fourier transform on the digital sequence to obtain the amplitude of the digital sequence;

(2c) Constructing a feature vector by utilizing a mahalanobis distance according to the amplitude value;

(3) Feature extraction using three concatenated automatic encoders to map the feature vectors into low-dimensional, high-level features as input to a model

(4) Training an optimal novel coronavirus sequence classification model according to the input of the model

(4a) Dividing training set data into K parts, wherein one part is used as a verification set, and the other K-1 parts are used as training sets;

(4b) Obtaining optimal super parameters by using Bayesian optimization;

(4c) Training by using a deep learning network according to the optimal super parameters to obtain an optimized novel coronavirus classification model;

(4d) Selecting one of the data sets which are not divided into verification sets as the verification set, taking the rest data as the training set, repeating the step (4 a), and if all the data are divided into the verification sets, calculating the precision of the K-time optimized novel coronavirus classification model on the test set, wherein the model with the highest precision is used as the optimal novel coronavirus classification model;

(5) Predicting a signature of the new coronavirus data using the optimal new coronavirus classification model;

and (2 a) performing character sequence primary coding on the virus sequence to obtain a digital sequence, wherein the realization method comprises the following steps:

the virus sequence downloaded from the database is a character sequence, represented by four base symbols of A, T, G and C, and is converted into a digital sequence recognized by a cascade automatic encoder, and the set of virus sequences is assumed to be represented as D= { P ₁ ，P ₂ ，P ₃ ，P ₄ ，…，P _n }, wherein P _i ∈{A,T,G,C}，1≤i≤n, for each character sequence P _i The code character T/c=1, a/g= -1, and the initially coded numerical sequence is noted as: g= (D ₁ ,D ₂ ,D ₃ …D _n ) Wherein D is _i Is the sequence P _i I is more than or equal to 1 and n is more than or equal to n;

step (2 b) performing fast fourier transform on the digital sequence to obtain the amplitude of the digital sequence; the implementation principle is as follows:

for each D _i Digital signals whose magnitudes are solved for using Fast Fourier Transforms (FFTs). D (D) _i Is expressed as |F by the fast Fourier transform mode of (A) _i (k) I, marked as H _i (k) Wherein 0.ltoreq.k.ltoreq.n-1, which is implemented as a point number n=2 ^M Viral sequence digital signal D _i Where N is the length of the viral digital sequence. Performing a DFT operation of decomposing M times into 2 points by performing a base MFFT change extracted in time to form D _i (n) to F _i (k) M-level butterfly operation process of (c). Specifically, each virus sequence is represented by digital signal D _i (n) two groups by parity of n: DFT is expressed as Then 1 +.>The viral subsequence of the dot is then split into two +.>Sub-sequence of points: x is x ₁ (2s)＝x ₃ (s)，x ₁ (2s+1)＝x ₄ (s)；x ₂ (2s)＝x ₅ (s)，x ₂ (2s+1)＝x ₆ (s)，/>DFT is expressed as By analogy, we can decompose M times to finally become 2-point DFT operation of the viral sequence. Thus, through M-level operation, a virus sequence digital signal D can be obtained _i (n) amplitude Spectrum representation F _i (k) The method comprises the steps of carrying out a first treatment on the surface of the Further, the method comprises the following specific implementation steps:

step 1 digital signal D of each viral sequence _i (n) two groups by parity of n:DFT is expressed as +.>

Step 2 will be 1The viral subsequence of the dot is then split into two +.>Sub-sequence of points: x is x ₁ (2s)＝x ₃ (s)，x ₁ (2s+1)＝x ₄ (s)；x ₂ (2s)＝x ₅ (s)，x ₂ (2s+1)＝x ₆ (s)，/>DFT is expressed as

Step 3 recursively repeating the steps 2.2.2 for M-2 times to obtain 2-point DFT operation of the virus sequence after M times of decomposition, D _i The fast Fourier transform mode of (n) is expressed as |F _i (k) I, marked as H _i (k) Wherein k is more than or equal to 0 and less than or equal to n-1, H _i (k) Namely, the amplitude value;

step (2 c) constructs a feature vector using a mahalanobis distance from the amplitude, which is implemented as:

wherein Hi and Hj are the magnitudes of the ith and jth digital signals, respectively, of the viral sequence;

and (3) performing feature extraction by using three cascading automatic encoders to map the feature vectors into low-dimensional high-level features, wherein the three cascading automatic encoders are used as input of a model, and the implementation steps are as follows:

(3a) In order to learn the characteristics of a more robust virus sequence, randomly destroying a small part of data of the characteristic vector, and avoiding the influence of introducing some irrelevant information when the character sequence is encoded in the previous stage, so that the designed automatic encoder can grasp the essential characteristics of the virus sequence more;

(3b) The L1+L2 norms are used as penalty items for model improvement, so that the excessive fitting of an algorithm to input data is avoided, and a loss function is constructed;

(3c) Training a first-stage automatic encoder to obtain a first-layer low-dimensional feature vector;

(3d) Taking the output of the first-stage automatic encoder as the input of the next-stage encoder, continuously completing the training of the second-stage automatic encoder, and repeating the step until the training of the third-stage automatic encoder is completed, so as to obtain the high-level feature expression of virus sequence abstraction;

(3e) The construction of the loss function of step (3 b) is implemented as:

x ⁽ⁱ⁾ as the original feature vector of the input, x ₁ ⁽ⁱ⁾ After reconstruction, the feature vector w is weight, and lambda and rho are used for adjusting the weight of penalty term

Step (3 c) of training the first-stage automatic encoder to obtain a first-layer low-dimensional feature vector, which is implemented as follows:

let n samples of the viral digital sequence be represented as x= { X ₁ ,x ₂ ,x ₃ ,x ₄ ……,x _n }＝{x _i I 1 is less than or equal to i is less than or equal to n, each sample x _i ∈R ^M Is an M-dimensional eigenvector, randomly selecting a small fraction of data for each sample to be set to 1, -1, and the vector X is further expressed as Input to a first stage encoder, the encoding function f _θ For->The first layer encoding of each virus sequence sample gives a feature vector denoted +.>Reconstructing an input vector denoted as Z by a decoder ₁ ＝g _θ ⁽¹⁾ (Y ₁ )＝s(W′Y ₁ +b '), where { W, b } is the coding parameter, { W ', b ' } is the decoding parameter, minimizing the loss function argmin using a gradient descent algorithm _θ,θ′ E gets the coding and decoding parameters, "> Wherein l is learning rate, W ', b' is adoptedCalculating by the same method;

the virus sequence output by the third-stage automatic encoder in the step (3 d) is abstract high-level characteristic expression, which is realized by the following steps:

the final characteristics of the obtained viral sequences can be expressed as: x ₁ ⁽ⁱ⁾ representing the feature vector, f, of the output of the third-stage automatic encoder _θ Is a coding function;

specifically, step (3) may be summarized as:

specifically, let n samples of the viral digital sequence be assumed, denoted as x= { X ₁ ,x ₂ ,x ₃ ,x ₄ ……,x _n }＝{x _i I 1 is less than or equal to i is less than or equal to n, each sample x _i ∈R ^M Is an M-dimensional eigenvector, randomly selecting a small fraction of data for each sample to be set to 1, -1, and the vector X is further expressed as Input to a first stage encoder, the encoding function f _θ For->Each virus sequence sample is subjected to first layer coding to obtain a characteristic vector denoted as Y ₁ ＝{y _i I1.ltoreq.i.ltoreq.n } (3-1), the input vector is reconstructed by the decoder to be denoted Z ₁ ＝{z _i I 1.ltoreq.i.ltoreq.n } (3-2), where { W, b } is the coding parameter, { W ', b' } is the decoding parameter, and a gradient descent algorithm is used to minimize the loss function argmin _θ,θ′ E obtaining coding and decoding parameters, and calculating formulas such as (3-3), (3-4), wherein l is learning rate, W 'and b' are calculated by adopting the same method, and only the first stage is reserved after the first layer of network training is completedThe encoder part takes the abstracted low-dimensional feature vector output by the encoder at the moment as the input of the next-stage encoder, trains the second-layer network according to the same method until the training of the third-stage automatic encoder is finished, and the output of the encoder at the moment is the abstract feature expression of the virus DNA sequence->The abstract high-level feature expression is finally completed through three-level gradual abstraction of the bottom layer features of the combined virus sequence, the virus sequence feature extraction block diagram is shown in fig. 1, and the algorithm flow chart of the automatic virus sequence encoder is shown in fig. 2.

Z ₁ ＝g _θ ⁽¹⁾ (Y ₁ )＝s(W′Y ₁ +b′) (3-2)

Step (4) can be summarized as:

firstly, the virus digital sequence after feature extraction is subjected to dimension reduction through a convolution layer, when the convolution kernel works, the virus digital sequence is regularly scanned to input features, matrix element multiplication summation is carried out on the input features, deviation amounts are overlapped, and the extracted features keep the inherent topology of the input. The convolution method formula is (4-1), wherein f (·) is an activation function, is convolution operation,characteristic of row i and column j of layer i, especially x ⁰ Representing the input viral number sequence,/->Is a convolution kernel, b _j Is a bias term.

After feature extraction by the convolution layer, the output features are passed to the pooling layer for feature selection and information filtering. The maximum pooling method is adopted, and the formula of the pooling method is (4-2), whereinFor the feature values of the ith row and j column of the layer-i max-pooling layer feature map, u (a, a) is a window function, and N is the size of the window.

The basic residual error module containing two convolution layers is used for replacing the stacked convolution layers, the problem of training difficulty caused by network depth is solved, and the pooling layer is connected behind the basic residual error module for filtering characteristic information and preventing overfitting to a certain extent. Three residual modules of the step are defined as y _l+1 ＝f(y _l +F(y _l ，w _l ))，y _l And y _l+1 Representing the input and output of the layer I residual module, F (·) is the residual function, w _l Is the parameter of the residual block, f (·) represents the activation function. Wherein F (y) _l ，w _l ) Representing the residual map to be learned. A dropout layer is connected after the second maximum pooling layer, and the activation value of a certain neuron stops working with a certain probability p to avoid overfitting when propagating forward, wherein the process calculates as (4-3), (4-4), W is weight, b is bias, k ⁽³⁾ For layer 3 output, k ⁽⁴⁾ For layer 4 output, C ⁽³⁾ It is shown that a 0,1 vector is randomly generated using the Bernoulli function, where each random variable obeys a Bernoulli probability distribution of parameter P. Wherein the discard ratio (P) best parameter determination is given by bayesian call parameters.

k _i ⁽⁴⁾ ＝f(a _i ⁽⁴⁾ ) (4-4)

After global averaging pooling, the extracted features are nonlinearly combined using a full-join layer to obtain an output, the process being represented as (4-5), where ∈ represents matrix multiplication, W _N*C Is weight, b is bias, G _c And (3) the feature vector of the virus sequence after global average pooling is obtained, and N is the number of all classes of the virus sequence of each class in the hierarchical modeling.

y _n ＝W _N*C ⊙G _c +b (4-5)

The number of neurons in the full-connection layer when the virus sequences are modeled in a grading manner is 13 and 12,4,4, namely the total number of the corresponding virus sequence categories is obtained. The patent uses cross entropy function as loss function training model, specific process is (4-6), wherein y _ik True tags of the kth virus sequence class are assembled for virus sequence tags, p _ik The classification prediction probability of the model on the virus sequence category is represented, N is the total number of virus sequence samples, and K is the total number of categories. With the loss function of the model, the model updates parameters by using an optimization method of gradient descent, each layer of network firstly calculates the output of each layer through feedforward, then counter propagates errors, and the model runs along the error gradient direction until the optimal solution is found.

The deep learning model design is shown in fig. 7, and the network specific structure design is shown in fig. 8.

When the model is optimized, the model val-loss is not reduced after 30 rounds of evaluation for the setting period (epoch) determination, the value of the epoch is output as the optimal parameter setting of the model, and the epoch optimization scheme is shown in figure 3. The parameter optimization schemes other than epoch are shown in FIG. 4. And (3) using three-fold cross validation to evaluate the prediction performance of the model under different parameter choices, using the average value of the AUC values of the three-fold cross validation as an evaluation index to define an objective function, and ensuring the reliability of the model parameter choices. The tuning range of the network parameters can be shown in the following table:

the prediction capability of the deep learning model under the condition of selecting different parameters is compared by a K-fold cross validation mode, sample data are disturbed to eliminate deviation possibly caused by samples before model training, training data are divided into K parts in order to ensure that all data sets can be trained, the deep learning model is trained, different K-1 parts are selected for training each time, 1 part is tested, K times are repeated, and the average value of K times of model evaluation indexes is compared to select the optimal model. The evaluation was performed using accuracy (acc), precision (Precision), the closer this value is to 1, the better the effect.

In classifying the covd-19 sequences, since each virus sequence of the training data is a set of taxonomic names, the classification structure of the existing available virus sequences in the database is shown in fig. 6 from high to low, and the sequence is searched from high to low according to the classification calculation of the classification level. The first range is 11 virus families and a ribovirus domain, and the optimized deep learning model is used for classifying and predicting the COVID-19 sequence, so that the COVID-19 sequence can be determined to be the ribovirus domain. The second range was 12 families under the riboviridae domain, which were identified as Coronaviridae (Coronaviridae) by using an optimized deep learning model to classify and predict the covd-19 sequence. The third sub-range is four genera under the coronaviridae family, and classification prediction is performed on the covd-19 sequence by using an optimized deep learning model, so that the covd-19 sequence can be determined to be coronavirus (betacorenavirus). The fourth sub-range is four subgenera under the coronavirus genus, and classification prediction of the covd-19 sequence using an optimized deep learning model can be determined to be the B-series (Sarbecovirus). Thus, it was finally determined that the classification tag of the COVID-19 sequence was the B-series (Sarbecovirus) of the genus coronavirus (Betacoronavirus), and the classification prediction structure of the classification of the COVID-19 sequence was modeled in a hierarchical manner as shown in FIG. 9.

Let the optimal model chosen be denoted as f (x), and for the covd-19 sequence x, this model is used to output its class label, covd-19_label=f (x).

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. The novel coronavirus classification method based on the deep learning algorithm is characterized by comprising the following steps of:

step 1, acquiring a novel coronavirus data set, wherein the data set is all available virus sequences and COVID-19 sequences downloaded from a virus host database and a GISAID platform;

step 2, preprocessing a virus sequence data set to obtain a feature vector;

the step 3 uses three cascade automatic encoders to perform feature extraction to map the feature vectors into low-dimensional high-level features, and uses the low-dimensional high-level features as the input of a model, and the implementation steps are as follows:

step 3.1, randomly destroying one tenth of data of the feature vector to obtain a sample for obtaining virus sequence features;

the step 3.2 constructs a loss function, which is implemented as:

x ⁽ⁱ⁾ as an input of the original feature vector,after reconstruction, the feature vector w is a weight, and lambda and rho are used for adjusting the weight of the penalty term;

step 3.3 trains the first-stage automatic encoder to obtain a first-layer low-dimensional feature vector, which is realized by:

let n samples of the viral digital sequence be represented as x= { X ₁ ,x ₂ ,x ₃ ,x ₄ ……,x _n }＝{x _i I 1 is less than or equal to i is less than or equal to n, each sample x _i ∈R ^M Is an M-dimensional feature vector, randomly selecting one tenth of data for each sample to be set as 1, -1, and the vector X is further expressed as Input to the first stage encoder, encoding function +.>For->The first layer encoding of each virus sequence sample gives a feature vector denoted +.>Reconstructing an input vector denoted as Z by a decoder ₁ ＝g _θ ⁽¹⁾ (Y ₁ )＝s(W′Y ₁ +b '), where { W, b } is the coding parameter, { W ', b ' } is the decoding parameter, minimizing the loss function argmin using a gradient descent algorithm _θ，θ′ E gets the coding and decoding parameters, "> Wherein l is learning rate, and W 'and b' are calculated by adopting the same method;

the abstract high-level characteristic expression of the virus sequence output by the third-level automatic encoder in the step 3.4 is realized as follows:

the final characteristics of the obtained viral sequences are expressed as: representing the feature vector, f, of the output of the third-stage automatic encoder _θ Is a coding function;

step 4, training an optimal novel coronavirus sequence classification model according to the input of the model;

step 5, predicting the label of the novel coronavirus data by using the optimal novel coronavirus sequence classification model.

2. The method for classifying coronaviruses based on the deep learning algorithm according to claim 1, wherein the step 2 of preprocessing the viral sequence data set to obtain feature vectors comprises the following implementation steps:

step 2.1, carrying out character sequence coding on the virus sequence to obtain a digital sequence;

and 2.3, constructing a characteristic vector by utilizing the Markov distance according to the amplitude value.

3. The method for classifying coronaviruses based on deep learning algorithm according to claim 1, wherein said step 4 is to train an optimal novel coronavirus sequence classification model according to the input of said model, and comprises the steps of:

step 4.2, obtaining optimal super parameters by using Bayes optimization;

and 4.4, judging whether all data are divided into verification sets, if so, calculating the precision of the K-time optimized novel coronavirus classification model on the test set, taking the model with the highest precision as the optimal novel coronavirus classification model, and if not, re-selecting one of the training set data which is not divided into the verification sets as the verification set, taking the rest K-1 as the training set, and repeating the steps 4.2 and 4.3.

4. The novel coronavirus classification method based on the deep learning algorithm as claimed in claim 2, wherein the step 2.1 is to perform initial encoding of the virus sequence to obtain a digital sequence, and the implementation method is as follows:

the virus sequence downloaded from the database is a character sequence, represented by four base symbols of A, T, G and C, and is converted into a digital sequence recognized by the cascade automatic encoder, and the virus sequence set is assumed to be represented as D= { P ₁ ，P ₂ ，P ₃ ，P ₄ ，…，P _n }, wherein P _i E { A, T, G, C }, 1.ltoreq.i.ltoreq.n, for each character sequence P _i The coding character is T/C=1, A/G= -1, and the numerical sequence after preliminary coding is recorded as follows: g= (D ₁ ,D ₂ ,D ₃ …D _n ) Wherein D is _i Is the sequence P _i And the discrete value of (2) represents that i is more than or equal to 1 and n is more than or equal to n.

5. The method for classifying coronaviruses based on deep learning algorithm according to claim 2, wherein step 2.2 comprises the steps of:

step 2.2.1 digital signal D for each viral sequence _i (n) two groups by parity of n: d (D) _i (2r)＝x ₁ (r),D _i (2r+1)＝x ₂ (r)，DFT is expressed as +.> N is the sequence length;

Step 2.2.3 recursively repeating the step 2.2.2M times to obtain 2-point DFT operation, D, of the virus sequence after M times of decomposition _i The fast Fourier transform mode of (n) is expressed as |F _i (k) I, marked as H _i (k) Wherein k is more than or equal to 0 and less than or equal to n-1, H _i (k) I.e. the amplitude.

6. The method according to claim 2, wherein the step 2.3 is implemented by constructing feature vectors using mahalanobis distance according to the magnitude, and the method comprises the steps of:

wherein Hi and Hj are the magnitudes of the ith and jth digital signals of the viral sequence, one-hot encoding is adopted for the viral sequence tag, and any virus type after encoding corresponds to a tag value, L= [ L ] ₁ ，L ₂ ，L ₃ …L _k ]，L _i E {0,1}, i = {1,2 … k }, specifically, if any one of the viral sequences belongs to the ith virus, only the ith position in the corresponding tag L is 1, the rest positions are 0, and all the tag data L are a two-dimensional array.