CN113764043B - Vesicle transport protein identification method and identification equipment based on position specificity scoring matrix - Google Patents

Vesicle transport protein identification method and identification equipment based on position specificity scoring matrix Download PDF

Info

Publication number
CN113764043B
CN113764043B CN202111063261.4A CN202111063261A CN113764043B CN 113764043 B CN113764043 B CN 113764043B CN 202111063261 A CN202111063261 A CN 202111063261A CN 113764043 B CN113764043 B CN 113764043B
Authority
CN
China
Prior art keywords
data file
sequence data
protein sequence
protein
feature vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111063261.4A
Other languages
Chinese (zh)
Other versions
CN113764043A (en
Inventor
赵玉茗
汪国华
宫越
邹权
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeast Forestry University
Yangtze River Delta Research Institute of UESTC Huzhou
Original Assignee
Northeast Forestry University
Yangtze River Delta Research Institute of UESTC Huzhou
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeast Forestry University, Yangtze River Delta Research Institute of UESTC Huzhou filed Critical Northeast Forestry University
Priority to CN202111063261.4A priority Critical patent/CN113764043B/en
Publication of CN113764043A publication Critical patent/CN113764043A/en
Application granted granted Critical
Publication of CN113764043B publication Critical patent/CN113764043B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2113Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2148Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures

Abstract

The invention relates to a vesicle transport protein identification method and identification equipment based on a position specificity scoring matrix. The invention aims to solve the problems of low efficiency and high cost of the existing vesicle transport protein identification method. The process is as follows: s1, acquiring a protein sequence data file; s2, generating a position specificity score matrix based on S1, and extracting feature vectors from the position specificity score matrix by adopting an AATP algorithm; s3, obtaining a processed feature vector by using an imbalance processing algorithm; s4, obtaining a characteristic vector set by adopting an MRMD algorithm; s5, adopting XGboost as a classifier, and carrying out hyper-parameter optimization; s6, obtaining a trained classifier model; and S7, inputting the data set to be detected into the trained classifier model to obtain a classification result, and completing the identification of the vesicle transport protein. The invention is used in the field of protein recognition.

Description

Vesicle transport protein identification method and identification equipment based on position specificity scoring matrix
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a vesicle transport protein identification method and identification equipment.
Background
In recent years, the research on vesicle transporters has been receiving more and more attention. During transport, vesicle transporters will assume the task of transporting macromolecules and particles when they are unable to cross the cell membrane. To date, many studies have demonstrated that aberrant vesicular transporters may cause a variety of diseases that severely compromise human health, such as the Hermansky-pudlak syndrome. In view of the importance of vesicular transporters in eukaryotic cells, researchers in the field of cell biology have been working on developing experimental techniques capable of identifying vesicular transporters with excellent results, such as morpholino knockdown and disection. These techniques can accurately identify the vesicle transporters, but these techniques are often inefficient and expensive, and thus it is necessary to find a time-saving and high-accuracy method for identifying the vesicle transporters.
Disclosure of Invention
The invention aims to solve the problems of low efficiency and high cost of the existing vesicle transporter identification method, and provides a vesicle transporter identification method and identification equipment based on a position specificity score matrix.
The vesicle transport protein recognition method based on the position specificity score matrix comprises the following specific processes:
s1, acquiring a protein sequence data file;
s2, generating a position specificity score matrix based on the protein sequence data file obtained in the S1, and extracting a feature vector from the position specificity score matrix by adopting an AATP algorithm;
s3, processing the feature vector extracted in the S2 by using an imbalance processing algorithm to obtain a processed feature vector;
s4, performing feature selection on the processed feature vectors obtained in the step S3 by adopting an MRMD algorithm to obtain a feature vector set with strong correlation between features and categories and low redundancy among the features;
s5, adopting XGboost as a classifier, and carrying out hyper-parameter optimization;
s6, inputting the feature vector set obtained in the S4 into a classifier for classification training to obtain a trained classifier model;
and S7, inputting the data set to be detected into the trained classifier model to obtain a classification result, and completing the recognition of the vesicle transport protein.
The vesicle transporter identification device based on the position specificity score matrix comprises a processor and a memory, wherein at least one instruction is stored in the memory, and the at least one instruction is loaded and executed by the processor to realize the vesicle transporter identification method based on the position specificity score matrix.
The invention has the beneficial effects that:
(1) the invention provides a brand-new vesicle transport protein identification method, which can realize accurate identification of vesicle transport protein by utilizing the characteristic extracted by a position specificity scoring matrix and provides a theoretical basis for corresponding drug development.
(2) The invention adopts a plurality of unbalance processing algorithms to reduce the unbalance of the data and make comparison, and finally selects the algorithm with the best performance. And then MRMD is used for reducing the characteristic dimension, so that the identification effect of the model is effectively improved.
(3) The XGboost is used as a learner, and the hyper-parameter optimization is carried out, so that the processing efficiency of the model on the vesicle transport protein is improved, and the identification cost is reduced.
Drawings
FIG. 1 is a flowchart of a method for identifying vesicle transporters based on position specificity matrix provided in the embodiment of the present invention;
fig. 2 is a schematic diagram of recognition effects of different feature extraction methods according to an embodiment of the present invention;
fig. 3 is a schematic diagram illustrating recognition effects of different imbalance processing methods according to an embodiment of the present invention;
fig. 4 is a schematic diagram illustrating recognition effects of different parameters in the dimension reduction algorithm according to the embodiment of the present invention;
fig. 5 is a schematic diagram of different learner recognition effects according to an embodiment of the present invention.
Detailed Description
The first embodiment is as follows: the vesicle transport protein recognition method based on the position specificity score matrix comprises the following specific processes:
s1, acquiring a protein sequence data file;
s2, generating a position specificity score matrix based on the protein sequence data file acquired at S1, and extracting feature vectors from the position specificity score matrix by adopting an AATP algorithm;
s3, processing the feature vector extracted in the S2 by using an imbalance processing algorithm to obtain a processed feature vector;
s1 it is known that the vesicle transport protein and the non-vesicle transport protein, the purpose of the unbalanced processing algorithm is to balance the two unbalanced quantities, and delete the data with large quantity to balance the two quantities;
using an imbalance processing algorithm to reduce the imbalance of the feature vector data extracted in S2 (data imbalance is to divide all data into two types, one type of vesicle transport protein and one type of data except for vesicle transport; then the two types differ greatly in number, for example, there are only 2000 vesicle transport proteins, but there are more than 7000 vesicle transport proteins, and the two types are unbalanced and need to be processed.);
s4, performing feature selection on the processed feature vectors obtained in the step S3 by adopting an MRMD algorithm to obtain a feature vector set with strong correlation between features and categories and low redundancy among the features;
s5, adopting XGboost as a classifier, and carrying out hyper-parameter optimization;
s6, inputting the feature vector set obtained in the S4 into a classifier for classification training to obtain a trained classifier model;
and S7, inputting the data set to be detected into the trained classifier model to obtain a classification result, and completing the identification of the vesicle transport protein.
The second embodiment is as follows: in this embodiment, unlike the first embodiment, a protein sequence data file is obtained in S1; the specific process is as follows:
acquiring a protein sequence data file (known websites, such as a UniProt database and a Gene Ontology website, wherein the UniProt database is a website specially providing protein related data), wherein the protein sequence data file comprises a positive example data set and a negative example data set;
the positive example data set is a sequence data file of vesicle transport protein, and the negative example data set is a sequence data file of non-vesicle transport protein.
Other steps and parameters are the same as those in the first embodiment.
The third concrete implementation mode: the embodiment is different from the first or second embodiment in that a position-specific score matrix is generated based on the protein sequence data file acquired in S1 in S2, and an AATP algorithm is used to extract feature vectors from the position-specific score matrix; the specific process is as follows:
s21, before the position specificity score matrix is generated in the step S2, the format and the content of the protein sequence data file obtained in the step S1 are checked for errors, and the file with the wrong format can influence the subsequent steps to obtain a correct protein sequence data file; the specific process is as follows:
s211, carrying out error detection on the format of the protein sequence data file obtained in the S1 to obtain a protein sequence data file with a correct format;
s212, error detection is carried out on the content of the protein sequence data file with the correct format obtained in the S211, and the protein sequence data file with the correct format and content is obtained;
s22, using PSI-BLAST program to compare the correct protein sequence data file obtained in S21 with NCBI' S non-redundant database, obtaining a position specificity scoring matrix;
the position specificity scoring matrix contains important evolution information of the protein, and the extraction of the characteristics from the matrix can effectively improve the effect of the vesicle transport protein recognition model.
And extracting feature vectors from the position specificity scoring matrix by adopting a feature extraction algorithm AATP.
The feature extraction algorithm AATP consists of two parts, namely AAC and TPC; the AAC, which is a 20-dimensional feature vector, represents the average score of each amino acid changed to other types of amino acids during the evolution of a protein. The TPC is a 400-dimensional characteristic obtained from a transition probability matrix, and can effectively avoid the loss of information in a sequence.
The most important information can be effectively extracted from the position specificity scoring matrix by adopting the AATP algorithm, and the efficiency and the performance of the vesicle transport protein are further improved.
Other steps and parameters are the same as those in the first or second embodiment.
The fourth concrete implementation mode: the difference between this embodiment and one of the first to third embodiments is that in S211, an error detection is performed on the format of the protein sequence data file obtained in S1, so as to obtain a protein sequence data file with a correct format; the specific process is as follows:
when the line of the protein sequence data file acquired at S1 does not begin with the character ">", deleting this line of non-specification data;
when the line of the protein sequence data file acquired at S1 begins with the character ">", the data subsequent to this line includes information such as the identification number, position, etc. of the sequence, and the data of the next line is the text data of this protein sequence data file, then a protein sequence data file in the correct format is obtained;
protein sequence data files have many rows
|>Q20300
MMDQILGTNFTYEGAKEVARGLEGFSAKLAVGYIATIFGLKYYMKDRK
>D3ZGS3
MEPRLPIGAQPLACLHMVAGLEMKGPLREPCVLTLARRNGQYELIIQLI
>A2AUC9
MDSQRELAEELRLYQSTLLQDGLKDLLEEKKFIDCTLKAGDKSFPCHRLI
>O18037
MEAANEVVNLFASQATTPSSLDAVTTLETVSTPTFIFPEVSDSQILQLMI
>H2E7T7
MALDLLSSYAPGLVESLLTWKGAAGLAAAVALGYIIISNLPGRQVAKPS
>Q04LE4
MISRFFRHLFEALKSLKRNGWMTVAAVSSVMITLTLVAIFASVIFNTAKI
>G0Y287
MVKLVEVLQHPDEIVPILQMLHKTYRAKRSYKDPGLAFCYGMLQRVSF
">" is followed by the identification number of the protein, as in "Q20300" in the first row, and then the next row immediately below is its sequence.
The information following ">" has at least one identification number, and other information is not necessary, and sometimes there are two pieces of information, length and type.
Other steps and parameters are the same as those in one of the first to third embodiments.
The fifth concrete implementation mode: the difference between this embodiment and one of the first to the fourth embodiments is that in S212, an error detection is performed on the content of the protein sequence data file with the correct format obtained in S211, so as to obtain a protein sequence data file with both the correct format and the correct content; the specific process is as follows:
the amino acids are 20 kinds, and are respectively represented by 20 letters, and the 20 letters do not contain 'B', 'J', 'O', 'U', 'X' or 'Z';
judging whether the character string of the protein sequence data file with the correct format obtained in S211 contains "B", "J", "O", "U", "X" or "Z", and if the character string does not contain "B", "J", "O", "U", "X" or "Z", prompting that the protein sequence data file obtained in S211 is correct, and performing S22;
if "B", "J", "O", "U", "X", or "Z" is included in the character string, it is suggested that there is an error in the protein sequence data file acquired in S211, and S22 needs to be performed for deleting "B", "J", "O", "U", "X", or "Z" (including several deletions) included in the protein sequence data file acquired in S211.
Other steps and parameters are the same as in one of the first to fourth embodiments.
The sixth specific implementation mode: the difference between this embodiment and the first to fifth embodiments is that the feature vector extracted in S2 is processed by using an imbalance processing algorithm in S3 to obtain a processed feature vector; reduce the imbalance of the data; the specific process is as follows:
a tool called imblearn is used which provides algorithms Clustercentroids, NearMiss, ENN, Randomander, Smote, SmoteENN, and SmoteTomek.
The unbalance processing algorithms are seven in total and are respectively Cluster centroids, NearMiss, ENN, Randomander, Smote, SmoteENN and SmoteTomek;
processing the feature vectors extracted in the step S2 by adopting seven imbalance processing algorithms to reduce the imbalance of the data, evaluating the accuracy rate through cross validation, and selecting the imbalance processing algorithm with the highest accuracy rate as the finally selected imbalance processing algorithm;
processing the feature vector extracted in the step S2 by adopting a finally selected imbalance processing algorithm to obtain a processed feature vector so as to reduce the imbalance of the data;
a cross-validation method is adopted. Cross-validation is also known, that is, dividing the data into 5, taking 4 of them to train the learner, and then testing the remaining 1 to see how much the one can be successfully identified. The cross validation can obtain a plurality of indexes such as accuracy, sensitivity, recall rate and the like, and generally the selection accuracy is the highest.
In the step, the condition that other conditions are unchanged is kept, only the imbalance processing algorithm is changed for comparison, and then the algorithm with the best effect performance is selected to be applied to the subsequent steps.
Other steps and parameters are the same as those in one of the first to fifth embodiments.
The seventh embodiment: the difference between this embodiment and one of the first to sixth embodiments is that, in S4, the MRMD algorithm is used to perform feature selection on the processed feature vector obtained in S3, so as to obtain a feature vector set in which features have strong correlation with categories and low redundancy between features; the specific process is as follows:
sorting all the processed feature vectors obtained in the step S3 by adopting sorting modes of Hits-a, TrustRank, PageRank, LeaderRank and Hits-h respectively to obtain feature vector sets of five sorting modes;
for example, all the feature vectors extracted by S2 have 5 features, and the 5 features are sorted by adopting sorting modes of Hits-a, TrustRank, PageRank, LeaderRank and Hits-h respectively to obtain feature vector sets of five sorting modes (the feature vector set of each sorting mode is composed of different sorting modes of 5 features);
respectively selecting features in the feature vector sets of the five sorting modes by using an MRMD algorithm (for example, if 5 features in the feature vector set of each sorting mode are too many, the MRMD algorithm can screen the 5 features in the feature vector set of each sorting mode), and obtaining the feature vector sets of the five sorting modes after feature selection;
using a Pearson correlation coefficient to balance the correlation between the feature subsets in the protein feature set obtained in the step S3 and two target classes of the vesicle transport protein and the non-vesicle transport protein by adopting an MRMD algorithm, and using a plurality of distance functions to obtain the redundancy of each feature subset; the redundancy of the feature subset selected by the MRMD is low, and the relevance of the feature subset selected by the MRMD and the target class is strong.
For example, the feature vector set of each ranking mode in the feature vector sets of 5 ranking modes is composed of different ranking modes of 5 features, 5 features in the feature vector set of each ranking mode are too many, and a most useful part of the features needs to be screened out, the MRMD algorithm screens 5 features in the feature vector set of each ranking mode, and the MRMD algorithm screens 5 features in the feature vector set of each ranking mode from the first feature and adds one feature into a feature subset until how many features are added, so that the effect of the feature subset is the best. Such as
Feature vector { name, gender, age, height }
After sorting, it becomes { age, name, height, sex }
Name is one of the features, and its feature subset includes { name }, { name, gender }, { gender, age }, etc.; first Max (MR) of the first subset of features { name }i+MDi) Then Max (MR) of the second subset of features { name, gender } is calculatedi+MDi) And so on, select Max (MR)i+MDi) The largest feature subset is used as a feature vector set after feature selection of the feature vector of the sort mode, the feature vector set of each sort mode is selected, and the feature vector set of each sort mode after feature selection is obtained;
different feature sorting modes have different sorting results, and finally selected feature subsets are different.
The role of MRMD is to screen the features in the feature vector set.
The distance function comprises an Euclidean distance function, a cosine distance function and a valley coefficient function. The three functions are used to calculate the distance between each feature subset and the target class, and the addition of the distances is redundancy.
And comparing the obtained feature vector sets of the five sorting modes after feature selection through cross validation, and selecting the feature vector set with the highest accuracy.
Other steps and parameters are the same as those in one of the first to sixth embodiments.
The specific implementation mode is eight: the difference between this embodiment and the first to seventh embodiments is that the basis for selecting features in the five feature subset sorting modes by using the MRMD algorithm is Max (MR)i+MDi);
In which MRiDenotes the Pearson coefficient, MD, between the ith protein class and the featureiRepresenting the Euclidean distance between the ith protein class and the feature;
in which maxMRiThe calculation of the values is as follows:
Figure BDA0003257246100000081
maxMDithe calculation of the values is as follows:
Figure BDA0003257246100000082
wherein PCC (.)Denotes the Pearson coefficient, FiCharacteristic vector representing the ith protein (vesicular transporter or non-vesicular transporter), CiRepresents the class of the i-th protein (vesicular or non-vesicular transporter), M represents the characteristic dimension of the protein (vesicular or non-vesicular transporter), SFiCiIs represented by FiAll elements in (A) and (C)iCovariance of all elements in (S)FiIs represented by FiStandard deviation of all elements in, SCiIs represented by CiStandard deviation of all elements in, fkIs represented by FiThe k-th element of (1), ckIs represented by CiN is FiAnd CiThe number of the elements in (1) is,
Figure BDA0003257246100000083
is FiThe average value of all the elements in (A),
Figure BDA0003257246100000084
is CiAverage of all elements in (1), EDiRepresenting the Euclidean distance (vesicular or non-vesicular transporter), COS, between the i-th protein featuresiDenotes the Cosine (Cosine) distance (vesicle transporter or non-vesicle transporter), TC, between the ith protein signatureiRepresent the trough (Tanimoto) coefficient (vesicular transporter or non-vesicular transporter) between the ith protein signature.
Other steps and parameters are the same as those in one of the first to seventh embodiments.
The specific implementation method nine: the difference between the present embodiment and the first to eighth embodiments is that the XGBoost is adopted as the classifier in S5, and the hyper-parameter optimization is performed; the specific process is as follows:
s51, initializing XGboost parameters: learning rate learning _ rate is 0.1; the maximum iteration number n _ estimators is 200; maximum depth max _ depth is 5; min _ child _ weight ═ 1; gamma is 0; 0.8; colsample _ byte ═ 0.8;
s52, selecting an adjusting range by taking one parameter in the initial parameters as a variable, and keeping the other parameters unchanged; using XGboost built-in cross validation to iteratively search for the optimal parameter;
and S53, repeating the step S52 until all the parameters find the optimal values, obtaining the optimal parameters of all the parameters, and obtaining the optimal XGboost to be used as a classifier.
Other steps and parameters are the same as those in one to eight of the embodiments.
The detailed implementation mode is ten: the vesicle transporter identification device based on the position specificity score matrix of the present embodiment comprises a processor and a memory, wherein the memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to realize the vesicle transporter identification method based on the position specificity matrix according to one of the first embodiment to the ninth embodiment.
The following examples were used to demonstrate the beneficial effects of the present invention:
the first embodiment is as follows:
exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It is to be understood that the embodiments shown and described in the drawings are merely exemplary and are intended to illustrate the principles and spirit of the invention, not to limit the scope of the invention.
The embodiment of the invention provides a vesicle transport protein recognition method based on a position specificity score matrix, as shown in figure 1, comprising the following steps of S1-S7:
and S1, downloading a protein sequence data file.
The acquired original protein characteristic data set comprises a positive case data set and a negative case data set, wherein the positive case data set is a vesicle transport protein sequence file, and the negative case data set is a non-vesicle transport protein sequence file.
In the present example, the total number of protein sequence data files is 2, and the sequence data files are a sequence data file of a vesicle transporter (containing a positive example vesicle transporter sequence of 9086) and a sequence data file of a non-vesicle transporter (containing a negative example non-vesicle transporter sequence of 2533).
S2, generating a position specificity score matrix based on the protein sequence data file acquired at S1, and extracting feature vectors from the position specificity score matrix by adopting an AATP algorithm; the specific process is as follows:
s21, carrying out error detection on the format and the content of the protein sequence data file obtained in the S1 to obtain a correct protein sequence data file; the specific process is as follows:
s211, carrying out error detection on the format of the protein sequence data file obtained in the S1 to obtain a protein sequence data file with a correct format; the specific method comprises the following steps:
when the line of the protein sequence data file acquired at S1 does not begin with the character ">", deleting this line of non-specification data;
when the line of the protein sequence data file acquired at S1 begins with the character ">", the data subsequent to this line includes information such as the identification number, position, etc. of the sequence, and the data of the next line is the text data of this protein sequence data file, then a protein sequence data file in the correct format is obtained;
s212, error detection is carried out on the content of the protein sequence data file with the correct format obtained in the S211, and the protein sequence data file with the correct format and content is obtained; the specific method comprises the following steps:
the amino acids are 20 kinds, and are respectively represented by 20 letters, and the 20 letters do not contain 'B', 'J', 'O', 'U', 'X' or 'Z';
judging whether the character string of the protein sequence data file with the correct format obtained in S211 contains "B", "J", "O", "U", "X" or "Z", and if the character string does not contain "B", "J", "O", "U", "X" or "Z", prompting that the protein sequence data file obtained in S211 is correct, and performing S22;
if "B", "J", "O", "U", "X", or "Z" is included in the character string, it is suggested that there is an error in the protein sequence data file acquired in S211, and S22 is performed to delete "B", "J", "O", "U", "X", or "Z" (including several deletions) included in the protein sequence data file acquired in S211;
s22, using PSI-BLAST program to compare the correct protein sequence data file obtained from S21 with the non-redundant database of NCBI to obtain a position specificity scoring matrix;
the position specificity scoring matrix comprises important evolutionary information of the protein, and the effect of the vesicle transport protein recognition model can be effectively improved by extracting characteristics from the matrix.
And extracting feature vectors from the position specificity scoring matrix by adopting a feature extraction algorithm AATP.
The feature extraction algorithm AATP consists of two parts, namely AAC and TPC; the AAC, which is a 20-dimensional feature vector, represents the average score of each amino acid changed to other types of amino acids during the evolution of a protein. The TPC is a 400-dimensional characteristic obtained from a transition probability matrix, and can effectively avoid the loss of information in a sequence.
The most important information can be effectively extracted from the position specificity scoring matrix by adopting the AATP algorithm, and the efficiency and the performance of the vesicle transport protein are further improved.
And S3, reducing the unbalance of the data by using an unbalance processing algorithm.
In the step, various unbalance processing methods provided by a Python software package Imbalanced-spare are used for comparison, and finally, an algorithm with the best effect is selected. Seven algorithms are adopted, including Cluster centroids, NearMiss, ENN, Randomander, Smote, SmoteENN and SmoteTomek. In the step, the condition that other conditions are unchanged is kept, only the imbalance processing algorithm is changed for comparison, and then the algorithm with the best effect performance is selected to be applied to the subsequent steps.
S4, sorting all the processed eigenvectors obtained in the S3 respectively by adopting sorting modes of Hits-a, TrustRank, PageRank, LeaderRank and Hits-h to obtain five feature subset sorting modes;
respectively selecting features in the five feature subset sorting modes by using an MRMD algorithm (for example, if 5 features in each feature subset are too many, the MRMD algorithm can screen 5 features in each feature subset), and obtaining the five feature subset sorting modes after feature selection;
and comparing the obtained five characteristic subset sorting modes after characteristic selection through cross validation, and selecting the characteristic vector set with the highest accuracy.
The MRMD algorithm uses Pearson correlation coefficients to balance the correlation between feature subsets and target classes and uses a variety of distance functions to obtain the redundancy of each feature subset. The redundancy between features is characterized by Euclidean distance, which is related to Euclidean distance ED, Cosine distance COS and Tanimoto coefficient TC, and the larger the Euclidean distance, the lower the redundancy between features.
Based on the theory, the basis for selecting the features of the feature set by adopting the MRMD algorithm is Max (MR)i+MDi) Wherein MRiDenotes the Pearson coefficient, MD, between the ith protein class and the featureiDenotes the Euclidean distance between the ith protein features, where maxMRiThe calculation of the values is as follows:
Figure BDA0003257246100000121
maxMDithe calculation of the values is as follows:
Figure BDA0003257246100000122
wherein PCC (. cndot.) represents the Pearson coefficient, FiFeature vector representing the ith protein, CiClass vector representing the ith protein, M the characteristic dimension of the protein, SFiCiIs represented by FiAll elements in (A) and (C)iCovariance of all elements in (S)FiIs represented by FiStandard deviation of all elements in, SCiIs represented by CiStandard deviation of all elements in, fkIs represented by FiThe k-th element of (1), ckIs represented by CiN is FiAnd CiThe number of the elements in (1) is,
Figure BDA0003257246100000123
is FiThe average value of all the elements in (A),
Figure BDA0003257246100000124
is CiAverage of all elements in (1), EDiRepresenting the Euclidean distance between the i-th protein features, COSiDenotes the Cosine distance, TC, between the i-th protein featuresiRepresenting Tanimoto coefficients between the ith protein features.
S5, adopting XGboost as a learner, and carrying out hyper-parameter optimization;
step S5 includes the following substeps S51-S54:
s51, initializing and setting XGboost parameters:
learning_rate=0.1;n_estimators=200;max_depth=5;min_child_weight=1;gamma=0;subsample=0.8;colsample_bytree=0.8。
and S52, selecting an adjusting range by taking one parameter in the initial parameters as a variable, and keeping the other parameters unchanged. Using XGboost built-in cross validation to iteratively search for the optimal parameter;
s53, repeating the step S52 until all the parameters find the optimal values;
and S54, obtaining the optimal parameters, and putting XGboost into training.
S6, inputting the feature vector set obtained in the S4 into a classifier for classification training to obtain a trained classifier model;
and S7, inputting the data set to be detected into the trained classifier model to obtain a classification result, and completing the identification of the vesicle transport protein.
The recognition effect of the present invention is further described below with a set of specific experimental examples.
Firstly, we compare the recognition effect of the AATP algorithm and other feature extraction methods based on the position specificity score matrix on the vesicle transport protein, as shown in fig. 2, wherein the evaluation indexes include ACC, SN, SP and MCC, and the calculation formula is as follows:
Figure BDA0003257246100000131
Figure BDA0003257246100000132
Figure BDA0003257246100000133
Figure BDA0003257246100000134
as can be seen from fig. 2, the AATP algorithm is better than other algorithms in terms of classification effect. The AATP algorithm can effectively extract information from the position specificity scoring matrix, thereby improving the recognition effect of the vesicle transport protein.
The different imbalance processing methods are then compared. The invention totally adopts seven unbalanced processing methods, including Cluster centroids, NearMiss, ENN, Randomander, Smote, SmoteENN, SmoteTomek and the like, and the comparison result is shown in figure 3. As can be seen from fig. 3, ENN is the best algorithm, and the ENN algorithm performs data cleaning on the side with a larger number of positive and negative samples to filter out a representative sample set. Subsequent experiments will employ the ENN algorithm to unbalanced process the data.
Then, the results obtained by different parameters in the MRMD3.0 algorithm adopted in the invention are compared. There are five sorting modes for users to select in MRMD3.0, including Hits-a, TrustRank, PageRank, LeaderRank and Hits-h, and the comparison results of these five methods are shown in FIG. 3. As can be seen from FIG. 4, Hits-h is selected because it works best among the various indicators.
Finally, we use different learners for comparison, including XGBoost, RF, KNN, and SVM. The comparison results are shown in fig. 5. As can be seen from fig. 5, the XGBoost is certainly the best choice, and the XGBoost has higher accuracy while ensuring extremely high efficiency, so the XGBoost is adopted in the present invention.
It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.
The present invention is capable of other embodiments and its several details are capable of modifications in various obvious respects, all without departing from the spirit and scope of the present invention.

Claims (6)

1. The vesicle transport protein recognition method based on the position specificity score matrix is characterized in that: the method comprises the following specific processes:
s1, acquiring a protein sequence data file;
s2, generating a position specificity score matrix based on the protein sequence data file acquired at S1, and extracting feature vectors from the position specificity score matrix by adopting an AATP algorithm;
s3, processing the feature vector extracted in the S2 by using an imbalance processing algorithm to obtain a processed feature vector;
s4, performing feature selection on the processed feature vectors obtained in the step S3 by adopting an MRMD algorithm to obtain a feature vector set with strong correlation between features and categories and low redundancy among the features;
s5, adopting XGboost as a classifier, and carrying out hyper-parameter optimization;
s6, inputting the feature vector set obtained in the S4 into a classifier for classification training to obtain a trained classifier model;
s7, inputting the data set to be detected into the trained classifier model to obtain a classification result, and completing the recognition of the vesicle transport protein;
generating a position specificity score matrix based on the protein sequence data file acquired at S1 in the S2, and extracting a feature vector from the position specificity score matrix by adopting an AATP algorithm; the specific process is as follows:
s21, carrying out error detection on the format and the content of the protein sequence data file obtained in the S1 to obtain a correct protein sequence data file; the specific process is as follows:
s211, carrying out error detection on the format of the protein sequence data file obtained in the S1 to obtain a protein sequence data file with a correct format;
s212, error detection is carried out on the content of the protein sequence data file with the correct format obtained in the S211, and the protein sequence data file with the correct format and content is obtained;
s22, using PSI-BLAST program to compare the correct protein sequence data file obtained in S21 with NCBI' S non-redundant database, obtaining a position specificity scoring matrix;
extracting a feature vector from the position specificity scoring matrix by adopting a feature extraction algorithm AATP;
in S211, error detection is performed on the format of the protein sequence data file obtained in S1, so as to obtain a protein sequence data file with a correct format; the specific process is as follows:
when the line of the protein sequence data file acquired at S1 does not begin with the character ">", deleting this line of non-specification data;
when the line of the protein sequence data file acquired at S1 begins with the character ">", the data subsequent to this line includes the identification number information of the sequence, and the data of the next line is the text data of this protein sequence data file, then the protein sequence data file in the correct format is acquired;
processing the feature vector extracted in the step S2 by using an imbalance processing algorithm in the step S3 to obtain a processed feature vector; the specific process is as follows:
the unbalance processing algorithms are seven in total and are respectively Cluster centroids, NearMiss, ENN, Randomander, Smote, SmoteENN and SmoteTomek;
processing the feature vectors extracted in the step S2 by adopting seven imbalance processing algorithms to reduce the imbalance of the data, evaluating the accuracy rate through cross validation, and selecting the imbalance processing algorithm with the highest accuracy rate as the finally selected imbalance processing algorithm;
processing the feature vector extracted in the step S2 by adopting a finally selected imbalance processing algorithm to obtain a processed feature vector;
in the step S4, an MRMD algorithm is adopted to perform feature selection on the processed feature vector obtained in the step S3, so that a feature vector set with strong correlation between features and categories and low redundancy among the features is obtained; the specific process is as follows:
sorting all the processed feature vectors obtained in the step S3 by adopting sorting modes of Hits-a, TrustRank, PageRank, LeaderRank and Hits-h respectively to obtain feature vector sets of five sorting modes;
respectively selecting features in the feature vector sets of the five sorting modes by adopting an MRMD algorithm to obtain the feature vector sets of the five sorting modes after feature selection;
and comparing the obtained feature vector sets of the five sorting modes after feature selection through cross validation, and selecting the feature vector set with the highest accuracy.
2. The method for identifying a vesicle transporter based on a position-specific score matrix according to claim 1, wherein: acquiring a protein sequence data file in the S1; the specific process is as follows:
acquiring a protein sequence data file, wherein the protein sequence data file comprises a positive example data set and a negative example data set;
the positive example data set is a sequence data file of vesicle transport protein, and the negative example data set is a sequence data file of non-vesicle transport protein.
3. The method for identifying a vesicle transporter based on a position-specific score matrix according to claim 2, wherein: in S212, error detection is performed on the content of the protein sequence data file with the correct format obtained in S211, so as to obtain a protein sequence data file with both the correct format and the correct content; the specific process is as follows:
judging whether the character string of the protein sequence data file with the correct format obtained in S211 contains "B", "J", "O", "U", "X" or "Z", and if the character string does not contain "B", "J", "O", "U", "X" or "Z", the protein sequence data file obtained in S211 is correct, and performing S22;
if "B", "J", "O", "U", "X", or "Z" is included in the character string, the protein sequence data file acquired in S211 has an error, and it is necessary to delete "B", "J", "O", "U", "X", or "Z" included in the protein sequence data file acquired in S211 and perform S22.
4. The method for identifying a vesicle transporter based on a position-specific score matrix according to claim 3, wherein: the basis for respectively selecting the features in the five feature subset sorting modes by adopting the MRMD algorithm is Max (MR)i+MDi);
In which MRiDenotes the Pearson coefficient, MD, between the ith protein class and the featureiRepresenting the Euclidean distance between the ith protein class and the feature;
where max MRiThe calculation of the values is as follows:
Figure FDA0003596418720000031
max MDithe calculation of the values is as follows:
Figure FDA0003596418720000041
wherein PCC (. cndot.) represents the Pearson coefficient, FiFeature vector representing the ith protein, CiRepresenting the class of the ith protein, M representing the characteristic dimension of the protein, SFiCiIs represented by FiAll elements in (A) and (C)iCovariance of all elements in (S)FiIs represented by FiSubject of all elements inTolerance, SCiIs represented by CiStandard deviation of all elements in, fkIs represented by FiThe k-th element of (1), ckIs represented by CiN is FiAnd CiThe number of the elements in (1) is,
Figure FDA0003596418720000042
is FiThe average value of all the elements in (A),
Figure FDA0003596418720000043
is CiAverage of all elements in (1), EDiRepresenting the Euclidean distance, COS, between the ith protein featuresiDenotes the cosine distance, TC, between the ith protein featureiThe trough coefficients between the ith protein features are indicated.
5. The method for identifying a vesicular transporter based on a position-specific score matrix according to claim 4, wherein: in the step S5, XGboost is adopted as a classifier, and hyper-parameter optimization is carried out; the specific process is as follows:
s51, initializing XGboost parameters:
learning rate learning _ rate is 0.1; the maximum iteration number n _ estimators is 200; maximum depth max _ depth is 5; min _ child _ weight ═ 1; gamma is 0; 0.8; colsample _ byte ═ 0.8;
s52, selecting an adjusting range by taking one parameter in the initial parameters as a variable, and keeping the other parameters unchanged; using XGboost built-in cross validation to iteratively search for the optimal parameter;
and S53, repeating the step S52 until all the parameters find the optimal values, obtaining the optimal parameters of all the parameters, and obtaining the optimal XGboost to be used as a classifier.
6. Position-specific score matrix based vesicle transporter identification apparatus, comprising a processor and a memory having stored therein at least one instruction, the at least one instruction being loaded and executed by the processor to implement a position-specific score matrix based vesicle transporter identification method according to one of claims 1 to 5.
CN202111063261.4A 2021-09-10 2021-09-10 Vesicle transport protein identification method and identification equipment based on position specificity scoring matrix Active CN113764043B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111063261.4A CN113764043B (en) 2021-09-10 2021-09-10 Vesicle transport protein identification method and identification equipment based on position specificity scoring matrix

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111063261.4A CN113764043B (en) 2021-09-10 2021-09-10 Vesicle transport protein identification method and identification equipment based on position specificity scoring matrix

Publications (2)

Publication Number Publication Date
CN113764043A CN113764043A (en) 2021-12-07
CN113764043B true CN113764043B (en) 2022-05-20

Family

ID=78794854

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111063261.4A Active CN113764043B (en) 2021-09-10 2021-09-10 Vesicle transport protein identification method and identification equipment based on position specificity scoring matrix

Country Status (1)

Country Link
CN (1) CN113764043B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101308526A (en) * 2008-07-07 2008-11-19 重庆大学 Recognition method of highly pathogenic avian influenza virus hemagglutinin protein
CN104331642A (en) * 2014-10-28 2015-02-04 山东大学 Integrated learning method for recognizing ECM (extracellular matrix) protein
CN105930688A (en) * 2016-04-18 2016-09-07 福州大学 Improved PSO algorithm based protein function module detection method
CN109448787A (en) * 2018-10-12 2019-03-08 云南大学 Based on the protein subnucleus localization method for improving PSSM progress feature extraction with merging
CN111428786A (en) * 2020-03-23 2020-07-17 电子科技大学 PageRank-based data feature set dimension reduction method
CN111599409A (en) * 2020-05-20 2020-08-28 电子科技大学 circRNA recognition method based on MapReduce parallelism

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
NZ759804A (en) * 2017-10-16 2022-04-29 Illumina Inc Deep learning-based techniques for training deep convolutional neural networks
CN111081311A (en) * 2019-12-26 2020-04-28 青岛科技大学 Protein lysine malonylation site prediction method based on deep learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101308526A (en) * 2008-07-07 2008-11-19 重庆大学 Recognition method of highly pathogenic avian influenza virus hemagglutinin protein
CN104331642A (en) * 2014-10-28 2015-02-04 山东大学 Integrated learning method for recognizing ECM (extracellular matrix) protein
CN105930688A (en) * 2016-04-18 2016-09-07 福州大学 Improved PSO algorithm based protein function module detection method
CN109448787A (en) * 2018-10-12 2019-03-08 云南大学 Based on the protein subnucleus localization method for improving PSSM progress feature extraction with merging
CN111428786A (en) * 2020-03-23 2020-07-17 电子科技大学 PageRank-based data feature set dimension reduction method
CN111599409A (en) * 2020-05-20 2020-08-28 电子科技大学 circRNA recognition method based on MapReduce parallelism

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Accurate Prediction and Key Feature Recognition of Immunoglobulin;Yuxin Gong等;《Applied.Sciences》;20210727;第15卷(第11期);第2-3节、图2 *
Identifying Antioxidant Proteins by Using Amino Acid Composition and Protein-Protein Interactions;Yixiao Zhai等;《Frontiers in Cell and Developmental Biology》;20201029;第8卷;第4页第2-8段 *
VTP-Identifier: Vesicular Transport Proteins Identification Based on PSSM Profiles and XGBoost;Yue Gong等;《Methods》;20220103;1-10 *
基于ACC变换和RFE算法的蛋白质亚核定位预测;李小苇等;《计算机工程与应用》;20150521;第52卷(第15期);83-87 *
基于谱隐马尔可夫模型的蛋白质序列模体识别方法研究;宋涛;《中国优秀博硕士学位论文全文数据库(博士)基础科学辑》;20170315(第03期);A006-38 *

Also Published As

Publication number Publication date
CN113764043A (en) 2021-12-07

Similar Documents

Publication Publication Date Title
CN105469096B (en) A kind of characteristic bag image search method based on Hash binary-coding
Wei et al. An improved protein structural classes prediction method by incorporating both sequence and structure information
US8019699B2 (en) Machine learning system
US20130297607A1 (en) Identification of pattern similarities by unsupervised cluster analysis
CN110516074B (en) Website theme classification method and device based on deep learning
CN113360701B (en) Sketch processing method and system based on knowledge distillation
CN101763466B (en) Biological information recognition method based on dynamic sample selection integration
CN107291895B (en) Quick hierarchical document query method
CN110210625A (en) Modeling method, device, computer equipment and storage medium based on transfer learning
CN111640468A (en) Method for screening disease-related protein based on complex network
CN111797267A (en) Medical image retrieval method and system, electronic device and storage medium
CN113764043B (en) Vesicle transport protein identification method and identification equipment based on position specificity scoring matrix
CN111048145B (en) Method, apparatus, device and storage medium for generating protein prediction model
CN113724779B (en) SNAREs protein identification method, system, storage medium and equipment based on machine learning technology
Phetkaew et al. Reordering adaptive directed acyclic graphs: an improved algorithm for multiclass support vector machines
CN116612307A (en) Solanaceae disease grade identification method based on transfer learning
Zhang et al. A hierarchical feature selection model using clustering and recursive elimination methods
CN115579068A (en) Pre-training and deep clustering-based metagenome species reconstruction method
CN110739028B (en) Cell line drug response prediction method based on K-nearest neighbor constraint matrix decomposition
CN113392086B (en) Medical database construction method, device and equipment based on Internet of things
CN113177608B (en) Neighbor model feature selection method and device for incomplete data
Kancharla et al. An efficient algorithm for feature selection problem in gene expression data: A spider monkey optimization approach
CN115420866A (en) Drug activity detection method, device, electronic equipment and storage medium
CN110942104B (en) Mixed feature selection method and system for foam flotation working condition identification process
Noto et al. Learning to find relevant biological articles without negative training examples

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Zhao Yuming

Inventor after: Wang Guohua

Inventor after: Gong Yue

Inventor after: Zou Quan

Inventor before: Wang Guohua

Inventor before: Gong Yue

Inventor before: Zou Quan

CB03 Change of inventor or designer information
GR01 Patent grant
GR01 Patent grant