CN104463251A - Cancer gene expression profile data identification method based on integration of extreme learning machines - Google Patents

Cancer gene expression profile data identification method based on integration of extreme learning machines Download PDF

Info

Publication number
CN104463251A
CN104463251A CN201410773130.9A CN201410773130A CN104463251A CN 104463251 A CN104463251 A CN 104463251A CN 201410773130 A CN201410773130 A CN 201410773130A CN 104463251 A CN104463251 A CN 104463251A
Authority
CN
China
Prior art keywords
elm
particle
integrated
value
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410773130.9A
Other languages
Chinese (zh)
Inventor
凌青华
韩飞
叶松林
杨春
崔宝祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu University of Science and Technology
Original Assignee
Jiangsu University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University of Science and Technology filed Critical Jiangsu University of Science and Technology
Priority to CN201410773130.9A priority Critical patent/CN104463251A/en
Publication of CN104463251A publication Critical patent/CN104463251A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/192Recognition using electronic means using simultaneous comparisons or correlations of the image signals with a plurality of references
    • G06V30/194References adjustable by an adaptive method, e.g. learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Bioethics (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a cancer gene expression profile data identification method based on integration of extreme learning machines. The method includes the steps of selection and integration of member extreme learning machines (ELMs). The method concretely includes the steps that preprocessing is carried out on a cancer gene expression profile data set, wherein the preprocessing includes gene selection and normalization of expression profile data; N sample sets are generated through a Bagging method, and each sample set is divided into a training set and a verification set according to a certain proportion; N ELMs are generated on the N training sets in a learning mode, and L ELMs (L&1t; N) with the highest recognition rate on the corresponding verification sets are selected to form an alternative member ELM base; K member ELMs (K&1t; L) forming an integrated system are selected from L ELMs based on the particle swarm optimization algorithm; the integrated vote weight of K member ELMs is worked out by utilizing the minimum-norm least square method; an integrated ELM system is obtained, and the integrated ELM system is used for performing tumor recognition on a newly increased cancer gene expression profile sample. Through the method, the cancer gene expression profile data can be quickly and accurately recognized.

Description

Based on the oncogene express spectra data identification method of integrated extreme learning machine
Technical field
The invention belongs to the application of the Computer Analysis technology of oncogene express spectra data, be specifically related to a kind of oncogene express spectra data identification method based on integrated extreme learning machine.
Background technology
In life science, DNA microarray technology is while biology and medical research bring unprecedented opportunities, and the gene expression profile data of its complexity produced but proposes huge challenge to existing data analysis and process method.First, gene expression profile data has very high dimension (gene), and has again very complicated relation between these genes dimension.The second, gene expression profile data has less sample, and this and huge number gene form uneven contradiction.3rd, gene expression profile data innately has the high variation of strong noise and waits data analysis difficult point.4th, useful informations a large amount of in gene expression profile data is hidden.Traditional computer analysis method can not meet actual needs to the process of gene expression profile data.How to utilize computer data analytical technology (normal or abnormal to gene expression profile data generic quickly and accurately; the different subtype of tumour) identify; guarantee that clinical diagnosis is more objective and accurate, become the gordian technique of oncogene express spectra data analysis.
The external research utilizing machine learning method to identify gene expression profile tumour of recent year is very active, mainly comprises: (1) uses the neural network based on backpropagation and associated gradients algorithm (BP:Backpropagation) to identify gene expression profile tumour.Such as, J.Khan etc. (Classification anddiagnostic prediction of cancers using gene expression profiling and artificial neuralnetworks) identify effectively by four hypotypes of model to roundlet large cortical cells tumour (SRBCT) based on neural network, and identify efficient gene subset the most.But BP and relevant gradient algorithm all exist speed of convergence is easily absorbed in the defects such as local extremum slowly, and network structure is difficult to determine, thus cause tumour accuracy of identification not high and time overhead is large.(2) Support Vector Machine (SVM:support vector machine) is used to identify gene expression profile tumour.Such as, T.S.Fruey etc. (SVM classification and validationof cancer tissue samples using microarray expression data) identify oophoroma (Ovarian), leukaemia (ALL/AML) and colon cancer (Colon) three data centralization sample class with standard SVM, all obtain higher discrimination.Although SVM is applicable to higher-dimension Small Sample Database, the method is only good at process two class classification problem, and in multicategory classification problem, effect is not ideal.In addition, in SVM parameter choose more time-consuming, and also there is no at present choosing of parameter in effective theories integration SVM.(3) extreme learning machine (ELM:extreme learning machine) is used to identify gene expression profile tumour.Such as, ELM is used for tumour identification by F.Han etc. (A Novel Strategy for Gene Selection of Microarray Data Based onGene-to-Class Sensitivity Information) on the basis that gene is selected, (Leukemia on six gene expression profile data collection, Colon, SRBCT, LUNG, Brain cancer and Lymphoma) all obtain the accuracy of identification being better than classical way.ELM obtains the unique weights solution of Single hidden layer feedforward neural networks with the method for resolving, and demonstrate this solution theoretically and can ensure that minimum training error and minimum norm export weights, therefore this algorithm can obtain optimum Generalization Capability with the extremely short time, and this is that other learning algorithm is incomparable.ELM can provide a unified platform for various application, can approach arbitrary continuous function, can classify to arbitrary disjoint region.Although some SVM (as MOC-LS-SVM], based on the SVM of Bayes rule) can be used in solving many classification problems, they increase computing time and complexity.A large amount of experiments shows, ELM has the measurability more excellent than SVM, close (returning and two class classification problems) or more excellent (many classification problems) Generalization Capability and speed of convergence faster.Obviously, in order to improve processing speed and the precision of gene expression profile data, ELM is a reasonably selection.On the other hand, cause because of the input layer weights of ELM Stochastic choice list hidden layer feedforward network that network hidden node number is on the high side, output layer weights norm and hidden layer output matrix conditional number increase, thus affect the response time of ELM on test set and estimated performance.
Carry out analysis to above-mentioned research can find, although single sorter can be used for carrying out the identification of gene expression profile tumour, its recognition performance still has larger room for promotion.Integrated classifier can make up the deficiency that single sorter exists well.Several single classifier is classified to certain sample jointly, and their recognition result is integrated by certain integrated rule, so just can effectively improve tumour accuracy of identification.Integrated study can play the performance of each member classifiers effectively, plays the complementarity between them fully, thus the Generalization Capability of improvement system, Classification and Identification rate and stability.
Traditional integrated classifier is mainly by multiple feedforward network based on BP or integrated by multiple SVM.Such as, Zhou Zhihua etc. (Ensembling neural networks:Many could be better than all) propose GASEN integrated approach, the method first trains each member's neural network, then uses genetic algorithm (GA:genetic algorithm) to optimize integrated weights.Experimental result in multiple recurrence and categorized data set shows that the performance of GASEN is better than classical Bagging and Boosting algorithm.Although can improve the classification performance of system, BP network or SVM self Problems existing still embody with certain degree in an integrated system.Because of the performance that ELM is good, multiple ELM integrates to improve integrated system performance further by people.Such as, multiple ELM integrates and has carried out successful prediction to landslide, plain boiled water section, reservoir area of Three Gorges by Cheng Lian etc. (Ensemble of extreme learning machine for landslidedisplacement prediction based on time series analysis).But these integrated ELM are just simply integrated by multiple ELM, all too simple choosing with the design of integrated rule of member ELM, thus its performance still has larger room for promotion.Up to now, ELM is integrated and tumour identification is carried out to gene expression profile data yet there are no corresponding report.
Summary of the invention
Goal of the invention: the object of the invention is to propose a kind of oncogene express spectra data identification method based on integrated extreme learning machine, the method also can identify tumour classification more fast exactly.
Technical scheme: a kind of oncogene express spectra data identification method based on integrated extreme learning machine, comprises the integrated step between the selection of member ELM and member ELM, comprise the following steps:
Step 1: the pre-service of oncogene express spectra data set, the gene comprising tumour express spectra data is selected and normalization;
Step 2: by Bagging method, N number of sample set is generated according to a certain percentage to the data set obtained in step 1, and this N number of sample set is generated by a certain percentage again N number of training set and checking collection;
Step 3: on N number of training set in step 2, study generates N number of extreme learning machine, selects a highest L ELM (L<N) according to the discrimination of N number of ELM on corresponding checking collection and forms alternative member ELM storehouse;
Step 4: with the diversity factor in integrated system between member ELM for optimization aim, utilizes K the extreme learning machine of member (K<L) of standard particle colony optimization algorithm optimum option composition integrated system from L ELM;
Step 5: utilize Minimum Norm least square method to calculate the integrated ballot weight of K the extreme learning machine of member;
Step 6: K that tries to achieve ballot weight is carried out integrated to the extreme learning machine of a corresponding K member, obtains an integrated ELM system, this integrated ELM system is carried out tumour identification to newly-increased oncogene express spectra sample.
The following step is comprised further in described step 4:
Step 4.1: initialization is carried out to the position of particle each in population and speed; K ELM of random selecting from L ELM, represents group membership's learning machine using the numbering of this K ELM as initial position i.e. each particle of particle, and particle initial velocity is random in (0,1) to be obtained; In K dimension space, the position of i-th particle can be expressed as vector x i=(x i1, x i2..., x iK), k=1,2 ..., K, x ikrepresent kth member's learning machine of i-th integrated system, the speed of particle flight is expressed as vector v i=(v i1, v i2..., v ik);
Step 4.2: present speed and position according to following formula adjustment particle:
v id(t+1)=w×v id(t)+c 1×rand(t)×(p id(t)-x id(t))+c 2×rand(t)×(p gd(t)-x id(t)) (1)
x id(t+1)=x id(t)+v id(t+1) (2)
Step 4.3: the adaptive value calculating each particle according to formula (4);
Fitness function in optimizing using the similarity between member as standard particle group, similarity is here the included angle cosine between the input layer weight matrix of any two ELM and two vectors of hidden unit threshold vector conversion; If cosine value is less, mean that between two vectors, angle is larger, show that the difference between the input weight matrix of two ELM and hidden unit threshold value is larger, thus the diversity factor of two ELM is larger; Otherness between member is converted into the folder cosine of an angle between two vectors, calculates the diversity factor between member by calculating included angle cosine, specific as follows:
cos &theta; = &alpha; &CenterDot; &beta; | &alpha; | &CenterDot; | &beta; | - - - ( 3 )
fitness ( i ) = &Sigma; k 1 = 1 K &Sigma; K 2 = k 1 + 1 K cos &theta; ik 1 k 2 = &Sigma; k 1 = 1 K &Sigma; k 2 = k 1 + 1 K &alpha; ik 1 &CenterDot; &alpha; ik 2 T | &alpha; ik 1 | &CenterDot; | &alpha; ik 2 T | - - - ( 4 )
Wherein
Z ik = [ WH ik ; B ik ] , &alpha; ik = ( Z ik ( : ) ) T , B ik = [ b 11 , b 21 , &CenterDot; &CenterDot; &CenterDot; , b H 1 ] H &times; 1 T ,
α in formula (3), β represent two vectors respectively, and θ represents the angle between these two vectors; WH ik, B ik, Z ikrepresent the input weight matrix of i-th particle kth dimension member ELM, hidden layer threshold vector and their connection matrix respectively, α ikrepresent Z ikaccording to the row vector that the form of row changes into, represent kth in i-th particle 1and kth 2representated by individual component the included angle cosine value of two ELM; If the less i.e. angle of cosine value is larger, then similarity is less thus diversity factor that is two ELM is larger between the two; Otherwise diversity factor is less between the two.Fitness (i) to represent in i-th particle arbitrarily the summation of Similarity value between member ELM between two, fitness (i) value is less, represent that the overall similarity between each ELM in i-th particle is less, i.e. diversity factor in the integrated system of this particle representative between member is larger; The each component of particle position is the numbering of selected member ELM, and each time, iteration is complete all rounds for speed; Meanwhile, the position of particle, only in [1, L] interior value, if the value of a certain position of certain particle is greater than L, is then got L, if be less than 1, is then got 1;
Step 4.4: in particle group optimizing process, by the optimal location P of the Similarity value of each particle and its process ithe Similarity value of (history optimal location) compares, if less, then upgrading current particle history optimal location is current particle;
Step 4.5: in particle group optimizing process, by the Similarity value of each particle and global optimum position P gsimilarity value contrast, if less, then upgrading global optimum particle is current particle;
Step 4.6: as do not reached the set goal (global optimum's particle there is enough good fitness value) or not reaching the maximum iteration time preset, be then back to step 4.2, otherwise go to step 4.7;
Step 4.7: export global optimum's particle, the representative of this particle is finally selected optimum member ELM and is gathered.
The following step is comprised further in described step 5:
Step 5.1: K the ELM according to optimizing in step 4 calculates the output on oncogene express spectra data set, judges the output classification at all samples, writes down class label; Formula (5) represents the output class distinguishing label matrix of K ELM on oncogene express spectra data set Ntr sample, YY l,krepresent the output classification of a kth ELM on l sample;
Step 5.2 obtains the ballot weight of integrated system according to Minimum Norm least square method, namely obtains the Minimal Norm Least Square Solutions of β in formula (6) by formula (7);
Optimum integrated system ballot weights β vector should meet formula (6), and wherein T represents the desired output categorization vector of Ntr sample on oncogene express spectra data set, β krepresent the ballot weight of a kth ELM in integrated system,
(YY) β=T (6) wherein
T=[t 1,t 2,…,t Ntr] T,β=[β 12,…,β K] T
Minimum Norm least square method is adopted to ask the voting right vector of integrated system, as shown in the formula:
&beta; ^ = YY + T - - - ( 7 )
YY +it is mole Peng Denuosi generalized inverse of the output class distinguishing label matrix of Ntr sample on K ELM.
Beneficial effect: there is variation and noise in the oncogene express spectra data of higher-dimension small sample, although traditional method can identify gene expression profile tumour but still there is the defect that precision is not high, time overhead is excessive.The speed of convergence that the present invention is exceedingly fast from ELM and well convergence precision are started with, and the basis to gene expression profile data analysis is set up integrated ELM tumour model of cognition; In conjunction with gene expression profile data, have studied the diversity factor between member ELM, propose the member ELM system of selection based on similarity between PSO and member ELM and the integrated rule based on Minimum Norm least square method; Compared with the recognition methods of existing gene expression profile tumour, the present invention greatly reduces integrated system study expense, and substantially increases gene expression profile tumour recognition accuracy.
Accompanying drawing explanation
Fig. 1 is structured flowchart of the present invention;
Fig. 2 is that in the present invention, gene selects process flow diagram;
Fig. 3 is that in the present invention, member ELM chooses process flow diagram;
Fig. 4 is that the integrated weights of member ELM in the present invention obtain process flow diagram;
Fig. 5 is to Brain cancer hypotype predictablity rate curve map in the present invention;
Fig. 6 is the number graph of a relation of Brain cancer hypotype predictablity rate and member ELM in the present invention;
Fig. 7 is the similarity change curve of each member ELM in the integrated system in the present invention corresponding to optimal particle.
Embodiment
A kind of oncogene express spectra data identification method based on integrated extreme learning machine, comprise the integrated step between the selection of member ELM and member ELM, ELM (Extreme learning machine) in the present invention is extreme learning machine, and the present invention specifically comprises the following steps:
Step 1: the pre-service of oncogene express spectra data set, the gene comprising tumour express spectra data is selected and normalization;
Step 2: by Bagging method, N number of sample set is generated according to a certain percentage to the data set obtained in step 1, and this N number of sample set is generated by a certain percentage again N number of training set and checking collection;
Step 3: on N number of training set in step 2, study generates N number of extreme learning machine, selects a highest L ELM (L<N) according to the discrimination of N number of ELM on corresponding checking collection and forms alternative member ELM storehouse;
Step 4: with the diversity factor in integrated system between member ELM for optimization aim, utilizes standard particle group to optimize K the extreme learning machine of member (K<L) of (PSO:Particle swarm optimization) algorithm optimum option composition integrated system from L ELM;
Step 5: utilize Minimum Norm least square method to calculate the integrated ballot weight of K the extreme learning machine of member;
Step 6: K that tries to achieve ballot weight is carried out integrated to the extreme learning machine of a corresponding K member, obtains an integrated ELM system, this integrated ELM system is carried out tumour identification to newly-increased oncogene express spectra sample.
The following step is comprised further in described step 4:
Step 4.1: initialization is carried out to the position of particle each in population and speed.K ELM of random selecting from L ELM, represents group membership's learning machine using the numbering of this K ELM as initial position i.e. each particle of particle, and particle initial velocity is random in (0,1) to be obtained.In K dimension space, the position of i-th particle can be expressed as vector x i=(x i1, x i2..., x iK), k=1,2 ..., K, x ikrepresent kth member's learning machine of i-th integrated system, the speed of particle flight is expressed as vector v i=(v i1, v i2..., v ik).
Step 4.2: according to present speed and the position of formula (1) and (2) adjustment particle.
v id(t+1)=w×v id(t)+c 1×rand(t)×(p id(t)-x id(t))+c 2×rand(t)×(p gd(t)-x id(t)) (1)
x id(t+1)=x id(t)+v id(t+1) (2)
Step 4.3: the adaptive value calculating each particle according to formula (4).
In order to increase diversity factor in integrated system between member ELM to improve the generalization ability of integrated system, the present invention is using the similarity between member as the fitness function in standard P SO.Here similarity is the included angle cosine between the input layer weight matrix of any two ELM and two vectors of therefore hidden unit threshold vector conversion.If cosine value is less, mean that between two vectors, angle is larger, namely show that the difference between the input weight matrix of two ELM and hidden unit threshold value is larger, thus the diversity factor of two ELM is larger.Otherness between member is converted into the folder cosine of an angle between two vectors by the present invention, calculates the diversity factor between member by calculating included angle cosine, specific as follows:
cos &theta; = &alpha; &CenterDot; &beta; | &alpha; | &CenterDot; | &beta; | - - - ( 3 )
fitness ( i ) = &Sigma; k 1 = 1 K &Sigma; K 2 = k 1 + 1 K cos &theta; ik 1 k 2 = &Sigma; k 1 = 1 K &Sigma; k 2 = k 1 + 1 K &alpha; ik 1 &CenterDot; &alpha; ik 2 T | &alpha; ik 1 | &CenterDot; | &alpha; ik 2 T | - - - ( 4 )
Wherein
Z ik = [ WH ik ; B ik ] , &alpha; ik = ( Z ik ( : ) ) T , B ik = [ b 11 , b 21 , &CenterDot; &CenterDot; &CenterDot; , b H 1 ] H &times; 1 T ,
α in formula (3), β represent two vectors respectively, and θ represents the angle between these two vectors; WH ik, B ik, Z ikrepresent the input weight matrix of i-th particle kth dimension member ELM, hidden layer threshold vector and their connection matrix respectively, α ikrepresent Z ikaccording to the row vector that the form of row changes into, represent kth in i-th particle 1and kth 2representated by individual component the included angle cosine value of two ELM.If the less i.e. angle of cosine value is larger, then similarity is less thus diversity factor that is two ELM is larger between the two; Otherwise diversity factor is less between the two.Fitness (i) to represent in i-th particle arbitrarily the summation of Similarity value between member ELM between two, fitness (i) value is less, represents that the diversity factor in the integrated system of overall similarity less this particle i.e. representative between each ELM in i-th particle between member is larger.
In the present invention, because each component of particle position is the numbering of selected member ELM, so each time, iteration is complete all must round for speed.Meanwhile, the position of particle in [1, L] interior value, if the value of a certain position of certain particle is greater than L, then can only be got L, if be less than 1, then gets 1.
Step 4.4: in particle group optimizing process, by the optimal location P of the Similarity value of each particle and its process ithe Similarity value of (history optimal location) compares, if less, then upgrading current particle history optimal location is current particle.
Step 4.5: in particle group optimizing process, by the Similarity value of each particle and global optimum position P gsimilarity value contrast, if less, then upgrading global optimum particle is current particle.
Step 4.6: as do not reached the set goal (global optimum's particle there is enough good fitness value) or not reaching the maximum iteration time preset, be then back to step 4.2, otherwise go to step 4.7;
Step 4.7: export global optimum's particle, the representative of this particle is finally selected optimum member ELM and is gathered.
The following step is comprised further in described step 5:
Step 5.1: K the ELM according to optimizing in step 4 calculates the output on oncogene express spectra data set, judges the output classification at all samples, writes down class label.Formula (5) represents the output class distinguishing label matrix of K ELM on oncogene express spectra data set Ntr sample, YY l,krepresent the output classification of a kth ELM on l sample;
Step 5.2: the ballot weight obtaining integrated system according to Minimum Norm least square method, namely obtains the Minimal Norm Least Square Solutions of β in formula (6) by formula (7).
Optimum integrated system ballot weights β vector should meet formula (6), and wherein T represents the desired output categorization vector of Ntr sample on oncogene express spectra data set, β krepresent the ballot weight of a kth ELM in integrated system,
(YY)β=T (6)
Wherein
T=[t 1,t 2,…,t Ntr] T,β=[β 12,…,β K] T
Minimum Norm least square method is adopted to ask the voting right vector of integrated system in the present invention, as shown in the formula:
&beta; ^ = YY + T - - - ( 7 )
Here YY +it is mole Peng Denuosi generalized inverse of the output class distinguishing label matrix of Ntr sample on K ELM.Above-mentioned improvement is too simple for the integrated rule of existing integrated ELM, as great majority ballot method, the present invention proposes to use Minimum Norm least square method to obtain the ballot weights of each member ELM to improve the Generalization Capability of integrated system further, thus improves the accuracy of identification of gene expression profile tumour further.
Below with oncogene express spectra data instance, implementation of the present invention is described simply.This example selects the cancer of the brain (Brain cancer) data set, and this data set comprises 60 samples altogether, and each sample is containing 7129 genes.This sample data collection is divided into two classes: 46 patients with classic samples and 14 patientswith desmoplastic samples, data are from http://linus.nci.nih.gov/ ~ brb/DataArchive_New.html.Although cancer of the brain categories of datasets is only divided into two tumors subtypes, because the difference of the expression of its gene in two classes is obvious not, thus cause the precision of prediction of a lot of classical way (as k nearest neighbor, SVM etc.) on this data set all very low.On this data set, concrete execution step of the present invention is as follows:
As shown in Figure 1, a kind of oncogene express spectra data identification method based on integrated extreme learning machine, comprises the constitution step with integrated rule of choosing of member ELM, and choosing of member ELM comprises the following steps with the constitution step of integrated rule:
(1) by linear transformation, each gene expression dose is normalized between [-1,1].Gene Selection Method KMeans-GCSI-MBPSO-ELM (the Han F using us to propose, Sun W, Ling Q-H (2014) A Novel Strategy for Gene Selection of Microarray Data Based on Gene-to-ClassSensitivity Information.PLoS ONE 9 (5): e97530.doi:10.1371/journal.pone.0097530) carry out gene selection.As shown in Figure 2, the method is considering that Data Base is because of on classification sensitivity information (GCSI:gene-to-class sensitivity information) basis, K-mean cluster is used to carry out gene selection in conjunction with scale-of-two PSO and ELM, to select relevant gene sets high to tumour classification.Table 1 gives and carries out to Brain cancer express spectra data 50 genes choosing the frequency the highest that gene selects acquisition by KMeans-GCSI-MBPSO-ELM method.
Table 1 KMeans-GCSI-MBPSO-ELM method carries out to Brain cancer express spectra data 50 genes choosing the frequency the highest that gene selects acquisition
(2) cancer of the brain tumour express spectra data set is divided into former state notebook data collection and newly-increased sample data collection by 1:1, and generates N (N=70) individual training set and checking collection by Bagging method (having the arbitrary sampling method put back to) further in the ratio of 2:1 on former state notebook data collection.Learn generation ELM (in the present embodiment often organizing on training set, in each ELM, Hidden nodes is 300, hidden unit activation function is sigmoid function), and L (L=20) the individual ELM the highest according to the discrimination primary election checking accuracy rate of each ELM on corresponding checking collection forms initial alternative member ELM storehouse.
(3) K the extreme learning machine of member (K<L) of standard particle colony optimization algorithm optimum option integrated system from L ELM is utilized (in the present embodiment, compared by many experiments and determine K=9), as shown in Figure 3, concrete steps are as follows:
1. initialization is carried out to the position of particle each in population and speed: K ELM of random selecting from L ELM, represents group membership's learning machine using the numbering of this K ELM as particle initial position and each position; Each particle rapidity v i=(v i1, v i2..., v ik) each component initial value be then obtain from random in (0,1).In K dimension space, the position of i-th particle can be shown as vector x i=(x i1, x i2..., x iK), k=1,2 ..., K, x ikrepresent kth member's learning machine of i-th integrated system.In the present embodiment, Population Size is 30.
2. according to present speed and the position of formula (1) and (2) adjustment particle.In the present embodiment, inertia weight w is set to 2; Aceleration pulse c 1and c 2be respectively 0.9 and 1.7.
3. the adaptive value of each particle is calculated according to formula (4).The fitness function of particle is delineated by formula (3) and (4).In the present embodiment, because each component of particle position is the numbering of selected member ELM, so speed all must round in iteration each time.Meanwhile, the position of particle in [1, L] interior value, if the value of a certain position of certain particle is greater than L, then can only be got L, if be less than 1, then gets 1.
4. in particle group optimizing process, by the optimal location P of the Similarity value of each particle and its process ithe Similarity value of (history optimal location) compares, if less, then upgrading current particle history optimal location is current particle.
5. in particle group optimizing process, by the Similarity value of each particle and global optimum position P gsimilarity value contrast, if less, then upgrading global optimum particle is current particle.
6. as do not reached the maximum iteration time (being 30 in the present embodiment) preset, be then back to step 2., otherwise export overall particle, the representative of this particle is finally selected optimum member ELM and is gathered.
(4) utilize Minimum Norm least square method to solve the integrated ballot weight of K the extreme learning machine of member, as Fig. 4, concrete steps are as follows:
1. according to the output of K ELM on former state notebook data collection optimized in step (3), judge the output classification of all samples on former state notebook data collection, write down class label, shown in (5).
2. obtain the ballot weight of integrated system according to Minimum Norm least square method, namely obtain the Minimal Norm Least Square Solutions of β in formula (6) by formula (7)
(5) K the ballot weight of step (4) being tried to achieve is carried out integrated to the extreme learning machine of a corresponding K member, obtains an integrated ELM system, and this integrated system is carried out tumour identification to 30 test sample books.Fig. 5 provides independent operating of the present invention 50 corresponding tumour accuracy of identification, its Average Accuracy reaches 93.48%, far above sorter single under homologous genes system of selection to the recognition accuracy (the identification preparation rate as ELM is the recognition accuracy of 80.40%, SVM is 80.55%) of Brain cancer and simple integrated ELM to the recognition accuracy (88.17%) of Brain cancer.

Claims (3)

1., based on an oncogene express spectra data identification method for integrated extreme learning machine, comprise the integrated step between the selection of member ELM and member ELM, it is characterized in that, comprise the following steps:
Step 1: the pre-service of oncogene express spectra data set, the gene comprising tumour express spectra data is selected and normalization;
Step 2: by Bagging method, N number of sample set is generated according to a certain percentage to the data set obtained in step 1, and this N number of sample set is generated by a certain percentage again N number of training set and checking collection;
Step 3: on N number of training set in step 2, study generates N number of extreme learning machine, selects a highest L ELM (L<N) according to the discrimination of N number of ELM on corresponding checking collection and forms alternative member ELM storehouse;
Step 4: with the diversity factor in integrated system between member ELM for optimization aim, utilizes K the extreme learning machine of member (K<L) of standard particle colony optimization algorithm optimum option composition integrated system from L ELM;
Step 5: utilize Minimum Norm least square method to calculate the integrated ballot weight of K the extreme learning machine of member;
Step 6: K that tries to achieve ballot weight is carried out integrated to the extreme learning machine of a corresponding K member, obtains an integrated ELM system, this integrated ELM system is carried out tumour identification to newly-increased oncogene express spectra sample.
2. the oncogene express spectra data identification method based on integrated extreme learning machine according to claim 1, is characterized in that, comprise the following step further in described step 4:
Step 4.1: initialization is carried out to the position of particle each in population and speed; K ELM of random selecting from L ELM, represents group membership's learning machine using the numbering of this K ELM as initial position i.e. each particle of particle, and particle initial velocity is random in (0,1) to be obtained; In K dimension space, the position of i-th particle can be expressed as vector x i=(x i1, x i2..., x iK), k=1,2 ..., K, x ikrepresent kth member's learning machine of i-th integrated system, the speed of particle flight is expressed as vector v i=(v i1, v i2..., v ik);
Step 4.2: present speed and position according to following formula adjustment particle:
v id(t+1)=w×v id(t)+c 1×rand(t)×(p id(t)-x id(t))+c 2×rand(t)×(p gd(t)-x id(t)) (1)
x id(t+1)=x id(t)+v id(t+1) (2)
Step 4.3: the adaptive value calculating each particle according to formula (4);
Fitness function in optimizing using the similarity between member as standard particle group, similarity is here the included angle cosine between the input layer weight matrix of any two ELM and two vectors of hidden unit threshold vector conversion; If cosine value is less, mean that between two vectors, angle is larger, show that the difference between the input weight matrix of two ELM and hidden unit threshold value is larger, thus the diversity factor of two ELM is larger; Otherness between member is converted into the folder cosine of an angle between two vectors, calculates the diversity factor between member by calculating included angle cosine, specific as follows:
cos &theta; = &alpha; &CenterDot; &beta; | &alpha; | &CenterDot; | &beta; | - - - ( 3 )
fitness ( i ) = &Sigma; k 1 = 1 K &Sigma; k 2 = k 1 + 1 K cos &theta; ik 1 k 2 = &Sigma; k 1 = 1 K &Sigma; k 2 = k 1 + 1 K &alpha; ik 1 &CenterDot; &alpha; ik 2 T | &alpha; ik 1 | &CenterDot; | &alpha; ik 2 T | - - - ( 4 )
Wherein
Z ik = [ WH ik ; B ik ] , &alpha; ik = ( Z ik ( : ) ) T , B ik = [ b 11 , b 21 , . . . , b H 1 ] H &times; 1 T ,
WH ik = wh 11 wh 12 . . . wh 1 n wh 21 wh 22 . . . wh 2 n . . . . . . . . . . . . wh H 1 wh H 2 . . . wh HN H &times; n ;
α in formula (3), β represent two vectors respectively, and θ represents the angle between these two vectors; WH ik, B ik, Z ikrepresent the input weight matrix of i-th particle kth dimension member ELM, hidden layer threshold vector and their connection matrix respectively, α ikrepresent Z ikaccording to the row vector that the form of row changes into, represent kth in i-th particle 1and kth 2representated by individual component the included angle cosine value of two ELM; If the less i.e. angle of cosine value is larger, then similarity is less thus diversity factor that is two ELM is larger between the two; Otherwise diversity factor is less between the two.Fitness (i) to represent in i-th particle arbitrarily the summation of Similarity value between member ELM between two, fitness (i) value is less, represent that the overall similarity between each ELM in i-th particle is less, i.e. diversity factor in the integrated system of this particle representative between member is larger; The each component of particle position is the numbering of selected member ELM, and each time, iteration is complete all rounds for speed; Meanwhile, the position of particle, only in [1, L] interior value, if the value of a certain position of certain particle is greater than L, is then got L, if be less than 1, is then got 1;
Step 4.4: in particle group optimizing process, by the optimal location P of the Similarity value of each particle and its process ithe Similarity value of (history optimal location) compares, if less, then upgrading current particle history optimal location is current particle;
Step 4.5: in particle group optimizing process, by the Similarity value of each particle and global optimum position P gsimilarity value contrast, if less, then upgrading global optimum particle is current particle;
Step 4.6: as do not reached the set goal (global optimum's particle there is enough good fitness value) or not reaching the maximum iteration time preset, be then back to step 4.2, otherwise go to step 4.7;
Step 4.7: export global optimum's particle, the representative of this particle is finally selected optimum member ELM and is gathered.
3. the oncogene express spectra data identification method based on integrated extreme learning machine according to claim 1, is characterized in that, comprise the following step further in described step 5:
Step 5.1: K the ELM according to optimizing in step 4 calculates the output on oncogene express spectra data set, judges the output classification at all samples, writes down class label; Formula (5) represents the output class distinguishing label matrix of K ELM on oncogene express spectra data set Ntr sample, YY l,krepresent the output classification of a kth ELM on l sample;
YY = YY 1,1 YY 1,2 . . . YY 1 , K YY 2,1 YY 2,2 . . . YY 2 , K . . . . . . . . . . . . YY Ntr , 1 TT Ntr , 2 . . . YY Ntr , K - - - ( 5 )
Step 5.2 obtains the ballot weight of integrated system according to Minimum Norm least square method, namely obtains the Minimal Norm Least Square Solutions of β in formula (6) by formula (7);
Optimum integrated system ballot weights β vector should meet formula (6), and wherein T represents the desired output categorization vector of Ntr sample on oncogene express spectra data set, β krepresent the ballot weight of a kth ELM in integrated system,
(YY)β=T (6)
Wherein
T=[t 1,t 2,…,t Ntr] T,β=[β 12,…,β K] T
Minimum Norm least square method is adopted to ask the voting right vector of integrated system, as shown in the formula:
&beta; ^ = YY + T - - - ( 7 )
YY +it is mole Peng Denuosi generalized inverse of the output class distinguishing label matrix of Ntr sample on K ELM.
CN201410773130.9A 2014-12-15 2014-12-15 Cancer gene expression profile data identification method based on integration of extreme learning machines Pending CN104463251A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410773130.9A CN104463251A (en) 2014-12-15 2014-12-15 Cancer gene expression profile data identification method based on integration of extreme learning machines

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410773130.9A CN104463251A (en) 2014-12-15 2014-12-15 Cancer gene expression profile data identification method based on integration of extreme learning machines

Publications (1)

Publication Number Publication Date
CN104463251A true CN104463251A (en) 2015-03-25

Family

ID=52909265

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410773130.9A Pending CN104463251A (en) 2014-12-15 2014-12-15 Cancer gene expression profile data identification method based on integration of extreme learning machines

Country Status (1)

Country Link
CN (1) CN104463251A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105117525A (en) * 2015-07-31 2015-12-02 天津工业大学 Bagging extreme learning machine integrated modeling method
CN106951728A (en) * 2017-03-03 2017-07-14 江苏大学 A kind of tumour key gene recognition methods based on particle group optimizing and marking criterion
CN107121407A (en) * 2017-06-02 2017-09-01 中国计量大学 The method that near-infrared spectrum analysis based on PSO RICAELM differentiates Cuiguan pear maturity
CN107908927A (en) * 2017-10-27 2018-04-13 福州大学 Based on the disease lncRNA Relationship Prediction methods for improving PSO and ELM
CN108920900A (en) * 2018-06-21 2018-11-30 福州大学 The unsupervised extreme learning machine Feature Extraction System and method of gene expression profile data
CN110033041A (en) * 2019-04-13 2019-07-19 湖南大学 A kind of gene expression profile distance metric method based on deep learning
CN110310703A (en) * 2019-06-25 2019-10-08 中国人民解放军军事科学院军事医学研究院 Prediction technique, device and the computer equipment of drug

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105117525A (en) * 2015-07-31 2015-12-02 天津工业大学 Bagging extreme learning machine integrated modeling method
CN105117525B (en) * 2015-07-31 2018-05-15 天津工业大学 Bagging extreme learning machine integrated modelling approach
CN106951728A (en) * 2017-03-03 2017-07-14 江苏大学 A kind of tumour key gene recognition methods based on particle group optimizing and marking criterion
CN106951728B (en) * 2017-03-03 2020-08-28 江苏大学 Tumor key gene identification method based on particle swarm optimization and scoring criterion
CN107121407A (en) * 2017-06-02 2017-09-01 中国计量大学 The method that near-infrared spectrum analysis based on PSO RICAELM differentiates Cuiguan pear maturity
CN107908927A (en) * 2017-10-27 2018-04-13 福州大学 Based on the disease lncRNA Relationship Prediction methods for improving PSO and ELM
CN108920900A (en) * 2018-06-21 2018-11-30 福州大学 The unsupervised extreme learning machine Feature Extraction System and method of gene expression profile data
CN110033041A (en) * 2019-04-13 2019-07-19 湖南大学 A kind of gene expression profile distance metric method based on deep learning
CN110033041B (en) * 2019-04-13 2022-05-03 湖南大学 Gene expression spectrum distance measurement method based on deep learning
CN110310703A (en) * 2019-06-25 2019-10-08 中国人民解放军军事科学院军事医学研究院 Prediction technique, device and the computer equipment of drug
CN110310703B (en) * 2019-06-25 2021-09-07 中国人民解放军军事科学院军事医学研究院 Medicine prediction method and device and computer equipment

Similar Documents

Publication Publication Date Title
CN104463251A (en) Cancer gene expression profile data identification method based on integration of extreme learning machines
CN107292350A (en) The method for detecting abnormality of large-scale data
CN1197025C (en) Enhancing knowledge discovery from multiple data sets using multiple support vector machines
CN104751469B (en) The image partition method clustered based on Fuzzy c-means
Shi et al. Multi-label ensemble learning
CN106339416A (en) Grid-based data clustering method for fast researching density peaks
CN104794482A (en) Inter-class maximization clustering algorithm based on improved kernel fuzzy C mean value
CN101833671A (en) Support vector machine-based surface electromyogram signal multi-class pattern recognition method
CN103425994B (en) A kind of feature selection approach for pattern classification
CN106156401A (en) Data-driven system state model on-line identification methods based on many assembled classifiers
CN103489033A (en) Incremental type learning method integrating self-organizing mapping and probability neural network
CN105550715A (en) Affinity propagation clustering-based integrated classifier constructing method
CN106548041A (en) A kind of tumour key gene recognition methods based on prior information and parallel binary particle swarm optimization
CN108171012A (en) A kind of gene sorting method and device
Suo et al. Application of clustering analysis in brain gene data based on deep learning
CN110263834A (en) A kind of detection method of new energy power quality exceptional value
CN104966106A (en) Biological age step-by-step predication method based on support vector machine
CN104200134A (en) Tumor gene expression data feature selection method based on locally linear embedding algorithm
CN109376790A (en) A kind of binary classification method based on Analysis of The Seepage
CN102750545A (en) Pattern recognition method capable of achieving cluster, classification and metric learning simultaneously
Singh et al. A neighborhood search based cat swarm optimization algorithm for clustering problems
CN103793600A (en) Isolated component analysis and linear discriminant analysis combined cancer forecasting method
CN105550711A (en) Firefly algorithm based selective ensemble learning method
Zhang et al. An Algorithm Research for Prediction of Extreme Learning Machines Based on Rough Sets.
Lu et al. Cancer classification through filtering progressive transductive support vector machine based on gene expression data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20150325