CN111748634A

CN111748634A - Characteristic lincRNA expression profile combination and early prediction method of colon cancer

Info

Publication number: CN111748634A
Application number: CN202010775531.3A
Authority: CN
Inventors: 贺轲; 向国安; 李文兴; 陈小勋; 黄许森
Original assignee: Guangdong No 2 Peoples Hospital
Current assignee: Guangdong No 2 Peoples Hospital
Priority date: 2020-08-04
Filing date: 2020-08-04
Publication date: 2020-10-09

Abstract

The invention discloses a characteristic lincRNA expression profile combination and an early prediction method of colon cancer, wherein a nucleotide probe sequence of the characteristic miRNA expression profile combination is shown as SEQ ID NO. 1-15. The method for evaluating the early risk of the colon cancer based on the lincRNA expression profile combination characteristics has high precision and accuracy (the area AUC under the ROC curve is 1.000). The relative expression quantity of the 15 lincRNAs is only required to be obtained, and the early stage morbidity probability of the colon cancer is calculated and given through a support vector machine model and can be used as a reference basis for early stage prediction of the colon cancer.

Description

Characteristic lincRNA expression profile combination and early prediction method of colon cancer

Technical Field

The invention belongs to the field of biotechnology and medicine, and particularly relates to a characteristic lincRNA expression profile combination and an early prediction method of colon cancer.

Background

Colon cancer (colon cancer) is a common malignancy of the digestive tract that occurs in the colon, often at the junction of the rectum and sigmoid colon. The prevalence rate of colon cancer is 2-3:1, and the incidence rate of colon cancer is the highest in the population of 40-50 years old. Patients with chronic colitis, colonic polyps, male obesity, etc. are susceptible people. The early stage of the colon cancer has no obvious symptoms, and the early diagnosis is difficult. Global Burden of Disease (GBD) data shows that the number of people with colorectal cancer in 2017 worldwide exceeds 930 ten thousand, with the number of people in china reaching up to 235 ten thousand. The number of deaths with colorectal cancer worldwide in 2017 was about 90 million, accounting for 1.60% of the total deaths. The number of the death patients in 2017 in China is about 19 thousands, and accounts for 1.79 percent of the total death number. Statistics show a continuous increase in the global colorectal cancer prevalence and mortality from 1990 to 2017. The prevalence and mortality of colorectal cancer in china was below the global average before 2010 and after 2010 the prevalence and mortality have increased dramatically above the global average.

A Support Vector Machine (SVM) is a generalized linear classifier that performs binary classification on data in a supervised learning manner, and a decision boundary of the SVM is a maximum edge distance hyperplane for solving a learning sample. The SVM model represents instances as points in space, so that the mapping is such that instances of the individual classes are separated by as wide an apparent interval as possible. The new instances are then mapped to the same space and the categories are predicted based on which side of the interval they fall on. When the training data is linearly separable, the SVM is classified by hard interval maximization learning. When the training data is linearly non-separable, the SVM is classified by using a kernel technique and soft interval maximization learning. SVMs are powerful for medium-sized data sets with similar meaning of features and are also suitable for small data sets. In general, the prediction effect is good for the SVM data set with the sample size less than 1 ten thousand. SVM has a wide range of applications in disease diagnosis, tumor classification, tumor gene recognition, and the like.

Early diagnosis of tumors has been a difficult problem in the medical community. The existing early diagnosis methods mostly observe the expression level of a certain marker or a class of markers, and the ideal diagnosis effect is difficult to achieve. Since the expression profiles of these markers in tumor patients and normal populations partially overlap, it is difficult to define a cut-off for the markers that better separates tumor patients from normal populations. Therefore, the use of multiple marker expression signature combinations may be an effective method for early diagnosis of tumors. Long-stranded intergenic non-coding RNA (lincRNA) is a type of non-coding single-stranded RNA molecule with a length greater than 200 nucleotides located in the intergenic non-coding sequence. lincRNA has no coding potential and is not conserved between different species. Research shows that lincRNA is involved in the expression regulation of multiple genes, and the lincRNA is relatively stable in expression in a human body and easy to detect. Since the expression distribution of individual lincRNA molecules in tumor and normal human populations overlap, it is difficult to define a critical value for early diagnosis.

Therefore, there is a need to establish a more stable predictive model of a combination of multiple differential lincRNA expression signatures that facilitates early prediction of colon cancer.

Disclosure of Invention

In view of the above, the present invention provides a method for early prediction of colon cancer by using a combination of characteristic lincRNA expression profiles.

In order to solve the technical problems, the invention discloses a characteristic lincRNA expression profile combination, which comprises AC005332.6, AC008124.1, AC090114.2, BAIAP2-DT, HEIH, LINC00294, LINC00476, LINC00667, LINC00847, LINC01559, MIR194-2HG, MIR22HG, PVT1, SNHG15 and TP53TG1, wherein a nucleotide sequence probe of the characteristic lincRNA expression profile combination is shown as SEQ ID NO.1-SEQ ID NO. 15.

The invention also discloses an early prediction method of colon cancer based on the characteristic lincRNA expression profile combination, which comprises the following steps:

step 1, obtaining characteristic lincRNA stably and differentially expressed by a patient with early colon cancer;

step 2, selecting characteristic lincRNA expression data, and carrying out data standardization on each sample;

step 3, constructing an early prediction model for the standardized data by using a support vector machine;

step 4, carrying out early prediction according to the expression level of lincRNA (lincRNA) of the patient characteristics;

the method is useful for non-disease diagnostic and therapeutic purposes.

Optionally, the characteristic lincRNA for obtaining stable differential expression of early colon cancer patients in the step 1 is specifically:

step 1.1, downloading transcriptome Data and clinical Data of tumor tissues and para-carcinoma tissues of a colon cancer patient from a Genomic Data common Data Portal database to obtain a tumor tissue gene expression profile readcounts value of the colon cancer patient, namely a sequencing read value, and carrying out logarithmic conversion;

step 1.2, selecting lincRNA with certain expression abundance, namely readcounts of the lincRNA in all samples are more than or equal to 10; taking the logarithm of the read counts of all the lincRNAs, setting the total number of samples as n, setting the total number of the screened lincRNAs as m, setting v as the read counts of the lincRNAs, and setting u as the expression value after taking the logarithm, wherein the number of the read counts is m;

u_ij＝log₂v_ij，i∈(1，n)，j∈(1，m) (1)

wherein i is the sample number, j is the lincRNA number, u_ijExpression value after taking logarithm of ith sample and jth lincRNA number, v_ijRead counts values for the ith sample, jth lincRNA number;

step 1.3, selecting colon cancer patients with disease stages of I and II, and recording the patients as early colon cancer patients, wherein the total number of the early colon cancer patients is n';

step 1.4, selecting the lincRNA stably expressed in the tumor sample and the normal sample, namely the lincRNA with the coefficient of variation smaller than 0.2 in the tumor sample and the normal sample, setting mu as the expression mean value of the lincRNA in all samples, setting sigma as the standard deviation, and calculating the coefficient of variation according to the formula:

wherein j is the lincRNA number, c_vIs the coefficient of variation, c_vjCoefficient of variation, σ, for the j-th sample_jStandard deviation for jth lincRNA numbering, μ_jThe expression average of lincRNA numbered for the jth lincRNA, set as m₁For the total number of stably expressed lincrnas, the following are:

step 1.5, selecting lincRNA which is differentially expressed in a tumor sample and a normal sample; the log-taken expression values were used to calculate the log-taken fold change f of the lincrnas in tumor and normal samples, and the formula is:

wherein j is the lincRNA number, f_jFold change for jth lincRNA numbering,. mu._1jExpression mean, μ, of tumor samples numbered for jth lincRNA_2jThe expression mean of the normal sample numbered for the jth lincRNA;

the expression difference of lincRNA in tumor and normal samples was then compared using independent sample t-test, which was formulated as:

wherein n is₁Is the number of tumor samples, n₂Is a normal number of samples, mu₁Mean expression of lincRNA in tumor samples, μ₂Is the mean value of the expression of lincRNA in a normal sample,

the variance of lincRNA in the tumor sample,

lincRNA variance for normal samples;

correcting the p values obtained by all t tests by using a False Discovery Rate (FDR), wherein q is a value corrected by the FDR, and r is a p value in m₁The sequenced positions in each lincRNA are:

wherein j is the lincRNA number, q_jRepresents the FDR corrected value of the jth lincRNA number, p_jP-value, r, from t-test representing the jth lincRNA number_jP-value at m representing the jth lincRNA number₁The sequenced position in each lincRNA;

finally, lincRNA with the absolute value of the fold change f larger than 1 and the q value smaller than or equal to 0.05 after FDR correction is selected and marked as characteristic lincRNA, and the total number of the characteristic lincRNA is set as m₂Then, there are:

m₂＝m₁{|f_j|≥1，q_j≤0.05}，j∈(1，m₁) (7)。

optionally, the characteristic lincRNA expression data in step 2 is selected, and data normalization is performed on each sample, wherein the formula is as follows:

wherein i is the sample number and j is the characteristic lincRNA number; mu.s_iThe mean, σ, of all characteristic lincRNA expression of the ith sample_iFor all characteristic lincRNA standard deviations, u, of the i-th sample_ijTo take the characteristic lincRNA expression value after log, u_ij' is the normalized lincRNA value.

Optionally, the constructing an early prediction model for the normalized data by using the support vector machine in step 3 specifically includes:

step 3.1, grouping all samples; 80% of all samples are divided into training set + validation set, and the remaining 20% are divided into test set. The training set and the verification set are used for 5-fold cross verification, namely the training set and the verification set are divided into 5 groups which are equal, one group is used as the verification set in sequence, and the other 4 groups are used as the training set; parameters are given, a training set is used for constructing a model, and a verification set is used for checking the accuracy of the model;

step 3.2, screening optimal parameters; the parameter gamma in the SVM controls the width of the Gaussian kernel, and C is a regularization parameter, limiting the importance of each point. The parameter grid is set as:

gamma＝[0.001，0.01，0.1，1，10，100](9)

C[0.001，0.01，0.1，1，10，100](10)

in cross-validation, the model is constructed using a combination of every two parameters gamma and C in turn, and then the validation set is used to verify the model accuracy. For each parameter combination, each verification of 5-fold cross verification generates 1 precision, and 5 times of verification is performed to generate 5 precisions; selecting a parameter combination with the highest average accuracy of 5 times of verification as an optimal parameter;

3.3, constructing a model by using the optimal parameters and the data of the training set and the verification set, and finally evaluating the model by using the test set; the evaluation indexes include accuracy (accuracy), accuracy (precision), recall (call), specificity (specificity), F1 score (F1 score), Mathews Correlation Coefficient (MCC), and area under the subject operating curve (ROC) (AUC); in the test set, defining the tumor count as True Positive (TP), the tumor count as normal but predicted as False Positive (FP), the tumor count as true but predicted as normal False Negative (FN), the tumor count as normal but predicted as True Negative (TN); the above evaluation index calculation formula is:

the accuracy, recall, specificity, F1 score and AUC of the above assessment indices returned values between (0, 1); the higher the accuracy is, the higher the overall prediction efficiency of the model is; higher accuracy indicates that the class I error is smaller; higher recall indicates that a class II error is being made smaller; the high specificity indicates that few negative examples are mixed in the samples predicted to be positive examples; the F1 score is a comprehensive index and is a harmonic average of the accuracy rate and the recall rate; MCC is the correlation coefficient between observed and predicted binary classifications, returning a value between (-1, 1), where 1 represents perfect prediction, 0 represents no better than random prediction, -1 represents a complete disparity between prediction and observation; a higher AUC indicates a higher probability of a positive instance being predicted by the classifier; therefore, the closer the above index is to 1, the better the overall prediction effect of the model is;

and 3.4, if the evaluation indexes are all larger than 0.9, the model has a better prediction effect, and then all data are used and the optimal parameter combination is used for constructing a final prediction model.

Optionally, the early prediction according to the expression level of lincRNA characteristic to the patient in the step 4 specifically comprises:

step 4.1, standardizing the characteristic lincRNA expression data of the prediction sample, setting u as the characteristic lincRNA expression value of the prediction sample, setting mu as the average value of the characteristic lincRNA expression of the prediction sample, and setting sigma as the standard deviation of the characteristic lincRNA of the prediction sample, wherein the formula is as follows:

wherein j is the characteristic lincRNA numbering, u_j' is the normalized lincRNA value;

and 4.2, substituting the normalized lincRNA value of the prediction sample into the final prediction for prediction. A prediction of 1 indicates the presence of colon cancer and a prediction of 0 indicates normal.

Compared with the prior art, the invention can obtain the following technical effects:

1) the prediction speed is high: the prediction model constructed by the invention can be used for rapidly predicting large-scale samples, and the prediction time of 100 samples only needs a few seconds.

2) The accuracy is high: the prediction model constructed by the method has high prediction accuracy and accuracy, both of which reach over 90 percent, and the area AUC under the ROC curve is 1.000.

3) Platform heterogeneity impact is minor: since there is a large difference in lincRNA expression values determined for different analysis platforms, the present invention predicts the use of normalized characteristic lincRNA expression values and is therefore less affected by platform heterogeneity.

Of course, it is not necessary for any one product in which the invention is practiced to achieve all of the above-described technical effects simultaneously.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a flow chart of data screening and model building according to the present invention;

FIG. 2 is a cross-validation parameter optimization process for a support vector machine model according to the present invention;

FIG. 3 is a diagram of a test set evaluation index for a support vector machine model according to the present invention;

FIG. 4 is a support vector machine model test set ROC curve of the present invention.

Detailed Description

The following embodiments are described in detail with reference to the accompanying drawings, so that how to implement the technical features of the present invention to solve the technical problems and achieve the technical effects can be fully understood and implemented.

The invention discloses an early stage prediction method of colon cancer based on characteristic lincRNA expression profile combination, which can accurately predict colon cancer stage I/II and comprises the following steps:

step 1, obtaining lincRNA (characteristic lincRNA) stably and differentially expressed by a patient with early colon cancer; the method specifically comprises the following steps:

and 1.2, selecting the lincRNA with certain expression abundance, namely readcounts of the lincRNA in all samples are more than or equal to 10. Taking the logarithm of the read counts of all the lincRNAs, setting the total number of samples as n, setting the total number of the screened lincRNAs as m, setting v as the read counts of the lincRNAs, and setting u as the expression value after taking the logarithm, wherein the number of the read counts is m;

u_ij＝log₂v_ij，i∈(1，n)，j∈(1，m) (1)

wherein i is the sample number, j is the lincRNA number, u_ijExpression value after taking logarithm of ith sample and jth lincRNA number, v_ijThe read counts number for the ith sample, jth lincRNA number.

step 1.5, lincRNA differentially expressed in tumor and normal samples was selected. The log-taken expression values were used to calculate the log-taken fold change f of the lincrnas in tumor and normal samples, and the formula is:

wherein j is the lincRNA number, f_jFold change for jth lincRNA numbering,. mu._1jExpression mean, μ, of tumor samples numbered for jth lincRNA_2jThe expression mean of the normal sample numbered for the jth lincRNA.

the variance of lincRNA in the tumor sample,

is the lincRNA variance of normal samples.

wherein j is the lincRNA number, q_jRepresents the FDR corrected value of the jth lincRNA number, p_jP-value, r, from t-test representing the jth lincRNA number_jP-value at m representing the jth lincRNA number₁The sequenced positions in each lincRNA.

m₂＝m₁{|f_j|≥1，q_j≤0.05}，j∈(1，m₁) (7)

step 2, selecting characteristic lincRNA expression data, and carrying out data standardization on each sample, wherein the formula is as follows:

where i is the sample number and j is the characteristic lincRNA number. Mu.s_iThe mean, σ, of all characteristic lincRNA expression of the ith sample_iFor all characteristic lincRNA standard deviations, u, of the i-th sample_ijTo take the characteristic lincRNA expression value after log, u_ij' is the normalized lincRNA value.

Step 3, constructing an early diagnosis model for the standardized data by using a support vector machine; the method specifically comprises the following steps:

and 3.1, grouping all samples. 80% of all samples are divided into training set + validation set, and the remaining 20% are divided into test set. The training set and the verification set are used for 5-fold cross validation, namely the training set and the verification set are divided into 5 groups which are equal, one group is used as the verification set in sequence, and the other 4 groups are used as the training set. Given the parameters, the training set is used to construct the model, and the validation set is used to verify the accuracy of the model.

And 3.2, screening the optimal parameters. The parameter gamma in the SVM controls the width of the Gaussian kernel, and C is a regularization parameter, limiting the importance of each point. The parameter grid is set as:

gamma＝[0.001，0.01，0.1，1，10，100](9)

C＝[0.001，0.01，0.1，1，10，100](10)

in cross-validation, the model is constructed using a combination of every two parameters gamma and C in turn, and then the validation set is used to verify the model accuracy. For each parameter combination, each validation of 5-fold cross-validation yielded 1 accuracy, and a total of 5 validations yielded 5 accuracies. And selecting the parameter combination with the highest average accuracy of 5 times of verification as the optimal parameter.

And 3.3, constructing a model by using the optimal parameters and the data of the training set and the verification set, and finally evaluating the model by using the test set. The evaluation index includes accuracy (accuracy), accuracy (precision), recall (call), specificity (specificity), F1 score (F1 score), Mathematic Correlation Coefficient (MCC), and area under the subject operating curve (ROC) (AUC). In the test set, the tumor counts are defined as True Positive (TP), normal but predicted tumor counts as False Positive (FP), tumor counts as False Negative (FN), and normal and predicted as True Negative (TN). The above evaluation index calculation formula is:

the accuracy, recall, specificity, F1 score and AUC returned values between (0, 1) in the above evaluation indices. The higher the accuracy is, the higher the overall prediction efficiency of the model is; higher accuracy indicates that the class I error is smaller; higher recall indicates that a class II error is being made smaller; the high specificity indicates that few negative examples are mixed in the samples predicted to be positive examples; the F1 score is a comprehensive index and is a harmonic average of the accuracy rate and the recall rate; MCC is the correlation coefficient between observed and predicted binary classifications, returning a value between (-1, 1), where 1 represents perfect prediction, 0 represents no better than random prediction, -1 represents a complete disparity between prediction and observation; a higher AUC indicates a higher probability of a positive instance being predicted by the classifier. Therefore, the closer the above index is to 1, the better the prediction effect of the entire model is.

And 3.4, if the evaluation indexes are all larger than 0.9, the model has a better prediction effect. The final prediction model is constructed with the optimal parameter combinations using all the data.

And 4, carrying out early prediction according to the expression level of the lincRNA characteristic of the patient, specifically comprising the following steps:

wherein j is the characteristic lincRNA numbering, u_j' is the normalized lincRNA value.

Example 1

A method for early prediction of colon cancer based on a combination of characteristic lincRNA expression profiles comprising the steps of:

step 1, obtaining lincRNA (characteristic lincRNA) stably and differentially expressed by a patient with early colon cancer, wherein the detailed flow chart is shown in a figure 1.

Step 1.1, downloading transcriptome Data and clinical Data of tumor tissues and para-carcinoma tissues of colon cancer patients from a Genomic Data common Data Portal database, obtaining tumor tissue gene expression profile readcounts values of the colon cancer patients, and carrying out logarithmic conversion.

Step 1.2, lincRNA with certain expression abundance is selected, namely readcounts of the lincRNA in all samples are more than or equal to 10, and the detailed formula (1) is shown.

And 1.3, selecting colon cancer patients with stage I and stage II, and recording the colon cancer patients as early-stage colon cancer patients according to formulas (2) to (3).

And step 1.4, selecting the stably expressed lincRNA in the tumor sample and the normal sample, namely the lincRNA with the coefficient of variation smaller than 0.2 in the tumor sample and the normal sample.

Step 1.5, lincRNA differentially expressed in tumor and normal samples were selected, as detailed in formulas (4) - (7). Is designated as characteristic lincRNA.

Through the screening, 15 lincRNA with colon cancer characteristics are finally obtained, and are shown in Table 1. The nucleotide probe sequences of 15 lincrnas characteristic of colon cancer are shown in table 2.

TABLE 1 characteristics of colon cancer lincRNA

TABLE 2 nucleotide Probe sequences for lincRNA characteristic of Colon cancer

And 2, carrying out data standardization on each sample, wherein the details are shown in a formula (8).

And 3, constructing an early diagnosis model for the standardized data by using a support vector machine.

And 3.1, grouping all samples. 80% of all samples are divided into training set + validation set, and the remaining 20% are divided into test set. The training set and the verification set are used for 5-fold cross validation, namely the training set and the verification set are divided into 5 groups which are equal, one group is used as the verification set in sequence, and the other 4 groups are used as the training set. Given the parameters, the training set is used to construct the model, and the validation set is used to verify the accuracy of the model. See figure 1 for details.

And 3.2, screening the optimal parameters. The SVM parameter grid is set by formulas (9) - (10). In cross-validation, the model is constructed using a combination of every two parameters gamma and C in turn, and then the validation set is used to verify the model accuracy. For each parameter combination, each validation of 5-fold cross-validation yielded 1 accuracy, and a total of 5 validations yielded 5 accuracies. And selecting the parameter combination with the highest average accuracy of 5 times of verification as the optimal parameter. Fig. 2 shows the cross-validation parameter optimization process, where the model cross-validation accuracy is highest when the parameter gamma is 0.001 and the parameter C is 100: 1.000. the optimal parameters of the model are therefore: gamma is 0.001 and C is 100.

And 3.3, constructing a model by using the optimal parameters and the data of the training set and the verification set, and finally evaluating the model by using the test set. The evaluation index includes accuracy (accuracy), accuracy (precision), recall (call), specificity (specificity), F1 score (F1 score), Mathematic Correlation Coefficient (MCC), and area under the subject operating curve (ROC) (AUC). The evaluation index is described in detail in formulas (11) to (17).

Step 3.4, fig. 3 shows accuracy, recall, specificity, F1 score and MCC in the above evaluation indexes, all 6 indexes being 1.0; FIG. 4 shows the ROC curve and AUC, with an AUC of 1.000 in the test set. The evaluation indexes show that the model has good prediction effect. Thus, using all the data, the final prediction model is constructed with the optimal parameter combinations.

And 4, early prediction is carried out according to the expression level of lincRNA which is characteristic of the patient:

and 4.1, standardizing the characteristic lincRNA expression data of the prediction sample, wherein the details are shown in a formula 18. The method randomly selects 10 samples for prediction, and eliminates the 10 samples when a final prediction model is constructed. The numbers of the 10 samples taken and the normalized characteristic lincRNA values are shown in table 3.

TABLE 3.10 sample numbers and values normalized for characteristic lincRNA

And 4.2, substituting the normalized lincRNA value of the prediction sample into the final prediction for prediction. A prediction of 1 indicates the presence of colon cancer and a prediction of 0 indicates normal. The sample numbers of 10 cases, corresponding TCGA numbers, actual states and predicted results are shown in Table 4. The prediction results of 10 samples completely accord with the actual state, which shows that the invention can accurately predict colon cancer in early stage.

TABLE 4.10 sample numbers, corresponding TCGA numbers, actual and predicted states

In conclusion, the characteristic lincRNA expression profile combination has high prediction accuracy, and can effectively predict colon cancer at an early stage. In addition, the method has no platform dependency, and can predict data from various sources.

While the foregoing description shows and describes several preferred embodiments of the invention, it is to be understood, as noted above, that the invention is not limited to the forms disclosed herein, but is not to be construed as excluding other embodiments and is capable of use in various other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept as expressed herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

SEQUENCE LISTING

<110> second people hospital of Guangdong province

<120> a characteristic lincRNA expression profile combination and an early prediction method of colon cancer

<130>2020

<160>15

<170>PatentIn version 3.3

<210>1

<211>30

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>1

tcgacctccc tgggctcagg tgatcctccc 30

<210>2

<211>30

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>2

gtttacattt ttatagtaag gtctcttcaa 30

<210>3

<211>30

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>3

ccgggcagca gccgcctgcg ccgggctcca 30

<210>4

<211>30

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>4

caccacccca gcagcccggg tcccgggtgg 30

<210>5

<211>30

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>5

cactccagcc tgggtgacag aacagactgt 30

<210>6

<211>30

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>6

agaatgtccc taatttagct gaggaaccta 30

<210>7

<211>30

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>7

ttgcaatcac actgtgagaa actctaccct 30

<210>8

<211>30

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>8

agttctagat ccattgagac aagctctaga 30

<210>9

<211>30

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>9

tggaccctga tcttctggtg ggtttaccag 30

<210>10

<211>30

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>10

gttgtatgtt agccttcagc tgcttaaatg 30

<210>11

<211>30

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>11

aatcccaacc ctcactgcac aaagctttac 30

<210>12

<211>30

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>12

ataagcagcc tcaaggacca agaaccatct 30

<210>13

<211>30

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>13

cccaaaatac agtctttgtg ttgccatctg 30

<210>14

<211>30

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>14

acctgggccc ttctggtatc tcctgaatga 30

<210>15

<211>30

<212>DNA

<213> Artificial sequence (Artificial sequence)

<400>15

ttcctgcatg atgctgggga gcttggcgcc 30

Claims

1. A combination of characteristic lincRNA expression profiles for predicting early colon cancer comprising AC005332.6, AC008124.1, AC090114.2, BAIAP2-DT, HEIH, LINC00294, LINC00476, LINC00667, LINC00847, LINC01559, MIR194-2HG, MIR22HG, PVT1, SNHG15 and TP53TG1, the nucleotide sequence probes of which are shown in SEQ ID No.1-SEQ ID No. 15.

2. A method for the early prediction of colon cancer based on the combination of characteristic lincRNA expression profiles of claim 1, comprising the steps of:

the method is useful for non-disease diagnostic and therapeutic purposes.

3. The prediction method according to claim 2, wherein the obtaining of the characteristic lincRNA stably and differentially expressed in the patient with early colon cancer in the step 1 is specifically as follows:

step 1.1, downloading transcriptome Data and clinical Data of tumor tissues and para-carcinoma tissues of a colon cancer patient from a Genomic Data common Data Portal database to obtain a tumor tissue gene expression profile read counts value of the colon cancer patient, namely a sequencing read value, and carrying out logarithmic conversion;

step 1.2, selecting lincRNA with certain expression abundance, namely, reading counts of the lincRNA in all samples are more than or equal to 10; taking the logarithm of the read counts of all the lincRNAs, setting the total number of samples as n, setting the total number of the screened lincRNAs as m, setting v as the read counts of the lincRNAs, and setting u as the expression value after taking the logarithm, wherein the number of the read counts is m;

u_ij＝log₂v_ij，i∈(1，n)，j∈(1，m) (1)

the variance of lincRNA in the tumor sample,

lincRNA variance for normal samples;

m₂＝m₁{|f_j|≥1，q_j≤0.05}，j∈(1，m₁) (7)。

4. the prediction method of claim 2, wherein the characteristic lincRNA expression data is selected in step 2, and the data is normalized for each sample according to the formula:

5. The prediction method according to claim 2, wherein the constructing of the early prediction model for the normalized data by using the support vector machine in the step 3 is specifically:

gamma＝[0.001，0.01，0.1，1，10，100](9)

C＝[0.001，0.01，0.1，1，10，100](10)

6. The prediction method according to claim 2, wherein the early prediction according to the patient characteristic lincRNA expression level in step 4 is specifically: