CN106909767B - System for classifying hepatitis B-related cirrhosis - Google Patents

System for classifying hepatitis B-related cirrhosis Download PDF

Info

Publication number
CN106909767B
CN106909767B CN201510964983.5A CN201510964983A CN106909767B CN 106909767 B CN106909767 B CN 106909767B CN 201510964983 A CN201510964983 A CN 201510964983A CN 106909767 B CN106909767 B CN 106909767B
Authority
CN
China
Prior art keywords
mir
hepatitis
model
classification
cirrhosis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510964983.5A
Other languages
Chinese (zh)
Other versions
CN106909767A (en
Inventor
李亦学
张卫红
侯婷
靳文静
王振
孙翔英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Contemporaneous Biotechnology Co ltd
Original Assignee
Beijing Quantobio Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Quantobio Biotechnology Co ltd filed Critical Beijing Quantobio Biotechnology Co ltd
Priority to CN201510964983.5A priority Critical patent/CN106909767B/en
Publication of CN106909767A publication Critical patent/CN106909767A/en
Application granted granted Critical
Publication of CN106909767B publication Critical patent/CN106909767B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems

Abstract

The invention provides a hepatitis B related cirrhosis classification method based on plasma microRNA marker expression level by using a logistic regression mathematical model, which simply and accurately diagnoses hepatitis B related cirrhosis by using expression values of 7 plasma microRNA molecular marker combinations. The technical scheme of the system is as follows: establishing a database module by collecting a large amount of chronic hepatitis B, hepatitis B related cirrhosis and plasma microRNA expression values of healthy samples, and storing an original database used as a training set and subsequent blind test data; establishing a quality control module, and removing extreme values caused by experimental errors; establishing a model classification module, constructing and optimizing a logistic regression model by means of feature selection and the like, establishing a final classification method by evaluating and selecting a model with the optimal accuracy, and judging blind test sample classification by adopting two layers of classification models (health and liver diseases (chronic hepatitis B/cirrhosis), chronic hepatitis B and hepatitis B related cirrhosis).

Description

System for classifying hepatitis B-related cirrhosis
Technical Field
The invention relates to a classification method and a classification system for hepatitis B related cirrhosis, in particular to a method and a classification system for hepatitis B related cirrhosis based on plasma microRNA marker expression level by using a logistic regression mathematical model.
Background
China is a big country with viral hepatitis, and particularly, the number of hepatitis B patients is large. The carriers of hepatitis B account for about 8-10% of the total population, and about 25% of the carriers develop chronic hepatitis B and hepatitis B-related cirrhosis, and about 10% of the carriers develop hepatocellular carcinoma (HCC). The infection of hepatitis B virus not only brings serious health hazard to people, but also brings huge economic burden to patients, countries and society with the cost related to treatment and other diseases.
The current clinical diagnosis means of cirrhosis mainly comprise histopathological biopsy, Fibroscan, color Doppler ultrasound, CT, gastroscope, plasmapheresis indexes and the like. However, the clinical application of these single techniques or indexes has some limitations and deficiencies, and none of them can accurately and timely diagnose the progress of liver cirrhosis, so that the staged diagnosis of liver cirrhosis still depends on the pathological standard of liver biopsy, and one or a group of convenient, timely and noninvasive liver fibrosis and liver cirrhosis staged diagnosis indexes are urgently needed in clinic.
microRNA (miRNA) was originally discovered in 1993, and gradually becomes a research hotspot in recent years with the development of high-throughput sequencing technology. micrornas are capable of binding to flanking regions of a gene sequence to repress or inhibit translation of a target mRNA and are highly conserved, time-ordered, and tissue-specific. Recent studies have shown that hepatitis virus infection, chronic hepatitis, cirrhosis and micrornas are closely related, which can affect disease progression by acting on the virus itself or on the immune system. Research shows that the microRNA expression profile of a virus-infected liver disease patient is obviously different from that of healthy human tissues. Researchers also find that a large amount of stable small ribonucleic acid molecules, namely microRNA, exist in human serum/plasma, which lays a foundation for clinically diagnosing liver cirrhosis by detecting the expression quantity of the microRNA molecules in the serum/plasma.
In summary, although researchers have conducted research in this field, they still face many difficulties and challenges, and have failed to accurately and timely diagnose the degree of progression of cirrhosis. The expression level of the microRNA marker in serum/plasma is utilized, and a new idea is provided for diagnosis and research of liver cirrhosis. However, there is no deep research on the expression change of the microRNA markers for liver cirrhosis or the combination thereof, and it is still necessary to find the microRNA markers or the combination thereof capable of effectively judging liver cirrhosis, particularly the microRNA markers or the combination thereof capable of distinguishing hepatitis B related liver cirrhosis from chronic hepatitis B, and to construct a suitable and accurate classification method and system for hepatitis B related liver cirrhosis by using a mathematical model based on the obtained expression level of the combination of the microRNA markers. The method using microRNA markers or a combination thereof has the advantage of being more rapid and accurate than the conventional diagnosis methods of cirrhosis and hepatitis B.
Disclosure of Invention
One object of the present invention is to provide a method for classifying hepatitis B related cirrhosis based on plasma microRNA marker expression levels using a logistic regression mathematical model, comprising the steps of:
a) establishing an original database by using the training set data;
b) adopting two layers of classification models for the training set;
c) constructing and optimizing the logistic regression mathematical model by performing feature selection and data optimization on the training set;
d) performing prediction evaluation;
e) selecting an optimal model according to the prediction evaluation result and establishing a final classification method;
f) separate test set samples were collected for model testing and evaluation.
Preferably, the training set comprises sample data based on Ct values and clinical indicators of plasma microRNA marker expression; the two-layer classification model includes a classification model (model DH) for liver disease and health consisting of chronic hepatitis b/cirrhosis and a health control and a classification model (model AB) for chronic hepatitis b and hepatitis b-related cirrhosis; the feature selection adopts an information gain algorithm to rank the features of the training set to select the features with high contribution degree as candidate microRNA markers; the data optimization mode is to perform quality control and end value removal on the data in the training set, remove an extreme value caused by errors in a test, construct the logistic regression mathematical model by using a logistic regression method, and combine a plurality of microRNA molecular markers to express by using a formula:
h(x)=hθ(x)=θ01x12x2+...+θnxn
wherein x1,x2,...,xnIs the selected n features, θ012,...θnAre the coefficients of the individual features obtained by the training set.
A second object of the present invention is to provide a system for classification of liver disease, comprising a database module, a quality control module, a model classification module, wherein: the database module comprises an original database used as a training set and a blind test database collected subsequently; the quality control module is a module for removing extreme values caused by experimental errors; the model classification module includes a classification model (model DH) between liver disease and health consisting of chronic hepatitis b/cirrhosis and a healthy control and a classification model (hepatitis or cirrhosis) between chronic hepatitis b and hepatitis b-related cirrhosis. The liver disease is hepatitis B related liver disease.
Preferably, the database module comprises 486 cases of raw databases used as training sets and blind test data collected subsequently, wherein samples of each case comprise miR-122-5p, miR-21-5p, miR-146a-5p, miR-29c-3p, miR-381-3p,
Ct expression values of miR-223 and miR-22-3p, and values of clinical indicators transaminase (ALT), Albumin (ALB) and HBV virus DNA.
The quality control module removes extreme values caused by experimental errors through quality control, and the range of the non-extreme values is defined as: in the model DH, the Ct value range of the marker miR-381-3p is 19.40-32.10, the Ct value range of the marker miR-22-3p is 16.72-26.86, and the Ct value range of the marker miR-146a-5p is 19.32-29.16; in the model AB, the Ct value range of the marker miR-122-5p is 17.61-26.99, the marker miR-21-5p is 16.79-24.47, the marker miR-146a-5p is 19.31-26.64, the marker miR-29c-3p is 18.57-26.18, the marker miR-381-3p is 20.13-27.87, the marker miR-223 is 15.35-24.15, and the marker miR-22-3p is 16.71-23.95.
The modeling algorithm of the model classification module is logistic regression, and a plurality of microRNA molecular markers are combined and expressed by a formula, wherein the formula of the algorithm for distinguishing health from liver Diseases (DH) is as follows:
hDH(x)=-1.972X(miR-381-3p)+0.0079X(miR-22-3p)–1.6462X(miR-146a-5p)+ 74.495
the thresholds that can be determined from the maximum probability classification are:
liver disease (chronic hepatitis b/cirrhosis) class: h isDH(x)>0;
H health class: h isDH(x)<0;
The algorithm formula for distinguishing the hepatitis B related cirrhosis and the chronic hepatitis B (AB) is as follows:
hAB(x)=1.1925X(miR-122-5p)+0.3978X(miR-21-5p)+0.3726X(miR-146a-5p)– 1.7062X(miR-29c-3p)+0.1303X(miR-223)+0.8156X(miR-22-3p)–0.1432XALB–0.3608XDNA–0.0041XALT–23.9918
a hepatitis b-associated cirrhosis group: h isAB(x)>0
B chronic hepatitis B: h isAB(x)<0。
The method and the system for classifying the hepatitis B related cirrhosis based on the expression level of the plasma microRNA markers by using the logistic regression mathematical model have the advantages of providing a method and a system for automatically and quickly classifying the hepatitis B related cirrhosis and the chronic hepatitis B by using the microRNA markers to express Ct values and common clinical indexes by using a database algorithm and a formula.
Drawings
Fig. 1 shows a flow chart of an embodiment of the present invention for establishing a classification method of hepatitis b related cirrhosis based on the expression level of a combination of 7 plasma microRNA markers using a logistic regression mathematical model.
Fig. 2 shows a flow chart of an embodiment of the invention of using a logistic regression mathematical model to perform a monolayer classification model in a classification method and system for hepatitis b related cirrhosis based on the combined expression levels of 7 plasma microRNA markers.
Fig. 3 is a graph illustrating the cross-validation result of model DH feature selection in the classification method and system for hepatitis b related cirrhosis based on the expression level of the combination of 7 plasma microRNA markers according to the present invention using a logistic regression mathematical model.
FIG. 4 is a graph showing cross-validation results before and after performing AB quality control on a hepatitis B related cirrhosis classification method and a model in a system based on the expression level of 7 plasma microRNA marker combinations by using a logistic regression mathematical model. Fig. 5 shows a flow chart of an embodiment of the present invention in a method and system for classification of hepatitis b related cirrhosis based on the expression level of a combination of 7 plasma microRNA markers using a logistic regression mathematical model.
Detailed Description
The technical scheme of the invention is further illustrated by the specific embodiment and the attached drawings, but the technical scheme can be understood by those skilled in the art: the following detailed description and examples are intended to illustrate the invention and should not be construed as limiting the invention in any way.
One aspect of the invention is a novel hepatitis B related cirrhosis classification method based on the combined expression level of 7 plasma microRNA markers by using a logistic regression mathematical model.
In a second aspect, the invention provides a system for classifying hepatitis B associated cirrhosis based on plasma microRNA marker expression levels using a logistic regression mathematical model.
The technical scheme of the invention is as follows: the invention establishes a method for classifying hepatitis B related cirrhosis based on the expression level of a plasma microRNA marker by using a logistic regression mathematical model, which comprises the following steps:
collecting a large amount of samples of chronic hepatitis B, hepatitis B related cirrhosis and health sample data and establishing an original database;
two layers of classification models are adopted to distinguish healthy hepatitis B, chronic hepatitis B and hepatitis B related cirrhosis in sequence;
constructing and optimizing a logistic regression model by using a training set through modes of feature selection, data optimization and the like, and selecting an optimal model after evaluation to establish a final classification method;
separate test set samples were collected for model testing and evaluation.
According to an embodiment of the method for classifying hepatitis B related cirrhosis based on the expression level of the plasma microRNA marker by using the logistic regression mathematical model, the classification model adopts two layers, wherein the first layer is a liver disease (chronic hepatitis B/cirrhosis) and health classification model (model DH), and the second layer is a hepatitis B related cirrhosis and chronic hepatitis B classification model (model AB).
According to an embodiment of the method for classifying hepatitis B related cirrhosis based on the expression level of the plasma microRNA marker by using the logistic regression mathematical model, the logistic regression model is constructed and optimized by using the training set through the modes of feature selection, data optimization and the like, the feature selection adopts an information gain algorithm to rank the features of the training set, the obtained data is the contribution degree index of each feature, and the features with high contribution degree can be used as candidate microRNA markers. Data are processed and optimized in modes of quality control, unilateral endvalue elimination and the like on the training set, a logistic regression model is established, and the accuracy of the model is evaluated through cross validation. And continuously cycling the process to obtain a model with the best accuracy.
The logistic regression model formula is:
h(x)=hθ(x)=θ01x12x2+...+θnxn
wherein x1,x2,...,xnIs the selected n features, θ012,...θnAre the coefficients of the individual features obtained by the training set.
The invention also discloses a system for liver disease classification based on plasma microRNA marker expression level by using a logistic regression mathematical model, which comprises a database module, a quality control module and a model classification module, wherein:
the database module comprises an original database used as a training set and subsequently collected blind test data;
the quality control module removes extreme values caused by experimental errors;
the model classification module adopts two layers of classification models (health and liver diseases (chronic hepatitis B/cirrhosis), hepatitis B related cirrhosis and chronic hepatitis B) to judge the final sample classification.
According to the method for classifying hepatitis B related cirrhosis based on the expression level of the plasma microRNA marker by using the logistic regression mathematical model, 486 cases of original databases used as training sets are stored in the database module, and 150 cases of hepatitis B related cirrhosis samples, 150 cases of chronic hepatitis B samples and 186 cases of healthy samples are included; and subsequently collected blind test data. Each sample respectively comprises expression values of 7 microRNAs (miR-122-5p, miR-21-5p, miR-146a-5p, miR-29c-3p, miR-381-3p, miR-223 and miR-22-3p) and content values (ALT, ALB and DNA) of three clinical indexes.
According to the method for classifying the hepatitis B related cirrhosis based on the expression level of the plasma microRNA marker by using the logistic regression mathematical model, the extreme value caused by experimental error is removed by the quality control module. The range of non-extremes is defined as: in the model DH, the Ct value range of the marker miR-381-3p is 19.40-32.10, the Ct value range of the marker miR-22-3p is 16.72-26.86, and the Ct value range of the marker miR-146a-5p is 19.32-29.16; in the model AB, the Ct value range of the marker miR-122-5p is 17.61-26.99, the marker miR-21-5p is 16.79-24.47, the marker miR-146a-5p is 19.31-26.64, the marker miR-29c-3p is 18.57-26.18, the marker miR-381-3p is 20.13-27.87, the marker miR-223 is 15.35-24.15, and the marker miR-22-3p is 16.71-23.95.
According to the method for classifying the hepatitis B related cirrhosis based on the plasma microRNA marker expression level by using the logistic regression mathematical model, the model classification module judges sample classification by adopting two layers of classification models (health and liver diseases (chronic hepatitis B/cirrhosis), hepatitis B related cirrhosis and chronic hepatitis B). The modeling algorithm is logistic regression, and a plurality of microRNA molecular markers are combined and expressed by a formula. The formula of the algorithm for distinguishing the healthy liver Disease (DH) is as follows:
hDH(x)=-1.972X(miR-381-3p)+0.0079X(miR-22-3p)–1.6462X(miR-146a-5p)+ 74.495
the thresholds that can be determined from the maximum probability classification are:
liver disease (chronic hepatitis b/cirrhosis) class: h isDH(x)>0;
H health class: h isDH(x)<0;
The algorithm formula for distinguishing the hepatitis B related cirrhosis and the chronic hepatitis B (AB) is as follows:
hAB(x)=1.1925X(miR-122-5p)+0.3978X(miR-21-5p)+0.3726X(miR-146a-5p)– 1.7062X(miR-29c-3p)+0.1303X(miR-223)+0.8156X(miR-22-3p)–0.1432XALB–0.3608XDNA–0.0041XALT–23.9918
a hepatitis b-associated cirrhosis group: h isAB(x)>0
B chronic hepatitis B: h isAB(x)<0。
The invention provides a method and a system for classifying hepatitis B related cirrhosis based on the combined expression level of 7 plasma microRNA markers by using a logistic regression mathematical model. The classification method and system of the present invention are simpler and faster to operate than conventional clinical diagnostic methods. With the advent of the big data era and the development of sequencing technology, the collected health and disease data is increasing, the method related to the invention can be continuously improved, and a model with higher accuracy is obtained.
The invention is further described below with reference to the figures and examples.
Example in a method and System for Classification of hepatitis B associated cirrhosis based on the expression levels of a combination of 7 plasma microRNA markers Using a logistic regression mathematical model
Fig. 1 shows a flow chart of an embodiment of the present invention of a method and system for classification of hepatitis b related cirrhosis based on the expression level of a combination of 7 plasma microRNA markers using a logistic regression mathematical model. Referring to fig. 1, the following is a detailed description of each step in the method of the present embodiment.
Step 1: collecting a large amount of samples of chronic hepatitis B, hepatitis B related cirrhosis and health samples and establishing an original database.
In the step, chronic hepatitis B, hepatitis B related cirrhosis and healthy person samples collected by various hospitals and related clinical indexes are summarized, and microRNA expression values of various samples are measured by extracting sample blood through experiments, so that the microRNA molecular markers with differential expression are screened out. The sample data are classified into three types, namely healthy people with hepatitis A, hepatitis B related cirrhosis and chronic hepatitis B.
Step 2: two-layer classification models are adopted to distinguish health and liver diseases (chronic hepatitis B/cirrhosis), hepatitis B related cirrhosis and chronic hepatitis B in turn.
In the step, the model is established in two layers, the first layer is the model DH, the hepatitis B related cirrhosis A and the chronic hepatitis B are classified into the disease D, and the model is classified and modeled with the healthy person H; the second layer is a model AB used for establishing a classification model of hepatitis B related cirrhosis A and chronic hepatitis B.
And step 3: constructing and optimizing a logistic regression model by using a training set through modes of feature selection, data optimization and the like, and selecting an optimal model after evaluation to establish a final classification method;
in the step, the method for selecting the features is information gain, the features of the training set are ranked, the obtained data is the contribution degree index of each feature, and the features with high contribution degree can be used as candidate microRNA markers. The extreme values due to errors in the test were removed by quality control of the training set. And the data is processed and optimized in a single-side end value removing mode, so that the discrimination among different types of samples is increased. And establishing a logistic regression model by using the processed training set, and performing cross validation to evaluate the accuracy of the model. And finally, continuously circulating the process to obtain the optimal model.
For the logistic regression model, the calculation formula is expressed as:
h(x)=hθ(x)=θ01x12x2+...+θnxn
wherein x1,x2,...,xnIs the selected n features, θ012,...θnAre the coefficients of the individual features obtained by the training set. n is an integer of 1 < n.ltoreq.20, preferably less than 10.
And 4, step 4: separate test set samples were collected for model testing and evaluation.
In this step, separate test set samples are collected for inspection and evaluation of the model to determine that the model is not overfitting.
Fig. 2 shows a flow chart of an embodiment of the present invention of using a logistic regression mathematical model to perform a monolayer classification model in a classification method and system for hepatitis b related cirrhosis based on the expression level of a combination of 7 plasma microRNA markers, from which the details of the monolayer classification model can be more clearly understood.
For the above steps, the following are four specific examples:
example 1: collecting a large amount of samples of chronic hepatitis B, hepatitis B related cirrhosis and health sample data and establishing an original database
The diagnosis of the liver cirrhosis is based on the viral hepatitis prevention and treatment guidelines of the Chinese medical society of 2000, and specifically comprises the following steps: has a history of chronic infection of hepatitis virus, and imaging prompts diffuse hepatic fibrosis and formation of regeneration nodules, other manifestations can include splenomegaly, splenic hyperfunction and esophageal and gastric varices, and the gold standard is that pathological examination finds regeneration nodules. The diagnosis of hepatitis B is based on the viral hepatitis prevention and treatment guidelines of the Chinese medical society of 2000, which are as follows: the hepatitis course is more than half a year, or the original hepatitis B or HBsAg has a history, and the symptoms, signs and liver function abnormality of the hepatitis reappear due to the same pathogen at this time, but the patient without the liver cirrhosis can be diagnosed as chronic hepatitis B.
During the period of 6 months-2014 3 months from 2012, 150 plasma samples of patients who meet the above definition of hepatitis b and 150 plasma samples that meet the above definition of hepatitis b-related cirrhosis were collected in advance from the Beijing Youyan hospital affiliated with the university of capital medical science. RNA extraction is carried out by Beijing Kuangbo biotechnology, Inc., and sequencing analysis of microRNA is completed. And (3) centrally constructing the obtained hepatitis B related cirrhosis and chronic hepatitis B microRNA expression values into a database to become a partial sample data set of the original database.
Example 2: feature selection
The algorithm for feature selection is information gain, is a relatively mature algorithm in the field, and is mainly assisted by a software package under weka. Reference may be made to Mitchell, Tom M. (1997) Machine learning, the Mc-Graw-Hill Companies, inc. isbn 0070428077, pages 55 to 60. The method specifically comprises the following steps: the microRNA is used as a feature, feature sorting is carried out through an InfoGainAttributeEval (information gain) algorithm under a weka software package, and the obtained data, namely the contribution degree index of each feature, can be used as a reference for feature selection. Carrying out feature selection on a plurality of groups of microRNA data of hepatitis B related cirrhosis, chronic hepatitis B and healthy samples sequenced on 28 days in 2 months in 2015, wherein for a first-layer model DH, the model accuracy is basically constant along with the increase of microRNA molecular markers, and finally selecting markers with contribution degrees of the first three in ranking according to the number of the markers and the accuracy of the model, namely miR-381-3p, miR-22-3p and miR-146a-5p, so that the model accuracy is D0.997, H0.986 and the average accuracy is 0.995. Please refer to fig. 3.
Example 3: quality control
Carrying out quality control on multiple groups of microRNA data of hepatitis B related cirrhosis, chronic hepatitis B and healthy samples sequenced on 28 days in 2 months in 2015, and removing extreme values caused by experimental errors. Extreme values are defined as: and (4) performing statistics by using a boxplot software package in the R, wherein the extreme value is greater than the maximum value or less than the minimum value. And combining A and B in the model AB for preliminary quality control, wherein the Ct value range of the marker miR-122-5p is 17.61-26.99, the marker miR-21-5p is 16.79-24.47, the marker miR-146a-5p is 19.31-26.64, the marker miR-29c-3p is 18.57-26.18, the marker miR-381-3p is 20.13-27.87, the marker miR-223 is 15.35-24.15, and the marker miR-22-3p is 16.71-23.95. After quality control, the accuracy of the training set is improved compared to before quality control, see fig. 4.
Example 4: optimal model
And determining an optimal model through multiple optimization and evaluation, wherein the optimal model of the model DH is characterized by 3 microRNA molecular markers (miR-381-3p, miR-22-3p and miR-146a-5p), the accuracy rate of model cross validation at the moment is D0.963, H0.939, and the average accuracy rate is 0.954.
Embodiments of methods and systems for classification of hepatitis B associated cirrhosis based on the expression levels of a combination of 7 plasma microRNA markers using logistic regression mathematical models
FIG. 5 shows a schematic diagram of the composition and connections of the system for classification of hepatitis B associated cirrhosis based on the expression levels of a combination of 7 plasma microRNA markers according to the present invention using a logistic regression mathematical model. Please refer to fig. 5. The system of the embodiment comprises a database module, a quality control module and a model classification module.
The database module is used for storing a raw database used as a training set and blind test data collected subsequently;
the quality control module is a module for removing extreme values caused by experimental errors through quality control;
the model classification module is a part for classifying the samples by judging healthy and liver diseases, hepatitis B related cirrhosis and chronic hepatitis B by adopting a two-layer classification model.
The database module in the system of the embodiment stores a raw database comprising 486 cases used as training sets, comprising 150 cases of hepatitis B related cirrhosis samples, 150 cases of chronic hepatitis B samples and 186 cases of health samples; and subsequently collected blind test data. Each sample comprises expression values of 7 microRNAs (miR-122-5p, miR-21-5p, miR-146a-5p, miR-29c-3p, miR-381-3p, miR-223 and miR-22-3p) and three clinical indexes (ALT, ALB and DNA).
In the system of the embodiment, the quality control module removes extreme values caused by experimental errors through quality control. The range of extremes is defined as: in the model DH, the Ct value range of the marker miR-381-3p is 19.40-32.10, the Ct value range of the marker miR-22-3p is 16.72-26.86, and the Ct value range of the marker miR-146a-5p is 19.32-29.16; in the model AB, the Ct value range of the marker miR-122-5p is 17.61-26.99, the marker miR-21-5p is 16.79-24.47, the marker miR-146a-5p is 19.31-26.64, the marker miR-29c-3p is 18.57-26.18, the marker miR-381-3p is 20.13-27.87, the marker miR-223 is 15.35-24.15, and the marker miR-22-3p is 16.71-23.95.
The model classification module in the system of this embodiment determines the classification of the sample using two-layer classification models (healthy and liver disease, hepatitis b-related cirrhosis and chronic hepatitis b). The modeling algorithm is logistic regression, and a plurality of microRNA molecular markers are combined and expressed by a formula. The formula of the algorithm for distinguishing the healthy liver Disease (DH) is as follows:
hDH(x)=-1.972X(miR-381-3p)+0.0079X(miR-22-3p)–1.6462X(miR-146a-5p)+ 74.495
the thresholds that can be determined from the maximum probability classification are:
liver disease (chronic hepatitis b/cirrhosis) class: h isDH(x)>0;
H health class: h isDH(x)<0;
The algorithm formula for distinguishing the hepatitis B related cirrhosis and the chronic hepatitis B (AB) is as follows:
hAB(x)=1.1925X(miR-122-5p)+0.3978X(miR-21-5p)+0.3726X(miR-146a-5p)- 1.7062X(miR-29c-3p)+0.1303X(miR-223)+0.8156X(miR-22-3p)-0.1432XALB -0.3608XDNA-0.0041XALT-23.9918
a hepatitis b-associated cirrhosis group: h isAB(x)>0
B chronic hepatitis B: h isAB(x)<0。
Test examples
To verify the performance of the system of the present invention, two sets of blind test data were used for verification and evaluation below.
Blind test data 1
The blind test data 1 is a liver disease sample set which is sequenced by the subsidiary Beijing Youyan Hospital of the university of capital medical science at 20 months 2 and 2014, and comprises 40 samples, wherein 20 samples of hepatitis B-related cirrhosis and 20 samples of chronic hepatitis B are included.
Blind test data 2
The blind data 2 is a liver disease and health sample set which is sequenced by the subsidiary Beijing Youyan Hospital of the university of capital medical science at 2015, 4 and 1, and comprises 40 samples, wherein 12 samples of hepatitis B-related cirrhosis, 13 samples of chronic hepatitis B and 15 samples of health are included.
System operational requirements/environments
1. A command line form, a DOS command line or a command line form in a Linux environment;
2. a statistical software package R is installed.
Command line input format:
RscriptmiRNA.R-itest_DH.txt-typeDH-otest_0H_report.txt-etest_DH_poorQC.txt RscriptmiRNA.R-itest_DH.txt-typeDH-v
wherein, the software name is miRNA.R, -i input file, -type data processing format, -o output file, -e error file, -v output directly on the screen.
Example 1 input File Format
sample_name v o e
1 27.367422 23.918165 20.387817
2 27.591124 24.643553 20.168322
3 28.13521 23.20343 21.219599
4 27.901966 21.143312 20.402287
5 28.58136 20.707237 21.73571
6 24.76316 18.762772 19.222338
7 27.30698 22.417469 23.841616
8 26.368567 19.766613 20.129692
9 28.93138 25.612793 21.301153
10 26.824923 18.512665 19.730814
Output file format
sample_name v o e status
1 27.367422 23.918165 20.387817 LiverDisease
2 27.591124 24.643553 20.168322 LiverDisease
3 28.13521 23.20343 21.219599 LiverDisease
4 27.901966 21.143312 20.402287 LiverDisease
5 28.58136 20.707237 21.73571 LiverDisease
7 27.30698 22.417469 23.841616 LiverDisease
8 26.368567 19.766613 20.129692 LiverDisease
9 28.93138 25.612793 21.301153 LiverDisease
10 26.824923 18.512665 19.730814 LiverDisease
Results and discussion
Blind test data 1
The blind test data 1 only contains 40 liver disease data samples, 39 liver disease data samples are remained after quality control, the data after quality control is used for model AB prediction analysis and evaluation, and detailed results are shown in table 1, wherein the accuracy rate of hepatitis A related to hepatitis B is 0.90, the accuracy rate of B chronic hepatitis B is 0.737, and the average accuracy rate is 0.821. When a ROC (receiver operating characteristic) curve is drawn, the AUC (area under the curve, ROC area) reaches 0.884.
TABLE 1 model AB predictive analytical evaluation
Figure GDA0003137563710000121
Figure GDA0003137563710000131
Table 1 blind test data 1 model AB prediction results
Blind test data 2
The blind test data 2 contains 40 liver disease and health samples, for the first-layer model DH, through preliminary quality control, all 31 samples are reserved, the 40 samples are used for model DH prediction analysis and evaluation, and detailed results please refer to table 2, wherein the accuracy of D liver disease is 0.875, the accuracy of H health is 1, and the average accuracy is 0.903. ROC curves were plotted and AUC values were 0.988. It can be obviously seen that the classification effect of the model DH is very good, and the model DH is relatively in accordance with the actual situation of clinical diagnosis.
TABLE 2
Figure GDA0003137563710000132
TABLE 2 Blind test data 2 model DH prediction results
The blind test data 2 contains 25 hepatitis B related cirrhosis and chronic hepatitis B samples, and 19 samples are remained after quality control. The samples after quality control were used for model AB predictive analysis and evaluation, and detailed results are shown in table 3, wherein the accuracy of hepatitis a-related cirrhosis is 0.90, the accuracy of chronic hepatitis B is 0.889, and the average accuracy is 0.895. ROC curves were plotted and AUC values reached 0.967.
TABLE 3
Figure GDA0003137563710000133
TABLE 3 Blind test data 2 model AB prediction results
In conclusion, the system has the advantages that the prediction classification accuracy of the model DH and the model AB is high, the classification effect is good, the overfitting phenomenon does not occur, the system can be used for diagnosing hepatitis B related cirrhosis in actual disease detection, and the operation is simple and rapid.

Claims (5)

1. A system for liver disease classification, includes database module, quality control module, model classification module, wherein:
the database module comprises an original database used as a training set and a blind test database collected subsequently;
the quality control module is a module for removing extreme values caused by experimental errors;
the model classification module includes a classification model DH for between liver diseases and health consisting of chronic hepatitis B/cirrhosis and a healthy control and a classification model AB for between chronic hepatitis B and hepatitis B-related cirrhosis,
the modeling algorithm of the model classification module is logistic regression, and a plurality of microRNA molecular markers are combined and expressed by a formula, wherein the formula of the DH for distinguishing health and liver diseases is as follows:
hDH(x)=-1.972X(miR-381-3p)+0.0079X(miR-22-3p)–1.6462X(miR-146a-5p)+74.495
the thresholds that can be determined from the maximum probability classification are:
liver disease type D: h isDH(x)>0;
H health class: h isDH(x)<0;
The algorithm formula of the AB for distinguishing the hepatitis B related cirrhosis and the chronic hepatitis B is as follows:
hAB(x)=1.1925X(miR-122-5p)+0.3978X(miR-21-5p)+0.3726X(miR-146a-5p)–1.7062X(miR-29c-3p)+0.1303X(miR-223)+0.8156X(miR-22-3p)–0.1432X ALB–0.3608X DNA–0.0041X ALT–23.9918
a hepatitis b-associated cirrhosis group: h isAB(x)>0,
B chronic hepatitis B: h isAB(x)<0,
Wherein ALB is albumin, DNA is HBV virus DNA, and ALT is transaminase.
2. The system of claim 1, wherein the database module comprises more than 10 original databases used as training sets and blind test databases collected subsequently, wherein the data of each case comprises Ct values corresponding to miR-122-5p, miR-21-5p, miR-146a-5p, miR-29c-3p, miR-381-3p, miR-223 and miR-22-3p, and clinical indexes such as transaminase ALT value, albumin ALB content and HBV viral DNA content value.
3. The system according to claim 2, wherein the database module comprises more than 50 original databases used as training sets and blind databases collected subsequently.
4. The system of claim 2, wherein the database module comprises more than 200 original databases used as training sets and blind databases collected subsequently.
5. The system of claim 1, wherein the quality control module removes extreme values due to experimental error by quality control, and the range of non-extreme Ct values is defined as: in the model DH, the Ct value range of the marker miR-381-3p is 19.40-32.10, the Ct value range of the marker miR-22-3p is 16.72-26.86, and the Ct value range of the marker miR-146a-5p is 19.32-29.16; in the model AB, the Ct value range of the marker miR-122-5p is 17.61-26.99, the Ct value range of the marker miR-21-5p is 16.79-24.47, the Ct value range of the marker miR-146a-5p is 19.31-26.64, the Ct value range of the marker miR-29c-3p is 18.57-26.18, the Ct value range of the marker miR-381-3p is 20.13-27.87, the Ct value range of the marker miR-223 is 15.35-24.15, and the Ct value range of the marker miR-22-3p is 16.71-23.95.
CN201510964983.5A 2015-12-21 2015-12-21 System for classifying hepatitis B-related cirrhosis Active CN106909767B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510964983.5A CN106909767B (en) 2015-12-21 2015-12-21 System for classifying hepatitis B-related cirrhosis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510964983.5A CN106909767B (en) 2015-12-21 2015-12-21 System for classifying hepatitis B-related cirrhosis

Publications (2)

Publication Number Publication Date
CN106909767A CN106909767A (en) 2017-06-30
CN106909767B true CN106909767B (en) 2021-11-05

Family

ID=59200700

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510964983.5A Active CN106909767B (en) 2015-12-21 2015-12-21 System for classifying hepatitis B-related cirrhosis

Country Status (1)

Country Link
CN (1) CN106909767B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102776185A (en) * 2011-05-06 2012-11-14 复旦大学附属中山医院 Liver cancer diagnostic marker composed of blood plasma microRNA (micro ribonucleic acid) and new method for diagnosing liver cancer
CN103345544A (en) * 2013-06-11 2013-10-09 大连理工大学 Predicting organic chemical biodegradability according to logistic regression method
CN104232637A (en) * 2014-04-18 2014-12-24 首都医科大学附属北京佑安医院 microRNA molecular marker of liver cirrhosis and use thereof
CN104794321A (en) * 2014-01-21 2015-07-22 中国科学院上海生命科学研究院 Precancerous disease state detecting device and method
WO2015175642A2 (en) * 2014-05-13 2015-11-19 Sangamo Biosciences, Inc. Methods and compositions for prevention or treatment of a disease
CN105139083A (en) * 2015-08-10 2015-12-09 石庆平 Method and system for reevaluating safety of drug after appearance on market
CN105160182A (en) * 2015-09-07 2015-12-16 向阳 Automatic evaluation system of diagnosis and treatment correctness on the basis of virtual case

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3965111A1 (en) * 2013-08-30 2022-03-09 Personalis, Inc. Methods and systems for genomic analysis

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102776185A (en) * 2011-05-06 2012-11-14 复旦大学附属中山医院 Liver cancer diagnostic marker composed of blood plasma microRNA (micro ribonucleic acid) and new method for diagnosing liver cancer
CN103345544A (en) * 2013-06-11 2013-10-09 大连理工大学 Predicting organic chemical biodegradability according to logistic regression method
CN104794321A (en) * 2014-01-21 2015-07-22 中国科学院上海生命科学研究院 Precancerous disease state detecting device and method
CN104232637A (en) * 2014-04-18 2014-12-24 首都医科大学附属北京佑安医院 microRNA molecular marker of liver cirrhosis and use thereof
WO2015175642A2 (en) * 2014-05-13 2015-11-19 Sangamo Biosciences, Inc. Methods and compositions for prevention or treatment of a disease
CN105139083A (en) * 2015-08-10 2015-12-09 石庆平 Method and system for reevaluating safety of drug after appearance on market
CN105160182A (en) * 2015-09-07 2015-12-16 向阳 Automatic evaluation system of diagnosis and treatment correctness on the basis of virtual case

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
血浆microRNA作为肝癌及慢性乙肝肝损伤分子标记物的研究;高雪;《中国博士学位论文全文数据库 医药卫生科技辑》;20130215(第2期);第E072-31页 *

Also Published As

Publication number Publication date
CN106909767A (en) 2017-06-30

Similar Documents

Publication Publication Date Title
CN109767810B (en) High-throughput sequencing data analysis method and device
CA3133639A1 (en) Systems and methods for deriving and optimizing classifiers from multiple datasets
US20160068890A1 (en) Gene signatures of inflammatory disorders that relate to the liver
WO2015018307A1 (en) Biomarkers for colorectal cancer
CN107034301A (en) A kind of detection Lung neoplasm is benign or pernicious kit and its application
WO2016112488A1 (en) Biomarkers for colorectal cancer related diseases
WO2013119871A1 (en) A multi-biomarker-based outcome risk stratification model for pediatric septic shock
CN110205378B (en) Vertebral column tuberculosis plasma miRNA combined diagnosis marker and application thereof
CN105067822A (en) Marker for diagnosing esophagus cancer
WO2018209625A1 (en) Analysis system for peripheral blood-based non-invasive detection of lesion immune repertoire diversity and uses of system
CN111613324A (en) Method for predicting liver cancer risk by high-throughput analysis of hepatitis B virus genome RT/S region sequence characteristics through machine learning model
KR102124193B1 (en) Method for screening makers for predicting depressive disorder or suicide risk using machine learning, markers for predicting depressive disorder or suicide risk, method for predicting depressive disorder or suicide risk
CN108977533A (en) It is a kind of for predicting the miRNA combination object of chronic hepatitis B inflammation damnification
CN106909767B (en) System for classifying hepatitis B-related cirrhosis
CN114317725B (en) Crohn disease biomarker, kit and screening method of biomarker
CN116469462A (en) Ultra-low frequency DNA mutation identification method and device based on double sequencing
CN113234817B (en) Marker for detecting early liver cancer by using CpG locus methylation level
CN111733229B (en) Schizophrenia genetic risk typing kit and typing device
AU2019446735B2 (en) Method for discovering marker for predicting risk of depression or suicide using multi-omics analysis, marker for predicting risk of depression or suicide, and method for predicting risk of depression or suicide using multi-omics analysis
CN109022569A (en) It is a kind of for predicting the miRNA combination object of chronic hepatitis B liver fibrosis
CN111582328A (en) COVID-19 classification identification method based on lung CT image
CN111554347B (en) Method for constructing model for classifying hand-foot-mouth samples and application of method
Ali et al. Machine learning in early genetic detection of multiple sclerosis disease: A survey
CN113393901B (en) Glioma sorting device based on tumor nucleic acid is gathered to monocyte
CN113811621A (en) Method for determining RCC subtype

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20230510

Address after: 102600 3rd floor, No.18 Keyuan Road, economic development zone, Daxing District, Beijing

Patentee after: Beijing contemporaneous Biotechnology Co.,Ltd.

Address before: 100176 floor 3, building 2, aipuyi building, No. 1, Desheng East Road, Yizhuang Economic and Technological Development Zone, Daxing District, Beijing

Patentee before: BEIJING QUANTOBIO BIOTECHNOLOGY CO.,LTD.

TR01 Transfer of patent right