CN116344040A - Construction method of integrated model for intestinal flora detection and detection device thereof - Google Patents

Construction method of integrated model for intestinal flora detection and detection device thereof Download PDF

Info

Publication number
CN116344040A
CN116344040A CN202310571968.9A CN202310571968A CN116344040A CN 116344040 A CN116344040 A CN 116344040A CN 202310571968 A CN202310571968 A CN 202310571968A CN 116344040 A CN116344040 A CN 116344040A
Authority
CN
China
Prior art keywords
intestinal flora
model
flora data
intestinal
data sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310571968.9A
Other languages
Chinese (zh)
Other versions
CN116344040B (en
Inventor
任毅
崔玉涛
李旭
方沁怡
穆煜
李响
任静
张志明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Cui Yutao Clinic Co ltd
Beijing Kayudi Biotechnology Co ltd
Original Assignee
Beijing Cui Yutao Clinic Co ltd
Beijing Kayudi Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Cui Yutao Clinic Co ltd, Beijing Kayudi Biotechnology Co ltd filed Critical Beijing Cui Yutao Clinic Co ltd
Priority to CN202310571968.9A priority Critical patent/CN116344040B/en
Publication of CN116344040A publication Critical patent/CN116344040A/en
Application granted granted Critical
Publication of CN116344040B publication Critical patent/CN116344040B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/20ICT specially adapted for the handling or processing of patient-related medical or healthcare data for electronic clinical trials or questionnaires
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/40ICT specially adapted for the handling or processing of patient-related medical or healthcare data for data related to laboratory analysis, e.g. patient specimen analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Primary Health Care (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Pathology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The application provides a construction method of an integrated model for intestinal flora detection and a detection device thereof. The construction method comprises the following steps: obtaining raw intestinal flora data, including healthy intestinal flora data samples and abnormal intestinal flora data samples; combining a plurality of different subsets of healthy intestinal flora data samples with the abnormal intestinal flora data samples, respectively, into a plurality of intestinal flora data sample sets; training a plurality of single models based on each intestinal flora data sample set to determine model parameters of each single model; and fusing the plurality of single models into an integrated model for intestinal flora detection. According to the method for constructing the integrated model for detecting the intestinal flora, the prediction accuracy of machine learning modeling analysis can be obviously improved, the variance of multiple modeling results is reduced, the random fluctuation of data results is reduced, the overall situation of a data set is intuitively reflected, and the accuracy, objectivity and clinical applicability of modeling prediction are improved.

Description

Construction method of integrated model for intestinal flora detection and detection device thereof
Technical Field
The present application relates to the construction and use of machine learning models for intestinal flora data. More particularly, the application relates to a method for constructing an integrated model for intestinal flora detection and a detection device thereof.
Background
The human intestinal flora has complex constitution and plays an important role in maintaining human health and in vivo microecological balance. The intestinal flora is important biological data and has important significance in actual clinical problems such as disease risk prediction, health condition judgment, drug use effect analysis and the like. By detecting the intestinal flora of the human body, the method is beneficial to evaluating the intestinal flora and the human body health degree, and the composition of the intestinal flora is purposefully improved, so that the dynamic balance and the physical health of the intestinal flora are recovered, and the health condition of people with symptoms such as diarrhea, constipation, flatulence, eczema, poor immunity and the like is improved.
However, most studies remain in community diversity analysis for intestinal flora, but the clinical application of intestinal flora detection for diagnosis and treatment of specific diseases is relatively small, and an effective model construction method for important biological data of intestinal flora is also lacking. Thus, how to construct and use machine learning models for intestinal flora data is a very challenging problem in practical clinical applications.
Disclosure of Invention
According to one aspect of the present application, there is provided a method of constructing an integrated model for intestinal flora detection, comprising: obtaining raw intestinal flora data, wherein the raw intestinal flora data comprise a healthy intestinal flora data sample and an abnormal intestinal flora data sample; combining a plurality of different subsets of the healthy intestinal flora data samples with the abnormal intestinal flora data samples, respectively, into a plurality of intestinal flora data sample sets; training a plurality of single models based on each of the plurality of intestinal flora data sample sets to determine model parameters for each single model; and fusing the plurality of single models into the integrated model for intestinal flora detection.
In some embodiments, combining the plurality of different subsets of healthy intestinal flora data samples with the abnormal intestinal flora data samples into a plurality of intestinal flora data sample sets, respectively, comprises: and randomly selecting the different subsets from the healthy intestinal flora data samples and combining the abnormal intestinal flora data samples into a plurality of intestinal flora data sample sets respectively.
In some embodiments, prior to combining the plurality of different subsets of the healthy intestinal flora data samples with the abnormal intestinal flora data samples into the plurality of intestinal flora data sample sets, respectively, the method further comprises: determining whether a plurality of candidate intestinal bacteria species in the healthy intestinal flora data sample and the abnormal intestinal flora data sample reach a detection threshold; and deleting data samples which do not reach the detection threshold of the candidate intestinal strains from the healthy intestinal flora data samples and the abnormal intestinal flora data samples.
In some embodiments, prior to combining the plurality of different subsets of the healthy intestinal flora data samples with the abnormal intestinal flora data samples into the plurality of intestinal flora data sample sets, respectively, the method further comprises: calculating the feature importance of each candidate intestinal strain of a plurality of candidate intestinal strains in the healthy intestinal flora data sample and the abnormal intestinal flora data sample; selecting a plurality of key intestinal strains from the plurality of candidate intestinal strains according to the feature importance of each candidate intestinal strain of the plurality of candidate intestinal strains; and retaining content information corresponding to the plurality of critical intestinal bacterial species in the healthy intestinal flora data sample and the abnormal intestinal flora data sample.
In some embodiments, the characteristic importance of each candidate intestinal species of the plurality of candidate intestinal species is calculated from one or more of: decision tree algorithm, pearson correlation coefficient algorithm, mutual information and maximum information coefficient algorithm and recursive feature elimination algorithm.
In some embodiments, training the plurality of single models based on each of the plurality of intestinal flora data sample sets to determine model parameters for each single model comprises: for each intestinal flora data sample set: respectively learning the intestinal flora data sample set by utilizing a plurality of machine learning algorithms to obtain a plurality of machine learning models; and taking model parameters of an optimal machine learning model of the plurality of machine learning models as model parameters of a single model determined by training based on the intestinal flora data sample set.
In some embodiments, each of the plurality of intestinal flora data sample sets comprises a training set and a testing set. Learning the intestinal flora data sample set with a plurality of machine learning algorithms to obtain a plurality of machine learning models, respectively, includes: and respectively learning the training set of the intestinal flora data sample set by utilizing the plurality of machine learning algorithms to obtain a plurality of machine learning models. The model parameters of the optimal machine learning model of the plurality of machine learning models as model parameters of a single model determined based on training of the intestinal flora data sample set comprise: the predictive performance of the plurality of machine learning models is tested separately using a test set of the intestinal flora data sample set, and model parameters of the optimal machine learning model having optimal predictive performance are determined.
In some embodiments, the plurality of machine learning algorithms includes two or more of: random forest algorithm, support vector machine algorithm, decision tree algorithm, logistic regression algorithm, naive Bayes algorithm, K nearest neighbor algorithm, K mean algorithm, linear discriminant analysis algorithm and linear regression algorithm.
In some embodiments, fusing the plurality of single-pass models into the integrated model for intestinal flora detection comprises: and determining the prediction results of the integrated model by adopting a majority voting method according to the corresponding multiple prediction results of the multiple single models aiming at the intestinal flora data samples.
According to another aspect of the present application, there is provided an apparatus for intestinal flora detection using an integrated model, comprising: the input device is used for receiving an intestinal flora data sample to be detected; a memory for storing model parameters of an integrated model for intestinal flora detection, wherein the integrated model is constructed by: obtaining raw intestinal flora data, wherein the raw intestinal flora data comprise a healthy intestinal flora data sample and an abnormal intestinal flora data sample; combining a plurality of different subsets of the healthy intestinal flora data samples with the abnormal intestinal flora data samples, respectively, into a plurality of intestinal flora data sample sets; training a plurality of single models based on each of the plurality of intestinal flora data sample sets, respectively, to determine model parameters for each single model; and fusing the plurality of single-pass models into the integrated model; the processor is used for running the integrated model to perform intestinal flora detection so as to detect whether the intestinal flora data sample to be detected is abnormal or not; and the output equipment is used for providing a detection result of the intestinal flora data sample to be detected.
Drawings
These and/or other aspects and advantages of the present application will become more apparent and more readily appreciated from the following detailed description of the embodiments of the present application, taken in conjunction with the accompanying drawings, wherein:
fig. 1 shows a schematic diagram of a machine learning model constructed using a general processing approach in case of unbalance of intestinal flora data samples.
Fig. 2 shows model performance obtained by constructing a machine learning model in a general processing manner in the case of unbalance of intestinal flora data samples of eczema patients and normal people.
Figure 3 shows the model performance obtained by constructing a machine learning model using a general processing approach in the case of an imbalance in intestinal flora data samples for diarrhea patients and normal populations.
Fig. 4 shows a flow chart of a method of constructing an integrated model for intestinal flora detection according to an embodiment of the present application.
Fig. 5 shows a schematic diagram of a method of constructing an integrated model for intestinal flora detection according to an embodiment of the present application.
Fig. 6 shows a schematic diagram of the selection of key intestinal species during the construction of a machine learning model of intestinal flora data for eczema patients and healthy people according to an embodiment of the present application.
Fig. 7 shows a schematic diagram of a single model training for each sample set of intestinal flora data for eczema patients and healthy people according to an embodiment of the present application.
Fig. 8 shows a schematic representation of the performance of an integrated model trained on intestinal flora data for eczema patients and healthy people according to an embodiment of the present application.
Fig. 9 shows a schematic diagram of performance comparisons of a single model and an integrated model trained on intestinal flora data for eczema patients and healthy people according to an embodiment of the present application.
Fig. 10 shows another schematic of the performance of an integrated model trained on intestinal flora data for eczema patients and healthy people according to an embodiment of the present application.
Fig. 11 shows a schematic diagram of the selection of key intestinal species during the construction of a machine learning model of intestinal flora data for diarrhea patients and healthy people according to an embodiment of the present application.
Fig. 12 shows a schematic diagram of the performance of an integrated model trained on intestinal flora data for diarrhea patients and healthy people according to an embodiment of the present application.
Fig. 13 shows a schematic diagram of performance comparisons of a single model and an integrated model trained on intestinal flora data for diarrhea patients and healthy people according to an embodiment of the present application.
Fig. 14 shows another schematic representation of the performance of an integrated model trained on intestinal flora data for diarrhea patients and healthy people according to an embodiment of the present application.
Fig. 15 shows a schematic diagram of a process for intestinal flora detection using an integrated model according to an embodiment of the present application.
Fig. 16 shows a block diagram of an apparatus for intestinal flora detection using an integrated model according to an embodiment of the present application.
Fig. 17 shows a block diagram of an apparatus for constructing an integrated model for intestinal flora detection according to an embodiment of the present application.
Detailed Description
In order that those skilled in the art will better understand the present application, the following description will be presented in further detail with reference to the drawings and detailed description.
First, a brief overview of the basic background and main ideas of the construction and use techniques of machine learning models for intestinal flora data in the present application will be presented.
As described above, the current clinical application of intestinal flora detection for diagnosis and treatment of specific diseases is relatively small, and an effective model construction method for important biological data of intestinal flora is also lacking. In recent years, along with continuous optimization of the performance of a machine learning algorithm in the problems of classification, regression and the like and clinical research of correlation between intestinal flora disorder and diseases by people, the related modeling method also provides a novel disease diagnosis and treatment thought for developing a special machine learning model for intestinal flora data. However, the clinical intestinal flora data in the real world often has the problem of unbalanced data, namely the number of intestinal flora data samples of healthy people is far more than that of special and abnormal people. Such data imbalance can significantly impact machine learning, modeling analysis, result accuracy and volatility of data testing. For this problem, a general processing manner is to select a healthy sample equal to an abnormal sample in data preprocessing, so as to avoid bias prediction of a trained machine learning model caused by training directly using an unbalanced data sample, however, this processing method may cause that a large amount of remaining healthy samples are not fully used. Especially, under the condition that intestinal flora data samples are relatively rare and limited, a model construction method for maintaining sample balance by using a small amount of healthy samples cannot reflect the whole information of a data set, so that the accuracy and objectivity of modeling prediction cannot be ensured. Therefore, there is a need for an efficient model building method for intestinal flora data that enables the efficient use of a large number of healthy samples in training of machine learning models.
A schematic diagram of the construction of a machine learning model and its corresponding model performance using the above general processing in case of unbalance of intestinal flora data samples will be described below with reference to fig. 1-3.
Fig. 1 shows a schematic diagram of the construction of a machine learning model using the general processing described above in case of unbalance of intestinal flora data samples. As shown in fig. 1, when modeling analysis is performed on the intestinal flora data, raw intestinal flora data samples may be obtained in advance, which include M raw healthy intestinal flora data samples (hereinafter, used interchangeably with "healthy samples" or "healthy control samples") and N raw abnormal intestinal flora data samples (hereinafter, used interchangeably with "abnormal samples"), where M and N are both positive integers and M is much larger than N. As described above, in order to balance the number of two data samples involved in machine learning model training, N healthy control samples equal to N abnormal samples are typically selected to construct training data, which is input into a machine learning algorithm for training to obtain model parameters of a single model. However, this approach will result in only a portion of the healthy control samples being involved in machine learning, while a large number of healthy control samples (M-N in this example) are discarded and not used effectively. Meanwhile, the training data constructed by the method cannot reflect the overall level of the original intestinal flora data, so that the problems that the later modeling evaluation index fluctuates greatly, and the later modeling evaluation index cannot be effectively applied to clinical actual scenes and the like are caused.
Taking allergic dermatitis (e.g. eczema) and diarrhea (e.g. infantile diarrhea/infantile diarrhea) as examples, the model properties trained using the general treatment described above are intuitively presented below. The AUC value (the meaning of which will be described below) of the single model obtained by the machine learning algorithm is used as an exemplary model evaluation index, so as to reflect the classification prediction accuracy and generalization capability of the single model. It is understood that other evaluation metrics may also be employed in the present application to evaluate model performance of a single model. In addition, it should be noted that the single model referred to in the present application refers to a machine learning model obtained by training a single training data sample set including a certain amount of healthy samples and abnormal samples.
Fig. 2 shows the model performance obtained by constructing a machine learning model using the general processing method described above in the case of unbalance of intestinal flora data samples of eczema patients and normal populations. For example, to perform modeling analysis on intestinal flora data of eczema patients, intestinal flora data samples of 31 eczema patients and intestinal flora data samples of 441 normal people can be collected. Next, model training may be performed by selecting about 31 samples from the 441 intestinal flora data samples of the normal population, so as to construct training data with the intestinal flora data samples of 31 eczema patients, so as to obtain model parameters of a single model. As shown in fig. 2, the overall AUC value for the different single model trained by picking up healthy samples corresponding to the number of abnormal samples and discarding the remaining healthy samples was 0.88, indicating a clear correlation between eczema disease and intestinal flora data. However, since each single model cannot reflect the overall level of the original intestinal flora data, AUC value fluctuation between different single models is large, prediction effect is unstable, and model generalization capability is poor. Therefore, the single model obtained in this way cannot be effectively applied to clinical actual scenes as a reliable diagnosis and treatment model, and has a certain limitation.
Figure 3 shows the model performance obtained by constructing a machine learning model using the general processing described above in the case of an imbalance in intestinal flora data samples for diarrhea patients and normal populations. Similar to that described above in connection with fig. 2, 51 samples of intestinal flora data from eczema patients may be collected, and about 51 samples may be selected from 441 samples of intestinal flora data from normal population, so as to construct training data for model training, so as to obtain model parameters of a single model. As shown in fig. 3, the overall AUC value for the different single models obtained by training was 0.76, indicating that diarrhea disease also has a clear correlation with intestinal flora data. However, the single model obtained in this way also fails to reflect the overall level of the raw intestinal flora data, and is not effectively applicable to clinical practice scenarios.
In addition to the above-described manner of maintaining the balance of the training samples by discarding a large number of normal healthy samples, the abnormal samples may be supplemented by simulating the abnormal samples, for example, by copying the existing abnormal samples and then adding them to the original data set or constructing artificial abnormal samples based on the existing abnormal samples by a method such as SMOTE. However, the above-mentioned method of simulating the abnormal sample may also cause that the simulated data set cannot reflect the real distribution situation of the data.
In view of this, in order to fully utilize precious gut flora biological data, especially those healthy sample data discarded due to sample imbalance, in the model training process for gut flora data, thereby improving the accuracy and objectivity of modeling prediction for gut flora data, the present application proposes combining different subsets of most healthy samples with few abnormal samples, respectively, and applying an integrated modeling machine learning concept on this basis, so as to effectively obtain a machine learning model for gut flora data.
Example 1
Fig. 4 shows a flow chart of a method of constructing an integrated model for intestinal flora detection according to an embodiment of the present application. Fig. 5 shows a schematic diagram of a method of constructing an integrated model for intestinal flora detection according to an embodiment of the present application. The construction method is described below in detail with reference to fig. 4 and 5.
As shown in fig. 4, in step S401, raw intestinal flora data is obtained, the raw intestinal flora data comprising a healthy intestinal flora data sample and an abnormal intestinal flora data sample. As mentioned above, clinical gut flora data in the real world often show data imbalance. In this embodiment, as shown in fig. 5, the obtained original intestinal flora data samples include M original healthy intestinal flora data samples and N original abnormal intestinal flora data samples, where M and N are positive integers and M is much larger than N.
In step S402, a plurality of different subsets of the healthy intestinal flora data samples are combined with the abnormal intestinal flora data samples, respectively, into a plurality of intestinal flora data sample sets (hereinafter interchangeably referred to as "sample sets"). In this embodiment, in order to make full use of a large number of healthy samples rather than simply discarding them, different subsets of the majority of healthy samples may be combined with a few abnormal samples, respectively, resulting in multiple sample sets that are trained separately. As shown in FIG. 5, the N original abnormal samples can be combined with M of the M original healthy samples 1 Combining the N original abnormal samples and M of the M original healthy samples into a sample set 1 2 The individual samples are combined into sample set 2, and so on. It will be appreciated that m in this embodiment 1 、m 2 ……m P The individual samples correspond to a plurality of mutually different subsets of the original M healthy intestinal flora data samples (as indicated by the background of the different patterns in fig. 5), and are each much smaller in number than the original normal samples M and the same or approximately the same as the original abnormal samples N (e.g. within a threshold percentage thereof, such as by 1%, 5%, 10%, etc.), so that a large number of healthy samples are fully utilized and the two sample numbers within each sample set are balanced.
In step S403, a plurality of single models are trained based on each of the plurality of intestinal flora data sample sets, respectively, to determine model parameters of each single model. In this embodiment, single-pass model 1 may be trained to determine its model parameters based on sample set 1, single-pass model 2 may be trained to determine its model parameters based on sample set 2, and so on. It will be appreciated that in this embodiment, a plurality of machine learning algorithms may be used to train to obtain parameters for each individual model, where the algorithms used may be the same or different between the individual models.
In step S404, the plurality of single-time models are fused into the integrated model for intestinal flora detection. As described above, the present embodiment applies the idea of integrated modeling on the basis of combining and training different subsets of the majority of healthy samples with a few abnormal samples, respectively, thereby improving the accuracy and objectivity of modeling predictions for intestinal flora data. As shown in fig. 5, P single models can be fused to obtain an integrated model for intestinal flora detection.
It will be appreciated that the number of samples, the number of sample sets and the number of corresponding single models in the present embodiment are only illustrative examples, and those skilled in the art can adjust the number according to the actual application requirements. In addition, the integrated model for detecting intestinal flora in the embodiment can be applied to prediction diagnosis and treatment of various diseases, and the application is not limited to this. For example, a clinical or scientific staff may perform a correlation study between a disorder of intestinal flora data and a specific disease in advance to determine that the intestinal flora data can be used as an important index reflecting the disease, and further collect the intestinal flora data of the patient population with the disease on the basis of the study for modeling analysis. Specific details of the construction method of the integrated model for intestinal flora detection will be described below for two application scenarios of eczema and diarrhea, respectively.
Example 2
The construction of a machine learning model and its corresponding model performance in an application scenario for eczema diagnosis will be described further with reference to fig. 6-10 based on the descriptions of fig. 4-5. The construction of a machine learning model of intestinal flora data for eczema patients and healthy people will be described from the following stages.
(1) Data preprocessing
In this example, its raw intestinal flora data can be obtained from eczema patients and healthy people in a variety of ways, such as 16S, metagenomic, quantitative polymerase chain reaction (qPCR) sequencing, etc. It can be understood that the original intestinal flora data in the application can be the detection data of a plurality of common intestinal strains in the detection data of diversified intestinal strains obtained by adopting modes such as 16S, metagenome sequencing and the like, or can be the detection data of a plurality of intestinal strains obtained by adopting a targeting detection mode such as qPCR sequencing and the like, and can be used as a plurality of candidate intestinal strains in the original intestinal flora data. As an illustrative example, the present application may take the following seven types of intestinal species detection data for 31 eczema patients and 441 healthy people as candidate intestinal species for their original intestinal flora data: bacteroides (BAC), escherichia (ESC), ruminococcus (RUM), bifidobacterium (BIF 1), clostridium tenecum (FAE), lactobacillus rhamnosus (BIF 2) and lactobacillus reuteri (DSM), but the present application is not limited thereto.
It can be understood that, because the original detected data may have the problems of missing values, irregular data formats, a large amount of redundant information, and the like, if the seven types of intestinal strain detected data are directly used as training data for modeling, the machine model training is not performed efficiently, and even the trained model cannot reach the expected level or cannot be converged. In view of this, in this embodiment, preprocessing is performed on the original intestinal flora detection data, so that the original intestinal flora detection data meets the efficient model training requirement, ensures the normal running of the training process, and improves the robustness of the training process. Three main aspects of deleting invalid data samples, key feature selection and data population from detected data for the original intestinal flora will be described below.
First, for the collected intestinal flora data samples of 31 eczema patients and 441 healthy people, if none of the seven species of a certain data sample detects data (the reason may be that the content of the seven species does not reach the detection threshold or the detection sensitivity causes no detected data, or that the sample is not detected and causes no data, etc.), the data sample may be regarded as missing data and deleted. For example, in this embodiment, it may be determined whether a plurality of candidate intestinal bacteria species (e.g., the seven species described above) in the raw collected healthy and eczematous intestinal flora data samples reach a detection threshold. Data samples that do not meet the detection threshold for a plurality of candidate intestinal species may then be deleted from these originally collected data samples. Optionally, to facilitate subsequent processing calculations, the intestinal flora data samples are negative log processed using the relative abundance of the flora.
Secondly, after the invalid data sample is deleted, the content of the detected data of the seven intestinal strains can be used as input characteristics for model training for training and learning, but the data may contain a plurality of redundant or useless characteristics, so that the model training is difficult, a large amount of storage and calculation cost is required, and the fitting problem is easy to occur. However, modeling analysis is performed on intestinal flora data, which is characterized in that, because the content of various intestinal strains and the mechanism of whether the human body is an eczema patient may not be clear, it is not theoretically possible to determine whether the content of all seven strains plays a certain role in modeling analysis, so that it is not possible to determine in advance which strains belong to relevant characteristics and which strains belong to redundant or useless characteristics. In view of this, the present application proposes to select a key intestinal strain useful in predicting a target variable from among a plurality of candidate intestinal strains based on feature importance calculation, thereby performing effective dimension reduction on training features and enhancing training efficiency and robustness of a model. It will be appreciated that the feature importance referred to in this application is a means of evaluating the usefulness of a plurality of candidate intestinal species detected from a human as input features in a modeling analysis process to determine which candidate intestinal species are most relevant to the classified prediction of eczema and which candidate intestinal species are least relevant to determine which species content information to delete and retain. An exemplary process for selecting a key intestinal strain from a plurality of candidate intestinal strains according to an embodiment of the present application will be described below in connection with fig. 6.
Fig. 6 shows a schematic diagram of the selection of key intestinal species during the construction of a machine learning model of intestinal flora data for eczema patients and healthy people according to an embodiment of the present application. In embodiments of the present application, feature importance of each of a plurality of candidate intestinal species (e.g., the seven types of intestinal species described above) in a healthy intestinal flora data sample and an eczematous intestinal flora data sample may be calculated. It will be appreciated that the feature importance of the candidate intestinal species is calculated according to one or more of the following algorithms: decision tree algorithms, pearson correlation coefficient algorithms, mutual information and maximum information coefficient algorithms, recursive feature elimination algorithms, and the like. Then, a plurality of key intestinal strains may be selected from the plurality of candidate intestinal strains according to the feature importance of each candidate intestinal strain. Finally, content information corresponding to the selected plurality of critical intestinal species in the healthy intestinal flora data sample and the eczema intestinal flora data sample may be retained. As shown in fig. 6, in the case of calculating the feature importance by using a decision tree algorithm, a key strain with a feature importance greater than 0.1 may be selected, where the feature strain playing a key role in classification prediction of eczema is the following five strains: "Bacteroides (BAC): 0.1640"," Escherichia (ESC): 0.2042"," Ruminococcus (RUM): 0.2017"," bifidobacteria (BIF 1): 0.2064"," clostridium tenectum (FAE): 0.1215", the content information of the five strains can be reserved in the two types of data samples as input characteristics. In addition, two species with minimal impact in feature selection can be deleted: the lactobacillus rhamnosus (BIF 2) and the lactobacillus reuteri (DSM) are low in detection quantity and are probiotics, and the practical significance is met.
Finally, after carrying out negative logarithm processing and feature importance screening on the intestinal flora data, in theory, the content of five types of key intestinal strains should be in a numerical range of 6-10, wherein the higher the relative content is, the higher the corresponding numerical value is. Considering that the five fungi are common fungi in human intestinal tracts, if the detected data of intestinal flora lacks content information of a certain or certain intestinal strains, the detected baseline of 6 is not met, and the content information smaller than 6 can be uniformly filled with the value of 6, so that the integrity of training data is ensured.
(2) Sample set composition
In order to balance the intestinal flora samples of healthy people and eczema patients in a sample set and to make full use of a large number of healthy samples instead of discarding them, different subsets of the majority of healthy samples may be combined with the minority of eczema samples, respectively, resulting in a plurality of sample sets. It will be appreciated that different subsets of the majority of health samples may be obtained in a variety of ways in this embodiment. As an illustrative example, a subset of data samples that do not overlap each other at all may be selected from a plurality of healthy samples by not replacing the samples, thereby constructing a corresponding plurality of sample sets with the same batch of eczema samples, respectively. Preferably, each different subset of the plurality of healthy samples is obtained from all healthy intestinal flora data samples (and if pre-treated, from all pre-treated healthy intestinal flora data samples) at random. In this way, by randomly selecting the subset of data samples independently a number of times, human intervention in the different subset selection process of healthy samples can be avoided, thereby enabling each constructed set of samples to reflect real world data conditions. Optionally, considering that the distribution of the data required for modeling analysis is normal to obtain a better training effect, the detected data of the intestinal flora obtained through the processing is not necessarily subjected to normal, so that the data of each sample set can be subjected to normal processing to be distributed in a concentrated manner in the embodiment, thereby meeting the model training requirement.
(3) Single-shot model training
After constructing a plurality of sample sets, each sample set may be separately input into a machine learning algorithm for training to obtain model parameters for the sample set. It will be appreciated that a single model training may be performed in a variety of ways in this embodiment.
As an illustrative example, in the embodiment of the present application, each sample set may be input into the same type of machine learning algorithm to perform training, so that each single model (for example, single model 1 to single model P shown in fig. 5) obtained by training is the same machine learning model but has different model parameters, for example, the machine learning algorithm may be one of the following: random forest algorithm, support vector machine algorithm, decision tree algorithm, logistic regression algorithm, naive Bayes algorithm, K nearest neighbor algorithm, K mean algorithm, linear discriminant analysis algorithm, linear regression algorithm, etc.
As another illustrative example, in an embodiment of the present application, each sample set may be input into a different type of machine learning algorithm for training, and each sample set has a predetermined one-to-one correspondence with the corresponding machine learning algorithm. For example, a first sample set is associated with a first machine learning algorithm (e.g., a random forest algorithm), a second sample set is associated with a second machine learning algorithm (e.g., a support vector machine algorithm), and so on, such that each single model is a different type of machine learning model.
As a preferred embodiment of the present application, any one of a plurality of sample sets is input into a plurality of different types of machine learning algorithms for training, so that each single model may be the same type of machine learning model or different types of machine learning model, depending on which machine learning algorithm has the best model performance for the current sample set. An exemplary process of performing a single model training for each sample set according to an embodiment of the present application will be described below in connection with fig. 7.
Fig. 7 shows a schematic diagram of a single model training for each sample set of intestinal flora data for eczema patients and healthy people according to an embodiment of the present application. For each intestinal flora data sample set, a plurality of machine learning algorithms can be utilized to respectively learn the intestinal flora data sample set to obtain a plurality of machine learning models, and then model parameters of an optimal machine learning model in the plurality of machine learning models are used as model parameters of a single model determined by training based on the intestinal flora data sample set. As shown in fig. 7, for any sample set i among sample sets 1~P, the sample set i may be learned with Q different machine learning algorithms, respectively, to obtain Q corresponding machine learning models, and then an optimal machine learning model (in this example, a machine learning model corresponding to machine learning algorithm 2) among the Q machine learning models may be used as a single model trained for sample set i. In this embodiment, the different machine learning algorithms may include, but are not limited to, a random forest algorithm, a support vector machine algorithm, a decision tree algorithm, a logistic regression algorithm, a naive bayes algorithm, a K nearest neighbor algorithm, a K-means algorithm, a linear discriminant analysis algorithm, a linear regression algorithm, and the like.
In the embodiment of the application, the single model training and the determination of the optimal machine model can be performed in various modes. For example, each intestinal flora data sample set may be divided into two separate parts, a training set and a test set, for example according to 7: 3. Thereafter, the training set may be used for model training of Q different machine learning algorithms, while the test set is used to select an optimal machine learning model from the Q machine learning models trained by the Q different machine learning algorithms. Specifically, for each sample set, a plurality of machine learning algorithms can be adopted to respectively tune parameters based on the training set, and a model with the highest quintuples cross-validation AUC value in the tuning process of each machine learning algorithm (for example, a grid search method is adopted to circularly traverse all candidate parameters) is selected to be used as a machine learning model obtained under the algorithm. Then, the predictive performance of the plurality of resulting machine learning models is tested separately using the test set therein to determine an optimal model. In this way, the test set is taken from the entire sample set but does not participate in the training process, which can be used as validation data to objectively evaluate the predictive power and generalization performance of various machine learning algorithms for data samples outside of the training set.
(4) Integrated model fusion
As described above in connection with fig. 2 and 3, a single model can only use randomly selected portions of the health sample data, resulting in large overall AUC value fluctuations and unstable predictive results. In view of this, in order to improve the stability and accuracy of the model, a method of fusing a plurality of single models into an integrated model is adopted in the present embodiment. In this way, by repeating the process of sample set composition and single model training multiple times (e.g., p=15 times), multiple random selections of different subsets of healthy samples for modeling analysis can enable the application of a majority of healthy samples to model construction such that the training model can adequately reflect the true distribution of data and overall information of the data set. In addition, in each single model training, the model with the optimal performance is selected from a plurality of machine learning algorithms (for example, q=5), so that 15 single optimal models and model parameters thereof can be obtained, and respective advantages of different machine learning algorithms can be integrated.
Next, by integrating the plurality of single best models into a new model (i.e., fusing into an integrated model), the plurality of prediction results of the plurality of single models can be fused into a single final prediction result, thereby avoiding the problems of prediction instability and performance fluctuation that may occur with a single model, so as to obtain a machine learning model with good generalization performance that can be effectively applied to clinical detection. In the embodiment of the application, the final prediction result of the integrated model can be determined by adopting a majority voting method according to a plurality of prediction results of the intestinal flora data samples by a plurality of single models. For example, for the newly input intestinal flora sample data to be detected, inputting the intestinal flora sample data to each single model in the integrated model, obtaining the classified prediction results (whether the eczema is a patient) of all the single models for the eczema disease, and recording and outputting more than half of the prediction results as final prediction results by adopting a majority voting method. It will be appreciated that in embodiments of the present application, other suitable fusion strategies may be employed to enable integration of multiple single-pass models.
(5) Integrated model performance assessment
In the last stage of integrated model construction, the performance of the integrated model needs to be evaluated and verified so as to determine whether the performance of the integrated model meets the expected index, and the integrated model can be used as a reliable prediction model for diagnosing eczema based on intestinal flora data in clinic.
In the embodiment of the application, the intestinal flora data samples of all eczema patients and the intestinal flora data samples of healthy people are input into a trained integrated model to be predicted, and the prediction accuracy and the fluctuation of the model are verified. The performance of the integrated model constructed for intestinal flora data of eczema patients and healthy people will be described below in connection with fig. 8-10.
First, an index for evaluating the performance of the integrated model will be briefly described. The following concept will be described below taking an eczema sample as a positive sample and a healthy sample as a negative sample as an example.
(A) True Positive number (TP): the number of positive samples predicted by the model to be positive.
(B) True Negative number (True Negative, TN): the number of negative samples that are predicted by the model to be negative.
(C) False Positive number (FP): the number of negative samples that are predicted by the model to be positive.
(D) False Negative number (FN): the number of positive samples predicted by the model to be negative.
(E) Sensitivity (or referred to as true positive rate True Positive Rate, TPR) =true positive number/(true positive number+false negative number) =true positive number/actual positive number, which measures the recognition capability of the integrated model on positive samples, the higher the sensitivity, the lower the missed diagnosis rate.
(F) Specificity (or referred to as true negative rate True Negative Rate, TNR) =true negative number/(true negative number+false positive number) =true negative number/actual negative number, which measures the recognition ability of the integrated model to negative samples. Wherein: 1-specificity = false positive rate (False Positive Rate, FPR) = false positive number/(false positive number + true negative number), the higher the specificity, the lower the false positive probability.
(G) Accuracy = (true positive number + true negative number)/(true positive number + true negative number + false positive number + false negative number), which measures the proportion of predicted correct to total data samples.
(H) The operator manipulates the characteristic curve (Receiver Operating Characteristic, ROC curve) reflecting an integrated index of sensitivity and specificity continuous variables, with the abscissa being False Positive Rate (FPR) and the ordinate being True Positive Rate (TPR).
(I) Area Under Curve (Area Under Curve), defined as the Area Under the ROC Curve, is suitable for the case of two classifications, representing the probability that a model ranks positive samples before negative samples given randomly one positive one negative two samples. Thus, the greater the AUC value, the better the classification result of the model.
Fig. 8 shows a schematic representation of the performance of an integrated model trained on intestinal flora data for eczema patients and healthy people according to an embodiment of the present application. As shown in fig. 8, by repeating the procedure of sample set composition and single model training a plurality of times (for example, p=15 times), an integrated model 1 can be obtained; the integrated model 2 can be obtained by repeating the above-mentioned processes of multiple sample set composition and single model training again; and so on. It can be seen that for each integrated model, the overall AUC value is significantly improved relative to the single model shown in fig. 2, and the AUC value fluctuation between different integrated models is also significantly reduced (i.e., variance is significantly reduced) relative to the single model, so that the stability and accuracy of the final integrated model are significantly enhanced.
Fig. 9 shows a schematic diagram of performance comparisons of a single model and an integrated model trained on intestinal flora data for eczema patients and healthy people according to an embodiment of the present application. As can be seen more intuitively in fig. 9, the overall AUC values of the single model fluctuate widely, with a variance of about 0.004358481; while the overall AUC value of the integrated model fluctuates less, its variance is only about 0.000876616.
Fig. 10 shows another schematic of performance of an integrated model trained on intestinal flora data for eczema patients and healthy people, showing ROC curves and corresponding AUC values of the integrated model, according to an embodiment of the present application. As can be seen from fig. 10, the overall AUC value has reached 0.984. Through verifying the integrated model, the sensitivity of the integrated model reaches 0.968, the specificity reaches 1.000, the accuracy reaches 0.986, and the integrated model has excellent model effect and can be reliably applied to clinical diagnosis.
According to the embodiment of the application, the health sample subset for modeling analysis is randomly selected for a plurality of times, so that most health samples can be fully applied to model construction, and the problem that the model construction which only uses a small amount of health samples to maintain sample balance cannot reflect the whole information of the data set is solved. In addition, by using the integrated model construction method, the prediction accuracy of machine learning modeling analysis can be obviously improved, the variance of multiple modeling results can be reduced, the random fluctuation of the data results can be reduced, the overall situation of the data set can be intuitively reflected, and the accuracy, objectivity and clinical applicability of modeling prediction can be improved.
Example 3
The construction of a machine learning model and its corresponding model performance in an application scenario for diarrhea diagnosis will be described further in connection with fig. 11-14 on the basis of the description of fig. 4-5, similarly as described with reference to the application scenario for eczema diagnosis of fig. 6-10. The construction of a machine learning model of intestinal flora data for diarrhea patients and healthy people will be described hereinafter from the same following stages. It should be noted that most of the model construction processes of embodiment 3 and embodiment 2 are the same, and in order to avoid repetition, only the present embodiment will be briefly described below, and detailed descriptions of the same details will be omitted.
(1) Data preprocessing
In this example, the raw intestinal flora data can also be obtained from diarrhea and healthy people by means of 16S, metagenomic, qPCR sequencing, etc. For example, the following seven classes of intestinal species detection data for 51 diarrhea patients and 441 healthy people can be used as candidate intestinal species for their raw intestinal flora data: bacteroides (BAC), escherichia (ESC), ruminococcus (RUM), bifidobacterium (BIF 1), clostridium teneicum (FAE), lactobacillus rhamnosus (BIF 2) and lactobacillus reuteri (DSM).
Similarly, in this embodiment, the pretreatment may be performed on the raw intestinal flora detection data. Specifically, the invalid data sample, the key feature selection and the data filling can be preprocessed from three main aspects, so that the invalid data sample, the key feature selection and the data filling meet the efficient model training requirement, and specific details are not repeated here.
As an illustrative example, for key feature selection for diarrhea diagnosis and treatment application scenarios, a key intestinal strain useful in predicting a target variable may also be selected from a plurality of candidate intestinal strains based on feature importance calculations. Fig. 11 shows a schematic diagram of the selection of key intestinal species during the construction of a machine learning model of intestinal flora data for diarrhea patients and healthy people according to an embodiment of the present application. As shown in fig. 11, in the case of calculating the feature importance by using the decision tree algorithm, five types of key strains with feature importance greater than 0.1 may be selected, namely: "Bacteroides (BAC): 0.1436"," Escherichia (ESC): 0.2570"," Ruminococcus (RUM): 0.1728"," bifidobacteria (BIF 1): 0.1911"," clostridium tenectum (FAE): 0.1245". Similarly, two species with minimal impact in feature selection may be deleted: lactobacillus rhamnosus (BIF 2) and lactobacillus reuteri (DSM).
(2) Sample set composition
The same processing as in example 2 can be used for the sample set composition of the application scenario for eczema diagnosis, and different subsets of the majority of healthy samples and a minority of diarrhea samples are respectively combined, so as to obtain a plurality of sample sets. Specific details are not described herein.
(3) Single-shot model training
The same processing as in embodiment 2 can be adopted in the single model training mode for the application scenario of eczema diagnosis, and each sample set is respectively input into a machine learning algorithm for training, so that model parameters for the sample set are obtained. Specific details are not described herein.
(4) Integrated model fusion
The same process as in example 2, i.e. the fusion by the majority voting method, can be adopted for the integration model fusion mode of the application scene of eczema diagnosis. Specific details are not described herein.
(5) Integrated model performance assessment
In the embodiment of the application, the intestinal flora data samples of all diarrhea patients and the intestinal flora data samples of healthy people are input into a trained integrated model for prediction, and the prediction accuracy and the fluctuation of the model are verified. The performance of the integrated model constructed for intestinal flora data of diarrhea patients and healthy people will be described below in connection with fig. 12-14.
Fig. 12 shows a schematic diagram of the performance of an integrated model trained on intestinal flora data for diarrhea patients and healthy people according to an embodiment of the present application. As shown in fig. 12, for each integrated model, the overall AUC value is significantly improved relative to the single model shown in fig. 3, and the AUC value fluctuation between different integrated models is also significantly reduced (i.e., variance is significantly reduced) relative to the single model, so that the stability and accuracy of the final integrated model are significantly enhanced.
Fig. 13 shows a schematic diagram of performance comparisons of a single model and an integrated model trained on intestinal flora data for diarrhea patients and healthy people according to an embodiment of the present application. As can be seen more intuitively in fig. 13, the overall AUC values of the single model fluctuate widely, with a variance of about 0.013396082; while the overall AUC value of the integrated model fluctuates less, its variance is only about 0.005116054.
Fig. 14 shows another schematic representation of performance of an integrated model trained on intestinal flora data for diarrhea patients and healthy people, showing ROC curves and corresponding AUC values of the integrated model, according to an embodiment of the present application. As can be seen from fig. 14, the overall AUC value has reached 0.857. Through verifying the integrated model, the sensitivity of the integrated model reaches 0.843, the specificity reaches 0.870, the accuracy reaches 0.857, and the integrated model has excellent model effect and can be reliably applied to clinical diagnosis.
Example 4
Fig. 15 shows a schematic diagram of a process for intestinal flora detection using an integrated model according to an embodiment of the present application. Fig. 16 shows a block diagram of an apparatus for intestinal flora detection using an integrated model according to an embodiment of the present application. The detection apparatus and the detection method are described below with reference to fig. 15 and 16 in particular. It should be noted that the integrated model may be obtained by model training in the manner of the above-described embodiments 1 to 3, so that it may be used for diagnosing various diseases such as eczema or diarrhea based on the intestinal flora detection data, thereby providing a novel diagnosis and treatment manner.
As shown in fig. 15, a sample of intestinal flora data to be detected may be received and input into the integrated model. It will be appreciated that the intestinal flora data sample to be detected may be subjected to a pretreatment similar to that described above, thereby facilitating detection and analysis thereof. Specifically, taking eczema diagnosis and treatment as an example, the intestinal flora data sample to be detected can be input into each single model in the integrated model to obtain the classified prediction results of all the single models for the eczema disease, for example, the prediction result 1~P, which respectively gives whether the data sample to be detected corresponds to the classified prediction result of the eczema patient. Then, a majority voting method is adopted, more than half of the predicted results are recorded as final predicted results, and the detected results are output.
As shown in fig. 16, an apparatus 1600 for intestinal flora detection using an integrated model may include an input device U1601, a memory U1602, a processor U1603, and an output device U1604. The respective components may perform the respective steps/functions of the process of intestinal flora detection using the integrated model described above in connection with fig. 15, respectively, and thus, in order to avoid repetition, only a brief description of the device will be provided hereinafter, and a detailed description of the same details will be omitted. As examples of the above-described apparatus, it may include a computer, a server, a workstation, and the like.
The input device U1601 may be used to receive intestinal flora data samples to be detected. For example, the input device U1601 may be any input device capable of receiving a data sample to be detected, such as any wired or wireless data interface, mouse, keyboard, etc. that receives the original data sample to be detected or the preprocessed data sample, and the like, which is not limited by the present application.
The memory U1602 may be used to store model parameters of an integrated model for intestinal flora detection. For example, memory U1602 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) and/or cache memory, and may also include other removable/non-removable, volatile/nonvolatile computer system memory, such as hard disk drives, floppy disks, CD-ROMs, DVD-ROMs, or other optical storage media.
Processor U1603 may be configured to run the integrated model for intestinal flora detection to detect whether an abnormality exists in the intestinal flora data sample to be detected. For example, processor U1603 may be any device with processing capabilities that is capable of implementing the functionality of embodiments of the present application, e.g., it may be a general purpose processor, digital Signal Processor (DSP), ASIC, field Programmable Gate Array (FPGA) or other Programmable Logic Device (PLD), discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. The processor U1603 may load the integrated model stored in the memory U1602 to run the integrated model for analysis of the sample to be detected delivered by the input device U1601.
The output device U1604 may be configured to provide a detection result of the intestinal flora data sample to be detected. For example, the output device U1604 may include a display, a printer, an image output system, a voice output system, or the like for notifying a detection person of the detection result in a visual or audible manner or the like.
Example 5
Fig. 17 shows a block diagram of an apparatus for constructing an integrated model for intestinal flora detection according to an embodiment of the present application. It should be noted that the device may be model trained in the manner described above in examples 1-3 to obtain an integrated model for intestinal flora detection.
As shown in fig. 17, an apparatus 1700 for constructing an integrated model for intestinal flora detection may include a processor U1701 and a memory U1702. Similar to that described above in connection with fig. 16, the processor U1701 may be any device with processing capabilities that is capable of implementing the functionality of the various embodiments of the present application. Memory U1702 may include a computer system readable medium in the form of volatile memory. In this embodiment, computer program instructions are stored in the memory U1702, and the processor U1701 may execute the instructions stored in the memory U1702. The computer program instructions, when executed by the processor, cause the processor to perform a method of constructing an integrated model for intestinal flora detection of embodiments of the present application. The construction method for the integrated model for intestinal flora detection is substantially the same as described above in connection with examples 1-3, and thus will not be repeated in order to avoid repetition. As examples of the above-described apparatus, a computer, a server, a workstation, or the like may be included.
Example 6
The techniques of construction and use of machine learning models for intestinal flora data according to the present application may also be implemented by providing a computer program product comprising program code for implementing the method or device, or by any storage medium storing such a computer program product.
The basic principles of the present application have been described above in connection with specific embodiments, however, it should be noted that the advantages, benefits, effects, etc. mentioned in the present application are merely examples and not limiting, and these advantages, benefits, effects, etc. are not to be considered as necessarily possessed by the various embodiments of the present application. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, as the application is not intended to be limited to the details disclosed herein as such. In addition, features from one embodiment may be combined with features of another embodiment or embodiments to yield still further embodiments.
The block diagrams of the devices, apparatuses, devices, systems referred to in this application are only illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. As will be appreciated by one of skill in the art, the devices, apparatuses, devices, systems may be connected, arranged, configured in any manner. Words such as "including," "comprising," "having," and the like are words of openness and mean "including but not limited to," and are used interchangeably therewith. The terms "or" and "as used herein refer to and are used interchangeably with the term" and/or "unless the context clearly indicates otherwise. The term "such as" as used herein refers to, and is used interchangeably with, the phrase "such as, but not limited to.
In addition, as used herein, the use of "or" in the recitation of items beginning with "at least one" indicates a separate recitation, such that recitation of "at least one of A, B or C" for example means a or B or C, or AB or AC or BC, or ABC (i.e., a and B and C). Furthermore, the term "exemplary" does not mean that the described example is preferred or better than other examples.
It is also noted that in the apparatus and methods of the present application, components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered as equivalent to the present application.
It will be appreciated by those of ordinary skill in the art that all or any portion of the methods and apparatus of the present application may be implemented in hardware, firmware, software, or combinations thereof in any computing device (including processors, storage media, etc.) or network of computing devices. The hardware may be implemented with general purpose processors, digital Signal Processors (DSPs), ASICs, field Programmable Gate Array Signals (FPGAs) or other Programmable Logic Devices (PLDs), discrete gate or transistor logic, discrete hardware components, or any combinations thereof, designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any commercially available processor, controller, microcontroller or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. The software may reside in any form of computer readable tangible storage medium. By way of example, and not limitation, such computer-readable tangible storage media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other tangible medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. As used herein, discs include Compact Discs (CDs), laser discs, optical discs, digital Versatile Discs (DVDs), floppy discs, and blu-ray discs.
Various changes, substitutions, and alterations are possible to the techniques described herein without departing from the teachings of the techniques defined by the appended claims. Furthermore, the scope of the claims hereof is not to be limited to the exact aspects of the process, machine, manufacture, composition of matter, means, methods and acts described above. The processes, machines, manufacture, compositions of matter, means, methods, or acts, presently existing or later to be developed that perform substantially the same function or achieve substantially the same result as the corresponding aspects described herein may be utilized. Accordingly, the appended claims are intended to include within their scope such processes, machines, manufacture, compositions of matter, means, methods, or acts.
The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been presented for purposes of illustration and description. This description is not intended to limit the embodiments of the application to the form disclosed herein. Although a number of example aspects and embodiments have been discussed above, a person of ordinary skill in the art will recognize certain variations, modifications, alterations, additions, and subcombinations thereof.

Claims (10)

1. The method for constructing the integrated model for detecting the intestinal flora comprises the following steps:
obtaining raw intestinal flora data, wherein the raw intestinal flora data comprise a healthy intestinal flora data sample and an abnormal intestinal flora data sample;
combining a plurality of different subsets of the healthy intestinal flora data samples with the abnormal intestinal flora data samples, respectively, into a plurality of intestinal flora data sample sets;
training a plurality of single models based on each of the plurality of intestinal flora data sample sets to determine model parameters for each single model; and
fusing the plurality of single-pass models into the integrated model for intestinal flora detection.
2. The method of constructing according to claim 1, wherein combining the plurality of different subsets of the healthy intestinal flora data samples with the abnormal intestinal flora data samples into a plurality of intestinal flora data sample sets, respectively, comprises:
And randomly selecting the different subsets from the healthy intestinal flora data samples and combining the abnormal intestinal flora data samples into a plurality of intestinal flora data sample sets respectively.
3. The method of constructing according to claim 1, wherein prior to combining the plurality of different subsets of the healthy intestinal flora data samples with the abnormal intestinal flora data samples into a plurality of intestinal flora data sample sets, respectively, the method further comprises:
determining whether a plurality of candidate intestinal bacteria species in the healthy intestinal flora data sample and the abnormal intestinal flora data sample reach a detection threshold; and
deleting data samples which do not reach the detection threshold of the candidate intestinal strains from the healthy intestinal flora data samples and the abnormal intestinal flora data samples.
4. The method of constructing according to claim 1, wherein prior to combining the plurality of different subsets of the healthy intestinal flora data samples with the abnormal intestinal flora data samples into a plurality of intestinal flora data sample sets, respectively, the method further comprises:
calculating the feature importance of each candidate intestinal strain of a plurality of candidate intestinal strains in the healthy intestinal flora data sample and the abnormal intestinal flora data sample;
Selecting a plurality of key intestinal strains from the plurality of candidate intestinal strains according to the feature importance of each candidate intestinal strain of the plurality of candidate intestinal strains; and
and reserving content information corresponding to the plurality of key intestinal strains in the healthy intestinal flora data sample and the abnormal intestinal flora data sample.
5. The method of claim 4, wherein the feature importance of each of the plurality of candidate intestinal species is calculated from one or more of: decision tree algorithm, pearson correlation coefficient algorithm, mutual information and maximum information coefficient algorithm and recursive feature elimination algorithm.
6. The method of constructing according to claim 1, wherein training the plurality of single models based on each of the plurality of intestinal flora data sample sets to determine model parameters of each single model comprises:
for each intestinal flora data sample set:
respectively learning the intestinal flora data sample set by utilizing a plurality of machine learning algorithms to obtain a plurality of machine learning models; and
model parameters of an optimal machine learning model of the plurality of machine learning models are taken as model parameters of a single model determined by training based on the intestinal flora data sample set.
7. The method of constructing according to claim 6, wherein each of the plurality of intestinal flora data sample sets comprises a training set and a testing set, and
wherein learning the intestinal flora data sample set by using a plurality of machine learning algorithms to obtain a plurality of machine learning models comprises: learning the training set of intestinal flora data sample sets by using the plurality of machine learning algorithms to obtain a plurality of machine learning models respectively, and
wherein taking model parameters of an optimal machine learning model of the plurality of machine learning models as model parameters of a single model determined by training based on the intestinal flora data sample set comprises: the predictive performance of the plurality of machine learning models is tested separately using a test set of the intestinal flora data sample set, and model parameters of the optimal machine learning model having optimal predictive performance are determined.
8. The build method of claim 6, wherein the plurality of machine learning algorithms comprises two or more of:
random forest algorithm, support vector machine algorithm, decision tree algorithm, logistic regression algorithm, naive Bayes algorithm, K nearest neighbor algorithm, K mean algorithm, linear discriminant analysis algorithm and linear regression algorithm.
9. The method of constructing according to claim 1, wherein fusing the plurality of single models into the integrated model for intestinal flora detection comprises:
and determining the prediction results of the integrated model by adopting a majority voting method according to the corresponding multiple prediction results of the multiple single models aiming at the intestinal flora data samples.
10. An apparatus for intestinal flora detection using an integrated model, comprising:
the input device is used for receiving an intestinal flora data sample to be detected;
a memory for storing model parameters of an integrated model for intestinal flora detection,
wherein the integrated model is constructed by:
obtaining raw intestinal flora data, wherein the raw intestinal flora data comprise a healthy intestinal flora data sample and an abnormal intestinal flora data sample;
combining a plurality of different subsets of the healthy intestinal flora data samples with the abnormal intestinal flora data samples, respectively, into a plurality of intestinal flora data sample sets;
training a plurality of single models based on each of the plurality of intestinal flora data sample sets, respectively, to determine model parameters for each single model; and
Fusing the plurality of single models into the integrated model;
the processor is used for running the integrated model to perform intestinal flora detection so as to detect whether the intestinal flora data sample to be detected is abnormal or not; and
and the output equipment is used for providing a detection result of the intestinal flora data sample to be detected.
CN202310571968.9A 2023-05-22 2023-05-22 Construction method of integrated model for intestinal flora detection and detection device thereof Active CN116344040B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310571968.9A CN116344040B (en) 2023-05-22 2023-05-22 Construction method of integrated model for intestinal flora detection and detection device thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310571968.9A CN116344040B (en) 2023-05-22 2023-05-22 Construction method of integrated model for intestinal flora detection and detection device thereof

Publications (2)

Publication Number Publication Date
CN116344040A true CN116344040A (en) 2023-06-27
CN116344040B CN116344040B (en) 2023-09-22

Family

ID=86889740

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310571968.9A Active CN116344040B (en) 2023-05-22 2023-05-22 Construction method of integrated model for intestinal flora detection and detection device thereof

Country Status (1)

Country Link
CN (1) CN116344040B (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108475544A (en) * 2015-11-12 2018-08-31 卡尤迪医学检验实验室(北京)有限公司 Method and system for disease surveillance and assessment
US20190085396A1 (en) * 2015-09-09 2019-03-21 uBiome, Inc. Method and system for microbiome-derived diagnostics and therapeutics for conditions associated with gastrointestinal health
CN110097928A (en) * 2019-04-17 2019-08-06 广东省微生物研究所(广东省微生物分析检测中心) A kind of prediction technique and prediction model based on intestinal flora prediction tissue micronutrient levels
US20200194119A1 (en) * 2018-10-15 2020-06-18 Hangzhou New Horizon Health Technology Co., Ltd. Methods and systems for predicting or diagnosing cancer
US20210057046A1 (en) * 2018-03-29 2021-02-25 Freenome Holdings, Inc. Methods and systems for analyzing microbiota
CN112992274A (en) * 2021-03-31 2021-06-18 青岛泱深生物医药有限公司 Method and system for constructing disease risk prediction model based on sequencing and machine learning
CN113348367A (en) * 2018-10-31 2021-09-03 卡尤迪医学检验实验室(北京)有限公司 Methods, systems and kits for predicting preterm labor status
CN113930526A (en) * 2021-12-02 2022-01-14 张聚 Method and composition for identifying methamphetamine-related people and application of composition
CN114093515A (en) * 2021-11-17 2022-02-25 江南大学 Age prediction method based on intestinal flora prediction model ensemble learning
CN114283890A (en) * 2021-12-15 2022-04-05 南京医科大学 Disease risk prediction method and device based on rumen coccus microbiota
WO2022203351A1 (en) * 2021-03-26 2022-09-29 주식회사 에이치이엠파마 Method and diagnostic device for determining presence or absence of enteritis using machine learning model
CN115873956A (en) * 2022-12-30 2023-03-31 深圳未知君生物科技有限公司 Kit, system, use and modeling method of prediction model for predicting risk of colorectal cancer of subject

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190085396A1 (en) * 2015-09-09 2019-03-21 uBiome, Inc. Method and system for microbiome-derived diagnostics and therapeutics for conditions associated with gastrointestinal health
CN108475544A (en) * 2015-11-12 2018-08-31 卡尤迪医学检验实验室(北京)有限公司 Method and system for disease surveillance and assessment
US20210057046A1 (en) * 2018-03-29 2021-02-25 Freenome Holdings, Inc. Methods and systems for analyzing microbiota
US20200194119A1 (en) * 2018-10-15 2020-06-18 Hangzhou New Horizon Health Technology Co., Ltd. Methods and systems for predicting or diagnosing cancer
CN113348367A (en) * 2018-10-31 2021-09-03 卡尤迪医学检验实验室(北京)有限公司 Methods, systems and kits for predicting preterm labor status
CN110097928A (en) * 2019-04-17 2019-08-06 广东省微生物研究所(广东省微生物分析检测中心) A kind of prediction technique and prediction model based on intestinal flora prediction tissue micronutrient levels
WO2022203351A1 (en) * 2021-03-26 2022-09-29 주식회사 에이치이엠파마 Method and diagnostic device for determining presence or absence of enteritis using machine learning model
CN112992274A (en) * 2021-03-31 2021-06-18 青岛泱深生物医药有限公司 Method and system for constructing disease risk prediction model based on sequencing and machine learning
CN114093515A (en) * 2021-11-17 2022-02-25 江南大学 Age prediction method based on intestinal flora prediction model ensemble learning
CN113930526A (en) * 2021-12-02 2022-01-14 张聚 Method and composition for identifying methamphetamine-related people and application of composition
CN114283890A (en) * 2021-12-15 2022-04-05 南京医科大学 Disease risk prediction method and device based on rumen coccus microbiota
CN115873956A (en) * 2022-12-30 2023-03-31 深圳未知君生物科技有限公司 Kit, system, use and modeling method of prediction model for predicting risk of colorectal cancer of subject

Also Published As

Publication number Publication date
CN116344040B (en) 2023-09-22

Similar Documents

Publication Publication Date Title
Liang et al. Hybrid of memory and prediction strategies for dynamic multiobjective optimization
Behlouli et al. Identifying relative cut-off scores with neural networks for interpretation of the Minnesota Living with Heart Failure questionnaire
Andersson et al. Prediction of severe acute pancreatitis at admission to hospital using artificial neural networks
Marwala Bayesian training of neural networks using genetic programming
Dong et al. Predictive analysis methods for human microbiome data with application to Parkinson’s disease
CN114438165B (en) Acute coronary syndrome risk assessment marker for stable coronary heart disease and application
CN114974598A (en) Lung cancer prognosis prediction model construction method and lung cancer prognosis prediction system
Abd El Hamid et al. Developing an early predictive system for identifying genetic biomarkers associated to Alzheimer’s disease using machine learning techniques
CN113517073A (en) Method and system for predicting survival rate after lung cancer surgery
CN116344040B (en) Construction method of integrated model for intestinal flora detection and detection device thereof
Weekes et al. Development and validation of a prognostic tool: pulmonary embolism short-term clinical outcomes risk estimation (PE-SCORE)
Fahmy et al. Machine learning for predicting heart failure progression in hypertrophic cardiomyopathy
Anderies et al. Prediction of heart disease UCI dataset using machine learning algorithms
Nivaan et al. Analytic predictive of hepatitis using the regression logic algorithm
Wisaeng Predict the diagnosis of heart disease using feature selection and k-nearest neighbor algorithm
Becker et al. Rough set theory in the classification of loan applications
Chung et al. A Deep Learning-Based Radiomic Classifier for Usual Interstitial Pneumonia
Pedroto et al. Predicting age of onset in TTR-FAP patients with genealogical features
Aihong et al. Notice of Retraction: Fault diagnosis based on adaptive genetic algorithm and BP neural network
Lyu et al. Automatic selection of lexical features for detecting Alzheimer's disease using bag-of-words model and genetic algorithm
Nastiti et al. Logistic Regression Using Hyperparameter Optimization on COVID-19 Patients’ Vital Status
Kryvenchuk et al. Random Forest as a Method of Predicting the Presence of Cardiovasculars Diseases.
Tang et al. Different thresholds in the prediction of chronic obstructive pulmonary disease using neural network and Logistic model
Climer A machine-learning evaluation of biomarkers designed for the future of precision medicine
US20080015788A1 (en) Method and Program for Predicting Gene Network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant