CN116312745B - Intestinal flora super donor image information detection generation method - Google Patents

Intestinal flora super donor image information detection generation method Download PDF

Info

Publication number
CN116312745B
CN116312745B CN202310568770.5A CN202310568770A CN116312745B CN 116312745 B CN116312745 B CN 116312745B CN 202310568770 A CN202310568770 A CN 202310568770A CN 116312745 B CN116312745 B CN 116312745B
Authority
CN
China
Prior art keywords
feature
data
sub
clusters
donor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310568770.5A
Other languages
Chinese (zh)
Other versions
CN116312745A (en
Inventor
董秀山
苗雄伟
张晓燕
高燕飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanxi Bethune Hospital Shanxi Academy Of Medical Sciences Shanxi Hospital Of Tongji Hospital Affiliated To Tongji Medical College Of Huazhong University Of Science And Technology Third Hospital Of Shanxi Medical University And Third Clinical Medical College Of Shanxi Medical University
Shanxi Intelligent Big Data Research Institute Co ltd
Original Assignee
Shanxi Bethune Hospital Shanxi Academy Of Medical Sciences Shanxi Hospital Of Tongji Hospital Affiliated To Tongji Medical College Of Huazhong University Of Science And Technology Third Hospital Of Shanxi Medical University And Third Clinical Medical College Of Shanxi Medical University
Shanxi Intelligent Big Data Research Institute Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanxi Bethune Hospital Shanxi Academy Of Medical Sciences Shanxi Hospital Of Tongji Hospital Affiliated To Tongji Medical College Of Huazhong University Of Science And Technology Third Hospital Of Shanxi Medical University And Third Clinical Medical College Of Shanxi Medical University, Shanxi Intelligent Big Data Research Institute Co ltd filed Critical Shanxi Bethune Hospital Shanxi Academy Of Medical Sciences Shanxi Hospital Of Tongji Hospital Affiliated To Tongji Medical College Of Huazhong University Of Science And Technology Third Hospital Of Shanxi Medical University And Third Clinical Medical College Of Shanxi Medical University
Priority to CN202310568770.5A priority Critical patent/CN116312745B/en
Publication of CN116312745A publication Critical patent/CN116312745A/en
Application granted granted Critical
Publication of CN116312745B publication Critical patent/CN116312745B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Physiology (AREA)
  • Molecular Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a method for detecting and generating intestinal flora super donor image information, belonging to the technical field of intestinal flora super donor image generation; the technical problems to be solved are as follows: in order to overcome the defect that no research on the image generation method at present relates to an intestinal flora super donor, the method for detecting and generating the image information of the intestinal flora super donor is provided; the technical scheme adopted for solving the technical problems is as follows: firstly, collecting data information of a donor from five dimensions of physiology, psychology, physical examination, life history and life habit, and preprocessing the collected data: the method comprises the processing steps of deleting unique attributes, extracting keywords by an NLP model, encoding labels, scaling data, removing abnormal values and the like, and obtaining an original feature set of each dimension; clustering the data of each dimension based on a clustering algorithm to obtain the initial characteristics of the super donor; the method is applied to image generation of the intestinal flora super donor.

Description

Intestinal flora super donor image information detection generation method
Technical Field
The invention provides a method for detecting and generating intestinal flora super donor image information, and belongs to the technical field of intestinal flora super donor image generation.
Background
The currently adopted intestinal flora transplanting (FMT) method is to transplant functional intestinal flora in the feces of a healthy person into the intestinal tract of a patient so as to reconstruct a new intestinal flora, and can realize the treatment of intestinal and parenteral diseases, and the key of the operation method is to select a proper donor; since the donor for flora transplantation is mainly derived from a variant, the screening method aiming at the variant source mainly adopts a rejection method, namely screening the drug history, disease history, infection history, common pathogen detection indexes and the like of the donor, and rejecting factors possibly influencing intestinal tracts, so that the healthiest preferential donor is obtained, and once the adverse factors on the intestinal flora are found, the rejection should be immediately carried out.
Along with popularization and application of intestinal flora transplanting technology, intestinal flora data of a donor/acceptor can be obtained continuously, data support is provided for application of data mining technology, wherein the proposed intestinal flora super donor image is obtained and depicted on the basis of real donor data, the data mining technology is applied to the field to construct an intestinal flora super donor image, and super donor features can be converted into intuitively readable and highly refined feature labels, so that better ways for screening super donors and large data science basis are provided for clinical application; however, no research on the image generation method relates to the super donor of the intestinal flora, and a method for detecting and generating the image information of the super donor of the intestinal flora needs to be developed and provided.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, and solves the technical problems that: in order to overcome the defect that no research on the image generation method at present relates to the super donor of the intestinal flora, the method for detecting and generating the image information of the super donor of the intestinal flora is provided.
In order to solve the technical problems, the invention adopts the following technical scheme: the intestinal flora super donor image information detection and generation method comprises the following detection and generation steps:
step one: collecting image information of the intestinal flora super donor: the front end of the management system acquires personal basic information, a life history questionnaire, a fecal type questionnaire, a WEXNER constipation score, a gastrointestinal life quality score, a diet nutrition questionnaire, a life habit questionnaire and physical examination report information of a donor through a software program;
the background of the management system derives donor data capable of curing diseases from five dimensions of physiology, psychology, physical examination, life history and life habit;
step two: processing the acquired information based on the feature extraction module:
step 2.1: preprocessing data by adopting a first processing module, wherein the preprocessed data comprises unique attributes, classification features and text input type open features, deleting the unique attribute data, extracting feature values in the classification features, and extracting keywords from the text input type open feature data by adopting a trained natural language processing model;
step 2.2: mapping the text characteristic value into integer codes by using tag codes, and converting the acquired information into numerical data without unique attributes;
step 2.3: performing data scaling and outlier processing: performing data scaling on all dimensions of the data, scaling the value between 0-100;
step 2.4: removing outliers by adopting a box graph, deleting all data points suspected to have noise or outliers in the data, calculating the distance between the data points by using all dimensions, and identifying the outliers by using the generated single-dimensional distance vector;
step 2.5: performing data processing on the original feature set of each dimension of the donor by adopting a DWMB clustering algorithm of the second processing module to obtain the optimal number of sub-cluster groups under the original feature set of each dimension of the donor:
step 2.5.1: in the partitioning phase, the data is partitioned into an optimal number of small sub-cluster groups:
dividing the data in all dimensions, calculating the number of optimal clusters of each dimension of the data by using a K-means algorithm, dividing the whole data into the optimal number of sub-clusters by using sub-cluster intersection of all dimensions, further dividing the optimal number of sub-clusters by using the same steps, and finally further dividing each sub-cluster into two other clusters by using the K-means algorithm;
step 2.5.2: in the merging stage, small sub-cluster groups created based on the dividing stage are merged again to form actual cluster groups in the data, and the main steps comprise: projecting the data, estimating the probability density of the projected data, and calculating an overlapping area;
step three: the feature set data is selected and extracted, feature evaluation coefficients are comprehensively generated from the feature effectiveness and the feature redundancy, and an optimal feature subset of each dimension is generated based on the feature evaluation coefficients:
step 3.1: the effectiveness of the evaluation feature is specifically measured by the information gain, and for the information gain of the feature Y, the effectiveness degree of uncertainty reduction of the random variable X caused by the introduction of the feature Y can be measured and determined, and the calculation formula is as follows:
wherein A (X) is the information entropy of a random variable X, and the calculation formula is as follows:
in the method, in the process of the invention,is a possible value of the random variable X, < ->Is->The corresponding occurrence probability;
wherein A (X|Y) is the information entropy of the random variable X after the feature Y is added, and the calculation formula is as follows:
in the method, in the process of the invention,for possible values of feature Y, +.>Is->Corresponding probability of occurrence, ++>To take value for a certain featureThe internal part of the kit contains a sample belonging to the group->Probability of class cluster;
step 3.2: the redundancy of the evaluation feature is specifically measured by a Spearman correlation coefficient, the similarity degree between different features can be measured based on the Spearman correlation coefficient, and a calculation formula is as follows:
in the method, in the process of the invention,is characterized by->,/>Is the number of samples; />Is characterized by->And->Is>Differences in the arrangement sequence of the values in the respective samples;
step 3.3: aiming at the target feature set S ⊆ T, the feature validity measured by the comprehensive information gain and the redundancy measured by the correlation coefficient between the features, a feature set evaluation coefficient is constructed, and a calculation formula is as follows:
in the formula, H (S) is an information gain index of all the features in the feature set S, and the calculation formula is as follows:
wherein, B (S) is the redundancy index of all the features in the feature set S, and the calculation formula is as follows:
in the method, in the process of the invention,for the number of features contained in the feature set S, < >>For 2 different features belonging to the feature set S;
step 3.4: iteratively scoring the cluster feature subsets of each dimension in the second processing module through feature scoring coefficients, selecting the subset with the highest score as an optimal feature subset, and selecting the subset with fewer feature numbers as the optimal feature subset if the scores of the two feature subsets are the same;
step four: and (3) obtaining the optimal characteristics and characteristic values of the optimal characteristics in each dimension based on the optimal characteristic subset obtained in the step (III), further obtaining the label of the flora transplanting super donor, and constructing the portrait of the flora transplanting super donor based on the label in five dimensions of physiology, psychology, physical examination, life history and life habit.
The specific steps of the merging stage in the step 2.5.2 are as follows:
only two adjacent sub-clusters are evaluated at a time to be combined, data is projected onto the central connecting line of the two sub-clusters being combined, and the data projected onto the central connecting line is helpful for better estimating the proximity and density of the sub-clusters;
the projection data of the two sub-clusters are converted into one-dimensional distance vectors, the density of the two sub-clusters is estimated by using a Gaussian mixture model, and the projection data are converted into single-dimensional projection data, so that the calculation cost can be reduced, and the density distribution can be effectively estimated;
the density distribution of the two sub-clusters is estimated for calculating the overlap between the two sub-clusters, if the overlap exceeds a certain threshold, the sub-clusters are merged together, otherwise the sub-clusters will be regarded as separate clusters and not merged.
The flora transplanting super donor portrait information constructed in the fourth step specifically comprises a table or a distribution pattern.
Compared with the prior art, the invention has the following beneficial effects: the invention provides a method for analyzing and processing image data of an intestinal flora super donor, which can accurately construct an intestinal flora super donor image in multiple dimensions, wherein the method comprises the steps of acquiring five dimensional information such as physiology, psychology, physical examination, life history, life habit and the like, deleting unique attributes, extracting keywords by an NLP model, tag coding, data scaling, removing abnormal values and the like, preprocessing the data to obtain an original feature set of each dimension, clustering the original feature set of each dimension by a DWMB clustering algorithm to obtain a feature set capable of preliminarily reflecting the intestinal flora super donor image, iteratively calculating a subset of the feature set obtained by the DWMB clustering algorithm by a trained evaluation coefficient model to obtain an optimal feature subset, and finally generating the intestinal flora super donor image according to the optimal feature subset of each dimension; the method can construct a true, effective, multidimensional and accurate super donor image of the intestinal flora.
Drawings
The invention is further described below with reference to the accompanying drawings:
FIG. 1 is a flowchart showing the steps of image generation of an intestinal flora super donor according to the present invention;
FIG. 2 is a flowchart illustrating steps for employing a DWMB algorithm in accordance with the present invention;
FIG. 3 is a flowchart illustrating steps for generating a feature set evaluation model in accordance with an embodiment of the present invention;
FIG. 4 is a flowchart illustrating the steps for solving the optimal feature set in an embodiment of the present invention.
Detailed Description
As shown in fig. 1 to 4, the present invention provides a method for detecting and generating image information of super donor of intestinal flora, which mainly comprises the following steps:
collecting data information of a donor from five dimensions (physiology, psychology, physical examination, life history and life habit);
data preprocessing: the method comprises the processing steps of deleting unique attributes, extracting keywords by an NLP model, encoding labels, scaling data, removing abnormal values and the like, and obtaining an original feature set of each dimension;
clustering the data of each dimension based on a DWMB (Divide Well To Merge Better: A Novel Clustering Algorithm) clustering algorithm to obtain the initial characteristics of the super donor;
generating a feature evaluation coefficient according to the feature validity (information gain) and feature redundancy (Spearman correlation coefficient);
then, obtaining an optimal feature subset of each dimension based on iterative solution of the feature evaluation model;
and finally, generating an intestinal flora super donor image based on the acquired optimal feature subset.
Furthermore, in order to realize the operations of collecting, analyzing, processing, generating and the like of the portrait information of the intestinal flora super donor, the control module arranged in the management system is provided with:
and a data acquisition module:
the method is used for collecting data information of the donor, and a WeChat applet 'healthy intestinal bacteria management system' is used for obtaining personal basic information, a life history questionnaire, a fecal type questionnaire, a WEXNER constipation score, a gastrointestinal life quality score, a diet nutrition questionnaire, a life habit questionnaire, a physical examination report and other information filled in by the donor. Donor data capable of curing a certain disease is derived in five dimensions of physiology, psychology, physical examination, life history and life habit.
A first feature extraction processing module:
the module is a data preprocessing module and comprises a unique attribute deletion module; extracting keywords by the NLP model; tag coding; scaling data; and removing abnormal values and the like to obtain an original feature set of each dimension. A suitable data set is provided for the DWMB algorithm in the second processing module.
And a second feature extraction processing module:
the module clusters the original feature set for each dimension using a DWMB (Divide Well To Merge Better: A Novel Clustering Algorithm) clustering algorithm. The data set is donor data which can successfully cure diseases, the similarity among the data is larger, the outliers are fewer, so that the obtained cluster ratio difference is larger, the DWMB algorithm has strong capability and is derived from a hierarchical clustering concept and a density-based clustering concept in clustering, and the algorithm can detect clusters with very similar shapes and densities, is suitable for high-dimensional data and can ideally complete clustering tasks in the application. The cluster group with the largest duty ratio is a class I cluster group, so that the class I cluster group can initially embody the characteristics of the super donor.
The feature selection module:
the feature selection is an important link in the super donor image of the intestinal flora, and the accuracy of the super donor image is directly affected by the appropriateness of the feature selection, so that an ideal target which can accurately describe the key characteristics of the super donor and has reasonable number of feature sets is obtained. Feature selection is not performed in the class I cluster groups obtained by the feature extraction processing module, a feature evaluation model is comprehensively generated by the feature effectiveness and the feature redundancy, and the class I cluster groups in each dimension are iterated based on the feature evaluation model to obtain an optimal feature subset in each dimension.
An image generation module:
and obtaining optimal characteristics and corresponding characteristic values based on the optimal characteristic subsets, and further constructing the super donor image of the intestinal flora.
Furthermore, in order to realize the detection generation processing of the portrait data, the invention provides an intestinal flora super donor portrait data processing generation system which is specifically arranged on a server provided with a plurality of GPUs, wherein the server is provided with four Intel to strong E5-2683 V4 processors with kernel numbers of 16, and the memory size is 512GB; a Graphic Processing Unit (GPU) of 8 blocks of English-Weida GTX2080 is provided, and 88GB of graphic processing units are provided; the server runs on the operating system of the CentOS 7.7.1908; the programming language used is Python and the integrated development environment involved is Pycharm.
When the image information of the intestinal flora super donor is detected and generated, the corresponding intestinal flora transplanting management system is relied on, and the health intestinal flora management system of a preset program is used for acquiring information such as personal basic information, a life history questionnaire, a fecal type questionnaire, WEXNER constipation score, gastrointestinal life quality score, a diet nutrition questionnaire, a life habit questionnaire, a physical examination report and the like filled in by the donor; the background management system derives donor data capable of curing a certain disease from five dimensions of physiology, psychology, physical examination, life history and life habit.
After the data acquisition is completed, the characteristic extraction module extracts target information in a targeted manner, and the operation is mainly completed by the first processing module and the second processing module which are arranged in the device; the first processing module is a data preprocessing module, and data collected by the module can be divided into three types: unique attribute, classification feature, text input type open feature. Wherein the unique attribute is a unique identifier of each user, such as: name, phone number, etc. cannot be used to characterize the super donor representation and delete it. The text input type open type feature is user defined input, such as: preferred staple food collocation. The class attributes are extracted from the trained natural language processing model. Classification features are single-choice or multi-choice questions in the questionnaire, such as features: breakfast frequency, eigenvalues: eating every day; is eaten frequently; sometimes eat; never eat. The DWMB clustering algorithm in the second processing module is not capable of processing text data, so the tag encoding technique is used to map text feature values to integer encodings. The data obtained are numerical data and have no unique attribute. In order to ensure that the DWMB algorithm in the second processing module completes the clustering task faster and more accurately, the data preprocessing module needs to execute data scaling and outlier processing: data scaling is performed on all dimensions of the data, scaling the value between 0-100. Outliers in the data have a great influence on the clustering result, and the application uses an improved box graph to remove outliers. All data points in the data, which are suspected to have noise or outliers, are deleted (the data points are further processed after the DWMB clustering is finished), the outliers are identified from unique dimensions, all dimensions are used to calculate the distances between the data points after improvement, and the outliers are identified by using the generated single-dimensional distance vectors.
In addition, a second processing module is used for utilizing the original feature set of each dimension to obtain the optimal number of sub-cluster groups under the data set by a DWMB clustering algorithm. It should be noted that, the clustering algorithm is adopted to cluster the features of different dimensions of the donor, so that the similarity between the same cluster is as large as possible, and the similarity between different clusters is as small as possible, thereby facilitating the extraction of the features of the super donor and further improving the accuracy of the representation of the super donor.
The clustering method used in the application is a DWMB algorithm (Divide Well To Merge Better: A Novel Clustering Algorithm), the algorithm flow diagram is shown in FIG. 2, and the algorithm is based on two main stages after data cleaning: (i) a partitioning stage (ii) a merging stage. In the partitioning phase, the data is partitioned into an optimal number of small sub-cluster groups. This partitioning is performed across all dimensions of the data, the optimal cluster number for each dimension of the data is calculated using the optimized K-means algorithm, then the entire data is partitioned into the optimal number of sub-clusters using the sub-cluster intersections of all dimensions, and then the resulting sub-clusters are further partitioned using the same steps. Finally, the K-means algorithm is used for dividing each sub-cluster into two other clusters. In the second stage of merging, the small sub-cluster groups created in the dividing stage are merged again to form the actual cluster group in the data. The merging phase is mainly based on three basic steps: (i) projection of the data, (ii) probability density estimation of the projection data, (iii) calculation of the overlap region. Only two adjacent sub-clusters are evaluated at a time to be combined, data is projected onto the central connecting line of the two sub-clusters being combined, and the data projected onto the central connecting line is helpful for better estimating the proximity and density of the sub-clusters; the projection data of the two sub-clusters are converted into one-dimensional distance vectors, the density of the two sub-clusters is estimated by using a Gaussian mixture model, and the projection data are converted into single-dimensional projection data, so that the calculation cost can be reduced, and the density distribution can be effectively estimated; the density distribution of the two sub-clusters is estimated for calculating the overlap between the two sub-clusters, if the overlap exceeds a certain threshold, the sub-clusters are merged together, otherwise the sub-clusters will be regarded as separate clusters and not merged.
The above algorithm is derived from two different concepts in the cluster, namely hierarchical clustering concepts and density-based clustering concepts; because of the ability of the DWMB algorithm to derive from different clustering concepts, the algorithm is able to detect clusters with very close shapes and densities, and the algorithm also has the following advantages: convex and non-convex clusters can be found; clusters of different densities can be found; being able to detect and remove outliers/noise in the data; easy to realize, easy to adjust or fix the super parameter; availability of high-dimensional data. The DWMB algorithm tests on 20 data sets, including a synthetic data set and a real data set, and the result shows that the DWMB algorithm can better and more accurately realize the clustering task compared with the existing advanced clustering algorithm.
The original feature set is donor data which can successfully cure diseases, the similarity among the data is large, the outliers are few, so that the obtained cluster has large duty ratio difference, and the clustering task in the application can be ideally completed by selecting a DWMB clustering algorithm. The cluster group with more duty ratio is a class I cluster group, so that the class I cluster group can initially embody the characteristics of the super donor. For example, the characteristics of lifestyle dimensions are clustered by using a DWMB clustering algorithm, and the obtained class I cluster accounts for 80% and class II cluster accounts for 20%, so that the class I cluster is a common characteristic set of 80% of the donors, and the characteristics and the characteristic values of the class I cluster can be considered to be capable of primarily describing the representation of the super donor.
After feature extraction is completed, a feature selection module is selected for subsequent processing, and the internal structure of a feature set selection module is shown in fig. 3, so that an ideal target which can accurately describe the key characteristics of the super donor and has reasonable number of feature sets is a super donor image is obtained; therefore, the module comprehensively generates the feature evaluation coefficients from the two aspects of feature effectiveness and feature redundancy, and generates the optimal feature subset of each dimension based on the feature evaluation coefficients, and the specific method is as follows:
feature effectiveness is measured by the information gain, which can measure how effectively the introduction of Y is reduced to the uncertainty of the random variable X for a feature Y. The calculation formula is as follows:
wherein A (X) is the information entropy of the random variable X, and the calculation formula is as follows:
wherein the method comprises the steps ofIs a possible value of the random variable X, < ->Is->The corresponding probability of occurrence.
A (X|Y) is the information entropy of the random variable X after the feature Y is added, and the calculation formula is as follows:
wherein the method comprises the steps ofFor possible values of feature Y, +.>Is->Corresponding probability of occurrence, ++>To take value for a certain featureThe internal part of the kit contains a sample belonging to the group->Probability of class cluster.
Feature redundancy is measured by Spearman correlation coefficients, which measure the degree of similarity between different features, and the calculation formula is as follows:
wherein, the liquid crystal display device comprises a liquid crystal display device,is characterized by->,/>Is the number of samples; />Is characterized by->And->Is>The individual values are the rank (order) differences in the respective samples.
For a certain feature set S ⊆ T, feature validity measured by the comprehensive information gain and redundancy measured by the inter-feature correlation coefficient are combined, a feature set evaluation coefficient is constructed, and a calculation formula is as follows:
wherein H (S) is an information gain index of all the features in the feature set S, and the calculation formula is as follows:
b (S) is a redundancy index of all the features in the feature set S, and the calculation formula is as follows:
in the middle ofThe number of features included in the feature set S. />Is 2 different features belonging to the feature set S.
The optimal feature set solving flow is shown in fig. 4, the feature scoring coefficient is used for iteratively scoring the class I cluster feature subsets of each dimension in the second processing module, the subset with the highest score is selected as the optimal feature subset, and if the scores of the two feature subsets are the same, the subset with fewer feature numbers is selected as the optimal feature subset.
And finally, based on the optimal feature subset obtained by the feature selection module, obtaining optimal features and feature values of the optimal features in each dimension, further obtaining labels of the flora transplanting super donor, and finally constructing a flora transplanting super donor portrait based on the labels in five dimensions of physiology, psychology, physical examination, life history and life habit, thereby providing decision basis for screening intestinal flora super donors in clinical application. The form of creating the super donor image of the intestinal flora is not limited in the embodiments of the present application, such as tables, graphic distributions, etc.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims (3)

1. A method for detecting and generating image information of an intestinal flora super donor is characterized by comprising the following steps of: the method comprises the following detection generation steps:
step one: collecting image information of the intestinal flora super donor: the front end of the management system acquires personal basic information, a life history questionnaire, a fecal type questionnaire, a WEXNER constipation score, a gastrointestinal life quality score, a diet nutrition questionnaire, a life habit questionnaire and physical examination report information of a donor through a software program;
the background of the management system derives donor data capable of curing diseases from five dimensions of physiology, psychology, physical examination, life history and life habit;
step two: processing the acquired information based on the feature extraction module:
step 2.1: preprocessing data by adopting a first processing module, wherein the preprocessed data comprises unique attributes, classification features and text input type open features, deleting the unique attribute data, extracting feature values in the classification features, and extracting keywords from the text input type open feature data by adopting a trained natural language processing model;
step 2.2: mapping the text characteristic value into integer codes by using tag codes, and converting the acquired information into numerical data without unique attributes;
step 2.3: performing data scaling and outlier processing: performing data scaling on all dimensions of the data, scaling the value between 0-100;
step 2.4: removing outliers by adopting a box graph, deleting all data points suspected to have noise or outliers in the data, calculating the distance between the data points by using all dimensions, and identifying the outliers by using the generated single-dimensional distance vector;
step 2.5: performing data processing on the original feature set of each dimension of the donor by adopting a DWMB clustering algorithm of the second processing module to obtain the optimal number of sub-cluster groups under the original feature set of each dimension of the donor:
step 2.5.1: in the partitioning phase, the data is partitioned into an optimal number of small sub-cluster groups:
dividing the data in all dimensions, calculating the number of optimal clusters of each dimension of the data by using a K-means algorithm, dividing the whole data into the optimal number of sub-clusters by using sub-cluster intersection of all dimensions, further dividing the optimal number of sub-clusters by using the same steps, and finally further dividing each sub-cluster into two other clusters by using the K-means algorithm;
step 2.5.2: in the merging stage, small sub-cluster groups created based on the dividing stage are merged again to form actual cluster groups in the data, and the main steps comprise: projecting the data, estimating the probability density of the projected data, and calculating an overlapping area;
step three: the feature set data is selected and extracted, feature evaluation coefficients are comprehensively generated from the feature effectiveness and the feature redundancy, and an optimal feature subset of each dimension is generated based on the feature evaluation coefficients:
step 3.1: the effectiveness of the evaluation feature is specifically measured by the information gain, and for the information gain of the feature Y, the effectiveness degree of uncertainty reduction of the random variable X caused by the introduction of the feature Y can be measured and determined, and the calculation formula is as follows:
wherein A (X) is the information entropy of a random variable X, and the calculation formula is as follows:
in the method, in the process of the invention,is a possible value of the random variable X, < ->Is->The corresponding occurrence probability;
wherein A (X|Y) is the information entropy of the random variable X after the feature Y is added, and the calculation formula is as follows:
in the method, in the process of the invention,for possible values of feature Y, +.>Is->Corresponding probability of occurrence, ++>To take the value of a certain feature +.>The probability that the sample i belongs to each cluster is contained in the sample i;
step 3.2: the redundancy of the evaluation feature is specifically measured by a Spearman correlation coefficient, the similarity degree between different features can be measured based on the Spearman correlation coefficient, and a calculation formula is as follows:
in the method, in the process of the invention,is characterized by->,/>Is the number of samples; />Is characterized by->And->Is>Differences in the arrangement sequence of the values in the respective samples;
step 3.3: aiming at the target feature set S ⊆ T, the feature validity measured by the comprehensive information gain and the redundancy measured by the correlation coefficient between the features, a feature set evaluation coefficient is constructed, and a calculation formula is as follows:
in the formula, H (S) is an information gain index of all the features in the feature set S, and the calculation formula is as follows:
wherein, B (S) is the redundancy index of all the features in the feature set S, and the calculation formula is as follows:
in the method, in the process of the invention,for the number of features contained in the feature set S, < >>For 2 different features belonging to the feature set S;
step 3.4: iteratively scoring the cluster feature subsets of each dimension in the second processing module through feature scoring coefficients, selecting the subset with the highest score as an optimal feature subset, and selecting the subset with fewer feature numbers as the optimal feature subset if the scores of the two feature subsets are the same;
step four: and (3) obtaining the optimal characteristics and characteristic values of the optimal characteristics in each dimension based on the optimal characteristic subset obtained in the step (III), further obtaining the label of the flora transplanting super donor, and constructing the portrait of the flora transplanting super donor based on the label in five dimensions of physiology, psychology, physical examination, life history and life habit.
2. The method for detecting and generating intestinal flora super donor image information according to claim 1, wherein the method comprises the following steps: the specific steps of the merging stage in the step 2.5.2 are as follows:
only two adjacent sub-clusters are evaluated at a time to be combined, data is projected onto the central connecting line of the two sub-clusters being combined, and the data projected onto the central connecting line is helpful for better estimating the proximity and density of the sub-clusters;
the projection data of the two sub-clusters are converted into one-dimensional distance vectors, the density of the two sub-clusters is estimated by using a Gaussian mixture model, and the projection data are converted into single-dimensional projection data, so that the calculation cost can be reduced, and the density distribution can be effectively estimated;
the density distribution of the two sub-clusters is estimated for calculating the overlap between the two sub-clusters, if the overlap exceeds a certain threshold, the sub-clusters are merged together, otherwise the sub-clusters will be regarded as separate clusters and not merged.
3. The method for detecting and generating intestinal flora super donor image information according to claim 1, wherein the method comprises the following steps: the flora transplanting super donor portrait information constructed in the fourth step specifically comprises a table or a distribution pattern.
CN202310568770.5A 2023-05-19 2023-05-19 Intestinal flora super donor image information detection generation method Active CN116312745B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310568770.5A CN116312745B (en) 2023-05-19 2023-05-19 Intestinal flora super donor image information detection generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310568770.5A CN116312745B (en) 2023-05-19 2023-05-19 Intestinal flora super donor image information detection generation method

Publications (2)

Publication Number Publication Date
CN116312745A CN116312745A (en) 2023-06-23
CN116312745B true CN116312745B (en) 2023-08-08

Family

ID=86780146

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310568770.5A Active CN116312745B (en) 2023-05-19 2023-05-19 Intestinal flora super donor image information detection generation method

Country Status (1)

Country Link
CN (1) CN116312745B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117198551B (en) * 2023-11-08 2024-01-30 天津医科大学第二医院 Kidney function deterioration pre-judging system based on big data analysis

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110664847A (en) * 2019-10-15 2020-01-10 厦门大学 Application of fecal flora in preparation of microecological preparation for treating chronic hepatitis B
CN111681733A (en) * 2020-04-30 2020-09-18 厦门承葛生物科技有限公司 Method for screening intestinal flora transplantation donor
CN111785329A (en) * 2020-07-24 2020-10-16 中国人民解放军国防科技大学 Single-cell RNA sequencing clustering method based on confrontation automatic encoder
CN113886669A (en) * 2021-10-26 2022-01-04 国家电网有限公司 Self-adaptive clustering method for portrait of power consumer
WO2022073973A1 (en) * 2020-10-05 2022-04-14 Vib Vzw Means and methods to diagnose gut flora dysbiosis and inflammation
CN114446396A (en) * 2021-12-17 2022-05-06 广州保量医疗科技有限公司 Group matching method, system, equipment and storage medium for intestinal flora transplantation
CN114496277A (en) * 2022-01-12 2022-05-13 广州保量医疗科技有限公司 Method, system, equipment and medium for optimizing multigroup chemical data of intestinal flora match
CN114496278A (en) * 2022-01-12 2022-05-13 广州保量医疗科技有限公司 Data processing method, system, equipment and medium for intestinal flora transplantation matching

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2710383B1 (en) * 2011-05-16 2017-01-11 The University of Newcastle Performance of a biomarker panel for irritable bowel syndrome

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110664847A (en) * 2019-10-15 2020-01-10 厦门大学 Application of fecal flora in preparation of microecological preparation for treating chronic hepatitis B
CN111681733A (en) * 2020-04-30 2020-09-18 厦门承葛生物科技有限公司 Method for screening intestinal flora transplantation donor
CN111785329A (en) * 2020-07-24 2020-10-16 中国人民解放军国防科技大学 Single-cell RNA sequencing clustering method based on confrontation automatic encoder
WO2022073973A1 (en) * 2020-10-05 2022-04-14 Vib Vzw Means and methods to diagnose gut flora dysbiosis and inflammation
CN113886669A (en) * 2021-10-26 2022-01-04 国家电网有限公司 Self-adaptive clustering method for portrait of power consumer
CN114446396A (en) * 2021-12-17 2022-05-06 广州保量医疗科技有限公司 Group matching method, system, equipment and storage medium for intestinal flora transplantation
CN114496277A (en) * 2022-01-12 2022-05-13 广州保量医疗科技有限公司 Method, system, equipment and medium for optimizing multigroup chemical data of intestinal flora match
CN114496278A (en) * 2022-01-12 2022-05-13 广州保量医疗科技有限公司 Data processing method, system, equipment and medium for intestinal flora transplantation matching

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
周瑞.基于PET分子影像的肠道菌群移植延缓认知老化的实验研究.《中国博士学位论文全文数据库 (医药卫生科技辑)》.2023,(第02期),E059-31. *

Also Published As

Publication number Publication date
CN116312745A (en) 2023-06-23

Similar Documents

Publication Publication Date Title
JP7330372B2 (en) A system that collects and identifies skin symptoms from images and expertise
Zhang et al. Comparing data mining methods with logistic regression in childhood obesity prediction
CN105407819A (en) Apparatus, method, and system for automated, non-invasive cell activity tracking
Pan et al. Brain CT image similarity retrieval method based on uncertain location graph
CN116312745B (en) Intestinal flora super donor image information detection generation method
Bergamasco et al. Intelligent retrieval and classification in three-dimensional biomedical images—a systematic mapping
Flayyih et al. ASystematic Mapping Study on Brain Tumors Recognition Based on Machine Learning Algorithms
Wu et al. A preliminary study of sperm identification in microdissection testicular sperm extraction samples with deep convolutional neural networks
Kim et al. Fostering transparent medical image AI via an image-text foundation model grounded in medical literature
Shi et al. An automatic classification method on chronic venous insufficiency images
Kumar et al. A case study on machine learning and classification
BE1027433A9 (en) A method of extracting information from semi-structured documents, an associated system and a processing device
Navaz et al. The use of data mining techniques to predict mortality and length of stay in an ICU
Ghosh et al. Multi-model approach and fuzzy clustering for mammogram tumor to improve accuracy
US11783165B1 (en) Generating vectors from data
Siddique et al. Predicting heart-disease from medical data by applying naive bayes and Apriori algorithm
Janani et al. Dengue prediction using (MLP) multilayer perceptron—A machine learning approach
Radwan et al. Thyroid diagnosis based technique on rough sets with modified similarity relation
Le et al. Choosing seeds for semi-supervised graph based clustering
Rajasekar et al. Comparison of machine learning algorithms in domain specific information extraction
CN111966780A (en) Retrospective queue selection method and device based on word vector modeling and information retrieval
Li et al. Improved counting and localization from density maps for object detection in 2d and 3d microscopy imaging
Herath et al. Autism spectrum disorder identification using multi‐model deep ensemble classifier with transfer learning
Thompson Augmenting Biological Pathway Extraction with Synthetic Data and Active Learning
Zhao et al. Data Correlation based Feature Selection Model for Children’s Growth and Development Assessment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant