CN118132973A

CN118132973A - Feature combination determining method, device, equipment and medium for influencing classified service

Info

Publication number: CN118132973A
Application number: CN202311790534.4A
Authority: CN
Inventors: 宋阳
Original assignee: BOE Technology Group Co Ltd
Current assignee: BOE Technology Group Co Ltd
Priority date: 2023-12-22
Filing date: 2023-12-22
Publication date: 2024-06-04

Abstract

The method, the device, the equipment and the medium for determining the feature combination affecting the classified service are applied to the technical field of information. By applying the method of the embodiment of the application, a plurality of item combinations can be obtained by arranging and combining the feature data in a plurality of biomarker items and a plurality of attribute items, each item combination is input into a classification model for training, the item combination with the highest accuracy is determined by the accuracy of the trained classification model corresponding to each item combination in the trained classification model, so that the item combination with the highest accuracy is determined as the target item combination, and each feature data type in the corresponding target item combination is used as the target feature data type influencing the target classification service to finish the screening of the feature data type.

Description

Feature combination determining method, device, equipment and medium for influencing classified service

Technical Field

The present application relates to the field of information technologies, and in particular, to a method, an apparatus, a device, and a medium for determining a feature combination affecting a classified service.

Background

A classification model is a model commonly used in machine learning, and is used to divide samples in a dataset into different categories, and is widely used in various fields. For example, in the biological category, the classification model can judge the plant growth condition or whether crops can be harvested or not through analyzing the information such as temperature, soil mineral content, growth hormone content in plants, soil humidity and the like; the health condition of animals and the like can also be analyzed by the habitat, body weight, in-vivo hormone content and the like of the animals. However, in the classification service, especially for some more complex classification services, there are many kinds of feature data input into the classification model, so people often cannot effectively determine which kind of data affects the classification service, and the classification model needs to analyze and process each kind of feature data, which can cause huge calculation amount of the classification model and affect the prediction efficiency of the classification model. Therefore, how to determine the feature data affecting the classified service becomes a problem to be solved.

Disclosure of Invention

The embodiment of the application aims to provide a method, a device, equipment and a medium for determining characteristic combinations affecting classified services, which are used for determining characteristic data affecting the classified services. The specific technical scheme is as follows:

In a first aspect of the embodiment of the present application, a method for determining a feature combination affecting a classified service is provided, where the method includes:

Acquiring characteristic data and label information of a plurality of sample objects, wherein the characteristic data comprises biological information of a plurality of biomarker items and attribute information of a plurality of attribute items, and the label information represents true values of classification results of the sample objects on a target classification service;

Selecting item combinations in the biomarker items and the attribute items in a permutation and combination mode to obtain a plurality of groups of item combinations;

Obtaining a classification model to be trained;

Aiming at each group of item combinations, taking the characteristic data of the sample object under the group of item combinations as the input of a classification model, taking the label information of the sample object as the true value of the classification model prediction, and training the classification model to obtain a classification model trained under the group of item combinations;

And respectively determining the first accuracy of the trained classification model under each item combination, and taking the item combination used for training the classification model with the highest accuracy as a target item combination.

In one possible embodiment, the method further comprises:

dividing the plurality of sample objects into a sample training set and a sample testing set;

The feature data of the sample object under the group of item combinations is used as input of a classification model, the label information of the sample object is used as a true value of classification model prediction, and the classification model is trained to obtain a trained classification model under the group of item combinations; comprising the following steps:

Selecting a sample object in the sample training set aiming at each group of item combinations, and inputting characteristic data of the currently selected sample object under the group of item combinations into a classification model to obtain a current classification result;

According to the label information of the currently selected sample object and the current classification result, parameters of the classification model are adjusted to obtain a classification model trained under the set of item combinations;

the step of determining the first accuracy of the trained classification model under each item combination comprises the following steps:

And for each group of item combinations, determining the first accuracy of the classification model trained under the item combination by utilizing the characteristic data and the classification labels of each sample object in the sample test set under the group of item combinations.

In one possible embodiment, the sample training set includes n sample training subsets, the method further comprising:

Selecting an ith sample training subset from the n sample training subsets to obtain a super-parameter verification subset, wherein other sample training subsets except the current super-parameter verification subset in the n sample training subsets are super-parameter training subsets, and the initial value of i is 1;

aiming at each group of item combinations, training the classification models under different super parameters by utilizing characteristic data and classification labels of each sample object in the current super parameter training subset under the group of item combinations to obtain the classification models under different super parameters of the item combinations;

Respectively determining the second accuracy of the classification model under different super parameters of the item combination by utilizing the characteristic data and the classification labels of each sample object in the current super parameter verification subset under the item combination;

Increasing i by 1, and returning to the execution step: aiming at each group of item combinations, training the classification models under different super parameters by utilizing characteristic data and classification labels of each sample object in the current super parameter training subset under the group of item combinations to obtain different super parameter training classification models of the item combinations until i=n;

And respectively calculating the average value of the second accuracy of the classification model under each super parameter of each item combination aiming at each group of item combination, and selecting the super parameter of the classification model with the highest average value as the super parameter of the classification model under the item combination.

In one possible implementation manner, the classification model includes a first classification model, a second classification model and a third classification model, wherein the first classification model adopts a linear classification algorithm classifier, the second classification model adopts a nonlinear classification algorithm classifier, and the third classification model adopts a multi-layer perceptron classifier.

In one possible embodiment, the biomarker concentration of the plurality of biomarker items comprises a biomarker concentration of a plurality of biomarker items, wherein the biomarker concentration of the plurality of biomarker items comprises at least two of aβ40, aβ42, P-tau181, P-tau217, nfL; the attribute information of the plurality of attribute items includes at least two of gender information, age information, and educational level information.

In one possible embodiment, the method further comprises:

For each group of item combinations, carrying out numerical conversion on discontinuous attribute items in the group of item combinations, and/or carrying out ratio calculation on biomarker items in the group of item combinations to obtain preprocessed characteristic data;

wherein the classification model is trained using the preprocessed feature data.

In a second aspect of the embodiments of the present application, there is provided a feature combination determining apparatus for influencing classified services, the apparatus comprising:

the information acquisition module is used for acquiring characteristic data and label information of a plurality of sample objects, wherein the characteristic data comprises biological information of a plurality of biomarker items and attribute information of a plurality of attribute items, and the label information represents true values of classification results of the sample objects on a target classification service;

the arrangement and combination module is used for selecting item combinations from the biomarker items and the attribute items in an arrangement and combination mode to obtain a plurality of groups of item combinations;

the classification model acquisition module is used for acquiring a classification model to be trained;

The classification model training module is used for aiming at each group of item combinations, taking the characteristic data of the sample object under the group of item combinations as the input of a classification model, taking the label information of the sample object as the true value of the classification model prediction, and training the classification model to obtain a classification model trained under the group of item combinations;

the target feature combination determining module is used for determining the first accuracy of the trained classification model under each item combination respectively, and taking the item combination used for training the classification model with the highest accuracy as the target item combination.

In one possible embodiment, the apparatus further comprises:

the sample object dividing module is used for dividing the plurality of sample objects into a sample training set and a sample testing set;

the classification model training module; comprising the following steps:

The sample object selection sub-module is specifically used for selecting a sample object in the sample training set aiming at each group of item combinations, and inputting characteristic data of the currently selected sample object under the group of item combinations into the classification model to obtain a current classification result;

The parameter adjustment sub-module is specifically used for adjusting parameters of the classification model according to the label information of the currently selected sample object and the current classification result to obtain a classification model trained under the set of item combinations;

the target feature combination determination module includes:

The first accuracy computing sub-module is specifically configured to determine, for each set of item combinations, a first accuracy of a classification model trained under the item combination by using feature data and classification labels of each sample object in the sample test set under the set of item combinations.

In one possible embodiment, the sample training set includes n sample training subsets, the apparatus further comprising:

the hyper-parameter verification subset selecting module is used for selecting an ith sample training subset from the n sample training subsets to obtain a hyper-parameter verification subset, wherein other sample training subsets except the current hyper-parameter verification subset in the n sample training subsets are hyper-parameter training subsets, and the initial value of i is 1;

The classification model super-parameter training module is used for training the classification model under different super-parameters by utilizing the characteristic data and the classification labels of each sample object in the current super-parameter training subset under the set of item combinations aiming at each set of item combinations to obtain the classification model under the different super-parameters of the item combinations;

The second accuracy calculation module is used for verifying the characteristic data and the classification labels of each sample object in the subset under the project combination by utilizing the current super parameters, and respectively determining the second accuracy of the classification model under different super parameters of the project combination;

The cross verification module is used for increasing i by 1 and returning to the execution step: aiming at each group of item combinations, training the classification models under different super parameters by utilizing characteristic data and classification labels of each sample object in the current super parameter training subset under the group of item combinations to obtain different super parameter training classification models of the item combinations until i=n;

The super-parameter determining module is used for respectively calculating the average value of the second accuracy of the classification model under each super-parameter of each item combination aiming at each group of item combinations, and selecting the super-parameter of the classification model with the highest average value as the super-parameter of the classification model under the item combination.

In one possible embodiment, the apparatus further comprises:

the data processing module is used for carrying out numerical conversion on the discontinuous attribute items in the group of item combinations and/or carrying out ratio calculation on the biomarker items in the group of item combinations aiming at each group of item combinations to obtain preprocessed characteristic data;

In another aspect of the embodiment of the application, an electronic device is provided, which comprises a processor and a memory;

A memory for storing a computer program;

and the processor is used for realizing any of the characteristic combination determining method steps for influencing the classified service when executing the program stored in the memory.

In another aspect of the embodiments of the present application, there is provided a computer readable storage medium having stored therein a computer program which, when executed by a processor, implements any of the above-described feature combination determination method steps for influencing classified services.

The embodiments of the present application also provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform any of the above-described feature combination determination methods for influencing classified services.

The embodiment of the application has the beneficial effects that:

According to the feature combination determining method, device and equipment for influencing the classified service and the medium, the feature data in the biomarker items and the attribute items can be arranged and combined to obtain the item combinations, each item combination is input into the trained classified model, the item combination with the highest accuracy is determined according to the accuracy of the trained classified model corresponding to each item combination, and accordingly the item combination with the highest accuracy is determined to be the target item combination, and accordingly, each feature data type in the target item combination is used as the target feature data type for influencing the target classified service, and screening of the feature data type is completed.

Of course, it is not necessary for any one product or method of practicing the application to achieve all of the advantages set forth above at the same time.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the application, and other embodiments may be obtained according to these drawings to those skilled in the art.

FIG. 1 is a flowchart of a method for determining feature combinations affecting classified services according to an embodiment of the present application;

FIG. 2 is another flowchart of a method for determining feature combinations affecting classified services according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a feature combination determining device for influencing classified services according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. Based on the embodiments of the present application, all other embodiments obtained by the person skilled in the art based on the present application are included in the scope of protection of the present application.

In a first aspect of the embodiment of the present application, a method for determining a feature combination affecting a classified service is provided, where the method includes the steps as shown in fig. 1:

step S101; and acquiring characteristic data and label information of a plurality of sample objects, wherein the characteristic data comprises biological information of a plurality of biomarker items and attribute information of a plurality of attribute items, and the label information represents true values of classification results of the sample objects on a target classification service.

The plurality of sample objects may be the same kind of object or may be different states of the same object. The biological information and attribute information are information that may affect the classification result. The target classification service refers to a service of classifying a sample object, for example, determining which kind the sample object is, or which state the sample object is.

For example, when the sample object is a plant, the germination period, growth period and maturity period of the plant may be included in the plurality of sample objects, and the corresponding characteristic data may include information such as a auxin concentration, an organic matter concentration, an inorganic salt concentration, and the like, and the attribute information may include information such as a type of the plant, a climate condition under which the plant is suitable to grow, a distribution area in which the plant grows, and the like. The tag information may represent a yield classification of the plant, for example, may include a high yield classification, a medium yield classification, a low yield classification, and the like. The classification model analyzes the biological information and attribute information of the plant to obtain a classification result representing the yield of the plant.

In another example, the tumor marker screening item is used as a physical examination item, so that people can be helped to monitor the content of the tumor markers in vivo earlier, and prevention of tumors is realized. When the content of the tumor marker reaches a certain range, medical staff can remind people of the need of preventing the occurrence of the tumor, and when the content of the tumor marker exceeds a certain threshold value, the patient can consider that the tumor exists in the body of the patient. The biological information may be tumor marker content and its related marker content, and the attribute information may be sex, weight, age, height, etc. of the person. The classification model analyzes the biological information and attribute information of the target person to obtain a classification result which indicates that the target person belongs to the common crowd, the crowd needing to prevent the tumor or the crowd suffering from the tumor.

It should be noted that, the method of the embodiment of the present application may be implemented by a terminal device, and in one example, the terminal device may be an electronic device such as a computer, a tablet computer, a server, or the like. The method provided by the embodiment of the application can be applied to a plurality of information detection fields, such as the field of detecting and classifying plants, the detection field for distinguishing whether a driver is drunk driving or drunk driving, and the like, including the detection field of biological information.

Step S102: and selecting the item combinations from the biomarker items and the attribute items in a permutation and combination mode to obtain a plurality of groups of item combinations.

When obtaining the multiple item combinations, the multiple item combinations may be obtained by permutation and combination according to the total number of the multiple biomarker items and the multiple attribute items. For example, it may be byA plurality of item combinations is calculated, wherein N is the total number of the plurality of biomarker items and the plurality of attribute items. In one example, if the feature data has 2 biomarker items a ₁、a₂ and 2 attribute items b ₁、b₂, the feature data contains 4 feature data in total, and the multiple item combination includes ：A1＝[a₁、a₂、b₁、b₂]、A2＝[a₁、a₂、b₁]、A3＝[a₁、a₂、b₂]、A4＝[a₁、b₁、b₂]、A5＝[a₁、a₂]、A6＝[a₁、b₁]、A7＝[a₁、b₂]、A8＝[a₁]、A9＝[a₂、b₁、b₂]、A10＝[a₂、b₁]、A11＝[a₂、b₂]、A12＝[a₂]、A13＝[b₁、b₂]、A14＝[b₁]、A15＝[b₂].

Step S103: and obtaining the classification model to be trained.

The classification model comprises a feature extraction network and a classification network (classifier), and the specific structure of the classification model can be seen in the structure of the classification model in the prior art. In one example, the classification models may be of multiple types and are distinguished according to the type of classifier, and in one possible implementation, the classification models include a first classification model, a second classification model and a third classification model, where the first classification model uses a linear classification algorithm classifier, the second classification model uses a nonlinear classification algorithm classifier, and the third classification model uses a multi-layer perceptron classifier.

In one example, the first classifier may be a linear classification algorithm classifier such as a support vector machine with linear kernel and LogisticRegression (logistic regression classifier); the second classifier can be a nonlinear algorithm classifier such as K-nearest neighbor, decision tree classification, randomForest (random forest) and the like; the third classifier may be a neural network classifier, for example, a Multi-layer perceptron classifier (Multi-layer Perceptrons, MLP).

Step S104: and aiming at each group of item combinations, taking the characteristic data of the sample object under the group of item combinations as the input of a classification model, taking the label information of the sample object as the true value of the classification model prediction, and training the classification model to obtain the classification model trained under the group of item combinations.

In training the classification model, the model may be trained in a supervised learning manner. For example, the model training may be performed using regularization techniques, cross-validation methods, and the like. The cross-validation method is of various types, for example, a leave-one-out cross-validation method, a K-fold cross-validation method, a Monte Carlo cross-validation method, a layered K-fold cross-validation method and the like can be adopted for model training.

When training the model, the feature data in each set of project combination is required to be input into the classification model to be trained for training, so as to obtain a trained classification model corresponding to each project combination. Still taking the example in step S102 as an example, step S104 needs to be performed for each item combination from A1 to a15, resulting in 15 trained classification models.

In the case of a plurality of classification models, the training operation of step S104 is performed for each classification model, for example, there are n sets of item combinations, m classification models, for each set of item combinations, respectively for training of m classification models, i.e., each set of item combinations results in m trained classification models, and n sets of item combinations have a total of n×m trained classification models.

Step S105: and respectively determining the first accuracy of the trained classification model under each item combination, and taking the item combination used for training the classification model with the highest accuracy as a target item combination.

And obtaining the first accuracy of each trained classification model by comparing the prediction result of the trained classification model with the label information of the sample object. Taking the above example as an example, after obtaining 15 trained classification models, calculating the first accuracy of the 15 trained classification models, determining the item combination corresponding to the highest first accuracy, taking the item combination as a target item combination, and considering that three kinds of feature data a ₂、b₁、b₂ have the greatest influence on the target classification business if the first accuracy corresponding to the item combination A9 is the highest in one example, taking the item combination composed of the three kinds of feature data as the target item combination. When the target classification service is carried out later, only three kinds of characteristic data, namely a ₂、b₁、b₂ of the target object, are required to be collected.

By applying the method of the embodiment of the application, a plurality of item combinations can be obtained by arranging and combining the feature data in a plurality of biomarker items and a plurality of attribute items, each item combination is input into a trained classification model, and the item combination with the highest accuracy is determined by the accuracy of the trained classification model corresponding to each item combination, so that the item combination with the highest accuracy is used as a target item combination, and correspondingly, each feature data type in the target item combination is used as a target feature data type affecting the target classification service, so that the screening of the feature data type is completed, the operand of the classification model is further reduced, and the prediction efficiency of the classification model is improved on the premise of ensuring the accuracy of the classification model.

In a possible implementation manner, the method of the embodiment of the present application may further include the following steps:

The plurality of sample objects is divided into a sample training set and a sample testing set.

Step S104 may be implemented by:

Step one: and selecting a sample object in the sample training set aiming at each group of item combinations, and inputting the characteristic data of the currently selected sample object under the group of item combinations into a classification model to obtain a current classification result.

Step two: and adjusting parameters of the classification model according to the label information of the currently selected sample object and the current classification result to obtain the classification model trained under the set of item combinations.

Correspondingly, in step S105, the first accuracy of the trained classification model under each item combination is determined, which may be implemented by the following steps:

step four: and for each group of item combinations, determining the first accuracy of the classification model trained under the item combination by utilizing the characteristic data and the classification labels of each sample object in the sample test set under the group of item combinations.

The classification label is label information of the sample object and represents a true value of a classification result of the sample object. In practice, the number of sample objects is often not equal, for example, when a plant is substantially mature in a certain period, the number of plants in germination and growth is significantly smaller than in maturity. Thus, when dividing a plurality of sample objects into a sample training set and a sample testing set, the ratio of the number of each sample object in the sample training set and the sample testing set should remain unchanged. In one example, if the ratio of the number of germination, growth and maturation phases of a plant is 1:2:3, then the ratio of the number of germination, growth and maturation phases should also be 1:2:3 in the sample training set and the sample test set.

The number of the sample training sets and the number of the sample testing sets can be equal or different. In practical applications, the proportions of the sample training set and the sample testing set may be determined based on the total number of sample objects. In one example, where the number of sample objects is sufficient, the following may be applied: 5 or 6: the ratio of 4 divides the sample object into a sample training set and a sample testing set, and when the total number of samples is small, the sample object is divided into the sample training set and the sample testing set according to the ratio of 8:2.

In addition, not only the accuracy of the classification model under each item combination can be calculated, but also the evaluation index parameter of each classification model can be calculated. The evaluation index parameters of the classification model may include accuracy, recall, precision, F1Score (F1 Score), sensitivity, specificity, ROC Curve (receiver operating characteristic Curve, subject work characteristic), AUC (Area formed by Area unit Curve, ROC and coordinate axis), 90% or 95% AUC confidence interval, cross-validated average AUC, and the like. After each evaluation index parameter is calculated, the trained classification model is comprehensively determined according to each evaluation index parameter.

By applying the method provided by the embodiment of the application, the classification model can be trained by the characteristic data of a plurality of sample objects in the sample training set to obtain the trained classification model with higher accuracy, and the sample testing set is adopted to test the trained classification model, so that the interference of the sample training set can be eliminated, the prediction result is more objective, and the target item combination determined by the first accuracy is more accurate.

In a possible implementation manner, the classification model may be super-parameterized by a cross-validation method, and then the sample training set may include n sample training subsets, where the method according to the embodiment of the present application further includes the following steps:

Step 1: and selecting an ith sample training subset from the n sample training subsets to obtain a super-parameter verification subset, wherein other sample training subsets except the current super-parameter verification subset in the n sample training subsets are super-parameter training subsets, and the initial value of i is 1.

Where n may be determined based on the number of sample training sets. If the number of sample training sets is large, n may be set to 10 or 20, etc., and if the number of sample training sets is small, n may be set to 3 or 5. In practical applications, n is typically not less than 3.

Taking n as 3 as an example, the sample training subsets n ₁、n₂、n₃ are shared, namely 3 sample training subsets, and the first sample training subset n ₁ is selected as a super-parameter verification subset, and the other two sample training subsets n ₂ and n ₃ are super-parameter training subsets.

Step 2: and training the classification models under different super parameters by utilizing the characteristic data and the classification labels of each sample object in the current super parameter training subset under the set of item combinations aiming at each set of item combinations to obtain the classification models under different super parameters of the item combinations.

Taking the above example as an example, when n is 3,i and is 1, n ₂ and n ₃ are input as a super-parameter training subset to the classification models under different super-parameters for each item combination to train, so as to obtain the classification models under different super-parameters of the item combination. To ensure optimal superparameter acquisition, a cross-validation method and GRIDSEARCHCV (grid search) function may be used for superparameter training. In one example, 3 sets of superparameters p1, p2, and p3 of the existing classification model. And respectively inputting the super-parameter training subsets into the classification models under the super-parameters p1, p2 and p3 for training to obtain the respective trained classification models.

Step 3: and respectively determining the second accuracy of the classification model under different super parameters of the item combination by utilizing the characteristic data and the classification labels of each sample object in the current super parameter verification subset under the item combination.

Taking the above example as an example, when n is 3,i and is 1, obtaining classification models under different super parameters under each item combination, inputting n ₁ as a super parameter verification subset into the classification models for each classification model to test, and calculating the second accuracy of each classification model according to the characteristic data and the classification labels under the item combination in n ₁.

Step 4: increasing i by 1 and returning to step 3 until i=n.

Taking the example that n is 3 as an example, after i is calculated to obtain the second accuracy of the classification model of each super parameter under each item combination, i is increased by 1, namely the first sample training subset n ₂ is selected as a super parameter verification subset, then the other two sample training subsets n ₁ and n ₃ are super parameter training subsets, n ₁ and n ₃ are input as super parameter training subsets to the classification model under different super parameters for training, the classification model under different super parameters of the item combination is obtained, n ₂ is input as a super parameter verification subset to the classification model for testing, and the second accuracy of each classification model is calculated according to the characteristic data and the classification label under the item combination in n ₂. And so on until i=n.

Step 5: and respectively calculating the average value of the second accuracy of the classification model under each super parameter of each item combination aiming at each group of item combination, and selecting the super parameter of the classification model with the highest average value as the super parameter of the classification model under the item combination.

Similarly, when the super-parameters of the classification model are determined, the super-parameters of the classification model with the highest second accuracy mean value can be determined as the super-parameters of the classification model under the project combination, and the evaluation index parameters of the classification model trained by different super-parameters can be synthesized. Step 6 may also calculate each evaluation index parameter of the classification model trained by different super parameters, and select the super parameter of the classification model with the highest comprehensive evaluation of each evaluation index parameter as the super parameter of the classification model under the project combination.

By applying the method provided by the embodiment of the application, the classification model can be subjected to super parameter training through the super parameter training subset and the super parameter verification subset, so that the super parameters of the classification model are determined, the optimal classification model is selected, and the accuracy of determining the target item combination is improved.

In one possible embodiment, the biomarker concentration of the plurality of biomarker items comprises at least two of aβ40 (β -amyloid 40), aβ42 (β -amyloid 42), P-tau181 (phosphorylated tau protein 181), P-tau217 (phosphorylated tau protein 217), nfL (neurofilament LIGHT CHAIN ); the attribute information of the plurality of attribute items includes at least two of gender information, age information, and educational level information.

In one example, the sample object is a human, the biological information is Aβ40, Aβ42, P-tau181, P-tau217, nfL, and the attribute information is gender information, age information, and educational level information. It should be noted that, the biological information and the attribute information in the embodiment of the present application are obtained under the condition that the target object is known and authorized. The biological information can be obtained from the blood plasma of the sample subject by ELISA (Enzyme-linked Immunosorbent Assay ).

By applying the method of the embodiment of the application, biological information containing Abeta 40, abeta 42, P-tau181, P-tau217 and NfL and attribute information containing gender information, age information and education degree information can be processed and analyzed, and are arranged and combined, so that screening of characteristic data types is completed through a trained classification model, and target item combinations influencing target classification business are determined.

In a possible implementation manner, the method of the embodiment of the present application may further include:

When the feature data is input to the classification model, the feature data may be first subjected to data processing. In one example, the feature data representing the continuous variable property item as a numerical value may be normalized, for example, using a normalization function STANDARDSCALER. For example, if the feature data is education level, the primary school, the advanced school, the university and the research school may be represented by 1,2,3,4,5, respectively. For only two results, non-black, i.e. white, feature data, a binarization process may be performed. For example, for the characteristic data of sex, 0 may be used to represent female and 1 may be used to represent male.

In addition, in order to make the accuracy of the classification model high, the kind of feature data may be artificially increased, and for example, the ratio between every two pieces of biological information may be calculated. The feature data of the sample object is enriched by artificially adding the feature data, so that the accuracy of the classification model is improved when the classification model is trained. Wherein, after adding new feature data, the new feature data still needs to be preprocessed.

In one example, where the biological information includes biomarker concentrations of Abeta 40, abeta 42, P-tau181, P-tau217, nfL, the biological information may also include ratios of Abeta 40/Abeta 42, P-tau 181/Abeta 42, P-tau 217/Abeta 42, nfL/Abeta 42.

By applying the method provided by the embodiment of the application, the format of the characteristic data can be unified by preprocessing the characteristic data, so that the classification model can be conveniently processed. And the variety of the characteristic data is increased, so that the characteristic data can be richer, the accuracy of the classification model is higher, and the accuracy of determining the target item combination is improved.

In one possible embodiment, when the classification traffic is to distinguish healthy people, mild cognitive impairment patients and alzheimer's disease patients, the feature combination determination method affecting the classification traffic may be implemented by:

Step I: and acquiring characteristic data and tag information of three sample objects of healthy people, patients with mild cognitive impairment and patients with Alzheimer's disease.

The characteristic data of the three sample objects comprise biological information and attribute information, and the label information indicates that the sample objects are healthy people or mild cognitive impairment patients or Alzheimer disease patients. In one example, the biometric information includes: the attribute information includes 5 kinds of information including gender information, age information, and education level information of the sample object, namely, Aβ40, Aβ42, P-tau181, P-tau217, nfL.

Step II: and carrying out data processing on the characteristic data of the sample object.

In one example, the biological information in Aβ40, Aβ42, P-tau181, P-tau217, nfL may be compared every two to obtain a new characteristic data category. The sex information and the education level information are subjected to numerical processing, and are converted into numerical representation by text representation. All feature data that are already numerical values are normalized using a normalization function STANDARDSCALER.

Step III: and according to the category number of the processed feature data, arranging and combining the feature data of each group according to the category number to obtain a plurality of groups of item combinations. Wherein each set of item combinations includes at least one characteristic data category.

Step IV: dividing a data set consisting of characteristic data of each sample object into a sample training set and a sample testing set, dividing the sample training set into n sample training subsets, training a classification model by adopting a cross validation method, and determining the super parameters of the classification model.

The classification model comprises three kinds of classifiers, namely a linear classification algorithm, a nonlinear classification algorithm and a multi-layer perceptron. And selecting a hierarchical sampling method, and dividing the sample training set into sample training subsets. The number ratio of the three populations in each sample training subset, healthy people, mild cognitive impairment patients and Alzheimer's disease patients, is the same as the number ratio of the three populations in the sample training subset. Step iv may be implemented by:

step ⑴: and selecting the ith sample training subset from the n sample training subsets as a super-parameter verification subset, and taking the rest sample training subsets as super-parameter training subsets. Wherein the initial value of i is 1.

Step ⑵: and inputting the characteristic data in the super-parameter training subset corresponding to the item combination into the super-parameter classification model for training aiming at each group of item combination and each group of super-parameters to obtain the classification model under the super-parameters corresponding to the group of item combination.

Step ⑶: and calculating a second accuracy of the classification model of each set of item combinations and each set of super parameters.

Step ⑷: and (3) increasing i by 1, and returning to execute the steps ⑵ and ⑶ to obtain the second accuracy of each classification model of each group of item combination under each group of super parameters.

Step ⑸: and calculating the mean value of the second accuracy under each super parameter aiming at each item combination, and taking the super parameter used by the classification model with the highest mean value as the super parameter of the classification model under the item combination to obtain the classification model to be trained under each item combination.

Step V: and aiming at each item combination, inputting data corresponding to the item combination in the sample training set into a classification model to be trained corresponding to the item combination for training, and obtaining a classification model trained under the item combination.

Step VI: and aiming at each item combination, inputting the characteristic data corresponding to the item combination in the sample test set into the classification model trained under the item combination for testing, and obtaining the classification result of the classification model trained under each item combination.

The classification result is that the sample object is a healthy person or a patient with mild cognitive impairment or a patient with Alzheimer's disease. In one example, the classification result may be a confidence that the sample subject is a healthy person, and a confidence that the sample subject is a mild cognitive impairment patient, and a confidence that the sample subject is a Alzheimer's disease patient.

Step VII: according to the label result of the sample object and the classification result of the classification model trained under each item combination, calculating the first accuracy of the classification model trained under each item combination, selecting the item combination corresponding to the classification model with the highest accuracy as a target item combination, wherein the characteristic data in the target item combination is the characteristic data affecting the distinction of healthy people, mild cognitive impairment patients and Alzheimer's disease patients.

By applying the method of the embodiment of the application, a plurality of item combinations can be obtained by arranging and combining the characteristic data in a plurality of biomarker items and a plurality of attribute items which can affect the distinction of healthy people, mild cognitive impairment patients and Alzheimer's disease patients, each item combination is input into a classification model for training, the item combination with the highest accuracy is determined by the accuracy of the trained classification model corresponding to each item combination in the trained classification model, so that the item combination with the highest accuracy is determined as the target item combination, and each characteristic data type in the target item combination is corresponding to the target characteristic data type affecting the distinction of healthy people, mild cognitive impairment patients and Alzheimer's disease patients, the screening of the characteristic data type is completed, and the characteristic data type with larger influence on the distinction of healthy people, mild cognitive impairment patients and Alzheimer's disease patients is obtained.

The following example is used to specifically explain the method of the embodiment of the present application, and when an average person becomes an Alzheimer's Disease (AD) patient, there is an intermediate state, MCI (Mild Cognitive Impairment ), which has a certain similarity in terms of symptoms and biological information, and therefore it is necessary to determine which information has a great influence on the Alzheimer's patient. In one example, the target feature class may be identified by the steps shown in FIG. 2:

step S201: various biological information and attribute information of three populations are collected.

Wherein the three groups are healthy group, MCI patient and AD patient. The biological information includes biomarker concentrations Aβ40, Aβ42, P-tau181, P-tau217, nfL. The attribute information includes various information that may affect the prevalence of AD, for example, sex information, age information, and education level information. In addition, label information of each sample group needs to be collected.

Step S202: and preprocessing the sample data to obtain processed sample characteristic data.

Calculating the ratio of each two biomarker concentrations to obtain data such as Abeta 40/Abeta 42, P-tau 181/Abeta 42, P-tau 217/Abeta 42, nfL/Abeta 42 and the like. The non-numeric information is digitized, and for example, the education level information such as the primary school, the advanced school, the university and the research school is represented by 1,2,3,4,5, and the sex information is represented by 0, 1. Further, the tag information may be represented by a one-dimensional array, for example, [1, 0] represents a healthy population, [0,1,0] represents an MCI patient, and [0, 1] represents an AD patient. After all sample data are digitized, a STANDARDSCALER function is used for normalization.

Step S203: and arranging and combining the sample characteristic data according to the number of types to obtain a plurality of groups of sample characteristic item combinations.

Step S204: and aiming at each sample characteristic item combination, performing super-parametrization by adopting a plurality of classification models, and determining super-parametrics of the classification models.

Three classifiers, such as a linear classification algorithm LogisticRegression, a nonlinear classification algorithm RandomForest, a neural network line MLP and the like, are selected for modeling, and three classification models are obtained. And splitting the sample characteristic data into a sample training set and a sample testing set according to a ratio of 5:5 in a layering sampling mode, wherein the sample training set comprises a plurality of sample training subsets. And inputting a certain number of sample training subsets into the classification model to perform super-parameter training, and testing the trained classification model by a super-parameter verification subset formed by a certain number of sample training subsets. In one example, the sample training set includes 3 sample training subsets, and then the 1 st sample training subset, the 2 nd sample training subset, the 1 st sample training subset, the 3 rd sample training subset and the 3 rd sample training subset are respectively selected as the super parameter training subsets to be input into the classification model for super parameter training to obtain a trained classification model corresponding to each super parameter under each sample characteristic item combination, and correspondingly, the 3 rd sample training subset, the 2 nd sample training subset and the 1 st sample training subset are respectively selected as the super parameter verification subsets to test each trained classification model.

Wherein, the super-parameters can be performed on the training set through GRIDSEARCHCV functions, for example, cv is set to 3, and scoring (measuring model performance index) is set to roc _ auc _ovr. In one example, after 5-fold cross-validation of each classification model after super-parametric training, the average AUC value and range of each classification model after super-parametric training are calculated. The steps are executed for each of the three classification models to obtain the average AUC value and range of the classification models after the super-parameter training of each category. And judging the stability of the classification model through the average AUC value and the range, and selecting the super-parameter of the classification model with the highest stability as the super-parameter of the classification model.

Step S205: training the classification model by using the sample training set, and evaluating the classification model by using the sample testing set to obtain an optimal classification model.

The classification model is trained using a cross-validation approach as well. AUC, accuracy, sensitivity, confidence interval of AUC 90% and specificity of each classification model were calculated. The confidence interval of AUC 90% can be obtained by sampling 1000 times on the prediction probability index by adopting boottrap method and calculating the upper and lower bounds of the confidence interval of 90%. Comparison of AUC between models can be performed using DelongTest differential assay. And (5) integrating various performance evaluation indexes to determine an optimal classification model.

Step S206: and selecting a sample characteristic item combination corresponding to the optimal classification model as a target item combination.

When the optimal classification model predicts a result, three probabilities are generated, wherein the three probabilities are respectively the probabilities of the sample characteristic data representing healthy people, MCI patients and AD patients, and the classification result with the largest probability is taken as the prediction result, wherein the classification result represents the probability of the MCI patients corresponding to the MCI patients on AD classification and can be used as the risk probability of the patients for MCI conversion to AD. And according to the prediction result, comparing the label information in the step S201, and calculating to obtain the accuracy corresponding to each group of sample characteristic item combination. After the target feature item combination is determined, the feature data category included in the target feature item combination is the target feature data category.

By applying the method of the embodiment of the application, a plurality of item combinations can be obtained by arranging and combining the feature data in a plurality of biomarker items and a plurality of attribute items, each item combination is input into a trained classification model, the item combination with the highest accuracy is determined by the accuracy of the trained classification model corresponding to each item combination, so that the item combination with the highest accuracy is determined as the target item combination, correspondingly, each feature data type in the target item combination is used as the target feature data type influencing the target classification service, the screening of feature data is completed, and the feature data type greatly influencing the Alzheimer disease patient is determined.

In a second aspect of the embodiment of the present application, there is provided a feature combination determining apparatus for influencing classified services, the apparatus including a structure as shown in fig. 3:

The information obtaining module 301 is configured to obtain feature data and tag information of a plurality of sample objects, where the feature data includes biological information of a plurality of biomarker items and attribute information of a plurality of attribute items, and the tag information represents a true value of a classification result of the sample objects on a target classification service;

the permutation and combination module 302 is configured to select, in the multiple biomarker items and the multiple attribute items, an item combination in a permutation and combination manner, so as to obtain multiple groups of item combinations;

A classification model acquisition module 303, configured to acquire a classification model to be trained;

The classification model training module 304 is configured to train, for each set of item combinations, the classification model by using feature data of the sample object under the set of item combinations as input of the classification model and tag information of the sample object as a true value of classification model prediction, to obtain a classification model trained under the set of item combinations;

The target feature combination determining module 305 is configured to determine a first accuracy of the classification model trained under each item combination, and use the item combination used for training the classification model with the highest accuracy as the target item combination.

In one possible implementation manner, the device of the embodiment of the present application further includes:

The sample object dividing module is used for dividing a plurality of sample objects into a sample training set and a sample testing set;

A classification model training module; comprising the following steps:

the sample object selecting sub-module is specifically used for selecting a sample object in a sample training set aiming at each group of item combinations, and inputting characteristic data of the currently selected sample object under the group of item combinations into the classification model to obtain a current classification result;

A target feature combination determination module comprising:

In one possible implementation manner, the sample training set includes n sample training subsets, and the apparatus of the embodiment of the present application further includes:

The super-parameter verification subset selecting module is used for selecting an ith sample training subset from n sample training subsets to obtain a super-parameter verification subset, wherein other sample training subsets except the current super-parameter verification subset in the n sample training subsets are super-parameter training subsets, and the initial value of i is 1;

In one possible implementation, the classification model includes a first classification model, a second classification model, and a third classification model, where the first classification model uses a linear classification algorithm classifier, the second classification model uses a nonlinear classification algorithm classifier, and the third classification model uses a multi-layer perceptron classifier.

In one possible embodiment, the biomarker concentration of the plurality of biomarker items comprises at least two of aβ40, aβ42, P-tau181, P-tau217, nfL; the attribute information of the plurality of attribute items includes at least two of gender information, age information, and educational level information.

By using the device provided by the embodiment of the application, a plurality of item combinations can be obtained by arranging and combining the feature data in a plurality of biomarker items and a plurality of attribute items, each item combination is input into a trained classification model, and the item combination with the highest accuracy is determined by the accuracy of the trained classification model corresponding to each item combination, so that the item combination with the highest accuracy is determined as a target item combination, and accordingly, each feature data type in the target item combination is used as a target feature data type affecting the target classification service, so that the screening of the feature data type is completed, the operand of the classification model is further reduced, and the prediction efficiency of the classification model is improved on the premise of ensuring the accuracy of the classification model.

The embodiment of the application also provides an electronic device, as shown in fig. 4, which comprises a processor 401, a communication interface 402, a memory 403 and a communication bus 404, wherein the processor 401, the communication interface 402 and the memory 403 complete communication with each other through the communication bus 404,

A memory 403 for storing a computer program;

the processor 401, when executing the program stored in the memory 403, implements the following steps:

selecting item combinations in a plurality of biomarker items and a plurality of attribute items in a permutation and combination mode to obtain a plurality of groups of item combinations;

Obtaining a classification model to be trained;

The communication bus mentioned above for the electronic device may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface is used for communication between the electronic device and other devices.

The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but may also be a digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components.

In yet another embodiment of the present application, there is also provided a computer readable storage medium having stored therein a computer program which, when executed by a processor, implements the steps of any of the above-described feature combination determination methods for influencing classified services.

In yet another embodiment of the present application, a computer program product containing instructions that, when run on a computer, cause the computer to perform any of the feature combination determination methods of the above embodiments that affect classification traffic is also provided.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk Solid STATE DISK (SSD)), etc.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for apparatus, electronic devices, computer readable storage medium embodiments, since they are substantially similar to method embodiments, the description is relatively simple, and relevant references are made to the partial description of method embodiments.

The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application are included in the protection scope of the present application.

Claims

1. A method for determining a combination of features affecting a classified service, the method comprising:

Obtaining a classification model to be trained;

2. The method according to claim 1, wherein the method further comprises:

3. The method of claim 2, wherein the training set of samples comprises n training subsets of samples, the method further comprising:

4. The method of claim 1, wherein the classification model comprises a first classification model, a second classification model, and a third classification model, the first classification model employs a linear classification algorithm classifier, the second classification model employs a nonlinear classification algorithm classifier, and the third classification model employs a multi-layer perceptron classifier.

5. The method of claim 1, wherein the biological information of the plurality of biomarker items comprises biomarker concentrations of a plurality of biomarker items, wherein the biomarker concentrations of the plurality of biomarker items comprise at least two of aβ40, aβ42, P-tau181, P-tau217, nfL; the attribute information of the plurality of attribute items includes at least two of gender information, age information, and educational level information.

6. The method according to claim 1, wherein the method further comprises:

7. A feature combination determination apparatus that affects classified services, the apparatus comprising:

8. The apparatus of claim 7, wherein the apparatus further comprises:

the classification model training module; comprising the following steps:

the target feature combination determination module includes:

9. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

A memory for storing a computer program;

A processor for carrying out the method steps of any one of claims 1-6 when executing a program stored on a memory.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored therein a computer program which, when executed by a processor, implements the method steps of any of claims 1-6.