CN111368924A - Unbalanced data classification method based on active learning - Google Patents
Unbalanced data classification method based on active learning Download PDFInfo
- Publication number
- CN111368924A CN111368924A CN202010148859.2A CN202010148859A CN111368924A CN 111368924 A CN111368924 A CN 111368924A CN 202010148859 A CN202010148859 A CN 202010148859A CN 111368924 A CN111368924 A CN 111368924A
- Authority
- CN
- China
- Prior art keywords
- samples
- training
- data
- selecting
- uncertainty
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Medical Informatics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an unbalanced data classification method based on active learning, which comprises the steps of randomly sampling and selecting samples from original non-label data for marking, wherein the samples are used as initial training data; performing cost-sensitive learning training on the initial training data by adopting a general machine learning model; predicting all unlabeled samples in original training data samples by using a trained binary supervised classification model, and selecting N most uncertain samples according to uncertainty; respectively calculating the sum of Euclidean distances between the N samples and the center point in the trained data set, and selecting M samples from the N samples according to the sequence of the distances from large to small; marking the selected M samples, and adding the M samples into a training data set; carrying out cost-sensitive learning training on the initial training data set by using a general machine learning model; and continuously repeating the processes and iterating and circulating until the average uncertainty of the selected M samples is less than a set uncertainty threshold value, and stopping training. The method can effectively reduce the amount of labeled samples on the basis of keeping the performance of the unbalanced data classifier, thereby saving the labeling time and labor cost.
Description
Technical Field
The invention relates to the field of machine learning, in particular to an unbalanced data classification method based on active learning.
Background
Unbalanced data is widely present in practical applications, such as fraudulent and non-fraudulent transactions in the field of credit card transactions. Due to the unbalanced distribution of such data classes, conventional machine learning models cannot be used directly into the classification of such data. The method for solving the unbalanced data classification mainly comprises a resampling method and a cost sensitive learning method. The resampling method is further divided into methods such as undersampling, oversampling and smote.
Most of the traditional machine learning training modes are supervised learning modes, namely, all samples of relevant field data and labels of the samples are given, training is carried out through a proper machine learning model, and then a final classifier is generated. This learning approach requires labeling of all samples, which can result in significant time and labor costs. In this context, researchers have proposed active learning to train classifiers. The current active learning methods can be mainly classified into the following categories: (1) uncertainty-based sample selection method (2) committee-based sample selection method (3) generalized error reduction method.
The K means algorithm is a clustering analysis algorithm for iterative solution. Firstly, randomly determining the mass centers of k initial points; then each point in the data set is distributed to a cluster, namely a centroid closest to each point is found for each point and is distributed to a cluster corresponding to the centroid; after this step is completed, the centroid of each cluster is updated to the average of all the points of the cluster.
Disclosure of Invention
The invention aims to provide an unbalanced data classification method based on active learning.
The technical solution for realizing the purpose of the invention is as follows: an unbalanced data classification method based on active learning comprises the following steps:
randomly sampling and selecting a certain number of samples from original label-free data, and marking the samples to be used as initial training data.
And carrying out cost-sensitive learning training on the initial training data by using a universal machine learning model, wherein the specific learning mode is to calculate the proportion of positive samples and negative samples in the training data set, take the proportion of the negative samples as the weight of the positive samples in the training, take the proportion of the positive samples as the weight of the negative samples in the training process, and obtain a binary supervised classification model of the initial training samples. Meanwhile, a k _ means clustering algorithm is used for calculating the central point of the initial training data.
And predicting all unlabeled samples in the original training data samples by using the trained binary supervised classification model, and selecting the N most uncertain samples according to the uncertainty. And respectively calculating the sum of Euclidean distances between the N samples and the center point in the trained data set, and selecting M samples from the N samples according to the sequence from large to small of the distance, wherein M is smaller than N. And marking the selected M samples, and adding the M samples into a training data set. And performing cost-sensitive learning training on the initial training data set by using a general machine learning model.
And continuously repeating the processes and iterating and circulating until the average uncertainty of the selected M samples is less than a set uncertainty threshold value, and stopping training.
Compared with the prior art, the invention has the following remarkable advantages: the method can effectively reduce the amount of labeled samples on the basis of keeping the performance of the unbalanced data classifier, thereby saving the labeling time and labor cost.
Drawings
Fig. 1 is a flowchart of an imbalance data classification method based on active learning according to an embodiment of the present invention.
Detailed Description
The invention can be used in the fields of credit card fraud transaction detection, information security detection and the like.
The invention relates to an unbalanced data classification method based on active learning, which comprises the following steps:
randomly sampling and selecting samples from original label-free data to label the samples as initial training data; the original non-tag data comprises credit card transaction data;
performing cost-sensitive learning training on the initial training data by adopting a general machine learning model;
predicting all unlabeled samples in original training data samples by using a trained binary supervised classification model, and selecting N most uncertain samples according to uncertainty; respectively calculating the sum of Euclidean distances between the N samples and the center point in the trained data set, and selecting M samples from the N samples according to the sequence from large to small of the distance, wherein M is smaller than N; marking the selected M samples, and adding the M samples into a training data set; carrying out cost-sensitive learning training on the initial training data set by using a general machine learning model;
and continuously repeating the processes and iterating and circulating until the average uncertainty of the selected M samples is less than a set uncertainty threshold value, and stopping training.
The specific learning mode for performing the cost-sensitive learning training on the initial training data by adopting the universal machine learning model is as follows: calculating the proportion of positive samples and negative samples in a training data set, taking the proportion of the negative samples as the weight of the positive samples in training, taking the proportion of the positive samples as the weight of the negative samples in the training process, and obtaining a binary supervised classification model of an initial training sample; meanwhile, a k _ means clustering algorithm is used for calculating the central point of the initial training data.
The invention is further described below with reference to the figures and examples.
The invention discloses an unbalanced data classification method based on active learning, which comprises the following steps of:
101. and selecting a certain number of unlabeled samples from the original unlabeled data for labeling. And adding the marked data into the training data set.
102. And calculating the quantity proportion of the positive and negative samples in the training data set, taking the proportion of the positive samples as the weight of the negative samples during training, and taking the proportion of the negative samples as the weight of the positive samples. And selecting a proper machine learning model for training.
103. And predicting the unlabeled data set according to the trained classifier, and selecting M most uncertain samples according to the uncertainty sequence.
104. Calculating the center point of the training set according to a K _ means algorithm, respectively calculating the sum of Euclidean distances between M least determined samples and the center point, and sequentially selecting N sample points according to the distance for labeling.
105. And updating the data samples which are not marked, adding the marked N samples into the training set, and updating the classification model.
106. Predicting the unlabeled data samples according to the trained classifier, selecting M most uncertain samples according to the uncertainty sequence, and calculating the average value T of the M uncertainties.
107. And judging whether T is smaller than an uncertainty threshold value. If T is less than the uncertainty threshold, the training process stops and the final classifier is output. And if T is larger than the uncertain threshold, jumping to 104 and carrying out the next iteration.
Claims (4)
1. An unbalanced data classification method based on active learning is characterized in that:
randomly sampling and selecting samples from original label-free data to label the samples as initial training data;
performing cost-sensitive learning training on the initial training data by adopting a general machine learning model;
predicting all unlabeled samples in original training data samples by using a trained binary supervised classification model, and selecting N most uncertain samples according to uncertainty; respectively calculating the sum of Euclidean distances between the N samples and the center point in the trained data set, and selecting M samples from the N samples according to the sequence from large to small of the distance, wherein M is smaller than N; marking the selected M samples, and adding the M samples into a training data set; carrying out cost-sensitive learning training on the initial training data set by using a general machine learning model;
and continuously repeating the processes and iterating and circulating until the average uncertainty of the selected M samples is less than a set uncertainty threshold value, and stopping training.
2. The active learning based imbalance data classification method of claim 1, wherein: the raw unlabeled data includes credit card transaction data.
3. The active learning-based imbalance data classification method according to claim 1 or 2, wherein the specific learning manner for performing the cost-sensitive learning training on the initial training data by using the general machine learning model is as follows: calculating the proportion of positive samples and negative samples in a training data set, taking the proportion of the negative samples as the weight of the positive samples in training, taking the proportion of the positive samples as the weight of the negative samples in the training process, and obtaining a binary supervised classification model of an initial training sample; meanwhile, a k _ means clustering algorithm is used for calculating the central point of the initial training data.
4. The method for classifying unbalanced data based on active learning according to claim 1, comprising the following steps:
101. selecting a certain number of unlabelled samples from original unlabelled data, and labeling; adding the marked data into a training data set;
102. calculating the quantity proportion of positive and negative samples in the training data set, taking the proportion of the positive samples as the weight of the negative samples during training, and taking the proportion of the negative samples as the weight of the positive samples; selecting a machine learning model for training;
103. predicting the unlabeled data set according to the trained classifier, and selecting M most uncertain samples according to the uncertainty sequence;
104. calculating the center point of a training set according to a K _ means algorithm, respectively calculating the sum of Euclidean distances between M least determined samples and the center point, and sequentially selecting N sample points according to the distance to label;
105. updating the unlabeled data samples, adding the labeled N samples into a training set, and updating the classification model;
106. predicting the unlabeled data samples according to the trained classifier, selecting M most uncertain samples according to the uncertainty sequence, and calculating the average value T of the M uncertainties;
107. judging whether T is smaller than an uncertainty threshold value; if T is smaller than the uncertainty threshold, stopping the training process and outputting a final classifier; if T is greater than the uncertainty threshold, return to 104 for the next iteration.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010148859.2A CN111368924A (en) | 2020-03-05 | 2020-03-05 | Unbalanced data classification method based on active learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010148859.2A CN111368924A (en) | 2020-03-05 | 2020-03-05 | Unbalanced data classification method based on active learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111368924A true CN111368924A (en) | 2020-07-03 |
Family
ID=71211668
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010148859.2A Withdrawn CN111368924A (en) | 2020-03-05 | 2020-03-05 | Unbalanced data classification method based on active learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111368924A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112036491A (en) * | 2020-09-01 | 2020-12-04 | 北京推想科技有限公司 | Method and device for determining training sample and method for training deep learning model |
CN112257767A (en) * | 2020-10-16 | 2021-01-22 | 浙江大学 | Product key part state classification method aiming at class imbalance data |
CN112308139A (en) * | 2020-10-29 | 2021-02-02 | 中国科学院计算技术研究所厦门数据智能研究院 | Sample labeling method based on active learning |
CN112422590A (en) * | 2021-01-25 | 2021-02-26 | 中国人民解放军国防科技大学 | Network traffic classification method and device based on active learning |
CN112527670A (en) * | 2020-12-18 | 2021-03-19 | 武汉理工大学 | Method for predicting software aging defects in project based on Active Learning |
CN112785585A (en) * | 2021-02-03 | 2021-05-11 | 腾讯科技(深圳)有限公司 | Active learning-based training method and device for image video quality evaluation model |
CN113469251A (en) * | 2021-07-02 | 2021-10-01 | 南京邮电大学 | Method for classifying unbalanced data |
CN113537630A (en) * | 2021-08-04 | 2021-10-22 | 支付宝(杭州)信息技术有限公司 | Training method and device of business prediction model |
CN113469251B (en) * | 2021-07-02 | 2024-07-26 | 南京邮电大学 | Method for classifying unbalanced data |
-
2020
- 2020-03-05 CN CN202010148859.2A patent/CN111368924A/en not_active Withdrawn
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112036491A (en) * | 2020-09-01 | 2020-12-04 | 北京推想科技有限公司 | Method and device for determining training sample and method for training deep learning model |
CN112257767B (en) * | 2020-10-16 | 2023-03-17 | 浙江大学 | Product key part state classification method aiming at class imbalance data |
CN112257767A (en) * | 2020-10-16 | 2021-01-22 | 浙江大学 | Product key part state classification method aiming at class imbalance data |
CN112308139A (en) * | 2020-10-29 | 2021-02-02 | 中国科学院计算技术研究所厦门数据智能研究院 | Sample labeling method based on active learning |
CN112308139B (en) * | 2020-10-29 | 2024-03-22 | 中科(厦门)数据智能研究院 | Sample labeling method based on active learning |
CN112527670A (en) * | 2020-12-18 | 2021-03-19 | 武汉理工大学 | Method for predicting software aging defects in project based on Active Learning |
CN112422590A (en) * | 2021-01-25 | 2021-02-26 | 中国人民解放军国防科技大学 | Network traffic classification method and device based on active learning |
CN112785585B (en) * | 2021-02-03 | 2023-07-28 | 腾讯科技(深圳)有限公司 | Training method and device for image video quality evaluation model based on active learning |
CN112785585A (en) * | 2021-02-03 | 2021-05-11 | 腾讯科技(深圳)有限公司 | Active learning-based training method and device for image video quality evaluation model |
CN113469251A (en) * | 2021-07-02 | 2021-10-01 | 南京邮电大学 | Method for classifying unbalanced data |
CN113469251B (en) * | 2021-07-02 | 2024-07-26 | 南京邮电大学 | Method for classifying unbalanced data |
CN113537630A (en) * | 2021-08-04 | 2021-10-22 | 支付宝(杭州)信息技术有限公司 | Training method and device of business prediction model |
CN113537630B (en) * | 2021-08-04 | 2024-06-14 | 支付宝(杭州)信息技术有限公司 | Training method and device of business prediction model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111368924A (en) | Unbalanced data classification method based on active learning | |
CN111783844B (en) | Deep learning-based target detection model training method, device and storage medium | |
CN111343147B (en) | Network attack detection device and method based on deep learning | |
CN111325248A (en) | Method and system for reducing pre-loan business risk | |
CN108388929A (en) | Client segmentation method and device based on cost-sensitive and semisupervised classification | |
CN111667135B (en) | Load structure analysis method based on typical feature extraction | |
CN112381248A (en) | Power distribution network fault diagnosis method based on deep feature clustering and LSTM | |
CN111796957A (en) | Transaction abnormal root cause analysis method and system based on application log | |
CN112950347B (en) | Resource data processing optimization method and device, storage medium and terminal | |
CN112990294A (en) | Training method and device of behavior discrimination model, electronic equipment and storage medium | |
CN111598187A (en) | Progressive integrated classification method based on kernel width learning system | |
CN114386856A (en) | Method, device and equipment for identifying empty-shell enterprise and computer storage medium | |
CN111582315B (en) | Sample data processing method and device and electronic equipment | |
CN115081025A (en) | Sensitive data management method and device based on digital middlebox and electronic equipment | |
CN111273911A (en) | Software technology debt identification method based on bidirectional LSTM and attention mechanism | |
CN111160526B (en) | Online testing method and device for deep learning system based on MAPE-D annular structure | |
CN113468338A (en) | Big data analysis method for digital cloud service and big data server | |
CN111582313B (en) | Sample data generation method and device and electronic equipment | |
CN110807159B (en) | Data marking method and device, storage medium and electronic equipment | |
CN116401606A (en) | Fraud identification method, device, equipment and medium | |
CN110737812A (en) | search engine user satisfaction evaluation method integrating semi-supervised learning and active learning | |
CN115936003A (en) | Software function point duplicate checking method, device, equipment and medium based on neural network | |
CN115830371A (en) | Deep learning-based rail transit subway steering frame rod member classification detection method | |
CN112580505B (en) | Method and device for identifying network point switch door state, electronic equipment and storage medium | |
CN111798237B (en) | Abnormal transaction diagnosis method and system based on application log |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20200703 |