CN111368924A

CN111368924A - Unbalanced data classification method based on active learning

Info

Publication number: CN111368924A
Application number: CN202010148859.2A
Authority: CN
Inventors: 张静; 董怀龙
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2020-03-05
Filing date: 2020-03-05
Publication date: 2020-07-03

Abstract

The invention discloses an unbalanced data classification method based on active learning, which comprises the steps of randomly sampling and selecting samples from original non-label data for marking, wherein the samples are used as initial training data; performing cost-sensitive learning training on the initial training data by adopting a general machine learning model; predicting all unlabeled samples in original training data samples by using a trained binary supervised classification model, and selecting N most uncertain samples according to uncertainty; respectively calculating the sum of Euclidean distances between the N samples and the center point in the trained data set, and selecting M samples from the N samples according to the sequence of the distances from large to small; marking the selected M samples, and adding the M samples into a training data set; carrying out cost-sensitive learning training on the initial training data set by using a general machine learning model; and continuously repeating the processes and iterating and circulating until the average uncertainty of the selected M samples is less than a set uncertainty threshold value, and stopping training. The method can effectively reduce the amount of labeled samples on the basis of keeping the performance of the unbalanced data classifier, thereby saving the labeling time and labor cost.

Description

Unbalanced data classification method based on active learning

Technical Field

The invention relates to the field of machine learning, in particular to an unbalanced data classification method based on active learning.

Background

Unbalanced data is widely present in practical applications, such as fraudulent and non-fraudulent transactions in the field of credit card transactions. Due to the unbalanced distribution of such data classes, conventional machine learning models cannot be used directly into the classification of such data. The method for solving the unbalanced data classification mainly comprises a resampling method and a cost sensitive learning method. The resampling method is further divided into methods such as undersampling, oversampling and smote.

Most of the traditional machine learning training modes are supervised learning modes, namely, all samples of relevant field data and labels of the samples are given, training is carried out through a proper machine learning model, and then a final classifier is generated. This learning approach requires labeling of all samples, which can result in significant time and labor costs. In this context, researchers have proposed active learning to train classifiers. The current active learning methods can be mainly classified into the following categories: (1) uncertainty-based sample selection method (2) committee-based sample selection method (3) generalized error reduction method.

The K means algorithm is a clustering analysis algorithm for iterative solution. Firstly, randomly determining the mass centers of k initial points; then each point in the data set is distributed to a cluster, namely a centroid closest to each point is found for each point and is distributed to a cluster corresponding to the centroid; after this step is completed, the centroid of each cluster is updated to the average of all the points of the cluster.

Disclosure of Invention

The invention aims to provide an unbalanced data classification method based on active learning.

The technical solution for realizing the purpose of the invention is as follows: an unbalanced data classification method based on active learning comprises the following steps:

randomly sampling and selecting a certain number of samples from original label-free data, and marking the samples to be used as initial training data.

And carrying out cost-sensitive learning training on the initial training data by using a universal machine learning model, wherein the specific learning mode is to calculate the proportion of positive samples and negative samples in the training data set, take the proportion of the negative samples as the weight of the positive samples in the training, take the proportion of the positive samples as the weight of the negative samples in the training process, and obtain a binary supervised classification model of the initial training samples. Meanwhile, a k _ means clustering algorithm is used for calculating the central point of the initial training data.

And predicting all unlabeled samples in the original training data samples by using the trained binary supervised classification model, and selecting the N most uncertain samples according to the uncertainty. And respectively calculating the sum of Euclidean distances between the N samples and the center point in the trained data set, and selecting M samples from the N samples according to the sequence from large to small of the distance, wherein M is smaller than N. And marking the selected M samples, and adding the M samples into a training data set. And performing cost-sensitive learning training on the initial training data set by using a general machine learning model.

And continuously repeating the processes and iterating and circulating until the average uncertainty of the selected M samples is less than a set uncertainty threshold value, and stopping training.

Compared with the prior art, the invention has the following remarkable advantages: the method can effectively reduce the amount of labeled samples on the basis of keeping the performance of the unbalanced data classifier, thereby saving the labeling time and labor cost.

Drawings

Fig. 1 is a flowchart of an imbalance data classification method based on active learning according to an embodiment of the present invention.

Detailed Description

The invention can be used in the fields of credit card fraud transaction detection, information security detection and the like.

The invention relates to an unbalanced data classification method based on active learning, which comprises the following steps:

randomly sampling and selecting samples from original label-free data to label the samples as initial training data; the original non-tag data comprises credit card transaction data;

performing cost-sensitive learning training on the initial training data by adopting a general machine learning model;

predicting all unlabeled samples in original training data samples by using a trained binary supervised classification model, and selecting N most uncertain samples according to uncertainty; respectively calculating the sum of Euclidean distances between the N samples and the center point in the trained data set, and selecting M samples from the N samples according to the sequence from large to small of the distance, wherein M is smaller than N; marking the selected M samples, and adding the M samples into a training data set; carrying out cost-sensitive learning training on the initial training data set by using a general machine learning model;

The specific learning mode for performing the cost-sensitive learning training on the initial training data by adopting the universal machine learning model is as follows: calculating the proportion of positive samples and negative samples in a training data set, taking the proportion of the negative samples as the weight of the positive samples in training, taking the proportion of the positive samples as the weight of the negative samples in the training process, and obtaining a binary supervised classification model of an initial training sample; meanwhile, a k _ means clustering algorithm is used for calculating the central point of the initial training data.

The invention is further described below with reference to the figures and examples.

The invention discloses an unbalanced data classification method based on active learning, which comprises the following steps of:

101. and selecting a certain number of unlabeled samples from the original unlabeled data for labeling. And adding the marked data into the training data set.

102. And calculating the quantity proportion of the positive and negative samples in the training data set, taking the proportion of the positive samples as the weight of the negative samples during training, and taking the proportion of the negative samples as the weight of the positive samples. And selecting a proper machine learning model for training.

103. And predicting the unlabeled data set according to the trained classifier, and selecting M most uncertain samples according to the uncertainty sequence.

104. Calculating the center point of the training set according to a K _ means algorithm, respectively calculating the sum of Euclidean distances between M least determined samples and the center point, and sequentially selecting N sample points according to the distance for labeling.

105. And updating the data samples which are not marked, adding the marked N samples into the training set, and updating the classification model.

106. Predicting the unlabeled data samples according to the trained classifier, selecting M most uncertain samples according to the uncertainty sequence, and calculating the average value T of the M uncertainties.

107. And judging whether T is smaller than an uncertainty threshold value. If T is less than the uncertainty threshold, the training process stops and the final classifier is output. And if T is larger than the uncertain threshold, jumping to 104 and carrying out the next iteration.

Claims

1. An unbalanced data classification method based on active learning is characterized in that:

randomly sampling and selecting samples from original label-free data to label the samples as initial training data;

2. The active learning based imbalance data classification method of claim 1, wherein: the raw unlabeled data includes credit card transaction data.

3. The active learning-based imbalance data classification method according to claim 1 or 2, wherein the specific learning manner for performing the cost-sensitive learning training on the initial training data by using the general machine learning model is as follows: calculating the proportion of positive samples and negative samples in a training data set, taking the proportion of the negative samples as the weight of the positive samples in training, taking the proportion of the positive samples as the weight of the negative samples in the training process, and obtaining a binary supervised classification model of an initial training sample; meanwhile, a k _ means clustering algorithm is used for calculating the central point of the initial training data.

4. The method for classifying unbalanced data based on active learning according to claim 1, comprising the following steps:

101. selecting a certain number of unlabelled samples from original unlabelled data, and labeling; adding the marked data into a training data set;

102. calculating the quantity proportion of positive and negative samples in the training data set, taking the proportion of the positive samples as the weight of the negative samples during training, and taking the proportion of the negative samples as the weight of the positive samples; selecting a machine learning model for training;

103. predicting the unlabeled data set according to the trained classifier, and selecting M most uncertain samples according to the uncertainty sequence;

104. calculating the center point of a training set according to a K _ means algorithm, respectively calculating the sum of Euclidean distances between M least determined samples and the center point, and sequentially selecting N sample points according to the distance to label;

105. updating the unlabeled data samples, adding the labeled N samples into a training set, and updating the classification model;

106. predicting the unlabeled data samples according to the trained classifier, selecting M most uncertain samples according to the uncertainty sequence, and calculating the average value T of the M uncertainties;

107. judging whether T is smaller than an uncertainty threshold value; if T is smaller than the uncertainty threshold, stopping the training process and outputting a final classifier; if T is greater than the uncertainty threshold, return to 104 for the next iteration.