CN111368924A - Unbalanced data classification method based on active learning - Google Patents

Unbalanced data classification method based on active learning Download PDF

Info

Publication number
CN111368924A
CN111368924A CN202010148859.2A CN202010148859A CN111368924A CN 111368924 A CN111368924 A CN 111368924A CN 202010148859 A CN202010148859 A CN 202010148859A CN 111368924 A CN111368924 A CN 111368924A
Authority
CN
China
Prior art keywords
samples
training
data
selecting
uncertainty
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010148859.2A
Other languages
Chinese (zh)
Inventor
张静
董怀龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Science and Technology
Original Assignee
Nanjing University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Science and Technology filed Critical Nanjing University of Science and Technology
Priority to CN202010148859.2A priority Critical patent/CN111368924A/en
Publication of CN111368924A publication Critical patent/CN111368924A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an unbalanced data classification method based on active learning, which comprises the steps of randomly sampling and selecting samples from original non-label data for marking, wherein the samples are used as initial training data; performing cost-sensitive learning training on the initial training data by adopting a general machine learning model; predicting all unlabeled samples in original training data samples by using a trained binary supervised classification model, and selecting N most uncertain samples according to uncertainty; respectively calculating the sum of Euclidean distances between the N samples and the center point in the trained data set, and selecting M samples from the N samples according to the sequence of the distances from large to small; marking the selected M samples, and adding the M samples into a training data set; carrying out cost-sensitive learning training on the initial training data set by using a general machine learning model; and continuously repeating the processes and iterating and circulating until the average uncertainty of the selected M samples is less than a set uncertainty threshold value, and stopping training. The method can effectively reduce the amount of labeled samples on the basis of keeping the performance of the unbalanced data classifier, thereby saving the labeling time and labor cost.

Description

Unbalanced data classification method based on active learning
Technical Field
The invention relates to the field of machine learning, in particular to an unbalanced data classification method based on active learning.
Background
Unbalanced data is widely present in practical applications, such as fraudulent and non-fraudulent transactions in the field of credit card transactions. Due to the unbalanced distribution of such data classes, conventional machine learning models cannot be used directly into the classification of such data. The method for solving the unbalanced data classification mainly comprises a resampling method and a cost sensitive learning method. The resampling method is further divided into methods such as undersampling, oversampling and smote.
Most of the traditional machine learning training modes are supervised learning modes, namely, all samples of relevant field data and labels of the samples are given, training is carried out through a proper machine learning model, and then a final classifier is generated. This learning approach requires labeling of all samples, which can result in significant time and labor costs. In this context, researchers have proposed active learning to train classifiers. The current active learning methods can be mainly classified into the following categories: (1) uncertainty-based sample selection method (2) committee-based sample selection method (3) generalized error reduction method.
The K means algorithm is a clustering analysis algorithm for iterative solution. Firstly, randomly determining the mass centers of k initial points; then each point in the data set is distributed to a cluster, namely a centroid closest to each point is found for each point and is distributed to a cluster corresponding to the centroid; after this step is completed, the centroid of each cluster is updated to the average of all the points of the cluster.
Disclosure of Invention
The invention aims to provide an unbalanced data classification method based on active learning.
The technical solution for realizing the purpose of the invention is as follows: an unbalanced data classification method based on active learning comprises the following steps:
randomly sampling and selecting a certain number of samples from original label-free data, and marking the samples to be used as initial training data.
And carrying out cost-sensitive learning training on the initial training data by using a universal machine learning model, wherein the specific learning mode is to calculate the proportion of positive samples and negative samples in the training data set, take the proportion of the negative samples as the weight of the positive samples in the training, take the proportion of the positive samples as the weight of the negative samples in the training process, and obtain a binary supervised classification model of the initial training samples. Meanwhile, a k _ means clustering algorithm is used for calculating the central point of the initial training data.
And predicting all unlabeled samples in the original training data samples by using the trained binary supervised classification model, and selecting the N most uncertain samples according to the uncertainty. And respectively calculating the sum of Euclidean distances between the N samples and the center point in the trained data set, and selecting M samples from the N samples according to the sequence from large to small of the distance, wherein M is smaller than N. And marking the selected M samples, and adding the M samples into a training data set. And performing cost-sensitive learning training on the initial training data set by using a general machine learning model.
And continuously repeating the processes and iterating and circulating until the average uncertainty of the selected M samples is less than a set uncertainty threshold value, and stopping training.
Compared with the prior art, the invention has the following remarkable advantages: the method can effectively reduce the amount of labeled samples on the basis of keeping the performance of the unbalanced data classifier, thereby saving the labeling time and labor cost.
Drawings
Fig. 1 is a flowchart of an imbalance data classification method based on active learning according to an embodiment of the present invention.
Detailed Description
The invention can be used in the fields of credit card fraud transaction detection, information security detection and the like.
The invention relates to an unbalanced data classification method based on active learning, which comprises the following steps:
randomly sampling and selecting samples from original label-free data to label the samples as initial training data; the original non-tag data comprises credit card transaction data;
performing cost-sensitive learning training on the initial training data by adopting a general machine learning model;
predicting all unlabeled samples in original training data samples by using a trained binary supervised classification model, and selecting N most uncertain samples according to uncertainty; respectively calculating the sum of Euclidean distances between the N samples and the center point in the trained data set, and selecting M samples from the N samples according to the sequence from large to small of the distance, wherein M is smaller than N; marking the selected M samples, and adding the M samples into a training data set; carrying out cost-sensitive learning training on the initial training data set by using a general machine learning model;
and continuously repeating the processes and iterating and circulating until the average uncertainty of the selected M samples is less than a set uncertainty threshold value, and stopping training.
The specific learning mode for performing the cost-sensitive learning training on the initial training data by adopting the universal machine learning model is as follows: calculating the proportion of positive samples and negative samples in a training data set, taking the proportion of the negative samples as the weight of the positive samples in training, taking the proportion of the positive samples as the weight of the negative samples in the training process, and obtaining a binary supervised classification model of an initial training sample; meanwhile, a k _ means clustering algorithm is used for calculating the central point of the initial training data.
The invention is further described below with reference to the figures and examples.
The invention discloses an unbalanced data classification method based on active learning, which comprises the following steps of:
101. and selecting a certain number of unlabeled samples from the original unlabeled data for labeling. And adding the marked data into the training data set.
102. And calculating the quantity proportion of the positive and negative samples in the training data set, taking the proportion of the positive samples as the weight of the negative samples during training, and taking the proportion of the negative samples as the weight of the positive samples. And selecting a proper machine learning model for training.
103. And predicting the unlabeled data set according to the trained classifier, and selecting M most uncertain samples according to the uncertainty sequence.
104. Calculating the center point of the training set according to a K _ means algorithm, respectively calculating the sum of Euclidean distances between M least determined samples and the center point, and sequentially selecting N sample points according to the distance for labeling.
105. And updating the data samples which are not marked, adding the marked N samples into the training set, and updating the classification model.
106. Predicting the unlabeled data samples according to the trained classifier, selecting M most uncertain samples according to the uncertainty sequence, and calculating the average value T of the M uncertainties.
107. And judging whether T is smaller than an uncertainty threshold value. If T is less than the uncertainty threshold, the training process stops and the final classifier is output. And if T is larger than the uncertain threshold, jumping to 104 and carrying out the next iteration.

Claims (4)

1. An unbalanced data classification method based on active learning is characterized in that:
randomly sampling and selecting samples from original label-free data to label the samples as initial training data;
performing cost-sensitive learning training on the initial training data by adopting a general machine learning model;
predicting all unlabeled samples in original training data samples by using a trained binary supervised classification model, and selecting N most uncertain samples according to uncertainty; respectively calculating the sum of Euclidean distances between the N samples and the center point in the trained data set, and selecting M samples from the N samples according to the sequence from large to small of the distance, wherein M is smaller than N; marking the selected M samples, and adding the M samples into a training data set; carrying out cost-sensitive learning training on the initial training data set by using a general machine learning model;
and continuously repeating the processes and iterating and circulating until the average uncertainty of the selected M samples is less than a set uncertainty threshold value, and stopping training.
2. The active learning based imbalance data classification method of claim 1, wherein: the raw unlabeled data includes credit card transaction data.
3. The active learning-based imbalance data classification method according to claim 1 or 2, wherein the specific learning manner for performing the cost-sensitive learning training on the initial training data by using the general machine learning model is as follows: calculating the proportion of positive samples and negative samples in a training data set, taking the proportion of the negative samples as the weight of the positive samples in training, taking the proportion of the positive samples as the weight of the negative samples in the training process, and obtaining a binary supervised classification model of an initial training sample; meanwhile, a k _ means clustering algorithm is used for calculating the central point of the initial training data.
4. The method for classifying unbalanced data based on active learning according to claim 1, comprising the following steps:
101. selecting a certain number of unlabelled samples from original unlabelled data, and labeling; adding the marked data into a training data set;
102. calculating the quantity proportion of positive and negative samples in the training data set, taking the proportion of the positive samples as the weight of the negative samples during training, and taking the proportion of the negative samples as the weight of the positive samples; selecting a machine learning model for training;
103. predicting the unlabeled data set according to the trained classifier, and selecting M most uncertain samples according to the uncertainty sequence;
104. calculating the center point of a training set according to a K _ means algorithm, respectively calculating the sum of Euclidean distances between M least determined samples and the center point, and sequentially selecting N sample points according to the distance to label;
105. updating the unlabeled data samples, adding the labeled N samples into a training set, and updating the classification model;
106. predicting the unlabeled data samples according to the trained classifier, selecting M most uncertain samples according to the uncertainty sequence, and calculating the average value T of the M uncertainties;
107. judging whether T is smaller than an uncertainty threshold value; if T is smaller than the uncertainty threshold, stopping the training process and outputting a final classifier; if T is greater than the uncertainty threshold, return to 104 for the next iteration.
CN202010148859.2A 2020-03-05 2020-03-05 Unbalanced data classification method based on active learning Withdrawn CN111368924A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010148859.2A CN111368924A (en) 2020-03-05 2020-03-05 Unbalanced data classification method based on active learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010148859.2A CN111368924A (en) 2020-03-05 2020-03-05 Unbalanced data classification method based on active learning

Publications (1)

Publication Number Publication Date
CN111368924A true CN111368924A (en) 2020-07-03

Family

ID=71211668

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010148859.2A Withdrawn CN111368924A (en) 2020-03-05 2020-03-05 Unbalanced data classification method based on active learning

Country Status (1)

Country Link
CN (1) CN111368924A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112036491A (en) * 2020-09-01 2020-12-04 北京推想科技有限公司 Method and device for determining training sample and method for training deep learning model
CN112257767A (en) * 2020-10-16 2021-01-22 浙江大学 Product key part state classification method aiming at class imbalance data
CN112308139A (en) * 2020-10-29 2021-02-02 中国科学院计算技术研究所厦门数据智能研究院 Sample labeling method based on active learning
CN112422590A (en) * 2021-01-25 2021-02-26 中国人民解放军国防科技大学 Network traffic classification method and device based on active learning
CN112527670A (en) * 2020-12-18 2021-03-19 武汉理工大学 Method for predicting software aging defects in project based on Active Learning
CN112785585A (en) * 2021-02-03 2021-05-11 腾讯科技(深圳)有限公司 Active learning-based training method and device for image video quality evaluation model
CN113469251A (en) * 2021-07-02 2021-10-01 南京邮电大学 Method for classifying unbalanced data
CN113537630A (en) * 2021-08-04 2021-10-22 支付宝(杭州)信息技术有限公司 Training method and device of business prediction model
CN113469251B (en) * 2021-07-02 2024-07-26 南京邮电大学 Method for classifying unbalanced data

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112036491A (en) * 2020-09-01 2020-12-04 北京推想科技有限公司 Method and device for determining training sample and method for training deep learning model
CN112257767B (en) * 2020-10-16 2023-03-17 浙江大学 Product key part state classification method aiming at class imbalance data
CN112257767A (en) * 2020-10-16 2021-01-22 浙江大学 Product key part state classification method aiming at class imbalance data
CN112308139A (en) * 2020-10-29 2021-02-02 中国科学院计算技术研究所厦门数据智能研究院 Sample labeling method based on active learning
CN112308139B (en) * 2020-10-29 2024-03-22 中科(厦门)数据智能研究院 Sample labeling method based on active learning
CN112527670A (en) * 2020-12-18 2021-03-19 武汉理工大学 Method for predicting software aging defects in project based on Active Learning
CN112422590A (en) * 2021-01-25 2021-02-26 中国人民解放军国防科技大学 Network traffic classification method and device based on active learning
CN112785585B (en) * 2021-02-03 2023-07-28 腾讯科技(深圳)有限公司 Training method and device for image video quality evaluation model based on active learning
CN112785585A (en) * 2021-02-03 2021-05-11 腾讯科技(深圳)有限公司 Active learning-based training method and device for image video quality evaluation model
CN113469251A (en) * 2021-07-02 2021-10-01 南京邮电大学 Method for classifying unbalanced data
CN113469251B (en) * 2021-07-02 2024-07-26 南京邮电大学 Method for classifying unbalanced data
CN113537630A (en) * 2021-08-04 2021-10-22 支付宝(杭州)信息技术有限公司 Training method and device of business prediction model
CN113537630B (en) * 2021-08-04 2024-06-14 支付宝(杭州)信息技术有限公司 Training method and device of business prediction model

Similar Documents

Publication Publication Date Title
CN111368924A (en) Unbalanced data classification method based on active learning
CN111783844B (en) Deep learning-based target detection model training method, device and storage medium
CN111343147B (en) Network attack detection device and method based on deep learning
CN111325248A (en) Method and system for reducing pre-loan business risk
CN108388929A (en) Client segmentation method and device based on cost-sensitive and semisupervised classification
CN111667135B (en) Load structure analysis method based on typical feature extraction
CN112381248A (en) Power distribution network fault diagnosis method based on deep feature clustering and LSTM
CN111796957A (en) Transaction abnormal root cause analysis method and system based on application log
CN112950347B (en) Resource data processing optimization method and device, storage medium and terminal
CN112990294A (en) Training method and device of behavior discrimination model, electronic equipment and storage medium
CN111598187A (en) Progressive integrated classification method based on kernel width learning system
CN114386856A (en) Method, device and equipment for identifying empty-shell enterprise and computer storage medium
CN111582315B (en) Sample data processing method and device and electronic equipment
CN115081025A (en) Sensitive data management method and device based on digital middlebox and electronic equipment
CN111273911A (en) Software technology debt identification method based on bidirectional LSTM and attention mechanism
CN111160526B (en) Online testing method and device for deep learning system based on MAPE-D annular structure
CN113468338A (en) Big data analysis method for digital cloud service and big data server
CN111582313B (en) Sample data generation method and device and electronic equipment
CN110807159B (en) Data marking method and device, storage medium and electronic equipment
CN116401606A (en) Fraud identification method, device, equipment and medium
CN110737812A (en) search engine user satisfaction evaluation method integrating semi-supervised learning and active learning
CN115936003A (en) Software function point duplicate checking method, device, equipment and medium based on neural network
CN115830371A (en) Deep learning-based rail transit subway steering frame rod member classification detection method
CN112580505B (en) Method and device for identifying network point switch door state, electronic equipment and storage medium
CN111798237B (en) Abnormal transaction diagnosis method and system based on application log

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20200703