CN111831822A - Text multi-classification method for unbalanced data set based on text multi-classification mixed equipartition clustering sampling algorithm - Google Patents

Text multi-classification method for unbalanced data set based on text multi-classification mixed equipartition clustering sampling algorithm Download PDF

Info

Publication number
CN111831822A
CN111831822A CN202010646859.5A CN202010646859A CN111831822A CN 111831822 A CN111831822 A CN 111831822A CN 202010646859 A CN202010646859 A CN 202010646859A CN 111831822 A CN111831822 A CN 111831822A
Authority
CN
China
Prior art keywords
data
data set
cluster
clustering
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010646859.5A
Other languages
Chinese (zh)
Inventor
王德志
梁俊艳
陈超
李泽荃
李永飞
顾涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
North China Institute of Science and Technology
Original Assignee
North China Institute of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by North China Institute of Science and Technology filed Critical North China Institute of Science and Technology
Priority to CN202010646859.5A priority Critical patent/CN111831822A/en
Publication of CN111831822A publication Critical patent/CN111831822A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text multi-classification method for an unbalanced data set based on a text multi-classification mixed type equipartition clustering sampling algorithm, which is characterized in that a dynamic K-means clustering method based on a contour coefficient is introduced into the multi-classification algorithm for the unbalanced data set to cluster the unbalanced data set.

Description

Text multi-classification method for unbalanced data set based on text multi-classification mixed equipartition clustering sampling algorithm
Technical Field
The invention relates to the problem of unbalanced data set classification in data mining, in particular to a text multi-classification method for unbalanced data sets based on a text multi-classification mixed equipartition clustering sampling algorithm.
Background
In natural language processing, text classification based on deep learning neural network models is one of important research contents. The text classification research is divided into two-classification (or emotion classification) and multi-classification problems. At present, the more successful text classification deep learning algorithm mostly adopts a supervised learning mode. The balance of the training data set has an important influence on the performance of the deep learning algorithm. In practical applications, however, the training data set of the text multi-classification is often an unbalanced data set. An unbalanced data set is a data set in which the number of samples of a certain type of data in the same data set is much larger or smaller than the number of other samples. While misclassifications of the few classes of samples are more costly than misclassifications of the most classes of samples. Unbalanced data set problems are widespread in, for example, spam classification, fraud phone classification, and social media data classification.
How to improve the text classification accuracy based on the unbalanced data set is a hot issue of current research. The current common algorithm comprises modifying a loss function result based on label weight in a convolutional neural network training process based on an unbalanced text data set, and reinforcing the influence of a few types of samples on model parameters; in the text emotion classification, a pre-training task selection method based on word vector migration is used for distinguishing small category samples and improving the small category classification accuracy; the unbalanced data weighting method based on hierarchical clustering divides a few samples into a plurality of clusters, determines sampling frequency according to density factors, and improves the weight of small samples; and a hyperplane feature map based on a differential twin convolutional neural network and an algorithm for classifying unbalanced data sets based on the distances between samples and different hyperplanes.
The current research is mainly focused on the problem of binary classification of texts or classification of unbalanced data sets of low-dimensional feature vectors of general applicability. With the development and application of deep learning, a text multi-classification scheme based on word vectors gradually becomes the mainstream. The problem of multi-classification of text unbalanced data sets based on high-dimension word vectors faces huge challenges, and how to improve the prediction accuracy of high-dimension small sample classified data becomes a problem to be solved urgently at present.
Disclosure of Invention
Aiming at the existing problems, the invention aims to provide a text multi-classification method for an unbalanced data set based on a text multi-classification mixed equipartition clustering sampling algorithm, and provides a mixed unbalanced text data set sampling method based on a high-dimensionality vector clustering method, so that the accuracy of small sample data classification is improved on the basis of ensuring the accuracy of large sample data classification.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
a text multi-classification method for an unbalanced data set based on a text multi-classification hybrid equipartition clustering sampling algorithm is characterized in that a feature set of unbalanced data is calculated for a microblog disaster data set, and comprises the following steps:
s1: selecting a microblog disaster data set to carry out high-dimensional word segmentation vectorization;
s2: calculating the arithmetic mean of the number of all different classified samples in the data set, and taking the arithmetic mean as a sample data volume average line, wherein the calculation formula is represented as:
Figure BDA0002573455060000031
wherein N represents the number of classification tags, XiRepresenting the total number of samples in each classification;
s3: partitioning each classified sample in the data set according to the calculated sample data volume average line, and defining an upper region z with the sample quantity greater than the sample data volume average lineupThe number of samples less than the mean line of the sample data amount is the lower area zdnAnd calculating the difference d between each classified sample and the sample data volume average linei
di=Xi-Lavg(2),
Wherein, XiRepresents the total number of samples in each class, LavgThe sample data volume average line;
s4: respectively adopting an undersampling method and an oversampling method based on K-means clustering to the upper area sample data and the lower area sample data in the data set, and carrying out data clustering on each classified sample until the sample data volume of each type is the same as the sample data volume average line;
s5: after S4 clustering is carried out on each type of data in the data set, a mixed sample data set is formed, and the number of samples of the new sample data set is NxLavgExpressed as follows:
Figure BDA0002573455060000032
wherein z isupDenotes the upper region, zdnLower section of representation, diRepresenting the difference value of each classified sample and the sample data volume average line;
s6: and performing text classification on the formed mixed sample data set to obtain a text classification result.
Further, the specific operation of performing data clustering on the sample data in the data set in step S4 includes:
s41: the distance of the nodes in the multidimensional space is calculated by adopting the text cosine distance, and the formula is as follows:
Figure BDA0002573455060000033
where x, y are two nodes in the multidimensional word vector space, n represents the vector space dimension, xiAnd yiA value representing two vectors in the i-dimensional space;
s42: selecting a dynamically adjusted K value based on a contour system, and respectively clustering data of each type of data of the upper region and the lower region according to the obtained K value;
s43: for each type of sample data in the upper region, the average sample data size is divided into L by adopting an undersampling methodavgClassification ofReduction of data set | diAnd the formula for the reduced data size of the ith class data set is expressed as:
Figure BDA0002573455060000041
wherein Q isi,jRepresenting the data amount needed to be reduced by the jth cluster after clustering in the ith class of data in the data set, | diRepresents LavgThe number of data needing to be reduced in the ith type data set on the average line;
s44: for each type of sample data of the lower region, adopting an oversampling method to average the sample data quantity by a line LavgClass data set of (1) increase | diAnd the formula for increasing the data amount of the ith class data set is expressed as:
Figure BDA0002573455060000042
wherein N isi,jRepresents the data quantity X of the j cluster needing to be increased after clustering in the ith class of data in the data setiRepresenting the total number of class i data sets, Mi,jRepresents the data volume of the jth cluster after clustering in the ith class data set, KiRepresenting the number of clusters in the ith class of dataset cluster.
Further, the specific operation of step S42 includes:
s421: initializing a K value and an average contour system number S;
s422: based on the current K cluster, firstly calculating the distance between each node i and the cluster centroid, selecting the cluster with the minimum distance as the cluster, calculating the average distance between all nodes in the cluster and the centroid according to a formula (7), and selecting the node with the closest centroid distance to the average distance as a new centroid for clustering until the distance is less than a certain value;
Figure BDA0002573455060000051
wherein the content of the first and second substances,
Figure BDA0002573455060000052
representing old centroid, ciIs the cluster in which the node i is located, p is ciThe other nodes in (1);
s423: calculate each node xiDegree of cohesion of (i.e. node x)iThe average distance to other nodes in the same cluster, and the calculation formula is:
Figure BDA0002573455060000053
wherein, ciThe node i is located in a cluster, and p is other nodes in the same cluster with the node i;
s424: calculate each node xiDegree of separation of (i.e. this node x)iCluster c nearest theretomThe average distance of all nodes in the system, and the calculation formula is as follows:
Figure BDA0002573455060000054
wherein, cmIs and node xiMost recently cluster, q is cmAnd calculates the nearest cluster cmThe formula of (1) is:
Figure BDA0002573455060000055
s425: calculate each node xiThe larger the value of S, the better the clustering effect, and the calculation formula is:
Figure BDA0002573455060000056
wherein S isiRepresenting the contour coefficient of the node i, the average contour coefficient S is the arithmetic mean of all the contour coefficients of the node, and the value range is S e [ -1, 1];
S426: increasing the K value, repeatedly executing the steps S422-S425, calculating the contour coefficient corresponding to the K value, after N iterations, selecting the K corresponding to the maximum value of the average contour coefficient as the number of the clustering clusters, and taking the clustering result at the moment as the final result.
The invention has the beneficial effects that:
firstly, the algorithm of the invention improves the copy proportion of the small feature vector in the small sample due to clustering, thereby improving the prediction accuracy of the small feature.
Secondly, the invention introduces a dynamic K-means clustering method based on the contour coefficient to cluster the unbalanced data set, and utilizes the clustering cluster to adopt a mixed sampling mode to realize the balanced distribution of the text data set, and the performance of the textCNN model is effectively improved in the aspects of accuracy, F1 value and the like compared with the conventional method.
Drawings
FIG. 1 is a diagram of a TextCNN network model architecture;
FIG. 2 is a parameter diagram of a TextCNN model;
FIG. 3 is a schematic diagram of the arithmetic mean index obtained by four experimental methods of the present invention;
FIG. 4 is a diagram illustrating weighted average index values obtained by four experimental methods according to the present invention;
Detailed Description
In order to make those skilled in the art better understand the technical solution of the present invention, the following further describes the technical solution of the present invention with reference to the drawings and the embodiments.
In order to solve the problem of imbalance of the text multi-classification training data set, the preprocessing of the data set can be carried out by adopting an undersampling (downsampling) or oversampling (upsampling) mode. The undersampling takes the number of small samples as a standard, and performs decrement extraction on the data of the large samples, so that the number of the large samples and the number of the small samples have the same scale, and the total amount of the data set is reduced. The oversampling is just opposite to that, and the small sample data is copied and increased by taking the large sample number as a standard, so that the small sample number and the large sample number have the same scale, and the data aggregation amount is increased. However, in the text multi-classification dataset, since there are a plurality of classification labels, if the processing is performed only based on the minimum sample and the maximum sample, the irrational performance of the dataset is caused, and the training result of the final deep learning model is affected. Therefore, a clustering-based equipartition hybrid sampling algorithm (HCSA) is proposed herein for handling multiple classifications of text in unbalanced data sets;
a text multi-classification method for an unbalanced data set based on a text multi-classification hybrid equipartition clustering sampling algorithm is used for calculating a feature set of unbalanced data aiming at a microblog disaster data set, and comprises the following steps:
s1: selecting a microblog disaster data set to carry out high-dimensional word segmentation vectorization;
in the text multi-classification processing, high-latitude vectorization representation needs to be carried out on each Word in a text, the typical methods include TF-IDF, skip-gram, CBOW and the like, and the Word vector processing of a text data set can adopt a custom model training mode or a classical Word vector model, such as Word2vec, GloVe, FastText and the like. However, in any way, when there are many text training sample data, the basic vocabulary involved in training is increasing. Especially when non-standard texts are processed (e.g. microblog texts), a large number of new vocabularies are encountered. In order to more accurately represent the relationship between texts, high-dimensional vectorization needs to be performed on words in the texts, each dimension represents a text feature, and only when the vector dimension of the words reaches a certain scale, text classification training sample data with the features having the discrimination can be provided. Google corporation trained a Word2vec Word vector model with 300 dimensions based on a large amount of general news material. Facebook trains the FastText model with 300 dimensions based on the general wiki news material. These high dimensional models provide a solid word vector basis for text multi-classification.
The high-dimensional word vector model only provides support for word segmentation vectorization, but in text multi-classification, a large amount of training sample data is needed. At present, the text multi-classification largely uses a supervised learning method, so that sample labeling needs to be carried out on training data. And the multi-classification labeling of a large amount of text data is a difficult matter, and it is basically impossible to obtain completely balanced labeled text classification training data. But also challenges the balance of the training data set as the classification data increases. In practical text multi-classification studies, unbalanced training data sets are heavily used. The prediction accuracy of the small sample quantity labeled data plays an important role in specific application acceptance. For example, mail of a fraudulent nature is classified in e-mail, which belongs to a small sample amount of annotation data relative to ordinary mail and advertising spam. The imbalance of data sets has been a fundamental property of most text multiclassifications.
S2: calculating the arithmetic mean of the number of all different classified samples in the data set, and taking the arithmetic mean as a sample data volume average line, wherein the calculation formula is represented as:
Figure BDA0002573455060000081
wherein N represents the number of classification tags, XiRepresenting the total number of samples in each classification;
s3: partitioning each classified sample in the data set according to the calculated sample data volume average line, and defining an upper region z with the sample quantity greater than the sample data volume average lineupThe number of samples less than the mean line of the sample data amount is the lower area zdnAnd calculating the difference d between each classified sample and the sample data volume average linei
di=Xi-Lavg(2),
Wherein, XiRepresents the total number of samples in each class, LavgThe sample data volume average line;
s4: respectively adopting an undersampling method and an oversampling method based on K-means clustering to the upper region sample data and the lower region sample data in the data set, and carrying out data clustering on each classified sample until the sample data size of each class is the same as the sample data size average line, thereby realizing the balance of the sample data sizes of different classes;
s5: after S4 clustering is carried out on each type of data in the data set, data increase and decrease operations are carried out, a gentle mixed sample data set is formed, and the number of samples of the mixed sample data set is N multiplied by LavgExpressed as follows:
Figure BDA0002573455060000091
wherein z isupDenotes the upper region, zdnLower section of representation, diRepresenting the difference value of each classified sample and the sample data volume average line;
s6: and performing text classification on the formed mixed sample data set to obtain a text classification result.
In the unbalanced data set, data under the sample data size average line belongs to small sample data, and the number of data samples needs to be increased. The more data in the cluster, the similarity of the characteristic vectors of the data is shown, and the increased data is relatively less; the less the clustering data indicates that the sample features are more special, the more the clustering data is increased; on the contrary, in the unbalanced data set, the data on the sample data volume average line belongs to large sample data, the number of data samples needs to be reduced in order to prevent the over-fitting of the classification model and improve the training speed, the large sample data is clustered by adopting a K-means dynamic clustering method based on the contour coefficient, and the samples are reduced according to the distribution condition of clustering in the similar samples. The more data in the cluster, the similarity of the characteristic vectors of the data is shown, and the more data is reduced; the less the clustering data indicates that the sample features are more specific, the less.
Further, the specific operation of performing data clustering on the sample data in the data set in step S4 includes:
s41: the distance of the nodes in the multidimensional space is calculated by adopting the text cosine distance, and the formula is as follows:
Figure BDA0002573455060000092
where x, y are multi-dimensional word vector spacesTwo nodes in (1), n represents the vector space dimension, xiAnd yiThe value of the two vectors in the i-dimension space can be seen from formula (4), for the two text vectors, if the texts are more similar, the smaller Dis (x, y) and the closer the distance are, when the two texts are completely consistent, the distance is 0, and when the two texts are completely different, the maximum distance is 1, that is, Dis (x, y) belongs to [0, 1 ∈];
S42: selecting a dynamically adjusted K value based on a contour system, and respectively clustering data of each type of data of the upper region and the lower region according to the obtained K value;
s43: for each type of sample data in the upper region, the average sample data size is divided into L by adopting an undersampling methodavgReduction of | diAnd the formula for the reduced data size of the ith class data set is expressed as:
Figure BDA0002573455060000101
wherein Q isi,jRepresenting the data amount needed to be reduced by the jth cluster after clustering in the ith class of data in the data set, | diRepresents LavgThe number of data needing to be reduced in the ith class data set on the sample data volume average line is reduced, and a random selection method is adopted for selecting data in the same cluster, namely in Mi,jRandomly selecting | X from the numberi-Qi,jThe number of | s;
s44: for each type of sample data of the lower region, adopting an oversampling method to average the sample data quantity by a line LavgClass data set of (1) increase | diAnd the formula for increasing the data amount of the ith class data set is expressed as:
Figure BDA0002573455060000102
wherein N isi,jRepresents the data quantity X of the j cluster needing to be increased after clustering in the ith class of data in the data setiRepresenting the total number of class i data sets, Mi,jRepresenting the j-th cluster after clustering in the i-th class data setAmount of data, KiThe cluster number of the ith class of data set cluster is expressed, and it can be seen from formula (6) that after clustering, the more data in the cluster, the less data in the cluster is increased. The distance between different types of data is equal to the line LavgThe further away, the more data volume is added overall. The data increasing method in the same cluster adopts a random duplication method, namely in Mi,jRandom duplication of N in numberi,jAnd (4) respectively.
Further, in the K-means clustering, the K value represents the number of clustering clusters, and since the clustering belongs to unsupervised learning, the optimal K value cannot be determined in advance, and the size of the K value directly influences the final clustering effect. Therefore, the algorithm adopts the dynamically adjusted K value selection and data clustering based on the contour system, that is, the specific operation steps of step S42 include:
s421: initializing a value K and an average contour system number S, and clustering the cluster number K belongs to [1, M ] in a vector space with M nodes, namely the extreme possibility of clustering is that all nodes are in one cluster, or each node is independent of one cluster and is independent of any other node. Therefore, the K value is initialized to 2, the minimum possible cluster number is started, the average contour coefficient S is not calculated, the minimum value is-1, and two nodes are randomly selected to serve as centroid nodes of the initial cluster;
s422: based on the current K cluster, firstly calculating the distance between each node i and the cluster centroid, selecting the cluster with the minimum distance as the cluster, calculating the average distance between all nodes in the cluster and the centroid according to a formula (7), selecting the node with the closest centroid distance to the average distance as a new centroid, calculating the distance between the new centroid and the old centroid, if the distance is smaller than a certain value, the value range of the value is (0, 1), ending clustering, and otherwise, starting a new round of clustering by taking the new centroid as the core;
Figure BDA0002573455060000111
wherein the content of the first and second substances,
Figure BDA0002573455060000112
representing old centroid, ciIs the cluster in which the node i is located, p is ciThe other nodes in (1);
s423: after clustering is finished, in order to dynamically optimize and select a K value by utilizing the contour coefficient, each node x is calculated firstlyiDegree of aggregation of (a)iI.e. this node xiThe average distance to other nodes in the same cluster, and the calculation formula is:
Figure BDA0002573455060000121
wherein, ciThe node i is located in a cluster, and p is other nodes in the same cluster with the node i;
s424: calculate each node xiDegree of separation biI.e. this node xiCluster c nearest theretomThe average distance of all nodes in the system, and the calculation formula is as follows:
Figure BDA0002573455060000122
wherein, cmIs and node xiMost recently cluster, q is cmAnd calculates the nearest cluster cmThe formula of (1) is:
Figure BDA0002573455060000123
so-called nearest cluster cmI.e. by xiSelecting the cluster with the minimum average distance after the average distance from all nodes to a certain cluster is used as the distance for measuring the distance from the point to the cluster;
it can be seen that the degree of aggregation represents the degree of density within a cluster, and the degree of separation represents the distance between clusters, so that the higher the degree of aggregation and the longer the distance between clusters, the better the clustering effect, and therefore the combination of the degree of aggregation and the degree of separation is the profile coefficient of the node, and the average profile coefficients of all the nodes are calculated below;
s425: calculating the average contour coefficient S of all nodes, namely the arithmetic mean of the contour coefficients of all nodes, wherein the larger the value of S is, the better the clustering effect is, and the calculation formula is as follows:
Figure BDA0002573455060000124
wherein S isiRepresenting the outline coefficient of the node i, and the value range of the outline coefficient is S e [ -1, 1 [ ]];
S426: increasing the K value, repeatedly executing the steps S422-S425, calculating the contour coefficient corresponding to the K value, after N iterations, selecting the K corresponding to the maximum value of the average contour coefficient as the number of the clustering clusters, and taking the clustering result at the moment as the final result.
Example (b):
1. data set selection:
convolutional Neural Networks (CNNs) are a classical neural network model in machine learning and have been successfully applied in a number of fields. For natural language analysis, CNN generally adopts a one-dimensional model structure, which can be modified into parallel text classification convolutional neural network TextCNN, and the model structure is shown in fig. 1. The input text can be processed by a plurality of parallel convolution layers, the maximum pooling layer can adopt the schemes of step length 3, step length 4 and step length 5 to process data, the purpose is to extract the text characteristic information of different word intervals, and finally the characteristic information is summarized by the tiling layer. In order to guarantee the operation efficiency of the model, the TextCNN model designed herein adopts a one-dimensional convolution model structure with 3 parallel convolution layers according to the high-dimensional character of the word vector, as shown in fig. 2. The convolutional layer in the model has an input dimension of (50,300) structure and an output of (50,256). The convolution layer activation function adopts a 'relu' function, the output layer activation function adopts a 'softmax' function, the optimizer adopts 'adam', and the loss function adopts 'coordinated _ cross'.
Text multi-classification algorithms are generally optimized for a particular text data set. The optimization of the algorithm mainly aims at a microblog disaster data set. This data set comes from the CrisisNLP website (https:// CrisisNLP. qcri. org.). The method provides microblog data related to 2 million 1 thousand disasters from 2013 to 2015, and manually carries out multi-classification labeling on the data. The labeling sample data condition is shown in table 1. The labels include 9 types of information including injury, death, missing, finding, personnel placement, evacuation, and the like. The maximum classification sample number is about 13 times of the minimum sample classification number, 5 classifications are below a data set mean line, 4 classifications are above the data set mean line, and the text data set belongs to a typical unbalanced text data set. In the aspect of data set preprocessing, because the content required in the microblog articles cannot exceed 140 words, keyword extraction needs to be performed before text vectorization, and in order to ensure that the extracted keywords can represent the target classification of the articles, the average word quantity 50 of the articles is finally selected as a parameter through statistical analysis, namely, the words 50 before word frequency statistics are used as the keywords of the articles.
TABLE 1 microblog disaster dataset calibration
Figure BDA0002573455060000141
2. Evaluation index
The evaluation indexes of the machine learning algorithm generally adopt accuracy (Acc), precision (P), recall (R) and F1 values. In text multi-classification, the accuracy, recall, and F1 values may take the form of arithmetic averages (P)m、RmAnd F1m) And weighted average (P)w、RwAnd F1m) Two methods are calculated, which are shown in equations (12) and (13):
Figure BDA0002573455060000142
Figure BDA0002573455060000143
Figure BDA0002573455060000144
wherein, P in the formulaiThe accuracy of each classification is "the number of correct predictions of this class/the number of predictions of all this classes"; riFor recall, this is "Number of correct predictions of class/number of all classes "; f1iIs "2 (P)i*Ri)/(Pi+Ri)”,αiThe proportion of different classification samples in the total samples is N, and the total number of classifications is N.
Figure BDA0002573455060000151
Figure BDA0002573455060000152
Figure BDA0002573455060000153
3. The test steps are as follows:
firstly, performing Word segmentation vectorization on a microblog disaster data set based on a Word2vec model, wherein the dimension of each text is (50,300), 50 represents a keyword in the text, and if the number of the keywords is less than 50, zero padding is performed for processing. 300 represents the dimension of each word, i.e. the word feature vector space is 300 dimensions. 21125 microblog data were used together in the experiment, of which 90% were used for model training and 10% were used for model testing. The output dimension of the TextCNN model data is (9, 1), which means that the data is 9 rows and 1 columns and is classified into 9;
secondly, calculating the arithmetic mean of the number of all different classified samples in the microblog disaster data set, taking the arithmetic mean as a sample data volume average line, partitioning each classified sample in the data set according to the calculated sample data volume average line, and calculating the difference d between each classified sample and the sample data volume average linei
Thirdly, respectively carrying out data clustering on each classified sample by adopting an undersampling method and an oversampling method based on K-means clustering on the upper region sample data and the lower region sample data of the partitions in the data set until the sample amount of each type of sample is the same as the average sample amount line, and finally obtaining a new sample data set;
finally, in order to verify the performance of the algorithm, 4 methods are respectively carried out for comparing experimental data, wherein the first method is a conventional method and under-sampling or over-sampling is not carried out on a data set; the second is a random undersampling mode, the data quantity of the minimum classification data set is taken as a standard, and other classification data sets are subjected to random undersampling; the third is a random oversampling mode, the data quantity of the maximum classification data set is taken as a standard, and other classification data sets are subjected to random replication oversampling; the fourth is the HCSA sampling method proposed herein.
4. Analysis of results
The confusion matrix obtained by the above 4 experimental methods is shown in tables a-d, where table a is the confusion matrix of the results of the conventional method, table b is the confusion matrix of the results of the under-sampling method, table c is the confusion matrix of the results of the over-sampling method, and table d is the confusion matrix of the results of the HCSA method:
TABLE a
Figure BDA0002573455060000161
Table b
Figure BDA0002573455060000162
Table c
Figure BDA0002573455060000171
Table d
Figure BDA0002573455060000172
The evaluation index values corresponding to the classification data sets calculated based on the confusion matrix of various methods are shown in table 2:
TABLE 2 evaluation index values of the methods
Figure BDA0002573455060000173
Figure BDA0002573455060000181
As can be seen from tables a-d, the classification 5 is the minimum data set, the classification 7 is the maximum data set, and the table 2 is combined, so that for the classification 5 of the minimum data set, when the F1 value adopts the HCSA algorithm, the value is maximum, that is, the prediction accuracy and the recall rate of the small sample are effectively improved when the HCSA algorithm is adopted; and for the maximum data set classification 7, when the HCSA algorithm is adopted for the F1 value, the value is maximum, and the accuracy and the recall rate performance are not reduced.
And the figure 3 and the figure 4 are combined to respectively show the arithmetic average index value and the weighted average index value data of each method, and as can be seen from the figures, the accuracy and the F1 value of the HCSA algorithm are the highest, the performance of the over-sampling method is similar to that of the conventional method, and the index value of the under-sampling method is the lowest. Under-sampling results in severe performance degradation due to random discarding of training sample data. Although training data is added in oversampling, small feature vectors in a text vector space cannot be increased to a certain extent due to random replication. On the other hand, oversampling causes an overfitting phenomenon of the TextCNN model in training due to the existence of a large amount of duplicated data. It is shown that although the amount of training data is increased, if the amount is unreasonable, the model is overfit, and the prediction performance of the model cannot be improved. In the HCSA algorithm, due to the fact that clustering is carried out, the copying proportion of small feature vectors in a small sample is improved, and therefore the prediction accuracy of the small features can be improved. For large sample data, training data set rejection was performed in order to prevent overfitting. But does not cause the index value to drop much like undersampling, because when the data is discarded, the more the clustered data is, the larger the discarded proportion is, based on the clustering result. Thus, the distribution balance of various characteristics in the data set is finally ensured.
According to the experimental results, the dynamic clustering method is introduced into the HCSA algorithm, so that the data sets can be further distinguished based on the high-dimensional features of the text, a foundation is provided for undersampling and oversampling, the balanced distribution of the feature vectors of the data in the text training data set in the high-dimensional vector space is finally realized, and support is provided for improving the multi-classification performance of the text. By taking a microblog disaster data set as an example, the performance of the HCSA algorithm on the TextCNN model is verified. Experiments prove that compared with the conventional method, the oversampling method and the undersampling method, the performance of the algorithm is effectively improved in the aspects of accuracy, F1 value and the like.
The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (3)

1. A text multi-classification method for an unbalanced data set based on a text multi-classification hybrid equipartition clustering sampling algorithm is characterized in that a feature set of unbalanced data is calculated for a microblog disaster data set, and comprises the following steps:
s1: selecting a microblog disaster data set to carry out high-dimensional word segmentation vectorization;
s2: calculating the arithmetic mean of the number of all different classified samples in the data set, and taking the arithmetic mean as a sample data volume average line, wherein the calculation formula is represented as:
Figure FDA0002573455050000011
wherein N represents the number of classification tags, XiRepresenting the total number of samples in each classification;
s3: carrying out the classification on each classified sample in the data set according to the calculated sample data volume average linePartitioning, and defining the upper zone z with the number of samples greater than the sample data quantity average lineupThe number of samples less than the mean line of the sample data amount is the lower area zdnAnd calculating the difference d between each classified sample and the sample data volume average linei
di=Xi-Lavg(2),
Wherein, XiRepresents the total number of samples in each class, LavgThe sample data volume average line;
s4: respectively adopting an undersampling method and an oversampling method based on K-means clustering to the upper area sample data and the lower area sample data in the data set, and carrying out data clustering on each classified sample until the sample data volume of each type is the same as the sample data volume average line;
s5: after S4 clustering is carried out on each type of data in the data set, a mixed sample data set is formed, and the number of samples of the new sample data set is NxLavgExpressed as follows:
Figure FDA0002573455050000012
wherein z isupDenotes the upper region, zdnLower section of representation, diRepresenting the difference value of each classified sample and the sample data volume average line;
s6: and performing text classification on the formed mixed sample data set to obtain a text classification result.
2. The unbalanced data set text multi-classification method based on the text multi-classification hybrid mean-average clustering sampling algorithm according to claim 1, wherein the specific operation of data clustering on the sample data in the data set in step S4 includes:
s41: the distance of the nodes in the multidimensional space is calculated by adopting the text cosine distance, and the formula is as follows:
Figure FDA0002573455050000021
where x, y are two nodes in the multidimensional word vector space, n represents the vector space dimension, xiAnd yiA value representing two vectors in the i-dimensional space;
s42: selecting a dynamically adjusted K value based on a contour system, and respectively clustering data of each type of data of the upper region and the lower region according to the obtained K value;
s43: for each type of sample data in the upper region, the average sample data size is divided into L by adopting an undersampling methodavgReduction of | diAnd the formula for the reduced data size of the ith class data set is expressed as:
Figure FDA0002573455050000022
wherein Q isi,jRepresenting the data amount needed to be reduced by the jth cluster after clustering in the ith class of data in the data set, | diRepresents LavgThe number of data needing to be reduced in the ith type data set on the average line;
s44: for each type of sample data of the lower region, adopting an oversampling method to average the sample data quantity by a line LavgClass data set of (1) increase | diAnd the formula for increasing the data amount of the ith class data set is expressed as:
Figure FDA0002573455050000023
wherein N isi,jRepresents the data quantity X of the j cluster needing to be increased after clustering in the ith class of data in the data setiRepresenting the total number of class i data sets, Mi,jRepresents the data volume of the jth cluster after clustering in the ith class data set, KiRepresenting the number of clusters in the ith class of dataset cluster.
3. The method of claim 1, wherein the specific operation of step S42 includes:
s421: initializing a K value and an average contour system number S;
s422: based on the current K cluster, firstly calculating the distance between each node i and the cluster centroid, selecting the cluster with the minimum distance as the cluster, calculating the average distance between all nodes in the cluster and the centroid according to a formula (7), and selecting the node with the closest centroid distance to the average distance as a new centroid for clustering until the distance is less than a certain value;
Figure FDA0002573455050000031
wherein the content of the first and second substances,
Figure FDA0002573455050000032
representing old centroid, ciIs the cluster in which the node i is located, p is ciThe other nodes in (1);
s423: calculate each node xiDegree of cohesion of (i.e. node x)iThe average distance to other nodes in the same cluster, and the calculation formula is:
Figure FDA0002573455050000033
wherein, ciThe node i is located in a cluster, and p is other nodes in the same cluster with the node i;
s424: calculate each node xiDegree of separation of (i.e. this node x)iCluster c nearest theretomThe average distance of all nodes in the system, and the calculation formula is as follows:
Figure FDA0002573455050000034
wherein, cmIs and node xiMost recently cluster, q is cmAnd calculates the nearest cluster cmThe formula of (1) is:
Figure FDA0002573455050000035
s425: calculate each node xiThe larger the value of S, the better the clustering effect, and the calculation formula is:
Figure FDA0002573455050000041
wherein S isiRepresenting the contour coefficient of the node i, the average contour coefficient S is the arithmetic mean of all the contour coefficients of the node, and the value range is S e [ -1, 1];
S426: increasing the K value, repeatedly executing the steps S422-S425, calculating the contour coefficient corresponding to the K value, after N iterations, selecting the K corresponding to the maximum value of the average contour coefficient as the number of the clustering clusters, and taking the clustering result at the moment as the final result.
CN202010646859.5A 2020-07-07 2020-07-07 Text multi-classification method for unbalanced data set based on text multi-classification mixed equipartition clustering sampling algorithm Pending CN111831822A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010646859.5A CN111831822A (en) 2020-07-07 2020-07-07 Text multi-classification method for unbalanced data set based on text multi-classification mixed equipartition clustering sampling algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010646859.5A CN111831822A (en) 2020-07-07 2020-07-07 Text multi-classification method for unbalanced data set based on text multi-classification mixed equipartition clustering sampling algorithm

Publications (1)

Publication Number Publication Date
CN111831822A true CN111831822A (en) 2020-10-27

Family

ID=72900432

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010646859.5A Pending CN111831822A (en) 2020-07-07 2020-07-07 Text multi-classification method for unbalanced data set based on text multi-classification mixed equipartition clustering sampling algorithm

Country Status (1)

Country Link
CN (1) CN111831822A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112365060A (en) * 2020-11-13 2021-02-12 广东电力信息科技有限公司 Preprocessing method for power grid internet of things perception data
CN116401362A (en) * 2023-01-16 2023-07-07 之江实验室 Unbalanced data set-oriented green molten industry classification method and device based on dynamic clustering
CN117235270A (en) * 2023-11-16 2023-12-15 中国人民解放军国防科技大学 Text classification method and device based on belief confusion matrix and computer equipment

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112365060A (en) * 2020-11-13 2021-02-12 广东电力信息科技有限公司 Preprocessing method for power grid internet of things perception data
CN112365060B (en) * 2020-11-13 2024-01-26 广东电力信息科技有限公司 Preprocessing method for network Internet of things sensing data
CN116401362A (en) * 2023-01-16 2023-07-07 之江实验室 Unbalanced data set-oriented green molten industry classification method and device based on dynamic clustering
CN117235270A (en) * 2023-11-16 2023-12-15 中国人民解放军国防科技大学 Text classification method and device based on belief confusion matrix and computer equipment
CN117235270B (en) * 2023-11-16 2024-02-02 中国人民解放军国防科技大学 Text classification method and device based on belief confusion matrix and computer equipment

Similar Documents

Publication Publication Date Title
CN107944480B (en) Enterprise industry classification method
CN109241530B (en) Chinese text multi-classification method based on N-gram vector and convolutional neural network
CN110443281B (en) Text classification self-adaptive oversampling method based on HDBSCAN (high-density binary-coded decimal) clustering
Jiang et al. An improved K-nearest-neighbor algorithm for text categorization
CN111831822A (en) Text multi-classification method for unbalanced data set based on text multi-classification mixed equipartition clustering sampling algorithm
CN110929029A (en) Text classification method and system based on graph convolution neural network
WO2022126810A1 (en) Text clustering method
CN108897791B (en) Image retrieval method based on depth convolution characteristics and semantic similarity measurement
CN110795564B (en) Text classification method lacking negative cases
CN109299263B (en) Text classification method and electronic equipment
CN111368891A (en) K-Means text classification method based on immune clone wolf optimization algorithm
CN112231477A (en) Text classification method based on improved capsule network
CN115688024A (en) Network abnormal user prediction method based on user content characteristics and behavior characteristics
Belhaouari et al. Optimized K‐Means Algorithm
CN112417152A (en) Topic detection method and device for case-related public sentiment
Maddumala A Weight Based Feature Extraction Model on Multifaceted Multimedia Bigdata Using Convolutional Neural Network.
CN113626604B (en) Web page text classification system based on maximum interval criterion
CN108268461A (en) A kind of document sorting apparatus based on hybrid classifer
CN112417082B (en) Scientific research achievement data disambiguation filing storage method
Elgeldawi et al. Hyperparameter Tuning for Machine Learning Algorithms Used for Arabic Sentiment Analysis. Informatics 2021, 8, 79
CN115098690B (en) Multi-data document classification method and system based on cluster analysis
CN117273134A (en) Zero-sample knowledge graph completion method based on pre-training language model
CN116881451A (en) Text classification method based on machine learning
Sudha Semi supervised multi text classifications for telugu documents
Huang et al. An empirical study on the classification of Chinese news articles by machine learning and deep learning techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20201027

RJ01 Rejection of invention patent application after publication