CN111831822A

CN111831822A - Text multi-classification method for unbalanced data set based on text multi-classification mixed equipartition clustering sampling algorithm

Info

Publication number: CN111831822A
Application number: CN202010646859.5A
Authority: CN
Inventors: 王德志; 梁俊艳; 陈超; 李泽荃; 李永飞; 顾涛
Original assignee: North China Institute of Science and Technology
Current assignee: North China Institute of Science and Technology
Priority date: 2020-07-07
Filing date: 2020-07-07
Publication date: 2020-10-27

Abstract

The invention discloses a text multi-classification method for an unbalanced data set based on a text multi-classification mixed type equipartition clustering sampling algorithm, which is characterized in that a dynamic K-means clustering method based on a contour coefficient is introduced into the multi-classification algorithm for the unbalanced data set to cluster the unbalanced data set.

Description

Text multi-classification method for unbalanced data set based on text multi-classification mixed equipartition clustering sampling algorithm

Technical Field

The invention relates to the problem of unbalanced data set classification in data mining, in particular to a text multi-classification method for unbalanced data sets based on a text multi-classification mixed equipartition clustering sampling algorithm.

Background

In natural language processing, text classification based on deep learning neural network models is one of important research contents. The text classification research is divided into two-classification (or emotion classification) and multi-classification problems. At present, the more successful text classification deep learning algorithm mostly adopts a supervised learning mode. The balance of the training data set has an important influence on the performance of the deep learning algorithm. In practical applications, however, the training data set of the text multi-classification is often an unbalanced data set. An unbalanced data set is a data set in which the number of samples of a certain type of data in the same data set is much larger or smaller than the number of other samples. While misclassifications of the few classes of samples are more costly than misclassifications of the most classes of samples. Unbalanced data set problems are widespread in, for example, spam classification, fraud phone classification, and social media data classification.

How to improve the text classification accuracy based on the unbalanced data set is a hot issue of current research. The current common algorithm comprises modifying a loss function result based on label weight in a convolutional neural network training process based on an unbalanced text data set, and reinforcing the influence of a few types of samples on model parameters; in the text emotion classification, a pre-training task selection method based on word vector migration is used for distinguishing small category samples and improving the small category classification accuracy; the unbalanced data weighting method based on hierarchical clustering divides a few samples into a plurality of clusters, determines sampling frequency according to density factors, and improves the weight of small samples; and a hyperplane feature map based on a differential twin convolutional neural network and an algorithm for classifying unbalanced data sets based on the distances between samples and different hyperplanes.

The current research is mainly focused on the problem of binary classification of texts or classification of unbalanced data sets of low-dimensional feature vectors of general applicability. With the development and application of deep learning, a text multi-classification scheme based on word vectors gradually becomes the mainstream. The problem of multi-classification of text unbalanced data sets based on high-dimension word vectors faces huge challenges, and how to improve the prediction accuracy of high-dimension small sample classified data becomes a problem to be solved urgently at present.

Disclosure of Invention

Aiming at the existing problems, the invention aims to provide a text multi-classification method for an unbalanced data set based on a text multi-classification mixed equipartition clustering sampling algorithm, and provides a mixed unbalanced text data set sampling method based on a high-dimensionality vector clustering method, so that the accuracy of small sample data classification is improved on the basis of ensuring the accuracy of large sample data classification.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

a text multi-classification method for an unbalanced data set based on a text multi-classification hybrid equipartition clustering sampling algorithm is characterized in that a feature set of unbalanced data is calculated for a microblog disaster data set, and comprises the following steps:

s1: selecting a microblog disaster data set to carry out high-dimensional word segmentation vectorization;

s2: calculating the arithmetic mean of the number of all different classified samples in the data set, and taking the arithmetic mean as a sample data volume average line, wherein the calculation formula is represented as:

wherein N represents the number of classification tags, X_iRepresenting the total number of samples in each classification;

s3: partitioning each classified sample in the data set according to the calculated sample data volume average line, and defining an upper region z with the sample quantity greater than the sample data volume average line_upThe number of samples less than the mean line of the sample data amount is the lower area z_dnAnd calculating the difference d between each classified sample and the sample data volume average line_i：

d_i＝X_i-L_avg(2)，

Wherein, X_iRepresents the total number of samples in each class, L_avgThe sample data volume average line;

s4: respectively adopting an undersampling method and an oversampling method based on K-means clustering to the upper area sample data and the lower area sample data in the data set, and carrying out data clustering on each classified sample until the sample data volume of each type is the same as the sample data volume average line;

s5: after S4 clustering is carried out on each type of data in the data set, a mixed sample data set is formed, and the number of samples of the new sample data set is NxL_avgExpressed as follows:

wherein z is_upDenotes the upper region, z_dnLower section of representation, d_iRepresenting the difference value of each classified sample and the sample data volume average line;

s6: and performing text classification on the formed mixed sample data set to obtain a text classification result.

Further, the specific operation of performing data clustering on the sample data in the data set in step S4 includes:

s41: the distance of the nodes in the multidimensional space is calculated by adopting the text cosine distance, and the formula is as follows:

where x, y are two nodes in the multidimensional word vector space, n represents the vector space dimension, x_iAnd y_iA value representing two vectors in the i-dimensional space;

s42: selecting a dynamically adjusted K value based on a contour system, and respectively clustering data of each type of data of the upper region and the lower region according to the obtained K value;

s43: for each type of sample data in the upper region, the average sample data size is divided into L by adopting an undersampling method_avgClassification ofReduction of data set | d_iAnd the formula for the reduced data size of the ith class data set is expressed as:

wherein Q is_i，jRepresenting the data amount needed to be reduced by the jth cluster after clustering in the ith class of data in the data set, | d_iRepresents L_avgThe number of data needing to be reduced in the ith type data set on the average line;

s44: for each type of sample data of the lower region, adopting an oversampling method to average the sample data quantity by a line L_avgClass data set of (1) increase | d_iAnd the formula for increasing the data amount of the ith class data set is expressed as:

wherein N is_i，jRepresents the data quantity X of the j cluster needing to be increased after clustering in the ith class of data in the data set_iRepresenting the total number of class i data sets, M_i，jRepresents the data volume of the jth cluster after clustering in the ith class data set, K_iRepresenting the number of clusters in the ith class of dataset cluster.

Further, the specific operation of step S42 includes:

s421: initializing a K value and an average contour system number S;

s422: based on the current K cluster, firstly calculating the distance between each node i and the cluster centroid, selecting the cluster with the minimum distance as the cluster, calculating the average distance between all nodes in the cluster and the centroid according to a formula (7), and selecting the node with the closest centroid distance to the average distance as a new centroid for clustering until the distance is less than a certain value;

wherein the content of the first and second substances,

representing old centroid, c_iIs the cluster in which the node i is located, p is c_iThe other nodes in (1);

s423: calculate each node x_iDegree of cohesion of (i.e. node x)_iThe average distance to other nodes in the same cluster, and the calculation formula is:

wherein, c_iThe node i is located in a cluster, and p is other nodes in the same cluster with the node i;

s424: calculate each node x_iDegree of separation of (i.e. this node x)_iCluster c nearest thereto_mThe average distance of all nodes in the system, and the calculation formula is as follows:

wherein, c_mIs and node x_iMost recently cluster, q is c_mAnd calculates the nearest cluster c_mThe formula of (1) is:

s425: calculate each node x_iThe larger the value of S, the better the clustering effect, and the calculation formula is:

wherein S is_iRepresenting the contour coefficient of the node i, the average contour coefficient S is the arithmetic mean of all the contour coefficients of the node, and the value range is S e [ -1, 1]；

S426: increasing the K value, repeatedly executing the steps S422-S425, calculating the contour coefficient corresponding to the K value, after N iterations, selecting the K corresponding to the maximum value of the average contour coefficient as the number of the clustering clusters, and taking the clustering result at the moment as the final result.

The invention has the beneficial effects that:

firstly, the algorithm of the invention improves the copy proportion of the small feature vector in the small sample due to clustering, thereby improving the prediction accuracy of the small feature.

Secondly, the invention introduces a dynamic K-means clustering method based on the contour coefficient to cluster the unbalanced data set, and utilizes the clustering cluster to adopt a mixed sampling mode to realize the balanced distribution of the text data set, and the performance of the textCNN model is effectively improved in the aspects of accuracy, F1 value and the like compared with the conventional method.

Drawings

FIG. 1 is a diagram of a TextCNN network model architecture;

FIG. 2 is a parameter diagram of a TextCNN model;

FIG. 3 is a schematic diagram of the arithmetic mean index obtained by four experimental methods of the present invention;

FIG. 4 is a diagram illustrating weighted average index values obtained by four experimental methods according to the present invention;

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present invention, the following further describes the technical solution of the present invention with reference to the drawings and the embodiments.

In order to solve the problem of imbalance of the text multi-classification training data set, the preprocessing of the data set can be carried out by adopting an undersampling (downsampling) or oversampling (upsampling) mode. The undersampling takes the number of small samples as a standard, and performs decrement extraction on the data of the large samples, so that the number of the large samples and the number of the small samples have the same scale, and the total amount of the data set is reduced. The oversampling is just opposite to that, and the small sample data is copied and increased by taking the large sample number as a standard, so that the small sample number and the large sample number have the same scale, and the data aggregation amount is increased. However, in the text multi-classification dataset, since there are a plurality of classification labels, if the processing is performed only based on the minimum sample and the maximum sample, the irrational performance of the dataset is caused, and the training result of the final deep learning model is affected. Therefore, a clustering-based equipartition hybrid sampling algorithm (HCSA) is proposed herein for handling multiple classifications of text in unbalanced data sets;

a text multi-classification method for an unbalanced data set based on a text multi-classification hybrid equipartition clustering sampling algorithm is used for calculating a feature set of unbalanced data aiming at a microblog disaster data set, and comprises the following steps:

in the text multi-classification processing, high-latitude vectorization representation needs to be carried out on each Word in a text, the typical methods include TF-IDF, skip-gram, CBOW and the like, and the Word vector processing of a text data set can adopt a custom model training mode or a classical Word vector model, such as Word2vec, GloVe, FastText and the like. However, in any way, when there are many text training sample data, the basic vocabulary involved in training is increasing. Especially when non-standard texts are processed (e.g. microblog texts), a large number of new vocabularies are encountered. In order to more accurately represent the relationship between texts, high-dimensional vectorization needs to be performed on words in the texts, each dimension represents a text feature, and only when the vector dimension of the words reaches a certain scale, text classification training sample data with the features having the discrimination can be provided. Google corporation trained a Word2vec Word vector model with 300 dimensions based on a large amount of general news material. Facebook trains the FastText model with 300 dimensions based on the general wiki news material. These high dimensional models provide a solid word vector basis for text multi-classification.

The high-dimensional word vector model only provides support for word segmentation vectorization, but in text multi-classification, a large amount of training sample data is needed. At present, the text multi-classification largely uses a supervised learning method, so that sample labeling needs to be carried out on training data. And the multi-classification labeling of a large amount of text data is a difficult matter, and it is basically impossible to obtain completely balanced labeled text classification training data. But also challenges the balance of the training data set as the classification data increases. In practical text multi-classification studies, unbalanced training data sets are heavily used. The prediction accuracy of the small sample quantity labeled data plays an important role in specific application acceptance. For example, mail of a fraudulent nature is classified in e-mail, which belongs to a small sample amount of annotation data relative to ordinary mail and advertising spam. The imbalance of data sets has been a fundamental property of most text multiclassifications.

d_i＝X_i-L_avg(2)，

s4: respectively adopting an undersampling method and an oversampling method based on K-means clustering to the upper region sample data and the lower region sample data in the data set, and carrying out data clustering on each classified sample until the sample data size of each class is the same as the sample data size average line, thereby realizing the balance of the sample data sizes of different classes;

s5: after S4 clustering is carried out on each type of data in the data set, data increase and decrease operations are carried out, a gentle mixed sample data set is formed, and the number of samples of the mixed sample data set is N multiplied by L_avgExpressed as follows:

In the unbalanced data set, data under the sample data size average line belongs to small sample data, and the number of data samples needs to be increased. The more data in the cluster, the similarity of the characteristic vectors of the data is shown, and the increased data is relatively less; the less the clustering data indicates that the sample features are more special, the more the clustering data is increased; on the contrary, in the unbalanced data set, the data on the sample data volume average line belongs to large sample data, the number of data samples needs to be reduced in order to prevent the over-fitting of the classification model and improve the training speed, the large sample data is clustered by adopting a K-means dynamic clustering method based on the contour coefficient, and the samples are reduced according to the distribution condition of clustering in the similar samples. The more data in the cluster, the similarity of the characteristic vectors of the data is shown, and the more data is reduced; the less the clustering data indicates that the sample features are more specific, the less.

where x, y are multi-dimensional word vector spacesTwo nodes in (1), n represents the vector space dimension, x_iAnd y_iThe value of the two vectors in the i-dimension space can be seen from formula (4), for the two text vectors, if the texts are more similar, the smaller Dis (x, y) and the closer the distance are, when the two texts are completely consistent, the distance is 0, and when the two texts are completely different, the maximum distance is 1, that is, Dis (x, y) belongs to [0, 1 ∈]；

s43: for each type of sample data in the upper region, the average sample data size is divided into L by adopting an undersampling method_avgReduction of | d_iAnd the formula for the reduced data size of the ith class data set is expressed as:

wherein Q is_i，jRepresenting the data amount needed to be reduced by the jth cluster after clustering in the ith class of data in the data set, | d_iRepresents L_avgThe number of data needing to be reduced in the ith class data set on the sample data volume average line is reduced, and a random selection method is adopted for selecting data in the same cluster, namely in M_i，jRandomly selecting | X from the number_i-Q_i，jThe number of | s;

wherein N is_i，jRepresents the data quantity X of the j cluster needing to be increased after clustering in the ith class of data in the data set_iRepresenting the total number of class i data sets, M_i，jRepresenting the j-th cluster after clustering in the i-th class data setAmount of data, K_iThe cluster number of the ith class of data set cluster is expressed, and it can be seen from formula (6) that after clustering, the more data in the cluster, the less data in the cluster is increased. The distance between different types of data is equal to the line L_avgThe further away, the more data volume is added overall. The data increasing method in the same cluster adopts a random duplication method, namely in M_i，jRandom duplication of N in number_i，jAnd (4) respectively.

Further, in the K-means clustering, the K value represents the number of clustering clusters, and since the clustering belongs to unsupervised learning, the optimal K value cannot be determined in advance, and the size of the K value directly influences the final clustering effect. Therefore, the algorithm adopts the dynamically adjusted K value selection and data clustering based on the contour system, that is, the specific operation steps of step S42 include:

s421: initializing a value K and an average contour system number S, and clustering the cluster number K belongs to [1, M ] in a vector space with M nodes, namely the extreme possibility of clustering is that all nodes are in one cluster, or each node is independent of one cluster and is independent of any other node. Therefore, the K value is initialized to 2, the minimum possible cluster number is started, the average contour coefficient S is not calculated, the minimum value is-1, and two nodes are randomly selected to serve as centroid nodes of the initial cluster;

s422: based on the current K cluster, firstly calculating the distance between each node i and the cluster centroid, selecting the cluster with the minimum distance as the cluster, calculating the average distance between all nodes in the cluster and the centroid according to a formula (7), selecting the node with the closest centroid distance to the average distance as a new centroid, calculating the distance between the new centroid and the old centroid, if the distance is smaller than a certain value, the value range of the value is (0, 1), ending clustering, and otherwise, starting a new round of clustering by taking the new centroid as the core;

wherein the content of the first and second substances,

s423: after clustering is finished, in order to dynamically optimize and select a K value by utilizing the contour coefficient, each node x is calculated firstly_iDegree of aggregation of (a)_iI.e. this node x_iThe average distance to other nodes in the same cluster, and the calculation formula is:

s424: calculate each node x_iDegree of separation b_iI.e. this node x_iCluster c nearest thereto_mThe average distance of all nodes in the system, and the calculation formula is as follows:

so-called nearest cluster c_mI.e. by x_iSelecting the cluster with the minimum average distance after the average distance from all nodes to a certain cluster is used as the distance for measuring the distance from the point to the cluster;

it can be seen that the degree of aggregation represents the degree of density within a cluster, and the degree of separation represents the distance between clusters, so that the higher the degree of aggregation and the longer the distance between clusters, the better the clustering effect, and therefore the combination of the degree of aggregation and the degree of separation is the profile coefficient of the node, and the average profile coefficients of all the nodes are calculated below;

s425: calculating the average contour coefficient S of all nodes, namely the arithmetic mean of the contour coefficients of all nodes, wherein the larger the value of S is, the better the clustering effect is, and the calculation formula is as follows:

wherein S is_iRepresenting the outline coefficient of the node i, and the value range of the outline coefficient is S e [ -1, 1 [ ]]；

Example (b):

1. data set selection:

convolutional Neural Networks (CNNs) are a classical neural network model in machine learning and have been successfully applied in a number of fields. For natural language analysis, CNN generally adopts a one-dimensional model structure, which can be modified into parallel text classification convolutional neural network TextCNN, and the model structure is shown in fig. 1. The input text can be processed by a plurality of parallel convolution layers, the maximum pooling layer can adopt the schemes of step length 3, step length 4 and step length 5 to process data, the purpose is to extract the text characteristic information of different word intervals, and finally the characteristic information is summarized by the tiling layer. In order to guarantee the operation efficiency of the model, the TextCNN model designed herein adopts a one-dimensional convolution model structure with 3 parallel convolution layers according to the high-dimensional character of the word vector, as shown in fig. 2. The convolutional layer in the model has an input dimension of (50,300) structure and an output of (50,256). The convolution layer activation function adopts a 'relu' function, the output layer activation function adopts a 'softmax' function, the optimizer adopts 'adam', and the loss function adopts 'coordinated _ cross'.

Text multi-classification algorithms are generally optimized for a particular text data set. The optimization of the algorithm mainly aims at a microblog disaster data set. This data set comes from the CrisisNLP website (https:// CrisisNLP. qcri. org.). The method provides microblog data related to 2 million 1 thousand disasters from 2013 to 2015, and manually carries out multi-classification labeling on the data. The labeling sample data condition is shown in table 1. The labels include 9 types of information including injury, death, missing, finding, personnel placement, evacuation, and the like. The maximum classification sample number is about 13 times of the minimum sample classification number, 5 classifications are below a data set mean line, 4 classifications are above the data set mean line, and the text data set belongs to a typical unbalanced text data set. In the aspect of data set preprocessing, because the content required in the microblog articles cannot exceed 140 words, keyword extraction needs to be performed before text vectorization, and in order to ensure that the extracted keywords can represent the target classification of the articles, the average word quantity 50 of the articles is finally selected as a parameter through statistical analysis, namely, the words 50 before word frequency statistics are used as the keywords of the articles.

TABLE 1 microblog disaster dataset calibration

2. Evaluation index

The evaluation indexes of the machine learning algorithm generally adopt accuracy (Acc), precision (P), recall (R) and F1 values. In text multi-classification, the accuracy, recall, and F1 values may take the form of arithmetic averages (P)_m、R_mAnd F1_m) And weighted average (P)_w、R_wAnd F1_m) Two methods are calculated, which are shown in equations (12) and (13):

wherein, P in the formula_iThe accuracy of each classification is "the number of correct predictions of this class/the number of predictions of all this classes"; r_iFor recall, this is "Number of correct predictions of class/number of all classes "; f1_iIs "2 (P)_i*R_i)/(P_i+R_i)”，α_iThe proportion of different classification samples in the total samples is N, and the total number of classifications is N.

3. The test steps are as follows:

firstly, performing Word segmentation vectorization on a microblog disaster data set based on a Word2vec model, wherein the dimension of each text is (50,300), 50 represents a keyword in the text, and if the number of the keywords is less than 50, zero padding is performed for processing. 300 represents the dimension of each word, i.e. the word feature vector space is 300 dimensions. 21125 microblog data were used together in the experiment, of which 90% were used for model training and 10% were used for model testing. The output dimension of the TextCNN model data is (9, 1), which means that the data is 9 rows and 1 columns and is classified into 9;

secondly, calculating the arithmetic mean of the number of all different classified samples in the microblog disaster data set, taking the arithmetic mean as a sample data volume average line, partitioning each classified sample in the data set according to the calculated sample data volume average line, and calculating the difference d between each classified sample and the sample data volume average line_i；

Thirdly, respectively carrying out data clustering on each classified sample by adopting an undersampling method and an oversampling method based on K-means clustering on the upper region sample data and the lower region sample data of the partitions in the data set until the sample amount of each type of sample is the same as the average sample amount line, and finally obtaining a new sample data set;

finally, in order to verify the performance of the algorithm, 4 methods are respectively carried out for comparing experimental data, wherein the first method is a conventional method and under-sampling or over-sampling is not carried out on a data set; the second is a random undersampling mode, the data quantity of the minimum classification data set is taken as a standard, and other classification data sets are subjected to random undersampling; the third is a random oversampling mode, the data quantity of the maximum classification data set is taken as a standard, and other classification data sets are subjected to random replication oversampling; the fourth is the HCSA sampling method proposed herein.

4. Analysis of results

The confusion matrix obtained by the above 4 experimental methods is shown in tables a-d, where table a is the confusion matrix of the results of the conventional method, table b is the confusion matrix of the results of the under-sampling method, table c is the confusion matrix of the results of the over-sampling method, and table d is the confusion matrix of the results of the HCSA method:

TABLE a

Table b

Table c

Table d

The evaluation index values corresponding to the classification data sets calculated based on the confusion matrix of various methods are shown in table 2:

TABLE 2 evaluation index values of the methods

As can be seen from tables a-d, the classification 5 is the minimum data set, the classification 7 is the maximum data set, and the table 2 is combined, so that for the classification 5 of the minimum data set, when the F1 value adopts the HCSA algorithm, the value is maximum, that is, the prediction accuracy and the recall rate of the small sample are effectively improved when the HCSA algorithm is adopted; and for the maximum data set classification 7, when the HCSA algorithm is adopted for the F1 value, the value is maximum, and the accuracy and the recall rate performance are not reduced.

And the figure 3 and the figure 4 are combined to respectively show the arithmetic average index value and the weighted average index value data of each method, and as can be seen from the figures, the accuracy and the F1 value of the HCSA algorithm are the highest, the performance of the over-sampling method is similar to that of the conventional method, and the index value of the under-sampling method is the lowest. Under-sampling results in severe performance degradation due to random discarding of training sample data. Although training data is added in oversampling, small feature vectors in a text vector space cannot be increased to a certain extent due to random replication. On the other hand, oversampling causes an overfitting phenomenon of the TextCNN model in training due to the existence of a large amount of duplicated data. It is shown that although the amount of training data is increased, if the amount is unreasonable, the model is overfit, and the prediction performance of the model cannot be improved. In the HCSA algorithm, due to the fact that clustering is carried out, the copying proportion of small feature vectors in a small sample is improved, and therefore the prediction accuracy of the small features can be improved. For large sample data, training data set rejection was performed in order to prevent overfitting. But does not cause the index value to drop much like undersampling, because when the data is discarded, the more the clustered data is, the larger the discarded proportion is, based on the clustering result. Thus, the distribution balance of various characteristics in the data set is finally ensured.

According to the experimental results, the dynamic clustering method is introduced into the HCSA algorithm, so that the data sets can be further distinguished based on the high-dimensional features of the text, a foundation is provided for undersampling and oversampling, the balanced distribution of the feature vectors of the data in the text training data set in the high-dimensional vector space is finally realized, and support is provided for improving the multi-classification performance of the text. By taking a microblog disaster data set as an example, the performance of the HCSA algorithm on the TextCNN model is verified. Experiments prove that compared with the conventional method, the oversampling method and the undersampling method, the performance of the algorithm is effectively improved in the aspects of accuracy, F1 value and the like.

The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A text multi-classification method for an unbalanced data set based on a text multi-classification hybrid equipartition clustering sampling algorithm is characterized in that a feature set of unbalanced data is calculated for a microblog disaster data set, and comprises the following steps:

s3: carrying out the classification on each classified sample in the data set according to the calculated sample data volume average linePartitioning, and defining the upper zone z with the number of samples greater than the sample data quantity average line_upThe number of samples less than the mean line of the sample data amount is the lower area z_dnAnd calculating the difference d between each classified sample and the sample data volume average line_i：

d_i＝X_i-L_avg(2)，

2. The unbalanced data set text multi-classification method based on the text multi-classification hybrid mean-average clustering sampling algorithm according to claim 1, wherein the specific operation of data clustering on the sample data in the data set in step S4 includes:

3. The method of claim 1, wherein the specific operation of step S42 includes:

s421: initializing a K value and an average contour system number S;

wherein the content of the first and second substances,