CN117235270B

CN117235270B - Text classification method and device based on belief confusion matrix and computer equipment

Info

Publication number: CN117235270B
Application number: CN202311526182.1A
Authority: CN
Inventors: 孙建彬; 姚雪湄; 杨克巍; 李自拓; 姜江; 于海跃; 赵蕊蕊; 剧伦豪; 秦宇琪
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2023-11-16
Filing date: 2023-11-16
Publication date: 2024-02-02
Anticipated expiration: 2043-11-16
Also published as: CN117235270A

Abstract

The application relates to a text classification method, a text classification device and a text classification computer device based on a belief confusion matrix, wherein a first belief confusion matrix before a candidate classification algorithm is attacked is firstly obtained, a centroid between a correct prediction and a wrong prediction based on the same real label is introduced on the basis of the belief confusion matrix to form a centroid offset quadrangle, and the total area of the centroid offset quadrangle in the first belief confusion matrix is calculated; and then, a second confidence confusion matrix is obtained after the candidate classification algorithm is attacked, and similarly, a centroid offset quadrangle is constructed, the difference of the candidate classification algorithm in resisting robustness can be intuitively seen by calculating the total area difference of the centroid offset quadrangle, and meanwhile, confidence is introduced, so that the classification algorithm with the best resisting robustness can be scientifically and rapidly screened to execute a text classification task, the authenticity and the accuracy of the area of the centroid offset quadrangle as a visual index of resisting robustness can be improved, and the stability of the text classification effect can be ensured.

Description

Text classification method and device based on belief confusion matrix and computer equipment

Technical Field

The present disclosure relates to the field of text classification technologies, and in particular, to a text classification method, apparatus, and computer device based on a belief confusion matrix.

Background

The equipment intelligence refers to the organic embedding of an artificial intelligence algorithm into the existing equipment system, so as to improve the capabilities of recognition, reasoning, judgment, decision-making, control and environment adaptation. The traditional equipment test identification method is difficult to measure how much intelligent the intelligent equipment is, and is difficult to analyze and evaluate the performance and reliability of the intelligent equipment accurately. To solve the problem of evaluating intelligent equipment, a machine learning algorithm needs to be evaluated to describe the degree of intelligence of the equipment. At present, research on the intelligent level evaluation of algorithms is mainly conducted around the dimensions of robustness resistance, data security, fairness, interpretability and the like.

Currently, the method for evaluating the robustness against the disturbance is mainly divided into two types, namely reference evaluation and index evaluation. The method mainly comprises the steps of obtaining a benchmark ranking in a manner of performing countermeasure training by using different attack and defense algorithms, and comparing the strength of countermeasure robustness; the latter focuses on the generation process of the countermeasure sample, and provides a series of evaluation indexes of the countermeasure robustness, so that the measurement of the countermeasure robustness is more comprehensive and reasonable. For each stage of model input, training, decision-making, etc., the index can be subdivided into a model-oriented index and a data-oriented index. For example, indexes such as the classification accuracy of the countermeasure sample, the average confidence of the countermeasure class and the like measure the robustness against the output result of the model under the countermeasure environment; the indexes such as neuron sensitivity, CLEVER score and the like observe the reflection of the model on the countermeasure sample, so that the countermeasure robustness of the model is measured; kThe indexes such as the node neuron coverage rate, the average structural similarity and the like measure the quality of training data from the angles of test sufficiency and visual imperceptibility, and the antagonism robustness of the model is indirectly measured. However, neither of these evaluation methods focuses on quantifying the robustness against robustness, nor on visualizing the robustness against robustness, nor on achieving both quantifiable and visual evaluation.

The confusion matrix is a summary matrix of classification results of the intelligent classification algorithm, rows and columns of the confusion matrix respectively represent real types and prediction types of samples, and numerical values in each column represent the number of the real samples predicted as the types, so that the confusion matrix is widely used for multi-classification task accuracy evaluation, intelligent candidate classification algorithm performance evaluation and the like, and is one of important means for performing anti-robustness measurement on output results of the intelligent candidate classification algorithm.

However, the confusion matrix focuses on quantifying and visualizing the robust resistance of a single intelligent candidate classification algorithm, and has a certain limitation on visualizing the robust resistance difference of different intelligent candidate classification algorithms, and cannot intuitively and quantitatively describe the difference, so that performance comparison of a plurality of intelligent candidate classification algorithms cannot be simply and rapidly performed to select an intelligent device embedded with a candidate classification algorithm with the best robust resistance, and stability of text classification effect is difficult to ensure.

The text sample may be a mail, and the classification labels may include important, unimportant, and uncertain, or recruitment type mail, training agency type mail, business information type mail, advertisement mail, subscription mail, and so forth. If the text classification algorithm with poor robustness is attacked, the phenomenon of misjudgment of mail category can occur, so that a user can not timely see the required mail: for example, the user may be concerned about the recruitment mail during the job hunting period, and the method in the prior art may divide the recruitment mail into the training mechanism mail, so that the user cannot obtain the information of the recruitment mail in time, which brings great inconvenience to the user. Therefore, it is required to ensure that text systems such as mails are embedded with text classification algorithms with strong robustness, so as to meet the requirements of users and optimize the use experience of the users.

Disclosure of Invention

Based on the above, it is necessary to provide a text classification method, device and computer equipment based on confusion matrix, which can rapidly screen out the classification algorithm with the best robustness for text classification, thereby improving the stability of text classification effect.

A text classification method based on a belief confusion matrix, the method comprising:

Acquiring an evaluation text sample set; the evaluation text samples in the evaluation text sample set all have corresponding real classification labels;

respectively carrying out prediction classification on the estimated text sample set by adopting a plurality of text classification algorithms, obtaining a first confidence confusion matrix corresponding to each text classification algorithm based on the prediction classification result, calculating a first total area of a plurality of centroid offset quadrilaterals in the first confidence confusion matrix, and screening a preset number of candidate text classification algorithms from the plurality of text classification algorithms according to the size of the first total area; in the belief confusion matrix, the abscissa of the center point of each square is a real classification label for evaluating the text sample, and the ordinate is a prediction classification label for evaluating the text sample; the weight of the center point of each square is the estimated text sample size in the corresponding square; the text sample amount is estimated and calculated according to the confidence corresponding to the prediction classification label; the number of centroid offset quadrilaterals isN represents the number of class labels;

wherein the step of calculating the area of the centroid offset quadrangle comprises: determining a construction area of a current centroid offset quadrangle according to any two real classification labels in the belief confusion matrix and squares corresponding to two prediction classification labels with the same coordinate values, respectively determining the horizontal coordinates of the two offset centroids according to the two real classification labels in the construction area, respectively determining the vertical coordinates of the two offset centroids according to the vertical coordinates and the weights of the central points of the two squares corresponding to each real classification label and the two prediction classification labels, respectively determining the positions of the two offset centroids according to the horizontal coordinates and the vertical coordinates of the offset centroids, and obtaining the area of the centroid offset quadrangle according to the two offset centroids and the central points of the squares corresponding to the two prediction classifications, and calculating to obtain the area of the centroid offset quadrangle;

Adding deception text samples into the process of predicting and classifying the evaluation text sample set by adopting a preselected attack algorithm to obtain a second confidence confusion matrix corresponding to each candidate text classification algorithm, and calculating a second total area of a plurality of centroid offset quadrilaterals in the second confidence confusion matrix;

comparing the robustness of each candidate text classification algorithm according to the difference value between the first total area and the second total area of the centroid offset quadrangle;

and embedding a candidate classification algorithm with the best robustness resistance into the intelligent equipment to classify the target text.

A text classification device based on a belief confusion matrix, the device comprising:

the evaluation text sample set acquisition module is used for acquiring an evaluation text sample set; the evaluation text samples in the evaluation text sample set all have corresponding real classification labels;

the first total area calculation module is used for respectively carrying out prediction classification on the estimated text sample set by adopting a plurality of text classification algorithms, obtaining a first confidence confusion matrix corresponding to each text classification algorithm based on the prediction classification result, calculating a first total area of a plurality of centroid offset quadrilaterals in the first confidence confusion matrix, and screening a preset number of candidate text classification algorithms from the plurality of text classification algorithms according to the size of the first total area; in the belief confusion matrix, the abscissa of the center point of each square is a real classification label for evaluating the text sample, and the ordinate is a prediction classification label for evaluating the text sample; the weight of the center point of each square is the estimated text sample size in the corresponding square; the text sample amount is estimated and calculated according to the confidence corresponding to the prediction classification label; the number of centroid offset quadrilaterals is N represents the number of class labels;

the second total area calculation module is used for adding deception text samples in the process of carrying out predictive classification on the evaluation text sample set by adopting each candidate text classification algorithm by adopting a preselected attack algorithm to obtain a second confidence confusion matrix corresponding to each candidate text classification algorithm, and calculating a second total area of a plurality of centroid offset quadrilaterals in the second confidence confusion matrix;

The robustness countermeasure comparison module is used for comparing the robustness countermeasure of each candidate text classification algorithm according to the difference value of the first total area and the second total area of the centroid offset quadrangle;

and the target text classification module is used for embedding a candidate classification algorithm with the best robustness into the intelligent equipment to classify the target text.

A computer device comprising a memory storing a computer program and a processor which when executing the computer program performs the steps of:

respectively carrying out prediction classification on the estimated text sample set by adopting a plurality of text classification algorithms, obtaining a first confidence confusion matrix corresponding to each text classification algorithm based on the prediction classification result, calculating a first total area of a plurality of centroid offset quadrilaterals in the first confidence confusion matrix, and screening a preset number of candidate text classification algorithms from the plurality of text classification algorithms according to the size of the first total area; in the belief confusion matrix, the abscissa of the center point of each square is a real classification label for evaluating the text sample, and the ordinate is a prediction classification label for evaluating the text sample; center of each square The weight of the point is the estimated text sample size in the corresponding square; the text sample amount is estimated and calculated according to the confidence corresponding to the prediction classification label; the number of centroid offset quadrilaterals isN represents the number of class labels;

According to the text classification method, the device, the computer equipment and the storage medium based on the confidence confusion matrix, firstly, the first confidence confusion matrix corresponding to the text classification algorithm is obtained, the centroid between the correct prediction and the incorrect prediction based on the same real label is introduced on the basis of the first confidence confusion matrix, the centroid offset quadrangle is formed on the basis of the centroids corresponding to the two classification labels, the total area of a plurality of centroid offset quadrangles in the first confidence confusion matrix is calculated, it can be understood that the total area of the centroid offset quadrangle corresponding to the first confidence confusion matrix can be used for judging the classification precision of the corresponding text classification algorithm, if the classification precision is higher, the estimated text sample size in the square with the correct prediction is smaller, the offset centroid is closer to the center point of the square with the correct prediction, the area of the corresponding centroid offset quadrangle is smaller, the candidate text classification algorithm set can be screened out from the initial text classification algorithm set, and the precision of the text classification is ensured in the initial step; then, attack algorithm is adopted to attack each candidate classification algorithm, namely attack algorithm generates deception text sample, the deception text sample is input into candidate text classification algorithm together with evaluation text sample set, and a second confidence confusion matrix corresponding to candidate classification algorithm is obtained, after the classification algorithm is attacked, the text sample with correct prediction is transferred into square with incorrect prediction, and similarly, centroid deviation quadrangle is constructed, so that the worse the robustness of the classification algorithm is, the larger the text sample transfer amount is, the difference between the total area of the attacked centroid deviation quadrangle and the total area of the centroid deviation quadrangle before attack is larger, therefore, the difference of the robustness of the candidate classification algorithm can be intuitively seen through calculating the difference between the total area of the centroid deviation quadrangle before and after attack, a powerful visual method support is provided for evaluating the robustness, and meanwhile, the confidence of classification is introduced into the confusion matrix and the calculation process of the centroid deviation quadrangle area, the text classification task can be executed scientifically and rapidly screened out by the classification algorithm with optimal robustness, and the true centroid deviation index and the true text classification effect can be guaranteed.

Drawings

FIG. 1 is a flow diagram of a text classification method based on a belief confusion matrix;

FIG. 2 is a schematic diagram of a two-class confusion matrix;

FIG. 3 is an internal block diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

In one embodiment, as shown in fig. 1, there is provided a text classification method based on a confidence confusion matrix, including the steps of:

step 102, acquiring an evaluation text sample set.

Wherein, all the evaluation text samples in the evaluation text sample set have corresponding real classification labels.

And 104, respectively carrying out prediction classification on the estimated text sample set by adopting a plurality of text classification algorithms, obtaining a first confidence confusion matrix corresponding to each text classification algorithm based on the prediction classification result, calculating a first total area of a plurality of centroid offset quadrilaterals in the first confidence confusion matrix, and screening a preset number of candidate text classification algorithms from the plurality of text classification algorithms according to the size of the first total area.

In the belief confusion matrix, the abscissa of the center point of each square is the real classification label of the estimated text sample, and the ordinate is the prediction classification label of the estimated text sample.

For the followingMeta-classification, the confusion matrix is a +.>For recording the prediction result of the intelligent classification algorithm. Taking two classes as examples, the confusion matrix is one +.>A schematic diagram of which is shown in figure 2. In the confusion matrix, each row represents a true classification labelEach column represents a predictive classification label. The four indexes corresponding to the two classification confusion matrixes are respectively as follows: TP (1 Positive) represents a sample with a true classification label of 0 and a prediction classification label of 0; FN (0 Negative) represents a sample with a true class label of 0 and a predicted class label of 1; FP (0 Positive) represents a sample with a true class label of 1 and a predicted class label of 0; TN (1 Negative) represents a sample with a true class label of 1 and a predicted class label of 1.

The weight of the center point of each square is the estimated text sample size in the corresponding square, and the estimated text sample size is calculated according to the confidence corresponding to the prediction classification label. In practical situations, the output of a part of intelligent classification algorithm is the prediction confidence of each category, for example, when the text classification algorithm classifies three classification data sets with labels of 0,1 and 2, the output corresponding to a certain sample is the prediction confidence corresponding to three categories 。

It will be appreciated that in a confusion matrix, the centroid shifts by the number of quadrilateralsAnd N represents the number of class labels. At->In the confusion matrix of the classification problem, since there is only one centroid offset quadrangle between any two different classes, namely classification labels, there is a total +.>The individual centroids are offset from the quadrangles.

Wherein the step of calculating the area of the centroid offset quadrangle comprises: determining a construction area of a current centroid offset quadrangle according to any two real classification labels and squares corresponding to two prediction classification labels with the same coordinate values in the belief confusion matrix, respectively determining the horizontal coordinates of the two offset centroids according to the two real classification labels in the construction area, respectively determining the vertical coordinates of the two offset centroids according to the vertical coordinates and the weights of the central points of the two squares corresponding to each real classification label and the two prediction classification labels corresponding to the real classification labels, respectively determining the positions of the two offset centroids according to the horizontal coordinates and the vertical coordinates of the offset centroids, and obtaining the area of the centroid offset quadrangle according to the two offset centroids and the central points of the squares corresponding to the two prediction classification correctly, and calculating to obtain the area of the centroid offset quadrangle.

It can be understood that the total area of the centroid offset quadrangle corresponding to the first confidence confusion matrix can be used for judging the classification precision of the corresponding text classification algorithm, if the classification precision is higher, the smaller the estimated text sample size in the square with the wrong prediction is, the smaller the weight is, the closer the ordinate of the offset centroid is to the ordinate of the center point of the square with the correct prediction is, and the smaller the area of the corresponding centroid offset quadrangle is, the candidate text classification algorithm set can be firstly screened from the initial text classification algorithm set according to the scheme, and the precision of text classification is ensured in the first step.

And 106, adding a deception text sample in the process of carrying out predictive classification on the evaluation text sample set by adopting a preselected attack algorithm to obtain a second confidence confusion matrix corresponding to each candidate text classification algorithm, and calculating a second total area of a plurality of centroid offset quadrilaterals in the second confidence confusion matrix.

The attack algorithm generates deception text samples and inputs the deception text samples and the evaluation text sample sets into the candidate text classification algorithm, at this time, the candidate text classification algorithm may be affected, and the more the candidate text classification algorithm with poor robustness is, the more the evaluation text samples corresponding to the squares in the belief confusion matrix may be transferred, which is reflected in that the smaller the evaluation text sample amount in the square with correct prediction is, the lower the weight is, the more the ordinate of the offset centroid is far away from the ordinate of the center point of the square with correct prediction, and the larger the area of the corresponding centroid offset quadrangle is, so that the difference value of the total area of the centroid offset quadrangle can reflect the strong and weak robustness of the candidate text classification algorithm, namely the strong and weak capability of maintaining the stability of the text classification accuracy.

And step 108, comparing the robustness of each candidate text classification algorithm according to the difference value between the first total area and the second total area of the centroid offset quadrangle.

In the scheme, the number of types of preselected attack algorithms can be one or more, and when only one attack algorithm exists, the candidate classification algorithm with the largest total area difference value is directly selected to classify the target text; when the attack algorithms are multiple, each candidate text classification algorithm and each attack algorithm have a corresponding total area difference value, if the robustness of the candidate classification algorithm to the attack algorithms needs to be comprehensively considered, the weighted sum of the total area difference values of the candidate text classification algorithms can be considered and the weight of the attack algorithm which is more emphasized can be set to be higher.

It can be understood that the difference of the total areas of the centroid offset quadrilaterals before and after the attack is taken as an opposite robustness visual evaluation index, the opposite robustness difference between the classification algorithms can be intuitively reflected, and further the classification algorithm with the best opposite robustness can be rapidly screened out to be embedded into intelligent equipment for carrying out text classification tasks, so that the stability of text classification effects is ensured. Meanwhile, the difference value of the total area of the centroid offset quadrangle before and after the attack is used as an robustness countermeasure visual evaluation index, and the robustness countermeasure difference of the classification algorithm to different attack algorithms can be intuitively reflected.

Step 110, embedding a candidate classification algorithm with the best robustness resistance into the intelligent device to classify the target text.

The centroid offset quadrangle provided by the scheme is mainly generated based on four squares formed by two different real classification labels and two corresponding prediction classification labels in the confidence confusion matrix.

In the text classification method based on the belief confusion matrix, firstly, a first belief confusion matrix corresponding to a text classification algorithm is obtained, on the basis of the first belief confusion matrix, the centroid between the correct prediction and the incorrect prediction based on the same real label is introduced, the centroid offset quadrangle is formed on the basis of the centroids corresponding to the two classification labels, and the total area of a plurality of centroid offset quadrangles in the first belief confusion matrix is calculated, so that it can be understood that the total area of the centroid offset quadrangles corresponding to the first belief confusion matrix can be used for evaluating the classification precision of the corresponding text classification algorithm, if the classification precision is higher, the estimated text sample amount in the square with the incorrect prediction is smaller, the offset centroid is closer to the central point of the square with the correct prediction, the area of the corresponding centroid offset quadrangle is smaller, the candidate text classification algorithm set can be screened from the initial text classification algorithm set, and the precision of the text classification is ensured initially; then, attack algorithm is adopted to attack each candidate classification algorithm, namely attack algorithm generates deception text sample, the deception text sample is input into candidate text classification algorithm together with evaluation text sample set, and a second confidence confusion matrix corresponding to candidate classification algorithm is obtained, after the classification algorithm is attacked, the text sample with correct prediction is transferred into square with incorrect prediction, and similarly, centroid deviation quadrangle is constructed, so that the worse the robustness of the classification algorithm is, the larger the text sample transfer amount is, the difference between the total area of the attacked centroid deviation quadrangle and the total area of the centroid deviation quadrangle before attack is larger, therefore, the difference of the robustness of the candidate classification algorithm can be intuitively seen through calculating the difference between the total area of the centroid deviation quadrangle before and after attack, a powerful visual method support is provided for evaluating the robustness, and meanwhile, the confidence of classification is introduced into the confusion matrix and the calculation process of the centroid deviation quadrangle area, the text classification task can be executed scientifically and rapidly screened out by the classification algorithm with optimal robustness, and the true centroid deviation index and the true text classification effect can be guaranteed.

In one embodiment, in the first belief confusion matrix, the step of constructing a centroid offset quadrilateral includes:

determining a current centroid offset quadrilateral construction area according to the connecting line of the central points of the square corresponding to any two real classification labels and two prediction classification labels with the same coordinate values in the first confidence confusion matrix;

within the build region:

the abscissa of the center points of the two squares is taken as the abscissa of the two offset centroids respectively:

；

wherein,representing a true class label as +.>Is +.>Is>Representing a true class label as +.>The abscissa of the central point of the square, +.>Representing a true class label as +.>Is +.>Is>Representing a true class label as +.>The abscissa of the center point of the square of (2)；

Taking the estimated text sample size in the square as the weight of the center point of the corresponding square;

according to the ordinate and the weight of the central points of the two square grids corresponding to each real classification label and the two prediction classification labels, respectively determining the ordinate of two offset centroids:

；

wherein,representing a true class label as +.>Is +.>Ordinate of>Representing true class labels as Is +.>Ordinate of>Representing a true class label as +.>Is the ordinate of the center point of the square, +.>Representing a true class label as +.>Is the ordinate of the center point of the square, +.>Representing the total amount of evaluation text samples in the square corresponding to each real classification label, and performing data preprocessing at the beginning to make the number of samples of each real class equal, < >>Representing a true class label as +.>Predictive class label +.>Confidence of the square correspondence of +.>Representing a true class label as +.>Predictive class label +.>Confidence of the square correspondence of +.>Representing a true class label as +.>Predictive class label is also +.>Confidence of the square correspondence of (2)>Representing a true class label as +.>Predictive class label is also +.>Confidence corresponding to the square of (a);

and obtaining the construction centroid offset quadrangle according to the two abscissas and the two ordinates of the offset centroid.

In one embodiment, calculating a first total area of a plurality of centroid-offset quadrilaterals in a first belief confusion matrix includes:

calculating a first total area of a plurality of centroid offset quadrilaterals in a first belief confusion matrix as:

，；

wherein,representing that the real class labels are +.>，/>The predictive classification labels are also +. >，/>The centroid in 4 squares of (2) is offset by the first area of the quadrangle, +.>The area of the quadrangle formed by the center points of the four squares corresponding to the centroid offset quadrangle is shown. Denominator->The function of (2) is to exclude the interference of the position relation of two categories on the centroid offset quadrangle.

In one embodiment, in the second confidence confusion matrix, the step of constructing a centroid offset quadrilateral comprises:

determining a current centroid offset quadrilateral construction area according to the connecting line of the central points of the square corresponding to any two real classification labels and two prediction classification labels with the same coordinate values in the first confidence confusion matrix; within the build region: the abscissa of the center points of the two squares is taken as the abscissa of the two offset centroids respectively:

；

wherein,representing a true class label as +.>Is +.>Is>Representing a true class label as +.>Is +.>Is the abscissa of (2);

taking the estimated text sample size in the square as the weight of the center point of the corresponding square; in the second confidence confusion matrix, the estimated text sample size in the square lattice is calculated by the estimated text sample size of the corresponding square lattice in the first confidence confusion matrix and the estimated text sample transfer size which changes relative to the prediction classification label in the first confidence confusion matrix;

；

wherein,representing a true class label as +.>Is +.>Ordinate of>Representing a true class label as +.>Is +.>Ordinate of>Representing a true class label as +.>Predictive class label of (2) from->Become->Is a sample transfer amount of the text, is->Representing a true class label as +.>Predictive class label of (2) from->Become->Is used for evaluating the transfer quantity of the text sample;

and constructing and obtaining a centroid offset quadrangle according to the two abscissas and the two ordinates of the offset centroid.

In one embodiment, calculating a second total area of the plurality of centroid-offset quadrilaterals in the second belief confusion matrix includes:

calculating a second total area of a plurality of centroid offset quadrilaterals in a second belief confusion matrix as:

；

wherein,representing that the real class labels are +.>，/>The predictive classification labels are also +.>，/>The centroid in the 4 squares of (2) is offset by the second area of the quadrilateral.

From the above, it can be seen that the greater the intelligent classification algorithm is affected by the attack algorithm, the sample transfer amount And->The larger the sum is, the larger the total offset of the centroid relative to the center point of the classification predicted correct square is, the larger the corresponding centroid offset quadrilateral area after attack is, and the larger the difference value between the centroid offset quadrilateral area before attack is.

In one embodiment, comparing the counterrobustness of each candidate text classification algorithm based on the difference between the first total area and the second total area of the centroid-offset quadrangle comprises:

comparing the robustness of each candidate text classification algorithm according to the difference between the first total area and the second total area of the centroid offset quadrangle is as follows:

；

wherein a smaller difference represents a better robustness against robustness.

In one embodiment, embedding a candidate classification algorithm with optimal robustness against robustness into an intelligent device for target text classification comprises:

and embedding a candidate classification algorithm with the smallest difference between the first total area and the second total area of the centroid offset quadrangle into the intelligent equipment to classify the target text.

The scheme can also be used for classifying samples in other fields such as images, audios and videos.

It should be understood that, although the steps in the flowchart of fig. 1 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 1 may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order in which the sub-steps or stages are performed is not necessarily sequential, but may be performed in turn or alternately with at least some of the other steps or sub-steps of other steps.

In one embodiment, a text classification device based on a confusion matrix is provided, wherein:

the first total area calculation module is used for respectively carrying out prediction classification on the evaluation text sample set by adopting a plurality of text classification algorithms, obtaining a first confidence confusion matrix corresponding to each text classification algorithm based on the result of the prediction classification, calculating a first total area of a plurality of centroid offset quadrilaterals in the first confidence confusion matrix, and screening a preset number of candidate text classification algorithms from the plurality of text classification algorithms according to the size of the first total area; in the belief confusion matrix, the abscissa of the center point of each square is a real classification label for evaluating the text sample, and the ordinate is a prediction classification label for evaluating the text sample; the weight of the center point of each square is the estimated text sample size in the corresponding square; the estimated text sample size is calculated according to the confidence corresponding to the prediction classification label; the number of centroid offset quadrilaterals is N represents the number of class labels;

wherein the step of calculating the area of the centroid offset quadrangle comprises: determining a construction area of a current centroid offset quadrangle according to any two real classification labels and squares corresponding to two prediction classification labels with the same coordinate values in the belief confusion matrix, respectively determining the horizontal coordinates of the two offset centroids according to the two real classification labels, respectively determining the vertical coordinates of the two offset centroids according to the vertical coordinates and the weights of the central points of the two squares corresponding to each real classification label and the two prediction classification labels, respectively determining the positions of the two offset centroids according to the horizontal coordinates and the vertical coordinates of the offset centroids, and obtaining the area of the centroid offset quadrangle according to the two offset centroids and the central points of the squares correctly corresponding to the two prediction classifications, and calculating to obtain the area of the centroid offset quadrangle;

the second total area calculation module is used for adding a deception text sample in the process of carrying out predictive classification on the evaluation text sample set by adopting a preselected attack algorithm to obtain a second confidence confusion matrix corresponding to each candidate text classification algorithm, and calculating a second total area of a plurality of centroid offset quadrilaterals in the second confidence confusion matrix;

For specific limitations on the text classification device based on the confidence confusion matrix, reference may be made to the above limitation on the text classification method based on the confidence confusion matrix, and the description thereof will not be repeated here. The above-described respective modules in the text classification apparatus based on the confidence confusion matrix may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 3. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing data such as classification algorithms and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a text classification method based on a belief confusion matrix.

It will be appreciated by those skilled in the art that the structure shown in fig. 3 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In an embodiment a computer device is provided comprising a memory storing a computer program and a processor implementing the steps of the method of the above embodiments when the computer program is executed.

In one embodiment, a computer readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements the steps of the method of the above embodiments.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A method for classifying text based on a belief confusion matrix, the method comprising:

respectively carrying out predictive classification on the estimated text sample set by adopting a plurality of text classification algorithms, obtaining a first confidence confusion matrix corresponding to each text classification algorithm based on the result of predictive classification, calculating a first total area of a plurality of centroid offset quadrilaterals in the first confidence confusion matrix, and screening a preset number of candidate text classification algorithms from the plurality of text classification algorithms according to the size of the first total area; in the belief confusion matrix, the abscissa of the center point of each square is a real classification label for evaluating the text sample, and the ordinate is a prediction classification label for evaluating the text sample; the weight of the center point of each square is the estimated text sample size in the corresponding square; the estimated text sample size is calculated according to the confidence corresponding to the prediction classification label; centroid offset The number of the quadrilaterals isN represents the number of class labels;

adding a deception text sample in the process of carrying out predictive classification on the evaluation text sample set by adopting a preselected attack algorithm to obtain a second confidence confusion matrix corresponding to each candidate text classification algorithm, and calculating a second total area of a plurality of centroid offset quadrilaterals in the second confidence confusion matrix;

2. The method of claim 1, wherein in the first belief confusion matrix, the step of constructing a centroid offset quadrilateral comprises:

；

wherein,representing a true class label as +.>Is +.>Is>Representing a true class label as +.>The abscissa of the central point of the square, +.>Representing a true class label as +.>Is +.>Is>Representing a true class label as +.>The abscissa of the center point of the square;

；

wherein,representing a true class label as +.>Is +.>Ordinate of>Representing a true class label as +.>Is +.>Ordinate of>Representing a true class label as +.>Is the ordinate of the center point of the square, +.>Representing a true class label as +.>Is the ordinate of the center point of the square, +.>Representing the total amount of evaluation text samples in the square corresponding to each real classification label +.>Representing a true class label as +.>Predictive class label +.>Confidence of the square correspondence of +.>Representing a true class label as +.>Predictive class label +.>Confidence of the square correspondence of +.>Representing a true class label as +.>Predictive class label is also +.>Confidence of the square correspondence of +.>Representing a true class label as +.>Predictive class label is also +.>Confidence corresponding to the square of (a);

3. The method of claim 2, wherein calculating a first total area of a plurality of centroid-offset quadrilaterals in the first belief confusion matrix comprises:

Calculating a first total area of a plurality of centroid offset quadrilaterals in the first belief confusion matrix as:

；

wherein,representing that the real class labels are +.>，/>The predictive classification labels are also +.>，/>The centroid in the 4 squares of (2) is offset from the first area of the quadrilateral.

4. A method according to claim 3, wherein in the second belief confusion matrix, the step of constructing a centroid offset quadrilateral comprises:

；

5. The method of claim 4, wherein calculating a second total area of a plurality of centroid-offset quadrilaterals in the second belief confusion matrix comprises:

calculating a second total area of a plurality of centroid offset quadrilaterals in the second belief confusion matrix as:

；

6. The method of claim 5, wherein comparing the counterrobustness of each candidate text classification algorithm based on the difference between the first total area and the second total area of the centroid-offset quadrangle comprises:

；

wherein a smaller difference represents a better robustness against robustness.

7. The method of claim 5, wherein embedding the candidate classification algorithm with the best robustness against the target text classification into the intelligent device comprises:

8. A text classification device based on a belief confusion matrix, the device comprising:

the first total area calculation module is used for respectively adopting a plurality of text classification algorithms to conduct prediction classification on the evaluation text sample set and based on the prediction classificationObtaining a first confidence confusion matrix corresponding to each text classification algorithm, calculating a first total area of a plurality of centroid offset quadrilaterals in the first confidence confusion matrix, and screening a preset number of candidate text classification algorithms from a plurality of text classification algorithms according to the size of the first total area; in the belief confusion matrix, the abscissa of the center point of each square is a real classification label for evaluating the text sample, and the ordinate is a prediction classification label for evaluating the text sample; the weight of the center point of each square is the estimated text sample size in the corresponding square; the estimated text sample size is calculated according to the confidence corresponding to the prediction classification label; the number of centroid offset quadrilaterals is N represents the number of class labels;

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed.