CN116204645B

CN116204645B - Intelligent text classification method, system, storage medium and electronic equipment

Info

Publication number: CN116204645B
Application number: CN202310227369.5A
Authority: CN
Inventors: 戴长松
Original assignee: Shumei Tianxia Beijing Technology Co ltd; Beijing Nextdata Times Technology Co ltd
Current assignee: Shumei Tianxia Beijing Technology Co ltd; Beijing Nextdata Times Technology Co ltd
Priority date: 2023-03-02
Filing date: 2023-03-02
Publication date: 2024-02-20
Anticipated expiration: 2043-03-02
Also published as: CN116204645A

Abstract

The invention discloses an intelligent text classification method, an intelligent text classification system, a storage medium and electronic equipment, and relates to the field of natural language processing. The method comprises the following steps: acquiring interception information of label information of text data to be classified; classifying the text data to be classified according to the preset reference model and the interception information to obtain a classification result, and classifying the text data to be classified according to the preset reference model and the interception information to obtain the classification result, so that negative influence caused by data problems is reduced, and label classification effect is improved.

Description

Intelligent text classification method, system, storage medium and electronic equipment

Technical Field

The present invention relates to the field of natural language processing, and in particular, to an intelligent text classification method, system, storage medium, and electronic device.

Background

BERT (Bidirectional Encoder Representation from Transformers), which is a breakthrough research progress in the field of natural language processing, has remarkable effects on various tasks such as GLUE and SQUAD, and is widely applied to the fields such as text classification, natural language inference, emotion analysis, semantic similarity, reading and understanding. The representation of the context information is obtained by constructing a multi-layer bidirectional transducer model, the BERT divides the training process into two stages of pre-training and fine-tuning, two non-supervised tasks MLM and NSP are proposed in the pre-training to train the unlabeled data, and the labeled data corresponding to the downstream tasks are adopted in the fine-tuning stage to fine-tune the parameters of the pre-training model. In the classification task, if the labeling data has the problems of dirty data, label missing and the like, the model recognition accuracy can be reduced to a certain extent.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides an intelligent text classification method, an intelligent text classification system, a storage medium and electronic equipment.

The technical scheme for solving the technical problems is as follows:

an intelligent text classification method, comprising:

acquiring interception information of label information of text data to be classified;

and classifying the text data to be classified according to a preset reference model and the interception information to obtain a classification result.

The beneficial effects of the invention are as follows: according to the method, the text data to be classified is classified according to the preset reference model and the interception information, so that a classification result is obtained, negative effects caused by data problems are reduced, and the label classification effect is improved.

Further, the method further comprises the following steps: replacing the softmax layer of the preset reference model with a sigmoid layer to obtain an optimized preset reference model;

the classifying the text data to be classified according to the preset reference model and the interception information specifically comprises the following steps:

and classifying the text data to be classified according to the optimized preset reference model and the interception information.

Further, the method further comprises the following steps: training the reference model through a downstream task data set to obtain a trained preset reference model;

and classifying the text data to be classified according to the trained preset reference model and the interception information.

Further, the text data to be classified includes: tag information and text content.

The other technical scheme for solving the technical problems is as follows:

an intelligent text classification system comprising: the method comprises the steps of obtaining an interception information module and a classification module;

the interception information acquisition module is used for acquiring interception information of tag information of text data to be classified;

the classification module is used for classifying the text data to be classified according to a preset reference model and the interception information to obtain a classification result.

Further, the method further comprises the following steps: the optimization module is used for replacing the softmax layer of the preset reference model with a sigmoid layer to obtain an optimized preset reference model;

the classification module is specifically configured to classify the text data to be classified according to the optimized preset reference model in combination with the interception information.

Further, the method further comprises the following steps: the training module is used for training the reference model through a downstream task data set to obtain a trained preset reference model;

the classification module is specifically configured to classify the text data to be classified according to the trained preset reference model in combination with the interception information.

The other technical scheme for solving the technical problems is as follows:

a storage medium having instructions stored therein which, when read by a computer, cause the computer to perform an intelligent text classification method according to any of the above aspects.

An electronic device comprising a processor and a storage medium as described in the above, the processor executing instructions in the storage medium.

Additional aspects of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

Fig. 1 is a schematic flow chart of an intelligent text classification method according to an embodiment of the present invention;

FIG. 2 is a block diagram of an intelligent text classification system according to an embodiment of the present invention;

FIG. 3 is a flowchart of an intelligent text classification method based on BERT penalty masks according to other embodiments of the present invention;

fig. 4 is a schematic structural diagram of a preset reference model according to another embodiment of the present invention.

Detailed Description

The principles and features of the present invention are described below with reference to the drawings, the illustrated embodiments are provided for illustration only and are not intended to limit the scope of the present invention.

As shown in fig. 1, the method for classifying intelligent text provided by the embodiment of the invention includes:

s1, acquiring interception information of tag information of text data to be classified;

s2, classifying the text data to be classified according to a preset reference model and the interception information, and obtaining a classification result. It should be noted that, the construction flow of the model requires: data set, training resources (GPU), training code. The training code adopts a BERT model training code which is based on Tensorflow and is of Huggingface open source, and the code is modified and adapted to a certain extent. The reference model is called because it does not apply any additional techniques and is a basic, comparable, standard model. The modified content mainly comprises two points: 1, replacing a softmax layer of a model classifier with a sigmoid layer; 2, writing the acquired interception relation between the classifier and the label into the model in a code mode, wherein the specific implementation mode is to hide loss of the robbery-dividing label, so that the prediction accuracy of the classifier is improved.

In one embodiment, the model specific classification process may include:

1 preparing annotated classification datasets (these datasets are also referred to as training sets for models, text classification datasets typically contain text content and its labels);

2, invoking GPU resources, and training a BERT model (the model is a reference model) according to the data set in the step 1;

3 scoring on the training set using a benchmark model to obtain an interception relationship between the classifier and the tag (a specific method is explained below);

modifying the training codes by inputting the interception relation obtained in step 3 into a model and retraining to obtain a new BERT model (the modification is described in detail below);

the same test set is used for testing on the reference model and the new model, and the new model has better classification effect compared with the reference model. (this step is to verify the validity of the method, and is not a step necessary for constructing a new model)

In one embodiment, the functionality of the model generally includes three parts: training and prediction, and relationship extraction.

Training is to construct a model through codes and computing resources and marked data sets; the prediction is to score text data of unknown labels on a trained model, so that labels predicted by the model are obtained; the relation extraction module obtains the label interception tendency relation of the classifier according to the scoring of the reference BERT model on the training set. The structural part of the model is shown in fig. 4.

In some embodiment, the training process of the model may include:

1, preparing marked training sets (24 labels, 301 ten thousand data);

2 preparing a GPU computing platform (NVIDIA TESLA V) a code training and debugging environment (Tensorflow, CUDA, etc.);

3BERT model training code (an open source version provided by HuggingFace);

after all the above resources are prepared, model training can begin.

The model construction process is a model training process, taking the invention as an example, and is different from the reference model training in that after the reference model is trained, two steps are added:

1, scoring a training set by using a trained reference model to obtain an interception tendency relation between a classifier and a label;

2, embedding the interception tendency obtained in the step 1 into a model training process (code is required to be modified), and retraining to obtain a new model.

According to the method, the text data to be classified is classified according to the preset reference model and the interception information, so that a classification result is obtained, negative effects caused by data problems are reduced, and the label classification effect is improved.

Optionally, in some embodiments, further comprising: replacing the softmax layer of the preset reference model with a sigmoid layer to obtain an optimized preset reference model;

Optionally, in some embodiments, further comprising: training the reference model through a downstream task data set to obtain a trained preset reference model; where the downstream task dataset refers to a classified text dataset, the dataset generally contains tag information and text content, for example, the training set in the present technology experiment contains 24 tags for a total of 301 ten thousand pieces of data.

Optionally, in some embodiments, the text data to be classified includes: tag information and text content.

In one embodiment, an intelligent text classification method based on a BERT penalty mask includes:

step 11: training a benchmark model on the downstream task data set using BERT; it should be noted that, constructing the reference model may include: the model needs to be trained on a GPU computing platform supporting parallel computing, and the bottom GPU based on the technical experiment is NVIDIA Tesla V100. Where the downstream task dataset refers to a classified text dataset, the dataset generally contains tag information and text content, for example, the training set in the present technology experiment contains 24 tags for a total of 301 ten thousand pieces of data. The construction flow of the model needs: data set, training resources (GPU), training code. The training code adopts a BERT model training code which is based on Tensorflow and is of Huggingface open source, and the code is modified and adapted to a certain extent. The reference model is called because it does not apply any additional techniques and is a basic, comparable, standard model.

Step 12: predicting training data by using the reference model to obtain the score of each label, then counting the average score of each classifier on each label data set, setting a threshold t, and regarding that the sub-classifier has interception tendency to the target label when the average score is higher than the threshold; the prediction process may include: prediction is one of the basic functions of a model, where prediction refers to the act of classifying a trained model on data without any labels. For example, there is a new piece of text (without any tag information) that can be scored on all tags by means of a model, giving a predictive tag to this piece of text.

The hypothesis model has three sub-classifiers: classifier A, classifier B, classifier C; the training set has three labels: label A, label B, label C; then, the reference model is used for predictive scoring of all samples of the training set, so that the average scoring of the classifier on all samples can be obtained, for example, the average scoring of the classifier A on all samples with labels C is S ^A _C =0.2, assuming that the threshold t=0.1 is set, there is S ^A _C >t, this indicates that classifier a has a tendency to intercept tag C.

It should be noted that, prediction is one of the basic functions of the model, the classifier is an important carrier for realizing prediction by the model, and if the training set has three labels, then there are three classifiers corresponding to the model, the functions of the three classifiers are to predict the scores of the samples on the labels, and the model judges which label the samples finally belong to according to the scores.

Step 13: retraining on the same training set, except that when calculating the loss of each classifier, the label loss with interception tendency is erased; it should be noted that, when the model is trained, the label information of a sample is used to construct the loss of the classifier, and if the classifier a has an interception tendency to the label C, when the model is trained, the loss of the classifier a is set to 0 (erasure loss) when the loss is constructed for a sample of which the label is C. The method has the effect of eliminating the influence of the classifier which is easy to robbery the current label, thereby improving the training effect of the current label.

Step 14: when the probability is output by the model, the traditional softmax layer is replaced by a sigmoid layer, so that a better label classification effect is obtained; it should be noted that, the model will score by means of a classifier in prediction, i.e. output a value for all tags: softmax is a method (function) of mapping the output values that can transform a range of values of different sizes into probabilities between 0 and 1; sigmoid is also a function of mapping values into a probability distribution between 0 and 1, except that the sum of probabilities for softmax mapping is 1, and the sum of probabilities after sigmoid mapping is often greater than 1. This is the difference caused by the different specific implementation of the two functions, where the main reason for choosing sigmoid is: the sigmoid has no probability sum 1 limit compared with the softmax, and can improve the scoring of each classifier, so that the recall of the model to the sample is improved, and a better classification effect is finally obtained.

It should be noted that steps 12, 13, 14 are in time sequence: the purpose of step 12 is to provide support for the training of the subsequent model (step 13) in order to obtain the interception tendency of the classifier; step 13, utilizing the interception tendency of the classifier obtained in the step 12 to change the training codes in a targeted manner and retraining, thereby obtaining a model embedded with priori information; step 14 is to modify the prediction process based on the trained model in step 13, so as to perfect the technical method.

The method can relieve negative effects brought by problem data and improve text classification effects; how to mitigate the negative impact of problem data may include: for samples of robbery, the model can be focused on the training of the label by using loss masking, so that adverse effects caused by problem data are relieved; the probability output layer can improve the scoring of the model labels by changing sigmoid, so that the probability sum constraint of 1 brought by softmax is avoided, and the disadvantage of robbing among the labels is relieved.

The sigmoid is used for replacing softmax to serve as a final probability output layer, so that the influence of other tags on the current tag is weakened, and recall of the current tag can be further improved;

extracting interception relations between the sub-classifier and the labels according to BERT reference model scoring on the training set, restarting training, and masking label loss with the interception relations on loss of the sub-classifier, so that negative influence caused by data problems is reduced, and label classification effect is improved; it should be noted that, extracting the interception relationship between the sub-classifier and the label may include: the hypothesis model has three sub-classifiers: classifier A, classifier B, classifier C; the training set has three labels: label A, label B, label C; then, the reference model is used for predictive scoring of all samples of the training set, so that the average scoring of the classifier on all samples can be obtained, for example, the average scoring of the classifier A on all samples with labels C is S ^A _C =0.2, assuming that the threshold t=0.1 is set, there is S ^A _C >t, this indicates that classifier a has a tendency to intercept tag C. The masking process may include: assuming classifier A has a tendency to intercept tag C, when model training builds a loss for a sample of tag C, the loss of classifier A will be set to 0 (erasure loss), also known as masking.

In one embodiment, as shown in FIG. 2, an intelligent text classification system includes: an interception information acquisition module 1101 and a classification module 1102;

the interception information acquisition module 1101 is configured to acquire interception information of tag information of text data to be classified;

the classification module 1102 is configured to classify the text data to be classified according to a preset reference model in combination with the interception information, so as to obtain a classification result.

Optionally, in some embodiments, further comprising: the optimization module is used for replacing the softmax layer of the preset reference model with a sigmoid layer to obtain an optimized preset reference model;

Optionally, in some embodiments, further comprising: the training module is used for training the reference model through a downstream task data set to obtain a trained preset reference model;

It is to be understood that in some embodiments, some or all of the alternatives described in the various embodiments above may be included.

It should be noted that, the foregoing embodiments are product embodiments corresponding to the previous method embodiments, and the description of each optional implementation manner in the product embodiments may refer to the corresponding description in the foregoing method embodiments, which is not repeated herein.

In one embodiment, a storage medium has instructions stored therein that, when read by a computer, cause the computer to perform a method of intelligent text classification as in any of the embodiments described above.

An electronic device comprising a processor and a storage medium of the above embodiments, the processor executing instructions in the storage medium.

The reader will appreciate that in the description of this specification, a description of terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the method embodiments described above are merely illustrative, e.g., the division of steps is merely a logical function division, and there may be additional divisions of actual implementation, e.g., multiple steps may be combined or integrated into another step, or some features may be omitted or not performed.

The above-described method, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present invention is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-only memory (ROM), a random access memory (RAM, randomAccessMemory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The present invention is not limited to the above embodiments, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the present invention, and these modifications and substitutions are intended to be included in the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. An intelligent text classification method, comprising:

classifying the text data to be classified according to a preset reference model and the interception information to obtain a classification result;

further comprises: replacing the softmax layer of the preset reference model with a sigmoid layer to obtain an optimized preset reference model;

classifying the text data to be classified according to the optimized preset reference model and the interception information;

further comprises: training the reference model through a downstream task data set to obtain a trained preset reference model, wherein the training process comprises the following steps:

step 11: training a benchmark model on the downstream task data set using BERT;

step 12: predicting training data by using the reference model to obtain the score of each label, counting the average score of each classifier on each label data set, setting a threshold t, and regarding that the classifier has interception tendency on the target label when the average score is higher than the threshold;

step 13: retraining on the same training set, erases the label loss with intercept propensity when computing the loss for each classifier, specifically:

extracting interception relations between the classifier and the labels according to BERT reference model scoring on the training set, restarting training, and masking label loss with the interception relations on loss of the classifier;

step 14: and when the model outputs the probability, replacing the softmax layer of the preset reference model with the sigmoid layer.

2. The intelligent text classification method according to claim 1, wherein the text data to be classified comprises: tag information and text content.

3. An intelligent text classification system, comprising: the method comprises the steps of obtaining an interception information module and a classification module;

the classification module is used for classifying the text data to be classified according to a preset reference model and the interception information to obtain a classification result;

further comprises: the optimization module is used for replacing the softmax layer of the preset reference model with a sigmoid layer to obtain an optimized preset reference model;

the classification module is specifically used for classifying the text data to be classified according to the optimized preset reference model and the interception information;

further comprises: the training module is used for training the reference model through a downstream task data set to obtain a trained preset reference model, and the training process comprises the following steps:

step 11: training a benchmark model on the downstream task data set using BERT;

step 13: retraining on the same training set, erasing labels with interception tendency when calculating the loss of each classifier, specifically, extracting interception relations between the classifier and the labels according to the BERT reference model scoring on the training set, restarting training, and masking the labels loss with interception relations on the loss of the classifier;

4. An intelligent text classification system according to claim 3 wherein said text data to be classified comprises: tag information and text content.

5. A storage medium having stored therein instructions which, when read by a computer, cause the computer to perform an intelligent text classification method according to claim 1 or 2.

6. An electronic device comprising a processor and the storage medium of claim 5, the processor executing instructions in the storage medium.