CN113032563A

CN113032563A - Regularization text classification fine-tuning method based on manually-covered keywords

Info

Publication number: CN113032563A
Application number: CN202110302636.1A
Authority: CN
Inventors: 潘晓光; 陈亮; 董虎弟; 宋晓晨; 张雅娜
Original assignee: Shanxi Sanyouhe Smart Information Technology Co Ltd
Current assignee: Shanxi Sanyouhe Smart Information Technology Co Ltd
Priority date: 2021-03-22
Filing date: 2021-03-22
Publication date: 2021-06-25
Anticipated expiration: 2041-03-22
Also published as: CN113032563B

Abstract

The invention belongs to the technical field of text classification, and particularly relates to a regularization text classification fine adjustment method based on manually-hidden keywords, which comprises the following steps: data acquisition and processing, keyword selection based on frequency, keyword selection based on attention value, masked keyword reconstruction, hidden entropy regularization and performance evaluation, wherein the data acquisition and processing acquires text data required by a model, labels the type of the text data, constructs a data set required by the model, and pre-trains the data set; the frequency-based keyword selection uses relative frequencies of words in a dataset to select keywords; the attention value-based keyword selection uses model attention to select keywords. The method regularizes the model, reconstructs keywords from other words, and makes low confidence predictions without sufficient context. The method can greatly improve OOD detection and cross-domain generalization under the condition of not reducing the classification precision.

Description

Regularization text classification fine-tuning method based on manually-covered keywords

Technical Field

The invention relates to the technical field of text classification, in particular to a regularization text classification fine-tuning method based on manually-hidden keywords.

Background

At present, a language model trained in advance achieves the most advanced accuracy in various text classification tasks, such as emotion analysis, natural language reasoning and semantic text similarity. However, the reliability of the fine-tuned text classifier is severely underestimated. It has not been possible to build a model that can detect samples of the OOD (out-of-distribution) or that is robust to domain transitions, mainly due to the model's excessive dependence on a limited number of keywords, rather than looking at the entire context.

Cause of problems or defects: current research on text classification focuses only on evaluating the accuracy of models, and ignores their reliability. Meanwhile, the excessive dependence of the traditional method on the keywords may cause the problems of abnormal distribution and generalization of the detection.

Disclosure of Invention

The invention aims to provide a regularization text classification fine adjustment method based on manually-hidden keywords.

In order to achieve the purpose, the invention provides the following technical scheme: a regularization text classification fine adjustment method based on manually-hidden keywords comprises the following steps:

s100, data acquisition and processing: acquiring text data required by a model, labeling the type of the text data, constructing a data set required by the model, and pre-training the data set;

s200, selecting keywords based on frequency: selecting keywords using relative frequencies of words in the dataset;

s300, selecting keywords based on the attention value: selecting keywords using model attention;

s400, hiding the keyword reconstruction: reconstructing the keywords from the keyword mask document;

s500, hidden entropy regularization: regularizing the random deletion of the non-key words in the context for the prediction of the context-obscured document;

s600, performance evaluation: and evaluating the text classification precision.

Further, in the keyword selection based on frequency in step S200, the importance of the token is measured through TF-IDF, and then the importance of the token is measured by comparing the frequency in the target document with the frequency in the entire corpus, and the keyword is defined as the word with the highest TF-IDF score.

Further, in the step S300, in the selection of the attention value-based keyword, the LCE standard method of cross entropy loss is used to train the model, and the attention value of the model is used to select the labeled keyword with the highest attention value.

Further, in the keyword selection based on the attention value in step S300, let a ═ a1, … aT ∈ R T be the attention value of document embedding, and the attention-based scoring formula in which ai corresponds to ti in the input symbol and the symbol t is set as ti

Further, in step S400, during the hidden keyword reconstruction, the sentence is regularized by the keyword, and k is assumed to be^～For a random subset of the full keyword k, each element is chosen with an independent probability p, and k is masked from the original document x^～To obtain a masked document x^～＝x-k^～Finally, the reconstruction loss formula of the shielding keyword is obtained as

Further, step (ii)In step S500 hidden entropy regularization, let c be a randomly selected subset of the full context word c x-k, where each element is independently selected with a probability q, then mask c from the original document x and obtain a context masked document x^～-c, obtaining a formula for the latent entropy regularization as

Finally, the verification formula is set as

Further, in the performance evaluation of step S600, classification accuracy, OOD detection and cross-domain generalization index are mainly evaluated.

Further, step S200 and step S300 are not in sequence.

Further, step S400 and step S500 are not in sequence.

The invention has the following technical effects: aiming at the problems that the reliability of an evaluation model is neglected and the reliability of a research method for classifying texts excessively depends on keywords and the like at present, the invention provides a method which can carry out overall prediction and has higher reliability on the basis of context so as to carry out overall prediction on the basis of context. The method regularizes the model, reconstructs keywords from other words, and makes low confidence predictions without sufficient context. Running in pre-trained language models such as BERT, RoBERTa and ALBERT, this approach can greatly improve OOD detection and cross-domain generalization without reducing classification accuracy.

Drawings

FIG. 1 is a flow chart of the system of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Examples

A regularization text classification fine-tuning method based on artificial masking keywords is disclosed, as shown in FIG. 1, and comprises the following steps:

s600, performance evaluation: and evaluating the text classification precision.

In the keyword selection based on frequency in step S200, the importance of the token is measured by TF-IDF term frequency-inverse document frequency, and then the importance of the token is measured by comparing the frequency term frequency in the target document with the frequency inverse document frequency in the entire corpus, and the keyword is defined as the word with the highest TF-IDF score. XC is a document with all symbols in the DC corpus concatenated, D is a very large document with D ═ X1, … XC ], and the frequency-based keyword selection score formula of t symbols is

Where tf (t, X) ═ 0.5+ 0.5. nt, idf (t, D) ═ log (| D |/| { X ∈ D: t ∈ X } |). The frequency-based selection is model-independent and relatively easy to compute, but does not directly reflect the contribution of words to text predictions.

Step S300 selects keywords based on attention values in keyword selection using model attention, since this is a more direct and efficient way to scale how important quantifiers are in model prediction. The model is trained using the LCE standard method of cross entropy loss, with the attention value of the model being used to select the labeled keyword with the highest attention value.

In the step S300 of keyword selection based on attention value, let a ═ a1, … aT ∈ R T be the attention value of document embedding, and the attention-based scoring formula in which ai corresponds to ti in the input symbol and the symbol t is set as

Where II is an indicator function and IIII is L2 regularization.

In the masked keyword reconstruction of step S400, the model is forced to reconstruct the keywords from the keyword masked document in order to strengthen the model to understand the surrounding context. The principle is similar to the masking mechanism in the BERT model, but the scheme only masks keywords rather than random words. The hidden keyword reconstruction only carries out keyword regularization on the sentences, and the loss of the sentences without the keywords is ignored. Formally, assume k^～For a random subset of the full keyword k, each element is chosen with an independent probability p, and k is masked from the original document x^～To obtain a masked document x^～＝x-k^～Finally, the reconstruction loss formula of the shielding keyword is obtained as

Wherein the index (k)^～) Is the keyword k^～Vi is the index of the keyword relative to the vocabulary set, relative to the index of the original document x. It is also important to select the appropriate keyword method here, and experiments have shown that the attention-based keyword selection method performs better than the frequency-or random-based keyword selection method.

In step S500 hidden entropy regularization, the model should not correctly classify the context masked documents because it is not the original context already here. Formally, let c be a randomly selected subset of the full context word c ═ x-k, where each element is a probabilityq are independently selected, c is then masked from the original document x, and a context masked document x is obtained^～-c, obtaining a formula for the latent entropy regularization as

Where DKL is a KL-difference and U (y) is a homogeneous distribution. The hidden entropy regularization does not degrade the classification accuracy because it specifies unrealistic, masked sentences, rather than complete documents. Finally, the verification formula is set as

Where λ MKR and λ MER are the hyperparameters lost by MKR masked keyword reconstruction and MER hidden entropy regularization, respectively.

In the performance evaluation of step S600, classification accuracy, OOD detection, and cross-domain generalization index are mainly evaluated. The scheme does not reduce the classification precision of the model, and the indexes of OOD detection and cross-domain generalization are greatly improved.

Step S200 and step S300 are not in sequence, and step S400 and step S500 are not in sequence.

The invention provides a fine adjustment method based on manual masking keyword regularization, so that overall prediction can be performed based on context. The method regularizes the model, reconstructs keywords from other words, and makes low confidence predictions without sufficient context. The method has good reliability when being operated in a pre-training language model such as BERT, RoBERTA and ALBERT, and can greatly improve OOD detection and cross-domain generalization without reducing classification precision.

Although only the preferred embodiments of the present invention have been described in detail, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art, and all changes are encompassed in the scope of the present invention.

Claims

1. A regularization text classification fine adjustment method based on manually-hidden keywords is characterized by comprising the following steps:

s600, performance evaluation: and evaluating the text classification precision.

2. The method of claim 1, wherein in the frequency-based keyword selection of S200, the importance of the label is measured by TF-IDF, and then the importance of the label is measured by comparing the frequency in the target document with the frequency in the entire corpus, and the keyword is defined as the word with the highest TF-IDF score.

3. The method as claimed in claim 1, wherein the step S300 is implemented by training a model using LCE standard method of cross entropy loss in the keyword selection based on attention value, and selecting the labeled keyword with the highest attention value by using the attention value of the model.

4. The regularization text classification fine-tuning method based on artificial occlusion keyword as claimed in claim 1, wherein in the keyword selection based on attention value of S300, let a ═ a1, … aT ] ∈ RT as the attention value of document embedding, and set the attention-based scoring formula in which ai corresponds to ti in the input symbols and t is the symbol t as

5. The regularization text classification fine tuning method based on artificial mask keywords as claimed in claim 1, wherein in S400 mask keyword reconstruction, the keywords of the sentence are regularized, assuming k is^～For a random subset of the full keyword k, each element is chosen with an independent probability p, and k is masked from the original document x^～To obtain a masked document x^～＝x-k^～Finally, the reconstruction loss formula of the shielding keyword is obtained as

6. The regularization text classification fine tuning method based on artificial occlusion keywords as claimed in claim 1, wherein in the S500 hidden entropy regularization, let c be a randomly selected subset of full context words c ═ x-k, where each element is independently selected with probability q, then mask c from original document x and obtain context occlusion document x ═ x-k^～-c, obtaining a formula for the latent entropy regularization as

Finally, the verification formula is set as

7. The regularized text classification fine tuning method based on artificial occlusion keywords according to claim 1,

in the performance evaluation of step S600, classification accuracy, OOD detection, and cross-domain generalization index are mainly evaluated.

8. The regularized text classification fine tuning method based on artificial occlusion keywords according to claim 1,

step S200 and step S300 are not in sequence.

9. The regularized text classification fine tuning method based on artificial occlusion keywords according to claim 1,

step S400 and step S500 are not in sequence.