CN113032563B

CN113032563B - Regularized text classification fine tuning method based on manual masking keywords

Info

Publication number: CN113032563B
Application number: CN202110302636.1A
Authority: CN
Inventors: 潘晓光; 陈亮; 董虎弟; 宋晓晨; 张雅娜
Original assignee: Shanxi Sanyouhe Smart Information Technology Co Ltd
Current assignee: Shanxi Sanyouhe Smart Information Technology Co Ltd
Priority date: 2021-03-22
Filing date: 2021-03-22
Publication date: 2023-07-14
Anticipated expiration: 2041-03-22
Also published as: CN113032563A

Abstract

The invention belongs to the technical field of text classification, and particularly relates to a regularized text classification fine tuning method based on manual masking keywords, which comprises the following steps: the method comprises the steps of data acquisition and processing, keyword selection based on frequency, keyword selection based on attention value, masking keyword reconstruction, hidden entropy regularization and performance evaluation, wherein the data acquisition and processing are used for acquiring text data required by a model, marking the category of the text data, constructing a data set required by the model and pre-training the data set; the frequency-based keyword selection uses the relative frequencies of words in the dataset to select keywords; the attention value based keyword selection uses model attention to select keywords. This method regularizes the model, reconstructs keywords from other words, and makes low confidence predictions without sufficient context. The method can greatly improve OOD detection and cross-domain generalization under the condition of not reducing classification precision.

Description

Regularized text classification fine tuning method based on manual masking keywords

Technical Field

The invention relates to the technical field of text classification, in particular to a regularized text classification fine tuning method based on manual masking keywords.

Background

The pre-trained language model achieves the most advanced accuracy in various text classification tasks such as emotion analysis, natural language reasoning and semantic text similarity. However, the reliability of the fine-tuned text classifier is severely underestimated. It has not been possible to build a model that can detect out-of-distribution samples or is robust to domain transfer, mainly because the model is overly dependent on a limited number of keywords, rather than focusing on the entire context.

Causes of problems or defects: current studies on text classification focus only on evaluating the accuracy of models, and neglecting their reliability. Meanwhile, the excessive dependence of the conventional method on the keywords may cause problems of abnormal distribution and generalization of the detection.

Disclosure of Invention

The invention aims to provide a regularized text classification fine tuning method based on manual masking keywords.

In order to achieve the above purpose, the present invention provides the following technical solutions: a regularized text classification fine tuning method based on manual masking keywords comprises the following steps:

s100, data acquisition and processing: collecting text data required by a model, marking the categories of the text data, constructing a data set required by the model, and pre-training the data set;

s200, selecting keywords based on frequency: selecting keywords using the relative frequencies of words in the dataset;

s300, keyword selection based on attention value: selecting a keyword using model attention;

s400, reconstructing a masking keyword: reconstructing keywords from the keyword-masked document;

s500, hidden entropy regularization: regularization of random deletion of context non-key words is performed on predictions of the context mask document;

s600, performance evaluation: and evaluating the text classification accuracy.

Further, in the step S200, the importance of the mark is measured by TF-IDF in the selection of the keyword based on the frequency, and the importance of the mark is measured by comparing the frequency in the target document with the frequency in the whole corpus, and the keyword is defined as the word with the highest TF-IDF score.

Further, in the keyword selection based on the attention value, step S300 trains a model using LCE standard method of cross entropy loss, and selects the keyword of the label having the highest attention value by using the attention value of the model.

Further, in the keyword selection based on the attention value in step S300, a= [ a1, … aT ] ∈ R T is set as the attention value embedded in the document, where ai corresponds to ti in the input symbol, and the attention-based scoring formula of symbol t is set as

Further, in step S400, in the masking keyword reconstruction, the sentence is subjected to the regularization of the keyword, and k is assumed ^～ For a random subset of the full key k, each element is selected with an independent probability p, and k is then masked from the original document x ^～ Obtaining the mask document x ^～＝x-k ^～ Finally, the reconstruction loss formula of the hidden key words is obtained as follows

Further, in the step S500 latent entropy regularization, let c be a subset of the randomly selected full context words c=x-k, where each element is independently selected with probability q, and then masked from the original document xc, and obtaining the context mask document x=x ^～ -c, obtaining the formula of latent entropy regularization as

Finally, the verification formula is set to +.>

Further, in the performance evaluation in step S600, classification accuracy, OOD detection and cross-domain generalization indexes are mainly evaluated.

Further, the step S200 and the step S300 are not sequential.

Further, step S400 and step S500 are not sequential.

The invention has the following technical effects: aiming at the problems that the reliability of a model is ignored, the model is excessively dependent on keywords and the like in the current text classification research method, the invention provides a method capable of carrying out overall prediction based on context and having higher reliability so as to carry out overall prediction based on context. This method regularizes the model, reconstructs keywords from other words, and makes low confidence predictions without sufficient context. Running in pre-trained language models such as BERT, roBERTa, and ALBERT, this approach can greatly improve OOD detection and cross-domain generalization without degrading classification accuracy.

Drawings

FIG. 1 is a flow chart of the system of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Examples

A regularized text classification fine tuning method based on manual masking keywords, as shown in FIG. 1, comprises the following steps:

s600, performance evaluation: and evaluating the text classification accuracy.

In step S200, in the frequency-based keyword selection, the importance of the tag is measured by TF-IDF term frequency-inverse document frequency, and then the importance of the tag is measured by comparing the frequency term frequency in the target document with the frequency-inverse document frequency in the whole corpus, and the keyword is defined as the word having the highest TF-IDF score. XC is a document concatenated with all symbols in the DC corpus, and D is an oversized document with d= [ X1, … XC ], the frequency-based keyword selection score formula for t symbols is

Where tf (t, X) =0.5+0.5·nt, idf (t, D) =log (|D|/|{ X ε D: t ε X } |). The frequency-based selection is model-independent and relatively easy to calculate, but does not directly reflect the contribution of the word to the text prediction.

In the keyword selection based on the attention value, the model attention is used to select keywords in step S300, because this is a more direct and efficient way to scale the importance of the keywords in model prediction. The model is trained using LCE standard methods of cross entropy loss, with the attention value of the model being used to select the labeled keyword with the highest attention value.

In the keyword selection based on the attention value in step S300, a= [ a1, … aT ] ∈ R T is set as the attention value embedded in the document, and the attention-based scoring formula in which ai corresponds to ti in the input symbol and symbol t is set as

Wherein II is an indication function, and II.II is L2 regular.

In step S400 mask keyword reconstruction, to strengthen the model to understand the surrounding context, the model is forced to reconstruct keywords from the keyword mask document. The principle is similar to the masking mechanism in the BERT model, but the scheme only masks keywords and not random words. The hidden keyword reconstruction only regularizes the keywords of the sentences, and ignores the loss of sentences without keywords. Formally, let k ^～ For a random subset of the full key k, each element is selected with an independent probability p, and k is then masked from the original document x ^～ Obtaining the mask document x ^～＝x-k ^～ Finally, the reconstruction loss formula of the hidden key words is obtained as follows

Wherein index (k) ^～ ) Is keyword k ^～ Vi is an index of keywords with respect to the vocabulary set with respect to the index of the original document x. Here too, it is important to select a proper keyword method, and experiments prove that the attention-based keyword selection method performs better than the frequency-based or random keyword selection method.

In step S500 latent entropy regularization, the model should not correctly classify the context mask document because it is not the original context here. Formally, let c be a subset of the randomly selected full context words c=x-k, where each element is independently selected with probability q, then mask c from the original document x, and obtain the context mask document x=x ^～ -c, obtaining the formula of latent entropy regularization as

Where DKL is the KL-distinction and U (y) is uniformly distributed. Latent entropy regularization does not reduce the accuracy of classification because it normalizes unrealistic masked sentences rather than complete documents. Finally, the verification formula is set to +.>

Where λ MKR and λMER are the super parameters lost by MKR mask keyword reconstruction and MER latent entropy regularization, respectively.

In the step S600 performance evaluation, classification accuracy, OOD detection and cross-domain generalization indexes are mainly evaluated. The classification accuracy of the model is not reduced, and the indexes of OOD detection and cross-domain generalization are greatly improved.

Step S200 and step S300 are not sequenced, and step S400 and step S500 are not sequenced.

The invention provides a fine tuning method based on regularization of manual masking keywords so as to conduct overall prediction based on context. This method regularizes the model, reconstructs keywords from other words, and makes low confidence predictions without sufficient context. The method is operated in pre-training language models such as BERT, roBERTa and ALBERT, has good reliability, and can greatly improve OOD detection and cross-domain generalization without reducing classification accuracy.

The preferred embodiments of the present invention have been described in detail, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the spirit of the present invention, and the various changes are included in the scope of the present invention.

Claims

1. A regularized text classification fine tuning method based on manual masking keywords is characterized by comprising the following steps:

s300, keyword selection based on attention value: selecting a keyword using model attention; in the selecting of the keywords based on the attention value, training a model by using an LCE standard method of cross entropy loss, and selecting the keyword with the mark with the highest attention value by using the attention value of the model; in the keyword selection of S300 based on the attention value, a= [ a1, … aT]E R T is the attention value embedded in the document, and the attention-based scoring formula in which ai corresponds to ti in the input symbol, symbol t, is set to

The II is an indication function, the II is L2 regular, and the ti is ai corresponding to an input symbol;

s400, reconstructing a masking keyword: reconstructing keywords from the keyword-masked document; in the S400 masking keyword reconstruction, keyword regularization is carried out on sentences, k-is assumed to be a random subset of the complete keyword k, the selection of each element is independent probability p, k-is then shielded from the original document x to obtain masking documents x-k-and finally a masking keyword reconstruction loss formula is obtained as follows

The index (k-) is the index of the random subset k-relative to the original document x, and vi is the index of the keyword relative to the vocabulary set;

s500, hidden entropy regularization: regularization of random deletion of context non-key words is performed on predictions of the context mask document; in the S500 hidden entropy regularization, let c be a subset of the randomly selected complete context words c=x-k, wherein each element is independently selected with probability q, then mask c from the original document x, and obtain the context mask documents x=x to-c, and the formula for the hidden entropy regularization is

Finally, the verification formula is set as

The DKL is KL-difference, the U (y) is uniformly distributed, and the lambda MKR and lambda MER are the super parameters lost by MKR mask keyword reconstruction and MER latent entropy regularization respectively;

s600, performance evaluation: and evaluating the text classification accuracy.

2. The regularized text classification fine tuning method based on artificial masking of keywords according to claim 1, wherein in the step S200 of frequency-based keyword selection, the importance of the markers is measured by TF-IDF, and then the importance of the markers is measured by comparing the frequencies in the target document with the frequencies in the whole corpus, and the keywords are defined as words with the highest TF-IDF score.

3. The regularized text classification fine-tuning method based on artificial masking keywords as recited in claim 1, wherein,

in the step S600 performance evaluation, classification accuracy, OOD detection and cross-domain generalization indexes are mainly evaluated.

4. The regularized text classification fine-tuning method based on artificial masking keywords as recited in claim 1, wherein,

step S200 and step S300 are not sequential.

5. The regularized text classification fine-tuning method based on artificial masking keywords as recited in claim 1, wherein,

step S400 and step S500 are not sequential.