CN113032563B - Regularized text classification fine tuning method based on manual masking keywords - Google Patents
Regularized text classification fine tuning method based on manual masking keywords Download PDFInfo
- Publication number
- CN113032563B CN113032563B CN202110302636.1A CN202110302636A CN113032563B CN 113032563 B CN113032563 B CN 113032563B CN 202110302636 A CN202110302636 A CN 202110302636A CN 113032563 B CN113032563 B CN 113032563B
- Authority
- CN
- China
- Prior art keywords
- keywords
- keyword
- masking
- model
- text classification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3346—Query execution using probabilistic model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to the technical field of text classification, and particularly relates to a regularized text classification fine tuning method based on manual masking keywords, which comprises the following steps: the method comprises the steps of data acquisition and processing, keyword selection based on frequency, keyword selection based on attention value, masking keyword reconstruction, hidden entropy regularization and performance evaluation, wherein the data acquisition and processing are used for acquiring text data required by a model, marking the category of the text data, constructing a data set required by the model and pre-training the data set; the frequency-based keyword selection uses the relative frequencies of words in the dataset to select keywords; the attention value based keyword selection uses model attention to select keywords. This method regularizes the model, reconstructs keywords from other words, and makes low confidence predictions without sufficient context. The method can greatly improve OOD detection and cross-domain generalization under the condition of not reducing classification precision.
Description
Technical Field
The invention relates to the technical field of text classification, in particular to a regularized text classification fine tuning method based on manual masking keywords.
Background
The pre-trained language model achieves the most advanced accuracy in various text classification tasks such as emotion analysis, natural language reasoning and semantic text similarity. However, the reliability of the fine-tuned text classifier is severely underestimated. It has not been possible to build a model that can detect out-of-distribution samples or is robust to domain transfer, mainly because the model is overly dependent on a limited number of keywords, rather than focusing on the entire context.
Causes of problems or defects: current studies on text classification focus only on evaluating the accuracy of models, and neglecting their reliability. Meanwhile, the excessive dependence of the conventional method on the keywords may cause problems of abnormal distribution and generalization of the detection.
Disclosure of Invention
The invention aims to provide a regularized text classification fine tuning method based on manual masking keywords.
In order to achieve the above purpose, the present invention provides the following technical solutions: a regularized text classification fine tuning method based on manual masking keywords comprises the following steps:
s100, data acquisition and processing: collecting text data required by a model, marking the categories of the text data, constructing a data set required by the model, and pre-training the data set;
s200, selecting keywords based on frequency: selecting keywords using the relative frequencies of words in the dataset;
s300, keyword selection based on attention value: selecting a keyword using model attention;
s400, reconstructing a masking keyword: reconstructing keywords from the keyword-masked document;
s500, hidden entropy regularization: regularization of random deletion of context non-key words is performed on predictions of the context mask document;
s600, performance evaluation: and evaluating the text classification accuracy.
Further, in the step S200, the importance of the mark is measured by TF-IDF in the selection of the keyword based on the frequency, and the importance of the mark is measured by comparing the frequency in the target document with the frequency in the whole corpus, and the keyword is defined as the word with the highest TF-IDF score.
Further, in the keyword selection based on the attention value, step S300 trains a model using LCE standard method of cross entropy loss, and selects the keyword of the label having the highest attention value by using the attention value of the model.
Further, in the keyword selection based on the attention value in step S300, a= [ a1, … aT ] ∈ R T is set as the attention value embedded in the document, where ai corresponds to ti in the input symbol, and the attention-based scoring formula of symbol t is set as
Further, in step S400, in the masking keyword reconstruction, the sentence is subjected to the regularization of the keyword, and k is assumed ~ For a random subset of the full key k, each element is selected with an independent probability p, and k is then masked from the original document x ~ Obtaining the mask document x ~ =x-k ~ Finally, the reconstruction loss formula of the hidden key words is obtained as follows
Further, in the step S500 latent entropy regularization, let c be a subset of the randomly selected full context words c=x-k, where each element is independently selected with probability q, and then masked from the original document xc, and obtaining the context mask document x=x ~ -c, obtaining the formula of latent entropy regularization asFinally, the verification formula is set to +.>
Further, in the performance evaluation in step S600, classification accuracy, OOD detection and cross-domain generalization indexes are mainly evaluated.
Further, the step S200 and the step S300 are not sequential.
Further, step S400 and step S500 are not sequential.
The invention has the following technical effects: aiming at the problems that the reliability of a model is ignored, the model is excessively dependent on keywords and the like in the current text classification research method, the invention provides a method capable of carrying out overall prediction based on context and having higher reliability so as to carry out overall prediction based on context. This method regularizes the model, reconstructs keywords from other words, and makes low confidence predictions without sufficient context. Running in pre-trained language models such as BERT, roBERTa, and ALBERT, this approach can greatly improve OOD detection and cross-domain generalization without degrading classification accuracy.
Drawings
FIG. 1 is a flow chart of the system of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Examples
A regularized text classification fine tuning method based on manual masking keywords, as shown in FIG. 1, comprises the following steps:
s100, data acquisition and processing: collecting text data required by a model, marking the categories of the text data, constructing a data set required by the model, and pre-training the data set;
s200, selecting keywords based on frequency: selecting keywords using the relative frequencies of words in the dataset;
s300, keyword selection based on attention value: selecting a keyword using model attention;
s400, reconstructing a masking keyword: reconstructing keywords from the keyword-masked document;
s500, hidden entropy regularization: regularization of random deletion of context non-key words is performed on predictions of the context mask document;
s600, performance evaluation: and evaluating the text classification accuracy.
In step S200, in the frequency-based keyword selection, the importance of the tag is measured by TF-IDF term frequency-inverse document frequency, and then the importance of the tag is measured by comparing the frequency term frequency in the target document with the frequency-inverse document frequency in the whole corpus, and the keyword is defined as the word having the highest TF-IDF score. XC is a document concatenated with all symbols in the DC corpus, and D is an oversized document with d= [ X1, … XC ], the frequency-based keyword selection score formula for t symbols is
Where tf (t, X) =0.5+0.5·nt, idf (t, D) =log (|D|/|{ X ε D: t ε X } |). The frequency-based selection is model-independent and relatively easy to calculate, but does not directly reflect the contribution of the word to the text prediction.
In the keyword selection based on the attention value, the model attention is used to select keywords in step S300, because this is a more direct and efficient way to scale the importance of the keywords in model prediction. The model is trained using LCE standard methods of cross entropy loss, with the attention value of the model being used to select the labeled keyword with the highest attention value.
In the keyword selection based on the attention value in step S300, a= [ a1, … aT ] ∈ R T is set as the attention value embedded in the document, and the attention-based scoring formula in which ai corresponds to ti in the input symbol and symbol t is set as
In step S400 mask keyword reconstruction, to strengthen the model to understand the surrounding context, the model is forced to reconstruct keywords from the keyword mask document. The principle is similar to the masking mechanism in the BERT model, but the scheme only masks keywords and not random words. The hidden keyword reconstruction only regularizes the keywords of the sentences, and ignores the loss of sentences without keywords. Formally, let k ~ For a random subset of the full key k, each element is selected with an independent probability p, and k is then masked from the original document x ~ Obtaining the mask document x ~ =x-k ~ Finally, the reconstruction loss formula of the hidden key words is obtained as followsWherein index (k) ~ ) Is keyword k ~ Vi is an index of keywords with respect to the vocabulary set with respect to the index of the original document x. Here too, it is important to select a proper keyword method, and experiments prove that the attention-based keyword selection method performs better than the frequency-based or random keyword selection method.
In step S500 latent entropy regularization, the model should not correctly classify the context mask document because it is not the original context here. Formally, let c be a subset of the randomly selected full context words c=x-k, where each element is independently selected with probability q, then mask c from the original document x, and obtain the context mask document x=x ~ -c, obtaining the formula of latent entropy regularization asWhere DKL is the KL-distinction and U (y) is uniformly distributed. Latent entropy regularization does not reduce the accuracy of classification because it normalizes unrealistic masked sentences rather than complete documents. Finally, the verification formula is set to +.>Where λ MKR and λMER are the super parameters lost by MKR mask keyword reconstruction and MER latent entropy regularization, respectively.
In the step S600 performance evaluation, classification accuracy, OOD detection and cross-domain generalization indexes are mainly evaluated. The classification accuracy of the model is not reduced, and the indexes of OOD detection and cross-domain generalization are greatly improved.
Step S200 and step S300 are not sequenced, and step S400 and step S500 are not sequenced.
The invention provides a fine tuning method based on regularization of manual masking keywords so as to conduct overall prediction based on context. This method regularizes the model, reconstructs keywords from other words, and makes low confidence predictions without sufficient context. The method is operated in pre-training language models such as BERT, roBERTa and ALBERT, has good reliability, and can greatly improve OOD detection and cross-domain generalization without reducing classification accuracy.
The preferred embodiments of the present invention have been described in detail, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the spirit of the present invention, and the various changes are included in the scope of the present invention.
Claims (5)
1. A regularized text classification fine tuning method based on manual masking keywords is characterized by comprising the following steps:
s100, data acquisition and processing: collecting text data required by a model, marking the categories of the text data, constructing a data set required by the model, and pre-training the data set;
s200, selecting keywords based on frequency: selecting keywords using the relative frequencies of words in the dataset;
s300, keyword selection based on attention value: selecting a keyword using model attention; in the selecting of the keywords based on the attention value, training a model by using an LCE standard method of cross entropy loss, and selecting the keyword with the mark with the highest attention value by using the attention value of the model; in the keyword selection of S300 based on the attention value, a= [ a1, … aT]E R T is the attention value embedded in the document, and the attention-based scoring formula in which ai corresponds to ti in the input symbol, symbol t, is set toThe II is an indication function, the II is L2 regular, and the ti is ai corresponding to an input symbol;
s400, reconstructing a masking keyword: reconstructing keywords from the keyword-masked document; in the S400 masking keyword reconstruction, keyword regularization is carried out on sentences, k-is assumed to be a random subset of the complete keyword k, the selection of each element is independent probability p, k-is then shielded from the original document x to obtain masking documents x-k-and finally a masking keyword reconstruction loss formula is obtained as followsThe index (k-) is the index of the random subset k-relative to the original document x, and vi is the index of the keyword relative to the vocabulary set;
s500, hidden entropy regularization: regularization of random deletion of context non-key words is performed on predictions of the context mask document; in the S500 hidden entropy regularization, let c be a subset of the randomly selected complete context words c=x-k, wherein each element is independently selected with probability q, then mask c from the original document x, and obtain the context mask documents x=x to-c, and the formula for the hidden entropy regularization isFinally, the verification formula is set asThe DKL is KL-difference, the U (y) is uniformly distributed, and the lambda MKR and lambda MER are the super parameters lost by MKR mask keyword reconstruction and MER latent entropy regularization respectively;
s600, performance evaluation: and evaluating the text classification accuracy.
2. The regularized text classification fine tuning method based on artificial masking of keywords according to claim 1, wherein in the step S200 of frequency-based keyword selection, the importance of the markers is measured by TF-IDF, and then the importance of the markers is measured by comparing the frequencies in the target document with the frequencies in the whole corpus, and the keywords are defined as words with the highest TF-IDF score.
3. The regularized text classification fine-tuning method based on artificial masking keywords as recited in claim 1, wherein,
in the step S600 performance evaluation, classification accuracy, OOD detection and cross-domain generalization indexes are mainly evaluated.
4. The regularized text classification fine-tuning method based on artificial masking keywords as recited in claim 1, wherein,
step S200 and step S300 are not sequential.
5. The regularized text classification fine-tuning method based on artificial masking keywords as recited in claim 1, wherein,
step S400 and step S500 are not sequential.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110302636.1A CN113032563B (en) | 2021-03-22 | 2021-03-22 | Regularized text classification fine tuning method based on manual masking keywords |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110302636.1A CN113032563B (en) | 2021-03-22 | 2021-03-22 | Regularized text classification fine tuning method based on manual masking keywords |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113032563A CN113032563A (en) | 2021-06-25 |
CN113032563B true CN113032563B (en) | 2023-07-14 |
Family
ID=76472302
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110302636.1A Active CN113032563B (en) | 2021-03-22 | 2021-03-22 | Regularized text classification fine tuning method based on manual masking keywords |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113032563B (en) |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2014043519A1 (en) * | 2012-09-14 | 2014-03-20 | Population Diagnostics Inc. | Methods and compositions for diagnosing, prognosing, and treating neurological conditions |
CN110119765A (en) * | 2019-04-18 | 2019-08-13 | 浙江工业大学 | A kind of keyword extracting method based on Seq2seq frame |
CN110222349A (en) * | 2019-06-13 | 2019-09-10 | 成都信息工程大学 | A kind of model and method, computer of the expression of depth dynamic context word |
CN111339278A (en) * | 2020-02-28 | 2020-06-26 | 支付宝(杭州)信息技术有限公司 | Method and device for generating training speech generating model and method and device for generating answer speech |
CN111444709A (en) * | 2020-03-09 | 2020-07-24 | 腾讯科技(深圳)有限公司 | Text classification method, device, storage medium and equipment |
CN111444721A (en) * | 2020-05-27 | 2020-07-24 | 南京大学 | Chinese text key information extraction method based on pre-training language model |
CN111488459A (en) * | 2020-04-15 | 2020-08-04 | 焦点科技股份有限公司 | Product classification method based on keywords |
CN111563166A (en) * | 2020-05-28 | 2020-08-21 | 浙江学海教育科技有限公司 | Pre-training model method for mathematical problem classification |
CN111563373A (en) * | 2020-04-13 | 2020-08-21 | 中南大学 | Attribute-level emotion classification method for focused attribute-related text |
CN112115247A (en) * | 2020-09-07 | 2020-12-22 | 中国人民大学 | Personalized dialogue generation method and system based on long-time and short-time memory information |
CN112214599A (en) * | 2020-10-20 | 2021-01-12 | 电子科技大学 | Multi-label text classification method based on statistics and pre-training language model |
CN112256876A (en) * | 2020-10-26 | 2021-01-22 | 南京工业大学 | Aspect-level emotion classification model based on multi-memory attention network |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107315823B (en) * | 2017-07-04 | 2020-11-03 | 北京京东尚科信息技术有限公司 | Data processing method and device based on electronic commerce |
-
2021
- 2021-03-22 CN CN202110302636.1A patent/CN113032563B/en active Active
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2014043519A1 (en) * | 2012-09-14 | 2014-03-20 | Population Diagnostics Inc. | Methods and compositions for diagnosing, prognosing, and treating neurological conditions |
CN110119765A (en) * | 2019-04-18 | 2019-08-13 | 浙江工业大学 | A kind of keyword extracting method based on Seq2seq frame |
CN110222349A (en) * | 2019-06-13 | 2019-09-10 | 成都信息工程大学 | A kind of model and method, computer of the expression of depth dynamic context word |
CN111339278A (en) * | 2020-02-28 | 2020-06-26 | 支付宝(杭州)信息技术有限公司 | Method and device for generating training speech generating model and method and device for generating answer speech |
CN111444709A (en) * | 2020-03-09 | 2020-07-24 | 腾讯科技(深圳)有限公司 | Text classification method, device, storage medium and equipment |
CN111563373A (en) * | 2020-04-13 | 2020-08-21 | 中南大学 | Attribute-level emotion classification method for focused attribute-related text |
CN111488459A (en) * | 2020-04-15 | 2020-08-04 | 焦点科技股份有限公司 | Product classification method based on keywords |
CN111444721A (en) * | 2020-05-27 | 2020-07-24 | 南京大学 | Chinese text key information extraction method based on pre-training language model |
CN111563166A (en) * | 2020-05-28 | 2020-08-21 | 浙江学海教育科技有限公司 | Pre-training model method for mathematical problem classification |
CN112115247A (en) * | 2020-09-07 | 2020-12-22 | 中国人民大学 | Personalized dialogue generation method and system based on long-time and short-time memory information |
CN112214599A (en) * | 2020-10-20 | 2021-01-12 | 电子科技大学 | Multi-label text classification method based on statistics and pre-training language model |
CN112256876A (en) * | 2020-10-26 | 2021-01-22 | 南京工业大学 | Aspect-level emotion classification model based on multi-memory attention network |
Non-Patent Citations (4)
Title |
---|
BERTSurv: BERT-Based Survival Models for Predicting Outcomes of Trauma Patients;Yun Zhao 等;《arXiv:2103.10928v1》;20210319;1-15 * |
KL散度的理解;薄层;《https://www.cnblogs.com/boceng/p/11519381.html》;20190914;1-3 * |
基于BERT改进的文本表示模型研究;王楠禔;《中国优秀硕士学位论文全文数据库 信息科技辑》;20200115(第01期);I138-2641 * |
面向社交物联网的细粒度文本情感分类方法研究;田芳;《中国优秀硕士学位论文全文数据库 信息科技辑》;20210315(第03期);I136-338 * |
Also Published As
Publication number | Publication date |
---|---|
CN113032563A (en) | 2021-06-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Diggelmann et al. | Climate-fever: A dataset for verification of real-world climate claims | |
Mei et al. | Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research | |
Wu et al. | Learning to tag | |
CN111125349A (en) | Graph model text abstract generation method based on word frequency and semantics | |
CN108304372A (en) | Entity extraction method and apparatus, computer equipment and storage medium | |
CN111914062B (en) | Long text question-answer pair generation system based on keywords | |
CN110096572B (en) | Sample generation method, device and computer readable medium | |
CN110807324A (en) | Video entity identification method based on IDCNN-crf and knowledge graph | |
CN115309872B (en) | Multi-model entropy weighted retrieval method and system based on Kmeans recall | |
Xie et al. | T2ranking: A large-scale chinese benchmark for passage ranking | |
Hillard et al. | Learning weighted entity lists from web click logs for spoken language understanding | |
CN117271792A (en) | Method for constructing enterprise domain knowledge base based on large model | |
CN116756303A (en) | Automatic generation method and system for multi-topic text abstract | |
CN116862318B (en) | New energy project evaluation method and device based on text semantic feature extraction | |
CN111581365B (en) | Predicate extraction method | |
CN115146021A (en) | Training method and device for text retrieval matching model, electronic equipment and medium | |
CN113032563B (en) | Regularized text classification fine tuning method based on manual masking keywords | |
CN116720498A (en) | Training method and device for text similarity detection model and related medium thereof | |
CN117131383A (en) | Method for improving search precision drainage performance of double-tower model | |
CN110019814B (en) | News information aggregation method based on data mining and deep learning | |
CN109189915A (en) | A kind of information retrieval method based on depth relevant matches model | |
Amini et al. | Incorporating prior knowledge into a transductive ranking algorithm for multi-document summarization | |
Kalmar | Bootstrapping Websites for Classification of Organization Names on Twitter. | |
Sotudeh et al. | Qontsum: On contrasting salient content for query-focused summarization | |
CN111930880A (en) | Text code retrieval method, device and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |