CN113032563A - Regularization text classification fine-tuning method based on manually-covered keywords - Google Patents
Regularization text classification fine-tuning method based on manually-covered keywords Download PDFInfo
- Publication number
- CN113032563A CN113032563A CN202110302636.1A CN202110302636A CN113032563A CN 113032563 A CN113032563 A CN 113032563A CN 202110302636 A CN202110302636 A CN 202110302636A CN 113032563 A CN113032563 A CN 113032563A
- Authority
- CN
- China
- Prior art keywords
- keywords
- keyword
- model
- regularization
- text classification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3346—Query execution using probabilistic model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to the technical field of text classification, and particularly relates to a regularization text classification fine adjustment method based on manually-hidden keywords, which comprises the following steps: data acquisition and processing, keyword selection based on frequency, keyword selection based on attention value, masked keyword reconstruction, hidden entropy regularization and performance evaluation, wherein the data acquisition and processing acquires text data required by a model, labels the type of the text data, constructs a data set required by the model, and pre-trains the data set; the frequency-based keyword selection uses relative frequencies of words in a dataset to select keywords; the attention value-based keyword selection uses model attention to select keywords. The method regularizes the model, reconstructs keywords from other words, and makes low confidence predictions without sufficient context. The method can greatly improve OOD detection and cross-domain generalization under the condition of not reducing the classification precision.
Description
Technical Field
The invention relates to the technical field of text classification, in particular to a regularization text classification fine-tuning method based on manually-hidden keywords.
Background
At present, a language model trained in advance achieves the most advanced accuracy in various text classification tasks, such as emotion analysis, natural language reasoning and semantic text similarity. However, the reliability of the fine-tuned text classifier is severely underestimated. It has not been possible to build a model that can detect samples of the OOD (out-of-distribution) or that is robust to domain transitions, mainly due to the model's excessive dependence on a limited number of keywords, rather than looking at the entire context.
Cause of problems or defects: current research on text classification focuses only on evaluating the accuracy of models, and ignores their reliability. Meanwhile, the excessive dependence of the traditional method on the keywords may cause the problems of abnormal distribution and generalization of the detection.
Disclosure of Invention
The invention aims to provide a regularization text classification fine adjustment method based on manually-hidden keywords.
In order to achieve the purpose, the invention provides the following technical scheme: a regularization text classification fine adjustment method based on manually-hidden keywords comprises the following steps:
s100, data acquisition and processing: acquiring text data required by a model, labeling the type of the text data, constructing a data set required by the model, and pre-training the data set;
s200, selecting keywords based on frequency: selecting keywords using relative frequencies of words in the dataset;
s300, selecting keywords based on the attention value: selecting keywords using model attention;
s400, hiding the keyword reconstruction: reconstructing the keywords from the keyword mask document;
s500, hidden entropy regularization: regularizing the random deletion of the non-key words in the context for the prediction of the context-obscured document;
s600, performance evaluation: and evaluating the text classification precision.
Further, in the keyword selection based on frequency in step S200, the importance of the token is measured through TF-IDF, and then the importance of the token is measured by comparing the frequency in the target document with the frequency in the entire corpus, and the keyword is defined as the word with the highest TF-IDF score.
Further, in the step S300, in the selection of the attention value-based keyword, the LCE standard method of cross entropy loss is used to train the model, and the attention value of the model is used to select the labeled keyword with the highest attention value.
Further, in the keyword selection based on the attention value in step S300, let a ═ a1, … aT ∈ R T be the attention value of document embedding, and the attention-based scoring formula in which ai corresponds to ti in the input symbol and the symbol t is set as ti
Further, in step S400, during the hidden keyword reconstruction, the sentence is regularized by the keyword, and k is assumed to be~For a random subset of the full keyword k, each element is chosen with an independent probability p, and k is masked from the original document x~To obtain a masked document x~=x-k~Finally, the reconstruction loss formula of the shielding keyword is obtained as
Further, step (ii)In step S500 hidden entropy regularization, let c be a randomly selected subset of the full context word c x-k, where each element is independently selected with a probability q, then mask c from the original document x and obtain a context masked document x~-c, obtaining a formula for the latent entropy regularization asFinally, the verification formula is set as
Further, in the performance evaluation of step S600, classification accuracy, OOD detection and cross-domain generalization index are mainly evaluated.
Further, step S200 and step S300 are not in sequence.
Further, step S400 and step S500 are not in sequence.
The invention has the following technical effects: aiming at the problems that the reliability of an evaluation model is neglected and the reliability of a research method for classifying texts excessively depends on keywords and the like at present, the invention provides a method which can carry out overall prediction and has higher reliability on the basis of context so as to carry out overall prediction on the basis of context. The method regularizes the model, reconstructs keywords from other words, and makes low confidence predictions without sufficient context. Running in pre-trained language models such as BERT, RoBERTa and ALBERT, this approach can greatly improve OOD detection and cross-domain generalization without reducing classification accuracy.
Drawings
FIG. 1 is a flow chart of the system of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Examples
A regularization text classification fine-tuning method based on artificial masking keywords is disclosed, as shown in FIG. 1, and comprises the following steps:
s100, data acquisition and processing: acquiring text data required by a model, labeling the type of the text data, constructing a data set required by the model, and pre-training the data set;
s200, selecting keywords based on frequency: selecting keywords using relative frequencies of words in the dataset;
s300, selecting keywords based on the attention value: selecting keywords using model attention;
s400, hiding the keyword reconstruction: reconstructing the keywords from the keyword mask document;
s500, hidden entropy regularization: regularizing the random deletion of the non-key words in the context for the prediction of the context-obscured document;
s600, performance evaluation: and evaluating the text classification precision.
In the keyword selection based on frequency in step S200, the importance of the token is measured by TF-IDF term frequency-inverse document frequency, and then the importance of the token is measured by comparing the frequency term frequency in the target document with the frequency inverse document frequency in the entire corpus, and the keyword is defined as the word with the highest TF-IDF score. XC is a document with all symbols in the DC corpus concatenated, D is a very large document with D ═ X1, … XC ], and the frequency-based keyword selection score formula of t symbols is
Where tf (t, X) ═ 0.5+ 0.5. nt, idf (t, D) ═ log (| D |/| { X ∈ D: t ∈ X } |). The frequency-based selection is model-independent and relatively easy to compute, but does not directly reflect the contribution of words to text predictions.
Step S300 selects keywords based on attention values in keyword selection using model attention, since this is a more direct and efficient way to scale how important quantifiers are in model prediction. The model is trained using the LCE standard method of cross entropy loss, with the attention value of the model being used to select the labeled keyword with the highest attention value.
In the step S300 of keyword selection based on attention value, let a ═ a1, … aT ∈ R T be the attention value of document embedding, and the attention-based scoring formula in which ai corresponds to ti in the input symbol and the symbol t is set as
In the masked keyword reconstruction of step S400, the model is forced to reconstruct the keywords from the keyword masked document in order to strengthen the model to understand the surrounding context. The principle is similar to the masking mechanism in the BERT model, but the scheme only masks keywords rather than random words. The hidden keyword reconstruction only carries out keyword regularization on the sentences, and the loss of the sentences without the keywords is ignored. Formally, assume k~For a random subset of the full keyword k, each element is chosen with an independent probability p, and k is masked from the original document x~To obtain a masked document x~=x-k~Finally, the reconstruction loss formula of the shielding keyword is obtained asWherein the index (k)~) Is the keyword k~Vi is the index of the keyword relative to the vocabulary set, relative to the index of the original document x. It is also important to select the appropriate keyword method here, and experiments have shown that the attention-based keyword selection method performs better than the frequency-or random-based keyword selection method.
In step S500 hidden entropy regularization, the model should not correctly classify the context masked documents because it is not the original context already here. Formally, let c be a randomly selected subset of the full context word c ═ x-k, where each element is a probabilityq are independently selected, c is then masked from the original document x, and a context masked document x is obtained~-c, obtaining a formula for the latent entropy regularization asWhere DKL is a KL-difference and U (y) is a homogeneous distribution. The hidden entropy regularization does not degrade the classification accuracy because it specifies unrealistic, masked sentences, rather than complete documents. Finally, the verification formula is set asWhere λ MKR and λ MER are the hyperparameters lost by MKR masked keyword reconstruction and MER hidden entropy regularization, respectively.
In the performance evaluation of step S600, classification accuracy, OOD detection, and cross-domain generalization index are mainly evaluated. The scheme does not reduce the classification precision of the model, and the indexes of OOD detection and cross-domain generalization are greatly improved.
Step S200 and step S300 are not in sequence, and step S400 and step S500 are not in sequence.
The invention provides a fine adjustment method based on manual masking keyword regularization, so that overall prediction can be performed based on context. The method regularizes the model, reconstructs keywords from other words, and makes low confidence predictions without sufficient context. The method has good reliability when being operated in a pre-training language model such as BERT, RoBERTA and ALBERT, and can greatly improve OOD detection and cross-domain generalization without reducing classification precision.
Although only the preferred embodiments of the present invention have been described in detail, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art, and all changes are encompassed in the scope of the present invention.
Claims (9)
1. A regularization text classification fine adjustment method based on manually-hidden keywords is characterized by comprising the following steps:
s100, data acquisition and processing: acquiring text data required by a model, labeling the type of the text data, constructing a data set required by the model, and pre-training the data set;
s200, selecting keywords based on frequency: selecting keywords using relative frequencies of words in the dataset;
s300, selecting keywords based on the attention value: selecting keywords using model attention;
s400, hiding the keyword reconstruction: reconstructing the keywords from the keyword mask document;
s500, hidden entropy regularization: regularizing the random deletion of the non-key words in the context for the prediction of the context-obscured document;
s600, performance evaluation: and evaluating the text classification precision.
2. The method of claim 1, wherein in the frequency-based keyword selection of S200, the importance of the label is measured by TF-IDF, and then the importance of the label is measured by comparing the frequency in the target document with the frequency in the entire corpus, and the keyword is defined as the word with the highest TF-IDF score.
3. The method as claimed in claim 1, wherein the step S300 is implemented by training a model using LCE standard method of cross entropy loss in the keyword selection based on attention value, and selecting the labeled keyword with the highest attention value by using the attention value of the model.
4. The regularization text classification fine-tuning method based on artificial occlusion keyword as claimed in claim 1, wherein in the keyword selection based on attention value of S300, let a ═ a1, … aT ] ∈ RT as the attention value of document embedding, and set the attention-based scoring formula in which ai corresponds to ti in the input symbols and t is the symbol t as
5. The regularization text classification fine tuning method based on artificial mask keywords as claimed in claim 1, wherein in S400 mask keyword reconstruction, the keywords of the sentence are regularized, assuming k is~For a random subset of the full keyword k, each element is chosen with an independent probability p, and k is masked from the original document x~To obtain a masked document x~=x-k~Finally, the reconstruction loss formula of the shielding keyword is obtained as
6. The regularization text classification fine tuning method based on artificial occlusion keywords as claimed in claim 1, wherein in the S500 hidden entropy regularization, let c be a randomly selected subset of full context words c ═ x-k, where each element is independently selected with probability q, then mask c from original document x and obtain context occlusion document x ═ x-k~-c, obtaining a formula for the latent entropy regularization asFinally, the verification formula is set as
7. The regularized text classification fine tuning method based on artificial occlusion keywords according to claim 1,
in the performance evaluation of step S600, classification accuracy, OOD detection, and cross-domain generalization index are mainly evaluated.
8. The regularized text classification fine tuning method based on artificial occlusion keywords according to claim 1,
step S200 and step S300 are not in sequence.
9. The regularized text classification fine tuning method based on artificial occlusion keywords according to claim 1,
step S400 and step S500 are not in sequence.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110302636.1A CN113032563B (en) | 2021-03-22 | 2021-03-22 | Regularized text classification fine tuning method based on manual masking keywords |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110302636.1A CN113032563B (en) | 2021-03-22 | 2021-03-22 | Regularized text classification fine tuning method based on manual masking keywords |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113032563A true CN113032563A (en) | 2021-06-25 |
CN113032563B CN113032563B (en) | 2023-07-14 |
Family
ID=76472302
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110302636.1A Active CN113032563B (en) | 2021-03-22 | 2021-03-22 | Regularized text classification fine tuning method based on manual masking keywords |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113032563B (en) |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2014043519A1 (en) * | 2012-09-14 | 2014-03-20 | Population Diagnostics Inc. | Methods and compositions for diagnosing, prognosing, and treating neurological conditions |
CN110119765A (en) * | 2019-04-18 | 2019-08-13 | 浙江工业大学 | A kind of keyword extracting method based on Seq2seq frame |
CN110222349A (en) * | 2019-06-13 | 2019-09-10 | 成都信息工程大学 | A kind of model and method, computer of the expression of depth dynamic context word |
US20200193500A1 (en) * | 2017-07-04 | 2020-06-18 | Beijing Jingdong Shangke Information Technology Co., Ltd. | Data processing method and apparatus based on electronic commerce |
CN111339278A (en) * | 2020-02-28 | 2020-06-26 | 支付宝(杭州)信息技术有限公司 | Method and device for generating training speech generating model and method and device for generating answer speech |
CN111444709A (en) * | 2020-03-09 | 2020-07-24 | 腾讯科技(深圳)有限公司 | Text classification method, device, storage medium and equipment |
CN111444721A (en) * | 2020-05-27 | 2020-07-24 | 南京大学 | Chinese text key information extraction method based on pre-training language model |
CN111488459A (en) * | 2020-04-15 | 2020-08-04 | 焦点科技股份有限公司 | Product classification method based on keywords |
CN111563166A (en) * | 2020-05-28 | 2020-08-21 | 浙江学海教育科技有限公司 | Pre-training model method for mathematical problem classification |
CN111563373A (en) * | 2020-04-13 | 2020-08-21 | 中南大学 | Attribute-level emotion classification method for focused attribute-related text |
CN112115247A (en) * | 2020-09-07 | 2020-12-22 | 中国人民大学 | Personalized dialogue generation method and system based on long-time and short-time memory information |
CN112214599A (en) * | 2020-10-20 | 2021-01-12 | 电子科技大学 | Multi-label text classification method based on statistics and pre-training language model |
CN112256876A (en) * | 2020-10-26 | 2021-01-22 | 南京工业大学 | Aspect-level emotion classification model based on multi-memory attention network |
-
2021
- 2021-03-22 CN CN202110302636.1A patent/CN113032563B/en active Active
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2014043519A1 (en) * | 2012-09-14 | 2014-03-20 | Population Diagnostics Inc. | Methods and compositions for diagnosing, prognosing, and treating neurological conditions |
US20200193500A1 (en) * | 2017-07-04 | 2020-06-18 | Beijing Jingdong Shangke Information Technology Co., Ltd. | Data processing method and apparatus based on electronic commerce |
CN110119765A (en) * | 2019-04-18 | 2019-08-13 | 浙江工业大学 | A kind of keyword extracting method based on Seq2seq frame |
CN110222349A (en) * | 2019-06-13 | 2019-09-10 | 成都信息工程大学 | A kind of model and method, computer of the expression of depth dynamic context word |
CN111339278A (en) * | 2020-02-28 | 2020-06-26 | 支付宝(杭州)信息技术有限公司 | Method and device for generating training speech generating model and method and device for generating answer speech |
CN111444709A (en) * | 2020-03-09 | 2020-07-24 | 腾讯科技(深圳)有限公司 | Text classification method, device, storage medium and equipment |
CN111563373A (en) * | 2020-04-13 | 2020-08-21 | 中南大学 | Attribute-level emotion classification method for focused attribute-related text |
CN111488459A (en) * | 2020-04-15 | 2020-08-04 | 焦点科技股份有限公司 | Product classification method based on keywords |
CN111444721A (en) * | 2020-05-27 | 2020-07-24 | 南京大学 | Chinese text key information extraction method based on pre-training language model |
CN111563166A (en) * | 2020-05-28 | 2020-08-21 | 浙江学海教育科技有限公司 | Pre-training model method for mathematical problem classification |
CN112115247A (en) * | 2020-09-07 | 2020-12-22 | 中国人民大学 | Personalized dialogue generation method and system based on long-time and short-time memory information |
CN112214599A (en) * | 2020-10-20 | 2021-01-12 | 电子科技大学 | Multi-label text classification method based on statistics and pre-training language model |
CN112256876A (en) * | 2020-10-26 | 2021-01-22 | 南京工业大学 | Aspect-level emotion classification model based on multi-memory attention network |
Non-Patent Citations (4)
Title |
---|
YUN ZHAO 等: "BERTSurv: BERT-Based Survival Models for Predicting Outcomes of Trauma Patients", 《ARXIV:2103.10928V1》 * |
王楠禔: "基于BERT改进的文本表示模型研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
田芳: "面向社交物联网的细粒度文本情感分类方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
薄层: "KL散度的理解", 《HTTPS://WWW.CNBLOGS.COM/BOCENG/P/11519381.HTML》 * |
Also Published As
Publication number | Publication date |
---|---|
CN113032563B (en) | 2023-07-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Mei et al. | Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research | |
Parameswaran et al. | Towards the web of concepts: Extracting concepts from large datasets | |
CN111125349A (en) | Graph model text abstract generation method based on word frequency and semantics | |
CN108304372A (en) | Entity extraction method and apparatus, computer equipment and storage medium | |
Plank | Domain adaptation for parsing | |
CN112749274B (en) | Chinese text classification method based on attention mechanism and interference word deletion | |
CN110807324A (en) | Video entity identification method based on IDCNN-crf and knowledge graph | |
CN110807326A (en) | Short text keyword extraction method combining GPU-DMM and text features | |
CN110287314A (en) | Long text credibility evaluation method and system based on Unsupervised clustering | |
Tungthamthiti et al. | Recognition of sarcasm in microblogging based on sentiment analysis and coherence identification | |
Shnarch et al. | GRASP: Rich patterns for argumentation mining | |
CN112528653B (en) | Short text entity recognition method and system | |
Chang et al. | The secret’s in the word order: Text-to-text generation for linguistic steganography | |
CN116720498A (en) | Training method and device for text similarity detection model and related medium thereof | |
CN110019814B (en) | News information aggregation method based on data mining and deep learning | |
CN113032563B (en) | Regularized text classification fine tuning method based on manual masking keywords | |
CN114996442B (en) | Text abstract generation system combining abstract degree discrimination and abstract optimization | |
CN107729509B (en) | Discourse similarity determination method based on recessive high-dimensional distributed feature representation | |
Tang et al. | Text semantic understanding based on knowledge enhancement and multi-granular feature extraction | |
Wang et al. | Weakly Supervised Chinese short text classification algorithm based on ConWea model | |
CN115455975A (en) | Method and device for extracting topic keywords based on multi-model fusion decision | |
Amini et al. | Incorporating prior knowledge into a transductive ranking algorithm for multi-document summarization | |
CN112257458A (en) | Intention recognition model training method, intention recognition method, device and equipment | |
Kalmar | Bootstrapping Websites for Classification of Organization Names on Twitter. | |
Jain | Unsupervised method for text summarization using content based approach |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |