CN112800222B - Multi-task auxiliary limit multi-label short text classification method using co-occurrence information - Google Patents
Multi-task auxiliary limit multi-label short text classification method using co-occurrence information Download PDFInfo
- Publication number
- CN112800222B CN112800222B CN202110101374.2A CN202110101374A CN112800222B CN 112800222 B CN112800222 B CN 112800222B CN 202110101374 A CN202110101374 A CN 202110101374A CN 112800222 B CN112800222 B CN 112800222B
- Authority
- CN
- China
- Prior art keywords
- task
- label
- microblog
- text classification
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a multitask auxiliary limit multi-label short text classification method by utilizing co-occurrence information, which is mainly technically characterized by comprising the following steps of: constructing an account-feature file; providing additional characteristic information for each microblog short text by using an account-characteristic file, and modeling the characteristic information into explicit model input co-occurrence information; constructing a multi-label text classification task and a limit multi-label text classification task related to the microblog short texts; constructing a multi-task learning task model; pre-training a multi-task learning task model by using large-scale microblog short text data; fine adjustment is carried out on the multi-task learning task model; and quantizing the neural network output and finally outputting a multi-task prediction result. The method designs the multi-task learning architecture by utilizing the co-occurrence information, realizes multi-label classification of the large-scale short text, and can realize stable, accurate and real-time multi-label prediction on the large-scale short text data set under the condition of lower industrial deployment cost.
Description
Technical Field
The invention belongs to the technical field of information, relates to a natural language processing and text classification method, and particularly relates to a multi-task auxiliary limit multi-label short text classification method by utilizing co-occurrence information.
Background
With the increasing production speed of text data and the increasing appearance of data diversity and semantic complexity, the conventional multi-label text classification method is difficult to meet the daily industrial requirements in terms of accuracy and real-time performance, and the requirement on the ultimate multi-label text classification task under the large-scale label set scene is more and more strong.
In order to solve the above problems, the prior art mostly solves the problems by methods such as embedding, multi-classifier, tree, deep learning, and the like. The embedding method is high in time complexity, and the effect extremely depends on the clustering effect in the preprocessing process; the multi-classifier method ignores information among labels, considers the labels to be independent individuals, and has huge deployment cost because each label needs to train one classifier, thus being difficult to be effectively applied to real service scenes; the tree method cannot solve the problem of long tail in data concentration, and is large in scale, high in cost, poor in precision and difficult to stably use in an industrial scene. The existing deep learning method does not optimize the long tail problem, but simply increases the number of neurons in an output layer, so that the effect of the existing deep learning method is generally not as good as that of the other three methods.
In view of the above, the method provided by the invention avoids the disadvantages of the existing method, designs a multi-task learning architecture by utilizing co-occurrence information, realizes multi-label classification of large-scale short texts, and can realize stable, accurate and real-time multi-label prediction on large-scale short text data sets under the condition of lower industrial deployment cost.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a multitask auxiliary limit multi-label short text classification method which is reasonable in design, accurate in prediction, low in cost and easy to implement and utilizes co-occurrence information.
The technical problem to be solved by the invention is realized by adopting the following technical scheme:
a multitask auxiliary limit multi-label short text classification method utilizing co-occurrence information comprises the following steps:
step 1, constructing an account-feature file by using related feature information of a microblog text sending account;
step 2, providing additional characteristic information F for each microblog short text by using the account-characteristic file, and modeling the characteristic information F as explicit model input co-occurrence information C;
step 3, constructing a multi-label text classification task t related to the microblog short texts1;
Step (ii) of4. Constructing large-scale microblog short text limit multi-label text classification task t2;
Step 5, classifying the multi-label text into tasks t1And extreme multi-label text classification task t2Performing combined modeling to obtain a multi-task learning task model T;
step 6, pre-training a pre-training model M in the multi-task learning task model T by using large-scale microblog short text data;
step 7, fine adjustment is carried out on the multi-task learning task model T by using small-scale and accurate labeled microblog short text data;
step 8, classifying the label text1And extreme multi-label text classification task t2And the output of the last layer of neural network is quantized, the probability value results after the quantization of respective tasks are constructed according to a joint rule, and the multi-task prediction result is finally output.
The microblog text issuing account in the step 1 is an author account for issuing a microblog text; the microblog posting account comprises the following characteristic information: the label, the place and the account name of the message are commonly used for the message. When the account-feature file is constructed, the account-feature file is cleaned and updated in consideration of the fact that errors exist in related feature information of a microblog posting account.
The characteristic information F obtained in the step 2 is:
wherein, FiThe microblog text account number is one of feature information in the related features of the microblog text account number, and n is the related feature number of the microblog text account number;
the explicit model input co-occurrence information C is:
wherein [ SEP ] is a special marker in the model input text.
The pre-training model M is a shared layer, and the upper layer of the pre-training model M implements respective tasks.
And 6, pre-training a pre-training model M in the multi-task learning task model T by using large-scale unmarked microblog short text data, or pre-training the multi-task learning task model T by using large-scale marked microblog short text data with errors.
The quantization of the last layer of neural network output in the step 8 is realized by SoftMax and normalization operations.
The joint rule is as follows: hypothesis multi-label text classification task t1And extreme multi-label text classification task t2There is a certain degree of relevance when the multi-label text classification task t1The probability value of the maximum probability value neuron is larger than the limit multi-label text classification task t2Considering the multi-label text classification task t when the neuron with the maximum probability value1The reliability is better, so that the task t is classified by multi-label text1For reference, the extreme multi-label text classification task t2Making a decision to adjust; otherwise, the extreme multi-label text classification task t is considered2The reliability is better, so that the task t of multi-label text classification is limited2For reference, a multi-label text classification task t1And (6) making a decision to adjust.
The invention has the advantages and positive effects that:
1. the invention adopts the multitask assistant limit multi-label short text classification method of the co-occurrence information to carry out deep learning of prediction on the large-scale label set, and can stably and effectively predict the large-scale label set under the condition of lower industrial deployment cost.
2. The method improves the prediction effect on the high-frequency label and the low-frequency label through the co-occurrence information, and utilizes the information learned by the multi-label text classification task to assist the learning of the extreme multi-label text classification task; in the training process, firstly, large-scale unlabeled data are used for pre-training a pre-training model in the method, or large-scale labeled data containing noise are used for pre-training the method, and then, accurate labeled data are used for fine tuning the method.
3. The co-occurrence information used by the invention not only assists the label prediction to a certain extent, but also can guide the correct result of the misjudgment sample by maintaining the account number-feature file.
4. The invention can realize daily industrial maintenance only by maintaining the account-feature file, greatly reduces the maintenance cost and effectively solves the problem of high maintenance cost in an industrial scene during deep learning.
Drawings
FIG. 1 is a process diagram of the present invention:
FIG. 2 is a diagram of a multi-task learning model architecture according to the present invention.
Detailed Description
The present invention is further described in detail below with reference to the accompanying drawings.
The design idea of the invention is as follows: the prediction effect of the extreme multi-label text classification task is assisted and improved by using the co-occurrence information and the multi-task learning technology. The method is inspired by the explicit incidence relation between the co-occurrence information and the tags in the tag set, and the co-occurrence information is constructed by using the relevant characteristic information of the account number, so that the prediction effect of the method on the high-frequency tags and the low-frequency tags is effectively improved. Furthermore, the method is inspired by shared parameters in multi-task learning, and the shared parameters inspire that the method and the system assist the prediction of the extreme multi-label text classification task by using the learned information of the multi-label text classification task under the condition that the complex industrial environment is difficult to solve the practical problem simply through a single task. In the practical application process, in the face of the situation that the deep learning method is difficult to maintain at low cost, the invention provides the method for controlling the relative characteristics of the account by maintaining the account-characteristic file, thereby controlling the co-occurrence information input by the method to realize the low-cost maintenance of the method.
For convenience of explanation, the symbols used in the present invention will now be described:
based on the above reasonable design, the invention provides a multitask auxiliary limit multi-label short text classification method using co-occurrence information, as shown in fig. 1, comprising the following steps:
step 1, an account-feature file is constructed by utilizing related feature information of a microblog text sending account.
The microblog text issuing account is an author account for issuing microblog texts. The microblog text sending account and the target classification task have related characteristic information, such as a text sending common label, a text sending common place, a text sending account name and the like.
In the step, an account-feature file is constructed by using the microblog issuance account and the corresponding related feature information. And cleaning and updating the account-feature file in consideration of errors of the related feature information of the microblog posting account.
And 2, providing additional characteristic information F for each microblog short text by using the account-characteristic file, and modeling the characteristic information F as explicit model input co-occurrence information C.
The sending account number of each microblog short text exists, and the account number-feature file is utilized to obtain additional feature information of each microblog short textWherein, FiThe number n is a certain feature information in the related features of the microblog text sending account number. Obtaining explicit model input co-occurrence information by using feature information F for modelingWherein, [ SEP]Special tags in the text are entered for the model. The model input text is essentially composed of target text (taking microblog short text as an example) and co-occurrence information related to a classification task, and the specific form is determined by model (taking a pre-training model ERNIE as an example) input. Assuming that microblog short text Content is represented as Content, the model input text may be represented as [ CLS]Content C. Wherein, [ CLS]Special tags in the text are entered for the model.
Step 3, constructing a multi-label text classification task t related to the microblog short texts1。
Constructing multi-label text classification tasks related to target texts, such as emotion classification task t related to microblog short texts1。
Step 4, constructing large-scale microblog short text limit multi-label text classification task t2。
Constructing an extreme multi-label text classification task related to a target text, such as a large-scale label classification task t related to a microblog short text2。
Step 5, classifying the multi-label text into tasks t1And extreme multi-label text classification task t2The joint modeling is a multitask learning task T.
In this step, t is added1And t2The joint modeling is a multi-task learning task model T. The pre-training model M (for example, but not limited to ERNIE) serves as a shared layer, and the upper layers respectively implement their tasks.
The structure of the multitask learning model T is shown in FIG. 2. [ CLS]Content C as model input and the pre-trained model ERNIE as model sharing parameter layer. The output of ERNIE passes through two full-connection layers respectively, and the left full-connection layer points to t1(ii) a The right full connection layer is spliced with the output of the ERNIE to fully utilize the information learned by the multi-label text classification task, and the right side points to t2。
And 6, pre-training a pre-training model M in the multi-task learning task model T by using large-scale unlabeled microblog short text data, or pre-training T by using large-scale labeled microblog short text data with errors.
And 7, fine tuning the multi-task learning task model T by using small-scale and accurate labeled microblog short text data.
And 8: task t of classifying label texts1And extreme multi-label text classification task t2The last layer of neural network output is quantized, the probability value results after the quantization of respective tasks are constructed according to a joint rule, and finallyAnd outputting a multi-task prediction result.
In this step, for t1And t2And when the output of the last layer of neural network of the task is quantized, the quantization can be realized by SoftMax and normalization operation. And carrying out rule design by using the probability value result after the quantification of each task, and finally outputting a multi-task result.
In this step, the rule design takes into account: the industrial scene has large noise and complex requirements, and not only the model is difficult to accurately judge, but also the artificial judgment is difficult. Let t be1And t2There is a certain degree of relevance between tasks, when t1The probability value of the task maximum probability value neuron is more than t2When the neuron with the highest probability value is a task, t can be considered1Task reliability is better, so that t is1Task as a reference, for t2And adjusting the task decision. And vice versa. Based on the rule, further complex rule design can be carried out by combining with a service scene so as to meet the service requirement.
It should be emphasized that the embodiments described herein are illustrative rather than restrictive, and thus the present invention is not limited to the embodiments described in the detailed description, but also includes other embodiments that can be derived from the technical solutions of the present invention by those skilled in the art.
Claims (7)
1. A multi-task auxiliary limit multi-label short text classification method using co-occurrence information is characterized in that: the method comprises the following steps:
step 1, constructing an account-feature file by using related feature information of a microblog posting account;
step 2, providing additional feature information F for each microblog short text by using the account-feature file, and modeling the feature information F as explicit model input co-occurrence information C;
step 3, constructing multi-label text classification task t related to microblog short texts1;
Step 4, constructing large-scale microblog short text limit multi-label text classification task t2;
Step 5, classifying the multi-label text into a task t1And extreme multi-label text classification task t2Performing combined modeling to obtain a multi-task learning task model T;
step 6, pre-training a pre-training model M in the multi-task learning task model T by using large-scale microblog short text data;
step 7, fine adjustment is carried out on the multi-task learning task model T by using small-scale and accurate labeled microblog short text data;
step 8, classifying the label text1And extreme multi-label text classification task t2The output of the last layer of neural network is quantized, the probability value results after the quantization of respective tasks are constructed according to a joint rule, and a multi-task prediction result is finally output;
the feature information F obtained in step 2 is:
wherein, FiThe microblog text sending account number is one of the related features of the microblog text sending account number, and n is the related feature number of the microblog text sending account number;
the explicit model input co-occurrence information C is:
wherein [ SEP ] is a special marker in the model input text.
2. The method of claim 1, wherein the method comprises: in step 6, pre-training a pre-training model M in the multi-task learning task model T by using large-scale unlabeled microblog short text data, or pre-training the multi-task learning task model T by using large-scale labeled microblog short text data with errors.
3. The method for multi-tasking assisted-extreme multi-label short text classification with co-occurrence information according to claim 1 or 2, characterized in that: the microblog text issuing account number is an author account number for issuing microblog texts; the microblog posting account comprises the following characteristic information: the label, the place and the account name of the message are commonly used for the message.
4. The method for multi-tasking assisted-extreme multi-label short text classification with co-occurrence information according to claim 1 or 2, characterized in that: in the step 1, when the account-feature file is constructed, the account-feature file is cleaned and updated in consideration of errors in related feature information of a microblog posting account.
5. The method for multi-tasking assisted-extreme multi-label short text classification with co-occurrence information according to claim 1 or 2, characterized in that: the pre-training model M is a shared layer, and the upper layer of the pre-training model M implements respective tasks.
6. The method for multi-tasking assisted-extreme multi-label short text classification with co-occurrence information according to claim 1 or 2, characterized in that: the quantization of the last layer of neural network output in the step 8 is realized by SoftMax and normalization operations.
7. The method for multi-tasking assistant-limit multi-label short text classification with co-occurrence information according to claim 6, wherein: the joint rule is as follows: hypothesis multi-label text classification task t1And extreme multi-label text classification task t2There is a certain degree of relevance when the multi-label text classification task t1The probability value of the maximum probability value neuron is larger than the limit multi-label text classification task t2Considering the multi-label text classification task t when the neuron with the maximum probability value exists1Reliability is better, so that the task t is classified by multi-label text1For reference, the extreme multi-label text classification task t2Decision makingAdjusting; otherwise, the extreme multi-label text classification task t is considered2The reliability is better, so that the task t of multi-label text classification is limited2For reference, a multi-label text classification task t1And (6) making a decision to adjust.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110101374.2A CN112800222B (en) | 2021-01-26 | 2021-01-26 | Multi-task auxiliary limit multi-label short text classification method using co-occurrence information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110101374.2A CN112800222B (en) | 2021-01-26 | 2021-01-26 | Multi-task auxiliary limit multi-label short text classification method using co-occurrence information |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112800222A CN112800222A (en) | 2021-05-14 |
CN112800222B true CN112800222B (en) | 2022-07-19 |
Family
ID=75811747
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110101374.2A Active CN112800222B (en) | 2021-01-26 | 2021-01-26 | Multi-task auxiliary limit multi-label short text classification method using co-occurrence information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112800222B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20240188895A1 (en) * | 2021-07-27 | 2024-06-13 | Boe Technology Group Co., Ltd. | Model training method, signal recognition method, apparatus, computing and processing device, computer program, and computer-readable medium |
CN114490951B (en) * | 2022-04-13 | 2022-07-08 | 长沙市智为信息技术有限公司 | Multi-label text classification method and model |
CN117033641A (en) * | 2023-10-07 | 2023-11-10 | 江苏微皓智能科技有限公司 | Network structure optimization fine tuning method of large-scale pre-training language model |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103577405A (en) * | 2012-07-19 | 2014-02-12 | 中国人民大学 | Interest analysis based micro-blogger community classification method |
WO2014047727A1 (en) * | 2012-09-28 | 2014-04-03 | Alkis Papadopoullos | A method and system for monitoring social media and analyzing text to automate classification of user posts using a facet based relevance assessment model |
CN104881689A (en) * | 2015-06-17 | 2015-09-02 | 苏州大学张家港工业技术研究院 | Method and system for multi-label active learning classification |
CN110442723A (en) * | 2019-08-14 | 2019-11-12 | 山东大学 | A method of multi-tag text classification is used for based on the Co-Attention model that multistep differentiates |
CN110442707A (en) * | 2019-06-21 | 2019-11-12 | 电子科技大学 | A kind of multi-tag file classification method based on seq2seq |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103577549B (en) * | 2013-10-16 | 2017-02-15 | 复旦大学 | Crowd portrayal system and method based on microblog label |
CN109684478B (en) * | 2018-12-18 | 2023-04-07 | 腾讯科技(深圳)有限公司 | Classification model training method, classification device, classification equipment and medium |
CN111553442B (en) * | 2020-05-12 | 2024-03-12 | 国网智能电网研究院有限公司 | Optimization method and system for classifier chain tag sequence |
CN111709475B (en) * | 2020-06-16 | 2024-03-15 | 全球能源互联网研究院有限公司 | N-gram-based multi-label classification method and device |
CN112199536A (en) * | 2020-10-15 | 2021-01-08 | 华中科技大学 | Cross-modality-based rapid multi-label image classification method and system |
-
2021
- 2021-01-26 CN CN202110101374.2A patent/CN112800222B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103577405A (en) * | 2012-07-19 | 2014-02-12 | 中国人民大学 | Interest analysis based micro-blogger community classification method |
WO2014047727A1 (en) * | 2012-09-28 | 2014-04-03 | Alkis Papadopoullos | A method and system for monitoring social media and analyzing text to automate classification of user posts using a facet based relevance assessment model |
CN104881689A (en) * | 2015-06-17 | 2015-09-02 | 苏州大学张家港工业技术研究院 | Method and system for multi-label active learning classification |
CN110442707A (en) * | 2019-06-21 | 2019-11-12 | 电子科技大学 | A kind of multi-tag file classification method based on seq2seq |
CN110442723A (en) * | 2019-08-14 | 2019-11-12 | 山东大学 | A method of multi-tag text classification is used for based on the Co-Attention model that multistep differentiates |
Also Published As
Publication number | Publication date |
---|---|
CN112800222A (en) | 2021-05-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112800222B (en) | Multi-task auxiliary limit multi-label short text classification method using co-occurrence information | |
CN110569508A (en) | Method and system for classifying emotional tendencies by fusing part-of-speech and self-attention mechanism | |
CN107943967A (en) | Algorithm of documents categorization based on multi-angle convolutional neural networks and Recognition with Recurrent Neural Network | |
CN110807328A (en) | Named entity identification method and system oriented to multi-strategy fusion of legal documents | |
CN112163089B (en) | High-technology text classification method and system integrating named entity recognition | |
CN110874411A (en) | Cross-domain emotion classification system based on attention mechanism fusion | |
CN112395417A (en) | Network public opinion evolution simulation method and system based on deep learning | |
CN113051914A (en) | Enterprise hidden label extraction method and device based on multi-feature dynamic portrait | |
CN111581967A (en) | News theme event detection method combining LW2V and triple network | |
US20230289528A1 (en) | Method for constructing sentiment classification model based on metaphor identification | |
CN114239574A (en) | Miner violation knowledge extraction method based on entity and relationship joint learning | |
CN113673254A (en) | Knowledge distillation position detection method based on similarity maintenance | |
CN113869055A (en) | Power grid project characteristic attribute identification method based on deep learning | |
CN114444481B (en) | Sentiment analysis and generation method of news comment | |
CN116663540A (en) | Financial event extraction method based on small sample | |
CN111813939A (en) | Text classification method based on representation enhancement and fusion | |
CN114841151A (en) | Medical text entity relation joint extraction method based on decomposition-recombination strategy | |
CN115062727A (en) | Graph node classification method and system based on multi-order hypergraph convolutional network | |
CN114048314A (en) | Natural language steganalysis method | |
CN111709231B (en) | Class case recommendation method based on self-attention variational self-coding | |
CN113297374A (en) | Text classification method based on BERT and word feature fusion | |
CN117350286A (en) | Natural language intention translation method oriented to intention driving data link network | |
CN112612884A (en) | Entity label automatic labeling method based on public text | |
CN115631504B (en) | Emotion identification method based on bimodal graph network information bottleneck | |
CN116304064A (en) | Text classification method based on extraction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |