CN112800222B - Multi-task auxiliary limit multi-label short text classification method using co-occurrence information - Google Patents

Multi-task auxiliary limit multi-label short text classification method using co-occurrence information Download PDF

Info

Publication number
CN112800222B
CN112800222B CN202110101374.2A CN202110101374A CN112800222B CN 112800222 B CN112800222 B CN 112800222B CN 202110101374 A CN202110101374 A CN 202110101374A CN 112800222 B CN112800222 B CN 112800222B
Authority
CN
China
Prior art keywords
task
label
microblog
text classification
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110101374.2A
Other languages
Chinese (zh)
Other versions
CN112800222A (en
Inventor
王嫄
徐涛
王世龙
周宇博
王欢
杨巨成
赵婷婷
陈亚瑞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University of Science and Technology
Original Assignee
Tianjin University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University of Science and Technology filed Critical Tianjin University of Science and Technology
Priority to CN202110101374.2A priority Critical patent/CN112800222B/en
Publication of CN112800222A publication Critical patent/CN112800222A/en
Application granted granted Critical
Publication of CN112800222B publication Critical patent/CN112800222B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a multitask auxiliary limit multi-label short text classification method by utilizing co-occurrence information, which is mainly technically characterized by comprising the following steps of: constructing an account-feature file; providing additional characteristic information for each microblog short text by using an account-characteristic file, and modeling the characteristic information into explicit model input co-occurrence information; constructing a multi-label text classification task and a limit multi-label text classification task related to the microblog short texts; constructing a multi-task learning task model; pre-training a multi-task learning task model by using large-scale microblog short text data; fine adjustment is carried out on the multi-task learning task model; and quantizing the neural network output and finally outputting a multi-task prediction result. The method designs the multi-task learning architecture by utilizing the co-occurrence information, realizes multi-label classification of the large-scale short text, and can realize stable, accurate and real-time multi-label prediction on the large-scale short text data set under the condition of lower industrial deployment cost.

Description

Multi-task auxiliary limit multi-label short text classification method using co-occurrence information
Technical Field
The invention belongs to the technical field of information, relates to a natural language processing and text classification method, and particularly relates to a multi-task auxiliary limit multi-label short text classification method by utilizing co-occurrence information.
Background
With the increasing production speed of text data and the increasing appearance of data diversity and semantic complexity, the conventional multi-label text classification method is difficult to meet the daily industrial requirements in terms of accuracy and real-time performance, and the requirement on the ultimate multi-label text classification task under the large-scale label set scene is more and more strong.
In order to solve the above problems, the prior art mostly solves the problems by methods such as embedding, multi-classifier, tree, deep learning, and the like. The embedding method is high in time complexity, and the effect extremely depends on the clustering effect in the preprocessing process; the multi-classifier method ignores information among labels, considers the labels to be independent individuals, and has huge deployment cost because each label needs to train one classifier, thus being difficult to be effectively applied to real service scenes; the tree method cannot solve the problem of long tail in data concentration, and is large in scale, high in cost, poor in precision and difficult to stably use in an industrial scene. The existing deep learning method does not optimize the long tail problem, but simply increases the number of neurons in an output layer, so that the effect of the existing deep learning method is generally not as good as that of the other three methods.
In view of the above, the method provided by the invention avoids the disadvantages of the existing method, designs a multi-task learning architecture by utilizing co-occurrence information, realizes multi-label classification of large-scale short texts, and can realize stable, accurate and real-time multi-label prediction on large-scale short text data sets under the condition of lower industrial deployment cost.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a multitask auxiliary limit multi-label short text classification method which is reasonable in design, accurate in prediction, low in cost and easy to implement and utilizes co-occurrence information.
The technical problem to be solved by the invention is realized by adopting the following technical scheme:
a multitask auxiliary limit multi-label short text classification method utilizing co-occurrence information comprises the following steps:
step 1, constructing an account-feature file by using related feature information of a microblog text sending account;
step 2, providing additional characteristic information F for each microblog short text by using the account-characteristic file, and modeling the characteristic information F as explicit model input co-occurrence information C;
step 3, constructing a multi-label text classification task t related to the microblog short texts1
Step (ii) of4. Constructing large-scale microblog short text limit multi-label text classification task t2
Step 5, classifying the multi-label text into tasks t1And extreme multi-label text classification task t2Performing combined modeling to obtain a multi-task learning task model T;
step 6, pre-training a pre-training model M in the multi-task learning task model T by using large-scale microblog short text data;
step 7, fine adjustment is carried out on the multi-task learning task model T by using small-scale and accurate labeled microblog short text data;
step 8, classifying the label text1And extreme multi-label text classification task t2And the output of the last layer of neural network is quantized, the probability value results after the quantization of respective tasks are constructed according to a joint rule, and the multi-task prediction result is finally output.
The microblog text issuing account in the step 1 is an author account for issuing a microblog text; the microblog posting account comprises the following characteristic information: the label, the place and the account name of the message are commonly used for the message. When the account-feature file is constructed, the account-feature file is cleaned and updated in consideration of the fact that errors exist in related feature information of a microblog posting account.
The characteristic information F obtained in the step 2 is:
Figure BDA0002915808770000021
wherein, FiThe microblog text account number is one of feature information in the related features of the microblog text account number, and n is the related feature number of the microblog text account number;
the explicit model input co-occurrence information C is:
Figure BDA0002915808770000022
wherein [ SEP ] is a special marker in the model input text.
The pre-training model M is a shared layer, and the upper layer of the pre-training model M implements respective tasks.
And 6, pre-training a pre-training model M in the multi-task learning task model T by using large-scale unmarked microblog short text data, or pre-training the multi-task learning task model T by using large-scale marked microblog short text data with errors.
The quantization of the last layer of neural network output in the step 8 is realized by SoftMax and normalization operations.
The joint rule is as follows: hypothesis multi-label text classification task t1And extreme multi-label text classification task t2There is a certain degree of relevance when the multi-label text classification task t1The probability value of the maximum probability value neuron is larger than the limit multi-label text classification task t2Considering the multi-label text classification task t when the neuron with the maximum probability value1The reliability is better, so that the task t is classified by multi-label text1For reference, the extreme multi-label text classification task t2Making a decision to adjust; otherwise, the extreme multi-label text classification task t is considered2The reliability is better, so that the task t of multi-label text classification is limited2For reference, a multi-label text classification task t1And (6) making a decision to adjust.
The invention has the advantages and positive effects that:
1. the invention adopts the multitask assistant limit multi-label short text classification method of the co-occurrence information to carry out deep learning of prediction on the large-scale label set, and can stably and effectively predict the large-scale label set under the condition of lower industrial deployment cost.
2. The method improves the prediction effect on the high-frequency label and the low-frequency label through the co-occurrence information, and utilizes the information learned by the multi-label text classification task to assist the learning of the extreme multi-label text classification task; in the training process, firstly, large-scale unlabeled data are used for pre-training a pre-training model in the method, or large-scale labeled data containing noise are used for pre-training the method, and then, accurate labeled data are used for fine tuning the method.
3. The co-occurrence information used by the invention not only assists the label prediction to a certain extent, but also can guide the correct result of the misjudgment sample by maintaining the account number-feature file.
4. The invention can realize daily industrial maintenance only by maintaining the account-feature file, greatly reduces the maintenance cost and effectively solves the problem of high maintenance cost in an industrial scene during deep learning.
Drawings
FIG. 1 is a process diagram of the present invention:
FIG. 2 is a diagram of a multi-task learning model architecture according to the present invention.
Detailed Description
The present invention is further described in detail below with reference to the accompanying drawings.
The design idea of the invention is as follows: the prediction effect of the extreme multi-label text classification task is assisted and improved by using the co-occurrence information and the multi-task learning technology. The method is inspired by the explicit incidence relation between the co-occurrence information and the tags in the tag set, and the co-occurrence information is constructed by using the relevant characteristic information of the account number, so that the prediction effect of the method on the high-frequency tags and the low-frequency tags is effectively improved. Furthermore, the method is inspired by shared parameters in multi-task learning, and the shared parameters inspire that the method and the system assist the prediction of the extreme multi-label text classification task by using the learned information of the multi-label text classification task under the condition that the complex industrial environment is difficult to solve the practical problem simply through a single task. In the practical application process, in the face of the situation that the deep learning method is difficult to maintain at low cost, the invention provides the method for controlling the relative characteristics of the account by maintaining the account-characteristic file, thereby controlling the co-occurrence information input by the method to realize the low-cost maintenance of the method.
For convenience of explanation, the symbols used in the present invention will now be described:
Figure BDA0002915808770000031
based on the above reasonable design, the invention provides a multitask auxiliary limit multi-label short text classification method using co-occurrence information, as shown in fig. 1, comprising the following steps:
step 1, an account-feature file is constructed by utilizing related feature information of a microblog text sending account.
The microblog text issuing account is an author account for issuing microblog texts. The microblog text sending account and the target classification task have related characteristic information, such as a text sending common label, a text sending common place, a text sending account name and the like.
In the step, an account-feature file is constructed by using the microblog issuance account and the corresponding related feature information. And cleaning and updating the account-feature file in consideration of errors of the related feature information of the microblog posting account.
And 2, providing additional characteristic information F for each microblog short text by using the account-characteristic file, and modeling the characteristic information F as explicit model input co-occurrence information C.
The sending account number of each microblog short text exists, and the account number-feature file is utilized to obtain additional feature information of each microblog short text
Figure BDA0002915808770000032
Wherein, FiThe number n is a certain feature information in the related features of the microblog text sending account number. Obtaining explicit model input co-occurrence information by using feature information F for modeling
Figure BDA0002915808770000033
Wherein, [ SEP]Special tags in the text are entered for the model. The model input text is essentially composed of target text (taking microblog short text as an example) and co-occurrence information related to a classification task, and the specific form is determined by model (taking a pre-training model ERNIE as an example) input. Assuming that microblog short text Content is represented as Content, the model input text may be represented as [ CLS]Content C. Wherein, [ CLS]Special tags in the text are entered for the model.
Step 3, constructing a multi-label text classification task t related to the microblog short texts1
Constructing multi-label text classification tasks related to target texts, such as emotion classification task t related to microblog short texts1
Step 4, constructing large-scale microblog short text limit multi-label text classification task t2
Constructing an extreme multi-label text classification task related to a target text, such as a large-scale label classification task t related to a microblog short text2
Step 5, classifying the multi-label text into tasks t1And extreme multi-label text classification task t2The joint modeling is a multitask learning task T.
In this step, t is added1And t2The joint modeling is a multi-task learning task model T. The pre-training model M (for example, but not limited to ERNIE) serves as a shared layer, and the upper layers respectively implement their tasks.
The structure of the multitask learning model T is shown in FIG. 2. [ CLS]Content C as model input and the pre-trained model ERNIE as model sharing parameter layer. The output of ERNIE passes through two full-connection layers respectively, and the left full-connection layer points to t1(ii) a The right full connection layer is spliced with the output of the ERNIE to fully utilize the information learned by the multi-label text classification task, and the right side points to t2
And 6, pre-training a pre-training model M in the multi-task learning task model T by using large-scale unlabeled microblog short text data, or pre-training T by using large-scale labeled microblog short text data with errors.
And 7, fine tuning the multi-task learning task model T by using small-scale and accurate labeled microblog short text data.
And 8: task t of classifying label texts1And extreme multi-label text classification task t2The last layer of neural network output is quantized, the probability value results after the quantization of respective tasks are constructed according to a joint rule, and finallyAnd outputting a multi-task prediction result.
In this step, for t1And t2And when the output of the last layer of neural network of the task is quantized, the quantization can be realized by SoftMax and normalization operation. And carrying out rule design by using the probability value result after the quantification of each task, and finally outputting a multi-task result.
In this step, the rule design takes into account: the industrial scene has large noise and complex requirements, and not only the model is difficult to accurately judge, but also the artificial judgment is difficult. Let t be1And t2There is a certain degree of relevance between tasks, when t1The probability value of the task maximum probability value neuron is more than t2When the neuron with the highest probability value is a task, t can be considered1Task reliability is better, so that t is1Task as a reference, for t2And adjusting the task decision. And vice versa. Based on the rule, further complex rule design can be carried out by combining with a service scene so as to meet the service requirement.
It should be emphasized that the embodiments described herein are illustrative rather than restrictive, and thus the present invention is not limited to the embodiments described in the detailed description, but also includes other embodiments that can be derived from the technical solutions of the present invention by those skilled in the art.

Claims (7)

1. A multi-task auxiliary limit multi-label short text classification method using co-occurrence information is characterized in that: the method comprises the following steps:
step 1, constructing an account-feature file by using related feature information of a microblog posting account;
step 2, providing additional feature information F for each microblog short text by using the account-feature file, and modeling the feature information F as explicit model input co-occurrence information C;
step 3, constructing multi-label text classification task t related to microblog short texts1
Step 4, constructing large-scale microblog short text limit multi-label text classification task t2
Step 5, classifying the multi-label text into a task t1And extreme multi-label text classification task t2Performing combined modeling to obtain a multi-task learning task model T;
step 6, pre-training a pre-training model M in the multi-task learning task model T by using large-scale microblog short text data;
step 7, fine adjustment is carried out on the multi-task learning task model T by using small-scale and accurate labeled microblog short text data;
step 8, classifying the label text1And extreme multi-label text classification task t2The output of the last layer of neural network is quantized, the probability value results after the quantization of respective tasks are constructed according to a joint rule, and a multi-task prediction result is finally output;
the feature information F obtained in step 2 is:
Figure FDA0003675819450000011
wherein, FiThe microblog text sending account number is one of the related features of the microblog text sending account number, and n is the related feature number of the microblog text sending account number;
the explicit model input co-occurrence information C is:
Figure FDA0003675819450000012
wherein [ SEP ] is a special marker in the model input text.
2. The method of claim 1, wherein the method comprises: in step 6, pre-training a pre-training model M in the multi-task learning task model T by using large-scale unlabeled microblog short text data, or pre-training the multi-task learning task model T by using large-scale labeled microblog short text data with errors.
3. The method for multi-tasking assisted-extreme multi-label short text classification with co-occurrence information according to claim 1 or 2, characterized in that: the microblog text issuing account number is an author account number for issuing microblog texts; the microblog posting account comprises the following characteristic information: the label, the place and the account name of the message are commonly used for the message.
4. The method for multi-tasking assisted-extreme multi-label short text classification with co-occurrence information according to claim 1 or 2, characterized in that: in the step 1, when the account-feature file is constructed, the account-feature file is cleaned and updated in consideration of errors in related feature information of a microblog posting account.
5. The method for multi-tasking assisted-extreme multi-label short text classification with co-occurrence information according to claim 1 or 2, characterized in that: the pre-training model M is a shared layer, and the upper layer of the pre-training model M implements respective tasks.
6. The method for multi-tasking assisted-extreme multi-label short text classification with co-occurrence information according to claim 1 or 2, characterized in that: the quantization of the last layer of neural network output in the step 8 is realized by SoftMax and normalization operations.
7. The method for multi-tasking assistant-limit multi-label short text classification with co-occurrence information according to claim 6, wherein: the joint rule is as follows: hypothesis multi-label text classification task t1And extreme multi-label text classification task t2There is a certain degree of relevance when the multi-label text classification task t1The probability value of the maximum probability value neuron is larger than the limit multi-label text classification task t2Considering the multi-label text classification task t when the neuron with the maximum probability value exists1Reliability is better, so that the task t is classified by multi-label text1For reference, the extreme multi-label text classification task t2Decision makingAdjusting; otherwise, the extreme multi-label text classification task t is considered2The reliability is better, so that the task t of multi-label text classification is limited2For reference, a multi-label text classification task t1And (6) making a decision to adjust.
CN202110101374.2A 2021-01-26 2021-01-26 Multi-task auxiliary limit multi-label short text classification method using co-occurrence information Active CN112800222B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110101374.2A CN112800222B (en) 2021-01-26 2021-01-26 Multi-task auxiliary limit multi-label short text classification method using co-occurrence information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110101374.2A CN112800222B (en) 2021-01-26 2021-01-26 Multi-task auxiliary limit multi-label short text classification method using co-occurrence information

Publications (2)

Publication Number Publication Date
CN112800222A CN112800222A (en) 2021-05-14
CN112800222B true CN112800222B (en) 2022-07-19

Family

ID=75811747

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110101374.2A Active CN112800222B (en) 2021-01-26 2021-01-26 Multi-task auxiliary limit multi-label short text classification method using co-occurrence information

Country Status (1)

Country Link
CN (1) CN112800222B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20240188895A1 (en) * 2021-07-27 2024-06-13 Boe Technology Group Co., Ltd. Model training method, signal recognition method, apparatus, computing and processing device, computer program, and computer-readable medium
CN114490951B (en) * 2022-04-13 2022-07-08 长沙市智为信息技术有限公司 Multi-label text classification method and model
CN117033641A (en) * 2023-10-07 2023-11-10 江苏微皓智能科技有限公司 Network structure optimization fine tuning method of large-scale pre-training language model

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577405A (en) * 2012-07-19 2014-02-12 中国人民大学 Interest analysis based micro-blogger community classification method
WO2014047727A1 (en) * 2012-09-28 2014-04-03 Alkis Papadopoullos A method and system for monitoring social media and analyzing text to automate classification of user posts using a facet based relevance assessment model
CN104881689A (en) * 2015-06-17 2015-09-02 苏州大学张家港工业技术研究院 Method and system for multi-label active learning classification
CN110442723A (en) * 2019-08-14 2019-11-12 山东大学 A method of multi-tag text classification is used for based on the Co-Attention model that multistep differentiates
CN110442707A (en) * 2019-06-21 2019-11-12 电子科技大学 A kind of multi-tag file classification method based on seq2seq

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577549B (en) * 2013-10-16 2017-02-15 复旦大学 Crowd portrayal system and method based on microblog label
CN109684478B (en) * 2018-12-18 2023-04-07 腾讯科技(深圳)有限公司 Classification model training method, classification device, classification equipment and medium
CN111553442B (en) * 2020-05-12 2024-03-12 国网智能电网研究院有限公司 Optimization method and system for classifier chain tag sequence
CN111709475B (en) * 2020-06-16 2024-03-15 全球能源互联网研究院有限公司 N-gram-based multi-label classification method and device
CN112199536A (en) * 2020-10-15 2021-01-08 华中科技大学 Cross-modality-based rapid multi-label image classification method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577405A (en) * 2012-07-19 2014-02-12 中国人民大学 Interest analysis based micro-blogger community classification method
WO2014047727A1 (en) * 2012-09-28 2014-04-03 Alkis Papadopoullos A method and system for monitoring social media and analyzing text to automate classification of user posts using a facet based relevance assessment model
CN104881689A (en) * 2015-06-17 2015-09-02 苏州大学张家港工业技术研究院 Method and system for multi-label active learning classification
CN110442707A (en) * 2019-06-21 2019-11-12 电子科技大学 A kind of multi-tag file classification method based on seq2seq
CN110442723A (en) * 2019-08-14 2019-11-12 山东大学 A method of multi-tag text classification is used for based on the Co-Attention model that multistep differentiates

Also Published As

Publication number Publication date
CN112800222A (en) 2021-05-14

Similar Documents

Publication Publication Date Title
CN112800222B (en) Multi-task auxiliary limit multi-label short text classification method using co-occurrence information
CN110569508A (en) Method and system for classifying emotional tendencies by fusing part-of-speech and self-attention mechanism
CN107943967A (en) Algorithm of documents categorization based on multi-angle convolutional neural networks and Recognition with Recurrent Neural Network
CN110807328A (en) Named entity identification method and system oriented to multi-strategy fusion of legal documents
CN112163089B (en) High-technology text classification method and system integrating named entity recognition
CN110874411A (en) Cross-domain emotion classification system based on attention mechanism fusion
CN112395417A (en) Network public opinion evolution simulation method and system based on deep learning
CN113051914A (en) Enterprise hidden label extraction method and device based on multi-feature dynamic portrait
CN111581967A (en) News theme event detection method combining LW2V and triple network
US20230289528A1 (en) Method for constructing sentiment classification model based on metaphor identification
CN114239574A (en) Miner violation knowledge extraction method based on entity and relationship joint learning
CN113673254A (en) Knowledge distillation position detection method based on similarity maintenance
CN113869055A (en) Power grid project characteristic attribute identification method based on deep learning
CN114444481B (en) Sentiment analysis and generation method of news comment
CN116663540A (en) Financial event extraction method based on small sample
CN111813939A (en) Text classification method based on representation enhancement and fusion
CN114841151A (en) Medical text entity relation joint extraction method based on decomposition-recombination strategy
CN115062727A (en) Graph node classification method and system based on multi-order hypergraph convolutional network
CN114048314A (en) Natural language steganalysis method
CN111709231B (en) Class case recommendation method based on self-attention variational self-coding
CN113297374A (en) Text classification method based on BERT and word feature fusion
CN117350286A (en) Natural language intention translation method oriented to intention driving data link network
CN112612884A (en) Entity label automatic labeling method based on public text
CN115631504B (en) Emotion identification method based on bimodal graph network information bottleneck
CN116304064A (en) Text classification method based on extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant