CN112800222B

CN112800222B - Multi-task auxiliary limit multi-label short text classification method using co-occurrence information

Info

Publication number: CN112800222B
Application number: CN202110101374.2A
Authority: CN
Inventors: 王嫄; 徐涛; 王世龙; 周宇博; 王欢; 杨巨成; 赵婷婷; 陈亚瑞
Original assignee: Tianjin University of Science and Technology
Current assignee: Tianjin University of Science and Technology
Priority date: 2021-01-26
Filing date: 2021-01-26
Publication date: 2022-07-19
Anticipated expiration: 2041-01-26
Also published as: CN112800222A

Abstract

The invention relates to a multitask auxiliary limit multi-label short text classification method by utilizing co-occurrence information, which is mainly technically characterized by comprising the following steps of: constructing an account-feature file; providing additional characteristic information for each microblog short text by using an account-characteristic file, and modeling the characteristic information into explicit model input co-occurrence information; constructing a multi-label text classification task and a limit multi-label text classification task related to the microblog short texts; constructing a multi-task learning task model; pre-training a multi-task learning task model by using large-scale microblog short text data; fine adjustment is carried out on the multi-task learning task model; and quantizing the neural network output and finally outputting a multi-task prediction result. The method designs the multi-task learning architecture by utilizing the co-occurrence information, realizes multi-label classification of the large-scale short text, and can realize stable, accurate and real-time multi-label prediction on the large-scale short text data set under the condition of lower industrial deployment cost.

Description

Multi-task auxiliary limit multi-label short text classification method using co-occurrence information

Technical Field

The invention belongs to the technical field of information, relates to a natural language processing and text classification method, and particularly relates to a multi-task auxiliary limit multi-label short text classification method by utilizing co-occurrence information.

Background

With the increasing production speed of text data and the increasing appearance of data diversity and semantic complexity, the conventional multi-label text classification method is difficult to meet the daily industrial requirements in terms of accuracy and real-time performance, and the requirement on the ultimate multi-label text classification task under the large-scale label set scene is more and more strong.

In order to solve the above problems, the prior art mostly solves the problems by methods such as embedding, multi-classifier, tree, deep learning, and the like. The embedding method is high in time complexity, and the effect extremely depends on the clustering effect in the preprocessing process; the multi-classifier method ignores information among labels, considers the labels to be independent individuals, and has huge deployment cost because each label needs to train one classifier, thus being difficult to be effectively applied to real service scenes; the tree method cannot solve the problem of long tail in data concentration, and is large in scale, high in cost, poor in precision and difficult to stably use in an industrial scene. The existing deep learning method does not optimize the long tail problem, but simply increases the number of neurons in an output layer, so that the effect of the existing deep learning method is generally not as good as that of the other three methods.

In view of the above, the method provided by the invention avoids the disadvantages of the existing method, designs a multi-task learning architecture by utilizing co-occurrence information, realizes multi-label classification of large-scale short texts, and can realize stable, accurate and real-time multi-label prediction on large-scale short text data sets under the condition of lower industrial deployment cost.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a multitask auxiliary limit multi-label short text classification method which is reasonable in design, accurate in prediction, low in cost and easy to implement and utilizes co-occurrence information.

The technical problem to be solved by the invention is realized by adopting the following technical scheme:

a multitask auxiliary limit multi-label short text classification method utilizing co-occurrence information comprises the following steps:

step 1, constructing an account-feature file by using related feature information of a microblog text sending account;

step 2, providing additional characteristic information F for each microblog short text by using the account-characteristic file, and modeling the characteristic information F as explicit model input co-occurrence information C;

step 3, constructing a multi-label text classification task t related to the microblog short texts₁；

Step (ii) of4. Constructing large-scale microblog short text limit multi-label text classification task t₂；

Step 5, classifying the multi-label text into tasks t₁And extreme multi-label text classification task t₂Performing combined modeling to obtain a multi-task learning task model T;

step 6, pre-training a pre-training model M in the multi-task learning task model T by using large-scale microblog short text data;

step 7, fine adjustment is carried out on the multi-task learning task model T by using small-scale and accurate labeled microblog short text data;

step 8, classifying the label text₁And extreme multi-label text classification task t₂And the output of the last layer of neural network is quantized, the probability value results after the quantization of respective tasks are constructed according to a joint rule, and the multi-task prediction result is finally output.

The microblog text issuing account in the step 1 is an author account for issuing a microblog text; the microblog posting account comprises the following characteristic information: the label, the place and the account name of the message are commonly used for the message. When the account-feature file is constructed, the account-feature file is cleaned and updated in consideration of the fact that errors exist in related feature information of a microblog posting account.

The characteristic information F obtained in the step 2 is:

wherein, F_iThe microblog text account number is one of feature information in the related features of the microblog text account number, and n is the related feature number of the microblog text account number;

the explicit model input co-occurrence information C is:

wherein [ SEP ] is a special marker in the model input text.

The pre-training model M is a shared layer, and the upper layer of the pre-training model M implements respective tasks.

And 6, pre-training a pre-training model M in the multi-task learning task model T by using large-scale unmarked microblog short text data, or pre-training the multi-task learning task model T by using large-scale marked microblog short text data with errors.

The quantization of the last layer of neural network output in the step 8 is realized by SoftMax and normalization operations.

The joint rule is as follows: hypothesis multi-label text classification task t₁And extreme multi-label text classification task t₂There is a certain degree of relevance when the multi-label text classification task t₁The probability value of the maximum probability value neuron is larger than the limit multi-label text classification task t₂Considering the multi-label text classification task t when the neuron with the maximum probability value₁The reliability is better, so that the task t is classified by multi-label text₁For reference, the extreme multi-label text classification task t₂Making a decision to adjust; otherwise, the extreme multi-label text classification task t is considered₂The reliability is better, so that the task t of multi-label text classification is limited₂For reference, a multi-label text classification task t₁And (6) making a decision to adjust.

The invention has the advantages and positive effects that:

1. the invention adopts the multitask assistant limit multi-label short text classification method of the co-occurrence information to carry out deep learning of prediction on the large-scale label set, and can stably and effectively predict the large-scale label set under the condition of lower industrial deployment cost.

2. The method improves the prediction effect on the high-frequency label and the low-frequency label through the co-occurrence information, and utilizes the information learned by the multi-label text classification task to assist the learning of the extreme multi-label text classification task; in the training process, firstly, large-scale unlabeled data are used for pre-training a pre-training model in the method, or large-scale labeled data containing noise are used for pre-training the method, and then, accurate labeled data are used for fine tuning the method.

3. The co-occurrence information used by the invention not only assists the label prediction to a certain extent, but also can guide the correct result of the misjudgment sample by maintaining the account number-feature file.

4. The invention can realize daily industrial maintenance only by maintaining the account-feature file, greatly reduces the maintenance cost and effectively solves the problem of high maintenance cost in an industrial scene during deep learning.

Drawings

FIG. 1 is a process diagram of the present invention:

FIG. 2 is a diagram of a multi-task learning model architecture according to the present invention.

Detailed Description

The present invention is further described in detail below with reference to the accompanying drawings.

The design idea of the invention is as follows: the prediction effect of the extreme multi-label text classification task is assisted and improved by using the co-occurrence information and the multi-task learning technology. The method is inspired by the explicit incidence relation between the co-occurrence information and the tags in the tag set, and the co-occurrence information is constructed by using the relevant characteristic information of the account number, so that the prediction effect of the method on the high-frequency tags and the low-frequency tags is effectively improved. Furthermore, the method is inspired by shared parameters in multi-task learning, and the shared parameters inspire that the method and the system assist the prediction of the extreme multi-label text classification task by using the learned information of the multi-label text classification task under the condition that the complex industrial environment is difficult to solve the practical problem simply through a single task. In the practical application process, in the face of the situation that the deep learning method is difficult to maintain at low cost, the invention provides the method for controlling the relative characteristics of the account by maintaining the account-characteristic file, thereby controlling the co-occurrence information input by the method to realize the low-cost maintenance of the method.

For convenience of explanation, the symbols used in the present invention will now be described:

based on the above reasonable design, the invention provides a multitask auxiliary limit multi-label short text classification method using co-occurrence information, as shown in fig. 1, comprising the following steps:

step 1, an account-feature file is constructed by utilizing related feature information of a microblog text sending account.

The microblog text issuing account is an author account for issuing microblog texts. The microblog text sending account and the target classification task have related characteristic information, such as a text sending common label, a text sending common place, a text sending account name and the like.

In the step, an account-feature file is constructed by using the microblog issuance account and the corresponding related feature information. And cleaning and updating the account-feature file in consideration of errors of the related feature information of the microblog posting account.

And 2, providing additional characteristic information F for each microblog short text by using the account-characteristic file, and modeling the characteristic information F as explicit model input co-occurrence information C.

The sending account number of each microblog short text exists, and the account number-feature file is utilized to obtain additional feature information of each microblog short text

Wherein, F_iThe number n is a certain feature information in the related features of the microblog text sending account number. Obtaining explicit model input co-occurrence information by using feature information F for modeling

Wherein, [ SEP]Special tags in the text are entered for the model. The model input text is essentially composed of target text (taking microblog short text as an example) and co-occurrence information related to a classification task, and the specific form is determined by model (taking a pre-training model ERNIE as an example) input. Assuming that microblog short text Content is represented as Content, the model input text may be represented as [ CLS]Content C. Wherein, [ CLS]Special tags in the text are entered for the model.

Step 3, constructing a multi-label text classification task t related to the microblog short texts₁。

Constructing multi-label text classification tasks related to target texts, such as emotion classification task t related to microblog short texts₁。

Step 4, constructing large-scale microblog short text limit multi-label text classification task t₂。

Constructing an extreme multi-label text classification task related to a target text, such as a large-scale label classification task t related to a microblog short text₂。

Step 5, classifying the multi-label text into tasks t₁And extreme multi-label text classification task t₂The joint modeling is a multitask learning task T.

In this step, t is added₁And t₂The joint modeling is a multi-task learning task model T. The pre-training model M (for example, but not limited to ERNIE) serves as a shared layer, and the upper layers respectively implement their tasks.

The structure of the multitask learning model T is shown in FIG. 2. [ CLS]Content C as model input and the pre-trained model ERNIE as model sharing parameter layer. The output of ERNIE passes through two full-connection layers respectively, and the left full-connection layer points to t₁(ii) a The right full connection layer is spliced with the output of the ERNIE to fully utilize the information learned by the multi-label text classification task, and the right side points to t₂。

And 6, pre-training a pre-training model M in the multi-task learning task model T by using large-scale unlabeled microblog short text data, or pre-training T by using large-scale labeled microblog short text data with errors.

And 7, fine tuning the multi-task learning task model T by using small-scale and accurate labeled microblog short text data.

And 8: task t of classifying label texts₁And extreme multi-label text classification task t₂The last layer of neural network output is quantized, the probability value results after the quantization of respective tasks are constructed according to a joint rule, and finallyAnd outputting a multi-task prediction result.

In this step, for t₁And t₂And when the output of the last layer of neural network of the task is quantized, the quantization can be realized by SoftMax and normalization operation. And carrying out rule design by using the probability value result after the quantification of each task, and finally outputting a multi-task result.

In this step, the rule design takes into account: the industrial scene has large noise and complex requirements, and not only the model is difficult to accurately judge, but also the artificial judgment is difficult. Let t be₁And t₂There is a certain degree of relevance between tasks, when t₁The probability value of the task maximum probability value neuron is more than t₂When the neuron with the highest probability value is a task, t can be considered₁Task reliability is better, so that t is₁Task as a reference, for t₂And adjusting the task decision. And vice versa. Based on the rule, further complex rule design can be carried out by combining with a service scene so as to meet the service requirement.

It should be emphasized that the embodiments described herein are illustrative rather than restrictive, and thus the present invention is not limited to the embodiments described in the detailed description, but also includes other embodiments that can be derived from the technical solutions of the present invention by those skilled in the art.

Claims

1. A multi-task auxiliary limit multi-label short text classification method using co-occurrence information is characterized in that: the method comprises the following steps:

step 1, constructing an account-feature file by using related feature information of a microblog posting account;

step 2, providing additional feature information F for each microblog short text by using the account-feature file, and modeling the feature information F as explicit model input co-occurrence information C;

step 3, constructing multi-label text classification task t related to microblog short texts₁；

Step 4, constructing large-scale microblog short text limit multi-label text classification task t₂；

Step 5, classifying the multi-label text into a task t₁And extreme multi-label text classification task t₂Performing combined modeling to obtain a multi-task learning task model T;

step 8, classifying the label text₁And extreme multi-label text classification task t₂The output of the last layer of neural network is quantized, the probability value results after the quantization of respective tasks are constructed according to a joint rule, and a multi-task prediction result is finally output;

the feature information F obtained in step 2 is:

wherein, F_iThe microblog text sending account number is one of the related features of the microblog text sending account number, and n is the related feature number of the microblog text sending account number;

the explicit model input co-occurrence information C is:

wherein [ SEP ] is a special marker in the model input text.

2. The method of claim 1, wherein the method comprises: in step 6, pre-training a pre-training model M in the multi-task learning task model T by using large-scale unlabeled microblog short text data, or pre-training the multi-task learning task model T by using large-scale labeled microblog short text data with errors.

3. The method for multi-tasking assisted-extreme multi-label short text classification with co-occurrence information according to claim 1 or 2, characterized in that: the microblog text issuing account number is an author account number for issuing microblog texts; the microblog posting account comprises the following characteristic information: the label, the place and the account name of the message are commonly used for the message.

4. The method for multi-tasking assisted-extreme multi-label short text classification with co-occurrence information according to claim 1 or 2, characterized in that: in the step 1, when the account-feature file is constructed, the account-feature file is cleaned and updated in consideration of errors in related feature information of a microblog posting account.

5. The method for multi-tasking assisted-extreme multi-label short text classification with co-occurrence information according to claim 1 or 2, characterized in that: the pre-training model M is a shared layer, and the upper layer of the pre-training model M implements respective tasks.

6. The method for multi-tasking assisted-extreme multi-label short text classification with co-occurrence information according to claim 1 or 2, characterized in that: the quantization of the last layer of neural network output in the step 8 is realized by SoftMax and normalization operations.

7. The method for multi-tasking assistant-limit multi-label short text classification with co-occurrence information according to claim 6, wherein: the joint rule is as follows: hypothesis multi-label text classification task t₁And extreme multi-label text classification task t₂There is a certain degree of relevance when the multi-label text classification task t₁The probability value of the maximum probability value neuron is larger than the limit multi-label text classification task t₂Considering the multi-label text classification task t when the neuron with the maximum probability value exists₁Reliability is better, so that the task t is classified by multi-label text₁For reference, the extreme multi-label text classification task t₂Decision makingAdjusting; otherwise, the extreme multi-label text classification task t is considered₂The reliability is better, so that the task t of multi-label text classification is limited₂For reference, a multi-label text classification task t₁And (6) making a decision to adjust.