CN112966103B

CN112966103B - Mixed attention mechanism text title matching method based on multi-task learning

Info

Publication number: CN112966103B
Application number: CN202110190612.1A
Authority: CN
Inventors: 王维宽; 冯翱; 宋馨宇; 张学磊; 张举; 蔡佳志
Original assignee: Chengdu University of Information Technology
Current assignee: Chengdu University of Information Technology
Priority date: 2021-02-05
Filing date: 2021-02-05
Publication date: 2022-04-19
Anticipated expiration: 2041-02-05
Also published as: CN112966103A

Abstract

The invention relates to a mixed attention strategy text title matching method based on multi-task learning, which is characterized in that the multi-task learning of a model is realized by simultaneously carrying out a classification task 1 of the original type of a text and a classification task 2 of whether the text is a 'headline party' article on an input text, the model is jointly trained through a multi-task learning model, and one task assists the other task to learn better parameters. According to the method, the model parameters are adjusted by using the back propagation of the classification task 1, so that the classification task 2 can obtain better performance. The attention mechanism provided by the method can calculate the association degree of each element and other elements in one step, and has small calculation amount and high efficiency.

Description

Mixed attention mechanism text title matching method based on multi-task learning

Technical Field

The invention relates to the field of text processing, in particular to a text title matching method based on a mixed attention mechanism of multi-task learning.

Background

In the network era, the network text 'heading party' is still in the trend based on actual benefits brought by traffic hoarding and traffic incentive, so that the browsing experience of network users is reduced. The platform with active "title party" is the condition that users run off, and the sustainable development of the platform is affected.

The title party is a general name of website editors, reporters, managers and netizens for various purposes such as increasing click rate or known name by making an eye-catching title on a forum or media represented by the internet to attract the attention of audiences and making click-in discovery have a large and reasonable gap with the title. The "title party" main behavior is simply that the title of the posting is strongly exaggerated, and the content of the post is usually completely unrelated or not strongly related to the title.

The network user can report after reading the junk text without nutritive value, and a platform supervisor is expected to put the article off shelf. The web text publishing platform usually performs a certain degree of text detection during text publishing, but the detection mechanism is loose. In order to improve the user stickiness and the experience sense during text sending and avoid the user dislike and user loss caused by frequent text sending failure, the platform side only strictly checks whether the text contains sensitive words at present, and the text can not be successfully published if some words can be found in the text by matching the text with all words in the sensitive word dictionary.

Background auditing is also a means for improving the text quality of the platform on the platform side. The platform randomly samples a certain amount of successfully published texts for background manual review. Some texts with higher report quantity can be included in the scope of background review. But again based on user stickiness and user experience considerations, the backend personnel will not easily delete the user's text unless the article has a particularly obvious place of noncompliance. From the present, the 'heading party' is mainly based on manual detection and lacks systematic algorithm and model support.

And (5) manual detection. And deleting the network user or the platform auditor after the network user or the platform auditor is found manually. Because the subjective consciousness, reading habit and character interest are different, the manual judgment standard is not uniform. The user's habits of using the software are different and the report is not necessarily made. Based on massive network texts, a reporting mechanism of the platform sets a reporting quantity threshold value, and only if the reporting quantity of the articles exceeds the threshold value, the article reporting quantity is automatically transferred to a background for auditing.

Simple and loose machine detection method. And matching through words prestored in the dictionary, and if the words are matched, rejecting the user to issue. A loose detection mechanism cannot effectively detect the "title party" text. Most of texts of the 'heading party' are in word compliance, the number of words contained in a sensitive word dictionary is limited, and the method cannot be applied to the application scene in practical application.

In view of the above situation, an effective "headline party" text detection method is needed to filter articles in the "headline party" in the network, improve the quality of internet texts, improve the experience of network users, and enable various text push platforms to be continuously and healthily developed.

Disclosure of Invention

Aiming at the defects of the prior art, the method for matching the text titles of the mixed attention mechanism based on the multitask learning comprises the following steps:

step 1: crawling different types of 'title party' text data and normal text data to form a data set;

step 2: cleaning the data set, and removing interference characters of webpage labels and network emoticons;

and step 3: respectively marking the titles and the texts of the text data in the data set into categories to generate classified data, wherein the category marks comprise a classification task 1 and a classification task 2, the classification task 1 is an original category when the titles are marked as crawling data, and the classification task 2 is used for marking the texts as whether the texts are 'headline party' texts;

and 4, step 4: respectively performing word segmentation processing on the title and the text of the classification data obtained in the step 3 to obtain a text word sequence;

and 5: processing the text word sequence into a preset fixed length, wherein the length is not filled with 0, and the length is truncated;

step 6: randomly disordering the text data marked with the categories to fully mix the text of the 'title party' with the normal text;

and 7: dividing the mixed data set into batch data with the size of batch;

and 8: inputting the batch of data into a constructed text detection model for training, and specifically comprising the following steps:

step 8.1: inputting the title and the text of the batch of data into the same BERT model, and respectively obtaining word embedding matrixes T ═ T of the text and the title₁,t₂,t₃……t_n}、C＝{c₁,c₂,c₃……c_m}，T∈R^n×300N is the word sequence length of the text, C is the element R^m×300M is the word sequence length of the title, 300 is the word vector dimension encoded by the standard BERT model, while obtaining the first output of the BERT model

And

step 8.2: randomly initializing a shared parameter matrix W ∈ R^300×nPerforming matrix transformation to obtain a feature matrix M epsilon R mixing text and title information^m×300The mathematical expression of the matrix transformation is as follows:

M_m×300＝C_m×300×W_300×n×T_n×300

step 8.3: to pair

And

performing matrix transformation to obtain a characteristic matrix F e R^300×300The matrix transform mathematical expression is as follows:

step 8.4: taking M as Q and V, taking F as K, calculating a mixed attention matrix A epsilon R^m×300The calculation method is as follows:

wherein d is_kA second dimension of K;

step 8.5: fully connecting the mixed attention matrix A to obtain a dimension reduction matrix D epsilon R^m×n；

Step 8.6: randomly initializing a first weight matrix W₁Fully connecting the dimensionality reduction matrix D to obtain a one-dimensional matrix, calculating softmax as the output of the classification task 1 in the step 3, wherein the dimensionality is R^1×jJ is the original category number of the data;

step 8.7: randomly initializing a second weight matrix W₂Fully connecting the dimensionality reduction matrix D to obtain a one-dimensional matrix, calculating softmax as the output of the classification task 2 mentioned in the step 3, wherein the dimensionality is R^1×2Two dimensions respectively represent the probability of being or not the "heading party";

step 8.8: taking the maximum value in the results of step 8.6 and step 8.7 as p of the corresponding task_iRespectively calculating cross entropy, summing and averaging, wherein the mathematical expression is as follows:

wherein n is the size of batch, y_iIs a real label of the ith piece of data, p_iCalculating the maximum probability of the label to which the ith piece of data belongs for the model;

step 8.9: the result of the step 8.8 is used as an error to carry out back propagation and is used for training model parameters;

step 8.10: and setting an end condition, if the end condition is not met, repeating the step 8.1 to the step 8.9 until the end condition is met, and stopping the training of the model.

According to a preferred embodiment, the method further includes testing the trained text detection model, specifically including:

and step 9: and (3) executing the step 1 to the step 8.7 aiming at the trained model, taking the subscript of the maximum number in the output result of the task two in the step 8.7 as a final result, and not executing the step 8.8 to the step 8.10 any more.

The invention has the beneficial effects that:

1. the invention provides a mixed attention strategy based on multi-task learning, which is used for extracting key information from a text to match with a title so as to realize the detection of a 'title party' article and obviously improving the detection precision and accuracy of the title party.

2. As the RNN model forgets early input information along with the increase of the input sequence, the calculation amount is large, and the RNN is a time sequence model and cannot be calculated in parallel, the attention mechanism provided by the method can calculate the association degree of each element and other elements in one step, and is small in calculation amount and high in efficiency.

3. Meanwhile, the similarity of the text calculated by the text embedded vector can be interfered by a large amount of noise, the text detection method can avoid the influence of the text noise, is suitable for the variability of the naming strategy of the 'headline party' article, breaks through the limitation of the traditional similarity calculation method, realizes high-efficiency calculation and accurate classification, and reduces a large amount of manual operation.

Drawings

FIG. 1 is a flow chart of a method according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings in conjunction with the following detailed description. It should be understood that the description is intended to be exemplary only, and is not intended to limit the scope of the present invention. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present invention.

The following detailed description is made with reference to the accompanying drawings.

For a long text body and a short title, only from a character level analysis, most words of the text body are irrelevant to the title, and the similarity of the text calculated by the text embedding vector is interfered by a large amount of noise. The scheme provides a mixed attention strategy based on multi-task learning, which is used for extracting key information from a text and matching the key information with a title so as to realize the detection of a 'title party' article. Compared with the RNN-based model, the memory cells in the model forget the information inputted earlier due to the increase of the RNN model with the input sequence, and the calculation amount is larger. Meanwhile, the RNN is a time sequence model and cannot be calculated in parallel. The attention mechanism used by the scheme can calculate the association degree of each element and other elements in one step, and has small calculation amount and high efficiency.

The basic idea of the scheme is to design a technical scheme of matching degree of text and title to realize the identification of the 'title party' article.

The BERT used in the present invention is an auto-encoder, each of which outputs a context that encodes text, i.e., each element of T and W already contains context information. The first output of the BERT encodes full-text information, i.e.

And

full-text information of the title and full-text information of the body are encoded separately. Similar effects can be achieved by the RNN-based bidirectional timing model, but the RNN-based model requires a word embedding matrix to be obtained in one step, and the BERT can automatically encode by directly inputting a text sequence.

The mixed attention strategy calculates two different feature matrixes which are Q, V and K respectively through an attention mechanism mixed word embedding matrix of text and a word embedding matrix of a title, and a mixed attention matrix is calculated.

The attention mechanism can flexibly capture global and local relations by comparing each element with other elements, and is a one-step in-place process, namely the attention mechanism can pick up the emphasis in a long-sequence text. Thus, the hybrid attention strategy can focus attention on textual focus content matching the title through training.

The multi-task learning of the model is embodied in that the model simultaneously performs classification learning (classification task 1) of original text categories and classification learning (classification task 2) of whether the text is a 'headline party' article on the input text. The model is jointly trained through a multi-task learning model, and one task assists the other task to learn better parameters. The scheme adjusts the model parameters by using the back propagation of the classification task 1, so that the classification task 2 obtains better performance. The following detailed description of the embodiments refers to the accompanying drawings.

Step 1: crawling different categories of 'heading party' text data and normal text data to form a data set.

Step 2: and cleaning the data set, and removing the interference characters of the webpage labels and the network emoticons.

And step 3: respectively carrying out category marking on the title and the body of the text data in the data set to generate classified data, wherein the category marking comprises a classification task 1 and a classification task 2, the classification task 1 is an original category when the title is marked as crawling data, and the classification task 2 is used for marking the body as whether the text is a 'heading party' text.

And 4, step 4: and (4) performing word segmentation processing on the titles and the texts of the classification data obtained in the step (3) respectively to obtain a text word sequence.

And 5: and processing the text word sequence into a preset fixed length, wherein the length is not filled with 0, and the length exceeds the truncation.

Step 6: and randomly disordering the text data marked with the categories to fully mix the text of the 'title party' with the normal text.

And 7: the blended dataset is divided into batch data of batch size.

And 8: inputting the batch data into a constructed text detection model for training, and specifically comprising the following steps:

step 8.1: inputting the title and the text of the batch data into the same BERT model, and respectively obtaining word embedding matrixes T ═ T of the text and the title₁,t₂,t₃……t_n}、C＝{c₁,c₂,c₃……c_m}。T∈R^n×300N is the word sequence length of the text, C is the element R^m×300M is the word sequence length of the title, 300 is the word vector dimension encoded by the standard BERT model, while obtaining the first output of the BERT model

And

by default the title global information is encoded,

the body global information is encoded by default.

M_m×300＝C_m×300×W_300×n×T_n×300

step 8.3: to pair

And

performing matrix transformation to obtain a characteristic matrix F e R^300×300Change of matrixThe mathematical expressions are changed as follows:

wherein d is_kA second dimension of K; the Attention mechanism consists of three parts, query (Q), value (V), Key (K).

Step 8.5: fully connecting the mixed attention matrix A to obtain a dimension reduction matrix D belonging to R^m×n。

Step 8.6: randomly initializing a first weight matrix W₁Fully connecting the dimensionality reduction matrix D to obtain a one-dimensional matrix, calculating softmax as the output of the classification task 1 in the step 3, wherein the dimensionality is R^1×jAnd j is the original category number of the data.

Step 8.7: randomly initializing a second weight matrix W₂Fully connecting the dimensionality reduction matrix D to obtain a one-dimensional matrix, calculating softmax as the output of the classification task 2 mentioned in the step 3, wherein the dimensionality is R^1×2The two dimensions represent the probability of being or not the "heading party", respectively.

wherein n is the size of batch, y_iIs a real label of the ith piece of data, p_iAnd calculating the maximum probability of the label to which the ith piece of data belongs for the model.

Step 8.9: and (5) performing back propagation on the result of the step 8.8 as an error for model parameter training.

Step 8.10: and setting an end condition, if the end condition is not reached, if the result in 1000 rounds is not promoted, repeating the steps from 8.1 to 8.9 until the end condition is met, and stopping the training of the model.

The method of the invention also comprises the step of testing the trained text detection model, which specifically comprises the following steps:

It should be noted that the above-mentioned embodiments are exemplary, and that those skilled in the art, having benefit of the present disclosure, may devise various arrangements that are within the scope of the present disclosure and that fall within the scope of the invention. It should be understood by those skilled in the art that the present specification and figures are illustrative only and are not limiting upon the claims. The scope of the invention is defined by the claims and their equivalents.

Claims

1. A text title matching method based on a mixed attention mechanism of multitask learning is characterized in that,

and 7: dividing the mixed data set into batch data with the size of batch;

And

M_m×300＝C_m×300×W_300×n×T_n×300

step 8.3: to pair

And

is subjected to matrix transformation to obtainTo the feature matrix F ∈ R^300×300The matrix transform mathematical expression is as follows:

wherein d is_kA second dimension of K;

2. The method for matching text titles according to claim 1, wherein the method further comprises testing the trained text detection model, specifically comprising: