CN114429143A

CN114429143A - Cross-language attribute level emotion classification method based on enhanced distillation

Info

Publication number: CN114429143A
Application number: CN202210044125.9A
Authority: CN
Inventors: 吴含前; 王志可; 吴国威; 李露
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2022-01-14
Filing date: 2022-01-14
Publication date: 2022-05-03

Abstract

The invention provides a cross-language attribute level emotion classification method based on reinforced distillation, which is characterized in that a teacher network is trained based on source language linguistic data, and attribute emotion information in the source language linguistic data is migrated to a target classifier based on a knowledge distillation frame; selecting attribute emotion related information from a target translation sentence sequence according to a specific attribute by adopting a sequence selector, and providing denoised sentence sequence representation for a target classifier; and constructing a target classifier based on cross-language distillation by using a self-attention layer, and modeling fine-grained interaction between the attribute sequence and the denoised target translation sentence sequence. The method and the device solve the problem of domain deviation between the translated corpus and the real corpus, and the target classifier has stronger generalization capability. The method can fully utilize effective attribute emotional information in the source language and the target translation, and simultaneously better model fine-grained interaction between sentences and attributes. Compared with a baseline method, the emotion classification performance of the method is improved to a certain extent.

Description

Cross-language attribute level emotion classification method based on enhanced distillation

Technical Field

The invention relates to a cross-linguistic attribute level emotion classification method based on enhanced distillation, and belongs to the technical field of cross-linguistic attribute level emotion classification in semantic recognition.

Background

Most of current machine translation tools are trained based on parallel corpora in the general field, translation noise words are inevitably introduced when texts with strong domain are translated, and the translation corpus quality is prone to decline due to the fact that the noise words comprise abnormal words, translation ambiguity and the like. Meanwhile, the problem of exposure deviation exists in the decoding process of the translation model, so that obvious expression difference exists between the translation language and the natural language. The above problems are the main reasons for the domain difference between the translated corpus and the real corpus.

Unlike traditional cross-language text classification or cross-language sentence level emotion classification, attribute level emotion classification does not rely on the overall semantics of the sentence, but rather on a set of words or phrases that are related to the attribute, which requires modeling fine-grained interactive relationships between the attribute and the sentence. The existence of noise words in the machine translation result can increase the difficulty of fine-grained modeling of the model, because the attention mechanism (such as self-attention) is based on a 'soft' information selection mechanism, and redundant noise words can cause distraction of attention weights and even cause weight assignment errors. On the other hand, emotion classification based on attribute level only depends on a group of words or phrases related to the attribute and is not related to the overall semantics of the sentence, so that attention weight only needs to be concentrated on emotion keywords for determining attribute emotion. In addition, the attribute emotion polarity discrimination depends on the descriptive language semantics related to the attribute rather than the specific language form, and it is a waste of resources to train the model only by using the translated target language corpus and ignore the rich attribute emotion information in the source language corpus. Intuitively speaking, as a source language corpus with rich resources and high quality, the attribute emotion information contained in the source language corpus is more accurate and reliable than a target translation corpus, and if the attribute emotion information contained in the source language corpus can be used for assisting a target language classifier to train in the training process, the performance of cross-language attribute level emotion classification is further improved.

Disclosure of Invention

In order to solve the problems, the invention provides a cross-language attribute level emotion classification method based on enhanced distillation, and provides enhanced self-attention mechanism modeling attribute level representation by combining the characteristics of cross-language attribute level emotion classification and aiming at the problems in a corpus migration method based on machine translation.

In order to achieve the purpose, the invention provides the following technical scheme:

a cross-language attribute level emotion classification method based on enhanced distillation comprises the following steps:

training a teacher network based on source language corpora, and migrating attribute emotion information in the source language corpora to a target classifier of a student network based on a knowledge distillation frame;

selecting attribute emotion related information from the target translation sentence sequence according to a specific attribute by adopting an attribute sensitive sequence selector, and providing denoised sentence sequence representation for a target classifier by using the attribute emotion related information as an intermediate module of a model;

and step three, constructing a target classifier based on cross-language distillation by using the self-attention layer, and modeling fine-grained interaction between the attribute sequence and the denoised target translation sentence sequence.

Further, the sequence selector uses the LSTM network modeling policy network p_πAnd learning an optimal strategy pi (a) by using a strategy gradient algorithm_1:n) Policy network p_πLearning optimal strategies by defining a reward and with a probability p_π(a_i|s_i；θ_r) Deciding whether to select x_i。

Further, the state, action and reward of the policy network are defined as follows:

state: the state of the ith time step is defined as s_i(ii) a Depending on the given attributes, the state needs to provide enough information to decide whether to select x or not_iThus state s_iThe device consists of the following three parts:

wherein h is_iIs a hidden state representation of the i-th time step of LSTM, v_iIs the ith word x_iVector representation of v_AIs an attribute vector representation;

the actions: policy network p_πWith probability p_π(a_i|s_i；θ_r) Performing action a_iE {0,1}, and this probability is calculated using a logistic function:

a＝[a₁，a₂，...，a_n]～p_π(A|S；θ_r)

wherein theta is_rFor strategic network parameters, represent a sampling operation, S represents a state, A represents an action, w_rAnd b_rRepresenting trainable parameters;

return: defining an attribute-sensitive reward R that integrates attribute emotion classification loss and cross-language distillation loss for a training sample<x_s,x_t,y>The payback is defined as follows:

wherein theta is_srcAs a teacher's network parameter, θ_tgtFor student network parameters, γ N'/N is a penalty term to prevent overfitting.

Further, the second step and the third step specifically include the following processes:

for target translated sentence representation

Attribute representation

Obtaining denoised sentence representation H by a sequence selector_DNamely:

a＝[a₁，a₂，…，a_N]＝RATS(H_S，υ_A)

H_D＝H_S～a

wherein RATS represents a sequence selector, generates an action sequence a, and represents a sequence from H_SIn all the positions a_iExtracting and splicing the vectors of 1 into a new sentence sequence representation;

then, fine-grained interaction between the attributes and the denoised sentence representation is modeled by means of the self-attention layer in a cross-language distillation-based target classifier:

H＝SelfAttention(H_A，H_D)

finally, the average pooling layer and the full-link layer are used to calculate the non-normalized probability for each class, i.e., q ═ q₁,q₂,...,q_K]Wherein K represents the number of categories; the probabilities are normalized by softening the softmax layer:

where T denotes temperature, and degrades to a softmax function when T is 1.

Further, for the strategy network in the sequence selector, optimizing by using a REINFORCE algorithm based on strategy gradient; parameter theta_rThe optimization objective of (1) is to maximize the expected return

With respect to parameter θ_rThe policy gradient of (2) is defined as follows:

where D represents the data set size, N represents the sentence sequence length,

representing the action representing the t time step of the ith sample,

indicating the state of the ith sample at the tth time step.

Further, the parameter θ for the object classifier_tgtOptimizing by using a reverberation propagation algorithm; seeking parameter theta_tgtMinimizing attribute emotion classification loss for the target classifier:

wherein the content of the first and second substances,<x_s,x_t>representing source language training samples and target translation training samples, respectively, theta_srcAs a teacher's network parameter, θ_tgtNetwork parameters for students; teacher network parameters under the knowledge distillation framework need to be frozen in the training process.

Further, at the beginning of model training, θ_rNot participating in the training process, when the parameter theta_tgtAfter the loss on the development set begins to converge, the parameter θ begins to be matched_rAnd theta_tgtTraining is performed together.

Furthermore, a teacher network is trained by adopting a BERTBase model provided by Google officials, and a target classifier of a student network uses an interactive relation between a multi-head self-attention layer modeling attribute and a denoised sentence sequence and is composed of 3 layers of transform encoder sub-modules. The maximum length of the sentence sequence is set to be 60, the maximum length of the attribute sequence is 5, the dimension of the word vector is 768, and the sentence sequence and the attribute sequence share the target language encoder.

Further, the model is optimized by using an Adam optimizer, the initial learning rate is set to be 1e-5 for training a student network, the knowledge distillation temperature T is set to be 3, and the penalty parameter gamma in return is set to be 1 e-5. In addition, the batch size of the model training is 32, the number of training iteration rounds is 10, and in order to reduce the influence of overfitting, the random inactivation rate of the neurons is set to be 0.3.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. according to the method, the target translation sentence is denoised by using the sequence selector, only emotion keywords are selected to interact with the follow-up attribute modeling fine granularity, and emotion irrelevant words and noise words are discarded, so that the difficulty of model modeling attribute level interaction is reduced in the denoising process, and the follow-up soft attention weight distribution is more concentrated; meanwhile, the problem of domain deviation between the translated corpus and the real corpus is solved.

2. The target classifier is trained under a knowledge distillation framework, once training is completed, a machine translation tool is not relied on, so that attribute emotion information in source language corpora can be fully utilized, and the target classifier has stronger generalization capability.

3. The model can fully utilize effective attribute emotional information in source language and target translation, and simultaneously better model fine-grained interaction between sentences and attributes.

4. Experiments show that the performance of the emotion classification method in all aspects is improved to a certain extent compared with a baseline method.

Drawings

FIG. 1 is a diagram of emotion keywords for attribute level emotion classification according to the present invention;

FIG. 2 is a diagram of an implementation architecture of the present invention;

FIG. 3 is a schematic diagram of a model sequence selector of the present invention;

FIG. 4 is a schematic diagram of the model object classifier of the present invention.

Detailed Description

The technical solutions provided by the present invention will be described in detail below with reference to specific examples, and it should be understood that the following specific embodiments are only illustrative of the present invention and are not intended to limit the scope of the present invention.

The invention provides a cross-language attribute level emotion classification method based on reinforced distillation, which comprises the following steps:

training a teacher network based on source language corpora, and migrating attribute emotion information in the source language corpora to a target classifier of a student network based on a knowledge distillation frame; therefore, the attribute emotion information in the source language corpus can be fully utilized, and the target classifier has stronger generalization capability.

Selecting attribute emotion related information from a target translation sentence sequence according to a specific attribute by adopting an attribute sensitive sequence selector, and providing denoised sentence sequence representation for a target classifier by using the attribute emotion related information as an intermediate module of a model; the denoising process reduces the difficulty of model modeling attribute level interaction and simultaneously relieves the problem of domain difference.

FIG. 1 shows a schematic diagram of attribute-level emotion classification emotion keywords. The overall architecture for implementing the present invention is shown in FIG. 2, and includes a teacher network and a student network. The teacher network is trained by adopting a BERTBase model provided by Google officials, and a target classifier of the student network uses an interactive relation between a multi-head self-attention layer modeling attribute and a denoised sentence sequence and is composed of 3 layers of transform encoder sub-modules. The maximum length of the sentence sequence is set to be 60, the maximum length of the attribute sequence is 5, the dimension of the word vector is 768, and the sentence sequence and the attribute sequence share the target language encoder. The innovation point of the invention is two core modules: (1) an attribute-sensitive sequence selector; (2) a cross-language distillation based target classifier.

The invention provides a sequence selector based on an enhanced attention mechanism, and particularly relates to a sequence selector which selects attribute emotion-related information from a target translation sentence sequence according to specific attributes and serves as an intermediate module of a model to provide denoised sentence sequence representation for a target classifier. The invention uses a reinforcement learning algorithm, namely a strategy gradient trains a sequence selector, and the selector can learn an optimal strategy to select the sequence. The architecture of the sequence selector is shown in fig. 3.

In order to fully consider the context information and the history information of the sequence selection in the sequence selector, the invention uses the LSTM network to model the policy network p_πAnd use the strategyLearning to obtain an optimal strategy pi (a) by a slight gradient algorithm_1:n) Policy network p_πLearning optimal strategies by defining a reward and with a probability p_π(a_i|s_i；θ_r) Deciding whether to select x_iHere, the definition of status, action and reward is as follows:

state of the ith time step is defined as s_i. Depending on the given attributes, the state needs to provide enough information to decide whether to select x or not_iThus state s_iThe device consists of the following three parts:

wherein h is_iIs a hidden state representation of the i-th time step of LSTM, v_iIs the ith word x_iVector representation of v_AIs an attribute vector representation.

Action policy network p_πWith probability p_π(a_i|s_i；θ_r) Performing action a_iE {0,1 }. And this probability is calculated using the logistic function:

a＝[a₁，a₂，...，a_n]～p_π(A|S；θ_r)

wherein theta is_rFor strategic network parameters, represent a sampling operation, S represents a state, A represents an action, w_rAnd b_rTrainable parameters are represented.

In order to encourage policy networks p_πTo make the right decision, the present invention defines an attribute-sensitive reward R that integrates attribute emotion classification loss and cross-language distillation loss. In particular, for a training sample<x_s,x_t,y>The payback is defined as follows:

wherein theta is_srcNetwork parameters for teachers，θ_tgtFor student network parameters, γ N'/N is a penalty term to prevent overfitting.

Given a target translation sentence sequence s with attribute labels_tgt＝x₁,x₂,...,x_nThe sequence selector generates an equal-length binary sequence a ═ a₁,a₂,...,a_nWherein when a_iWhen 1 denotes x_iIs selected when a_iWhen 0, x is represented_iIs discarded. In this way, the sequence selector plays a role of a hard attention mechanism, and can select a group of emotion keywords for determining attribute emotion polarity from the translated sentences according to specific attribute words for attribute emotion classification and discard other emotion irrelevant words and noise words, so that the difficulty of model modeling fine-grained attribute level representation can be reduced, and subsequent soft attention weight distribution is more concentrated. By discarding the emotion irrelevant words and the translation noise words, the problem of domain deviation between the translation corpus and the real corpus can be relieved to a certain extent.

Through a sequence selector, the target translation sentence sequence is subjected to word selection so as to filter out noise words, and on the basis, a target classifier is used for modeling fine-grained interaction between attribute representation and denoised sentence representation, and the framework of the target classifier is shown in fig. 4.

For target translated sentence representation

Attribute representation

We can obtain denoised sentence representation H by sequence selector_D. Namely:

a＝[a₁，a₂，...，a_N|＝RATS(H_S，υ_A)

H_D＝H_S～a

wherein RATS represents a sequence selector that generates an action sequence a, representing the sequence from H_SIn all positions a_iVector extraction of 1To be spliced into a new sentence sequence representation. Then, fine-grained interaction between the attributes and the denoised sentence representation is modeled by means of a self-attention layer in a cross-language distillation-based target classifier:

H＝SelfAttention(H_A，H_D)

finally, the average pooling layer and the full-link layer are used to calculate the non-normalized probability for each class, i.e., q ═ q₁,q₂,...,q_K]Where K represents the number of categories. By softening the softmax layer, the probabilities are normalized,

where T denotes temperature, and degrades to a softmax function when T is 1. Since higher temperature values cause the entropy of probability distribution to be larger and the category information to be more abundant, in the knowledge distillation framework, the higher temperature values are generally selected to maximally utilize the category knowledge in the teacher network to help the student network train.

And finally, an optimization strategy is adopted to obtain a model with better performance. The optimization strategy is characterized in that the cross-language attribute level emotion classification model parameter based on the enhanced attention is divided into two parts, wherein one part is a strategy network parameter theta of a sequence selector part_rThe other part is the target classifier parameter θ_tgt. For the policy network in the sequence selector, we optimize using the policy gradient-based REINFORCE algorithm. Parameter theta_rThe optimization objective of (1) is to maximize the expected return

With respect to parameter θ_rThe policy gradient of (2) is defined as follows:

representing the action representing the t time step of the ith sample,

indicating the state of the ith sample at the tth time step.

Parameter θ for target classifier_tgtThe method utilizes the echo propagation algorithm to carry out optimization. Specifically, the parameter θ is sought_tgtAttribute emotion classification loss of the target classifier is minimized.

Wherein the content of the first and second substances,<x_s,x_t>representing source language training samples and target translation training samples, respectively, theta_srcAs a teacher's network parameter, θ_tgtIs a student network parameter. And teacher network parameters under the knowledge distillation framework need to be frozen in the training process.

At the beginning of model training, θ_rWill not participate in the training process, which means that the sequence selector will not discard any sequences when the parameter θ_tgtAfter the loss on the development set begins to converge, the parameter θ begins to be matched_rAnd theta_tgtTraining is performed together.

To verify the advantages of the present invention over other models, a series of comparative experiments were performed. The experimental steps mainly comprise three aspects, namely firstly, data preparation; then training a model; and finally, testing through the trained model to show the effect of the model.

1) Data preparation

The corpus resources used in the experiment are from attribute-level labeled corpus disclosed by SemEval-2016 and the published corpus of a Sunning easy-to-purchase e-commerce platform, and in order to comprehensively evaluate the performance of the cross-language attribute emotion classification model, the corpus contains comment texts in two fields, namely a catering field and a note field. On a catering field data set, taking English as a source language and Spanish and Russian as target languages to carry out cross-language attribute level emotion classification research; on note field data sets, we studied english as the source language and chinese as the target language. In the experimental process, the source language corpus or the target translation corpus is selected from the training set and the development set, no additional target markup corpus resources are used, and the ratio of the training set to the development set is 8: 2. Table 1 shows the corpus statistics before and after translation using the machine translation tool.

TABLE 1

2) Model training

During training, the model is optimized by using an Adam optimizer, the initial learning rate is set to be 1e-5 and used for training a student network, the knowledge distillation temperature T is set to be 3, and the penalty parameter gamma in return is set to be 1 e-5. In addition, the batch size of the model training is 32, the number of training iteration rounds is 10, and in order to reduce the influence of overfitting, the random inactivation rate of the neurons is set to be 0.3.

To demonstrate the effectiveness of the enhanced distillation based cross-linguistic attribute level emotion classification method proposed by the present invention, we selected the following reference model on the same dataset as compared to the ReKD proposed by the present invention:

MTDAN: the method translates a source language to a target language. Based on the target translation corpus, training a deep average Network (Deepaverage Network) by using the translated corpus to generate sentence vector representation for emotion classification. And finally, constructing a target language emotion classification model, and directly testing by using the real target language corpus in the testing stage.

ATAELSTM (S2T): the method translates a source language to a target language. Based on the target translation corpus, attribute representations and sentence representations are modeled using the LSTM network, and the attribute representations are added to the sentence representations using an attentive mechanism, generating attribute-level sentence representations for emotion classification. And finally, constructing a target language emotion classification model, and directly testing by using the real target language corpus in the testing stage.

ATAELSTM (T2S): the method is based on source language corpora, utilizes an LSTM network to model attribute representation and sentence representation, and utilizes an attention mechanism to generate attribute-level sentence representation for emotion classification. And finally, constructing a source language emotion classifier, and translating the real corpus into the source language by using a machine translation tool in a testing stage for testing.

CLDKCNN: the method is based on source language linguistic data to train a teacher network, utilizes a convolutional neural network as a language encoder and migrates emotion information to a student network through cross-language distillation. And finally, constructing a target language emotion classifier, and directly using the real target language corpus to test in the testing stage.

mBERT: the method is based on source language linguistic data, an attribute and sentence pair is constructed to be used as a model input and model building attribute level sentence representation, and vector representation obtained through an average pooling layer is used for emotion classification. And directly testing by using the cross-language representation capability of mBERT by using the real target language corpus in the testing stage.

mBERTSL: the method is based on a self-learning (SelfLearning) framework, a source language attribute emotion classifier is input and trained by using a source language corpus as a model input, then an unlabeled target language corpus is predicted by using the cross-language representation capability of mBERT in combination with a selection mechanism, and the target language corpus is expanded through continuous iteration to complete target language attribute level emotion classification. And directly using the real target language corpus to test in the testing stage.

DualBERT: the method is based on a source language corpus and a target translation language corpus, and utilizes emotion information in the source language corpus to assist a target language classifier in attribute-level emotion classification in a training stage. And finally, constructing an emotion classifier taking the dual-language text as input, and testing the real target language corpus and the corresponding source language translation in the testing stage.

TransMatch: the method translates a source language to a target language. Based on the target translation corpus and the unmarked real target language corpus, the target language encoder is used for encoding attributes, target translation sentences and real sentence expressions respectively, then a domain discriminator and the encoder are introduced to generate domain-independent feature expressions for countermeasure training, and attribute-level emotion classification training is carried out by using the target translation corpus. And directly using the real target language corpus to test in the testing stage.

ReKD: the method trains a teacher network based on source language corpora, denoises target translation sentences by using an enhanced distillation method, and models and attributes level sentences through a target classifier constructed by a multi-head attention layer to represent, so that the denoising process reduces the difficulty of model modeling attribute level interaction and simultaneously relieves the problem of domain difference. And finally, constructing a target language classifier, and directly using the real target language corpus to test in the testing stage.

3) Results of the experiment

Applying the prepared data to the above model, the results shown in table 2 were obtained. The results show the accuracy of the trained model on the test set and F1-measure, and the larger these evaluation index values are, the better the model is.

TABLE 2

From table 2, it can be seen that the proposed cross-linguistic attribute emotion classification method based on enhanced distillation is superior to other reference models, and achieves the best results. From the experimental results, it can be seen that the MTDAN method translates the language material of the source language into the target language, and uses the sentence-level representation of the deep average network modeling. However, the fine-grained interaction process between the attributes and the sentence representation is omitted, and the finally obtained sentence representation does not contain attribute information, so that the method is poorer in performance than other benchmark models. This also demonstrates the importance of modeling attribute-level interactions in the attribute-level emotion classification task. By comparing ATAELSTM (S2T) with ATAELSTM (T2S), it can be found that the forward translation result is worse than the reverse translation result, mainly because the forward translation method uses a machine translation tool to translate the source language corpus into the target language corpus, but the translation result has lower quality, so that the model trained based on the target translation corpus has worse performance, i.e. error propagation. The reverse translation method is characterized in that a source language emotion classifier is trained on source language corpora, a target language is translated into the source language in a testing stage, the source language corpora used by the classifier are high in quality and good in performance, and even if translation corpora poor in quality are used in the testing stage, the good performance can still be kept.

Secondly, the cldckn method also achieves superior performance using cross-language distillation, proving the effectiveness of the cross-language distillation method. But compared with ReKD, the method constructs a target language encoder by using a convolutional neural network, ignores the interaction between the attributes and the sentence representation, and also ignores the noise problem in the translated sentence. The ReKD method uses a sequence selector with sensitive attributes to filter noise words in target translation sentences, and uses a multi-head self-attention network as a target classifier to model attribute-level sentence representation as a final classification basis, so that the ReKD method is remarkably improved. Furthermore, mBERT achieves good performance without any external corpora and machine translation tools, which also illustrates the potential of current cross-language pre-training models. The mBERTSL method continuously enlarges the size of a target language training set in an iterative mode by combining a self-learning framework on the basis of mBERT, and finally constructs an attribute emotion classification model with the target language classification capability. The method is remarkably improved compared with mBERT, and the effectiveness of the self-learning framework in the cross-language task is also proved. The ReKD method provided by the method is superior to all other reference models.

The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and such improvements and modifications are also considered to be within the scope of the present invention.

Claims

1. A cross-linguistic attribute level emotion classification method based on enhanced distillation is characterized by comprising the following steps of:

2. The method for classifying emotion according to claim 1, wherein the sequence selector uses LSTM network modeling strategy network p_πAnd learning an optimal strategy pi (a) by using a strategy gradient algorithm_1：n) Policy network p_πLearning optimal strategies by defining a reward and with a probability p_π(a_i|s_i；θ_r) Deciding whether to select x_i。

3. The method for emotion classification based on enhanced distillation across linguistic attribute levels as recited in claim 2, wherein the state, action and return of the policy network are defined as follows:

wherein h is_iIs a hidden state representation of the i-th time step of LSTM, v_iIs the ithWord x_iVector representation of v_AIs an attribute vector representation;

a＝[a₁，a₂，...，a_n]～p_π(A|S；θ_r)

return: defining an attribute-sensitive reward R that integrates attribute emotion classification loss and cross-linguistic distillation loss for a training sample<x_s，x_t，y>The payback is defined as follows:

4. The method for classifying emotion according to claim 1 or 3, wherein the second step and the third step specifically comprise the following steps:

for target translated sentence representation

Attribute representation

Obtaining denoised sentence representation H by a sequence selector_DNamely:

a＝[a₁，a₂，...，a_N]＝RATS(H_S，v_A)

H_D＝H_S～a

H＝Self Attention(H_A，H_D)

finally, the average pooling layer and the full-link layer are used to calculate the non-normalized probability for each class, i.e., q ═ q₁，q₂，...，q_K]Wherein K represents the number of categories; the probabilities are normalized by softening the softmax layer:

where T denotes temperature, and degrades to a softmax function when T is 1.

5. The method for emotion classification based on enhanced distillation cross-linguistic attribute level as claimed in claim 2 or 3, wherein the strategy network in the sequence selector is optimized by using a REINFORCE algorithm based on strategy gradient; parameter theta_rThe optimization objective of (1) is to maximize the expected return

With respect to parameter θ_rThe policy gradient of (c) is defined as follows:

representing the action representing the t time step of the ith sample,

indicating the state of the ith sample at the tth time step.

6. The method of classifying emotion according to claim 4, wherein θ is a parameter of a target classifier_tgtOptimizing by using a reverberation propagation algorithm; seeking parameter theta_tgtMinimizing attribute emotion classification loss for the target classifier:

wherein the content of the first and second substances,<x_s，x_t>representing source language training samples and target translation training samples, respectively, theta_srcAs a teacher's network parameter, θ_tgtNetwork parameters for students; teacher network parameters under the knowledge distillation framework need to be frozen in the training process.

7. The method for emotion classification based on enhanced distillation cross-linguistic attribute level as claimed in claim 4 or claim 6, wherein θ is θ at the initial stage of model training_rNot participating in the training process, when the parameter theta_tgtAfter the loss on the development set begins to converge, the parameter θ begins to be matched_rAnd theta_tgtTraining is performed together.

8. The method for classifying emotion according to claim 1, wherein the teacher network is trained using the BERTBase model provided by *** officers, and the target classifier of the student network is constructed from 3-layer transform encoder sub-modules using the interaction between multi-headed self-attention layer modeling attributes and the denoised sentence sequence. The maximum length of the sentence sequence is set to be 60, the maximum length of the attribute sequence is 5, the dimension of the word vector is 768, and the sentence sequence and the attribute sequence share the target language encoder.

9. The method for classifying emotion based on enhanced distillation cross-linguistic attribute level, according to claim 1, wherein the model is optimized by using an Adam optimizer, the initial learning rate is set to 1e-5 for training a student network, the knowledge distillation temperature T is set to 3, and a penalty parameter γ in return is set to 1 e-5; in addition, the batch size of the model training is 32, the number of training iteration rounds is 10, and the random inactivation rate of the neurons is set to be 0.3.