CN109325112B

CN109325112B - A kind of across language sentiment analysis method and apparatus based on emoji

Info

Publication number: CN109325112B
Application number: CN201810678889.7A
Authority: CN
Inventors: 刘譞哲; 陈震鹏; 沈晟; 陆璇; 马郓; 黄罡
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2018-06-27
Filing date: 2018-06-27
Publication date: 2019-08-20
Anticipated expiration: 2038-06-27
Also published as: CN109325112A

Abstract

Across the language sentiment analysis method and apparatus based on emoji that the present invention relates to a kind of.This method comprises: 1) the unmarked text creation term vector of a large amount of source language and the target language based on collection；2) text in unmarked text comprising emoji is selected based on term vector, emoji prediction task is established by the inclusion of the text of emoji, to obtain a characterization model；3) the original language corpus of labeled feeling polarities is translated into object language, is characterized using the document that sentence characterization model obtains original text and translates obtained text, then characterizes training sentiment classification model using document；4) sentiment classification model obtained using training is carried out emotional semantic classification to the new text of object language, obtains its feeling polarities.The present invention realizes across language sentiment analysis using the emoji text easily climbed in social platform, can alleviate that markup resources are rare, the unbalanced problem of markup resources in different language.

Description

A kind of across language sentiment analysis method and apparatus based on emoji

Technical field

The present invention is a kind of across language sentiment analysis method and apparatus based on emoji, belongs to software technology field.

Background technique

In recent years, with the development of internet, a large amount of user has been emerged on network and has generated text, such as blog, micro- Rich, forum's discussion, comment etc..It is emerging that a large amount of user's generation text has caused the research that researcher carries out automatic sentiment analysis to it Interest.Since at the beginning of 2000, sentiment analysis has become most popular one of the research topic of natural language processing field, and wide It is general to be applied to the research fields such as Web excavation, data mining, information retrieval, general fit calculation and human-computer interaction.Researcher is for emotion The enthusiasm of analysis work is largely attributed to the fact that its higher practical application value.Sentiment analysis technology has been applied to client Feedback and tracking, sales forecast, product ranking, Stock Market Forecasting, opinion integration, election prediction etc. many real scenes, and generate compared with Big actual benefit.

But the research of majority sentiment analysis is all carried out on English text at present.This present Research is largely Because the work of early stage sentiment analysis is mainly carried out by the researcher for the country that English is mother tongue.These researchs, which provide, some has mark The corpus and benchmark dataset of note carry out later period research for researcher and provide convenience.Further, researchers open Beginning focuses in English text research, the stagnation to work so as to cause sentiment analysis on other language.However, according to system Meter, only 25.3% Internet user use English (https: //www.internetworldstats.com/ stats7.html).This shows that other language also possess huge user group, carries out sentiment analysis work on other language It is same most important.Such present Research promotes a collection of researcher to start to carry out across language sentiment analysis research.The research purport A kind of universal model is being trained using the labeled data in resourceful language (i.e. original language is often referred to English), it should be across Language sentiment analysis model equally can to labeled data resource not plentiful language (i.e. object language, such as Japanese) text carry out Emotional semantic classification.

Key across language sentiment analysis is to search out the vocabulary that can be connected between an original language and object language letter The medium of ditch.The parallel text of most mainstream work selection source language and the target language is as this medium.Parallel text This is i.e. for same semanteme, macaronic different text expression.The generation of parallel language is highly dependent on machine translation skill Art.But current translation technology often loses the emotion information in prototype statement in translation process, gives across language sentiment analysis Cause difficulty.For example, " blacksheep " in English is often used for referring to " blacksheep ", but is translating into Japanese shown in Fig. 1 Afterwards, the semantic information (sheep of black) for only remaining original English is lost the emotion meaning of satire.In addition, though original language (English) has a data volume relative abundance of the resource of mark compared with other language, but in fact, these data of today deep Still limit to very much in face of degree learning algorithm, can not often learn the vector characterization of words and phrases out well.Therefore, it is badly in need of finding one The new mode of learning of the missing of the problem of appearance emotion in translation process is lost and flag data can be alleviated. One kind possible solution to be remote supervisory study.Remote supervisory learning art needs researcher Manual definition regular next life At weak label data, the knot close to the data training using authentic signature is reached by the study to a large amount of weak label datas Fruit.

Summary of the invention

Across language sentiment analysis technical field at present there are aiming at the problem that, the purpose of the present invention is based on the wide of emoji It is general to solve the method and apparatus across language sentiment analysis using the semi-supervised representative learning frame of one kind is provided.

For across language emotional semantic classification problem, the weak labeling requirement of selection meets two characteristics.On the one hand, the labeling requirement It is all widely used in each language.On the other hand, which can implicitly reveal out emotion information.Such selection criteria Under, the present invention uses emoji (emoticon) as weak label.Emoji is because it does not have aphasis and can be used to express difference The speciality of emotion is widely used by the user of different genders and country, can be as this Chinese real feelings of each language Weak label.Therefore, across the language sentiment analysis representative learning method based on emoji that the invention proposes a kind of, it is intended to utilize The resource of original language (English) trains the model of the text emotion of energy class object language.

The technical solution adopted by the invention is as follows:

A kind of across language sentiment analysis method based on emoji, key step are as follows:

1. the unsupervised learning stage: the unmarked text creation word of a large amount of source language and the target language based on collection to Amount；

2. remote supervisory learns the stage: the term vector based on creation, the text in unmarked text comprising emoji is selected, Emoji is established by these texts and predicts task, to obtain a characterization model；

3. the supervised learning stage: the original language corpus of labeled feeling polarities being translated into object language, using original Sentence characterization model obtains original text and translates the document characterization of obtained text, then characterizes one feelings of training using these documents Feel disaggregated model；

4. the emotional semantic classification stage: the sentiment classification model obtained using training carries out emotion to the new text of object language Classification, obtains its feeling polarities.

Fig. 2 is the flow chart of the above method.The specific technical solution of above-mentioned steps is as follows:

1. the unsupervised learning stage

At this stage, it has used and has pushed away the method for Te Wenben and Word2Vec on a large scale to train to obtain term vector. These texts can be acquired by Twitter API (https: //developer.twitter.com/).Traditional Although One-hot representation method can distinguish each word, that is still discretely indicated, can not be established between word and word Semantic relation, to increase the difficulty of later period text-processing task.In order to solve this problem, present invention uses Each word is encoded in continuous vector space by the mechanism of Word2Vec term vector by model training.The process is only It is the semantic information for capturing word using unlabelled corpus and being characterized, therefore is unsupervised.In the process of specific implementation In, the term vector model parameter that pre-training obtains is first passed through to initialize the characterization of the term vector in general frame part, and is being connect The remote supervisory study stage got off carries out the adjusting and optimizing of parameter, in fixed relevant parameter of last supervised learning stage.

2. remote supervisory learns the stage

The characterization (i.e. term vector) of word level based on the creation of unsupervised learning stage, the present invention devise one and are based on The prediction task of emoji come include simultaneously text semantic and emotion information characterization mechanism.In prediction emoji task, use The sentence of identical emoji can similarly be characterized in vector space.Emoji prediction model is further illustrated in Fig. 3 Frame structure.The text code of sentence surface is wherein carried out using two two-way LSTM layers and one Attention layers.This In, by using the mechanism of Skip-connection (great-jump-forward transmitting), so that Attention layers of input is term vector layer In addition two LSTM layers of output, to realize information without hindrance transmitting in entire model.Finally, Attention layers of output For Softmax layers of classification.

Next, LSTM layers two-way, Attention layers and Softmax layers will be introduced respectively.

Two-way shot and long term memory network (Bi-LSTM) layer: each training sample can be expressed as (x, e), wherein x=[d₁, d₂,…,d_L], it indicates to remove the corresponding term vector sequence of text after emoji, and e is corresponded in text and is included originally emoji.In step t, LSTM carries out the calculating of nodes state according to following formula:

i^(t)=δ (U_ix^(t)+W_ih^(t-1)+b_i),

f^(t)=δ (U_fx^(t)+W_fh^(t-1)+b_f),

o^(t)=δ (U_ox^(t)+W_oh^(t-1)+b_o),

c^(t)=f_t⊙c^(t-1)+i^(t)⊙tanh(U_cx^(t)+W_ch^(t-1)+b_c),

h^(t)=o^(t)⊙tanh(c^(t)),

Wherein x^(t),i^(t),f^(t),o^(t),c^(t)And h^(t)Respectively indicate input vector of the LSTM at step t, input gate-shaped State forgets door state, output door state, internal storage location state and hiding layer state.W, U, b respectively represent recirculating network structure Parameter, the parameter of input structure and bias term parameter.Symbol ⊙ indicates element product.Corresponding step can be obtained according to the output of model The word sequence of each sentence characterizes vector under rapid t.

Further, in order to obtain the contextual information in past relevant with each word and future usage, use is two-way LSTM to encode word sequence.The characterization vector of i-th of element of the obtained word sequence of forward and backward LSTM is directly connected to Obtain final characterization h_i.Specific formula for calculation is as follows:

The characterization vector h obtained in this way_iThe forward and backward context letter of corresponding i-th of word has been captured simultaneously Breath.

Attention layers: due to what is be previously mentioned, term vector layer, forward direction being connected to by Skip-connection LSTM layers, it is backward LSTM layers as Attention layers of input vector, input i-th of word in sentence thus can be as follows It is characterized as u_i:

u_i=[d_i,h_i1,h_i2],

D in above formula_i, h_i1And h_i2Respectively indicate i-th of word term vector layer, LSTM layers of forward direction, after in LSTM layers Characterization.The task and emotional semantic classification of prediction emoji are served the same role due to not being each word, present invention introduces Attention mechanism determines each word in the importance of current generation representative learning process.I-th word Attention layers of score can be calculated according to following formula:

Wherein W_aFor Attention layers of parameter matrix, and each sentence may be expressed as one group of word sequence, further It is characterized as the weighted average of each vocabulary sign after connection in word sequence, and weight therein is that above formula is calculated Attention value.Specifically, the characterization of each sentence is following form, and wherein L is word included in sentence Number:

Softmax layers: being then transferred in Softmax layers from Attention layers of obtained sentence characterization, by Softmax Corresponding probability vector Y will be returned after layer.The corresponding sentence of each element representation of probability vector Y includes that some is specific The probability of emoji.Specifically, i-th of element of probability vector can be calculated as follows:

Wherein, T representing matrix transposition, w_iIndicate i-th of weight parameter, b_iIndicate i-th of bias term parameter, K indicates probability The dimension of vector.After obtaining the corresponding probability vector of each sentence, uses cross entropy as loss function, declined using gradient Mode parameter is updated, minimize the prediction error of model.In above-mentioned remote supervisory study and unsupervised learning pair After the adjustment of parameter, the vector characterization of each sentence can be extracted from Attention layers of output.

And since the data volume in supervised learning stage later is limited, in order to avoid the excessively huge bring mistake of model parameter Fitting problems, during the vector table of the final document level of training sign will the sub- characterization model of fixed sentence, corresponding parameter will It will not be adjusted again.

3. the supervised learning stage

After the remote supervisory study stage, there is identical semantic and emotion information sentence will be in table in each language It is mapped in the vector space of sign close.And finally wish to solve the problems, such as be document across language sentiment analysis, therefore there is still a need for Method that is a kind of compact and catching effective information characterizes document.In each document, different sentences is to entire document Emotional expression has different degrees of importance.Therefore, equally each to polymerize using the Attention mechanism of document level Different sentences in document.Here each document characterization is denoted as r, the sentence characterization in document is denoted as v, passes through following formula R is calculated:

Wherein W_bFor Attention layers of weight matrix, and β_iFor the Attention value of i-th of sentence in document.It uses Google translates each original language sample x ∈ L_STranslate into object language, and text after being translated using identical method Vector characterization.There is mark English text x for each_sAnd its corresponding cypher text x_t, it is assumed that it passes through above-mentioned Attention The vector obtained after layer is characterized as r_sAnd r_t, it is directly connected to obtain r in the stage of supervised learning_c=[r_s,r_t], and will obtain R_cAs last Softmax layers of input, and minimizes the intersection entropy loss between neural network forecast result and true tag and come more New corresponding network parameter.

Accordingly with above method, the present invention also provides a kind of across the language sentiment analysis device based on emoji, packet It includes:

Unsupervised learning module, be responsible for a large amount of source language and the target language based on collection unmarked text creation word to Amount；

Remote supervisory study module is responsible for being selected the text in unmarked text comprising emoji based on the term vector, be led to It crosses the text comprising emoji and establishes emoji prediction task, to obtain a characterization model；

Supervised learning module is responsible for the original language corpus of labeled feeling polarities translating into object language, using described Sentence characterization model obtains original text and translates the document characterization of obtained text, then characterizes training emotion using the document and divides Class model；

Emotional semantic classification module, is responsible for using the obtained sentiment classification model of training, to the new text of object language into Row emotional semantic classification obtains its feeling polarities.

Compared with prior art, the positive effect of the present invention are as follows:

The present invention alleviates that markup resources are rare, in different language using the emoji text easily climbed in social platform The unbalanced problem of markup resources.Specifically, because being remote supervisory study, use emoji as the weak label of emotion, because The demand of this corpus for manually marking is less.In addition, because emoji be widely used in each language, with emoji come The weak label in across language sentiment analysis is made, it is pervasive to each language.

Detailed description of the invention

Fig. 1 is Google's translation sample schematic diagram.

Fig. 2 is the flow chart of the method for the present invention.

Fig. 3 is remote supervisory learning framework figure.

Fig. 4 is the exemplary diagram for extracting sample in text from pushing away comprising multiple emoji.

Specific embodiment

Below with across the language analysis task (https: //www.uni-weimar.de/en/ of classical Amazon comment Media/chairs/computer-science-department/webis/data/corp us-webis-cls-10/) come into One step illustrates and verifies method of the invention.The task is right using Japanese, French, German as object language using English as original language In each language, comprising data, DVD, three fields of music sentiment analysis task.Because of its representativeness, task conduct always Benchmark dataset of the sphere of learning in across language sentiment analysis field.In order to verify method of the invention on the data set, press The training of following steps implementation model.

Firstly, having crawled English, Japanese, French, German on spy from pushing away and pushing away text, and as follows pre-processed:

1) what removal forwarded pushes away text, to guarantee that every words appear in its original context；

2) removal pushes away text comprising URL, to guarantee that the emotion of every words only depends on the semanteme of itself, independent of outside Resource；

3) word cutting has been carried out for all texts that pushes away, and has switched to lowercase.Since Japanese is not by space-separated Word, the present embodiment have selected this tokenizer of MeCab (http://taku910.github.io/mecab) to come to Japanese Individually processing；

4) it for pushing away@and number in text, is substituted with unified spcial character；

5) there is the word of redundancy letter to be restored to their original appearance those, for example, by " cooool " and " cooooooool " all switchs to " cool ".

Word2Vec is used for these pretreated texts and obtains the characterization of each word in source language and the target language.

The text comprising emoji is extracted in text next, pushing away from these.For each language, pushes away in text and be extracted from it 64 kinds of most emoji, then filter out the sentence without these emoji.In addition, may include more in some sentences A emoji.Text is pushed away for each, all creates a sample for every kind of emoji wherein included.For example, (a) figure institute in Fig. 4 The sentence shown can derive two samples shown in (b) figure, (c) figure.The emoji sample obtained using these establishes emoji Prediction task is respectively trained to obtain a characterization model for source language and the target language.

Finally, the English corpus with affective tag is translated as object language, parallel text is constituted.By parallel text point The sentence characterization that every words in text are obtained in the sentence characterization model of corresponding language obtained in the previous step is not lost, for training supervision Learning model.

Obtained supervised learning model can be used for the emotion of the text of class object language.Following table 1 illustrates this hair Classification accuracy of the bright method in 9 tasks of Amazon benchmark dataset.

Classification accuracy (%) of 1. the method for the present invention of table in 9 tasks of Amazon benchmark dataset

In the present invention, the unsupervised learning stage obtains during term vector may be used also other than using Word2Vec algorithm To use other classic algorithms, such as GloVe algorithm.The remote supervisory study stage encodes text in addition to using two-way LSTM layers Outside, CNN (convolutional neural networks) model also can be used.In addition, the number of plies of two-way LSTM can also adjust.

Another embodiment of the present invention provides a kind of across the language sentiment analysis device based on Emoji comprising:

The above embodiments are merely illustrative of the technical solutions of the present invention rather than is limited, the ordinary skill of this field Personnel can be with modification or equivalent replacement of the technical solution of the present invention are made, without departing from the spirit and scope of the present invention, this The protection scope of invention should be subject to described in claims.

Claims

1. a kind of across language sentiment analysis method based on emoji, which comprises the following steps:

1) the unmarked text creation term vector of a large amount of source language and the target language based on collection；

2) text in unmarked text comprising emoji is selected based on the term vector, is built by the text comprising emoji Vertical emoji predicts task, to obtain a characterization model；

3) the original language corpus of labeled feeling polarities is translated into object language, obtains original text using the sentence characterization model The document characterization of the text obtained with translation, then characterizes training sentiment classification model using the document；

4) sentiment classification model obtained using training is carried out emotional semantic classification to the new text of object language, obtains its feelings Feel polarity.

2. making in this stage the method according to claim 1, wherein step 1) is the unsupervised learning stage It trains to obtain term vector with Te Wenben and Word2Vec method is pushed away on a large scale.

3. being predicted the method according to claim 1, wherein step 2) is that remote supervisory learns the stage in emoji It is similarly characterized using the sentence of identical emoji in vector space in task；The emoji prediction task is double using two The text code of sentence surface is carried out to LSTM layers and one Attention layers, and by using Skip-connection's Mechanism, so that Attention layers of input is the output that term vector layer adds two LSTM layers, to realize information in entire model In without hindrance transmitting, last Attention layers of output is used for Softmax layers of classification.

4. according to the method described in claim 3, it is characterized in that, described two-way LSTM layers carry out in network according to following formula The calculating of node state:

i^(t)=δ (U_ix^(t)+W_ih^(t-1)+b_i),

f^(t)=δ (U_fx^(t)+W_fh^(t-1)+b_f),

o^(t)=δ (U_ox^(t)+W_oh^(t-1)+b_o),

c^(t)=f_t⊙c^(t-1)+i^(t)⊙tanh(U_cx^(t)+W_ch^(t-1)+b_c),

h^(t)=o^(t)⊙tanh(c^(t)),

Wherein, x^(t),i^(t),f^(t),o^(t),c^(t)And h^(t)Respectively indicate input vector of the LSTM at step t, input door state, Forget door state, output door state, internal storage location state and hiding layer state；W, U, b respectively represent the ginseng of recirculating network structure Number, the parameter of input structure and bias term parameter；Symbol ⊙ indicates element product.

5. according to the method described in claim 4, it is characterized in that, described two-way LSTM layers obtain forward and backward LSTM The characterization vector of i-th of element of word sequence is directly connected to obtain final characterization vector h_i, make to characterize vector h_iIt captures simultaneously The forward and backward contextual information of corresponding i-th of word.

6. according to the method described in claim 5, it is characterized in that, described Attention layers is determined using Attention mechanism Each word is in the importance of current generation representative learning process, and the Attention layer score of i-th of word is according to following formula It is calculated:

Wherein, W_aFor Attention layers of parameter matrix；u_iFor the table of i-th of word in Attention layers of input sentence Levy vector, u_i=[d_i,h_i1,h_i2], d_i, h_i1And h_i2Respectively indicate i-th of word term vector layer, LSTM layers of forward direction, after to Characterization in LSTM layers；L is word number included in sentence.

7. according to the method described in claim 6, it is characterized in that, what the Softmax layers of basis was obtained from Attention layers Sentence characterization, obtains corresponding probability vector Y；The corresponding sentence of each element representation of the probability vector Y includes some The probability of specific emoji；After obtaining the corresponding probability vector of each sentence, uses cross entropy as loss function, use gradient The mode of decline is updated parameter, minimizes the prediction error of model.

8. the method according to the description of claim 7 is characterized in that i-th of element of the probability vector Y is according to following formula It is calculated:

Wherein, T representing matrix transposition, w_iIndicate i-th of weight parameter, b_iIndicate i-th of bias term parameter, K indicates probability vector Dimension；

Wherein,It is the characterization of each sentence.

9. the method according to claim 1, wherein step 3) is the supervised learning stage, using document level Attention layers sub come the different sentences polymerizeing in each document；Each document characterization is denoted as r, the sentence characterization in document It is denoted as v, r is calculated by following formula:

Wherein W_bFor Attention layers of weight matrix, β_iFor the Attention value of i-th of sentence in document；By each source language Say sample x ∈ L_STranslate into object language, and after being translated text vector characterization；There is mark English text x for each_s And its corresponding cypher text x_t, it is assumed that it is characterized as r by the vector obtained after above-mentioned Attention layer_sAnd r_t, supervising The stage of study is directly connected to obtain r_c=[r_s,r_t], the r that will be obtained_cAs last Softmax layers of input, and it is minimum Change the intersection entropy loss between neural network forecast result and true tag to update corresponding network parameter.

10. a kind of across language sentiment analysis device based on emoji characterized by comprising

Unsupervised learning module is responsible for the unmarked text creation term vector of a large amount of source language and the target language based on collection；

Remote supervisory study module is responsible for being selected the text in unmarked text comprising emoji based on the term vector, passes through institute It states the text comprising emoji and establishes emoji prediction task, to obtain a characterization model；

Supervised learning module is responsible for the original language corpus of labeled feeling polarities translating into object language, utilizes the sentence table Sign model obtains original text and translates the document characterization of obtained text, then characterizes training emotional semantic classification mould using the document Type；

Emotional semantic classification module is responsible for the sentiment classification model obtained using training, carries out feelings to the new text of object language Sense classification, obtains its feeling polarities.