CN112905736B

CN112905736B - Quantum theory-based unsupervised text emotion analysis method

Info

Publication number: CN112905736B
Application number: CN202110113463.9A
Authority: CN
Inventors: 张亚洲; 马军霞; 崔建涛; 李璞; 朱少林
Original assignee: Zhengzhou University of Light Industry
Current assignee: Zhengzhou University of Light Industry
Priority date: 2021-01-27
Filing date: 2021-01-27
Publication date: 2023-09-19
Anticipated expiration: 2041-01-27
Also published as: CN112905736A

Abstract

The invention relates to an unsupervised text emotion analysis method based on quantum theory, which comprises the following steps: the method comprises the following steps: creating two emotion dictionaries, namely a positive emotion dictionary PSD and a negative emotion dictionary NSD; preprocessing texts in the positive emotion dictionary PSD, the negative emotion dictionary NSD and the corpus; constructing a quantum text representation model, and respectively extracting features of the preprocessed positive emotion dictionary PSD, the preprocessed negative emotion dictionary NSD and the text to construct a positive emotion dictionary density matrix rho _PSD Negative emotion dictionary density matrix ρ _NSD Text density matrix ρ _text The method comprises the steps of carrying out a first treatment on the surface of the And obtaining the emotion classification result of each text by using a quantum relative entropy algorithm.

Description

Quantum theory-based unsupervised text emotion analysis method

Technical Field

The invention relates to the technical field of text emotion classification, in particular to an unsupervised text emotion analysis method.

Background

The development of the internet has penetrated all aspects of the social politics economy so far, affecting people's daily lives. With the advent of the Internet age, social platforms develop rapidly like the spring bamboo shoots after rain, break through the social mode of closed blockage in the past, provide a wider platform for open interaction between users, and provide a lot of convenience for daily life of people. Nowadays, more and more users like to publish own attitudes and comments on social platforms (such as microblogs, weChat and the like), and every day, the social platforms can emerge tens of thousands of TB-level data contents, so that the social platforms become one of main sources for acquiring information in daily life of people. These information not only contain a report of objective facts, but also carry a large number of subjective emotional expressions. The method has the advantages that the contained emotion information is mined and identified, and the method has important scientific research significance and economic value for various fields such as public opinion analysis, marketing, investment prediction and the like. The invention mainly researches the most common text-pushing and blogging emotion in the social platform, namely a text emotion analysis technology.

One core task of text emotion analysis is text representation. Text representation is a form (method) of representing semantic information contained in text strings into real-valued vectors which can be processed by a computer, and meanwhile, the vectors are required to have excellent expression capability and distinguishing capability. Therefore, the vector-based text representation method occupies the main stream, and the performance of the vector-based text representation method is fully verified on each large data set, such as one-hot coding, word frequency-inverse document frequency, word embedding and the like. In recent years, the field of information retrieval shows a series of outstanding achievements based on quantum probability theory, which shows that the quantum probability theory can be used as an extended mathematical framework for tasks such as text characterization, document ordering and the like. Of these, the most representative is the quantum language model proposed by Sordoni et al for classical information retrieval tasks. As an extension of the classical language model, the quantum language model aims at solving the problem of term dependence, and achieves good effect.

Text in emotion analysis represents questions, typically for long text, comments at the chapter level, such as movie comments, product comments, etc. Such style text generally has the characteristics of complex semantic relationships, frequent interaction between terms, and deep dependence of context, and requires a superior representation learning model compared to information retrieval tasks. The standard quantum language model adopts one-hot coding to construct projection operators, when facing long texts, dimension disasters are easy to cause, and the problem that the quantum language model cannot be converged is exposed when training the high-dimension density matrix. But compared to vector-based representation methods, the density matrix in quantum theory can encode more semantic information, exhibiting second order correlation between word vectors. Therefore, combining quantum theory with density matrices is a valuable topic for developing novel text representation models.

Disclosure of Invention

The invention aims to solve the technical problem of overcoming the defects of the prior art and providing an unsupervised quantum text emotion analysis method. According to the method, two active and passive emotion dictionaries are constructed, each emotion dictionary and each subjective document are respectively represented, density matrix representation is constructed, then the similarity score between each subjective document and each active and passive emotion dictionary is calculated by quantum relative entropy, and an emotion classification result is obtained by comparing the similarity scores. The aim of the invention is realized by the following technical scheme:

an unsupervised text emotion analysis method based on quantum theory comprises the following steps:

(1): creating two emotion dictionaries, namely a positive emotion dictionary PSD and a negative emotion dictionary NSD, wherein the positive emotion dictionary contains words with positive emotion polarities, and the negative emotion dictionary contains words with negative emotion polarities;

(2): preprocessing texts in the positive emotion dictionary PSD, the negative emotion dictionary NSD and the corpus;

(3): constructing a quantum text representation model, and respectively extracting features of the preprocessed positive emotion dictionary PSD, the preprocessed negative emotion dictionary NSD and the text to construct a positive emotion dictionary density matrix rho _PSD Negative emotion dictionary density matrix ρ _NSD Text density matrix ρ _text The method comprises the following steps:

the first step: respectively obtaining PSD, NSD and word vector of word in each textAnd then normalizing:

and a second step of: based on vector outer product operation, a positive emotion dictionary PSD, a negative emotion dictionary NSD and projection matrixes of each word in a text are constructed, and the projection matrixes of all words in the positive emotion dictionary PSD are combined together to form a positive emotion projection sequenceProjection matrixes of all words in the negative emotion dictionary are combined into a negative emotion projection sequence +.>And the projection matrices of all words in each text are combined into a text projection sequenceWhere r represents the number of words of the positive emotion dictionary PSD, k represents the number of words of the negative emotion dictionary NSD, and t represents the number of words contained in each text;

and a third step of: obtaining respective projection sequences pi of the positive emotion dictionary, the negative emotion dictionary and the text _PSD 、Π _NSD 、Π _text Then, a maximum likelihood estimation MLE method is used for making likelihood functionsNumber of digitsRespectively training the density matrixes of the active emotion dictionary density matrixes ρ _PSD Negative dictionary density matrix ρ _NSD And text density matrix ρ _text ；

(4): calculating text density matrix rho by using quantum relative entropy algorithm _text Respectively and actively emotion dictionary density matrix rho _PSD Negative emotion dictionary density matrix ρ _NSD Is a positive similarity score S _p Similarity to negative score S _n ；

(5): comparing the positive similarity score with the negative similarity score if S _p ＞S _n And if the emotion type belongs to positive, otherwise, the emotion type belongs to negative, and finally, the emotion classification result of each text is obtained.

Further, in the step (1), the method for creating the positive emotion dictionary PSD and the negative emotion dictionary NSD is as follows:

the first step: selecting M groups of seed word pairs with opposite polarities to respectively form an initial positive emotion dictionary PSD and a negative emotion dictionary NSD;

and a second step of: selecting a corpus, extracting adjectives and adverbs in the corpus by a part-of-speech labeler based on a hidden Markov model, and taking the adjectives and the adverbs as candidate emotion words W _hx Using part-of-speech tagger to make each word w in the sentence in the corpus _i Marking the part of speech t _i Let each part of speech t _i Is only related to the part of speech t of the last word _i-1 Concerning, i.e. P (t _i |t _i-1 ) And each word w _i Probability of only t being part of speech _i Correlation, i.e. P (w _i |t _i ) Then a part-of-speech tag is selected as word w that maximizes the joint probability distribution _i Is part of speech:

and a third step of: using point-to-point information-information retrieval algorithmPMI-IR calculates each candidate emotion word W _hx Semantic association degrees among all seed words in the positive emotion dictionary PSD and the negative emotion dictionary NSD are used as emotion scores of candidate emotion words;

fourth step: for a certain candidate emotion word W _hx If emotion Score (W _hx ) Greater than 0, the word belongs to a positive emotion word, if emotion Score (W) _hx ) Less than 0, belonging to the passive emotion words, and according to the emotion attribute, the candidate emotion word W _hx And adding the emotion dictionary into a corresponding emotion dictionary.

In the third step, the semantic association degree calculating process may be:

wherein W is _hx Representing candidate emotion words, seed representing seed words in each emotion dictionary, PMI (W _hx Seed) is a statistical candidate emotion word W _hx Probability of co-occurrence with seed word, if probability is larger, the more closely related it is, the higher the degree of association is, score (W _hx ) Is the emotion score of the candidate emotion word.

In step (2), preprocessing the text in the positive emotion dictionary PSD, the negative emotion dictionary NSD and the corpus should include: correcting spelling errors, removing illegal characters of each dictionary and text, and removing useless words including stop words and punctuation marks based on an English standard stop word list.

In step (3), the GloVe tool can be used to obtain PSD, NSD and word vectors of words in each text

In the third step of step (3), for positive emotion projection sequencesThe training method comprises the following steps:

likelihood functionThe definition is as follows:

wherein pi (n) _i Is the positive emotion projection sequence pi _PSD Projection matrix of ith word in (p) _PSD Is the density matrix of the active emotion dictionary, tr is the trace operation of the computation matrix, tr (pi _i ρ _PSD ) Representing word w _i Probability of occurrence, likelihood functionRepresenting the joint probability of the co-occurrence of all words in the positive emotion dictionary.

Objective function F (ρ) _PSD ) The definition is as follows:

F(ρ _PSD ) Representing the maximum value of the joint probability of solving all words of the positive emotion dictionary.

Using a global convergence algorithm that continuously iteratively updates ρ by defining an iteration direction Dk _PSD And an objective function F (ρ) _PSD ) Until the objective function F (p _PSD ) Outputs the maximum value of the positive emotion dictionary density matrix ρ _PSD ；

According to the same training method, a negative dictionary density matrix rho is obtained _NSD And text density matrix ρ _text 。

In the third step in the step (3), the quantum relative entropy calculation process may be:

S _p ＝tr(ρ _text (logρ _text -logρ _PSD ))

S _n ＝tr(ρ _text (logρ _text -logρ _NSD ))

wherein S is _p ,S _n Not less than 0, if and only if ρ _text ＝ρ _PSD At the time S _p ＝0；ρ _text ＝ρ _NSD At the time S _n ＝0。

The beneficial effects of the invention are as follows:

(1) Constructing a high-quality positive emotion dictionary and a high-quality negative emotion dictionary, and expressing two basic emotions of human beings;

(2) Based on quantum probability theory, extracting text features, constructing a density matrix, and encoding term semantics and probability distribution information;

(3) Based on quantum relative entropy, similarity between density matrixes is calculated, emotion classification can be completed unsupervised, and the method has the characteristics of quick response, strong field adaptability, high accuracy and the like.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a polarity distribution diagram of emotion words in an emotion dictionary;

FIG. 3 is a quantum text representation model flow diagram;

FIG. 4 shows the results of comparison of histogram experiments of different emotion analysis methods.

Detailed Description

The technical solution of the present invention will be described in further detail with reference to the accompanying drawings, but the scope of the present invention is not limited to the following description. FIG. 1 shows the flow of the method for unsupervised text emotion analysis based on quantum theory; FIG. 2 shows the emotion polarity profile of words in an emotion dictionary; FIG. 3 shows a flow chart of a quantum text representation model; fig. 4 shows the results of the experimental comparison of emotion classification between the final different methods. The method comprises the following specific steps:

(1): based on 7 groups of seed words and OMD (The Obama-McCain Debate) English corpus, two emotion dictionaries, namely positive and negative emotion dictionaries, named (positive sentiment dictionary, PSD) and (negative sentiment dictionary, NSD) are created manually, and The method is as follows:

the first step: 7 sets of seed word pairs with opposite polarities are manually selected, namely 'active/negative', 'good/bad', 'love/ha', 'excel/pore', 'amazing/shit', 'nice/tert' and 'awesome/crap', respectively. Thus, the initial positive emotion dictionary psd= (positive, good, love, excel, amazing, nice, awesome), and the initial negative emotion dictionary nsd= (negative, bad, ate, face, shit, terlie, crap).

In the second step, a total of 855 adjectives and adverbs in the corpus are extracted by a hidden markov (Hidden Markov Model, HMM) part-of-speech labeler, and these words are used as candidate emotion words, for example awesome, thankful, dirty, dumb, terrible. The calculation process is as follows: the HMM part-of-speech tagger is for each word w in the text _i Marking the part of speech t _i (e.g., adjectives, verbs, adverbs, etc.). Assume each part of speech t _i Is only related to the part of speech t of the last word _i-1 Related (i.e. P (t) _i |t _i-1 ) With each word w) _i Probability of only t being part of speech _i Correlation (i.e. P (w) _i |t _i ) A part-of-speech tag that maximizes the joint probability distribution is selected as the word w) _i Is part of speech:and counting the occurrence frequency of each word according to the corpus, and calculating the part of speech corresponding to each word after three parameters of the HMM are obtained, so as to finish the part of speech labeling process.

And a third step of: each candidate emotion word W is calculated by using a point mutual information-information retrieval PMI-IR method _hx Semantic association degrees among all seed words in the positive emotion dictionary PSD and the negative emotion dictionary NSD are used as emotion scores of the candidate emotion words. The semantic association degree calculating process is as follows:wherein W is _hx Representing candidate emotion words, seed representing seed words in each emotion dictionary, PMI (W _hx Seed) is a statistical candidate emotion word W _hx The probability of co-occurrence with the seed word, the more closely the correlation, the higher the correlation if the probability is greater. Score (W) _hx ) Is the emotion score of the candidate emotion word.

Fourth step: if emotion Score (W) _hx ) Greater than 0, the word belongs to a positive emotion word, if emotion Score (W) _hx ) And (3) being smaller than 0, belonging to the negative emotion words, and respectively adding the positive emotion words and the negative emotion words into the corresponding emotion dictionary. Finally, the positive emotion dictionary PSD contains 150 positive emotion words, e.g., best, healthy, amazing, beautiful, etc., while the negative emotion dictionary NSD contains 152 negative emotion words, e.g., fake, bloody, weird, offensively, sad, etc.

(2): the method comprises the steps of preprocessing 1928 documents in a positive emotion dictionary PSD, a negative emotion dictionary NSD and an OMD text corpus by using a Python natural language tool kit, correcting spelling errors, removing illegal characters (such as ". The total number of text books of the final OMD corpus is 1906.

(3): training a quantum text representation model, respectively extracting features from positive and negative emotion dictionaries and texts, and constructing a positive dictionary density matrix rho _PSD Negative dictionary density matrix ρ _NSD Text density matrix ρ _text Are all L x L matrices, where L is the dimension of each word vector. Assume that each dictionary or text is represented as d= { w ₁ ,w ₂ ,...,w _t T is the number of words in the dictionary or text, as shown in fig. 3. The method comprises the following steps:

the first step: obtaining a positive emotion dictionary PSD, a negative emotion dictionary NSD and 300-dimensional word vectors of each word in a text by using a Glove toolNormalizing to obtain: />

And a second step of: based on the vector outer product operation, the following formulas are utilized to construct an emotion dictionary and each word in the textw _i Projection matrix of (c) Projection matrixIs a 300 x 300 matrix.

Then the projection matrixes of all words in the positive emotion dictionary are combined together to form a positive emotion projection sequenceCombining projection matrixes of all words in negative emotion dictionary into negative emotion projection sequenceAnd the projection matrices of all words in each text are combined into a text projection sequenceWhere r represents the number of words of the positive emotion dictionary, i.e., 150; k represents the number of words of the negative emotion dictionary, i.e., 152; and t represents the number of words each text contains.

And a third step of: obtain projection sequence pi of active dictionary, passive dictionary and text _PSD 、Π _NSD And pi (a Chinese character) _text Then, a maximum likelihood estimation (maximum likelihood estimation, MLE) method is used for formulating likelihood functions(the meaning of likelihood function is the probability of getting the document), start training density matrix, likelihood function +.>The definition is as follows:

wherein pi (n) _i Is each projection sequence { pi } _PSD ,Π _NSD ,Π _text The i-th word projection matrix in the sequence { r, k, t } represents each projection sequence { n } _PSD ,Π _NSD ,Π _text The number of words contained in the pattern, ρ is the density matrix, ρ∈ { ρ }, ρ is _PSD ，ρ _NSD ，ρ _text And tr is the trace operation to calculate the matrix. tr (pi) _i ρ) represents the word w _i Probability of occurrence, likelihood functionAnd respectively representing the joint probabilities of the positive emotion dictionary, the negative emotion dictionary and all words in the text.

Since the log function has monotonicity, the log function is used for likelihood functionThe logarithm does not change its monotonic nature, so the objective function F (ρ) can be defined as:

wherein tr (ρ) =1, ρ.gtoreq. 0,F (ρ) ∈ { F (ρ) _PSD ),F(ρ _NSD ),F(ρ _text ) The maximum value of joint probabilities that the positive emotion dictionary, the negative emotion dictionary and all words in the text co-occur are solved.

Fourth step: a global convergence algorithm is applied, which algorithm is implemented by defining the iteration direction D ^k Iteratively updating values of p and the objective function F (p) continuously until a maximum value of the objective function F (p) is obtained, and outputting respective positive dictionary density matrices p _PSD Negative dictionary density matrix ρ _NSD And text density matrix ρ _text . Wherein, the update rule defining the kth iteration of the density matrix ρ is: ρ ^k+1 ＝ρ ^k +t _k D ^k And t _k Called step size, t _k ∈[0,1]Representing the magnitude of the kth iteration objective function F (ρ) update; and direction of iteration D ^k The definition is as follows:

wherein the method comprises the steps ofAnd->Respectively representing two basic directions of vertical and horizontal, and iteration direction D ^k By->And->And simultaneously controlling between vertical and horizontal. q (t) _k ) Representing the overall iteration direction, +.>Representing the gradient direction of the kth iteration objective function.

They are defined as:

wherein, the liquid crystal display device comprises a liquid crystal display device,is the frequency of each word. To demonstrate the robustness of the global convergence algorithm, a diagonal matrix is randomly initialized at the beginning of the iteration>It satisfies all properties of the density matrix, e.g. ρ ⁰ And more than or equal to 0. When the back-and-forth variation of the value of the objective function is within 0.0001, the iteration is terminated, and the final density matrix ρ ε { ρ _PSD ，ρ _NSD ，ρ _text }。

(4): calculating text density matrix rho by using quantum relative entropy algorithm _text Respectively and actively dictionary density matrix ρ _PSD Negative dictionary density matrix ρ _NSD Is a positive similarity score S _p Similarity to negative score S _n . Quantum relative entropy is defined as:

S _p ＝tr(ρ _text (logρ _text -logρ _PSD ))

S _n ＝tr(ρ _text (logρ _text -logρ _NSD ))

(5) Comparing positive similarity scores S _p Similarity to negative score S _n If S _p ＞S _n And if the emotion type belongs to positive (emotion label is +1), otherwise, the emotion type belongs to negative (emotion label is-1), and finally, the emotion classification result of each text is obtained.

The emotion classification result of each subjective text is obtained, the emotion label is compared and tested, the classification accuracy is calculated, the word bag model, the sentence embedding model, the point mutual information-information retrieval algorithm and the quantum language model are compared, the statistical accuracy is compared with the histogram, and the effect of the text emotion analysis model can be obviously improved, as shown in fig. 4, by the method and the device.

The technical means disclosed by the scheme of the invention is not limited to the technical means disclosed by the embodiment, and also comprises the technical scheme formed by any combination of the technical features. It should be noted that modifications and adaptations to the invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

Claims

1. An unsupervised text emotion analysis method based on quantum theory comprises the following steps: the method comprises the following steps:

and a second step of: based on vector outer product operation, a positive emotion dictionary PSD, a negative emotion dictionary NSD and projection matrixes of each word in a text are constructed, and the projection matrixes of all words in the positive emotion dictionary PSD are combined together to form the positive emotion dictionaryEmotion projection sequenceProjection matrixes of all words in the negative emotion dictionary are combined into a negative emotion projection sequence +.>And the projection matrices of all words in each text are combined into a text projection sequenceWhere r represents the number of words of the positive emotion dictionary PSD, k represents the number of words of the negative emotion dictionary NSD, and t represents the number of words contained in each text;

and a third step of: obtaining respective projection sequences pi of the positive emotion dictionary, the negative emotion dictionary and the text _PSD 、Π _NSD 、Π _text Then, a likelihood function is formulated by using a maximum likelihood estimation MLE methodRespectively training the density matrixes of the active emotion dictionary density matrixes ρ _PSD Negative dictionary density matrix ρ _NSD And text density matrix ρ _text ；

2. The method of unsupervised text emotion analysis of claim 1, wherein in step (1), the method of creating positive emotion dictionary PSD and negative emotion dictionary NSD is as follows:

and a third step of: calculating each candidate emotion word W by using point mutual information-information retrieval algorithm PMI-IR _hx Semantic association degrees among all seed words in the positive emotion dictionary PSD and the negative emotion dictionary NSD are used as emotion scores of candidate emotion words;

3. The method for unsupervised text emotion analysis according to claim 2, wherein in the third step, the semantic association degree calculation process is as follows:wherein W is _hx Representing candidate emotion words, seed representing seed words in each emotion dictionary, PMI (W _hx Seed) is a systemCounting candidate emotion words W _hx Probability of co-occurrence with seed word, if probability is larger, the more closely related it is, the higher the degree of association is, score (W _hx ) Is the emotion score of the candidate emotion word.

4. The method of unsupervised text emotion analysis of claim 1, wherein preprocessing the text in the positive emotion dictionary PSD, the negative emotion dictionary NSD and the corpus in step (2) comprises: correcting spelling errors, removing illegal characters of each dictionary and text, and removing useless words including stop words and punctuation marks based on an English standard stop word list.

5. The method of claim 1, wherein in step (3), the GloVe tool is used to obtain the word vectors of the PSD, NSD and the words in each text, respectively

6. The method of unsupervised text emotion analysis of claim 1, wherein in the third step of step (3), the sequence is projected for positive emotionThe training method comprises the following steps:

likelihood functionThe definition is as follows:

wherein pi (n) _i Is the positive emotion projection sequence pi _PSD Projection matrix of ith word in (p) _PSD Is the density matrix of the active emotion dictionary, tr is the trace operation of the computation matrix, tr (pi _i ρ _PSD ) Representing word w _i Probability of occurrence, likelihood functionRepresenting joint probabilities of co-occurrence of all words in the positive emotion dictionary;

objective function F (ρ) _PSD ) The definition is as follows:

F(ρ _PSD ) Representing solving a maximum value of joint probabilities of all words appearing in the positive emotion dictionary;

using a global convergence algorithm by defining an iteration direction D ^k Continuous iterative update ρ _PSD And an objective function F (ρ) _PSD ) Until the objective function F (p _PSD ) Outputs the maximum value of the positive emotion dictionary density matrix ρ _PSD ；

7. The method of unsupervised text emotion analysis according to claim 1, wherein in the third step of step (3), the quantum relative entropy calculation process is as follows:

S _p ＝tr(ρ _text (logρ _text -logρ _PSD ))

S _n ＝tr(ρ _text (logρ _text -logρ _NSD ))