CN117291190A

CN117291190A - User demand calculation method based on emotion dictionary and LDA topic model

Info

Publication number: CN117291190A
Application number: CN202310809452.3A
Authority: CN
Inventors: 李波; 刘婷; 李辉; 曾洪; 王海洋
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2023-07-03
Filing date: 2023-07-03
Publication date: 2023-12-26

Abstract

The invention discloses a user demand calculation method based on an emotion dictionary and an LDA topic model, which is used for carrying out user demand analysis based on emotion analysis and attribute extraction based on product comments of a network so as to assist enterprises to determine the improvement direction of product design. Aiming at the problems that emotion characteristic information is ignored in text vectorization of emotion analysis in user demand analysis and vector semantic representations of different vectorization models are different, an emotion polarity analysis method for carrying out user comments by fusing text vectors with emotion characteristic vectors represented by different models is provided; and then aiming at the problem of product attribute extraction in user demand analysis, the part-of-speech characteristics of the product attributes in the comments are analyzed, the product attributes under negative emotion polarity are extracted by using an LDA model based on the part-of-speech, and the user demand is determined by the theme or attribute words with higher frequency under the negative emotion polarity.

Description

User demand calculation method based on emotion dictionary and LDA topic model

Technical Field

The invention belongs to the field of natural language processing.

Background

User demand analysis is a primary task for an enterprise to determine the design direction of a product, and how to quickly identify user demands is an important task for the enterprise to promote the competitive advantage of the product. Traditional ways of obtaining user demands mainly include questionnaires, interviews and the like, which are time-consuming, labor-consuming and inefficient. With the development of the internet, more and more users tend to publish their own views of products on public websites, so that network comments become important information sources for enterprises to mine user demands.

In recent years, the development of new energy automobiles in China has become a powerful motive force for leading the development of the automobile market in China, and more users also start to select new energy automobiles. Related enterprises can achieve advantages in market competition by designing products based on user requirements. Therefore, the new energy automobile field is taken as an example for user demand analysis.

The main tasks of demand analysis based on user comment mining are mainly emotion analysis and attribute extraction. Emotion analysis is also called emotion mining and opinion mining, and is a process of processing, analyzing, summarizing and reasoning a text to obtain emotion colors of the text. For emotion analysis, there are two categories of methods based on rules and statistical learning. The rule-based method generally consists of a manually defined rule base and an emotion dictionary, and is greatly influenced by manpower; the method is based on statistical learning, and generally, text vectorization is carried out on comments through a word vector model, the accuracy of the method is greatly affected by a text vectorization mode, and the importance of certain emotion feature words in the comments can be ignored by text vectorization of sentences.

The attribute extraction is to mine out the attribute characteristic information of the product from the user comments. For attribute extraction, on one hand, a grammar-based and syntax-based method can be adopted, the method depends on a constructed rule base, on the other hand, attribute extraction is carried out through a machine learning method such as topic modeling and the like, all words contained in comment sentences are taken as an attribute extraction data set, and part-of-speech characteristics of user comment individual product attributes are not considered.

For the above reasons, it is proposed herein that: and identifying emotion feature words by using an emotion dictionary, and fusing text vectors and emotion feature vectors generated by different models to analyze comment emotion polarities. And then, combining part-of-speech characteristics of the comment product attributes, aiming at negative emotion polarity evaluation, utilizing an LDA model based on the part-of-speech to carry out product attribute subject mining, and determining the user requirements.

The invention provides a user demand analysis method based on an emotion dictionary and an LDA topic model, which realizes user demand analysis of product comments.

Disclosure of Invention

Aiming at the defects of the prior art, the invention carries out emotion analysis and attribute extraction of user comments so as to realize user demand analysis.

In order to solve the above problems, the present invention provides a user demand computing method based on an emotion dictionary and an LDA topic model, comprising the following steps:

step 1: preprocessing user comments;

performing word segmentation processing of comments by using a Chinese word segmentation tool jieba, and constructing a proper noun word library and a stop word list to remove stop words in the comments to obtain text features;

step 2: screening emotion feature words;

selecting emotion active vocabulary, emotion passive vocabulary and negative vocabulary in comment corpus to screen emotion words by taking a knowledge network Hownet emotion dictionary as a main part; extracting all item sets with the occurrence times meeting the minimum support degree in the screened data set to obtain emotion characteristics; the definition expression of the minimum support is:

where |D| represents the set of items in the datasetTotal, sigma _x Representing a number of sets of items in the dataset that contain x;

step 3: fusing the text vector and the emotion feature vector;

step 3.1: vectorizing text features by Word2vec to obtain a text vector S _w ＝(w ₁ ,w ₂ ...,w _n ) Vectorizing emotion characteristics by using FastText to obtain an emotion vector S _e ＝(e ₁ ,e ₂ ...,e _m )；

Step 3.2: vector fusion;

splicing the text vector and the emotion vector in a vector splicing mode to obtain a spliced vector S;

S＝S _w +S _e ＝(w ₁ ,w ₂ ...,w _n ,e ₁ ,e ₂ ...,e _m )；

step 4: classifying comment emotion polarities of the vectors fused in the step 3 by adopting a Logistic regression classification model (Logistic regression analysis);

step 5: mining LDA comment topics based on parts of speech, wherein LDA represents implicit Dirichlet distribution;

step 5.1: part-of-speech analysis;

dividing words and marking parts of speech of partial comments by using a jieba word dividing tool, removing stop words, analyzing the part of speech characteristics of attribute words used in comments by users, screening nouns and verbs to form a new word set, and carrying out cluster analysis;

step 5.2: determining n by the number of topics;

determining the number of product attribute topics by using the topic continuity; based on the sliding window, calculating a confirmation degree on the paired words in each topic based on the normalized point state mutual information NPMI, namely quantifying the support degree between the words by using the probability calculated from the corpus; the NPMI calculation formula is:

wherein W' and W ^* Is each masterSegmentation of the set of top n most important word components, P (W', W) ^* ) Representing W' and W ^* Probability of occurrence, ε is a fixed constant; calculating NPMI corresponding to each possible n value, wherein n corresponding to the largest NPMI is the number of topics;

if the relevance of the two words is stronger, the NPMI value is about high, namely the confirmation degree is about high; finally, the confirmation degree of each topic word set is obtained through the point state mutual information NPMI of the arithmetic average value, the final topic consistency score is obtained, and the higher the topic consistency score is, the stronger the topic interpretability is, and the more suitable the number of topics is selected;

step 5.3: performing topic extraction on the words filtered in the step 5.1 by adopting an LDA topic model to obtain the product requirement of a user;

the LDA topic model is decomposed into two processes of a document-topic and a topic-word;

(1)topic distribution of documents->From based on super parameters->Is sampled from Dirichlet a priori, and then generates a potential topic by polynomial Multi distribution, so that the document-topic process is generated by Dirichlet-Multi-normal structure:

wherein,hyper-parameters representing a priori distribution of potential topic variables for each document in the document collection,representing hyper-parameters->Dirichlet a priori, z _m,n Theme representing nth word in mth document,/->Refers to the topic probability distribution of the mth document, < ->Representation->Is a polynomial distribution of (2);

(2)representing the topic-word generation process, the model is based on the super-parameter +.>Generates random probability of vocabulary under the topic, and then randomly selects a characteristic item w through Multi distribution according to probability distribution of potential topic obtained by the first process _m,n ；

Where W is the size of the entire corpus dictionary,referring to word distribution on the kth topic, beta represents a priori distribution superparameter of word distribution of each topic, and K represents the number of topics;

estimating parameters of the LDA model by using Gibbs sampling, and performing theme sampling on each characteristic item, wherein a sampling formula is as follows:

where i represents the nth word of the mth document,representing a topic distribution after removal of the topic corresponding to the word with subscript i,/>Representing the distribution of the subject words of the document, and alpha represents the proportional relation of the front and the back expressions _k And beta _t Respectively representing super parameters of the LDA model, V represents the number of words contained in the document, K represents the number of topics, and z _i Theme of n-th word of mth document,/->Indicating that the nth word in the mth document is removed,>representing the total number of words assigned to topic k in the mth document,/>Representing the number of words t in the topic k;

obtaining the product requirement of the user according to the sampling obtaining theme.

Further, the detailed method of the step 4 is as follows:

the Logistic distribution is a continuous probability distribution with a distribution function of:

wherein μ is an unknown parameter, γ > 0 is a shape parameter;

logistic regression is a Logistic regression analysis, and the decision boundary of Logistic regression is omega ^T The probability estimates for x+b, given the sample x, logistic model are:

the Logistic regression model fits the decision boundary first, and then establishes a probabilistic connection between this boundary and the classification result. The core problem of the Logistic model is to solve the parameter ω, usually using the maximum likelihood estimation method, i.e. to solve and find a set of parameters that maximize the likelihood of the data, for the emotion classification problem, there is:

P(Y＝1|x)＝p(x)

P(Y＝0|x)＝1-p(x)

the likelihood function is:

taking the logarithm of the likelihood function, the following can be obtained:

taking the average log likelihood function over the whole data set, the Logistic loss function can be obtained:

solving parameters by maximizing likelihood functions, namely minimizing loss functions; solving is performed by using a gradient descent method and a Newton method, and over fitting of a model is prevented by using regularization.

Further, the specific flow of Gibbs sampling in the step 5 is as follows:

(1) Randomly assigning a topic number z to each verb and noun in the comment sentence;

(2) Traversing verbs and nouns in the document, and updating the topic numbers according to the formula;

(3) Repeating the step (1) until the Gibbs sampling model converges;

(4) Counting the topics of each noun and verb in the document, and obtaining document-topic distribution based on verbs and nouns according to the following formula:

the beneficial effects of adopting above-mentioned scheme to produce lie in: the user demand mining analysis method based on the emotion dictionary and the LDA topic model provided by the invention improves the user demand analysis effect in two aspects: 1) In the aspect of emotion analysis, aiming at the problems that emotion characteristic information is ignored in text vectorization of emotion analysis in user demand analysis and difference exists in vector semantic representations of different vectorization models, an emotion polarity analysis method for carrying out user comments by fusing text vectors with emotion characteristic vectors represented by different models is provided, and the classification effect of positive and negative emotion polarity comments is effectively improved; 2) In the aspect of product attribute extraction, the part of the useless information can be filtered by analyzing the part of speech characteristics of the product attributes in the comments and extracting the product attributes under the negative emotion polarity by using an LDA model based on the part of speech, and the user requirements are determined by the theme or attribute words with higher frequency.

Drawings

FIG. 1 is a flowchart of a user demand analysis method based on an emotion dictionary and an LDA topic model;

FIG. 2 is a schematic diagram of a Word2vec model. FIG. 2 (a) is a block diagram of a CBOW model; FIG. 2 (b) is a block diagram of a Skip-gram model;

FIG. 3 is a diagram of FastText word vector storage format;

fig. 4 is a structural diagram of an LDA topic model.

Detailed Description

The following describes the specific embodiments of the present invention in further detail, taking the new energy automobile field as an example with reference to the accompanying drawings. The following is illustrative of the invention and is not intended to limit the scope of the invention.

The user demand analysis method based on the emotion dictionary and the LDA topic model is realized, as shown in fig. 1, and comprises the following steps:

step 1: user comment preprocessing. The word segmentation processing of the comments is carried out by using a Chinese word segmentation tool jieba, and a proper noun word library is constructed for identifying proper nouns such as brand names, component names and the like in the comments, as shown in the following table 1. And improving the comment word segmentation accuracy, and constructing a stop word list to remove stop words in the comments.

TABLE 1 example data for proper noun score word library

When Chinese word segmentation is carried out, all words in a sentence are divided, and the split words generally comprise a large number of punctuation marks, exclamation words and words without practical significance, so that the workload of classifying subsequent word vector construction is increased, the interference on results is increased, and therefore, the operation of removing stop words is needed. The words which can influence emotion recognition, such as negative words 'none', adjectives 'general', and the like, are deleted on the basis of the Haindustrial large stop vocabulary, and the automobile brand names are added to form the stop vocabulary for filtering partial words.

Step 2: and (5) emotion feature word screening. And filtering emotion words in the comment corpus based on the emotion dictionary, wherein in addition, the attribute matched with the emotion words also affects the final result, the attribute words used by the user in the comment are often limited and have relatively high frequency, the frequent word set in the comment corpus is obtained by the Apirior association mining algorithm, and the emotion words and the frequent word set jointly construct emotion characteristics.

The method mainly comprises the step of selecting emotion active vocabulary, emotion passive vocabulary and negative vocabulary in an emotion dictionary of a knowledge network Hownet to screen emotion words.

The Apriori algorithm is a data mining algorithm for mining frequent item sets and association rules, and the algorithm mainly comprises two steps, namely, generating a frequent item set (frequencnt itemset) and generating rules according to the frequent item set. Because only common attribute words in comments are needed here, only the first step of the algorithm is needed, and all item sets with the number of occurrence times in the data set meeting the minimum support degree are extracted.

The definition expression of the minimum support degree is

Where |D| represents the total number of items in the dataset, σ _x Representing the number of sets of items in the dataset that contain x.

The minimum support degree is set to be 0.01, and frequent word sets of comments are extracted.

The emotion feature data formed by emotion words and frequent word sets are shown in the following table 2

TABLE 2 emotion feature example data

Step 3: the text vector and the emotion feature vector are fused. And (3) respectively adopting Word2vec and FastText models, carrying out text vectorization processing on the comments preprocessed in the step (1), carrying out emotion feature vectorization processing on the emotion feature words screened in the step (2) by adopting different models with text vectorization so as to obtain semantic representations of different vector models on comment corpus, emphasizing the importance of the emotion feature words, and then fusing the text vectors and the emotion features in a splicing mode.

Step 3.1: the text and emotion feature vectorization processing is carried out by using Word2vec and FastText vector models.

Word2vec is a distributed Word vector model proposed by Mikolov T. Word2vec has two implementation methods, skip-gram (Skip model) and CBOW (Continuous Bag of Word, continuous Word bag model), respectively. Two model structure diagrams are shown in figure 1.

The principle of CBOW is to predict target words given a context. If a sentence containing C words is input, the input word X=C×V passes through the hidden layer W εR ^V×N Post-addition results in a 1 XN-dimensional vector whichWherein V represents the number of words contained in the corpus, N represents the dimension of the word vector, and the value of the dimension is equal to the number of neurons of the hidden layer. The 1 XN-dimensional vector and the weight matrix W' E R ^N×V Multiplying the two results in a 1 XV vector at the output layer, and passing the Softmax layer to obtain a vector for each word in the sentence, as shown in FIG. 2 (a).

The principle of Skip-gram is to make a context prediction given a target word. Given a target word of 1 XV dimension, and W.epsilon.R ^V×N Multiplying to obtain 1 XN vector, and then multiplying with weight matrix W' E R ^N×V Multiplication gives a vector of 1V, and after Softmax layers, the probability distribution of the word is obtained, as shown in fig. 2 (b).

FastText is a word vector training and text classification tool proposed by Facebook. The FastText Word vector model is different from Word2vec in that FastText uses Word bags and n-gram to characterize sentences, so that Word vector effects generated by low-frequency words can be better.

Since the bag-of-words model only considers the words existing in the sentence, but does not consider the word order problem, when the word vectors of the text are averaged, much information is often lost, and even the meaning of the whole sentence is affected. FastText takes into account the n-gram characteristics of the word. For example, for the sentences "a is more expensive than B" and "B is more expensive than a", the bag of words model obtains four features "a" is more expensive than "B" and "B" is more expensive than a ", although the two sentences express opposite attitudes, the bag of words model obtains the same features, and when n=3 is added to the n-gram feature, the first sentence has the features" a is more expensive than B "and the second sentence has the features" B is more expensive than a "and the two sentences can be distinguished by different features.

From the analysis, fastText stores word vectors of sub-word information in a Hash bucket mode due to the fact that sub-word information is considered, the FastText determines storage positions of n-gram vectors of words in a corpus through a Hash algorithm, and n-grams with equal Hash results share one word vector. FastText word vector store As shown in FIG. 3, fastText holds not only the word vector for each word, but also the N-gram vector for each word.

Step 3.2: vector fusion. Fusing the text vector and the emotion feature vector of the comment sentence in a vector splicing mode, wherein the S is shown in the following formula _w For comment text vector, S _e And training the model by using emotion feature vectors generated for emotion features in comment sentences generated by different word vector models as input features of a classifier so as to enhance the emotion features of the text and improve the classification accuracy.

S＝S _w +S _e ＝(w ₁ ,w ₂ ...,w _n ,e ₁ ,e ₂ ...,e _m )

Step 4: and (5) emotion polarity classification. And classifying comment emotion polarities by adopting a logistic regression classification model.

Logistic regression is also called Logistic regression analysis, belongs to supervised learning, is a generalized linear regression analysis model, and is mainly used for solving the classification problem. On the premise that the data obeys the Logistic distribution, the Logistic regression adopts a maximum likelihood estimation method to carry out parameter estimation of a model decision boundary.

The Logistic distribution is a continuous probability distribution with a distribution function of

Wherein μ is an unknown parameter, γ > 0 is a shape parameter.

The decision boundary of Logistic regression can be expressed as ω ^T x+b, probability estimation for a given sample x, logistic model

The Logistic regression model fits the decision boundary first, and then establishes a probabilistic connection between this boundary and the classification result. The core problem of the Logistic model is to solve the parameter omega, usually using the maximum likelihood estimation method, i.e. solving to find a group of parameters which maximize the likelihood of the data, for the emotion classification problem, there is

P(Y＝1|x)＝p(x)

P(Y＝0|x)＝1-p(x)

The likelihood function is

Taking the logarithm of the likelihood function to obtain

Taking the average log likelihood function of the whole data set to obtain the Logistic loss function

And solving parameters by maximizing likelihood functions, namely minimizing loss functions. The solution can be generally performed using gradient descent methods and newton methods, with regularization to prevent overfitting of the model.

Step 5: part-of-speech-based LDA comment topic mining. And (3) extracting the product attribute of the negative emotion polarity comment after emotion classification in the step (4) by using the topic model. The general LDA topic model models all word words after word segmentation in a document, but not all parts of speech have the same importance, and the importance of words with different parts of speech to semantic expression is different.

Step 5.1: part of speech analysis. And (3) performing word segmentation and part-of-speech tagging on part of comment corpus by using a jieba word segmentation tool, and removing stop words, wherein the result is shown in the following table 3. Since the words related to the product attributes can express the meaning of the user comment when the comments are clustered, as can be seen from the following table 3, most of the attribute words used by the user in the comments are nouns (the parts of speech are marked as n and nr) and small part of the words (the parts of speech are marked as v), and other parts of speech are not directly related to the product attributes, so that the parts of speech and the parts of speech of the verbs are the main contents for analyzing the subject of the user comment.

Therefore, in order to eliminate the interference of noise information, when the LDA model is used for subject clustering of user comments, nouns and verbs are firstly screened to form a new word set for cluster analysis.

TABLE 3 comment segmentation example

Step 5.2: the number of topics is determined. The number of topics is determined before mining the product attributes of the comments, and the number of topics of the product attributes is determined by using topic Coherence (Coherence Score).

Topic consistency is an objective measure based on linguistic distribution assumptions that words with similar meaning tend to appear in similar contexts, i.e., topics are considered to be consistent if most words are closely related, in general, the better the effect of the LDA model, the higher its topic consistency score.

When the LDA calculates the topic continuity, based on a sliding window, the confirmation degree is calculated on the paired words in each topic based on the normalized point state mutual information NPMI, namely the support degree between the words is quantized by using the probability calculated from the corpus; the NPMI calculation formula is

Wherein W' and W ^* Is the segmentation of the set of n most important word components of each topic, P (W', W) ^* ) Representing W' and W ^* Probability of occurrence, ε is a fixed constant;

if two words are related (e.g., often appear in the same document), the above value will be high, i.e., the certainty will be high; finally, the confirmation degree of each topic word set is obtained through the point state mutual information NPMI of the arithmetic average value, the final topic consistency score is obtained, and the higher the topic consistency score is, the stronger the topic interpretability is, and the more suitable the number of topics is selected;

step 5.3: and extracting the product comment attributes. And (4) extracting product attributes of the words filtered in the step 5.1 by adopting an LDA topic model.

LDA (Latent Dirchlet Aollocation, implicit dirichlet distribution) is a document generation model in which words in a set D of documents are represented as a set of K topics, each topic being formed by a vocabulary of different words. The basic process of the LDA topic model is to calculate and obtain the topic corresponding to the text and the vocabulary list corresponding to the topic according to the text content. The LDA topic model block diagram representation is shown in FIG. 4.

The LDA topic model can be broken down into two processes, document-topic, topic-word.

(1)Topic distribution of documents->From based on super parameters->Is sampled from Dirichlet a priori, and then a potential topic is generated by polynomial (Multinomial) distribution, so that the document-topic process is generated by Dirichlet-Multinomial structure

(2)Representing the topic-word generation process, the model is based on the super-parameter +.>Dirichlet distribution of (C) generates random probabilities of vocabulary under topicThen, a feature item w is randomly selected through a Multinomial distribution according to the probability distribution of the potential subject obtained by the first process _m,n

simpler Gibbs sampling is used herein for estimation of LDA model parameters, and subject sampling is performed for each feature term. The sampling formula is

Where i represents the nth word of the mth document,representing a topic distribution after removal of the topic corresponding to the word with subscript i,/>Representing the distribution of the subject words of the document, and alpha represents the proportional relation of the front and the back expressions _k And beta _t Respectively representing super parameters of the LDA model, V represents the number of words contained in the document, K represents the number of topics, and z _i Theme of n-th word of mth document,/->Indicating that the nth word in the mth document is removed,>represents the mthTotal number of words assigned to topic k in the document, < >>Representing the number of words t in the topic k;

the specific flow of Gibbs sampling corresponding to this document is:

(1) A topic number z is randomly assigned to each verb and noun in the comment sentence.

(2) Traversing verbs and nouns in the document and updating the topic numbers according to the formula.

(3) Repeating (1) until the Gibbs sampling model converges.

Claims

1. a user demand calculation method based on an emotion dictionary and an LDA topic model comprises the following steps:

step 1: preprocessing user comments;

step 2: screening emotion feature words;

where |D| represents the datasetTotal number of item sets, sigma _x Representing a number of sets of items in the dataset that contain x;

step 3: fusing the text vector and the emotion feature vector;

Step 3.2: vector fusion;

S＝S _w +S _e ＝(w ₁ ,w ₂ ...,w _n ,e ₁ ,e ₂ ...,e _m )；

step 4: classifying comment emotion polarities of the vectors fused in the step 3 by adopting a logistic regression classification model;

step 5.1: part-of-speech analysis;

step 5.2: determining n by the number of topics;

wherein W' and W ^* Is the top n heaviest of each topicSegmentation of a set of intended word components, P (W', W ^* ) Representing W' and W ^* Probability of occurrence, ε is a fixed constant; calculating NPMI corresponding to each possible n value, wherein n corresponding to the largest NPMI is the number of topics;

wherein,hyper-parameters representing a priori distribution of potential topic variables for each document in the document collection, +.>Representing hyper-parameters->Dirichl of dirichlet of (a)et a priori, z _m,n Theme representing nth word in mth document,/->Refers to the topic probability distribution of the mth document, < ->Representation->Is a polynomial distribution of (2);

where i represents the nth word of the mth document,represents the topic distribution after removal of the topic corresponding to the word with subscript i,representing the distribution of the subject words of the document, and alpha represents the proportional relation of the front and the back expressions _k And beta _t Respectively representing super parameters of the LDA model, V represents the number of words contained in the document, K represents the number of topics, and z _i Theme of n-th word of mth document,/->Indicating that the nth word in the mth document is removed,>representing the total number of words assigned to topic k in the mth document,/>Representing the number of words t in the topic k;

2. The method for calculating user demand based on emotion dictionary and LDA topic model as set forth in claim 1, wherein the detailed method in step 4 is as follows:

wherein μ is an unknown parameter, γ > 0 is a shape parameter;

logistic regression to LThe decision boundary of Logistic regression was ω ^T The probability estimates for x+b, given the sample x, logistic model are:

P(Y＝1|x)＝p(x)

P(Y＝0|x)＝1-p(x)

the likelihood function is:

taking the logarithm of the likelihood function, the following can be obtained:

3. The method for calculating user requirements based on an emotion dictionary and an LDA topic model as claimed in claim 1, wherein the specific flow of Gibbs sampling in step 5 is as follows:

(3) Repeating the step (1) until the Gibbs sampling model converges;