CN110929034A

CN110929034A - Commodity comment fine-grained emotion classification method based on improved LSTM

Info

Publication number: CN110929034A
Application number: CN201911173494.2A
Authority: CN
Inventors: 金庆雨; 李勇; 蔡圆媛; 张青川
Original assignee: Beijing Technology and Business University
Current assignee: Beijing Technology and Business University
Priority date: 2019-11-26
Filing date: 2019-11-26
Publication date: 2020-03-27

Abstract

The invention belongs to the field of natural language processing, and provides a commodity comment fine-grained emotion classification method based on improved LSTM, which comprises the following steps: compiling a crawler script, capturing commodity comment data of an electronic commerce website, and performing data preprocessing on the data; segmenting the cleaned data by using a crust segmentation tool; using word2vec of the generic natural language processing package to train word vectors, and obtaining word vectors corresponding to comment data; the existing emotion word bank is used as a seed word bank, and the emotion word bank is expanded according to the similarity of word vectors; extracting subject words and emotional words from the comments; and constructing an emotion classification model, importing a word vector sequence corresponding to the commodity comment subject word and the emotion word into the model, and carrying out emotion classification on the commodity comment. The invention provides a commodity comment fine-grained sentiment classification method based on improved LSTM, which fully excavates sentiment tendentiousness in commodity comments by using deep learning knowledge, thereby improving the sentiment classification accuracy of the commodity comments.

Description

Commodity comment fine-grained emotion classification method based on improved LSTM

Technical Field

The invention relates to the technical field of natural language processing, in particular to a commodity comment fine-grained emotion classification method based on improved LSTM.

Background

In recent years, with the rapid development of the internet, a large number of online users are gathered on network shopping platforms such as various social media, forums, the kyoto, the Taobao and the like. According to the 44 th statistical report of the development conditions of the Chinese Internet, as shown in 2019, 6 months, the scale of online shopping users in China reaches 6.39 hundred million, the online shopping users increase 2871 ten thousand in comparison with 2018, the online shopping users account for 74.8% of the whole netizens, and online shopping and Internet payment become applications with higher use ratio of netizens. Compared with the subjective description of merchants, when a certain commodity is purchased on the internet, people prefer to know the detailed condition of the commodity through objective information of the comments of buyers, and sellers of the e-commerce platform can also know the opinions of the people on the certain or certain commodity through the comments, so that the problems of the commodity are judged, and a reasonable selling strategy is made. In the face of such huge comment text information, it is a very time-consuming and labor-consuming matter to manually acquire emotional tendencies of comments, and therefore, it is a very important task to automatically mine and analyze emotional tendencies of comment texts by using an artificial intelligence technology and a related technology in the field of natural language processing.

The emotion analysis is a process of analyzing, processing, inducing and reasoning subjective text with emotion colors, and the emotion classification divides the text into two or more types which are either positive or negative according to the meaning and emotion information expressed by the text, and divides the text into emotion tendentiousness and viewpoint attitude. The traditional emotion classification mainly comprises emotion classification methods based on emotion dictionaries and machine learning. The emotion dictionary-based emotion classification method performs semantic analysis by using an emotion dictionary such as HowNet, and judges the positive and negative tendency of the text according to the final score. If the score is positive, the text represents positive emotion, and if the score is negative, the text represents negative emotion. The disadvantage of emotion classification through an emotion dictionary is that the emotion dictionary is excessively depended on, the difference between different fields is large, a mature Chinese emotion dictionary is limited, and the use range is limited, so that the transportability is poor. The emotion classification through machine learning mainly comprises a naive Bayes classification algorithm, a maximum entropy algorithm, a support vector machine and the like, but the methods need to contain a large number of labeled data sets, select positive features from positive comment data and select negative features from negative comment data.

Disclosure of Invention

In order to solve the defects that the existing mature emotion dictionary is short, the transportability of an emotion classification model is poor, and a large number of manual labeling data sets are needed, the invention provides a commodity comment fine-grained emotion classification method based on an improved LSTM (long-short term memory network), and the deep emotion information of a text can be extracted by combining a deep neural network model, so that the emotion classification precision is improved.

The technical scheme adopted by the invention for solving the technical problems is as follows: text processing technology in the natural language processing field is introduced into the emotion classification model, and the emotion classification accuracy is improved by combining deep learning technology. The Word vectors are trained by using the Word2Vec algorithm, the commodity comment texts are expressed by the Word vectors, the concept space of the texts is converted into a computable space, and the similarity is obtained by calculating the Euclidean distance between the two Word vectors. And finally, obtaining the emotional tendency of the commodity comment by inputting the word vectors corresponding to the subject words and the emotional words into an emotion classifier for training.

A commodity comment fine-grained emotion classification method based on improved LSTM comprises the following steps:

step 1: grabbing commodity comment data from an E-commerce website, wherein the commodity comment data comprise a commodity ID, a commodity category, a commodity name, commodity comment content and comment time, marking part of the commodity comment data into a positive category and a negative category, and dividing the marked data into a training set and a test set;

step 2: data cleaning is carried out on the commodity comment data, some punctuations which are useless for emotion classification are deleted, and the commodity comment is segmented;

and step 3: converting each word segmentation in the step 2 into a word vector, and constructing a word vector matrix corresponding to each word;

and 4, step 4: converting the emotional words and words in the subject word seed word bank into word vectors, wherein the vector matrix corresponding to each word is used as the vector matrix of the seed words, the vector matrix of the seed words and the word vector matrix obtained in the step 3 are subjected to similarity calculation, wherein the seed words are the subject words, the words of which the similarity calculation value is greater than the threshold value are used as the expansion of the subject word bank, and the seed words are the emotional words of which the similarity calculation value is greater than the threshold value are used as the expansion of the emotional words;

and 5: extracting subject words and emotion words from the commodity comment data, mapping the subject words and the emotion words into word vectors, and splicing the vectors between the subject words and the emotion words to obtain word vector splicing results as input of an emotion classifier;

step 6: the emotion classifier comprises a bidirectional long-time memory network and a softmax function, the word vector splicing result in the step 5 is used as the input of the emotion classifier, and the flow of a state matrix at different moments in the model training process is controlled through an input gate, an output gate and a forgetting gate through a two-layer LSTM neural network model; the network of the neural network model updates node information through a memory unit so as to learn the remote dependence characteristic in the text sequence, the weights of the subject words and the emotion words are respectively adjusted through an attention layer, the weight corresponding to the output matrix of the neural network unit is calculated, and the weighted sum of the output matrix and the weight of the attention layer is obtained and is a feature vector of commodity comments, so that a more accurate emotion classification result is obtained; finally, outputting the emotion categories of the commodity comments through a softmax function.

Further, in the step 1, data acquisition is carried out on the commodity comments by compiling Python crawler codes of the E-commerce website, manual labeling is carried out on the captured partial data, and each sentence of commodity comment is labeled as positive or negative; and finally, dividing the marked data into a training set and a test set.

Further, in the step 2, data cleaning is performed on the collected commodity comment data, punctuation marks which are useless for sentiment classification in the comment are removed, and a word segmentation tool is used for segmenting the commodity comment data.

Further, in the step 3, each Word is mapped into a Word vector by using Word2Vec as a result of segmenting the commodity comment, and the captured commodity comment data is trained, so that a feature vector containing emotion information and semantic information is obtained.

Further, in the step 4, each subject Word or each emotional Word is mapped to a Word vector by using Word2Vec in the subject Word seed lexicon and the emotional Word seed lexicon, similarity calculation is performed on the seed words and the Word vectors of the commodity reviews obtained in the step 3, and the seed words and the Word vectors with high similarity are respectively used as expansion lexicons of the subject Word lexicon and the emotional Word bank according to calculation results between the seed words and the Word vectors.

Further, in the step 5, effective components in the sentiment classification of the commodity comment are subject words and sentiment words, all words of the user comment are subjected to word vector conversion, vector similarity between words in the comment and a subject word bank is calculated, subject words in the comment are filtered, vector similarity between words in the comment and a sentiment word bank is calculated, and sentiment words in the comment are filtered; the method for filtering out the subject words and the emotional words is the same as the method for calculating the similarity of the extended subject word library and the emotional word library in the step 4; and splicing the filtered subject term and the emotion word vector in order to input the emotion classification information contained in the comment into the emotion classification model.

Further, in the step 6, the subject word vector and the emotion word vector of the result in the step 5 are spliced to be used as the input of the emotion classifier; inputting commodity comment data into an emotion classifier, and classifying the commodity comment data by using an emotion classification model; inputting a text vector into an emotion model, firstly, carrying out calculation of a comment corresponding matrix through two layers of long-time memory networks, wherein nodes of hidden layers in the two layers of networks are mutually connected, and the two layers of networks are connected with the same output layer; in order to highlight the role of the subject words and the emotion words in the comment sentences, an attention mechanism is introduced into the matrix of the output layer, and the matrix of the output layer is subjected to weighted summation, so that the final accuracy of emotion classification is improved; and finally, inputting the array matrix into a softmax function to obtain a softmax value, and determining the emotional tendency of the comment.

Has the advantages that:

the method has the advantages that the model can be used in various fields and situations, other data in the field can be classified with emotion only by labeling a small amount of corpora, and high classification accuracy can be achieved.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a diagram of the emotion classifier of the present invention.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

As shown in fig. 1 and fig. 2, the method for classifying the fine-grained sentiment of the commodity comment based on the improved LSTM according to the present invention obtains the sentiment category of the commodity comment through a sentiment classifier, and mainly includes the following steps:

the method comprises the following steps: the method comprises the steps of writing Python crawler codes to capture commodity comment data on an E-commerce website, such as a Kyoto mall, wherein the commodity comment data comprise a commodity ID, a commodity type, a commodity name, commodity comment contents, comment time and the like, and marking the commodity comment data manually to respectively represent positive emotional tendency and negative emotional tendency.

After the commodity comment data are labeled, dividing the data set into a test set and a training set, wherein the training set is used for model training of the labeled data, the test set is used for testing the trained model, and the division ratio of the test set to the training set is 2: 8.

Step two: for the captured commodity comment data, punctuation marks which make the commodity comment data useless are cleaned through data, such as: deleting commas, periods, ellipses and the like, and segmenting the commodity comments; the Chinese character recognition method includes the steps that a word segmentation tool is used for segmenting words of a commodity comment data corpus, stop words generally refer to words which do not affect the meaning and emotional tendency of whole sentence comment, stop words are filtered for commodity comments by using a stop word list, and therefore model training efficiency is improved.

Step three: calculating the similarity between words, for example, the similarity can be calculated by calling a genesis library of Python, then performing Word vector training by using a Word2Vec method, and performing Word vector conversion on a result obtained after the commodity comment data is participled, wherein each Word corresponds to one Word vector. The model adopted by the Word2Vec method comprises two different modes, namely a bag-of-words model (CBOW) and a Skip-Word model (Skip-Gram), and the Word vector can be obtained through efficient training on large data volume.

Step four: and converting words in the subject Word seed lexicon and the emotional Word seed lexicon into corresponding Word vectors by a Word2Vec method, calculating the similarity between the Word vectors and the Word vectors converted from the commodity comment data in the commodity comment step III, and respectively using the Word vectors and the Word vectors with high similarity as the expansion lexicons of the subject Word lexicon and the emotional Word lexicon according to the calculation result between the Word vectors and the Word vectors.

Step five: and step three, converting all the commodity comment data into corresponding word vector data, extracting word vector information corresponding to the subject words and the emotion words from the word vector data, and splicing word vectors to be used as input of the emotion classifier.

Step six: and inputting the subject word and the emotion word vector into an emotion classifier for model training, and inputting the word vector into a bidirectional long-time memory network in the emotion classifier model. The memory units in the LSTM are respectively a forgetting gate f_tAnd input gate i_tAnd an output gate o_t. These gates together determine the current memory cell c_tAnd a current hidden state h_tThe conversion of (1).

Given an input sequence v ═ (v)₁,v₂,···,v_L) LSTM computes the hidden vector sequence H ═ H₁,h₂,...,h_L]And the output matrix sequence X ═ X₁,x₂,···,x_L]。

The forgetting gate is used for controlling whether information is forgotten or not, and whether a cell state of a previous layer is forgotten or not is determined in the LSTM according to a certain probability. The representation of the forgetting gate is as follows:

f_t＝σ(W_f·[h_t-1,x_t]+b_f)

wherein f is_tA forgetting gate representing the t-th time point, wherein sigma is the output value of the activation function with the value of 0,1]，W_fWeight variable for forgetting gate, h_t-1Hidden layer data for t-1 time points, x_tInput variable representing the t-th point in time, b_fThe deviation of the door is forgotten.

The input gate is responsible for processing the input of the current sequence position and selectively storing new data information into the cell state, and the representation of the input gate is as follows:

i_t＝σ(W_i·[h_t-1,x_t]+b_i)

wherein i_tRepresents the input gate at the t-th time point, and sigma is the magnitude of the output value of the activation function at 0,1]，W_iAs weight variable of input gate, h_t-1Hidden layer data for t-1 time points, x_tInput variable representing the t-th point in time, b_iIs the bias of the input gate.

C_t＝tanh(W_c·[h_t-1,x_t]+b_c)

Wherein C is_tRepresenting candidate vectors, tanh being the magnitude of the activation function output value [ -1,1 [ ]]，W_cAs a weight variable of the candidate vector, h_t-1Hidden layer data for t-1 time points, x_tInput variable representing the t-th point in time, b_cIs the bias of the input gate.

The output gate functions to pass the state of the cell back as an output through the processing of the intermediate layer information. The output gates are represented as follows:

o_t＝σ(W_o·[h_t-1,x_t]+b_o)

wherein o is_tOutput gate representing the t-th time point, σ being the magnitude of the output value of the activation functionIn [0,1 ]]，W_oAs weight variable of the output gate, h_t-1Hidden layer data for t-1 time points, x_tInput variable representing the t-th point in time, b_oIs the bias of the output gate.

Two types of memory cells, including long and short memory, are represented as follows:

h_t＝0_t⊙tanh(C_t)

wherein, C_tRepresents the update status of the t-th time point, h_tThe hidden layer data for t time points.

For a sentence s ═ w₁,w₂,···,w_LL represents the maximum number of words in a sentence. LSTM_lAnd LSTM_rRepresenting left and right hand LSTM elements, respectively. C_l(v_i) The context vector, Cr (v), representing the output of the left LSTM cell_i) Context vector representing the output of the right LSTM cell, C_l(v_i) And Cr (v)_i) The combination of (C) results in the value of the final state matrix C.

And calculating the weight of the emotion classification model on each network node in the model training process through the attention layer, and then performing weighted summation on the vector of the output layer and the weight of the attention layer to obtain a feature vector corresponding to the final comment.

And finally, obtaining the emotional tendency classification of the commodity comments by the characteristic vector through a softmax function.

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, but various changes may be apparent to those skilled in the art, and it is intended that all inventive concepts utilizing the inventive concepts set forth herein be protected without departing from the spirit and scope of the present invention as defined and limited by the appended claims.

Claims

1. A commodity comment fine-grained emotion classification method based on improved LSTM is characterized by comprising the following steps:

2. The method for classifying the fine-grained emotion of the commodity comments based on the improved LSTM as claimed in claim 1, wherein:

in the step 1, data acquisition is carried out on the commodity comments by compiling Python crawler codes of the E-commerce website, manual marking is carried out on the captured partial data, and each sentence of commodity comment is marked as positive or negative; and finally, dividing the marked data into a training set and a test set.

3. The method for classifying the fine-grained emotion of the commodity comments based on the improved LSTM as claimed in claim 1, wherein:

in the step 2, data cleaning is carried out on the collected commodity comment data, punctuation marks which are useless for sentiment classification in the comment are removed, and a word segmentation tool is used for carrying out word segmentation on the commodity comment data.

4. The method for classifying the fine-grained emotion of the commodity comments based on the improved LSTM as claimed in claim 1, wherein:

in the step 3, each Word is mapped into a Word vector by using Word2Vec as a result of segmenting the commodity comment, and the captured commodity comment data is trained, so that a feature vector containing emotion information and semantic information is obtained.

5. The method for classifying the fine-grained emotion of the commodity comments based on the improved LSTM as claimed in claim 1, wherein:

in the step 4, each subject Word or each emotional Word is mapped into a Word vector by using Word2Vec for the subject Word seed lexicon and the emotional Word seed lexicon, similarity calculation is carried out on the seed words and the Word vectors of the commodity comments obtained in the step 3, and according to the calculation result between the seed words and the Word vectors, the high similarity is respectively used as the expansion lexicons of the subject Word lexicon and the emotional lexicon.

6. The method for classifying the fine-grained emotion of the commodity comments based on the improved LSTM as claimed in claim 1, wherein:

in the step 5, effective components in the sentiment classification of the commodity comment are subject words and sentiment words, all words of the user comment are subjected to word vector conversion, the vector similarity between the words in the comment and the subject word bank is calculated, the subject words in the comment are filtered, the vector similarity between the words in the comment and the sentiment word bank is calculated, and the sentiment words in the comment are filtered; the method for filtering out the subject words and the emotional words is the same as the method for calculating the similarity of the extended subject word library and the emotional word library in the step 4; and splicing the filtered subject term and the emotion word vector in order to input the emotion classification information contained in the comment into the emotion classification model.

7. The method for classifying the fine-grained emotion of the commodity comments based on the improved LSTM as claimed in claim 1, wherein:

in the step 6, the subject word vectors and the emotion word vectors obtained in the step 5 are spliced to be used as the input of an emotion classifier; inputting commodity comment data into an emotion classifier, and classifying the commodity comment data by using an emotion classification model; inputting text vectors into an emotion classification model, firstly, carrying out calculation of comment corresponding matrixes through two layers of long-time and short-time memory networks, wherein nodes of hidden layers in the two layers of networks are mutually connected, and the two layers of networks are connected with the same output layer; in order to highlight the role of the subject words and the emotion words in the comment sentences, an attention mechanism is introduced into the matrix of the output layer, and the matrix of the output layer is subjected to weighted summation, so that the final accuracy of emotion classification is improved; and finally, inputting the array matrix into a softmax function to obtain a softmax value, and determining the emotional tendency of the comment.