CN113673239A

CN113673239A - Hotel comment emotion polarity classification method based on emotion dictionary weighting

Info

Publication number: CN113673239A
Application number: CN202110752833.3A
Authority: CN
Inventors: 谢晓兰; 李新飞; 陈灵彬; 刘亚荣
Original assignee: Guilin University of Technology
Current assignee: Guilin University of Technology
Priority date: 2021-07-03
Filing date: 2021-07-03
Publication date: 2021-11-19

Abstract

The invention discloses a hotel comment emotion polarity classification method based on emotion dictionary weighting, which is characterized in that a Chinese hotel comment corpus public data set is used, hotel comment text data are cleaned and extracted, then the hotel comment text data are sent to a naive Bayes emotion polarity classifier model based on emotion dictionary weighting for training and testing, and prediction results are evaluated by indexes such as accuracy, recall rate, F1 value and the like, so that more accurate decision support is provided for consumers and merchants. The hotel comment emotion polarity classification method based on emotion dictionary weighting is an application method for a consumer to the high-quality hotel check-in requirement according with the self interest and the hotel service quality, solves the influence caused by the lack of a hotel specific field corpus and the independence assumption problem of a naive Bayes algorithm, meets the correctness of the subjective selection of the consumer, improves the timeliness and pertinence of an emotion polarity classification model by applying an emotion dictionary, and meets the actual application requirement.

Description

Hotel comment emotion polarity classification method based on emotion dictionary weighting

Technical Field

The invention relates to the technical field of emotion polarity classification, in particular to a hotel comment emotion polarity classification method based on emotion dictionary weighting.

Background

At present, the rapid development of the internet prompts the network to be full of all aspects of life, various network applications change the life style of people, the service form of clothes and eating houses is changed from off-line to on-line, consumers express their own opinions and evaluations on certain services by publishing personalized language through a network platform, and the products of the on-line products also influence the intentions of other non-consuming clients. The comment information with subjective emotional tendency can generate immeasurable value, changes the selection tendency of people, and not only can ensure the correctness of the subjective selection of consumers, but also can lead merchants to extract valuable data to improve the business, the service and the like, and improve the evaluation and the income by analyzing the comment information. The explosion growth of information data can bring rich value and huge potential, and also bring serious 'information overload' problem, so that the information data is difficult to process and analyze only by a manual method, and the quality of hotel services often becomes a condition for whether a customer chooses to check in, so that text sentiment analysis technology in the field of computers, text sentiment analysis, namely sentiment mining, tendency analysis and the like are generated for text data of service types such as hotel comments and the like.

Scholars at home and abroad make a great deal of research aiming at emotional feature analysis, methods such as Word2Vec and TF-IDF are commonly applied in the aspect of feature extraction, an emotional classification model comprises a Support Vector Machine (SVM), a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN) and a CHI-square feature selection model (CHI), the invention discloses a hotel comment emotion polarity classification method based on emotion dictionary weighting, and relates to a classification method based on OCC models of emoticons and model establishment based on emotion dictionaries and mechanics. The merchant can extract valuable data to improve the business and the service, applies the emotion dictionary to hotel comment data analysis, analyzes the emotion polarity of subjective texts of consumers, improves the timeliness and pertinence of the emotion classification model, and meets the actual application requirements.

Disclosure of Invention

The invention aims to provide a hotel comment emotion polarity classification method based on emotion dictionary weighting, which is used for solving the problems of inaccurate classification precision and the like caused by the lack of a corpus in a hotel comment specific field and the independence assumption of a naive Bayesian algorithm, and providing more accurate decision support for consumers and merchants according to model analysis.

The technical scheme adopted by the invention is as follows: a hotel comment emotion polarity classification method based on emotion dictionary weighting comprises the following steps:

s1: obtaining hotel comment data: firstly, a large-scale Chinese hotel comment corpus public data set provided by Tankbo, China academy of sciences is selected, and the public data set mainly comprises commendatory and derogatory two kinds of emotion comments, so that data for research are provided for emotion analysis.

S2: preprocessing hotel comment data: the method comprises the steps of removing stop words and participles, using a Hadamard stop word list for a stop word list, and self-defining high-frequency hotel noun stop words such as hotels, beds, guest rooms and the like, and rapidly scanning the words in the sentences from which the stop words are removed by adopting a full mode of Chinese ending participles.

S3: the TFIDF word frequency weight technology constructs a word vector matrix: and extracting the characteristic words, carrying out normalization processing on the document words, and constructing a characteristic word vector matrix by evaluating the importance of certain characteristic words.

S4: improving TFIDF word frequency weight algorithm: the hotel comment data does not have fixed emotion dictionary corpora, regression testing is carried out on samples with inaccurate emotion dictionary feature word vector prediction in the test set text, namely, emotion feature word vectors in error samples are regressed again according to word frequency importance and stored in an emotion dictionary, the basic emotion dictionary adopts a Hownet emotion dictionary, and an improved TFIDF word frequency weight algorithm is constructed.

S5: the naive Bayes algorithm constructs a text emotion polarity classifier: and constructing a Chinese text sentiment polarity classification model for predicting hotel comment data based on sentiment dictionary weighting.

S6: training a classifier model by utilizing a hotel comment corpus public data set: the hotel comment data set comprises ten thousand of data in four scales, the data set is divided into two sets in a mutually exclusive manner, and all the data sets are divided into a test set and a training set according to the ratio of 1: 4.

S7: classifying by using the constructed emotion polarity classification model: the test set comments are partitioned hotel comment corpus public data sets.

The step S2 hotel comment data preprocessing includes removing stop words and word segmentation.

The method for removing the stop words and the participles is characterized in that the defined stop word list is used for removing irrelevant words in the data set, including conjunctions, symbols, foreign language and the like which influence the participles and the words for constructing a word vector matrix, so that the interference of the stop words on the analysis of the Chinese text is reduced, certain useless words or words are ignored during model training, and the model learning efficiency is improved. And (3) rapidly scanning out the words in the sentences from which the stop words are removed by adopting a full mode of Chinese ending segmentation, and preparing for constructing a word vector matrix and training a model.

The step S3 TFIDF word frequency weight technique for constructing the word vector matrix includes feature extraction and TFIDF value calculation.

The feature extraction is to perform word frequency inverse frequency analysis on the text by utilizing a TffVectorizer class to form a word frequency matrix and generate a feature vocabulary table with length of | V |. And generating a matrix of N x | D |, wherein N is the number of the hotel evaluation papers, and A [ i, j ] represents the tfidf value of the jth word in the ith document in the feature vocabulary. The higher the frequency of occurrence of a feature word in all documents, the more important it is in distinguishing the document attributes. A feature word has a higher weight if it appears more frequently in only a small number of documents, the wider it appears.

The tfidf value is calculated by calculating word frequency and inverse document frequency, and can reflect the importance of certain characteristics and ignore certain meaningless characteristics.

The TFIDF algorithm used is formulated as follows:

tfidf＝tf_ij*idf_j＝tf_ij*log(N/n_j) (1)

tf_ijmeans that the feature word vector t is in the text d_iThe number of times of occurrence in (1) is the word frequency; idf_jThe inverse of the text of the feature word vector t is indicated. The total number of texts is N, and the number of texts containing the feature word vectors is N_j。

The step S4 is to improve the TFIDF word frequency weight algorithm, and for hotel comment data in a specific field, because there is no fixed emotion dictionary, stop word list, etc., it will affect the accuracy of prediction and reduce the accuracy, where regression testing is performed on samples with inaccurate emotion dictionary feature word vector prediction appearing in the test data set text, i.e., the emotion feature word vectors in the error samples are regressed again according to the word frequency importance and stored in the emotion dictionary, and these word vectors are given a weight based on the emotion dictionary, so that they have higher category features. The weight formula of the improved tfidf is as follows:

tfidf＝tfidf^(1-tfi)，t_j∈D_senti (2)

t _jfor mismatching feature word vectors in data sets, D_sentiFor the emotion dictionary, the weighted tfidf value is a weight which is proportional and linear to the original weight value.

The step S5 of constructing the text emotion polarity classifier by the naive Bayes algorithm comprises the following steps.

S51, the simple representation means that all the characteristics are independent from each other, and the data to be trained T { (x)₁，y₁)，(x₂，y₂)...(x_N，y_N) In which x_i＝(x_i ⁽¹⁾，x_i ⁽²⁾...x_i ⁽ⁿ⁾)^T，x_i ^(j)Is the jth feature of the ith sample, namely the jth feature word x in the ith comment text data in the hotel comment data to be trained_i ^(j)∈{a_j1,a_j2...a_ji}，a_jiIs the value that the jth feature word may take, y_i∈{c₁,c₂...c_KOnly positive and negative categories, namely 0 and 1, represent the text categories represented by the negative and positive emotions respectively.

And calculating prior probability and conditional probability.

S52: for a given data x_i＝(x_i ⁽¹⁾,x_i ⁽²⁾...x_i ⁽ⁿ⁾)^TAnd calculating the posterior probability.

S53: a probability value is determined that is a positive emotion.

Said P (Y ═ c)_k) Is the probability that the text belongs to a certain category, here the probability that the hotel comment text data belongs to a certain category, doc (c)_k) For text data as category c_kThe number of (2).

To prevent conditional probability P (X) thereof^(j)＝x^(j)|Y＝c_k) The value of (A) is 0, the probability of a certain feature being 0 renders all other data invalid, and Laplace conversion, namely Bayesian estimation, is adopted here, | V | is the feature X^(j)＝a_jiThe number of the values can be taken, and the Laplace conversion is also adopted for the prior probability in the same way.

In step S6, a classifier model is trained by using the hotel comment corpus public data set, generally, to ensure that similar data obtain similar results, a test set and a training set are selected from the same data set, the hotel comment data set includes ten thousand pieces of data of four scales, and all data sets are divided into the test set and the training set according to a ratio of 1: 4. In order to ensure the accuracy of model recommendation, the model recommendation can be evaluated according to the accuracy, precision, recall rate and F1 value, and the corresponding parameters are modified to ensure the optimization of the model.

In the step S7, the constructed emotion polarity classification model is used for classification, the test set comments are divided hotel comment corpus public data sets, the divided test set data are sent to a trained emotion polarity classifier based on emotion dictionary weighting to obtain evaluation indexes such as accuracy and the like, and support for decision making is provided for consumers and merchants.

The invention has the following beneficial effects and advantages:

(1) the classification model is processed, analyzed and trained by utilizing a large-scale Chinese hotel comment corpus public data set provided by Tan Tubo Boshi of China academy of sciences, for example, negative comments that service attitudes are extremely poor, reception by a front desk is as if the front desk is not trained, basic politeness is not understood, several guests are received at the same time, positive comments that rooms are clean and tidy, services can be carried out, and the price is good, so that timeliness and pertinence of the emotion classification model can be improved, and actual application requirements are met.

(2) The naive Bayes (naive Bayes) algorithm is a classification method based on Bayes theorem and independent hypothesis of characteristic conditions, and is simple to implement and high in learning and predicting efficiency.

(3) The improved TFIDF algorithm extracts text feature word vectors and emotion dictionary feature word vector weights after regression testing, weakens the influence of independence hypothesis, gives the word vectors a weight based on an emotion dictionary, enables the word vectors to have higher category characteristics, and makes up the influence of inaccurate classification caused by insufficient linguistic data in a specific field.

Drawings

FIG. 1 is a schematic diagram of a hotel comment emotion polarity classification method based on emotion dictionary weighting.

FIG. 2 is a flow chart of emotion classification used in the present invention.

Detailed Description

Example (b):

as shown in fig. 1, the technical solution of the present invention comprises seven steps: the method comprises the steps of obtaining hotel comment data S1, preprocessing hotel comment data S2, constructing a word vector matrix S3 by a TFIDF word frequency weighting technology, improving a TFIDF word frequency weighting algorithm S4, constructing a text emotion polarity classifier S5 by a naive Bayes algorithm, training a classifier model S6 by using hotel comment corpus public data sets, and classifying S7 by using the constructed emotion polarity classification model.

The step S1 acquires hotel comment data: firstly, a large-scale Chinese hotel comment corpus public data set provided by Tankbo, China academy of sciences is selected, and the public data set mainly comprises commendatory and derogatory two kinds of emotion comments, so that data for research are provided for emotion analysis.

The step S2 is hotel comment data preprocessing: the method comprises the steps of removing stop words and word segmentation, using a Haugh stop word list by using a stop word list, and customizing high-frequency hotel noun stop words. And a full mode of Chinese ending word segmentation is adopted, and words in the sentence without stop words are quickly scanned out.

The step S3 TFIDF word frequency weight technique constructs a word vector matrix: and extracting the characteristic words, carrying out normalization processing on the document words, and constructing a characteristic word vector matrix by evaluating the importance of certain characteristic words.

And performing word frequency inverse frequency analysis on the text by using a TffVectorizer class to form a word frequency matrix so as to generate a characteristic vocabulary list with the length of | V |. And generating a matrix of N x | D |, wherein N is the number of the hotel evaluation papers, and A [ i, j ] represents the tfidf value of the jth word in the ith document in the feature vocabulary. The TFIDF algorithm used is formulated as follows:

tfidf＝tf_ij*idf_j＝tf_ij*log(N/n_j) (10)

The step S4 improves TFIDF term frequency weight algorithm: the hotel comment data has no fixed emotion dictionary corpus, regression testing is carried out on samples with inaccurate emotion dictionary feature word vector prediction in the test set text, namely, emotion feature word vectors in error samples are regressed and stored in an emotion dictionary again according to word frequency importance, and an improved TFIDF word frequency weight algorithm is constructed:

tfidf＝tfidf^(1-tfi),t_j∈D_senti (11)

t_jfor mismatching feature word vectors in data sets, D_sentiFor the emotion dictionary, the weighted tfidf value is a weight which is proportional and linear to the original weight value.

The step S5 naive Bayes algorithm constructs a text emotion polarity classifier: and constructing a Chinese text classification model for predicting hotel comment data.

S51, the simple representation means that all the characteristics are independent from each other, and the data to be trained T { (x)₁,y₁),(x₂,y₂)...(x_N,y_N) In which x_i＝(x_i ⁽¹⁾,x_i ⁽²⁾...x_i ⁽ⁿ⁾)^T,x_i ^(j)Is the jth feature of the ith sample, namely the jth feature word x in the ith comment text data in the hotel comment data to be trained_i ^(j)∈{a_j1,a_j2...a_ji}，a_jiIs the value that the jth feature word may take, y_i∈{c₁,c₂...c_KOnly positive and negative categories, namely 0 and 1, represent the text categories represented by the negative and positive emotions respectively.

Calculating prior probability and conditional probability:

s52: for a given data x_i＝(x_i ⁽¹⁾,x_i ⁽²⁾...x_i ⁽ⁿ⁾)^TAnd calculating the posterior probability:

s53: determine the probability value of being a forward emotion:

The step S6 trains classifier models using the hotel comment corpus public data set: the hotel comment data set comprises ten thousand of data in four scales, the data set is divided into two sets in a mutually exclusive manner, and all the data sets are divided into a test set and a training set according to the ratio of 1: 4.

The step S7 classifies by using the constructed emotion polarity classification model: the test set comments are partitioned hotel comment corpus public data sets.

Referring to fig. 2, the workflow of the hotel comment emotion polarity classification method based on emotion dictionary weighting mainly includes the following steps:

(1) data set selection: the method is characterized in that a large-scale Chinese hotel comment corpus public data set provided by Tankbo Boston, China academy of sciences, is selected, ten thousand pieces of text data are provided, and mainly comprise commendatory and derogatory two kinds of emotion comments, for example, negative comments that the service attitude is extremely poor, the reception of the foreground is not trained, even basic politeness is not understood, several guests are received at the same time, positive comments that the room is clean and tidy, services can be provided, and the price is good, and data for research are provided for emotion analysis.

(2) Text preprocessing and word segmentation: the method comprises the steps of removing stop words and ending participles, using a Hadamard stop word list as a Chinese word list, and customizing high-frequency hotel noun stop words such as hotels, guest rooms, beds and the like, and rapidly scanning out words in sentences from which the stop words are removed by adopting a full mode of the Chinese ending participles.

(3) Feature extraction: and performing word frequency inverse frequency analysis on the text by using the improved TFIDF algorithm to form a word frequency matrix to generate a characteristic vocabulary list with the length of | V |. And generating a matrix of N x | D |, wherein N is the number of the hotel evaluation papers, and A [ i, j ] represents the tfidf value of the jth word in the ith document in the feature vocabulary.

(4) Naive Bayes primary training and learning training based on an emotion dictionary: the method comprises the steps that after being preprocessed, data to be trained are sent to a naive Bayes emotion polarity classifier based on emotion dictionary weighting for primary training to form an emotion polarity classifier model, regression testing is conducted on samples with inaccurate emotion dictionary feature word vector prediction appearing in a data set text, namely, emotion feature word vectors in error samples are regressed again according to word frequency importance and stored in an emotion dictionary, the word vectors are given a weight based on the emotion dictionary, learning training is conducted again based on the weighted emotion dictionary, and a well-learned emotion polarity classifier is generated.

(5) And (4) judging the category: and preprocessing a test set of the partitioned hotel comment corpus public data set, namely a new text to be classified, extracting features, sending the preprocessed test set into a learned classification model for classification judgment, and evaluating a judgment result by indexes such as accuracy, precision, recall rate, F1 value and the like, so that more accurate decision support is provided for consumers and merchants.

The above description is only a preferred embodiment of the present invention, but the scope of the present invention is not limited to the above description, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims

1. A hotel comment emotion polarity classification method based on emotion dictionary weighting is characterized by comprising the following steps:

s1: obtaining hotel comment data: firstly, selecting a large-scale Chinese hotel comment corpus public data set provided by Tan Tubo of China academy of sciences, mainly comprising commendatory and derogatory two kinds of emotion comments, and providing data for research for emotion analysis;

s2: preprocessing hotel comment data: removing stop words and participles, using a Hadamard stop word list for a stop word list, and self-defining high-frequency hotel noun stop words such as hotels, beds, guest rooms and the like, and rapidly scanning out the words in the sentences from which the stop words are removed by adopting a full mode of Chinese ending participles;

s3: the TFIDF word frequency weight technology constructs a word vector matrix: extracting feature words, carrying out normalization processing on the document words to evaluate the importance of certain feature words and construct a feature word vector matrix;

s4: improving TFIDF word frequency weight algorithm: the hotel comment data does not have fixed emotion dictionary corpora, regression testing is carried out on samples with inaccurate emotion dictionary feature word vector prediction appearing in the test set text, namely, emotion feature word vectors in error samples are regressed again according to word frequency importance and stored in an emotion dictionary, the basic emotion dictionary adopts a Hownet emotion dictionary, and an improved TFIDF word frequency weight algorithm is constructed;

s5: the naive Bayes algorithm constructs a text emotion polarity classifier: constructing a Chinese text sentiment polarity classification model for predicting hotel comment data based on sentiment dictionary weighting;

s6: training a classifier model by utilizing a hotel comment corpus public data set: the hotel comment data set is divided into ten thousand of data of four scales, the data set is divided into two sets by adopting a set-out method mutually exclusive, and all the data sets are divided into a test set and a training set according to the ratio of 1: 4;

s7: classifying by using the constructed emotion polarity classification model: the test set comments are divided hotel comment corpus public data sets;

the step S2 of preprocessing hotel comment data comprises removing stop words and participles;

the method comprises the steps that stop words and word segmentation are removed, irrelevant words in a data set are removed by using a defined stop word list, the irrelevant words comprise conjunctions, symbols, foreign language and the like which influence word segmentation and words of a constructed word vector matrix, interference of the stop words on Chinese text analysis is reduced, and some useless words or words are ignored during model training, so that the model learning efficiency is improved;

the step S3 of constructing a word vector matrix by the TFIDF word frequency weight technology comprises feature extraction and TFIDF value calculation;

the characteristic extraction is to use a TffVectorizer class to perform word frequency inverse frequency analysis on a text to form a word frequency matrix to generate a characteristic vocabulary list with length | V |, and then generate a matrix with N x | D |, wherein N is the number of hotel evaluation papers, and A [ i, j ] represents the occurrence frequency of the jth word in the ith document in the characteristic vocabulary list; the higher the frequency of appearance of the characteristic word in all the documents is, the more important the characteristic word is in distinguishing the document attribute, and if the appearance range of the characteristic word is wider, the higher the appearance frequency in a small number of documents is, the higher the weight is;

the TFIDF value is calculated by calculating word frequency and inverse document frequency, the importance of certain characteristics can be reflected, and certain meaningless characteristics can be ignored, and the basic TFIDF algorithm formula is as follows:

tfidf＝tf_ij*idf_j＝tf_ij*log(N/n_j) (1)

tf_ijmeans that the feature word vector t is in the text d_iThe number of times of occurrence in (1) is the word frequency; idf_jThe reciprocal of the text of the current characteristic word vector t is indicated, the total number of the text is N, and the number of the text containing the characteristic word vector is N_j；

The step S4 is to improve the TFIDF word frequency weight algorithm, and for hotel comment data in a specific field, because there is no fixed emotion dictionary, stop word list, etc., it will affect the accuracy of prediction and reduce the accuracy, here, regression test is performed on the sample with inaccurate emotion dictionary feature word vector prediction appearing in the test data set text, i.e., the emotion feature word vector in the error sample is regressed again according to the word frequency importance and stored in the emotion dictionary, and the word vectors are given a weight based on the emotion dictionary, so that they have higher category features, and the weight formula of TFIDF after improvement is:

tfidf＝tfidf^(1-tfidf)，t_j∈D_senti (2)

t_jfor mismatching feature word vectors in data sets, D_sentiFor an emotion dictionary, where the weighted tfidf value is oneThe weight is proportional to the original weight value and linear;

the step S5 of constructing the text emotion polarity classifier by a naive Bayes algorithm comprises the following steps;

s51: naive represents that all features are independent from each other, and data to be trained is T { (x)₁，y₁)，(x₂，y₂)...(x_N，y_N) In which x_i＝(x_i ⁽¹⁾，x_i ⁽²⁾...x_i ⁽ⁿ⁾)^T，x_i ^(j)Is the jth feature of the ith sample, namely the jth feature word x in the ith comment text data in the hotel comment data to be trained_i ^(j)∈{a_j1，a_j2...a_ji}，a_jiIs the value that the jth feature word may take, y_i∈{c₁，c₂...c_KOnly positive and negative categories, namely 0 and 1, respectively represent text categories represented by negative emotions and positive emotions;

calculating prior probability and conditional probability:

s52: for a given data x_i＝(x_i ⁽¹⁾，x_i ⁽²⁾...x_i ⁽ⁿ⁾)^TAnd calculating the posterior probability:

s53: determine the probability value of being a forward emotion:

said P (Y ═ c)_k) Is the probability that the text belongs to a certain category, here the probability that the hotel comment text data belongs to a certain category, doc (c)_k) For text data as category c_kThe number of (2);

to prevent conditional probability P (X) thereof^(j)＝x^(j)|Y＝c_k) The value of (A) is 0, the probability of a certain feature being 0 renders all other data invalid, and Laplace conversion, namely Bayesian estimation, is adopted here, | V | is the feature X^(j)＝a_jiTaking the value number, and adopting Laplace conversion to the prior probability in the same way;

the step S6 is to train the classifier model by utilizing the hotel comment corpus public data set, generally to ensure similar data to obtain similar results, a test set and a training set are selected in the same data set, the hotel comment data set is divided into ten thousand pieces of data of four scales, all the data sets are divided into the test set and the training set according to the ratio of 1:4, the accuracy of model recommendation can be evaluated according to the accuracy, the precision rate, the recall rate and the F1 value, and the corresponding parameters are modified to ensure the optimization of the model;

in the step S7, the constructed emotion polarity classification model is used for classification, the test set comments are divided hotel comment corpus public data sets, the divided test set data are sent to a trained emotion polarity classifier based on emotion dictionary weighting to obtain evaluation indexes such as accuracy, and more accurate decision support is provided for consumers and merchants.