CN113673239A - Hotel comment emotion polarity classification method based on emotion dictionary weighting - Google Patents

Hotel comment emotion polarity classification method based on emotion dictionary weighting Download PDF

Info

Publication number
CN113673239A
CN113673239A CN202110752833.3A CN202110752833A CN113673239A CN 113673239 A CN113673239 A CN 113673239A CN 202110752833 A CN202110752833 A CN 202110752833A CN 113673239 A CN113673239 A CN 113673239A
Authority
CN
China
Prior art keywords
emotion
word
data
hotel
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202110752833.3A
Other languages
Chinese (zh)
Inventor
谢晓兰
李新飞
陈灵彬
刘亚荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guilin University of Technology
Original Assignee
Guilin University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guilin University of Technology filed Critical Guilin University of Technology
Priority to CN202110752833.3A priority Critical patent/CN113673239A/en
Publication of CN113673239A publication Critical patent/CN113673239A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a hotel comment emotion polarity classification method based on emotion dictionary weighting, which is characterized in that a Chinese hotel comment corpus public data set is used, hotel comment text data are cleaned and extracted, then the hotel comment text data are sent to a naive Bayes emotion polarity classifier model based on emotion dictionary weighting for training and testing, and prediction results are evaluated by indexes such as accuracy, recall rate, F1 value and the like, so that more accurate decision support is provided for consumers and merchants. The hotel comment emotion polarity classification method based on emotion dictionary weighting is an application method for a consumer to the high-quality hotel check-in requirement according with the self interest and the hotel service quality, solves the influence caused by the lack of a hotel specific field corpus and the independence assumption problem of a naive Bayes algorithm, meets the correctness of the subjective selection of the consumer, improves the timeliness and pertinence of an emotion polarity classification model by applying an emotion dictionary, and meets the actual application requirement.

Description

Hotel comment emotion polarity classification method based on emotion dictionary weighting
Technical Field
The invention relates to the technical field of emotion polarity classification, in particular to a hotel comment emotion polarity classification method based on emotion dictionary weighting.
Background
At present, the rapid development of the internet prompts the network to be full of all aspects of life, various network applications change the life style of people, the service form of clothes and eating houses is changed from off-line to on-line, consumers express their own opinions and evaluations on certain services by publishing personalized language through a network platform, and the products of the on-line products also influence the intentions of other non-consuming clients. The comment information with subjective emotional tendency can generate immeasurable value, changes the selection tendency of people, and not only can ensure the correctness of the subjective selection of consumers, but also can lead merchants to extract valuable data to improve the business, the service and the like, and improve the evaluation and the income by analyzing the comment information. The explosion growth of information data can bring rich value and huge potential, and also bring serious 'information overload' problem, so that the information data is difficult to process and analyze only by a manual method, and the quality of hotel services often becomes a condition for whether a customer chooses to check in, so that text sentiment analysis technology in the field of computers, text sentiment analysis, namely sentiment mining, tendency analysis and the like are generated for text data of service types such as hotel comments and the like.
Scholars at home and abroad make a great deal of research aiming at emotional feature analysis, methods such as Word2Vec and TF-IDF are commonly applied in the aspect of feature extraction, an emotional classification model comprises a Support Vector Machine (SVM), a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN) and a CHI-square feature selection model (CHI), the invention discloses a hotel comment emotion polarity classification method based on emotion dictionary weighting, and relates to a classification method based on OCC models of emoticons and model establishment based on emotion dictionaries and mechanics. The merchant can extract valuable data to improve the business and the service, applies the emotion dictionary to hotel comment data analysis, analyzes the emotion polarity of subjective texts of consumers, improves the timeliness and pertinence of the emotion classification model, and meets the actual application requirements.
Disclosure of Invention
The invention aims to provide a hotel comment emotion polarity classification method based on emotion dictionary weighting, which is used for solving the problems of inaccurate classification precision and the like caused by the lack of a corpus in a hotel comment specific field and the independence assumption of a naive Bayesian algorithm, and providing more accurate decision support for consumers and merchants according to model analysis.
The technical scheme adopted by the invention is as follows: a hotel comment emotion polarity classification method based on emotion dictionary weighting comprises the following steps:
s1: obtaining hotel comment data: firstly, a large-scale Chinese hotel comment corpus public data set provided by Tankbo, China academy of sciences is selected, and the public data set mainly comprises commendatory and derogatory two kinds of emotion comments, so that data for research are provided for emotion analysis.
S2: preprocessing hotel comment data: the method comprises the steps of removing stop words and participles, using a Hadamard stop word list for a stop word list, and self-defining high-frequency hotel noun stop words such as hotels, beds, guest rooms and the like, and rapidly scanning the words in the sentences from which the stop words are removed by adopting a full mode of Chinese ending participles.
S3: the TFIDF word frequency weight technology constructs a word vector matrix: and extracting the characteristic words, carrying out normalization processing on the document words, and constructing a characteristic word vector matrix by evaluating the importance of certain characteristic words.
S4: improving TFIDF word frequency weight algorithm: the hotel comment data does not have fixed emotion dictionary corpora, regression testing is carried out on samples with inaccurate emotion dictionary feature word vector prediction in the test set text, namely, emotion feature word vectors in error samples are regressed again according to word frequency importance and stored in an emotion dictionary, the basic emotion dictionary adopts a Hownet emotion dictionary, and an improved TFIDF word frequency weight algorithm is constructed.
S5: the naive Bayes algorithm constructs a text emotion polarity classifier: and constructing a Chinese text sentiment polarity classification model for predicting hotel comment data based on sentiment dictionary weighting.
S6: training a classifier model by utilizing a hotel comment corpus public data set: the hotel comment data set comprises ten thousand of data in four scales, the data set is divided into two sets in a mutually exclusive manner, and all the data sets are divided into a test set and a training set according to the ratio of 1: 4.
S7: classifying by using the constructed emotion polarity classification model: the test set comments are partitioned hotel comment corpus public data sets.
The step S2 hotel comment data preprocessing includes removing stop words and word segmentation.
The method for removing the stop words and the participles is characterized in that the defined stop word list is used for removing irrelevant words in the data set, including conjunctions, symbols, foreign language and the like which influence the participles and the words for constructing a word vector matrix, so that the interference of the stop words on the analysis of the Chinese text is reduced, certain useless words or words are ignored during model training, and the model learning efficiency is improved. And (3) rapidly scanning out the words in the sentences from which the stop words are removed by adopting a full mode of Chinese ending segmentation, and preparing for constructing a word vector matrix and training a model.
The step S3 TFIDF word frequency weight technique for constructing the word vector matrix includes feature extraction and TFIDF value calculation.
The feature extraction is to perform word frequency inverse frequency analysis on the text by utilizing a TffVectorizer class to form a word frequency matrix and generate a feature vocabulary table with length of | V |. And generating a matrix of N x | D |, wherein N is the number of the hotel evaluation papers, and A [ i, j ] represents the tfidf value of the jth word in the ith document in the feature vocabulary. The higher the frequency of occurrence of a feature word in all documents, the more important it is in distinguishing the document attributes. A feature word has a higher weight if it appears more frequently in only a small number of documents, the wider it appears.
The tfidf value is calculated by calculating word frequency and inverse document frequency, and can reflect the importance of certain characteristics and ignore certain meaningless characteristics.
The TFIDF algorithm used is formulated as follows:
tfidf=tfij*idfj=tfij*log(N/nj) (1)
tfijmeans that the feature word vector t is in the text diThe number of times of occurrence in (1) is the word frequency; idfjThe inverse of the text of the feature word vector t is indicated. The total number of texts is N, and the number of texts containing the feature word vectors is Nj
The step S4 is to improve the TFIDF word frequency weight algorithm, and for hotel comment data in a specific field, because there is no fixed emotion dictionary, stop word list, etc., it will affect the accuracy of prediction and reduce the accuracy, where regression testing is performed on samples with inaccurate emotion dictionary feature word vector prediction appearing in the test data set text, i.e., the emotion feature word vectors in the error samples are regressed again according to the word frequency importance and stored in the emotion dictionary, and these word vectors are given a weight based on the emotion dictionary, so that they have higher category features. The weight formula of the improved tfidf is as follows:
tfidf=tfidf(1-tfi),tj∈Dsenti (2)
t jfor mismatching feature word vectors in data sets, DsentiFor the emotion dictionary, the weighted tfidf value is a weight which is proportional and linear to the original weight value.
The step S5 of constructing the text emotion polarity classifier by the naive Bayes algorithm comprises the following steps.
S51, the simple representation means that all the characteristics are independent from each other, and the data to be trained T { (x)1,y1),(x2,y2)...(xN,yN) In which xi=(xi (1),xi (2)...xi (n))T,xi (j)Is the jth feature of the ith sample, namely the jth feature word x in the ith comment text data in the hotel comment data to be trainedi (j)∈{aj1,aj2...aji},ajiIs the value that the jth feature word may take, yi∈{c1,c2...cKOnly positive and negative categories, namely 0 and 1, represent the text categories represented by the negative and positive emotions respectively.
And calculating prior probability and conditional probability.
Figure BDA0003146703160000031
Figure BDA0003146703160000032
S52: for a given data xi=(xi (1),xi (2)...xi (n))TAnd calculating the posterior probability.
Figure BDA0003146703160000033
S53: a probability value is determined that is a positive emotion.
Figure BDA0003146703160000034
Said P (Y ═ c)k) Is the probability that the text belongs to a certain category, here the probability that the hotel comment text data belongs to a certain category, doc (c)k) For text data as category ckThe number of (2).
Figure BDA0003146703160000035
To prevent conditional probability P (X) thereof(j)=x(j)|Y=ck) The value of (A) is 0, the probability of a certain feature being 0 renders all other data invalid, and Laplace conversion, namely Bayesian estimation, is adopted here, | V | is the feature X(j)=ajiThe number of the values can be taken, and the Laplace conversion is also adopted for the prior probability in the same way.
Figure BDA0003146703160000041
Figure BDA0003146703160000042
In step S6, a classifier model is trained by using the hotel comment corpus public data set, generally, to ensure that similar data obtain similar results, a test set and a training set are selected from the same data set, the hotel comment data set includes ten thousand pieces of data of four scales, and all data sets are divided into the test set and the training set according to a ratio of 1: 4. In order to ensure the accuracy of model recommendation, the model recommendation can be evaluated according to the accuracy, precision, recall rate and F1 value, and the corresponding parameters are modified to ensure the optimization of the model.
In the step S7, the constructed emotion polarity classification model is used for classification, the test set comments are divided hotel comment corpus public data sets, the divided test set data are sent to a trained emotion polarity classifier based on emotion dictionary weighting to obtain evaluation indexes such as accuracy and the like, and support for decision making is provided for consumers and merchants.
The invention has the following beneficial effects and advantages:
(1) the classification model is processed, analyzed and trained by utilizing a large-scale Chinese hotel comment corpus public data set provided by Tan Tubo Boshi of China academy of sciences, for example, negative comments that service attitudes are extremely poor, reception by a front desk is as if the front desk is not trained, basic politeness is not understood, several guests are received at the same time, positive comments that rooms are clean and tidy, services can be carried out, and the price is good, so that timeliness and pertinence of the emotion classification model can be improved, and actual application requirements are met.
(2) The naive Bayes (naive Bayes) algorithm is a classification method based on Bayes theorem and independent hypothesis of characteristic conditions, and is simple to implement and high in learning and predicting efficiency.
(3) The improved TFIDF algorithm extracts text feature word vectors and emotion dictionary feature word vector weights after regression testing, weakens the influence of independence hypothesis, gives the word vectors a weight based on an emotion dictionary, enables the word vectors to have higher category characteristics, and makes up the influence of inaccurate classification caused by insufficient linguistic data in a specific field.
Drawings
FIG. 1 is a schematic diagram of a hotel comment emotion polarity classification method based on emotion dictionary weighting.
FIG. 2 is a flow chart of emotion classification used in the present invention.
Detailed Description
Example (b):
as shown in fig. 1, the technical solution of the present invention comprises seven steps: the method comprises the steps of obtaining hotel comment data S1, preprocessing hotel comment data S2, constructing a word vector matrix S3 by a TFIDF word frequency weighting technology, improving a TFIDF word frequency weighting algorithm S4, constructing a text emotion polarity classifier S5 by a naive Bayes algorithm, training a classifier model S6 by using hotel comment corpus public data sets, and classifying S7 by using the constructed emotion polarity classification model.
The step S1 acquires hotel comment data: firstly, a large-scale Chinese hotel comment corpus public data set provided by Tankbo, China academy of sciences is selected, and the public data set mainly comprises commendatory and derogatory two kinds of emotion comments, so that data for research are provided for emotion analysis.
The step S2 is hotel comment data preprocessing: the method comprises the steps of removing stop words and word segmentation, using a Haugh stop word list by using a stop word list, and customizing high-frequency hotel noun stop words. And a full mode of Chinese ending word segmentation is adopted, and words in the sentence without stop words are quickly scanned out.
The step S3 TFIDF word frequency weight technique constructs a word vector matrix: and extracting the characteristic words, carrying out normalization processing on the document words, and constructing a characteristic word vector matrix by evaluating the importance of certain characteristic words.
And performing word frequency inverse frequency analysis on the text by using a TffVectorizer class to form a word frequency matrix so as to generate a characteristic vocabulary list with the length of | V |. And generating a matrix of N x | D |, wherein N is the number of the hotel evaluation papers, and A [ i, j ] represents the tfidf value of the jth word in the ith document in the feature vocabulary. The TFIDF algorithm used is formulated as follows:
tfidf=tfij*idfj=tfij*log(N/nj) (10)
tfijmeans that the feature word vector t is in the text diThe number of times of occurrence in (1) is the word frequency; idfjThe inverse of the text of the feature word vector t is indicated. The total number of texts is N, and the number of texts containing the feature word vectors is Nj
The step S4 improves TFIDF term frequency weight algorithm: the hotel comment data has no fixed emotion dictionary corpus, regression testing is carried out on samples with inaccurate emotion dictionary feature word vector prediction in the test set text, namely, emotion feature word vectors in error samples are regressed and stored in an emotion dictionary again according to word frequency importance, and an improved TFIDF word frequency weight algorithm is constructed:
tfidf=tfidf(1-tfi),tj∈Dsenti (11)
tjfor mismatching feature word vectors in data sets, DsentiFor the emotion dictionary, the weighted tfidf value is a weight which is proportional and linear to the original weight value.
The step S5 naive Bayes algorithm constructs a text emotion polarity classifier: and constructing a Chinese text classification model for predicting hotel comment data.
S51, the simple representation means that all the characteristics are independent from each other, and the data to be trained T { (x)1,y1),(x2,y2)...(xN,yN) In which xi=(xi (1),xi (2)...xi (n))T,xi (j)Is the jth feature of the ith sample, namely the jth feature word x in the ith comment text data in the hotel comment data to be trainedi (j)∈{aj1,aj2...aji},ajiIs the value that the jth feature word may take, yi∈{c1,c2...cKOnly positive and negative categories, namely 0 and 1, represent the text categories represented by the negative and positive emotions respectively.
Calculating prior probability and conditional probability:
Figure BDA0003146703160000051
Figure BDA0003146703160000052
s52: for a given data xi=(xi (1),xi (2)...xi (n))TAnd calculating the posterior probability:
Figure BDA0003146703160000053
s53: determine the probability value of being a forward emotion:
Figure BDA0003146703160000061
said P (Y ═ c)k) Is the probability that the text belongs to a certain category, here the probability that the hotel comment text data belongs to a certain category, doc (c)k) For text data as category ckThe number of (2).
Figure BDA0003146703160000062
To prevent conditional probability P (X) thereof(j)=x(j)|Y=ck) The value of (A) is 0, the probability of a certain feature being 0 renders all other data invalid, and Laplace conversion, namely Bayesian estimation, is adopted here, | V | is the feature X(j)=ajiThe number of the values can be taken, and the Laplace conversion is also adopted for the prior probability in the same way.
Figure BDA0003146703160000063
Figure BDA0003146703160000064
The step S6 trains classifier models using the hotel comment corpus public data set: the hotel comment data set comprises ten thousand of data in four scales, the data set is divided into two sets in a mutually exclusive manner, and all the data sets are divided into a test set and a training set according to the ratio of 1: 4.
The step S7 classifies by using the constructed emotion polarity classification model: the test set comments are partitioned hotel comment corpus public data sets.
Referring to fig. 2, the workflow of the hotel comment emotion polarity classification method based on emotion dictionary weighting mainly includes the following steps:
(1) data set selection: the method is characterized in that a large-scale Chinese hotel comment corpus public data set provided by Tankbo Boston, China academy of sciences, is selected, ten thousand pieces of text data are provided, and mainly comprise commendatory and derogatory two kinds of emotion comments, for example, negative comments that the service attitude is extremely poor, the reception of the foreground is not trained, even basic politeness is not understood, several guests are received at the same time, positive comments that the room is clean and tidy, services can be provided, and the price is good, and data for research are provided for emotion analysis.
(2) Text preprocessing and word segmentation: the method comprises the steps of removing stop words and ending participles, using a Hadamard stop word list as a Chinese word list, and customizing high-frequency hotel noun stop words such as hotels, guest rooms, beds and the like, and rapidly scanning out words in sentences from which the stop words are removed by adopting a full mode of the Chinese ending participles.
(3) Feature extraction: and performing word frequency inverse frequency analysis on the text by using the improved TFIDF algorithm to form a word frequency matrix to generate a characteristic vocabulary list with the length of | V |. And generating a matrix of N x | D |, wherein N is the number of the hotel evaluation papers, and A [ i, j ] represents the tfidf value of the jth word in the ith document in the feature vocabulary.
(4) Naive Bayes primary training and learning training based on an emotion dictionary: the method comprises the steps that after being preprocessed, data to be trained are sent to a naive Bayes emotion polarity classifier based on emotion dictionary weighting for primary training to form an emotion polarity classifier model, regression testing is conducted on samples with inaccurate emotion dictionary feature word vector prediction appearing in a data set text, namely, emotion feature word vectors in error samples are regressed again according to word frequency importance and stored in an emotion dictionary, the word vectors are given a weight based on the emotion dictionary, learning training is conducted again based on the weighted emotion dictionary, and a well-learned emotion polarity classifier is generated.
(5) And (4) judging the category: and preprocessing a test set of the partitioned hotel comment corpus public data set, namely a new text to be classified, extracting features, sending the preprocessed test set into a learned classification model for classification judgment, and evaluating a judgment result by indexes such as accuracy, precision, recall rate, F1 value and the like, so that more accurate decision support is provided for consumers and merchants.
The above description is only a preferred embodiment of the present invention, but the scope of the present invention is not limited to the above description, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims (1)

1. A hotel comment emotion polarity classification method based on emotion dictionary weighting is characterized by comprising the following steps:
s1: obtaining hotel comment data: firstly, selecting a large-scale Chinese hotel comment corpus public data set provided by Tan Tubo of China academy of sciences, mainly comprising commendatory and derogatory two kinds of emotion comments, and providing data for research for emotion analysis;
s2: preprocessing hotel comment data: removing stop words and participles, using a Hadamard stop word list for a stop word list, and self-defining high-frequency hotel noun stop words such as hotels, beds, guest rooms and the like, and rapidly scanning out the words in the sentences from which the stop words are removed by adopting a full mode of Chinese ending participles;
s3: the TFIDF word frequency weight technology constructs a word vector matrix: extracting feature words, carrying out normalization processing on the document words to evaluate the importance of certain feature words and construct a feature word vector matrix;
s4: improving TFIDF word frequency weight algorithm: the hotel comment data does not have fixed emotion dictionary corpora, regression testing is carried out on samples with inaccurate emotion dictionary feature word vector prediction appearing in the test set text, namely, emotion feature word vectors in error samples are regressed again according to word frequency importance and stored in an emotion dictionary, the basic emotion dictionary adopts a Hownet emotion dictionary, and an improved TFIDF word frequency weight algorithm is constructed;
s5: the naive Bayes algorithm constructs a text emotion polarity classifier: constructing a Chinese text sentiment polarity classification model for predicting hotel comment data based on sentiment dictionary weighting;
s6: training a classifier model by utilizing a hotel comment corpus public data set: the hotel comment data set is divided into ten thousand of data of four scales, the data set is divided into two sets by adopting a set-out method mutually exclusive, and all the data sets are divided into a test set and a training set according to the ratio of 1: 4;
s7: classifying by using the constructed emotion polarity classification model: the test set comments are divided hotel comment corpus public data sets;
the step S2 of preprocessing hotel comment data comprises removing stop words and participles;
the method comprises the steps that stop words and word segmentation are removed, irrelevant words in a data set are removed by using a defined stop word list, the irrelevant words comprise conjunctions, symbols, foreign language and the like which influence word segmentation and words of a constructed word vector matrix, interference of the stop words on Chinese text analysis is reduced, and some useless words or words are ignored during model training, so that the model learning efficiency is improved;
the step S3 of constructing a word vector matrix by the TFIDF word frequency weight technology comprises feature extraction and TFIDF value calculation;
the characteristic extraction is to use a TffVectorizer class to perform word frequency inverse frequency analysis on a text to form a word frequency matrix to generate a characteristic vocabulary list with length | V |, and then generate a matrix with N x | D |, wherein N is the number of hotel evaluation papers, and A [ i, j ] represents the occurrence frequency of the jth word in the ith document in the characteristic vocabulary list; the higher the frequency of appearance of the characteristic word in all the documents is, the more important the characteristic word is in distinguishing the document attribute, and if the appearance range of the characteristic word is wider, the higher the appearance frequency in a small number of documents is, the higher the weight is;
the TFIDF value is calculated by calculating word frequency and inverse document frequency, the importance of certain characteristics can be reflected, and certain meaningless characteristics can be ignored, and the basic TFIDF algorithm formula is as follows:
tfidf=tfij*idfj=tfij*log(N/nj) (1)
tfijmeans that the feature word vector t is in the text diThe number of times of occurrence in (1) is the word frequency; idfjThe reciprocal of the text of the current characteristic word vector t is indicated, the total number of the text is N, and the number of the text containing the characteristic word vector is Nj
The step S4 is to improve the TFIDF word frequency weight algorithm, and for hotel comment data in a specific field, because there is no fixed emotion dictionary, stop word list, etc., it will affect the accuracy of prediction and reduce the accuracy, here, regression test is performed on the sample with inaccurate emotion dictionary feature word vector prediction appearing in the test data set text, i.e., the emotion feature word vector in the error sample is regressed again according to the word frequency importance and stored in the emotion dictionary, and the word vectors are given a weight based on the emotion dictionary, so that they have higher category features, and the weight formula of TFIDF after improvement is:
tfidf=tfidf(1-tfidf),tj∈Dsenti (2)
tjfor mismatching feature word vectors in data sets, DsentiFor an emotion dictionary, where the weighted tfidf value is oneThe weight is proportional to the original weight value and linear;
the step S5 of constructing the text emotion polarity classifier by a naive Bayes algorithm comprises the following steps;
s51: naive represents that all features are independent from each other, and data to be trained is T { (x)1,y1),(x2,y2)...(xN,yN) In which xi=(xi (1),xi (2)...xi (n))T,xi (j)Is the jth feature of the ith sample, namely the jth feature word x in the ith comment text data in the hotel comment data to be trainedi (j)∈{aj1,aj2...aji},ajiIs the value that the jth feature word may take, yi∈{c1,c2...cKOnly positive and negative categories, namely 0 and 1, respectively represent text categories represented by negative emotions and positive emotions;
calculating prior probability and conditional probability:
Figure FDA0003146703150000021
Figure FDA0003146703150000022
s52: for a given data xi=(xi (1),xi (2)...xi (n))TAnd calculating the posterior probability:
Figure FDA0003146703150000023
s53: determine the probability value of being a forward emotion:
Figure FDA0003146703150000024
said P (Y ═ c)k) Is the probability that the text belongs to a certain category, here the probability that the hotel comment text data belongs to a certain category, doc (c)k) For text data as category ckThe number of (2);
Figure FDA0003146703150000025
to prevent conditional probability P (X) thereof(j)=x(j)|Y=ck) The value of (A) is 0, the probability of a certain feature being 0 renders all other data invalid, and Laplace conversion, namely Bayesian estimation, is adopted here, | V | is the feature X(j)=ajiTaking the value number, and adopting Laplace conversion to the prior probability in the same way;
Figure FDA0003146703150000031
Figure FDA0003146703150000032
the step S6 is to train the classifier model by utilizing the hotel comment corpus public data set, generally to ensure similar data to obtain similar results, a test set and a training set are selected in the same data set, the hotel comment data set is divided into ten thousand pieces of data of four scales, all the data sets are divided into the test set and the training set according to the ratio of 1:4, the accuracy of model recommendation can be evaluated according to the accuracy, the precision rate, the recall rate and the F1 value, and the corresponding parameters are modified to ensure the optimization of the model;
in the step S7, the constructed emotion polarity classification model is used for classification, the test set comments are divided hotel comment corpus public data sets, the divided test set data are sent to a trained emotion polarity classifier based on emotion dictionary weighting to obtain evaluation indexes such as accuracy, and more accurate decision support is provided for consumers and merchants.
CN202110752833.3A 2021-07-03 2021-07-03 Hotel comment emotion polarity classification method based on emotion dictionary weighting Withdrawn CN113673239A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110752833.3A CN113673239A (en) 2021-07-03 2021-07-03 Hotel comment emotion polarity classification method based on emotion dictionary weighting

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110752833.3A CN113673239A (en) 2021-07-03 2021-07-03 Hotel comment emotion polarity classification method based on emotion dictionary weighting

Publications (1)

Publication Number Publication Date
CN113673239A true CN113673239A (en) 2021-11-19

Family

ID=78538490

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110752833.3A Withdrawn CN113673239A (en) 2021-07-03 2021-07-03 Hotel comment emotion polarity classification method based on emotion dictionary weighting

Country Status (1)

Country Link
CN (1) CN113673239A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115099241A (en) * 2022-06-30 2022-09-23 郑州信大先进技术研究院 Massive tourism network text semantic analysis method based on model fusion
CN116957740A (en) * 2023-08-01 2023-10-27 哈尔滨商业大学 Agricultural product recommendation system based on word characteristics
CN117114746A (en) * 2023-08-24 2023-11-24 哈尔滨工业大学 Method for predicting emotion of consumer by sudden public health event
CN118261142A (en) * 2024-05-30 2024-06-28 南京信息工程大学 Machine learning and statistical regression-based hotel text decomposition description method

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115099241A (en) * 2022-06-30 2022-09-23 郑州信大先进技术研究院 Massive tourism network text semantic analysis method based on model fusion
CN115099241B (en) * 2022-06-30 2024-04-12 郑州信大先进技术研究院 Massive travel network text semantic analysis method based on model fusion
CN116957740A (en) * 2023-08-01 2023-10-27 哈尔滨商业大学 Agricultural product recommendation system based on word characteristics
CN116957740B (en) * 2023-08-01 2024-01-05 哈尔滨商业大学 Agricultural product recommendation system based on word characteristics
CN117114746A (en) * 2023-08-24 2023-11-24 哈尔滨工业大学 Method for predicting emotion of consumer by sudden public health event
CN118261142A (en) * 2024-05-30 2024-06-28 南京信息工程大学 Machine learning and statistical regression-based hotel text decomposition description method

Similar Documents

Publication Publication Date Title
CN107491531B (en) Chinese network comment sensibility classification method based on integrated study frame
CN111767741B (en) Text emotion analysis method based on deep learning and TFIDF algorithm
CN112001187B (en) Emotion classification system based on Chinese syntax and graph convolution neural network
CN110807320B (en) Short text emotion analysis method based on CNN bidirectional GRU attention mechanism
CN109977413A (en) A kind of sentiment analysis method based on improvement CNN-LDA
CN113673239A (en) Hotel comment emotion polarity classification method based on emotion dictionary weighting
CN109165294B (en) Short text classification method based on Bayesian classification
CN110287320A (en) A kind of deep learning of combination attention mechanism is classified sentiment analysis model more
CN106844632B (en) Product comment emotion classification method and device based on improved support vector machine
CN111143549A (en) Method for public sentiment emotion evolution based on theme
CN108763214B (en) Automatic construction method of emotion dictionary for commodity comments
CN112001186A (en) Emotion classification method using graph convolution neural network and Chinese syntax
CN107818173B (en) Vector space model-based Chinese false comment filtering method
CN110765769A (en) Entity attribute dependency emotion analysis method based on clause characteristics
CN113360647B (en) 5G mobile service complaint source-tracing analysis method based on clustering
CN111368082A (en) Emotion analysis method for domain adaptive word embedding based on hierarchical network
CN111353044A (en) Comment-based emotion analysis method and system
CN111966944A (en) Model construction method for multi-level user comment security audit
CN113591487A (en) Scenic spot comment emotion analysis method based on deep learning
Ahuja et al. Pragmatic Analysis of Classification Techniques based on Hyper-parameter Tuning for Sentiment Analysis
CN110569495A (en) Emotional tendency classification method and device based on user comments and storage medium
CN114626367A (en) Sentiment analysis method, system, equipment and medium based on news article content
CN114942991A (en) Emotion classification model construction method based on metaphor recognition
CN112989803B (en) Entity link prediction method based on topic vector learning
CN112182227A (en) Text emotion classification system and method based on transD knowledge graph embedding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20211119