CN108804416B

CN108804416B - Training method for film evaluation emotion tendency analysis based on machine learning

Info

Publication number: CN108804416B
Application number: CN201810480816.7A
Authority: CN
Inventors: 赵丹丹; 高宠
Original assignee: Dalian Minzu University
Current assignee: Dalian Minzu University
Priority date: 2018-05-18
Filing date: 2018-05-18
Publication date: 2022-08-09
Anticipated expiration: 2038-05-18
Also published as: CN108804416A

Abstract

A training method for film evaluation emotion tendentiousness analysis based on machine learning belongs to the field of natural language processing, and aims to solve the problem of improving film evaluation emotion analysis accuracy of an existing emotion dictionary, and the technical key points are as follows: downloading the film comments; selecting a feature word; using a feature word set to represent each film evaluation by a feature vector; the classifier constructed by the naive Bayes thought is trained by randomly dividing the feature vector text into a training set, adding positive or negative labels to each feature vector of the training set, and the overall tendency of the words with the same quantity of positive and negative in the shadow assessment can be more accurately judged.

Description

Training method for film evaluation emotion tendency analysis based on machine learning

Technical Field

The invention belongs to the field of natural language processing, and relates to a training method for analyzing emotion tendentiousness of film evaluation based on machine learning.

Background

More and more users can publish own opinions, attitudes and emotions on various forums, shopping websites, comment websites, microblogs and the like, and if the emotion change process of the users can be analyzed, the comments can provide a large amount of information for the users. Such as a review of a certain movie, an evaluation of a certain product, etc. The attitude of the user is identified as liked, disliked or neutral by analyzing the subjective text with emotional colors. In real life, the method has many applications, such as forecasting stock tendency, forecasting movie box office, selecting result and the like through emotion analysis of microblog users, can be used for knowing the preference of users to companies and products, can be used for improving products and services through analysis results, and can also find out the advantages and disadvantages of competitors.

In the prior art, emotion analysis on a text is mainly Chinese emotion analysis based on an emotion dictionary, and words in the emotion dictionary can be single words or words. According to the different emotion polarities of the emotion words in the dictionary, the emotion dictionary is divided into a commendation dictionary and a derviation dictionary, the emotion score of the whole sentence is calculated according to the polarity and the emotion intensity of the emotion words in the dictionary, and finally the emotion tendency of the sentence is obtained.

Disclosure of Invention

In order to solve the problem of improving the film evaluation emotion analysis accuracy of the conventional emotion dictionary, the invention provides the following technical scheme: a training method for movie evaluation emotional tendency analysis based on machine learning comprises the following steps:

step 1: downloading the film comments;

step 2: selecting characteristic words, extracting a meaningful emotion word set as a characteristic word set according to downloaded film comments, wherein each word in the characteristic word set is a characteristic word;

and step 3: for downloaded movie ratings, using a feature word set to represent each movie rating by a feature vector, wherein the set of positive feature vectors is a positive feature text, the set of negative feature vectors is a negative feature text, and the positive feature vectors and the negative feature vectors with the same number are selected to form a feature vector text;

and 4, step 4: and randomly dividing the feature vector text into a training set, adding a positive or negative label to each feature vector of the training set, and training a classifier constructed by a naive Bayes idea.

Further, the method for training the classifier constructed by the naive Bayes idea comprises the following steps:

dividing the feature vectors in the training set into a positive type feature vector text and a negative type feature vector text, constructing the positive type feature vector text and the negative type feature vector text of the training set, and calculating the probability of each type appearing in the training set;

calculating the probability of the feature words in the feature word set appearing in the class-like feature vector text of the training set according to the classes;

and calculating the probability that the characteristic words in the characteristic word set can appear in each class of the training set respectively.

Further, the probability of each class in the training set is calculated as: calculating the probability of the passive classified feature vectors and the active classified feature vectors in the training set, wherein the set of the active classified feature vectors in the training set is a training set active feature vector text, and the set of the passive classified feature vectors is a training set passive feature vector text;

calculating the probability of the feature words in the feature word set appearing in the class-like feature vector text of the training set according to the classes: and calculating the proportion of the number of times of the feature words appearing in the text of the training set positive feature vectors to the number of all the feature words in the text of the training set positive feature vectors, and calculating the proportion of the number of times of the feature words appearing in the text of the training set negative feature vectors to the number of all the feature words in the text of the training set negative feature vectors.

Calculating the probability that the characteristic words in the characteristic word set can respectively appear in each class of the training set: and calculating the proportion of the characteristic words appearing in the training set positive characteristic vector text to the characteristic words appearing in the training set, and calculating the proportion of the characteristic words appearing in the training set negative characteristic vector text to the characteristic words appearing in the training set.

Further, the method for representing each shadow score by a feature vector by using the feature word set comprises the following steps: and judging whether each feature word in the feature word set appears in the comment, if so, marking 1, otherwise, marking 0, forming an array of the comment, and converting each comment into a feature representation form to serve as a feature vector.

Further, the probability of each class in the training set is calculated: calculating p (C) _i ) It includes negative class probability and positive class probability:

negative class probability:

the active class probability:

C _i the feature vector text representing the classification, i ═ 0, 1.

Calculating the class of the feature words in the feature word set in the training set according to the categoryProbability of occurrence in feature vector text: calculating p (w) _j |C _i ) The probability of the feature words appearing in the passive feature vector texts in the training set and the probability of the feature words appearing in the active feature vector texts in the training set are as follows:

probability of appearance of feature words in negative feature vector text in training set:

p(w _j |C ₀ )＝[p(w ₀ |C ₀ ),p(w ₁ |C ₀ ),p(w ₂ |C ₀ ),…,p(w _n |C ₀ )]

probability of appearance of feature words in the text of the active feature vectors in the training set:

p(w _j |C ₁ )＝[p(w ₀ |C ₁ ),p(w ₁ |C ₁ ),p(w ₂ |C ₁ ),…,p(w _n |C ₁ )]

C _i feature vector text representing a classification, i ═ 0,1, w _j And j is 1,2 … n, and n is the number of the characteristic words in the characteristic word set.

Further, calculating the probability that the feature words in the feature word set can respectively appear in each type of vector text of the training set: calculating p (C) _i |w _j ) Which includes the probability that a feature word can appear in the passive class of the training set and the probability that a feature word can appear in the active class of the training set:

probability that a feature word can appear in a passive class of the training set:

p(C ₀ |w _j )＝[p(C ₀ |w ₀ ),p(C ₀ |w ₁ ),p(C ₀ |w ₂ ),…,p(C ₀ |w _n )]

probability that a feature word can appear in the active class of the training set:

p(C ₁ |w _j )＝[p(C ₁ |w ₀ ),p(C ₁ |w ₁ ),p(C ₁ |w ₂ ),…,p(C ₁ |w _n )]

Has the advantages that: the training method is a preorder step of film evaluation emotion analysis, and the step establishes that a machine learning method is used for analyzing film evaluation emotion, and correspondingly, a training step suitable for film evaluation is adaptively provided, wherein each film evaluation is represented by a feature vector by using a feature word set, a feature vector text is randomly divided into a training set, and a classifier constructed by a naive Bayes idea is trained.

Drawings

Fig. 1 is a flowchart of a method for analyzing emotion tendentiousness of film evaluation based on machine learning in embodiment 1;

FIG. 2 is a diagram of the processing result of the principal extraction by the jieba;

FIG. 3 is a graph comparing classification results with Bernoulli naive Bayes classification results;

wherein: the solid line is the classification result of the invention, and the dotted line is the result of Bernoulli naive Bayes classification; the y-axis is the accuracy and the x-axis is different test samples;

FIG. 4 is a schematic diagram of classifier construction.

Detailed Description

Example 1:

the embodiment provides an emotional tendency distinguishing method aiming at the emotional tendency analysis of Chinese film evaluation, which mainly comprises a training method, a testing method and an analyzing method.

The technical scheme disclosed by the embodiment is as follows:

a film evaluation emotion tendentiousness analysis method based on machine learning comprises the following steps:

step 1: compiling a crawler to download the broad bean movie film comments, wherein the downloaded film comments form a corpus;

step (a): and acquiring the website of the movie to be downloaded in the bean cotyledon.

Step (b): and downloading the information such as movie reviews, movie names, appraisers, scores, review time and the like corresponding to each movie, and storing the information in the csv format.

Step 2: extracting features to form a feature set of the corpus:

according to downloaded film comments (namely, the film comments in the corpus), meaningful emotion words of the film comments in the corpus are extracted as feature words, and in the step, if a single method is adopted, more valuable feature words cannot be extracted, so in one embodiment, the feature words are extracted by combining the following two modes, and the extraction rate of valuable feature word pairs can be improved.

Step (a): and performing word segmentation on all film comments in the corpus by using jieba word segmentation, and extracting words of adjectives, idioms, distinguished words and verbs as a characteristic set.

Step (b): and extracting stems from all the film comments in the corpus by using jieba word segmentation, and extracting stem words and adding the stem words into a feature set.

Step (c): stop words may be present in the feature set, and therefore the stop words are removed using the stop dictionary.

And step 3: and processing the film comments to form a feature representation text:

step (a): using jieba word segmentation to segment each shadow comment in the corpus, using the feature set obtained in step 2 to judge whether each feature word in the feature set appears in the shadow comment, if so, marking 1, otherwise, marking 0, forming an array of the shadow comment, namely, converting each shadow comment into a feature representation form, and it needs to be explained that in the invention, the feature vector of the shadow comment refers to the text after the feature representation of the shadow comment.

Step (b): the comments in the corpus are all represented by texts after feature representation by the steps, and the text representations after the feature representation of the comments form feature vector texts.

Step (c): text after the feature representation without any features is removed.

Step (d): in order to reduce the influence on the analysis result caused by the difference between the positive and negative comment numbers, in one scheme, the same number of texts of positive and negative feature representations are extracted from the feature vector, the feature vector text used in the embodiment is formed, the feature vector text is randomly divided into a training set, a positive or negative label is added to the text after each feature representation in the training set, 1(true) represents positive, and 0(false) represents negative.

It should be noted that, because each shadow score is short, the thought of bernoulli naive bayes algorithm is adopted in the embodiment, and whether a word appears is counted, rather than how many times the word appears.

And 4, step 4: the classifier is constructed by using naive Bayes thought, and is improved to be more suitable for film evaluation text classification.

The method for constructing and improving the classifier based on the naive Bayes idea comprises the following steps:

step (a): a naive bayes classifier is analyzed, the definition of naive bayes classification being as follows:

1. let X be { a ═ a ₁ ，a ₂ ，...，a _m Is an item to be classified, and each a is a characteristic attribute of X.

2. Set of categories C ═ y ₁ ，y ₂ ，...，y _n }。

3. Calculating p (y) ₁ |x)，p(y ₂ |x)，...，p(y _n |x)。

4. If p (y) _k |x)＝max{p(y ₁ |x)，p(y ₂ |x)，...，p(y _n | x) }, then x ∈ y _k 。

Bayesian text classification is based on this formula, namely:

wherein p (C) _i ) Is the probability of occurrence of the ith text class, p (w) ₁ ，w ₂ ...w _n |C _i ) For a text category Ci, a feature vector (w) occurs ₁ ，w ₂ ...w _n ) Probability of p (w) ₁ ，w ₂ ...w _n ) Is the probability of the occurrence of the feature vector. In this embodiment, assuming that the probabilities of the occurrence of the feature words in the text are independent, i.e. there is no correlation between the utterance and the word, the joint probability can be expressed as a product, as follows:

for a fixed training set, P (w) in the above equation ₁ )P(w ₂ )…P(w _n ) Is a fixed constant, the calculation of the denominator can be omitted when performing the classification calculation, such that:

p(C _i |w ₁ ，w ₂ …w _n )＝p(w ₁ |C _i )p(w ₂ |C _i )...p(w _n |C _i )p(C _i )

step (c): classifiers were constructed and improved using naive bayes thought.

The naive Bayes thought is converted into a calculation formula, and the calculation formula is obtained through a large amount of training texts

p(C _i ),p(w _n |C _i ) To prevent the problem of overflow of results due to too small a factor, a logarithm is used for processing. I.e. obtaining log (p (C) _i ))、log(p(w _n |C _i ) ) and brings the test data in to get the scores of the test data in the different categories.

Namely:

by analyzing the shadow comments, it can be concluded that the probability of positive terms appearing in the positive shadow comments is much higher than the probability of positive terms appearing in the negative shadow comments relative to the terms. In contrast, the probability of negative words appearing in a negative rating is much higher than the probability of negative words appearing in a positive rating. I.e. the probability of a word appearing in a certain type of text is specific, the probability of a word appearing can be used to influence the last p (C) _i |w ₁ ，w ₂ ...w _n ) The value is obtained.

Namely:

finally, only p (C) under different categories is calculated _i |w ₁ ，w ₂ ...w _n ) And taking the maximum value of the size of the Chinese character (1).

Step (d): using the training set above to obtain p (C) _i )、p(w _j |C _i )、p(C _i |w _j ) Values of the isoparametric:

calculating p (C) _i ) It includes negative class probability and positive class probability:

negative class probability:

the active class probability:

C _i the feature vector text representing the classification, i ═ 0, 1.

Calculating the probability of the feature words in the feature word set appearing in the class-like feature vector text of the training set according to the classes: calculating p (w) _j |C _i ) The probability of the feature words appearing in the passive feature vector texts in the training set and the probability of the feature words appearing in the active feature vector texts in the training set are as follows:

Calculating the probability that the characteristic words in the characteristic word set can respectively appear in each type of vector text of the training set: calculating p (C) _i |w _j ) It includes the probability that the feature word can appear in the passive class of the training set and the probability that the feature word can appear in the active class of the training set:

probability that a feature word can appear in a negative class of the training set:

The above is a detailed disclosure of the training steps.

And 5: randomly dividing the feature vector text into a test set, wherein in the test set, no positive or negative label is added to the text after each feature is represented, and the test set is used for testing the trained model and modifying parameters:

step (a): and training by using a training set to obtain a classification model, testing on the test set data, and classifying the unlabeled test set data.

Step (b): for log (p (C) in the formula _i ))、

Any two of the three items are added with parameters to balance the influence of the three items on the final result (note: the parameters are between 0 and 1). And analyzing the comparison test result, and adjusting parameters.

Step (c): and modifying the parameters, repeatedly testing to find the optimal parameters, and comparing the optimal parameters with a naive Bayes classifier.

The above is a detailed disclosure of the testing procedure.

According to the text tendency analysis based on machine learning, words with high frequency are obtained from a large number of film evaluation texts as features, the film evaluation texts are changed into use feature representation, and emotion classification is carried out by using learning algorithms such as naive Bayes and support vector machines.

Because natural language is complex, a word has different emotion extrema in different sentences, and any emotion dictionary cannot summarize all characteristics of emotion words, the method improves the analysis of movie evaluation tendency based on machine learning, because everyone adopts the word with higher word frequency as a characteristic, if the data is insufficient, the effect of the trained classifier is quite unsatisfactory, the text extracts the characteristic by utilizing the part of speech, the sentence stem and a small amount of artificial interference of the word, then converts all movie evaluation texts into a characteristic representation form by utilizing the obtained characteristic, and further constructs the classifier by a naive Bayes thought. The method has low requirement on the performance of the computer, the selected characteristics are not interfered by frequency, and the method is more suitable for film evaluation classification, and has high speed and high accuracy.

Example 2:

as an example supplement of the technical solution in embodiment 1, fig. 1 shows a flow of the analysis method of the present invention, in this embodiment, jieba is used to perform word segmentation on a large number of texts and select a specific part-of-speech word, jieba is used to extract a sentence stem word, the two are combined, and downloaded movie scores are classified according to their scores, including positive and negative categories. And converting the film evaluation text into a characteristic representation form, constructing a classifier by using a classification algorithm, and performing necessary post-processing. The present invention will be described in detail with reference to fig. 1, taking an evaluation of one image in the data set as an example.

Step 1, downloading film comments, namely compiling reptiles to download the film comments of the broad bean film. One of the movie reviews as downloaded is as follows:

step 2, extracting characteristics of the film comments:

2.1 using the jieba word-dividing to perform word-dividing processing on all the film scores, and extracting words of adjectives, idioms, distinguishing words and verbs as a characteristic set. The results after the parts of speech are extracted by example sentence evaluation are as follows:

note: the above is the result of being extracted, and the eliminated words are not listed.

2.2 extracting stems of all the film comments by using jieba partial words, and extracting stem words and adding feature sets. The example sentence image scoring word and the result of extracting the main stem after processing are as follows:

2.3 stop words may be present in the feature set, the stop words are removed using the stop dictionary.

And step 3: and processing the film comments, and converting each film comment into a characteristic representation form. Using jieba word to divide each film comment, using the characteristic word set to represent each film comment,

example sentence evaluation: the milestone of the domestic type piece has tight and clear whole course of 2 hours of rhythm and true heat and blood stimulation.

Suppose a characteristic word set of [ very good, like, …, homemade, milestone, hour, rhythm, whole course, clear, hot blood, stimulus, …, resonance, boring ]

The characteristics of the example sentence are expressed as: [0,0, …,1,1,1,1,1,1,1,1, …,0,0 ].

In order to reduce the influence on the analysis result caused by the difference between the positive and negative comment numbers, in one scheme, the same number of texts of positive and negative feature representations are extracted from the feature vector, the feature vector text used in the embodiment is formed, the feature vector text is randomly divided into a training set, a positive or negative label is added to the text after each feature representation in the training set, 1(true) represents positive, and 0(false) represents negative.

If example sentence evaluation is randomized to the training set, the characteristic representation form is that an identifier is inserted at the first position, 0 represents negative, and 1 represents positive. Then its feature represents the text as: [1,0,0, …,1,1,1,1,1,1,1,1, …,0,0 ].

And 4, step 4: the algorithm is realized as follows: the following three parts are obtained by the training set.

negative class probability:

the active class probability:

C _i the feature vector text representing the classification, i ═ 0, 1.

Calculating the probability that the characteristic words in the characteristic word set can respectively appear in each type of vector text of the training set: calculating p (C) _i |w _j ) Which includes the probability that a feature word can appear in the passive class of the training set and the probability that a feature word can appear in the active class of the training set:

And 5: and testing the trained model by using the test set, randomly generating the test set in the feature vector text by using the obtained classification model, testing by using the data of the test set, classifying the text after the characteristic representation of the film evaluation of the unlabeled test set, and comparing the test result to analyze so as to judge the accuracy of the current training model.

5.1. Acquiring an array of characteristic representations of the movie reviews which need to be classified, namely texts after the characteristic representations;

5.2. respectively calculating the characteristic words w of the film comment _i Probability of occurrence in both types of documents.

Namely: to prevent too little or too much of the result we are dealing with p (w) _j |C _i ) One logarithm of the array is multiplied by the evaluation feature expression array and summed to obtain the tendency score (reflecting the probability).

Let the resulting negative score be f ₀ (ii) a Positive score f ₁ ；

5.3. And calculating the probability that each characteristic word of the strip shadow is respectively present in the two types of words.

Namely: to prevent too little or too much of the result we are dealing with p (C) _i |w _j ) And taking a logarithm of the array and the evaluation characteristic of the strip to represent the idea of the array and summing to obtain the tendency score.

Let the resulting negative score be g ₀ (ii) a Positive score g ₁ ；

5.4. Score merging

The final score of the bar score in negative was:

the final score of the bar score in positive was:

for example sentence evaluation, the probability result is as follows:

probability of aggressiveness	Probability of negativity	Predicted results	Whether it is correct or not
				-38.352214246565453	-41.408669267263221	Active	Is that

For the above scores, which value is greater for scores where data belongs to different categories, the greater the likelihood of belonging to which category, e.g., a set of data-28.5338768667 less than-23.4792674766, the greater the likelihood of belonging to negative.

The above description is only for the purpose of creating a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can substitute or change the technical solution and the inventive concept of the present invention within the technical scope of the present invention.

Claims

1. A training method for analyzing emotion tendentiousness of film evaluation based on machine learning is characterized by comprising the following steps:

step 1: downloading the film comments;

and 4, step 4: randomly dividing the feature vector text into a training set, adding a positive or negative label to each feature vector of the training set, and training a classifier constructed by a naive Bayes idea;

probability of occurrence of feature words in the passive feature vector text in the training set:

C _i feature vector text representing a classification, i ═ 0,1, w _j Representing the feature words in the feature word set, wherein j is 1,2 … n, and n is the number of the feature words in the feature word set;

the method for training the classifier constructed by the naive Bayes idea comprises the following steps:

calculating the probability that the characteristic words in the characteristic word set can appear in each class of the training set respectively;

the probability of each class in the training set is calculated as: calculating the probability of the passive classified feature vectors and the active classified feature vectors in the training set, wherein the set of the active classified feature vectors in the training set is a training set active feature vector text, and the set of the passive classified feature vectors is a training set passive feature vector text;

calculating the probability of the feature words in the feature word set appearing in the class-like feature vector text of the training set according to the classes: calculating the proportion of the number of times of the feature words appearing in the text of the training set positive feature vectors to the number of all the feature words in the text of the training set positive feature vectors, and calculating the proportion of the number of times of the feature words appearing in the text of the training set negative feature vectors to the number of all the feature words in the text of the training set negative feature vectors;

calculating the probability that the characteristic words in the characteristic word set can respectively appear in each class of the training set: calculating the proportion of the characteristic words appearing in the training set positive characteristic vector text to the characteristic words appearing in the training set, and calculating the proportion of the characteristic words appearing in the training set negative characteristic vector text to the characteristic words appearing in the training set;

the method for representing each film evaluation by using the feature word set by using the feature vector comprises the following steps: judging whether each feature word in the feature word set appears in the film comment, if so, marking 1, otherwise, marking 0, forming an array of the film comment, and converting each film comment into a feature representation form as a feature vector;

judging the classification emotional tendency of the different types of the emotion by calculating the size of p (Ci | w1, w2... wn) and taking the maximum value, wherein data is test data;

p(C _i ) Including negative class probability and positive class probability:

negative class probability:

the active class probability:

C _i the feature vector text representing the classification, i ═ 0, 1.