CN110825876A

CN110825876A - Movie comment viewpoint emotion tendency analysis method

Info

Publication number: CN110825876A
Application number: CN201911082409.1A
Authority: CN
Inventors: 许青青; 谢赟; 韩欣
Original assignee: Shanghai Tak Billiton Information Technology Ltd By Share Ltd
Current assignee: Shanghai Tak Billiton Information Technology Ltd By Share Ltd
Priority date: 2019-11-07
Filing date: 2019-11-07
Publication date: 2020-02-21
Anticipated expiration: 2039-11-07
Also published as: CN110825876B

Abstract

The invention discloses a movie comment viewpoint emotion tendentiousness analysis method, which comprises the following steps: crawling film description information and comment information of a plurality of films of each category from a film comment website; carrying out data preprocessing on the collected film comment description information and comment information; formulating a plurality of comment viewpoint extraction rules, obtaining viewpoint words and sentiment words from each comment sentence of comment content of comment information by using the comment viewpoint extraction rules, and then respectively storing all the viewpoint words and sentiment words as a comment label word bank and a viewpoint sentiment word bank; marking by keyword matching or manual marking, and carrying out comment label category marking and emotion tendency marking on each comment statement; generating a comment viewpoint emotion analysis model consisting of a comment label classification model and a label emotion classification model; and automatically generating a comment label category label and an emotion tendency label by using a comment viewpoint emotion analysis model aiming at the target film comment. The emotional expression of the user to the film can be comprehensively and accurately reflected.

Description

Movie comment viewpoint emotion tendency analysis method

Technical Field

The invention relates to the technical field of information extraction and data mining, in particular to a movie comment viewpoint emotion orientation analysis method.

Background

In the internet big data era, online comments become praise terms, and are the most direct expression mode and channel of the emotional attitude of consumers. The analysis of the consumer comments can obtain the all-around evaluation of the product for the consumer, so that the product can be known in multiple dimensions, and the decision making of the user is facilitated. For the merchant, the preference of the consumer and the market can be known, so that the service quality is improved, and the stickiness of the customer is increased. With the increasing innovation of internet media technology, the movie entertainment industry, such as the cinema industry and the home entertainment industry, is developing vigorously, movies have become daily entertainment options of people, and the acceptance and welcome of people to movies also breed a large amount of comment information. The subjective view is extracted from public comments, and the positive tendency or negative tendency of the public is judged to be an important problem in information extraction and mining in the field of natural language processing, and meanwhile, the film comment information shows the value of the film comment information in the aspects of value transmission, film and television environment modeling and the like, and the film comment information is developed and analyzed to contribute to the deepening development of film and television research. Therefore, it is significant to analyze the emotional orientation of the movie review viewpoint.

The commonly used method for extracting the opinion of the user comment is mainly an unsupervised rule extraction and clustering algorithm and the like. The method based on rule extraction mainly extracts viewpoints in the comments according to the syntactic structure manual summary rule, but the manual arrangement rule cannot cover all comment viewpoint expression modes, so that the method has limited effective viewpoints to extract. The clustering-based method is simple but low in accuracy, and is difficult to generate reasonable and accurate comment tags.

At present, dictionary matching and classification algorithms and the like are commonly used methods for comment sentiment analysis. The method based on the emotion dictionary completely depends on the emotion dictionary and is limited by the size of the scale of the dictionary; the emotion classification algorithm is a supervised method, some training sets are obtained according to comment information and score combination, some training sets are manually labeled, and a large amount of labor cost is consumed.

In addition, comment information of different industries often has respective focus points, so the ways of emotion analysis are slightly different. For movie reviews, compared with online review information such as e-commerce, restaurants, hotels and the like, the included user experience and experience information is relatively complex, so that the current emotion analysis and viewpoint extraction method cannot be completely applied to movie review analysis. In addition, many online comment researches use comment viewpoint extraction and emotion classification as two separate research modules, and user comments on a certain product or thing are often multidimensional, and the comments and the derogations of each dimension evaluation of the product are different, and it is obviously not correct enough to directly analyze whether the user emotion is good comment (positive) or bad comment (negative), so that the emotion analysis on the main viewpoint dimension extracted by the user is more practical. For example, for the comment that "the actor in the movie is cracked, but the story is not good", the results of (actor, positive direction) and (plot, negative direction) obtained by emotion analysis are more accurate.

Disclosure of Invention

The invention aims to provide a movie comment viewpoint emotion tendency analysis method which can comprehensively and accurately reflect emotion expression of a user on a movie.

The technical scheme for realizing the purpose is as follows:

a movie comment opinion sentiment tendency analysis method comprises the following steps:

step S1, crawling the film description information and comment information of a plurality of films of each category from the film evaluation website;

step S2, carrying out data preprocessing on the collected film comment description information and comment information;

step S3, formulating a plurality of comment viewpoint extraction rules, obtaining viewpoint words and sentiment words from each comment sentence of comment content of comment information by using the comment viewpoint extraction rules, and then respectively storing all the viewpoint words and sentiment words as a comment label word bank and a viewpoint sentiment word bank;

step S4, comment label category marking and emotion tendency marking are carried out on each comment sentence through keyword matching marking or manual marking;

step S5, generating a comment viewpoint emotion analysis model consisting of a comment label classification model and a label emotion classification model;

and step S6, automatically generating comment label category labels and emotion tendency labels by using the comment viewpoint emotion analysis model aiming at the target movie comment.

Preferably, in step S1, the classification of the movies includes: love, animation, action, science fiction, horror, comedy, and suspicion;

the film description information comprises a film name, a director name, a lead actor name, a type and a total score;

the comment information includes: the comment is a nickname, useful number of comments, time of comment, comment content and score.

Preferably, the data preprocessing comprises:

integrating all the collected comment information to form a comment material library;

removing repeated data in the comment corpus;

deleting data with missing comment content in the comment corpus;

converting all traditional Chinese characters in the comment corpus into simplified Chinese characters;

and acquiring the film name, the director name and the director name from the acquired description information of each film, storing the film names, the director names and the director names into a user-defined dictionary, and marking the film names with different symbols.

Preferably, the step S3 includes:

constructing a plurality of comment viewpoint extraction rules according to the dependency syntax structure, the part of speech among the words and the expression structure of viewpoint words and sentiment words in the comment viewpoints;

sentence segmentation, word segmentation, part of speech tagging and dependency syntactic analysis are carried out on the comment content in the comment corpus to obtain each comment sentence, whether the comment sentences match a certain comment viewpoint extraction rule or not is checked, if matching, viewpoint words and sentiment words are obtained,

and respectively storing all the acquired viewpoint words and emotion words as a comment label word library and a viewpoint emotion word library.

Preferably, the dependency syntax structure includes: a main-meaning structure, a guest-moving structure, a centering structure, a shape-middle structure, a dynamic compensation structure and a parallel structure;

the part of speech among the words comprises: a subject component, an object or object-like component, a idiom component, and a noun component; a formal object refers to an indirect or object-like structure;

the expression structure of the viewpoint words and the emotion words refers to: the subject component is a viewpoint word, and the object or the shape-like object component is an emotional word; the fixed language component is an emotional word, and the noun component modified by the fixed language component is a viewpoint word.

Preferably, the step S4 includes:

acquiring a label category dictionary and an emotion dictionary;

and performing keyword matching marking on the comment sentences capable of extracting the viewpoint words and the emotion words in the step S3: matching the acquired viewpoint words with the label category dictionary, matching the acquired emotion words with the emotion dictionary, and marking the comment sentences with label category labels and emotion tendentiousness labels if the matching of the acquired viewpoint words and the emotion dictionary is successful; otherwise, carrying out manual label category marking and emotion tendency marking;

and performing manual label type marking and emotion tendency marking on the comment sentences of which the viewpoint words and the emotion words are not extracted in the step S3.

Preferably, the obtaining of the tag category dictionary includes:

respectively marking the film name, the director name and the actor name in the user-defined dictionary in the comment tag word library as 'film', 'director' and 'actor';

training each comment sentence through a word vector model to obtain a trained word vector model;

expressing the words in the comment label word library by using a trained word vector model, and clustering the words in the comment label word library into k categories by using a k-means clustering algorithm;

manually inducing and screening the popular viewpoints of the movie reviews into 8 dimensions of director, photography, scenario, actor, emotion, audio-visual effect, subject matter and impression, screening words under each cluster, and reserving related words to form a preliminary label category dictionary;

acquiring related words of the labeled category words in the preliminary label category dictionary by using the trained word vector model to expand the label category dictionary, removing repeated words in the dictionary, and generating a final label category dictionary;

the obtaining of the emotion dictionary refers to: firstly, collecting open-source positive and negative emotion dictionaries for sorting and merging, then counting word frequency in the viewpoint emotion word bank, reserving all words larger than a set threshold value, and then manually deleting words irrelevant to movie comment emotion to form an emotion dictionary.

Preferably, the step S5 includes:

respectively training and generating two preliminary comment label classification models and two preliminary label emotion classification models by utilizing the keyword matching marking data set and the manual marking data set;

weighting and fusing the two preliminary comment label classification models to generate a final comment label classification model;

and performing weighted fusion on the two primary label emotion classification models to generate a final label emotion classification model.

Preferably, the step of generating the preliminary comment tag classification model or the preliminary tag emotion classification model includes:

an up-sampling strategy is adopted for the keyword matching marking data set and the manual marking data set to carry out data balance;

dividing the keyword matched marking data set and the manually marked data set after the data balance into a training set and a testing set according to a preset proportion;

performing word segmentation on the corpus in the training set, removing stop words, extracting text features by adopting a TF-IDF algorithm, and calculating chi-square values of the features to perform feature dimension reduction;

and importing the data into a random forest classification model, and performing model training, storage and evaluation.

Preferably, the step S6 includes:

extracting viewpoint words and emotion words, if the viewpoint words and the emotion words can be obtained, performing keyword matching including label category matching and emotion word matching, and if the viewpoint words and the emotion words can be successfully matched, directly outputting label category marks and emotion tendency marks; otherwise, directly calling the comment tag classification model and/or the tag emotion classification model to perform tag class prediction and tag emotion prediction, setting two thresholds T1 and T2, and outputting a tag class mark and an emotion tendency mark if the tag class prediction probability P1 is greater than T1 and the tag emotion prediction probability P2 is greater than T2.

The invention has the beneficial effects that: the method and the device are used for processing text information with complex movie comment contents and emotional tendencies, and analyzing the emotional tendencies of movie comment data in a mode of combining various methods and various strategies, so that the emotional tendencies of audiences to certain aspects of a movie can be captured accurately.

Drawings

FIG. 1 is a flow chart of a movie reviews perspective emotional orientation analysis method of the present invention;

FIG. 2 is a flow chart of keyword matching marking in the present invention;

FIG. 3 is a schematic diagram of a review tag classification model fusion in the present invention;

FIG. 4 is a schematic diagram of label emotion classification model fusion in the present invention;

FIG. 5 is a schematic diagram of a classification model construction process according to the present invention;

FIG. 6 is a flow chart of the automatic generation of comment emotion tags in the present invention.

Detailed Description

The invention will be further explained with reference to the drawings.

Referring to fig. 1, the method for analyzing the sentiment orientation of the review viewpoint of the movie according to the present invention mainly extracts the review viewpoint of movie review data, performs marking classification and sentiment orientation analysis of the viewpoint, that is, obtains the category of the review label and the sentiment orientation thereof, and simultaneously constructs a review viewpoint sentiment analysis model to analyze and classify the new movie review data and attach the category and the sentiment label thereto. Comprises the following steps:

step S1, data crawling: and crawling love, animation, action, science fiction, horror, comedy and suspicion categories of film description information of a plurality of films and comment information of each film from a film evaluation website. The movie description information includes information such as movie name, director name, genre, and overall score. The comment information of the film comprises information such as a nickname of a commentator, useful number of comments, comment time, comment content, score and the like.

Step S2, performing data preprocessing on the movie description information and the comment information, including:

integrating data, namely integrating all the collected comment information into a comment corpus;

data deduplication, namely removing duplicate data in the comment corpus;

processing the missing value, and deleting data with missing comment content in the comment corpus;

the traditional Chinese processing is to convert all traditional Chinese in the comment corpus into simplified Chinese;

and self-defining a user dictionary, acquiring the film name, the director name and the director name from the collected film description information, storing the film name, the director name and the director name into the user-defined dictionary, and marking the film names with different symbols.

Step S3, comment viewpoint extraction: and (3) making a plurality of universal comment viewpoint extraction rules according to the dependency syntax structure and the part of speech among the words in the modern Chinese and by combining the expression structure of the viewpoint words and the emotion words in the actual comment viewpoint. The method comprises the steps of carrying out operations such as sentence segmentation, word segmentation, part of speech tagging and dependency syntactic analysis on comment contents in a comment corpus to obtain each comment sentence, then checking whether the comment sentences are matched with a certain comment viewpoint extraction rule, obtaining (viewpoint words and sentiment words) if the comment sentences are matched with the comment viewpoint extraction rule, and finally storing all the obtained viewpoint words and sentiment words as a comment label word bank and a viewpoint sentiment word bank respectively.

The comment viewpoint extraction rule mainly divides the rule into two types according to the dependency syntax structure: the rule system takes a main and predicate Structure (SBV) as a core, and the rule system takes a fixed-center structure (ATT) as a core. The syntax relationships involved in the extraction rules are shown in table 1:

type of relationship	Tag	Description	Example
				Main and subordinate structure	SBV	subject-verb	I send her a bunch of flowers (I < — send)
Structure of Buddhist guest	VOB	verb-object	I send her bunch of flowers (send- - > flower)
				Centering structure	ATT	attribute	Red apple (Red < -apple)
Middle structure	ADV	adverbial	Very beautiful (very < -beautiful)
				Dynamic compensation structure	CMP	complement	Completed operation (do- - > complete)
Parallel structure	COO	coordinate	Mountain and sea (mountain- - >)Sea)

TABLE 1

Further, the SBV-based rule system is mainly classified into 4 categories, as shown in table 2:

TABLE 2

As can be seen from Table 2, the rules based on SBV are mainly based on the noun subject to directly or indirectly establish relationship connection with an object or an object-like structure (hereinafter, the indirect or object-like structure is referred to as an object-like structure). The extracted subject component is a comment viewpoint word, and the extracted object-like component is a comment viewpoint emotion word.

This rule does not only relate to the sentence structure listed in Table 2, but also considers whether the subject and the formal object have a parallel structure, and further considers whether the formal object has adverb modifications because negative words affect the emotion. For example, for the movie rating "movie and scenario good", two sets of viewpoint words and emotion word pairs (movie, good), (scenario, good) can be extracted according to the proposed rules; the 'subject rich and novel' can obtain a (subject, rich) and (subject, novel) label pair; "movie is not good at" can be extracted (movie, not good at).

Further, the rule system with ATT as the core is also classified into 4 types, and the specific rules are shown in table 3.

TABLE 3

Since the fixed language is used to modify, define, and explain the quality and characteristics of a noun or pronoun, the centering relation is essential in the review perspective extraction rule. As seen from table 3, the adjectives are generally used as sentiment words for commenting on the viewpoint, and the nouns modified by them or verbs used as nouns are used as viewpoint words for commenting on. Similarly, the rules also need to consider the side-by-side structure of noun components, adjectives, and adverb components that modify adjectives. For example, the example sentence "hard and embarrassed performance" given in table 3 is parallel to "embarrassed", so two sets of label pairs (representing, hard) and (performing, embarrassed) can be extracted; the "show not live" can be extracted (show, not live).

And step S4, commenting the label category mark and the emotion tendency mark, and dividing the comment label category mark into keyword matching marking and manual marking. The method comprises the following steps that a label category dictionary and an emotion dictionary need to be acquired during keyword matching marking, keyword matching is carried out, the main process is shown in figure 2, the label category dictionary is acquired firstly, and the method comprises the following steps:

1) film proper noun substitution. The comment tag word library contains the film names, director names and actor names in a user-defined dictionary and is respectively marked as 'movies', 'directors' and 'actors', so that the classification of partial words in the comment tag word library is realized; that is, if the names of actors such as "zhang san" and "lie si" exist in the comment tag word stock, but the machine cannot distinguish that "zhang san" and "lie si" are actors, the names of actors in the user-defined dictionary can be matched with the names of actors in the user-defined dictionary, so that "zhang san" and "lie si" can be marked as "actors"; the same approach is used for the marking of the director's name and the film name.

2) And (5) training a word vector model. Dividing words of comment contents in a comment corpus, stopping words, and storing the words in a text, wherein each comment sentence is stored in a line, and the words are separated by spaces; obtaining a word vector model by utilizing the word2vec (word vector) model to train the well-processed comment content;

3) and clustering words. Expressing the words in the comment label word library by using a trained word vector model, and clustering the words in the comment label word library into k categories by using a k-means (k mean) clustering algorithm; the k categories are determined by observing clustering results through multiple tests;

4) and (5) inducing the evaluation dimension and screening a category dictionary. The popular viewpoints of the film reviews are divided into 8 dimensions of director, photography, drama, actor, emotion, audio-visual, subject and impression by manual induction and screening, the words under each cluster are screened, and the related words are reserved to form a label category dictionary;

5) a tag class dictionary is augmented. And (3) acquiring related words of the label category words by using the trained word vector model to expand the label category dictionary, removing repeated words in the dictionary, and generating a final label category dictionary. The method comprises the steps of obtaining related words of label category words, calculating similarity between the words through a word vector model, setting a threshold value, determining that the words are related and similar when the similarity is larger than the threshold value, and manually screening results of the related words to ensure the accuracy of a label category dictionary.

An example of the generated label category dictionary is shown in table 4:

TABLE 4

Next, an emotion dictionary is obtained. Firstly, collecting positive and negative emotion dictionaries of an open source, wherein the HowNet dictionary of a known network and the emotion dictionaries of the open source of Taiwan university are mainly used for sorting and combining the dictionaries. The HowNet knowledge network dictionary only takes positive and negative evaluation words. Then, counting word frequency in the viewpoint emotion word bank, reserving all words larger than a set threshold value, and then manually deleting some words irrelevant to the movie comment emotion to form an emotion dictionary with movie characteristics.

And finally, matching keywords. The keyword matching is to extract comment sentences of the viewpoint words and the emotion words in the comment viewpoint extraction, match the viewpoint words with the label category dictionary, match the emotion words with the emotion dictionary, and mark (label category, emotion tendentiousness) on the comment sentences if both the comment sentences and the emotion words can be successfully matched. For example, for a "less-than-storied" comment, the comment viewpoint is extracted to obtain a (less-than-storied) label, and a (storyline, negative) label is obtained after the label category and emotional tendency label.

The manual marking has two conditions that sentences of the viewpoint words and the emotion words are not extracted in the comment viewpoint extraction, sentences which can extract the viewpoint words and the emotion words but cannot meet the keyword matching marking can be extracted in the comment viewpoint extraction, and the manual label category marking and the emotion tendency marking are carried out on the condition.

And step S5, generating a comment viewpoint emotion analysis model which is composed of a comment label classification model and a label emotion classification model, wherein the two classification models are different except for class labels, and the whole data processing and classification algorithm are the same. There are two types of classification model datasets: firstly, a data set marked by keyword matching and secondly, a data set marked manually are respectively used for training to generate 2 comment label classification models and 2 label emotion classification models. In order to improve the accuracy of emotion analysis, the 2 comment label classification models are weighted and fused to generate a new comment label classification model, and the 2 label emotion classification models are weighted and fused to generate a new label emotion classification model, which is referred to fig. 3 and 4. In this embodiment, the weight of the model generated by the keyword marking data and the weight of the model generated by the manual marking data are 0.4 and 0.6, respectively.

The comment opinion sentiment analysis probability calculation formula is as follows:

P_i＝0.4*P_1i+0.6*P_2i

wherein, P_iRepresenting the probability that a certain comment content in the comment corpus is of the i category, P_1i、P_2iThe probability values obtained by the models generated by the keyword marking data and the probability values obtained by the models generated by the manual marking data are respectively shown. For the comment tag classification model, the values of i are 0-7, and the 8 categories of director, photography, scenario, actor, emotion, audio-visual and subject are represented respectively. For the label emotion classification model, the values of i are 0 and 1, 1 represents positive emotion, and 0 represents negative emotion.

The above-mentioned construction process of the classification model, see fig. 5, involves the following steps:

first, data balancing is performed. The various samples of the classified data may have an unbalanced phenomenon, which has a great influence on the overall accuracy of classification. The invention adopts an upsampling (Oversampling) strategy, namely, copying small data types into multiple copies.

Second, dataset partitioning is performed. The scrambled data set is divided into a training set and a test set according to the ratio of 8: 2.

Then, feature extraction is performed. Segmenting the corpus of the training set, removing stop words, extracting text features by adopting TF-IDF algorithm (word frequency-inverse document frequency), and calculating CHI-square value (CHI2 or CHI) of each feature²) And by setting a threshold value K (K is an integer), keeping K characteristics before the chi-square value arrangement to realize characteristic dimension reduction.

And finally, importing the data into a random forest classification model, and performing model training, storage and evaluation.

Step S6, the comment emotion label is automatically generated. After the comment opinion emotion analysis model is trained, automatic marking of new film comments can be performed, and a specific emotion prediction process is described with reference to fig. 6. Firstly, comment viewpoint extraction and extraction (viewpoint words and emotion words) are carried out, if the (viewpoint words and emotion words) can be obtained, keyword matching including label category matching and emotion word matching is carried out, and if the keyword matching and the emotion word matching can be successfully matched, a result is directly output. Otherwise, directly calling the comment tag classification model and/or the tag emotion classification model to perform tag class prediction and tag emotion prediction, setting two thresholds (T1 and T2), and outputting (comment tag class mark and emotion tendency mark) if the tag class prediction probability P1 is greater than T1 and the tag emotion prediction probability P2 is greater than T2.

The above embodiments are provided only for illustrating the present invention and not for limiting the present invention, and those skilled in the art can make various changes and modifications without departing from the spirit and scope of the present invention, and therefore all equivalent technical solutions should also fall within the scope of the present invention, and should be defined by the claims.

Claims

1. A movie comment viewpoint emotion tendentiousness analysis method is characterized by comprising the following steps:

2. The method for analyzing emotional tendency of opinion of movie reviews according to claim 1, wherein in step S1, the classification of movies includes: love, animation, action, science fiction, horror, comedy, and suspicion;

3. The method for analyzing emotional tendency of opinion of movie reviews according to claim 1, wherein the data preprocessing comprises:

removing repeated data in the comment corpus;

deleting data with missing comment content in the comment corpus;

4. The method for analyzing emotional tendency of opinion of movie reviews, according to claim 1, wherein said step S3 includes:

5. The method of analyzing emotional tendency of opinion of movie reviews, according to claim 4, wherein the dependency syntax structure comprises: a main-meaning structure, a guest-moving structure, a centering structure, a shape-middle structure, a dynamic compensation structure and a parallel structure;

6. The method for analyzing emotional tendency of opinion of movie reviews according to claim 3, wherein said step S4 includes:

acquiring a label category dictionary and an emotion dictionary;

7. The method for analyzing emotional tendency of opinion of movie reviews according to claim 6, wherein said obtaining a dictionary of tag categories comprises:

8. The method for analyzing emotional tendency of opinion of movie reviews, according to claim 1, wherein said step S5 includes:

9. The method for analyzing emotion tendentiousness of comment viewpoint of movie as claimed in claim 8, wherein said step of generating preliminary comment label classification model or preliminary label emotion classification model includes:

10. The method for analyzing emotional tendency of opinion of movie reviews according to claim 6, wherein said step S6 includes: