CN107967258B

CN107967258B - Method and system for emotion analysis of text information

Info

Publication number: CN107967258B
Application number: CN201711183201.XA
Authority: CN
Inventors: 张毅; 黄宇
Original assignee: Guangzhou Iimedia Information Consulting Co ltd
Current assignee: Ai Media Consulting (Guangzhou) Co.,Ltd.
Priority date: 2017-11-23
Filing date: 2017-11-23
Publication date: 2021-09-17
Anticipated expiration: 2037-11-23
Also published as: CN107967258A

Abstract

The invention relates to a method and a system for analyzing emotion of text information, wherein keywords and context associated words of the keywords are extracted from acquired text information, the keywords and the context associated words are analyzed through a preset word vector analysis model to obtain first word vectors of the keywords, then second word vectors of the emotion words are obtained, and emotion values of the text information are obtained according to the first word vectors of the keywords and the second word vectors of the emotion words. In the scheme, the word vector analysis model analyzes the keywords and the context related words, the obtained first word vector of the keywords not only expresses the characteristics of the keywords, but also considers the characteristics of the context related words related to the keywords, accurately reflects the emotional characteristics of the keywords in the text information, and can obtain the emotional value as the emotional tendency of the text information by combining the second word vector of the emotional words, so that an accurate basis is provided for the further processing of the text information.

Description

Method and system for emotion analysis of text information

Technical Field

The invention relates to the technical field of data analysis, in particular to a method and a system for emotion analysis of text information.

Background

With the rapid development of the internet, networks have become the main means for people to obtain information. Various kinds of information are filled in the network, and the information is very necessary to be combed in the face of various information. For example, the comment information of the public on the network on social events, hot characters and E-commerce products is combed, and the comment information is of five-flower eight, wherein the attitudes of the public on comment objects are expressed, and the attitudes can be expressed by specific emotions.

Currently, emotion analysis of information generally analyzes a certain specific vocabulary in text information, so as to judge emotion of the whole text information, and because emotion expressed by the same vocabulary in different text contexts is different, emotion accuracy of information analysis by the certain specific vocabulary is low.

Disclosure of Invention

Therefore, it is necessary to provide a method and a system for emotion analysis of text information to solve the conventional problem that emotion accuracy of information analyzed through a specific vocabulary is low.

A method for emotion analysis of text information comprises the following steps:

extracting keywords and context associated words of the keywords from the text information;

analyzing the keywords and the context associated words according to a preset word vector analysis model to obtain a first word vector of the keywords;

and acquiring the emotion value of the text information according to the first word vector and the second word vector, wherein the second word vector is a pre-stored word vector of the emotion words.

An emotion analysis system for text information, comprising:

the word acquisition unit is used for extracting keywords and context associated words of the keywords from the text information;

the word vector analysis unit is used for analyzing the keywords and the context associated words according to a preset word vector analysis model to obtain first word vectors of the keywords;

and the emotion value acquisition unit is used for acquiring the emotion value of the text information according to the first word vector and the second word vector, wherein the second word vector is a pre-stored word vector of the emotion words.

According to the method and the system for analyzing the emotion of the text information, the keywords and the context associated words of the keywords are extracted from the acquired text information, the keywords and the context associated words are analyzed through a preset word vector analysis model, a first word vector of the keywords is acquired, a second word vector of the emotion words is acquired, and the emotion value of the text information is obtained according to the first word vector of the keywords and the second word vector of the emotion words. In the scheme, the word vector analysis model analyzes the keywords and the context related words, the obtained first word vector of the keywords not only expresses the characteristics of the keywords, but also considers the characteristics of the context related words related to the keywords, accurately reflects the emotional characteristics of the keywords in the text information, and can obtain the emotional value as the emotional tendency of the text information by combining the second word vector of the emotional words, so that an accurate basis is provided for the further processing of the text information.

A readable storage medium, on which an executable program is stored, which when executed by a processor implements the steps of the method for emotion analysis of text information as described above.

An analysis device comprises a memory, a processor and an executable program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of the emotion analysis method of the text information.

According to the emotion analysis method of the text information, the invention also provides a readable storage medium and analysis equipment, the keywords and the context relevant words can be analyzed through the word vector analysis model, the obtained first word vector of the keywords not only expresses the characteristics of the keywords, but also considers the characteristics of the context relevant words related to the keywords, the emotion characteristics of the keywords in the text information are accurately reflected, and in combination with the second word vector of the emotion words, the emotion value can be obtained to serve as the emotion tendency of the text information, so that accurate basis is provided for further processing of the text information.

Drawings

FIG. 1 is a flowchart illustrating a method for emotion analysis of text information according to an embodiment;

FIG. 2 is a schematic structural diagram of a system for emotion analysis of text information according to an embodiment;

FIG. 3 is a schematic structural diagram of a system for emotion analysis of text information according to an embodiment;

FIG. 4 is a simplified diagram of a model training process according to one embodiment;

fig. 5 is a diagram illustrating a process of modifying an intermediate node vector by a Huffman tree according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the detailed description and specific examples, while indicating the scope of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

Fig. 1 is a schematic flow chart of a method for emotion analysis of text information according to an embodiment of the present invention. The emotion analysis method for the text information in the embodiment comprises the following steps:

step S110: extracting keywords and context associated words of the keywords from the text information;

in this step, the keyword may be a word that can directly express an emotion, or a word that appears in a text with a high frequency, and the context-related word is a word related to the keyword in a text paragraph, may reflect a language environment in which the keyword is located, and is located above or below the keyword in the text;

step S120: analyzing the keywords and the context associated words according to a preset word vector analysis model to obtain a first word vector of the keywords;

in this step, the first word vector is a vector value that can be identified and calculated corresponding to the keyword;

step S130: and acquiring the emotion value of the text information according to the first word vector and the second word vector, wherein the second word vector is a pre-stored word vector of the emotion words.

In this step, the emotion value of the text information can be obtained through the relationship between the first word vector of the keyword and the second word vector of the emotion word.

In this embodiment, the keywords and the context associated words of the keywords are extracted from the obtained text information, the keywords and the context associated words are analyzed through a preset word vector analysis model, a first word vector of the keywords is obtained, then a second word vector of the emotion words is obtained, and the emotion value of the text information is obtained according to the first word vector of the keywords and the second word vector of the emotion words. In the scheme, the word vector analysis model analyzes the keywords and the context related words, the obtained first word vector of the keywords not only expresses the characteristics of the keywords, but also considers the characteristics of the context related words related to the keywords, accurately reflects the emotional characteristics of the keywords in the text information, and can obtain the emotional value as the emotional tendency of the text information by combining the second word vector of the emotional words, so that an accurate basis is provided for the further processing of the text information.

It should be noted that there may be a plurality of keywords, and when there are a plurality of keywords, an emotion value may be obtained for each keyword, and then emotion values corresponding to all keywords are synthesized, so as to accurately obtain an emotion value of text information.

Further, the process of obtaining the emotion value of the text information according to the first word vector and the second word vector may be to perform vector distance calculation on the first word vector and the second word vector to obtain the emotion value of the text information.

In one embodiment, the step of analyzing the keywords and the context-related words according to a preset word vector analysis model further comprises the following steps:

establishing a binary neural network model, obtaining an information corpus to be trained, training the binary neural network model by taking the information corpus as a training sample, and obtaining a preset word vector analysis model.

In this embodiment, the information corpus is a set including a plurality of words, and the obtained information corpus can be used as a training sample to train the established binary neural network model, so that the binary neural network model continuously learns itself and is converted into a word vector analysis model capable of analyzing words and obtaining word vectors.

Alternatively, the binary neural network Model may be a CBOW Model (Continuous Bag of Words Model) and a Skip-gram Model (Continuous Skip-gram Model).

In one embodiment, the step of training the binary neural network model by using the information corpus as a training sample comprises the following steps:

selecting a target word and a related word from the information corpus, initializing an original word vector of the target word and the related word, analyzing the original word vector of the related word through a binary neural network model to obtain an error vector of the original word vector of the target word, and correcting the original word vector of the target word according to the error vector of the original word vector of the target word.

In this embodiment, when training is performed by using the information corpus as a training sample, a target word and a related word may be selected, where the related word is a word related to the target word in a certain language environment, and a relationship between the target word and the related word is similar to a relationship between a keyword and a context associated word in text information; the method comprises the steps of analyzing an initial word vector of an initialized related word in a training process to obtain an error vector of the initial word vector of a target word, correcting the initial word vector of the target word by using the error vector, and performing a correction process on the initial word vector of the target word by training an enhanced binary neural network model for multiple times to enable a final word vector analysis model to analyze an input keyword and a context related word and accurately obtain a first word vector of the keyword.

In one embodiment, the related words are multiple, and the step of analyzing the original word vectors of the related words through the binary neural network model comprises the following steps:

adding the original word vectors of all related words to obtain a sum vector;

constructing a Huffman tree of a binary neural network model by taking the target word and each related word as leaf nodes, acquiring a path from a root node of the Huffman tree to the leaf node corresponding to the target word, and classifying corresponding intermediate nodes according to the sum vector and the vector of the intermediate nodes in the path;

if the classification result of the current intermediate node is different from the trend of the path, correcting the vector of the current intermediate node according to the trend of the path, and acquiring an error vector of the current intermediate node;

and adding the error vectors of all the intermediate nodes to be used as the error quantity of the original word vector of the target word.

In this embodiment, there may be a plurality of related words, a huffman tree of a binary neural network model is constructed by using a target word and each related word as a leaf node, a path from the root node to the leaf node corresponding to the target word may be obtained therefrom, original word vectors are added, intermediate nodes are classified according to a sum vector of each related word and a vector of the intermediate node in the path, vectors of the intermediate nodes are corrected according to a classification result, a sum of corrected error vectors of each intermediate node is an error vector of an original word vector of the target word, and the original word vector of the target word is corrected by the error vector of the original word vector of the target word obtained in the above manner, so that the word vector of the target word reflects information of the related word, and the word vector of the target word is more accurate.

It should be noted that, when constructing the huffman tree of the binary neural network model, the vectors of the non-leaf nodes of the huffman tree may be initialized, and optionally, the initialized value of the vectors of the non-leaf nodes may be a zero vector.

Optionally, when the intermediate nodes are classified according to the sum vector of each related word and the vector of the intermediate node in the path, a logistic regression classification method or other types of regression classification methods may be used.

In one embodiment, the step of obtaining the information corpus to be trained includes the following steps:

and acquiring a network data text, filtering noise information of the network data text, and cutting words to generate an information corpus to be trained.

In the embodiment, words in the network data text can be obtained as the information corpus to be trained, the relevance of the expected word and the keywords in the text information is high, the training accuracy of the word vector analysis model can be improved, information irrelevant to the words used by model training can be filtered out by filtering noise information of the network data text, word segmentation is facilitated, and effective information corpus is obtained.

In one embodiment, the emotion analyzing method for text information further comprises the following steps:

and analyzing the emotional words according to the word vector analysis model, acquiring a second word vector of the emotional words and storing the second word vector.

In this embodiment, the second word vector of the emotion word may also be obtained through a word vector analysis model, when the information corpus to be trained is rich enough, the information corpus may also include the emotion word, and the emotion word is used as a target word, and the emotion word may be analyzed to obtain the second word vector of the emotion word. The second word vector of the emotional word can be stored in advance before the keyword and the context associated word are analyzed according to the word vector analysis model.

In one embodiment, the step of obtaining the emotion value of the text message according to the first word vector and the second word vector comprises the following steps:

and respectively acquiring relative values of different emotion words corresponding to the text information according to the first word vector and the second word vectors of different emotion words, and taking the maximum relative value as the emotion value of the text information.

In this embodiment, there may be a plurality of emotion words, a plurality of relative values corresponding to different emotion words may be obtained according to the first word vector of the keyword and the second word vector of different emotion words, and the largest relative value may be selected as the emotion value of the text information, so that the emotion value of the text information matches the characteristics of the text information itself.

Furthermore, if a plurality of keywords exist, statistical analysis can be performed on the relative values of all the keywords, and a plurality of relative values are selected as the emotion values of the text information according to a preset proportion.

Alternatively, the category of emotional words may be happy, angry, sadness, happy, sad, terrorist, hated, surprised, calm, disappointed, excited, etc. The emotional words of each category may also have different forms of expression.

The present invention also provides a text information emotion analysis system according to the text information emotion analysis method, and an embodiment of the text information emotion analysis system of the present invention will be described in detail below.

Fig. 2 is a schematic structural diagram of a system for emotion analysis of text information according to an embodiment of the present invention. The emotion analysis system for text information in this embodiment includes:

a word obtaining unit 210 configured to extract a keyword and a context related word of the keyword from text information;

the word vector analysis unit 220 is configured to analyze the keyword and the context associated word according to a preset word vector analysis model to obtain a first word vector of the keyword;

the emotion value obtaining unit 230 is configured to obtain an emotion value of the text information according to the first word vector and a second word vector, where the second word vector is a word vector of a pre-stored emotion word.

In this embodiment, as shown in fig. 3, the emotion analysis system for text information further includes a model establishing unit 240, configured to establish a binary neural network model, obtain an information corpus to be trained, train the binary neural network model by using the information corpus as a training sample, and obtain a preset word vector analysis model.

In one embodiment, the model building unit 240 selects a target word and a related word from the information corpus, initializes an original word vector of the target word and the related word, analyzes the original word vector of the related word through a binary neural network model, obtains an error vector of the original word vector of the target word, and corrects the original word vector of the target word according to the error vector of the original word vector of the target word.

In one embodiment, the related words are multiple, and the model building unit 240 adds the original word vectors of the related words to obtain a sum vector; constructing a Huffman tree of a binary neural network model by taking the target word and each related word as leaf nodes, acquiring a path from a root node of the Huffman tree to the leaf node corresponding to the target word, and performing logistic classification on corresponding intermediate nodes according to the sum vector and the vector of the intermediate nodes in the path; if the classification result of the current intermediate node is different from the trend of the path, correcting the vector of the current intermediate node according to the trend of the path, and acquiring an error vector of the current intermediate node; and adding the error vectors of all the intermediate nodes to be used as the error vector of the original word vector of the target word.

In one embodiment, the model building unit 240 obtains the web data text, performs noise information filtering on the web data text, and cuts words to generate the information corpus to be trained.

In one embodiment, the emotion value obtaining unit 230 analyzes the emotion words according to the word vector analysis model, obtains a second word vector, and stores the second word vector.

In one embodiment, the emotion value acquisition unit 230 acquires relative values of different emotion words corresponding to the text information according to the first word vector and the second word vectors of the different emotion words, and takes the maximum relative value as the emotion value of the text information.

The emotion analysis system for text information and the emotion analysis method for text information correspond to each other one by one, and technical features and beneficial effects thereof described in the embodiment of the emotion analysis method for text information are all applicable to the embodiment of the emotion analysis system for text information.

The terms "first," "second," and the like are used merely to distinguish one element from another, and do not limit the other elements.

According to the emotion analysis method of the text information, the embodiment of the invention also provides a readable storage medium and analysis equipment.

The readable storage medium stores an executable program, and the program realizes the steps of the emotion analysis method of the text information when being executed by a processor; the analysis device comprises a memory, a processor and an executable program which is stored on the memory and can run on the processor, and the processor realizes the steps of the emotion analysis method of the text information when executing the program.

In a specific embodiment, the scheme of the embodiment of the invention can be applied to scenes such as sentiment analysis of network comment information.

In order to reflect the attitude of the netizen to the hot event, the comments made by the netizen under the news report of the hot event can be selected as a corpus. And the comment information of the event can be collected and stored in a database by a special crawler module, and then the comment information is transmitted into a text processing module to filter noise information and cut words to generate a corpus to be trained.

According to the embodiment of the scheme, a Word2vec mode is adopted to process linguistic data, a model of a binary neural network is used for training, training of all words depends on words with similar contexts, context information is well considered, all words are trained into Word vectors in the same space, and a value obtained by vector distance calculation is used for representing the emotion value of a target Word by using a unique emotion Word bank (a Word bank formed by coarsening words representing emotions), so that emotion analysis and calculation considering the context information are realized.

Word vectors, as the name implies, use vectors to express words, and machines cannot understand the meaning they express words as humans, so they can convert words into computationally useful word vectors that machines can recognize. While the Word2vec scheme is a scheme for converting text into reasonable Word vectors, the training models used therein may be CBOW (Continuous Bag-of-Words Model) and Skip-gram (Continuous Skip-gram Model). Taking CBOW as an example, the model is based on a Huffman tree (Huffman tree), where the initialization value of the intermediate vector stored by the non-leaf node in the Huffman tree may be a zero vector, and the initialization of the word vector of the word corresponding to the leaf node is related to the position and the occurrence frequency of the word in the text message, and the training process is as shown in fig. 4:

there are three main stages, input layer (input), mapping layer (project) and output layer (output). The input layer is a word vector of n-1 words around a certain word a. If n takes 5, the words of the first two and the last two of the word A (which can be denoted as w (t)) are w (t-2), w (t-1), w (t +1), and w (t + 2). Correspondingly, the word vectors for those 4 words are denoted as v (w (t-2)), v (w (t-1)), v (w (t +1)), and v (w (t + 2)). It is relatively simple to add those n-1 word vectors from the input layer to the mapping layer. And from the mapping layer to the output layer, a Huffman tree is constructed. Starting from the root node, the values of the mapping layer need to be continuously classified logically along the Huffman tree, and each intermediate vector and word vector are continuously modified.

Taking fig. 5 as an example, in the Huffman tree, the middle word is w (t), and the mapping layer input is pro (t) ═ v (w (t-2)) + v (w (t-1)) + v (w (t +1)) + v (w (t +2))

If the word at this time is "football", that is, w (t) ═ football ", the Huffman code is known as d (t) ═ 1001", and then the path from the root node to the leaf node is known as "right and left", that is, from the root node, the leaf node first turns left, then turns right 2 times, and finally turns left.

And correcting the intermediate vector of each node on the path from top to bottom in sequence according to the path. At the first node, Logistic classification is performed according to intermediate vectors θ (t,1) and pro (t) of the nodes. If the classification result shows 0, it indicates that the classification is erroneous (should turn left, i.e., classify to 1), θ (t,1) is corrected, and the amount of error is recorded.

Next, after the first node is processed, the second node is processed, similarly, θ (t,2) is corrected, and the error amount is accumulated. The subsequent nodes are analogized in the same way.

After all nodes have been processed and the leaf nodes have been reached, the word vector v (w (t)) is corrected according to the previously accumulated error.

Thus, the processing flow of a word w (t) is ended. If there are N words in a text, the above process needs to be repeated N times from w (0) to w (N-1). After training, a vector of each word is obtained, and a model capable of performing word vector analysis on the input words can be obtained through the training mode and is applied to analysis of emotion values of text information.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

Those skilled in the art will appreciate that all or part of the steps in the method for implementing the above embodiments may be implemented by a program instructing the relevant hardware. The program may be stored in a readable storage medium. Which when executed comprises the steps of the method described above. The storage medium includes: ROM/RAM, magnetic disk, optical disk, etc.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for emotion analysis of text information is characterized by comprising the following steps:

extracting keywords and context associated words of the keywords from text information;

analyzing the keywords and the context associated words according to a preset word vector analysis model to obtain first word vectors of the keywords;

acquiring an emotion value of the text information according to the first word vector and a second word vector, wherein the second word vector is a word vector of a prestored emotion word;

the step of analyzing the keywords and the context associated words according to a preset word vector analysis model further comprises the following steps of:

establishing a binary neural network model, acquiring an information corpus to be trained, and training the binary neural network model by taking the information corpus as a training sample to obtain the preset word vector analysis model;

the step of training the binary neural network model by using the information corpus as a training sample comprises the following steps:

selecting a target word and a related word from the information corpus, initializing an original word vector of the target word and the related word, analyzing the original word vector of the related word through the binary neural network model to obtain an error vector of the original word vector of the target word, and correcting the original word vector of the target word according to the error vector of the original word vector of the target word;

the related words are multiple, and the step of analyzing the original word vectors of the related words through the binary neural network model comprises the following steps:

adding the original word vectors of the related words to obtain a sum vector;

constructing a Huffman tree of the binary neural network model by taking the target word and each related word as leaf nodes, acquiring a path from a root node of the Huffman tree to the leaf node corresponding to the target word, and performing Logistic classification on corresponding intermediate nodes according to the sum vector and vectors of the intermediate nodes in the path;

and adding the error vectors of all the intermediate nodes to be used as the error vector of the original word vector of the target word.

2. The emotion analysis method for text information according to claim 1, wherein the step of obtaining the corpus of information to be trained includes the steps of:

and acquiring a network data text, filtering noise information of the network data text, and cutting words to generate the information corpus to be trained.

3. The emotion analysis method for text information according to claim 1, further comprising the steps of:

and analyzing the emotional words according to the preset word vector analysis model to obtain and store the second word vector.

4. The method for emotion analysis of text information according to any one of claims 1 to 3, wherein said step of obtaining the emotion value of the text information based on the first word vector and the second word vector comprises the steps of:

5. An emotion analysis system for text information, comprising:

the word acquisition unit is used for extracting keywords and context associated words of the keywords from text information;

the emotion value acquisition unit is used for acquiring the emotion value of the text information according to the first word vector and a second word vector, wherein the second word vector is a pre-stored word vector of emotion words;

the model establishing unit is used for establishing a binary neural network model, acquiring information corpora to be trained, and training the binary neural network model by using the information corpora as training samples to obtain the preset word vector analysis model;

the model establishing unit is further configured to select a target word and a related word from the information corpus, initialize original word vectors of the target word and the related word, analyze the original word vectors of the related word through the binary neural network model, obtain an error vector of the original word vector of the target word, and correct the original word vector of the target word according to the error vector of the original word vector of the target word;

the model building unit is further used for adding the original word vectors of the related words to obtain a sum vector; constructing a Huffman tree of the binary neural network model by taking the target word and each related word as leaf nodes, acquiring a path from a root node of the Huffman tree to the leaf node corresponding to the target word, and performing Logistic classification on corresponding intermediate nodes according to the sum vector and vectors of the intermediate nodes in the path; if the classification result of the current intermediate node is different from the trend of the path, correcting the vector of the current intermediate node according to the trend of the path, and acquiring an error vector of the current intermediate node; and adding the error vectors of all the intermediate nodes to be used as the error vector of the original word vector of the target word.

6. The emotion analysis system of text information according to claim 5, wherein the model building unit is further configured to obtain a web data text, filter noise information of the web data text, and cut words to generate the information corpus to be trained.

7. The system for emotion analysis of text information according to claim 5, wherein said emotion value acquisition unit is further configured to analyze the emotion words according to the preset word vector analysis model, and acquire and store the second word vector.

8. The system according to any one of claims 5 to 7, wherein the emotion value obtaining unit is further configured to obtain, according to the first word vector and the second word vector of different emotion words, relative values of different emotion words corresponding to the text information, respectively, and use a maximum relative value as the emotion value of the text information.

9. A readable storage medium on which an executable program is stored, characterized in that the program, when being executed by a processor, carries out the steps of the method for emotion analysis of a text message as claimed in any one of claims 1 to 4.

10. An analysis device comprising a memory, a processor and an executable program stored on the memory and operable on the processor, the processor implementing the steps of the method for emotion analysis of textual information according to any of claims 1 to 4 when executing the program.