CN110175325B - Comment analysis method based on word vector and syntactic characteristics and visual interaction interface - Google Patents
Comment analysis method based on word vector and syntactic characteristics and visual interaction interface Download PDFInfo
- Publication number
- CN110175325B CN110175325B CN201910343337.5A CN201910343337A CN110175325B CN 110175325 B CN110175325 B CN 110175325B CN 201910343337 A CN201910343337 A CN 201910343337A CN 110175325 B CN110175325 B CN 110175325B
- Authority
- CN
- China
- Prior art keywords
- word
- words
- emotion
- evaluation
- dependency
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9532—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/06—Buying, selling or leasing transactions
- G06Q30/0601—Electronic shopping [e-shopping]
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Databases & Information Systems (AREA)
- General Health & Medical Sciences (AREA)
- Business, Economics & Management (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Accounting & Taxation (AREA)
- Finance (AREA)
- Development Economics (AREA)
- Strategic Management (AREA)
- Economics (AREA)
- General Business, Economics & Management (AREA)
- Mathematical Physics (AREA)
- Marketing (AREA)
- Data Mining & Analysis (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a comment analysis method based on word vectors and syntactic characteristics in the field of data analysis, which comprises the following steps: acquiring comment data of commodity pages of an e-commerce website; preprocessing the acquired target data set; extracting recognition and detraction sets provided by Hownet and NTU to form a basic emotion dictionary; carrying out Word vector training on the obtained preprocessed data set through a Word2Vec tool; establishing a probability transition matrix by using the semantic similarity matrix; processing the obtained commodity comment text based on the core sentence rule; preprocessing the obtained text with redundancy removed; extracting < commodity attribute, negation word, degree word and emotion word > from the obtained dependency relationship pair through part of speech to evaluate and match the pair; and carrying out recognition and devaluation calculation and good and bad sequencing on the evaluation objects by combining the obtained evaluation matching pair with an emotion dictionary, and finally realizing accurate, real-time, automatic and convenient processing and analysis on commodity comment data through a visual interactive interface, thereby being applicable to an electronic commerce platform.
Description
Technical Field
The invention belongs to the technical field of data analysis, and particularly relates to an emotion dictionary and attribute recognition algorithm which are constructed by using word vectors trained by a neural network model and are suitable for commodity comments and a comment analysis system based on the word vectors and syntactic features.
Background
With the popularization of the Internet and the development of electronic commerce, internet electronic commerce websites such as Beijing dong and Taobao rapidly develop, and more consumers begin to select online shopping; the e-commerce websites have massive commodities and also have a large user group, so that huge comment data are generated. The comments given by consumers often carry the subjective feelings of the user about the consumption, including preference for purchasing goods, satisfaction for merchant services, etc. For consumers, these comment texts can help them to more objectively learn about the information about the relevant goods or services, thus giving a more suitable choice; the merchant can be helped to further improve the service or commodity quality in a targeted manner through experience information about commodities or services fed back by the user, so that more clients and profits are obtained. However, with the explosive growth of data volume, the cost required by the user to acquire useful information from massive comment data is also increased, so how to process and analyze the comment text of the user rapidly and effectively, extract valuable information from the comment text, and have important application value and research significance.
Currently, a large amount of comment data cannot be fully utilized, and consumers are difficult to acquire valuable information from a huge amount of comment data. Therefore, a comment analysis system based on word vectors and syntactic features is researched, satisfaction of users on all properties of commodities is obtained according to analysis results, advantages and disadvantages of the commodities are summarized, and then data visualization is conducted on the analysis results.
Disclosure of Invention
The technical problem to be solved by the invention is how to realize accurate, real-time, automatic and convenient processing and analysis of commodity comment data, and overcomes the defects of the prior art to provide a comment analysis method based on word vectors and syntactic characteristics.
The invention provides a comment analysis method based on word vectors and syntactic features, which comprises the following steps:
1) Acquiring comment data of commodity pages of an e-commerce website;
2) Preprocessing the obtained target data set, and constructing a candidate emotion word set;
3) Extracting recognition and detraction sets provided by Hownet and NTU to form a basic emotion dictionary;
4) Carrying out Word vector training on the obtained preprocessed data set through a Word2Vec tool to obtain Word vectors and generate a semantic similarity matrix;
5) Establishing a probability transfer matrix by using a semantic similarity matrix, combining a seed word set, passing through an LPA tag propagation algorithm, and generating a final emotion dictionary after basic emotion dictionary test;
6) Processing the obtained commodity comment text based on the core sentence rule to obtain a comment text with redundancy removed;
7) Preprocessing the obtained text with redundancy removed, forming a dependency tree for the obtained word segmentation data set based on the dependency and the syntactic characteristics, and generating SBV, VOB, ATT, CMP, COO dependency pairs;
8) Extracting < commodity attribute, negation word, degree word and emotion word > from the obtained dependency relationship pair through part of speech to evaluate and match the pair;
9) And combining the obtained evaluation matching pair with an emotion dictionary, performing recognition and devaluation calculation and good and bad sequencing on the evaluation object, and finally realizing the method through a visual interaction interface.
As a further definition of the invention, step 2) specifically comprises:
2-1) removing the illegal character using a character matching algorithm;
2-2) word segmentation and part-of-speech tagging are carried out on the original data set by using LTP;
2-3) extracting words conforming to part of speech, and forming a candidate emotion word set 1 through duplication elimination;
2-4) word segmentation and part-of-speech tagging are carried out on the original data set by using NLPIR;
2-5) extracting words conforming to part of speech, and forming a candidate emotion word set 2 through duplication elimination;
2-6) combining the candidate emotion word set 1 and the candidate emotion word set 2, and obtaining the candidate emotion word set through duplication removal.
As a further definition of the invention, step 3) specifically comprises: and respectively extracting the recognition and detraction words in the word dictionary by using the hownet emotion dictionary and the ntu evaluation word dictionary, and combining and then removing duplication to form a basic emotion dictionary.
As a further definition of the invention, step 4) specifically comprises:
4-1) utilizing a Word2Vec training data set to obtain Word vectors of words;
4-2) combining the candidate emotion word sets, and calculating semantic similarity between words by adopting the following formula:
4-3) for example two n-dimensional word vectors a (x 11 , x 12 , … , x 1n ) And b (x) 21 , x 22 , … , x 2n ) The semantic similarity calculation formula is as follows:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing a semantic similarity value;representing a kth dimension value of the word vector a;representing a k-th dimension value of the word vector b;
4-4) constructing a semantic similarity matrix according to the calculated semantic similarity.
As a further definition of the invention, step 5) specifically comprises:
5-1) regarding each word as a node of the graph, wherein the weight of the edge between two nodes is represented by the semantic similarity between the words represented by the weight;
5-2) establishing a probability transition matrix P according to the following formula:
wherein, P [ i ]][j]Representing the probability of similarity transition between words i through j, SIM (w i ,w j ) Representing the similarity of words i and j, and m represents the number of words with the highest semantic similarity with the word i;
5-3) counting word frequencies of all emotion words in the candidate emotion word set in the original comment data, screening N words with highest word frequencies, and forming a seed word set 1; screening words with emotion vocabulary ontology intensity > m in the candidate emotion word sets by using the emotion vocabulary ontology library to form a seed word set 2; combining the seed word set 1 and the seed word set 2, then removing duplication to form a seed word set, and carrying out artificial emotion marking;
5-4) establishing a label matrix Y of LxC by using a small number of manually-labeled seed words L Wherein: l represents the number of seed words; c represents the number of classes, which are classified into 3 classes, namely, the identification, the disambiguation and the neutrality respectively;
5-5) simultaneously building a label matrix Y of UxC using unlabeled sample words U Wherein: u represents the number of unlabeled sample words; c represents the number of classes, which are classified into 3 classes, namely, the identification, the disambiguation and the neutrality respectively;
5-6) finally, performing part-of-speech tagging on the sample words by adopting an LPA tag propagation algorithm, and forming a final emotion dictionary after passing through a basic emotion dictionary test.
As a further definition of the invention, step 6) specifically comprises:
the core sentence mainly refers to deleting redundancy, and retaining a trunk component related to evaluation collocation; if the original sentence does not accord with any rule, the original sentence is kept unchanged, the method uses the core sentence to aim at improving the accuracy of the analysis of the syntactic dependency of the evaluation text, and the rule comprises the following steps:
rule 1: deleting sentence initial components in sentences, such as the "… advantage", "… disadvantage", "… deficiency", "… advantage", "… benefit" sequence;
rule 2: deleting sentences with hypothetical tendencies, such as "if …", "hope …", "if …", "wish …", "suggestion …";
rule 3: deleting a sequence whose period is "exactly," "naturally," "particularly," "still further," "particularly";
rule 4: deleting "feel", "consider" claim words;
rule 5: and deleting continuous punctuation marks except the first punctuation mark, such as abnormal characters of expression, pigment and brackets.
As a further definition of the invention, step 7) specifically comprises:
five axioms of dependency syntax:
(1) One sentence has only one and only one independent component;
(2) Any component in a sentence must depend on a certain component at the same time;
(3) Any component in a sentence cannot depend on two or more components at the same time;
(4) If component a depends directly on component b and component c is located between components a and b in the sentence, then component c depends on a or b or other components between a and b;
(5) The components on the left and right sides of the central component have no dependency relationship with each other;
the dependency tree is characterized by:
(1) Nodes in the tree are served by the individual components in the sentence;
(2) The root node of the tree is the center component of the whole sentence;
(3) Edges formed among nodes in the tree have directionality, reflecting asymmetric dependency relationships among components;
(4) Five axioms of the dependency syntax are satisfied;
most sentence dependency relations in comments are five categories of main-predicate relation (SBV), moving-guest relation (VOB/FOB), centering relation (ATT), moving-complement relation (CMP) and parallel relation (COO), dependency syntax analysis can be carried out through an LTP dependency syntax analyzer, and dependency relation pairs are extracted by combining COO algorithm for identifying parallel evaluation objects and parallel evaluation words; the COO algorithm for identifying the parallel evaluation objects and the parallel evaluation words specifically comprises the following steps:
traversing all words between two nodes in a SBV, VOB, ATT, CMP dependency pair obtained based on the dependency relationship and the syntactic characteristic and related left and right in the dependency syntactic tree;
judging whether COO relations exist in all the traversed words or not;
and expanding the parallel evaluation objects and evaluation words of COO relations.
As a further definition of the invention, step 8) specifically comprises:
8-1) according to the characteristics of Chinese language, most evaluation objects are nouns or verbs, and most evaluation words are adjectives or verbs;
8-2) extracting an evaluation object and an evaluation word, namely commodity attributes and emotion words according to the part of speech;
8-3) traversing whether negative words exist between the obtained evaluation object and the evaluation word according to the dependency syntax tree, if so, carrying out +1 number of the negative words, and if so, carrying out parity judgment on the number of the negative words until the traversal is finished. If the number is odd, the corresponding negative word private is assigned as-1, and if the number is even, the corresponding negative word private is assigned as +1;
8-4) traversing whether the obtained evaluation object and the evaluation word have the degree word according to the dependency syntax tree, and if so, accumulating the number to obtain the number of the degree words of the collocation pair;
8-5) finally forming the evaluation match pair of the commodity attribute, the negation word, the degree word and the emotion word.
As a further definition of the invention, step 9) specifically comprises:
according to the commodity attribute a appearing n times, the identification value calculation formula is as follows:
where a. Score is the affective value of commodity attribute a,for the ith time of the commodity attribute occurrence, private is the obtained value (-1 or +1) of the negative word corresponding to the ith commodity attribute, and degree is the number of the degree adverbs corresponding to the ith commodity attribute; calculating commodity attribute emotion values, and accumulating and calculating the same evaluation objects;
and (5) sorting the extracted all evaluation objects into two categories, namely, recognition and derogation, and arranging the final results by using bubbling sequencing.
A visual interaction interface can execute all the steps of the claims, can well display emotion values in a bar chart form, and is added with a plurality of friendly interaction functions, comprising: loading, logging in, logging out, modifying passwords, user logging in use status, etc.
Compared with the prior art, the technical scheme provided by the invention has the following technical effects:
the invention constructs a basic emotion dictionary by acquiring and preprocessing commodity page comment data of an e-commerce website; carrying out Word vector training on the obtained preprocessed data set through a Word2Vec tool, generating a semantic similarity matrix, further establishing a probability transfer matrix, and generating a final emotion dictionary through an LPA label propagation algorithm by combining a seed Word set; processing the obtained commodity comment text based on the core sentence rule to obtain a comment text with redundancy removed; preprocessing the obtained text with redundancy removed, forming a dependency relation tree on the basis of the dependency relation and the syntactic characteristic of the obtained word segmentation data set, generating SBV, VOB, ATT, CMP, COO dependency relation pairs, extracting < commodity attributes, negatives, degree words and emotion word > evaluation matching pairs, carrying out positive and negative value calculation on commodity attributes by combining with an emotion dictionary, and finally realizing through a visual interactive interface; the comment data can be analyzed accurately, in real time, automatically and conveniently.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The technical scheme of the invention is further described in detail below with reference to the accompanying drawings:
according to the technical scheme, a word vector trained by a neural network model is used, and an emotion dictionary suitable for commodity comments is constructed by combining an LTP label propagation algorithm; designing a commodity attribute identification extraction algorithm based on the core sentence rule, the dependency relationship and the syntactic characteristics; and a comment analysis system based on word vectors and syntactic characteristics is constructed by combining the technical scheme, the satisfaction degree of the user on each attribute of the commodity is obtained according to the analysis result, the advantages and disadvantages of the commodity are summarized, and then the analysis result is subjected to data visualization.
Referring to fig. 1, the invention implements a comment analysis method based on word vectors and syntactic features, and the implementation steps are as follows:
step S101: and acquiring comment data of the commodity page of the E-commerce website.
In specific implementation, a comment data crawling algorithm is designed to acquire comment data of various commodities in an e-commerce website and generate an original comment data set.
Step S102: and preprocessing the obtained target data set, and constructing a basic emotion dictionary.
In a specific implementation, the original dataset is used to remove the illegal characters using a character matching algorithm; firstly, performing word segmentation and part-of-speech tagging by using LTP, extracting words with part-of-speech marks of "a" (adj), and performing de-duplication to form a candidate emotion word set 1; then, using NLPIR to perform word segmentation and part-of-speech tagging, extracting words with part-of-speech identification of "a" (adj), and performing de-duplication to form a candidate emotion word set 2; and merging the candidate emotion word set 1 and the candidate emotion word set 2, and performing de-duplication to form a final candidate emotion word set.
Step S103: and extracting the recognition and detraction sets provided by Hownet and NTU to form a basic emotion dictionary.
In specific implementation, a hotnet emotion dictionary and an NTU evaluation word dictionary are utilized to respectively extract recognition and detraction words in the hotnet emotion dictionary, and the recognition and detraction words are combined to form a basic emotion dictionary.
Step S104: and training Word vectors of the obtained preprocessed data set through a Word2Vec tool to obtain Word vectors and generate a semantic similarity matrix.
In a specific implementation, a Word2Vec training data set is used, training parameters size=100, window=5, sg=0, min_count=0 are respectively set, and Word vectors of words are obtained through training.
And combining the candidate emotion word sets, and calculating the semantic similarity between words by adopting the following formula.
For example two n-dimensional word vectors a (x 11 , x 12 , … , x 1n ) And b (x) 21 , x 22 , … , x 2n ) The semantic similarity calculation formula is as follows:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing a semantic similarity value;representing a kth dimension value of the word vector a;representing a k-th dimension value of the word vector b;
traversing all emotion words in the candidate emotion word set in sequence, fixing one emotion word, and calculating the similarity of the emotion words with all other emotion words; assuming m candidate emotion words, obtaining a m-m semantic similarity matrix through m-m times of calculation.
In order to facilitate the following operation, it is prescribed that the similarity between identical emotion words is 0.
And constructing a semantic similarity matrix according to the calculated semantic similarity.
Step S105: and establishing a probability transition matrix by using a semantic similarity matrix, combining a seed word set, passing through an LPA tag propagation algorithm, and generating a final emotion dictionary after basic emotion dictionary test.
In particular implementations, each word is considered as a node of the graph, and the weights of edges between two nodes are represented by semantic similarity between the words they represent.
The probability transition matrix P is established according to the following formula:
wherein, P [ i ]][j]Representing the probability of similarity transition between words i through j, SIM (w i ,w j ) Representing the similarity of words i and j, and m represents the number of words (manually set) with the highest semantic similarity with the word i; and establishing a probability transition matrix P according to the formula.
Counting word frequencies of all emotion words in the candidate emotion word set in the original comment data, screening out 100 words with highest word frequencies, and forming a seed word set 1; screening words with emotion vocabulary ontology intensity of more than 7 in the candidate emotion word set by using an emotion vocabulary ontology library of university of great company, and forming a seed word set 2; and merging the seed word set 1 and the seed word set 2, then removing duplication to form a seed word set, and carrying out artificial emotion marking.
Then, a label matrix Y of LxC is established by using a small amount of manually marked seed words L Wherein: l represents the number of seed words; c represents the number of classes, typically 3 classes (recognition, detraction, neutral); simultaneously, a label matrix Y of UxC is established by utilizing unlabeled sample words U Wherein: u represents the number of unlabeled sample words; c represents the number of classes, typically 3 classes (recognition, detraction, neutral); combining the two label matrixes to obtain a soft label matrix F= [ Y ] of NxC L ;Y U ]。
Executing a tag propagation algorithm, wherein the specific operation is as follows: 1) Performing propagation: f=pf; 2) Resetting the tag of the labeled sample in F: f (F) L =Y L The method comprises the steps of carrying out a first treatment on the surface of the 3) Repeating steps 1) and 2) until F converges.
The purpose of step 1 is to transmit the label (emotion attribute) of each node (emotion word) to other nodes with probability determined by a probability transition matrix, if the similarity of two nodes is larger, the transmission probability is larger; the step 2 aims to reset the label marked with the seed word to a marked value, so that the change caused by the operation process of the step 1 is avoided; the method for determining F convergence in step 3 is to calculate the latest F and F after the last operation 0 Is considered to have converged until the similarity is no longer changing.
And finally, three numerical values of a single row in the matrix F represent attribute propagation values of the emotion words corresponding to the three numerical values, the largest numerical value is selected, the corresponding attribute is judged, and the emotion word attribute is determined.
Leading out emotion words with confirmed attributes to form an emotion dictionary 1, traversing all emotion words in the emotion dictionary 1, and changing the attributes of the emotion words if the basic emotion dictionary contains the words and contradicts the attributes in the basic emotion dictionary in step S103, wherein the basic emotion dictionary is based on the attributes in the basic emotion dictionary; otherwise, the attribute is unchanged.
After the above steps are finished, the modified emotion dictionary 1 is the final emotion dictionary.
Step S106: and processing the obtained commodity comment text based on the core sentence rule to obtain a comment text with redundancy removed.
In the implementation, a commodity website is input on an interactive interface of a webpage of the system, comment data of the commodity input on an electronic commerce platform is crawled through a web crawler mechanism designed in the background, and the system sets up the first 1000 pieces of high-quality comment data of the commodity to be crawled.
The obtained commodity comment data is subjected to redundancy removal processing based on core sentence rules, and trunk components related to evaluation collocation are reserved; for example: the mobile phone receives good quality, good pixels and sound quality, particularly good express delivery force (the next day), and the only disadvantage is that the package is not good, so that a store can hope to improve. . . The treatment is as follows:
(1) The matching rule 1, namely the example sentence is matched with the defect of …, the mobile phone is changed into a mobile phone after the processing, the mobile phone receives the defect, the pixels and the tone quality are good, and especially the express delivery is very powerful (the next day is reached), namely the package is not very good, and the store can hope to improve. . . ";
(2) The matching rule 2, the example sentence is matched to the hope, the processing is changed into the mobile phone which receives the request, the picture and the tone quality are good, especially the express delivery is very powerful (the next day is reached), the package is not very good, and the store can improve. . . ";
(3) The matching rule 3, the example sentence is matched to be ' in particular ', the mobile phone is changed into ' after processing, the mobile phone is received well, the pixels and the tone quality are good, the express delivery is very powerful (the next day is reached), the package is not very good, and the store can improve. . . ";
(4) The matching rule 5, the example sentence deletes the continuous punctuation mark, the core sentence obtained by the final processing is' the mobile phone receives, the picture and the tone quality are good, the express delivery is very powerful and the next day arrives, the package is not very good, and the store can improve. This example is denoted as example sentence sendees.
Step S107: preprocessing the obtained text with redundancy removed, forming a dependency tree by the obtained word segmentation data set based on the dependency and the syntactic characteristics, and generating SBV, VOB, ATT, CMP, COO dependency pairs.
In a specific implementation, the text with redundancy removed obtained in the step S106 is preprocessed, and 6 clauses are obtained by punctuating clauses. And segmenting each sentence by using an LTP tool, marking parts of speech, and forming a dependency tree based on the dependency and the syntactic characteristics. The dependency relationship is obtained for SBV < mobile phone, received >, SBV < pixel, good >, COO < tone quality, pixel >, SBV < express delivery, force giving >, SBV < package, good >, SBV < store, improvement >.
For example, if the phrase "the pixels and the sound quality are both good" is processed by the above steps, and the dependency pair is extracted again by combining the COO algorithm for identifying the parallel evaluation object and the parallel evaluation word, the obtained dependency pair is < the pixels, good >, < the sound quality, good >.
Step S108: and extracting < commodity attribute, negation word, degree word and emotion word > from the obtained dependency relationship pair through part of speech to evaluate and match the pair.
In the specific implementation, traversing whether negative words exist between the evaluation object and the evaluation word for each extracted relation pair, calculating the number, judging parity of the negative words between the evaluation object and the evaluation word to obtain positive and negative values of the negative words, namely judging the negative words to be an odd number, and assigning-1 corresponding to the private; the negation word is determined to be an even number, and the corresponding private is assigned a value of +1. Then traversing whether the degree words exist between the evaluation object and the evaluation words, and calculating the number of the degree words. Finally, the < commodity attribute, private, emotion > evaluation match pair is formed. In the embodiment sentence sendees in step S106, a negative word "no" is identified between the relation pair < package good >, and then the corresponding private value is-1; traversing the adverbs of degree between "package" and "good", identifying "very", and the corresponding degree value is 1. The evaluation match pair for this phrase extraction is < package, -1, good >.
Step S109: and combining the obtained evaluation matching pair with an emotion dictionary, performing recognition and devaluation calculation and good and bad sequencing on the evaluation object, and finally realizing the method through a visual interaction interface.
In a specific implementation, the extracted evaluation matching pair is used for obtaining the recognition attribute of the emotion words through the emotion dictionary. Then carrying out the identification and devaluation calculation of commodity attributes according to the following formula:
for the evaluation collocation pair obtained in step S107<Packaging, -1, preferably>The commodity attribute of the commodity is 'packaged' and is subjected to identification and detraction value calculation to obtain that the emotion value is。
Traversing all comment data of the obtained commodity, carrying out the processing of the steps, accumulating the same evaluation objects, finally extracting to obtain all commodity attributes of the commodity, then classifying the commodity attributes into two classes, and finally obtaining a final result by utilizing bubbling sequencing arrangement. And finally, through the front end and the rear end, the visual interactive interface is used for realizing the method on the webpage.
The foregoing is merely illustrative of the embodiments of the present invention, and the scope of the present invention is not limited thereto, and any person skilled in the art will appreciate that modifications and substitutions are within the scope of the present invention, and the scope of the present invention is defined by the appended claims.
Claims (8)
1. A comment analysis method based on word vectors and syntactic features is characterized by comprising the following steps:
1) Acquiring comment data of commodity pages of an e-commerce website;
2) Preprocessing the obtained target data set, and constructing a candidate emotion word set;
3) Extracting recognition and detraction sets provided by Hownet and NTU to form a basic emotion dictionary;
4) Carrying out Word vector training on the obtained preprocessed data set through a Word2Vec tool to obtain Word vectors and generate a semantic similarity matrix, wherein the step 4) specifically comprises the following steps:
4-1) utilizing a Word2Vec training data set to obtain Word vectors of words;
4-2) combining the candidate emotion word sets, and calculating semantic similarity between words by adopting the following formula:
4-3) for example two n-dimensional word vectors a (x 11 ,x 12 ,…,x 1n ) And b (x) 21 ,x 22 ,…,x 2n ) The semantic similarity calculation formula is as follows:
wherein cos θ represents the semantic similarity value; x is x 1k Representing a kth dimension value of the word vector a; x is x 2k Representing a k-th dimension value of the word vector b;
4-4) constructing a semantic similarity matrix according to the calculated semantic similarity;
5) Establishing a probability transition matrix by using a semantic similarity matrix, combining a seed word set, generating a final emotion dictionary by an LPA tag propagation algorithm and a basic emotion dictionary test, wherein the step 5) specifically comprises the following steps:
5-1) regarding each word as a node of the graph, wherein the weight of the edge between two nodes is represented by the semantic similarity between the words represented by the weight;
5-2) establishing a probability transition matrix P according to the following formula:
wherein, P [ i ]][j]Representing the probability of similarity transition between words i through j, SIM (w i ,w j ) Representing the similarity of words i and j, and m represents the number of words with the highest semantic similarity with the word i;
5-3) counting word frequencies of all emotion words in the candidate emotion word set in the original comment data, screening N words with highest word frequencies, and forming a seed word set 1; screening words with emotion vocabulary ontology intensity > m in the candidate emotion word sets by using the emotion vocabulary ontology library to form a seed word set 2; combining the seed word set 1 and the seed word set 2, then removing duplication to form a seed word set, and carrying out artificial emotion marking;
5-4) establishing a label matrix Y of LxC by using a small number of manually-labeled seed words L Wherein: l represents the number of seed words; c represents the number of classes, which are classified into 3 classes, namely, the identification, the disambiguation and the neutrality respectively;
5-5) simultaneously building a label matrix Y of UxC using unlabeled sample words U Wherein: u represents the number of unlabeled sample words; c represents the number of classes, which are classified into 3 classes, namely, the identification, the disambiguation and the neutrality respectively;
5-6) finally, marking the parts of speech of the sample words by adopting an LPA tag propagation algorithm, and forming a final emotion dictionary after passing through a basic emotion dictionary test;
6) Processing the obtained commodity comment text based on the core sentence rule to obtain a comment text with redundancy removed;
7) Preprocessing the obtained text with redundancy removed, forming a dependency tree for the obtained word segmentation data set based on the dependency and the syntactic characteristics, and generating SBV, VOB, ATT, CMP, COO dependency pairs;
8) Extracting < commodity attribute, negation word, degree word and emotion word > from the obtained dependency relationship pair through part of speech to evaluate and match the pair;
9) And combining the obtained evaluation matching pair with an emotion dictionary, performing recognition and devaluation calculation and good and bad sequencing on the evaluation object, and finally realizing the method through a visual interaction interface.
2. The comment analysis method based on word vectors and syntactic features according to claim 1, characterized in that step 2) specifically includes:
2-1) removing the illegal character using a character matching algorithm;
2-2) word segmentation and part-of-speech tagging are carried out on the original data set by using LTP;
2-3) extracting words conforming to part of speech, and forming a candidate emotion word set 1 through duplication elimination;
2-4) word segmentation and part-of-speech tagging are carried out on the original data set by using NLPIR;
2-5) extracting words conforming to part of speech, and forming a candidate emotion word set 2 through duplication elimination;
2-6) combining the candidate emotion word set 1 and the candidate emotion word set 2, and obtaining the candidate emotion word set through duplication removal.
3. The comment analysis method based on word vectors and syntactic features according to claim 1, characterized in that step 3) specifically includes: and respectively extracting the recognition and detraction words in the word dictionary by using the hownet emotion dictionary and the ntu evaluation word dictionary, and combining and then removing duplication to form a basic emotion dictionary.
4. The comment analysis method based on word vectors and syntactic features according to claim 1, characterized in that step 6) specifically includes:
the core sentence mainly refers to deleting redundancy, and retaining a trunk component related to evaluation collocation; if the original sentence does not accord with any rule, the original sentence is kept unchanged, the method uses the core sentence to aim at improving the accuracy of the analysis of the syntactic dependency of the evaluation text, and the rule comprises the following steps:
rule 1: deleting the initial sentence component of the sentence, such as "… advantage", "… disadvantage", etc,
A "… deficiency", "… advantage", "… benefit" sequence;
rule 2: deleting sentences with hypothetical tendencies, such as "if …", "hope …", "if …", "wish …", "suggestion …";
rule 3: deleting a sequence whose period is "exactly," "naturally," "particularly," "still further," "particularly";
rule 4: deleting "feel", "consider" claim words;
rule 5: and deleting continuous punctuation marks except the first punctuation mark, such as abnormal characters of expression, pigment and brackets.
5. The comment analysis method based on word vectors and syntactic features according to claim 1, characterized in that step 7) specifically includes:
five axioms of dependency syntax:
(1) One sentence has only one and only one independent component;
(2) Any component in a sentence must depend on a certain component at the same time;
(3) Any component in a sentence cannot depend on two or more components at the same time;
(4) If component a depends directly on component b and component c is located between components a and b in the sentence, then component c depends on a or b or other components between a and b;
(5) The components on the left and right sides of the central component have no dependency relationship with each other;
the dependency tree is characterized by:
(1) Nodes in the tree are served by the individual components in the sentence;
(2) The root node of the tree is the center component of the whole sentence;
(3) Edges formed among nodes in the tree have directionality, reflecting asymmetric dependency relationships among components;
(4) Five axioms of the dependency syntax are satisfied;
most sentence dependency relations in comments are five types of main-predicate relation, moving-guest relation, centering relation, moving-complement relation and parallel relation, dependency syntax analysis can be carried out through an LTP dependency syntax analyzer, and dependency relation pairs are extracted by combining a parallel relation algorithm for identifying parallel evaluation objects and parallel evaluation words; the parallel relation algorithm for identifying the parallel evaluation objects and the parallel evaluation words specifically comprises the following steps:
traversing all words between two nodes in a dependency relationship pair and related to the two nodes left and right in a dependency syntax tree, wherein the main-predicate relationship, the dynamic guest relationship, the centering relationship and the dynamic complement relationship are obtained based on the dependency relationship and the syntax characteristics;
judging whether all the traversed words have parallel relations or not;
and expanding the parallel evaluation objects and evaluation words of the parallel relationship.
6. The comment analysis method based on word vectors and syntactic features according to claim 1, characterized in that step 8) specifically includes:
8-1) according to the characteristics of Chinese language, most evaluation objects are nouns or verbs, and most evaluation words are adjectives or verbs;
8-2) extracting an evaluation object and an evaluation word, namely commodity attributes and emotion words according to the part of speech;
8-3) traversing whether negative words exist between the obtained evaluation object and the evaluation word according to the dependency syntax tree, if so, carrying out +1 number of the negative words, and if so, carrying out parity judgment on the number of the negative words until the traversal is finished; if the number is odd, the corresponding negative word private is assigned as-1, and if the number is even, the corresponding negative word private is assigned as +1;
8-4) traversing whether the obtained evaluation object and the evaluation word have the degree word according to the dependency syntax tree, and if so, accumulating the number to obtain the number of the degree words of the collocation pair;
8-5) finally forming the evaluation match pair of the commodity attribute, the negation word, the degree word and the emotion word.
7. The comment analysis method based on word vectors and syntactic features according to claim 1, characterized in that step 9) specifically includes:
according to the commodity attribute a appearing n times, the identification value calculation formula is as follows:
a.score=x i .privative*(x i .score+x i .degree*x i .score*0.5)(0<i<=n)
where a.score is the emotion value of commodity attribute a, x i For the ith time of the commodity attribute occurrence, private is the obtained value-1 or +1 of the negative word corresponding to the ith commodity attribute, and degree is the number of the degree adverbs corresponding to the ith commodity attribute; calculating commodity attribute emotion values, and accumulating and calculating the same evaluation objects;
and (5) sorting the extracted all evaluation objects into two categories, namely, recognition and derogation, and arranging the final results by using bubbling sequencing.
8. A visual interactive interface, characterized in that the comment analysis method of claims 1 to 7 can be performed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910343337.5A CN110175325B (en) | 2019-04-26 | 2019-04-26 | Comment analysis method based on word vector and syntactic characteristics and visual interaction interface |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910343337.5A CN110175325B (en) | 2019-04-26 | 2019-04-26 | Comment analysis method based on word vector and syntactic characteristics and visual interaction interface |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110175325A CN110175325A (en) | 2019-08-27 |
CN110175325B true CN110175325B (en) | 2023-07-11 |
Family
ID=67690209
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910343337.5A Active CN110175325B (en) | 2019-04-26 | 2019-04-26 | Comment analysis method based on word vector and syntactic characteristics and visual interaction interface |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110175325B (en) |
Families Citing this family (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110705266B (en) * | 2019-09-09 | 2023-05-12 | 创新奇智(南京)科技有限公司 | Emotion analysis method and device |
CN110717654B (en) * | 2019-09-17 | 2022-05-06 | 合肥工业大学 | Product quality evaluation method and system based on user comments |
CN110659828B (en) * | 2019-09-23 | 2022-03-08 | 上海海事大学 | Software feature evaluation method based on comment data |
CN110706028A (en) * | 2019-09-26 | 2020-01-17 | 四川长虹电器股份有限公司 | Commodity evaluation emotion analysis system based on attribute characteristics |
CN110750646B (en) * | 2019-10-16 | 2022-12-06 | 乐山师范学院 | Attribute description extracting method for hotel comment text |
CN111259661B (en) * | 2020-02-11 | 2023-07-25 | 安徽理工大学 | New emotion word extraction method based on commodity comments |
CN111414753A (en) * | 2020-03-09 | 2020-07-14 | 中国美术学院 | Method and system for extracting perceptual image vocabulary of product |
CN111523300B (en) * | 2020-04-14 | 2021-03-05 | 北京精准沟通传媒科技股份有限公司 | Vehicle comprehensive evaluation method and device and electronic equipment |
CN111930941A (en) * | 2020-07-31 | 2020-11-13 | 腾讯音乐娱乐科技(深圳)有限公司 | Method and device for identifying abuse content and server |
CN112069312B (en) * | 2020-08-12 | 2023-06-20 | 中国科学院信息工程研究所 | Text classification method based on entity recognition and electronic device |
CN111898928B (en) * | 2020-08-18 | 2021-08-31 | 哈尔滨工业大学 | Multi-party service value-quality-capability index alignment method facing space-time boundary |
CN112115700B (en) * | 2020-08-19 | 2024-03-12 | 北京交通大学 | Aspect-level emotion analysis method based on dependency syntax tree and deep learning |
CN113535901B (en) * | 2021-07-08 | 2023-08-18 | 北京航空航天大学 | Method for constructing user side commodity knowledge graph based on e-commerce comments |
CN113327140B (en) * | 2021-08-02 | 2021-10-29 | 深圳小蝉文化传媒股份有限公司 | Video advertisement putting effect intelligent analysis management system based on big data analysis |
CN115309867A (en) * | 2022-08-16 | 2022-11-08 | 中国第一汽车股份有限公司 | Text processing method, device, equipment and medium |
CN117436446B (en) * | 2023-12-21 | 2024-03-22 | 江西农业大学 | Weak supervision-based agricultural social sales service user evaluation data analysis method |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107133282A (en) * | 2017-04-17 | 2017-09-05 | 华南理工大学 | A kind of improved evaluation object recognition methods based on two-way propagation |
CN107247702A (en) * | 2017-05-05 | 2017-10-13 | 桂林电子科技大学 | A kind of text emotion analysis and processing method and system |
-
2019
- 2019-04-26 CN CN201910343337.5A patent/CN110175325B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107133282A (en) * | 2017-04-17 | 2017-09-05 | 华南理工大学 | A kind of improved evaluation object recognition methods based on two-way propagation |
CN107247702A (en) * | 2017-05-05 | 2017-10-13 | 桂林电子科技大学 | A kind of text emotion analysis and processing method and system |
Non-Patent Citations (2)
Title |
---|
基于word2vec扩充情感词典的商品评论倾向分析;陆峰;《电脑知识与技术》;20170227;第13卷(第05期);第143-145、159页 * |
基于句法依赖规则和词性特征的情感词识别研究;邓淑卿 等;《情报理论与实践》;20181231;第41卷(第05期);第137-142页 * |
Also Published As
Publication number | Publication date |
---|---|
CN110175325A (en) | 2019-08-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110175325B (en) | Comment analysis method based on word vector and syntactic characteristics and visual interaction interface | |
US10235624B2 (en) | Information processing method and apparatus | |
CN108363790A (en) | For the method, apparatus, equipment and storage medium to being assessed | |
CN107657056B (en) | Method and device for displaying comment information based on artificial intelligence | |
CN106649603B (en) | Designated information pushing method based on emotion classification of webpage text data | |
CN103678564B (en) | Internet product research system based on data mining | |
CN107220386A (en) | Information-pushing method and device | |
CN109376251A (en) | A kind of microblogging Chinese sentiment dictionary construction method based on term vector learning model | |
CN112001186A (en) | Emotion classification method using graph convolution neural network and Chinese syntax | |
CN107301163B (en) | Formula-containing text semantic parsing method and device | |
CN108984554B (en) | Method and device for determining keywords | |
CN111767725A (en) | Data processing method and device based on emotion polarity analysis model | |
CN105740382A (en) | Aspect classification method for short comment texts | |
CN108596637B (en) | Automatic E-commerce service problem discovery system | |
CN112069312B (en) | Text classification method based on entity recognition and electronic device | |
CN107357785A (en) | Theme feature word abstracting method and system, feeling polarities determination methods and system | |
KR102325022B1 (en) | On-line image and review integrated analysis method and system using deep learning-based hybrid analysis method | |
CN110706028A (en) | Commodity evaluation emotion analysis system based on attribute characteristics | |
CN107818173B (en) | Vector space model-based Chinese false comment filtering method | |
Braz et al. | Document classification using a Bi-LSTM to unclog Brazil's supreme court | |
CN107436916A (en) | The method and device of intelligent prompt answer | |
Siddharth et al. | Sentiment analysis on twitter data using machine learning algorithms in python | |
Stewart et al. | Seq2kg: an end-to-end neural model for domain agnostic knowledge graph (not text graph) construction from text | |
Khemani et al. | A review on reddit news headlines with nltk tool | |
CN113590809A (en) | Method and device for automatically generating referee document abstract |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |