CN110175325B - Comment analysis method based on word vector and syntactic characteristics and visual interaction interface - Google Patents

Comment analysis method based on word vector and syntactic characteristics and visual interaction interface Download PDF

Info

Publication number
CN110175325B
CN110175325B CN201910343337.5A CN201910343337A CN110175325B CN 110175325 B CN110175325 B CN 110175325B CN 201910343337 A CN201910343337 A CN 201910343337A CN 110175325 B CN110175325 B CN 110175325B
Authority
CN
China
Prior art keywords
word
words
emotion
evaluation
dependency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910343337.5A
Other languages
Chinese (zh)
Other versions
CN110175325A (en
Inventor
吕奇
沈楠楠
胡新春
陈可佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN201910343337.5A priority Critical patent/CN110175325B/en
Publication of CN110175325A publication Critical patent/CN110175325A/en
Application granted granted Critical
Publication of CN110175325B publication Critical patent/CN110175325B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Mathematical Physics (AREA)
  • Marketing (AREA)
  • Data Mining & Analysis (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a comment analysis method based on word vectors and syntactic characteristics in the field of data analysis, which comprises the following steps: acquiring comment data of commodity pages of an e-commerce website; preprocessing the acquired target data set; extracting recognition and detraction sets provided by Hownet and NTU to form a basic emotion dictionary; carrying out Word vector training on the obtained preprocessed data set through a Word2Vec tool; establishing a probability transition matrix by using the semantic similarity matrix; processing the obtained commodity comment text based on the core sentence rule; preprocessing the obtained text with redundancy removed; extracting < commodity attribute, negation word, degree word and emotion word > from the obtained dependency relationship pair through part of speech to evaluate and match the pair; and carrying out recognition and devaluation calculation and good and bad sequencing on the evaluation objects by combining the obtained evaluation matching pair with an emotion dictionary, and finally realizing accurate, real-time, automatic and convenient processing and analysis on commodity comment data through a visual interactive interface, thereby being applicable to an electronic commerce platform.

Description

Comment analysis method based on word vector and syntactic characteristics and visual interaction interface
Technical Field
The invention belongs to the technical field of data analysis, and particularly relates to an emotion dictionary and attribute recognition algorithm which are constructed by using word vectors trained by a neural network model and are suitable for commodity comments and a comment analysis system based on the word vectors and syntactic features.
Background
With the popularization of the Internet and the development of electronic commerce, internet electronic commerce websites such as Beijing dong and Taobao rapidly develop, and more consumers begin to select online shopping; the e-commerce websites have massive commodities and also have a large user group, so that huge comment data are generated. The comments given by consumers often carry the subjective feelings of the user about the consumption, including preference for purchasing goods, satisfaction for merchant services, etc. For consumers, these comment texts can help them to more objectively learn about the information about the relevant goods or services, thus giving a more suitable choice; the merchant can be helped to further improve the service or commodity quality in a targeted manner through experience information about commodities or services fed back by the user, so that more clients and profits are obtained. However, with the explosive growth of data volume, the cost required by the user to acquire useful information from massive comment data is also increased, so how to process and analyze the comment text of the user rapidly and effectively, extract valuable information from the comment text, and have important application value and research significance.
Currently, a large amount of comment data cannot be fully utilized, and consumers are difficult to acquire valuable information from a huge amount of comment data. Therefore, a comment analysis system based on word vectors and syntactic features is researched, satisfaction of users on all properties of commodities is obtained according to analysis results, advantages and disadvantages of the commodities are summarized, and then data visualization is conducted on the analysis results.
Disclosure of Invention
The technical problem to be solved by the invention is how to realize accurate, real-time, automatic and convenient processing and analysis of commodity comment data, and overcomes the defects of the prior art to provide a comment analysis method based on word vectors and syntactic characteristics.
The invention provides a comment analysis method based on word vectors and syntactic features, which comprises the following steps:
1) Acquiring comment data of commodity pages of an e-commerce website;
2) Preprocessing the obtained target data set, and constructing a candidate emotion word set;
3) Extracting recognition and detraction sets provided by Hownet and NTU to form a basic emotion dictionary;
4) Carrying out Word vector training on the obtained preprocessed data set through a Word2Vec tool to obtain Word vectors and generate a semantic similarity matrix;
5) Establishing a probability transfer matrix by using a semantic similarity matrix, combining a seed word set, passing through an LPA tag propagation algorithm, and generating a final emotion dictionary after basic emotion dictionary test;
6) Processing the obtained commodity comment text based on the core sentence rule to obtain a comment text with redundancy removed;
7) Preprocessing the obtained text with redundancy removed, forming a dependency tree for the obtained word segmentation data set based on the dependency and the syntactic characteristics, and generating SBV, VOB, ATT, CMP, COO dependency pairs;
8) Extracting < commodity attribute, negation word, degree word and emotion word > from the obtained dependency relationship pair through part of speech to evaluate and match the pair;
9) And combining the obtained evaluation matching pair with an emotion dictionary, performing recognition and devaluation calculation and good and bad sequencing on the evaluation object, and finally realizing the method through a visual interaction interface.
As a further definition of the invention, step 2) specifically comprises:
2-1) removing the illegal character using a character matching algorithm;
2-2) word segmentation and part-of-speech tagging are carried out on the original data set by using LTP;
2-3) extracting words conforming to part of speech, and forming a candidate emotion word set 1 through duplication elimination;
2-4) word segmentation and part-of-speech tagging are carried out on the original data set by using NLPIR;
2-5) extracting words conforming to part of speech, and forming a candidate emotion word set 2 through duplication elimination;
2-6) combining the candidate emotion word set 1 and the candidate emotion word set 2, and obtaining the candidate emotion word set through duplication removal.
As a further definition of the invention, step 3) specifically comprises: and respectively extracting the recognition and detraction words in the word dictionary by using the hownet emotion dictionary and the ntu evaluation word dictionary, and combining and then removing duplication to form a basic emotion dictionary.
As a further definition of the invention, step 4) specifically comprises:
4-1) utilizing a Word2Vec training data set to obtain Word vectors of words;
4-2) combining the candidate emotion word sets, and calculating semantic similarity between words by adopting the following formula:
4-3) for example two n-dimensional word vectors a (x 11 , x 12 , … , x 1n ) And b (x) 21 , x 22 , … , x 2n ) The semantic similarity calculation formula is as follows:
Figure 341218DEST_PATH_IMAGE002
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure DEST_PATH_IMAGE003
representing a semantic similarity value;
Figure 692696DEST_PATH_IMAGE004
representing a kth dimension value of the word vector a;
Figure DEST_PATH_IMAGE005
representing a k-th dimension value of the word vector b;
4-4) constructing a semantic similarity matrix according to the calculated semantic similarity.
As a further definition of the invention, step 5) specifically comprises:
5-1) regarding each word as a node of the graph, wherein the weight of the edge between two nodes is represented by the semantic similarity between the words represented by the weight;
5-2) establishing a probability transition matrix P according to the following formula:
Figure DEST_PATH_IMAGE007
wherein, P [ i ]][j]Representing the probability of similarity transition between words i through j, SIM (w i ,w j ) Representing the similarity of words i and j, and m represents the number of words with the highest semantic similarity with the word i;
5-3) counting word frequencies of all emotion words in the candidate emotion word set in the original comment data, screening N words with highest word frequencies, and forming a seed word set 1; screening words with emotion vocabulary ontology intensity > m in the candidate emotion word sets by using the emotion vocabulary ontology library to form a seed word set 2; combining the seed word set 1 and the seed word set 2, then removing duplication to form a seed word set, and carrying out artificial emotion marking;
5-4) establishing a label matrix Y of LxC by using a small number of manually-labeled seed words L Wherein: l represents the number of seed words; c represents the number of classes, which are classified into 3 classes, namely, the identification, the disambiguation and the neutrality respectively;
5-5) simultaneously building a label matrix Y of UxC using unlabeled sample words U Wherein: u represents the number of unlabeled sample words; c represents the number of classes, which are classified into 3 classes, namely, the identification, the disambiguation and the neutrality respectively;
5-6) finally, performing part-of-speech tagging on the sample words by adopting an LPA tag propagation algorithm, and forming a final emotion dictionary after passing through a basic emotion dictionary test.
As a further definition of the invention, step 6) specifically comprises:
the core sentence mainly refers to deleting redundancy, and retaining a trunk component related to evaluation collocation; if the original sentence does not accord with any rule, the original sentence is kept unchanged, the method uses the core sentence to aim at improving the accuracy of the analysis of the syntactic dependency of the evaluation text, and the rule comprises the following steps:
rule 1: deleting sentence initial components in sentences, such as the "… advantage", "… disadvantage", "… deficiency", "… advantage", "… benefit" sequence;
rule 2: deleting sentences with hypothetical tendencies, such as "if …", "hope …", "if …", "wish …", "suggestion …";
rule 3: deleting a sequence whose period is "exactly," "naturally," "particularly," "still further," "particularly";
rule 4: deleting "feel", "consider" claim words;
rule 5: and deleting continuous punctuation marks except the first punctuation mark, such as abnormal characters of expression, pigment and brackets.
As a further definition of the invention, step 7) specifically comprises:
five axioms of dependency syntax:
(1) One sentence has only one and only one independent component;
(2) Any component in a sentence must depend on a certain component at the same time;
(3) Any component in a sentence cannot depend on two or more components at the same time;
(4) If component a depends directly on component b and component c is located between components a and b in the sentence, then component c depends on a or b or other components between a and b;
(5) The components on the left and right sides of the central component have no dependency relationship with each other;
the dependency tree is characterized by:
(1) Nodes in the tree are served by the individual components in the sentence;
(2) The root node of the tree is the center component of the whole sentence;
(3) Edges formed among nodes in the tree have directionality, reflecting asymmetric dependency relationships among components;
(4) Five axioms of the dependency syntax are satisfied;
most sentence dependency relations in comments are five categories of main-predicate relation (SBV), moving-guest relation (VOB/FOB), centering relation (ATT), moving-complement relation (CMP) and parallel relation (COO), dependency syntax analysis can be carried out through an LTP dependency syntax analyzer, and dependency relation pairs are extracted by combining COO algorithm for identifying parallel evaluation objects and parallel evaluation words; the COO algorithm for identifying the parallel evaluation objects and the parallel evaluation words specifically comprises the following steps:
traversing all words between two nodes in a SBV, VOB, ATT, CMP dependency pair obtained based on the dependency relationship and the syntactic characteristic and related left and right in the dependency syntactic tree;
judging whether COO relations exist in all the traversed words or not;
and expanding the parallel evaluation objects and evaluation words of COO relations.
As a further definition of the invention, step 8) specifically comprises:
8-1) according to the characteristics of Chinese language, most evaluation objects are nouns or verbs, and most evaluation words are adjectives or verbs;
8-2) extracting an evaluation object and an evaluation word, namely commodity attributes and emotion words according to the part of speech;
8-3) traversing whether negative words exist between the obtained evaluation object and the evaluation word according to the dependency syntax tree, if so, carrying out +1 number of the negative words, and if so, carrying out parity judgment on the number of the negative words until the traversal is finished. If the number is odd, the corresponding negative word private is assigned as-1, and if the number is even, the corresponding negative word private is assigned as +1;
8-4) traversing whether the obtained evaluation object and the evaluation word have the degree word according to the dependency syntax tree, and if so, accumulating the number to obtain the number of the degree words of the collocation pair;
8-5) finally forming the evaluation match pair of the commodity attribute, the negation word, the degree word and the emotion word.
As a further definition of the invention, step 9) specifically comprises:
according to the commodity attribute a appearing n times, the identification value calculation formula is as follows:
Figure DEST_PATH_IMAGE009
where a. Score is the affective value of commodity attribute a,
Figure 234667DEST_PATH_IMAGE010
for the ith time of the commodity attribute occurrence, private is the obtained value (-1 or +1) of the negative word corresponding to the ith commodity attribute, and degree is the number of the degree adverbs corresponding to the ith commodity attribute; calculating commodity attribute emotion values, and accumulating and calculating the same evaluation objects;
and (5) sorting the extracted all evaluation objects into two categories, namely, recognition and derogation, and arranging the final results by using bubbling sequencing.
A visual interaction interface can execute all the steps of the claims, can well display emotion values in a bar chart form, and is added with a plurality of friendly interaction functions, comprising: loading, logging in, logging out, modifying passwords, user logging in use status, etc.
Compared with the prior art, the technical scheme provided by the invention has the following technical effects:
the invention constructs a basic emotion dictionary by acquiring and preprocessing commodity page comment data of an e-commerce website; carrying out Word vector training on the obtained preprocessed data set through a Word2Vec tool, generating a semantic similarity matrix, further establishing a probability transfer matrix, and generating a final emotion dictionary through an LPA label propagation algorithm by combining a seed Word set; processing the obtained commodity comment text based on the core sentence rule to obtain a comment text with redundancy removed; preprocessing the obtained text with redundancy removed, forming a dependency relation tree on the basis of the dependency relation and the syntactic characteristic of the obtained word segmentation data set, generating SBV, VOB, ATT, CMP, COO dependency relation pairs, extracting < commodity attributes, negatives, degree words and emotion word > evaluation matching pairs, carrying out positive and negative value calculation on commodity attributes by combining with an emotion dictionary, and finally realizing through a visual interactive interface; the comment data can be analyzed accurately, in real time, automatically and conveniently.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The technical scheme of the invention is further described in detail below with reference to the accompanying drawings:
according to the technical scheme, a word vector trained by a neural network model is used, and an emotion dictionary suitable for commodity comments is constructed by combining an LTP label propagation algorithm; designing a commodity attribute identification extraction algorithm based on the core sentence rule, the dependency relationship and the syntactic characteristics; and a comment analysis system based on word vectors and syntactic characteristics is constructed by combining the technical scheme, the satisfaction degree of the user on each attribute of the commodity is obtained according to the analysis result, the advantages and disadvantages of the commodity are summarized, and then the analysis result is subjected to data visualization.
Referring to fig. 1, the invention implements a comment analysis method based on word vectors and syntactic features, and the implementation steps are as follows:
step S101: and acquiring comment data of the commodity page of the E-commerce website.
In specific implementation, a comment data crawling algorithm is designed to acquire comment data of various commodities in an e-commerce website and generate an original comment data set.
Step S102: and preprocessing the obtained target data set, and constructing a basic emotion dictionary.
In a specific implementation, the original dataset is used to remove the illegal characters using a character matching algorithm; firstly, performing word segmentation and part-of-speech tagging by using LTP, extracting words with part-of-speech marks of "a" (adj), and performing de-duplication to form a candidate emotion word set 1; then, using NLPIR to perform word segmentation and part-of-speech tagging, extracting words with part-of-speech identification of "a" (adj), and performing de-duplication to form a candidate emotion word set 2; and merging the candidate emotion word set 1 and the candidate emotion word set 2, and performing de-duplication to form a final candidate emotion word set.
Step S103: and extracting the recognition and detraction sets provided by Hownet and NTU to form a basic emotion dictionary.
In specific implementation, a hotnet emotion dictionary and an NTU evaluation word dictionary are utilized to respectively extract recognition and detraction words in the hotnet emotion dictionary, and the recognition and detraction words are combined to form a basic emotion dictionary.
Step S104: and training Word vectors of the obtained preprocessed data set through a Word2Vec tool to obtain Word vectors and generate a semantic similarity matrix.
In a specific implementation, a Word2Vec training data set is used, training parameters size=100, window=5, sg=0, min_count=0 are respectively set, and Word vectors of words are obtained through training.
And combining the candidate emotion word sets, and calculating the semantic similarity between words by adopting the following formula.
For example two n-dimensional word vectors a (x 11 , x 12 , … , x 1n ) And b (x) 21 , x 22 , … , x 2n ) The semantic similarity calculation formula is as follows:
Figure 392110DEST_PATH_IMAGE012
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure DEST_PATH_IMAGE013
representing a semantic similarity value;
Figure 765454DEST_PATH_IMAGE014
representing a kth dimension value of the word vector a;
Figure DEST_PATH_IMAGE015
representing a k-th dimension value of the word vector b;
traversing all emotion words in the candidate emotion word set in sequence, fixing one emotion word, and calculating the similarity of the emotion words with all other emotion words; assuming m candidate emotion words, obtaining a m-m semantic similarity matrix through m-m times of calculation.
In order to facilitate the following operation, it is prescribed that the similarity between identical emotion words is 0.
And constructing a semantic similarity matrix according to the calculated semantic similarity.
Step S105: and establishing a probability transition matrix by using a semantic similarity matrix, combining a seed word set, passing through an LPA tag propagation algorithm, and generating a final emotion dictionary after basic emotion dictionary test.
In particular implementations, each word is considered as a node of the graph, and the weights of edges between two nodes are represented by semantic similarity between the words they represent.
The probability transition matrix P is established according to the following formula:
Figure 412467DEST_PATH_IMAGE016
wherein, P [ i ]][j]Representing the probability of similarity transition between words i through j, SIM (w i ,w j ) Representing the similarity of words i and j, and m represents the number of words (manually set) with the highest semantic similarity with the word i; and establishing a probability transition matrix P according to the formula.
Counting word frequencies of all emotion words in the candidate emotion word set in the original comment data, screening out 100 words with highest word frequencies, and forming a seed word set 1; screening words with emotion vocabulary ontology intensity of more than 7 in the candidate emotion word set by using an emotion vocabulary ontology library of university of great company, and forming a seed word set 2; and merging the seed word set 1 and the seed word set 2, then removing duplication to form a seed word set, and carrying out artificial emotion marking.
Then, a label matrix Y of LxC is established by using a small amount of manually marked seed words L Wherein: l represents the number of seed words; c represents the number of classes, typically 3 classes (recognition, detraction, neutral); simultaneously, a label matrix Y of UxC is established by utilizing unlabeled sample words U Wherein: u represents the number of unlabeled sample words; c represents the number of classes, typically 3 classes (recognition, detraction, neutral); combining the two label matrixes to obtain a soft label matrix F= [ Y ] of NxC L ;Y U ]。
Executing a tag propagation algorithm, wherein the specific operation is as follows: 1) Performing propagation: f=pf; 2) Resetting the tag of the labeled sample in F: f (F) L =Y L The method comprises the steps of carrying out a first treatment on the surface of the 3) Repeating steps 1) and 2) until F converges.
The purpose of step 1 is to transmit the label (emotion attribute) of each node (emotion word) to other nodes with probability determined by a probability transition matrix, if the similarity of two nodes is larger, the transmission probability is larger; the step 2 aims to reset the label marked with the seed word to a marked value, so that the change caused by the operation process of the step 1 is avoided; the method for determining F convergence in step 3 is to calculate the latest F and F after the last operation 0 Is considered to have converged until the similarity is no longer changing.
And finally, three numerical values of a single row in the matrix F represent attribute propagation values of the emotion words corresponding to the three numerical values, the largest numerical value is selected, the corresponding attribute is judged, and the emotion word attribute is determined.
Leading out emotion words with confirmed attributes to form an emotion dictionary 1, traversing all emotion words in the emotion dictionary 1, and changing the attributes of the emotion words if the basic emotion dictionary contains the words and contradicts the attributes in the basic emotion dictionary in step S103, wherein the basic emotion dictionary is based on the attributes in the basic emotion dictionary; otherwise, the attribute is unchanged.
After the above steps are finished, the modified emotion dictionary 1 is the final emotion dictionary.
Step S106: and processing the obtained commodity comment text based on the core sentence rule to obtain a comment text with redundancy removed.
In the implementation, a commodity website is input on an interactive interface of a webpage of the system, comment data of the commodity input on an electronic commerce platform is crawled through a web crawler mechanism designed in the background, and the system sets up the first 1000 pieces of high-quality comment data of the commodity to be crawled.
The obtained commodity comment data is subjected to redundancy removal processing based on core sentence rules, and trunk components related to evaluation collocation are reserved; for example: the mobile phone receives good quality, good pixels and sound quality, particularly good express delivery force (the next day), and the only disadvantage is that the package is not good, so that a store can hope to improve. . . The treatment is as follows:
(1) The matching rule 1, namely the example sentence is matched with the defect of …, the mobile phone is changed into a mobile phone after the processing, the mobile phone receives the defect, the pixels and the tone quality are good, and especially the express delivery is very powerful (the next day is reached), namely the package is not very good, and the store can hope to improve. . . ";
(2) The matching rule 2, the example sentence is matched to the hope, the processing is changed into the mobile phone which receives the request, the picture and the tone quality are good, especially the express delivery is very powerful (the next day is reached), the package is not very good, and the store can improve. . . ";
(3) The matching rule 3, the example sentence is matched to be ' in particular ', the mobile phone is changed into ' after processing, the mobile phone is received well, the pixels and the tone quality are good, the express delivery is very powerful (the next day is reached), the package is not very good, and the store can improve. . . ";
(4) The matching rule 5, the example sentence deletes the continuous punctuation mark, the core sentence obtained by the final processing is' the mobile phone receives, the picture and the tone quality are good, the express delivery is very powerful and the next day arrives, the package is not very good, and the store can improve. This example is denoted as example sentence sendees.
Step S107: preprocessing the obtained text with redundancy removed, forming a dependency tree by the obtained word segmentation data set based on the dependency and the syntactic characteristics, and generating SBV, VOB, ATT, CMP, COO dependency pairs.
In a specific implementation, the text with redundancy removed obtained in the step S106 is preprocessed, and 6 clauses are obtained by punctuating clauses. And segmenting each sentence by using an LTP tool, marking parts of speech, and forming a dependency tree based on the dependency and the syntactic characteristics. The dependency relationship is obtained for SBV < mobile phone, received >, SBV < pixel, good >, COO < tone quality, pixel >, SBV < express delivery, force giving >, SBV < package, good >, SBV < store, improvement >.
For example, if the phrase "the pixels and the sound quality are both good" is processed by the above steps, and the dependency pair is extracted again by combining the COO algorithm for identifying the parallel evaluation object and the parallel evaluation word, the obtained dependency pair is < the pixels, good >, < the sound quality, good >.
Step S108: and extracting < commodity attribute, negation word, degree word and emotion word > from the obtained dependency relationship pair through part of speech to evaluate and match the pair.
In the specific implementation, traversing whether negative words exist between the evaluation object and the evaluation word for each extracted relation pair, calculating the number, judging parity of the negative words between the evaluation object and the evaluation word to obtain positive and negative values of the negative words, namely judging the negative words to be an odd number, and assigning-1 corresponding to the private; the negation word is determined to be an even number, and the corresponding private is assigned a value of +1. Then traversing whether the degree words exist between the evaluation object and the evaluation words, and calculating the number of the degree words. Finally, the < commodity attribute, private, emotion > evaluation match pair is formed. In the embodiment sentence sendees in step S106, a negative word "no" is identified between the relation pair < package good >, and then the corresponding private value is-1; traversing the adverbs of degree between "package" and "good", identifying "very", and the corresponding degree value is 1. The evaluation match pair for this phrase extraction is < package, -1, good >.
Step S109: and combining the obtained evaluation matching pair with an emotion dictionary, performing recognition and devaluation calculation and good and bad sequencing on the evaluation object, and finally realizing the method through a visual interaction interface.
In a specific implementation, the extracted evaluation matching pair is used for obtaining the recognition attribute of the emotion words through the emotion dictionary. Then carrying out the identification and devaluation calculation of commodity attributes according to the following formula:
Figure 425422DEST_PATH_IMAGE018
for the evaluation collocation pair obtained in step S107<Packaging, -1, preferably>The commodity attribute of the commodity is 'packaged' and is subjected to identification and detraction value calculation to obtain that the emotion value is
Figure 120977DEST_PATH_IMAGE020
Traversing all comment data of the obtained commodity, carrying out the processing of the steps, accumulating the same evaluation objects, finally extracting to obtain all commodity attributes of the commodity, then classifying the commodity attributes into two classes, and finally obtaining a final result by utilizing bubbling sequencing arrangement. And finally, through the front end and the rear end, the visual interactive interface is used for realizing the method on the webpage.
The foregoing is merely illustrative of the embodiments of the present invention, and the scope of the present invention is not limited thereto, and any person skilled in the art will appreciate that modifications and substitutions are within the scope of the present invention, and the scope of the present invention is defined by the appended claims.

Claims (8)

1. A comment analysis method based on word vectors and syntactic features is characterized by comprising the following steps:
1) Acquiring comment data of commodity pages of an e-commerce website;
2) Preprocessing the obtained target data set, and constructing a candidate emotion word set;
3) Extracting recognition and detraction sets provided by Hownet and NTU to form a basic emotion dictionary;
4) Carrying out Word vector training on the obtained preprocessed data set through a Word2Vec tool to obtain Word vectors and generate a semantic similarity matrix, wherein the step 4) specifically comprises the following steps:
4-1) utilizing a Word2Vec training data set to obtain Word vectors of words;
4-2) combining the candidate emotion word sets, and calculating semantic similarity between words by adopting the following formula:
4-3) for example two n-dimensional word vectors a (x 11 ,x 12 ,…,x 1n ) And b (x) 21 ,x 22 ,…,x 2n ) The semantic similarity calculation formula is as follows:
Figure QLYQS_1
wherein cos θ represents the semantic similarity value; x is x 1k Representing a kth dimension value of the word vector a; x is x 2k Representing a k-th dimension value of the word vector b;
4-4) constructing a semantic similarity matrix according to the calculated semantic similarity;
5) Establishing a probability transition matrix by using a semantic similarity matrix, combining a seed word set, generating a final emotion dictionary by an LPA tag propagation algorithm and a basic emotion dictionary test, wherein the step 5) specifically comprises the following steps:
5-1) regarding each word as a node of the graph, wherein the weight of the edge between two nodes is represented by the semantic similarity between the words represented by the weight;
5-2) establishing a probability transition matrix P according to the following formula:
Figure QLYQS_2
wherein, P [ i ]][j]Representing the probability of similarity transition between words i through j, SIM (w i ,w j ) Representing the similarity of words i and j, and m represents the number of words with the highest semantic similarity with the word i;
5-3) counting word frequencies of all emotion words in the candidate emotion word set in the original comment data, screening N words with highest word frequencies, and forming a seed word set 1; screening words with emotion vocabulary ontology intensity > m in the candidate emotion word sets by using the emotion vocabulary ontology library to form a seed word set 2; combining the seed word set 1 and the seed word set 2, then removing duplication to form a seed word set, and carrying out artificial emotion marking;
5-4) establishing a label matrix Y of LxC by using a small number of manually-labeled seed words L Wherein: l represents the number of seed words; c represents the number of classes, which are classified into 3 classes, namely, the identification, the disambiguation and the neutrality respectively;
5-5) simultaneously building a label matrix Y of UxC using unlabeled sample words U Wherein: u represents the number of unlabeled sample words; c represents the number of classes, which are classified into 3 classes, namely, the identification, the disambiguation and the neutrality respectively;
5-6) finally, marking the parts of speech of the sample words by adopting an LPA tag propagation algorithm, and forming a final emotion dictionary after passing through a basic emotion dictionary test;
6) Processing the obtained commodity comment text based on the core sentence rule to obtain a comment text with redundancy removed;
7) Preprocessing the obtained text with redundancy removed, forming a dependency tree for the obtained word segmentation data set based on the dependency and the syntactic characteristics, and generating SBV, VOB, ATT, CMP, COO dependency pairs;
8) Extracting < commodity attribute, negation word, degree word and emotion word > from the obtained dependency relationship pair through part of speech to evaluate and match the pair;
9) And combining the obtained evaluation matching pair with an emotion dictionary, performing recognition and devaluation calculation and good and bad sequencing on the evaluation object, and finally realizing the method through a visual interaction interface.
2. The comment analysis method based on word vectors and syntactic features according to claim 1, characterized in that step 2) specifically includes:
2-1) removing the illegal character using a character matching algorithm;
2-2) word segmentation and part-of-speech tagging are carried out on the original data set by using LTP;
2-3) extracting words conforming to part of speech, and forming a candidate emotion word set 1 through duplication elimination;
2-4) word segmentation and part-of-speech tagging are carried out on the original data set by using NLPIR;
2-5) extracting words conforming to part of speech, and forming a candidate emotion word set 2 through duplication elimination;
2-6) combining the candidate emotion word set 1 and the candidate emotion word set 2, and obtaining the candidate emotion word set through duplication removal.
3. The comment analysis method based on word vectors and syntactic features according to claim 1, characterized in that step 3) specifically includes: and respectively extracting the recognition and detraction words in the word dictionary by using the hownet emotion dictionary and the ntu evaluation word dictionary, and combining and then removing duplication to form a basic emotion dictionary.
4. The comment analysis method based on word vectors and syntactic features according to claim 1, characterized in that step 6) specifically includes:
the core sentence mainly refers to deleting redundancy, and retaining a trunk component related to evaluation collocation; if the original sentence does not accord with any rule, the original sentence is kept unchanged, the method uses the core sentence to aim at improving the accuracy of the analysis of the syntactic dependency of the evaluation text, and the rule comprises the following steps:
rule 1: deleting the initial sentence component of the sentence, such as "… advantage", "… disadvantage", etc,
A "… deficiency", "… advantage", "… benefit" sequence;
rule 2: deleting sentences with hypothetical tendencies, such as "if …", "hope …", "if …", "wish …", "suggestion …";
rule 3: deleting a sequence whose period is "exactly," "naturally," "particularly," "still further," "particularly";
rule 4: deleting "feel", "consider" claim words;
rule 5: and deleting continuous punctuation marks except the first punctuation mark, such as abnormal characters of expression, pigment and brackets.
5. The comment analysis method based on word vectors and syntactic features according to claim 1, characterized in that step 7) specifically includes:
five axioms of dependency syntax:
(1) One sentence has only one and only one independent component;
(2) Any component in a sentence must depend on a certain component at the same time;
(3) Any component in a sentence cannot depend on two or more components at the same time;
(4) If component a depends directly on component b and component c is located between components a and b in the sentence, then component c depends on a or b or other components between a and b;
(5) The components on the left and right sides of the central component have no dependency relationship with each other;
the dependency tree is characterized by:
(1) Nodes in the tree are served by the individual components in the sentence;
(2) The root node of the tree is the center component of the whole sentence;
(3) Edges formed among nodes in the tree have directionality, reflecting asymmetric dependency relationships among components;
(4) Five axioms of the dependency syntax are satisfied;
most sentence dependency relations in comments are five types of main-predicate relation, moving-guest relation, centering relation, moving-complement relation and parallel relation, dependency syntax analysis can be carried out through an LTP dependency syntax analyzer, and dependency relation pairs are extracted by combining a parallel relation algorithm for identifying parallel evaluation objects and parallel evaluation words; the parallel relation algorithm for identifying the parallel evaluation objects and the parallel evaluation words specifically comprises the following steps:
traversing all words between two nodes in a dependency relationship pair and related to the two nodes left and right in a dependency syntax tree, wherein the main-predicate relationship, the dynamic guest relationship, the centering relationship and the dynamic complement relationship are obtained based on the dependency relationship and the syntax characteristics;
judging whether all the traversed words have parallel relations or not;
and expanding the parallel evaluation objects and evaluation words of the parallel relationship.
6. The comment analysis method based on word vectors and syntactic features according to claim 1, characterized in that step 8) specifically includes:
8-1) according to the characteristics of Chinese language, most evaluation objects are nouns or verbs, and most evaluation words are adjectives or verbs;
8-2) extracting an evaluation object and an evaluation word, namely commodity attributes and emotion words according to the part of speech;
8-3) traversing whether negative words exist between the obtained evaluation object and the evaluation word according to the dependency syntax tree, if so, carrying out +1 number of the negative words, and if so, carrying out parity judgment on the number of the negative words until the traversal is finished; if the number is odd, the corresponding negative word private is assigned as-1, and if the number is even, the corresponding negative word private is assigned as +1;
8-4) traversing whether the obtained evaluation object and the evaluation word have the degree word according to the dependency syntax tree, and if so, accumulating the number to obtain the number of the degree words of the collocation pair;
8-5) finally forming the evaluation match pair of the commodity attribute, the negation word, the degree word and the emotion word.
7. The comment analysis method based on word vectors and syntactic features according to claim 1, characterized in that step 9) specifically includes:
according to the commodity attribute a appearing n times, the identification value calculation formula is as follows:
a.score=x i .privative*(x i .score+x i .degree*x i .score*0.5)(0<i<=n)
where a.score is the emotion value of commodity attribute a, x i For the ith time of the commodity attribute occurrence, private is the obtained value-1 or +1 of the negative word corresponding to the ith commodity attribute, and degree is the number of the degree adverbs corresponding to the ith commodity attribute; calculating commodity attribute emotion values, and accumulating and calculating the same evaluation objects;
and (5) sorting the extracted all evaluation objects into two categories, namely, recognition and derogation, and arranging the final results by using bubbling sequencing.
8. A visual interactive interface, characterized in that the comment analysis method of claims 1 to 7 can be performed.
CN201910343337.5A 2019-04-26 2019-04-26 Comment analysis method based on word vector and syntactic characteristics and visual interaction interface Active CN110175325B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910343337.5A CN110175325B (en) 2019-04-26 2019-04-26 Comment analysis method based on word vector and syntactic characteristics and visual interaction interface

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910343337.5A CN110175325B (en) 2019-04-26 2019-04-26 Comment analysis method based on word vector and syntactic characteristics and visual interaction interface

Publications (2)

Publication Number Publication Date
CN110175325A CN110175325A (en) 2019-08-27
CN110175325B true CN110175325B (en) 2023-07-11

Family

ID=67690209

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910343337.5A Active CN110175325B (en) 2019-04-26 2019-04-26 Comment analysis method based on word vector and syntactic characteristics and visual interaction interface

Country Status (1)

Country Link
CN (1) CN110175325B (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110705266B (en) * 2019-09-09 2023-05-12 创新奇智(南京)科技有限公司 Emotion analysis method and device
CN110717654B (en) * 2019-09-17 2022-05-06 合肥工业大学 Product quality evaluation method and system based on user comments
CN110659828B (en) * 2019-09-23 2022-03-08 上海海事大学 Software feature evaluation method based on comment data
CN110706028A (en) * 2019-09-26 2020-01-17 四川长虹电器股份有限公司 Commodity evaluation emotion analysis system based on attribute characteristics
CN110750646B (en) * 2019-10-16 2022-12-06 乐山师范学院 Attribute description extracting method for hotel comment text
CN111259661B (en) * 2020-02-11 2023-07-25 安徽理工大学 New emotion word extraction method based on commodity comments
CN111414753A (en) * 2020-03-09 2020-07-14 中国美术学院 Method and system for extracting perceptual image vocabulary of product
CN111523300B (en) * 2020-04-14 2021-03-05 北京精准沟通传媒科技股份有限公司 Vehicle comprehensive evaluation method and device and electronic equipment
CN111930941A (en) * 2020-07-31 2020-11-13 腾讯音乐娱乐科技(深圳)有限公司 Method and device for identifying abuse content and server
CN112069312B (en) * 2020-08-12 2023-06-20 中国科学院信息工程研究所 Text classification method based on entity recognition and electronic device
CN111898928B (en) * 2020-08-18 2021-08-31 哈尔滨工业大学 Multi-party service value-quality-capability index alignment method facing space-time boundary
CN112115700B (en) * 2020-08-19 2024-03-12 北京交通大学 Aspect-level emotion analysis method based on dependency syntax tree and deep learning
CN113535901B (en) * 2021-07-08 2023-08-18 北京航空航天大学 Method for constructing user side commodity knowledge graph based on e-commerce comments
CN113327140B (en) * 2021-08-02 2021-10-29 深圳小蝉文化传媒股份有限公司 Video advertisement putting effect intelligent analysis management system based on big data analysis
CN115309867A (en) * 2022-08-16 2022-11-08 中国第一汽车股份有限公司 Text processing method, device, equipment and medium
CN117436446B (en) * 2023-12-21 2024-03-22 江西农业大学 Weak supervision-based agricultural social sales service user evaluation data analysis method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107133282A (en) * 2017-04-17 2017-09-05 华南理工大学 A kind of improved evaluation object recognition methods based on two-way propagation
CN107247702A (en) * 2017-05-05 2017-10-13 桂林电子科技大学 A kind of text emotion analysis and processing method and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107133282A (en) * 2017-04-17 2017-09-05 华南理工大学 A kind of improved evaluation object recognition methods based on two-way propagation
CN107247702A (en) * 2017-05-05 2017-10-13 桂林电子科技大学 A kind of text emotion analysis and processing method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于word2vec扩充情感词典的商品评论倾向分析;陆峰;《电脑知识与技术》;20170227;第13卷(第05期);第143-145、159页 *
基于句法依赖规则和词性特征的情感词识别研究;邓淑卿 等;《情报理论与实践》;20181231;第41卷(第05期);第137-142页 *

Also Published As

Publication number Publication date
CN110175325A (en) 2019-08-27

Similar Documents

Publication Publication Date Title
CN110175325B (en) Comment analysis method based on word vector and syntactic characteristics and visual interaction interface
US10235624B2 (en) Information processing method and apparatus
CN108363790A (en) For the method, apparatus, equipment and storage medium to being assessed
CN107657056B (en) Method and device for displaying comment information based on artificial intelligence
CN106649603B (en) Designated information pushing method based on emotion classification of webpage text data
CN103678564B (en) Internet product research system based on data mining
CN107220386A (en) Information-pushing method and device
CN109376251A (en) A kind of microblogging Chinese sentiment dictionary construction method based on term vector learning model
CN112001186A (en) Emotion classification method using graph convolution neural network and Chinese syntax
CN107301163B (en) Formula-containing text semantic parsing method and device
CN108984554B (en) Method and device for determining keywords
CN111767725A (en) Data processing method and device based on emotion polarity analysis model
CN105740382A (en) Aspect classification method for short comment texts
CN108596637B (en) Automatic E-commerce service problem discovery system
CN112069312B (en) Text classification method based on entity recognition and electronic device
CN107357785A (en) Theme feature word abstracting method and system, feeling polarities determination methods and system
KR102325022B1 (en) On-line image and review integrated analysis method and system using deep learning-based hybrid analysis method
CN110706028A (en) Commodity evaluation emotion analysis system based on attribute characteristics
CN107818173B (en) Vector space model-based Chinese false comment filtering method
Braz et al. Document classification using a Bi-LSTM to unclog Brazil's supreme court
CN107436916A (en) The method and device of intelligent prompt answer
Siddharth et al. Sentiment analysis on twitter data using machine learning algorithms in python
Stewart et al. Seq2kg: an end-to-end neural model for domain agnostic knowledge graph (not text graph) construction from text
Khemani et al. A review on reddit news headlines with nltk tool
CN113590809A (en) Method and device for automatically generating referee document abstract

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant