CN109460508B

CN109460508B - Efficient spam comment user group detection method

Info

Publication number: CN109460508B
Application number: CN201811177783.5A
Authority: CN
Inventors: 张小旭; 邓水光; 李莹; 吴健; 尹建伟; 吴朝晖
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2018-10-10
Filing date: 2018-10-10
Publication date: 2021-10-15
Anticipated expiration: 2038-10-10
Also published as: CN109460508A

Abstract

The invention discloses an efficient spam comment user group detection method which is characterized in that candidate groups (each group is required to be composed of at least 2 persons and comment 3 products together) are obtained based on comment data of products on an E-commerce website, the basic spam information characteristics of each product node, user nodes and group nodes in a heterogeneous network based on self and the spam influence characteristics based on relationship are extracted, the influence of the basic spam information of each node and other two types of nodes on the spam information of each node is considered, the spam information of each node is obtained, the spam information of the group nodes is included, and the groups higher than a certain threshold value are identified as spam comment user groups. Meanwhile, the optimized GroupRank algorithm is adopted, so that the accuracy rate and the performance are higher.

Description

Efficient spam comment user group detection method

Technical Field

The invention belongs to the technical field of data mining, and particularly relates to an efficient spam comment user group detection method.

Background

With the coming of the mobile internet era and the improvement of the internet of things, online shopping gradually becomes a novel important consumption mode, more and more contents generated by multiple users appear in network application, and most consumers can publish their shopping experiences and use thoughts and opinions on products on an e-commerce platform after shopping. On one hand, according to the survey report of the united states Cone 2011, 64% of users refer to existing user comments before purchasing behaviors, and the comments made by the users have certain influence on the consumption behaviors of potential consumers and have commercial value; on the other hand, the comments of the users contain a great deal of information such as evaluation and preference degree of the consumers on various aspects of the products, and the information can help enterprises to know the preference and consumption requirements of the consumers more conveniently, find problems such as product quality and know the advantages and the defects of the product performance of the enterprises. For the E-commerce service platform, the characteristics of the commodities most concerned by consumers are known through the commodity comment information, the consumers are guided to evaluate the commodities more comprehensively in the commodity comment system, and the comment information quality and the website public praise are improved.

The comments of the e-commerce website are derived from the real experience of a user on a specific product or service, the purchasing decision of a future customer is directly influenced, and the positive idea can bring huge economic benefit and wealth for organizations and individuals. Because consumers of the same kind of commodities tend to visit stores with large sales volume, large number of comments and large number of good comments, but under the drive of interests, merchants often hire water army to do false transactions and then swipe the number of comments or do activities such as full N-word good comment return, and others can issue false comments to try to make unfair comments on some products, such as writing good comments to promote the products, or deliberately making poor comments to damage the reputation of a certain product, so that the consumers are misled, and the false comments are called spam comments.

Unlike other types of spam (e.g., spam email), spam comments are very difficult to find, mainly because spam comment users can easily disguise themselves, resulting in a bottleneck in algorithms that rely on underlying textual language features and behavioral characteristics to detect spam comments, which is difficult to identify spam comment and spam comment users. At present, research on the field of spam comments is basically developed around detection of spam comments and spam comment users, however, a spam comment user group has stronger destructiveness, and due to the fact that many members of a group write false comments, the comment emotion of a product can be completely controlled. And it is found that compared with the detection difficulty of the spam comment and the spam comment users, the spam comment user group is easier to detect, so that the spam comment user group is more significant to detect.

Since the comment user behavior proposed in the existing spam comment methods is not sufficient to capture a spam comment user group, it is necessary to find a more complex and complementary framework.

Disclosure of Invention

In view of the above, the invention provides an efficient spam comment user group detection method, which includes capturing the mutual relation among comment user groups, comment users and products, mining the product comment data of an e-commerce platform by using a heterogeneous network iterative algorithm GroupRank, obtaining spam information of the groups by considering the influence of product nodes and user nodes on the group nodes and the basic spam information of the groups, and identifying the spam comment user groups.

An efficient spam comment user group detection method comprises the following steps:

(1) preprocessing comment data of the E-commerce product;

(2) extracting basic features of groups, products and users based on the preprocessed comment data and integrating the basic features into a vector form, wherein the group is composed of at least 2 users, and the users comment 3 products together;

(3) extracting three groups of corresponding relation characteristics among the group, the product and the user and integrating the three groups of relation characteristics into a matrix form;

(4) calculating the junk information scores of all groups through a GroupRank algorithm according to the basic characteristics and the relational characteristics;

(5) setting a proper threshold value, and classifying all groups into a garbage group and a non-garbage group through comparison between the garbage information score and the threshold value.

Further, the preprocessing of the comment data in the step (1) comprises filtering low-frequency users, low-quality comments and low-sales commodities; for low-quality comments, the comments with too low information quantity, namely below a set threshold value, are screened out through the comment length and the richness, and then spam comments and advertisement comments are removed according to manual judgment and machine learning; for low-frequency users, namely the users can not accurately identify the effective comments which are less than the set threshold value, the comments issued by the users are removed; for a low-sales commodity, namely if the comment of a commodity is very few, namely lower than a set threshold value, the commodity is considered to have basically no comment brushing behavior, and all comments of the commodity are rejected.

Further, the basic features of the group, the product and the user are extracted in the step (2), wherein the basic features of the group are the similarity of the behavior information of the group members on all the products commented together (the higher the similarity is, the more likely the group commented on the products according to the task is), and the behavior information comprises comment texts, scores, time windows and comment positions; the basic characteristics of the users are the consistency of the overall comments of the users (spam comment users generally brush the comments deliberately, so that the comment consistency is higher), and the consistency is expressed in the aspects of user rating, comment time and comment position; the basic characteristic of the product is the deviation degree of the comments, and since the product can have the comments which are intentionally commented or badly commented, the comments deviating from the real condition generate the deviation feeling in the overall comments, which includes the deviation in the scores and the comment time.

Further, in the step (3), in order to quantify the magnitude of influence of spam among the three types of entities, namely, group, product and user, appropriate relationship features conforming to each group of relationships are extracted from the three groups of relationships, namely, group-product, user-product and group-user.

Further, the GroupRank algorithm in step (4) is as follows:

wherein: s₁～S₅Are all intermediate variables, B_U、B_P、B_GBasic feature vectors, A, for users, products, groups, respectively_PG、A_UP、A_GUThe relation feature matrixes are respectively group-product, user-product and group-user;

and

group spam vectors in the t-1 th iteration process and the t-th iteration process respectively comprise spam scores of all groups; alpha is a set weight coefficient,^Trepresenting the transpose of the matrix, t being a natural number greater than 0, | | | | | | non-woven phosphor₁Is a norm of 1; initialization

When in use

Algorithm convergence and output

Preferably, since a large number of matrix operations are used in the iterative process, which consumes very much computing resources, S is respectively pre-aligned₁～S₅And

carrying out standardization, and obtaining the standard by converting 6 groups of formulas in the GroupRank algorithm

W is a transformation matrix obtained by transformation integration.

The method comprises the steps of obtaining candidate groups (each group is required to be at least composed of 2 persons and at least comments 3 products together) based on comment data of products on an E-commerce website, extracting basic spam information characteristics and relation-based spam influence characteristics of each product node, user nodes and group nodes in a heterogeneous network, obtaining spam information of each node by considering the influence of the basic spam information of each node and other two types of nodes on the spam information of each node, including the spam information of the group nodes, and identifying the group higher than a certain threshold value as a spam comment user group; therefore, the invention has the following beneficial technical effects:

(1) based on the idea that webpage nodes in the PageRank transmit webpage authority values in a network structure, the invention provides a heterogeneous network model consisting of three types of nodes to capture a more complex relationship among comment user groups, comment users and products, and more comprehensively analyzes various possible factors than other spam comment user group identification algorithms.

(2) The invention carries out iterative computation based on a heterogeneous network model, can deduce that the junk information of one node influences the junk information of another node through the influence of the junk according to the relation among the 3 types of nodes, simultaneously considers the basic junk information of the nodes and uses an adjusting factor to weight the basic junk information so as to adjust the proportion of the basic junk information to the influence of the junk based on the relation.

(3) The optimized GroupRank algorithm is adopted, an experiment is carried out on 120 ten thousand comment data sets of products on the E-commerce website of Amazon China, and the experiment is compared with other identification methods commonly used in the field, and the experiment shows that the algorithm can achieve higher accuracy rate and performance.

Drawings

FIG. 1 is a schematic overall flow chart of the method of the present invention.

Fig. 2 is a flow chart of the GroupRank algorithm.

FIG. 3 is a schematic diagram of accelerated optimization of the GroupRank algorithm.

Detailed Description

In order to more specifically describe the present invention, the following detailed description is provided for the technical solution of the present invention with reference to the accompanying drawings and the specific embodiments.

The spam comment user group detection method disclosed by the invention is mainly divided into six parts as shown in FIG. 1: comment reading, data preprocessing and analysis, basic feature extraction, relation feature extraction, GroupRank iterative computation and algorithm acceleration, and spam group identification, wherein:

the comment reading part mainly defines an interface for data input and output, and data can be acquired or output from different channels and types through duplicating an interface function at any time.

Data preprocessing and analysis: the targeted data is comment data of E-commerce products, the data is relatively complex and not standard, and certain cleaning is needed to improve the detection accuracy, so comment contents, users and groups are mainly screened in the data preprocessing part, low-frequency users, low-quality comments and low-sales commodities are filtered, and main data are obtained. The method mainly comprises the following four steps: the method comprises the following steps that firstly, for low-quality comments, comments with low obvious information content are screened according to the comment length and the comment richness, and then spam comments and advertisement comments are removed according to simple manual comment screening and machine learning; secondly, screening low-frequency users, wherein the users can not accurately identify the users who give too few effective comments; the third step is directed at the commodity, if the comment of one commodity is few, the commodity is considered to have no comment brushing behavior basically; and finally, mining candidate high-similarity user groups based on a machine learning method of frequent mining according to the screened data.

Extracting basic characteristics: the basic characteristics of the group, the product and the user are based on the overall characteristics of the group, the user and the product, the overall similarity, consistency or deviation degree is mainly extracted to serve as the basic junk information characteristics, and the indexes reflect the junk information of the node to a certain degree. The core of the group basic characteristics is the similarity of the behaviors of the group on all the products which are commented together, the higher the similarity is, the more likely the group is to comment the products spam according to tasks, and the behaviors comprise comment texts, scores, time windows, comment positions and the like; the core of the basic characteristics of the comment users is the consistency of the overall comments of the users, and the spam comment users generally have the meaning of good comment, so that the comment consistency is higher. The consistency of the behaviors is realized in the aspects of user score, user comment time and user comment position; the core of the basic characteristics of the product is the deviation degree of the comments, because the product has the comments which are intentionally commented or badly commented, the comments deviate from the comments of the real users, and the deviation sense is generated in the overall comments, including the deviation in the aspects of the product rating, the product comment time and the like.

Extracting the relation features: because the influence degrees among the entities are different, and the spam information degrees transmitted among the entities are also different, the relationship characteristics are extracted from different entity relationships to represent different spam information influence. In order to quantify the magnitude of the influence of the garbage among the entities, the relationship feature extraction part extracts proper relationship features which accord with each relationship from 3 relationships of a comment user group-product, a comment user-product and a comment user group-comment user; for example, the suspicious degrees of behaviors of the same group on different products are different, the spam degrees of comments on different products by the same comment user are different, and the larger the strength of the influence is, the larger the spam influence between the comment user and the comment user is.

GroupRank iterative computation and algorithm acceleration: after the junk information is spread between the nodes, the junk information of each node can be pushed out by the junk information of other two types of nodes, so that the junk information of the group is initialized to be basic junk information, an iterative process is started until the last update, and S of the previous and subsequent two times is completed_GIs less than a threshold delta, at which point the iterative process converges to a stable S_GThe spam scores of all groups are obtained. Because a large amount of matrix operation is used in the iteration process, the calculation resources are consumed very much, and the acceleration effect is achieved through the advanced iterative calculation of the matrix W.

The process of the GroupRank algorithm is as shown in fig. 2, at the beginning of the algorithm, firstly, initializing the spam of all groups as their basic characteristic values; the iterative process is then started, in a first step from S_GDeducing product spam information S from the current value_P(ii) a In the second step, useNew S_PIn deducing the user spam S_U(ii) a In the third step using S_UAnd deducing the group garbage information, thus completing a forward loop deduction. Starting from the fourth step, the reverse loop derivation is first performed by S_GDerivation of user spam S_UThen from the updated S_UDeduct the junk information S of the product_PFinally, in the sixth step, group garbage information S is derived_G. Thus, a round of iteration process is completed, the spam degree is standardized in each step process, and the iteration group spam degree is continuously updated until the last update leads to the two previous and next group spam information S_GIs less than a threshold delta, at which point the iterative process converges to a stable S_G，S_GThe spam score of all groups is included, and the actual spam score of the group can be solved through the iterative process.

The algorithm accelerating part achieves the accelerating effect through the advanced iterative calculation of the matrix W, and as shown in figure 3, the S is respectively pre-calculated_P、S_U、S_GThe standardization is carried out, then 6 formulas which are derived from each other among three types of nodes are transformed, and the calculation process of the GroupRank can be simply summarized as follows:

let W be Z_PGZ_UPZ_GUZ_UGZ_PUZ_GPThen the calculation process can be converted into:

an iteration process at this time is equivalent to the calculation process plus a normalization and whether the normalization is converged; the GroupRank needs to perform W and S in each iteration_GThe matrix multiplication operation between them, whose time complexity can be considered as O (t | E | may,where | E | is the number of non-zero elements in A and t is the total number of iterations. In the process of generating the matrix W, because the matrix W only needs a few times of matrix calculation, compared with iterative calculation, the matrix W can be ignored; the actual computation is rather fast because the relationship matrix is very sparse and follows the distribution of comments, resulting in a large number of null computations in the matrix computation. In addition, the GroupRank is an algorithm capable of performing power iteration, and the effect of acceleration is achieved by performing advanced iterative computation on the matrix W, so that the optimized GroupRank is an efficient spam comment user group identification method.

Identifying a garbage group: and setting a proper threshold value, and dividing the candidate group into a garbage group and a non-garbage group. The part is used for identifying the candidate groups higher than a certain spam threshold value gamma as spam comment user groups, and in order to identify the spam comment user groups more accurately, the ROC can be manually marked and calculated under different scenes, so that a better threshold value can be selected, and a better classification effect can be achieved.

According to the GroupRank algorithm, a classification experiment is carried out on comment data of 120 ten thousand products captured from Amazon China, comparison is carried out with other garbage group identification methods, and the F-score value of GroupRank (when gamma is 0.67) is found to be the highest and reaches 0.917; and the F-score of the other recognition algorithms SVM, KNN, GC are 0.834, 0.908, 0.916, respectively. Therefore, experiments prove that the GroupRank provided by the invention has high identification accuracy rate in identifying the spam comment user group, is really superior to other reference methods, and the optimized GroupRank algorithm is also an efficient spam comment user group identification method.

The embodiments described above are presented to enable a person having ordinary skill in the art to make and use the invention. It will be readily apparent to those skilled in the art that various modifications to the above-described embodiments may be made, and the generic principles defined herein may be applied to other embodiments without the use of inventive faculty. Therefore, the present invention is not limited to the above embodiments, and those skilled in the art should make improvements and modifications to the present invention based on the disclosure of the present invention within the protection scope of the present invention.

Claims

1. An efficient spam comment user group detection method comprises the following steps:

(1) preprocessing comment data of the E-commerce product;

2. The spam comment user group detection method of claim 1 wherein: preprocessing the comment data in the step (1) comprises filtering low-frequency users, low-quality comments and low-sales commodities; for low-quality comments, the comments with too low information quantity, namely below a set threshold value, are screened out through the comment length and the richness, and then spam comments and advertisement comments are removed according to manual judgment and machine learning; for low-frequency users, namely the users can not accurately identify the effective comments which are less than the set threshold value, the comments issued by the users are removed; for a low-sales commodity, namely if the comments of a commodity are few, namely lower than a set threshold value, the commodity is considered to have no comment brushing behavior, and all the comments of the commodity are rejected.

3. The spam comment user group detection method of claim 1 wherein: extracting basic features of the group, the product and the user in the step (2), wherein the basic features of the group are the similarity of behavior information of the group members on all products commented together, and the behavior information comprises comment texts, scores, time windows and comment positions; the basic characteristics of the users are the consistency of the overall comments of the users, and the consistency is expressed in the aspects of user scores, comment time and comment positions; the basic characteristic of the product is the deviation degree of the comments, and since the product can have the comments which are intentionally commented or badly commented, the comments deviating from the real condition generate the deviation feeling in the overall comments, which includes the deviation in the scores and the comment time.

4. The spam comment user group detection method of claim 1 wherein: and (3) extracting the relation characteristics which accord with each group of relations and are proper from the three groups of relations of group-product, user-product and group-user in order to quantify the influence of the garbage among the three types of entities of group, product and user.

5. The spam comment user group detection method of claim 1 wherein: the GroupRank algorithm in step (4) is as follows:

wherein: s₁～S₅Are all intermediate variables, B_U、B_P、B_GBasic feature vectors, A, for users, products, groups, respectively_PG、A_UP、A_GUGroup-product, user-product, group-user associations, respectivelyA family feature matrix;

and

group spam vectors in the t-1 th iteration process and the t-th iteration process respectively comprise spam scores of all groups; alpha is a set weight coefficient, T represents matrix transposition, T is a natural number greater than 0, | | | | | non-woven phosphor₁Is a norm of 1; initialization

When in use

Algorithm convergence and output

6. The spam comment user group detection method of claim 5 wherein: respectively pre-align with S₁～S₅And

W is a transformation matrix obtained by transformation integration.