CN114742564A - False reviewer group detection method fusing complex relationships - Google Patents

False reviewer group detection method fusing complex relationships Download PDF

Info

Publication number
CN114742564A
CN114742564A CN202210449853.8A CN202210449853A CN114742564A CN 114742564 A CN114742564 A CN 114742564A CN 202210449853 A CN202210449853 A CN 202210449853A CN 114742564 A CN114742564 A CN 114742564A
Authority
CN
China
Prior art keywords
node
model
false
training
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210449853.8A
Other languages
Chinese (zh)
Inventor
于硕
李世豪
雷启航
夏锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University of Technology
Original Assignee
Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University of Technology filed Critical Dalian University of Technology
Priority to CN202210449853.8A priority Critical patent/CN114742564A/en
Publication of CN114742564A publication Critical patent/CN114742564A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/018Certifying business or products
    • G06Q30/0185Product, service or business identity fraud
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Accounting & Taxation (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Finance (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the field of artificial intelligence, and provides a false comment group detection method fusing complex relationships, which is used for false comment group detection on an online trading platform. The method comprises three stages of node representation updating, model training and false comment group detection. The method applies the trained model to a real data set, can identify the false reviewers, and can well distinguish the false reviewer group from the normal reviewers. The method is based on the complex relation characteristics of the nodes, makes full use of valuable relation information among reviewers, integrates the embedding process and the clustering detection process to obtain a false reviewer group detection model taking a target as a guide, and can overcome the problems of poor universality, low detection effect and the like of the conventional group detection method.

Description

False reviewer group detection method fusing complex relationships
Technical Field
The invention relates to the field of artificial intelligence, in particular to a false reviewer group detection method fusing complex relationships.
Background
The rapid popularity of the online comment system enables comments to become an important basis for people to buy commodities, more and more people can check the comments on a platform before buying the commodities and can also make evaluations on the commodities after buying the commodities. These reviews may provide useful information and first-hand merchandise experience to the customer, and thus the quality of online reviews is particularly important, and false reviews that do not conform to the fact of the merchandise may affect the reputation of the merchandise and may also obscure the line of sight of the buyer.
Most of the existing false comment detection technologies are realized by a big data and artificial intelligence method, the traditional detection technology utilizes manually generated features to classify reviewers, and relationship features between users are captured based on behavior features, language features in comments and construction graphs. In the past, researchers mainly focus on detecting individual false reviewers, however, a false review group often causes more harm to an online review system, and difficulty exists in finding out the false reviewers of the group: the false comments in the group may be normal individual comments, and the previous individual false comment detection techniques are difficult to work. In addition, relationships between false reviewers are difficult to establish, and such complex relationships can enable the model to grab connections between reviewers within a group, thereby assisting in false reviewer group detection.
Current false comment population detection methods can be categorized into the following categories:
a detection method based on a clustering algorithm. The detection algorithm based on the clustering algorithm generally uses algorithms such as a graph neural network to learn node embedding expression, then the nodes are clustered through the clustering algorithm, and finally a false comment group is detected through the detection method. Common clustering algorithms are, for example, the partition-based clustering algorithm KMeans, the density-based clustering algorithm DBSCAN.
(1) The KMeans clustering algorithm mainly divides all points in a sample space into K groups, similarity is usually measured by using euclidean distance, and the main flow of the algorithm is as follows: k centroids are randomly placed, one centroid being present in each cluster. The distance of each point to the centroid is calculated and each data point is assigned to its nearest centroid, forming a cluster. In an iterative process, the position of the centroid K is recalculated.
(2) The DBSCAN clustering algorithm first determines the type of each point, each of which in the data set may be a core point or a boundary point. A data point is a core point if at least M points in its neighborhood are within a specified radius R, and a boundary point if less than M data points in its neighborhood, or it can be reached from a core point, i.e. it is within R distance from the core point. The core points that are neighbors will be connected and placed in the same cluster and boundary points will be assigned to each cluster.
Graph-based detection methods. Starting from the subgraph, judging the doubtful character of the group by using the node or the attribute of the subgraph, thereby realizing the whole detection process. Some methods aggregate relationships from differences in graph topology, time, and scores, using joint probabilities to detect false reviewer populations. The method ignores the structural characteristics of the nodes and does not consider the complex relation among the nodes. Still other methods address several main characteristics of the population, such as synchronicity, mildness and dispersion, and detect population abnormalities by calculating certain indices. The method is lack of universality in practical application, specific indexes need to be provided for different networks or data sets to well complete the detection task of the false comment group, and if the method is popularized, the detection precision is greatly reduced. In addition, such methods only consider features within the population, and still lack consideration of complex relationships between reviewers.
Disclosure of Invention
In the existing false comment group detection method, an embedding process and a subsequent clustering and detection process are separated, a training process lacks target guidance, and if a result representing learning is not suitable for detection, the obtained false comment group detection result is poor. In addition, complex relationships in the comment network are ignored, and valuable relationship information among the reviewers in the group cannot be utilized.
Aiming at the problems in the prior art, the invention provides a false comment crowd detection method fusing complex relationships, which is used for false comment crowd detection on an online selling platform, wherein a target is used as guidance, complex relationship characteristics of nodes are based, complex relationship representation of the nodes is learned by utilizing the characteristics, and topological information of a picture is reconstructed by using an automatic encoder; in order to integrate the embedding process and the clustering and detecting process, the method adopts a self-supervision training model and guides the optimization of the model by using the clustering and detecting results.
In order to achieve the purpose, the invention adopts the technical scheme that: a false reviewer group detection method fusing complex relationships is characterized and updated by using a graph neural network based on an attention mechanism to comment nodes in a comment network; performing model training on the reconfiguration loss and the self-supervision distribution loss of the design drawing, obtaining an optimal model, and applying the optimal model to a group of false reviewers to detect and identify the group of the false reviewers in the comment network; the method comprises the following specific steps:
firstly, updating a node representation to obtain a reconstructed image; the model extracts an adjacency matrix and an attribute matrix of the comment network, and obtains a complex relation matrix according to the adjacency matrix. After the complex relation matrix is obtained, the attention encoder fuses the complex relation with the message transmission process, effectively encodes the high-order structure information and the node attribute information of the network, and then updates the node representation. A graph neural network based on an attention mechanism is used as an encoder; taking the initial characteristics of the nodes as the initial embedding of the nodes, and fusing the complex relationship of the nodes on a graph neural network based on an attention mechanism to ensure that the node characterization expresses high-order structural characteristics and attribute characteristics at the same time;
1.1) calculating node similarity; in order to simplify the calculation and reduce the model parameters, the node is limited in the first-order neighbor node of the central node, and the calculation formula is as follows:
cij=a(Whi,Whj)#(1)
in the formula, cijRepresenting the importance of the node j to the node i, and W represents a weight matrix; h isiAnd hjRespectively representing the feature vectors of the node i and the node j; a represents a function of computing node similarity;
1.2) calculating a complex relation matrix; the comment network has complex structural relationships, and the complex relationships among the nodes of the comment network contain valuable information. Obtaining a complex relation matrix of the node by considering a high-order neighbor node of the node:
M=(B+B2+…+Bt)/t# (2)
where B is the transition matrix, B is the transition matrix when an edge exists between node i and node jij=1/diWherein d isiDegree of a node; when there is no edge between node i and node j, B ij0; matrix M represents a complex relationship matrix, MijThe complex relation of the node i and the node j under the order t is obtained;
1.3) fusing complex relationships; the method comprises the steps that a single-layer feedforward neural network is used as a calculation mode, a complex relation matrix M is fused with a graph neural network based on an attention mechanism, specifically, the complex relation matrix is multiplied with node similarity, and when the similarity between nodes is calculated, not only the similarity between node representations but also the influence of the complex relation between the nodes on the similarity are considered; LeakyReLU is selected as an activation function to increase model non-linearity factors, so that the feature expression capability of the model is enhanced. After the complex relationships are fused, the importance expression of the node j to the node i is rewritten as follows:
Figure BDA0003618158730000041
1.4) updating the node representation; the softmax function is used for carrying out normalization processing on the importance of the neighbor nodes, so that the importance of the first-order neighbor nodes to the central node is distributed between [0 and 1], and the characteristics of the neighbor nodes are aggregated to update the node representation;
Figure BDA0003618158730000042
Figure BDA0003618158730000043
in the formula (4), αijRepresenting the normalized attention coefficient; n is a radical ofiA first-order neighbor set representing a node i;
in the formula (5), the reaction mixture is,
Figure BDA0003618158730000044
a representation on level i of a neighbor node j to node i,
Figure BDA0003618158730000045
represents the representation of node i on the l +1 th; the final representation of the node is obtained by multilayer aggregation;
secondly, training a model; the model first uses the topology information reconstructed from the encoder to calculate the loss, which is the first partial loss, by calculating the difference between the original and the reconstructed adjacency matrix. The second part of loss is obtained by a self-supervision training mode, the model determines core points in the comment network by using a DBSCAN clustering algorithm, the distances between all nodes and the core points are calculated, and KL divergence is used as the loss of the second part of loss. The final loss function is composed of the two loss functions and is used for jointly training the model. And after loss is calculated, updating model parameters by using a gradient descent algorithm, and finishing training.
Designing a graph reconstruction loss function and an automatic supervision distribution loss function, updating graph neural network model parameters based on an attention mechanism, and completing training, wherein the method comprises the following specific steps:
2.1) calculating a graph reconstruction loss function; calculating the difference between the adjacent matrixes according to the topological information of the reconstructed graph of the encoder to obtain the reconstruction loss of the reconstructed graph and the original graph; the formula is as follows:
Figure BDA0003618158730000046
in the formula (I), the compound is shown in the specification,
Figure BDA0003618158730000047
is a contiguous matrix; h is an updated node characterization matrix; σ is an activation function;
in the training process, cross entropy is adopted as a loss function:
Figure BDA0003618158730000051
where y represents the value of an element in the adjacency matrix,
Figure BDA0003618158730000052
representing the corresponding elements in the reconstructed adjacency matrix. This part of the training requires minimizing the reconstruction loss, which is defined as follows:
Figure BDA0003618158730000053
2.2) calculating an automatic supervision distribution loss function; one of the challenges of the false comment detection method is the training of no label-guided models; the model adopts a self-supervision training mode and adopts the embedded expression of pseudo-labeled optimization nodes; clustering nodes by adopting a clustering algorithm, and clustering by adopting a K-Means algorithm in the model:
Figure BDA0003618158730000054
in the formula, muiIs SiThe mean value of all nodes in the cluster, k is the number of sets to be clustered.
After all the false comment groups are obtained, determining core points in the comment network by adopting a DBSCAN clustering algorithm, and calculating the distance distribution between each node and the core points;
during training, the distribution of data needs to be continuously learned to distinguish normal nodes from abnormal nodes, piuRepresenting pseudo-labels, q, calculated by the modeliuThe distance distribution between the features of all nodes and the core points detected by DBSCAN is represented. q. q.siuIs defined as follows:
Figure BDA0003618158730000055
in the formula uuA characterization representing core points detected by the DBSCAN; ziA representation representing a current processing node; u. ofkRepresenting a characterization of the core points of the kth class. The formula calculates the distance between the characterization of the node and the characterization of the core point, and if the distance between the node and the core point is close enough, the node can be considered to belong to the group and is considered to be a normal node. Assuming that a node is far away from the core point, the node can be regarded as an outlier, i.e., a corresponding group of false comments. The node label can be obtained by the following formula:
Si=argmax·qiu#(11)
using the KL divergence as a loss function to measure the difference between the distance distribution between the node and the core point and the pseudo label thereof;
the KL divergence mainly measures the difference between the probability distribution Q and the reference probability distribution P. Unlike the label obtained in equation (11), the target distribution piuConsidered as a true label, is calculated by Q in the training process, piuThe P distribution is relied on and updated according to the phase, and the P distribution is regarded as an automatic supervision label in the phase. The main function of the target distribution is to supervise the learning of the model and guide the updating of the distribution Q. The formula for P is as follows:
Figure BDA0003618158730000061
in the formula, qikRepresenting the distance distribution between the features of all nodes and the core point of the kth class. The loss function for the self-supervised optimization embedding is as follows:
Figure BDA0003618158730000062
2.3) calculating a joint loss function; the joint loss function expression is:
L=·Lr+βLc#(14)
in the formula, LrReconstruction of the loss function for the graph, LcIs an auto-supervised distributed loss function, the weight between two loss functions;
2.4) model training, setting initial parameters of the graph neural network model based on the attention mechanism, and iterating the training process based on the joint loss function to obtain the optimal parameters of the graph neural network model based on the attention mechanism;
thirdly, detecting a false comment group; and detecting the real comment network by adopting the attention-based graph neural network model obtained in the second step, and storing the detection result.
The graph reconstruction loss function adopts a cross entropy loss function; the clustering algorithm for clustering the nodes adopts a KMeans clustering algorithm.
The specific method for the model training in 2.4) is as follows:
setting initial parameters of the graph neural network model based on the attention mechanism, wherein the initial parameters comprise the number of aggregation layers, node embedding dimensions, the number of clustering of a KMeans clustering algorithm, training iteration times and the like of the graph neural network model based on the attention mechanism;
continuously adjusting parameters in the training process of the model, and determining optimal parameters according to the descending condition of the joint loss function in the training process or the final detection result of the model;
the method specifically comprises the following steps: inputting the comment network and the adjacency matrix of the network into a model, operating and training the model, recording the detection performance of the model after the training, repeatedly training for many times under the same set of hyper-parameters, and taking the average value of the detection precision as the final result detection precision; after model training under a group of parameters is completed, parameters in the model are adjusted according to a control variable method, one parameter of the model is adjusted according to the direction of increasing the average precision, and other parameters are kept unchanged; and repeatedly adjusting parameters, reserving a group of parameter settings for enabling the average discrimination precision of the model to reach the highest, and finishing the model training.
The invention has the beneficial effects that: the method can identify the false reviewers and can well distinguish the false reviewer group from the normal reviewers. The method is based on the complex relation characteristics of the nodes, makes full use of valuable relation information among reviewers, integrates the embedding process and the clustering detection process to obtain a false reviewer group detection model taking a target as a guide, and can overcome the problems of poor universality, low detection effect and the like of the conventional group detection method.
Drawings
FIG. 1 is a basic framework diagram of the present invention;
FIG. 2 is a flow chart of the present invention;
FIG. 3 is a graph of recall rate changes during training in accordance with an embodiment of the present invention;
FIG. 4 is a graph of the variation of the loss function during training according to an embodiment of the present invention;
FIG. 5 is a visualization diagram of the population detection result according to an embodiment of the present invention.
Detailed Description
The following further describes a specific embodiment of the present invention with reference to the drawings and technical solutions.
A false reviewer group detection method fusing complex relationships comprises three stages: updating the node representation; training a model; false comment population detection.
In the first step, the node representation is updated. In the stage, the graph neural network based on the attention mechanism is used as an encoder, the initial embedding of the nodes is represented by the initial characteristics of the nodes, and the complex relationships of the nodes are fused on the graph neural network based on the attention mechanism, so that the node characterization has the capacity of expressing high-order structural characteristics and attribute characteristics.
1.1) calculating node similarity. In order to simplify the calculation and reduce the model parameters, the node is limited in a one-hop neighbor of the central node, and the calculation formula is as follows:
cij=a(Whi,Whj)#(1)
in the formula, cijRepresenting the importance of the node j to the node i, and W represents a weight matrix; (ii) a h isiAnd hjRespectively representing the feature vectors of the node i and the node j; a represents a function for computing node similarity;
1.2) calculating a complex relation matrix. The comment network has complex structural relationships, and the complex relationships among the nodes of the comment network contain valuable information. By considering the higher-order neighbors of a node, a complex relationship matrix of the node can be obtained:
M=(B+B2+…+Bt)/t#(2)
where B is the transition matrix, and if there is an edge between node i and node j, B isij=1/diWherein d isiAs the degree of a node, when there is no edge between node i and node j, B ij0. Matrix M represents a complex relationship matrix, MijIs a complex relationship of node i and node j in order t.
1.3) fusing complex relationships. The method comprises the specific steps that a single-layer feedforward neural network is selected as a calculation mode, a complex relation matrix M is fused with a graph attention network, and the complex relation matrix is multiplied by the similarity between nodes to show that not only the similarity between node representations but also the influence of the complex relation between the nodes on the similarity need to be considered when the similarity between the nodes is calculated. Finally, LeakyReLU is selected as an activation function to increase model nonlinear factors, so that the feature expression capability of the model is enhanced. After the complex relationship is fused, the importance expression of the node j to the node i is rewritten as follows:
Figure BDA0003618158730000081
1.4) updating the node representation. In order to enable the importance of the neighbor nodes to the central node to be distributed between [0 and 1], the importance of the neighbor nodes is normalized by utilizing a softmax function, and the characteristics of the neighbor nodes are aggregated to update the node representation.
Figure BDA0003618158730000082
Figure BDA0003618158730000083
In the formula (4), αijRepresents the normalized attention coefficient, NiA first-order neighbor set representing a node i; in the formula (5), the reaction mixture is,
Figure BDA0003618158730000091
a representation on level i of a neighbor node j to node i,
Figure BDA0003618158730000092
representing the representation of node i on the l +1 th. The final characterization of the nodes is obtained by multilayer aggregation.
And secondly, training a model. Firstly, a loss function is designed, and after loss is calculated by using the designed loss function, model parameters are updated so as to complete training. The model first reconstructs the original network using a decoder to calculate the adjacency matrix difference loss of the original network and the reconstructed network. Because nodes in the detection task of the false reviewer group have no labels, the embedding is optimized by adopting a self-supervision training mode, a core point in the review network is generated by utilizing a DBSCAN clustering algorithm, the distance between the core point and other nodes is measured by using KL divergence, and then the difference between the false mark and the learned embedding distribution is calculated. After the loss calculation is completed, the gradient descent algorithm is used to update the model parameters, completing the training.
2.1) calculate graph reconstruction loss. Reconstructing an original graph by adopting an inner product mode, wherein a reconstruction formula is as follows:
Figure BDA0003618158730000093
in the formula, H is the embedded vector of the learned node,
Figure BDA0003618158730000094
for reconstructing the patterned adjacency matrix, for enabling the reconstructed adjacency matrix
Figure BDA0003618158730000095
As similar as possible to the input adjacency matrix. In the training process, cross entropy is adopted as lossFunction:
Figure BDA0003618158730000096
where y represents the value of an element in the adjacency matrix,
Figure BDA0003618158730000097
representing the corresponding elements in the reconstructed adjacency matrix. This part of the training requires minimizing the reconstruction loss, which is defined as follows:
Figure BDA0003618158730000098
2.2) calculating the distribution loss. One of the challenges of the false comment detection method is the training of the label-free guidance model. The model adopts a self-supervision mode and uses pseudo marks to optimize node embedded representation. Because the nodes in the graph are independent, all the nodes are clustered firstly in the training process, and the model adopts a K-Means algorithm to cluster:
Figure BDA0003618158730000099
in the formula, muiIs SiThe mean value of all nodes in the cluster, k is the number of sets to be clustered. After all the false comment groups are obtained, the DBSCAN algorithm is adopted to detect abnormal groups. The DBSCAN algorithm firstly distinguishes core points and boundary points in the graph, takes the detected core points as the core points in the training model, and calculates the distance between the characterization of other nodes and the characterization of the core points. During the training process, the distribution of data needs to be continuously learned to distinguish normal nodes from abnormal nodes, piuRepresenting pseudo-labels, q, calculated by the modeliuRepresents the distance distribution between the features of all nodes and the core points detected by the DBSCAN. q. q.siuIs defined as follows:
Figure BDA0003618158730000101
in the formula uuRepresenting a characterization of the core points detected by DBSCAN. The formula calculates the distance between the representation of the node and the representation of the core point, and if the distance between the node and the core point is close enough, the node can be considered to belong to the group and be considered as a normal node. Assuming that a node is far away from the core point, the node can be regarded as an outlier, i.e., a corresponding group of false comments. The node label can be obtained by the following formula:
Si=argmax·qiu#(11)
the model adopts KL divergence to measure the difference between the pseudo-marker and the learned distribution, and the KL divergence mainly measures the difference between the probability distribution Q and the reference probability distribution P. Unlike the label obtained in equation (11), the target distribution piuConsidered as a true label, is calculated by Q in the training process, piuThe P distribution is relied on and updated according to the phase, and the P distribution is regarded as an automatic supervision label in the phase. The main function of the target distribution is to supervise the learning of the model and guide the updating of the distribution Q. The formula for P is as follows:
Figure BDA0003618158730000102
the loss function for the self-supervised optimization embedding is as follows:
Figure BDA0003618158730000103
2.3) calculating a joint loss function. The overall loss function of the model consists of a graph reconstruction loss function and an automatic supervision distribution loss function, and the final loss function expression is as follows:
L=·Lr+βLc#(14)
in the formula, LrTo reconstruct the loss, LcTo distribute the losses, β is used to control the weight between the two losses.
2.4) model training. The training of the model is carried out according to the following steps: setting initial hyper-parameters including the aggregation layer number of the graph attention network, the node embedding dimension, the clustering number of the KMeans clustering algorithm, the training iteration number and the like.
In the training process of the model, the hyper-parameters need to be adjusted manually, so that the detection effect of the model is optimal. Generally speaking, the hyper-parameters need to be determined according to the drop condition of the loss function in the training process or the final detection result of the model. After the hyper-parameters are set, inputting information such as a comment network and an adjacent matrix of the network into the model, operating the model, waiting for the model training to be finished, recording the detection performance of the model after the training, repeating the process for a plurality of times under the same group of hyper-parameters, and taking the average value of the detection precision as the final result detection precision. After model training under a group of hyper-parameters is completed, the hyper-parameters in the model are adjusted according to a control variable method, one hyper-parameter of the model is adjusted according to the direction of increasing the average precision, and other parameters are kept unchanged. And repeating the adjustment process of the hyper-parameters, reserving a group of hyper-parameter settings for enabling the average discrimination precision of the model to reach the highest, and finishing the model training.
And thirdly, detecting a false comment group. And detecting the real comment network by using the model trained in the last step and the hyper-parameter, and storing the detection result of the model on the comment network.
Table 1 the algorithm runs
Figure BDA0003618158730000111
In conjunction with the protocol of the present invention, the experimental analysis was performed as follows:
the invention verifies the detection effect of the false comment population on an Amazon data set processed by a researcher, and the basic situation of the data set is shown in Table 2. The relationship type U-P-U in the table represents that two users have at least commented on one same product. U-S-U represents that two reviewers reviewed the same score within a week. U-V-U represents that two reviewers have similar comments. The experiment was performed on four datasets, corresponding to the above three relationships, and one dataset consisting of the three relationships, the four datasets being Amazon _ p, Amazon _ s, Amazon _ v, and Amazon datasets, respectively.
TABLE 2 basic cases of false comment data sets used in the experiment
Figure BDA0003618158730000121
The experimental analysis process of the false reviewer group detection method fusing complex relationships can be divided into two parts: comparing the method with the existing false comment group detection method, and verifying the superiority of the method by taking the recall rate as an evaluation index; and performing a visual experiment on the training process and the detection result, thereby more intuitively analyzing the rationality of model design and the effectiveness of the detection effect.
(1) Test result comparison experiment
Several false comment population detection methods that researchers have proposed are compared with the present method, wherein Graph-developer uses a Graph-based approach to find the target item and on this basis detects a group of false reviewers, and the population detection problem is solved by a 2-hop diagram. Collueage uses a Markov random field to detect colluded false reviewers and false comment activity. The method comprises the steps that the DeFrauder detects candidate fraud groups by utilizing a product review graph and combining behavior signals, maps the candidate fraud groups into an embedding space, assigns scores to each group, and finally determines a false reviewer group according to the scores. Besides the comparison method, in order to verify the effectiveness of the modules in the method, two decoupling detection methods are additionally added in the experiment: GCN + KMeans + DBSCAN and GAT + KMeans + DBSCAN. The first method is to embed the initial data set by GCN, the second method is to embed the initial data set by GAT, and after the embedding is obtained, the embedding results are detected by both KMeans clustering method and DBSCAN method.
The experimental results of the present method and the comparative method are shown in table 3. Through the longitudinal comparison of experimental results, the performance of the method is obviously superior to that of other methods, and the detection effect is greatly improved. The results of GAT + KMeans + DBSCAN are superior to GCN + KMeans + DBSCAN, demonstrating the effectiveness of using GAT as a graph encoder. Compared with GCN, GAT can aggregate neighbor characteristics according to the similarity between a central node and neighbor nodes, so that a large amount of information of false nodes cannot be aggregated in a characterization result in a normal node. Through transverse comparison of experimental results, it can be seen that under the condition that three different relationships and all relationships are considered, the method obtains the optimal result, which shows that the KMeans clustering algorithm is fused in the deep learning model, and the core points are continuously updated in an iterative manner in the training process, so that a more accurate detection result can be obtained.
TABLE 2 test results
Figure BDA0003618158730000131
(2) Visual experiment of training process and detection result
The visualization experiment aims to express the reasonability of the design of the method by analyzing the loss and the change of the result recall rate in the training process, and the effectiveness of the detection result is visually expressed by the means of visualization of the detection result.
Fig. 3 shows the change of the recall rate in the training process, and the overall situation of the graph shows that the recall rate of the detection result is continuously improved along with the training, and the rationality of the model design is verified.
The variation of the loss function during training is shown in fig. 4. Analysis is performed by combining fig. 3 and fig. 4, and as the loss function is continuously reduced, the recall rate is continuously improved, which indicates that the obtained representation learning result can be also suitable for the detection of the false reviewer group while the representation learning result is continuously updated by the method. In a reverse way, the loss function designed by the method can well feed back the loss to the model and supervise the learning of the model, and the problem that the representation learning result is possibly not suitable for the detection method is solved.
FIG. 5 shows the clustering result of the model on the Amazon data set, and it can be seen that the method has a good effect on the detection problem of the false reviewer population. Wherein the black entities represent the false comment population, mainly concentrated on the lower left, and the gray entities represent the normal comment nodes, mainly concentrated on the upper right.
The above embodiments only express the embodiments of the present invention, but not should be understood as the limitation of the scope of the invention patent, it should be noted that, for those skilled in the art, many variations and modifications can be made without departing from the concept of the present invention, and these all fall into the protection scope of the present invention.

Claims (3)

1. A false reviewer group detection method fusing complex relationships is characterized in that the false reviewer group detection method fusing complex relationships uses a graph neural network based on an attention mechanism to perform representation updating on comment nodes in a comment network; performing model training on the reconfiguration loss and the self-supervision distribution loss of the design drawing, obtaining an optimal model, and applying the optimal model to a group of false reviewers to detect and identify the group of the false reviewers in the comment network; the method comprises the following specific steps:
firstly, updating node representation to obtain a reconstructed graph; a graph neural network based on an attention mechanism is used as an encoder; taking the initial characteristics of the nodes as the initial embedding of the nodes, and fusing the complex relationship of the nodes on a graph neural network based on an attention mechanism to ensure that the node characterization expresses high-order structural characteristics and attribute characteristics at the same time;
1.1) calculating node similarity; and limiting the node in a first-order neighbor node of the central node, wherein the calculation formula is as follows:
cij=a(Whi,Whj)#(1)
in the formula, cijRepresenting the importance of the node j to the node i, and W represents a weight matrix; h isiAnd hjRespectively representing the feature vectors of the node i and the node j; a represents a function for computing node similarity;
1.2) calculating a complex relation matrix; obtaining a complex relation matrix of the node by considering a high-order neighbor node of the node:
M=(B+B2+…+Bt)/t# (2)
where B is a transition matrix, and B is the transition matrix when an edge exists between node i and node jij=1/diWherein d isiDegree of a node; when there is no edge between node i and node j, Bij0; matrix M represents a complex relationship matrix, MijThe complex relation of the node i and the node j under the order t is obtained;
1.3) fusing complex relationships; fusing a complex relation matrix M and a graph neural network based on an attention mechanism by taking a single-layer feedforward neural network as a calculation mode, specifically multiplying the complex relation matrix by the similarity of nodes; and selecting LeakyReLU as an activation function, fusing the complex relationship, and rewriting the importance expression of the node j to the node i into:
Figure FDA0003618158720000011
1.4) updating the node representation; the softmax function is used for carrying out normalization processing on the importance of the neighbor nodes, so that the importance of the first-order neighbor nodes to the central node is distributed between [0 and 1], and the characteristics of the neighbor nodes are aggregated to update the node representation;
Figure FDA0003618158720000021
Figure FDA0003618158720000022
in the formula (4), αijRepresenting the normalized attention coefficient; n is a radical ofiA first-order neighbor set representing a node i;
in the formula (5), the reaction mixture is,
Figure FDA0003618158720000023
a representation on level i of a neighbor node j to node i,
Figure FDA0003618158720000024
representing the representation of node i on the l +1 th; the final representation of the node is obtained by multilayer aggregation;
secondly, training a model; designing a graph reconstruction loss function and an automatic supervision distribution loss function, updating graph neural network model parameters based on an attention mechanism, and completing training, wherein the method specifically comprises the following steps:
2.1) calculating a graph reconstruction loss function; calculating the difference between adjacent matrixes according to the topological information of the reconstructed image of the encoder to obtain the reconstruction loss of the reconstructed image and the original image; the formula is as follows:
Figure FDA0003618158720000025
in the formula (I), the compound is shown in the specification,
Figure FDA0003618158720000026
is a contiguous matrix; h is an updated node characterization matrix; σ is an activation function;
2.2) calculating an automatic supervision distribution loss function; adopting a self-supervision training mode and adopting a pseudo-label optimization node embedded representation; clustering the nodes by adopting a clustering algorithm, determining core points in the comment network by adopting a DBSCAN clustering algorithm, and calculating the distance distribution between each node and the core points; using the KL divergence as a loss function to measure the difference between the distance distribution between the node and the core point and the pseudo label thereof;
2.3) calculating a joint loss function; the joint loss function expression is:
L=·Lr+βLc# (7)
in the formula, LrReconstruction of the loss function for the graph, LcIs an auto-supervised distributed loss function, the weight between two loss functions;
2.4) model training, setting initial parameters of the graph neural network model based on the attention mechanism, and iterating the training process based on the joint loss function to obtain the optimal parameters of the graph neural network model based on the attention mechanism;
thirdly, detecting a false comment group; and detecting the real comment network by adopting the attention-based graph neural network model obtained in the second step, and storing the detection result.
2. The method for detecting the false reviewer population fusing the complex relationships according to claim 1, wherein the graph reconstruction loss function adopts a cross entropy loss function; the clustering algorithm for clustering the nodes adopts a KMeans clustering algorithm.
3. The method for detecting the false reviewer population fusing the complex relationship according to claim 2, wherein the model training in 2.4) is as follows:
setting initial parameters of the graph neural network model based on the attention mechanism, wherein the initial parameters comprise the number of aggregation layers, node embedding dimensions, the number of clustering of a KMeans clustering algorithm and training iteration times of the graph neural network model based on the attention mechanism;
continuously adjusting parameters in the training process of the model, and determining optimal parameters according to the descending condition of the joint loss function or the final detection result of the model in the training process;
the method specifically comprises the following steps: inputting the comment network and the adjacency matrix of the network into a model, operating and training the model, recording the detection performance of the model after the training, repeatedly training for many times under the same set of hyper-parameters, and taking the average value of the detection precision as the final result detection precision; after model training under a group of parameters is completed, parameters in the model are adjusted according to a control variable method, one parameter of the model is adjusted according to the direction of increasing the average precision, and other parameters are kept unchanged; and repeatedly adjusting parameters, reserving a group of parameter settings for enabling the average discrimination precision of the model to reach the highest, and finishing the model training.
CN202210449853.8A 2022-04-27 2022-04-27 False reviewer group detection method fusing complex relationships Pending CN114742564A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210449853.8A CN114742564A (en) 2022-04-27 2022-04-27 False reviewer group detection method fusing complex relationships

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210449853.8A CN114742564A (en) 2022-04-27 2022-04-27 False reviewer group detection method fusing complex relationships

Publications (1)

Publication Number Publication Date
CN114742564A true CN114742564A (en) 2022-07-12

Family

ID=82282704

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210449853.8A Pending CN114742564A (en) 2022-04-27 2022-04-27 False reviewer group detection method fusing complex relationships

Country Status (1)

Country Link
CN (1) CN114742564A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116737934A (en) * 2023-06-20 2023-09-12 合肥工业大学 Naval false comment detection algorithm based on semi-supervised graph neural network
CN116993433A (en) * 2023-07-14 2023-11-03 重庆邮电大学 Internet E-commerce abnormal user detection method based on big data

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116737934A (en) * 2023-06-20 2023-09-12 合肥工业大学 Naval false comment detection algorithm based on semi-supervised graph neural network
CN116737934B (en) * 2023-06-20 2024-03-22 合肥工业大学 Naval false comment detection algorithm based on semi-supervised graph neural network
CN116993433A (en) * 2023-07-14 2023-11-03 重庆邮电大学 Internet E-commerce abnormal user detection method based on big data

Similar Documents

Publication Publication Date Title
Krishnaiah et al. Survey of classification techniques in data mining
WO2020008919A1 (en) Machine learning device and method
CN109389151B (en) Knowledge graph processing method and device based on semi-supervised embedded representation model
CN112184391B (en) Training method of recommendation model, medium, electronic equipment and recommendation model
CN114742564A (en) False reviewer group detection method fusing complex relationships
CN113807422B (en) Weighted graph convolutional neural network scoring prediction model integrating multi-feature information
CN113407864B (en) Group recommendation method based on mixed attention network
CN111667466B (en) Multi-objective optimization feature selection method for multi-classification of strip steel surface quality defects
Satyanarayana et al. Survey of classification techniques in data mining
CN112308115A (en) Multi-label image deep learning classification method and equipment
CN112529683A (en) Method and system for evaluating credit risk of customer based on CS-PNN
CN113269647A (en) Graph-based transaction abnormity associated user detection method
CN114036298B (en) Node classification method based on graph convolution neural network and word vector
CN113837266B (en) Software defect prediction method based on feature extraction and Stacking ensemble learning
CN114997476A (en) Commodity prediction method fusing commodity incidence relation
CN111584010B (en) Key protein identification method based on capsule neural network and ensemble learning
CN112905906B (en) Recommendation method and system fusing local collaboration and feature intersection
CN117408735A (en) Client management method and system based on Internet of things
CN110109005B (en) Analog circuit fault testing method based on sequential testing
CN111221915B (en) Online learning resource quality analysis method based on CWK-means
CN117093849A (en) Digital matrix feature analysis method based on automatic generation model
CN115840853A (en) Course recommendation system based on knowledge graph and attention network
CN115730248A (en) Machine account detection method, system, equipment and storage medium
CN115344794A (en) Scenic spot recommendation method based on knowledge map semantic embedding
CN114820074A (en) Target user group prediction model construction method based on machine learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination