CN104463601A

CN104463601A - Method for detecting users who score maliciously in online social media system

Info

Publication number: CN104463601A
Application number: CN201410638173.6A
Authority: CN
Inventors: 尚明生; 蔡世民; 高见; 董宇蔚
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2014-11-13
Filing date: 2014-11-13
Publication date: 2015-03-25

Abstract

The invention discloses a method for detecting users who score maliciously in an online social media system. The method for detecting users who score maliciously in the online social media system arms at scoring feedback. Firstly, clustering is conducted according to scores for products by users, and the normalized user confidence degree is calculated; secondly, the reliability degree of user scoring is calculated according to the user confidence degree to obtain a candidate list of the users who score maliciously; finally, candidate users who score maliciously are sorted in combination with the deviation degree of user scoring and product quality to obtain the final list of the users who score maliciously. The method has advantages in the aspects of calculation accuracy and efficiency and can be applied to large-scale online social media websites.

Description

A kind of method of detection of malicious scoring user in online Social Media system

Technical field

The present invention relates to the method for detection of malicious evaluation user in online Social Media system, particularly a kind of method for detection of malicious scoring user in the Social Media system of scoring feedback.

Background technology

Internet, as the carrier of commercial affairs, has become the instrument of requisite information acquisition, transmission and exchange, and the arrival of information age is be filled with new vitality based on the IT service sector of Internet.Wherein Social Media gets most of the attention especially, has been acknowledged as novel economizer pattern and the catalyzer of 21 world structures, has had the title of " sunrise industry, pollution-free industry ".Social Media is that the novel economizer of networking is movable, just with unprecedented speed fast development, has become country and has strengthened white war strength, wins the effective means of global resources configuration advantage.By Social Media people be no longer aspectant, look at out and out goods, carry on transactions by paper medium document (comprising cash), but presenting a feast for the eyes merchandise news by network, perfect logistics distribution system and convenient and safe financial account system are concluded the business.There is ten hundreds of electric business and hundreds of millions of consumer in Social Media, how to set up effective credit rating mechanism, build the environment of orderly competition, the person of guiding rational consumption just seems especially important.

The Reputation Evaluation System of Most current all to be commented on product based on user or is commented grading information, and user makes comments to bought product or carries out satisfaction scoring and have expressed user to the view of certain part product and satisfaction.These review information are that producer and potential consumer provide valuable information resources.The feedback opinion of the market status and consumer, by analyzing these information, can be understood in time by producer, and potential consumer also can in this, as the important reference buying product.Whether potential consumer determines to buy product, is the most also the most important reference frame height of the scoring of product acquisition itself and the quality of comment content often.For large-scale Social Media transaction platform, the commending system for potential user's recommended products most all based on user to the history score data of product and comment content.If the great majority comment of certain commodity is all front, so this user has very large possibility to buy this product; If great majority comment is negative, so these commodity are purchased hardly.When reality, some illegal businessman is in order to increase the interests of oneself, and employ a group of people to carry out malice comment to some commodity, its comment content and commodity actual value are not inconsistent, or malice is flattered or malice is slandered.Malice scoring and review information have impact on the reference value of review information, the serious selection misleading consumer, there is meaning in what weaken the scoring of normal users and review information, make consumer lose trust to Social Media product evaluation system gradually, and then jeopardize and finally compromise entire society's media industry.As can be seen here, the score data in Reputation Evaluation System and the authenticity of review information and Usefulness Pair mean a great in the benign competition of Social Media, and the significance level how screening out the malice scoring user in Reputation Evaluation System is self-evident.

In order to detect the user of cheating comment or malice scoring, mainly contain two kinds of methods at present:

First method is handmarking.By observe artificially evaluate user scoring, comment content and other comment behaviors, judge whether user belongs to cheating comment user.But this detection method is with very strong subjectivity, and owing to needing data volume to be processed large, manual method is difficult to really be applied to malice in large-scale Social Media system and evaluates the detection of user.

Second method utilizes Computer Automatic Recognition.First mark typical cheating comment user, then by machine learning algorithm, unlabelled user is classified.More typical way has two kinds, and one is the similarity judging user comment content in the evaluation having text reviews, and another kind calculates user's scoring and product proper mass departure degree.

Such as, article (A robust ranking algorithm to spamming.EPL EPL delivered in 2011,94 (2011), 48002.) a kind of user's prestige sort algorithm detection of malicious based on correlativity scoring user is proposed in.This algorithm calculates user's credit value and product average mainly through iterative strategy simultaneously, and finally according to the prestige sequence detection of malicious scoring user of user.The essence of this algorithm is to adopt user's prestige to be weighted average computation product quality to product scoring, in fact detect according to the deviation of user's score value and product proper mass, deviation is larger, illustrate user become malice scoring user possibility larger.Although this method is simple, the proper mass of product itself is an immensurable value, and the satisfaction of different user to same product varies with each individual.Generally, objectively there is certain error, thus accuracy in detection can be caused not high in the way that the average of all scorings that product quality product obtains represents.In addition, this algorithm shows good robustness when malice scoring user ratio is large especially, but to mark all less true points-scoring system poor effect of ratio for malice scoring user's ratio and cheating user.

And for example, WWW meeting paper (Spotting Fake Reviewer Groups in Consumer Reviews.WWW ' 12,2012, pp in 2012 ,the method of the detection of malicious scoring user based on user comment content similarities 191-200.) is proposed.The method detects cheating comment user by the similarity of analysis user comment text content, if similarity is very high between two comments, the possibility that the user so delivering comment in these two days becomes the comment user that practises fraud is larger.Although this method effectively can detect cheating reviewer, need to carry out text analyzing to the comment content in entire society's media system, data volume is large, and treatment effeciency is low; On the other hand, in a lot of Social Media system, user does not actively participate in comment, even and if participate in comment also only have brief word, this make based on comment content analysis can not normally use in many systems.And be that current most system all possesses based on the system of scoring, due to user, to evaluate cost not high, and the user therefore participated in is many, and can not be used in this type systematic based on the method for discrimination of comment text.

Along with the development of social networks, the US Patent No. 8176057 that on August 5th, 2012 authorizes discloses a kind of user's prestige detection method based on social networks, carried out the transmission of credit value by the feedback of high prestige user, thus detect the user of low prestige.Although the method can effectively calculate user's credit value, be mainly used in the user identifying that prestige is higher, the user's detection accuracy for malice scoring is not high.

In sum, existing method can't meet the actual demand of most of Social Media website, or has deviation in identification accuracy, or can not be applied to actual detection efficiently, or is not suitable for some evaluation system.

Summary of the invention

The object of this invention is to provide a kind of effective ways being applicable to malice scoring user detection in online Social Media system.The present invention is directed have the Social Media system of scoring feedback, carry out detection of malicious scoring user by the score value analyzing user, avoid user comment text content analysis and process the super large calculated amount brought, accuracy is high simultaneously to improve detection efficiency.

The technical scheme that its technical matters of solution provided by the invention adopts is the method for detection of malicious scoring user in a kind of online Social Media system, comprises the steps:

Step 1: the user's score data in extraction system, pre-service is carried out to data, obtains normalized user's score data and comprise by user ID, product IDs, user to the scoring of product, by these three classes data according to tlv triple (u, p, v) form store;

Step 2: user marks cluster, calculates the degree of confidence vector of user's scoring;

Step 2-1: be one group by the user clustering giving identical scoring for same product;

Step 2-2: the degree of confidence vector calculating every user, this user of each representation in components of this degree of confidence vector is to a kind of credit value of product, this credit value is for user is for the ratio of phylogenetic group size belonging to this product and all evaluation numbers of users, and this ratio is defined as ratio value of comforming;

Step 3: the user's degree of confidence vector always calculated according to step 2, calculate the fiduciary level of user's scoring, be considered as least reliable N number of user maliciously to mark user, generate malice scoring user candidate list, wherein N to mark ratio and detect the factors such as degree of accuracy and set according to the user of real system;

Step 4: with the departure degree of product proper mass, user's candidate list of maliciously marking is resequenced according to user's scoring in malice scoring user candidate list, choose the maximum M of a departure degree user, obtain final malice scoring user, wherein M to mark ratio and detect the factors such as degree of accuracy and set according to the user of real system.

Wherein, the concrete steps of step 1 are:

Step 1-1: remove the user of scoring number of times lower than threshold k, wherein threshold k can regulate according to the situation of system scoring and the concrete fine degree detected;

Step 1-2: according to the principle rounded up, to mark not for integer discretize is carried out in the scoring of integer;

Step 1-3: by user ID, product IDs, user store the form of the score data of product according to tlv triple (u, p, v).

In described step 1, usual K value is 8.

In described step 2, the degree of confidence vector dimension of each user is inconsistent, adopts xml file to store.

The concrete steps of described step 3 are:

Step 3-1: the mean value and the variance that calculate every user's degree of confidence vector, at calculating mean value divided by square extent, obtain user's fiduciary level;

Step 3-2: by all users according to the arrangement of fiduciary level size ascending order, choose top n user, generates malice scoring user candidate list.

The concrete steps of described step 4 are:

Step 4-1: calculate a mean value of product scoring, this mean value is considered as the proper mass of product;

Step 4-2: in malice that calculation procedure 3 obtains scoring user candidate list, each user is for the proper mass irrelevance of each product, namely user is to the difference of the scoring of product and this product proper mass;

Step 4-3: calculate the proper mass irrelevance absolute value of each user to each product, then it is averaging, obtain the scoring irrelevance of this user;

Step 4-4: each user is carried out descending sort according to irrelevance of marking, chooses a front M user for the final user that maliciously marks, generates user list of maliciously marking.

The present invention is based on user's scoring to detect, eliminate the complicated procedures of forming of process text on the one hand, improve detection efficiency, be applicable to nearly all evaluation system, first detect malice scoring user Candidate Set on the other hand, again secondary detection is carried out to the user in Candidate Set, this operation makes the present invention greatly improve in identification accuracy, and especially in user marks number Realistic Evaluation system that and malice mark number of users relative to all number of users ratio little less of the total ratio of product, Detection results is very outstanding.

Accompanying drawing explanation

Fig. 1 is a kind of process flow diagram being applicable to the method for detection of malicious scoring user in large scale community media system provided by the invention.

Fig. 2 is the processing flow chart of generation user degree of confidence vector provided by the invention.

Fig. 3 is the process flow diagram that User reliability provided by the invention calculates and malice scoring user candidate list generates.

Fig. 4 provided by the inventionly resequences to malice user's candidate list of mark according to user's scoring in malice scoring user candidate list and product proper mass departure degree, obtains final malice and to mark the process flow diagram of user.

Embodiment

For making the object, technical solutions and advantages of the present invention clearly understand, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in more detail.

Below in conjunction with accompanying drawing, the present invention is described in detail, be to be noted that described example is only intended to be convenient to the understanding of the present invention, and any restriction effect is not play to it.

The present invention propose based on malice scoring user detection method in the Social Media of scoring behavior cluster, overall procedure is as shown in Figure 1.

Step 1 is data preprocessing module.The raw data inputted system is carried out pre-service by this module, filtering noise data, and carries out discrete integer to score data, and pretreated data are the input of feature extraction operation in step S2.

Step 2 is user's confidence calculations module.This module carries out scoring cluster to through the pretreated data of step S1, calculates user's degree of confidence vector, the input data that user's degree of confidence will be extracted as secondary characteristics in step S3 according to the ratio of comforming of cluster size.

Step 3, for calculating the fiduciary level of user's scoring, generates malice scoring user candidate list module.This module extracts mean value and the variance of each user's degree of confidence based on user's degree of confidence vector, and the ratio of computation of mean values and variance, as user's fiduciary level, sorts to user's fiduciary level, final generation malice scoring user candidate list.

Step 4 is that malice scoring user candidate list reorders, final malice scoring user generation module.This module, by counting yield proper mass, on the basis of malice scoring user candidate list, utilizes user to mark and the irrelevance of product proper mass, carries out two-stage detection, generate final malice scoring user testing result to Preliminary detection result.

Next each key step is described in detail:

1. input system iotave evaluation data, and data prediction is carried out to input data, pretreated result is stored.(step 1).

Pretreatment work comprises noise data and filters and score value integer two major parts.Isolating user comment data the first-selected raw data from inputting, filtering the comment user of number of times below 8 times and the score information of correspondence.If user's scoring is not integer, based on the principle rounded up, user's scoring is rounded.Because noise data is the less user of scoring and score information, little on whole system impact after removing, but effectively raise counting yield.By scoring integer discretize, decrease cluster calculation complexity, be easier to the application of real system.

2. user marks cluster, calculates the degree of confidence vector of user's scoring.

Step 2 mainly completes the work of user's degree of confidence vector calculation, and workflow diagram as shown in Figure 2, comprises scoring behavior cluster, ratio of comforming calculates and the generation of user's degree of confidence vector and storage.

If carrying out cluster to scoring behavior in step 2-1 is according to identical to user's evaluation score of like products scoring, is one group by these user clusterings.User to be carried out to product each in system to mark behavior cluster.If user carried out scoring to N number of product, so this user's degree of confidence was a N dimensional vector, and each component is the credit value obtained after user marks at every turn.Because mark of marking after pre-service is discrete, so form the group of fixed number after cluster.

G_{j}^{(r)} = {U_{i} | r_{i, j} = r, r &Element; Rate}

Wherein to product O _jthe group formed after cluster is carried out in scoring, r _i,jrepresent user U _ito product O _jmarking.Rate is that the discrete scoring of the integer of product is interval.

Step 2-2 calculates group size belonging to each user to account for the scale evaluating the total user number of product, and ratio larger explanation belongingness is stronger.This ratio value has reacted the departure degree that user evaluates behavior and most people.If user belongs to a less group, so ratio value is little, user evaluate behavior depart from popular evaluate larger.On the contrary, if user belongs to a larger group, illustrate that comment is consistent with the comment of most people, departure degree is little, credible.System adopts the method for cluster size normalization to calculate intensity of comforming.Generate and store user's degree of confidence vector.Belonging to user, the ratio of comforming of group and this group allocation carrys out distributing user degree of confidence, the intensity size of comforming of group belonging to degree of confidence size characterizing consumer.By being that the user of a group gives identical degree of confidence to giving that like products marks and gathered, obtaining each user for each degree of confidence of carrying out the product of marking, generating the degree of confidence vector of user.Finally, the degree of confidence vector that user is corresponding is stored.

3. calculate the fiduciary level of user's scoring, generate malice scoring user candidate list.(step S3)

Step 3 calculates user's fiduciary level on the basis of user's degree of confidence vector of step S2 generation, and sort according to user's fiduciary level size, and before getting rank, the user of percent K adds in malice scoring user candidate list.The process flow diagram of step S3 as shown in Figure 3, comprises degree of confidence mean value and variance calculating (step S31), calculates user's fiduciary level (step S32) and maliciously scoring user's candidate generation and storage.

In step 3-1, extract mean value and the variance of all degree of confidence of each user.The mean value of degree of confidence has reacted the average level of this user's fiduciary level, and the variance of degree of confidence has reacted the degree of fluctuation of this user's fiduciary level.User reliability is the final confidence level size calculated further on the basis of user's degree of confidence.The average coherence of user and fiduciary level degree of fluctuation is utilized comprehensively to generate the fiduciary level of user, shown in the following formula of circular:

{Score}_{i} = \frac{{Rs}_{i}}{{Ps}_{i}},

Wherein

{Rs}_{i} = \frac{Σ_{j &Element; O_{i &Element; U}} {rp}_{i, j}}{\dim {\overset{&RightArrow;}{rp}}_{i}}, {Ps}_{i} = \frac{Σ_{j &Element; O_{i &Element; U}} {({rp}_{i, j} - {Rs}_{i})}^{2}}{\dim {rp}_{i}};

Wherein Score _iuser U _ifiduciary level, the Reliability size that namely user is final; Rs _iit is the average level of the mean value of user's degree of confidence, representative of consumer fiduciary level; Ps _iit is the degree of fluctuation of the variance of user's degree of confidence, the representative of consumer degree of reliability.When mean value is less compared with variance during Datong District, the scoring prestige of the acquisition high score that user at every turn can be stable, this kind of user's fiduciary level is high, is trustworthy.

In step 3-2, generate on basis that malice scoring user candidate list is the user's fiduciary level calculated in step s 32 and carry out.Carry out ascending order arrangement to user's fiduciary level, before getting list, the user of percent K adds malice scoring user candidate collection, completes Preliminary detection.Sort algorithm adopts ripe quicksort, and this algorithm does not belong to the content that the present invention emphasizes, when data volume is larger, this sort algorithm can well distributedization, improves sequence efficiency.

4. with product proper mass departure degree, user's candidate list of maliciously marking is resequenced according to user's scoring in malice scoring user candidate list, obtain final malice scoring user.

The overall flow of step 4 as shown in Figure 4, mainly contains and extracts product proper mass (step S41), calculates scoring irrelevance (step S42), to user's scoring and product proper mass deviation average sorts (step S43) and finally generates user's testing result of maliciously marking.

In step 4-1, product proper mass is weighed by all scoring averages of product.Product proper mass itself is an immensurable amount, usually makes estimation by some algorithms to product proper mass.Take the proper mass of arithmetic average as product of the scoring calculating the acquisition of each product in the present invention, mean value larger explanation product sole mass is better, otherwise then product sole mass is poorer.

Calculate in step 4-2 scoring irrelevance be one can the process of calculated off-line.Calculate the scoring irrelevance of absolute value as user of the difference of user's scoring and product proper mass.

Do to all evaluated products of user the scoring irrelevance vector that same process obtains this user in step 4-3, the mean value finally calculating scoring irrelevance vector to be marked irrelevance as final user.Same, all users in the malice scoring user Candidate Set generate step 3 do above-mentioned identical process." user mark irrelevance " that below mention all refer to through deviation that user is given a mark carry out absolute value average after value.

In step 4-4, according to user's irrelevance of marking, descending sort is carried out to the user in malice scoring user Candidate Set based in step 4-2, the more forward user of rank irrelevance of marking is larger, the possibility becoming malice scoring user is larger, and before getting rank, the user of K generates final malice scoring user list.K value can be marked ratio and detect the factor such as degree of accuracy and adjust according to the user of real system.Obtaining final ranking results is thus exactly malice scoring user testing result.

The process of execution of the present invention is described with a concrete instance below

For simplified illustration, in this example, in Social Media web station system, one has the iotave evaluation situation of 10 users to 5 products, and scoring is 1 assign to 5 points, totally 5 grading systems.As shown in table 1, in table 1, row represents user (U), and product (O) is shown in list, value in corresponding cell is the scoring of user to product, if cell is empty (-), represents that this user did not buy this product, mark as sky.The consumer products rating matrix R of such formation table 1.

	O1	O2	O3	O4	O5
						U1	4	5	3	4	-
U2	-	4	4	2	5
						U3	3	4	-	5	3
U4	5	-	-	4	3
						U5	3	4	5	-	3
U6	2	4	3	5	3
						U7	-	3	1	5	3

U8	1	-	3	3	4
						U9	5	2	2	5	-
U10	5	-	2	1	4

Table 1

For simplified illustration, only utilize the one based on scoring behavior cluster to be implemented as example herein and be described, wherein clustering method carries out according to specifically describing in step S21, obtains the scoring group after cluster.Row is 5 products in corresponding table 1, row are that 1 to 5 scorings are interval, and corresponding unit lattice are the groups according to being formed after scoring behavior cluster, and cell does not have user to comment reciprocal fraction for empty (-) represents, after numeric representation cluster, the size of group, as shown in table 2.

Table 2

Obtaining user's degree of confidence to the user clustering that each product obtains according to cluster size normalization, is that the user of a group gives identical degree of confidence to giving that like products marks and gathered.As shown in table 3, row expression 10 users, 5 kinds of products are shown in list, and cell is that user is to product evaluation degree of confidence size.If cell is empty, show that user does not mark to this product.

	O1	O2	O3	O4	O5
						U1	0.125	0.143	0.375	0.222	-
U2	-	0.571	0.125	0.111	0.125

U3	0.250	0.571	-	0.444	0.625
						U4	0.375	-	-	0.222	0.625
U5	0.250	0.571	0.125	-	0.625
						U6	0.125	0.571	0.375	0.444	0.625
U7	-	0.143	0.125	0.444	0.625
						U8	0.125	-	0.375	0.111	0.250
U9	0.375	0.143	0.250	0.444	-
						U10	0.375	-	0.250	0.111	0.250

Table 3

Calculate the Gaussian distribution statistical nature of each user's degree of confidence vector according to step S31, obtain mean value and variance.Utilize mean value to obtain the final fiduciary level of user than variance, as shown in table 4, row represents user, and user's degree of confidence mean value, variance and final fiduciary level size are shown in list.

	Mean value	Variance	Fiduciary level
				u1	0.2163	0.1139	1.899
u2	0.2330	0.2254	1.034
				u3	0.4725	0.1666	2.836
u4	0.4073	0.2034	2.002
				u5	0.3927	0.2434	1.613
u6	0.4280	0.1963	2.180
				u7	0.3342	0.2429	1.376
u8	0.2153	0.1235	1.743
				u9	0.3030	0.1335	2.269
u10	0.2465	0.1079	2.285

Table 4

User's fiduciary level size according to calculating in table 4 carries out ascending sort, and result is: u2, u7, u5, u8, u1, u4, u6, u9, u10, u3.Before getting list, the user of 40% adds malice scoring user candidate collection, obtains malice scoring user candidate collection to be: { u2, u7, u5, u8}.

According to step S41, all scoring mean value of counting yield gained, as product proper mass, is resequenced to user's Candidate Set of maliciously marking according to the irrelevance of user's scoring with product proper mass.As shown in table 5, row represents user, and product is shown in list.Totally 4 users 5 kinds of products, corresponding unit lattice are that user marks and the irrelevance of product proper mass.

Table 5

The irrelevance mean value descending sort calculated in his-and-hers watches 5, the malice obtained scoring user list is U2, U7, U8, U5.Compared with list of marking in Candidate Set of maliciously marking, U5 is more for U8 scoring departure ratio, more easily becomes the user that maliciously marks.So far detect complete, obtain malice and to mark user list, rank is more forward, and to become the possibility of malice scoring user larger.

Although be described, so that those skilled in the art understand the present invention the illustrative embodiment of the present invention above.But be noted that; the invention is not restricted to the scope of embodiment; to those skilled in the art; as long as various change to limit and in the spirit and scope of the present invention determined in appended claim; these changes are apparent, and all innovation and creation utilizing the present invention to conceive are all at the row of protection.

Claims

1. in online Social Media system detection of malicious scoring user a method, the method comprises:

Step 2-2: the degree of confidence vector calculating every user, this user of each representation in components of this degree of confidence vector is to a kind of credit value of product, and this credit value is for user is for the ratio value of comforming of phylogenetic group size belonging to this product and all evaluation numbers of users;

Step 3: according to the user's degree of confidence vector calculated in step 2, calculate the fiduciary level of user's scoring, be considered as least reliable N number of user maliciously to mark user, generate malice scoring user candidate list, wherein N to mark ratio and detect the factors such as degree of accuracy and set according to the user of real system;

2. in a kind of online Social Media system as claimed in claim 1 detection of malicious scoring user method, it is characterized in that the concrete steps of step 1 are:

3. in a kind of online Social Media system as claimed in claim 2 detection of malicious scoring user method, it is characterized in that in described step 1-1, usual K value is 8.

4. the method for detection of malicious scoring user in a kind of online Social Media system as claimed in claim 1, is characterized in that the degree of confidence vector dimension of each user in described step 2 is inconsistent, adopts xml file to store.

5. in a kind of online Social Media system as claimed in claim 1 detection of malicious scoring user method, it is characterized in that the concrete steps of described step 3 are:

6. in a kind of online Social Media system as claimed in claim 1 detection of malicious scoring user method, it is characterized in that the concrete steps of described step 4 are: