CN110648173B - Unsupervised abnormal commodity data detection method based on good evaluation and poor evaluation rates of commodities - Google Patents

Unsupervised abnormal commodity data detection method based on good evaluation and poor evaluation rates of commodities Download PDF

Info

Publication number
CN110648173B
CN110648173B CN201910887119.8A CN201910887119A CN110648173B CN 110648173 B CN110648173 B CN 110648173B CN 201910887119 A CN201910887119 A CN 201910887119A CN 110648173 B CN110648173 B CN 110648173B
Authority
CN
China
Prior art keywords
commodity
commodities
abnormal
good
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910887119.8A
Other languages
Chinese (zh)
Other versions
CN110648173A (en
Inventor
刘静
侯志鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN201910887119.8A priority Critical patent/CN110648173B/en
Publication of CN110648173A publication Critical patent/CN110648173A/en
Application granted granted Critical
Publication of CN110648173B publication Critical patent/CN110648173B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data

Landscapes

  • Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Engineering & Computer Science (AREA)
  • Accounting & Taxation (AREA)
  • Development Economics (AREA)
  • Finance (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Game Theory and Decision Science (AREA)
  • Data Mining & Analysis (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an unsupervised abnormal commodity data detection method based on good evaluation and bad evaluation rates of commodities, and mainly solves the problem that abnormal commodity data in an online shopping mall is low in detection accuracy. The implementation scheme is as follows: determining the data type of the detected abnormal commodity; for the detection of abnormally high-scoring commodities, firstly, calculating the good scoring rate of each commodity; calculating the difference good evaluation rate of the commodity after the difference operator; finally determining abnormal high-score commodities; for the detection of abnormal low-grade commodities, firstly, calculating the poor evaluation rate of each commodity; calculating the scaling poor evaluation rate of the commodity after the scaling operator; and finally determining the abnormal low-score commodities. The invention provides two calculation indexes and two operation operators for two abnormal commodity data detection scenes, can more accurately detect abnormal commodities, helps system maintenance personnel to find problematic commodities as soon as possible and delete abnormal data in time, and can be used for the stability of a detection and maintenance system of abnormal commodity data in an online shopping mall.

Description

Unsupervised abnormal commodity data detection method based on good evaluation and poor evaluation rates of commodities
Technical Field
The invention belongs to the technical field of detection, and particularly relates to a method for detecting abnormal commodity data, which can be used for detecting the abnormal commodity data by an online mall and maintaining the stability of an online mall system.
Background
With the rapid development of information technology and the internet, online shopping is the first choice for more and more people to purchase goods. In order to increase the exposure rate of the commodities of some merchants and increase the sales volume of the commodities, the merchants can prompt the users to give good comments to the commodities by cashback, reward and the like, namely, to make high scores, and even to press the commodities of competitors, the merchants directly hire the users to give bad comments to the commodities of the competitors maliciously, namely, to make low scores. For example, the domestic online shopping website is Taobao, a community website bean-paste web providing recommendations and comments for books, movies and music, and a foreign online shopping website eBay and other well-known electronic commerce websites are found in the system. The abnormal commodity data can greatly affect the stability of the system, thereby affecting the use experience of the user and even causing the user to give up using the abnormal commodity data. Therefore, the abnormal commodity data in the system can be timely and effectively detected, the system maintenance personnel can be helped to find out the commodity with problems as soon as possible, the abnormal data can be timely deleted, and the stability of the system can be maintained, which is very important.
According to the publication "Robust collaborative registration" (recommendation Systems Handbook, page number: 805-835, 2015) by Burke Robin et al. At present, a KNN method based on clustering or a C4.5 method based on a decision tree are two more classical and common abnormal data detection methods. The clustering-based KNN method directly utilizes original data to perform clustering, and abnormal data and non-abnormal data are clustered into different categories, so that detection is completed. The method is an unsupervised method, so that the method does not need pre-training and is simple and effective. However, since the scoring information of the product is directly used, the highest score and the lowest score of the product are not quantitatively analyzed, and thus the detection accuracy for abnormal product data is not high. And a C4.5 method based on the decision tree directly utilizes the data to construct the decision tree, thereby completing the distinguishing and detection of abnormal data. Although the detection accuracy rate is higher than that of the KNN method based on clustering, the method is a supervised model, and a certain amount of false data needs to be artificially constructed in advance to train the model so as to complete the detection. However, the data constructed artificially is often far from the real situation, and it is difficult to simulate the complex situation in the real system, so the method is limited to be used in the real system.
Disclosure of Invention
The invention aims to provide an unsupervised abnormal commodity data detection method based on good evaluation and poor evaluation rates of commodities, and aims to solve the problems that in the prior art, due to the lack of quantitative analysis on commodity grading conditions, the detection accuracy is low, and the limitation that a certain amount of false data needs to be artificially constructed in advance to carry out detection is caused.
The technical idea of the invention is that in the detection process of abnormal high-score commodities, good-score index is defined to quantify the condition that the commodities obtain high scores, and differential operator operation is defined to eliminate noise in data and highlight good-score data of the abnormal high-score commodities, so that the detection accuracy of the abnormal commodities is improved. In the detection process of the abnormal low-score commodities, poor-score indexes are defined to quantify the condition that the commodities obtain low scores, scaling operator operation is defined to overcome the power law distribution characteristics of data, the poor-score data of the abnormal low-score commodities are highlighted, and therefore the detection accuracy of the abnormal commodities is improved. The method comprises the following implementation steps:
(1) Entering data:
according to the scoring records of the commodities by the user in the e-commerce website, the scoring data of each commodity is extracted, and a commodity set O = { O } is formed by all commodities in the extracted data 1 ,o 2 ,....,o i ,...,o m And constructing a user set U = { U } by using all users in the extracted data 1 ,u 2 ,...,u j ,...,u n In which o is i Denotes the ith commodity, i is from 1 to m, m is the total number of commodities, u j Representing j users, wherein j is from 1 to n, and n is the total number of the users;
(2) Determining whether the detection is to detect the abnormally high-scoring commodity: if yes, executing the step (3); if not, the detected commodity is an abnormal low-grade commodity, and the step (6) is skipped;
(3) Calculating the good rate of each commodity:
(3a) For each commodity O in the commodity set O i Statistics for each item o i User number r with scoring behavior i
(3b) For each item O in the set of items O i Calculating the good rate H of each commodity i :
Figure BDA0002207641510000021
Wherein r is i_max Is a commercial product o i The number of scores equal to the highest score of the system, if the allowed score range of the current system is 1 to 5 i_max Represents a commodity o i A number of scores equal to 5 in the scores of (a);
(4) Calculating the difference good evaluation rate of each commodity:
(4a) According to the number of scores possessed by the goods r i Good rating H of goods sorted in descending order i
(4b) The number r of the commodities is scored according to the commodities i On the basis of the ranking, for each item o i Taking the position of the user in the commodity sequencing sequence as a center, respectively selecting l/2 commodities forwards and backwards to construct a commodity o i Of neighbor commodity set Γ i ={g 1 ,g 2 ,...,g k ,...,g l In which g is k Represents a commodity o i K is the product o, k is from 1 to l i Total number of neighboring commodities of (a);
(4c) For the good rate of each commodity, calculating the differential good rate D after the difference i
Figure BDA0002207641510000031
Wherein H k Is a commodity o i (ii) a good rating of the kth neighboring commodity;
(5) Selecting the number of scores r of the commodities in the commodity set O i The commodities larger than 1% of the total number n of the users form an abnormal commodity candidate set, and the abnormal commodity candidate set is selected to have the maximum difference good evaluation rate D i Article o of i An output as a result of the detection;
(6) Calculating the poor rating for each commodity:
(6a) For each commodity O in the commodity set O i Statistics for each item o i Number of users with scoring behavior r i
(6b) For each item O in the set of items O i Calculating a poor rating C of each commodity i :
Figure BDA0002207641510000032
Wherein r is i_min Is a commodity o i The number of scores equal to the lowest score of the system, if the allowed score range of the current system is 1 to 5 i_min Represents a commodity o i The number of scores equal to 1 in the scores of (a);
(7) For the poor rating of each commodity, calculating the zoom poor rating S after zooming i
Figure BDA0002207641510000033
Wherein
Figure BDA0002207641510000034
Is the number of scores r owned by each commodity in the commodity set O i Average value of (d);
(8) Selecting the number of scores r of the commodities in the commodity set O i The commodities which are larger than 1% of the total number n of the users form an abnormal commodity candidate set, and the abnormal commodity candidate set is selected to have the maximum zoom difference rating S i Article o of i As an output of the detection result.
Compared with the prior art, the invention has the following advantages:
firstly, the invention defines the statistical indexes of good rating and poor rating of the commodity, and quantifies the high-score and low-score scoring condition of the commodity through the two indexes. Compared with the method that the numerical analysis is carried out by directly using all scores of the commodities, the two indexes can more intuitively reflect the difference of abnormal commodity data, so that the abnormal commodities can be better detected.
Secondly, according to the numerical distribution characteristic that good evaluation rate numerical values of commodities with similar scoring numbers are relatively close to each other and the good evaluation rate numerical values of the abnormally high-score commodities are greatly different from the good evaluation rates of the commodities with similar scoring numbers, the method defines the operation of a difference operator to smooth good evaluation rate numerical noise, amplifies the difference between good evaluation rates of the commodities and highlights the abnormality of the good evaluation rate numerical values of the abnormally high-score commodities, so that the detection accuracy of the abnormally high-score commodities is further improved.
Thirdly, the invention defines the operation of the scaling operator according to the characteristic that the poor evaluation rate value of the commodity has power law distribution, so that the poor evaluation rates of the commodity are basically distributed on the same datum line after scaling. The poor evaluation rate of the abnormal low-score commodities is compared with that of normal commodities to form an obvious peak value, and therefore the detection accuracy rate of the abnormal low-score commodities is further improved.
Fourthly, the detection method is based on data statistical indexes, does not need to artificially construct a data training model in advance, is an unsupervised detection method, and has wider application range.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention;
FIG. 2 is a simulation diagram of the present invention showing the numerical differentiation between the defined good rating and the differential good rating before and after the act of maliciously scoring the top of each commodity;
FIG. 3 is a simulation diagram of the numerical differentiation between the defined poor rating and the scaling poor rating before and after the act of maliciously undergrading each commodity in the present invention;
FIG. 4 is a simulation diagram of the results of detecting an abnormally high-scoring commodity in accordance with the present invention;
fig. 5 is a simulation diagram of the results of detecting an abnormally low-scoring commodity according to the present invention.
The specific implementation mode is as follows:
the embodiments and effects of the present invention will be described in further detail below with reference to the accompanying drawings.
Referring to fig. 1, the specific implementation steps of the present invention are as follows:
step 1, inputting data:
1.1 According to the scoring records of the users on the commodities in the e-commerce website, extracting specific scoring data of the users on each commodity in the website;
1.2 Construct a product set O = { O) using all products in the extracted data 1 ,o 2 ,....,o i ,...,o m In which o is i Representing the ith commodity, i is from 1 to m, and m is the total number of commodities;
1.3 Form a user set U =with all users in the extracted data{u 1 ,u 2 ,...,u j ,...,u n In which u j Representing the jth user, j being from 1 to n, n being the total number of users.
And 2, determining whether the detection is used for detecting the abnormally high-score commodities.
In general, when abnormal commodity data is detected, the detection of abnormal high-score commodities and the detection of abnormal low-score commodities can be classified into two cases. Determining the type of the detection according to actual requirements, and if the detected abnormal high-score commodity data is abnormal high-score commodity data, executing the step 3; if not, the detected commodity is an abnormal low-grade commodity, and the step 6 is skipped.
Step 3, calculating the good rating of each commodity:
3.1 For each item O in the set of items O) i Statistics for each item o i Number of users with scoring behavior r i
3.2 For each item O in the set of items O i Calculating a good rating H of each commodity i :
Figure BDA0002207641510000051
Wherein r is i_max Is a commercial product o i The number of scores equal to the highest score of the system, if the allowed score range of the current system is 1 to 5 i_max Represents a commodity o i The number of scores in the score of (1) is equal to 5.
And 4, calculating the difference good evaluation rate of each commodity.
4.1 According to the number of scores possessed by the goods r) i Good rating H of goods sorted in descending order i
4.2 Number of grades r in terms of goods on goods i On the basis of the ranking, for each item o i Taking the position of the user in the commodity ordering sequence as a center, selecting l/2 commodities forwards and backwards respectively to construct a commodity o i Of neighbor commodity set Γ i ={g 1 ,g 2 ,...,g k ,...,g l Therein ofg k Represents a commodity o i K is the product o, k is from 1 to l i The total number of neighboring commodities, this example l is equal to 1% of the number of users n;
4.3 Good rating H for each commodity i Calculating the difference good evaluation rate D after the difference i
Figure BDA0002207641510000052
Wherein H k Is a commodity o i (ii) a good rating of the kth neighboring commodity.
And step 5, determining the abnormal high-scoring commodities according to the calculated difference good scoring rate.
Selecting the number of grades r of the commodities in the commodity set O i And (3) the commodities which are more than 1% of the total number n of the users form an abnormal commodity candidate set, and the commodities which have the maximum difference good evaluation rate in the abnormal commodity candidate set are selected as the output of the detection result.
Step 6, calculating the bad rating of each commodity:
6.1 For each item O in the set of items O) i For each product o, statistics i User number r with scoring behavior i
6.2 For each item O in the set of items O i Calculating a poor rating C of each commodity i :
Figure BDA0002207641510000061
Wherein r is i_min Is a commercial product o i Is equal to the number of scores of the lowest score of the system, if the allowed score range of the current system is 1 to 5 i_min Represents a commodity o i The number of scores equal to 1.
Step 7, poor rating C for each product i Calculating the zoom difference rating S after zooming i
Figure BDA0002207641510000062
Wherein
Figure BDA0002207641510000063
Is the number of scores r owned by each commodity in the commodity set O i Average value of (d);
and 8, determining abnormal low-score commodities according to the calculated zooming poor-score rate.
Selecting the number of grades r of the commodities in the commodity set O i And (3) the commodities which are more than 1% of the total number n of the users form an abnormal commodity candidate set, and the commodities which have the largest scaling difference evaluation rate in the abnormal commodity candidate set are selected as the output of the detection result.
The effect of the present invention will be further described with reference to simulation experiments.
1. Simulation conditions are as follows:
the simulation experiment of the invention adopts a data set MovieLens-100K commonly used in the field of electronic commerce, which comprises 100000 pieces of rating data of 943 users for 1682 commodities, and the rating range is 1 to 5.
2. Simulation content and result analysis:
simulation 1: the data distinguishing effect of the good-scoring rate and the difference operator defined by the invention on the abnormal high-scoring commodity is further explained.
Firstly, on the basis of inputting a MovieLens-100K data set, calculating the original good rating and the original difference good rating numerical value of each commodity on the original data by using the method for detecting the abnormal high-rating commodities;
next, simulating the behavior of maliciously scoring a high score for each commodity, namely randomly selecting 3% of users who have not scored the commodity from the 943 system user number for each commodity, and adding scores for the commodity for the users, wherein the score is the highest score of the system, namely 5 scores;
thirdly, calculating the good rating and the difference good rating of each commodity after the behavior of maliciously scoring the high score by using the method for detecting the abnormally high-score commodities in the invention;
finally, the good evaluation rate of the product and the difference good evaluation rate before and after the behavior of malicious high scoring are compared and plotted, and the result is shown in fig. 2 (a) and 2 (b). Wherein:
fig. 2 (a) shows a good rating value distribution diagram of a commodity before and after a malicious rating behavior is performed on each commodity by 3% of the number of users who randomly select a system user and have not rated the commodity, an abscissa of fig. 2 (a) shows the rating number of the commodity in an original data set, an ordinate shows the good rating value of the commodity, a gray line shows a good rating value distribution curve of the commodity before the malicious rating behavior is performed, and a black line shows a good rating value distribution curve of the commodity after the malicious rating behavior is performed.
Fig. 2 (b) shows a differential good-rating numerical distribution diagram before and after a malicious high-rating act is performed on each commodity for 3% of the randomly selected system users who have not scored the commodity, an abscissa shows the number of scores of the commodity in the original data set, an ordinate shows the differential good-rating numerical value of the commodity, a gray line shows a differential good-rating numerical distribution curve of the commodity before the malicious high-rating act is performed, and a black line shows a differential good-rating numerical distribution curve of the commodity after the malicious high-rating act is performed.
As can be seen from fig. 2 (a), the good scoring rate values of the commodities with similar scoring numbers are also relatively close to each other, and when malicious high scoring behavior occurs, the good scoring rate values of the abnormally high scoring commodities are greatly different from the good scoring rate values of the commodities with similar scoring numbers. The condition that the good-scoring rate index can quantitatively describe the condition that the commodity obtains high scoring is reflected, and the index has good distinguishability when malicious high-scoring behaviors exist. Comparing the curves in fig. 2 (a) and fig. 2 (b), it can be seen that the difference operator can smooth the noise in the good-scoring value well, and can amplify the difference between good-scoring values of the goods, and highlight the abnormality of the good-scoring value of the abnormally high-scoring goods. In an actual scene, malicious high-scoring behavior often appears in a commodity with a poor score or without too many numbers of comments, namely the second half of the curve in fig. 2. For the data of the part, as can be clearly seen from fig. 2, the good-scoring rate index and the difference operator defined by the invention can well distinguish abnormal high-scoring commodity data from normal commodity data.
Simulation 2: the data distinguishing effect of the poor rating and the scaling operator on the abnormal low-rating commodity defined by the invention is further explained.
Firstly, on the basis of inputting a MovieLens-100K data set, calculating the original poor rating and the original zooming poor rating value of each commodity on the original data by using the method for detecting the abnormal low-rating commodities;
next, simulating the behavior of maliciously scoring the low score for each commodity, namely randomly selecting 3% of users who have not scored the commodity from the 943 system user number for each commodity, and adding scores for the commodity for the users, wherein the score is the lowest score of the system, namely 1 score;
thirdly, calculating the numerical values of the poor evaluation rate and the zooming poor evaluation rate of each commodity after the behavior of maliciously scoring low by using the method for detecting the abnormal low-scoring commodities is adopted;
finally, the values of the poor evaluation rate and the scaled poor evaluation rate of the product before and after the act of maliciously underscoring are plotted by comparison, and the results are shown in fig. 3 (a) and 3 (b). Wherein:
fig. 3 (a) shows a poor rating value distribution diagram before and after malicious rating grading is performed on each commodity for users who have not scored the commodity in 3% of the number of randomly selected system users, the abscissa shows the number of scores of the commodity in the original data set, the ordinate shows the poor rating value, the gray line shows a poor rating value distribution curve of the commodity before malicious rating is performed, and the black line shows a poor rating value distribution curve of the commodity after malicious rating is performed.
Fig. 3 (b) shows a zoom difference score value distribution graph before and after a malicious underscoring behavior is performed on each commodity for 3% of the number of users who randomly select the system user, where the abscissa shows the number of scores possessed by the commodity in the original data set, the ordinate shows the zoom difference score value of the commodity, the gray line shows the zoom difference score value distribution curve of the commodity before the malicious underscoring behavior is performed, and the black line shows the zoom difference score value distribution curve of the commodity after the malicious underscoring behavior is performed.
As can be seen from fig. 3 (a), before and after the act of maliciously scoring low scores is performed, the poor score value of the commodity changes significantly, which reflects that the poor score index defined by the present invention can quantitatively describe the condition that the commodity obtains low scores, and the index has good distinctiveness for the abnormal low-score commodity. Meanwhile, the phenomenon that the more the number of the reviews of the commodity is, the lower the bad review rate is, the less the number of the reviews is, the higher the bad review rate is can be found, that is, the bad review rate value of the commodity has power law distribution characteristics. Comparing the curves in fig. 3 (a) and fig. 3 (b), it can be seen that the poor evaluation rates of the commodities are substantially distributed on the same reference line after passing through the scaling operator, and the scaling poor evaluation rates of the abnormally low-scoring commodities have obvious peak values compared with the normal commodities. The scaling operator defined by the invention can well eliminate the influence of the power law distribution characteristic of the poor evaluation rate value of the commodity on the detection result, and can further highlight the data abnormality of the abnormal low-score commodity. In an actual scenario, the malicious low-scoring behavior is often found in the commodities with higher scores or more comments, i.e., the first half of the curve in fig. 3. For the data of the part, as can be clearly seen from fig. 3, the poor rating index and the scaling operator defined by the invention can well distinguish abnormal low-rating commodity data from normal commodity data.
Simulation 3: the effect of the abnormal data detection method of the invention, the cluster-based detection method KNN and the decision tree-based detection method C4.5 on detecting abnormal high-score commodity data is further explained.
Firstly, on the basis of inputting a MovieLens-100K data set, randomly selecting 50 commodities from a commodity set as a commodity set to be subjected to malicious high-score behaviors. Each detection takes out one commodity from a commodity set which is to take a malicious high-scoring action as a malicious high-scoring commodity, randomly selects users which have not scored the commodity from a user set according to the number of specified users participating in the malicious high-scoring action, and adds scores of the commodity for the users, wherein the score is 5, namely good score;
then, detecting the changed data set by using the method of the invention to obtain a detection result;
finally, comparing whether the abnormal commodities output by the method are consistent with the commodities selected in the previous step and subjected to malicious high scoring behavior, if so, marking as 1, namely correct detection, and otherwise, marking as 0, namely wrong detection, further obtaining the correct detection ratio of the method on the 50 commodities, wherein the higher the correct ratio is, the more accurate the detection is proved.
In the simulation experiment, the detection accuracy of the method is sequentially tested from 1% of the number of the system users participating in the malicious high-rating behavior, which is increased by 1%, to 10% of the number of the system users participating in the malicious high-rating behavior, and the result is shown in fig. 4. Wherein:
the abscissa represents the proportion of the number of users participating in malicious high-score behaviors to the total number of users of the system, the proportion is increased from 1% to 10% by taking 1% as a step length, the ordinate represents the detection accuracy of the method for detecting abnormally high-score commodities, the curve marked by a circle represents the detection accuracy curve of the detection method KNN based on clustering, the curve marked by a triangle represents the detection accuracy curve of the detection method C4.5 based on a decision tree, and the curve marked by a square represents the detection accuracy curve of the method.
As can be seen from fig. 4, the accuracy curve of the method of the present invention is always located above the accuracy curves of the cluster-based detection method KNN and the decision tree-based detection method C4.5, which indicates that the method of the present invention can more accurately detect the abnormal high-score commodity data. Meanwhile, when the number of users participating in malicious high scoring is not large, such as 1% to 2% of the total number of users of the system, the accuracy of the method is far higher than that of a cluster-based detection method KNN and a decision tree-based detection method C4.5, which shows that the method has higher data sensitivity for abnormally high-scoring commodities, can detect the abnormality in data early, and illustrates the effectiveness of the method from another aspect.
Simulation 4: the effect of the abnormal low-grade commodity data detection by the method, the cluster-based detection method KNN and the decision tree-based detection method C4.5 is further explained.
Firstly, on the basis of inputting a MovieLens-100K data set, randomly selecting 50 commodities from a commodity set as a commodity set to be subjected to malicious low-grade behavior. Taking a commodity from a commodity set which is to take a malicious low-grade behavior as a commodity which is subjected to the malicious low-grade behavior at this time, randomly selecting users which do not score the commodity from a user set according to the number of specified users participating in the malicious low-grade behavior, and adding scores of the commodity for the users, wherein the score is 1, namely poor score;
then, the changed data set is detected by using the method of the invention to obtain a detection result;
and finally, comparing whether the abnormal commodity output by the method is consistent with the commodity subjected to the malicious low-grade behavior selected in the previous step, if so, marking as 1, namely correct detection, and otherwise, marking as 0, namely wrong detection. Further, the detection accuracy rate of the method of the invention on the 50 commodities is obtained, and the higher the accuracy rate is, the more accurate the detection is proved.
In the simulation experiment, the detection accuracy of the method is sequentially tested from 1% of the number of the system users participating in the malicious low-grade division behavior, the detection accuracy is increased by 1% to 10% of the number of the system users participating in the malicious low-grade division behavior, and the result is shown in fig. 5. Wherein:
the abscissa represents the proportion of the number of users participating in malicious low-score behaviors to the total number of users of the system, the proportion is increased from 1% to 10% by taking 1% as a step length, the ordinate represents the detection accuracy of detecting abnormal low-score commodities by using a method, a curve marked by a circle represents a detection accuracy curve of a cluster-based detection method KNN, a curve marked by a triangle represents a detection accuracy curve of a decision tree-based detection method C4.5, and a curve marked by a square represents a detection accuracy curve of the method.
As can be seen from fig. 5, the accuracy curve of the method of the present invention is always located above the accuracy curves of the cluster-based detection method KNN and the decision tree-based detection method C4.5, which indicates that the method of the present invention can more accurately detect the abnormal low-score commodity data. Meanwhile, when the number of users participating in malicious low scoring is not large, such as 1% to 2% of the total number of users of the system, the accuracy of the method is far higher than that of a cluster-based detection method KNN and a decision tree-based detection method C4.5, which shows that the method has higher data sensitivity for abnormal low-scoring commodities, can detect the abnormality in the data early, and illustrates the effectiveness of the method from another aspect.

Claims (1)

1. An unsupervised abnormal commodity data detection method based on good evaluation and poor evaluation rates of commodities is characterized by comprising the following steps:
(1) Entering data:
according to the scoring records of the commodities by the user in the e-commerce website, scoring data of each commodity is extracted, and a commodity set O = { O } is formed by all commodities in the extracted data 1 ,o 2 ,....,o i ,...,o m A start-up time of the system is shortened, constructing a user set U = { U } with all users in the extracted data 1 ,u 2 ,...,u j ,...,u n H, o therein i Represents the ith product, i is from 1 to m, m is the total number of products, u j Representing j users, wherein j is from 1 to n, and n is the total number of the users;
(2) Determining whether the detection is to detect the abnormally high-scoring commodity: if yes, executing the step (3); if not, the detected commodity is an abnormal low-grade commodity, and the step (6) is skipped;
(3) Calculating the good rating of each commodity:
(3a) For each commodity O in the commodity set O i For each product o, statistics i Number of users with scoring behavior r i
(3b) For each item O in the set of items O i Calculating the good rate H of each commodity i :
Figure FDA0002207641500000011
Wherein r is i_max Is a commercial product o i The number of scores equal to the highest score of the system, if the allowed score range of the current system is 1 to 5 i_max Represents a commodity o i A number of scores equal to 5 in the scores of (a);
(4) Calculating the difference good evaluation rate of each commodity:
(4a) According to the number of scores possessed by the goods r i Good rating H of goods sorted in descending order i
(4b) Number of commodities scored according to commodities r i On the basis of the ranking, for each item o i Taking the position of the user in the commodity ordering sequence as a center, selecting l/2 commodities forwards and backwards respectively to construct a commodity o i Neighbor commodity set Γ i ={g 1 ,g 2 ,...,g k ,...,g l In which g is k Represents a commodity o i K is the product o from 1 to l i Total number of neighboring commodities of (a);
(4c) For the good rate of each commodity, calculating the differential good rate D after the difference i
Figure FDA0002207641500000021
Wherein H k As a commodity o i (ii) a good rating of the kth neighboring commodity;
(5) Selecting the number of grades r of the commodities in the commodity set O i The commodities which are more than 1% of the total number n of the users form an abnormal commodity candidate set, and the abnormal commodity candidate set is selected to have the maximum difference good evaluation rate D i Article o of i An output as a result of the detection;
(6) Calculating the poor rating for each commodity:
(6a) For each commodity O in the commodity set O i Statistics for each item o i User number r with scoring behavior i
(6b) For each item O in the set of items O i Calculating the poor rating C of each commodity i :
Figure FDA0002207641500000022
Wherein r is i_min Is a commercial product o i Is equal to the number of scores of the lowest score of the system, if the allowed score range of the current system is 1 to 5 i_min Represents a commodity o i The number of scores equal to 1 in the scores of (a);
(7) For the poor rating of each commodity, calculating the zoom poor rating S after zooming i
Figure FDA0002207641500000023
/>
Wherein
Figure FDA0002207641500000024
Is the number of scores r owned by each commodity in the commodity set O i Average value of (d);
(8) Selecting the number of scores r of the commodities in the commodity set O i The commodities which are larger than 1% of the total number n of the users form an abnormal commodity candidate set, and the abnormal commodity candidate set is selected to have the maximum zoom difference rating S i Article o of i As an output of the detection result.
CN201910887119.8A 2019-09-19 2019-09-19 Unsupervised abnormal commodity data detection method based on good evaluation and poor evaluation rates of commodities Active CN110648173B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910887119.8A CN110648173B (en) 2019-09-19 2019-09-19 Unsupervised abnormal commodity data detection method based on good evaluation and poor evaluation rates of commodities

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910887119.8A CN110648173B (en) 2019-09-19 2019-09-19 Unsupervised abnormal commodity data detection method based on good evaluation and poor evaluation rates of commodities

Publications (2)

Publication Number Publication Date
CN110648173A CN110648173A (en) 2020-01-03
CN110648173B true CN110648173B (en) 2023-04-07

Family

ID=68992011

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910887119.8A Active CN110648173B (en) 2019-09-19 2019-09-19 Unsupervised abnormal commodity data detection method based on good evaluation and poor evaluation rates of commodities

Country Status (1)

Country Link
CN (1) CN110648173B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006129792A1 (en) * 2005-06-02 2006-12-07 Nec Corporation Abnormality detecting method and system, and upkeep method and system
CN104766175A (en) * 2015-04-16 2015-07-08 东南大学 Power system abnormal data identifying and correcting method based on time series analysis
CN106599154A (en) * 2016-12-07 2017-04-26 国云科技股份有限公司 Product ranking method
CN106779468A (en) * 2017-01-03 2017-05-31 国网江苏省电力公司电力科学研究院 A kind of user power utilization demand response dynamic modeling and the uncertain appraisal procedure of response
CN106951514A (en) * 2017-03-17 2017-07-14 合肥工业大学 A kind of automobile Method for Sales Forecast method for considering brand emotion
CN107392718A (en) * 2017-07-26 2017-11-24 四川长虹电器股份有限公司 Method of Commodity Recommendation
CN108648038A (en) * 2018-04-13 2018-10-12 上海电机学院 A kind of credit propagation and maliciously evaluation recognition methods excavated based on subgraph
CN108665339A (en) * 2018-03-27 2018-10-16 北京航空航天大学 A kind of electric business product reliability index and its implementation estimated based on subjective emotion
CN109034400A (en) * 2018-05-29 2018-12-18 国网天津市电力公司 A kind of substation's exception metric data predicting platform system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106485507B (en) * 2015-09-01 2019-10-18 阿里巴巴集团控股有限公司 A kind of software promotes the detection method of cheating, apparatus and system

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006129792A1 (en) * 2005-06-02 2006-12-07 Nec Corporation Abnormality detecting method and system, and upkeep method and system
CN104766175A (en) * 2015-04-16 2015-07-08 东南大学 Power system abnormal data identifying and correcting method based on time series analysis
CN106599154A (en) * 2016-12-07 2017-04-26 国云科技股份有限公司 Product ranking method
CN106779468A (en) * 2017-01-03 2017-05-31 国网江苏省电力公司电力科学研究院 A kind of user power utilization demand response dynamic modeling and the uncertain appraisal procedure of response
CN106951514A (en) * 2017-03-17 2017-07-14 合肥工业大学 A kind of automobile Method for Sales Forecast method for considering brand emotion
CN107392718A (en) * 2017-07-26 2017-11-24 四川长虹电器股份有限公司 Method of Commodity Recommendation
CN108665339A (en) * 2018-03-27 2018-10-16 北京航空航天大学 A kind of electric business product reliability index and its implementation estimated based on subjective emotion
CN108648038A (en) * 2018-04-13 2018-10-12 上海电机学院 A kind of credit propagation and maliciously evaluation recognition methods excavated based on subgraph
CN109034400A (en) * 2018-05-29 2018-12-18 国网天津市电力公司 A kind of substation's exception metric data predicting platform system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
钟婧.基于虚假评论检测的评论可视化***的研究与实现.中国优秀硕士学位论文全文数据库 (信息科技辑).2019,(第undefined期),I138-1239. *

Also Published As

Publication number Publication date
CN110648173A (en) 2020-01-03

Similar Documents

Publication Publication Date Title
CN109711955B (en) Poor evaluation early warning method and system based on current order and blacklist base establishment method
Peng et al. Detecting Spam Review through Sentiment Analysis.
CN110009372B (en) User risk identification method and device
CN105069072B (en) Hybrid subscriber score information based on sentiment analysis recommends method and its recommendation apparatus
Dematis et al. Fake review detection via exploitation of spam indicators and reviewer behavior characteristics
CN105389505B (en) Support attack detection method based on the sparse self-encoding encoder of stack
CN107153656B (en) Information searching method and device
CN109783734A (en) A kind of mixing Collaborative Filtering Recommendation Algorithm based on item attribute
CN108648038B (en) Credit frying and malicious evaluation identification method based on subgraph mining
Kommineni et al. Machine learning based efficient recommendation system for book selection using user based collaborative filtering algorithm
CN108415913A (en) Crowd's orientation method based on uncertain neighbours
CN108229826A (en) A kind of net purchase risk class appraisal procedure based on improvement bayesian algorithm
CN111612340A (en) Network commodity inspection sampling method based on big data
CN104463601A (en) Method for detecting users who score maliciously in online social media system
CN111681084A (en) E-commerce platform recommendation method based on social relationship influence factors
CN107133811A (en) The recognition methods of targeted customer a kind of and device
Kumar et al. A novel fuzzy rough sets theory based CF recommendation system
CN113837844A (en) Multi-cascade downstream enterprise recommendation system and method and storage medium
CN110648173B (en) Unsupervised abnormal commodity data detection method based on good evaluation and poor evaluation rates of commodities
CN111047148A (en) False score detection method based on reinforcement learning
CN114510645B (en) Method for solving long-tail recommendation problem based on extraction of effective multi-target groups
CN111507804B (en) Emotion perception commodity recommendation method based on mixed information fusion
CN111859146B (en) Information mining method and device and electronic equipment
CN111667339B (en) Defamation malicious user detection method based on improved recurrent neural network
CN109559169B (en) Method for identifying sharp users based on online user scoring

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant