CN109191191A - Advertisement click fraud detection method based on cost-sensitive convolutional neural network - Google Patents

Advertisement click fraud detection method based on cost-sensitive convolutional neural network Download PDF

Info

Publication number
CN109191191A
CN109191191A CN201810951569.4A CN201810951569A CN109191191A CN 109191191 A CN109191191 A CN 109191191A CN 201810951569 A CN201810951569 A CN 201810951569A CN 109191191 A CN109191191 A CN 109191191A
Authority
CN
China
Prior art keywords
cost
layer
sample
data set
publisher
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810951569.4A
Other languages
Chinese (zh)
Other versions
CN109191191B (en
Inventor
张欣
刘学军
毛宇佳
李斌
徐新艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Tech University
Original Assignee
Nanjing Tech University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Tech University filed Critical Nanjing Tech University
Priority to CN201810951569.4A priority Critical patent/CN109191191B/en
Publication of CN109191191A publication Critical patent/CN109191191A/en
Application granted granted Critical
Publication of CN109191191B publication Critical patent/CN109191191B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • G06Q30/0248Avoiding fraud
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • G06Q30/0202Market predictions or forecasting for commercial activities

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Accounting & Taxation (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Physics & Mathematics (AREA)
  • Game Theory and Decision Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Image Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)

Abstract

The invention discloses an advertisement click fraud detection method based on a cost sensitive convolutional neural network, and belongs to the technical field of data processing. The invention comprises the steps of obtaining a one-dimensional feature data set: analyzing original data comprising a click data set and a publisher data set, extracting features of advertisement publishers by using a statistical method, and obtaining a one-dimensional feature data set of each publisher; constructing a characteristic matrix: converting the one-dimensional features into a feature matrix by using a multi-granularity time window; and (3) classification prediction training: taking the feature matrix data set as input, and selecting a convolutional neural network structure for classification prediction training; and a cost sensitive mechanism is introduced into an output layer, and the back propagation is carried out by using threshold value movement. The method and the device can be well suitable for detecting the advertisement click fraud at the present stage, achieve better detection capability, improve the detection accuracy and the model training efficiency, and have higher practicability.

Description

Ad click fraud detection method based on cost-sensitive convolutional neural networks
Technical field
The invention belongs to technical field of data processing, and in particular to a kind of advertisement based on cost-sensitive convolutional neural networks Click fraud detection method.
Background technique
With the prevalence of smart machine, advertising market rapid growth.It is reported according to McKinsey Global, digital advertisement branch Entire Numerical market share 51% has been accounted for out, it is contemplated that 2019 are up to 70%, and pay per click (Cost Per Click, CPC) It is that the main charge mode of moving advertising such as Baidu, Google, Yahoo takes CPC mode at present.However, CPC bid ranking Business model also in the generation for objectively having encouraged ad distribution quotient click fraud (click fraud) behavior, many networks Company, that is, ad distribution quotient clicks publication in the advertisement from home Web site, oneself to bring more advertising incomes repeatedly.Moving advertising Almost 30% income is consumed by different types of cheating in advertisement.There is investigation to be shown in the past year within 2014, click is taken advantage of Swindleness almost occupy the 14.6% of online advertisement total expenditure, while 75% ad distribution quotient representation they all by click fraud It influences.Click fraud causes the serious loss of advertising budget, and existing makes entire online advertisement market can by crisis It can property.
The advertisement means of payment to pay per click has formed a set of technology model and business prototype at present, and the web advertisement is allowed to throw It puts quotient and publisher abandons that this mode is unrealistic, effectively stable click fraud detection is only influence interconnection with prevention mechanism The critical issue of effective touching quantity of net advertisement.Increase the skill of click volume using automatically clickings such as malicious code, Botnets It is skilful very common, it can monitor by the third party monitoring mechanism of profession or be shielded by the way of graphical verification code It covers.But method and model used by third party are simultaneously underground, and validity is not transparent enough.Fraudster is to obtain higher benefit simultaneously Benefit, click fraud means emerge one after another, and online advertising system needs more accurate and effective click fraud detection method.It is domestic Outer researcher seeks various methods to identify and prevent and treat the click fraud behavior in advertisement, such as Knowledge Discovery and expert system A variety of fraud detection technologies.It but is more to click Activity recognition fraud row using based on user from internet mass data For detection method.Click fraud has complex patterns, while artificial click is similar compared with normal click, and fraud is caused to be clicked Feature there is certain weak randomness and be difficult to detect, and the generalization ability of algorithm can be improved in integrated approach, leads in detection Domain integrated approach is using more and more common, but integrated approach can make model excessively complicated while may cause model over-fitting.
Summary of the invention
It is an object of the present invention to provide a kind of ad click fraud detection methods based on cost-sensitive convolutional neural networks, will Convolutional neural networks structure is applied in moving advertising click fraud detection scene, can effectively be solved the problems, such as click fraud, be dropped Low model complexity avoids model over-fitting, and more classification can be realized while not increasing model complexity.
Specifically, on the one hand, realization that the present invention adopts the following technical solutions, comprising:
The step of obtaining one-dimensional characteristic data set: the initial data comprising click data collection and publisher data set is carried out Analysis is extracted the feature of ad distribution quotient using statistical method, obtains the one-dimensional characteristic data set of each publisher;
The step of construction feature matrix: one-dimensional characteristic is converted into eigenmatrix using the time window of more granularities;
The step of classification prediction training: using eigenmatrix data set as input, convolutional neural networks structure is selected to carry out Classification prediction training;Cost-sensitive mechanism is introduced in output layer, carries out backpropagation using threshold value is mobile;
It predicts to carry out ad click fraud detection using trained classification.
Furthermore, the step of construction feature matrix includes:
The click data collection and publisher data set that initial data is concentrated are marked off several according to time attribute is clicked Fine granularity time window length constructs the accordingly one-dimensional characteristic data set based on several time windows;
One-dimensional characteristic data set based on several time windows is converted into two dimensional character matrix according to time window.
Furthermore, the convolutional neural networks structure are as follows: input layer, pond layer, repeats convolutional layer, pond at convolutional layer Change layer, full articulamentum.
The convolutional layer calculates input by different convolution kernels, convolutional calculation process are as follows:
WhereinIt is accorded with for convolution operation,Respectively j-th of characteristic pattern of l layer and l-1 layers of ith feature figure,ForIt arrivesConvolution kernel,For bias term,For a nonlinear excitation function;
The pondization of the pond layer is expressed are as follows:
WhereinWithFor outputting and inputting for pond layer,For pond function, pond layer selects maximum pond Change method sufficiently extracts the notable feature of different convolution attribute mappings;
Characteristic pattern after convolutional layer, pond layer, excitation function layer operation is mapped as regular length by the full articulamentum Feature vector:
Output layer processing of the feature vector of the regular length after the full articulamentum, completes classification prediction.
Furthermore, described to introduce cost-sensitive mechanism in output layer, backpropagation is carried out using threshold value is mobile, is Refer to:
The output layer carries out classification prediction using Softmax function, using threshold value moving method that non-cost-sensitive is refreshing Decision boundaries through network change the probability of sample mistake classification, to the boundary shifts of the lower a kind of sample of cost with most sample This mistake classifies cost for target progress backpropagation;The calculation formula of the sample mistake classification cost are as follows:
Wherein,The wrong cost of j class is classified as sample x, η is normalization item, is madeaj=P (j | x) it is sample This x is classified as the probability of j class, and Cost [j, c] is the cost that j class sample mistake is divided into c class, and C is batch total.
On the other hand, the present invention can also use following technical scheme, comprising:
The step of obtaining one-dimensional characteristic data set: the initial data comprising click data collection and publisher data set is carried out Analysis is extracted the feature of ad distribution quotient using statistical method, obtains the one-dimensional characteristic data set of each publisher;
The step of construction feature matrix: one-dimensional characteristic is converted into eigenmatrix using the time window of more granularities;
The step of constructing training sample: processing is balanced to data set using the oversampler method based on cost;
The step of classification prediction training: using eigenmatrix data set as input, convolutional neural networks structure is selected to carry out Classification prediction training;Cost-sensitive mechanism is introduced in output layer, carries out backpropagation using threshold value is mobile;
It predicts to carry out ad click fraud detection using trained classification.
Furthermore, the feature that ad distribution quotient is extracted using statistical method, including extract and click entropy feature.
Furthermore, the step of construction feature matrix includes:
The click data collection and publisher data set that initial data is concentrated are marked off several according to time attribute is clicked Fine granularity time window length constructs the accordingly one-dimensional characteristic data set based on several time windows;
The one-dimensional characteristic data set based on several time windows is converted into two dimensional character square according to time window Battle array.
Furthermore, the step of construction training sample includes:
3-1) calculate the cost of each fraud publisher sample, specific formula for calculation are as follows:
Wherein CostiFor the cost of each fraud publisher sample, dijIt is sent out for i-th of fraud publisher sample and j-th The distance of draper's sample, the neighborhood of i-th of fraud publisher is by functionIt is limited with interrupting value C;
Several clusters 3-2) are divided by publisher is cheated using k-means algorithm, select a cluster according to different cost values The middle higher fraud sample pub of cost1And pub2As seed specimen, new fraud publisher sample Pub is constructednew, Pubnew Calculation formula are as follows:
Pubnew=Pub2+ rand (0,1) × (Pub1-Pub2)
Wherein rand (0,1) is a random function of random number between generating one (0,1).
Furthermore, the convolutional neural networks structure are as follows: input layer, pond layer, repeats convolutional layer, pond at convolutional layer Change layer, full articulamentum;
The convolutional layer calculates input by different convolution kernels, convolutional calculation process are as follows:
WhereinIt is accorded with for convolution operation,Respectively j-th of characteristic pattern of l layer and l-1 layers of ith feature figure,ForIt arrivesConvolution kernel,For bias term,For a nonlinear excitation function;
The pondization of the pond layer is expressed are as follows:
WhereinWithFor outputting and inputting for pond layer,For pond function, pond layer selects maximum pond Change method sufficiently extracts the notable feature of different convolution attribute mappings;
Characteristic pattern after convolutional layer, pond layer, excitation function layer operation is mapped as regular length by the full articulamentum Feature vector;
Output layer processing of the feature vector of the regular length after the full articulamentum, completes classification prediction.
Furthermore, described to introduce cost-sensitive mechanism in output layer, backpropagation is carried out using threshold value is mobile, is Refer to:
The output layer carries out classification prediction using Softmax function, using threshold value moving method that non-cost-sensitive is refreshing Decision boundaries through network change the probability of sample mistake classification, to the boundary shifts of the lower a kind of sample of cost with most sample This mistake classifies cost for target progress backpropagation;The calculation formula of the sample mistake classification cost are as follows:
Wherein,The wrong cost of j class is classified as sample x, η is normalization item, is madeaj=P (j | x) it is sample This x is classified as the probability of j class, and Cost [j, c] is the cost that j class sample mistake is divided into c class, and C is batch total.
Beneficial effects of the present invention are as follows: the ad click of the invention based on cost-sensitive convolutional neural networks cheats inspection Survey method can be very good to be suitable for ad click fraud detection at this stage, reach preferable detectability, improve inspection The accuracy of survey and the training effectiveness of model, practicability are higher.
Detailed description of the invention
Fig. 1 is the system framework schematic diagram of the embodiment of the present invention 1.
Fig. 2 is the eigenmatrix construction process schematic diagram of the embodiment of the present invention 1.
Fig. 3 is the cost-sensitive mind convolution of the embodiment of the present invention 1 through schematic network structure.
Fig. 4 is the system framework schematic diagram of the embodiment of the present invention 2.
Specific embodiment
Below with reference to embodiment and referring to attached drawing, present invention is further described in detail.
Embodiment 1:
One embodiment of the present of invention is a kind of ad click fraud detection side based on cost-sensitive convolutional neural networks Method, implementation process are as shown in Figure 1.
Step 1: one-dimensional characteristic data set is obtained.
The present embodiment uses FDMA2012 contest (The International Workshop on Fraud Detection in Mobile Advertising 2012, is detailed in http://palanteer.sis.smu.edu.sg/ Fdma2012/ the true moving advertising click data collection that certain company provides in).The data set includes click data collection and publisher Data set two parts.Wherein click data concentration each information indicates the click record log of a certain network user, publisher Each information represents each publisher profile information being labeled in data set.
Initial data is analyzed, and refers to existing click cheating in advertisement feature, using statistical methods from more The feature of a angle extraction ad distribution quotient calculates the one-dimensional characteristic data set of each publisher.
Step 2: construction feature matrix.
In order to be more applicable for convolutional neural networks model, need one-dimensional characteristic being converted to eigenmatrix.Utilize more The time window of degree constructs eigenmatrix, similar to the input of a two dimensional image.As shown in Fig. 2, the construction of eigenmatrix Time window based on multiple granularities carries out, and specific steps include:
2-1) the click data collection and publisher data set of initial data concentration are analyzed, according to click time attribute Mark off multiple fine granularity time window length T (such as first_15min (0-14), second_15min (15-29), Third_15min (30-44), last_15min (45-59), night (0:00-5:59), morning (6:00-11:59), Afternoon (12-17:59), evening (18:00-23:59), 3days, per min, per 5min, per 15min, per Hour, per 3hours, per 6hours, per day) preferably capture the time DYNAMIC DISTRIBUTION of click fraud behavior, from And construct multiple one-dimensional characteristics based on time window T.
The one-dimensional characteristic constructed 2-2) is converted into two dimensional character matrix according to time window T, is similar to two dimensional image, Input for convolutional neural networks.
Step 3: classification prediction training.
Using carry out Feature Conversion after eigenmatrix data set as input, select suitable convolutional neural networks structure into Row training.In view of data set disequilibrium, threshold value movement is introduced in output layer, constructs Cost-Sensitive Classifiers model.
As shown in figure 3, convolutional neural networks are applied in fraud detection scene.From convolution kernel size, number and pond Change layer strategy, excitation function etc. and experiment test is carried out to different neural network structures, determines network structure.Also, consider It is existed simultaneously to data set imbalance from different misclassification costs, introduces cost-sensitive mechanism in the classification based training stage, it is specific to walk Suddenly include:
3-1) determine the structure of convolutional neural networks.The structure of convolutional neural networks is under normal circumstances are as follows: input layer, convolution Layer, repeats convolutional layer, pond layer, full articulamentum at pond layer, and final output, convolutional layer is by different convolution kernels to input It is calculated, convolutional calculation process are as follows:
WhereinIt is accorded with for convolution operation,Respectively j-th of characteristic pattern of l layer and l-1 layers of ith feature figure,ForIt arrivesConvolution kernel,For bias term,For a nonlinear excitation function.
The activation primitive of all hidden layers of the present invention selects line rectification function (ReLU), due to using smaller filiter It is able to carry out more Nonlinear Mappings and obtains more abstract feature, so that the feature representation performance extracted is also better;Separately On the one hand the introducing of parameter can be reduced, the present invention stacks again the knot in pond using the lesser convolutional layer of multiple convolution kernels Structure.Simultaneously in order to keep boundary information, so that convolution anteroposterior dimension is consistent, Zero is carried out before each convolution Padding.Layer effect in pond is to carry out down-sampling, and each pond characteristic pattern depth is constant, real by removing unessential feature Existing dimensionality reduction.The General Expression in pond are as follows:
WhereinWithFor outputting and inputting for pond layer,For pond function, current maximum pond and mean value Pondization is the most commonly used.Pond layer of the present invention selects maximum pond method sufficiently to extract the significant spies of different convolution attribute mappings Sign, the characteristic pattern after full articulamentum (FCl, FC2, FC3) operates convolutional layer, pond layer, excitation function layer etc. are mapped as fixing The feature vector of length.In addition, reducing model over-fitting in order to improve the generalization ability of model, joined after full articulamentum The feature vector input Softmax function of regular length is carried out classification prediction in output layer by output layer.
Cost-sensitive mechanism 3-2) is introduced when carrying out classification prediction using Softmax function, utilizes threshold value moving method By the decision boundaries of non-cost-sensitive neural network to the boundary shifts of the lower a kind of sample of cost, so that cost is higher by one Class sample is reduced by the risk of misclassification class, i.e. change sample x is classified as the probability of j class, using minimum misclassification cost as target into Row backpropagation.Sample x is classified as the wrong cost of j classCalculation formula are as follows:
Wherein η is normalization item, is madeaj=P (j | x) it is the probability that sample x is classified as j class, Cost [j, c] is J class sample mistake is divided into the cost of c class, and C is batch total.
After the completion of classification prediction training, i.e., predict to carry out ad click fraud detection using trained classification.It can benefit It uses new data set as test set, evaluates performance using a variety of evaluation indexes.
Embodiment 2:
Another embodiment of the invention, principle is substantially the same manner as Example 1, uses and embodiment 1 is essentially identical The step of one-dimensional characteristic to be converted to two dimensional character matrix and the step of the classification based training stage, different mainly obtains one Entropy is introduced in the step of dimensional feature data set, and further includes the sample for introducing cost-sensitive mechanism before the classification based training stage The step of this construction phase.As shown in figure 4, the present embodiment the following steps are included:
Step 1: one-dimensional characteristic data set is obtained.
The present embodiment, with embodiment 1, includes click data collection and publisher data set two parts using data set.Its midpoint Hitting each information in data set indicates the click record log of a certain network user, each information generation in publisher data set Each publisher profile information for being labeled of table.
Initial data can not reflect the global behavior of click, cannot be directly used to the building of model.In order to capture a little The inherent nature feature for hitting fraud publisher preferably constructs sorter model, derives after needing to arrange data Aggregate attribute, for the training of classifier and the characteristic of division of test set.Initial data is analyzed, and refers to existing point Cheating in advertisement feature is hit, using statistical methods from the feature of multiple angle extraction ad distribution quotient, derives each publication The one-dimensional characteristic data set of quotient.When extracting the feature of ad distribution quotient, in addition to referring to existing click cheating in advertisement feature, utilize The simple statistical method such as average value, standard deviation carries out outside feature extraction, also introduces the related knowledge of entropy.Device type attribute Click the calculation of entropy are as follows:
Wherein amount_clicks_deviceuaiFor the click total amount of i-th kind of device type, Total_amount_ Clicks is the total touching quantity of the advertisement of publication of the same publisher within the past period, piIt is set for network user's use Standby i clicks the probability of the advertisement of same publisher publication.Same publisher is related to the different equipment of l kind, Entropy_ Deviceua is then the click moisture in the soil of l kind distinct device.When that is sample distribution is uniform for the appearance of the probabilities such as sample, entropy is most Greatly.And carry out fraud and click that will appear the corresponding click volume of certain equipment more so that sample distribution occur it is biggish not Uniformity, entropy are smaller.
It defined respectively in the same way based on click person country, click network address, be clicked the attributes such as ad identifier Click entropy feature.
Step 2: construction feature matrix.
In order to be more applicable for convolutional neural networks model, need one-dimensional characteristic being converted to eigenmatrix.Utilize more The time window of degree constructs eigenmatrix, similar to the input of a two dimensional image.As shown in Fig. 2, the construction of eigenmatrix Time window based on multiple granularities carries out, and specific steps include:
2-1) the click data collection and publisher data set of initial data concentration are analyzed, according to click time attribute Mark off multiple fine granularity time window length T (such as first_15min (0-14), second_15min (15-29), Third_15min (30-44), last_15min (45-59), night (0:00-5:59), morning (6:00-11:59), Afternoon (12-17:59), evening (18:00-23:59), 3days, per min, per 5min, per 15min, per Hour, per 3hours, per 6hours, per day) preferably capture the time DYNAMIC DISTRIBUTION of click fraud behavior, from And construct multiple one-dimensional characteristics based on time window T.
The one-dimensional characteristic constructed 2-2) is converted into two dimensional character matrix according to time window T, is similar to two dimensional image, Input for convolutional neural networks.
Since fraud detection class data set is all extremely unbalanced, it is contemplated that data set imbalance and different misclassification generations Valence exists simultaneously, and sample architecture stage and classification prediction training stage all introduce cost-sensitive mechanism in the present embodiment.
Step 3: construction training sample.
For click data imbalance, the oversampler method based on cost is carried out in the sample architecture stage, to data sample It is balanced processing, while playing the purpose of EDS extended data set, specific steps include:
3-1) calculate the cost of each fraud publisher sample.
In categorised decision near border, the sample for cheating click has higher probability to generate more artificial fraud samples. Therefore, can the fraud training sample to higher costs replicate.Therefore each fraud is first calculated before carrying out sample architecture The cost of publisher sample, specific formula for calculation are as follows:
Wherein CostiFor the cost of each fraud publisher sample, dijIt is sent out for i-th of fraud publisher sample and j-th The distance of draper's sample.The neighborhood of i-th of fraud publisher is by functionIt is limited with interrupting value C.
3-2) new sample is generated using SMOTE method.First with k-means algorithm will cheat publisher be divided into it is several A cluster selects the higher fraud sample pub of cost in a cluster according to different cost values1And pub2As seed specimen, construct New fraud publisher sample Pubnew, PubnewCalculation formula are as follows:
Pubnew=Pub2+ rand (0,1) × (Pub1-Pub2)
Wherein rand (0,1) is a random function of random number between generating one (0,1).
Step 4: classification prediction training.
The eigenmatrix data set that the sample architecture stage has been carried out Balance Treatment selects suitable convolution mind as input It is trained through network structure.In view of data set imbalance is existed simultaneously from different misclassification costs, threshold is introduced in output layer Value movement, constructs Cost-Sensitive Classifiers model.
As shown in figure 3, convolutional neural networks are applied in fraud detection scene.From convolution kernel size, number and pond Change layer strategy, excitation function etc. and experiment test is carried out to different neural network structures, determines network structure, and in view of number It is existed simultaneously according to collection imbalance from different misclassification costs, introduces cost-sensitive mechanism, specific steps packet in the classification based training stage It includes:
4-1) determine the structure of convolutional neural networks.The structure of convolutional neural networks is under normal circumstances are as follows: input layer, convolution Layer, repeats convolutional layer, pond layer and full articulamentum at pond layer.Final output, convolutional layer is by different convolution kernels to input It is calculated, convolutional calculation process are as follows:
WhereinIt is accorded with for convolution operation,Respectively j-th of characteristic pattern of l layer and l-1 layers of ith feature figure,ForIt arrivesConvolution kernel,For bias term,For a nonlinear excitation function.
The activation primitive of all hidden layers of the present invention selects ReLU, due to using smaller filiter to be able to carry out more Nonlinear Mapping obtains more abstract feature, so that the feature representation performance extracted is also better;On the other hand it can reduce The introducing of parameter, the present invention stack again the structure in pond using the lesser convolutional layer of multiple convolution kernels.While in order to keep Boundary information carries out Zero Padding so that convolution anteroposterior dimension is consistent before each convolution.Pond layer effect be into Row down-sampling, each pond characteristic pattern depth is constant, realizes dimensionality reduction by removing unessential feature.The General Expression in pond Are as follows:
WhereinWithFor outputting and inputting for pond layer,For pond function.Maximum pond and mean value at present Pondization is the most commonly used.Pond layer of the present invention selects maximum pond method sufficiently to extract the significant spies of different convolution attribute mappings Sign, the characteristic pattern after full articulamentum (FC1, FC2, FC3) operates convolutional layer, pond layer, excitation function layer etc. are mapped as fixing The feature vector of length.In addition, reducing model over-fitting in order to improve the generalization ability of model, joined after full articulamentum The feature vector input Softmax function of regular length is carried out classification prediction in output layer by output layer.
Cost-sensitive mechanism 4-2) is introduced when carrying out classification prediction using Softmax function, utilizes threshold value moving method By the decision boundaries of non-cost-sensitive neural network to the boundary shifts of the lower a kind of sample of cost, so that cost is higher by one Class sample is reduced by the risk of misclassification class, i.e. change sample x is classified as the probability of j class, using minimum misclassification cost as target into Row backpropagation.Sample x is classified as the wrong cost of j classCalculation formula are as follows:
Wherein η is normalization item, is madeaj=P (j | x) it is the probability that sample x is classified as j class, Cost [j, c] is J class sample mistake is divided into the cost of c class, and C is batch total.
After the completion of classification prediction training, i.e., predict to carry out ad click fraud detection using trained classification.It can benefit It uses new data set as test set, evaluates performance using a variety of evaluation indexes.
Although the present invention has been described by way of example and in terms of the preferred embodiments, embodiment is not for the purpose of limiting the invention.Not It is detached from the spirit and scope of the present invention, any equivalent change or retouch done also belongs to the protection scope of the present invention.Cause This protection scope of the present invention should be based on the content defined in the claims of this application.

Claims (10)

1. the ad click fraud detection method based on cost-sensitive convolutional neural networks characterized by comprising
The step of obtaining one-dimensional characteristic data set: the initial data comprising click data collection and publisher data set is divided Analysis is extracted the feature of ad distribution quotient using statistical method, obtains the one-dimensional characteristic data set of each publisher;
The step of construction feature matrix: one-dimensional characteristic is converted into eigenmatrix using the time window of more granularities;
The step of classification prediction training: using eigenmatrix data set as input, convolutional neural networks structure is selected to classify Prediction training;Cost-sensitive mechanism is introduced in output layer, carries out backpropagation using threshold value is mobile;
It predicts to carry out ad click fraud detection using trained classification.
2. the ad click fraud detection method according to claim 1 based on cost-sensitive convolutional neural networks, special The step of sign is, the construction feature matrix include:
The click data collection and publisher data set that initial data is concentrated are marked off into several particulates according to time attribute is clicked Time window length is spent, the accordingly one-dimensional characteristic data set based on several time windows is constructed;
One-dimensional characteristic data set based on several time windows is converted into two dimensional character matrix according to time window.
3. the ad click fraud detection method according to claim 1 based on cost-sensitive convolutional neural networks, special Sign is, the convolutional neural networks structure are as follows: input layer, pond layer, repeats convolutional layer, pond layer, connects full convolutional layer Layer;
The convolutional layer calculates input by different convolution kernels, convolutional calculation process are as follows:
WhereinIt is accorded with for convolution operation,Fi l-1Respectively j-th of characteristic pattern of l layer and l-1 layers of ith feature figure, For Fi l-1It arrivesConvolution kernel,For bias term,For a nonlinear excitation function;
The pondization of the pond layer is expressed are as follows:
WhereinWithFor outputting and inputting for pond layer,For pond function, pond layer selects maximum pondization side Method sufficiently extracts the notable feature of different convolution attribute mappings;
Characteristic pattern after convolutional layer, pond layer, excitation function layer operation is mapped as the feature of regular length by the full articulamentum Vector;
Output layer processing of the feature vector of the regular length after the full articulamentum, completes classification prediction.
4. the ad click fraud detection method according to claim 1 based on cost-sensitive convolutional neural networks, special Sign is, described to introduce cost-sensitive mechanism in output layer, carries out backpropagation using threshold value is mobile, refers to:
The output layer carries out classification prediction using Softmax function, using threshold value moving method by non-cost-sensitive nerve net The decision boundaries of network change the probability of sample mistake classification to the boundary shifts of the lower a kind of sample of cost, with smallest sample mistake Misclassification cost is that target carries out backpropagation;The calculation formula of the sample mistake classification cost are as follows:
Wherein,The wrong cost of j class is classified as sample x, η is normalization item, is madeaj=P (j | x) it is sample x It is classified as the probability of j class, Cost [j, c] is the cost that j class sample mistake is divided into c class, and C is batch total.
5. a kind of ad click fraud detection method based on cost-sensitive convolutional neural networks characterized by comprising
The step of obtaining one-dimensional characteristic data set: the initial data comprising click data collection and publisher data set is divided Analysis is extracted the feature of ad distribution quotient using statistical method, obtains the one-dimensional characteristic data set of each publisher;
The step of construction feature matrix: one-dimensional characteristic is converted into eigenmatrix using the time window of more granularities;
The step of constructing training sample: processing is balanced to data set using the oversampler method based on cost;
The step of classification prediction training: using eigenmatrix data set as input, convolutional neural networks structure is selected to classify Prediction training;Cost-sensitive mechanism is introduced in output layer, carries out backpropagation using threshold value is mobile;
It predicts to carry out ad click fraud detection using trained classification.
6. the ad click fraud detection method according to claim 5 based on cost-sensitive convolutional neural networks, special Sign is that entropy feature is clicked in the feature that ad distribution quotient is extracted using statistical method, including extraction.
7. the ad click fraud detection method according to claim 5 based on cost-sensitive convolutional neural networks, special The step of sign is, the construction feature matrix include:
The click data collection and publisher data set that initial data is concentrated are marked off into several particulates according to time attribute is clicked Time window length is spent, the accordingly one-dimensional characteristic data set based on several time windows is constructed;
One-dimensional characteristic data set based on several time windows is converted into two dimensional character matrix according to time window.
8. the ad click fraud detection method according to claim 5 based on cost-sensitive convolutional neural networks, special The step of sign is, the construction training sample include:
3-1) calculate the cost of each fraud publisher sample, specific formula for calculation are as follows:
Wherein CostiFor the cost of each fraud publisher sample, dijFor i-th of fraud publisher sample and j-th of publisher The distance of sample, the neighborhood of i-th of fraud publisher is by functionIt is limited with interrupting value C;
Several clusters 3-2) are divided by publisher is cheated using k-means algorithm, select generation in a cluster according to different cost values The higher fraud sample pub of valence1And pub2As seed specimen, new fraud publisher sample Pub is constructednew, PubnewMeter Calculate formula are as follows:
Pubnew=Pub2+ rand (0,1) × (Pub1-Pub2)
Wherein rand (0,1) is a random function of random number between generating one (0,1).
9. the ad click fraud detection method according to claim 5 based on cost-sensitive convolutional neural networks, special Sign is, the convolutional neural networks structure are as follows: input layer, pond layer, repeats convolutional layer, pond layer, connects full convolutional layer Layer;
The convolutional layer calculates input by different convolution kernels, convolutional calculation process are as follows:
WhereinIt is accorded with for convolution operation,Fi l-1Respectively j-th of characteristic pattern of l layer and l-1 layers of ith feature figure, For Fi l-1It arrivesConvolution kernel,For bias term,For a nonlinear excitation function;
The pondization of the pond layer is expressed are as follows:
WhereinWithFor outputting and inputting for pond layer,For pond function, pond layer selects maximum pondization side Method sufficiently extracts the notable feature of different convolution attribute mappings;
Characteristic pattern after convolutional layer, pond layer, excitation function layer operation is mapped as the feature of regular length by the full articulamentum Vector;
Output layer processing of the feature vector of the regular length after the full articulamentum, completes classification prediction.
10. the ad click fraud detection method according to claim 5 based on cost-sensitive convolutional neural networks, special Sign is, described to introduce cost-sensitive mechanism in output layer, carries out backpropagation using threshold value is mobile, refers to:
The output layer carries out classification prediction using Softmax function, using threshold value moving method by non-cost-sensitive nerve net The decision boundaries of network change the probability of sample mistake classification to the boundary shifts of the lower a kind of sample of cost, with smallest sample mistake Misclassification cost is that target carries out backpropagation;The calculation formula of the sample mistake classification cost are as follows:
Wherein,The wrong cost of j class is classified as sample x, η is normalization item, is madeaj=P (j | x) it is sample x It is classified as the probability of j class, Cost [j, c] is the cost that j class sample mistake is divided into c class, and C is batch total.
CN201810951569.4A 2018-08-20 2018-08-20 Advertisement click fraud detection method based on cost-sensitive convolutional neural network Active CN109191191B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810951569.4A CN109191191B (en) 2018-08-20 2018-08-20 Advertisement click fraud detection method based on cost-sensitive convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810951569.4A CN109191191B (en) 2018-08-20 2018-08-20 Advertisement click fraud detection method based on cost-sensitive convolutional neural network

Publications (2)

Publication Number Publication Date
CN109191191A true CN109191191A (en) 2019-01-11
CN109191191B CN109191191B (en) 2022-04-26

Family

ID=64918635

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810951569.4A Active CN109191191B (en) 2018-08-20 2018-08-20 Advertisement click fraud detection method based on cost-sensitive convolutional neural network

Country Status (1)

Country Link
CN (1) CN109191191B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110570217A (en) * 2019-09-10 2019-12-13 北京百度网讯科技有限公司 cheating detection method and device
CN110956209A (en) * 2019-11-28 2020-04-03 上海风秩科技有限公司 Model training and predicting method, device, electronic equipment and storage medium
CN111325579A (en) * 2020-02-25 2020-06-23 华南师范大学 Advertisement click rate prediction method
CN111429215A (en) * 2020-03-18 2020-07-17 北京互金新融科技有限公司 Data processing method and device
CN111612531A (en) * 2020-05-13 2020-09-01 宁波财经学院 Click fraud detection method and system
CN112258254A (en) * 2020-12-21 2021-01-22 中国人民解放军国防科技大学 Internet advertisement risk monitoring method and system based on big data architecture
CN113191809A (en) * 2021-05-06 2021-07-30 上海交通大学 Mobile advertisement click fraud detection method, system, terminal and medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103996088A (en) * 2014-06-10 2014-08-20 苏州工业职业技术学院 Advertisement click-through rate prediction method based on multi-dimensional feature combination logical regression
CN107886344A (en) * 2016-09-30 2018-04-06 北京金山安全软件有限公司 Convolutional neural network-based cheating advertisement page identification method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103996088A (en) * 2014-06-10 2014-08-20 苏州工业职业技术学院 Advertisement click-through rate prediction method based on multi-dimensional feature combination logical regression
CN107886344A (en) * 2016-09-30 2018-04-06 北京金山安全软件有限公司 Convolutional neural network-based cheating advertisement page identification method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
RUONAN LIU等: "Dislocated Time Series Convolutional Neural Architecture:An Intelligent Fault Diagnosis Approach for Electric Machine", 《IEEE TRANSACTIONS ON INDSTRIAL INFORMATICS》 *
潘志辉等: "基于代价敏感神经网络的警告分类研究", 《计算机工程与科学》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110570217A (en) * 2019-09-10 2019-12-13 北京百度网讯科技有限公司 cheating detection method and device
CN110570217B (en) * 2019-09-10 2022-10-14 北京百度网讯科技有限公司 Cheating detection method and device
CN110956209A (en) * 2019-11-28 2020-04-03 上海风秩科技有限公司 Model training and predicting method, device, electronic equipment and storage medium
CN110956209B (en) * 2019-11-28 2024-03-26 上海秒针网络科技有限公司 Model training and predicting method and device, electronic equipment and storage medium
CN111325579A (en) * 2020-02-25 2020-06-23 华南师范大学 Advertisement click rate prediction method
CN111429215A (en) * 2020-03-18 2020-07-17 北京互金新融科技有限公司 Data processing method and device
CN111429215B (en) * 2020-03-18 2023-10-31 北京互金新融科技有限公司 Data processing method and device
CN111612531A (en) * 2020-05-13 2020-09-01 宁波财经学院 Click fraud detection method and system
CN111612531B (en) * 2020-05-13 2024-05-10 宁波财经学院 Click fraud detection method and system
CN112258254A (en) * 2020-12-21 2021-01-22 中国人民解放军国防科技大学 Internet advertisement risk monitoring method and system based on big data architecture
CN113191809A (en) * 2021-05-06 2021-07-30 上海交通大学 Mobile advertisement click fraud detection method, system, terminal and medium
CN113191809B (en) * 2021-05-06 2022-08-09 上海交通大学 Mobile advertisement click fraud detection method, system, terminal and medium

Also Published As

Publication number Publication date
CN109191191B (en) 2022-04-26

Similar Documents

Publication Publication Date Title
CN109191191A (en) Advertisement click fraud detection method based on cost-sensitive convolutional neural network
Wu et al. Who are the phishers? phishing scam detection on ethereum via network embedding
Alrashidi et al. Metaheuristic optimization algorithms to estimate statistical distribution parameters for characterizing wind speeds
Lee et al. Toward detecting illegal transactions on bitcoin using machine-learning methods
Zhou et al. Analyzing and detecting money-laundering accounts in online social networks
CN110276679A (en) A kind of network individual credit fraud detection method towards deep learning
CN107886366A (en) Generation method, sex fill method, terminal and the storage medium of Gender Classification model
Yu et al. Ponzi scheme detection in ethereum transaction network
Elshaar et al. Semi-supervised classification of fraud data in commercial auctions
Vrbančič et al. Parameter setting for deep neural networks using swarm intelligence on phishing websites classification
Iqbal et al. The impact of banking services on poverty: Evidence from sub-district level for Bangladesh
Altman et al. Realistic synthetic financial transactions for anti-money laundering models
Ford et al. Identifying Suspicious Bidders Utilizing Hierarchical Clustering and Decision Trees.
Song et al. Blockchain data analysis from the perspective of complex networks: Overview
CN115375480A (en) Abnormal virtual coin wallet address detection method based on graph neural network
Fan et al. Smart contract scams detection with topological data analysis on account interaction
Huang et al. A deep dive into nft rug pulls
Zhu et al. Click fraud detection of online advertising–LSH based tensor recovery mechanism
Durga et al. The use of supervised machine learning classifiers for the detection of fake instagram accounts
CN113065943A (en) Anti-fraud black product entity identification method and system
Valadares et al. Mapping user behaviors to identify professional accounts in Ethereum using semi-supervised learning
CN109146667A (en) A kind of construction method of the external interface integrated application model based on quantitative statistics
CN112784116A (en) Method for identifying user industry identity in block chain
Lakshmi et al. Machine learning based credit card fraud detection
Borgi et al. Advertisement click fraud detection system: A survey

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant