CN109191191A

CN109191191A - Advertisement click fraud detection method based on cost-sensitive convolutional neural network

Info

Publication number: CN109191191A
Application number: CN201810951569.4A
Authority: CN
Inventors: 张欣; 刘学军; 毛宇佳; 李斌; 徐新艳
Original assignee: Nanjing Tech University
Current assignee: Nanjing Tech University
Priority date: 2018-08-20
Filing date: 2018-08-20
Publication date: 2019-01-11
Anticipated expiration: 2038-08-20
Also published as: CN109191191B

Abstract

The invention discloses an advertisement click fraud detection method based on a cost sensitive convolutional neural network, and belongs to the technical field of data processing. The invention comprises the steps of obtaining a one-dimensional feature data set: analyzing original data comprising a click data set and a publisher data set, extracting features of advertisement publishers by using a statistical method, and obtaining a one-dimensional feature data set of each publisher; constructing a characteristic matrix: converting the one-dimensional features into a feature matrix by using a multi-granularity time window; and (3) classification prediction training: taking the feature matrix data set as input, and selecting a convolutional neural network structure for classification prediction training; and a cost sensitive mechanism is introduced into an output layer, and the back propagation is carried out by using threshold value movement. The method and the device can be well suitable for detecting the advertisement click fraud at the present stage, achieve better detection capability, improve the detection accuracy and the model training efficiency, and have higher practicability.

Description

Ad click fraud detection method based on cost-sensitive convolutional neural networks

Technical field

The invention belongs to technical field of data processing, and in particular to a kind of advertisement based on cost-sensitive convolutional neural networks Click fraud detection method.

Background technique

With the prevalence of smart machine, advertising market rapid growth.It is reported according to McKinsey Global, digital advertisement branch Entire Numerical market share 51% has been accounted for out, it is contemplated that 2019 are up to 70%, and pay per click (Cost Per Click, CPC) It is that the main charge mode of moving advertising such as Baidu, Google, Yahoo takes CPC mode at present.However, CPC bid ranking Business model also in the generation for objectively having encouraged ad distribution quotient click fraud (click fraud) behavior, many networks Company, that is, ad distribution quotient clicks publication in the advertisement from home Web site, oneself to bring more advertising incomes repeatedly.Moving advertising Almost 30% income is consumed by different types of cheating in advertisement.There is investigation to be shown in the past year within 2014, click is taken advantage of Swindleness almost occupy the 14.6% of online advertisement total expenditure, while 75% ad distribution quotient representation they all by click fraud It influences.Click fraud causes the serious loss of advertising budget, and existing makes entire online advertisement market can by crisis It can property.

The advertisement means of payment to pay per click has formed a set of technology model and business prototype at present, and the web advertisement is allowed to throw It puts quotient and publisher abandons that this mode is unrealistic, effectively stable click fraud detection is only influence interconnection with prevention mechanism The critical issue of effective touching quantity of net advertisement.Increase the skill of click volume using automatically clickings such as malicious code, Botnets It is skilful very common, it can monitor by the third party monitoring mechanism of profession or be shielded by the way of graphical verification code It covers.But method and model used by third party are simultaneously underground, and validity is not transparent enough.Fraudster is to obtain higher benefit simultaneously Benefit, click fraud means emerge one after another, and online advertising system needs more accurate and effective click fraud detection method.It is domestic Outer researcher seeks various methods to identify and prevent and treat the click fraud behavior in advertisement, such as Knowledge Discovery and expert system A variety of fraud detection technologies.It but is more to click Activity recognition fraud row using based on user from internet mass data For detection method.Click fraud has complex patterns, while artificial click is similar compared with normal click, and fraud is caused to be clicked Feature there is certain weak randomness and be difficult to detect, and the generalization ability of algorithm can be improved in integrated approach, leads in detection Domain integrated approach is using more and more common, but integrated approach can make model excessively complicated while may cause model over-fitting.

Summary of the invention

It is an object of the present invention to provide a kind of ad click fraud detection methods based on cost-sensitive convolutional neural networks, will Convolutional neural networks structure is applied in moving advertising click fraud detection scene, can effectively be solved the problems, such as click fraud, be dropped Low model complexity avoids model over-fitting, and more classification can be realized while not increasing model complexity.

Specifically, on the one hand, realization that the present invention adopts the following technical solutions, comprising:

The step of obtaining one-dimensional characteristic data set: the initial data comprising click data collection and publisher data set is carried out Analysis is extracted the feature of ad distribution quotient using statistical method, obtains the one-dimensional characteristic data set of each publisher；

The step of construction feature matrix: one-dimensional characteristic is converted into eigenmatrix using the time window of more granularities；

The step of classification prediction training: using eigenmatrix data set as input, convolutional neural networks structure is selected to carry out Classification prediction training；Cost-sensitive mechanism is introduced in output layer, carries out backpropagation using threshold value is mobile；

It predicts to carry out ad click fraud detection using trained classification.

Furthermore, the step of construction feature matrix includes:

The click data collection and publisher data set that initial data is concentrated are marked off several according to time attribute is clicked Fine granularity time window length constructs the accordingly one-dimensional characteristic data set based on several time windows；

One-dimensional characteristic data set based on several time windows is converted into two dimensional character matrix according to time window.

Furthermore, the convolutional neural networks structure are as follows: input layer, pond layer, repeats convolutional layer, pond at convolutional layer Change layer, full articulamentum.

The convolutional layer calculates input by different convolution kernels, convolutional calculation process are as follows:

WhereinIt is accorded with for convolution operation,Respectively j-th of characteristic pattern of l layer and l-1 layers of ith feature figure,ForIt arrivesConvolution kernel,For bias term,For a nonlinear excitation function；

The pondization of the pond layer is expressed are as follows:

WhereinWithFor outputting and inputting for pond layer,For pond function, pond layer selects maximum pond Change method sufficiently extracts the notable feature of different convolution attribute mappings；

Characteristic pattern after convolutional layer, pond layer, excitation function layer operation is mapped as regular length by the full articulamentum Feature vector:

Output layer processing of the feature vector of the regular length after the full articulamentum, completes classification prediction.

Furthermore, described to introduce cost-sensitive mechanism in output layer, backpropagation is carried out using threshold value is mobile, is Refer to:

The output layer carries out classification prediction using Softmax function, using threshold value moving method that non-cost-sensitive is refreshing Decision boundaries through network change the probability of sample mistake classification, to the boundary shifts of the lower a kind of sample of cost with most sample This mistake classifies cost for target progress backpropagation；The calculation formula of the sample mistake classification cost are as follows:

Wherein,The wrong cost of j class is classified as sample x, η is normalization item, is madea_j=P (j | x) it is sample This x is classified as the probability of j class, and Cost [j, c] is the cost that j class sample mistake is divided into c class, and C is batch total.

On the other hand, the present invention can also use following technical scheme, comprising:

The step of constructing training sample: processing is balanced to data set using the oversampler method based on cost；

It predicts to carry out ad click fraud detection using trained classification.

Furthermore, the feature that ad distribution quotient is extracted using statistical method, including extract and click entropy feature.

Furthermore, the step of construction feature matrix includes:

The one-dimensional characteristic data set based on several time windows is converted into two dimensional character square according to time window Battle array.

Furthermore, the step of construction training sample includes:

3-1) calculate the cost of each fraud publisher sample, specific formula for calculation are as follows:

Wherein Cost_iFor the cost of each fraud publisher sample, d_ijIt is sent out for i-th of fraud publisher sample and j-th The distance of draper's sample, the neighborhood of i-th of fraud publisher is by functionIt is limited with interrupting value C；

Several clusters 3-2) are divided by publisher is cheated using k-means algorithm, select a cluster according to different cost values The middle higher fraud sample pub of cost₁And pub₂As seed specimen, new fraud publisher sample Pub is constructed_new, Pub_new Calculation formula are as follows:

Pub_new=Pub₂+ rand (0,1) × (Pub₁-Pub₂)

Wherein rand (0,1) is a random function of random number between generating one (0,1).

Furthermore, the convolutional neural networks structure are as follows: input layer, pond layer, repeats convolutional layer, pond at convolutional layer Change layer, full articulamentum；

The pondization of the pond layer is expressed are as follows:

Characteristic pattern after convolutional layer, pond layer, excitation function layer operation is mapped as regular length by the full articulamentum Feature vector；

Beneficial effects of the present invention are as follows: the ad click of the invention based on cost-sensitive convolutional neural networks cheats inspection Survey method can be very good to be suitable for ad click fraud detection at this stage, reach preferable detectability, improve inspection The accuracy of survey and the training effectiveness of model, practicability are higher.

Detailed description of the invention

Fig. 1 is the system framework schematic diagram of the embodiment of the present invention 1.

Fig. 2 is the eigenmatrix construction process schematic diagram of the embodiment of the present invention 1.

Fig. 3 is the cost-sensitive mind convolution of the embodiment of the present invention 1 through schematic network structure.

Fig. 4 is the system framework schematic diagram of the embodiment of the present invention 2.

Specific embodiment

Below with reference to embodiment and referring to attached drawing, present invention is further described in detail.

Embodiment 1:

One embodiment of the present of invention is a kind of ad click fraud detection side based on cost-sensitive convolutional neural networks Method, implementation process are as shown in Figure 1.

Step 1: one-dimensional characteristic data set is obtained.

The present embodiment uses FDMA2012 contest (The International Workshop on Fraud Detection in Mobile Advertising 2012, is detailed in http://palanteer.sis.smu.edu.sg/ Fdma2012/ the true moving advertising click data collection that certain company provides in).The data set includes click data collection and publisher Data set two parts.Wherein click data concentration each information indicates the click record log of a certain network user, publisher Each information represents each publisher profile information being labeled in data set.

Initial data is analyzed, and refers to existing click cheating in advertisement feature, using statistical methods from more The feature of a angle extraction ad distribution quotient calculates the one-dimensional characteristic data set of each publisher.

Step 2: construction feature matrix.

In order to be more applicable for convolutional neural networks model, need one-dimensional characteristic being converted to eigenmatrix.Utilize more The time window of degree constructs eigenmatrix, similar to the input of a two dimensional image.As shown in Fig. 2, the construction of eigenmatrix Time window based on multiple granularities carries out, and specific steps include:

2-1) the click data collection and publisher data set of initial data concentration are analyzed, according to click time attribute Mark off multiple fine granularity time window length T (such as first_15min (0-14), second_15min (15-29), Third_15min (30-44), last_15min (45-59), night (0:00-5:59), morning (6:00-11:59), Afternoon (12-17:59), evening (18:00-23:59), 3days, per min, per 5min, per 15min, per Hour, per 3hours, per 6hours, per day) preferably capture the time DYNAMIC DISTRIBUTION of click fraud behavior, from And construct multiple one-dimensional characteristics based on time window T.

The one-dimensional characteristic constructed 2-2) is converted into two dimensional character matrix according to time window T, is similar to two dimensional image, Input for convolutional neural networks.

Step 3: classification prediction training.

Using carry out Feature Conversion after eigenmatrix data set as input, select suitable convolutional neural networks structure into Row training.In view of data set disequilibrium, threshold value movement is introduced in output layer, constructs Cost-Sensitive Classifiers model.

As shown in figure 3, convolutional neural networks are applied in fraud detection scene.From convolution kernel size, number and pond Change layer strategy, excitation function etc. and experiment test is carried out to different neural network structures, determines network structure.Also, consider It is existed simultaneously to data set imbalance from different misclassification costs, introduces cost-sensitive mechanism in the classification based training stage, it is specific to walk Suddenly include:

3-1) determine the structure of convolutional neural networks.The structure of convolutional neural networks is under normal circumstances are as follows: input layer, convolution Layer, repeats convolutional layer, pond layer, full articulamentum at pond layer, and final output, convolutional layer is by different convolution kernels to input It is calculated, convolutional calculation process are as follows:

WhereinIt is accorded with for convolution operation,Respectively j-th of characteristic pattern of l layer and l-1 layers of ith feature figure,ForIt arrivesConvolution kernel,For bias term,For a nonlinear excitation function.

The activation primitive of all hidden layers of the present invention selects line rectification function (ReLU), due to using smaller filiter It is able to carry out more Nonlinear Mappings and obtains more abstract feature, so that the feature representation performance extracted is also better；Separately On the one hand the introducing of parameter can be reduced, the present invention stacks again the knot in pond using the lesser convolutional layer of multiple convolution kernels Structure.Simultaneously in order to keep boundary information, so that convolution anteroposterior dimension is consistent, Zero is carried out before each convolution Padding.Layer effect in pond is to carry out down-sampling, and each pond characteristic pattern depth is constant, real by removing unessential feature Existing dimensionality reduction.The General Expression in pond are as follows:

WhereinWithFor outputting and inputting for pond layer,For pond function, current maximum pond and mean value Pondization is the most commonly used.Pond layer of the present invention selects maximum pond method sufficiently to extract the significant spies of different convolution attribute mappings Sign, the characteristic pattern after full articulamentum (FCl, FC2, FC3) operates convolutional layer, pond layer, excitation function layer etc. are mapped as fixing The feature vector of length.In addition, reducing model over-fitting in order to improve the generalization ability of model, joined after full articulamentum The feature vector input Softmax function of regular length is carried out classification prediction in output layer by output layer.

Cost-sensitive mechanism 3-2) is introduced when carrying out classification prediction using Softmax function, utilizes threshold value moving method By the decision boundaries of non-cost-sensitive neural network to the boundary shifts of the lower a kind of sample of cost, so that cost is higher by one Class sample is reduced by the risk of misclassification class, i.e. change sample x is classified as the probability of j class, using minimum misclassification cost as target into Row backpropagation.Sample x is classified as the wrong cost of j classCalculation formula are as follows:

Wherein η is normalization item, is madea_j=P (j | x) it is the probability that sample x is classified as j class, Cost [j, c] is J class sample mistake is divided into the cost of c class, and C is batch total.

After the completion of classification prediction training, i.e., predict to carry out ad click fraud detection using trained classification.It can benefit It uses new data set as test set, evaluates performance using a variety of evaluation indexes.

Embodiment 2:

Another embodiment of the invention, principle is substantially the same manner as Example 1, uses and embodiment 1 is essentially identical The step of one-dimensional characteristic to be converted to two dimensional character matrix and the step of the classification based training stage, different mainly obtains one Entropy is introduced in the step of dimensional feature data set, and further includes the sample for introducing cost-sensitive mechanism before the classification based training stage The step of this construction phase.As shown in figure 4, the present embodiment the following steps are included:

Step 1: one-dimensional characteristic data set is obtained.

The present embodiment, with embodiment 1, includes click data collection and publisher data set two parts using data set.Its midpoint Hitting each information in data set indicates the click record log of a certain network user, each information generation in publisher data set Each publisher profile information for being labeled of table.

Initial data can not reflect the global behavior of click, cannot be directly used to the building of model.In order to capture a little The inherent nature feature for hitting fraud publisher preferably constructs sorter model, derives after needing to arrange data Aggregate attribute, for the training of classifier and the characteristic of division of test set.Initial data is analyzed, and refers to existing point Cheating in advertisement feature is hit, using statistical methods from the feature of multiple angle extraction ad distribution quotient, derives each publication The one-dimensional characteristic data set of quotient.When extracting the feature of ad distribution quotient, in addition to referring to existing click cheating in advertisement feature, utilize The simple statistical method such as average value, standard deviation carries out outside feature extraction, also introduces the related knowledge of entropy.Device type attribute Click the calculation of entropy are as follows:

Wherein amount_clicks_deviceua_iFor the click total amount of i-th kind of device type, Total_amount_ Clicks is the total touching quantity of the advertisement of publication of the same publisher within the past period, p_iIt is set for network user's use Standby i clicks the probability of the advertisement of same publisher publication.Same publisher is related to the different equipment of l kind, Entropy_ Deviceua is then the click moisture in the soil of l kind distinct device.When that is sample distribution is uniform for the appearance of the probabilities such as sample, entropy is most Greatly.And carry out fraud and click that will appear the corresponding click volume of certain equipment more so that sample distribution occur it is biggish not Uniformity, entropy are smaller.

It defined respectively in the same way based on click person country, click network address, be clicked the attributes such as ad identifier Click entropy feature.

Step 2: construction feature matrix.

Since fraud detection class data set is all extremely unbalanced, it is contemplated that data set imbalance and different misclassification generations Valence exists simultaneously, and sample architecture stage and classification prediction training stage all introduce cost-sensitive mechanism in the present embodiment.

Step 3: construction training sample.

For click data imbalance, the oversampler method based on cost is carried out in the sample architecture stage, to data sample It is balanced processing, while playing the purpose of EDS extended data set, specific steps include:

3-1) calculate the cost of each fraud publisher sample.

In categorised decision near border, the sample for cheating click has higher probability to generate more artificial fraud samples. Therefore, can the fraud training sample to higher costs replicate.Therefore each fraud is first calculated before carrying out sample architecture The cost of publisher sample, specific formula for calculation are as follows:

Wherein Cost_iFor the cost of each fraud publisher sample, d_ijIt is sent out for i-th of fraud publisher sample and j-th The distance of draper's sample.The neighborhood of i-th of fraud publisher is by functionIt is limited with interrupting value C.

3-2) new sample is generated using SMOTE method.First with k-means algorithm will cheat publisher be divided into it is several A cluster selects the higher fraud sample pub of cost in a cluster according to different cost values₁And pub₂As seed specimen, construct New fraud publisher sample Pub_new, Pub_newCalculation formula are as follows:

Pub_new=Pub₂+ rand (0,1) × (Pub₁-Pub₂)

Step 4: classification prediction training.

The eigenmatrix data set that the sample architecture stage has been carried out Balance Treatment selects suitable convolution mind as input It is trained through network structure.In view of data set imbalance is existed simultaneously from different misclassification costs, threshold is introduced in output layer Value movement, constructs Cost-Sensitive Classifiers model.

As shown in figure 3, convolutional neural networks are applied in fraud detection scene.From convolution kernel size, number and pond Change layer strategy, excitation function etc. and experiment test is carried out to different neural network structures, determines network structure, and in view of number It is existed simultaneously according to collection imbalance from different misclassification costs, introduces cost-sensitive mechanism, specific steps packet in the classification based training stage It includes:

4-1) determine the structure of convolutional neural networks.The structure of convolutional neural networks is under normal circumstances are as follows: input layer, convolution Layer, repeats convolutional layer, pond layer and full articulamentum at pond layer.Final output, convolutional layer is by different convolution kernels to input It is calculated, convolutional calculation process are as follows:

The activation primitive of all hidden layers of the present invention selects ReLU, due to using smaller filiter to be able to carry out more Nonlinear Mapping obtains more abstract feature, so that the feature representation performance extracted is also better；On the other hand it can reduce The introducing of parameter, the present invention stack again the structure in pond using the lesser convolutional layer of multiple convolution kernels.While in order to keep Boundary information carries out Zero Padding so that convolution anteroposterior dimension is consistent before each convolution.Pond layer effect be into Row down-sampling, each pond characteristic pattern depth is constant, realizes dimensionality reduction by removing unessential feature.The General Expression in pond Are as follows:

WhereinWithFor outputting and inputting for pond layer,For pond function.Maximum pond and mean value at present Pondization is the most commonly used.Pond layer of the present invention selects maximum pond method sufficiently to extract the significant spies of different convolution attribute mappings Sign, the characteristic pattern after full articulamentum (FC1, FC2, FC3) operates convolutional layer, pond layer, excitation function layer etc. are mapped as fixing The feature vector of length.In addition, reducing model over-fitting in order to improve the generalization ability of model, joined after full articulamentum The feature vector input Softmax function of regular length is carried out classification prediction in output layer by output layer.

Cost-sensitive mechanism 4-2) is introduced when carrying out classification prediction using Softmax function, utilizes threshold value moving method By the decision boundaries of non-cost-sensitive neural network to the boundary shifts of the lower a kind of sample of cost, so that cost is higher by one Class sample is reduced by the risk of misclassification class, i.e. change sample x is classified as the probability of j class, using minimum misclassification cost as target into Row backpropagation.Sample x is classified as the wrong cost of j classCalculation formula are as follows:

Although the present invention has been described by way of example and in terms of the preferred embodiments, embodiment is not for the purpose of limiting the invention.Not It is detached from the spirit and scope of the present invention, any equivalent change or retouch done also belongs to the protection scope of the present invention.Cause This protection scope of the present invention should be based on the content defined in the claims of this application.

Claims

1. the ad click fraud detection method based on cost-sensitive convolutional neural networks characterized by comprising

The step of obtaining one-dimensional characteristic data set: the initial data comprising click data collection and publisher data set is divided Analysis is extracted the feature of ad distribution quotient using statistical method, obtains the one-dimensional characteristic data set of each publisher；

The step of classification prediction training: using eigenmatrix data set as input, convolutional neural networks structure is selected to classify Prediction training；Cost-sensitive mechanism is introduced in output layer, carries out backpropagation using threshold value is mobile；

It predicts to carry out ad click fraud detection using trained classification.

2. the ad click fraud detection method according to claim 1 based on cost-sensitive convolutional neural networks, special The step of sign is, the construction feature matrix include:

The click data collection and publisher data set that initial data is concentrated are marked off into several particulates according to time attribute is clicked Time window length is spent, the accordingly one-dimensional characteristic data set based on several time windows is constructed；

3. the ad click fraud detection method according to claim 1 based on cost-sensitive convolutional neural networks, special Sign is, the convolutional neural networks structure are as follows: input layer, pond layer, repeats convolutional layer, pond layer, connects full convolutional layer Layer；

WhereinIt is accorded with for convolution operation,F_i ^l-1Respectively j-th of characteristic pattern of l layer and l-1 layers of ith feature figure, For F_i ^l-1It arrivesConvolution kernel,For bias term,For a nonlinear excitation function；

The pondization of the pond layer is expressed are as follows:

WhereinWithFor outputting and inputting for pond layer,For pond function, pond layer selects maximum pondization side Method sufficiently extracts the notable feature of different convolution attribute mappings；

Characteristic pattern after convolutional layer, pond layer, excitation function layer operation is mapped as the feature of regular length by the full articulamentum Vector；

4. the ad click fraud detection method according to claim 1 based on cost-sensitive convolutional neural networks, special Sign is, described to introduce cost-sensitive mechanism in output layer, carries out backpropagation using threshold value is mobile, refers to:

The output layer carries out classification prediction using Softmax function, using threshold value moving method by non-cost-sensitive nerve net The decision boundaries of network change the probability of sample mistake classification to the boundary shifts of the lower a kind of sample of cost, with smallest sample mistake Misclassification cost is that target carries out backpropagation；The calculation formula of the sample mistake classification cost are as follows:

Wherein,The wrong cost of j class is classified as sample x, η is normalization item, is madea_j=P (j | x) it is sample x It is classified as the probability of j class, Cost [j, c] is the cost that j class sample mistake is divided into c class, and C is batch total.

5. a kind of ad click fraud detection method based on cost-sensitive convolutional neural networks characterized by comprising

It predicts to carry out ad click fraud detection using trained classification.

6. the ad click fraud detection method according to claim 5 based on cost-sensitive convolutional neural networks, special Sign is that entropy feature is clicked in the feature that ad distribution quotient is extracted using statistical method, including extraction.

7. the ad click fraud detection method according to claim 5 based on cost-sensitive convolutional neural networks, special The step of sign is, the construction feature matrix include:

8. the ad click fraud detection method according to claim 5 based on cost-sensitive convolutional neural networks, special The step of sign is, the construction training sample include:

Wherein Cost_iFor the cost of each fraud publisher sample, d_ijFor i-th of fraud publisher sample and j-th of publisher The distance of sample, the neighborhood of i-th of fraud publisher is by functionIt is limited with interrupting value C；

Several clusters 3-2) are divided by publisher is cheated using k-means algorithm, select generation in a cluster according to different cost values The higher fraud sample pub of valence₁And pub₂As seed specimen, new fraud publisher sample Pub is constructed_new, Pub_newMeter Calculate formula are as follows:

Pub_new=Pub₂+ rand (0,1) × (Pub₁-Pub₂)

9. the ad click fraud detection method according to claim 5 based on cost-sensitive convolutional neural networks, special Sign is, the convolutional neural networks structure are as follows: input layer, pond layer, repeats convolutional layer, pond layer, connects full convolutional layer Layer；

The pondization of the pond layer is expressed are as follows:

10. the ad click fraud detection method according to claim 5 based on cost-sensitive convolutional neural networks, special Sign is, described to introduce cost-sensitive mechanism in output layer, carries out backpropagation using threshold value is mobile, refers to: