CN116310728A - Browser identification method based on CNN-Linformer model - Google Patents

Browser identification method based on CNN-Linformer model Download PDF

Info

Publication number
CN116310728A
CN116310728A CN202310311808.0A CN202310311808A CN116310728A CN 116310728 A CN116310728 A CN 116310728A CN 202310311808 A CN202310311808 A CN 202310311808A CN 116310728 A CN116310728 A CN 116310728A
Authority
CN
China
Prior art keywords
fingerprint
browser
cnn
linformer
fingerprints
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310311808.0A
Other languages
Chinese (zh)
Inventor
李小勇
谭韵
袁开国
高雅丽
李灵慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202310311808.0A priority Critical patent/CN116310728A/en
Publication of CN116310728A publication Critical patent/CN116310728A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/12Fingerprints or palmprints

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Human Computer Interaction (AREA)
  • Collating Specific Patterns (AREA)

Abstract

The invention discloses a browser identification method based on a CNN-Linformer model, which comprises the steps of extracting time sequence characteristics of input data through the CNN, focusing important characteristics through linear attention of the Linformer, improving the accuracy of the model, carrying out parallel calculation on the data through multi-head attention, and improving the training speed of the model. Meanwhile, the decoder part, the input embedding part and the position coding part in the Linformer are deleted to simplify the Linformer, so that the complexity of a model is reduced, and the calculation speed is further improved. In addition, the invention aims at the problem that the accuracy rate is reduced caused by other methods in the prior art without comprehensively considering various pseudo-fingerprint technologies, utilizes two pseudo-fingerprint technologies of randomization attribute values and noise addition to carry out data enhancement on an original data set, and then trains a CNN-Linformer model by using an enhanced data set containing normal fingerprints and pseudo-fingerprints. The method has the advantages of high accuracy and high recognition speed, and has good robustness in a pseudo fingerprint scene.

Description

Browser identification method based on CNN-Linformer model
Technical Field
The invention relates to the technical field of network information security, in particular to a browser identification method based on a CNN-Linformer model.
Background
Websites often need to track users' browsers because of business needs, including advertising personalities, content personalities, and network security. Servers and browsers typically use cookies to provide information for subsequent services of a website. However, in recent years cookie technology has raised more and more problems. For example, cookies are stored on the user's local machine, which can easily result in the loss of user information. As cookie technology becomes less efficient, browser fingerprinting (Browser Fingerprint) technology has become increasingly the new mainstream technology for browser tracking.
A browser fingerprint is a collection of all feature information about a browser that can be collected by the browser. It includes information such as browser version, screen resolution, browser plug-in, system font and time zone, etc., which can be used to identify the user's browser. Browser fingerprinting relies on the uniqueness of the fingerprint of the browsing device itself. However, browser devices change rapidly and frequently, which may result in new fingerprints being quite different from previous fingerprints. Therefore, merely the uniqueness of the fingerprint is insufficient for tracking, and it becomes important how to accurately identify the browser fingerprint before and after the change.
Because the browser fingerprint collects information of the client, the information can change along with the change of the configuration of the client or the update of the version of the browser, and the purpose of tracking the user for a long time is achieved, the fingerprint of the user, which is changed continuously, needs to be identified. In addition to browser fingerprint changes caused by objective reasons such as browser version, there are some information that is actively changed by anti-browser fingerprint tracking software, so that accuracy of identifying the browser fingerprints needs to be improved.
Traditional dynamic browser fingerprint identification methods use statistics and traditional machine learning algorithms, such as random forest algorithms, but the accuracy of the methods is not high. In addition, there are dynamic browser fingerprinting methods using deep learning algorithms, such as a recurrent neural network (Recurrent Neural Network, RNN) algorithm and a Long Short-Term Memory (LSTM) algorithm. But these algorithms can only process incoming data sequentially and have difficulty processing very high and sparse input data, such as browser fingerprints that contain a large number of features, and sequentially processing the data can reduce the efficiency of the algorithm.
Disclosure of Invention
In order to identify fingerprints of a user with constantly changing browsers, improve the accuracy and efficiency of a fingerprint identification algorithm and solve the problem of generating pseudo fingerprints by anti-browser fingerprint tracking software, the invention provides a browser identification method based on a CNN-Linformer model, which expands an original data set by generating the pseudo fingerprints to form an enhanced data set to train the model, so that the robustness of the fingerprint identification algorithm is improved.
In order to achieve the above object, the present invention provides the following technical solutions:
the invention provides a browser identification method based on a CNN-link model, which comprises the following steps:
for an unknown fingerprint fu, each fingerprint chain C in the set of fingerprint chains C is first traversed k =<f k,t ,f k,t-1 ,f k,t-2 ,f k,t-3 >Unknown fingerprint fu and fingerprint chain c k Generating an input vector V by generating an input vector algorithm input =[I t ,I t-1 ,...,I t-i+1 ]Where I is the length of the fingerprint chain, t is the current timestamp, I t =diff<f t-1 ,f t >Representing the comparison of two fingerprints, diff represents a single feature vector i=consisting of M features<x 1 ,x 2 ,...,x M >,x n Is the comparison of the nth features of the two fingerprints; the fingerprint chain is formed by linking fingerprints with the same ID according to the sequence of the fingerprint acquisition time, each browser fingerprint is provided with an ID, the ID is associated with a browser instance with the same ID, and one browser instance is provided with a fingerprint chain; the fingerprint chain set C contains all fingerprints linked to browser instances;
then V is set input Inputting a CNN-Linformer model, extracting time sequence characteristics of input data through the CNN, focusing important characteristics through linear attention of the Linformer, and carrying out parallel calculation on the data through multi-head attention to obtain an unknown fingerprint f u Whether or not it belongs to the fingerprint chain c k Probability p of (2); if p > lambda, then the unknown fingerprint f is represented u Belonging to fingerprint chain c k Chain c of fingerprints k Assigned to an unknown fingerprint f u And will be unknown fingerprint f u Inserting fingerprint chain c k At the end of (c), delete fingerprint chain c k A fingerprint of the middle head; if p is less than or equal to lambda, generating a new ID assignment to f u And generates a new fingerprint chain<f u >Adding the set lambda into a fingerprint chain set C, wherein lambda is a set probability threshold; the CNN-Linformer model comprises a remolding layer Bridge which is used for remolding the output from CNN to match the input size of the Linformer; the Linformer layer uses only 1 encoder block, and the output of the Linformer ultimately passes through a linear layer activated by Softmax to produce a binary classification result.
Further, in the CNN-Linformer model, a CNN module adopts two layers of convolution layers to extract time sequence characteristics of input data, and the output of the last layer of convolution layer is remolded through a Bridge module to be changed into input V matched with the Linformer module new The method comprises the steps of carrying out a first treatment on the surface of the The Linformer module uses only the encoder portion of the Linformer and only 1 encoder block.
Further, the construction steps of the CNN-linker model are as follows:
(1) Selecting a data set: taking browser fingerprint data set as original data set F raw
(2) Data enhancement: benefit (benefit)Using pseudo-fingerprint techniques on the original data set F raw Data enhancement is carried out to obtain an enhanced data set F aug
(3) Data preprocessing: enhancement of data set F by rule aug The fingerprints in the browser are screened to obtain a browser fingerprint data set F temp And fingerprint data set F for browser temp Performing feature regularization;
(4) Feature selection: selecting features according to the information entropy of the fingerprint features, and forming a plurality of fingerprint comparison vectors with time sequence relations by the selected fingerprint features and features obtained during data enhancement;
(5) Generating an input vector: converting multiple fingerprint comparison vectors with time sequence relation into a two-dimensional matrix V input =[I t ,I t-1 ,...,I t-i+1 ]As input vector to the CNN-linker model, where I is the length of the fingerprint chain, t is the current timestamp, I t =diff<f t-1 ,f t >Representing the comparison of two fingerprints, diff represents a single feature vector i=consisting of M features<x 1 ,x 2 ,...,x M >,x n Is the comparison result of the nth feature of the two fingerprints, and the diff method adopts a method for calculating the absolute value of the difference between the two features;
(6) Training a CNN-Linformer model: construction of data set F temp Positive and negative samples are used as a new data set F for subsequent training and testing, 20% of data is randomly selected from the new data set F to be used as a training set, the rest 80% of data is used as a testing set, the training data are used for training the model, a trained model is obtained and stored, then the model is used for classifying the testing data, and classification results are obtained and result analysis is carried out.
Further, step (2) randomizing the attribute values or adding noise to the original data set F by pseudo-fingerprint technique raw Data enhancement is performed.
Further, in the step (3), the method for screening the fingerprints is as follows:
for enhanced data set F aug ,F aug Each browser fingerprint in (a)F has a browser ID linked to the browser instance to which it belongs, all fingerprints of which constitute a set F id Wherein id is the id of the browser instance; for each of
Figure BDA0004148742250000042
If F id If rule 1 and rule 2 are not met, F is deleted aug All fingerprints F e F in (a) id Finally, a new browser fingerprint data set F is obtained temp
Wherein:
rule 1: if the number of all fingerprints associated with one browser instance is less than 6, deleting all fingerprints of the browser instance;
rule 2: if the operating system type attribute is not the same in all fingerprints associated with one browser instance, deleting all fingerprints of the browser instance.
Further, in the step (3), the feature regularization method is as follows:
for the numerical value type characteristics, processing by adopting minimum and maximum standardization;
for boolean type features, a binary representation of 0, 1;
for character string type characteristics, converting the character string type characteristics into a numerical value and then processing the numerical value by minimum and maximum standardization; for character strings which cannot be directly converted into numerical values, mapping the character strings into the numerical values through a hash algorithm in a hashlib library in python, and processing the numerical values by using minimum and maximum standardization;
for canvases, the string is mapped to a numerical value by a hash algorithm in the hashlib library in python, and then processed with minimum and maximum normalization.
Further, in the step (5), the step of generating the input vector algorithm is as follows:
first, an empty V is generated input The method comprises the steps of carrying out a first treatment on the surface of the Second, go through c k =<f k,t ,f k,t-1, f k,t-2 ,f k,t-3 >Fingerprint f of (3) k,i Will f u And f k,i Performing comparison operation to obtain comparison result I i =diff<f u ,f k,i >=<x 1 ,x 2 ,...,x M >And incorporate the results into V input In (a) and (b); if c k I.e. the number of fingerprints in the fingerprint chain is less than 4, will generate V zero To supplement, V zero =<0,0,...0>The method comprises the steps of carrying out a first treatment on the surface of the The input two-dimensional matrix V finally generated input Expressed as:
Figure BDA0004148742250000041
further, in the Linformer module of the CNN-Linformer model, a Dropout module is placed after the last layer.
Compared with the prior art, the invention has the following beneficial effects:
(1) In the existing browser fingerprint identification method, all algorithms used by the browser fingerprint identification method are sequential processing algorithms, such as random forests, RNNs, LSTM and the like, and only data can be sequentially processed. The invention designs a new browser identification model, combines convolutional neural networks (Convolutional neural network, CNN) and Linformer, and provides a browser fingerprint identification algorithm based on the CNN-Linformer model, which is used for calculating the probability of whether an unknown fingerprint belongs to a certain fingerprint chain, can process a plurality of inputs at one time, and is good at processing high-dimensional sparse input data. In the method, a new input matrix construction algorithm is adopted to convert a plurality of inputs into a two-dimensional matrix, so that the input matrix can be simultaneously transmitted into a model for training or testing. Features are extracted from the time series data using CNN, and further calculations are made with a Linformer to make predictions. The core of the Linformer is a linear attention mechanism, which enables more efficient capture of complex links and interactions in data. In addition, the decoder part, the input embedding part and the position coding part in the Linformer are removed to optimize the Linformer, the result of CNN is remodeled and then directly sent to the encoder of the Linformer for calculation, the complexity of a model is reduced, the calculation speed is further improved, and the Dropout layer is added after the Linformer decoder to prevent overfitting.
(2) Aiming at the problem that the success rate of fingerprint identification is reduced by changing browser fingerprints through a pseudo fingerprint generation technology in order to get rid of browser fingerprint tracking of an anonymous browser, the invention comprehensively analyzes common pseudo fingerprint technologies and tools and utilizes the pseudo fingerprint technologies to carry out data enhancement on an original data set. And the enhanced data set containing normal fingerprints and pseudo fingerprints is used for training the CNN-Linformer model, so that the robustness of browser fingerprint identification in a pseudo fingerprint scene is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings may be obtained according to these drawings for a person having ordinary skill in the art.
Fig. 1 is a flowchart of a browser fingerprint identification method based on a CNN-linker model according to an embodiment of the present invention.
Fig. 2 is a flowchart of a browser fingerprint identification process according to an embodiment of the present invention.
Fig. 3 is a block diagram of a CNN-linker model according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
According to the browser identification method based on the CNN-link model, as shown in fig. 1, an original data set is firstly processed, feature selection is carried out, and then a training set and a testing set are divided. And training the model by using training data to obtain a trained model, storing the trained model, classifying the test data by using the model to obtain a classification result, and analyzing the result.
A specific browser fingerprinting process is shown in fig. 2. In a practical environment, all browsers connected to a server are assigned an ID, and then the ID can uniquely identify a browser, and we call the browser with the ID as a browser instance. In this method, each browser fingerprint has an ID that is associated with a browser instance that has the same ID. According to the sequence of the fingerprint acquisition time, the fingerprints with the same ID are linked to form a fingerprint chain. A browser instance has a chain of fingerprints. We will maintain a collection of fingerprint chains that contains all the fingerprints that have been linked to the browser instance. For an unknown fingerprint f u The fingerprint matching (Fingerprint Matching) algorithm to identify the fingerprint is as shown in algorithm 1:
Figure BDA0004148742250000071
where C is the set of fingerprint chains and λ is the set probability threshold, in this method λ=0.5. In algorithm 1, each fingerprint chain C in the set of fingerprint chains C is first traversed k =<f k,t ,f k,t-1 ,f k,t-2 ,f k,t-3 >Will f u And c k Generating an input vector by an algorithm 2 (see for details the generating of the input vector part) to generate an input vector V input Obtaining f through a CNN-Linformer model u Whether or not it belongs to c k Is a probability p of (c). If p > λ, then f is represented u Belonging to c k C, adding k Assignment of ID to f u And f is taken u Insert c k At the end of c k The fingerprint of the middle header (i.e., the earliest joining fingerprint); if p is less than or equal to lambda, generating a new ID assignment to f u And generates a new fingerprint chain<f u >To the set of fingerprint chains C.
Compared with other browser identification methods, the browser fingerprint identification process designed by the method discards the part for pre-identification based on rules. Since the rules are coarse-grained rules which are manually formulated, the recognition error rate is high, and the fake fingerprint technology often falsifies fingerprint features with high information entropy, and the features are often considered unchanged in the rules, so that the fake fingerprint is wrongly recognized. Therefore, the deep learning algorithm is used as a core, coarse granularity rules are removed, the fingerprint identification flow is simplified, and the identification accuracy is improved.
The following will describe the steps for constructing the CNN-linker model in detail:
(1) Using the data set: the method uses a spirs team to publish a browser fingerprint dataset in the FPStalker project on the gilthub. There are 15000 browser fingerprints in the dataset, and the method takes these fingerprints as the original dataset F raw
(2) Data enhancement: in order to get rid of browser fingerprint tracking, some anonymous browsers change browser fingerprints through a pseudo fingerprint generation technology, so that the success rate of fingerprint identification is reduced. In order to improve the robustness of browser fingerprinting in pseudo-fingerprint scenarios, measures are needed to simulate the measures taken by pseudo-fingerprint techniques when preventing users from being tracked. We have performed a comprehensive analysis of common pseudo-fingerprint techniques and tools and utilized these to perform a comprehensive analysis of the original data set F raw Data enhancement is performed.
The specific classes of common pseudo-fingerprint techniques are: (I) FPGuard, priVaricator, etc. makes the information collected each time different by randomizing certain fingerprint feature values; (II) the FPRandom et al tool makes the information collected each time different by adding noise to the image information of WebGL, canvas, etc. Aiming at two pseudo-fingerprint technologies of randomizing attribute values and adding noise, the specific data enhancement method in the embodiment of the invention can adopt the following means:
a. the Flash-based font enumeration is prevented by disabling Flash aiming at tools such as FPGuard, and the lack of Flash can be used as fingerprint information, so that relevant attribute values of Flash can be added as fingerprint features during feature screening, such as Flash fonts, flash resolution, flash language and the like. However, from the aspect of facilitating feature processing, only the attribute value Flash resolution easy to be digitized can be selected as the fingerprint feature.
b. For the case that the DCB and other tools use the real fingerprint as the pseudo fingerprint. The basic fingerprint features with higher partial information entropy such as User-Agent, language, fonts can be forged. The actual data is acquired from the fingerprint characteristics of the current fingerprint database, and then the fingerprint characteristic values are replaced to form the pseudo fingerprint.
c. For the case where the Privaricator, etc. tool returns a random subset of the actual plug-in list by filtering a single entry of the navigator. Plug in attribute. The union of the plugin attributes of all the fingerprints in a fingerprint chain can be used as the plugin set of the fingerprint chain, and some plugins are randomly extracted from the plugin set to be used as a subset, so that the number of plugins in the subset is required to be smaller than that of the plugins of all the fingerprints. Finally, the subset is used for replacing the plugin attribute of some fingerprints in the fingerprint chain, so that the pseudo fingerprint is constructed.
d. The number of fonts that can be retrieved is limited for the FireGloves et al tool. The font subset can be constructed and the fingerprint forged by the same method as in c.
e. In the case where tools such as FireGloves need to indirectly access a browser function through a JavaScript function, whether a pseudo fingerprint tool such as FireGloves is used can be determined according to whether a JavaScript object is used. The related attribute value of the JavaScript object can be added as a fingerprint feature, such as localStorage, during feature screening.
f. And (3) carrying out noise processing on the Canvas and WebGL fingerprints aiming at FPrandom and other tools. Two Canvas (WebGL) images may be continuously created while one browser fingerprint is collected, and whether there is a difference in the generated Canvas (WebGL) image data is checked to determine whether there is Canvas noise. Since Canvas and WebGL images are too large, difference calculation can be performed when information is collected, and the difference result is used as one of browser fingerprint features. The difference result is named CanvasDiff (WebGLDiff), the value 0 or 1,0 representing the same and 1 representing different. When constructing the pseudo fingerprint, some browser instances are randomly selected, the fingerprints CanvasDiff (WebGLDiff) of the instances are set to 1, and the rest are set to 0.
g. The fingerprint features such as Screen resolution, timezone and Content-Encoding can be changed according to the real data in the fingerprint library, but each time the pseudo fingerprint is constructed, the features in the same pseudo fingerprint cannot be replaced at the same time.
h. It is required that the number of feature values replaced in one pseudo fingerprint cannot be greater than 3 at each time the pseudo fingerprint is constructed.
Finally, we will original dataset F raw Up to 20000 fingerprints as enhanced data set F aug
(3) Data preprocessing:
a. and screening fingerprints. In order to better train the CNN-linker model, the fingerprints in the dataset need to be limited according to rules. For enhanced data set F aug ,F aug Each browser fingerprint F in the list has a browser ID linked to the browser instance to which it belongs, all the fingerprints of which constitute a set F id Wherein id is the id of the browser instance. For each of
Figure BDA0004148742250000091
If F id If rule 1 and rule 2 are not met, F is deleted aug All fingerprints F e F in (a) id . Finally, a new browser fingerprint data set F is obtained temp . Rule 1 and rule 2 are as follows:
rule 1. If the number of all fingerprints associated with a browser instance is less than 6, then all fingerprints for that browser instance are deleted.
Rule 2. If the operating system type attribute is not the same in all fingerprints associated with a browser instance, then all fingerprints for that browser instance are deleted.
b. Feature regularization. There are several types of features in browser fingerprints: 1. numerical value type, 2. Boolean type, 3. Character string type. Canvas (S)
1) For the numeric type, minimum maximum normalization (min-max normalization) is used for processing. For a certain numerical type feature value in a browser fingerprintx, traversal F temp All x in (1) get a sequence x 1 ,x 2 ,x 3 ,...,x n Wherein n is F temp The number of fingerprints in (a) is determined. For sequence x 1 ,x 2 ,x 3 ,...,x n And (3) performing transformation:
Figure BDA0004148742250000101
obtaining a new sequence y 1 ,y 2 ,y 3 ,...,y n ∈[0,1]。
2) For the boolean type, a binary representation of 0,1 is used.
3) For string types. For the date form of creationDate, the date form is converted into a numerical value and then regularized by the method in 1. For values that cannot be directly converted, the string can be mapped to an 8bit value by a hash algorithm such as SHA256 in the hashlib library in python, and then regularized by the method in 1.
4) For canvas, an SHA256 hash algorithm is performed first to map the string to an 8bit value, and then regularized by the method in 1.
Compared with the technology of processing the fingerprint characteristics of the browser in other methods, the method is simpler and is easy to operate. For browser fingerprint identification, the speed of fingerprint identification is an important index for measuring the merits of algorithms, so that the simple and efficient data preprocessing method can reduce the data processing time and further improve the fingerprint identification speed.
(4) Feature selection: fingerprints collected by the browser fingerprint collection script possess many features, and many redundant features appear in these features. Having too many features does not guarantee better results when training the CNN-linker model, and may even lead to overfitting. In order to select which attributes constitute feature vectors, the method performs feature selection according to the information entropy of fingerprint features. Alejandri et al (A G Td mez-Boix, P.Laperdrix, and B.Baudry. "Hiding in the Crowd: an Analysis of the Effectiveness of Browser Fingerprinting at Large Scale." World Wide Web Conference 2018) collected 1,816,776 browser fingerprints as data sets to calculate information entropy for different browser fingerprint features, the results of which are shown in Table 1. We will select 16 fingerprint features in table 1 and 4 features Flashresolution, localStorage, canvasDiff, webGLDiff added as the data is enhanced, a total of 20 features constituting feature vectors for training the CNN-link model.
Table 1 browser fingerprint feature entropy values
Figure BDA0004148742250000111
(5) Generating an input vector: to transfer data into the CNN, we convert a plurality of time-series-related fingerprint comparison vectors into a two-dimensional matrix of similar images as input. To two-dimensional matrix V input =[I t ,I t-1 ,...,I t-i+1 ]As an input vector, where i is the length of the fingerprint chain and t is the current timestamp. And I t =diff<f t-1 ,f t >Representing the comparison of two fingerprints, diff represents a single feature vector i=consisting of M features<x 1 ,x 2 ,...,x M >,x n Is the result of a comparison of the nth features of the two fingerprints. In the present method, since the features of the browser fingerprint have all been converted into numerical values and regularized, the diff method adopts a method of calculating the absolute value of the difference between the two features. The specific steps of the algorithm for generating the input vector are shown in algorithm 2.
Figure BDA0004148742250000121
First, an empty V is generated input . Second, go through c k =<f k,t ,f k,t-1 ,f k,t-2 ,f k,t-3 >Fingerprint f of (3) k,i Will f u And f k,i Performing comparison operation to obtain comparison result I i =diff<f u ,f k,i >=<x 1 ,x 2 ,...,x M >And incorporate the results into V input Is a kind of medium. If c k I.e. the number of fingerprints in the fingerprint chain) is less than 4, we will generate V zero To supplement, V zero =<0,0,...0>. Finally, the two-dimensional matrix V is input generated by algorithm 2 input The following are provided:
Figure BDA0004148742250000122
in the method, the number of fingerprints associated with one browser instance is at least 4, and one fingerprint has 20 characteristics, so V input Is a 4 x 20 matrix.
In other browser fingerprinting methods, the algorithms they use are all sequential processing algorithms, such as random forests, RNNs, LSTM, etc. In the method, a plurality of inputs are converted into a two-dimensional matrix, so that the inputs can be simultaneously transmitted into a model for training or testing, and the fingerprint identification efficiency can be improved.
(6) Training a CNN-Linformer model: the method sets the data set F temp Positive and negative samples were constructed by the method in the prior art (Li, xiaoyun, et al, "Constructing browser fingerprint tracking chain based on LSTM model," 2018IEEE Third International Conference on Data Science in Cyberspace (DSC). IEEE, 2018), and 136914 positive and 136041 negative samples were obtained, for a total of 272, 955 samples as a new dataset F for subsequent training and testing. In the new data set F, 20% of the data were randomly selected as training set, the remaining 80% as test set.
The CNN-linker model constructed in accordance with the present invention is shown in fig. 3. The model is a combination of CNN and linker designed for binary classification tasks. The input being a two-dimensional matrix V input One of the dimensions is time series data and the other dimension is a feature of the browser fingerprint. CNN extracts features from the time series data and then the Linformer calculates these features to make predictions. The model includes a remodelling layer Bridge for remodellingFrom the output of the CNN to match the input size of the link. We will V input The data processed by the CNN module and the Bridge module is expressed as V new . Thereafter, V is new An incoming Linformer layer that processes the data and produces an output. The output of the Linformer finally passes through a linear layer activated by Softmax to produce a binary classification result.
The following describes the model from both CNN and Linformer.
a. CNN module: CNNs can perform convolution and pooling operations on input data, extracting important features from the input data, helping to capture spatial and temporal information related to classification tasks. The CNN can process data in parallel on a plurality of GPUs or CPUs, is suitable for large-scale training on a large data set, and improves training efficiency of a model. The specific architecture of the CNN of the present invention is shown in table 1, wherein two convolution layers are used to extract the time sequence characteristics of the input data, and in order to ensure that the time dimension of the input data is unchanged, let the following Linformer further extract the characteristics, we discard the Pooling layer (Pooling layer), and reform the output of the last convolution layer through the Bridge module in fig. 3, thereby changing the output of the last convolution layer into the input V usable for the Linformer module new 。V new Is a 4 x 320 matrix.
Table 1 CNN architecture
Layer(s) Type(s) Output shape Activation function Parameters (parameters)
conv2d Conv2D (None,4,20,10) ReLU 50
conv2d_1 Conv2D (None,4,20,20) ReLU 820
reshape Reshape (None,4,400,1) - 0
b. Linformer module: in the task of browser fingerprinting, the input is typically a series of data points with different characteristics, which may be scattered at different positions of the input sequence. By employing a linear attention mechanism, the Linformer model can better capture this scattered information and utilize it to improve the accuracy and robustness of fingerprint recognition. In particular, the linear attention mechanism allows the Linformer model to weight different locations of the input to focus more on information that is more important for a particular task. This mechanism has lower temporal and spatial complexity than conventional self-attention mechanisms, and thus can improve the efficiency of fingerprint recognition. Multi-headed linear attention (Multi-head Linear Attention) enables Linforward to process input data in parallel.
In order to be suitable for the fingerprint identification task of the browser, the invention modifies the Linformer model: to reduce the complexity of the model, we have deleted the decoder part of the Linformer model, only the encoder part of the Linformer is used, and only 1 encoder block is used, i.e., the dashed part of the Linformer module in FIG. 3, N=1. Since the output of the CNN part can be directly input into the Linformer-encoder for further computation without further processing, we have deleted the input embedding and position coding part in the Linformer. The structure of the modified Linformer model is shown in the Linformer section of FIG. 3. The settings of parameters associated with the Linformer model are shown in Table 2.
TABLE 2 parameter settings of Linformer model
Number of encoders 1
Multi-head attention number 5
Feed forward network layer dimension 64
c. Dropout layer: to prevent overfitting, a Dropout module is placed after the last layer normalization module. In the present method, dropout=0.1.
d. Linear (Linear) layer: the linear layer uses a RELU function. The ReLU is a nonlinear activation function, and compared with a linear activation function, the ReLU can better express complex classification boundaries and more accord with the signal excitation principle of neurons, and can improve the performance of a model. In addition, reLU helps to reduce vanishing state and error gradient problems.
e. Output layer activation function: sigmoid. In the classification problem, sigmoid is often used as an activation function. sigmoid is characterized by mapping the output to 0,1]the probability can be easily correlated with the probability, and the probability of the classification result can be easily reflected. The output layer finally outputs f u Probability p of whether belonging to a certain fingerprint chain.
f. Loss function: binary cross entropy (Binary Cross Entropy) function. Binary cross entropy loss functions are often used to address the two classification problems. The calculation formula is as follows:
Figure BDA0004148742250000141
where y is binary tag 0 or 1 and p (y) is the probability of the output belonging to the y tag.
g. An optimizer: adam. We used Adam as the optimization algorithm in building the model. Adam is a first order optimization algorithm that can replace the traditional random gradient descent process and that can iteratively update neural network weights based on training data.
The invention provides a browser fingerprint identification algorithm based on a CNN-link model, which is used for calculating the probability of whether an unknown fingerprint belongs to a certain fingerprint chain. The CNN-Linformer model firstly extracts time sequence characteristics of input data through CNN, then focuses important characteristics through linear attention of Linformer, improves accuracy of the model, and carries out parallel calculation on the data through multi-head attention, thereby improving training speed of the model. The decoder part, the input embedding part and the position coding part in the Linformer are deleted to simplify the Linformer, so that the complexity of the model is reduced, and the calculation speed is further improved. The accuracy of fingerprint identification of the browser is improved to 99.78% by adopting a CNN-Linformer model. Under the same experimental condition, the time for identifying 2000 browser fingerprints by using a browser fingerprint identification algorithm based on a CNN-Linformer model is 2830s, which is reduced by 22.57% compared with 3655s of LSTM, by 36.86% compared with 4482ms of BiGRU, and by 65.8% compared with 8275s of FPStalker. The result shows that the method has the advantages of high accuracy and high recognition speed.
Meanwhile, the invention aims at the problem that the accuracy rate is reduced due to the fact that various pseudo fingerprint technologies are not comprehensively considered in other existing methods, two pseudo fingerprint technologies of randomization attribute values and noise addition are utilized to carry out data enhancement on an original data set, and then the enhancement data set containing normal fingerprints and pseudo fingerprints is used to train a CNN-Linformer model. The method not only uses the pseudo fingerprint technology to enhance the data, but also uses the four features of Flash resolution and localStorage, canvasDiff, webGLDiff as fingerprint features. These newly added features can identify that the user has used those pseudo-fingerprint techniques, which are also part of the fingerprint. Finally, the accuracy of the CNN-Linformer model reaches 99.73%, the recall rate reaches 99.83%, the F1 value reaches 99.78%, and the MCC value reaches 98.89%. These results indicate that the method has good robustness in the pseudo fingerprint scene.
The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may be modified or some technical features may be replaced with others, which may not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (8)

1. A browser identification method based on a CNN-link model is characterized by comprising the following steps:
for unknown fingerprint f u First traverse each fingerprint chain C in the set of fingerprint chains C k =<f k,t ,f k,t-1 ,f k,t-2 ,f k,t-3 >, unknown fingerprint f u And fingerprint chain c k Generating an input vector V by generating an input vector algorithm input =[I t ,I t-1 ,...,I t-i+1 ]Where I is the length of the fingerprint chain, t is the current timestamp, I t =diff<f t-1 ,f t Representing the comparison of two fingerprints, diff represents a single feature vector i= < x consisting of M features 1 ,x 2 ,...,x M >,x n Is the comparison of the nth features of the two fingerprints; the fingerprint chain is formed by linking fingerprints with the same ID according to the sequence of the fingerprint acquisition time, each browser fingerprint is provided with an ID, the ID is associated with a browser instance with the same ID, and one browser instance is provided with a fingerprint chain; the fingerprint chain set C contains all fingerprints linked to browser instances;
then V is set input Inputting a CNN-Linformer model, extracting time sequence characteristics of input data through the CNN, focusing important characteristics through linear attention of the Linformer, and carrying out parallel calculation on the data through multi-head attention to obtain an unknown fingerprint f u Whether or not it belongs to the fingerprint chain c k Probability p of (2); if p > lambda, then the unknown fingerprint f is represented u Belonging to fingerprint chain c k Chain c of fingerprints k Assigned to an unknown fingerprint f u And will be unknown fingerprint f u Inserting fingerprint chain c k At the end of (c), delete fingerprint chain c k A fingerprint of the middle head; if p is less than or equal to lambda, generating a new ID assignment to f u And generating a new fingerprint chain < f u Adding the fingerprint chain to a fingerprint chain set C, wherein lambda is a set probability threshold; the CNN-Linformer model comprises a remolding layer Bridge which is used for remolding the output from CNN to match the input size of the Linformer; the Linformer layer uses only 1 encoder block, and the output of the Linformer ultimately passes through a linear layer activated by Softmax to produce a binary classification result.
2. The method for identifying a browser based on a CNN-Linformer model according to claim 1, wherein in the CNN-Linformer model, the CNN module adopts two layers of convolution layers to extract time sequence characteristics of input data, and the Bridge module remodels the output of the last layer of convolution layer to be changed into input V matched with the Linformer module new The method comprises the steps of carrying out a first treatment on the surface of the The Linformer module uses only the encoder portion of the Linformer and only 1 encoder block.
3. The method for identifying a browser based on a CNN-link model according to claim 1, wherein the steps of constructing the CNN-link model are as follows:
(1) Selecting a data set: taking browser fingerprint data set as original data set F raw
(2) Data enhancement: using pseudo-fingerprint techniques on the original data set F raw Data enhancement is carried out to obtain an enhanced data set F aug
(3) Data preprocessing: enhancement of data set F by rule aug The fingerprints in the browser are screened to obtain a browser fingerprint data set F temp And fingerprint data set F for browser temp Performing feature regularization;
(4) Feature selection: selecting features according to the information entropy of the fingerprint features, and forming a plurality of fingerprint comparison vectors with time sequence relations by the selected fingerprint features and features obtained during data enhancement;
(5) Generating an input vector: converting multiple fingerprint comparison vectors with time sequence relation into a two-dimensional matrix V input =[I t ,I t-1 ,...,I t-i+1 ]As input vector to the CNN-linker model, where I is the length of the fingerprint chain, t is the current timestamp, I t =diff<f t-1 ,f t Representing the comparison of two fingerprints, diff represents a single feature vector i= < x consisting of M features 1 ,x 2 ,...,x M >,x n Is the comparison result of the nth feature of the two fingerprints, and the diff method adopts a method for calculating the absolute value of the difference between the two features;
(6) Training a CNN-Linformer model: construction of data set F temp Positive and negative samples are used as a new data set F for subsequent training and testing, 20% of data is randomly selected from the new data set F to be used as a training set, the rest 80% of data is used as a testing set, the training data are used for training the model, a trained model is obtained and stored, then the model is used for classifying the testing data, and classification results are obtained and result analysis is carried out.
4. The CNN-linform-based system of claim 3The browser identification method of the r model is characterized in that the step (2) is implemented on the original data set F through a pseudo fingerprint technology of randomizing attribute values or adding noise raw Data enhancement is performed.
5. The method for identifying a browser based on a CNN-link model according to claim 3, wherein in the step (3), the method for screening fingerprints is as follows:
for enhanced data set F aug ,F aug Each browser fingerprint F in the list has a browser ID linked to the browser instance to which it belongs, all the fingerprints of which constitute a set F id Wherein id is the id of the browser instance; for each of
Figure FDA0004148742230000021
If F id If rule 1 and rule 2 are not met, F is deleted aug All fingerprints F e F in (a) id Finally, a new browser fingerprint data set F is obtained temp
Wherein:
rule 1: if the number of all fingerprints associated with one browser instance is less than 6, deleting all fingerprints of the browser instance;
rule 2: if the operating system type attribute is not the same in all fingerprints associated with one browser instance, deleting all fingerprints of the browser instance.
6. The method for identifying a browser based on a CNN-link model according to claim 3, wherein in the step (3), the feature regularization method is as follows:
for the numerical value type characteristics, processing by adopting minimum and maximum standardization;
for boolean type features, a binary representation of 0, 1;
for character string type characteristics, converting the character string type characteristics into a numerical value and then processing the numerical value by minimum and maximum standardization; for character strings which cannot be directly converted into numerical values, mapping the character strings into the numerical values through a hash algorithm in a hashlib library in python, and processing the numerical values by using minimum and maximum standardization;
for canvases, the string is mapped to a numerical value by a hash algorithm in the hashlib library in python, and then processed with minimum and maximum normalization.
7. The method for identifying a browser based on a CNN-link model according to claim 3, wherein in step (5), the step of generating an input vector algorithm is as follows:
first, an empty V is generated input The method comprises the steps of carrying out a first treatment on the surface of the Second, go through c k =<f k,t ,f k,t-1 ,f k,t-2 ,f k,t-3 Fingerprint f in > k,i Will f u And f k,i Performing comparison operation to obtain comparison result I i =diff<f u ,f k,i >=<x 1 ,x 2 ,...,x M >, and incorporate the result into V input In (a) and (b); if c k I.e. the number of fingerprints in the fingerprint chain is less than 4, will generate V zero To supplement, V zero = < 0, & gt, 0; the input two-dimensional matrix V finally generated input Expressed as:
Figure FDA0004148742230000031
8. the method of claim 1, wherein a Dropout module is placed after the last layer in a CNN-link model based browser identification method.
CN202310311808.0A 2023-03-28 2023-03-28 Browser identification method based on CNN-Linformer model Pending CN116310728A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310311808.0A CN116310728A (en) 2023-03-28 2023-03-28 Browser identification method based on CNN-Linformer model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310311808.0A CN116310728A (en) 2023-03-28 2023-03-28 Browser identification method based on CNN-Linformer model

Publications (1)

Publication Number Publication Date
CN116310728A true CN116310728A (en) 2023-06-23

Family

ID=86813065

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310311808.0A Pending CN116310728A (en) 2023-03-28 2023-03-28 Browser identification method based on CNN-Linformer model

Country Status (1)

Country Link
CN (1) CN116310728A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117544322A (en) * 2024-01-10 2024-02-09 北京雪诺科技有限公司 Browser identification method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117544322A (en) * 2024-01-10 2024-02-09 北京雪诺科技有限公司 Browser identification method, device, equipment and storage medium
CN117544322B (en) * 2024-01-10 2024-03-22 北京雪诺科技有限公司 Browser identification method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109639739B (en) Abnormal flow detection method based on automatic encoder network
CN111783442A (en) Intrusion detection method, device, server and storage medium
Alabadi et al. Anomaly detection for cyber-security based on convolution neural network: A survey
CN114120041B (en) Small sample classification method based on double-countermeasure variable self-encoder
Chu et al. Neural batch sampling with reinforcement learning for semi-supervised anomaly detection
CN113269228B (en) Method, device and system for training graph network classification model and electronic equipment
CN116310728A (en) Browser identification method based on CNN-Linformer model
CN111008570B (en) Video understanding method based on compression-excitation pseudo-three-dimensional network
Suratkar et al. Deep-fake video detection approaches using convolutional–recurrent neural networks
Shafique et al. SSViT-HCD: A spatial–spectral convolutional vision transformer for hyperspectral change detection
CN116340524A (en) Method for supplementing small sample temporal knowledge graph based on relational adaptive network
CN116722992A (en) Fraud website identification method and device based on multi-mode fusion
Barbhuiya et al. Gesture recognition from RGB images using convolutional neural network‐attention based system
CN116258504B (en) Bank customer relationship management system and method thereof
CN113743443A (en) Image evidence classification and identification method and device
CN117391816A (en) Heterogeneous graph neural network recommendation method, device and equipment
Charitidis et al. Operation-wise attention network for tampering localization fusion
CN111797732B (en) Video motion identification anti-attack method insensitive to sampling
Cultrera et al. Leveraging Visual Attention for out-of-distribution Detection
CN114757391A (en) Service quality prediction method based on network data space design
CN114118267A (en) Cultural relic perception data missing value interpolation method based on semi-supervised generation countermeasure network
Wu et al. Efficient multi-domain dictionary learning with gans
Nowak et al. Discovering Sequential Patterns by Neural Networks
Yuan et al. LR-ProtoNet: Meta-Learning for Low-Resolution Few-Shot Recognition and Classification
CN117351300B (en) Small sample training method and device for target detection model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination