CN110889133B

CN110889133B - Anti-network tracking privacy protection method and system based on identity behavior confusion

Info

Publication number: CN110889133B
Application number: CN201911081354.2A
Authority: CN
Inventors: 彭佳; 李敏; 张逸飞; 高能
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2019-11-07
Filing date: 2019-11-07
Publication date: 2022-03-15
Anticipated expiration: 2039-11-07
Also published as: CN110889133A

Abstract

The invention discloses a network tracking privacy protection method and a network tracking privacy protection system based on identity behavior confusion. The method comprises the following steps: 1) building a plurality of virtual identities for a user; 2) the behavior splitting module distributes a corresponding virtual identity to each web service operation request sent in the real behavior chain of the user in sequence; 3) the identity switching module switches the user identity to the currently distributed virtual identity and executes the distributed web service operation request; and the behavior generation module generates a virtual behavior for the current virtual identity; the virtual behavior means simulating the behavior which is not executed by the real user, and sending a virtual request to the web service to ensure that the behavior chain of the virtual identity is complete; 4) and feeding back the return results to the result fusion module for fusion respectively by the plurality of virtual identities distributed to the user, and then feeding back the fusion results to the client where the user is located. The invention uses a plurality of identities to access, thereby effectively protecting the privacy information of the user.

Description

Anti-network tracking privacy protection method and system based on identity behavior confusion

Technical Field

The invention belongs to the field of network space security and privacy protection, and particularly relates to a network tracking privacy protection method and system based on identity behavior confusion.

Background

With the explosion of internet services, users have more and more online behaviors. Various web services focus more on the online behavior of users, providing more accurate recommendations and advertisements. Regardless of whether a user logs into a website, the web service always identifies the user and records the user's behavior. Based on the identity and behavioral data of these users, web services can build powerful advertising services. Although recommendations or advertisements may provide some convenience to people's network life, they also pose a certain threat to the privacy of users. Firstly, the identity service party is not completely reliable, and a plurality of data leakage events occur every year, so that the personal identity information of the user is leaked. Meanwhile, with the development of analysis technology, an adversary can also infer privacy attribute information of a user, such as gender, age, occupation and the like, by using continuous behavior data through methods such as big data analysis, trajectory analysis, targeted advertising and the like.

The general defense method mainly aims at the active hiding of the user behavior and the passive protection of the personal identity information of the user according to the regulation of the identity service provider on the storage and the use of the personal identity information of the user.

Protection technologies for storing and using personal information of a user by an identity service provider include differential privacy, K anonymity, generalization and other de-identification technologies. The core method of the de-identification technology is to reduce the personal information discrimination in the data, disconnect the association between the data and the personal information body, and even if the data is leaked, it is difficult to associate the entity described by the data with the real user. The identity service side still obtains the full information of the user, protection is carried out based on the autonomy and the technical completeness of the identity service provider, and the method depends on the supervision of a third party, and belongs to a passive protection mode.

The hidden aiming at the user behavior itself, namely the active protection method, includes: (1) actively modifying the user behavior record, such as modifying the rating of a movie by the user or adding noise to the user behavior record by using a local differential privacy technology; (2) the acquisition of user historical information is reduced, for example, user associated information is used for recommending or deleting part of user behavior records, and the user behavior records are stored in a server and are difficult to modify once being sent out, so that the two methods can only be analyzed theoretically and cannot be applied to a real scene; (3) completely anonymous methods, such as hiding personal behavior by using a proxy Browser like a Tor Browser, may affect the usability of the service on the one hand, and on the other hand, these anonymous agents may be discovered by some regulatory bodies and even prohibited, so it is difficult to meet the requirements of users on the openness and security of the network service.

Disclosure of Invention

According to the above mentioned current situation, the present invention is directed to a method and system for protecting privacy against network tracking based on identity behavior confusion. The invention splits the behavior sequence, uses a plurality of identities to access, fuses the recommendation results of all the identities, and realizes that a service party can only obtain partial user information for each identity and cannot use the full amount of information to carry out the speculation of the privacy information. The fused result is fed back to the user to keep certain usability.

In order to achieve the purpose, the invention adopts the following scheme:

a network tracing privacy protection resisting method based on identity behavior confusion comprises the following steps:

1) building a plurality of virtual identities for a user;

2) the behavior splitting module distributes a corresponding virtual identity to each web service operation request sent in the real behavior chain of the user in sequence;

3) the identity switching module switches the user identity to the currently distributed virtual identity and executes the distributed web service operation request; and the behavior generation module generates a virtual behavior for the current virtual identity; the virtual behavior means simulating the behavior which is not executed by the real user, and sending a virtual request to the web service to ensure that the behavior chain of the virtual identity is complete;

4) and feeding back the return results to the result fusion module for fusion respectively by the plurality of virtual identities distributed to the user, and then feeding back the fusion results to the client where the user is located.

Further, a client of a user constructs a plurality of virtual identities for the user by requesting a plurality of cookies or constructing different browser fingerprints; and the identity switching module switches the user identity to the currently allocated virtual identity by switching cookies or browser fingerprints.

Further, the client deletes or replaces the cookie in the user request by the virtual cookie to generate the virtual identity of the user; the virtual cookie is a cookie actually issued using a web server collected when the user requests a target service.

Further, the client maintains a virtual identity list, and the record information in the virtual identity list includes the virtual identity, the corresponding keyword of the target service accessed by the virtual identity, and the identification information.

Further, the behavior generation module selects an unexecuted behavior similar to the real behavior assigned to the current virtual identity execution as the virtual behavior of the current virtual identity according to the training set.

Further, the method for the behavior splitting module to allocate a corresponding virtual identity to each web service operation request includes: the behavior splitting module calculates a vector representation for each web service operation request according to the training set, and calculates an integral vector according to the existing behavior of each virtual identity; and then, carrying out similarity calculation on the vector of the current web service operation request and the vector of each virtual identity, and then distributing the current web service operation request to the virtual identity corresponding to the maximum similarity value.

Further, the returned result comprises an accurate result and a recommended result; the accurate result is a request result which normally uses the web service and does not generate the privacy information, and the result fusion module directly returns the accurate result to the user; the recommendation is an ordered list of items provided by the server based on the user's historical behavior.

Furthermore, the result fusion module fuses the returned recommendation results by adopting a heuristic method, firstly selects all virtual identities for executing real user behaviors, then arranges the recommendation results in a reverse order according to the number of real behavior records owned by the selected virtual identities, and then fuses the recommendation results returned by the virtual identities by taking the return order as the priority or the frequency as the priority.

An anti-network tracking privacy protection system based on identity behavior confusion is characterized by comprising a behavior splitting module, an identity switching module, a behavior generating module and a result fusing module; wherein the content of the first and second substances,

the behavior splitting module is used for sequentially distributing a corresponding virtual identity to each web service operation request sent in the real behavior chain of the user; wherein the user has a plurality of virtual identities;

the identity switching module is used for switching the user identity to the currently distributed virtual identity and executing the distributed web service operation request;

the behavior generation module is used for generating virtual behaviors for the current virtual identity; the virtual behavior means simulating the behavior which is not executed by the real user, and sending a virtual request to the web service to ensure that the behavior chain of the virtual identity is complete;

and the result fusion module is used for fusing the returned results of the virtual identities and then returning the fused results to the client of the user.

The invention also provides a client which is characterized by comprising a behavior splitting module, an identity switching module, a behavior generating module and a result fusing module; the behavior splitting module is used for sequentially distributing a corresponding virtual identity to each web service operation request sent in a user real behavior chain; wherein the user has a plurality of virtual identities; the identity switching module is used for switching the user identity to the currently distributed virtual identity and executing the distributed web service operation request; the behavior generation module is used for generating virtual behaviors for the current virtual identity; the virtual behavior means simulating the behavior which is not executed by the real user, and sending a virtual request to the web service to ensure that the behavior chain of the virtual identity is complete; and the result fusion module is used for fusing the returned results of the virtual identities and then returning the fused results to the user.

In scenarios where users are not logged into a website, almost all web services currently use cookies or browser fingerprints to identify a user. In order to protect the privacy of the user, the method constructs a false identity (namely a virtual identity) which looks real for the user by requesting a plurality of cookies and constructing different browser fingerprints at the client, and uses the different false identities of the user to request the web service by switching the cookies and the browser fingerprints. The invention is mainly divided into four modules, namely behavior splitting, identity switching, behavior generation and result fusion. The behavior splitting module distributes a corresponding virtual identity to each web service operation request sent in the real behavior chain of the user in sequence, the identity switching module switches the identity for the user before the current behavior is executed and executes the identity again, and meanwhile, the behavior generating module selects and adds some false behaviors similar to the real behaviors, so that the behavior chain of the false identities looks complete and real. In the scene that the user logs in the website, the invention only adds the virtual behavior deception web service in the real behavior sequence, so that the accurate user portrait is difficult to infer. And then the result fusion module fuses recommendation results corresponding to a plurality of virtual identities constructed for the same user, and the fused results are returned to the real user. Therefore, the service party can only acquire partial user information for each identity, cannot use the whole amount of information to carry out private information speculation, and keeps certain usability.

The virtual identity constructed for the user at the client side by the method is the confusion of the user identity by aiming at two modes of cookie marking and browser fingerprint marking. When the user does not log in the website, a common way for web services to mark the user is by cookie, and when the cookie is disabled by the user, the browser fingerprint is also a medium for marking the user identity and recording the user data.

Further, the fake of the virtual cookie is a cookie really issued by using a web server collected when the target service is requested; the cookie is set for the user when the web service tracks the user who first visits the web site, so the present invention obtains a plurality of cookies and stores them by visiting the web site a plurality of times without the cookie. The present invention may delete or replace cookies in a real user request with virtual cookies to generate false identities.

Further, the forgery of the browser fingerprint is to replace part of the key browser or hardware identification information which is commonly used for browser fingerprint identification in an effective range to form virtual fingerprint information which looks real.

Further, the virtual identity (sub-id) maintenance is that the invention maintains a virtual identity list at the client, the virtual identity list includes a plurality of records, and each record information includes a virtual identity, a corresponding keyword of a target service accessed by the virtual identity and identification information. Virtual identities are not shared between different users, i.e. each virtual identity belongs to only one user.

The virtual behavior means simulating the behavior which is not executed by the real user, and sending a virtual request to the web service, so that the behavior chain of the false identity looks complete and real, and the confusion of the behavior layer is realized. Even if all false identities of a user are identified, the true behavioral stream of the user cannot be restored.

Further the virtual behavior is generated by a behavior generation module, and an unexecuted behavior similar to the real behavior assigned to the current virtual identity (sub-id) execution is selected as the virtual behavior according to the training set. The training set contains a pre-collected chain of true behaviors of a large number of users.

In the process of executing the user click behavior, the behavior of the real user is distributed to different virtual identities to be executed according to certain logic classification.

Further, the logical classification is divided according to behavior similarity. A vector representation can be calculated for each behavior through a representation learning word vector algorithm according to the training set, and an integral vector is calculated according to the existing behavior of each virtual identity. When a new behavior is generated, the invention automatically calculates the virtual identity with the most similar behavior, assigns the behavior to the virtual identity, and then updates the vector representation of the identity.

Further, when a user clicks to generate a new request, for example, clicking a link, inputting a search, scrolling and loading, which may cause the browser to send a request to the server, the behavior splitting module of the present invention intercepts the request, selects a closest virtual identity according to information such as a link text, input content, current page content, and the like, and then the identity switching module performs identity replacement, and resends the request with the converted identity. The virtual identities that are not selected will generate virtual behavior in the background at certain time intervals and execute.

In some situations where a login is required, such as purchasing a product in an e-commerce service, it is not possible to fool the service because the payment information and the delivery address cannot be disguised. Because of this limitation, the present invention provides an "exit" option, switching directly back to the logged-on state, and the previous browsing history is protected despite the exposure of the payment. In addition, the user can simply delete the identity information of any sub-id, so that the attribute embodied by the sub-id is easily abandoned.

And returning the request result, namely returning the result according to the behavior type, and dividing the result into an accurate result and a recommended result. Accurate results need to be accurately returned to the user, the main operation purpose of the user is achieved, and the requirements of the user are met.

The accurate result is a request result that normally uses the web service and does not generate the private information, such as opening a home page, scrolling a page, and the like.

The recommendation result is an ordered list of items provided by the server according to the user's historical behavior, wherein the items can be commodities, movies, restaurants, and the like in different scenes.

The recommendation result fusion means that a plurality of recommendation results returned by a plurality of requests generated by a plurality of virtual identities of the same user are fused, so that the recommendation results obtained by the user ensure service availability.

Further, the fusion adopts a heuristic method, selects all virtual identities containing real user behaviors, arranges the recommendation results in a reverse order according to the real behavior record quantity owned by the virtual identities, ensures that the virtual identities with more real behavior quantity have higher priority in subsequent fusion, and then performs fusion by taking the return order as the priority or the frequency as the priority.

Further, the fusion with the return order as the priority is performed according to the order of the items in the recommendation result obtained by each virtual identity. As in fig. 1, the letters ABC represent different types of articles and the number 123 represents different articles. And taking articles at each position of the recommendation result sequence in sequence, and skipping repeatedly to obtain a fused recommendation result. The method has the advantages that the order of recommendation is kept, the item with the highest recommendation rank is still higher after fusion, and the recommendation result of each virtual identity is respected.

Further, the fusion with the frequency as the priority is performed according to the frequency of the article appearing in the recommendation results obtained by all the virtual identities. As in fig. 2, the letters ABC represent different types of articles and the number 123 represents different articles. And counting the occurrence frequency of the articles in all the recommended results, and arranging and fusing the articles according to the reverse frequency sequence. The method has the advantages that the recommendation results obtained by all the virtual identities are comprehensively considered, and the overall recommendation result can be more effectively obtained.

Further, the two fusion methods are combined for use, and when the occurrence frequency of the articles is counted, the inverses of the orders of the articles in the recommendation list are used for weighting, and then the articles are arranged according to the weighted inverses of the frequencies. The combined method combines the recommended sequence and frequency, and achieves better effect in experiments. FIG. 3 depicts a flow chart of the present invention.

Compared with the prior art, the invention has the following positive effects:

the invention has light calculation process, saves calculation resources and has real-time calculation capability; the confusion process in the non-login scene does not relate to actions never generated by fake users, and unpredictable deviation of recommended contents is prevented.

The invention provides a method for actively protecting user privacy and preventing depth attribute inference attack based on an artificial intelligence technology. The chain of behavior in a user session is protected from being fully acquired by creating a virtual identity while keeping the service available. The method can realize the selection of the virtual identity, the seamless switching of the virtual identity and the real identity and the fusion of the recommendation result. The method greatly reduces the average accuracy of attribute reasoning attack, has small recommendation loss, and balances service loss and privacy protection.

Drawings

FIG. 1 is an in-order fusion flow diagram;

FIG. 2 is a flow chart of fusion by frequency;

FIG. 3 is a flow chart of the present invention.

Detailed Description

In order to make the objects, schemes and advantages of the present invention more apparent, the present invention is further described in detail by taking an experiment performed on a real data set as an example. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Taking the MovieLens 1M data set as an example, a specific implementation step for reducing the risk of the user being attacked by privacy attribute inference is described.

The MovieLens 1M dataset is a dataset in which a user scores movies, and includes privacy attributes (gender, age, occupation) of 6000 users, attributes (name, category) of 4000 movies, a total of one million scores and time of scoring. The example preprocesses the MovieLens 1M dataset, arranging the movie scores for each user in time, with the corresponding movie sequence as the sequence in which the user accesses the items.

In this embodiment, the task of the present invention is to perform splitting processing and recommendation fusion on a user behavior sequence. In addition, for comparison, two tasks are respectively required before and after splitting, and are respectively sequence-based recommendation and sequence-based privacy attribute inference for evaluating the retention capability of the recommendation accuracy and the privacy protection effect of the invention.

The user behavior sequence splitting task is described first. The present embodiment generates a plurality of virtual identities for a user. For the service tracked by the cookies, the cookies are generated and provided by a service party during first access, so that the local cookies information of the user only needs to be emptied when the virtual identity is generated every time, and the returned cookies are recorded after the virtual identity is accessed. For the service of fingerprint tracking through a browser, the invention prevents the tracker from acquiring part of fingerprint information (such as canvas fingerprint) which is difficult to be confused, and replaces other fingerprint information (such as user-agent) with common values, so as to obtain a plurality of virtual identities which are difficult to be identified.

The embodiment utilizes the API provided by the mainstream browser to replace the identity. For example, in Chrome-based browsers, cookies information may be directly obtained and replaced by Chrome.

In this embodiment, a list of movies is maintained for each virtual identity. Each movie is converted to a vector representation by a word vector (word2vec) algorithm through the data in the training set. The behavior sequence splitting task is a dynamic process, and as time goes on, when a new movie score is generated, the similarity between the movie and each virtual identity is calculated, the most similar identity is allocated to the movie, and the data corresponding to the identity is updated. And when all the similarity degrees are smaller than a certain threshold value, randomly allocating an identity.

When a user evaluates a movie, the method selects a virtual identity closest to the movie to replace the identity according to the movie expression vector obtained in the training set, and finally sends an evaluation request by using the replaced identity, so that each virtual identity has unique preference on different types of movies, and the effect of protecting the privacy attribute of the user is achieved.

After each virtual identity executes an action, the returned recommendation result is recorded and fused with the recommendation results of the other identities. The fusion process is shown in fig. 1 and fig. 2, and the recommendation result is guaranteed not to be affected significantly.

One evaluation task of the present embodiment is an experimental setup of a sequence-based recommendation task. The sequence-based recommendation task is essentially a sequence prediction task aimed at predicting the next item a given user may access when accessing a sequence of items. This example randomly takes 20% of the sequences as the test data set and the remaining 80% as the training data set. The experiments were performed 5 times, and the average of the results was taken as the final result. The present embodiment employs a recurrent neural network for sequence-based recommendations. The movie that each user last rated is targeted and all previous movies are entered in time series. The evaluation was carried out using TOP20, TOP5 and MRR20 accuracy. Another evaluation task of the present embodiment is based on the privacy attribute inference of the sequence, which is essentially a sequence classification task. Recurrent neural networks are also employed for privacy inference. The sequence of all movies that each user rates is taken as input and the privacy attributes of the user are taken as output. The evaluation employed macro-and micro-averaging of the accuracy, recall, F1 values of the multi-classification task.

The recommendation accuracy after application is not obviously influenced, the accuracy of user privacy attribute inference is greatly reduced, and the effectiveness of privacy protection and the usability of a maintenance recommendation system are shown.

The above description is intended to be illustrative of the present invention and is not to be construed as limiting the invention, and any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A network tracing privacy protection resisting method based on identity behavior confusion comprises the following steps:

1) a client of a user constructs a plurality of virtual identities for the user by requesting a plurality of cookies or constructing different browser fingerprints; the identity switching module switches the user identity to the currently allocated virtual identity by switching cookie or browser fingerprint;

2. The method of claim 1, wherein the client deletes or replaces the cookie in the user request with a virtual cookie to generate a virtual identity for the user; the virtual cookie is a cookie actually issued using a web server collected when the user requests a target service.

3. The method of claim 1, wherein the client maintains a virtual identity list, and the information recorded in the virtual identity list includes a virtual identity, a corresponding keyword of a target service accessed by the virtual identity, and identification information.

4. The method of claim 1, wherein the behavior generation module selects an unexecuted behavior similar to a real behavior assigned to a current virtual identity execution as the virtual behavior of the current virtual identity based on a training set.

5. The method of claim 1, wherein the behavior splitting module assigns a corresponding virtual identity for each web service operation request by: the behavior splitting module calculates a vector representation for each web service operation request according to the training set, and calculates an integral vector according to the existing behavior of each virtual identity; and then, carrying out similarity calculation on the vector of the current web service operation request and the vector of each virtual identity, and then distributing the current web service operation request to the virtual identity corresponding to the maximum similarity value.

6. The method of claim 1, wherein the returned results include an accurate result and a recommended result; the accurate result is a request result which normally uses the web service and does not generate the privacy information, and the result fusion module directly returns the accurate result to the user; the recommendation is an ordered list of items provided by the server based on the user's historical behavior.

7. The method of claim 6, wherein the result fusion module fuses the returned recommendation results by a heuristic method, and first selects all virtual identities for executing real user behaviors, then arranges the recommendation results in reverse order according to the number of real behavior records owned by the selected virtual identities, and then fuses the recommendation results returned by the virtual identities with the return order as priority or with the frequency as priority.

8. An anti-network tracking privacy protection system based on identity behavior confusion is characterized by comprising a behavior splitting module, an identity switching module, a behavior generating module and a result fusing module; wherein the content of the first and second substances,

the behavior splitting module is used for sequentially distributing a corresponding virtual identity to each web service operation request sent in the real behavior chain of the user; wherein the user has a plurality of virtual identities; a client of a user creates a plurality of virtual identities for the user by requesting a plurality of cookies or constructing different browser fingerprints; the identity switching module switches the user identity to the currently allocated virtual identity by switching cookie or browser fingerprint;

9. A client is characterized by comprising a behavior splitting module, an identity switching module, a behavior generating module and a result fusing module; wherein the content of the first and second substances,

and the result fusion module is used for fusing the returned results of the virtual identities and then returning the fused results to the user.