CN108171074A - One kind is based on the associated Web trackings automatic testing method of content - Google Patents
One kind is based on the associated Web trackings automatic testing method of content Download PDFInfo
- Publication number
- CN108171074A CN108171074A CN201711282970.5A CN201711282970A CN108171074A CN 108171074 A CN108171074 A CN 108171074A CN 201711282970 A CN201711282970 A CN 201711282970A CN 108171074 A CN108171074 A CN 108171074A
- Authority
- CN
- China
- Prior art keywords
- user
- web
- content
- page
- trackings
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
- G06F21/6263—Protecting personal data, e.g. for financial or medical purposes during internet communication, e.g. revealing personal data from cookies
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Bioethics (AREA)
- General Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Computer Hardware Design (AREA)
- Databases & Information Systems (AREA)
- Computer Security & Cryptography (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Medical Informatics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses one kind based on the associated Web trackings automatic testing method of content, it is related to Web user secret protection field, mainly solves the problems, such as part Web site in the unwitting collection of user, leakage user sensitive information.The present invention collects user to the operation behavior of Web page and page elements information in the form of browser extends, the content of pages of more front and rear access and the relevance of user's operation are analyzed by the technologies such as text analyzing and image identification, so as to judge whether the Web site is collecting user information.Since growing Web tracer techniques can avoid traditional detection method, the present invention starts with from Web trackings effect, can not only effectively detect privacy of user leakage problem, moreover it is possible to researcher be helped to find novel tracking means.
Description
Technical field
The present invention relates to Web user method for secret protection, and in particular to a kind of Web trackings based on content of pages relevance
Automatic testing method.
Background technology
It is quick universal with Web technologies and business, more and more users too busy to get away Web.At the same time, Web
Website wishes to identify by equipment with advertising service quotient carries out effective commending contents and more accurate advertisement dispensing, still
Part advertiser mutually " cooperation ", peddles user privacy information, is associated with so as to fulfill cross-domain user, and then analyzes the behavior of user
Custom and hobby, this has largely violated the secret protection wish of user.At present, the equipment means of identification based on Web
Mainly include Cookie, browser fingerprint.Wherein Cookie is the text envelope being stored in by Web server on user browser
Breath, it can include user and device-dependent message, and when user accesses Web site, server can access Cookie
Information is so as to obtain the browsing of user record and behavior;And browser fingerprint is a variety of clear by UserAgent, font, plug-in unit etc.
Looking at device, operating system and device hardware association attributes is formed, and independent of some specific feature, therefore with preferable strong
Strong property.
The privacy leakage brought for Web trackings threatens, and has scholar and proposes coherent detection and defence method.Wherein for
Cookie, user can directly be disabled by browser or periodically delete to evade;But browser fingerprint identification technology is complete
In the ignorant lower collection user information of user, can only be completed at present by monitoring the calling situation of sensitivity JavaScript API
Detection, but this scheme is based on the premise for having overall understanding to attack means, if Web site has used undiscovered new category
Property, with regard to this scheme can be avoided.
Invention content
Goal of the invention:For the deficiencies in the prior art, the present invention makes full use of the intelligent recommendation and use of Web site
The correlation of family operation proposes that a kind of associated Web of content that is based on tracks automatic testing method, and the detection that can start with from effect is used
Whether family is tracked.
Technical solution:It is of the present invention a kind of based on the associated Web trackings automatic testing method of content, include successively with
Lower step:
1) collection of page elements and user's operation information:When user accesses Web site, extended and obtained by browser
Page elements information (including the corresponding text class description information of all-links, image link URL) and user's operation relevant information
(search content, the corresponding text class description information of clickthrough comprising input click the corresponding link URL of picture), and write
Enter file and database.
2) analysis of content of pages relevance:Content of pages association includes textual association and is associated with picture, and wherein text closes
Connection:By extracting page elements information and the keyword in the text class description information in user's operation information respectively, text is utilized
Both this matching technique analyses degree of association;Picture is associated with:By downloading page elements information and the picture in user's operation respectively,
And analyze the two degree of association using image recognition technology.
3) realization of automatic flow:Started using browser automated test tool and browser is configured, analog subscriber
It operates and script is utilized to realize automatic flow, realize Web tracking automatic detections.
Advantageous effect:Compared with prior art, the present invention has the following advantages:
1st, the present invention starts with from Web trackings effect, by the content and use of analyzing the Web site accessed twice before and after user
Whether the relevance of family operation judges Web site using tracer technique collection user information.Even if Web tracer techniques are constantly more
Newly, it as long as Web site recommends advertisement related to user using it, can just be detected by the present invention.Avoid the prior art
The problem of Web tracer technique prioris need to be constantly updated, additionally aids with reference to artificial code analysis and finds novel Web trackings
Technology.
2nd, the present invention utilizes browser automated test tool and automatized script by whole flow process (including starting and matching
Browser is put, Web site is accessed, analog subscriber operation, collects the page and user's operation information) automation, realize Web trackings
Automatic detection without manually participating in, therefore helps to carry out extensive Web trace detections to test and analyze in real-life
The applicable cases of Web tracer techniques.
Description of the drawings
Fig. 1 is flow chart of the method for the present invention.
Specific embodiment
Technical scheme of the present invention is described further below in conjunction with the accompanying drawings.
As shown in Figure 1, tracking automatic testing process based on the associated Web of content is broadly divided into 3 steps, it is the page respectively
Collection, the analysis of content of pages relevance and the realization of automatic flow of element information and user's operation information, according to investigation
It was found that when user accesses Web site, browser extension can record page surface element and user's operation information, the present invention passes through
Compare the relevance of these information analysis content of pages and user's operation to judge whether Web site is tracking user, this is not only
The problem of existing method is based on tracer technique priori is avoided, also contributes to find novel Web tracer techniques.Tool
Body is realized as follows:
The collection of step 1, page elements information and user's operation information
11) acquisition of page elements information
Here page elements information includes the corresponding text class description information of all-links, image link URL, page letter
The acquisition of breath element refers to obtain page html source code, can be obtained by JavaScript API:
document.getElementsByTagName('html')[0].innerHTML.Since part Web site uses dynamic to add
Load technology, thus when user has just opened Web page can not obtain complete html source code.The present invention utilizes JavaScript
(window.scrollTo) wheel operation is simulated, so that the page is loaded completely.
12) acquisition of user's operation information
User's operation information is including search content input by user, the text class description information of clickthrough and clicks figure
The text class description information and link URL of piece.
Search content wherein input by user is obtained by adding the real-time change of monitor dynamic chek input labels
It takes, specific method is as follows:
The acquisition of the URL of the text class description information and picture of clickthrough and picture is clicked by monitoring user
Behavior simultaneously obtains the link for clicking object and context text class description information and obtains.It is usually corresponding due to clicking object<
img>And<a>Label, therefore the present invention only obtains<img>Useful attribute (src, alt, title) under label and<a>Mark
The text message (being obtained by innerText) signed, specific method is as follows:
Step 2, content of pages correlation analysis
21) content of pages relevance is calculated based on page elements information and user's operation information.
Content of pages correlation analysis includes two parts:Textual association is associated with picture.Wherein textual association is with text
Matching value represents that computational methods are:The user's operation information obtained in step 1 is carried out using text analyzing tool crucial
Word extracts and participle, then matches occurrence number of each keyword in page elements information and asks itself and as text
With value.Wherein, when extracting keyword, the present invention, which only focuses on noun, verb, adjective etc., has the word of essential meaning, and ignores
The unessential information such as preposition, number, quantifier, participle be in order to the Chinese long word extracted carry out cutting again, such as
By " jeans ", cutting is " cowboy " and " trousers " again, improves matched accuracy.Specific practice is as follows:
Picture relevance represents that computational methods are with picture match value:It is calculated using image recognition algorithm, machine learning
All pictures on the picture and the page that the technologies such as method identification user clicks, obtain the other set S of two picture categories1And S2, then
Match S1In each element in S2The number of middle appearance simultaneously asks itself and as picture match value.Final content relevance is text
Matching value MatchTextUSWith images match value MatchImageUSThe sum of:
MatchUS=MatchTextUS+MatchImageUS
22) judge Web site whether in tracking user based on the relevance difference for accessing Web site before and after user twice.
When accessing Web site A the specific steps are user, the page elements information S of record website A1And user's operation information
U, user back-call Web site A record its page elements information S again2, content of pages in accessing twice is calculated respectively
Relevance when the relevance of the front and rear Web site page info accessed twice and user's operation behavior is more than some threshold value, is recognized
Can be that user recommends particular advertisement, therefore the Web site is tracking user for the Web site, i.e.,:
WhereinFor the page info of back-call and the relevance of user's operation,For for the first time
The page info of access and the relevance of user's operation, threshhold are specified threshold, and threshhold takes in the present invention
5。
The realization of step 3, automatic flow
The present invention starts browser using browser automated test tool, installation browser extends, analog subscriber operates,
Multi-process automatized script is coordinated to realize automatic flow.As shown in Figure 1, task manager is responsible for controlling the concurrent of whole process
Into number of passes and it is each process, that is, browser automated test tool distributed tasks (the specified Web site URL accessed, configuration
Browser etc.);Each process is responsible for being configured and starting browser, analog subscriber operation.For the set of URL of Web site to be detected
S is closed, step includes:(1) URL is chosen from S and accesses Web site, mouse is simulated at page empty and clicks behavior (in order to cancel
The login window of suspension), simulation wheel operation is rolled to page bottom (page is made to load completely), records page source code.(2) it extracts
Search box in homepage is (i.e.<input>Label), analog subscriber input search article (various article classification and expansible), simulation
Carriage return operates, and the link of 3 pictures is randomly choosed in the page redirected and is clicked, record search item contents.(3) extraction master
Text Link and image link in page, and click 3 times at random respectively, record clickthrough related content.(4) it is fenestrate to close institute
Mouthful, otherwise the repetitive operation (1) if also having Web site URL to be detected in S carries out step (5).(5) above-mentioned steps are remembered
The Web site data of record to get to analog subscriber operation and page info, can using the correlation analysis method in step 2
To be tracked the Web site set of user.
It can be seen from above-described embodiment that the present invention is realized based on content association Web tracking automatic testing methods, energy
Enough effectively prevention privacy of user leakage problems.The present invention starts with from Web trackings effect, avoids existing method and need to constantly update and chases after
The problem of track technology priori.In addition, the present invention realizes automatic detection flow, it is true to be conducive to progress large scale analysis
The applicable cases that Web is tracked in life, it helps find novel Web tracer techniques.
Claims (5)
1. one kind is based on the associated Web trackings automatic testing method of content, which is characterized in that includes the following steps:
(1) Web page surface element and user's operation information are collected in the form of browser extends;
(2) based on Web page surface element and user's operation information analysis content of pages relevance, and judge whether Web site is chasing after
Track user;
(3) Web tracking automatic detections are realized using browser automated test tool.
It is 2. according to claim 1 based on the associated Web trackings automatic testing method of content, which is characterized in that the step
Suddenly page elements include text class description information and image link all in the page in (1);User's operation information includes user
The search content of input, the text class description information of clickthrough and the text class description information and link URL of clicking picture.
It is 3. according to claim 2 based on the associated Web trackings automatic testing method of content, which is characterized in that the step
Suddenly content of pages association includes textual association and is associated with picture in (2), wherein,
Textual association represents that computational methods are with text matches values:Using text analyzing tool to being obtained in step (1)
User's operation information carry out keyword extraction and participle, then match appearance of each keyword in page elements information
Number simultaneously asks itself and as text matches value MatchTextUS;
Picture relevance represents that computational methods are with picture match value:Known using image recognition algorithm, machine learning algorithm
All pictures, obtain the other set S of two picture categories on the picture and the page that other user clicks1And S2, then match S1In
Each element is in S2The number of middle appearance simultaneously asks itself and as picture match value MatchImageUS;
The content of pages degree of association is:MatchUS=MatchTextUS+MatchImageUS。
It is 4. according to claim 3 based on the associated Web trackings automatic testing method of content, which is characterized in that the step
Suddenly (2) include:
Page elements information user's operation information of Web site is accessed before and after record user twice, is calculated respectively in accessing twice
The content of pages degree of associationWithWhen the front and rear difference of the page degree of association accessed twice is more than specified threshold
During threshhold, it is believed that the Web site is in tracking user.
It is 5. according to claim 1 based on the associated Web trackings automatic testing method of content, which is characterized in that the step
Suddenly (3) include:It is realized by browser automated test tool and starts, browser access Web site is configured and writes automatic
Change the operation behavior that script analog subscriber is clicked, inputs text, analog subscriber operation and page info are obtained, using in step 2
Analysis tracked the Web site set of user.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711282970.5A CN108171074B (en) | 2017-12-07 | 2017-12-07 | Web tracking automatic detection method based on content association |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711282970.5A CN108171074B (en) | 2017-12-07 | 2017-12-07 | Web tracking automatic detection method based on content association |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108171074A true CN108171074A (en) | 2018-06-15 |
CN108171074B CN108171074B (en) | 2021-03-26 |
Family
ID=62524462
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711282970.5A Active CN108171074B (en) | 2017-12-07 | 2017-12-07 | Web tracking automatic detection method based on content association |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108171074B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109309664A (en) * | 2018-08-14 | 2019-02-05 | 中国科学院数据与通信保护研究教育中心 | A kind of browser fingerprint detection behavior monitoring method |
WO2020231988A1 (en) * | 2019-05-14 | 2020-11-19 | Google Llc | Automatically detecting unauthorized re-identification |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100162393A1 (en) * | 2008-12-18 | 2010-06-24 | Symantec Corporation | Methods and Systems for Detecting Man-in-the-Browser Attacks |
CN106650382A (en) * | 2016-12-30 | 2017-05-10 | 北京工业大学 | Browser-based high-performance user tracking method |
CN107239491A (en) * | 2017-04-25 | 2017-10-10 | 广州阿里巴巴文学信息技术有限公司 | For realizing method, equipment, browser and electronic equipment that user behavior is followed the trail of |
-
2017
- 2017-12-07 CN CN201711282970.5A patent/CN108171074B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100162393A1 (en) * | 2008-12-18 | 2010-06-24 | Symantec Corporation | Methods and Systems for Detecting Man-in-the-Browser Attacks |
CN106650382A (en) * | 2016-12-30 | 2017-05-10 | 北京工业大学 | Browser-based high-performance user tracking method |
CN107239491A (en) * | 2017-04-25 | 2017-10-10 | 广州阿里巴巴文学信息技术有限公司 | For realizing method, equipment, browser and electronic equipment that user behavior is followed the trail of |
Non-Patent Citations (3)
Title |
---|
NATALIIA BIELOVA等: "《24th ACM-SIGSAC Conference on Computer and Communications Security (ACM CCS)》", 3 November 2017 * |
XIAOFENG LIU等: "《 1st IEEE International Conference on Data Science in Cyberspace (DSC)》", 16 June 2016 * |
江军等: "浏览器指纹探测识别技术研究", 《保密科学技术》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109309664A (en) * | 2018-08-14 | 2019-02-05 | 中国科学院数据与通信保护研究教育中心 | A kind of browser fingerprint detection behavior monitoring method |
CN109309664B (en) * | 2018-08-14 | 2021-03-23 | 中国科学院数据与通信保护研究教育中心 | Browser fingerprint detection behavior monitoring method |
WO2020231988A1 (en) * | 2019-05-14 | 2020-11-19 | Google Llc | Automatically detecting unauthorized re-identification |
US11093644B2 (en) | 2019-05-14 | 2021-08-17 | Google Llc | Automatically detecting unauthorized re-identification |
CN113287143A (en) * | 2019-05-14 | 2021-08-20 | 谷歌有限责任公司 | Automatic detection of unauthorized re-identification |
CN113287143B (en) * | 2019-05-14 | 2022-12-16 | 谷歌有限责任公司 | Automatic detection of unauthorized re-identification |
US11720710B2 (en) | 2019-05-14 | 2023-08-08 | Google Llc | Automatically detecting unauthorized re-identification |
Also Published As
Publication number | Publication date |
---|---|
CN108171074B (en) | 2021-03-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Vishwakarma et al. | Detection and veracity analysis of fake news via scrapping and authenticating the web search | |
US10032081B2 (en) | Content-based video representation | |
CN103559235B (en) | A kind of online social networks malicious web pages detection recognition methods | |
CN109815386B (en) | User portrait-based construction method and device and storage medium | |
CN101534306A (en) | Detecting method and a device for fishing website | |
CN102436564A (en) | Method and device for identifying falsified webpage | |
WO2017084205A1 (en) | Network user identity authentication method and system | |
CN103020123A (en) | Method for searching bad video website | |
CN104036190A (en) | Method and device for detecting page tampering | |
CN108694325B (en) | Method and device for identifying specified type of website | |
CN108171074A (en) | One kind is based on the associated Web trackings automatic testing method of content | |
CN104036189A (en) | Page distortion detecting method and black link database generating method | |
CN111199172A (en) | Terminal screen recording-based processing method and device and storage medium | |
CN103729354B (en) | web information processing method and device | |
CN105447148B (en) | A kind of Cookie mark correlating method and device | |
CN108595453B (en) | URL (Uniform resource locator) identifier mapping obtaining method and device | |
CN108614849A (en) | A kind of web advertisement detection method based on dynamic pitching pile and static more script page feature extractions | |
CN109165264B (en) | Webpage analysis method and device based on diversified thermodynamic diagrams | |
CN110866170A (en) | Importance evaluation method, search method and system for Tor darknet service based on site quality | |
CN104978431B (en) | Web data fusion method and device | |
CN114780891A (en) | Website key resource analysis method and device based on page rendering contribution degree | |
CN104063491B (en) | A kind of method and device that the detection page is distorted | |
KR101277300B1 (en) | Method and apparatus for presenting personalized advertisements | |
Prasad et al. | Face-Based Alumni Tracking on Social Media Using Deep Learning | |
CN106227858B (en) | A kind of accurate extracting method of mobile Internet webpage or media platform article content |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |