CN109617915B - Abnormal user mining method based on page access topology - Google Patents

Abnormal user mining method based on page access topology Download PDF

Info

Publication number
CN109617915B
CN109617915B CN201910035793.3A CN201910035793A CN109617915B CN 109617915 B CN109617915 B CN 109617915B CN 201910035793 A CN201910035793 A CN 201910035793A CN 109617915 B CN109617915 B CN 109617915B
Authority
CN
China
Prior art keywords
search subset
url
access
abnormal
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910035793.3A
Other languages
Chinese (zh)
Other versions
CN109617915A (en
Inventor
李建聪
邓金城
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Knownsec Information Technology Co ltd
Original Assignee
Chengdu Knownsec Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Knownsec Information Technology Co ltd filed Critical Chengdu Knownsec Information Technology Co ltd
Priority to CN201910035793.3A priority Critical patent/CN109617915B/en
Publication of CN109617915A publication Critical patent/CN109617915A/en
Application granted granted Critical
Publication of CN109617915B publication Critical patent/CN109617915B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an abnormal user mining method based on page access topology, which comprises the following steps: step 1: according to the information extracted from the log, taking IP as a unit and time as a standard, sequencing URLs in an ascending order, and constructing an access topology database; step 2: extracting a training sample; and step 3: indexing the IP by using the URL for the training sample; and 4, step 4: taking the IP in the training sample as an initial search subset, removing the IP which does not access the URL in the initial search subset by using a reverse index, and updating the search subset; repeating the steps until the search subset is empty or the search subset cannot be updated; calculating a score according to the cycle times; and 5: sorting the IPs in ascending order according to scores, setting a threshold value, and taking the IPs smaller than the set threshold value as abnormal users; the invention describes the behavior of a normal user through the topological structure, judges the abnormal degree of the user, and has better self-adaptive capability and lower omission factor.

Description

Abnormal user mining method based on page access topology
Technical Field
The invention relates to an abnormal user mining method, in particular to an abnormal user mining method based on page access topology.
Background
The existing network security products generally adopt a plurality of rules or strategies to describe the behavior boundary of a user, and if a certain characteristic of the user breaks through the threshold of a normal user, a processing action is triggered; for example, the conventional abnormal user detection method mostly adopts a static feature matching means, for example, regular expression matching SQL, XSS injection attack, and the like.
The conventional method has the following disadvantages: firstly, static characteristics are easy to bypass, so that the missing rate is high; secondly, the static features can only match the abnormal conditions existing in the feature library, and the unknown abnormality cannot be detected.
Disclosure of Invention
The invention provides an abnormal user mining method based on a page access topology, which starts from users and the access topology, utilizes the access topology to depict user behaviors, improves the detection accuracy and reduces the missing report rate.
The technical scheme adopted by the invention is as follows: an abnormal user mining method based on page access topology comprises the following steps:
step 1: according to the information extracted from the log, taking IP as a unit and time as a standard, sequencing URLs in an ascending order, and constructing an access topology database;
step 2: extracting training samples from the database obtained in the step 1;
and step 3: indexing the IP by using the URL for the training sample in the step 2;
and 4, step 4: taking the IP in the training sample as an initial search subset, removing the IP which does not access the URL in the initial search subset by using a reverse index, and updating the search subset; repeating the steps until the search subset is empty or the search subset cannot be updated; calculating a score according to the cycle times;
and 5: and sorting the IPs in ascending order according to the scores, setting a threshold value, and setting the IPs smaller than the set threshold value as abnormal users.
Further, the information extracted from the log in step 1 includes a target URL, a request source IP and a request time, and removes request parameters of the URL.
Further, the score calculation process in step 4 is as follows:
Figure BDA0001945866550000011
in the formula: s is the score, N is the number of training samples in step 2, c (N) is,
Figure BDA0001945866550000012
in the formula: h (N-1) is the harmonic progression, which is a function of N-1.
The invention has the beneficial effects that:
(1) starting from a user and an access topology, the method uses the access topology to depict the user behavior and detects the user with abnormal access behavior;
(2) the invention can effectively utilize the information of the repeatability, the page type, the sequence of page access and the like when the user accesses the page, improves the accuracy of detection, reduces the rate of missing report and has certain adaptivity;
(3) the invention uses the topological structure formed by the access web pages to depict the behaviors of normal users and judge the abnormal degree of the users, and has better self-adaptive capacity and lower omission ratio.
Drawings
FIG. 1 is a schematic flow chart of the present invention.
FIG. 2 is a schematic view of the score calculation process of the present invention.
Detailed Description
The invention is further described with reference to the following figures and specific embodiments.
As shown in fig. 1 and fig. 2, an abnormal user mining method based on a page access topology includes the following steps:
step 1: and according to the information extracted from the log, sequencing the URLs in an ascending order by taking the IP as a unit and taking time as a standard, and constructing an access topology database.
Extracting a target URL, a request source IP and request time in the log, and removing parameters of the URL;
the URL in the access log often contains the requested parameters, which are irrelevant to the method of the invention and need to be removed; for example:
http://www.target.com?sreach_words=a,b,c,d
after removing the parameters, the following steps are changed:
http://www.target.com
and (4) sequencing URLs in an ascending order by taking a single IP as a unit and using a time standard to construct an access topology database.
For example, the information corresponding to an IP in the log is as follows:
ip url time of access
xxx url1 2018-06-11:15:35:28
xxx url2 2018-06-11:15:36:05
xxx url3 2018-06-11:15:38:25
xxx url4 2018-06-11:15:40:50
The request topology to obtain the IP is: 1 → 2 → 3 → 4.
Step 2: extracting training samples from the database obtained in the step 1;
and step 3: indexing the IP by using the URL for the training sample in the step 2;
and 4, step 4: taking the IP in the training sample as an initial search subset, removing the IP which does not access the URL in the initial search subset by using a reverse index, and updating the search subset; repeating the steps until the search subset is empty or the search subset cannot be updated; calculating a score according to the cycle times;
and 5: and sorting the IPs in ascending order according to the scores, setting a threshold value, and setting the IPs smaller than the set threshold value as abnormal users.
The processes from step 2 to step 4 can be used as a normal score calculation module, and can calculate the normal score of the specified IP in the access topology database; for convenience of explanation, assume that the IP and its access topology that needs to be detected are:
IP3:1→2→3→4→7→8→10
firstly, selecting a certain number of samples from a constructed access topology database without sampling back as training samples; a certain number of the constructed model parameters can be adjusted according to actual needs to achieve the best detection effect, and 256 samples are selected in the invention.
Then, carrying out reverse indexing on the selected training samples, wherein the reverse indexing takes the content as an index and takes the previous index as the content; corresponding to the invention, the IP is indexed by using the URL, for example, the data before indexing is as follows:
IP0:URL1→URL2→URL3→URL4→URL7→URL8→URL10
IP1:URL1→URL2→URL3→URL4→URL8→URL10
IP2:URL1→URL2→URL3→URL7→URL8→URL10
the reverse index is followed by:
URL1:IP0,IP1,IP2
URL2:IP0,IP1,IP2
URL3:IP0,IP1,IP2
URL4:IP0,IP1,IP2
URL5:IP1
URL7:IP2
URL8:IP0,IP1,IP2
URL10:IP0,IP1,IP2
the calculation speed can be improved by constructing the reverse index.
And then randomly selecting a URL from the corresponding access topology in the IP to be detected, and taking the IP in the training sample as an initial search subset.
Finally, the IP which does not access the URL in the search subset is removed by using the reverse index, the search subset is updated by using the result, and the process is repeated until the search subset is empty or the search subset cannot be updated; and recording the cycle times, and obtaining a normal score according to the cycle times.
Assume that the selected training samples are:
IP0:1→6→9→11→12→13→14
IP1:1→2→3→4→5→8→10
IP2:1→2→3→4→8→10
IP4:1→2→3→4→7→10
IP5:1→2→3→4→5→8→10
IP6:1→2→3→7→8→10
IP7:1→2→3→7→10
IP8:1→2→3→4→5→8→7→4→5→8→7→4→5→8→10
the initial search subset is { IP0, IP1, IP2, IP4, …, IP8 }.
Selecting URL3, removing IP of un-visited URL3 in un-searched subset, and obtaining new searched subset as follows: { IP1, IP2, IP4, …, IP8 }.
And selecting URL7, removing the IP of the URL7 which is not visited in the search subset, and obtaining a new search subset of { IP4, IP6, IP7 and IP8 }.
And selecting the URL8, removing the IP of the URL8 which is not visited in the search subset, and obtaining a new search subset of { IP6, IP8 }.
At this time, since the IP6 and IP7 in the search subset have accessed each URL that the sample IP3 that needs to be detected has accessed, the search subset cannot be updated, and the loop stops.
Calculating a normal score according to the cycle times:
let t be the cycle number, score s be:
Figure BDA0001945866550000041
in the formula: s is the score, N is the number of training samples in step 2, c (N) is,
Figure BDA0001945866550000042
in the formula: h (N-1) is a harmonic series, which is a function of N-1;
wherein H (N-1) ═ log (N-1) + 0.5772156649.
And calculating the normal score of each IP in the access topology database through a normal score module, sequencing the IPs in an ascending order, setting a threshold value, and determining the IP smaller than the threshold value as an abnormal user.
Because the abnormal user often has behavior characteristics different from those of the normal user, the access topology is the most intuitive expression form of the access behavior; therefore, abnormal users can be detected by accessing the topology; the access topology of an anomalous user tends to have one or more of the following characteristics:
some URLs in the access topology are rare among normal users;
a large number of duplicate URLs in the access topology;
the length of the access topology is much longer than normal users.
According to these features, it is therefore possible to select from the log the behaviour with the above access features and the corresponding user using the method of the invention.
The traditional abnormal user detection based on behaviors usually depicts user behaviors through the characteristics of unit time frequency, total access times and the like of user access; the access frequency, the total access number and other parameters do not utilize the sequence of the user accessing the pages, and the information of the type, the repeatability and the like of the user accessing the pages is also ignored. These parameters therefore often do not characterize the user's behaviour very well and the effect of the detection is therefore limited.
The invention starts from the user and the access topology, describes the user behavior by using the access topology, effectively utilizes the information of the repetition rate, the page type, the sequence of page access and the like when the user accesses the page, improves the detection accuracy, reduces the missing report rate and has certain self-adaptability.
The abnormal user in the invention refers to a user whose request is greatly different from the normal user behavior, such as a web crawler, a user with an attack request, and the like; the access topology refers to a topology structure formed by URLs corresponding to a certain IP.

Claims (2)

1. An abnormal user mining method based on page access topology is characterized by comprising the following steps:
step 1: according to the information extracted from the log, taking IP as a unit and time as a standard, sequencing URLs in an ascending order, and constructing an access topology database;
step 2: extracting training samples from the database obtained in the step 1;
and step 3: indexing the IP by using the URL for the training sample in the step 2;
and 4, step 4: taking the IP in the training sample as an initial search subset, removing the IP which does not access the URL in the initial search subset by using a reverse index, and updating the search subset; repeating the steps until the search subset is empty or the search subset cannot be updated; calculating a score according to the cycle times; the calculation process is as follows:
Figure FDA0002755261450000011
in the formula: s is the score, t is the number of cycles, N is the number of training samples in step 2, c (N) is the average number of steps in which the binary tree search fails,
Figure FDA0002755261450000012
in the formula: h (N-1) is a harmonic series, which is a function with respect to N-1;
and 5: and sorting the IPs in ascending order according to the scores, setting a threshold value, and setting the IPs smaller than the set threshold value as abnormal users.
2. The method for mining abnormal users based on page access topology as claimed in claim 1, wherein the information extracted from the log in step 1 includes target URL, request source IP and request time, and removes the request parameters of URL.
CN201910035793.3A 2019-01-15 2019-01-15 Abnormal user mining method based on page access topology Active CN109617915B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910035793.3A CN109617915B (en) 2019-01-15 2019-01-15 Abnormal user mining method based on page access topology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910035793.3A CN109617915B (en) 2019-01-15 2019-01-15 Abnormal user mining method based on page access topology

Publications (2)

Publication Number Publication Date
CN109617915A CN109617915A (en) 2019-04-12
CN109617915B true CN109617915B (en) 2020-12-15

Family

ID=66017344

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910035793.3A Active CN109617915B (en) 2019-01-15 2019-01-15 Abnormal user mining method based on page access topology

Country Status (1)

Country Link
CN (1) CN109617915B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103631830A (en) * 2012-08-29 2014-03-12 华为技术有限公司 Method and device for detecting web spiders
CN103823883A (en) * 2014-03-06 2014-05-28 焦点科技股份有限公司 Analysis method and system for website user access path
CN104980446A (en) * 2015-06-30 2015-10-14 百度在线网络技术(北京)有限公司 Detection method and system for malicious behavior
CN105357054A (en) * 2015-11-26 2016-02-24 上海晶赞科技发展有限公司 Website traffic analysis method and apparatus, and electronic equipment
CN107204958A (en) * 2016-03-16 2017-09-26 阿里巴巴集团控股有限公司 The detection method and device of web page resources element, terminal device
CN107438079A (en) * 2017-08-18 2017-12-05 杭州安恒信息技术有限公司 A kind of detection method of the unknown abnormal behaviour in website
CN108881194A (en) * 2018-06-07 2018-11-23 郑州信大先进技术研究院 Enterprises user anomaly detection method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9270693B2 (en) * 2013-09-19 2016-02-23 The Boeing Company Detection of infected network devices and fast-flux networks by tracking URL and DNS resolution changes

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103631830A (en) * 2012-08-29 2014-03-12 华为技术有限公司 Method and device for detecting web spiders
CN103823883A (en) * 2014-03-06 2014-05-28 焦点科技股份有限公司 Analysis method and system for website user access path
CN104980446A (en) * 2015-06-30 2015-10-14 百度在线网络技术(北京)有限公司 Detection method and system for malicious behavior
CN105357054A (en) * 2015-11-26 2016-02-24 上海晶赞科技发展有限公司 Website traffic analysis method and apparatus, and electronic equipment
CN107204958A (en) * 2016-03-16 2017-09-26 阿里巴巴集团控股有限公司 The detection method and device of web page resources element, terminal device
CN107438079A (en) * 2017-08-18 2017-12-05 杭州安恒信息技术有限公司 A kind of detection method of the unknown abnormal behaviour in website
CN108881194A (en) * 2018-06-07 2018-11-23 郑州信大先进技术研究院 Enterprises user anomaly detection method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种基于网页元数据的用户访问行为建模方法;杜瑾等;《西安交通大学学报》;20080229;第42卷(第2期);全文 *

Also Published As

Publication number Publication date
CN109617915A (en) 2019-04-12

Similar Documents

Publication Publication Date Title
CN100565526C (en) A kind of anti-cheat method and system at the webpage cheating
Cormode Sketch techniques for approximate query processing
Schäfer et al. SFA: a symbolic fourier approximation and index for similarity search in high dimensional datasets
US7447684B2 (en) Determining searchable criteria of network resources based on a commonality of content
CN105138558B (en) The real time individual information collecting method of content is accessed based on user
US20070011151A1 (en) Concept bridge and method of operating the same
CN102087648B (en) Method and system for fetching news comment page
CN102081601B (en) Field word identification method and device
CN103778262B (en) Information retrieval method and device based on thesaurus
CN107798080B (en) Similar sample set construction method for fishing URL detection
CN106980651B (en) Crawling seed list updating method and device based on knowledge graph
CN110543595A (en) in-station search system and method
CN106294815B (en) A kind of clustering method and device of URL
CN103914538B (en) theme capturing method based on anchor text context and link analysis
CN109471934B (en) Financial risk clue mining method based on Internet
CN109617915B (en) Abnormal user mining method based on page access topology
Ji et al. A new improvement on apriori algorithm
Setayesh et al. Presentation of an Extended Version of the PageRank Algorithm to Rank Web Pages Inspired by Ant Colony Algorithm
CN112418269B (en) Social media network event propagation key time prediction method, system and medium
CN111612531B (en) Click fraud detection method and system
Lu et al. Web robot detection based on hidden Markov model
Xue et al. Phishing sites detection based on Url Correlation
CN108090200A (en) A kind of sequence type hides the acquisition methods of grid database data
Narayana et al. A novel and efficient approach for near duplicate page detection in web crawling
Kumar et al. Accessing relevant and accurate information using entropy

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder

Address after: 9/F, Block C, No. 28 Tianfu Avenue North Section, Chengdu High tech Zone, China (Sichuan) Pilot Free Trade Zone, Chengdu City, Sichuan Province, 610000

Patentee after: CHENGDU KNOWNSEC INFORMATION TECHNOLOGY Co.,Ltd.

Address before: 610000, 11th floor, building 2, No. 219, Tianfu Third Street, hi tech Zone, Chengdu, Sichuan Province

Patentee before: CHENGDU KNOWNSEC INFORMATION TECHNOLOGY Co.,Ltd.

CP02 Change in the address of a patent holder