CN113395268A

CN113395268A - Online and offline fusion-based web crawler interception method

Info

Publication number: CN113395268A
Application number: CN202110616355.3A
Authority: CN
Inventors: 罗笑南; 张家伟
Original assignee: Guilin Xiaowei Hotel Management Co ltd; Guilin University of Electronic Technology
Current assignee: Guilin Xiaowei Hotel Management Co ltd; Guilin University of Electronic Technology
Priority date: 2021-06-03
Filing date: 2021-06-03
Publication date: 2021-09-14

Abstract

The invention provides a network crawler intercepting method based on online-to-offline fusion, which is characterized in that false data is set, whether a user is a crawler or not is judged online by analyzing the access behavior of the user, an id with obvious user behavior passes through the ID, and a verification code is set for the id analyzed as the crawler online, so that misjudgment is reduced. And analyzing the weblog offline to judge the crawlers bypassing the detection with better concealment, and adding the crawlers into a crawler name list library. The invention combines the online and offline common identification mode, saves server resources and improves the accuracy.

Description

Online and offline fusion-based web crawler interception method

Technical Field

The invention relates to a web crawler intercepting method, in particular to a web crawler intercepting method under the combined action of online real-time analysis and offline analysis.

Background

The web crawler is mainly used for automatically collecting required contents aiming at various web pages, and corresponding storage and processing are carried out after the required contents are collected. The crawler technology plays a critical role in the internet, the information acquisition efficiency and diversity are accelerated, but the web crawler also brings many negative effects. Some people use the crawler to collect data beyond the range, collect some unnecessary and irrelevant data, and steal user data in a large batch without secret user privacy data, which causes personal data leakage and harms social security. In addition, a large amount of bandwidth resources of a website can be substantially occupied by a crawler with multiple threads when the crawler crawls a certain website in a large amount, so that normal users cannot access the website, and interference is caused to normal access of the public. Therefore, the malicious web crawlers are detected and intercepted and banned, which is of great significance for maintaining overall network security and enterprise benefits.

Disclosure of Invention

The current web crawler detection technology is low in real-time performance, a large number of server resources can be consumed in the process of crawler information capturing detection, misjudgment can often occur due to the difference of detection algorithms, and then normal users cannot access the web crawler detection technology, so that user experience is influenced. Due to the diversity of crawler types, no uniform algorithm is available for detecting the crawler, so the detection speed and the detection accuracy cannot be improved simultaneously. The method provided by the invention identifies the access behavior of the crawler, reduces the misjudgment rate while improving the speed by using a method combining real time and off-line, effectively prevents malicious crawler from crawling resources, and selects the value with the most obvious behavior difference between the user and the crawler as the true value of the quality of the detection algorithm. In addition, the invention also provides an effective detection method aiming at the crawler of the simulation real user operation, and the technical scheme adopted by the invention is as follows:

a modularized web crawler intercepting method based on a sliding window comprises an online detection method and an offline detection method, wherein the online detection method comprises the following steps:

the on-line detection method comprises the following steps:

1) setting false data for digital data to make the data acquired by crawler different from real data

2) Setting a queue space, storing the access behavior data of the visitor in the queue, and further judging whether the visitor is a crawler

3) For the crawlers identified by the two methods, misjudgment is avoided by means of identifying codes

In particular, the amount of the solvent to be used,

in the step 1), the data displayed on the page is analyzed by cs and then uploaded to the page, and the analysis rule is set by the user. The user can obtain real data when accessing, and the crawler captures the false data before being analyzed.

The storing of the access data of each visitor in the queue in step 2) specifically includes:

extracting key fields including id, access time, reference field and access type from the request of the visitor;

maintaining a queue for each user, and storing n records recently accessed by the user, wherein n is the size of the queue;

for each request, first extracting key fields; if the queue corresponding to the user does not exist, firstly establishing a queue for the user, and if the queue corresponding to the user is full, popping up information in a queue head, namely the earliest record; storing the latest primary request information into a queue;

further, in step 2), further determining whether the visitor is a crawler, specifically including:

setting a characteristic value and a weight for each access behavior in the access behaviors of the visitor, weighting and summing the characteristic values to obtain an overall value, and judging the user with the overall value exceeding a threshold value as a crawler;

each access behavior comprises:

the percentage of error responses in the queue is used as a characteristic value;

acquiring the request modes of all requests in the queue, wherein the occupation ratio of the HEAD request is used as a characteristic value;

in the queue, the requested resources are classified, and the access times of each kind of resource requested is counted; then, the access times of all the resources are added after being squared in proportion to serve as a characteristic value;

the visit time interval feature vector is used as a feature value, and the probability of the suspected crawler is higher when the visit time interval feature value is larger.

Further, the visit interval feature vector needs to calculate a visit interval score;

(1) calculating all adjacent access time intervals in the queue to obtain a time interval sequence, wherein the value is time;

(2) initializing a value score of 0 for score; setting a minimum time length and a maximum time length, traversing the time interval sequence, adding a value to the score when the time is less than the set minimum time length, keeping the score value unchanged when the time is between the set minimum time length and the set maximum time length, and subtracting the value from the score when the time is greater than the set maximum time length.

(3) And traversing all the sequences, and obtaining the score finally as the characteristic value.

The offline identification method comprises the following steps: the offline identified data source is a weblog which is analyzed;

(1) extracting key fields, including: id. url, page turning condition; the extracted url is used to analyze the type of resource requested by the user.

(2) Counting the number of times of requests in a counting period, wherein the counting comprises the total number of times and the number of times of different types of requests, and then taking the ratio of the access types of the head requests as a characteristic value; counting page turning conditions, namely taking the ratio of the number of the page turning conditions of which the number of the page turning exceeds the set number within the counting time period as a characteristic value; counting the times of the specific resource types requested in the counting time period; counting the access times of each resource; then, the access times of the resources are squared and added to be used as a characteristic value.

(3) For each id, each characteristic value is endowed with a corresponding weight, and weighted average is carried out; and according to the weighted average result, if the weighted average result exceeds a set score threshold value, determining the crawler as the crawler.

The main advantages of the invention are: the online method can quickly detect most of crawlers in real time, so that the real-time performance of crawler detection is improved, and the false judgment of the crawlers is greatly reduced by adding the verification codes. The offline method improves the accuracy of crawler identification through analysis of a large amount of data, and meanwhile, the result can be fed back to the online analysis module to adjust the online analysis module.

Description of the drawings:

FIG. 1 is a general flow chart of the algorithm of the present invention.

FIG. 2 is a flow chart of an on-line analysis of the present invention.

FIG. 3 is a flow chart of an offline analysis of the present invention.

Detailed Description

In order to accurately identify the crawling behavior of the crawler in real time, the algorithm is divided into two parts, namely an online identification method and an offline identification method

The online identification method identifies suspicious grabbing behaviors, and comprises three parts of setting false data, access behavior analysis and verification code verification;

1. setting false data:

preliminarily judging whether the crawler is the crawler according to the difference between the normal user and the crawler access condition; data rendered by css is browsed by normal users, and data crawled by crawlers is not rendered. And (4) displaying the digit resolution to the page according to the rule, if the real data is 1234, making the rule to render the four digits of 0123 to the page by taking one digit down respectively.

2. And (3) access behavior analysis:

firstly, maintaining access information for each visitor, and extracting key fields from an access request of the visitor, wherein the key fields comprise id, access time and access type; and maintains this information; the id can be a user account or an IP address, the user account is used as the id when the user account exists, and the IP address of the user is used as the id when the user does not register the account;

aiming at the reason that the access amount of the server is large, maintaining a queue for each id, and recording n records recently accessed by an accessor by using the queue, wherein n is the size of the queue;

when a request comes, firstly analyzing the request and extracting key fields;

if the queue is not full, deleting the earliest record in the queue;

storing the newly requested information into a queue;

in addition, since the number of general users is large, there is usually no request for a long time after several requests. Much waste of resources occurs for this case, but in fact it is obvious that the user's request can be passed directly without further investigation. It is necessary to periodically scan all queues and clean up unneeded queues. The queues with the latest request longer than the current time can be directly deleted; for example, a time threshold is set, and the queue with the latest request exceeding the set time threshold from the current time is directly deleted.

The simplest existing method for identifying the crawler is a syntactic analysis technology: the robot protocol access detection in the syntax analysis technology, the user-agent detection, utilizes the robot protocol access detection to detect some regular crawlers. According to the robot protocol, a robots.txt file exists in the service, information which is not crawled by a crawler is written in the file, and a regular crawler can access the robots.txt file and cannot access the file indicated in the file. But considering that the robot protocol is not a mandatory one, some malicious crawlers will not access robot. This strategy is not advisable. Therefore, the specific algorithm for analyzing the access behaviors adopts a communication mode analysis technology, the feature vectors in the access behaviors are extracted, then the feature values are weighted and summed to obtain an evaluation score, and if the evaluation score exceeds a set threshold value, a visitor is judged to be a suspected crawler.

The characteristic values are as follows:

the percentage of error responses in the queue is used as a characteristic value; the number of times of error response is closely related to whether the visitor is a crawler, and if the visitor is a normal user, the number of errors is relatively small.

Taking the occupation ratio of the HEAD request type in the access types in the queue as a characteristic value; HEAD and GET are essentially the same, except that HEAD does not contain presentation data, but rather just HTTP header information, HEAD requests are commonly used to test the validity of links and are therefore commonly used in crawlers. The proportion of HEAD requests among all requests may well determine whether a visitor is suspected of being a crawler.

In the queue, classifying the requested resources, and counting the access times of each resource requested; then, the access times of all the resources are added after being squared in proportion to serve as a characteristic value; if there are 20 accesses within the sliding window, the ratio of accesses to 8 types of resources may be 0/20, 1/20, 18/20, 0/20, 1/20, 0/20, 0/20, 0/20; the fractional fractions are squared and then added to obtain a larger fraction (326/400 in this example); for the ordinary users, the ratio of the number of access times to the 8 types of resources may be 2/20, 3/20, 3/20, 2/20, 2/20, 3/20, 2/20 and 3/20, and the ratio scores are respectively squared and then added to obtain a small score (52/400 in this example);

visiting the time interval characteristic vector as a characteristic value, wherein the higher the visiting time interval characteristic value is, the higher the probability of representing suspected reptiles is, and the characteristic value needs a time interval score; the crawler usually has a higher crawling frequency in order to crawl data quickly, and whether the visitor is the crawler can be judged according to the time interval. When the time interval is less than the set minimum duration, score is incremented by a value, when the time interval is between the set minimum duration and the set maximum duration, score is unchanged, and when the time interval is greater than the set maximum duration, score is decremented by a value. Finally, this characteristic value is larger when the score value is larger.

Finally verifying the suspected reptiles identified by 1 and 2

3. Verification of the verification code:

the verification code verification is used for determining whether the crawler detection is correct or not in the last step, the verification code detection adopts typical CAPTCHA detection, and the server generates a verification page for testing a user and requires the user to input a character combination on a generated picture. The suspected crawler which is identified is verified by the verification code, so that misjudgment operation can be reduced. And if the verification code fails, refusing the visitor to continue accessing the server, adding the visitor into the determined list database, and storing the blacklist determined to be the crawler in the determined list database. However, if the visitor has an operation obviously not being a crawler, for example, an id has a 'payment' and the like obviously being a normal user operation, the id is put into a 'real user list', and a crawler detection operation is not performed on the id, so that the occupation of resources is reduced.

An offline identification method;

the offline identification method comprises the following steps: the data source identified offline is a blog by analyzing the blog

1. Extracting key fields, including: id. url, page turning condition; the extracted url is used to analyze the type of resource requested by the user. A login user and a non-login user respectively adopt a user account and an ip address as ids;

2. counting the number of times of requests in a counting period, wherein the counting comprises the total number of times and the number of times of different types of requests, and then taking the ratio of the access types of the head requests as a characteristic value; counting page turning conditions, namely taking the ratio of the number of the page turning conditions of which the number of the page turning exceeds a set number (for example, 10) in a counting time period as a characteristic value; for example, if the number of page turning times of 20 out of 100 visits is greater than 10, 20/100 is used as a feature value; counting the access times of each resource; then, the access times of the resources are squared and added to be used as a characteristic value.

3. And adding the access times of the resources to obtain a characteristic value. And if the characteristic value exceeds a set threshold value, determining the crawler.

Claims

1. A web crawler intercepting method based on-line and off-line fusion is characterized by at least comprising an on-line identification method and an on-line identification method, wherein the on-line identification method comprises the following steps:

(1) setting false data, and preliminarily intercepting crawlers crawling digital information.

(2) And setting a queue space, wherein the access behavior data of the visitor is stored in the queue, and further judging whether the visitor is a crawler.

(3) Regarding the behavior identified as a suspected crawler in (1) and (2) above, it is finally determined whether the behavior is a crawler by a method of verification with a verification code and added to the list library.

2. The web crawler intercepting method based on the online-offline fusion as claimed in claim 1, wherein:

in the step (2), the queue stores the access data of each visitor, and the method specifically includes:

for each request, first extracting key fields;

if the queue corresponding to the user does not exist, firstly establishing a queue for the user, and if the queue corresponding to the user is full, popping up information in a queue head, namely the earliest record;

and storing the latest primary request information into the queue.

3. The web crawler intercepting method based on online-offline fusion as claimed in claim 2, wherein:

in the step (2), further judging whether the visitor is a crawler, specifically comprising:

each access behavior comprises:

4. The web crawler intercepting method based on the online-offline fusion as recited in claim 3, wherein:

(2) initializing a value score of 0 for score; setting a minimum time length and a maximum time length, traversing the time interval sequence, adding a numerical value to the score when the time is less than the set minimum time length, keeping the score value unchanged when the time is between the set minimum time length and the set maximum time length, and subtracting the numerical value from the score when the time is greater than the set maximum time length;

5. A web crawler intercepting method based on online-offline fusion is characterized by at least comprising an offline identification method under a normal line, and the method comprises the following steps: the offline identified data source is a weblog which is analyzed;

(2) Counting the number of times of requests in a counting period, wherein the counting comprises the total number of times and the number of times of different types of requests, and then taking the ratio of the access types of the head requests as a characteristic value;

counting page turning conditions, namely taking the ratio of the number of the page turning conditions of which the number of the page turning exceeds the set number within the counting time period as a characteristic value; counting the times of the specific resource types requested in the counting time period;

counting the access times of each resource;

then, the access times of the resources are squared and added to be used as a characteristic value.

6. The web crawler intercepting method based on the online-offline fusion as claimed in claim 1, wherein:

in the step (1), the data displayed on the page is analyzed by cs and then uploaded to the page, and the analysis rule is set by itself. The user can obtain real data when accessing, and the crawler captures the false data before being analyzed.