CN108337255A

CN108337255A - A kind of detection method for phishing site learnt based on web automatic tests and width

Info

Publication number: CN108337255A
Application number: CN201810088364.8A
Authority: CN
Inventors: 袁巍; 聂依凡; 李浩鹏; 贾昂; 蔡明辉; 姜源
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2018-01-30
Filing date: 2018-01-30
Publication date: 2018-07-27
Anticipated expiration: 2038-01-30
Also published as: CN108337255B

Abstract

The invention discloses a kind of detection method for phishing site learnt based on web automatic tests and width, belong to computer network security technology field.The present invention is first based on url the and html pages and carries out traditional characteristic extraction; web automatization testing techniques are recycled to interact formula feature extraction; finally utilize the pretreatment training sample after extraction feature into line width learning training; to quickly and accurately identify and detect fishing website, the network information and property safety of the common people are protected.

Description

A kind of detection method for phishing site learnt based on web automatic tests and width

Technical field

The invention belongs to computer network security technology field, more particularly, to one kind based on web automatic tests and The detection method for phishing site of width study.

Background technology

Phishing is to claim that the duplicity spam for coming from bank or well-known mechanism, webpage are empty by largely sending A kind of attack pattern of the sensitive informations such as personal identification data and the financial account of user is stolen in false advertisement etc..Most typical net Network phishing attack be by user lure to one it is well-designed with the extremely similar fishing website in destination organization website on, obtain Take the personal sensitive information or gain user's remittance by cheating that family inputs on the web site.Since this kind of attack process victim is not easy to warn Feel, fishing website has become one of presently the most serious net crime means, and the detection of fishing website also becomes net One of the research direction that network security fields are most popular.

2016, taken the lead by CNNIC the internet domain name administrative skill national engineering laboratory established and international anti-phishing Working group (APWG), Chinese anti-phishing website monitoring (APAC), which are combined, issues《Global Chinese fishing website Statistical Analysis on Current Status It reports (2016)》(hereinafter referred to as《Report》).Data show that China's fishing website quantity increased by a year-on-year basis in 2016 150.96%, main counterfeit object is Taobao, middle movement, each big bank, used domain name mainly have .COM .CC .PW, .NET。

Third season 360 mobile guards in 2017 were that national mobile phone user intercepts fishing website meter 7.9 hundred million times, compared with 2016 The third season increases by 102.6%.Mobile phone terminal fishing website classification to being intercepted, wherein gambling lottery industry class fishing website accounts for totality The 80.2% of proportion, the types accountings such as falseness shopping, false recruitment, financial instrument, counterfeit drug and fishing advertisement are successively decreased successively.

Although intercepting, there are many quantity, and the website intercepted is largely to have existed for a long time, it is difficult to capture and block and is newest Fishing website.The life cycle of fishing website averagely only has 4.684 days, and the 13.327 days average periods reported, for fishing Fishnet station, it is necessary to identify and intercept within the extremely short time, otherwise can threaten to the property safety of the common people.

The identification of fishing website and Interception Technology are executed by antivirus software and browser itself at present, technology point For following several classes：

1. blacklist filtering technique：Blacklist is added in the fishing website that artificial detection and the common people are reported, as the url of access (Uniform Resource Locator, uniform resource locator) is present in blacklist, implements to intercept and sound a warning.This Kind mode cannot identify newest fishing website, while need manual verification.

2. the feature extraction of url：Extract corresponding feature, such as domain name by the url of access, but this judgement Mode is unreliable because in url and the determinant attribute without fishing website, the False Rates of such methods and misdetection rate compared with It is high.

3. the detection of fishing website is carried out as feature in conjunction with various Website page elements:Because the feature of Webpage obtains The regular hour need to be expended by taking, and such methods are improved in accuracy compared to the second class method, but the speed and efficiency executed It is not high.

Invention content

For the disadvantages described above or Improvement requirement of the prior art, the present invention provides one kind based on web automatic tests and The detection method for phishing site of width study is utilized its object is to be based on url the and html pages to carry out traditional characteristic extraction Web automatization testing techniques interact formula feature extraction, using the pretreatment training sample after extraction feature into line width Learning training protects the network information and property safety of the common people to quickly and accurately identify and detect fishing website.

To achieve the above object, according to one aspect of the present invention, it provides a kind of based on web automatic tests and width The detection method for phishing site of study, includes the following steps：

(1) it is held for a large amount of fishing website inside data set in PC (Personal Computer, personal computer) Static nature extraction, behavioral characteristics extraction and interactive feature extraction are carried out with normal website, forms feature vector set；

The data set comes the fishing website collected on automatic network and normal website, or is directly obtained from network security company It takes；

(2) feature vector set in step (1) is divided into training set using k folding cross-validation methods and verification collects；

(3) training learnt into line width using the training set is collected using the verification and carries out test comparison, builds base Plinth model simultaneously optimizes the performance of grader；

The grader is the model trained by width learning algorithm, in use, inputting net in the grader Location, whether output is fishing website；The performance of the grader refers to the accuracy of grader identification fishing website；To performance into Row optimization refers to improving recognition correct rate；

(4) erroneous judgement website and the website newly included are collected as new feature vector set, increase input is carried out to model Incremental learning, model is optimized.

Preferably, step (1) is specially：

(1.1) static nature extraction is carried out to url；

(1.2) it utilizes web automatization testing techniques to simulate without interface browser, accesses to the url of data set；

(1.3) behavioral characteristics extraction is carried out for the page that url is accessed；

(1.4) simulation browser interacts formula to the page and clicks browsing, and returns to interactive feature.

Preferably, static nature described in step (1.1) includes：

1. whether containing the addresses ip in url；

2. whether the domain name of url is from starting to being pure digi-tal between first point；

3. whether containing sensitive character in url, such as；

4. whether the ports url are 80 ports；

5. whether the length of url is less than 23 characters；

6. whether including the keyword for being related to shopping or property account, such as account, banking, taobao in url；

Above six static natures are denoted as<F1,F2,F3,F4,F5,F6>.

Preferably, behavioral characteristics described in step (1.3) include：

1. whether the title (title) of html is comprising sensitive character, such as ' lottery ticket ', ' overseas gambling ', and ' prize-winning '；

2. whether there is form lists；

3. the resource (resource) of picture whether with the same domain names of former url；

4. the href linked whether with the same domain names of url；The href is the abbreviation of Hypertext Reference, refers to Determine the url of hyperlink target；

Above four behavioral characteristics are denoted as<F7,F8,F9,F10>.

Preferably, interactive feature described in step (1.4) includes：

1. whether form lists are rigorous；

2. clickthrough, if for sky；

3. clickthrough, if url occurs and redirects；

Three above interactive feature is denoted as<F11,F12,F13>.

Preferably, step (2) is specially：

(2.1) k values are set；

(2.2) division for utilizing k folding cross-validation methods to collect the data set of step 1 into training set and verification.

Preferably, step (3) is specially：

(3.1) width learning model is instructed using the feature vector set of webpage sample in the training set of step (2) Practice simultaneously testing classification device performance；The webpage sample refers to fishing website network address and normal website in training set；

(3.2) constantly adjust the network architecture by increasing characteristic node and enhancement mode node to be trained and test until dividing Class device reaches estimated performance, obtains each layer weight information and preservation model.

Preferably, step (3.1) is specially：

(3.11) initialization feature window number N2, characteristic node number N1 in window, the value of enhancing number of nodes N3；It is random initial Change sorter model characteristic node weight matrix, and characteristic node weight is handled using sparse own coding；

(3.12) weight matrix that the feature vector set of webpage sample and step (3.11) obtain is subjected to matrix multiplication Obtain characteristic node matrix；

(3.13) random initializtion enhances node weights matrix；

(3.14) the characteristic node matrix that step (3.12) obtains is multiplied with the weight matrix that step (3.13) obtains and is obtained Node matrix equation must be enhanced；

(3.15) the enhancing node matrix equation that characteristic node matrix and step (3.14) that step (3.12) obtains obtain is pressed Row progress is horizontally-spliced to obtain input matrix；

(3.16) seek the plus sige generalized inverse of input matrix matrix obtained by step (3.15) and with<Y>Matrix multiplication is carried out to obtain To weight matrix；It is described<Y>Be webpage sample tag combination at matrix；The label of the webpage sample represents yes or no Fishing website, for example, the representative of label 1 is fishing website, the representative of label 0 is not fishing website；

(3.17) since step (2) is that k rolls over cross validation, step (3.1) is repeated k times, average k precision；

(3.18) N1 is gradually increased, the value of N2, N3, whether the precision for observing width model is promoted, and finds optimized parameter.

Preferably, step (3.2) is specially：

(3.21) using the Increment Learning Algorithm of increase characteristic node number and enhancing number of nodes to gained mould in step (3.1) Type is adjusted and is tested；

(3.22) step (3.21) is recycled into setting number and gained measuring accuracy is recorded, comparison determination is optimal Characteristic node number and enhancing interstitial content, preserve this optimal models.

Preferably, step (4) is specially：Erroneous judgement website and the website newly included are collected as new feature vector set, The incremental learning for model increase input, obtains adjusted weight matrix, to realize Optimized model；

Preferably, the static nature for extracting url using regular expression in step (1) simulates nothing using Phantomjs Interface browser is carried out without interface UI automatic tests；The PhantomJS is a JavaScript based on webkit API is a kind of no interface browser.

The present invention is with the extraction of url static natures, the extraction of html behavioral characteristics and the interaction of web automatization testing techniques Formula Feature Extraction Technology first extracts the static nature of url；No interface UI automatic tests are carried out again, and simulation browser carries out url It accesses, while realizing that page source code is extracted, behavioral characteristics are extracted from html；Link clicks are simulated, simulation form lists Account inputs and the operations such as login, saves the process of page rendering, rapid extraction interactive feature；For not in blacklist New fishing website, by simulating clickthrough, whether test link is empty；List is logined in simulation, and whether test list is regular；Mould Quasi- clickthrough, tests whether that there are url redirections.Pass through these completely new interactive features, rapidly and accurately detection fishing Website.

The present invention uses width learning model, promotes the recognition capability to new fishing website.Width study is a kind of new Machine learning method and thought, different from deep learning, width learn framework level it is shallower, to computing resource require compared with It is low.In addition to this, deep learning needs to improve entire model again when receiving new sample, need to expend a large amount of Time, but width learning algorithm need not carry out re -training to original model, it is only necessary to the fishing website sample being newly added This progress feature extraction is adjusted supplement, accuracy of detection continuous self-promotion in self refresh to existing model.

In general, through the invention it is contemplated above technical scheme is compared with the prior art, can obtain down and show Beneficial effect：

1, it according to method provided by the present invention, is first based on url and carries out static nature extraction, then carried out based on the html pages Behavioral characteristics extract, and then interact the extraction of formula feature to the page using web automatization testing techniques simulation browser, special It levies by static to dynamic, then arrives interactive feedback, feature mining from the superficial to the deep, ensure that the quantity and quality of feature；Finally, sharp A small amount of resource, quick training and incremental learning characteristic are only needed with width learning model, is realized accurate, quick, adaptive The identification technology of fishing website；

2, interactive feature combination url static natures, html behavioral characteristics extract the knowledge that can greatly improve fishing website Other precision, and it is suitable for newest fishing website, it quickly and accurately can detect and identify fishing website；

3, learn to carry out feature extraction to the fishing website sample being newly added by width, existing model is adjusted Supplement, on more new model needed for time greatly reduce, to computing resource require it is relatively low；Meanwhile the detection essence of width study Degree can realize continuous self-promotion in self refresh.

Description of the drawings

Fig. 1 is a kind of fishing website detection learnt based on web automatic tests and width in present pre-ferred embodiments The general flow chart of method；

Fig. 2 is the schematic diagram of fishing website feature extraction in present pre-ferred embodiments；

Fig. 3 is the schematic diagram for learning to carry out k folding cross-validation data set preparations in present pre-ferred embodiments to width；

Fig. 4 is the schematic diagram for learning to be trained and optimize to width in present pre-ferred embodiments.

Specific implementation mode

In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.As long as in addition, technical characteristic involved in the various embodiments of the present invention described below It does not constitute a conflict with each other and can be combined with each other.

The present invention provides a kind of fishing website intelligent detecting methods learnt based on web automatic tests and width, such as Shown in Fig. 1, it is the main flow chart of the present invention, clearly shows the relationship between the flow and step entirely invented.Have below Illustrate to body the embodiment of step：

(1) step 1：In the ends PC are for data set a large amount of fishing website and normal website carry out static nature extraction, Behavioral characteristics extract and interactive click accesses.

As shown in Fig. 2, step 1 is specific as follows：

Step 1.1, static nature extraction, including following six feature are carried out for url itself：

1. whether containing the addresses ip in url：The addresses ip can be used for escaping domain name registration and user checks；

2. whether the domain name of url starts to being pure digi-tal between first point：Regular network address is seldom inserted into domain name with pure digi-tal, Such as regular network address Baidu https://www.***.com/, fishing website http://www.030033.com/；

3. the sensitive character such as whether having in url：Account before@characters, behind be only real address, fishing website is normal In this way；

4. whether the ports url are 80 ports：Regular url is accessed by 80 ports, and non-80 port has Fishing net The suspicion stood；

5. whether the length of url is less than 23 characters：According to statistics, regular network address url is usually no more than 23 characters；

6. whether including account, the keywords such as banking, taobao in url：It is related to doing shopping, bank is worth police It wakes up；

Above six characteristic uses, six regular expressions match url and extract individual features, such as re.match (r'.*// (.*)/.*', url) combine split (' ') that the domain name in url can be extracted.Other features are similarly carried It takes.If there is features described above, 1 is returned, otherwise returns to 0, constitutes six characteristic sets in total<F1,F2,F3,F4,F5,F6>.

Step 1.2, the access without interface browser to url is realized by code, omits the rendering to browser page The time is saved in journey, the extraction for feature thereafter.Specific code is as follows：

Self.dirver=webdriver.PhantomJS ()

dirver.get('http://www.douyu.com/directory/all')

Step 1.3, by HTML (HyperText Markup Language, hypertext markup language) source codes for The page that url is accessed carries out behavioral characteristics extraction, including following four feature：

1. whether the title (title) of html includes ' lottery ticket ', ' overseas gamble ', the sensitive character such as ' prize-winnings ', part is non- Method fishing website likes the heart of the desire for gain using netizen；

2. whether there is form lists：List is also sensitive features, because the final purpose of fishing website is that steal account close Code；

3. the resource (resource) of picture whether with the same domain names of former url：Because fishing website often steals other legal copies The picture of website；

4. the href linked whether with the same domain names of url：It is generation mostly because fishing website oneself will not write article content Chain；Href is the abbreviation of Hypertext Reference, is the URL of specified hyperlink target.

The driver objects of above four characteristic use steps 1.1, parsed from source code using driver this four A feature extracts the content of title such as driver.find_element_by_tag_name (" title ") .text.Other are special Sign is similarly extracted.If there is features described above, 1 is returned, otherwise returns to 0, constitutes four characteristic sets in total<F7,F8,F9, F10>。

Step 1.4, formula is interacted to page itself and clicks access, altogether following three features：

1. whether form lists are rigorous：Whether random account number cipher can be logined：Because fishing website usually will not There is raw data base, general any account number cipher can be logined；

2. clickthrough, if for sky：If most of link is all null link, this website be the possibility of fishing website more It is high；

3. clickthrough, if url occurs and redirects：Fishing website life cycle is of short duration, is frequently utilized that url is redirected and jumps Go to the addresses ip that be not sealed off；

Three above feature can carry out the click page in real time using driver come simulation browser, to according to return value Extract feature, such as elem.send_keys (account that u' is generated at random ') can fill in account in list, password similarly, Click is logined can predict that maximum probability is fishing network address if success.Other features are similarly extracted.If there is features described above, 1 is returned, conversely, returning to 0, constitutes three characteristic sets in total<F11,F12,F13>；

(2) step 2：K rolls over cross-assignment training set and verification collects, as shown in figure 3, step 2 is specific as follows：

Step 2.1, k values are set：Parameter k represents the number of repetition training and test, and the resource of consumption is directly proportional；

Step 2.2, it rolls over cross validation using k to be trained to separate training set and verification the set pair analysis model, by initial data Collection is divided into k parts, and k-1 parts are used as training set, and 1 part collects as verification, and repeats k times, improves the generalization ability of model.

It is illustrated in figure 4 and the schematic diagram for being trained and optimizing is learnt to width, illustrate and see following steps 3 and step 4：

(3) step 3：Training width model and itself optimization, step 3 are specific as follows：

Step 3.1, the fishing website sampling feature vectors set obtained using step 1 is trained width learning model And testing classification device performance；Specifically include following steps：

Step 3.1.1, initialization feature window number N2, characteristic node number N1 in window, the value of enhancing number of nodes N3, according to Test of many times experience initializes N1*N2=samples/600, N3=samples/10, samples representative sample quantity；At random Sorter model characteristic node weight matrix We is initialized, and characteristic node weight is handled using sparse own coding；

The weight matrix We that step 3.1.2, fishing website training sample characteristic set X are obtained with step 3.1.1. carries out square Battle array multiplication obtains characteristic node matrix Z；

Step 3.1.3, random initializtion enhance node weights matrix W h；

Step 3.1.4, the weight matrix Wh phases that the characteristic node matrix Z that step 3.1.2 is obtained is obtained with step 3.1.3 Multiply acquisition enhancing node matrix equation；

Step 3.1.5, the enhancing node matrix equation that the step 3.1.2 characteristic node matrix Z obtained and step 3.1.4 are obtained H obtains input matrix A by row progress is horizontally-spliced；

Step 3.1.6, seek the plus sige generalized inverse of input matrix A obtained by step 3.1.5 and with application training sample label Set carries out matrix multiplication and obtains weight matrix W, and specific code is as follows：

W=np.linalg.inv ((A.T) .dot (A)+lamda*

np.eye((A.T).shape[0])).dot((A.T).dot(train_y))；

Step 3.1.7 repeats step 3.1 k times, average k precision since step 2 is that k rolls over cross validation, Enhance the generalization ability of grader；

Step 3.1.8, according to experiment experience to N1, N2, N3 parameters carry out tuning, go out in peak value and obtain full accuracy, and Preserve the value of this three parameters.

Step 3.2, it is trained and is tested until grader reaches by the number of nodes adjustment network architecture of increase matrix Estimated performance or adjustment reach certain number, obtain each layer weight information under optimal situation and preserve；

Step 3.2 specifically includes following processing：

Step 3.2.1, using increase characteristic node number and the Increment Learning Algorithm for enhancing number of nodes to gained in step 3.1 Model is adjusted and is tested；

Step 3.2.2, cycle setting number carry out step 3.2.1 and are recorded to gained measuring accuracy, and comparison determines Optimal characteristic node number and enhancing interstitial content, preserves this optimal models.

(4) step 4, collects erroneous judgement website and the website newly included as obtaining new feature vector set, in time right Model increase the incremental learning of input；

Step 4 specifically includes following processing：

4.1, feature extraction is carried out for the example for failure of classifying in step 3 and is preserved；

4.2, feature extraction is carried out for new website and is stored in local file；

4.3, when quantity reaches preset value in the imported incremental learning unified, W weight matrix are adjusted, to more New model；

In conclusion using technical solution provided by the invention, the Fishing net based on web automatic tests and width study Traditional characteristic is combined by intelligent detecting method of standing with interactive feature, and using width learning training model, consuming resource is smaller, It realizes quick self-adapted update, while ensure that the accuracy of model, can be realized in the extremely short life cycle of fishing website It precisely intercepts and hits.

Heretofore described k rolls over cross-validation method, is to upset sample, is then uniformly divided into k parts, selects in turn wherein K-1 parts of training, remaining portion are verified, and calculate Prediction sum squares, finally k Prediction sum squares are done again The foundation of average alternatively optimal models structure.Assuming that there is N number of sample, special k takes N, is exactly leaving-one method (leave one out)。

If matrix A is the matrix of m*n, the plus sige generalized inverse of heretofore described matrix A refers to (A'A) A ', wherein A' tables Show the transposed matrix of A.

Heretofore described weight is finger widths learning model parameter.

The sparse own coding (Sparse Autoencoder) refers to the learning characteristic from no labeled data automatically, and is given Go out the technology of feature description more better than initial data.

As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to The limitation present invention, all within the spirits and principles of the present invention made by all any modification, equivalent and improvement etc., should all include Within protection scope of the present invention.

Claims

1. a kind of detection method for phishing site learnt based on web automatic tests and width, which is characterized in that including walking as follows Suddenly：

(1) the ends PC for inside website data collection a large amount of fishing website and normal website carry out static nature extraction, dynamic State feature extraction and interactive feature extraction, form feature vector set；

(3) training learnt into line width using the training set is collected using the verification and carries out test comparison, and basic mould is built Type simultaneously optimizes the performance of grader；The performance of the grader refers to the accuracy of grader identification fishing website；

(4) erroneous judgement website and the website newly included are collected as new feature vector set, to model increase the increasing of input Amount study, with Optimized model.

2. a kind of detection method for phishing site learnt based on web automatic tests and width as described in claim 1, special Sign is that step (1) is specially：

(1.1) static nature extraction is carried out for url；

3. a kind of detection method for phishing site learnt based on web automatic tests and width as claimed in claim 2, special Sign is that static nature includes described in step (1.1)：

1. whether containing the addresses ip in url；

3. whether containing sensitive character in url, such as；

4. whether the ports url are 80 ports；

5. whether the length of url is less than 23 characters；

6. whether including the keyword for being related to shopping or property account, such as account, banking, taobao in url.

4. a kind of detection method for phishing site learnt based on web automatic tests and width as claimed in claim 2, special Sign is that behavioral characteristics include described in step (1.3)：

1. whether the title of html is comprising sensitive character, such as ' lottery ticket ', ' overseas gambling ', and ' prize-winning '；

2. whether there is form lists；

3. the resource of picture whether with the same domain names of former url；

4. the href linked whether with the same domain names of url；The href is the abbreviation of Hypertext Reference, is specified super The url of hyperlink target.

5. a kind of detection method for phishing site learnt based on web automatic tests and width as claimed in claim 2, special Sign is that interactive feature includes described in step (1.4)：

1. whether form lists are rigorous；

2. clickthrough, if for sky；

3. clickthrough, if url occurs and redirects.

6. a kind of detection method for phishing site learnt based on web automatic tests and width as described in claim 1, special Sign is that step (2) is specially：

(2.1) k values are set；

7. a kind of detection method for phishing site learnt based on web automatic tests and width as described in claim 1, special Sign is that step (3) is specially：

(3.1) the feature vector set of webpage sample in the training set of step (2) is utilized to be trained simultaneously width learning model Testing classification device performance；

(3.2) network architecture is constantly adjusted by increasing characteristic node and enhancement mode node to be trained and test until grader Reach estimated performance, obtains each layer weight information and preservation model.

8. a kind of detection method for phishing site learnt based on web automatic tests and width as claimed in claim 7, special Sign is that step (3.1) is specially：

(3.11) initialization feature window number N2, characteristic node number N1 in window, the value of enhancing number of nodes N3；Random initializtion point Class device aspect of model node weights matrix, and characteristic node weight is handled using sparse own coding；

(3.12) the feature vector set of webpage sample matrix multiplication is carried out with the weight matrix that step (3.11) obtains to obtain Characteristic node matrix；

(3.13) random initializtion enhances node weights matrix；

(3.14) the characteristic node matrix that step (3.12) obtains is multiplied with the weight matrix that step (3.13) obtains and is increased Strong node matrix equation；

(3.15) by step (3.12) obtain characteristic node matrix and step (3.14) obtain enhancing node matrix equation by arrange into Row is horizontally-spliced to obtain input matrix；

(3.16) seek the plus sige generalized inverse of input matrix obtained by step (3.15) and with<Y>It carries out matrix multiplication and obtains weight square Battle array；It is described<Y>Be webpage sample tag combination at matrix；The label of the webpage sample represents yes or no Fishing net It stands；

9. a kind of detection method for phishing site learnt based on web automatic tests and width as claimed in claim 7, special Sign is that step (3.2) is specially：

(3.21) using increase characteristic node number and enhance the Increment Learning Algorithm of number of nodes to gained model in step (3.1) into Row is adjusted and is tested；

(3.22) step (3.21) is recycled into setting number and gained measuring accuracy is recorded, comparison determines optimal feature Interstitial content and enhancing interstitial content, preserve this optimal models.

10. a kind of detection method for phishing site learnt based on web automatic tests and width as described in claim 1, It is characterized in that, step (4) is specially：Erroneous judgement website and the website newly included are collected as new feature vector set, to model The incremental learning for increase input, obtains adjusted weight matrix, to realize Optimized model.