CN108337255A - A kind of detection method for phishing site learnt based on web automatic tests and width - Google Patents

A kind of detection method for phishing site learnt based on web automatic tests and width Download PDF

Info

Publication number
CN108337255A
CN108337255A CN201810088364.8A CN201810088364A CN108337255A CN 108337255 A CN108337255 A CN 108337255A CN 201810088364 A CN201810088364 A CN 201810088364A CN 108337255 A CN108337255 A CN 108337255A
Authority
CN
China
Prior art keywords
url
width
matrix
detection method
website
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810088364.8A
Other languages
Chinese (zh)
Other versions
CN108337255B (en
Inventor
袁巍
聂依凡
李浩鹏
贾昂
蔡明辉
姜源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201810088364.8A priority Critical patent/CN108337255B/en
Publication of CN108337255A publication Critical patent/CN108337255A/en
Application granted granted Critical
Publication of CN108337255B publication Critical patent/CN108337255B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1483Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/30Network architectures or network communication protocols for network security for supporting lawful interception, monitoring or retaining of communications or communication related information
    • H04L63/306Network architectures or network communication protocols for network security for supporting lawful interception, monitoring or retaining of communications or communication related information intercepting packet switched data communications, e.g. Web, Internet or IMS communications

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Technology Law (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of detection method for phishing site learnt based on web automatic tests and width, belong to computer network security technology field.The present invention is first based on url the and html pages and carries out traditional characteristic extraction; web automatization testing techniques are recycled to interact formula feature extraction; finally utilize the pretreatment training sample after extraction feature into line width learning training; to quickly and accurately identify and detect fishing website, the network information and property safety of the common people are protected.

Description

A kind of detection method for phishing site learnt based on web automatic tests and width
Technical field
The invention belongs to computer network security technology field, more particularly, to one kind based on web automatic tests and The detection method for phishing site of width study.
Background technology
Phishing is to claim that the duplicity spam for coming from bank or well-known mechanism, webpage are empty by largely sending A kind of attack pattern of the sensitive informations such as personal identification data and the financial account of user is stolen in false advertisement etc..Most typical net Network phishing attack be by user lure to one it is well-designed with the extremely similar fishing website in destination organization website on, obtain Take the personal sensitive information or gain user's remittance by cheating that family inputs on the web site.Since this kind of attack process victim is not easy to warn Feel, fishing website has become one of presently the most serious net crime means, and the detection of fishing website also becomes net One of the research direction that network security fields are most popular.
2016, taken the lead by CNNIC the internet domain name administrative skill national engineering laboratory established and international anti-phishing Working group (APWG), Chinese anti-phishing website monitoring (APAC), which are combined, issues《Global Chinese fishing website Statistical Analysis on Current Status It reports (2016)》(hereinafter referred to as《Report》).Data show that China's fishing website quantity increased by a year-on-year basis in 2016 150.96%, main counterfeit object is Taobao, middle movement, each big bank, used domain name mainly have .COM .CC .PW, .NET。
Third season 360 mobile guards in 2017 were that national mobile phone user intercepts fishing website meter 7.9 hundred million times, compared with 2016 The third season increases by 102.6%.Mobile phone terminal fishing website classification to being intercepted, wherein gambling lottery industry class fishing website accounts for totality The 80.2% of proportion, the types accountings such as falseness shopping, false recruitment, financial instrument, counterfeit drug and fishing advertisement are successively decreased successively.
Although intercepting, there are many quantity, and the website intercepted is largely to have existed for a long time, it is difficult to capture and block and is newest Fishing website.The life cycle of fishing website averagely only has 4.684 days, and the 13.327 days average periods reported, for fishing Fishnet station, it is necessary to identify and intercept within the extremely short time, otherwise can threaten to the property safety of the common people.
The identification of fishing website and Interception Technology are executed by antivirus software and browser itself at present, technology point For following several classes:
1. blacklist filtering technique:Blacklist is added in the fishing website that artificial detection and the common people are reported, as the url of access (Uniform Resource Locator, uniform resource locator) is present in blacklist, implements to intercept and sound a warning.This Kind mode cannot identify newest fishing website, while need manual verification.
2. the feature extraction of url:Extract corresponding feature, such as domain name by the url of access, but this judgement Mode is unreliable because in url and the determinant attribute without fishing website, the False Rates of such methods and misdetection rate compared with It is high.
3. the detection of fishing website is carried out as feature in conjunction with various Website page elements:Because the feature of Webpage obtains The regular hour need to be expended by taking, and such methods are improved in accuracy compared to the second class method, but the speed and efficiency executed It is not high.
Invention content
For the disadvantages described above or Improvement requirement of the prior art, the present invention provides one kind based on web automatic tests and The detection method for phishing site of width study is utilized its object is to be based on url the and html pages to carry out traditional characteristic extraction Web automatization testing techniques interact formula feature extraction, using the pretreatment training sample after extraction feature into line width Learning training protects the network information and property safety of the common people to quickly and accurately identify and detect fishing website.
To achieve the above object, according to one aspect of the present invention, it provides a kind of based on web automatic tests and width The detection method for phishing site of study, includes the following steps:
(1) it is held for a large amount of fishing website inside data set in PC (Personal Computer, personal computer) Static nature extraction, behavioral characteristics extraction and interactive feature extraction are carried out with normal website, forms feature vector set;
The data set comes the fishing website collected on automatic network and normal website, or is directly obtained from network security company It takes;
(2) feature vector set in step (1) is divided into training set using k folding cross-validation methods and verification collects;
(3) training learnt into line width using the training set is collected using the verification and carries out test comparison, builds base Plinth model simultaneously optimizes the performance of grader;
The grader is the model trained by width learning algorithm, in use, inputting net in the grader Location, whether output is fishing website;The performance of the grader refers to the accuracy of grader identification fishing website;To performance into Row optimization refers to improving recognition correct rate;
(4) erroneous judgement website and the website newly included are collected as new feature vector set, increase input is carried out to model Incremental learning, model is optimized.
Preferably, step (1) is specially:
(1.1) static nature extraction is carried out to url;
(1.2) it utilizes web automatization testing techniques to simulate without interface browser, accesses to the url of data set;
(1.3) behavioral characteristics extraction is carried out for the page that url is accessed;
(1.4) simulation browser interacts formula to the page and clicks browsing, and returns to interactive feature.
Preferably, static nature described in step (1.1) includes:
1. whether containing the addresses ip in url;
2. whether the domain name of url is from starting to being pure digi-tal between first point;
3. whether containing sensitive character in url, such as;
4. whether the ports url are 80 ports;
5. whether the length of url is less than 23 characters;
6. whether including the keyword for being related to shopping or property account, such as account, banking, taobao in url;
Above six static natures are denoted as<F1,F2,F3,F4,F5,F6>.
Preferably, behavioral characteristics described in step (1.3) include:
1. whether the title (title) of html is comprising sensitive character, such as ' lottery ticket ', ' overseas gambling ', and ' prize-winning ';
2. whether there is form lists;
3. the resource (resource) of picture whether with the same domain names of former url;
4. the href linked whether with the same domain names of url;The href is the abbreviation of Hypertext Reference, refers to Determine the url of hyperlink target;
Above four behavioral characteristics are denoted as<F7,F8,F9,F10>.
Preferably, interactive feature described in step (1.4) includes:
1. whether form lists are rigorous;
2. clickthrough, if for sky;
3. clickthrough, if url occurs and redirects;
Three above interactive feature is denoted as<F11,F12,F13>.
Preferably, step (2) is specially:
(2.1) k values are set;
(2.2) division for utilizing k folding cross-validation methods to collect the data set of step 1 into training set and verification.
Preferably, step (3) is specially:
(3.1) width learning model is instructed using the feature vector set of webpage sample in the training set of step (2) Practice simultaneously testing classification device performance;The webpage sample refers to fishing website network address and normal website in training set;
(3.2) constantly adjust the network architecture by increasing characteristic node and enhancement mode node to be trained and test until dividing Class device reaches estimated performance, obtains each layer weight information and preservation model.
Preferably, step (3.1) is specially:
(3.11) initialization feature window number N2, characteristic node number N1 in window, the value of enhancing number of nodes N3;It is random initial Change sorter model characteristic node weight matrix, and characteristic node weight is handled using sparse own coding;
(3.12) weight matrix that the feature vector set of webpage sample and step (3.11) obtain is subjected to matrix multiplication Obtain characteristic node matrix;
(3.13) random initializtion enhances node weights matrix;
(3.14) the characteristic node matrix that step (3.12) obtains is multiplied with the weight matrix that step (3.13) obtains and is obtained Node matrix equation must be enhanced;
(3.15) the enhancing node matrix equation that characteristic node matrix and step (3.14) that step (3.12) obtains obtain is pressed Row progress is horizontally-spliced to obtain input matrix;
(3.16) seek the plus sige generalized inverse of input matrix matrix obtained by step (3.15) and with<Y>Matrix multiplication is carried out to obtain To weight matrix;It is described<Y>Be webpage sample tag combination at matrix;The label of the webpage sample represents yes or no Fishing website, for example, the representative of label 1 is fishing website, the representative of label 0 is not fishing website;
(3.17) since step (2) is that k rolls over cross validation, step (3.1) is repeated k times, average k precision;
(3.18) N1 is gradually increased, the value of N2, N3, whether the precision for observing width model is promoted, and finds optimized parameter.
Preferably, step (3.2) is specially:
(3.21) using the Increment Learning Algorithm of increase characteristic node number and enhancing number of nodes to gained mould in step (3.1) Type is adjusted and is tested;
(3.22) step (3.21) is recycled into setting number and gained measuring accuracy is recorded, comparison determination is optimal Characteristic node number and enhancing interstitial content, preserve this optimal models.
Preferably, step (4) is specially:Erroneous judgement website and the website newly included are collected as new feature vector set, The incremental learning for model increase input, obtains adjusted weight matrix, to realize Optimized model;
Preferably, the static nature for extracting url using regular expression in step (1) simulates nothing using Phantomjs Interface browser is carried out without interface UI automatic tests;The PhantomJS is a JavaScript based on webkit API is a kind of no interface browser.
The present invention is with the extraction of url static natures, the extraction of html behavioral characteristics and the interaction of web automatization testing techniques Formula Feature Extraction Technology first extracts the static nature of url;No interface UI automatic tests are carried out again, and simulation browser carries out url It accesses, while realizing that page source code is extracted, behavioral characteristics are extracted from html;Link clicks are simulated, simulation form lists Account inputs and the operations such as login, saves the process of page rendering, rapid extraction interactive feature;For not in blacklist New fishing website, by simulating clickthrough, whether test link is empty;List is logined in simulation, and whether test list is regular;Mould Quasi- clickthrough, tests whether that there are url redirections.Pass through these completely new interactive features, rapidly and accurately detection fishing Website.
The present invention uses width learning model, promotes the recognition capability to new fishing website.Width study is a kind of new Machine learning method and thought, different from deep learning, width learn framework level it is shallower, to computing resource require compared with It is low.In addition to this, deep learning needs to improve entire model again when receiving new sample, need to expend a large amount of Time, but width learning algorithm need not carry out re -training to original model, it is only necessary to the fishing website sample being newly added This progress feature extraction is adjusted supplement, accuracy of detection continuous self-promotion in self refresh to existing model.
In general, through the invention it is contemplated above technical scheme is compared with the prior art, can obtain down and show Beneficial effect:
1, it according to method provided by the present invention, is first based on url and carries out static nature extraction, then carried out based on the html pages Behavioral characteristics extract, and then interact the extraction of formula feature to the page using web automatization testing techniques simulation browser, special It levies by static to dynamic, then arrives interactive feedback, feature mining from the superficial to the deep, ensure that the quantity and quality of feature;Finally, sharp A small amount of resource, quick training and incremental learning characteristic are only needed with width learning model, is realized accurate, quick, adaptive The identification technology of fishing website;
2, interactive feature combination url static natures, html behavioral characteristics extract the knowledge that can greatly improve fishing website Other precision, and it is suitable for newest fishing website, it quickly and accurately can detect and identify fishing website;
3, learn to carry out feature extraction to the fishing website sample being newly added by width, existing model is adjusted Supplement, on more new model needed for time greatly reduce, to computing resource require it is relatively low;Meanwhile the detection essence of width study Degree can realize continuous self-promotion in self refresh.
Description of the drawings
Fig. 1 is a kind of fishing website detection learnt based on web automatic tests and width in present pre-ferred embodiments The general flow chart of method;
Fig. 2 is the schematic diagram of fishing website feature extraction in present pre-ferred embodiments;
Fig. 3 is the schematic diagram for learning to carry out k folding cross-validation data set preparations in present pre-ferred embodiments to width;
Fig. 4 is the schematic diagram for learning to be trained and optimize to width in present pre-ferred embodiments.
Specific implementation mode
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.As long as in addition, technical characteristic involved in the various embodiments of the present invention described below It does not constitute a conflict with each other and can be combined with each other.
The present invention provides a kind of fishing website intelligent detecting methods learnt based on web automatic tests and width, such as Shown in Fig. 1, it is the main flow chart of the present invention, clearly shows the relationship between the flow and step entirely invented.Have below Illustrate to body the embodiment of step:
(1) step 1:In the ends PC are for data set a large amount of fishing website and normal website carry out static nature extraction, Behavioral characteristics extract and interactive click accesses.
As shown in Fig. 2, step 1 is specific as follows:
Step 1.1, static nature extraction, including following six feature are carried out for url itself:
1. whether containing the addresses ip in url:The addresses ip can be used for escaping domain name registration and user checks;
2. whether the domain name of url starts to being pure digi-tal between first point:Regular network address is seldom inserted into domain name with pure digi-tal, Such as regular network address Baidu https://www.***.com/, fishing website http://www.030033.com/;
3. the sensitive character such as whether having in url:Account before@characters, behind be only real address, fishing website is normal In this way;
4. whether the ports url are 80 ports:Regular url is accessed by 80 ports, and non-80 port has Fishing net The suspicion stood;
5. whether the length of url is less than 23 characters:According to statistics, regular network address url is usually no more than 23 characters;
6. whether including account, the keywords such as banking, taobao in url:It is related to doing shopping, bank is worth police It wakes up;
Above six characteristic uses, six regular expressions match url and extract individual features, such as re.match (r'.*// (.*)/.*', url) combine split (' ') that the domain name in url can be extracted.Other features are similarly carried It takes.If there is features described above, 1 is returned, otherwise returns to 0, constitutes six characteristic sets in total<F1,F2,F3,F4,F5,F6>.
Step 1.2, the access without interface browser to url is realized by code, omits the rendering to browser page The time is saved in journey, the extraction for feature thereafter.Specific code is as follows:
Self.dirver=webdriver.PhantomJS ()
dirver.get('http://www.douyu.com/directory/all')
Step 1.3, by HTML (HyperText Markup Language, hypertext markup language) source codes for The page that url is accessed carries out behavioral characteristics extraction, including following four feature:
1. whether the title (title) of html includes ' lottery ticket ', ' overseas gamble ', the sensitive character such as ' prize-winnings ', part is non- Method fishing website likes the heart of the desire for gain using netizen;
2. whether there is form lists:List is also sensitive features, because the final purpose of fishing website is that steal account close Code;
3. the resource (resource) of picture whether with the same domain names of former url:Because fishing website often steals other legal copies The picture of website;
4. the href linked whether with the same domain names of url:It is generation mostly because fishing website oneself will not write article content Chain;Href is the abbreviation of Hypertext Reference, is the URL of specified hyperlink target.
The driver objects of above four characteristic use steps 1.1, parsed from source code using driver this four A feature extracts the content of title such as driver.find_element_by_tag_name (" title ") .text.Other are special Sign is similarly extracted.If there is features described above, 1 is returned, otherwise returns to 0, constitutes four characteristic sets in total<F7,F8,F9, F10>。
Step 1.4, formula is interacted to page itself and clicks access, altogether following three features:
1. whether form lists are rigorous:Whether random account number cipher can be logined:Because fishing website usually will not There is raw data base, general any account number cipher can be logined;
2. clickthrough, if for sky:If most of link is all null link, this website be the possibility of fishing website more It is high;
3. clickthrough, if url occurs and redirects:Fishing website life cycle is of short duration, is frequently utilized that url is redirected and jumps Go to the addresses ip that be not sealed off;
Three above feature can carry out the click page in real time using driver come simulation browser, to according to return value Extract feature, such as elem.send_keys (account that u' is generated at random ') can fill in account in list, password similarly, Click is logined can predict that maximum probability is fishing network address if success.Other features are similarly extracted.If there is features described above, 1 is returned, conversely, returning to 0, constitutes three characteristic sets in total<F11,F12,F13>;
(2) step 2:K rolls over cross-assignment training set and verification collects, as shown in figure 3, step 2 is specific as follows:
Step 2.1, k values are set:Parameter k represents the number of repetition training and test, and the resource of consumption is directly proportional;
Step 2.2, it rolls over cross validation using k to be trained to separate training set and verification the set pair analysis model, by initial data Collection is divided into k parts, and k-1 parts are used as training set, and 1 part collects as verification, and repeats k times, improves the generalization ability of model.
It is illustrated in figure 4 and the schematic diagram for being trained and optimizing is learnt to width, illustrate and see following steps 3 and step 4:
(3) step 3:Training width model and itself optimization, step 3 are specific as follows:
Step 3.1, the fishing website sampling feature vectors set obtained using step 1 is trained width learning model And testing classification device performance;Specifically include following steps:
Step 3.1.1, initialization feature window number N2, characteristic node number N1 in window, the value of enhancing number of nodes N3, according to Test of many times experience initializes N1*N2=samples/600, N3=samples/10, samples representative sample quantity;At random Sorter model characteristic node weight matrix We is initialized, and characteristic node weight is handled using sparse own coding;
The weight matrix We that step 3.1.2, fishing website training sample characteristic set X are obtained with step 3.1.1. carries out square Battle array multiplication obtains characteristic node matrix Z;
Step 3.1.3, random initializtion enhance node weights matrix W h;
Step 3.1.4, the weight matrix Wh phases that the characteristic node matrix Z that step 3.1.2 is obtained is obtained with step 3.1.3 Multiply acquisition enhancing node matrix equation;
Step 3.1.5, the enhancing node matrix equation that the step 3.1.2 characteristic node matrix Z obtained and step 3.1.4 are obtained H obtains input matrix A by row progress is horizontally-spliced;
Step 3.1.6, seek the plus sige generalized inverse of input matrix A obtained by step 3.1.5 and with application training sample label Set carries out matrix multiplication and obtains weight matrix W, and specific code is as follows:
W=np.linalg.inv ((A.T) .dot (A)+lamda*
np.eye((A.T).shape[0])).dot((A.T).dot(train_y));
Step 3.1.7 repeats step 3.1 k times, average k precision since step 2 is that k rolls over cross validation, Enhance the generalization ability of grader;
Step 3.1.8, according to experiment experience to N1, N2, N3 parameters carry out tuning, go out in peak value and obtain full accuracy, and Preserve the value of this three parameters.
Step 3.2, it is trained and is tested until grader reaches by the number of nodes adjustment network architecture of increase matrix Estimated performance or adjustment reach certain number, obtain each layer weight information under optimal situation and preserve;
Step 3.2 specifically includes following processing:
Step 3.2.1, using increase characteristic node number and the Increment Learning Algorithm for enhancing number of nodes to gained in step 3.1 Model is adjusted and is tested;
Step 3.2.2, cycle setting number carry out step 3.2.1 and are recorded to gained measuring accuracy, and comparison determines Optimal characteristic node number and enhancing interstitial content, preserves this optimal models.
(4) step 4, collects erroneous judgement website and the website newly included as obtaining new feature vector set, in time right Model increase the incremental learning of input;
Step 4 specifically includes following processing:
4.1, feature extraction is carried out for the example for failure of classifying in step 3 and is preserved;
4.2, feature extraction is carried out for new website and is stored in local file;
4.3, when quantity reaches preset value in the imported incremental learning unified, W weight matrix are adjusted, to more New model;
In conclusion using technical solution provided by the invention, the Fishing net based on web automatic tests and width study Traditional characteristic is combined by intelligent detecting method of standing with interactive feature, and using width learning training model, consuming resource is smaller, It realizes quick self-adapted update, while ensure that the accuracy of model, can be realized in the extremely short life cycle of fishing website It precisely intercepts and hits.
Heretofore described k rolls over cross-validation method, is to upset sample, is then uniformly divided into k parts, selects in turn wherein K-1 parts of training, remaining portion are verified, and calculate Prediction sum squares, finally k Prediction sum squares are done again The foundation of average alternatively optimal models structure.Assuming that there is N number of sample, special k takes N, is exactly leaving-one method (leave one out)。
If matrix A is the matrix of m*n, the plus sige generalized inverse of heretofore described matrix A refers to (A'A) A ', wherein A' tables Show the transposed matrix of A.
Heretofore described weight is finger widths learning model parameter.
The sparse own coding (Sparse Autoencoder) refers to the learning characteristic from no labeled data automatically, and is given Go out the technology of feature description more better than initial data.
As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to The limitation present invention, all within the spirits and principles of the present invention made by all any modification, equivalent and improvement etc., should all include Within protection scope of the present invention.

Claims (10)

1. a kind of detection method for phishing site learnt based on web automatic tests and width, which is characterized in that including walking as follows Suddenly:
(1) the ends PC for inside website data collection a large amount of fishing website and normal website carry out static nature extraction, dynamic State feature extraction and interactive feature extraction, form feature vector set;
(2) feature vector set in step (1) is divided into training set using k folding cross-validation methods and verification collects;
(3) training learnt into line width using the training set is collected using the verification and carries out test comparison, and basic mould is built Type simultaneously optimizes the performance of grader;The performance of the grader refers to the accuracy of grader identification fishing website;
(4) erroneous judgement website and the website newly included are collected as new feature vector set, to model increase the increasing of input Amount study, with Optimized model.
2. a kind of detection method for phishing site learnt based on web automatic tests and width as described in claim 1, special Sign is that step (1) is specially:
(1.1) static nature extraction is carried out for url;
(1.2) it utilizes web automatization testing techniques to simulate without interface browser, accesses to the url of data set;
(1.3) behavioral characteristics extraction is carried out for the page that url is accessed;
(1.4) simulation browser interacts formula to the page and clicks browsing, and returns to interactive feature.
3. a kind of detection method for phishing site learnt based on web automatic tests and width as claimed in claim 2, special Sign is that static nature includes described in step (1.1):
1. whether containing the addresses ip in url;
2. whether the domain name of url is from starting to being pure digi-tal between first point;
3. whether containing sensitive character in url, such as;
4. whether the ports url are 80 ports;
5. whether the length of url is less than 23 characters;
6. whether including the keyword for being related to shopping or property account, such as account, banking, taobao in url.
4. a kind of detection method for phishing site learnt based on web automatic tests and width as claimed in claim 2, special Sign is that behavioral characteristics include described in step (1.3):
1. whether the title of html is comprising sensitive character, such as ' lottery ticket ', ' overseas gambling ', and ' prize-winning ';
2. whether there is form lists;
3. the resource of picture whether with the same domain names of former url;
4. the href linked whether with the same domain names of url;The href is the abbreviation of Hypertext Reference, is specified super The url of hyperlink target.
5. a kind of detection method for phishing site learnt based on web automatic tests and width as claimed in claim 2, special Sign is that interactive feature includes described in step (1.4):
1. whether form lists are rigorous;
2. clickthrough, if for sky;
3. clickthrough, if url occurs and redirects.
6. a kind of detection method for phishing site learnt based on web automatic tests and width as described in claim 1, special Sign is that step (2) is specially:
(2.1) k values are set;
(2.2) division for utilizing k folding cross-validation methods to collect the data set of step 1 into training set and verification.
7. a kind of detection method for phishing site learnt based on web automatic tests and width as described in claim 1, special Sign is that step (3) is specially:
(3.1) the feature vector set of webpage sample in the training set of step (2) is utilized to be trained simultaneously width learning model Testing classification device performance;
(3.2) network architecture is constantly adjusted by increasing characteristic node and enhancement mode node to be trained and test until grader Reach estimated performance, obtains each layer weight information and preservation model.
8. a kind of detection method for phishing site learnt based on web automatic tests and width as claimed in claim 7, special Sign is that step (3.1) is specially:
(3.11) initialization feature window number N2, characteristic node number N1 in window, the value of enhancing number of nodes N3;Random initializtion point Class device aspect of model node weights matrix, and characteristic node weight is handled using sparse own coding;
(3.12) the feature vector set of webpage sample matrix multiplication is carried out with the weight matrix that step (3.11) obtains to obtain Characteristic node matrix;
(3.13) random initializtion enhances node weights matrix;
(3.14) the characteristic node matrix that step (3.12) obtains is multiplied with the weight matrix that step (3.13) obtains and is increased Strong node matrix equation;
(3.15) by step (3.12) obtain characteristic node matrix and step (3.14) obtain enhancing node matrix equation by arrange into Row is horizontally-spliced to obtain input matrix;
(3.16) seek the plus sige generalized inverse of input matrix obtained by step (3.15) and with<Y>It carries out matrix multiplication and obtains weight square Battle array;It is described<Y>Be webpage sample tag combination at matrix;The label of the webpage sample represents yes or no Fishing net It stands;
(3.17) since step (2) is that k rolls over cross validation, step (3.1) is repeated k times, average k precision;
(3.18) N1 is gradually increased, the value of N2, N3, whether the precision for observing width model is promoted, and finds optimized parameter.
9. a kind of detection method for phishing site learnt based on web automatic tests and width as claimed in claim 7, special Sign is that step (3.2) is specially:
(3.21) using increase characteristic node number and enhance the Increment Learning Algorithm of number of nodes to gained model in step (3.1) into Row is adjusted and is tested;
(3.22) step (3.21) is recycled into setting number and gained measuring accuracy is recorded, comparison determines optimal feature Interstitial content and enhancing interstitial content, preserve this optimal models.
10. a kind of detection method for phishing site learnt based on web automatic tests and width as described in claim 1, It is characterized in that, step (4) is specially:Erroneous judgement website and the website newly included are collected as new feature vector set, to model The incremental learning for increase input, obtains adjusted weight matrix, to realize Optimized model.
CN201810088364.8A 2018-01-30 2018-01-30 Phishing website detection method based on web automatic test and width learning Active CN108337255B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810088364.8A CN108337255B (en) 2018-01-30 2018-01-30 Phishing website detection method based on web automatic test and width learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810088364.8A CN108337255B (en) 2018-01-30 2018-01-30 Phishing website detection method based on web automatic test and width learning

Publications (2)

Publication Number Publication Date
CN108337255A true CN108337255A (en) 2018-07-27
CN108337255B CN108337255B (en) 2020-08-04

Family

ID=62926122

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810088364.8A Active CN108337255B (en) 2018-01-30 2018-01-30 Phishing website detection method based on web automatic test and width learning

Country Status (1)

Country Link
CN (1) CN108337255B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109522838A (en) * 2018-11-09 2019-03-26 大连海事大学 A kind of safety cap image recognition algorithm based on width study
CN110213741A (en) * 2019-05-23 2019-09-06 青岛智能产业技术研究院 A kind of vehicle based on width study sends the real-time detection method of information true or false
CN110287124A (en) * 2019-07-03 2019-09-27 大连海事大学 A kind of automatic marker software error reporting simultaneously carries out seriousness and knows method for distinguishing
CN110365691A (en) * 2019-07-22 2019-10-22 云南财经大学 Fishing website method of discrimination and device based on deep learning
CN110749793A (en) * 2019-10-31 2020-02-04 杭州中恒云能源互联网技术有限公司 Dry-type transformer health management method and system based on width learning and storage medium
CN111854732A (en) * 2020-07-27 2020-10-30 天津大学 Indoor fingerprint positioning method based on data fusion and width learning
CN113098887A (en) * 2021-04-14 2021-07-09 西安工业大学 Phishing website detection method based on website joint characteristics
CN113591653A (en) * 2021-07-22 2021-11-02 中南大学 Incremental zinc flotation working condition discrimination method based on width learning system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102523202A (en) * 2011-12-01 2012-06-27 华北电力大学 Deep learning intelligent detection method for fishing webpages
CN103605794A (en) * 2013-12-05 2014-02-26 国家计算机网络与信息安全管理中心 Website classifying method
CN105323248A (en) * 2015-10-23 2016-02-10 绵阳师范学院 Rule based interactive Chinese spam filtering method
US20160337401A1 (en) * 2015-05-13 2016-11-17 Google Inc. Identifying phishing communications using templates
CN106789888A (en) * 2016-11-18 2017-05-31 重庆邮电大学 A kind of fishing webpage detection method of multiple features fusion
CN107392025A (en) * 2017-08-28 2017-11-24 刘龙 Malice Android application program detection method based on deep learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102523202A (en) * 2011-12-01 2012-06-27 华北电力大学 Deep learning intelligent detection method for fishing webpages
CN103605794A (en) * 2013-12-05 2014-02-26 国家计算机网络与信息安全管理中心 Website classifying method
US20160337401A1 (en) * 2015-05-13 2016-11-17 Google Inc. Identifying phishing communications using templates
CN105323248A (en) * 2015-10-23 2016-02-10 绵阳师范学院 Rule based interactive Chinese spam filtering method
CN106789888A (en) * 2016-11-18 2017-05-31 重庆邮电大学 A kind of fishing webpage detection method of multiple features fusion
CN107392025A (en) * 2017-08-28 2017-11-24 刘龙 Malice Android application program detection method based on deep learning

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
ANKIT KUMAR JAIN AND B.B.GUPTA: ""Comparative Analysis of Features Based Machine Learning Approaches for Phishing Detection"", 《2016 3RD INTERNATIONAL CONFERENCE ON COMPUTING FOR SUSTAINABLE GLOBAL DEVELOPMENT》 *
C. L. PHILIP CHEN ET AL: ""Broad Learning System: An Effective and Efficient Incremental Learning System Without the Need for Deep Architecture"", 《IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS》 *
何高辉: ""防网络钓鱼的安全域名服务器研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
冯庆 等: ""基于集成学习的钓鱼网页深度检测***"", 《计算机***应用》 *
徐欢潇 等: ""多特征分类识别算法融合的网络钓鱼识别技术"", 《计算机应用研究》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109522838A (en) * 2018-11-09 2019-03-26 大连海事大学 A kind of safety cap image recognition algorithm based on width study
CN110213741A (en) * 2019-05-23 2019-09-06 青岛智能产业技术研究院 A kind of vehicle based on width study sends the real-time detection method of information true or false
CN110213741B (en) * 2019-05-23 2022-02-08 青岛智能产业技术研究院 Method for detecting authenticity of vehicle sending information in real time based on width learning
CN110287124A (en) * 2019-07-03 2019-09-27 大连海事大学 A kind of automatic marker software error reporting simultaneously carries out seriousness and knows method for distinguishing
CN110365691A (en) * 2019-07-22 2019-10-22 云南财经大学 Fishing website method of discrimination and device based on deep learning
CN110365691B (en) * 2019-07-22 2021-12-28 云南财经大学 Phishing website distinguishing method and device based on deep learning
CN110749793A (en) * 2019-10-31 2020-02-04 杭州中恒云能源互联网技术有限公司 Dry-type transformer health management method and system based on width learning and storage medium
CN111854732A (en) * 2020-07-27 2020-10-30 天津大学 Indoor fingerprint positioning method based on data fusion and width learning
CN111854732B (en) * 2020-07-27 2024-02-13 天津大学 Indoor fingerprint positioning method based on data fusion and width learning
CN113098887A (en) * 2021-04-14 2021-07-09 西安工业大学 Phishing website detection method based on website joint characteristics
CN113591653A (en) * 2021-07-22 2021-11-02 中南大学 Incremental zinc flotation working condition discrimination method based on width learning system

Also Published As

Publication number Publication date
CN108337255B (en) 2020-08-04

Similar Documents

Publication Publication Date Title
CN108337255A (en) A kind of detection method for phishing site learnt based on web automatic tests and width
CN104077396B (en) Method and device for detecting phishing website
Feng et al. The application of a novel neural network in the detection of phishing websites
CN101826105B (en) Phishing webpage detection method based on Hungary matching algorithm
Dupont et al. Population closure and the bias‐precision trade‐off in spatial capture–recapture
CN103559235B (en) A kind of online social networks malicious web pages detection recognition methods
CN104881608B (en) A kind of XSS leak detection methods based on simulation browser behavior
CN108111478A (en) A kind of phishing recognition methods and device based on semantic understanding
US11762990B2 (en) Unstructured text classification
CN109873810B (en) Network fishing detection method based on goblet sea squirt group algorithm support vector machine
CN105718577B (en) Method and system for automatically detecting phishing aiming at newly added domain name
CN108134784A (en) web page classification method and device, storage medium and electronic equipment
CN107944274A (en) A kind of Android platform malicious application off-line checking method based on width study
CN109657470A (en) Malicious web pages detection model training method, malicious web pages detection method and system
Liu et al. An efficient multistage phishing website detection model based on the CASE feature framework: Aiming at the real web environment
CN112685739A (en) Malicious code detection method, data interaction method and related equipment
CN107046586A (en) A kind of algorithm generation domain name detection method based on natural language feature
CN102999638A (en) Phishing website detection method excavated based on network group
CN107818132A (en) A kind of webpage agent discovery method based on machine learning
Sanglerdsinlapachai et al. Web phishing detection using classifier ensemble
Liu et al. Multi-scale semantic deep fusion models for phishing website detection
Ojewumi et al. Performance evaluation of machine learning tools for detection of phishing attacks on web pages
Abunadi et al. Feature extraction process: A phishing detection approach
CN111967503A (en) Method for constructing multi-type abnormal webpage classification model and abnormal webpage detection method
Shyni et al. Phishing detection in websites using parse tree validation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant