CN108337255A - A kind of detection method for phishing site learnt based on web automatic tests and width - Google Patents
A kind of detection method for phishing site learnt based on web automatic tests and width Download PDFInfo
- Publication number
- CN108337255A CN108337255A CN201810088364.8A CN201810088364A CN108337255A CN 108337255 A CN108337255 A CN 108337255A CN 201810088364 A CN201810088364 A CN 201810088364A CN 108337255 A CN108337255 A CN 108337255A
- Authority
- CN
- China
- Prior art keywords
- url
- width
- matrix
- detection method
- website
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1441—Countermeasures against malicious traffic
- H04L63/1483—Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/30—Network architectures or network communication protocols for network security for supporting lawful interception, monitoring or retaining of communications or communication related information
- H04L63/306—Network architectures or network communication protocols for network security for supporting lawful interception, monitoring or retaining of communications or communication related information intercepting packet switched data communications, e.g. Web, Internet or IMS communications
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Hardware Design (AREA)
- Databases & Information Systems (AREA)
- Computing Systems (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Technology Law (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of detection method for phishing site learnt based on web automatic tests and width, belong to computer network security technology field.The present invention is first based on url the and html pages and carries out traditional characteristic extraction; web automatization testing techniques are recycled to interact formula feature extraction; finally utilize the pretreatment training sample after extraction feature into line width learning training; to quickly and accurately identify and detect fishing website, the network information and property safety of the common people are protected.
Description
Technical field
The invention belongs to computer network security technology field, more particularly, to one kind based on web automatic tests and
The detection method for phishing site of width study.
Background technology
Phishing is to claim that the duplicity spam for coming from bank or well-known mechanism, webpage are empty by largely sending
A kind of attack pattern of the sensitive informations such as personal identification data and the financial account of user is stolen in false advertisement etc..Most typical net
Network phishing attack be by user lure to one it is well-designed with the extremely similar fishing website in destination organization website on, obtain
Take the personal sensitive information or gain user's remittance by cheating that family inputs on the web site.Since this kind of attack process victim is not easy to warn
Feel, fishing website has become one of presently the most serious net crime means, and the detection of fishing website also becomes net
One of the research direction that network security fields are most popular.
2016, taken the lead by CNNIC the internet domain name administrative skill national engineering laboratory established and international anti-phishing
Working group (APWG), Chinese anti-phishing website monitoring (APAC), which are combined, issues《Global Chinese fishing website Statistical Analysis on Current Status
It reports (2016)》(hereinafter referred to as《Report》).Data show that China's fishing website quantity increased by a year-on-year basis in 2016
150.96%, main counterfeit object is Taobao, middle movement, each big bank, used domain name mainly have .COM .CC .PW,
.NET。
Third season 360 mobile guards in 2017 were that national mobile phone user intercepts fishing website meter 7.9 hundred million times, compared with 2016
The third season increases by 102.6%.Mobile phone terminal fishing website classification to being intercepted, wherein gambling lottery industry class fishing website accounts for totality
The 80.2% of proportion, the types accountings such as falseness shopping, false recruitment, financial instrument, counterfeit drug and fishing advertisement are successively decreased successively.
Although intercepting, there are many quantity, and the website intercepted is largely to have existed for a long time, it is difficult to capture and block and is newest
Fishing website.The life cycle of fishing website averagely only has 4.684 days, and the 13.327 days average periods reported, for fishing
Fishnet station, it is necessary to identify and intercept within the extremely short time, otherwise can threaten to the property safety of the common people.
The identification of fishing website and Interception Technology are executed by antivirus software and browser itself at present, technology point
For following several classes:
1. blacklist filtering technique:Blacklist is added in the fishing website that artificial detection and the common people are reported, as the url of access
(Uniform Resource Locator, uniform resource locator) is present in blacklist, implements to intercept and sound a warning.This
Kind mode cannot identify newest fishing website, while need manual verification.
2. the feature extraction of url:Extract corresponding feature, such as domain name by the url of access, but this judgement
Mode is unreliable because in url and the determinant attribute without fishing website, the False Rates of such methods and misdetection rate compared with
It is high.
3. the detection of fishing website is carried out as feature in conjunction with various Website page elements:Because the feature of Webpage obtains
The regular hour need to be expended by taking, and such methods are improved in accuracy compared to the second class method, but the speed and efficiency executed
It is not high.
Invention content
For the disadvantages described above or Improvement requirement of the prior art, the present invention provides one kind based on web automatic tests and
The detection method for phishing site of width study is utilized its object is to be based on url the and html pages to carry out traditional characteristic extraction
Web automatization testing techniques interact formula feature extraction, using the pretreatment training sample after extraction feature into line width
Learning training protects the network information and property safety of the common people to quickly and accurately identify and detect fishing website.
To achieve the above object, according to one aspect of the present invention, it provides a kind of based on web automatic tests and width
The detection method for phishing site of study, includes the following steps:
(1) it is held for a large amount of fishing website inside data set in PC (Personal Computer, personal computer)
Static nature extraction, behavioral characteristics extraction and interactive feature extraction are carried out with normal website, forms feature vector set;
The data set comes the fishing website collected on automatic network and normal website, or is directly obtained from network security company
It takes;
(2) feature vector set in step (1) is divided into training set using k folding cross-validation methods and verification collects;
(3) training learnt into line width using the training set is collected using the verification and carries out test comparison, builds base
Plinth model simultaneously optimizes the performance of grader;
The grader is the model trained by width learning algorithm, in use, inputting net in the grader
Location, whether output is fishing website;The performance of the grader refers to the accuracy of grader identification fishing website;To performance into
Row optimization refers to improving recognition correct rate;
(4) erroneous judgement website and the website newly included are collected as new feature vector set, increase input is carried out to model
Incremental learning, model is optimized.
Preferably, step (1) is specially:
(1.1) static nature extraction is carried out to url;
(1.2) it utilizes web automatization testing techniques to simulate without interface browser, accesses to the url of data set;
(1.3) behavioral characteristics extraction is carried out for the page that url is accessed;
(1.4) simulation browser interacts formula to the page and clicks browsing, and returns to interactive feature.
Preferably, static nature described in step (1.1) includes:
1. whether containing the addresses ip in url;
2. whether the domain name of url is from starting to being pure digi-tal between first point;
3. whether containing sensitive character in url, such as;
4. whether the ports url are 80 ports;
5. whether the length of url is less than 23 characters;
6. whether including the keyword for being related to shopping or property account, such as account, banking, taobao in url;
Above six static natures are denoted as<F1,F2,F3,F4,F5,F6>.
Preferably, behavioral characteristics described in step (1.3) include:
1. whether the title (title) of html is comprising sensitive character, such as ' lottery ticket ', ' overseas gambling ', and ' prize-winning ';
2. whether there is form lists;
3. the resource (resource) of picture whether with the same domain names of former url;
4. the href linked whether with the same domain names of url;The href is the abbreviation of Hypertext Reference, refers to
Determine the url of hyperlink target;
Above four behavioral characteristics are denoted as<F7,F8,F9,F10>.
Preferably, interactive feature described in step (1.4) includes:
1. whether form lists are rigorous;
2. clickthrough, if for sky;
3. clickthrough, if url occurs and redirects;
Three above interactive feature is denoted as<F11,F12,F13>.
Preferably, step (2) is specially:
(2.1) k values are set;
(2.2) division for utilizing k folding cross-validation methods to collect the data set of step 1 into training set and verification.
Preferably, step (3) is specially:
(3.1) width learning model is instructed using the feature vector set of webpage sample in the training set of step (2)
Practice simultaneously testing classification device performance;The webpage sample refers to fishing website network address and normal website in training set;
(3.2) constantly adjust the network architecture by increasing characteristic node and enhancement mode node to be trained and test until dividing
Class device reaches estimated performance, obtains each layer weight information and preservation model.
Preferably, step (3.1) is specially:
(3.11) initialization feature window number N2, characteristic node number N1 in window, the value of enhancing number of nodes N3;It is random initial
Change sorter model characteristic node weight matrix, and characteristic node weight is handled using sparse own coding;
(3.12) weight matrix that the feature vector set of webpage sample and step (3.11) obtain is subjected to matrix multiplication
Obtain characteristic node matrix;
(3.13) random initializtion enhances node weights matrix;
(3.14) the characteristic node matrix that step (3.12) obtains is multiplied with the weight matrix that step (3.13) obtains and is obtained
Node matrix equation must be enhanced;
(3.15) the enhancing node matrix equation that characteristic node matrix and step (3.14) that step (3.12) obtains obtain is pressed
Row progress is horizontally-spliced to obtain input matrix;
(3.16) seek the plus sige generalized inverse of input matrix matrix obtained by step (3.15) and with<Y>Matrix multiplication is carried out to obtain
To weight matrix;It is described<Y>Be webpage sample tag combination at matrix;The label of the webpage sample represents yes or no
Fishing website, for example, the representative of label 1 is fishing website, the representative of label 0 is not fishing website;
(3.17) since step (2) is that k rolls over cross validation, step (3.1) is repeated k times, average k precision;
(3.18) N1 is gradually increased, the value of N2, N3, whether the precision for observing width model is promoted, and finds optimized parameter.
Preferably, step (3.2) is specially:
(3.21) using the Increment Learning Algorithm of increase characteristic node number and enhancing number of nodes to gained mould in step (3.1)
Type is adjusted and is tested;
(3.22) step (3.21) is recycled into setting number and gained measuring accuracy is recorded, comparison determination is optimal
Characteristic node number and enhancing interstitial content, preserve this optimal models.
Preferably, step (4) is specially:Erroneous judgement website and the website newly included are collected as new feature vector set,
The incremental learning for model increase input, obtains adjusted weight matrix, to realize Optimized model;
Preferably, the static nature for extracting url using regular expression in step (1) simulates nothing using Phantomjs
Interface browser is carried out without interface UI automatic tests;The PhantomJS is a JavaScript based on webkit
API is a kind of no interface browser.
The present invention is with the extraction of url static natures, the extraction of html behavioral characteristics and the interaction of web automatization testing techniques
Formula Feature Extraction Technology first extracts the static nature of url;No interface UI automatic tests are carried out again, and simulation browser carries out url
It accesses, while realizing that page source code is extracted, behavioral characteristics are extracted from html;Link clicks are simulated, simulation form lists
Account inputs and the operations such as login, saves the process of page rendering, rapid extraction interactive feature;For not in blacklist
New fishing website, by simulating clickthrough, whether test link is empty;List is logined in simulation, and whether test list is regular;Mould
Quasi- clickthrough, tests whether that there are url redirections.Pass through these completely new interactive features, rapidly and accurately detection fishing
Website.
The present invention uses width learning model, promotes the recognition capability to new fishing website.Width study is a kind of new
Machine learning method and thought, different from deep learning, width learn framework level it is shallower, to computing resource require compared with
It is low.In addition to this, deep learning needs to improve entire model again when receiving new sample, need to expend a large amount of
Time, but width learning algorithm need not carry out re -training to original model, it is only necessary to the fishing website sample being newly added
This progress feature extraction is adjusted supplement, accuracy of detection continuous self-promotion in self refresh to existing model.
In general, through the invention it is contemplated above technical scheme is compared with the prior art, can obtain down and show
Beneficial effect:
1, it according to method provided by the present invention, is first based on url and carries out static nature extraction, then carried out based on the html pages
Behavioral characteristics extract, and then interact the extraction of formula feature to the page using web automatization testing techniques simulation browser, special
It levies by static to dynamic, then arrives interactive feedback, feature mining from the superficial to the deep, ensure that the quantity and quality of feature;Finally, sharp
A small amount of resource, quick training and incremental learning characteristic are only needed with width learning model, is realized accurate, quick, adaptive
The identification technology of fishing website;
2, interactive feature combination url static natures, html behavioral characteristics extract the knowledge that can greatly improve fishing website
Other precision, and it is suitable for newest fishing website, it quickly and accurately can detect and identify fishing website;
3, learn to carry out feature extraction to the fishing website sample being newly added by width, existing model is adjusted
Supplement, on more new model needed for time greatly reduce, to computing resource require it is relatively low;Meanwhile the detection essence of width study
Degree can realize continuous self-promotion in self refresh.
Description of the drawings
Fig. 1 is a kind of fishing website detection learnt based on web automatic tests and width in present pre-ferred embodiments
The general flow chart of method;
Fig. 2 is the schematic diagram of fishing website feature extraction in present pre-ferred embodiments;
Fig. 3 is the schematic diagram for learning to carry out k folding cross-validation data set preparations in present pre-ferred embodiments to width;
Fig. 4 is the schematic diagram for learning to be trained and optimize to width in present pre-ferred embodiments.
Specific implementation mode
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, right
The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and
It is not used in the restriction present invention.As long as in addition, technical characteristic involved in the various embodiments of the present invention described below
It does not constitute a conflict with each other and can be combined with each other.
The present invention provides a kind of fishing website intelligent detecting methods learnt based on web automatic tests and width, such as
Shown in Fig. 1, it is the main flow chart of the present invention, clearly shows the relationship between the flow and step entirely invented.Have below
Illustrate to body the embodiment of step:
(1) step 1:In the ends PC are for data set a large amount of fishing website and normal website carry out static nature extraction,
Behavioral characteristics extract and interactive click accesses.
As shown in Fig. 2, step 1 is specific as follows:
Step 1.1, static nature extraction, including following six feature are carried out for url itself:
1. whether containing the addresses ip in url:The addresses ip can be used for escaping domain name registration and user checks;
2. whether the domain name of url starts to being pure digi-tal between first point:Regular network address is seldom inserted into domain name with pure digi-tal,
Such as regular network address Baidu https://www.***.com/, fishing website http://www.030033.com/;
3. the sensitive character such as whether having in url:Account before@characters, behind be only real address, fishing website is normal
In this way;
4. whether the ports url are 80 ports:Regular url is accessed by 80 ports, and non-80 port has Fishing net
The suspicion stood;
5. whether the length of url is less than 23 characters:According to statistics, regular network address url is usually no more than 23 characters;
6. whether including account, the keywords such as banking, taobao in url:It is related to doing shopping, bank is worth police
It wakes up;
Above six characteristic uses, six regular expressions match url and extract individual features, such as re.match
(r'.*// (.*)/.*', url) combine split (' ') that the domain name in url can be extracted.Other features are similarly carried
It takes.If there is features described above, 1 is returned, otherwise returns to 0, constitutes six characteristic sets in total<F1,F2,F3,F4,F5,F6>.
Step 1.2, the access without interface browser to url is realized by code, omits the rendering to browser page
The time is saved in journey, the extraction for feature thereafter.Specific code is as follows:
Self.dirver=webdriver.PhantomJS ()
dirver.get('http://www.douyu.com/directory/all')
Step 1.3, by HTML (HyperText Markup Language, hypertext markup language) source codes for
The page that url is accessed carries out behavioral characteristics extraction, including following four feature:
1. whether the title (title) of html includes ' lottery ticket ', ' overseas gamble ', the sensitive character such as ' prize-winnings ', part is non-
Method fishing website likes the heart of the desire for gain using netizen;
2. whether there is form lists:List is also sensitive features, because the final purpose of fishing website is that steal account close
Code;
3. the resource (resource) of picture whether with the same domain names of former url:Because fishing website often steals other legal copies
The picture of website;
4. the href linked whether with the same domain names of url:It is generation mostly because fishing website oneself will not write article content
Chain;Href is the abbreviation of Hypertext Reference, is the URL of specified hyperlink target.
The driver objects of above four characteristic use steps 1.1, parsed from source code using driver this four
A feature extracts the content of title such as driver.find_element_by_tag_name (" title ") .text.Other are special
Sign is similarly extracted.If there is features described above, 1 is returned, otherwise returns to 0, constitutes four characteristic sets in total<F7,F8,F9,
F10>。
Step 1.4, formula is interacted to page itself and clicks access, altogether following three features:
1. whether form lists are rigorous:Whether random account number cipher can be logined:Because fishing website usually will not
There is raw data base, general any account number cipher can be logined;
2. clickthrough, if for sky:If most of link is all null link, this website be the possibility of fishing website more
It is high;
3. clickthrough, if url occurs and redirects:Fishing website life cycle is of short duration, is frequently utilized that url is redirected and jumps
Go to the addresses ip that be not sealed off;
Three above feature can carry out the click page in real time using driver come simulation browser, to according to return value
Extract feature, such as elem.send_keys (account that u' is generated at random ') can fill in account in list, password similarly,
Click is logined can predict that maximum probability is fishing network address if success.Other features are similarly extracted.If there is features described above,
1 is returned, conversely, returning to 0, constitutes three characteristic sets in total<F11,F12,F13>;
(2) step 2:K rolls over cross-assignment training set and verification collects, as shown in figure 3, step 2 is specific as follows:
Step 2.1, k values are set:Parameter k represents the number of repetition training and test, and the resource of consumption is directly proportional;
Step 2.2, it rolls over cross validation using k to be trained to separate training set and verification the set pair analysis model, by initial data
Collection is divided into k parts, and k-1 parts are used as training set, and 1 part collects as verification, and repeats k times, improves the generalization ability of model.
It is illustrated in figure 4 and the schematic diagram for being trained and optimizing is learnt to width, illustrate and see following steps 3 and step
4:
(3) step 3:Training width model and itself optimization, step 3 are specific as follows:
Step 3.1, the fishing website sampling feature vectors set obtained using step 1 is trained width learning model
And testing classification device performance;Specifically include following steps:
Step 3.1.1, initialization feature window number N2, characteristic node number N1 in window, the value of enhancing number of nodes N3, according to
Test of many times experience initializes N1*N2=samples/600, N3=samples/10, samples representative sample quantity;At random
Sorter model characteristic node weight matrix We is initialized, and characteristic node weight is handled using sparse own coding;
The weight matrix We that step 3.1.2, fishing website training sample characteristic set X are obtained with step 3.1.1. carries out square
Battle array multiplication obtains characteristic node matrix Z;
Step 3.1.3, random initializtion enhance node weights matrix W h;
Step 3.1.4, the weight matrix Wh phases that the characteristic node matrix Z that step 3.1.2 is obtained is obtained with step 3.1.3
Multiply acquisition enhancing node matrix equation;
Step 3.1.5, the enhancing node matrix equation that the step 3.1.2 characteristic node matrix Z obtained and step 3.1.4 are obtained
H obtains input matrix A by row progress is horizontally-spliced;
Step 3.1.6, seek the plus sige generalized inverse of input matrix A obtained by step 3.1.5 and with application training sample label
Set carries out matrix multiplication and obtains weight matrix W, and specific code is as follows:
W=np.linalg.inv ((A.T) .dot (A)+lamda*
np.eye((A.T).shape[0])).dot((A.T).dot(train_y));
Step 3.1.7 repeats step 3.1 k times, average k precision since step 2 is that k rolls over cross validation,
Enhance the generalization ability of grader;
Step 3.1.8, according to experiment experience to N1, N2, N3 parameters carry out tuning, go out in peak value and obtain full accuracy, and
Preserve the value of this three parameters.
Step 3.2, it is trained and is tested until grader reaches by the number of nodes adjustment network architecture of increase matrix
Estimated performance or adjustment reach certain number, obtain each layer weight information under optimal situation and preserve;
Step 3.2 specifically includes following processing:
Step 3.2.1, using increase characteristic node number and the Increment Learning Algorithm for enhancing number of nodes to gained in step 3.1
Model is adjusted and is tested;
Step 3.2.2, cycle setting number carry out step 3.2.1 and are recorded to gained measuring accuracy, and comparison determines
Optimal characteristic node number and enhancing interstitial content, preserves this optimal models.
(4) step 4, collects erroneous judgement website and the website newly included as obtaining new feature vector set, in time right
Model increase the incremental learning of input;
Step 4 specifically includes following processing:
4.1, feature extraction is carried out for the example for failure of classifying in step 3 and is preserved;
4.2, feature extraction is carried out for new website and is stored in local file;
4.3, when quantity reaches preset value in the imported incremental learning unified, W weight matrix are adjusted, to more
New model;
In conclusion using technical solution provided by the invention, the Fishing net based on web automatic tests and width study
Traditional characteristic is combined by intelligent detecting method of standing with interactive feature, and using width learning training model, consuming resource is smaller,
It realizes quick self-adapted update, while ensure that the accuracy of model, can be realized in the extremely short life cycle of fishing website
It precisely intercepts and hits.
Heretofore described k rolls over cross-validation method, is to upset sample, is then uniformly divided into k parts, selects in turn wherein
K-1 parts of training, remaining portion are verified, and calculate Prediction sum squares, finally k Prediction sum squares are done again
The foundation of average alternatively optimal models structure.Assuming that there is N number of sample, special k takes N, is exactly leaving-one method (leave one
out)。
If matrix A is the matrix of m*n, the plus sige generalized inverse of heretofore described matrix A refers to (A'A) A ', wherein A' tables
Show the transposed matrix of A.
Heretofore described weight is finger widths learning model parameter.
The sparse own coding (Sparse Autoencoder) refers to the learning characteristic from no labeled data automatically, and is given
Go out the technology of feature description more better than initial data.
As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to
The limitation present invention, all within the spirits and principles of the present invention made by all any modification, equivalent and improvement etc., should all include
Within protection scope of the present invention.
Claims (10)
1. a kind of detection method for phishing site learnt based on web automatic tests and width, which is characterized in that including walking as follows
Suddenly:
(1) the ends PC for inside website data collection a large amount of fishing website and normal website carry out static nature extraction, dynamic
State feature extraction and interactive feature extraction, form feature vector set;
(2) feature vector set in step (1) is divided into training set using k folding cross-validation methods and verification collects;
(3) training learnt into line width using the training set is collected using the verification and carries out test comparison, and basic mould is built
Type simultaneously optimizes the performance of grader;The performance of the grader refers to the accuracy of grader identification fishing website;
(4) erroneous judgement website and the website newly included are collected as new feature vector set, to model increase the increasing of input
Amount study, with Optimized model.
2. a kind of detection method for phishing site learnt based on web automatic tests and width as described in claim 1, special
Sign is that step (1) is specially:
(1.1) static nature extraction is carried out for url;
(1.2) it utilizes web automatization testing techniques to simulate without interface browser, accesses to the url of data set;
(1.3) behavioral characteristics extraction is carried out for the page that url is accessed;
(1.4) simulation browser interacts formula to the page and clicks browsing, and returns to interactive feature.
3. a kind of detection method for phishing site learnt based on web automatic tests and width as claimed in claim 2, special
Sign is that static nature includes described in step (1.1):
1. whether containing the addresses ip in url;
2. whether the domain name of url is from starting to being pure digi-tal between first point;
3. whether containing sensitive character in url, such as;
4. whether the ports url are 80 ports;
5. whether the length of url is less than 23 characters;
6. whether including the keyword for being related to shopping or property account, such as account, banking, taobao in url.
4. a kind of detection method for phishing site learnt based on web automatic tests and width as claimed in claim 2, special
Sign is that behavioral characteristics include described in step (1.3):
1. whether the title of html is comprising sensitive character, such as ' lottery ticket ', ' overseas gambling ', and ' prize-winning ';
2. whether there is form lists;
3. the resource of picture whether with the same domain names of former url;
4. the href linked whether with the same domain names of url;The href is the abbreviation of Hypertext Reference, is specified super
The url of hyperlink target.
5. a kind of detection method for phishing site learnt based on web automatic tests and width as claimed in claim 2, special
Sign is that interactive feature includes described in step (1.4):
1. whether form lists are rigorous;
2. clickthrough, if for sky;
3. clickthrough, if url occurs and redirects.
6. a kind of detection method for phishing site learnt based on web automatic tests and width as described in claim 1, special
Sign is that step (2) is specially:
(2.1) k values are set;
(2.2) division for utilizing k folding cross-validation methods to collect the data set of step 1 into training set and verification.
7. a kind of detection method for phishing site learnt based on web automatic tests and width as described in claim 1, special
Sign is that step (3) is specially:
(3.1) the feature vector set of webpage sample in the training set of step (2) is utilized to be trained simultaneously width learning model
Testing classification device performance;
(3.2) network architecture is constantly adjusted by increasing characteristic node and enhancement mode node to be trained and test until grader
Reach estimated performance, obtains each layer weight information and preservation model.
8. a kind of detection method for phishing site learnt based on web automatic tests and width as claimed in claim 7, special
Sign is that step (3.1) is specially:
(3.11) initialization feature window number N2, characteristic node number N1 in window, the value of enhancing number of nodes N3;Random initializtion point
Class device aspect of model node weights matrix, and characteristic node weight is handled using sparse own coding;
(3.12) the feature vector set of webpage sample matrix multiplication is carried out with the weight matrix that step (3.11) obtains to obtain
Characteristic node matrix;
(3.13) random initializtion enhances node weights matrix;
(3.14) the characteristic node matrix that step (3.12) obtains is multiplied with the weight matrix that step (3.13) obtains and is increased
Strong node matrix equation;
(3.15) by step (3.12) obtain characteristic node matrix and step (3.14) obtain enhancing node matrix equation by arrange into
Row is horizontally-spliced to obtain input matrix;
(3.16) seek the plus sige generalized inverse of input matrix obtained by step (3.15) and with<Y>It carries out matrix multiplication and obtains weight square
Battle array;It is described<Y>Be webpage sample tag combination at matrix;The label of the webpage sample represents yes or no Fishing net
It stands;
(3.17) since step (2) is that k rolls over cross validation, step (3.1) is repeated k times, average k precision;
(3.18) N1 is gradually increased, the value of N2, N3, whether the precision for observing width model is promoted, and finds optimized parameter.
9. a kind of detection method for phishing site learnt based on web automatic tests and width as claimed in claim 7, special
Sign is that step (3.2) is specially:
(3.21) using increase characteristic node number and enhance the Increment Learning Algorithm of number of nodes to gained model in step (3.1) into
Row is adjusted and is tested;
(3.22) step (3.21) is recycled into setting number and gained measuring accuracy is recorded, comparison determines optimal feature
Interstitial content and enhancing interstitial content, preserve this optimal models.
10. a kind of detection method for phishing site learnt based on web automatic tests and width as described in claim 1,
It is characterized in that, step (4) is specially:Erroneous judgement website and the website newly included are collected as new feature vector set, to model
The incremental learning for increase input, obtains adjusted weight matrix, to realize Optimized model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810088364.8A CN108337255B (en) | 2018-01-30 | 2018-01-30 | Phishing website detection method based on web automatic test and width learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810088364.8A CN108337255B (en) | 2018-01-30 | 2018-01-30 | Phishing website detection method based on web automatic test and width learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108337255A true CN108337255A (en) | 2018-07-27 |
CN108337255B CN108337255B (en) | 2020-08-04 |
Family
ID=62926122
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810088364.8A Active CN108337255B (en) | 2018-01-30 | 2018-01-30 | Phishing website detection method based on web automatic test and width learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108337255B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109522838A (en) * | 2018-11-09 | 2019-03-26 | 大连海事大学 | A kind of safety cap image recognition algorithm based on width study |
CN110213741A (en) * | 2019-05-23 | 2019-09-06 | 青岛智能产业技术研究院 | A kind of vehicle based on width study sends the real-time detection method of information true or false |
CN110287124A (en) * | 2019-07-03 | 2019-09-27 | 大连海事大学 | A kind of automatic marker software error reporting simultaneously carries out seriousness and knows method for distinguishing |
CN110365691A (en) * | 2019-07-22 | 2019-10-22 | 云南财经大学 | Fishing website method of discrimination and device based on deep learning |
CN110749793A (en) * | 2019-10-31 | 2020-02-04 | 杭州中恒云能源互联网技术有限公司 | Dry-type transformer health management method and system based on width learning and storage medium |
CN111854732A (en) * | 2020-07-27 | 2020-10-30 | 天津大学 | Indoor fingerprint positioning method based on data fusion and width learning |
CN113098887A (en) * | 2021-04-14 | 2021-07-09 | 西安工业大学 | Phishing website detection method based on website joint characteristics |
CN113591653A (en) * | 2021-07-22 | 2021-11-02 | 中南大学 | Incremental zinc flotation working condition discrimination method based on width learning system |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102523202A (en) * | 2011-12-01 | 2012-06-27 | 华北电力大学 | Deep learning intelligent detection method for fishing webpages |
CN103605794A (en) * | 2013-12-05 | 2014-02-26 | 国家计算机网络与信息安全管理中心 | Website classifying method |
CN105323248A (en) * | 2015-10-23 | 2016-02-10 | 绵阳师范学院 | Rule based interactive Chinese spam filtering method |
US20160337401A1 (en) * | 2015-05-13 | 2016-11-17 | Google Inc. | Identifying phishing communications using templates |
CN106789888A (en) * | 2016-11-18 | 2017-05-31 | 重庆邮电大学 | A kind of fishing webpage detection method of multiple features fusion |
CN107392025A (en) * | 2017-08-28 | 2017-11-24 | 刘龙 | Malice Android application program detection method based on deep learning |
-
2018
- 2018-01-30 CN CN201810088364.8A patent/CN108337255B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102523202A (en) * | 2011-12-01 | 2012-06-27 | 华北电力大学 | Deep learning intelligent detection method for fishing webpages |
CN103605794A (en) * | 2013-12-05 | 2014-02-26 | 国家计算机网络与信息安全管理中心 | Website classifying method |
US20160337401A1 (en) * | 2015-05-13 | 2016-11-17 | Google Inc. | Identifying phishing communications using templates |
CN105323248A (en) * | 2015-10-23 | 2016-02-10 | 绵阳师范学院 | Rule based interactive Chinese spam filtering method |
CN106789888A (en) * | 2016-11-18 | 2017-05-31 | 重庆邮电大学 | A kind of fishing webpage detection method of multiple features fusion |
CN107392025A (en) * | 2017-08-28 | 2017-11-24 | 刘龙 | Malice Android application program detection method based on deep learning |
Non-Patent Citations (5)
Title |
---|
ANKIT KUMAR JAIN AND B.B.GUPTA: ""Comparative Analysis of Features Based Machine Learning Approaches for Phishing Detection"", 《2016 3RD INTERNATIONAL CONFERENCE ON COMPUTING FOR SUSTAINABLE GLOBAL DEVELOPMENT》 * |
C. L. PHILIP CHEN ET AL: ""Broad Learning System: An Effective and Efficient Incremental Learning System Without the Need for Deep Architecture"", 《IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS》 * |
何高辉: ""防网络钓鱼的安全域名服务器研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
冯庆 等: ""基于集成学习的钓鱼网页深度检测***"", 《计算机***应用》 * |
徐欢潇 等: ""多特征分类识别算法融合的网络钓鱼识别技术"", 《计算机应用研究》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109522838A (en) * | 2018-11-09 | 2019-03-26 | 大连海事大学 | A kind of safety cap image recognition algorithm based on width study |
CN110213741A (en) * | 2019-05-23 | 2019-09-06 | 青岛智能产业技术研究院 | A kind of vehicle based on width study sends the real-time detection method of information true or false |
CN110213741B (en) * | 2019-05-23 | 2022-02-08 | 青岛智能产业技术研究院 | Method for detecting authenticity of vehicle sending information in real time based on width learning |
CN110287124A (en) * | 2019-07-03 | 2019-09-27 | 大连海事大学 | A kind of automatic marker software error reporting simultaneously carries out seriousness and knows method for distinguishing |
CN110365691A (en) * | 2019-07-22 | 2019-10-22 | 云南财经大学 | Fishing website method of discrimination and device based on deep learning |
CN110365691B (en) * | 2019-07-22 | 2021-12-28 | 云南财经大学 | Phishing website distinguishing method and device based on deep learning |
CN110749793A (en) * | 2019-10-31 | 2020-02-04 | 杭州中恒云能源互联网技术有限公司 | Dry-type transformer health management method and system based on width learning and storage medium |
CN111854732A (en) * | 2020-07-27 | 2020-10-30 | 天津大学 | Indoor fingerprint positioning method based on data fusion and width learning |
CN111854732B (en) * | 2020-07-27 | 2024-02-13 | 天津大学 | Indoor fingerprint positioning method based on data fusion and width learning |
CN113098887A (en) * | 2021-04-14 | 2021-07-09 | 西安工业大学 | Phishing website detection method based on website joint characteristics |
CN113591653A (en) * | 2021-07-22 | 2021-11-02 | 中南大学 | Incremental zinc flotation working condition discrimination method based on width learning system |
Also Published As
Publication number | Publication date |
---|---|
CN108337255B (en) | 2020-08-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108337255A (en) | A kind of detection method for phishing site learnt based on web automatic tests and width | |
CN104077396B (en) | Method and device for detecting phishing website | |
Feng et al. | The application of a novel neural network in the detection of phishing websites | |
CN101826105B (en) | Phishing webpage detection method based on Hungary matching algorithm | |
Dupont et al. | Population closure and the bias‐precision trade‐off in spatial capture–recapture | |
CN103559235B (en) | A kind of online social networks malicious web pages detection recognition methods | |
CN104881608B (en) | A kind of XSS leak detection methods based on simulation browser behavior | |
CN108111478A (en) | A kind of phishing recognition methods and device based on semantic understanding | |
US11762990B2 (en) | Unstructured text classification | |
CN109873810B (en) | Network fishing detection method based on goblet sea squirt group algorithm support vector machine | |
CN105718577B (en) | Method and system for automatically detecting phishing aiming at newly added domain name | |
CN108134784A (en) | web page classification method and device, storage medium and electronic equipment | |
CN107944274A (en) | A kind of Android platform malicious application off-line checking method based on width study | |
CN109657470A (en) | Malicious web pages detection model training method, malicious web pages detection method and system | |
Liu et al. | An efficient multistage phishing website detection model based on the CASE feature framework: Aiming at the real web environment | |
CN112685739A (en) | Malicious code detection method, data interaction method and related equipment | |
CN107046586A (en) | A kind of algorithm generation domain name detection method based on natural language feature | |
CN102999638A (en) | Phishing website detection method excavated based on network group | |
CN107818132A (en) | A kind of webpage agent discovery method based on machine learning | |
Sanglerdsinlapachai et al. | Web phishing detection using classifier ensemble | |
Liu et al. | Multi-scale semantic deep fusion models for phishing website detection | |
Ojewumi et al. | Performance evaluation of machine learning tools for detection of phishing attacks on web pages | |
Abunadi et al. | Feature extraction process: A phishing detection approach | |
CN111967503A (en) | Method for constructing multi-type abnormal webpage classification model and abnormal webpage detection method | |
Shyni et al. | Phishing detection in websites using parse tree validation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |