CN110427579B - Dangerous webpage identification method based on chrome plug-in - Google Patents

Dangerous webpage identification method based on chrome plug-in Download PDF

Info

Publication number
CN110427579B
CN110427579B CN201910720615.4A CN201910720615A CN110427579B CN 110427579 B CN110427579 B CN 110427579B CN 201910720615 A CN201910720615 A CN 201910720615A CN 110427579 B CN110427579 B CN 110427579B
Authority
CN
China
Prior art keywords
domain name
class
webpage
risk
elements
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910720615.4A
Other languages
Chinese (zh)
Other versions
CN110427579A (en
Inventor
成卫青
刁健峰
褚佳乐
蔡晨阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN201910720615.4A priority Critical patent/CN110427579B/en
Publication of CN110427579A publication Critical patent/CN110427579A/en
Application granted granted Critical
Publication of CN110427579B publication Critical patent/CN110427579B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a dangerous webpage identification method based on chrome plug-in, which comprises the steps of extracting first dimensional data of a support vector machine according to URLs of all external links in a webpage, extracting second dimensional data of the support vector machine according to JavaScript codes embedded or quoted in all < script > tags in a page html code, solving the support vector machine according to the extracted first dimensional data and the second dimensional data of the support vector machine, and outputting parameters w, b and a classification decision function which are used for separating hyperplanes.

Description

Dangerous webpage identification method based on chrome plug-in
Technical Field
The invention relates to a dangerous webpage identification method based on a chrome plug-in, and belongs to the field of Internet information security.
Background
Most of the existing malicious webpage identification systems are oriented to a certain specific application, so that some differences exist in system structures and implementation modes. The basic framework of the malicious webpage identification system is mainly divided into 3 parts:
(1) and (5) webpage collection. And the system is responsible for collecting, removing the duplicate and filtering the web pages on the Internet. The method can be generally divided into an active method and a passive method according to a webpage collection mode. And active collection, which is to directionally capture a webpage set from the Internet mainly by using a web crawler technology. And passive collection is mainly to collect the passing access flow in a gateway or a client honeypot. The flow filtering is to filter the part obviously not belonging to the malicious webpage according to the related information (such as the suffix of the webpage, the type of the webpage, and the like) of the webpage so as to improve the operating efficiency of the malicious webpage identification system.
(2) And (5) feature extraction. The characteristic extraction refers to a process of extracting information as a basis for identifying malicious webpages according to characteristics of the webpages from actual requirements of identifying different types of malicious webpages by different identification methods.
These features include, but are not limited to, URL vocabulary features, host information features, web page content features, URL (dns) blacklists, link relationships, and jump relationships, among others. Aiming at different classes of malicious web pages, students propose characteristics of a plurality of malicious web pages from different angles. The common identification features can be classified into static features and dynamic features according to the source of the identification features.
The static characteristics mainly come from static information of the web pages, and the variety of the static characteristics is various, but the extraction process is relatively simple. Common static features mainly include host information, URL information, web page content, and the like. The dynamic characteristics mainly come from the dynamic behavior of the web pages, and the types of the dynamic characteristics are few, but the extraction process is relatively complex. Common dynamic features mainly include browser behavior, web page jump relationships, registry and folder changes, and the like. These features often require extensive analysis of the suspect web page by a human for a long time. Meanwhile, in the process of using the dynamic features, the honey net technology and the virtualization technology are often combined to assist in malicious webpage identification.
(3) And judging the webpage. The currently common web page discrimination method comprises the following steps: a blacklist filtering method, a rule matching method, a machine learning method and an interactive host behavior based identification method.
Disclosure of Invention
The purpose of the invention is as follows: in order to accurately identify malicious web pages with numerous and various types, the malicious web page identification system has usability and expandability at the same time, the invention provides a dangerous web page identification method based on a chrome plug-in.
The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that:
a dangerous webpage identification method based on chrome plug-in analyzes dangerous webpages through a two-dimensional support vector machine algorithm from the perspective of source codes, and provides a scheme for webpage security classification, wherein two dimensional data sources are as follows:
1. obtaining a webpage Alexa ranking according to the webpage URL, and analyzing the webpage popularity; collecting a counterfeit website, analyzing domain names of the counterfeit website, and segmenting second-level or third-level domain names of the counterfeit website to form a domain name library; dividing a second-level or third-level domain name of the current webpage URL by using a domain name division technology, and comparing whether the domain name length exceeds the average value of the suspicious webpage or not; extracting href attributes of all < a > tags in a webpage source code, analyzing URLs of all hyperlinks pointing to a page on the webpage, filtering to obtain external links, analyzing the similarity of a current webpage domain name and an external link domain name thereof with domain names in a domain name library respectively by utilizing a cut domain name character-by-character comparison method and a DICE distance algorithm, and calculating to obtain webpage security evaluation of the dimension by combining the domain name length.
2. Firstly, extracting the addresses pointed by the src attribute of all < script > tags in a webpage source code and all internal js codes, comparing the correlation between the src pointed addresses and the webpage domain name, and if the src pointed addresses are not related to the webpage domain name, comparing the correlation between the src pointed addresses and a known external script provider; analyzing all < script > internal codes and codes except the known open source external script maker codes pointed by src of < script >, searching suspicious code segments existing in javascript codes, and judging the security of the quoted script by combining the position of the script embedded in HTML codes.
The data obtained by two dimensions are injected into a vector machine, and the output is parameters w and b of a separation hyperplane and a classification decision function.
The following concepts are used in the present invention:
(1) domain name: a domain name is the name of a computer or group of computers on the internet that consists of a string of names separated by dots. The domain names are divided into top-level domain names and other level domain names, and the second-level domain names are fields closest to the left side of the top-level domain names. From right to left, there are top level domain name, second level domain name, third level domain name, fourth level domain name, etc. Typically, the secondary or tertiary domain name is the website name.
(2) URL: a uniform resource locator, a compact representation of the location and access method of a resource available from the internet, is the address of a standard resource on the internet.
(3) External linking: hyperlinks to other websites
(4) javascript script: is a script language developed by LiveScript of Netscape, and can be interpreted and executed by a browser.
(5) A support vector machine: a generalized linear classifier for binary classification of data in a supervised learning mode is characterized in that a decision boundary is a maximum margin hyperplane for solving learning samples.
The method specifically comprises the following steps:
step 1) extracting first dimension data of a support vector machine according to URLs of all external links in a webpage, wherein the steps comprise:
step 1.1) extracting URLs of all external links in a webpage;
step 1.2) accessing http:// data. alexa.com, and obtaining Alexa ranking of a website where a webpage is located according to a domain name in a webpage URL; if the ranking of the website is within 1000, directly regarding the webpage as safe, and if the ranking of the website is outside 1000 or the ranking cannot be obtained, setting the risk factor zero danger0 to 1;
step 1.3) analyzing the current webpage URL and all levels of domain names in the URL of the external link of the webpage, and taking the longest first level of domain name; if the length of the longest section of domain name is larger than 18, setting 1 to the risk factor three danger3, otherwise, setting 0; the URL is divided as follows: firstly, dividing URLs by '/', taking domain name sections in the URLs, dividing the domain name sections by '/', and adding all levels of domain names serving as character strings into an array;
step 1.4) cutting the URL of the current webpage and the URL of the external link of the webpage again and extracting information: if the domain name ends with ". com.cn", extracting a third-level domain name; if not, extracting a secondary domain name;
step 1.5) comparing the domain name extracted from each external link URL with the domain name in a known domain name database one by one, calculating the similarity ratio, taking the highest value with the similarity ratio smaller than 1 as p, and marking the domain name extracted from the external link with the similarity ratio of p with a certain domain name in the known domain name database as dname;
step 1.6) if the length of the dname is more than 6 and the p value is more than 0.8, or the length of the dname is less than 6 and the p value is more than 0.54, taking the risk factor danger1 as 1, and if the situation is not met, setting the risk factor danger1 as 0;
step 1.7) comparing the domain name extracted from the current webpage URL with the domain name in the known domain name database one by one, and calculating the similarity rate, if the length of the extracted domain name is greater than 6 and the similarity rate is greater than 0.8, or the length of the extracted domain name is less than 6 and the similarity rate is greater than 0.54, taking the risk factor two danger2 as 1, and setting the risk factor two as 0 if the situation is not met;
step 1.8) if danger1 is 1 and The other risk factors are 0, output "The web page is not facial! ", and let danger be danger0+ danger1+ danger2+ danger3, and output" The web page is safe!when danger is 0, 1, 2 or 3 respectively! "," The web page is bright subspecious! "," The web is subspecious! "," The web page is Dangerous! ";
step 2) extracting second dimension data of the support vector machine according to all < script > tags in the html code of the page, and the specific steps comprise:
step 2.1) obtaining the contents in all < script > tags in the html code of the page, extracting all src pointing contents in the html code to form an array X, counting the total number of array elements, and setting the total number of X;
step 2.2) matching the page URL, extracting a second-level domain name of the current page URL, extracting a third-level domain name if the type of the current page URL is such as com.cn, and classifying elements containing the second-level or third-level domain name in array elements into an X1 group, wherein the number of the elements is marked as X1; matching the other elements of the array X with domestic public databases of famous open source static resources, wherein the matched elements are classified into X2 groups, and the number of the matched elements is recorded as X2; matching the other elements of the array X with a public library of foreign famous open-source static resources, wherein the matched elements are classified into X3 groups, and the number of the matched elements is marked as X3; finally, the unmatched elements are classified into an X4 group, and the number of the unmatched elements is recorded as X4; satisfies the relationship: x1+ x2+ x3+ x4 ═ x;
step 2.3) if x4/x is larger than 0.1, judging that the script reference has a class A risk, and if x2 and x3 are not 0 at the same time and x4 is not 0, judging that the script reference has a class B risk; if x4 is 0, one of the following three conditions is satisfied: x1/x >0.8, (x1+ x2)/x >0.9, (x1+ x3)/x >0.9, the script reference is considered to have a class D risk, and the rest cases judge that the script reference has a class C risk;
step 2.4) obtaining all internal js scripts and all external js scripts in the html code, combining all js scripts into an array Y, counting the number of array elements, and setting the number of Y; the following three common js horse hanging methods are matched:
1.document.write(“<iframe
2.document.body.innerHTML
3.open(***,”NewWindow”,”toolbar
counting the occurrence times of the three sentences in each script; counting the number of characters of each js script;
one array element contains a common trojan and b characters with the script length, and the comparison lg (b)2) The relationship to a; the array Y satisfies lg (b)2)>a, forming an array K by the elements of a, wherein the number of the elements is K; then observing the relation between each element in the K and the previous X array element, and if the source of any element in the array K is found to be the previous X4 element, judging that the js code danger level is A type; the number of elements in K which are derived from the elements in X1 is K1; if k/y>0.2 and k1/k>0.5, judging the danger level to be B type; if k/y>0.1 and k1/k>0.3, judging that the danger level is class C, and judging that the other situations are class D;
step 2.5) after two judgments, recording the grades of A, B, C and D as 4, 3, 2 and 1 in turn; a, B, C, D categories are sequentially divided into 4, 3, 2 and 1; multiplying the two scores to obtain a final second-dimension judgment score of the webpage;
step 3) solving the support vector machine according to the extracted first dimension data and second dimension data of the support vector machine, wherein the output is parameters w & ltx & gt and b & ltx & gt of the separation hyperplane and a classification decision function, wherein w & ltx & gt represents a normal vector of the hyperplane (here, a two-dimensional plane), b & ltx & gt represents an intercept of the hyperplane (here, a two-dimensional plane), and the classification decision function represents: if the feature space where the input data is located has a hyperplane serving as a decision boundary, the learning target is separated according to the positive class and the negative class, and the distance between the point of any sample and the plane is larger than or equal to 1.
The method for calculating the similarity rate in the step 1.5) comprises the following steps: firstly, comparing each extracted domain name with the known domain names with the same length L in the database one by one, finding out the same number s of letters in the domain names, and calculating the same letter ratio pr, wherein the formula is as follows:
pr ═ s/L (equation 1)
Recording the maximum value of the same letter ratio pr less than 1, assigning the maximum value to a variable percent, if pr is 1, directly regarding the link related to the extracted domain name as safety, and recording the domain name of which pr is the percent;
then, comparing each extracted domain name with the known domain names with different lengths in the database one by one, taking the domain names as a character set, and calculating the similarity between the domain names by using a Dice coefficient:
Figure GDA0002678454290000041
wherein s represents the similarity between domain names, A represents the currently accessed domain name, B represents the domain name in the database, the maximum value of the similarity s smaller than 1 is recorded, the maximum value is assigned to a variable dpercent, and s is recorded as the domain name of the dpercent;
comparing the values of percent and dpercent, taking a larger value to assign p, recording the domain name with the domain name similarity rate of p in the known domain name database, and recording as dname.
The method for solving the support vector machine in the step 3) is as follows:
step 3.1) taking 100 newly added dangerous webpages available for access in a malheadomainanlist library, adding 50 ALEXA famous webpages and safe webpages used in daily life to form 200 webpage test libraries, testing the 200 webpages, and checking a final fitting curve;
step 3.2) changing the original data value of each obtained first dimension data and second dimension data by using a random number, and changing the data injected into the vector machine into 90% -110% of the original data;
step 3.3), deleting points influencing linear separability by using a noise reduction part in the support vector machine to obtain linear separable data; simulating the remaining linearly separable points, and finally obtaining a simulation plane S as follows: y ═ w × x + b; wherein w-2.320, b-12.329; w is a normal vector of the hyperplane (here, a two-dimensional plane), b is an intercept representing the hyperplane (here, a two-dimensional plane), x represents a first set of dimensional scores of the data, and y represents a second set of dimensional scores of the data.
And 2.2) the public database of the famous open source static resources in China is a Baidu or Bootstrap Chinese network, and the public database of the famous open source static resources in foreign countries is *** or Microsoft Ajax.
In the step 2.3), the class A risk is a high risk class, the class B risk is a possible risk class, the class D risk is a safety class, and the class C risk is a risk class which is not found.
In the step 2.4), the class A is a high risk class, the class B is a possibly existing risk class, the class C is a low probability risk class, and the class D is a safety class.
Compared with the prior art, the invention has the following beneficial effects:
1. the method judges the possibility of the dangerous webpage by extracting the domain name in the webpage URL and the URL of the external link of the webpage and comparing the similarity degree of characters in the domain name according to the basic characteristics (usually too long length; imitating and counterfeiting known webpage domain names) of the domain name commonly used by the dangerous webpage (especially phishing webpage) obtained by analysis.
2. The invention provides a method for identifying dangerous webpages through code analysis, which can carry out active defense by directly analyzing the webpages by a DNS server after model optimization.
3. The method effectively solves the problems of low accuracy and low universality of the existing webpage security identification method.
Drawings
FIG. 1 is a graph of obfuscated data
FIG. 2 is a graph of linearly separable data after noise reduction
FIG. 3 is a flow chart of the first dimension data generation of the SVM
FIG. 4 is a flow chart of the second dimension data generation of the support vector machine
Detailed Description
The present invention is further illustrated by the following description in conjunction with the accompanying drawings and the specific embodiments, it is to be understood that these examples are given solely for the purpose of illustration and are not intended as a definition of the limits of the invention, since various equivalent modifications will occur to those skilled in the art upon reading the present invention and fall within the limits of the appended claims.
A dangerous webpage identification method based on a chrome plug-in is used for identifying dangerous webpages and comprises the following steps:
step 1) support vector machine first dimension data extraction, as shown in fig. 3, the specific steps include:
step 1.1) extracting URLs of all external links in a webpage;
step 1.2) accessing http:// data. alexa.com, and obtaining Alexa ranking of a website where a webpage is located according to a domain name in a webpage URL; if the website is ranked within 1000, directly regarding the webpage as safe, and if the website is ranked outside 1000 or the ranking cannot be obtained, setting a risk factor zero (danger0) to be 1;
step 1.3) analyzing the current webpage URL and all levels of domain names in the URL of the external link of the webpage, and taking the longest first level of domain name; if the length of the longest domain name is larger than 18, 1 is set to risk factor three (danger3), otherwise 0 is set. The URL is divided as follows: firstly, dividing URLs by '/', taking domain name sections in the URLs, dividing the domain name sections by '/', and adding all levels of domain names serving as character strings into an array;
step 1.4) cutting the URL of the current webpage and the URL of the external link of the webpage again and extracting information: if the domain name ends with ". com.cn", extracting a third-level domain name; if not, extracting a secondary domain name;
step 1.5) comparing the domain name extracted from each external link URL with the domain name in the known domain name database one by one, calculating the similarity ratio, taking the highest value with the similarity ratio less than 1 and marking as p, and marking as dname the domain name extracted from the external link with the similarity ratio of p with a certain domain name in the known domain name database.
The comparison and similarity rate calculation method comprises the following steps: firstly, comparing each extracted domain name with the known domain names with the same length (L) in the database one by one, finding out the same number s of letters in the domain names, and calculating the same letter ratio (pr) by the formula:
pr ═ s/L (equation 1) records the maximum value of similarity ratio pr less than 1, assigns it to variable percentage (if pr is 1, then the link associated with the extracted domain name is directly considered as safe), and records the domain name for which pr is percentage.
Then, comparing each extracted domain name with the known domain names with different lengths in the database one by one, taking the domain names as a character set, and calculating the similarity between the domain names by using a Dice coefficient:
Figure GDA0002678454290000061
recording the maximum value of the similarity rate s smaller than 1, assigning the maximum value to a variable dpercent, and recording s as the domain name of the dpercent.
Comparing the values of the variable percent and the dpercent, taking a larger value to assign p, and recording the domain name with the domain name similarity rate of p in the known domain name database as dname.
Step 1.6) if the length of the dname is more than 6 and the p value is more than 0.8, or the length of the dname is less than 6 and the p value is more than 0.54, taking a risk factor one (danger1) as 1, and if the situation is not met, setting the risk factor one as 0;
step 1.7) comparing the domain name extracted from the current webpage URL with the domain name in the known domain name database one by one, calculating the similarity rate according to the method in the step 1.5), and if the length of the extracted domain name is greater than 6 and the similarity rate is greater than 0.8, or the length of the extracted domain name is less than 6 and the similarity rate is greater than 0.54, taking the risk factor II (danger2) as 1, and if the situation is not met, setting the risk factor II as 0.
Step 1.8) if danger1 is 1 and The other risk factors are 0, output "The web page is not facial! ", and let danger be danger0+ danger1+ danger2+ danger3, and output" The web page is safe!when danger is 0, 1, 2, 3 or 4 respectively! "," The web page is bright subspecious! "," The web is subspecious! "," The web page is Dangerous! ".
Step 2) support vector machine second dimension data extraction, as shown in fig. 4, the specific steps include:
step 2.1) obtaining the contents in all < script > tags in the html code of the page, extracting all src pointing contents in the html code to form an array X, counting the total number of array elements, and setting the total number of X;
step 2.2) matching the page URL, extracting a second-level domain name of the current page URL (extracting a third-level domain name if the type of the current page URL is such as com.cn), and classifying elements containing the second-level (third-level) domain name in array elements into an X1 group, wherein the number of the elements is marked as X1; for the other elements of the array X, matching domestic public databases of famous open source static resources (such as Baidu, Bootstrap Chinese networks and the like), wherein the matched elements are classified into an X2 group, and the number of the matched elements is recorded as X2; for the rest elements of the array X, matching foreign famous open source static resource public libraries (such as ***, Microsoft Ajax and the like), and enabling the matched elements to be classified into an X3 group, wherein the number of the matched elements is marked as X3; the final unmatched elements are grouped into X4, and the number is marked as X4. Satisfies the relationship: x1+ x2+ x3+ x4 ═ x;
step 2.3) if x4/x is larger than 0.1, judging that the script citation has class A risk (high risk class); if x2, x3 are not 0 at the same time, and x4 is not 0, then it is determined that the script reference is at risk of class B (possibly a dangerous class); if x4 is 0, one of the following three conditions is satisfied: x1/x >0.8, (x1+ x2)/x >0.9, (x1+ x3)/x >0.9, namely, the script reference is considered to have a D-type risk (a security class), and the rest cases judge that the script reference has a C-type risk (a risk class is not found yet).
And 2.4) acquiring all internal js scripts and all external js scripts in the html codes, forming all js scripts into an array Y, counting the number of elements of the array, and setting the number of Y. The following three common js horse hanging methods are matched:
1.document.write(“<iframe
2.document.body.innerHTML
3.open(***,”NewWindow”,”toolbar
and counting the occurrence times of the three statements in each script. And counting the number of characters of each js script.
Setting an array element containing a common trojan and b characters with script length, comparing lg (b)2) The relationship with a. The array Y satisfies lg (b)2)>The elements of a form an array K, and the number of the elements is set to be K. Then observing the relation between each element in the K and the previous X array element, if finding that the source of any element in the array K is the previous X4 element, judging that the js code danger level is A type (high risk type); note again that the number of elements in K derived from X1 is K1. If k/y>0.2 and k1/k>0.5, judging the danger level to be B type (possibly existing danger type); if k/y>0.1 and k1/k>0.3, the risk level is judged as class C (small probability risk class), and the other situations are safety class D.
Step 2.5) after two judgments, recording the grades of A, B, C and D as 4, 3, 2 and 1; note that A, B, C, D types are also classified into 4, 3, 2 and 1. And multiplying the two scores to obtain the final second-dimension judgment score of the webpage.
Step 3) solving support vector machine
Step 3.1) taking 100 newly added dangerous webpages available for access in a malewalomainnlist (https:// www.malwaredomainlist.com/mdl. php) library, adding 50 ALEXA famous webpages and safe webpages used in daily life to form 200 webpage test libraries (100 safe 100 unsafe), testing the 200 points, and viewing a final fitting curve;
and 3.2) because the data of two dimensions are integers, a plurality of coincident points appear during fitting, and considering that the data volume is larger than that of a general two-dimensional support vector machine, certain confusion is determined in the data, the original data value of each obtained data is changed by using a random number, and the data injected into the vector machine is changed into 90% -110% of the original data. Pre-injection obfuscated data is shown in fig. 1.
And 3.3) deleting points influencing linear separability by using a noise reduction part in the support vector machine to obtain linear separable data as shown in FIG. 2. The remaining linearly separable points are simulated, and finally, simulation results w ═ 2.320 and b ═ 12.329 are obtained, namely, the plane S is: y-2.320 x + 12.329.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims (3)

1. A dangerous webpage identification method based on a chrome plug-in is characterized by comprising the following steps:
step 1) extracting first dimension data of a support vector machine according to URLs of all external links in a webpage, wherein the steps comprise:
step 1.1) extracting URLs of all external links in a webpage;
step 1.2) accessing http:// data. alexa.com, and obtaining Alexa ranking of a website where a webpage is located according to a domain name in a webpage URL; if the ranking of the website is within 1000, directly regarding the webpage as safe, and if the ranking of the website is outside 1000 or the ranking cannot be obtained, setting the risk factor zero danger0 to 1;
step 1.3) analyzing the current webpage URL and all levels of domain names in the URL of the external link of the webpage, and taking the longest first level of domain name; if the length of the longest section of domain name is larger than 18, setting 1 to the risk factor three danger3, otherwise, setting 0; the URL is divided as follows: firstly, dividing URLs by '/', taking domain name sections in the URLs, dividing the domain name sections by '/', and adding all levels of domain names serving as character strings into an array;
step 1.4) cutting the URL of the current webpage and the URL of the external link of the webpage again and extracting information: if the domain name ends with ". com.cn", extracting a third-level domain name; if not, extracting a secondary domain name;
step 1.5) comparing the domain name extracted from each external link URL with the domain name in a known domain name database one by one, calculating the similarity ratio, taking the highest value with the similarity ratio smaller than 1 as p, and marking the domain name extracted from the external link with the similarity ratio of p with a certain domain name in the known domain name database as dname;
step 1.6) if the length of the dname is more than 6 and the p value is more than 0.8, or the length of the dname is less than 6 and the p value is more than 0.54, taking the risk factor danger1 as 1, and if the situation is not met, setting the risk factor danger1 as 0;
step 1.7) comparing the domain name extracted from the current webpage URL with the domain name in the known domain name database one by one, and calculating the similarity rate, if the length of the extracted domain name is greater than 6 and the similarity rate is greater than 0.8, or the length of the extracted domain name is less than 6 and the similarity rate is greater than 0.54, taking the risk factor two danger2 as 1, and setting the risk factor two as 0 if the situation is not met;
step 1.8) if danger1 is 1 and The other risk factors are 0, output "The web page is not facial! ", and let danger be danger0+ danger1+ danger2+ danger3, and output" The web page is safe!when danger is 0, 1, 2 or 3 respectively! "," The web page is bright subspecious! "," The web is subspecious! "," The web page is Dangerous! ";
step 2) extracting second dimension data of the support vector machine according to all < script > tags in the html code of the page, and the specific steps comprise:
step 2.1) obtaining the contents in all < script > tags in the html code of the page, extracting all src pointing contents in the html code to form an array X, counting the total number of array elements, and setting the total number of X;
step 2.2) matching the page URL, extracting a second-level domain name of the current page URL, extracting a third-level domain name if the type of the current page URL is such as com.cn, and classifying elements containing the second-level or third-level domain name in array elements into an X1 group, wherein the number of the elements is marked as X1; matching the other elements of the array X with domestic public databases of famous open source static resources, wherein the matched elements are classified into X2 groups, and the number of the matched elements is recorded as X2; matching the other elements of the array X with a public library of foreign famous open-source static resources, wherein the matched elements are classified into X3 groups, and the number of the matched elements is marked as X3; finally, the unmatched elements are classified into an X4 group, and the number of the unmatched elements is recorded as X4; satisfies the relationship: x1+ x2+ x3+ x4 ═ x; wherein the domestic public bank of famous open source static resources is a Baidu or Bootstrap Chinese network, and the foreign public bank of famous open source static resources is *** or Microsoft Ajax;
step 2.3) if x4/x is larger than 0.1, judging that the script reference has a class A risk, and if x2 and x3 are not 0 at the same time and x4 is not 0, judging that the script reference has a class B risk; if x4 is 0, one of the following three conditions is satisfied: x1/x >0.8, (x1+ x2)/x >0.9, (x1+ x3)/x >0.9, the script reference is considered to have a class D risk, and the rest cases judge that the script reference has a class C risk; the risk of the class A is a high risk class, the risk of the class B is a possible risk class, the risk of the class D is a safety class, and the risk of the class C is a risk class which is not found;
step 2.4) obtaining all internal js scripts and all external js scripts in the html code, combining all js scripts into an array Y, counting the number of array elements, and setting the number of Y; the following three common js horse hanging methods are matched:
1.document.write(“<iframe
2.document.body.innerHTML
3.open(***,”NewWindow”,”toolbar
counting the occurrence times of the three sentences in each script; counting the number of characters of each js script;
one array element contains a common trojan and b characters with the script length, and the comparison lg (b)2) The relationship to a; the array Y satisfies lg (b)2)>a, forming an array K by the elements of a, wherein the number of the elements is K; then observing the relation between each element in the K and the previous X array element, and if the source of any element in the array K is found to be the previous X4 element, judging that the js code danger level is A type; the number of elements in K which are derived from the elements in X1 is K1; if k/y>0.2 and k1/k>0.5, judging the danger level to be B type; if k/y>0.1 and k1/k>0.3, judging that the danger level is class C, and judging that the other situations are class D; wherein, the A class is a high risk class, the B class is a possibly existing risk class, the C class is a low probability risk class, and the D class is a safety class;
step 2.5) after two judgments, recording the grades of A, B, C and D as 4, 3, 2 and 1 in turn; a, B, C, D categories are sequentially divided into 4, 3, 2 and 1; multiplying the two scores to obtain a final second-dimension judgment score of the webpage;
step 3) solving the support vector machine according to the extracted first dimension data and second dimension data of the support vector machine, and outputting parameters w & ltx & gt and b & ltx & gt of the separation hyperplane and a classification decision function, wherein w & ltx & gt represents a normal vector of the hyperplane, b & ltx & gt represents an intercept of the hyperplane, and the classification decision function represents: if the feature space where the input data is located has a hyperplane serving as a decision boundary, the learning target is separated according to the positive class and the negative class, and the distance between the point of any sample and the plane is larger than or equal to 1.
2. The chrome plug-in based dangerous webpage identification method as claimed in claim 1, wherein: the method for calculating the similarity rate in the step 1.5) comprises the following steps: firstly, comparing each extracted domain name with the known domain names with the same length L in the database one by one, finding out the same number s of letters in the domain names, and calculating the same letter ratio pr, wherein the formula is as follows:
pr ═ s/L (equation 1)
Recording the maximum value of the same letter ratio pr less than 1, assigning the maximum value to a variable percent, if pr is 1, directly regarding the link related to the extracted domain name as safety, and recording the domain name of which pr is the percent;
then, comparing each extracted domain name with the known domain names with different lengths in the database one by one, taking the domain names as a character set, and calculating the similarity between the domain names by using a Dice coefficient:
Figure FDA0002678454280000031
wherein s represents the similarity between domain names, A represents the currently accessed domain name, and B represents the domain name in the database;
recording the maximum value of the similarity s smaller than 1, assigning the maximum value to a variable dpercent, and recording s as the domain name of the dpercent;
comparing the values of percent and dpercent, taking a larger value to assign p, recording the domain name with the domain name similarity rate of p in the known domain name database, and recording as dname.
3. The chrome plug-in based dangerous webpage identification method as claimed in claim 2, wherein: the method for solving the support vector machine in the step 3) is as follows:
step 3.1) taking 100 newly added dangerous webpages available for access in a malheadomainanlist library, adding 50 ALEXA famous webpages and safe webpages used in daily life to form 200 webpage test libraries, testing the 200 webpages, and checking a final fitting curve;
step 3.2) changing the original data value of each obtained first dimension data and second dimension data by using a random number, and changing the data injected into the vector machine into 90% -110% of the original data;
step 3.3) deleting points influencing linear separability by using a noise reduction part in the support vector machine to obtain linear separable data; simulating the remaining linearly separable points, and finally obtaining a simulation classification hyperplane S as follows: y ═ w x + b ^ x; wherein w is-2.320, b is 12.329; w is a normal vector of the hyperplane, b is an intercept representing the hyperplane, x represents a first set of dimensional scores of the data, and y represents a second set of dimensional scores of the data.
CN201910720615.4A 2019-08-06 2019-08-06 Dangerous webpage identification method based on chrome plug-in Active CN110427579B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910720615.4A CN110427579B (en) 2019-08-06 2019-08-06 Dangerous webpage identification method based on chrome plug-in

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910720615.4A CN110427579B (en) 2019-08-06 2019-08-06 Dangerous webpage identification method based on chrome plug-in

Publications (2)

Publication Number Publication Date
CN110427579A CN110427579A (en) 2019-11-08
CN110427579B true CN110427579B (en) 2020-12-01

Family

ID=68414326

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910720615.4A Active CN110427579B (en) 2019-08-06 2019-08-06 Dangerous webpage identification method based on chrome plug-in

Country Status (1)

Country Link
CN (1) CN110427579B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101692639A (en) * 2009-09-15 2010-04-07 西安交通大学 Bad webpage recognition method based on URL
CN102332028A (en) * 2011-10-15 2012-01-25 西安交通大学 Webpage-oriented unhealthy Web content identifying method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101692639A (en) * 2009-09-15 2010-04-07 西安交通大学 Bad webpage recognition method based on URL
CN102332028A (en) * 2011-10-15 2012-01-25 西安交通大学 Webpage-oriented unhealthy Web content identifying method

Also Published As

Publication number Publication date
CN110427579A (en) 2019-11-08

Similar Documents

Publication Publication Date Title
US11463476B2 (en) Character string classification method and system, and character string classification device
CN110808968B (en) Network attack detection method and device, electronic equipment and readable storage medium
CN107707545B (en) Abnormal webpage access fragment detection method, device, equipment and storage medium
Mahajan et al. Phishing website detection using machine learning algorithms
CN107204960B (en) Webpage identification method and device and server
CN103559235B (en) A kind of online social networks malicious web pages detection recognition methods
Buber et al. NLP based phishing attack detection from URLs
Zhang et al. Boosting the phishing detection performance by semantic analysis
US7565350B2 (en) Identifying a web page as belonging to a blog
CN104077396A (en) Method and device for detecting phishing website
CN104156490A (en) Method and device for detecting suspicious fishing webpage based on character recognition
CN105279277A (en) Knowledge data processing method and device
CN110191096B (en) Word vector webpage intrusion detection method based on semantic analysis
CN111581355A (en) Method, device and computer storage medium for detecting subject of threat intelligence
CN111967503B (en) Construction method of multi-type abnormal webpage classification model and abnormal webpage detection method
CN110572359A (en) Phishing webpage detection method based on machine learning
CN102446255A (en) Method and device for detecting page tamper
CN103617213A (en) Method and system for identifying newspage attributive characters
WO2020082763A1 (en) Decision trees-based method and apparatus for detecting phishing website, and computer device
CN104239582A (en) Method and device for identifying phishing webpage based on feature vector model
CN112528294A (en) Vulnerability matching method and device, computer equipment and readable storage medium
CN112948725A (en) Phishing website URL detection method and system based on machine learning
Rajalakshmi et al. DLRG@ HASOC 2019: An Enhanced Ensemble Classifier for Hate and Offensive Content Identification.
CN115442075A (en) Malicious domain name detection method and system based on heterogeneous graph propagation network
CN109064067B (en) Financial risk operation subject determination method and device based on Internet

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 210000, 66 new model street, Gulou District, Jiangsu, Nanjing

Applicant after: NANJING University OF POSTS AND TELECOMMUNICATIONS

Address before: 210000 Nanjing, Jiangsu Province, Yuhuatai District, software Avenue, No. 186

Applicant before: NANJING University OF POSTS AND TELECOMMUNICATIONS

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant