CN115085952A

CN115085952A - Phishing website processing method and device, storage medium and electronic equipment

Info

Publication number: CN115085952A
Application number: CN202110262136.XA
Authority: CN
Inventors: 刘紫千; 张敏; 常力元; 佟欣哲; 陈林; 刘长波; 余启明; 孙福兴; 王大伟; 张咏; 李齐
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2021-03-10
Filing date: 2021-03-10
Publication date: 2022-09-20

Abstract

The disclosure provides a phishing website processing method, a phishing website processing device, a computer readable storage medium and electronic equipment, and relates to the technical field of network security. The phishing website processing method comprises the following steps: acquiring a suspected phishing website corresponding to the customer requirement from a multi-source data set; determining a first approximate probability of a suspected phishing website to a customer website; determining a second approximate probability of the suspected phishing website under the condition that the first approximate probability is larger than a first threshold value; when the second approximate probability is larger than a second threshold value, determining the suspected phishing website as a phishing website; and after the suspected phishing website is determined to be the phishing website, processing the phishing website according to the requirement of the customer. The present disclosure provides a scheme for accurately and timely discovering phishing websites.

Description

Phishing website processing method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of network security technologies, and in particular, to a phishing website processing method, a phishing website processing apparatus, a computer-readable storage medium, and an electronic device.

Background

Phishing is a common form of cyber attack. The phisher is tempted to give out sensitive information by sending deceptive information alleged to come from banks or other well-known organizations, and finally crimes are realized according to the sensitive information.

In the attack and defense confrontation of phishing events, as the invisibility of the phishing website propagation path is high, the disguise of website pages is strong, and the survival period of the website is short, the difficulty is increased for accurately and timely discovering phishing websites.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The invention provides a phishing website processing method, a phishing website processing device, a computer readable storage medium and electronic equipment, and further provides a scheme for accurately and timely finding a phishing website.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.

According to a first aspect of the present disclosure, there is provided a phishing website processing method comprising:

acquiring a suspected phishing website corresponding to the customer requirement from a multi-source data set;

determining a first approximate probability of the suspected phishing website and a customer website;

determining a second approximate probability of the suspected phishing website if the first approximate probability is greater than a first threshold;

when the second approximate probability is larger than a second threshold value, determining that the suspected phishing website is a phishing website;

and after the suspected phishing website is determined to be a phishing website, processing the phishing website according to the customer requirements.

Optionally, obtaining a suspected phishing website corresponding to the customer requirement from the multi-source data set includes:

and acquiring a suspected phishing website corresponding to the customer requirement from PDNS log data, a mobile phone short message suspected URL data source, a third party counterfeit URL data source and complaint data.

Optionally, determining the first approximate probability between the suspected phishing website and the client website includes:

determining the first approximate probability according to the domain name character string of the suspected phishing website and the domain name character string of the client website;

optionally, determining the first approximate probability according to the domain name character string of the suspected phishing website and the domain name character string of the client website includes:

and determining the first approximate probability according to the approximate distance between the domain name character string of the suspected phishing website and the domain name character string of the client website.

Optionally, determining the first approximate probability according to the approximate distance between the domain name character string of the suspected phishing website and the domain name character string of the client website includes:

segmenting the domain name character strings of the suspected phishing website according to a preset length to obtain a plurality of first sub-character strings;

segmenting the domain name character strings of the client website according to the preset length to obtain a plurality of second sub-character strings;

determining a common substring according to the plurality of first substrings and the plurality of second substrings;

determining the approximate distance according to the number of the first substrings, the number of the second substrings and the number of the common substrings;

and determining the first approximate probability according to the approximate distance.

Optionally, determining the approximate distance according to the number of the first substring, the number of the second substring, and the number of the common substrings includes:

determining the number of the first substrings according to the length of the domain name string of the suspected phishing website and the preset length;

determining the number of the second substrings according to the domain name string length of the client website and the preset length;

and summing the number of the first substrings and the number of the second substrings, and subtracting a value obtained by twice the number of the common substrings to determine the approximate distance.

Optionally, determining the first approximate probability according to the approximate distance includes:

and determining the first approximate probability according to the proportion of the approximate distance to the sum of the number of the first substrings and the number of the second substrings.

Optionally, the determining the second approximate probability of the suspected phishing website includes:

determining the similarity probability of the webpage contents and the key information of the suspected phishing website and the client website;

determining the similarity probability of the suspected phishing website and the webpage templates of the pre-stored phishing websites;

and determining the second approximate probability according to the webpage content similar probability, the webpage template similar probability and the key information similar probability.

Optionally, determining the probability of similarity between the suspected phishing website and the webpage content of the client website includes:

calculating a first Hamming distance between the client website and the webpage content of the suspected phishing website;

and determining the similarity probability of the webpage contents according to the first Hamming distance and a first preset conversion formula.

Optionally, determining the probability of similarity between the suspected phishing website and the webpage templates of the pre-stored phishing websites includes:

calculating a second Hamming distance between the suspected phishing website and the pre-stored phishing website;

and determining the similarity probability of the webpage template according to the second Hamming distance and a second preset conversion formula.

Optionally, determining the similarity probability between the suspected phishing website and the key information of the client website includes:

calculating the similarity probability of the suspected phishing website and the small icon of the client website;

determining the probability of record according to whether the suspected phishing website records or not;

determining the probability of the attribution according to the attribution of the IP address corresponding to the suspected phishing website;

and determining the similarity probability of the key information according to the similarity probability of the small icons, the record probability and the attribution probability.

Optionally, calculating the probability of similarity between the suspected phishing website and the small icon of the client website includes:

and determining the small icon similarity probability of the suspected phishing website and the client website according to an image similarity algorithm.

Optionally, determining the probability of filing according to whether the suspected phishing website is filed comprises:

according to whether the suspected phishing website is recorded or not, under the condition that the record exists, the record probability is determined to be 0;

in the case of no filing, the filing probability is determined to be 1.

Optionally, determining the attribution probability according to the IP address attribution corresponding to the suspected phishing website includes:

according to whether the IP address attribution is in a continental region or not, in the case of the continental region, determining that the attribution probability is 0;

in the case of not being the continent zone, determining that the home probability is 1.

Optionally, determining the key information similarity probability according to the small icon similarity probability, the docket probability and the attribution probability includes:

and determining the similarity probability of the key information according to the weighted average value of the similarity probability of the small icons, the record probability and the attribution probability.

Optionally, determining the second approximate probability according to the web page content similarity probability, the web page template similarity probability, and the key information similarity probability includes:

determining the key information similarity probability according to the webpage content similarity probability, the webpage template similarity probability and the weighted average value of the key information similarity probability;

or determining the similarity probability of the key information according to the maximum value among the similarity probability of the webpage content, the similarity probability of the webpage template and the similarity probability of the key information.

Optionally, the first threshold is greater than 70%, and the second threshold is greater than 80%.

According to a second aspect of the present disclosure, there is provided a phishing website processing apparatus comprising:

the website acquisition module is used for acquiring suspected phishing websites corresponding to customer requirements from a multi-source data set;

the first approximate probability determination module is used for determining a first approximate probability of the domain name character string of the suspected phishing website and the domain name character string of the client website;

a second approximate distance determination module, configured to determine a second approximate probability of the suspected phishing website if the first approximate probability is greater than a first threshold;

a phishing website determining module, configured to determine that the suspected phishing website is a phishing website when the second approximate probability is greater than a second threshold;

and the phishing website processing module is used for processing the phishing website according to the customer requirements after the suspected phishing website is determined to be a phishing website.

According to a third aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the phishing website processing method described above.

According to a fourth aspect of the present disclosure, there is provided an electronic device comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to execute the phishing website processing method described above via execution of the executable instructions.

The technical scheme of the disclosure has the following beneficial effects:

according to the phishing website processing method in the exemplary embodiment of the disclosure, on one hand, a suspected phishing website corresponding to a customer requirement is obtained from a multi-source data set, and a first approximate probability of the suspected phishing website and the customer website is determined, so that the suspected phishing website more similar to the customer website can be determined, and therefore preliminary screening of the suspected phishing website is achieved, the efficiency of determining the phishing website is improved, and a precondition is provided for timely finding the phishing website; in another aspect, after the preliminary screening, a second approximate probability is determined from the preliminarily screened suspected phishing websites, the phishing websites are determined according to the second approximate probability, and the screening standard of the phishing websites is further increased, so that the accuracy of determining the phishing websites is improved, and the purpose of accurately and timely finding the phishing websites is achieved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is apparent that the drawings in the following description are only some embodiments of the present disclosure, and that other drawings can be obtained from those drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart diagram illustrating a phishing website processing method in an exemplary embodiment of the present disclosure;

FIG. 2 is a flowchart illustrating the operation of a phishing website processing method provided by an exemplary embodiment of the present disclosure;

FIG. 3 is a flowchart illustrating an operation of determining a first approximate probability in a phishing website processing method provided in an exemplary embodiment of the present disclosure;

FIG. 4 is a flowchart illustrating an operation of determining a second approximate probability in a phishing website processing method provided in an exemplary embodiment of the present disclosure;

FIG. 5 is a flowchart illustrating an operation of determining a similarity probability of key information in a phishing website processing method according to an exemplary embodiment of the present disclosure;

FIG. 6 shows a block diagram of a phishing website processing apparatus provided by an exemplary embodiment of the present disclosure;

fig. 7 shows a block schematic diagram of an electronic device in an exemplary embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the embodiments of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

As a way of network attack, phishing can induce a user to enter a phishing website and acquire sensitive information of the user through a well-designed phishing website which is very similar to a website of a target organization, for example, the sensitive information can be a user name, a password, an account number or credit card detailed information and the like. Generally, a fishing website may imitate a domain name address and page content of a regular website, or insert dangerous Javascript codes into some webpages of a website by using a bug on a real website server program, so as to cheat a user of private data such as a bank or a credit card account number, a password and the like, so that the user suffers from economic loss, and extremely serious harm is caused to an online transaction system, a financial platform and the like. Therefore, the forensic disposal of phishing websites becomes an important research content in the field of network security.

From the transmission path, the transmission path of the phishing website is more diversified besides the traditional transmission modes such as mails, mass texting and the like. The common chat tools QQ, WeChat and website (such as personal blog and webpage popup) group purchase, and even some search results of searching websites become a transmission path of phishing websites.

In the actual process of attack and defense fight, the phishing website usually has short survival time. When they are discovered and intercepted, hackers often use various mature website templates to quickly generate and deploy phishing websites by changing domain names and servers, and reappear on the online cheating netizens. Therefore, the key to realize attack and defense is to accurately and timely discover the phishing website, get through the discovery, obtain evidence and dispose of all links, thereby realizing the automation of the whole process.

Based on this, an exemplary embodiment of the present disclosure provides a phishing website processing method, which, referring to fig. 1, may include the steps of:

step S110, obtaining a suspected phishing website corresponding to the customer requirement from a multi-source data set;

step S120, determining a first approximate probability of the suspected phishing website and a client website;

step S130, determining a second approximate probability of the suspected phishing website under the condition that the first approximate probability is larger than a first threshold value;

step S140, when the second approximate probability is larger than a second threshold value, determining that the suspected phishing website is a phishing website;

and S150, processing the phishing website according to the customer requirement after the suspected phishing website is determined to be a phishing website.

According to the phishing website processing method in the exemplary embodiment of the disclosure, on one hand, the suspected phishing websites can be obtained through multiple ways by obtaining the suspected phishing websites corresponding to the customer requirements from the multi-source data set, so that the obtaining ways of the suspected phishing websites are more diversified, the missing probability of the phishing websites is reduced, and the comprehensiveness and accuracy of the phishing website obtaining are improved; on the other hand, the suspected phishing websites which are more similar to the client websites can be determined by determining the first approximate probability of the suspected phishing websites and the client websites, so that the preliminary screening of the suspected phishing websites is realized, most of non-phishing websites can be removed through the preliminary screening, the efficiency of determining the phishing websites is improved, and a precondition is provided for timely finding the phishing websites; on the other hand, after the preliminary screening, determining a second approximate probability from the preliminarily screened suspected phishing websites, and then determining the phishing websites according to the second approximate probability, so that the screening standard of the suspected phishing websites is further increased, the accuracy of determining the phishing websites is improved, and the purpose of accurately and timely finding the phishing websites is achieved; on the other hand, the phishing websites are processed according to the customer requirements, a whole set of automatic processing method for discovering, tracking and customizing the phishing websites from the customer requirements is realized, customized anti-phishing services are provided for the customers, and the customer experience is improved.

Hereinafter, a phishing website processing method in an exemplary embodiment of the present disclosure will be further described:

in step S110, a suspected phishing website corresponding to the customer' S requirement is obtained from the multi-source data set.

In practical application, a customer generally provides related requirements on acquisition and processing of a phishing website, and related phishing website information which the customer wants to acquire or attack can be determined according to specific customer requirements.

In an exemplary embodiment of the present disclosure, the customer requirements may include at least information such as customer name, customer website address, disposal requirement, and contact. For example, in the case of chinese telecommunications, the customer name may be chinese telecommunications, the customer site address may be www.10000.com, the customer disposition requirement may be a consignment disposition or an un-consignment disposition, and finally the name and telephone number of the contact are included. In practical applications, the customer demand is not limited to the above-mentioned related information, and the actual demand information may be obtained from the customer according to needs, which is not particularly limited in this exemplary embodiment.

In the exemplary embodiment of the present disclosure, after the customer demand is obtained, a suspected phishing website corresponding to the customer demand may be obtained by using the multi-source data set. The multi-source data set at least collects PDNS (Passive Domain Name System) log data, mobile phone short message suspicious URL (Uniform Resource Locator), third party counterfeit URL data source, complaint data and other data sets. The PDNS passively records data packets entering and exiting the DNS system and does not interfere with the DNS system.

As an example, the PDNS log data is described by taking hour granularity as an example, wherein the PDNS log data includes: time (e.g., 2021010100:00:00), domain name (e.g., ***.com), domain name corresponding IP address (e.g., 220.181.38.148), number of requests (e.g., 1121223), etc.

The suspicious URL data source of the mobile phone sms may include: discovery time (e.g., 2021010100:00:00), suspicious URL address (e.g., 1oo00.com), etc.

The third party mock URL data source may include: discovery time (e.g., 2021010100:00:00), suspicious URL address (e.g., 1oo00.com), type (phishing websites), object (e.g., chinese telecom), etc.

The complaint data may include: serial number (e.g., 121232232), complaint time (e.g., 2021010100:00:00), complaint channel (e.g., telecommunications customer telephone call), complaint content details (e.g., password stolen, receipt of a short message prompting a password reset, etc.), complaint type (e.g., fraud), etc.

In the process of actually processing the information in the multi-source data set, the domain name information or the URL information in the multi-source data set can be mainly focused, and data support can be provided for preliminary screening of suspected phishing websites according to the domain name information or the URL information. In addition, in the process of acquiring the URL information from the complaint data, the URL information may be acquired from the text information of the details of the complaint content by using a regular matching method, and details of the specific acquisition method according to the exemplary embodiment of the present disclosure are not described in detail.

In the exemplary embodiment of the present disclosure, domain names with access amounts exceeding 10 ten thousand or 100 ten thousand, that is, domain names with request times larger than 10 ten thousand or 100 ten thousand as described above may also be removed according to the PDNS log data, and as described above, the request times of ***.com is 1121223 or larger than 100 ten thousand, so ***.com may be removed and not considered as a suspected phishing website. Based on the method, the number of the determined suspected phishing websites can be reduced, the speed and the efficiency of determining the phishing websites are improved, and the timeliness of processing the phishing websites is improved.

In addition, according to the phishing website processing method provided by the exemplary embodiment of the disclosure, the suspected phishing websites corresponding to the customer requirements are acquired from the multi-source data set, so that the suspected phishing websites can be acquired through multiple ways, the acquiring ways of the suspected phishing websites are more diversified, the missing probability of the phishing websites is reduced, and the comprehensiveness and accuracy of the phishing website acquisition are improved. Under the condition of not interfering the DNS system, the data packets entering and exiting the DNS system are recorded, so that the independence and the real-time property of the discovery of suspected phishing websites are realized, and the efficiency of phishing website processing is further improved.

In step S120, a first approximate probability of the suspected phishing website to the customer website is determined.

In an exemplary embodiment of the present disclosure, the first approximate probability may be determined according to actual conditions, for example, if the domain name of the suspected phishing website is specified by the third party mock URL data source for the client, the first approximate probability may be directly determined as 1.

In addition, the first approximate probability can also be determined by the approximate distance between the domain name character string of the suspected phishing website and the domain name character string of the client website, and in the process of determining the approximate distance, the domain name character string of the suspected phishing website and the domain name character string of the client website need to be cut into a plurality of sub-character strings respectively.

Specifically, the domain name character strings of the suspected phishing website can be segmented according to a preset length to obtain a plurality of first sub-character strings; and segmenting the domain name character strings of the client website according to the same preset length to obtain a plurality of second sub-character strings. The preset length N is 1 if the segmentation is performed according to 1 character, the preset length N is 2 if the segmentation is performed according to 2 characters, the preset length N is 3 if the segmentation is performed according to 3 characters, and the specific preset length may be determined according to an actual situation, which is not particularly limited in the exemplary embodiment of the present disclosure.

Taking the domain name of the client website as 10000.com as an example, if the preset length N is 3, that is, the length of the second sub-character string is 3 characters, then "10000. com" can be divided into "100", "000", "00", "0. c", ". co" and "com" 7 second sub-character strings, that is, the number Lz of the second sub-character strings ₂ 7. Wherein the number Lz of the second substring ₂ The length L of the domain name character string of the client website can be passed ₂ And a predetermined length N, i.e. Lz ₂ ＝L ₂ -N + 1-9-3 + 1-7, where the domain name string length L of the customer site is ₂ Refers to the number of characters in the domain name of the client website.

For example, if the domain name of the suspected phishing website is 1 ooo.com, if the preset length N is 3, that is, the length of the first sub-string is 3 characters, then "1 ooo.com" may be cut into 7 first sub-strings, that is, the number Lz of the first sub-strings, such as "1 oo", "ooo", "oo.", "o.c", ". co", and "com ₁ 7. Wherein the number Lz of the first substring ₁ The length L of the domain name character string of the suspected phishing website can be further passed ₁ And a predetermined length N, i.e. Lz ₁ ＝L ₁ -N + 1-9-3 + 1-7, wherein the length L of the domain name string of the suspected phishing website ₁ The number of characters in the domain name of the suspected phishing website is referred to.

After obtaining the plurality of first substrings and the plurality of second substrings, a common substring may be determined from the plurality of first substrings and the plurality of second substrings.

It should be noted that, in the exemplary embodiment of the present disclosure, in the determination process of the common substring, it is necessary to perform approximate replacement on the second substring or the first substring according to the approximate character comparison information (for example, 0-o, o-0, etc.), for example, 0 in the second substring is replaced by o, or o in the first substring is replaced by 0. Then, the common substring of "10000. com" and "1 ooo. com" in the above example is all the second substring or all the first substring, that is, the number S of common substrings is 7.

According to the number Lz of a plurality of first substrings ₁ The number Lz of the plurality of second substrings ₂ And the number S of common substrings, the approximate distance L can be determined _N ＝(L ₁ -N+1)+(L ₂ -N+1)-2*S＝Lz ₁ +Lz ₂ -2 + 7-0; i.e. the approximate distance is the number Lz of the first substring ₁ And the number of second substrings Lz ₂ And subtracting twice the number S of common substrings after the summation.

After determining the approximate distance L _N Thereafter, a first approximation probability P can be determined ₁ In an exemplary embodiment of the present disclosure, the first approximate probability P ₁ Can be based on an approximate distance L _N Number Lz of first substring ₁ And the number of second substrings Lz ₂ The ratio of the sums. Specifically, the first approximate probability P ₁ Can be expressed by equation (1):

in an exemplary embodiment of the present disclosure, the first approximate probability P ₁ Obtained by subtracting the above ratio from 1. By way of example only, as can be seen from the above examples,

in step S130, if the first approximate probability is greater than a first threshold, a second approximate probability of the suspected phishing website is determined.

In practical applications, the size of the first threshold may be directly obtained from the customer requirement information, or may be determined according to the customer requirement, for example, the first threshold may be a value greater than 70%, which is not limited in this exemplary embodiment of the present disclosure.

As aboveThe first approximation probability P calculated in the above example is shown ₁ If the ratio is greater than the first threshold value by 100%, the suspected phishing website can be continuously analyzed to obtain a second approximate probability.

It should be noted that if the first approximate probability is less than or equal to the first threshold, it may be determined that the suspected phishing website is not a phishing website or is not a phishing website within the range of the customer's requirement, and the other suspected phishing websites may be processed without further analysis processing on the suspected phishing website, or the process may be ended.

In an exemplary embodiment of the disclosure, a second approximate probability P of the suspected phishing website is determined ₂ The method specifically comprises the following steps: determining the webpage content similarity probability and the key information similarity probability of the suspected phishing website and the client website; determining the similarity probability of the suspected phishing website and the webpage templates of the pre-stored phishing websites; then, according to the page template similarity probability and the key information similarity probability of the page content similarity probability, a second approximate probability P is determined ₂ 。

Specifically, the webpage content similarity probability Pcontent of the suspected phishing website and the client website refers to the similarity probability of the webpage text content, and in practical application, the webpage content similarity probability can be determined by referring to the existing text content similarity calculation method.

The exemplary embodiment of the present disclosure takes a Simhash algorithm as an example to calculate the probability of similarity between the web page contents of the suspected phishing website and the client website. The Simhash algorithm maps high-dimensional feature vectors into low-dimensional feature vectors through the idea of dimension reduction, and determines the similarity of text contents through the Hamming distance between the two feature vectors. The hamming distance is obtained by comparing the number of different characters at corresponding positions of two character strings with equal length, that is, the hamming distance is the number of characters to be replaced by converting one character string into another character string, which is equivalent to the number of 1's obtained by performing exclusive or operation on corresponding bits in two binary feature vectors.

The step of obtaining the feature vector of the sentence in the text may be, for example, performing word segmentation on the sentence to obtain valid feature vectors, and then setting 5 levels of weights such as 1-5 for each feature vector (if a text is given, the feature vector may be a word in the text, and the weight may be the number of times of occurrence of the word). For example, given a segment of a sentence: "author July" of the way of the CSDN blog structure algorithm, after word segmentation: "author July of the way of the algorithm of CSDN blog structure", then assign a weight to each feature vector: CSDN (4) blog (5) structure (3) (1) method (2) algorithm (3) (1) author (1) track (2) (5) July (5), wherein the number in the brackets represents the importance degree of the word in the whole sentence, and the larger the number is, the more important the word is.

And then, calculating the Hash value of each feature vector through a Hash function, wherein the Hash value is an N-bit signature consisting of binary numbers 01. For example, the Hash value Hash (CSDN) of "CSDN" is 100101, and the Hash value Hash (blog) of "blog" is "101011". In this way, the string becomes a series of numbers.

On the basis of the Hash value, all the feature vectors are weighted, namely W is Hash weight, and when 1 is encountered, the Hash value and the weight are multiplied positively, and when 0 is encountered, the Hash value and the weight are multiplied negatively. For example, weighting the Hash value "100101" of "CSDN" yields: w (csdn) ═ 4-4-44-44, the Hash value "101011" of "blog" is weighted to yield: w (blog) 5-55-555, the rest of the feature vectors operate similarly.

And accumulating the weighted results of the feature vectors to form a sequence string. Taking the first two feature vectors as examples, for example, 4-4-44-44 of "CSDN" and 5-55-555 of "blog" are added to obtain "4 + 5-4 + -5-4 + 54 + -5-4 + 54 + 5", and a high-dimensional feature vector "9-91-119" is obtained.

And then carrying out dimensionality reduction on the high-dimensional feature vector, if the dimensionality reduction is larger than 0, setting the dimensionality reduction value to be 1, and otherwise, setting the dimensionality reduction value to be 0, thereby obtaining a Simhash value of the statement, namely the low-dimensional feature vector. For example, by reducing the dimension of the above-calculated "9-91-119" (a bit greater than 0 is recorded as 1, and a bit less than 0 is recorded as 0), the low-dimensional feature vector 01 string is obtained as: "101011".

And respectively calculating Simhash values of texts in the webpage contents of the suspected phishing website and the client website according to the Simhash algorithm, comparing the different Simhash values of the two websites, namely comparing whether each bit of two low-dimensional eigenvectors corresponding to the suspected phishing website and the client website is the same or not, and determining the different number as a first Hamming distance of the webpage contents of the client website and the suspected phishing website.

In an exemplary embodiment of the present disclosure, the web page content similarity probability P may be determined according to the first hamming distance and a first preset conversion formula _content . The first preset conversion formula may be determined according to an actual situation, for example, as shown in formula (2), the first preset conversion formula is:

therefore, according to the first Hamming distance and the first preset conversion formula, the probability P of similarity of the webpage contents of the suspected phishing website and the client website can be obtained _content 。

In practical application, the existing phishing websites and the webpage templates thereof are usually pre-stored, and in the process of judging the phishing websites, the suspected phishing websites can be compared with the webpage modules of the pre-stored phishing websites to determine the similarity probability P of the webpage templates of the suspected phishing websites and the pre-stored phishing websites _template 。

In practical application, there are various methods for determining the similarity probability of the web page templates of the suspected phishing website and the pre-stored phishing website, and since the web page templates can be in a text form after being converted into an HTML (HyperText Mark-up Language) format, the hamming distance between the suspected phishing website and the pre-stored phishing website, namely the second hamming distance, can be calculated by referring to the Simhash algorithm; then, according to the second Hamming distance and a second preset conversion formula, determining the similarity probability P of the webpage template _template 。

The process of determining the second hamming distance may be specifically performed with reference to the process of determining the first hamming distance, and is not described herein again. In addition, the second preset conversion formula may be the same as or different from the first preset conversion formula, and this is not particularly limited in the exemplary embodiment of the present disclosure.

In an exemplary embodiment of the present disclosure, the key information of the suspected phishing website and the client website mainly includes: determining the similarity probability P of the key information of the suspected phishing website and the client website according to the page small icon, the comprehensive record condition, the IPR address attribution condition and the like _others The method can comprise the following steps: calculating the small icon similarity probability P of the suspected phishing website and the client website _icon (ii) a Determining the probability P of record according to whether the suspected phishing website records _icp (ii) a Determining the probability P of the attribution according to the attribution of the IP address corresponding to the suspected phishing website _area (ii) a Then according to the small icon similarity probability P _icon The filing probability P _icp And the home probability P _area To determine the similarity probability P of the key information _others 。

Wherein, the small icon similarity probability P of the suspected phishing website and the client website _icon The calculation of (b) may be determined according to an image similarity algorithm.

In practical applications, the image SIMilarity algorithm may be various, for example, SSIM (Structural SIMilarity) algorithm, local sensitive hash algorithm, histogram

Graph calculation and the like, and any algorithm capable of measuring the similarity of two pictures can be used for measuring the similarity probability P of the small icon _icon The calculations are performed and the exemplary embodiments of the present disclosure are not particularly limited in this regard.

In the exemplary embodiment of the present disclosure, the probability of filing P _icp The determination of (2) may include: according to the fact that whether the suspected phishing website is recorded or not, namely whether recorded information of the suspected phishing website exists or not, if the recorded information exists, under the condition that the recorded information exists, the recording probability P is determined _icp 0; and under the condition of no record, determining the record probability as P _icp ＝1。

In the exemplary embodiment of the present disclosure, the home probability P _area Can be based on suspected phishing websitesThe corresponding IP address attribution is determined, for example, the attribution probability P can be determined according to whether the IP address attribution is overseas or other phishing site high-incidence areas _area The assignment probability P may be determined based on whether or not the IP address is assigned to the continental area, and when the IP address is assigned to the continental area, the assignment probability P may be determined _area 0; determining the home probability P in case the IP address home is not the continent area _area ＝1。

Obtaining the similarity probability P of the small icons _icon Probability of filing P _icp And a home probability P _area Then, the similarity probability P of the small icons can be obtained _icon Probability of filing P _icp And a home probability P _area To determine the key information similarity probability P _others The specific determination method may be various, for example, according to the similarity probability P of the small icons _icon And a probability of filing P _icp And a home probability P _area To determine the similarity probability P of the key information _others I.e. P _others ＝αP _icon +βP _icp +γP _area Wherein α + β + γ is 1.

In the exemplary embodiment of the present disclosure, the web content similarity probability P is obtained _content Web page template similarity probability P _template Probability of similarity to key information P _others Then, the web page content similarity probability P can be used _content The web page template similarity probability P _template Probability of similarity to the key information P _others To determine said second approximation probability P ₂ The specific determination method may include a plurality of methods: for example, the similarity probability P can be determined according to the content of the web page _content Web page template similarity probability P _template Probability of similarity to key information P _others Determining the similarity probability P of the key information _others I.e. may have P ₂ ＝δP _content +ηP _template +θP _others Wherein δ + η + θ is 1. Alternatively, the similarity probability P can be determined according to the web page contents _content Web page template similarity probabilityP _template Probability of similarity to key information P _others Determining the similarity probability P of the key information _others I.e. may have P ₂ ＝MAX{P _content ,P _template ,P _others }。

In step S140, when the second approximate probability is greater than a second threshold, it is determined that the suspected phishing website is a phishing website.

In practical applications, the size of the second threshold may be directly obtained from the customer requirement information, or may be determined according to the customer requirement, for example, the second threshold may be a value greater than 80%, which is not limited in this exemplary embodiment of the present disclosure.

When the second approximate probability of the suspected phishing website is larger than the second threshold value, the suspected phishing website can be determined to be a phishing website. If the second approximate probability is smaller than or equal to the second threshold value, the suspected phishing website can be determined to be not a phishing website or not a phishing website within the range of the customer requirement, the suspected phishing website does not need to be further analyzed and processed, other suspected phishing websites can be processed, or the process is finished.

In step S150, after the suspected phishing website is determined to be a phishing website, the phishing website is processed according to the customer requirement.

In practical applications, some customer needs are managed, that is, after the suspected phishing website is determined to be a phishing website, the phishing website can be directly processed, for example, the domain name of the phishing website is shut down or the phishing connection is deleted, so that the phishing website no longer survives. However, some customers do not host the phishing website, and at this time, only the warning information of the phishing website needs to be sent to the customer together with the analysis and judgment basis, and then the phishing website is processed or not processed according to the instruction of the customer, which is not particularly limited by the exemplary embodiment of the present disclosure.

According to the phishing website processing method provided by the exemplary embodiment of the disclosure, based on the angle of a client, a series of automatic processing such as active discovery, evidence obtaining and disposal is performed on suspected phishing websites according to the requirements of the client, so that the full automation of the whole service flow is realized, the timeliness of phishing website interception is improved, the harm brought by a fishing website is reduced, and the real-time disposal requirements of the client can be met.

Referring to fig. 2, an operation flowchart of a phishing website processing method provided in an exemplary embodiment of the present disclosure is shown, which may specifically include:

in step S201, a suspected phishing website is obtained according to a customer requirement; in step S202, a first approximate probability between the suspected phishing website and the client website is determined, i.e. a first approximate probability P is determined ₁ (ii) a In step S203, the determination condition 1 is entered, i.e., the first approximation probability P is determined ₁ Whether greater than a first threshold; if not, the first approximate probability P ₁ If the value is less than or equal to the first threshold value, ending; if so, i.e. the first approximate probability P ₁ If the second approximation probability P is greater than the first threshold, step S204 is executed ₂ (ii) a In step S205, the judgment condition 2 is entered, i.e., the second approximation probability P is judged ₂ Whether it is greater than a second threshold; if not, the second approximate probability P ₂ If the value is less than or equal to the second threshold value, ending; if so, i.e. the second approximation probability P ₂ If the value is greater than the second threshold value, step S206 is executed, that is, the suspected phishing website is determined to be a phishing website, which is abbreviated as phishing website confirmation; and finally, entering step S207, and processing the phishing website according to the customer requirements, namely phishing website processing for short.

Referring to fig. 3, a first approximation probability P is determined at step S202 ₁ Previously, the phishing website processing method provided by the exemplary embodiment of the present disclosure further includes: step S301, segmenting domain name character strings of a suspected phishing website according to a preset length to obtain a plurality of first sub-character strings; step S302, segmenting the domain name character strings of the client website according to a preset length to obtain a plurality of second sub-character strings; step S303, determining a common substring according to the plurality of first substrings and the plurality of second substrings; step S304, determining an approximate distance according to the number of the first substrings, the number of the second substrings and the number of the common substrings; step S305, determining a first approximate probability P according to the approximate distance ₁ 。

Referring to FIG. 4, a second approximate probability P of a suspected phishing website is determined ₂ The process of (a) may include: step S401, determining the similarity probability P of the web page contents of the suspected phishing website and the client website _content (ii) a Step S402, determining the similarity probability P of the webpage templates of the suspected phishing website and the pre-stored phishing website _template (ii) a Step S403, determining the similarity probability P of the key information of the suspected phishing website and the client website _others (ii) a Step S404, according to the similarity probability P of the webpage content _content Web page template similarity probability P _template Probability of similarity to key information P _others Determining a second approximation probability P ₂ 。

Referring to fig. 5, the similarity probability P of key information between a suspected phishing website and a client website is determined _others The step (b) may comprise: step S501, determining small icon similarity probability P of suspected phishing website and customer website _icon (ii) a Step S502, determining the probability P of record according to whether the suspected phishing website records _icp (ii) a Step S503, determining the attribution probability P according to the attribution of the IP address corresponding to the suspected phishing website _area (ii) a Step S504, according to the similarity probability P of the small icons _icon Probability of filing P _icp And a home probability P _area Determining the similarity probability P of the key information _others 。

It should be noted that although the various steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that these steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

Further, a phishing website processing device is further provided in the example embodiment of the disclosure.

Fig. 6 schematically shows a block diagram of a phishing website processing apparatus of an exemplary embodiment of the present disclosure. Referring to fig. 6, the phishing website processing apparatus 600 according to an exemplary embodiment of the present disclosure may include a website acquisition module 610, a first approximate probability determination module 620, a second approximate distance determination module 630, a phishing website determination module 640, and a phishing website processing module 650:

specifically, the website acquisition module 610 is configured to acquire a suspected phishing website corresponding to a customer demand from a multi-source data set;

a first approximate probability determination module 620, configured to determine a first approximate probability between the suspected phishing website and the client website; a second approximate distance determination module 630, configured to determine a second approximate probability of the suspected phishing website if the first approximate probability is greater than a first threshold; a phishing website determination module 640, configured to determine that the suspected phishing website is a phishing website when the second approximate probability is greater than a second threshold; and the phishing website processing module 650 is configured to, after determining that the suspected phishing website is a phishing website, process the phishing website according to the customer requirements.

Since each functional module of the phishing website processing apparatus of the embodiment of the present disclosure is the same as that in the above-described method embodiment, no further description is provided herein.

Furthermore, the above-described drawings are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

In an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

An electronic device 700 according to this embodiment of the invention is described below with reference to fig. 7. The electronic device 700 shown in fig. 7 is only an example and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 7, electronic device 700 is in the form of a general purpose computing device. The components of the electronic device 700 may include, but are not limited to: the at least one processing unit 710, the at least one memory unit 720, a bus 730 connecting different system components (including the memory unit 720 and the processing unit 710), and a display unit 740.

Wherein the storage unit 720 stores program code that can be executed by the processing unit 710 to cause the processing unit 710 to perform the steps according to various exemplary embodiments of the present invention described in the above section "exemplary method" of the present specification. For example, the processing unit 710 may execute step S110 shown in fig. 1, obtaining a suspected phishing website corresponding to the customer requirement from a multi-source data set; step S120, determining a first approximate probability of the suspected phishing website and a client website; step S130, determining a second approximate probability of the suspected phishing website under the condition that the first approximate probability is larger than a first threshold value; step S140, when the second approximate probability is larger than a second threshold value, determining that the suspected phishing website is a phishing website; and S150, processing the phishing website according to the customer requirement after the suspected phishing website is determined to be a phishing website.

The storage unit 720 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)7201 and/or a cache memory unit 7202, and may further include a read only memory unit (ROM) 7203.

The storage unit 720 may also include a program/utility 7204 having a set (at least one) of program modules 7205, such program modules 7205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 730 may be any representation of one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 700 may also communicate with one or more external devices 770 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 700, and/or with any device (e.g., router, modem, etc.) that enables the electronic device 700 to communicate with one or more other computing devices. Such communication may occur through input/output (I/O) interfaces 750. Also, the electronic device 700 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the internet) via the network adapter 760. As shown, the network adapter 760 communicates with the other modules of the electronic device 700 via the bus 730. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 700, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the above section "exemplary methods" of the present description, when said program product is run on the terminal device.

The program product for implementing the above method according to the embodiment of the present invention may employ a portable compact disc read only memory (CD-ROM) and include program codes, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Furthermore, the above-described figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the terms of the appended claims.

Claims

1. A phishing website processing method is characterized by comprising the following steps:

2. A phishing website processing method as recited in claim 1 wherein determining a first approximate probability of the suspected phishing website to a customer website comprises:

and determining the first approximate probability according to the domain name character string of the suspected phishing website and the domain name character string of the client website.

3. A phishing website processing method as claimed in claim 2, wherein determining the first approximate probability based on the domain name string of the suspected phishing website and the domain name string of the client website comprises:

4. A phishing website processing method as claimed in claim 3, wherein determining the first approximate probability based on the approximate distance of the domain name string of the suspected phishing website to the domain name string of the client website comprises:

5. A phishing website processing method as claimed in claim 4, wherein determining the approximate distance based on the number of the first sub-character string, the number of the second sub-character string and the number of the common sub-character string comprises:

6. A phishing website processing method as claimed in claim 4 wherein determining said first approximate probability based on said approximate distance comprises:

7. A phishing website processing method as claimed in any one of claims 1 to 6 wherein determining a second approximate probability of the suspected phishing website comprises:

determining the webpage content similarity probability and the key information similarity probability of the suspected phishing website and the client website;

8. A phishing website processing method as claimed in claim 7, wherein determining the probability that the suspected phishing website is similar to the web page content of the client website comprises:

9. A phishing website processing method as claimed in claim 7, wherein determining the probability of similarity of the web page templates of the suspected phishing website and the pre-stored phishing website comprises:

10. A phishing website processing method as claimed in claim 7, wherein determining the probability of similarity of key information of the suspected phishing website and the client website comprises:

calculating the small icon similarity probability of the suspected phishing website and the client website;

determining the attribution probability according to the attribution of the IP address corresponding to the suspected phishing website;

11. A phishing website processing apparatus, comprising:

the first approximate probability determining module is used for determining a first approximate probability of the suspected phishing website and a client website;

a phishing website determination module for determining the suspected phishing website as a phishing website when the second approximate probability is greater than a second threshold;

12. A computer-readable storage medium on which a computer program is stored, the computer program implementing the phishing website processing method of any one of claims 1 to 10 when executed by a processor.

13. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the phishing website processing method of any one of claims 1-10 via execution of the executable instructions.