WO2016173200A1 - 用于检测恶意网址的方法和*** - Google Patents
用于检测恶意网址的方法和*** Download PDFInfo
- Publication number
- WO2016173200A1 WO2016173200A1 PCT/CN2015/090648 CN2015090648W WO2016173200A1 WO 2016173200 A1 WO2016173200 A1 WO 2016173200A1 CN 2015090648 W CN2015090648 W CN 2015090648W WO 2016173200 A1 WO2016173200 A1 WO 2016173200A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- url
- server
- http request
- information
- user
- Prior art date
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1425—Traffic logging, e.g. anomaly detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/10—Network architectures or network communication protocols for network security for controlling access to devices or network resources
- H04L63/107—Network architectures or network communication protocols for network security for controlling access to devices or network resources wherein the security policies are location-dependent, e.g. entities privileges depend on current location or allowing specific operations only from locally connected terminals
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1441—Countermeasures against malicious traffic
- H04L63/1483—Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2221/00—Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F2221/21—Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F2221/2101—Auditing as a secondary aspect
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2221/00—Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F2221/21—Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F2221/2103—Challenge-response
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2221/00—Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F2221/21—Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F2221/2119—Authenticating web pages, e.g. with suspicious links
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1433—Vulnerability analysis
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/02—Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/50—Network services
- H04L67/52—Network services specially adapted for the location of the user terminal
Definitions
- the present disclosure generally relates to the field of computer technologies, and in particular, to the field of network information security, and in particular, to a method and system for detecting a malicious website.
- the method of detecting malicious web pages based on webpage text content has been relatively perfect.
- the black industry webmaster no longer contains a large amount of webpage text content, but uses encryption algorithms and webpage image technology to process malicious webpages while increasing dependence.
- Web page jump is embodied by the fact that the downstream webpage in a complete webpage request depends on the information of the upstream webpage, such as refer, cookie, etc., so that the webpage result obtained by the detection engine lacks the text content feature, and the detection capability drops sharply.
- web content is generally crawled by a static crawler.
- the principle of a static crawler is similar to Wget.
- Wget is a combination of "World Wide Web” and “get”. It is a free tool for automatically downloading files from the network. It supports HTTP (Hypertext Transfer Protocol) and HTTPS (Hypertext Transfer Security Protocol).
- HTTP Hypertext Transfer Protocol
- HTTPS Hypertext Transfer Security Protocol
- FTP File Transfer Protocol
- TCP/IP Transmission Control Protocol/Internet Protocol
- the detection engine can only rely on certain fixed components of the web page for malicious webpages. However, these fixed components rely on manual summarization, relying on prior knowledge, time-consuming and laborious, and the detection effect is not good.
- crawlers can render web pages.
- the rendered content of the web page is then output for analysis by the detection engine.
- the embodiment of the present application provides a method for detecting a malicious web address, including: receiving a uniform resource locator URL reported by a user; acquiring a hypertext transfer protocol HTTP request chain associated with the URL, and the HTTP request chain includes access Multiple HTTP requests for the URL - a time-series list of responses to the interaction information; and analysis of the HTTP request chain to determine if the URL is a malicious URL.
- the embodiment of the present application further provides a system for detecting a malicious website, including a crawler subsystem and a detection subsystem.
- the crawler subsystem includes a crawler dispatch server and one or more dynamic crawler servers.
- the crawler scheduling server is configured to receive a uniform resource locator URL reported by the user, and to schedule a dynamic crawler server.
- the dynamic crawler server is configured to obtain a hypertext transfer protocol HTTP request chain associated with the URL according to the schedule of the crawler dispatch server, and the HTTP request chain is a time-series list including multiple HTTP request-response interaction information for accessing the URL.
- the detection subsystem includes an analysis unit configured to analyze the HTTP request chain to determine if the URL is a malicious URL.
- the scheme for detecting a malicious web address provided by the embodiment of the present application can obtain a more comprehensive webpage content associated with the URL by acquiring an HTTP request chain associated with the URL, thereby enabling accurate detection of the malicious web address.
- the detection result of the malicious website is accurate, and various new malicious websites can be detected, and the user friend is Well, users only need to upload the URL without providing more information.
- FIG. 1 illustrates an exemplary system architecture in which embodiments of the present application may be applied
- FIG. 2 illustrates an exemplary flowchart of a method for detecting a malicious web address according to an embodiment of the present application
- Figure 3 shows an exemplary screenshot of an HTTP request chain
- Figure 4 illustrates an exemplary abstract representation of an HTTP request chain
- FIG. 5 illustrates an exemplary flowchart of a method for obtaining an HTTP request chain in accordance with one embodiment of the present application
- FIG. 6 shows an exemplary flow chart of a method for analyzing an HTTP request chain in accordance with one embodiment of the present application
- FIG. 7 illustrates an exemplary flowchart of a method for detecting a malicious web address according to another embodiment of the present application
- Figure 8 shows a screenshot of a page of a malicious URL phishing QQ login
- Figure 9 shows a screenshot of the official website page
- Figure 10 shows the HTTP request chain information when accessing the official website
- FIG. 12 and FIG. 13 respectively show a part of HTTP request chain information for accessing the malicious URL of the above-mentioned counterfeit QQ login;
- FIG. 14 illustrates an exemplary structural block diagram of a system for detecting a malicious web address according to an embodiment of the present application
- Figure 15 is a block diagram showing the structure of a computer system suitable for use in implementing the server of the embodiment of the present application.
- FIG. 1 illustrates an exemplary system architecture 100 in which embodiments of the present application may be applied.
- system architecture 100 can include terminal devices 101, 102, network 103, and servers 104, 105, 106, and 107.
- the network 103 is used to provide a medium for communication links between the terminal devices 101, 102 and the servers 104, 105, 106, 107.
- Network 103 may include various types of connections, such as wired, wireless communication links, fiber optic cables, and the like.
- the user 110 can interact with the servers 104, 105, 106, 107 over the network 103 using the terminal devices 101, 102 to access various services, such as browsing web pages, downloading data, and the like.
- Various client applications such as applications that can access the Uniform Resource Locator URL cloud service, including but not limited to browsers, security applications, and the like, may be installed on the terminal devices 101 and 102.
- the terminal devices 101, 102 can be various electronic devices including, but not limited to, personal computers, smart phones, smart televisions, tablets, personal digital assistants, e-book readers, and the like.
- the servers 104, 105, 106, 107 may be servers that provide various services.
- the server can provide services in response to a user's service request. It can be understood that one server can provide one or more services, and the same service can also be provided by multiple servers.
- the servers involved may include, but are not limited to, a crawler scheduling server, a dynamic crawler server, a web server, a detection server, an image recognition server, a semantic analysis server, and the like.
- terminal devices, networks, and servers in Figure 1 is merely illustrative. Depending on the implementation needs, there can be any number of terminal devices, networks, and servers.
- the prior art generally crawls web content through a static crawler.
- the static crawler scheme treats each URL in isolation and does not care about the complete HTTP request session process, the context information is lost, and the final rendering result of the webpage cannot be obtained, thereby causing the webpage content obtained by the detection engine and the ordinary user to see The content of the webpage is inconsistent, which in turn leads to inaccurate detection results.
- the feature rules in web pages are difficult to find, and even if some are found, the false positive rate of detection is high.
- the dynamic crawler scheme only cares about the final result of the web page, ignoring the intermediate process.
- the dynamic crawler's solution focuses on the content of the web page itself, that is, the body part of the web page, ignoring external description information, such as the header part, and missing the web page description information.
- part of the header information is used, it is only time-consuming and laborious to perform classification by manually setting a judgment rule set (for example, an if-else statement), and the accuracy is low.
- the embodiment of the present application provides a malicious URL detection scheme based on an HTTP request chain.
- the HTTP request chain is a time-series list of multiple HTTP request-response interactions containing access URLs.
- rich information including context information can be obtained, thereby effectively checking whether the URL to be detected is a malicious web address.
- FIG. 2 an exemplary flow diagram of a method for detecting a malicious web address in accordance with one embodiment of the present application is shown.
- the method shown in Figure 2 can be performed on the server side of Figure 1.
- step 210 the uniform resource locator URL reported by the user is received.
- step 220 an HTTP request chain associated with the reported URL is obtained, the HTTP request chain being a time-series list containing multiple HTTP request-response interaction information for accessing the URL.
- a client for example, a browser
- a data block which is a request message
- the web server returns a block of data to the client, which is the response message.
- the HTTP request message and the HTTP response message contain various information related to the accessed web page, such as external description information, context information, web page content, and the like. Therefore, by obtaining an HTTP request message and an HTTP response message, information for detecting a malicious website can be obtained.
- Both the HTTP request message and the HTTP response message consist of three parts: the start line, the header, and the entity-body.
- the request message and the response message are only different from the start line.
- different contents are specified for each part of the request message and the response message.
- the start line (or request line) of a request message contains a method (method) and a request URL (request-URL). This method describes what the server should do, and the request URL describes which resource to execute this method on.
- the request line also contains the version of the HTTP protocol that tells the server which HTTP version the client is using.
- the method in the request message may include, for example, GET (getting a document from the server), HEAD (getting only the header of the document from the server), POST (sending data to the server to be processed), PUT (storing the main part of the request) On the server), TRACE (tracks packets that may be sent to the server via the proxy server), OPTIONS (determines which methods can be executed on the server), and DELETE (deletes a document from the server).
- the start line (or status line) of the response message also contains the version of the HTTP protocol.
- the start line of the response message also contains a status code (reason) and a reason-phrase (reason-phrase).
- the status code is three digits that describe what happened during the request process. The first digit of each status code is used to describe the general category of the status ("success", "error”, etc.).
- Commonly used status codes include, for example: 1xx, informational status codes, such as 100, 101; 2xx, success status codes, such as 200 OK; 3xx, redirect status codes, such as 301 permanent redirection, 302 temporary redirection; 4xx, client Error status code, such as 404 not found, the requested URL resource does not exist; 5xx, server-side error status code, such as 500, server internal error.
- the reason phrase is a readable version of the digital status code, that is, about the number A short textual description of the status code. The reason phrase is just a description of the status code, and the client still uses the status code to determine whether the request/response is successful.
- the header adds some additional information to the request message and the response message, which appears in the form of a user agent-host paired value. There can be 0 or more headers.
- the entity-body of the entity is the payload of the HTTP message, that is, the content to be transmitted by HTTP.
- the body of an entity contains a block of arbitrary data that can carry many types of digital data, such as images, videos, HTML documents, software applications, credit card transactions, email, and more. Not all messages contain the body part of the entity, such as a GET request that does not contain an entity.
- the HTTP request message and the response message are briefly described above.
- the information about the HTTP message can be found in the HTTP protocol as needed by a person skilled in the art, and details are not described herein again.
- Figure 3 shows an exemplary screen shot of an HTTP request chain.
- the client when accessing the web page www.gruogo.com , the client sends multiple HTTP requests to the server, which can be sorted by time to form an HTTP request chain. A total of 89 requests were sent during the access to the above web page, and only the first few requests are shown in the screenshot of FIG.
- Information about each request-response interaction is documented in the screenshot, including, for example, status, method, file or path name, domain name, type, size, and wait time.
- you can also view its HTTP request message and response message.
- FIG. 3 illustrates an exemplary abstract representation of an HTTP request chain.
- the URLs are arranged in a parent-child relationship, and the details of each URL are listed in the box next to it, including references, time, status, size, and so on.
- step 230 the HTTP request chain is analyzed to determine if the URL is a malicious web address.
- the HTTP request chain contains a wealth of information, it is possible to determine whether the URL is a malicious web address based on the meaning conveyed by the information.
- FIG. 5 illustrates an exemplary flow diagram of a method for obtaining an HTTP request chain in accordance with one embodiment of the present application.
- the HTTP request chain is obtained using a distributed dynamic crawler subsystem based on the user's geographic location. That is, the method shown in FIG. 5 can be performed by a distributed dynamic crawler subsystem on the server side.
- the distributed dynamic crawler subsystem includes a crawler dispatch server and one or more dynamic crawler servers distributed in different geographical locations.
- step 510 the geographic location and network environment information of the user reporting the URL is determined.
- the crawler scheduling server can obtain the Internet Protocol IP address of the URL reported by the user. Based on the IP address, the crawler scheduling server can determine the geographic location of the user (eg, country-province-city-cell) and the network operator information used (eg, telecommunications or Unicom, etc.). Further, the network environment information of the user may be determined based on the network operator information, where the network environment information includes at least the network bandwidth.
- the crawler scheduling server dispatches the reported URL to a dynamic crawler server whose geographic location and network environment information are close to the user. For example, the crawler scheduling server can schedule the reported URL to download the webpage content on the dynamic crawler server that is closest to the user, the bandwidth environment, and the user (or the closest).
- Some black product webmasters may apply a crawling masking policy on the website, for example, by pre-storing the crawler server's IP address, network exit, etc., to block crawling of the crawler server, such as booting the crawler server to other URLs, such as The correct URL.
- a crawling masking policy on the website, for example, by pre-storing the crawler server's IP address, network exit, etc., to block crawling of the crawler server, such as booting the crawler server to other URLs, such as The correct URL.
- the content of the webpage captured by the crawler server is inconsistent with the content of the webpage accessed by the user, resulting in inaccurate detection results.
- the real access environment of the user can be simulated on the server side as much as possible, and the content of the webpage downloaded by the crawler is guaranteed.
- a distributed crawler subsystem which includes a dynamic crawler service
- the number of devices is large, and the location and configuration of the dynamic crawler server can be constantly changed, such as revoking or joining a new dynamic crawler server, so it is not easily blocked by malicious URLs.
- step 530 the web page content associated with the URL is downloaded at the scheduled dynamic crawler server to obtain an HTTP request chain.
- the dynamic crawler server fetches the content of the webpage in accordance with the normal operation, and also captures the content of the webpage that has been jumped and saves the intermediate result.
- An iframe element in an HTML document creates an inline frame (ie, an inline frame) that contains another document.
- the HTML DOM (Document Object Model) tree can be rendered using the browser's layout engine to capture web content that is redirected through the iframe tag in the HTML document.
- the typesetting engine can include, for example, but is not limited to, webkit or gecko.
- the dynamic crawler can use the open source webkit kernel to render the HTML DOM tree, allowing the iframe to be loaded, thereby capturing the content of the web page that is redirected through the iframe.
- JavaScript is the most popular scripting language on the Internet, it can be inserted into an HTML page, and after being inserted into an HTML page, it can be executed by the browser. JavaScript is used by millions of web pages to improve design, validate forms, detect browsers, create cookies, and more. JavaScript can be used to change the content of an HTML page. For web content that uses JavaScript technology to jump, JavaScript code can be executed by an open source JavaScript engine (such as Google's open source V8 engine) to capture web content that is redirected via JavaScript code.
- an open source JavaScript engine such as Google's open source V8 engine
- Flash is a multimedia format.
- the SWF file used by the Flash player can be created by Adobe Flash, Adobe Flex, or other software or third-party tools. It uses both bitmap and vector graphics to program in the ActionScript scripting language. Support for two-way video streaming and audio streaming. Flash is suitable for developing rich Internet applications, streaming video audio. Flash Player uses vector graphics technology to minimize file size and create files that save network bandwidth and download time. Flash is therefore a popular format for small games, animations, advertisements, and graphical user interfaces embedded in web pages.
- the dynamic crawler by compiling the Flash player plugin, the dynamic crawler has the Flash execution capability, and the dynamic crawler maintains the session retention function, so the dynamic crawler can execute the Flash to capture the jump through the Flash. Web content.
- FIG. 6 an exemplary flow diagram of a method for analyzing an HTTP request chain in accordance with one embodiment of the present application is shown.
- the method shown in Figure 6 can be performed by the server-side detection subsystem.
- step 610 feature extraction is performed from the acquired HTTP request chain.
- the black production webmaster In order to save costs, the black production webmaster generally adopts the method of renting virtual hosts, and does not adopt CDN (Content Distribution Network) technology.
- CDN Content Distribution Network
- IIS Internet Information Services
- IIS Internet Information Services
- IIS Internet Information Services
- web web service component
- Programs run by IIS Web Server are generally written in ASP (Active Server Pages, Dynamic Server Pages) language, because ASP is a scripting language with low entry barriers and easy to use.
- ASP Active Server Pages, Dynamic Server Pages
- Many web hosting providers provide such an integrated environment directly. As long as the black product manager uploads the written malicious code, it can be used to defraud the user, which is very convenient.
- Some black product webmasters may also use web servers such as netbox and kangle. These servers are upgraded versions of IIS, similar in principle, but more powerful. Large companies generally do not use these servers.
- the black production stationmaster generally rents virtual hosts from overseas and Hong Kong.
- the IP address is either overseas or Hong Kong, because it does not need to be filed in the Ministry of Industry and Information Technology, and there are many review procedures.
- black product webmasters In terms of web page writing, black product webmasters generally write web pages very complicated, and there are many dependent jump behaviors. The purpose of this is to make it difficult for web crawlers to get the final result of the web page. Moreover, downstream sub-HTML page requests are often initiated internally by upstream JavaScript code. In addition, the black product webmaster also likes to encrypt the content of the webpage.
- the backend interfaces are generally written in languages such as C, C++, or Java, because the programs written in these programming languages perform relatively well.
- PHP hypertext preprocessor
- PHP may be used for performance requirements that are not very high, but basically do not use ASP language to write code. Because once you use ASP, you have to buy Windows Server and IIS supporting facilities, it will be limited, generally large and medium-sized companies will not do so.
- Regular websites use the Linux operating system because most Linux operating systems are open source and free.
- Server IP is basically in the country, generally using nginx Or the apache server as a web server, the access delay is low, and the resource loading is generally limited to the case where the HTTP 404 cannot find the resource.
- Regular websites generally do not have multiple jumps and do not encrypt web content.
- regular websites generally have filing information in the Ministry of Industry and Information Technology.
- the feature extraction can be performed from at least one of the following dimensions: upstream and downstream information, server dimension, web programming language dimension, time dimension, and web page description information.
- the upstream and downstream information may include at least one of the following information: 302 number of jumps (eg, whether a predetermined threshold is exceeded, such as greater than 5 times), 404 page percentage (eg, whether a predetermined ratio is exceeded, such as greater than 50%), Whether the sub-URL contains an ad network link, whether the sub-URL contains a malicious sub-link, and whether the sub-URL contains a small website statistics tool.
- the server dimension may include at least one of the following information: whether it is an overseas Internet Protocol IP address, whether it is Windows IIS, whether the content distribution network CDN technology is used, whether it is a kangle server, whether it is a netbox server, whether it is a nginx server, whether it is an apache server. Whether it is a multimedia video.
- the web programming language dimension may include at least one of the following information: whether written by the dynamic server page ASP language, whether written by the hypertext preprocessor PHP language.
- the time dimension may include at least one of the following information: whether it is a hot spot time (for example, May 1st, 11th, Double Eleven, Spring Festival, etc.), whether it is a weekend.
- hot spots and weekends have more web views, so malicious URLs generally choose these times to increase their chances of being visited.
- the webpage description information may include at least one of the following information: webpage size, single URL loading time, whether the website is filed, whether it is encrypted, and whether it is a free second-level domain name.
- the URL is determined to be a normal or suspicious malicious URL based on the extracted features.
- Machine learning is a method of automatically analyzing and obtaining rules from data and using rules to predict unknown data.
- the classification model for machine learning refers to the process of adjusting the parameters of the classification model to achieve the required performance by using a set of samples of known categories.
- Modeling and machine learning can be performed on a variety of algorithms, such as decision trees, linear discriminant analysis, nearest neighbor methods, support vector machines, and so on.
- the features extracted in step 610 are modeled using a GBDT (Gradient Boosted Decision Tree) to determine whether a URL is a normal URL (also referred to as gray) or a suspicious malicious URL (also known as suspicious black).
- GBDT Gradient Boosted Decision Tree
- a specific modeling process is known to those skilled in the art, and a detailed description thereof is omitted herein.
- the URLs reported by the user are determined based on the features extracted from the HTTP request chain by means of classification modeling. Compared with the manner in which the if-else rule set is manually set in the prior art, the embodiment of the present application can greatly improve the detection efficiency and the accuracy is high.
- FIG. 7 illustrates an exemplary flow chart of a method for detecting a malicious web address in accordance with another embodiment of the present application.
- steps 710-730 are the same as steps 210-230 in FIG. 2, and details are not described herein again.
- the result in step 730 indicates that the URL is a normal web address
- the result can be returned to the client (not shown).
- the result indicates that the URL is a suspicious malicious web address
- step 740 in response to determining that the URL is a suspicious malicious web address, rendering the webpage content associated with the URL into a picture and extracting the webpage text content using optical character recognition OCR technology .
- Web crawlers have the ability to render web content into images. By using the OCR technology to identify and extract the image content, the webpage text content can be obtained.
- OCR techniques are well known to those skilled in the art, and the present application can use any OCR technology now known or later developed to identify web page content, and the application is not limited in this respect.
- step 750 the identified web page text content is subject to subject matter determination by an implicit semantic model.
- LDA Topic Dirichlet allocation
- LDA is a topic model that gives the subject of each document in a document set as a probability distribution. At the same time, it is an unsupervised learning algorithm. It does not require a manually labeled training set during training. All that is needed is the document set and the number of specified topics. Another advantage of LDA is that you can find some for each topic. Words to describe it. LDA currently has applications in text mining, including text topic recognition, text categorization, and text similarity calculations. This application may use any subject matter judging technique that is now known or developed in the future, and the application is not limited in this respect.
- step 760 the subject judgment result is subjected to de-false positive processing.
- the de-false positive processing may be performed by at least one of: determining whether the URL determined to be a malicious web address is a false positive according to the white list; querying the access information related to the URL to determine whether it is a false positive; and querying the URL of the Internet content provider ICP record information, determine whether it is a false positive; and query the qualification data related to the URL to determine whether it is a false positive.
- a list of lists that have been explicitly confirmed to be not malicious URLs can be saved in the whitelist. Therefore, by comparing the whitelists, it is possible to determine whether there is a false positive in the URL of the subject judgment result that is determined to be a malicious web address.
- the access information related to the URL may include, but is not limited to, the following information: the number of external links of the site, the number of sub-URLs under the domain name of the site, the most recent site search popularity index, and the like. From the perspective of these access information, you can avoid false positives for some popular sites. For such access information, the threshold can be set accordingly, and when the set threshold is exceeded, it can be considered as a false alarm. The threshold can be set empirically.
- the Internet content provider ICP filing information can indicate whether the site is an enterprise filing or a business unit for filing.
- the method is further determined by the OCR technology and the subject judgment technology to improve the detection accuracy.
- a malicious web address detecting method according to an embodiment of the present application is described below in conjunction with a specific example.
- Figure 8 shows a screenshot of a malicious URL for a fake QQ login.
- Figure 9 is a screenshot of the official website page.
- Figure 10 shows the HTTP request chain information when accessing the official website.
- QQ's official website code is very clear, the loading speed is very fast (about 4 seconds), the IP address is 140.207.69.100, which is the IP address of Shanghai Unicom, using Apache as the Web. server.
- the source code of the official website is not encrypted.
- the detection can be performed as follows.
- the distributed dynamic crawler subsystem tracks the jump process of the web page to obtain intermediate results and final results.
- the obtained result may include, for example, the following code:
- the JavaScript content in the third page contains string splicing to achieve iframe jump. Therefore, the dynamic crawler according to the embodiment of the present application finds an iframe tag in the process of parsing the webpage, and continues to load the content of the iframe.
- the feature extraction tool is used to extract the predefined features and input the classification model (for example, GBDT model) for classification.
- the classification model for example, GBDT model
- the crawler In the third step, if the output of the GBDT model is a suspicious malicious URL, the crawler first renders the webpage into a picture, and then uses the image OCR technology to extract the text content of the webpage.
- the semantic model is used for subject judgment.
- the subject judgment result is subjected to false positive processing.
- the final classification result is output to determine that the suspicious URL is a malicious web address.
- FIG. 14 an exemplary block diagram of a system for detecting a malicious web address in accordance with one embodiment of the present application is shown.
- a system 1400 for detecting a malicious web address can include a crawler subsystem 1410 and a detection subsystem 1420.
- the crawler subsystem 1410 includes a crawler scheduling server 1411 and one or more dynamic crawler servers 1412-1414.
- the crawler scheduling server 1411 is configured to receive the uniform resource locator URL reported by the user, and to schedule the dynamic crawler servers 1412-1414.
- the dynamic crawler servers 1412-1414 are configured to acquire an HTTP request chain associated with the URL reported by the user according to the schedule of the crawler scheduling server 1411.
- the HTTP request chain is a time-series list of multiple HTTP request-response interactions that access the URL.
- the crawler scheduling server 1411 is suspiciously configured to schedule a dynamic crawler server by: determining a geographic location and network environment information in which the user is located; and scheduling the URL reported by the user to the geographic location and network environment information and the user Close to the dynamic crawler server.
- the crawler scheduling server 1411 is suspiciously configured to determine the geographic location and network environment information of the user by determining the geographic location of the user and the network operator information used based on the Internet Protocol IP address of the user reporting the URL; And determining network environment information of the user based on the network operator information, where the network environment information includes at least network bandwidth.
- the dynamic crawler server 1412-1414 obtains an HTTP request chain This may include downloading webpage content associated with the URL to obtain an HTTP request chain.
- the dynamic crawler servers 1412-1414 can be configured to crawl the skipped webpage content and save the intermediate results by at least one of: using the browser's typesetting engine to hypertext markup language document object model HTML DOM The tree is rendered to capture the content of the web page that is redirected through the inline frame iframe tag in the HTML document; the JavaScript code is executed by the JavaScript engine to grab the web content that is redirected via the JavaScript code; and through the Flash Player plugin Execute Flash to capture the content of the web page that jumps through Flash.
- the detection subsystem 1420 includes an analysis unit 1421 configured to analyze the HTTP request chain obtained by the crawler subsystem 1410 to determine if the URL is a malicious web address.
- the analyzing unit 1421 can include: a feature extraction sub-unit 1422 configured to extract features of the following at least one dimension from the HTTP request chain: upstream and downstream information, server dimension, web programming language dimension, time dimension, webpage The self-description information; and a classification sub-unit 1423 configured to utilize the established, machine-learned classification model to determine whether the URL is a normal web address or a suspicious malicious web address based on the extracted features.
- a feature extraction sub-unit 1422 configured to extract features of the following at least one dimension from the HTTP request chain: upstream and downstream information, server dimension, web programming language dimension, time dimension, webpage The self-description information
- a classification sub-unit 1423 configured to utilize the established, machine-learned classification model to determine whether the URL is a normal web address or a suspicious malicious web address based on the extracted features.
- the detection subsystem 1420 may further include: an image recognition unit 1424 configured to determine the URL of the suspicious malicious website for the classification sub-unit 1423, using the optical character recognition OCR technology, from being rendered into a picture, The webpage text content is extracted from the webpage content associated with the URL; and the semantic parsing unit 1425 is configured to perform a topic determination on the webpage text content by the implicit semantic model to determine whether the URL is a malicious webpage.
- the detection subsystem 1420 may further include: a de-false alarm unit 1426 configured to perform a de-false alarm processing on the result of the subject determination.
- a de-false alarm unit 1426 configured to perform a de-false alarm processing on the result of the subject determination.
- system 1400 corresponds to the various steps in the methods described with reference to Figures 2-7.
- operations and features described above for the method are equally applicable to the system 1400 and the units contained therein, and are not described herein.
- FIG. 15 a block diagram of a computer system 1500 suitable for use in implementing the server of the embodiments of the present application is shown.
- computer system 1500 includes a central processing unit (CPU) 1501 that can be stored according to a program stored in read only memory (ROM) 1502 or from a storage portion. 1508 loads the program into random access memory (RAM) 1503 to perform various appropriate actions and processes. In the RAM 1503, various programs and data required for the operation of the system 1500 are also stored.
- the CPU 1501, the ROM 1502, and the RAM 1503 are connected to each other through a bus 1504.
- An input/output (I/O) interface 1505 is also coupled to bus 1504.
- the following components are connected to the I/O interface 1505: an input portion 1506 including a keyboard, a mouse, etc.; an output portion 1507 including a cathode ray tube (CRT), a liquid crystal display (LCD), and the like, and a speaker; a storage portion 1508 including a hard disk or the like And a communication portion 1509 including a network interface card such as a LAN card, a modem, or the like.
- the communication section 1509 performs communication processing via a network such as the Internet.
- Driver 1510 is also coupled to I/O interface 1505 as needed.
- a removable medium 1511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory or the like is mounted on the drive 1510 as needed so that a computer program read therefrom is installed into the storage portion 1508 as needed.
- an embodiment of the present disclosure includes a computer program product comprising a computer program tangibly embodied on a machine readable medium, the computer program comprising program code for performing the methods of Figs. 2-7.
- the computer program can be downloaded and installed from the network via the communication portion 1509, and/or installed from the removable medium 1511.
- each block of the flowchart or block diagrams can represent a module, a program segment, or a portion of code that includes one or more logic for implementing the specified.
- Functional executable instructions can also occur in a different order than that illustrated in the drawings. For example, two successively represented blocks may in fact be executed substantially in parallel, and they may sometimes be executed in the reverse order, depending upon the functionality involved.
- each block of the block diagrams and/or flowcharts, and combinations of blocks in the block diagrams and/or flowcharts can be implemented in a dedicated hardware-based system that performs the specified function or operation. Or it can be implemented by a combination of dedicated hardware and computer instructions.
- the unit or module described in the embodiment of the present application can pass the software side.
- the implementation can also be implemented in hardware.
- the described unit or module may also be provided in the processor, for example, as a processor including a crawler unit and a detection unit.
- the names of these units or modules do not in any way constitute a limitation on the unit or module itself.
- the present application further provides a computer readable storage medium, which may be a computer readable storage medium included in the apparatus described in the foregoing embodiment, or may exist separately, not A computer readable storage medium that is assembled into the device.
- the computer readable storage medium stores one or more programs that are used by one or more processors to perform the formula input methods described in this application.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- General Engineering & Computer Science (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Virology (AREA)
- Data Mining & Analysis (AREA)
- Information Transfer Between Computers (AREA)
- Computer And Data Communications (AREA)
Abstract
Description
Claims (22)
- 一种检测恶意网址的方法,包括:接收用户上报的统一资源定位符URL;获取与所述URL关联的超文本传输协议HTTP请求链,所述HTTP请求链是包含访问所述URL的多次HTTP请求-响应交互信息的时序链表;以及分析所述HTTP请求链以确定所述URL是否为恶意网址。
- 根据权利要求1所述的方法,其中,获取HTTP请求链包括:利用基于用户地理位置的分布式动态爬虫子***来获取HTTP请求链。
- 根据权利要求2所述的方法,其中,利用基于用户地理位置的分布式动态爬虫子***来获取HTTP请求链包括:确定所述用户所在的地理位置和网络环境信息;将所述URL调度至地理位置和网络环境信息与所述用户接近的动态爬虫服务器;以及在所述动态爬虫服务器处下载与所述URL关联的网页内容以获得HTTP请求链。
- 根据权利要求3所述的方法,其中,确定所述用户所在的地理位置和网络环境信息包括:基于所述用户上报URL的互联网协议IP地址确定所述用户的地理位置以及所使用的网络运营商信息;以及基于所述网络运营商信息确定所述用户的网络环境信息,其中所述网络环境信息至少包括网络带宽。
- 根据权利要求3或4所述的方法,其中,下载与所述URL关联的网页内容以获得HTTP请求链包括:抓取经过跳转的网页内容并保存中间结果。
- 根据权利要求5所述的方法,其中,抓取经过跳转的网页内容包括以下至少一项:利用浏览器的排版引擎对超文本标记语言文档对象模型HTML DOM树进行渲染,以抓取通过HTML文档中的内联框架iframe标签进行跳转的网页内容;通过JavaScript引擎执行JavaScript代码,以抓取通过JavaScript代码进行跳转的网页内容;以及通过Flash播放器插件执行Flash以抓取通过Flash进行跳转的网页内容。
- 根据权利要求1-6任一所述的方法,其中,分析所述HTTP请求链以确定所述URL是否为恶意网址包括:从所述HTTP请求链中提取以下至少一个维度的特征:上下游信息,服务器维度,网页编程语言维度,时间维度,网页自身描述信息;以及利用建立的、经过机器学习的分类模型,基于所提取的特征确定所述URL是正常网址还是可疑恶意网址。
- 根据权利要求7所述的方法,其中,所述上下游信息包括以下至少一项信息:302跳转次数,404页面占比,子URL是否包含广告联盟链接,子URL是否包含恶意子链接,子URL是否包含小型网站统计工具;所述服务器维度包括以下至少一项信息:是否为境外互联网协议IP地址,是否是Windows IIS,是否采用内容分发网络CDN技术,是否是kangle服务器,是否是netbox服务器,是否是nginx服务器,是否是apache服务器,是否是多媒体视频;所述网页编程语言维度包括以下至少一项信息:是否由动态服务器页面ASP语言编写,是否由超文本预处理器PHP语言编写;所述时间维度包括以下至少一项信息:是否是热点时间,是否是周末;并且所述网页自身描述信息包括以下至少一项信息:网页大小,单个URL加载时间,网站是否备案,是否经过加密处理,是否是***。
- 根据权利要求7或8所述的方法,其中,所述方法进一步包括:响应于确定所述URL是可疑恶意网址,将与所述URL关联的网页内容渲染成图片并利用光学字符识别OCR技术提取网页文本内容;通过隐含语义模型对所述网页文本内容进行主题判断;以及基于主题判断结果确定所述URL是否为恶意网址。
- 根据权利要求9所述的方法,其中,所述方法进一步包括:对所述主题判断结果进行去误报处理。
- 根据权利要求10所述的方法,其中,所述去误报处理包括以下至少一项:根据白名单判断被确定为恶意网址的URL是否为误报;查询与所述URL相关的访问信息,判断是否为误报;查询所述URL的互联网内容提供商ICP备案信息,判断是否为误报;以及查询与所述URL相关的资质数据,判断是否为误报。
- 一种检测恶意网址的***,包括爬虫子***和检测子***,其中,所述爬虫子***包括爬虫调度服务器以及一个或多个动态爬虫服务器,所述爬虫调度服务器配置用于接收用户上报的统一资源定位符URL,以及调度动态爬虫服务器;所述动态爬虫服务器配置用于根据所述爬虫调度服务器的调度获取与所述URL关联的超文本传输协议HTTP请求链,所述HTTP请求链是包含访问所述URL的多次HTTP 请求-响应交互信息的时序链表;并且所述检测子***包括分析单元,配置用于分析所述HTTP请求链以确定所述URL是否为恶意网址。
- 根据权利要求12所述的***,其中,所述爬虫调度服务器配置用于通过如下来调度动态爬虫服务器:确定所述用户所在的地理位置和网络环境信息;以及将所述URL调度至地理位置和网络环境信息与所述用户接近的动态爬虫服务器。
- 根据权利要求13所述的***,其中,所述爬虫调度服务器配置用于通过如下来确定所述用户所在的地理位置和网络环境信息:基于所述用户上报URL的互联网协议IP地址确定所述用户的地理位置以及所使用的网络运营商信息;以及基于所述网络运营商信息确定所述用户的网络环境信息,其中所述网络环境信息至少包括网络带宽。
- 根据权利要求12-14任一所述的***,其中所述动态爬虫服务器获取HTTP请求链包括:所述动态爬虫服务器下载与所述URL关联的网页内容以获得HTTP请求链。
- 根据权利要求15所述的***,其中,所述动态爬虫服务器配置用于通过以下至少一项来抓取经过跳转的网页内容并保存中间结果:利用浏览器的排版引擎对超文本标记语言文档对象模型HTML DOM树进行渲染,以抓取通过HTML文档中的内联框架iframe标签进行跳转的网页内容;通过JavaScript引擎执行JavaScript代码,以抓取通过JavaScript代码进行跳转的网页内容;以及通过Flash播放器插件执行Flash以抓取通过Flash进行跳转的网页内容。
- 根据权利要求12-16任一所述的***,其中,所述分析单元包括:特征提取子单元,配置用于从所述HTTP请求链中提取以下至少一个维度的特征:上下游信息,服务器维度,网页编程语言维度,时间维度,网页自身描述信息;以及分类子单元,配置用于利用建立的、经过机器学习的分类模型,基于所提取的特征确定所述URL是正常网址还是可疑恶意网址。
- 根据权利要求17所述的***,其中,所述上下游信息包括以下至少一项信息:302跳转次数,404页面占比,子URL是否包含广告联盟链接,子URL是否包含恶意子链接,子URL是否包含小型网站统计工具;所述服务器维度包括以下至少一项信息:是否为境外互联网协议IP地址,是否是Windows IIS,是否采用内容分发网络CDN技术,是否是kangle服务器,是否是netbox服务器,是否是nginx服务器,是否是apache服务器,是否是多媒体视频;所述网页编程语言维度包括以下至少一项信息:是否由动态服务器页面ASP语言编写,是否由超文本预处理器PHP语言编写asp语言编写;所述时间维度包括以下至少一项信息:是否是热点时间,是否是周末;并且所述网页自身描述信息包括以下至少一项信息:网页大小,单个URL加载时间,网站是否备案,是否经过加密处理,是否是***。
- 根据权利要求17或18所述的***,其中,所述检测子***进一步包括:图像识别单元,配置用于针对所述分类子单元确定为可疑恶意网址的URL,利用光学字符识别OCR技术,从被渲染成图片的、与所述URL关联的网页内容中提取网页文本内容;以及语义解析单元,配置用于通过隐含语义模型对所述网页文本内容进行主题判断以确定所述URL是否为恶意网址。
- 根据权利要求19所述的***,其中,所述检测子***进一步包括:去误报单元,配置用于对所述主题判断的结果进行去误报处理。
- 一种设备,其特征在于,包括:一个或者多个处理器;存储器;一个或者多个程序,所述一个或者多个程序存储在所述存储器中,当被所述一个或多个处理器执行时:接收用户上报的统一资源定位符URL;获取与所述URL关联的超文本传输协议HTTP请求链,所述HTTP请求链是包含访问所述URL的多次HTTP请求-响应交互信息的时序链表;以及分析所述HTTP请求链以确定所述URL是否为恶意网址。
- 一种非易失性计算机存储介质,所述计算机存储介质存储有一个或多个程序,当所述一个或者多个程序被一个设备执行时,使得所述设备:接收用户上报的统一资源定位符URL;获取与所述URL关联的超文本传输协议HTTP请求链,所述HTTP请求链是包含访问所述URL的多次HTTP请求-响应交互信息的时序链表;以及分析所述HTTP请求链以确定所述URL是否为恶意网址。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/504,276 US10567407B2 (en) | 2015-04-30 | 2015-09-25 | Method and system for detecting malicious web addresses |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510219801.1A CN104766014B (zh) | 2015-04-30 | 2015-04-30 | 用于检测恶意网址的方法和*** |
CN201510219801.1 | 2015-04-30 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2016173200A1 true WO2016173200A1 (zh) | 2016-11-03 |
Family
ID=53647836
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2015/090648 WO2016173200A1 (zh) | 2015-04-30 | 2015-09-25 | 用于检测恶意网址的方法和*** |
Country Status (3)
Country | Link |
---|---|
US (1) | US10567407B2 (zh) |
CN (1) | CN104766014B (zh) |
WO (1) | WO2016173200A1 (zh) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110909355A (zh) * | 2018-09-17 | 2020-03-24 | 北京京东金融科技控股有限公司 | 越权漏洞检测方法、***、电子设备和介质 |
CN111104618A (zh) * | 2019-12-19 | 2020-05-05 | 秒针信息技术有限公司 | 一种网页跳转方法及装置 |
CN113641933A (zh) * | 2021-06-30 | 2021-11-12 | 北京百度网讯科技有限公司 | 异常网页识别方法、异常站点识别方法及装置 |
CN113806732A (zh) * | 2020-06-16 | 2021-12-17 | 深信服科技股份有限公司 | 一种网页篡改检测方法、装置、设备及存储介质 |
CN116260660A (zh) * | 2023-05-15 | 2023-06-13 | 杭州美创科技股份有限公司 | 网页木马后门识别方法及*** |
Families Citing this family (80)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104766014B (zh) * | 2015-04-30 | 2017-12-01 | 安一恒通(北京)科技有限公司 | 用于检测恶意网址的方法和*** |
CN107025154B (zh) * | 2016-01-29 | 2020-12-01 | 阿里巴巴集团控股有限公司 | 磁盘的故障预测方法和装置 |
CN107239701B (zh) * | 2016-03-29 | 2020-06-26 | 腾讯科技(深圳)有限公司 | 识别恶意网站的方法及装置 |
CN106202319B (zh) * | 2016-06-30 | 2020-03-10 | 北京奇虎科技有限公司 | 一种异常url验证方法及*** |
US11330430B2 (en) * | 2016-08-18 | 2022-05-10 | Telefonaktiebolaget Lm Ericsson (Publ) | Method and device for enhancing VOIP security by selectively scrutinizing caller's geographical location |
CN106341406B (zh) * | 2016-09-19 | 2019-07-16 | 成都知道创宇信息技术有限公司 | 基于http响实体正文html dom树变化的准确攻击识别方法 |
CN106709353B (zh) * | 2016-10-27 | 2021-06-18 | 腾讯科技(深圳)有限公司 | 搜索引擎的安全性检测方法及装置 |
CN106529292A (zh) * | 2016-10-31 | 2017-03-22 | 北京奇虎科技有限公司 | 病毒查杀的方法及装置 |
US10505981B2 (en) | 2016-11-03 | 2019-12-10 | RiskIQ, Inc. | Techniques for detecting malicious behavior using an accomplice model |
US10841337B2 (en) | 2016-11-28 | 2020-11-17 | Secureworks Corp. | Computer implemented system and method, and computer program product for reversibly remediating a security risk |
CN108259416B (zh) * | 2016-12-28 | 2021-06-22 | 华为技术有限公司 | 检测恶意网页的方法及相关设备 |
CN108614849B (zh) * | 2017-01-13 | 2022-11-18 | 南京邮电大学盐城大数据研究院有限公司 | 一种基于动态插桩和静态多脚本页特征提取的网页广告检测方法 |
US10747825B2 (en) * | 2017-02-27 | 2020-08-18 | Google Llc | Content search engine |
US10855697B2 (en) * | 2017-06-30 | 2020-12-01 | Paypal, Inc. | Threat intelligence system |
CN107332848B (zh) * | 2017-07-05 | 2020-05-12 | 重庆邮电大学 | 一种基于大数据的网络流量异常实时监测*** |
CN107526967B (zh) | 2017-07-05 | 2020-06-02 | 阿里巴巴集团控股有限公司 | 一种风险地址识别方法、装置以及电子设备 |
US10819718B2 (en) * | 2017-07-05 | 2020-10-27 | Deep Instinct Ltd. | Methods and systems for detecting malicious webpages |
CN107437026B (zh) * | 2017-07-13 | 2020-12-08 | 西北大学 | 一种基于广告网络拓扑的恶意网页广告检测方法 |
CN109391584A (zh) * | 2017-08-03 | 2019-02-26 | 武汉安天信息技术有限责任公司 | 一种疑似恶意网站的识别方法及装置 |
CN109583210A (zh) * | 2017-09-29 | 2019-04-05 | 阿里巴巴集团控股有限公司 | 一种水平权限漏洞的识别方法、装置及其设备 |
US10735470B2 (en) | 2017-11-06 | 2020-08-04 | Secureworks Corp. | Systems and methods for sharing, distributing, or accessing security data and/or security applications, models, or analytics |
CN107835190A (zh) * | 2017-11-28 | 2018-03-23 | 广东华仝九方科技有限公司 | 一种恶意sp订购核查方法 |
CN107888616B (zh) * | 2017-12-06 | 2020-06-05 | 北京知道创宇信息技术股份有限公司 | 基于URI的分类模型的构建方法和Webshell攻击网站的检测方法 |
CN108171082B (zh) * | 2017-12-06 | 2021-04-30 | 新华三信息安全技术有限公司 | 一种网页探测方法及装置 |
US10630729B2 (en) * | 2017-12-15 | 2020-04-21 | T-Mobile Usa, Inc. | Detecting fraudulent logins |
EP3705974B1 (en) * | 2017-12-20 | 2022-12-07 | Nippon Telegraph and Telephone Corporation | Classification device, classification method, and classification program |
CN110020253A (zh) * | 2017-12-30 | 2019-07-16 | 惠州学院 | 基于内容的视频拷贝的识别有害视频的方法及其*** |
WO2019142398A1 (ja) * | 2018-01-17 | 2019-07-25 | 日本電信電話株式会社 | 解析装置、解析方法及び解析プログラム |
CN110198248B (zh) * | 2018-02-26 | 2022-04-26 | 北京京东尚科信息技术有限公司 | 检测ip地址的方法和装置 |
US10958683B2 (en) * | 2018-04-26 | 2021-03-23 | Wipro Limited | Method and device for classifying uniform resource locators based on content in corresponding websites |
CN110580318B (zh) * | 2018-05-21 | 2023-09-29 | 腾讯科技(深圳)有限公司 | 信息的展示方法、装置以及存储介质 |
US10785238B2 (en) | 2018-06-12 | 2020-09-22 | Secureworks Corp. | Systems and methods for threat discovery across distinct organizations |
US20190044967A1 (en) * | 2018-09-12 | 2019-02-07 | Intel Corporation | Identification of a malicious string |
JP6517416B1 (ja) * | 2018-09-26 | 2019-05-22 | 株式会社ラック | 分析装置、端末装置、分析システム、分析方法およびプログラム |
CN109582844A (zh) * | 2018-11-07 | 2019-04-05 | 北京三快在线科技有限公司 | 一种识别爬虫的方法、装置及*** |
CN109861958B (zh) * | 2018-11-20 | 2022-08-16 | 新疆福禄网络科技有限公司 | 基于Nginx的数据收集***及方法 |
US10885282B2 (en) * | 2018-12-07 | 2021-01-05 | Microsoft Technology Licensing, Llc | Document heading detection |
CN109450934A (zh) * | 2018-12-18 | 2019-03-08 | 国家电网有限公司 | 终端接入数据异常检测方法及*** |
CN109657470A (zh) * | 2018-12-27 | 2019-04-19 | 北京天融信网络安全技术有限公司 | 恶意网页检测模型训练方法、恶意网页检测方法及*** |
US11184363B2 (en) * | 2018-12-31 | 2021-11-23 | Microsoft Technology Licensing, Llc | Securing network-based compute resources using tags |
CN109922052B (zh) * | 2019-02-22 | 2020-12-29 | 中南大学 | 一种结合多重特征的恶意url检测方法 |
US11463455B1 (en) * | 2019-03-25 | 2022-10-04 | Meta Platforms, Inc. | Identification and deobfuscation of obfuscated text in digital content |
CN110209909A (zh) * | 2019-04-19 | 2019-09-06 | 平安科技(深圳)有限公司 | 数据爬取方法、装置、计算机设备和存储介质 |
US11310268B2 (en) * | 2019-05-06 | 2022-04-19 | Secureworks Corp. | Systems and methods using computer vision and machine learning for detection of malicious actions |
US11418524B2 (en) | 2019-05-07 | 2022-08-16 | SecureworksCorp. | Systems and methods of hierarchical behavior activity modeling and detection for systems-level security |
WO2020240718A1 (ja) * | 2019-05-28 | 2020-12-03 | 日本電信電話株式会社 | 抽出装置、抽出方法及び抽出プログラム |
CN110210231B (zh) * | 2019-06-04 | 2023-07-14 | 深信服科技股份有限公司 | 一种安全防护方法、***、设备及计算机可读存储介质 |
CN110414232B (zh) * | 2019-06-26 | 2023-07-25 | 腾讯科技(深圳)有限公司 | 恶意程序预警方法、装置、计算机设备及存储介质 |
CN110740074B (zh) * | 2019-08-22 | 2023-04-18 | 创新先进技术有限公司 | 网络地址的检测方法、装置及电子设备 |
US11089050B1 (en) * | 2019-08-26 | 2021-08-10 | Ca, Inc. | Isolating an iframe of a webpage |
US11381589B2 (en) | 2019-10-11 | 2022-07-05 | Secureworks Corp. | Systems and methods for distributed extended common vulnerabilities and exposures data management |
US11366862B2 (en) * | 2019-11-08 | 2022-06-21 | Gap Intelligence, Inc. | Automated web page accessing |
US11522877B2 (en) | 2019-12-16 | 2022-12-06 | Secureworks Corp. | Systems and methods for identifying malicious actors or activities |
CN111368164B (zh) * | 2020-02-24 | 2023-05-09 | 支付宝(杭州)信息技术有限公司 | 一种爬虫识别模型训练、爬虫识别方法、装置、***、设备及介质 |
CN111355728B (zh) * | 2020-02-27 | 2023-01-03 | 紫光云技术有限公司 | 一种恶意爬虫防护方法 |
CN111428107B (zh) * | 2020-03-23 | 2023-09-01 | 新华智云科技有限公司 | 多中心综合网络爬虫*** |
CN111711617A (zh) * | 2020-05-29 | 2020-09-25 | 北京金山云网络技术有限公司 | 网络爬虫的检测方法、装置、电子设备及存储介质 |
CN111898046B (zh) * | 2020-07-16 | 2024-02-13 | 北京天空卫士网络安全技术有限公司 | 重定向管理的方法和装置 |
CN111917787B (zh) * | 2020-08-06 | 2023-07-21 | 北京奇艺世纪科技有限公司 | 请求检测方法、装置、电子设备和计算机可读存储介质 |
US11588834B2 (en) | 2020-09-03 | 2023-02-21 | Secureworks Corp. | Systems and methods for identifying attack patterns or suspicious activity in client networks |
CN114172676A (zh) * | 2020-09-10 | 2022-03-11 | ***通信有限公司研究院 | 恶意网址检测方法、装置、设备及存储介质 |
CN112115266A (zh) * | 2020-09-25 | 2020-12-22 | 奇安信科技集团股份有限公司 | 恶意网址的分类方法、装置、计算机设备和可读存储介质 |
CN112422543A (zh) * | 2020-11-09 | 2021-02-26 | 建信金融科技有限责任公司 | 反爬虫方法和装置 |
CN114650158A (zh) * | 2020-12-21 | 2022-06-21 | 深信服科技股份有限公司 | 一种http检测方法、***、设备及计算机存储介质 |
US11528294B2 (en) | 2021-02-18 | 2022-12-13 | SecureworksCorp. | Systems and methods for automated threat detection |
CN113010892B (zh) * | 2021-03-26 | 2022-09-20 | 支付宝(杭州)信息技术有限公司 | 小程序恶意行为检测方法和装置 |
CN113518077A (zh) * | 2021-05-26 | 2021-10-19 | 杭州安恒信息技术股份有限公司 | 一种恶意网络爬虫检测方法、装置、设备及存储介质 |
CN113821754A (zh) * | 2021-09-18 | 2021-12-21 | 上海观安信息技术股份有限公司 | 一种敏感数据接口爬虫识别方法及装置 |
US20230122784A1 (en) * | 2021-10-08 | 2023-04-20 | Microsoft Technology Licensing, Llc | Browser-level runtime supply chain security and attack detection |
CN114222301B (zh) * | 2021-12-13 | 2024-04-12 | 奇安盘古(上海)信息技术有限公司 | 诈骗站点处理方法、装置及存储介质 |
CN114553486B (zh) * | 2022-01-20 | 2023-07-21 | 北京百度网讯科技有限公司 | 非法数据的处理方法、装置、电子设备及存储介质 |
NL2031256B1 (en) * | 2022-02-14 | 2023-08-18 | Group Ib Global Private Ltd | Method and computing device for detection of target malicious web resource |
CN114626062B (zh) * | 2022-02-22 | 2023-03-24 | 中国人民解放军国防科技大学 | 一种基于动静结合的网站应用用户交互点发现方法和*** |
CN114826688A (zh) * | 2022-03-30 | 2022-07-29 | 中国建设银行股份有限公司 | 恶意访问地址的识别方法、装置、设备、介质及程序产品 |
CN114978674B (zh) * | 2022-05-18 | 2023-12-05 | 中国电信股份有限公司 | 一种爬虫识别增强的方法及装置、存储介质及电子设备 |
CN114928638A (zh) * | 2022-06-16 | 2022-08-19 | 上海斗象信息科技有限公司 | 一种网络行为的解析方法、装置及监控设备 |
US12015623B2 (en) | 2022-06-24 | 2024-06-18 | Secureworks Corp. | Systems and methods for consensus driven threat intelligence |
CN115130104A (zh) * | 2022-07-15 | 2022-09-30 | 深圳安巽科技有限公司 | 一种恶意网址综合评判方法、***及存储介质 |
CN115186263A (zh) * | 2022-07-15 | 2022-10-14 | 深圳安巽科技有限公司 | 一种反非法诱导活动方法、***及存储介质 |
CN115329244B (zh) * | 2022-10-17 | 2023-03-31 | 广州钛动科技股份有限公司 | 广告跳转检测方法、装置及*** |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103634317A (zh) * | 2013-11-28 | 2014-03-12 | 北京奇虎科技有限公司 | 基于云安全对恶意网址信息进行安全鉴定的方法及*** |
CN103701779A (zh) * | 2013-12-13 | 2014-04-02 | 北京神州绿盟信息安全科技股份有限公司 | 一种二次访问网站的方法、装置及防火墙设备 |
CN103870329A (zh) * | 2014-03-03 | 2014-06-18 | 同济大学 | 基于加权轮叫算法的分布式爬虫任务调度方法 |
CN103902889A (zh) * | 2012-12-26 | 2014-07-02 | 腾讯科技(深圳)有限公司 | 一种恶意消息云检测方法和服务器 |
CN104021343A (zh) * | 2014-05-06 | 2014-09-03 | 南京大学 | 一种基于堆访问模式的恶意程序监控方法和*** |
CN104766014A (zh) * | 2015-04-30 | 2015-07-08 | 安一恒通(北京)科技有限公司 | 用于检测恶意网址的方法和*** |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6553310B1 (en) * | 2000-11-14 | 2003-04-22 | Hewlett-Packard Company | Method of and apparatus for topologically based retrieval of information |
US8307276B2 (en) * | 2006-05-19 | 2012-11-06 | Symantec Corporation | Distributed content verification and indexing |
US9654495B2 (en) * | 2006-12-01 | 2017-05-16 | Websense, Llc | System and method of analyzing web addresses |
US8707441B1 (en) * | 2010-08-17 | 2014-04-22 | Symantec Corporation | Techniques for identifying optimized malicious search engine results |
CN102624761A (zh) * | 2011-01-27 | 2012-08-01 | 腾讯科技(深圳)有限公司 | 一种获取图文信息的装置、***及方法 |
WO2013184653A1 (en) * | 2012-06-04 | 2013-12-12 | Board Of Regents, The University Of Texas System | Method and system for resilient and adaptive detection of malicious websites |
US9531736B1 (en) * | 2012-12-24 | 2016-12-27 | Narus, Inc. | Detecting malicious HTTP redirections using user browsing activity trees |
US8997232B2 (en) * | 2013-04-22 | 2015-03-31 | Imperva, Inc. | Iterative automatic generation of attribute values for rules of a web application layer attack detector |
-
2015
- 2015-04-30 CN CN201510219801.1A patent/CN104766014B/zh active Active
- 2015-09-25 WO PCT/CN2015/090648 patent/WO2016173200A1/zh active Application Filing
- 2015-09-25 US US15/504,276 patent/US10567407B2/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103902889A (zh) * | 2012-12-26 | 2014-07-02 | 腾讯科技(深圳)有限公司 | 一种恶意消息云检测方法和服务器 |
CN103634317A (zh) * | 2013-11-28 | 2014-03-12 | 北京奇虎科技有限公司 | 基于云安全对恶意网址信息进行安全鉴定的方法及*** |
CN103701779A (zh) * | 2013-12-13 | 2014-04-02 | 北京神州绿盟信息安全科技股份有限公司 | 一种二次访问网站的方法、装置及防火墙设备 |
CN103870329A (zh) * | 2014-03-03 | 2014-06-18 | 同济大学 | 基于加权轮叫算法的分布式爬虫任务调度方法 |
CN104021343A (zh) * | 2014-05-06 | 2014-09-03 | 南京大学 | 一种基于堆访问模式的恶意程序监控方法和*** |
CN104766014A (zh) * | 2015-04-30 | 2015-07-08 | 安一恒通(北京)科技有限公司 | 用于检测恶意网址的方法和*** |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110909355A (zh) * | 2018-09-17 | 2020-03-24 | 北京京东金融科技控股有限公司 | 越权漏洞检测方法、***、电子设备和介质 |
CN111104618A (zh) * | 2019-12-19 | 2020-05-05 | 秒针信息技术有限公司 | 一种网页跳转方法及装置 |
CN113806732A (zh) * | 2020-06-16 | 2021-12-17 | 深信服科技股份有限公司 | 一种网页篡改检测方法、装置、设备及存储介质 |
CN113806732B (zh) * | 2020-06-16 | 2023-11-03 | 深信服科技股份有限公司 | 一种网页篡改检测方法、装置、设备及存储介质 |
CN113641933A (zh) * | 2021-06-30 | 2021-11-12 | 北京百度网讯科技有限公司 | 异常网页识别方法、异常站点识别方法及装置 |
CN113641933B (zh) * | 2021-06-30 | 2023-10-20 | 北京百度网讯科技有限公司 | 异常网页识别方法、异常站点识别方法及装置 |
CN116260660A (zh) * | 2023-05-15 | 2023-06-13 | 杭州美创科技股份有限公司 | 网页木马后门识别方法及*** |
CN116260660B (zh) * | 2023-05-15 | 2023-07-25 | 杭州美创科技股份有限公司 | 网页木马后门识别方法及*** |
Also Published As
Publication number | Publication date |
---|---|
CN104766014B (zh) | 2017-12-01 |
US20180041530A1 (en) | 2018-02-08 |
US10567407B2 (en) | 2020-02-18 |
CN104766014A (zh) | 2015-07-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2016173200A1 (zh) | 用于检测恶意网址的方法和*** | |
US20190146616A1 (en) | Systems And Methods For Remote Tracking And Replay Of User Interaction With A Webpage | |
CN109033115B (zh) | 一种动态网页爬虫*** | |
US9614862B2 (en) | System and method for webpage analysis | |
US8819819B1 (en) | Method and system for automatically obtaining webpage content in the presence of javascript | |
US8725794B2 (en) | Enhanced website tracking system and method | |
US8935798B1 (en) | Automatically enabling private browsing of a web page, and applications thereof | |
CN111177519B (zh) | 网页内容获取方法、装置、存储介质及设备 | |
US20180131779A1 (en) | Recording And Triggering Web And Native Mobile Application Events With Mapped Data Fields | |
US20170199850A1 (en) | Method and system to decrease page load time by leveraging network latency | |
WO2017097039A1 (zh) | 一种视频可播放性的检测方法和装置 | |
CN112637361B (zh) | 一种页面代理方法、装置、电子设备及存储介质 | |
US11568448B2 (en) | Synthetic user profiles and monitoring online advertisements | |
US8789177B1 (en) | Method and system for automatically obtaining web page content in the presence of redirects | |
WO2012034537A1 (zh) | 一种在线应用***及其实现方法 | |
US20170141994A1 (en) | Anti-leech method and system | |
CN104023046A (zh) | 移动终端识别方法和装置 | |
Vargas et al. | Characterizing JSON Traffic Patterns on a CDN | |
US11716275B1 (en) | Assistant for automatic generation of server load test scripts | |
Vogel et al. | An in-depth analysis of web page structure and efficiency with focus on optimization potential for initial page load | |
US10536547B2 (en) | Reducing redirects | |
EP3552115B1 (en) | Reducing redirects | |
CN104407979A (zh) | 脚本检测方法和装置 | |
Panum et al. | Kraaler: A user-perspective web crawler | |
US11829434B2 (en) | Method, apparatus and computer program for collecting URL in web page |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 15890575 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 15504276 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 28/02/2018) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 15890575 Country of ref document: EP Kind code of ref document: A1 |