CN113810400A - Website parasite detection method, device, equipment and medium - Google Patents

Website parasite detection method, device, equipment and medium Download PDF

Info

Publication number
CN113810400A
CN113810400A CN202111068745.8A CN202111068745A CN113810400A CN 113810400 A CN113810400 A CN 113810400A CN 202111068745 A CN202111068745 A CN 202111068745A CN 113810400 A CN113810400 A CN 113810400A
Authority
CN
China
Prior art keywords
detected
website
webpage
abnormal
webpages
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111068745.8A
Other languages
Chinese (zh)
Inventor
陈由之
刘伟
杨国强
余文利
王鹏
张博
林赛群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202111068745.8A priority Critical patent/CN113810400A/en
Publication of CN113810400A publication Critical patent/CN113810400A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1466Active attacks involving interception, injection, modification, spoofing of data unit addresses, e.g. hijacking, packet injection or TCP sequence number attacks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1483Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/21Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/2119Authenticating web pages, e.g. with suspicious links

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Computer Hardware Design (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The disclosure provides a website parasite detection method, relates to the technical field of computers, and particularly relates to a computer network technology and a search engine technology. The implementation scheme is as follows: determining a website to be detected, wherein the website to be detected comprises at least one webpage to be detected; for each webpage to be detected in at least one webpage to be detected, acquiring a webpage title and an anchor text of the webpage to be detected, inputting the webpage title and the anchor text of the webpage to be detected into a trained predictive neural network, and determining whether the webpage to be detected is an abnormal webpage or not; and determining whether the website to be detected is parasitized by the website parasite or not based on the number of abnormal webpages in the at least one webpage to be detected.

Description

Website parasite detection method, device, equipment and medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a computer network technology and a search engine technology, and in particular, to a method and an apparatus for detecting website parasites, an electronic device, a computer-readable storage medium, and a computer program product.
Background
Search Engine Optimization (SEO) is a technology that, by analyzing the ranking rules of Search engines, knows how various Search engines Search, how to capture internet pages, how to determine the ranking of Search results of specific keywords, and further utilizes the rules of Search engines to improve the natural ranking of websites in related Search engines.
Black-cap SEOs are web pages or web sites that are not relevant and that focus primarily on business purposes, promoted by fraud and misuse of search algorithms, thereby allowing web sites to quickly increase rankings, gain more exposure, and benefit from it. Common black-hat SEO means include link factories, web hijacking, hidden links, and the like.
The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, unless otherwise indicated, the problems mentioned in this section should not be considered as having been acknowledged in any prior art.
Disclosure of Invention
The present disclosure provides a method, apparatus, electronic device, computer-readable storage medium, and computer program product for website parasite detection.
According to an aspect of the present disclosure, a website parasite detection method is provided. The website parasite detection method comprises the following steps: determining a website to be detected, wherein the website to be detected comprises at least one webpage to be detected; for each webpage to be detected in at least one webpage to be detected, acquiring a webpage title and an anchor text of the webpage to be detected, inputting the webpage title and the anchor text of the webpage to be detected into a trained predictive neural network, and determining whether the webpage to be detected is an abnormal webpage or not; and determining whether the website to be detected is parasitized by the website parasite or not based on the number of abnormal webpages in the at least one webpage to be detected.
According to another aspect of the present disclosure, a website parasite detection apparatus is provided. Website parasite detection device includes: the website detection device comprises a first determination unit, a second determination unit and a third determination unit, wherein the first determination unit is configured to determine a website to be detected, and the website to be detected comprises at least one webpage to be detected; the detection unit is configured to acquire a webpage title and an anchor text of each webpage to be detected in at least one webpage to be detected, input the webpage title and the anchor text of the webpage to be detected into the trained predictive neural network, and determine whether the webpage to be detected is an abnormal webpage; and a second determination unit configured to determine whether the website to be detected is parasitized by the website parasite based on the number of abnormal webpages in the at least one webpage to be detected.
According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the website parasite detection method described above.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to execute the above-described website parasite detection method.
According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program, wherein the computer program, when executed by a processor, implements the above-described website parasite detection method.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the embodiments and, together with the description, serve to explain the exemplary implementations of the embodiments. The illustrated embodiments are for purposes of illustration only and do not limit the scope of the claims. Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.
FIG. 1 shows a flow diagram of a method of website parasite detection, according to an embodiment of the present disclosure;
FIG. 2 shows a flow diagram of a method of website parasite detection, according to an embodiment of the present disclosure;
FIG. 3 shows a block diagram of an apparatus for website parasite detection, according to an embodiment of the present disclosure;
FIG. 4 shows a block diagram of an apparatus for website parasite detection, according to an embodiment of the present disclosure; and
FIG. 5 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In the present disclosure, unless otherwise specified, the use of the terms "first", "second", etc. to describe various elements is not intended to limit the positional relationship, the timing relationship, or the importance relationship of the elements, and such terms are used only to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, based on the context, they may also refer to different instances.
The terminology used in the description of the various described examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, the elements may be one or more. Furthermore, the term "and/or" as used in this disclosure is intended to encompass any and all possible combinations of the listed items.
The website parasite is a method of black cap SEO, and by invading other websites and implanting parasite program, illegal pages loaded with various information such as pornography, lottery, game personal clothing, invoice, credit card cash register, etc. are automatically generated. In the related art, there is no effective detection method for the site parasite.
In order to solve the problems, the method and the device have the advantages that the webpage titles and the anchor texts in the webpages of the websites are obtained for predictive analysis, whether the webpages are abnormal or not is identified, whether the websites are parasitized by the website parasites or not is determined based on the number of the abnormal webpages, and efficient detection of the website parasites is achieved.
According to an aspect of the present disclosure, a website parasite detection method is provided. As shown in fig. 1, the website parasite detection method includes: s101, determining a website to be detected, wherein the website to be detected comprises at least one webpage to be detected; step S102, for each webpage to be detected in at least one webpage to be detected, acquiring a webpage title and an anchor text of the webpage to be detected, inputting the webpage title and the anchor text of the webpage to be detected into a trained predictive neural network, and determining whether the webpage to be detected is an abnormal webpage; and step S103, determining whether the website to be detected is parasitized by the website parasites or not based on the number of abnormal webpages in at least one webpage to be detected.
Therefore, whether the web page is an abnormal web page or not is identified by acquiring and predicting and analyzing the web page title and the anchor text in the web page of the website, and whether the website is parasitized by the website parasites or not is determined based on the number of the abnormal web pages, so that the efficient detection of the website parasites is realized.
According to some embodiments, step S101, determining a website to be detected, where the website to be detected includes at least one webpage to be detected. The website to be detected may be any one of websites included in a search engine. In some preferred embodiments, the website to be detected may include at least one of: official websites, docket websites, and high-point exhibition websites.
In some embodiments, the official websites, the docketing websites, and the high-point exhibition websites have high authority, and are not usually used for actively implanting website parasites in order to quickly increase the ranking of the website in a search engine and obtain more exposure. Moreover, search engines often offer advantages to official websites, making such websites advantageous for inclusion and ranking, and making it easier for hackers to gain benefit by implanting website parasites in such websites. Meanwhile, the websites generally have weak management and maintenance capability, so that the websites are easier to select and attack by hackers. Thus, this type of website may be screened for website parasites with emphasis. It can be understood that, a person skilled in the art can set conditions for screening websites to be detected by himself, and the websites to be detected can be screened and determined, so that the efficiency of detecting parasites in websites can be improved.
According to some embodiments, in step S102, for each web page to be detected in at least one web page to be detected, a web page title and an anchor text of the web page to be detected are obtained, the web page title and the anchor text of the web page to be detected are input into a trained predictive neural network, and whether the web page to be detected is an abnormal web page is determined. The anchor text is also called an anchor text link, and is a link form which takes the key word as a link and points to other web pages. After the website is implanted with the website parasites, the parasites usually benefit by mounting cheating contents, the cheating contents are usually in the form of bad internet information such as pornography, lottery, game personal wear, invoices, credit card cash register and the like, and the information is usually embodied in a webpage title and an anchor text. Therefore, the neural network prediction can be carried out on the webpage title and the anchor text, so that whether the webpage to be detected belongs to one of abnormal webpages such as pornography, lottery, game personal wear, invoices, credit card cash register and the like can be identified.
According to some embodiments, obtaining the web page title and the anchor text of the web page to be detected may include, for example: and capturing a webpage to be detected, rendering the webpage to be detected, analyzing the rendered webpage, and extracting a webpage title and anchor text information from the webpage. It is understood that, those skilled in the art may obtain the web page title and the anchor text of the web page to be detected in other ways, which are not limited herein.
According to some embodiments, the web page title and anchor text of the web page to be detected are input into the trained predictive neural network, and whether the web page to be detected belongs to one of abnormal web pages such as pornography, lottery, game personal clothing, invoice, credit card cash register and the like is identified. Wherein, the applied neural network can be any one of the following: CNN, TextCNN, and Transformer.
According to some embodiments, step S103 determines whether the website to be detected is parasitized by the website parasite based on the number of abnormal webpages in the at least one webpage to be detected. The method comprises the steps of identifying all web pages in a website to be detected and counting the number of abnormal web pages in the website, so that the parasite in the website can be detected.
According to some embodiments, determining whether the website to be detected is parasitized by the website parasite based on the number of abnormal web pages in the at least one web page to be detected may include: and determining whether the website to be detected is parasitized by the whole website or not based on the number of abnormal webpages in at least one webpage to be detected. The fact that the website to be detected is parasitized by the whole website means that the website is completely controlled by a hacker, and the behavior that the cheating content is mounted on the whole website of the website is shown. Whether the website to be detected is parasitized by the whole website can be judged according to the convergence degree of the abnormal webpages. In some exemplary embodiments, the aggregation level of the abnormal web pages may be expressed as a ratio of the number of abnormal web pages to the number of the at least one web page to be detected, or the number of abnormal web pages in the web pages generated within the first time threshold in the at least one web page to be detected, or both.
According to some embodiments, determining whether the website to be detected is parasitized by the whole website based on the number of abnormal webpages in the at least one webpage to be detected may include: and determining that the website to be detected is parasitized by the whole website in response to determining that the first ratio of the number of the abnormal webpages to the number of the at least one webpage to be detected is greater than or equal to a first ratio threshold. The number of the abnormal web pages of the website identified in step S102 is counted and compared with the total number of the web pages of the website recorded by the search engine, so as to measure the aggregation degree of the abnormal web pages. In some exemplary embodiments, the comparing of the number of abnormal web pages of the website may be to calculate a first ratio of the number of abnormal web pages of the website to the total number of web pages of the website included in the search engine, and when the first ratio is greater than or equal to a first ratio threshold, it is determined that the website has been parasitized by the whole website, which may be expressed by the following formula:
Figure BDA0003259640090000051
wherein is _ signed represents whether the website is parasitized by the whole website, countunusualIndicates the number, count, of abnormal web pages of the websitesiteRepresents the total number of web pages, threshold, of the web site that the search engine has recordedsiteRepresenting a first ratio threshold. It is understood that, without limitation, a person skilled in the art may set the first ratio threshold empirically to identify whether a website has been parasitized by a whole website.
Alternatively or additionally, according to some embodiments, determining whether the website to be detected has been parasitized by the whole website based on the number of abnormal webpages in the at least one webpage to be detected may include: and in response to determining that all the webpages generated within the first time threshold value in the at least one webpage to be detected are abnormal webpages, determining that the website to be detected is parasitized by the whole website. And counting whether the newly generated webpages of the website to be detected are all abnormal webpages within a first time threshold value. And when the newly generated webpages are all abnormal webpages, determining that the website is parasitized by the whole website. It is understood that the first time threshold may be set by a person skilled in the art based on experience to realize the identification of whether the website has been parasitized by the whole website, and is not limited herein.
According to some embodiments, the website parasite detection method further comprises: and in response to determining that the website to be detected is parasitized by the whole website, deleting the data of the website to be detected, and listing the website to be detected in a website blacklist. When the websites to be detected are identified as being parasitized by the whole website, the websites can be generally considered to have no administrator to perform daily maintenance and management work on the websites within a short period of time, so that the data of the websites to be detected can be deleted from the database, the websites to be detected are listed in a website blacklist, and the search engine does not record and crawl the websites.
According to some embodiments, the website parasite detection method further comprises: and determining whether a part of the directory of the website to be detected is parasitic or not based on the number of abnormal webpages in at least one webpage to be detected. The parasitizing of the partial catalog of the website to be detected means that the partial catalog of the website is controlled by a hacker, and is represented that abnormal webpages generated by parasites exist in the partial catalog of the website, and normal webpages generated by the website still exist in the partial catalog. Whether a part of directories of the website to be detected are parasitized can be judged by judging whether the parasitized directories exist, and the judgment of the parasitized directories can be judged by judging the convergence degree of abnormal webpages in the directories to be detected.
In some exemplary embodiments, the aggregation degree of the abnormal webpages under the to-be-detected directory may be represented as a ratio of the number of the abnormal webpages under the corresponding to-be-detected directory to the number of the at least one webpage to be detected, or the number of the abnormal webpages in the webpages to be detected generated within the first time threshold value among the at least one webpage to be detected under the corresponding to-be-detected directory, or both.
According to some embodiments, determining whether the website to be detected is parasitic in part of the directory based on the number of abnormal webpages in the at least one webpage to be detected may include: and in response to determining that at least one to-be-detected directory in the to-be-detected websites meets, determining that a part of the to-be-detected website directory is parasitized, wherein a second ratio of the number of the abnormal webpages to the number of the at least one to-be-detected webpage under the corresponding to-be-detected directory is greater than or equal to a second ratio threshold.
The number of abnormal web pages under the to-be-detected catalog corresponding to the website identified in step S102 is counted, and compared with the total number of web pages under the to-be-detected catalog corresponding to the website recorded by the search engine, so as to measure the aggregation degree of the abnormal web pages. In some exemplary embodiments, the comparing of the log quantity may be to calculate a second ratio between the number of the abnormal webpages under the corresponding to the website to be detected in the directory and the total number of the webpages under the directory to be detected, and when the second ratio is greater than or equal to a second ratio threshold, it is determined that the directory to be detected is parasitic, and at the same time, a part of the directory of the website to be detected is parasitic, which may be specifically expressed by the following formula:
Figure BDA0003259640090000071
wherein is _ pattern _ signed indicates whether the website partial directory is parasitic or not, countpattern_unusualIndicating the number of abnormal web pages under the directory to be detected, countpatternIndicating the total number of pages under the directory to be detected, threshold, as recorded by the search enginepatternRepresenting a second ratio threshold. It is understood that the skilled person can set the second ratio threshold value by himself or herself according to experience to realize the identification of whether the website is partially populated, which is not limited herein.
Alternatively or additionally, according to some embodiments, determining whether the website to be detected is populated with a portion of the catalog based on the number of anomalous webpages in the at least one webpage to be detected may include: and in response to determining that at least one to-be-detected directory in the to-be-detected websites meets the requirement, all the webpages generated within the second time threshold in the at least one to-be-detected webpage are abnormal webpages under the corresponding to-be-detected directory, and determining that part of the directory of the to-be-detected website is parasitized. And counting whether all the webpages generated by the website to be detected in the corresponding directory to be detected are abnormal webpages within a second time threshold. And when all the newly generated webpages under the catalog are abnormal webpages, determining that the partial catalog of the website is parasitized. It is understood that the skilled person can set the second time threshold value by himself or herself according to experience to realize the identification of whether the website has been parasitized by the whole website, which is not limited herein.
According to some embodiments, the website parasite detection method further comprises: and sending alarm information to the website to be detected in response to the fact that the partial catalog of the website to be detected is determined to be parasitic. When the website to be detected is identified as being parasitized by a part of directories, it can be generally considered that the administrator of the website to be detected still performs daily maintenance and management work on the website, and only when the situation that the part of directories are parasitized due to the fact that no firewall is arranged exists, alarm information needs to be sent to the website to be detected. In some embodiments, the sent warning information may include the parasitic directory information, abnormal web page information, and the like, so that the administrator deletes the corresponding abnormal web page through the warning information to clean the background website parasite program.
According to some embodiments, the website parasite detection method further comprises: and in response to the fact that the website to be detected is not parasitic in the catalog and the number of abnormal webpages in at least one webpage to be detected is larger than zero, determining that partial webpages of the website to be detected are parasitic. The parasitic state of partial web pages of the website refers to that only partial web pages in the website to be detected are controlled by hackers, which shows that partial web pages of the website to be detected belong to abnormal web pages under the condition that the whole website is not parasitic and partial catalog is not parasitic in the website to be detected.
According to some embodiments, as shown in fig. 2, the website parasite detection method further comprises: and S208, in response to the fact that the web pages of the website to be detected are parasitic, deleting abnormal web page data of the web pages to be detected. When the website to be detected is identified as being parasitized on part of the pages, the daily maintenance and management working condition of the website to be detected can be considered to be better, so that the search engine can directly clean the abnormal webpages. In some embodiments, the cleaning of the abnormal web page by the search engine may include deleting data of the abnormal web page, blacklisting the abnormal web page, and the like, so that the abnormal web page is not crawled. Steps S201 to S207 in fig. 2 are similar to those in the above embodiments, and are not described herein again.
According to another aspect of the present disclosure, a website parasite detection device is also provided. As shown in fig. 3, the website parasite detection apparatus 300 includes: a first determining unit 310 configured to determine a website to be detected, where the website to be detected includes at least one webpage to be detected; the detecting unit 320 is configured to acquire, for each to-be-detected web page of at least one to-be-detected web page, a web page title and an anchor text of the to-be-detected web page, input the web page title and the anchor text of the to-be-detected web page into the trained predictive neural network, and determine whether the to-be-detected web page is an abnormal web page; and a second determining unit 330 configured to determine whether the website to be detected is parasitized by the website parasite based on the number of abnormal webpages in the at least one webpage to be detected.
The operations of the units 310-330 of the website parasite detection apparatus 300 are similar to the operations of the steps S101-S103 of the website parasite detection method, and are not described herein again.
According to some embodiments, the second determination unit is further configured to determine that the website to be detected has been parasitized by the whole website in response to at least one of the following rules being satisfied: the first ratio of the number of the abnormal webpages to the number of the at least one webpage to be detected is greater than or equal to a first ratio threshold value; and all the webpages generated within the first time threshold value in the at least one webpage to be detected are abnormal webpages.
According to some embodiments, the website parasite detection apparatus may further comprise: the first processing unit is configured to respond to the fact that the website to be detected is determined to be parasitized by the whole website, delete data of the website to be detected, and list the website to be detected in a website blacklist.
According to some embodiments, the website parasite detection apparatus may further comprise: a third determining unit, configured to determine whether at least one abnormality to-be-detected directory satisfies at least one rule that: a second ratio of the number of the abnormal webpages under the corresponding to-be-detected directory to the number of the at least one webpage to be detected is greater than or equal to a second ratio threshold; and under the corresponding to-be-detected directory, all the webpages generated within the second time threshold value in the at least one to-be-detected webpage are abnormal webpages.
According to some embodiments, the website parasite detection apparatus may further comprise: and the fourth determining unit is configured to respond to the fact that at least one to-be-detected directory comprises at least one abnormity to-be-detected directory, and determine that the to-be-detected website part directory is parasitized.
According to some embodiments, the website parasite detection apparatus may further comprise: and the second processing unit is configured to respond to the determination that the part of the catalogue of the website to be detected is parasitic, and send alarm information to the website to be detected.
According to some embodiments, the website parasite detection apparatus may further comprise: and the fifth determining unit is configured to respond to the fact that the website to be detected is not parasitic in the catalogue, and the number of abnormal webpages in the at least one webpage to be detected is larger than zero, and determine that partial webpages of the website to be detected are parasitic.
According to some embodiments, as shown in fig. 4, website parasite detection apparatus 400 may further include: the third processing unit 490 is configured to delete the abnormal web page data of the web page to be detected in response to determining that the web page to be detected is parasitic. The operations of the units 410 and 480 in fig. 4 are similar to those of the corresponding units, and are not described herein again.
According to an embodiment of the present disclosure, there is also provided an electronic device, a readable storage medium, and a computer program product.
Referring to fig. 5, a block diagram of a structure of an electronic device 500, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 5, the apparatus 500 comprises a computing unit 501 which may perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The calculation unit 501, the ROM 502, and the RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
A number of components in the device 500 are connected to the I/O interface 505, including: an input unit 506, an output unit 507, a storage unit 508, and a communication unit 509. The input unit 506 may be any type of device capable of inputting information to the device 500, and the input unit 506 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a track pad, a track ball, a joystick, a microphone, and/or a remote controller. Output unit 507 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. The storage unit 508 may include, but is not limited to, a magnetic disk, an optical disk. The communication unit 509 allows the device 500 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, 1302.11 devices, WiFi devices, WiMax devices, cellular communication devices, and/or the like.
The computing unit 501 may be a variety of general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 501 performs the various methods and processes described above, such as the website parasite detection method. For example, in some embodiments, the website parasite detection method can be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into RAM 503 and executed by the computing unit 501, one or more steps of the website parasite detection method described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the website parasite detection method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be performed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.
Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the above-described methods, systems and apparatus are merely exemplary embodiments or examples and that the scope of the present invention is not limited by these embodiments or examples, but only by the claims as issued and their equivalents. Various elements in the embodiments or examples may be omitted or may be replaced with equivalents thereof. Further, the steps may be performed in an order different from that described in the present disclosure. Further, various elements in the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced with equivalent elements that appear after the present disclosure.

Claims (17)

1. A website parasite detection method, comprising:
determining a website to be detected, wherein the website to be detected comprises at least one webpage to be detected;
for each webpage to be detected in the at least one webpage to be detected:
acquiring a webpage title and an anchor text of the webpage to be detected;
inputting the webpage title and the anchor text into the trained predictive neural network, and determining whether the webpage to be detected is an abnormal webpage; and
and determining whether the website to be detected is parasitized by the website parasite or not based on the number of abnormal webpages in the at least one webpage to be detected.
2. The method of claim 1, wherein the determining whether the website to be detected is parasitized by a website parasite based on the number of abnormal webpages in the at least one webpage to be detected comprises:
determining that the website to be detected has been parasitized by the whole website in response to at least one of the following rules being satisfied:
a first ratio of the number of the abnormal webpages to the number of the at least one webpage to be detected is greater than or equal to a first ratio threshold value; and
and all the webpages generated within the first time threshold value in the at least one webpage to be detected are abnormal webpages.
3. The method of claim 2, further comprising:
and in response to determining that the website to be detected is parasitized by the whole website, deleting the data of the website to be detected, and listing the website to be detected in a website blacklist.
4. The method according to claim 2 or 3, wherein the website to be detected comprises at least one directory to be detected, and wherein the determining whether the website to be detected is parasitized by a website parasite based on the number of abnormal webpages in the at least one webpage to be detected further comprises:
in response to determining that the website to be detected is not parasitized by the whole website, determining whether at least one abnormal directory to be detected which satisfies at least one rule is included in the at least one directory to be detected:
a second ratio of the number of the abnormal webpages under the corresponding to-be-detected directory to the number of the at least one to-be-detected webpage is greater than or equal to a second ratio threshold; and
all the webpages generated within a second time threshold value in the at least one webpage to be detected are abnormal webpages under the corresponding directory to be detected; and
and in response to the fact that the at least one to-be-detected directory comprises at least one abnormity to-be-detected directory, determining that the to-be-detected website partial directory is parasitized.
5. The method of claim 4, further comprising:
and sending alarm information to the website to be detected in response to the fact that the partial catalog of the website to be detected is parasitic.
6. The method according to claim 4 or 5, wherein the determining whether the website to be detected is parasitized by a website parasite based on the number of abnormal webpages in the at least one webpage to be detected further comprises:
and in response to the fact that the to-be-detected website is not parasitic in the directory and the number of abnormal webpages in the at least one to-be-detected webpage is larger than zero, determining that partial webpages of the to-be-detected website are parasitic.
7. The method of claim 6, further comprising:
and in response to determining that the partial web pages of the website to be detected are parasitic, deleting abnormal web page data of the web pages to be detected.
8. A website parasite detection device, comprising:
the website detection device comprises a first determination unit, a second determination unit and a third determination unit, wherein the first determination unit is configured to determine a website to be detected, and the website to be detected comprises at least one webpage to be detected;
a detection unit configured to, for each of the at least one web page to be detected:
acquiring a webpage title and an anchor text of the webpage to be detected;
inputting the webpage title and the anchor text into the trained predictive neural network, and determining whether the webpage to be detected is an abnormal webpage; and
a second determination unit configured to determine whether the website to be detected is parasitized by the website parasite based on the number of abnormal webpages in the at least one webpage to be detected.
9. The apparatus of claim 8, wherein the second determining unit is further configured to:
determining that the website to be detected has been parasitized by the whole website in response to at least one of the following rules being satisfied:
a first ratio of the number of the abnormal webpages to the number of the at least one webpage to be detected is greater than or equal to a first ratio threshold value; and
and all the webpages generated within the first time threshold value in the at least one webpage to be detected are abnormal webpages.
10. The apparatus of claim 9, further comprising:
the first processing unit is configured to respond to the fact that the website to be detected is parasitic by the whole website, delete data of the website to be detected, and list the website to be detected in a website blacklist.
11. The apparatus according to claim 9 or 10, wherein the website to be detected comprises at least one directory to be detected, and wherein the apparatus further comprises:
a third determining unit, configured to determine whether at least one abnormal to-be-detected directory meeting at least one of the following rules is included in the at least one to-be-detected directory in response to determining that the website to be detected is not parasitized by the whole website:
a second ratio of the number of the abnormal webpages under the corresponding to-be-detected directory to the number of the at least one to-be-detected webpage is greater than or equal to a second ratio threshold; and
all the webpages generated within a second time threshold value in the at least one webpage to be detected are abnormal webpages under the corresponding directory to be detected; and
a fourth determining unit, configured to determine that the website part directory to be detected is parasitic in response to determining that the at least one directory to be detected includes at least one directory to be detected for abnormality.
12. The apparatus of claim 11, further comprising:
the second processing unit is configured to respond to the fact that the partial catalog of the website to be detected is parasitic, and send alarm information to the website to be detected.
13. The apparatus of claim 11 or 12, further comprising:
a fifth determining unit, configured to determine that part of the web pages of the website to be detected are parasitic in response to determining that no directory of the website to be detected is parasitic and the number of abnormal web pages in the at least one web page to be detected is greater than zero.
14. The apparatus of claim 13, further comprising:
and the third processing unit is configured to respond to the fact that the web page of the website part to be detected is parasitic, and delete the abnormal web page data of the web page to be detected.
15. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein
The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.
16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.
17. A computer program product comprising a computer program, wherein the computer program realizes the method of any one of claims 1-7 when executed by a processor.
CN202111068745.8A 2021-09-13 2021-09-13 Website parasite detection method, device, equipment and medium Pending CN113810400A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111068745.8A CN113810400A (en) 2021-09-13 2021-09-13 Website parasite detection method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111068745.8A CN113810400A (en) 2021-09-13 2021-09-13 Website parasite detection method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN113810400A true CN113810400A (en) 2021-12-17

Family

ID=78940978

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111068745.8A Pending CN113810400A (en) 2021-09-13 2021-09-13 Website parasite detection method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN113810400A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104239485A (en) * 2014-09-05 2014-12-24 中国科学院计算机网络信息中心 Statistical machine learning-based internet hidden link detection method
CN106844685A (en) * 2017-01-26 2017-06-13 百度在线网络技术(北京)有限公司 Method, device and server for recognizing website
CN110232277A (en) * 2019-04-23 2019-09-13 平安科技(深圳)有限公司 Detection method, device and the computer equipment at webpage back door
CN110929257A (en) * 2019-10-30 2020-03-27 武汉绿色网络信息服务有限责任公司 Method and device for detecting malicious codes carried in webpage

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104239485A (en) * 2014-09-05 2014-12-24 中国科学院计算机网络信息中心 Statistical machine learning-based internet hidden link detection method
CN106844685A (en) * 2017-01-26 2017-06-13 百度在线网络技术(北京)有限公司 Method, device and server for recognizing website
CN110232277A (en) * 2019-04-23 2019-09-13 平安科技(深圳)有限公司 Detection method, device and the computer equipment at webpage back door
CN110929257A (en) * 2019-10-30 2020-03-27 武汉绿色网络信息服务有限责任公司 Method and device for detecting malicious codes carried in webpage

Similar Documents

Publication Publication Date Title
CN107547555B (en) Website security monitoring method and device
CN110442712B (en) Risk determination method, risk determination device, server and text examination system
CN105590055B (en) Method and device for identifying user credible behaviors in network interaction system
US10243967B2 (en) Method, apparatus and system for detecting fraudulant software promotion
KR102355973B1 (en) Apparatus and method for detecting smishing message
JP2015511340A (en) System and method for dynamic scoring of online fraud detection
EP3245598B1 (en) Website access control
US20160117328A1 (en) Influence score of a social media domain
CN107809762A (en) The security risk control method identified using the foster card of big data and device-fingerprint
CN110611655B (en) Blacklist screening method and related product
CN108270754B (en) Detection method and device for phishing website
CN113791837B (en) Page processing method, device, equipment and storage medium
CN112347457A (en) Abnormal account detection method and device, computer equipment and storage medium
CN117609992A (en) Data disclosure detection method, device and storage medium
KR101785288B1 (en) Apparatus, Method, and Program for Fraud Detecting Related to an Online Content
CN113810400A (en) Website parasite detection method, device, equipment and medium
CN110460620A (en) Website defence method, device, equipment and storage medium
CN115330522A (en) Credit card approval method and device based on clustering, electronic equipment and medium
CN114862479A (en) Information pushing method and device, electronic equipment and medium
CN113642919A (en) Risk control method, electronic device, and storage medium
US11381596B1 (en) Analyzing and mitigating website privacy issues by automatically classifying cookies
CN114329149A (en) Detection method and device for automatically capturing page information, electronic equipment and readable storage medium
CN114186123A (en) Processing method, device and equipment for hotspot event and storage medium
US20200334595A1 (en) Company size estimation system
KR101960962B1 (en) Apparatus and Method for Fraud Detecting Related to an Online Content

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20211217