CN106844685B

CN106844685B - Method, device and server for identifying website

Info

Publication number: CN106844685B
Application number: CN201710057271.4A
Authority: CN
Inventors: 邹红建; 方高林; 付立波
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2017-01-26
Filing date: 2017-01-26
Publication date: 2020-07-28
Anticipated expiration: 2037-01-26
Also published as: CN106844685A

Abstract

The application discloses a method, a device and a server for identifying a website. One embodiment of the method comprises: acquiring a webpage set of a website to be identified; identifying abnormal web pages in the web page set, wherein the correlation degree of the picture information and the text information in the abnormal web pages is smaller than a correlation degree threshold value; determining the ratio of the identified abnormal web pages in the web page set; and determining whether the website to be identified is a spam website or not according to the determined ratio. The embodiment improves the efficiency of identifying the spam websites.

Description

Method, device and server for identifying website

Technical Field

The present application relates to the field of computer technologies, and in particular, to the field of internet technologies, and in particular, to a method, an apparatus, and a server for identifying a website.

Background

The spam website generally refers to a website which obtains ranking effect higher than the network information quality by utilizing the defects of a search engine operation algorithm and adopting a cheating means aiming at a search engine. The spam website is arranged at the front position in the search result list to cheat the user to click, which increases the difficulty of information retrieval and reduces the retrieval efficiency.

However, the existing method for identifying spam websites usually calculates the importance of pages based on the link relationship between websites to identify spam websites, and this method has a large amount of calculation, so that the efficiency of identifying spam websites is low.

Disclosure of Invention

The present application is directed to an improved method, apparatus and server for identifying a website, so as to solve the technical problems mentioned in the above background.

In a first aspect, the present application provides a method for identifying a website, the method comprising: acquiring a webpage set of a website to be identified; identifying abnormal webpages in the webpage set, wherein the correlation degree of the picture information and the text information in the abnormal webpages is smaller than a correlation degree threshold value; determining the ratio of the identified abnormal web pages in the web page set; and determining whether the website to be identified is a spam website or not according to the determined ratio.

In a second aspect, the present application provides an apparatus for identifying a website, the apparatus comprising: the acquisition unit is used for acquiring a webpage set of a website to be identified; the identification unit is used for identifying abnormal webpages in the webpage set, wherein the correlation degree of the image information and the text information in the abnormal webpages is smaller than a correlation degree threshold value; a ratio determining unit, configured to determine a ratio of the identified abnormal web pages in the web page set; and the spam website determining unit is used for determining whether the website to be identified is a spam website or not according to the determined ratio.

In a third aspect, the present application provides a server, comprising: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the method for identifying a website of the first aspect.

In a fourth aspect, the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method for identifying a website of the first aspect described above.

According to the method, the device and the server for identifying the website, the webpage set of the website to be identified is obtained; identifying abnormal webpages in the webpage set, wherein the correlation degree of the picture information and the text information in the abnormal webpages is smaller than a correlation degree threshold value; determining the ratio of the identified abnormal web pages in the web page set; and determining whether the website to be identified is a spam website according to the determined ratio, wherein the spam website can be identified efficiently by utilizing the characteristic that the spam website usually piles up pictures which are irrelevant to texts.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow diagram of a first embodiment of a method for identifying a website according to the present application;

FIG. 3 is a schematic illustration of an application scenario of a method for identifying a website according to the present application;

FIG. 4 is a flow chart of a second embodiment of a method for identifying a web site according to the present application;

FIG. 5 is a flow diagram of an alternative implementation of steps in a method for identifying websites according to the present application;

FIG. 6 is a flow diagram of an alternative implementation of steps in a method for identifying websites according to the present application;

FIG. 7 is a flow diagram of an alternative implementation of steps in a method for identifying websites according to the present application;

FIG. 8 is a flow chart of a third embodiment of a method for identifying a website according to the present application;

FIG. 9 is a schematic diagram illustrating the structure of one embodiment of an apparatus for identifying web sites according to the present application;

FIG. 10 is a block diagram of a computer system suitable for use in implementing a server according to embodiments of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 shows an exemplary system architecture 100 to which embodiments of the method for identifying a website or apparatus for identifying a website of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include

web site servers

101, 102, 103, a network 104, and a network monitoring server 105. The network 104 serves as a medium for providing a communication link between the

web servers

101, 102, 103 and the network monitoring server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The

web servers

101, 102, 103 may be servers that provide support for various web sites, and the web servers may generate various web pages that can be displayed on the terminal device.

The network monitoring server 105 may capture a web page through a web crawler, and perform clustering on the captured web page according to a website domain name to obtain a web page of a certain website. And identifying the junk websites through the obtained webpages of a certain website. It should be noted that there may not be direct communication between the website server and the network monitoring server, but indirect communication is generated by the network monitoring server capturing a web page generated by the website server.

It should be noted that the method for identifying a website provided in the embodiment of the present application is generally performed by the network monitoring server 105, and accordingly, the means for identifying a website is generally disposed in the network monitoring server 105.

It should be understood that the number of web site servers, networks, and network monitoring servers in fig. 1 is merely illustrative. There may be any number of web servers, networks, and network monitoring servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for identifying a website in accordance with the present application is shown. The method for identifying the website comprises the following steps:

step 201, acquiring a webpage set of a website to be identified.

In this embodiment, an electronic device (for example, the network monitoring server shown in fig. 1) on which the method for identifying a website operates may acquire the set of webpages of the website to be identified in various ways.

In some optional implementation manners of this embodiment, the electronic device may obtain a pre-stored set of web pages of the website to be identified.

In some optional implementation manners of this embodiment, the electronic device may cluster the pages captured by the web crawler by clustering, according to the domain name of the website, to obtain respective web page sets of the multiple websites. And selecting one website as a website to be identified, and acquiring a webpage set of the website.

In some optional implementation manners of this embodiment, the web page set may be all web page sets of websites to be identified at a certain time point; the web page set may also be a set of newly added or updated web pages of the website to be identified within a preset time period.

In step 202, abnormal web pages in the web page set are identified.

In this embodiment, an electronic device (for example, the network monitoring server shown in fig. 1) on which the method for identifying a website operates may identify an abnormal webpage in the above-mentioned webpage set by various methods. Here, the abnormal web page refers to a web page in which the correlation between the picture information and the text information in the web page is smaller than a preset correlation threshold.

In some optional implementations of the present embodiment, identifying the abnormal web page in the web page set may be implemented by: for each webpage in the webpage set, determining the correlation degree between the picture information in the webpage and the corresponding text of the webpage by using a pre-established identification model, and identifying whether the webpage is an abnormal webpage or not based on the determined correlation degree.

In some optional implementations of the present embodiment, identifying the abnormal web page in the web page set may be implemented by: a large number of web pages of other websites can be obtained first, and the web pages in the web page set form a web page set to be clustered. And clustering the web pages in the to-be-clustered web page set based on the content of the picture to obtain one or more web page clusters. And acquiring corresponding texts of all the webpages in the webpage cluster, and generating a corresponding text set. And identifying the corresponding text with larger topic difference with other corresponding texts in the corresponding text set as abnormal corresponding text. And then identifying whether the webpage to which the abnormal corresponding text belongs or the corresponding webpage is the abnormal webpage or not based on the abnormal corresponding text.

It should be noted that the clustering algorithm and the classification algorithm related to the present application, and how to perform operations by using the clustering algorithm and the classification algorithm are well known to those skilled in the art, and are not described in detail in the present application.

In some optional implementation manners of this embodiment, the corresponding text of the web page may be all texts in the web page, may also be a main text obtained by extracting the whole text, may also be a sub-text in each text field of the web page, and may also be a search text. Here, the search text may be generated by first acquiring a search expression when the web page is presented as a search result, and then parsing the acquired search expression and extracting a keyword.

In some optional implementation manners of this embodiment, intersection may be performed between the identified abnormal web page in the web page cluster and the web page set, so as to obtain the abnormal web page in the web page set.

In some optional implementation manners of this embodiment, in response to that the identified abnormal corresponding text belongs to or corresponds to a web page in the web page set, the web page to which the identified abnormal corresponding text belongs or corresponds may be determined as an abnormal web page in the web page set.

In some optional implementations of the present embodiment, identifying the abnormal web page in the web page set may be implemented by: acquiring a preset picture text feature vector set, wherein the preset picture text feature vector is generated by analyzing the identified abnormal web pages of the spam website, extracting and splicing picture features and text features; respectively extracting and splicing the picture features and the text features of each webpage in the webpage set to generate a picture text feature vector to be identified; and for each webpage in the webpage set, determining that the webpage is an abnormal webpage in response to the fact that the similarity between the picture text feature to be identified of the webpage and at least one preset picture text feature vector in the preset picture text feature vector set is greater than a picture text threshold value.

Step 203, determining the ratio of the identified abnormal web pages in the web page set.

In this embodiment, an electronic device (for example, the network monitoring server shown in fig. 1) on which the method for identifying a website operates may determine a ratio of identified abnormal webpages in the webpage set. Here, the ratio may be obtained by dividing the number of identified abnormal web pages in the web page set by the total number of web pages in the web page set.

And step 204, determining whether the website to be identified is a spam website or not according to the determined ratio.

In this embodiment, an electronic device (e.g., a network monitoring server shown in fig. 1) on which the method for identifying websites operates may determine whether a website to be identified is a spam website according to the determined ratio.

In some optional implementation manners of this embodiment, when the determined ratio is greater than a preset abnormal webpage ratio threshold, it may be determined that the website to be identified is a spam website.

In some optional implementation manners of this embodiment, after determining that the website to be identified is a spam website, the image or image information and the text message of the determined spam website may be used to search for and identify other spam websites.

In some optional implementation manners of this embodiment, finding and identifying other spam websites by using the determined pictures of the spam websites can be implemented by the following steps: acquiring abnormal webpages of the determined spam websites and a set of webpages to be identified, wherein the set of webpages to be identified comprises webpages of other preset websites except the spam websites; analyzing the abnormal webpage and each webpage in the webpage set to be identified, and extracting picture features of pictures in the webpage to generate picture feature vectors; calculating each obtained picture feature vector by using a clustering or classifying algorithm to cluster or classify the abnormal webpage and the webpages in the webpage set to be identified to obtain at least one subnet page set; determining the web pages to be identified in the subnet page set comprising the abnormal web pages as spam web pages; and determining the website to which the spam webpage belongs as a spam website.

As an example, if the picture in the web page in the determined spam web site is likely to be a picture of the subject of gambling, and the subject of the picture in the to-be-identified web page in the same sub-web page set as the abnormal web page is likely to be gambling, the to-be-identified web page in the same sub-web page set as the abnormal web page is determined to be a spam web page, and further the web site to which the spam web page belongs is determined to be a spam web site, in this way, a large number of spam web sites can be quickly identified.

In some optional implementation manners of this embodiment, finding and identifying other spam websites by using the determined picture information and text information of the spam websites may be implemented by the following steps: acquiring abnormal webpages of the determined spam websites and a set of webpages to be identified, wherein the set of webpages to be identified comprises webpages of other preset websites except the spam websites; analyzing the abnormal webpage and each webpage in the webpage set to be identified, extracting and splicing picture features and text features of the webpage, and generating a picture text feature vector; utilizing a clustering or classifying algorithm to operate the obtained text characteristic vectors of all the pictures so as to cluster or classify the pictures in the webpage set to be identified, and obtaining at least one subnet page set; determining the web pages to be identified in the subnet page set comprising the abnormal web pages as spam web pages; and determining the website to which the spam webpage belongs as a spam website.

As an example, the determined correlation between the picture information and the text information in the abnormal web page in the spam website is small, for example, the picture text feature vector of the abnormal web page may be obtained by splicing the picture feature vector of the car theme and the text feature vector of the gambling theme, and the picture text feature vector in the to-be-identified web page in the same sub-web page set as the abnormal web page is likely to also be a combination of the picture feature vector of the car theme and the text feature vector of the gambling theme, then the picture information and the text information of the to-be-identified web page in the same sub-web page set as the abnormal web page are determined to be low in correlation, the to-be-identified web page is determined to be a spam web page, and then the web site to which the spam web page belongs is determined to be a spam website, by which a large number of spam websites can be.

An application scenario of the method for identifying a website according to the present embodiment is given below: firstly, a network monitoring server acquires a webpage set of a website to be identified; then, the network monitoring server may identify an abnormal web page in the web page set, where the abnormal web page is a web page in which the correlation between the picture information and the text information in the web page is smaller than a preset correlation threshold, for example, the web page shown in fig. 3, the subject of the picture is an automobile, and the text "lottery drawing", i.e., buying and winning; the theme of the lost and the no-time coming is the lottery, and the relevance between the automobile and the lottery is low; then, the network monitoring server can count the ratio of abnormal web pages in the web page set; finally, the network monitoring server can determine whether the website to be identified is a spam website according to the ratio of the abnormal webpages in the webpage set.

In the method provided by the embodiment of the application, the webpage set of the website to be identified is obtained; identifying abnormal webpages in the webpage set, wherein the correlation degree of the picture information and the text information in the abnormal webpages is smaller than a correlation degree threshold value; determining the ratio of the identified abnormal web pages in the web page set; and determining whether the website to be identified is a spam website according to the determined ratio, wherein the spam website can be identified efficiently by utilizing the characteristic that the spam website usually piles up pictures which are irrelevant to texts.

With continued reference to FIG. 4, a flow 400 of one embodiment of a method for identifying a website in accordance with the present application is shown. The method for identifying the website comprises the following steps:

step 401, acquiring a web page set of a website to be identified.

In this embodiment, an electronic device (for example, the network monitoring server shown in fig. 1) on which the method for identifying a website operates may acquire the web page set of the website to be identified in various ways.

Step 402, determining a webpage set to be clustered, and clustering pictures in webpages in the webpage set to be clustered by using a clustering algorithm to obtain a picture cluster.

In this embodiment, the electronic device may first determine a to-be-clustered web page set, and then cluster pictures in web pages in the to-be-clustered web page set by using a clustering algorithm to obtain one or more picture clusters. Here, the set of web pages to be clustered includes web pages in the set of web pages and web pages of other preset web sites except the web site to be identified.

It will be appreciated that the purpose of this step is to obtain a large collection of web pages to assist in identifying the web pages in the collection of web pages. The number of the obtained picture clusters may be one or more, and for convenience of description, the following steps are described with respect to one picture cluster.

Step 403, determining the webpage to which the pictures in the picture cluster belong.

In this embodiment, the electronic device may determine the web page to which each picture in the picture cluster determined in step 402 belongs. For a certain determined picture cluster, the webpage to which the pictures in the picture cluster belong can be determined.

And step 404, identifying the abnormal web page based on the abnormal corresponding text in the corresponding text set.

In this embodiment, the electronic device may identify the abnormal web page based on the abnormal corresponding text in the corresponding text set. Here, the corresponding text set includes texts corresponding to web pages to which the pictures in the picture clusters belong, and semantic similarity between the abnormal corresponding text and other corresponding texts in the corresponding text set except the abnormal corresponding text is smaller than a preset first semantic similarity threshold.

In this embodiment, the corresponding text of the web page may be all texts in the web page, may be a main text obtained by extracting all texts, may be a sub-text in each text field of the web page, and may be a search text. Here, the search text may be generated by first acquiring a search expression when the web page is presented as a search result, and then parsing the acquired search expression and extracting a keyword.

In some optional implementations of this embodiment, step 404 may be implemented by the flow 500 shown in fig. 5:

step 501, extracting texts in a webpage to which each picture in the picture cluster belongs, and generating a text set.

In this implementation manner, the electronic device may extract the text in the webpage to which each picture in the picture cluster obtained in step 403 belongs, and generate a text set.

In this implementation manner, the extracted text in the web page may be all texts of the web page, or may reflect a main text of a topic of the web page. As an example, advertisement text may be removed from the overall text of the web page, resulting in primary text.

Step 502, identifying abnormal texts in the text set.

In this implementation, the electronic device may identify an abnormal text in the text set generated in step 501. Here, the semantic similarity between the abnormal text and the text other than the abnormal text in the text set is smaller than a second semantic similarity threshold.

Alternatively, identifying abnormal text in the text collection may be achieved by: based on semantic similarity among texts in the text set, performing clustering operation on text feature vectors corresponding to the texts to obtain a clustering center; determining a text characteristic vector with a distance from the clustering center greater than a preset distance threshold; and identifying the text corresponding to the determined text feature vector as abnormal text. It can be understood how to operate on each text feature vector by using clustering operation, which is known per se by those skilled in the art and will not be described herein again.

Step 503, in response to the identified abnormal text being extracted from the web pages in the web page set, identifying the web page to which the abnormal text belongs as an abnormal web page.

In this implementation manner, the electronic device may extract, in response to the identified abnormal text, a web page from a web page set, and identify a web page to which the abnormal text belongs as an abnormal web page in the web page set.

It is understood that the identified abnormal text may be extracted from the web pages of other preset websites besides the website to be identified, in this case, the web page to which the identified abnormal text belongs is also an abnormal web page, but is not an abnormal web page in the above-mentioned web page set, and is not in the statistics of the abnormal web pages in the method of this embodiment.

Optionally, step 404 may also be implemented by the flow 600 shown in fig. 6:

step 601, for each webpage of the webpage to which each picture belongs, analyzing the webpage and extracting the sub-text in each text field of the webpage.

In this implementation manner, the electronic device may parse each web page in the web pages to which the pictures in the picture cluster belong, and further extract sub-texts in each text domain of the web page.

In this implementation, the text field may be a text-placing region at a different position of the page, for example, a page title text field, a picture title text field, a navigation bar text field, and the like. Accordingly, the sub-texts in the respective text fields may be page title sub-text, picture title sub-text, navigation bar sub-text, and the like. It should be noted that the subfolders herein are relative to the entire text in the page. By way of example, the page title sub-text is the entire text in the page title text field, relative to the entire text in the page, and not a portion of the text in the page title text field.

Step 602, dividing the extracted sub-texts according to the text fields, and generating a plurality of sub-text sets associated with the text fields.

In this implementation manner, the electronic device may divide the sub-text extracted in step 601 according to the text field, and generate a plurality of sub-text sets, where each sub-text set is associated with a text field. As an example, all the subfolders in the page title subfolders obtained by dividing are page title subfolders, and all the subfolders in the picture title subfolders obtained by dividing are picture title subfolders.

Step 603, for each sub-text set in the plurality of sub-text sets, identifying abnormal sub-text in the sub-text set.

In this implementation manner, the electronic device may identify, for each sub-text in the plurality of sub-text sets, an abnormal sub-text in the sub-text set. Here, the semantic similarity between the abnormal sub-text and the other sub-texts in the sub-text set except the abnormal sub-text is smaller than a second semantic similarity threshold. As an example, the subjects of most of the subfolders in the subfolder set are all cars, and the subject of the individual subfolders is an animal, and the individual subfolders with the subjects of the animals are identified as abnormal subfolders.

Optionally, identifying the abnormal sub-text in the sub-text set may be implemented by: based on semantic similarity among the sub texts in the sub text set, carrying out clustering operation on the sub text feature vectors corresponding to the sub texts to obtain a clustering center; determining a sub-text characteristic vector with the distance from the clustering center greater than a preset distance threshold; and identifying the sub-text corresponding to the determined sub-text feature vector as the abnormal sub-text.

Step 604, for each web page in the web page set, determining a ratio of the number of abnormal sub-texts in the web page to the total number of text fields of the web page, and identifying the web page as an abnormal web page in response to the ratio being greater than an abnormal text field threshold.

In this implementation manner, the electronic device may determine, for each web page in the web page set, a ratio of the number of abnormal sub-texts in the web page to the total number of text fields of the web page.

Optionally, the abnormal sub-text identified in step 603 may intersect with the sub-texts of the web pages in the web page set, determine which of the sub-texts of the web pages in the web page set are the abnormal sub-texts, and then determine, for each web page in the web page set, the number of the abnormal sub-texts in the web page.

Alternatively, step 404 may be implemented by the flow 700 shown in FIG. 7:

step 701, for each determined webpage, obtaining a search formula when the webpage is presented as a search result, analyzing the obtained search formula and extracting keywords to generate a search text.

In this implementation manner, for each web page determined in step 403, the electronic device may obtain a search expression when the web page is presented as a search result, and parse the obtained search expression to extract a keyword to generate a search text.

As an example, a certain web page includes information of a tiger hurting person in a zoo, and when the web page is presented as a search result, the search formula may be "tiger hurting person", "zoo accident", "tiger", and the parsing of the search formula extraction keywords to generate the search text may be "zoo, tiger, hurting person, accident".

Step 702, for a search text set including search texts of web pages to which each picture belongs, identifying abnormal search texts in the search text set.

In this implementation manner, the electronic device may identify an abnormal search text in the search text set. The search text set comprises search texts of web pages to which the pictures belong.

Alternatively, identifying abnormal search text in the set of search text may be accomplished by: based on semantic similarity among all search texts in the search text set, carrying out clustering operation on search text feature vectors corresponding to all the search texts to obtain a clustering center; determining a search text characteristic vector with the distance from the clustering center greater than a preset distance threshold; and identifying the search text corresponding to the determined search text feature vector as an abnormal search text.

Step 703, in response to that the identified web page corresponding to the abnormal search text is a web page in the web page set, determining that the identified web page corresponding to the abnormal search text is an abnormal web page.

In this implementation manner, the electronic device may determine, in response to that the web page corresponding to the identified abnormal search text is a web page in the web page set, that the web page corresponding to the identified abnormal search text is the abnormal web page.

In step 405, the ratio of the identified abnormal web pages in the web page set is determined.

And step 406, determining whether the website to be identified is a spam website or not according to the determined ratio.

The implementation of step 405 and step 406 can refer to the description in step 203 and step 204, and will not be described herein again.

As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, the process 400 of the method for identifying a website in this embodiment highlights clustering a large number of webpages based on the content of the pictures, and then identifies the texts corresponding to the webpages in the webpage clusters to determine the abnormal texts, and this embodiment identifies the abnormal webpages by using the characteristic that the topics of the texts in the webpages to which the similar pictures belong should be similar, and the difference between the picture information and the text information in the webpages of the spam websites is often large. Therefore, the scheme described in the embodiment can efficiently identify the abnormal web pages, and further efficiently identify the abnormal websites.

With continued reference to FIG. 8, a flow 800 of one embodiment of a method for identifying a website in accordance with the present application is shown. The method for identifying the website comprises the following steps:

step 801, acquiring a webpage set of a website to be identified.

Step 802, determining a text corresponding to each web page in the web page set, and respectively extracting picture features of pictures in the web pages and text features of the corresponding texts to generate picture feature vectors and corresponding text feature vectors.

In this embodiment, the electronic device may determine a corresponding text of each web page in the web page set, and respectively extract an image feature of an image in the web page and a text feature of the corresponding text, so as to generate an image feature vector and a corresponding text feature vector.

Step 803, for each webpage in the webpage set, importing the generated picture feature vector and the corresponding text feature vector into a pre-established recognition model, and determining the correlation between the imported picture feature vector and the corresponding text feature vector.

In this embodiment, the electronic device may import the generated picture feature vector and the corresponding text feature vector into a pre-established recognition model for each web page in the web page set, and determine a correlation between the imported picture feature vector and the corresponding text feature vector.

In this embodiment, the recognition model may be trained by manually labeled samples or samples mined from behavior logs of the user. Here, the samples may be paired pictures, texts.

In some optional implementations of this embodiment, the corresponding text may include text in each webpage in the set of webpages, and step 803 may include: and for each webpage in the webpage set, importing the generated picture characteristic vector and the text characteristic vector into a pre-established identification model, and determining the correlation degree between the imported picture characteristic vector and the text characteristic vector.

In some optional implementations of this embodiment, the corresponding text may include a search text in each web page in the web page set, and step 803 may include: and for each webpage in the webpage set, importing the generated picture characteristic vector and the search text characteristic vector into a pre-established identification model, and determining the correlation degree between the imported picture characteristic vector and the search text characteristic vector.

In some optional implementations of this embodiment, the corresponding text includes a sub-text in at least one text field in the web page, and step 803 may include: analyzing each webpage in the webpage set, and respectively extracting picture features and sub-text features in the webpage to generate a picture feature vector and a sub-text feature vector set, wherein the sub-text is a text in at least one text field in the webpage; and for each sub-text feature vector in the sub-text feature vector set, introducing the sub-text feature vector and the picture feature vector into a pre-established identification model, and determining the correlation degree between the sub-text feature and the picture feature vector.

Step 804, determining whether each web page in the web page set is the abnormal web page according to the determined correlation.

In this embodiment, the electronic device may determine, for each web page in the web page set, whether the web page is the abnormal web page according to the determined relevance.

In some optional implementation manners of this embodiment, the corresponding text includes text or search text in each webpage in the set of webpages, and step 804 may be implemented by: and for each webpage in the webpage set, determining the webpage to be the abnormal webpage in response to the fact that the determined relevance is smaller than a relevance threshold.

In some optional implementations of this embodiment, the corresponding text includes a sub-text in at least one text field in the web page, and step 804 may be implemented by: determining the sub-text as abnormal sub-text in response to the determined degree of correlation being less than a degree of correlation threshold; and for each webpage in the webpage set, determining the ratio of the number of abnormal sub-texts in the webpage to the total number of text fields of the webpage, and identifying the webpage as an abnormal webpage in response to the ratio being greater than an abnormal text field threshold value.

At step 805, the ratio of the identified abnormal web pages in the web page set is determined.

Step 806, determining whether the website to be identified is a spam website according to the determined ratio.

The implementation of step 805 and step 806 can refer to the description in step 203 and step 204, and will not be described herein again.

As can be seen from fig. 8, compared with the embodiment corresponding to fig. 2, the process 800 of the method for identifying a website in this embodiment highlights that the correlation between the pictures and the corresponding texts in the web page is determined by using the pre-established identification model, so as to quickly identify the abnormal web page and the spam website.

With further reference to fig. 9, as an implementation of the method shown in the above figures, the present application provides an embodiment of an apparatus for identifying a website, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be applied to various electronic devices.

As shown in fig. 9, the apparatus 900 for identifying websites of the present embodiment includes: an acquisition unit 901, a recognition unit 902, a ratio determination unit 903, and a spam website determination unit 904. The acquiring unit 901 is configured to acquire a web page set of a website to be identified; an identifying unit 902, configured to identify an abnormal web page in the web page set, where a correlation between picture information and text information in the abnormal web page is smaller than a correlation threshold; a ratio determining unit 903, configured to determine a ratio of the identified abnormal web pages in the web page set; a spam website determining unit 904, configured to determine whether the website to be identified is a spam website according to the determined ratio.

In this embodiment, the acquiring unit 901 of the website identifying device 900 may acquire the set of webpages of the websites to be identified in various ways.

In the present embodiment, the identifying unit 902 for identifying the website apparatus 900 may identify the abnormal web page in the above-mentioned web page set by various methods.

In this embodiment, the ratio determining unit 903 of the website identifying apparatus 900 may determine a ratio of the identified abnormal webpages in the webpage set.

In the present embodiment, the spam website determining unit 904 of the website identifying device 900 may determine whether the website to be identified is a spam website according to the determined ratio.

In some optional implementations of the present embodiment, the identification unit includes a first identification subunit (not shown) configured to: determining a webpage set to be clustered, and clustering pictures in webpages in the webpage set to be clustered by using a clustering algorithm to obtain a picture cluster, wherein the webpage set to be clustered comprises the webpages in the webpage set and webpages of other preset websites except the website to be identified; determining a webpage to which the pictures in the picture cluster belong; and identifying the abnormal web page based on an abnormal corresponding text in a corresponding text set, wherein the corresponding text set comprises a text corresponding to a web page to which the picture in the picture cluster belongs, and the semantic similarity between the abnormal corresponding text and other corresponding texts in the corresponding text set except the abnormal corresponding text is smaller than a first semantic similarity threshold.

In some optional implementations of this embodiment, the first identifying subunit is further configured to: extracting texts in the webpage to which each picture in the picture cluster belongs to generate a text set; identifying abnormal texts in the text set, wherein the semantic similarity between the abnormal texts and other texts in the text set except the abnormal texts is smaller than a second semantic similarity threshold; and responding to the identified abnormal text extracted from the web pages in the web page set, and identifying the web page to which the abnormal text belongs as the abnormal web page.

In some optional implementations of this embodiment, the first identifying subunit is further configured to: based on semantic similarity among texts in the text set, performing clustering operation on text feature vectors corresponding to the texts to obtain a clustering center; determining a text characteristic vector with a distance from the clustering center greater than a preset distance threshold; and identifying the text corresponding to the determined text feature vector as abnormal text.

In some optional implementations of this embodiment, the first identifying subunit is further configured to: for each determined webpage, analyzing the webpage and extracting sub-texts in each text domain of the webpage; dividing the extracted subfolders according to the text fields to generate a plurality of subfolders sets associated with the text fields; for each sub-text set in the plurality of sub-text sets, identifying abnormal sub-text in the sub-text set, wherein the semantic similarity between the abnormal sub-text and other sub-text in the sub-text set except the abnormal sub-text is smaller than a third semantic similarity threshold.

In some optional implementations of this embodiment, the first identifying subunit is further configured to: and for each webpage in the webpage set, determining the ratio of the number of abnormal sub-texts in the webpage to the total number of text fields of the webpage, and identifying the webpage as an abnormal webpage in response to the ratio being greater than an abnormal text field threshold value.

In some optional implementations of this embodiment, the first identifying subunit is further configured to: for each determined webpage, acquiring a search formula when the webpage is presented as a search result, analyzing the acquired search formula and extracting keywords to generate a search text; identifying abnormal search texts in a search text set for the search text set comprising the web pages to which the pictures belong, wherein the semantic similarity between the abnormal search texts and other search texts in the search text set except the abnormal search texts is smaller than a fourth semantic similarity threshold; and responding to the identified webpage corresponding to the abnormal search text as the webpage in the webpage set, and determining the identified webpage corresponding to the abnormal search text as the abnormal webpage.

In some optional implementations of this embodiment, the identification unit includes a second identification subunit (not shown) configured to: determining a corresponding text of each webpage in the webpage set, and respectively extracting picture features of pictures in the webpage and text features of the corresponding text to form a picture feature vector and a corresponding text feature vector; for each webpage in the webpage set, importing the generated picture feature vector and the corresponding text feature vector into a pre-established identification model, and determining the correlation degree between the imported picture feature vector and the corresponding text feature vector, wherein the identification model is used for representing the corresponding relation between the picture feature vector and the correlation degree between the corresponding text feature vector and the picture feature vector; and determining whether each webpage in the webpage set is the abnormal webpage or not according to the determined correlation.

In some optional implementations of this embodiment, the corresponding text includes a text or a search text in each web page in the web page set, where the search text is generated by parsing a search expression of the web page when being presented as a search result and extracting a keyword; and the second identifier unit is further configured to: and for each webpage in the webpage set, determining the webpage to be the abnormal webpage in response to the fact that the determined relevance is smaller than a relevance threshold.

In some optional implementations of this embodiment, the corresponding text includes a sub-text in at least one text field in the web page; and the second identifier unit is further configured to: determining the sub-text as abnormal sub-text in response to the determined degree of correlation being less than a degree of correlation threshold; and for each webpage in the webpage set, determining the ratio of the number of abnormal sub-texts in the webpage to the total number of text fields of the webpage, and identifying the webpage as an abnormal webpage in response to the ratio being greater than an abnormal text field threshold value.

In some optional implementations of the embodiment, the identification unit includes a third identification subunit (not shown) configured to: acquiring a preset picture text feature vector set, wherein the preset picture text feature vector is generated by analyzing the identified abnormal web pages of the spam website, extracting and splicing picture features and text features; respectively extracting and splicing the picture features and the text features of each webpage in the webpage set to generate a picture text feature vector to be identified; and for each webpage in the webpage set, determining that the webpage is an abnormal webpage in response to the fact that the similarity between the picture text feature to be identified of the webpage and at least one preset picture text feature vector in the preset picture text feature vector set is greater than a picture text threshold value.

In some optional implementation manners of this embodiment, the spam website determining unit is further configured to: and when the determined ratio is larger than the abnormal webpage ratio threshold value, determining the website to be identified as a spam website.

In some optional implementations of this embodiment, the apparatus further includes a first search unit (not shown) configured to: acquiring abnormal webpages of the determined spam websites and a set of webpages to be identified, wherein the set of webpages to be identified comprises webpages of other preset websites except the spam websites; analyzing the abnormal webpage and each webpage in the webpage set to be identified, and extracting picture features of pictures in the webpage to generate picture feature vectors; utilizing a clustering or classifying algorithm to operate the obtained characteristic vectors of the pictures so as to cluster or classify the pictures in the webpage set to be identified, and obtaining at least one subnet page set; determining the web pages to be identified in the subnet page set comprising the abnormal web pages as spam web pages; and determining the website to which the spam webpage belongs as a spam website.

In some optional implementations of this embodiment, the apparatus further includes a second search unit (not shown) configured to: acquiring abnormal webpages of the determined spam websites and a set of webpages to be identified, wherein the set of webpages to be identified comprises webpages of other preset websites except the spam websites; analyzing the abnormal webpage and each webpage in the webpage set to be identified, extracting and splicing picture features and text features of the webpage, and generating a picture text feature vector; utilizing a clustering or classifying algorithm to operate the obtained text characteristic vectors of all the pictures so as to cluster or classify the pictures in the webpage set to be identified, and obtaining at least one subnet page set; determining the web pages to be identified in the subnet page set comprising the abnormal web pages as spam web pages; and determining the website to which the spam webpage belongs as a spam website.

The implementation details and technical effects of each unit and each subunit in this embodiment may refer to the descriptions in other embodiments of this application, and are not described herein again.

Referring now to FIG. 10, a block diagram of a computer system 1000 suitable for use in implementing a server according to embodiments of the present application is shown. The server shown in fig. 10 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 10, the computer system 1000 includes a Central Processing Unit (CPU)1001 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)1002 or a program loaded from a storage section 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for the operation of the system 1000 are also stored. The CPU 1001, ROM 1002, and RAM 1003 are connected to each other via a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

To the I/O interface 1005, AN input section 1006 including a keyboard, a mouse, and the like, AN output section 1007 including a terminal such as a Cathode Ray Tube (CRT), a liquid crystal display (L CD), and the like, a speaker, and the like, a storage section 1008 including a hard disk, and the like, and a communication section 1009 including a network interface card such as a L AN card, a modem, and the like, the communication section 1009 performs communication processing via a network such as the internet, a drive 1010 is also connected to the I/O interface 1005 as necessary, a removable medium 1011 such as a magnetic disk, AN optical disk, a magneto-optical disk, a semiconductor memory, and the like is mounted on the drive 1010 as necessary, so that a computer program read out therefrom is mounted into the storage section 1008 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication part 1009 and/or installed from the removable medium 1011. The above-described functions defined in the method of the present application are executed when the computer program is executed by the Central Processing Unit (CPU) 1001. It should be noted that the computer readable medium mentioned above in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, a recognition unit, a ratio determination unit, and a spam website determination unit. The names of the units do not form a limitation to the units themselves in some cases, and for example, the acquiring unit may also be described as a "unit for acquiring a set of web pages of a website to be identified".

As another aspect, the present application also provides a computer-readable medium, which may be contained by the server described in the above embodiments; or may be present separately and not assembled into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: acquiring a webpage set of a website to be identified; identifying abnormal webpages in the webpage set, wherein the correlation degree of the picture information and the text information in the abnormal webpages is smaller than a correlation degree threshold value; determining the ratio of the identified abnormal web pages in the web page set; and determining whether the website to be identified is a spam website or not according to the determined ratio.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A method for identifying a website, the method comprising:

acquiring a webpage set of a website to be identified;

identifying abnormal webpages in the webpage set based on image-text information in the webpages, wherein the correlation degree of the image information and the text information in the abnormal webpages is smaller than a correlation degree threshold value;

determining a ratio of the identified abnormal web pages in the web page set;

and determining whether the website to be identified is a spam website or not according to the determined ratio.

2. The method of claim 1, wherein identifying abnormal web pages in the set of web pages based on the teletext information in the web pages comprises:

determining a webpage set to be clustered, and clustering pictures in webpages in the webpage set to be clustered by using a clustering algorithm to obtain a picture cluster, wherein the webpage set to be clustered comprises webpages in the webpage set and webpages of other preset websites except the website to be identified;

determining a webpage to which the pictures in the picture cluster belong;

and identifying the abnormal web page based on an abnormal corresponding text in a corresponding text set, wherein the corresponding text set comprises a text corresponding to a web page to which the picture in the picture cluster belongs, and the semantic similarity between the abnormal corresponding text and other corresponding texts in the corresponding text set except the abnormal corresponding text is smaller than a first semantic similarity threshold.

3. The method of claim 2, wherein identifying the abnormal web page based on abnormal text in the corresponding text set comprises:

extracting texts in the webpage to which each picture in the picture cluster belongs to generate a text set;

identifying abnormal texts in the text set, wherein the semantic similarity of the abnormal texts and other texts in the text set except the abnormal texts is smaller than a second semantic similarity threshold value;

and in response to the identified abnormal text being extracted from the web pages in the web page set, identifying the web page to which the abnormal text belongs as the abnormal web page.

4. The method of claim 3, wherein the identifying the abnormal text in the set of texts comprises:

based on semantic similarity among texts in the text set, performing clustering operation on text feature vectors corresponding to the texts to obtain a clustering center;

determining a text characteristic vector with a distance from the clustering center greater than a preset distance threshold;

and identifying the text corresponding to the determined text feature vector as abnormal text.

5. The method of claim 2, wherein identifying the abnormal web page based on abnormal text in the corresponding text set comprises:

for each determined webpage, analyzing the webpage and extracting sub-texts in each text domain of the webpage;

dividing the extracted subfolders according to the text fields to generate a plurality of subfolders sets associated with the text fields;

for each sub-text set in the plurality of sub-text sets, identifying abnormal sub-text in the sub-text set, wherein the semantic similarity between the abnormal sub-text and other sub-text in the sub-text set except the abnormal sub-text is smaller than a third semantic similarity threshold.

6. The method of claim 5, wherein identifying the abnormal web page based on abnormal text in the corresponding text set further comprises:

and for each webpage in the webpage set, determining the ratio of the number of abnormal sub-texts in the webpage to the total number of text fields of the webpage, and identifying the webpage as an abnormal webpage in response to the ratio being greater than an abnormal text field threshold value.

7. The method of claim 2, wherein identifying the abnormal web page based on abnormal text in the corresponding text set comprises:

for each determined webpage, acquiring a search formula when the webpage is presented as a search result, analyzing the acquired search formula and extracting keywords to generate a search text;

identifying abnormal search texts in a search text set for the search text set comprising the search texts of the webpages to which the pictures belong, wherein the semantic similarity between the abnormal search texts and other search texts except the abnormal search texts in the search text set is smaller than a fourth semantic similarity threshold;

and responding to the identified webpage corresponding to the abnormal search text as the webpage in the webpage set, and determining the identified webpage corresponding to the abnormal search text as the abnormal webpage.

8. The method of claim 1, wherein identifying abnormal web pages in the set of web pages based on the teletext information in the web pages comprises:

determining a corresponding text of each webpage in the webpage set, and respectively extracting picture features of pictures in the webpage and text features of the corresponding text to generate picture feature vectors and corresponding text feature vectors;

for each webpage in the webpage set, importing the generated picture feature vector and the corresponding text feature vector into a pre-established identification model, and determining the correlation degree between the imported picture feature vector and the corresponding text feature vector, wherein the identification model is used for representing the corresponding relation between the picture feature vector and the correlation degree between the corresponding text feature vector and the picture feature vector;

and for each webpage in the webpage set, determining whether the webpage is the abnormal webpage or not according to the determined correlation.

9. The method of claim 8, wherein the corresponding text comprises text in a web page or search text, wherein the search text is generated by parsing a search expression of the web page when presented as a search result and extracting keywords; and

determining whether each web page in the web page set is the abnormal web page according to the determined correlation degree, wherein the determining comprises the following steps:

for each web page in the web page set, in response to the determined relevance being less than a relevance threshold, determining that the web page is the abnormal web page.

10. The method of claim 8, wherein the corresponding text comprises a sub-text in at least one text field of the web page; and

determining the sub-text as abnormal sub-text in response to the determined degree of correlation being less than a degree of correlation threshold;

11. The method of claim 1, wherein identifying abnormal web pages in the set of web pages based on the teletext information in the web pages comprises:

acquiring a preset picture text feature vector set, wherein the preset picture text feature vector is generated by analyzing the identified abnormal web pages of the spam website, extracting and splicing picture features and text features;

respectively extracting and splicing picture features and text features of each webpage in the webpage set to generate a picture text feature vector to be identified;

and for each webpage in the webpage set, determining that the webpage is an abnormal webpage in response to the fact that the similarity between the picture text feature to be identified of the webpage and at least one preset picture text feature vector in the preset picture text feature vector set is greater than a picture text threshold value.

12. The method according to any one of claims 1-11, wherein determining whether the website to be identified is a spam website according to the determined ratio comprises:

and when the determined ratio is larger than the abnormal webpage ratio threshold value, determining that the website to be identified is a spam website.

13. The method of claim 12, wherein after determining that the website to be identified is a spam website when the determined ratio is greater than an abnormal web page ratio threshold, the method further comprises:

acquiring abnormal webpages of the determined spam websites and a set of webpages to be identified, wherein the set of webpages to be identified comprises webpages of other preset websites except the spam websites;

analyzing the abnormal webpage and each webpage in the webpage set to be identified, and extracting picture features of pictures in the webpage to generate picture feature vectors;

calculating each obtained picture feature vector by using a clustering or classifying algorithm to cluster or classify the abnormal webpage and the webpages in the webpage set to be identified to obtain at least one subnet page set;

determining that the web page to be identified in the subnet page set comprising the abnormal web page is a junk web page; and determining that the website to which the spam webpage belongs is a spam website.

14. The method of claim 12, wherein after determining that the website to be identified is a spam website when the determined ratio is greater than an abnormal web page ratio threshold, the method further comprises:

analyzing the abnormal webpage and each webpage in the webpage set to be identified, extracting and splicing picture features and text features of the webpage, and generating a picture text feature vector;

calculating the obtained picture text characteristic vectors by using a clustering or classifying algorithm to cluster or classify the abnormal webpages and the webpages in the webpage set to be identified to obtain at least one subnet page set;

determining that the web page to be identified in the subnet page set comprising the abnormal web page is a junk web page;

and determining that the website to which the spam webpage belongs is a spam website.

15. An apparatus for identifying a website, the apparatus comprising:

the acquisition unit is used for acquiring a webpage set of a website to be identified;

the identification unit is used for identifying abnormal webpages in the webpage set based on image-text information in the webpages, wherein the correlation degree of the image information and the text information in the abnormal webpages is smaller than a correlation degree threshold value;

a ratio determination unit for determining the ratio of the identified abnormal web pages in the web page set;

and the spam website determining unit is used for determining whether the website to be identified is a spam website or not according to the determined ratio.

16. A server, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-14.

17. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 14.