CN113157998A

CN113157998A - Method, system, device and medium for polling website and judging website type through IP

Info

Publication number: CN113157998A
Application number: CN202110222311.2A
Authority: CN
Inventors: 张乐平; 顾明娟; 吴一超; 卞豪
Original assignee: Jiangsu Jiangsuan Tiancheng Information Technology Co ltd
Current assignee: Jiangsu Jiangsuan Tiancheng Information Technology Co ltd
Priority date: 2021-02-28
Filing date: 2021-02-28
Publication date: 2021-07-23

Abstract

The invention relates to a method, a system, equipment and a medium for polling a website and judging the category of the website through IP, wherein the method comprises the following steps: capturing webpage content of a target website; extracting effective characters and pictures in the webpage; classifying and labeling the extracted effective characters and pictures; constructing and training a network model aiming at the character and picture data; respectively inputting pictures and characters crawled out from a webpage in a website as respective corresponding models to obtain a classification prediction result of the pictures and the characters in the webpage, and setting weights of the image classification result and the character classification result; counting the prediction results of all pictures and characters in the website to generate the distribution of picture classification and the distribution of character classification; and calculating a score to obtain a final classification result. The invention simulates the webpage browsing personnel in reality, adopts the artificial intelligence technology, directly analyzes the concrete content in the website, covers the website information such as videos, pictures, characters and the like, and comprehensively forms the website content judgment result.

Description

Method, system, device and medium for polling website and judging website type through IP

Technical Field

The invention relates to the field of computer image processing, in particular to a method, a system, equipment and a medium for polling a website and judging the category of the website through IP.

Background

At present, the following methods are mainly used in the market for solving the website classification:

1) based on the web page text;

A. analyzing effective words of a webpage to be judged in the current mode to judge the type of the website by establishing a website classification dictionary;

B. the similarity between characters is explained by aiming at algorithms such as deep learning CNN;

C. the texts are classified by a machine learning method such as logistic regression and Bayes.

2) The classification is made based on the structural features of the web site.

3) The classification is made based on the website log data.

However, these methods only extract partial features of the website, such as text information features and html structural features of the website, and cannot comprehensively and mathematically characterize the content of the web page, thereby resulting in low classification accuracy. Resulting in many manual corrective actions after machine sorting.

Disclosure of Invention

In order to solve the problem of low classification accuracy of the classification methods, and in consideration of the fact that images and characters are the most direct embodiment of website content classification, the invention provides a method, a system, equipment and a medium for inspecting websites through IP and judging website types, and the classification accuracy can be improved to over 85%.

The technical scheme for realizing the purpose of the invention is as follows: a method for polling websites and judging the category of the websites through IP comprises the following steps:

inputting an IP list, starting crawler scanning, and capturing webpage content of a target website;

judging whether a certain website is accessible or not, and recording the result to a database;

judging whether the record number exists in the webpage content and whether the record number can be checked, and recording the result to a database;

extracting effective characters and pictures in the webpage;

classifying and labeling the extracted effective characters and pictures;

constructing and training a network model aiming at the character and picture data, and writing model parameters into a model library after the training is finished;

respectively inputting pictures and characters crawled out from a webpage in a website as respective corresponding models to obtain a classification prediction result of the pictures and the characters in the webpage, and setting weights of the image classification result and the character classification result; counting the prediction results of all pictures and characters in the website to generate the distribution of picture classification and the distribution of character classification; and calculating a score to obtain a final classification result.

Furthermore, the webpage content of the target website is captured through a python crawler frame script in combination with javascript rendering service splash.

Further, classifying and labeling the extracted effective characters and pictures, specifically comprising: the webpage is used as a grouping dimension, and the pictures and the characters are combined and labeled together and labeled into a certain category or a plurality of categories in a preset classification list.

Further, for the picture data, a VGGNET model is used; for text data, using the textCNN model, the activation function ReLu, convolution kernel size: 14,15,16.

Further, the image prediction is optimized before the model is input, the input images are adjusted in size and are filled into n images to form a batch, batch prediction is carried out, then the output of a second layer is taken as the judgment of the result, n tensors with the shapes of (C, J and K) are generated, and pmap of a certain classification value is taken for comprehensive grading judgment;

the final pmap activation map matrix is

P＝(P1+P2+...+Pn)/n

And then, solving a bright point connected graph of the P matrix, and if the area of the bright point connected graph of a certain classification is larger than 50% of the whole area, determining the bright point connected graph as a picture of a certain classification.

Further, during network model training, preprocessing the picture: the original image is expanded into 8 images, corresponding two-dimensional (r, g, b) three-channel vectors are extracted, the height and the width of the images are 224 and 224 respectively, and a tensor with the shape of (3,224,224) is obtained;

preprocessing the characters: the collected text is converted into word vectors by word2vector, each word is represented by a 9-dimensional word vector, forming a matrix of n x 9.

Further, the model training method is as follows:

inputting the picture matrix in the data set into a model for gradient descent training, and writing model parameters of VGGNET into a model library after the training is finished;

and inputting the character matrix in the data set into textCNN for gradient descent training, and writing the model parameters into a model warehouse after the training is finished.

Further, setting the weight of the image classification result as a, the weight of the character classification result as b, and a + b as 1; counting the prediction results of all pictures and characters in a website to generate a picture classification distribution and a character classification distribution, and counting the Y with the highest picture classification count in a classification list_n1Count is C_n1(ii) a Counting the Y with highest character count classification in the classification list_n2Count is C_n2(ii) a The final calculated score is:

r_p＝C_n1·a

r_t＝C_n2·b

wherein r is_p、r_tRespectively scoring the picture and the character;

by classification Y_n1、Y_n2The final classification result is the one with a high median score.

A system for inspecting websites and judging website types through IP comprises a user interaction system, a crawler management system, a prediction service system and an AI platform;

the AI platform consists of a data marking tool, a model version management subsystem and a task flow scheduling subsystem and is used for carrying out model training;

the prediction service subsystem is used for classified prediction of characters or pictures;

the crawler management system is used for crawler task allocation, crawler task scheduling, specific crawler extraction logic setting and webpage character and picture extraction;

the user interaction system is used for customizing a website warehouse to be scanned and periodically scanning and classifying the websites in the website warehouse by an order placing mode.

An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above method when executing the program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the above-mentioned method.

Compared with the prior art, the invention has the beneficial effects that: (1) according to the method, a webpage browser in reality is simulated, a deep learning technology is adopted, specific contents in a website are directly analyzed, webpage contents such as videos, pictures and characters are extracted, two different neural networks are respectively adopted for classification prediction aiming at the pictures and the characters in the webpage, then, the distribution of prediction results is weighted and calculated to accurately represent the webpage contents, and the accuracy of classification prediction is greatly improved; the improved algorithm specifically adopts VGGNET to predict pictures in the webpage and TextCNN to predict characters in the webpage, and finally weights the distribution of the classification prediction results of the pictures and the characters to form a website content judgment result; (2) the invention can greatly improve the efficiency of culture law enforcement officers, and the culture law enforcement officers can rapidly put up a case and investigate according to the classification result of the invention.

Drawings

FIG. 1 is a schematic block diagram of a system for polling a website and determining the category of the website according to IP.

Fig. 2 is a flow chart of an IP patrol process.

Fig. 3 is a schematic diagram of processing optimization before picture prediction is input into a model.

Detailed Description

With reference to fig. 1, the invention periodically captures the web page content of the monitored IP website by using an artificial intelligence machine learning technique, and intelligently analyzes the characters, images and videos in the web page based on artificial intelligence techniques such as image recognition and semantic recognition to classify the web page. The classification method mainly comprises the following steps:

step 1, capturing webpage contents of n target websites through scratch + splash;

step 2, extracting effective characters and pictures in the webpage;

step 3, classifying and labeling the extracted effective characters and pictures; the specific method comprises the following steps:

the webpage is used as a grouping dimension, and the pictures and the characters are combined and labeled to be labeled into a certain category or a plurality of categories in a preset classification list;

step 4, training

Step 4.1 network model design

1) For picture data, an ALEXNET model with the last layer (FCN layer) removed is used, and the output result of the last second layer is used as a model output result;

2) for text data, using the textCNN model, the activation function ReLu, convolution kernel size: 14,15, 16; step 4.2 Picture preprocessing

The original image is expanded into 8 images, corresponding two-dimensional (r, g, b) three-channel vectors are extracted, the height and the width of the images are 224 and 224 respectively, and a tensor with the shape of (3,224,224) is obtained;

step 4.3 text preprocessing

Converting the collected characters into word vectors through word2vector, wherein each word is represented by a 9-dimensional word vector to form a matrix of (n, 9);

step 4.4 training the model

Inputting the image matrix in the data set into a model for gradient descent training, and writing model parameters into a model warehouse after the training is finished;

inputting a character matrix in the data set into textCNN for gradient descent training, and writing model parameters into a model warehouse after training is finished;

step 5, use of model

Respectively taking pictures and characters crawled from the web pages in the website as the input of respective corresponding models to obtain the classification prediction result Y of the pictures and the characters in the web pages_n1、Y_n2. Wherein the weight of the image classification result is set to be 0.6, and the weight of the character classification result is set to be 0.4. And finally, counting the prediction results of all pictures and characters in a website to generate the distribution of picture classification and the distribution of character classification. As in the following table:

input X	Classification result Y
		Web page 1-FIG. 1	Film and television
Web page 1-text 1	Forum/basketball
		Web pages 1-2	Film and television
Web page 2-FIG. 1	Film and television
		Web page 2-text 1	Forum/football
Web page 3-FIG. 1	Film and television/documentary
		......	.....

Counting Y with highest picture classification count in the classification list_n1Count is C_n1. Counting the Y with highest character count classification in the classification list_n2Count is C_n2. Final calculated score

r_p＝C_n1·0.6

r_t＝C_n2·0.4

Wherein r is_p、r_tThe score is the score of the picture and the character.

If classification Y_n1Score high, just classify Y_n1Is the final classification result.

The image prediction is processed and optimized before the model is input, as shown in fig. 3, the input image resize is filled into n images to form a batch, batch prediction is carried out, then the output of the second layer is taken as the judgment of the result, n tensors with the shapes of (C, J, K) are generated, wherein C is a classification type, J K represents how many small areas 224 × 224 are included in the input image, and pmap of a certain classification value is taken for comprehensive grading judgment; the pmap activation map is shown in fig. 3.

The final pmap activation map matrix is:

P＝(P1+P2+...+Pn)/n

and then, solving a bright spot connected graph of the P matrix, and if the area of the bright block connected graph of a certain classification is larger than 50% of the whole area, determining the bright block connected graph as a picture of a certain classification. The batch prediction method improves the overall prediction speed and accuracy.

The invention gives IP range, judges whether the IP is provided with a website or not by our tool, divides the accessible websites into details and classifies the websites into pornography, gambling, reaction, literature/house fighting novel, video and audio, video and animation and the like. Furthermore, evidence of website infringement can be made according to characters and picture characteristics of a plurality of copyrighted works, and a suspected video and audio work list is given.

As shown in fig. 2, the specific IP polling process is as follows:

1) a user inputs an IP list;

2) starting crawler scanning;

3) judging whether a certain website is accessible or not, and recording the result to a database;

4) judging whether the record number exists in the webpage content and whether the record number can be checked, and recording the result to a database;

5) inputting the extracted characters and pictures into a prediction system;

6) and recording the classification result of the website corresponding to a certain IP into a database.

The invention also provides a system for inspecting the website and judging the website category through the IP, which comprises a user interaction system, a crawler management system, a prediction service system and an AI platform;

the AI platform consists of subsystems such as a data marking tool, model version management, task flow scheduling and the like, and solves the problem of how to train the model;

the prediction service subsystem provides classified prediction service of characters or pictures, and solves the problem of how to use the model;

the crawler management system provides the abilities of crawler task allocation, crawler task scheduling and specific crawler extraction logic setting, and solves the problem of how to extract characters and pictures of a webpage quickly and accurately;

the user interaction system enables a user to customize a website warehouse which needs to be scanned, enables the user to scan and classify the websites regularly in an order placing mode, and points out possible content (pictures or characters) problems of the websites, such as pornography pictures to be checked by law enforcement.

Further, the present invention also provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the classification method when executing the computer program.

And a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the above-mentioned classification method.

The technical solutions in the embodiments of the present invention are described in detail below with reference to the accompanying drawings of the embodiments of the present invention, and it should be noted that only some embodiments, but not all embodiments, are provided in the implementation of the present invention.

Examples

1089793 IPs in a certain city range are scanned, and whether a website is erected or not is monitored; for the IP of the erected website, the first two pages of the website are accessed, whether ICP record numbers exist or not is checked, keywords are extracted, attribute classification is carried out on the website, and the three websites including games, videos (movies and music) and novels are focused. The overall scan results are shown in the table below.

The table shows that the method can help law enforcement departments to patrol legal websites of jurisdictions and give accurate classification, and the classification accuracy reaches 99.79%.

In the embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. Each functional unit may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for polling websites and judging the category of the websites through IP is characterized by comprising the following steps:

judging whether the website is accessible or not, and recording the result to a database;

extracting effective characters and pictures in the webpage;

classifying and labeling the extracted effective characters and pictures;

2. The method of claim 1, wherein the web page content of the target website is crawled by a python crawler framework script in combination with a javascript rendering service splash;

classifying and labeling the extracted effective characters and pictures, specifically comprising the following steps: the webpage is used as a grouping dimension, and the pictures and the characters are combined and labeled together and labeled into a certain category or a plurality of categories in a preset classification list.

3. Method according to claim 1, characterized in that for picture data a VGG NET model is used; for text data, using the textCNN model, the activation function ReLu, convolution kernel size: 14,15,16.

4. The method of claim 1 or 3, wherein the picture prediction is optimized before inputting the model, the input pictures are resized and filled into n pictures to form a batch, the batch prediction is performed, then the output of the second layer is taken as the result judgment, n tensors with the shape of (C, J, K) are generated, and pmap of a certain classification value is taken for the comprehensive scoring judgment;

the final pmap activation map matrix is

P＝(P1+P2+...+Pn)/n

And then, solving a bright spot connected graph of the P matrix, and if the area of the bright block connected graph of a certain classification is larger than 50% of the whole area, determining the bright block connected graph as a picture of a certain classification.

5. The method of claim 4, wherein the network model is trained by preprocessing the pictures: the original image is expanded into 8 images, corresponding two-dimensional (r, g, b) three-channel vectors are extracted, the height and the width of the images are 224 and 224 respectively, and a tensor with the shape of (3,224,224) is obtained;

6. The method of claim 1, wherein the model training method is as follows:

inputting the picture matrix in the data set into a model for gradient descent training, and writing model parameters of VGG NET into a model library after the training is finished;

and inputting the character matrix in the data set into textCNN for gradient descent training, and writing the model parameters into a model library after the training is finished.

7. The method according to claim 1, wherein the weight of the image classification result is set to a, the weight of the text classification result is set to b, and a + b is 1; counting the prediction results of all pictures and characters in a website to generate a picture classification distribution and a character classification distribution, and counting the Y with the highest picture classification count in a classification list_n1Count is C_n1(ii) a Counting the Y with highest character count classification in the classification list_n2Count is C_n2(ii) a The final calculated score is:

r_p＝C_n1·a

r_t＝C_n2·b

wherein r is_p、r_tScoring the pictures and the characters;

8. A system for inspecting websites and judging website types through IP is characterized by comprising a user interaction system, a crawler management system, a prediction service system and an AI platform;

the user interaction system is used for customizing a website library to be scanned and periodically scanning and classifying the websites in the website library by an order placing mode.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1-7 when executing the program.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.