CN112884053A

CN112884053A - Website classification method, system, equipment and medium based on image-text mixed characteristics

Info

Publication number: CN112884053A
Application number: CN202110222323.5A
Authority: CN
Inventors: 张乐平; 顾明娟; 吴一超; 卞豪
Original assignee: Jiangsu Jiangsuan Tiancheng Information Technology Co ltd
Current assignee: Changzhou Jiangsuan Tiancheng Information Technology Co.,Ltd.
Priority date: 2021-02-28
Filing date: 2021-02-28
Publication date: 2021-06-01
Anticipated expiration: 2041-02-28
Also published as: CN112884053B

Abstract

The invention relates to a website classification method, a system, equipment and a medium based on image-text mixed characteristics, wherein the classification method comprises the following steps: converting any one text into a paragraph vector through a memory type paragraph vector model; inputting the image matrix into a model by using a ResNet model, and converting the output of the second layer from the last to the second layer into a characteristic vector as input; and (4) correlating the graph and text characteristic information matrixes, inputting the graphs and text characteristic information matrixes into an LSTM model for training, and generating a final webpage classification network. According to the method, the webpage content is represented by the model LSTM based on the image-text mixed characteristics, the image-text mixed characteristics are represented by the correlation sequence of the predicted values of the PV-DM model and the RESNET model, the model can describe the content information of the webpage, the sequence model also describes the sequence of reading articles by human, and the identification accuracy is greatly improved.

Description

Website classification method, system, equipment and medium based on image-text mixed characteristics

Technical Field

The invention relates to the field of computer image processing, in particular to a method, a system, equipment and a medium for classifying websites based on image-text mixed characteristics.

Background

With the popularization of the internet, the threshold of web site establishment is lower, and various websites without ICP records, pornographic websites, gambling websites, infringement movie websites, infringement fiction websites and other illegal websites are flooded. These websites have a very bad influence on the social development and become a hotbed for network illegal crimes. Meanwhile, the method has great impact on the copyright market and is very unfavorable for protecting the copyright. Therefore, the demand of the culture supervision department is to accurately classify the privately-set websites so as to improve the law enforcement efficiency.

The methods for classifying websites through a machine learning method mainly include the following methods:

1) based on web page text

A. The similarity between characters is explained by aiming at algorithms such as deep learning CNN;

B. classifying the texts by a machine learning method such as logistic regression and Bayes;

C. inputting by using the attribute characteristics of the webpage structure, such as html tags, CSS, various attributes and the like, and predicting by using an SVM neural network;

2) making a classification based on the website log data;

however, none of the methods solves the problem of low classification accuracy, and the accuracy rate does not exceed 80%.

Disclosure of Invention

Aiming at the problem of high error rate of the current website classification, the invention provides a website classification method, a system, equipment and a medium based on image-text mixed characteristics.

The technical scheme for realizing the purpose of the invention is as follows: a website classification method based on image-text mixed characteristics comprises the following steps:

sequentially extracting texts and pictures in the webpage;

converting a block of text into a paragraph vector through a memory distributed paragraph vector model;

inputting the image matrix into a model by using a ResNet model, then taking a tensor with the shape of (1, c, x, y) of the output of the second last layer, and converting the tensor into an image classification vector;

and respectively converting the paragraph vectors and the image classification vectors into one-dimensional sequences and associating the one-dimensional sequences, inputting the one-dimensional sequences into an LSTM model for training, and generating a final webpage classification network.

Furthermore, a text is converted into a paragraph vector through a memory type paragraph vector model, and the method comprises the following steps:

training a memory type paragraph vector model by using the existing paragraph corpus to obtain a text input model which needs to be extracted from a webpage, and obtaining a paragraph vector of which the model output result is a target text.

Further, after the image matrix is input into the model, a tensor with a shape of (1, c, x, y) is taken as a second-to-last layer output, and the specific method is as follows:

training a multi-classification ResNet model by using an existing labeled picture training set; and extracting a picture input model from the webpage training set, extracting a tensor of which the shape of the penultimate layer output by the model is (1, c, x, y), and converting the tensor into an image classification vector.

Further, the paragraph vectors and the distribution matrix based on the image theme are respectively converted into one-dimensional sequences and correlated, and input into an LSTM model for training to generate a final webpage classification network, and the specific method comprises the following steps:

using a group of texts and pictures of a text picture training set classified by a webpage as input, inputting the texts into a PV-DM model to output predicted paragraph vectors, and converting the predicted paragraph vectors into a one-dimensional sequence as input, wherein if no texts exist in the webpage group, the paragraph vectors are all 0 one-dimensional sequences;

inputting the pictures in the group into a trained picture classification ResNet model, and converting a tensor with the shape of (1, c, x, y) output by the second last layer of the model into a one-dimensional sequence as input; wherein 1 represents the length of the picture order vector, c represents the length of the classification vector, and x y represents how many (224 ) small regions are included in the image;

when x is 1 and y is 1, that is, in the case where the input image is a small image, the c classification vector is directly used as our result;

when x >1 or y >1, that is, in the case that the input image is a large graph, the tensor of (1, c, x, y) needs to be converted into the matrix of (c, x, y), then each row is used as an eigenvector, and the plurality of eigenvectors are summed and averaged to achieve the purpose of clustering, and finally the image classification vector is output.

If no picture exists in the training in the webpage group, the picture classification vector is a one-dimensional sequence of all 0 s;

and sequentially splicing the one-dimensional sequences, inputting the sequences into an LSTM model for training, and generating a final webpage classification network.

The invention also provides a website classification system based on image-text mixed characteristics, which comprises the following steps:

the paragraph vector conversion module is used for converting any one text into a paragraph vector through a memory type paragraph vector model;

an image classification vector generation module which uses a ResNet model to input the image matrix into the model, then takes a tensor with the shape of (1, c, x, y) of the output of the second last layer, and converts the tensor into an image classification vector;

and the webpage classification module is used for converting the paragraph vectors and the image classification vectors into one-dimensional sequences respectively, associating the one-dimensional sequences, inputting the one-dimensional sequences into an LSTM model for training, and generating a final webpage classification network.

An electronic device comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the website classification method based on the image-text mixing characteristics.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, implements the above-mentioned method for website classification based on teletext features.

Compared with the prior art, the invention has the beneficial effects that: according to the invention, the webpage content is represented by a model LSTM based on image-text mixed characteristics, the image-text mixed characteristics are represented by the correlation sequence of the predicted values of a PV-DM model and a ResNet model, the model can describe the content information of the webpage, the sequence model also describes the sequence of reading articles by human, the identification accuracy is greatly improved, and the identification accuracy is up to 91.3% in the existing 50M-scale webpage test set.

Drawings

Fig. 1 is a flowchart of a website classification method based on image-text mixed features.

FIG. 2 is a schematic diagram of extracting page diagram and text theme features through a PV-DM, ResNet model.

FIG. 3 is a block diagram of a process for training a web page classification LSTM model and prediction.

Detailed Description

The invention provides a website classification method based on image-text mixed characteristics, which comprises the following steps as shown in figure 1:

in the first step, any one block of text is converted into Paragraph Vectors (classified Vectors) through a Memory type Paragraph vector Model of classified Vectors (PV-DM). PV-DM better describes the semantics of a long text paragraph.

And secondly, inputting the image matrix into a model by using a ResNet model with 50 layers, and taking the last second-layer characteristics as input.

And thirdly, correlating the graph and text characteristic information matrixes, inputting the LSTM model for training, and generating a final webpage classification network.

As shown in FIG. 2, any web page has a centered body portion. The main body part can be expressed in a sequence mode, if the main body part is a simple text, the main body part can be independently expressed as a certain text characteristic, if the main body part is a simple picture, the main body part can be expressed as a certain picture characteristic, and if the main body part is a combination of a text vector and a picture vector, the main body part can be expressed as a combination of two characteristic vectors. The specific conversion is as follows:

1) sequentially extracting texts and pictures in the webpage;

2) using a group of texts and pictures of a text picture training set classified by a webpage as input, inputting the texts into a PV-DM model to output predicted paragraph vectors, and converting the predicted paragraph vectors into a one-dimensional sequence as input, wherein if no texts exist in the webpage group, the paragraph vectors are all 0 one-dimensional sequences;

3) inputting the pictures in the group into a trained picture classification ResNet model, and converting a tensor with the shape of (1, c, x, y) output by the second last layer of the model into a one-dimensional sequence as input; if no picture exists in the webpage group, the picture classification vector is a one-dimensional sequence of all 0 s;

wherein 1 represents the length of the picture order vector, c represents the length of the classification vector, and x y represents how many (224 ) small regions are included in the image;

when x is 1 and y is 1, we directly use the c classification vector as our result;

when x is greater than 1 or y is greater than 1, the tensor of (1, c, x, y) is firstly converted into the matrix of (c, x, y), then each row is taken as an eigenvector, a plurality of eigenvectors are summed and averaged, and finally an image classification vector is output;

4) and sequentially splicing the one-dimensional sequences, inputting the sequences into an LSTM model for training, and generating a final webpage classification network.

As shown in FIG. 3, the process of LSTM model training and prediction is as follows:

1) sequentially obtaining a graph and text characteristic sequence of a webpage;

2) inputting the context sequence into an LSTM (long short term memory network) for training to obtain a sequential webpage classification network model capable of representing a human reading article;

3) and obtaining a graph-text characteristic sequence of the input webpage, and obtaining a webpage classification vector after model prediction.

The model expresses the sequence of reading articles by human beings, and the logic of judging the webpage type by the context relationship by human beings is reflected by the sequence model.

It should be noted that, the implementation method of each module in the system is specifically described in the website classification method section, and the present invention is not described in detail again.

The invention also provides electronic equipment which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein when the processor executes the program, the website classification method based on the image-text mixed characteristics is realized.

Further, the present invention also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the above-mentioned website classification method based on text-text mixing features.

To test the error rate of the new model, the following three models were trained using respective different training sets:

taking out a character training set in the training set to train a CNN model based on character features;

extracting a webpage structure characteristic set in a training set to train the SVM model based on the webpage structure characteristic;

taking out a training set of image-text mixed labels in the training set to train an LSTM model based on image-text mixed characteristics;

the test results obtained based on the same 50M web page test data set are shown in the following table:

TABLE 1

Model (model)	Error rate
		LSTM model based on image-text mixed characteristics	9.7％
CNN model based on character characteristics	26.7％
		SVM model based on webpage structure characteristics	41.9％

As can be seen from the table, the method represents the type of the webpage through the original image-text mixed characteristics, and greatly improves the accuracy of webpage classification.

In the embodiments provided in the present application, it should be understood that the disclosed method, apparatus, and device may be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A website classification method based on image-text mixed characteristics is characterized by comprising the following steps:

sequentially extracting texts and pictures in the webpage;

2. The method for website classification based on graph-text mixed features of claim 1, wherein a block of text is converted into a paragraph vector by a memory-based paragraph vector model, having the method:

training a memory type paragraph vector model by using the paragraph corpus, inputting texts needing to be extracted from the webpage into the model, and obtaining the paragraph vector of which the model output result is the target text.

3. The website classification method based on image-text mixed features according to claim 1, characterized in that an existing labeled picture training set is used to train a ResNet model of multiple classifications; and extracting a picture input model from the webpage training set, and extracting a tensor of which the shape of the penultimate layer output by the model is (1, c, x, y).

4. The website classification method based on image-text mixed features as claimed in claim 1, wherein the paragraph vectors and the image classification vectors are respectively converted into one-dimensional sequences and associated, and input into an LSTM model for training to generate a final webpage classification network, and the specific method is as follows:

inputting the pictures in the group into a trained picture classification ResNet model, and converting a tensor with the shape of (1, c, x, y) output by the second last layer of the model into a one-dimensional sequence as input; if no picture exists in the webpage group, the picture classification vector is a one-dimensional sequence of all 0 s;

wherein 1 represents the length of the picture order number vector, c represents the length of the classification vector, and x y represents how many (224 ) small regions are included in the image;

when x is 1 and y is 1, directly taking the c classification vector as a result;

5. A website classification system based on image-text mixed characteristics is characterized by comprising:

the paragraph vector conversion module is used for converting a text into a paragraph vector through a memory type paragraph vector model;

6. The system of claim 5, wherein a block of text is converted into a paragraph vector by the in-memory paragraph vector model, having:

training a memory type paragraph vector model by using the existing paragraph corpus, inputting texts needing to be extracted from webpages into the model, and obtaining a paragraph vector of which a model output result is a target text.

7. The system for classifying websites based on teletext features according to claim 5, wherein a tensor with a shape of (1, c, x, y) is taken from the second last layer output after the image matrix is input into the model, and specifically:

8. The website classification system based on image-text mixed features as claimed in claim 5, wherein the paragraph vectors and the image classification vectors are respectively converted into one-dimensional sequences and associated, and input into an LSTM model for training to generate a final webpage classification network, and the specific method is as follows:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method for website classification based on teletext features according to any one of claims 1-4 when executing the program.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a method for website classification based on teletext features according to any one of claims 1-4.