CN104361059A

CN104361059A - Harmful information identification and web page classification method based on multi-instance learning

Info

Publication number: CN104361059A
Application number: CN201410609728.4A
Authority: CN
Inventors: 胡卫明; 胡瑞光
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Renmin Zhongke Beijing Intelligent Technology Co ltd
Priority date: 2014-11-03
Filing date: 2014-11-03
Publication date: 2015-02-18
Anticipated expiration: 2034-11-03
Also published as: CN104361059B

Abstract

The invention discloses a web page classification method based on multi-instance learning. The method comprises the steps of designing a relative size ranking forward comparison method to extract effective images in a web page, and extracting relevant texts of the effective images according to a web page tree structure; using an effective image and the relevant text thereof as an instance in a web page packet, generating description of the effective image and the description of the relevant text thereof respectively by adopting an image word bag model and a text word bag model, and merging the description of the effective image and the description of the relevant text thereof as an instance description, conducting classification by adopting multiple instances to verify toxic web pages. The method disclosed by the invention uses the images contained in web pages and the relevant texts thereof as instances in the web page packet, the algorithm is enabled to be more compliant with the actual distribution of the contents of the web pages, the effective information of the web pages can be fully utilized, the complementarity between image information and text information is deeply mined and an effect better than that of classification by using single-mode information is finally obtained.

Description

A kind of harmful information identification based on multi-instance learning and Web page classification method

Technical field

The present invention relates to network content security field, relate more specifically to a kind of harmful information identification based on multi-instance learning and Web page classification method.

Background technology

Internet, while promotion social progress and development, is also that the propagation of various harmful information provides a great convenience.These harmful informations endanger normal social activities and healthy value system day by day, to teen-age grow up healthy and sound particularly unfavorable.Play the positive role of internet to greatest extent, suppress or eliminate its negative consequence, will be conducive to purifying internet environment, and promote social progress, take good care of teen-age growing up healthy and sound.Internet harmful information comprises pornographic, drugs, violence, terror, reaction etc., and wherein the harm of Drug Reference is compared with the harm of other harmful informations, goes even farther.

In internet, webpage is with HTML (Hypertext Markup Language) (Hyper Text Mark-up Language, HTML) form of file exists, html file is text in essence, therefore, common Web page classification method mainly utilizes text message, and wherein topmost is exactly word bag model.The principle of word bag model is: first select some keywords (key), composition text dictionary; Then add up the frequency of each keyword in document or webpage, and form a vector; Suitable sorter is adopted to classify to this vector.

Along with extensively popularizing of various digital device, the amount of images in webpage gets more and more, and amount of text is fewer and feweri, only utilizes text message to classify to webpage and can not meet the actual form of webpage well.Therefore, be necessary that very much comprehensive utilization image information and text message are to improve real web pages classification performance.

As an example, Fig. 1 is two drugs webpages, and left figure is the webpage peddling drug abuse instrument, and right figure is the webpage peddling hemp.Can find out, in two webpages, all contain a large amount of images and a small amount of text, and image and text alignment obtain in good order.For this situation, only utilize text message can not classify to it well.In addition; at present carry out for the Drug Reference on internet the Patents that processes or document also considerably less; in the urgent need to a kind of method of the harmful informations such as drugs being carried out to identifying processing, facilitate the supervision of national governments' reinforcement to internet, protection people are from the temptation of relevant information.

Summary of the invention

In view of this, the object of the invention is to propose a kind of Web page classification method and the harmful information recognition methods that meet image and this paper quantity actual distribution situation in webpage, solve the identification of harmful information and the technical matters of automatic classification in webpage.

For achieving the above object, as one aspect of the present invention, the present invention proposes a kind of Web page classification method, comprise the following steps:

Step 1: extract the effective image in selected webpage, and extract the related text of described effective image;

Step 2: using a width effective image and related text thereof as the example of in webpage bag, generate the description of described effective image and related text thereof, and the two is combined description exemplarily;

Step 3: the described example adopting many examples to check to obtain calculates, the result according to calculating is classified to described selected webpage.

Wherein, relative size in described step 1, is adopted to sort forward direction relative method to extract the effective image in described webpage, and

The related text of described effective image is extracted according to webpage tree structure.

Wherein, described step 2 comprises the following steps:

Step 2a: build webpage training set, extract the RGB-SIFT feature of effective image in described webpage training set, cluster generates visual dictionary, and adopts the proper vector of mode by effective image described in image word bag model generation of hard coded combination and polymerization;

Step 2b: utilize text dictionary, adopts the proper vector of text word bag model generation related text;

Step 2c: the proper vector of the proper vector of described effective image and described related text is combined, exemplarily describes.

Wherein, the step that the cluster described in step 2a generates visual dictionary adopts K-means clustering method, obtains the visual dictionary comprising 1500 vision word.

Wherein, the text dictionary described in step 2b comprises 100 to the representational keyword of required classification scheme and 100 and the complete incoherent keyword of required classification scheme;

The step of the proper vector of described employing text word bag model generation related text comprises:

For described related text, generate the proper vector of its 100 dimension according to described text dictionary statistics;

Described in step 2c, the step that the proper vector of effective image and the proper vector of related text are combined is comprised:

100 proper vectors tieed up of 1500 of the described effective image proper vectors tieed up and described related text are directly stringed together, obtains the proper vector of 1600 dimensions; And

If a webpage does not have effective image, then the proper vector of one 1500 null vector tieed up and described related text is combined.

Wherein, described step 3 comprises:

Step 3a: the described example adopting many examples to check to obtain calculates;

Step 3b: many examples core above-mentioned steps obtained is combined with support vector machine, classifies to described selected webpage.

Wherein, described step 3a comprises:

Using in step 2 generate a width effective image example as one bag in an example, a webpage as a bag, in step 2 generation bag and bag wherein x is the statement of corresponding example, in the following way tolerance bag B _iwith bag B _jbetween similarity:

K_{MI} (B_{i}, B_{j}) = Σ_{a = 1}^{n_{i}} Σ_{b = 1}^{n_{j}} K^{p} (x_{ia}, x_{jb})

Wherein, K _mI(. .) be many examples core, K (. .) be traditional core, p is a positive integer.

Wherein, described step 3a is further comprising the steps of:

Described many examples core is normalized according to the following formula:

K_{NMI} (B_{i}, B_{j}) = \frac{K_{MI} (B_{i}, B_{j})}{\sqrt{K_{MI} (B_{i}, B_{i}) K_{MI} (B_{j}, B_{j})}},

Wherein, K _nMI(. .) be many examples core after normalization.

Wherein, described step 3b comprises further:

By K _nMI(B _i, B _j) be combined with support vector machine, classify to described selected webpage, the discriminant of wherein said support vector machine is as follows:

f (x) = \underset{i &Element; SV}{Σ} α_{i} y_{i} K (x_{i}, x) + b

Wherein, SV is support vector indexed set, y _i(+1 or-1) is proper vector x _iclass label, α _ibe corresponding weight, b is biased, α _ivalue and the value of b all obtained by training; K (. .) be traditional core; And

Use K _nMI(. .) replacement K (. .) after, obtain:

f (B) = \underset{i &Element; SV}{Σ} α_{i} y_{i} K (B_{i}, B) + b

As another aspect of the present invention, the present invention proposes the recognition methods of a kind of webpage harmful information, comprise the following steps:

Step 1: extract the effective image in a webpage, and extract the related text of described effective image;

Step 3:

K_{MI} (B_{i}, B_{j}) = Σ_{a = 1}^{n_{i}} Σ_{b = 1}^{n_{j}} K^{p} (x_{ia}, x_{jb})

Wherein, K _mI(. .) be many examples core, K (. .) be traditional core, p is a positive integer;

By K _nMI(B _i, B _j) be combined with support vector machine, identify the harmful information in described selected webpage, the discriminant of wherein said support vector machine is as follows:

f (x) = \underset{i &Element; SV}{Σ} α_{i} y_{i} K (x_{i}, x) + b

Wherein, SV is support vector indexed set, y _i(+1 or-1) is proper vector x _iclass label, α _ibe corresponding weight, b is biased, α _ivalue and the value of b all obtained by training; And

Use K _nMI(. .) replacement K (. .) after, obtain:

f (B) = \underset{i &Element; SV}{Σ} α_{i} y_{i} K (B_{i}, B) + b

Web page classification method based on multi-instance learning proposed by the invention, by the image that will include in webpage and related text thereof as the example in webpage bag, algorithm is made more to meet the actual distribution of web page contents, and the effective information of webpage can be made full use of, deeply excavating the complementarity of image information and text message, finally obtaining than only utilizing single mode information to carry out better effect of classifying.

Accompanying drawing explanation

Fig. 1 is the sectional drawing of two drugs webpages exemplarily;

Fig. 2 is the false code schematic diagram of the Matlab style of FOCARSS algorithm of the present invention;

Fig. 3 is the schematic diagram of a width effective image sectional drawing and related text thereof;

Fig. 4 is the process flow diagram of the generating mode of the description of an example of the present invention;

Fig. 5 is whole lists of keywords of the text dictionary of the present invention as the present invention's specific embodiment.

Embodiment

For making the object, technical solutions and advantages of the present invention clearly understand, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in further detail.

Method of the present invention is not subject to the restriction of particular hardware and programming language, can realize method of the present invention by any language compilation.As an example, present invention employs the computing machine that has 2.83GHz central processing unit and 2GB internal memory, and achieve method of the present invention with Matlab language.

The basic procedure of the Web page classification method based on multi-instance learning of the present invention is:

Step 1: first carry out effective information extraction, adopts relative size sequence forward direction relative method to extract effective image in webpage, and extracts the related text of effective image according to webpage tree structure;

Step 2: according to the spread pattern of effective image in webpage and related text, using a width effective image and related text thereof as the example of in webpage bag, adopt the description of image word bag model and text word bag model generation effective image and related text thereof respectively, and the two is combined description exemplarily;

Step 3: adopt many examples to check webpage and classify.

Below in conjunction with accompanying drawing, each step of the present invention is described in detail, is wherein exemplarily described with drugs webpage.

For step 1, comprise the steps:

Step 1a: adopt relative size sequence forward direction relative method (FOrwardCompArison ofRelative Sizes Sorting, FOCARSS) to extract effective image in webpage.The false code of the Matlab style of FOCARSS algorithm as shown in Figure 2.FOCARSS algorithm is the algorithm that the present invention creates, and adopt the relative size of image, instead of absolute size sorts.First image size sorts according to descending order by FOCARSS algorithm, and ratio calculated matrix; Then threshold value beta is used to determine effective image candidate collection; Then adopt threshold gamma to carry out explication de texte to Candidate Set, finally determine the effective image in webpage.Threshold value beta and threshold gamma are two empirical threshold values; Can find by carrying out analysis to a large amount of webpage, threshold value beta gets 0.5, threshold gamma gets 0.95 extraction effect that can reach satisfied.

Step 1b: the related text extracting effective image according to webpage tree structure.For the html file of a webpage, by tag extraction and tag match, and generate corresponding tree structure according to the set membership between label.For a width effective image, according to its corresponding node in tree structure of its Name Lookup, and the mode of local traversal is adopted to search text around it, the condition of convergence using 200 words as local traversal.The surrounding of effective image text and label text thereof are combined the related text as this effective image.Fig. 3 is the sectional drawing of a width effective image and the schematic diagram of related text thereof.

Step 2 as shown in Figure 4, comprises the steps:

Step 2a: the proper vector adopting image word bag model generation one width effective image.Build training webpage collection, in a preferred embodiment, comprise 2243 webpages altogether, these webpages take from some shopping mall website and news websites equably; Training webpage concentrates the totally 6219 width effective images in all training webpages, all be used to generate visual dictionary: the RGB-SIFT (intensive sampling extracting each width effective image, sampling interval is 16), and K-means cluster is carried out to all RGB-SIFT, obtain 1500 cluster centres; Using each cluster centre as a vision word, thus the visual dictionary comprising 1500 vision word can be obtained.For every width effective image (no matter being from training webpage or test webpage), first we extract the RGB-SIFT (intensive sampling of this image, sampling interval is 16), and according to above-mentioned visual dictionary, adopt hard coded combination and polymerization methods to generate its proper vector; Particularly, hard coded refers to a RGB-SIFT only has response in the vision word nearest with it, and response is 1, and the response in all the other vision word is 0; After referring to polymerization all RGB-SIFT of a width effective image are encoded, all responses in each vision word are added up, as the final response on this word; Through hard coded and and polymerization, just can obtain a width effective image 1500 tie up proper vectors.In particular cases, if a webpage does not have effective image, we are using the image feature vector of the null vector of one 1500 dimension as this webpage.

Step 2b: adopt text word bag model to generate the proper vector of the related text of every width effective image.From harmful information webpage and non-harmful Intelligence Page, such as, in drugs webpage and non-drugs webpage well-chosen 100 representational keywords, composition text dictionary, as shown in Figure 5; The principle selected is that the number of times that certain keyword occurs in drugs webpage is a lot, and the number of times occurred in non-drugs webpage is little, is even zero; Do like this and text dictionary can be made to have good representativeness.For the related text of every width effective image, generate the proper vector of its 100 dimension according to above-mentioned text dictionary statistics.In particular cases, if a webpage does not have effective image, then extract its body text, then generate its proper vector according to above-mentioned text dictionary statistics.

Step 2c: for the example of in webpage, directly strings together the image feature vector of its 1500 dimension and 100 Text eigenvectors tieed up, and obtains the proper vector of 1600 dimensions of this example; If there is N (N > 0) individual example in a webpage, the proper vector of individual 1600 dimensions of N (N > 0) just can be obtained.In particular cases, if a webpage does not have effective image, then the proper vector of one 1500 null vector tieed up and body text is combined, also can obtains the proper vector of one 1600 dimension.It can be used as the example of this webpage, and this webpage only has so example.

Step 2 is calculated gained example as input by step 3, calculates many examples core and carries out final classification task, specifically comprising the steps:

Step 3a: calculate many examples core (Multi-Instance Kernel, MIK).

Many examples core is used for measuring the similarity between bag.Be provided with bag and bag wherein x is the statement of corresponding example.MIK measures bag B in the following way _iwith bag B _jbetween similarity:

K_{MI} (B_{i}, B_{j}) = Σ_{a = 1}^{n_{i}} Σ_{b = 1}^{n_{j}} K^{p} (x_{ia}, x_{jb})

Wherein, K _mI(. .) be many examples core, K (. .) be certain traditional core, p is a positive integer.Because the p power of RBF core is still RBF core, thus this method select radial basis function core (RBF core) as K (. .), RBF core is a kind of core be widely used, functional.Be similar to general kernel method, MIK also needs to be normalized:

K_{NMI} (B_{i}, B_{j}) = \frac{K_{MI} (B_{i}, B_{j})}{\sqrt{K_{MI} (B_{i}, B_{i}) K_{MI} (B_{j}, B_{j})}},

Using a webpage as a bag, and using the proper vector of the effective image in this webpage as the example in bag, can directly use above-mentioned formula.

Step 3b: by K _nMI(B _i, B _j) be combined with support vector machine, drugs webpage is classified.Support vector machine is the good sorter of a kind of performance, and widely, its discriminant is as follows in application scenario:

f (x) = \underset{i &Element; SV}{Σ} α_{i} y_{i} K (x_{i}, x) + b

Wherein, SV is support vector indexed set, y _i(+1 or-1) is proper vector x _iclass label, α _icorresponding weight, K (. .) be certain traditional core, b is biased; According to the ultimate principle of support vector machine, α _ivalue and the value of b all obtained by training.Use K _nMI(. .) replacement K (. .), obtain:

f (B) = \underset{i &Element; SV}{Σ} α_{i} y_{i} K (B_{i}, B) + b

So just can very naturally utilize support vector machine to classify to webpage: when classifying, if the output label of certain bag is+1, then the webpage of this bag representative be drugs webpage; Otherwise be normal webpage.

As another aspect of the present invention, present invention also offers a kind of webpage harmful information recognition methods based on multi-instance learning, based on the principle identical with sorting technique above, identify the webpage containing harmful information and mark, concrete steps comprise:

Step 3:

K_{MI} (B_{i}, B_{j}) = Σ_{a = 1}^{n_{i}} Σ_{b = 1}^{n_{j}} K^{p} (x_{ia}, x_{jb})

f (x) = \underset{i &Element; SV}{Σ} α_{i} y_{i} K (x_{i}, x) + b

Use K _nMI(. .) replacement K (. .) after, obtain:

f (B) = \underset{i &Element; SV}{Σ} α_{i} y_{i} K (B_{i}, B) + b .

Known by the description of the technical scheme to the inventive method, method of the present invention can make full use of the effective information of webpage, obtain than only utilizing single mode information to carry out the better effect that identifies and classify, through the actual test verification to some webpages in actual website, method accuracy of the present invention is high, recognition speed is fast, reaches good practical function.

Above-described specific embodiment; object of the present invention, technical scheme and beneficial effect are further described; be understood that; the foregoing is only specific embodiments of the invention; be not limited to the present invention; within the spirit and principles in the present invention all, any amendment made, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. a Web page classification method, comprises the following steps:

2. Web page classification method according to claim 1, adopts relative size to sort forward direction relative method to extract the effective image in described webpage in wherein said step 1, and

3. Web page classification method according to claim 1, wherein said step 2 comprises the following steps:

4. Web page classification method according to claim 3, the step that the cluster wherein described in step 2a generates visual dictionary adopts K-means clustering method, obtains the visual dictionary comprising 1500 vision word.

5. Web page classification method according to claim 3, the text dictionary wherein described in step 2b comprises 100 to the representational keyword of required classification scheme and 100 and the complete incoherent keyword of required classification scheme;

6. Web page classification method according to claim 1, wherein said step 3 comprises:

7. Web page classification method according to claim 6, wherein said step 3a comprises:

K_{MI} (B_{i}, B_{j}) = Σ_{a = 1}^{n_{i}} Σ_{b = 1}^{n_{j}} K^{p} (x_{ia}, x_{jb})

8. Web page classification method according to claim 7, wherein said step 3a is further comprising the steps of:

Described many examples core is normalized according to the following formula:

K_{NMI} (B_{i}, B_{j}) = \frac{K_{MI} (B_{i}, B_{j})}{\sqrt{K_{MI} (B_{i}, B_{i}) K_{MI} (B_{j}, B_{j})}},

Wherein, K _nMI(. .) be many examples core after normalization.

9. Web page classification method according to claim 6, wherein said step 3b comprises further:

f (x) = \underset{i &Element; SV}{Σ} α_{i} y_{i} K (x_{i}, x) + b

Use K _nMI(. .) replacement K (. .) after, obtain:

f (B) = \underset{i &Element; SV}{Σ} α_{i} y_{i} K (B_{i}, B) + b

10. a webpage harmful information recognition methods, comprises the following steps:

Step 3:

K_{MI} (B_{i}, B_{j}) = Σ_{a = 1}^{n_{i}} Σ_{b = 1}^{n_{j}} K^{p} (x_{ia}, x_{jb})

f (x) = \underset{i &Element; SV}{Σ} α_{i} y_{i} K (x_{i}, x) + b

Use K _nMI(. .) replacement K (. .) after, obtain:

f (B) = \underset{i &Element; SV}{Σ} α_{i} y_{i} K (B_{i}, B) + b