CN104361059B

CN104361059B - A kind of harmful information identification and Web page classification method based on multi-instance learning

Info

Publication number: CN104361059B
Application number: CN201410609728.4A
Authority: CN
Inventors: 胡卫明; 胡瑞光
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Renmin Zhongke Beijing Intelligent Technology Co ltd
Priority date: 2014-11-03
Filing date: 2014-11-03
Publication date: 2018-03-27
Anticipated expiration: 2034-11-03
Also published as: CN104361059A

Abstract

The invention discloses a kind of Web page classification method based on multi-instance learning, this method includes：Devise and extract effective image in webpage to comparison method before relative size sorts, and the related text of effective image is extracted according to webpage tree；Using a width effective image and its related text as an example in webpage bag, the description of image word bag model and text word bag model generation effective image and its related text is respectively adopted, and the two is combined to the description as example；Drugs webpage is classified using more example cores.The method of the present invention, pass through the example being used as the image included in webpage and its related text in webpage bag, algorithm is set more to meet the actual distribution of web page contents, and the effective information of webpage can be made full use of, image information and the complementarity of text message are deeply excavated, it is final to obtain than only carrying out more preferable effect of classifying using single mode information.

Description

A kind of harmful information identification and Web page classification method based on multi-instance learning

Technical field

The present invention relates to network content security field, relates more specifically to a kind of harmful information based on multi-instance learning and knows Other and Web page classification method.

Background technology

While promoting social progress with development, the also propagation for various harmful informations provides greatly just for internet Profit.These harmful informations increasingly endanger the value system of normal social activities and health, are grown up healthy and sound especially to teen-age To be unfavorable.The positive role of internet is played to greatest extent, suppresses or eliminate its negative consequence, is beneficial to purification interconnection Net environment, promote social progress, take good care of teen-age healthy growth.Internet harmful information include pornographic, drugs, violence, fear Fear, reaction etc., the wherein harm of Drug Reference goes even farther compared with the harm of other harmful informations.

In internet, webpage is with HTML (Hyper Text Mark-up Language, HTML) file Form exist, html file is substantially text, and therefore, common Web page classification method mainly utilizes text message, Wherein most important is exactly word bag model.The principle of word bag model is：Some keywords (key) are selected first, form text word Allusion quotation；Then frequency of each keyword in document or webpage is counted, and forms a vector；Using suitable grader pair The vector is classified.

Widely available with various digital devices, the amount of images in webpage is more and more, and amount of text is fewer and fewer, Only webpage, which is classified, using text message can not meet the actual form of webpage well.Therefore, it is highly desirable Comprehensive utilization image information improves real web pages classification performance with text message.

As an example, Fig. 1 is two drugs webpages, and left figure is to peddle the webpage of drug abuse instrument, and right figure is big to peddle The webpage of fiber crops.As can be seen that substantial amounts of image and a small amount of text are contained in two webpages, and image and text alignment Obtain in good order.In this case, only it can not have been classified well using text message.In addition, current pin The Patents or document handled the Drug Reference on internet are also considerably less, and there is an urgent need to a kind of harmful to drugs etc. The method that processing is identified in information, to facilitate national governments to strengthen the supervision to internet, protect people from relevant information Temptation.

The content of the invention

In view of this, it is an object of the invention to propose that one kind meets image and this paper quantity actual distribution situations in webpage Web page classification method and harmful information recognition methods, solve the identification of harmful information and the technology classified automatically in webpage and ask Topic.

To achieve the above object, as one aspect of the present invention, the present invention proposes a kind of Web page classification method, including Following steps：

Step 1：Effective image in the selected webpage of extraction, and extract the related text of the effective image；

Step 2：Using a width effective image and its related text as an example in webpage bag, the effectively figure is generated As and its related text description, and the two is combined to the description as example；

Step 3：The obtained example is calculated using more example cores, according to the result of calculating to the selected net Page is classified.

Wherein, the effective image in the webpage is extracted to comparison method before being sorted in the step 1 using relative size, And

The related text of the effective image is extracted according to webpage tree.

Wherein, the step 2 comprises the following steps：

Step 2a：Webpage training set is built, extracts the RGB-SIFT features of effective image in the webpage training set, cluster Generate visual dictionary, and using hard coded with reference to and polymerization by the way of pass through the spy of the image word bag model generation effective image Sign vector；

Step 2b：Using text dictionary, using the characteristic vector of text word bag model generation related text；

Step 2c：The characteristic vector of the effective image and the characteristic vector of the related text are combined, as Example describes.

Wherein, described in step 2a cluster generation visual dictionary the step of use K-means clustering methods, comprising The visual dictionary of 1500 vision words.

Wherein, the text dictionary described in step 2b include 100 to the representational keyword of required classification scheme and 100 and the completely unrelated keyword of required classification scheme；

It is described using text word bag model generate related text characteristic vector the step of include：

For the related text, according to the characteristic vector of its 100 dimension of text dictionary statistics generation；

The step of being combined the characteristic vector of the characteristic vector of effective image and related text described in step 2c is wrapped Include：

The characteristic vector of the characteristic vector of 1500 dimensions of the effective image and 100 dimensions of the related text is directly gone here and there Get up, obtain the characteristic vector of 1600 dimensions；And

If a webpage does not have effective image, by the feature of null vector and the related text of one 1500 dimension to Amount is combined.

Wherein, the step 3 includes：

Step 3a：The obtained example is calculated using more example cores；

Step 3b：More example cores that above-mentioned steps are obtained are combined with SVMs, and the selected webpage is divided Class.

Wherein, the step 3a includes：

Using the example of the width effective image generated in step 2 as an example in a bag, a webpage conduct One bag, for the bag generated in step 2And bagWherein x is corresponding example Statement, bag B is measured in the following way_iWith bag B_jBetween similitude：

Wherein, K_MI() is more example cores, and K () is traditional core, and p is a positive integer.

Wherein, the step 3a is further comprising the steps of：

Described more example cores are normalized according to the following formula：

Wherein, K_NMI() is more example cores after normalization.

Wherein, the step 3b further comprises：

By K_NMI(B_i, B_j) combined with SVMs, the selected webpage is classified, wherein the SVMs Discriminate it is as follows：

Wherein, SV is supporting vector indexed set, y_i(+1 or -1) is characteristic vector x_iClass label, α_iIt is to weigh accordingly Weight, b are to bias, α_iValue and b value all by training obtain；K () is traditional core；And

Use K_NMIAfter () replaces K (), obtain：

As another aspect of the present invention, the present invention proposes a kind of webpage harmful information recognition methods, including following Step：

Step 1：The effective image in a webpage is extracted, and extracts the related text of the effective image；

Step 3：

Using the example of the width effective image generated in step 2 as an example in a bag, a webpage conduct One bag, for the bag generated in step 2And bagWherein x is corresponding sample table State, in the following way measurement bag B_iWith bag B_jBetween similitude：

Wherein, K_MI() is more example cores, and K () is traditional core, and p is a positive integer；

By K_NMI(B_i, B_j) combined with SVMs, the harmful information in the selected webpage is identified, wherein institute The discriminate for stating SVMs is as follows：

Wherein, SV is supporting vector indexed set, y_i(+1 or -1) is characteristic vector x_iClass label, α_iIt is to weigh accordingly Weight, b are to bias, α_iValue and b value all by training obtain；And

Use K_NMIAfter () replaces K (), obtain：

Web page classification method based on multi-instance learning proposed by the invention, by by the image included in webpage and its Related text makes algorithm more meet the actual distribution of web page contents, and can make full use of webpage as the example in webpage bag Effective information, deeply excavate the complementarity of image information and text message, it is final to obtain than only utilizing the progress of single mode information Classify more preferable effect.

Brief description of the drawings

Fig. 1 is the sectional drawing as two drugs webpages of demonstration；

Fig. 2 is the false code schematic diagram of the Matlab styles of the FOCARSS algorithms of the present invention；

Fig. 3 is the schematic diagram of a width effective image sectional drawing and its related text；

Fig. 4 is the flow chart of the generating mode of the description of the example of the present invention；

Fig. 5 is whole lists of keywords as the text dictionary of the invention of a specific embodiment of the invention.

Embodiment

For the object, technical solutions and advantages of the present invention are more clearly understood, below in conjunction with specific embodiment, and reference Accompanying drawing, the present invention is described in further detail.

The method of the present invention is not limited by particular hardware and programming language, and this can be realized by being write with any language The method of invention.As an example, present invention employs the computer that one has 2.83GHz central processing units and 2GB internal memories, and The method that the present invention is realized with Matlab language.

The basic procedure of the Web page classification method based on multi-instance learning of the present invention is：

Step 1：Effective information extraction is carried out first, is extracted before being sorted using relative size to comparison method in webpage effectively Image, and according to the related text of webpage tree extraction effective image；

Step 2：According to the spread pattern of effective image in webpage and related text, by a width effective image and its related text Image word bag model and text word bag model generation effective image and its phase is respectively adopted as an example in webpage bag in this The description of text is closed, and the two is combined to the description as example；

Step 3：Webpage is classified using more example cores.

Below in conjunction with the accompanying drawings to the present invention each step be described in detail, wherein using drugs webpage as demonstrate into Row explanation.

For step 1, comprise the following steps：

Step 1a：It is preceding to comparison method (FOrwardCompArison of Relative Sizes using relative size sequence Sorting, FOCARSS) extract effective image in webpage.False code such as Fig. 2 institutes of the Matlab styles of FOCARSS algorithms Show.FOCARSS algorithms are the algorithms that the present invention creates, and are ranked up using the relative size of image, rather than absolute size. FOCARSS algorithms first sort image size according to descending order, and ratio calculated matrix；Then threshold value beta is used Determine effective image Candidate Set；Then explication de texte, the final effectively figure determined in webpage are carried out to Candidate Set using threshold gamma Picture.Threshold value beta and threshold gamma are two empirical threshold values；By being analyzed a large amount of webpages it can be found that threshold value beta takes 0.5, threshold Value γ takes 0.95 to can reach satisfied extraction effect.

Step 1b：The related text of effective image is extracted according to webpage tree.For the html file of a webpage, By tag extraction and tag match, and the tree according to corresponding to the set membership generation between label.Have for a width Image is imitated, according to its corresponding node in tree of its Name Lookup, and is searched by the way of locally traveling through around it Text, the condition of convergence locally traveled through is used as using 200 words.The surrounding's text and its label text of effective image are merged Together as the related text of the effective image.Fig. 3 is the sectional drawing of a width effective image and its schematic diagram of related text.

Step 2 is as shown in figure 4, comprise the following steps：

Step 2a：The characteristic vector of a width effective image is generated using image word bag model.Structure training webpage collection, one In individual preferred embodiment, altogether comprising 2243 webpages, these webpages are equably derived from some shopping mall websites and news website； Train webpage to concentrate the totally 6219 width effective image in all training webpages, be all used to generate visual dictionary：Extract each width The RGB-SIFT (intensive sampling, sampling interval 16) of effective image, and K-means clusters are carried out to all RGB-SIFT, obtain To 1500 cluster centres；Using each cluster centre as a vision word, so as to obtain including 1500 visions The visual dictionary of word.For every width effective image (still testing webpage either from training webpage), we extract first The RGB-SIFT (intensive sampling, sampling interval 16) of the image, and according to above-mentioned visual dictionary, combined using hard coded and poly- Conjunction mode generates its characteristic vector；Specifically, hard coded refers to a RGB-SIFT only in the vision list closest with it There is response on word, and response is 1, the response in remaining vision word is 0；The institute to a width effective image is referred to polymerization Have after RGB-SIFT encoded, all responses in each vision word are added up, as final on the word Response；By hard coded and and polymerization, it is possible to obtain a width effective image 1500 dimension characteristic vectors.Special circumstances Under, if a webpage does not have effective image, we using one 1500 dimension null vector as the webpage image feature vector.

Step 2b：The characteristic vector of the related text of every width effective image is generated using text word bag model.From harmful Well-chosen 100 is representational in Intelligence Page and non-harmful Intelligence Page, such as drugs webpage and non-drugs webpage Keyword, text dictionary is formed, as shown in Figure 5；The principle selected be number that some keyword occurs in drugs webpage very It is more, and the number occurred in non-drugs webpage is seldom, even zero；So doing can make text dictionary have good generation Table.For the related text of every width effective image, according to the characteristic vector of its 100 dimension of above-mentioned text dictionary statistics generation.It is special In the case of different, if a webpage does not have effective image, its body text is extracted, is then counted and given birth to according to above-mentioned text dictionary Into its characteristic vector.

Step 2c：For an example in webpage, by its 1500 image feature vector tieed up and 100 text features tieed up Vector directly strings together, and obtains the characteristic vector of 1600 dimensions of the example；If there is N (N ＞ 0) individual example in a webpage, just The characteristic vector of individual 1600 dimensions of N (N ＞ 0) can be obtained.In particular cases, if a webpage does not have effective image, by one The null vector of individual 1500 dimension and the characteristic vector of body text are combined, and can also obtain the characteristic vector of one 1600 dimension. As the example of the webpage, and the webpage only has so example.

Step 2 is calculated gained example as input by step 3, is calculated more example cores and is carried out final classification task, has Body comprises the following steps：

Step 3a：Calculate more example cores (Multi-Instance Kernel, MIK).

More example cores are used for measuring the similitude between bag.Provided with bagAnd bag Wherein x states for corresponding example.MIK measures bag B in the following way_iWith bag B_jBetween similitude：

Wherein, K_MI() is more example cores, and K () is certain traditional core, and p is a positive integer.Because the p of RBF cores Power is still RBF cores, so this method selection RBF core (RBF cores) is used as K (), RBF cores are a kind of extensive The core of application, it is functional.Similar in general kernel method, MIK is also required to be normalized：

Using a webpage as one bag, and using the characteristic vector of the effective image in the webpage as wrap in example, Above-mentioned formula can be used directly.

Step 3b：By K_NMI(B_i, B_j) combined with SVMs, drugs webpage is classified.SVMs is one The kind good grader of performance, application scenario is very extensive, and its discriminate is as follows：

Wherein, SV is supporting vector indexed set, y_i(+1 or -1) is characteristic vector x_iClass label, α_iIt is to weigh accordingly Weight, K () are certain traditional cores, and b is biasing；According to the general principle of SVMs, α_iValue and b value all pass through instruction Get.Use K_NMI() replaces K (), obtains：

Thus naturally enough webpage can be classified using SVMs：In classification, if some bag is defeated Outgoing label is+1, then the webpage that the bag represents is drugs webpage；Otherwise it is normal webpage.

As another aspect of the present invention, present invention also offers a kind of webpage harmful information based on multi-instance learning Recognition methods, based on sorting technique identical principle above, the webpage containing harmful information is identified and marked, specifically Step includes：

Step 3：

Use K_NMIAfter () replaces K (), obtain：

By the description of the technical scheme to the inventive method, method of the invention can make full use of having for webpage Information is imitated, is obtained than being identified using single mode information and more preferable effect of classifying, by a fixed number in actual website The actual test for measuring webpage is examined, and the method degree of accuracy of the invention is high, and recognition speed is fast, has reached good practical function.

Particular embodiments described above, the purpose of the present invention, technical scheme and beneficial effect are carried out further in detail Describe in detail bright, it should be understood that the foregoing is only the present invention specific embodiment, be not intended to limit the invention, it is all Within the spirit and principles in the present invention, any modification, equivalent substitution and improvements done etc., the protection of the present invention should be included in Within the scope of.

Claims

1. a kind of Web page classification method, comprises the following steps：

Step 2：Using a width effective image and its related text as an example in webpage bag, generate the effective image and The description of its related text, and the two is combined to the description as example；

Step 3：The obtained example is calculated using more example cores, the selected webpage entered according to the result of calculating Row classification, the step 3 specifically include following steps：

Step 3a：The obtained example is calculated using more example cores, the step 3a is specifically included：

Using the example of the width effective image generated in step 2 as an example in a bag, a webpage is as one Bag, for the bag generated in step 2And bagWherein x states for corresponding example, Measurement bag B in the following way_iWith bag B_jBetween similitude：

<mrow> <msub> <mi>K</mi> <mrow> <mi>M</mi> <mi>I</mi> </mrow> </msub> <mrow> <mo>(</mo> <msub> <mi>B</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>B</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>a</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>n</mi> <mi>i</mi> </msub> </munderover> <munderover> <mo>&Sigma;</mo> <mrow> <mi>b</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>n</mi> <mi>j</mi> </msub> </munderover> <msup> <mi>K</mi> <mi>p</mi> </msup> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mrow> <mi>i</mi> <mi>a</mi> </mrow> </msub> <mo>,</mo> <msub> <mi>x</mi> <mrow> <mi>j</mi> <mi>b</mi> </mrow> </msub> <mo>)</mo> </mrow> <mo>;</mo> </mrow>

Wherein, K_NMI() is more example cores after normalization；

Step 3b：More example cores that above-mentioned steps are obtained are combined with SVMs, and the selected webpage is classified, institute Step 3b is stated to further comprise：

By K_NMI(B_i, B_j) combined with SVMs, the selected webpage is classified, wherein the SVMs is sentenced Other formula is as follows：

<mrow> <mi>f</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>&Element;</mo> <mi>S</mi> <mi>V</mi> </mrow> </munder> <msub> <mi>&alpha;</mi> <mi>i</mi> </msub> <msub> <mi>y</mi> <mi>i</mi> </msub> <mi>K</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>,</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>+</mo> <mi>b</mi> <mo>;</mo> </mrow>

Wherein, SV is supporting vector indexed set, y_iIt is characteristic vector x_iClass label, α_iIt is corresponding weight, b is to bias, α_i Value and b value all by training obtain；K () is traditional core；And

Use K_NMIAfter () replaces K (), obtain：

<mrow> <mi>f</mi> <mrow> <mo>(</mo> <mi>B</mi> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>&Element;</mo> <mi>S</mi> <mi>V</mi> </mrow> </munder> <msub> <mi>&alpha;</mi> <mi>i</mi> </msub> <msub> <mi>y</mi> <mi>i</mi> </msub> <mi>K</mi> <mrow> <mo>(</mo> <msub> <mi>B</mi> <mi>i</mi> </msub> <mo>,</mo> <mi>B</mi> <mo>)</mo> </mrow> <mo>+</mo> <mi>b</mi> <mo>.</mo> </mrow>

2. Web page classification method according to claim 1, wherein in the step 1 using before relative size sequence to comparing Method extracts the effective image in the webpage, and

The related text of the effective image is extracted according to webpage tree；

Wherein, include before the relative size sequence to comparison method：Image size is sorted according to descending order first, And ratio calculated matrix；Then effective image Candidate Set is determined using threshold value beta；Then Candidate Set is carried out using threshold gamma fine Analysis, the final effective image determined in webpage, wherein, threshold value beta and threshold gamma are two empirical threshold values.

3. Web page classification method according to claim 1, wherein the step 2 comprises the following steps：

Step 2a：Webpage training set is built, extracts the RGB-SIFT features of effective image in the webpage training set, cluster generation Visual dictionary, and using hard coded with reference to and polymerization by the way of by image word bag model generate the feature of the effective image to Amount；

Step 2c：The characteristic vector of the effective image and the characteristic vector of the related text are combined, as example Description.

4. Web page classification method according to claim 3, the step of the cluster generation visual dictionary wherein described in step 2a Suddenly K-means clustering methods are used, obtains including the visual dictionary of 1500 vision words.

5. Web page classification method according to claim 3, the wherein text dictionary described in step 2b include 100 to institute Need the representational keyword of classification scheme and 100 and the completely unrelated keyword of required classification scheme；

The step of being combined the characteristic vector of the characteristic vector of effective image and related text described in step 2c includes：

The characteristic vector of the characteristic vector of 1500 dimensions of the effective image and 100 dimensions of the related text is directly strung Come, obtain the characteristic vector of 1600 dimensions；And

If a webpage does not have effective image, the characteristic vector of the null vector of one 1500 dimension and the related text is closed And get up.

6. a kind of webpage harmful information recognition methods, comprises the following steps：

Step 1：The effective image in a selected webpage is extracted, and extracts the related text of the effective image；

Step 3：Make the example of the width effective image generated in step 2 as an example in a bag, a webpage Wrapped for one, for the bag generated in step 2And bagWherein x is corresponding example Statement, bag B is measured in the following way_iWith bag B_jBetween similitude：

Wherein, K_NMI() is more example cores after normalization；

By K_NMI(B_i, B_j) combined with SVMs, the harmful information in the selected webpage is identified, wherein the branch The discriminate for holding vector machine is as follows：

Wherein, SV is supporting vector indexed set, y_iIt is characteristic vector x_iClass label, α_iIt is corresponding weight, b is to bias, α_i Value and b value all by training obtain；And

Use K_NMIAfter () replaces K (), obtain：