CN104361059B - A kind of harmful information identification and Web page classification method based on multi-instance learning - Google Patents

A kind of harmful information identification and Web page classification method based on multi-instance learning Download PDF

Info

Publication number
CN104361059B
CN104361059B CN201410609728.4A CN201410609728A CN104361059B CN 104361059 B CN104361059 B CN 104361059B CN 201410609728 A CN201410609728 A CN 201410609728A CN 104361059 B CN104361059 B CN 104361059B
Authority
CN
China
Prior art keywords
mrow
msub
webpage
effective image
bag
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410609728.4A
Other languages
Chinese (zh)
Other versions
CN104361059A (en
Inventor
胡卫明
胡瑞光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Renmin Zhongke Beijing Intelligent Technology Co ltd
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN201410609728.4A priority Critical patent/CN104361059B/en
Publication of CN104361059A publication Critical patent/CN104361059A/en
Application granted granted Critical
Publication of CN104361059B publication Critical patent/CN104361059B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of Web page classification method based on multi-instance learning, this method includes:Devise and extract effective image in webpage to comparison method before relative size sorts, and the related text of effective image is extracted according to webpage tree;Using a width effective image and its related text as an example in webpage bag, the description of image word bag model and text word bag model generation effective image and its related text is respectively adopted, and the two is combined to the description as example;Drugs webpage is classified using more example cores.The method of the present invention, pass through the example being used as the image included in webpage and its related text in webpage bag, algorithm is set more to meet the actual distribution of web page contents, and the effective information of webpage can be made full use of, image information and the complementarity of text message are deeply excavated, it is final to obtain than only carrying out more preferable effect of classifying using single mode information.

Description

A kind of harmful information identification and Web page classification method based on multi-instance learning
Technical field
The present invention relates to network content security field, relates more specifically to a kind of harmful information based on multi-instance learning and knows Other and Web page classification method.
Background technology
While promoting social progress with development, the also propagation for various harmful informations provides greatly just for internet Profit.These harmful informations increasingly endanger the value system of normal social activities and health, are grown up healthy and sound especially to teen-age To be unfavorable.The positive role of internet is played to greatest extent, suppresses or eliminate its negative consequence, is beneficial to purification interconnection Net environment, promote social progress, take good care of teen-age healthy growth.Internet harmful information include pornographic, drugs, violence, fear Fear, reaction etc., the wherein harm of Drug Reference goes even farther compared with the harm of other harmful informations.
In internet, webpage is with HTML (Hyper Text Mark-up Language, HTML) file Form exist, html file is substantially text, and therefore, common Web page classification method mainly utilizes text message, Wherein most important is exactly word bag model.The principle of word bag model is:Some keywords (key) are selected first, form text word Allusion quotation;Then frequency of each keyword in document or webpage is counted, and forms a vector;Using suitable grader pair The vector is classified.
Widely available with various digital devices, the amount of images in webpage is more and more, and amount of text is fewer and fewer, Only webpage, which is classified, using text message can not meet the actual form of webpage well.Therefore, it is highly desirable Comprehensive utilization image information improves real web pages classification performance with text message.
As an example, Fig. 1 is two drugs webpages, and left figure is to peddle the webpage of drug abuse instrument, and right figure is big to peddle The webpage of fiber crops.As can be seen that substantial amounts of image and a small amount of text are contained in two webpages, and image and text alignment Obtain in good order.In this case, only it can not have been classified well using text message.In addition, current pin The Patents or document handled the Drug Reference on internet are also considerably less, and there is an urgent need to a kind of harmful to drugs etc. The method that processing is identified in information, to facilitate national governments to strengthen the supervision to internet, protect people from relevant information Temptation.
The content of the invention
In view of this, it is an object of the invention to propose that one kind meets image and this paper quantity actual distribution situations in webpage Web page classification method and harmful information recognition methods, solve the identification of harmful information and the technology classified automatically in webpage and ask Topic.
To achieve the above object, as one aspect of the present invention, the present invention proposes a kind of Web page classification method, including Following steps:
Step 1:Effective image in the selected webpage of extraction, and extract the related text of the effective image;
Step 2:Using a width effective image and its related text as an example in webpage bag, the effectively figure is generated As and its related text description, and the two is combined to the description as example;
Step 3:The obtained example is calculated using more example cores, according to the result of calculating to the selected net Page is classified.
Wherein, the effective image in the webpage is extracted to comparison method before being sorted in the step 1 using relative size, And
The related text of the effective image is extracted according to webpage tree.
Wherein, the step 2 comprises the following steps:
Step 2a:Webpage training set is built, extracts the RGB-SIFT features of effective image in the webpage training set, cluster Generate visual dictionary, and using hard coded with reference to and polymerization by the way of pass through the spy of the image word bag model generation effective image Sign vector;
Step 2b:Using text dictionary, using the characteristic vector of text word bag model generation related text;
Step 2c:The characteristic vector of the effective image and the characteristic vector of the related text are combined, as Example describes.
Wherein, described in step 2a cluster generation visual dictionary the step of use K-means clustering methods, comprising The visual dictionary of 1500 vision words.
Wherein, the text dictionary described in step 2b include 100 to the representational keyword of required classification scheme and 100 and the completely unrelated keyword of required classification scheme;
It is described using text word bag model generate related text characteristic vector the step of include:
For the related text, according to the characteristic vector of its 100 dimension of text dictionary statistics generation;
The step of being combined the characteristic vector of the characteristic vector of effective image and related text described in step 2c is wrapped Include:
The characteristic vector of the characteristic vector of 1500 dimensions of the effective image and 100 dimensions of the related text is directly gone here and there Get up, obtain the characteristic vector of 1600 dimensions;And
If a webpage does not have effective image, by the feature of null vector and the related text of one 1500 dimension to Amount is combined.
Wherein, the step 3 includes:
Step 3a:The obtained example is calculated using more example cores;
Step 3b:More example cores that above-mentioned steps are obtained are combined with SVMs, and the selected webpage is divided Class.
Wherein, the step 3a includes:
Using the example of the width effective image generated in step 2 as an example in a bag, a webpage conduct One bag, for the bag generated in step 2And bagWherein x is corresponding example Statement, bag B is measured in the following wayiWith bag BjBetween similitude:
Wherein, KMI() is more example cores, and K () is traditional core, and p is a positive integer.
Wherein, the step 3a is further comprising the steps of:
Described more example cores are normalized according to the following formula:
Wherein, KNMI() is more example cores after normalization.
Wherein, the step 3b further comprises:
By KNMI(Bi, Bj) combined with SVMs, the selected webpage is classified, wherein the SVMs Discriminate it is as follows:
Wherein, SV is supporting vector indexed set, yi(+1 or -1) is characteristic vector xiClass label, αiIt is to weigh accordingly Weight, b are to bias, αiValue and b value all by training obtain;K () is traditional core;And
Use KNMIAfter () replaces K (), obtain:
As another aspect of the present invention, the present invention proposes a kind of webpage harmful information recognition methods, including following Step:
Step 1:The effective image in a webpage is extracted, and extracts the related text of the effective image;
Step 2:Using a width effective image and its related text as an example in webpage bag, the effectively figure is generated As and its related text description, and the two is combined to the description as example;
Step 3:
Using the example of the width effective image generated in step 2 as an example in a bag, a webpage conduct One bag, for the bag generated in step 2And bagWherein x is corresponding sample table State, in the following way measurement bag BiWith bag BjBetween similitude:
Wherein, KMI() is more example cores, and K () is traditional core, and p is a positive integer;
By KNMI(Bi, Bj) combined with SVMs, the harmful information in the selected webpage is identified, wherein institute The discriminate for stating SVMs is as follows:
Wherein, SV is supporting vector indexed set, yi(+1 or -1) is characteristic vector xiClass label, αiIt is to weigh accordingly Weight, b are to bias, αiValue and b value all by training obtain;And
Use KNMIAfter () replaces K (), obtain:
Web page classification method based on multi-instance learning proposed by the invention, by by the image included in webpage and its Related text makes algorithm more meet the actual distribution of web page contents, and can make full use of webpage as the example in webpage bag Effective information, deeply excavate the complementarity of image information and text message, it is final to obtain than only utilizing the progress of single mode information Classify more preferable effect.
Brief description of the drawings
Fig. 1 is the sectional drawing as two drugs webpages of demonstration;
Fig. 2 is the false code schematic diagram of the Matlab styles of the FOCARSS algorithms of the present invention;
Fig. 3 is the schematic diagram of a width effective image sectional drawing and its related text;
Fig. 4 is the flow chart of the generating mode of the description of the example of the present invention;
Fig. 5 is whole lists of keywords as the text dictionary of the invention of a specific embodiment of the invention.
Embodiment
For the object, technical solutions and advantages of the present invention are more clearly understood, below in conjunction with specific embodiment, and reference Accompanying drawing, the present invention is described in further detail.
The method of the present invention is not limited by particular hardware and programming language, and this can be realized by being write with any language The method of invention.As an example, present invention employs the computer that one has 2.83GHz central processing units and 2GB internal memories, and The method that the present invention is realized with Matlab language.
The basic procedure of the Web page classification method based on multi-instance learning of the present invention is:
Step 1:Effective information extraction is carried out first, is extracted before being sorted using relative size to comparison method in webpage effectively Image, and according to the related text of webpage tree extraction effective image;
Step 2:According to the spread pattern of effective image in webpage and related text, by a width effective image and its related text Image word bag model and text word bag model generation effective image and its phase is respectively adopted as an example in webpage bag in this The description of text is closed, and the two is combined to the description as example;
Step 3:Webpage is classified using more example cores.
Below in conjunction with the accompanying drawings to the present invention each step be described in detail, wherein using drugs webpage as demonstrate into Row explanation.
For step 1, comprise the following steps:
Step 1a:It is preceding to comparison method (FOrwardCompArison of Relative Sizes using relative size sequence Sorting, FOCARSS) extract effective image in webpage.False code such as Fig. 2 institutes of the Matlab styles of FOCARSS algorithms Show.FOCARSS algorithms are the algorithms that the present invention creates, and are ranked up using the relative size of image, rather than absolute size. FOCARSS algorithms first sort image size according to descending order, and ratio calculated matrix;Then threshold value beta is used Determine effective image Candidate Set;Then explication de texte, the final effectively figure determined in webpage are carried out to Candidate Set using threshold gamma Picture.Threshold value beta and threshold gamma are two empirical threshold values;By being analyzed a large amount of webpages it can be found that threshold value beta takes 0.5, threshold Value γ takes 0.95 to can reach satisfied extraction effect.
Step 1b:The related text of effective image is extracted according to webpage tree.For the html file of a webpage, By tag extraction and tag match, and the tree according to corresponding to the set membership generation between label.Have for a width Image is imitated, according to its corresponding node in tree of its Name Lookup, and is searched by the way of locally traveling through around it Text, the condition of convergence locally traveled through is used as using 200 words.The surrounding's text and its label text of effective image are merged Together as the related text of the effective image.Fig. 3 is the sectional drawing of a width effective image and its schematic diagram of related text.
Step 2 is as shown in figure 4, comprise the following steps:
Step 2a:The characteristic vector of a width effective image is generated using image word bag model.Structure training webpage collection, one In individual preferred embodiment, altogether comprising 2243 webpages, these webpages are equably derived from some shopping mall websites and news website; Train webpage to concentrate the totally 6219 width effective image in all training webpages, be all used to generate visual dictionary:Extract each width The RGB-SIFT (intensive sampling, sampling interval 16) of effective image, and K-means clusters are carried out to all RGB-SIFT, obtain To 1500 cluster centres;Using each cluster centre as a vision word, so as to obtain including 1500 visions The visual dictionary of word.For every width effective image (still testing webpage either from training webpage), we extract first The RGB-SIFT (intensive sampling, sampling interval 16) of the image, and according to above-mentioned visual dictionary, combined using hard coded and poly- Conjunction mode generates its characteristic vector;Specifically, hard coded refers to a RGB-SIFT only in the vision list closest with it There is response on word, and response is 1, the response in remaining vision word is 0;The institute to a width effective image is referred to polymerization Have after RGB-SIFT encoded, all responses in each vision word are added up, as final on the word Response;By hard coded and and polymerization, it is possible to obtain a width effective image 1500 dimension characteristic vectors.Special circumstances Under, if a webpage does not have effective image, we using one 1500 dimension null vector as the webpage image feature vector.
Step 2b:The characteristic vector of the related text of every width effective image is generated using text word bag model.From harmful Well-chosen 100 is representational in Intelligence Page and non-harmful Intelligence Page, such as drugs webpage and non-drugs webpage Keyword, text dictionary is formed, as shown in Figure 5;The principle selected be number that some keyword occurs in drugs webpage very It is more, and the number occurred in non-drugs webpage is seldom, even zero;So doing can make text dictionary have good generation Table.For the related text of every width effective image, according to the characteristic vector of its 100 dimension of above-mentioned text dictionary statistics generation.It is special In the case of different, if a webpage does not have effective image, its body text is extracted, is then counted and given birth to according to above-mentioned text dictionary Into its characteristic vector.
Step 2c:For an example in webpage, by its 1500 image feature vector tieed up and 100 text features tieed up Vector directly strings together, and obtains the characteristic vector of 1600 dimensions of the example;If there is N (N > 0) individual example in a webpage, just The characteristic vector of individual 1600 dimensions of N (N > 0) can be obtained.In particular cases, if a webpage does not have effective image, by one The null vector of individual 1500 dimension and the characteristic vector of body text are combined, and can also obtain the characteristic vector of one 1600 dimension. As the example of the webpage, and the webpage only has so example.
Step 2 is calculated gained example as input by step 3, is calculated more example cores and is carried out final classification task, has Body comprises the following steps:
Step 3a:Calculate more example cores (Multi-Instance Kernel, MIK).
More example cores are used for measuring the similitude between bag.Provided with bagAnd bag Wherein x states for corresponding example.MIK measures bag B in the following wayiWith bag BjBetween similitude:
Wherein, KMI() is more example cores, and K () is certain traditional core, and p is a positive integer.Because the p of RBF cores Power is still RBF cores, so this method selection RBF core (RBF cores) is used as K (), RBF cores are a kind of extensive The core of application, it is functional.Similar in general kernel method, MIK is also required to be normalized:
Using a webpage as one bag, and using the characteristic vector of the effective image in the webpage as wrap in example, Above-mentioned formula can be used directly.
Step 3b:By KNMI(Bi, Bj) combined with SVMs, drugs webpage is classified.SVMs is one The kind good grader of performance, application scenario is very extensive, and its discriminate is as follows:
Wherein, SV is supporting vector indexed set, yi(+1 or -1) is characteristic vector xiClass label, αiIt is to weigh accordingly Weight, K () are certain traditional cores, and b is biasing;According to the general principle of SVMs, αiValue and b value all pass through instruction Get.Use KNMI() replaces K (), obtains:
Thus naturally enough webpage can be classified using SVMs:In classification, if some bag is defeated Outgoing label is+1, then the webpage that the bag represents is drugs webpage;Otherwise it is normal webpage.
As another aspect of the present invention, present invention also offers a kind of webpage harmful information based on multi-instance learning Recognition methods, based on sorting technique identical principle above, the webpage containing harmful information is identified and marked, specifically Step includes:
Step 1:The effective image in a webpage is extracted, and extracts the related text of the effective image;
Step 2:Using a width effective image and its related text as an example in webpage bag, the effectively figure is generated As and its related text description, and the two is combined to the description as example;
Step 3:
Using the example of the width effective image generated in step 2 as an example in a bag, a webpage conduct One bag, for the bag generated in step 2And bagWherein x is corresponding example Statement, bag B is measured in the following wayiWith bag BjBetween similitude:
Wherein, KMI() is more example cores, and K () is traditional core, and p is a positive integer;
By KNMI(Bi, Bj) combined with SVMs, the harmful information in the selected webpage is identified, wherein institute The discriminate for stating SVMs is as follows:
Wherein, SV is supporting vector indexed set, yi(+1 or -1) is characteristic vector xiClass label, αiIt is to weigh accordingly Weight, b are to bias, αiValue and b value all by training obtain;And
Use KNMIAfter () replaces K (), obtain:
By the description of the technical scheme to the inventive method, method of the invention can make full use of having for webpage Information is imitated, is obtained than being identified using single mode information and more preferable effect of classifying, by a fixed number in actual website The actual test for measuring webpage is examined, and the method degree of accuracy of the invention is high, and recognition speed is fast, has reached good practical function.
Particular embodiments described above, the purpose of the present invention, technical scheme and beneficial effect are carried out further in detail Describe in detail bright, it should be understood that the foregoing is only the present invention specific embodiment, be not intended to limit the invention, it is all Within the spirit and principles in the present invention, any modification, equivalent substitution and improvements done etc., the protection of the present invention should be included in Within the scope of.

Claims (6)

1. a kind of Web page classification method, comprises the following steps:
Step 1:Effective image in the selected webpage of extraction, and extract the related text of the effective image;
Step 2:Using a width effective image and its related text as an example in webpage bag, generate the effective image and The description of its related text, and the two is combined to the description as example;
Step 3:The obtained example is calculated using more example cores, the selected webpage entered according to the result of calculating Row classification, the step 3 specifically include following steps:
Step 3a:The obtained example is calculated using more example cores, the step 3a is specifically included:
Using the example of the width effective image generated in step 2 as an example in a bag, a webpage is as one Bag, for the bag generated in step 2And bagWherein x states for corresponding example, Measurement bag B in the following wayiWith bag BjBetween similitude:
<mrow> <msub> <mi>K</mi> <mrow> <mi>M</mi> <mi>I</mi> </mrow> </msub> <mrow> <mo>(</mo> <msub> <mi>B</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>B</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>a</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>n</mi> <mi>i</mi> </msub> </munderover> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>b</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>n</mi> <mi>j</mi> </msub> </munderover> <msup> <mi>K</mi> <mi>p</mi> </msup> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mrow> <mi>i</mi> <mi>a</mi> </mrow> </msub> <mo>,</mo> <msub> <mi>x</mi> <mrow> <mi>j</mi> <mi>b</mi> </mrow> </msub> <mo>)</mo> </mrow> <mo>;</mo> </mrow>
Wherein, KMI() is more example cores, and K () is traditional core, and p is a positive integer;
Described more example cores are normalized according to the following formula:
<mrow> <msub> <mi>K</mi> <mrow> <mi>N</mi> <mi>M</mi> <mi>I</mi> </mrow> </msub> <mrow> <mo>(</mo> <msub> <mi>B</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>B</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <msub> <mi>K</mi> <mrow> <mi>M</mi> <mi>I</mi> </mrow> </msub> <mrow> <mo>(</mo> <msub> <mi>B</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>B</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> <msqrt> <mrow> <msub> <mi>K</mi> <mrow> <mi>M</mi> <mi>I</mi> </mrow> </msub> <mrow> <mo>(</mo> <msub> <mi>B</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>B</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <msub> <mi>K</mi> <mrow> <mi>M</mi> <mi>I</mi> </mrow> </msub> <mrow> <mo>(</mo> <msub> <mi>B</mi> <mi>j</mi> </msub> <mo>,</mo> <msub> <mi>B</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </msqrt> </mfrac> <mo>,</mo> </mrow>
Wherein, KNMI() is more example cores after normalization;
Step 3b:More example cores that above-mentioned steps are obtained are combined with SVMs, and the selected webpage is classified, institute Step 3b is stated to further comprise:
By KNMI(Bi, Bj) combined with SVMs, the selected webpage is classified, wherein the SVMs is sentenced Other formula is as follows:
<mrow> <mi>f</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mo>&amp;Element;</mo> <mi>S</mi> <mi>V</mi> </mrow> </munder> <msub> <mi>&amp;alpha;</mi> <mi>i</mi> </msub> <msub> <mi>y</mi> <mi>i</mi> </msub> <mi>K</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>,</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>+</mo> <mi>b</mi> <mo>;</mo> </mrow>
Wherein, SV is supporting vector indexed set, yiIt is characteristic vector xiClass label, αiIt is corresponding weight, b is to bias, αi Value and b value all by training obtain;K () is traditional core;And
Use KNMIAfter () replaces K (), obtain:
<mrow> <mi>f</mi> <mrow> <mo>(</mo> <mi>B</mi> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mo>&amp;Element;</mo> <mi>S</mi> <mi>V</mi> </mrow> </munder> <msub> <mi>&amp;alpha;</mi> <mi>i</mi> </msub> <msub> <mi>y</mi> <mi>i</mi> </msub> <mi>K</mi> <mrow> <mo>(</mo> <msub> <mi>B</mi> <mi>i</mi> </msub> <mo>,</mo> <mi>B</mi> <mo>)</mo> </mrow> <mo>+</mo> <mi>b</mi> <mo>.</mo> </mrow>
2. Web page classification method according to claim 1, wherein in the step 1 using before relative size sequence to comparing Method extracts the effective image in the webpage, and
The related text of the effective image is extracted according to webpage tree;
Wherein, include before the relative size sequence to comparison method:Image size is sorted according to descending order first, And ratio calculated matrix;Then effective image Candidate Set is determined using threshold value beta;Then Candidate Set is carried out using threshold gamma fine Analysis, the final effective image determined in webpage, wherein, threshold value beta and threshold gamma are two empirical threshold values.
3. Web page classification method according to claim 1, wherein the step 2 comprises the following steps:
Step 2a:Webpage training set is built, extracts the RGB-SIFT features of effective image in the webpage training set, cluster generation Visual dictionary, and using hard coded with reference to and polymerization by the way of by image word bag model generate the feature of the effective image to Amount;
Step 2b:Using text dictionary, using the characteristic vector of text word bag model generation related text;
Step 2c:The characteristic vector of the effective image and the characteristic vector of the related text are combined, as example Description.
4. Web page classification method according to claim 3, the step of the cluster generation visual dictionary wherein described in step 2a Suddenly K-means clustering methods are used, obtains including the visual dictionary of 1500 vision words.
5. Web page classification method according to claim 3, the wherein text dictionary described in step 2b include 100 to institute Need the representational keyword of classification scheme and 100 and the completely unrelated keyword of required classification scheme;
It is described using text word bag model generate related text characteristic vector the step of include:
For the related text, according to the characteristic vector of its 100 dimension of text dictionary statistics generation;
The step of being combined the characteristic vector of the characteristic vector of effective image and related text described in step 2c includes:
The characteristic vector of the characteristic vector of 1500 dimensions of the effective image and 100 dimensions of the related text is directly strung Come, obtain the characteristic vector of 1600 dimensions;And
If a webpage does not have effective image, the characteristic vector of the null vector of one 1500 dimension and the related text is closed And get up.
6. a kind of webpage harmful information recognition methods, comprises the following steps:
Step 1:The effective image in a selected webpage is extracted, and extracts the related text of the effective image;
Step 2:Using a width effective image and its related text as an example in webpage bag, generate the effective image and The description of its related text, and the two is combined to the description as example;
Step 3:Make the example of the width effective image generated in step 2 as an example in a bag, a webpage Wrapped for one, for the bag generated in step 2And bagWherein x is corresponding example Statement, bag B is measured in the following wayiWith bag BjBetween similitude:
<mrow> <msub> <mi>K</mi> <mrow> <mi>M</mi> <mi>I</mi> </mrow> </msub> <mrow> <mo>(</mo> <msub> <mi>B</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>B</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>a</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>n</mi> <mi>i</mi> </msub> </munderover> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>b</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>n</mi> <mi>j</mi> </msub> </munderover> <msup> <mi>K</mi> <mi>p</mi> </msup> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mrow> <mi>i</mi> <mi>a</mi> </mrow> </msub> <mo>,</mo> <msub> <mi>x</mi> <mrow> <mi>j</mi> <mi>b</mi> </mrow> </msub> <mo>)</mo> </mrow> <mo>;</mo> </mrow>
Wherein, KMI() is more example cores, and K () is traditional core, and p is a positive integer;
Described more example cores are normalized according to the following formula:
<mrow> <msub> <mi>K</mi> <mrow> <mi>N</mi> <mi>M</mi> <mi>I</mi> </mrow> </msub> <mrow> <mo>(</mo> <msub> <mi>B</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>B</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <msub> <mi>K</mi> <mrow> <mi>M</mi> <mi>I</mi> </mrow> </msub> <mrow> <mo>(</mo> <msub> <mi>B</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>B</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> <msqrt> <mrow> <msub> <mi>K</mi> <mrow> <mi>M</mi> <mi>I</mi> </mrow> </msub> <mrow> <mo>(</mo> <msub> <mi>B</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>B</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <msub> <mi>K</mi> <mrow> <mi>M</mi> <mi>I</mi> </mrow> </msub> <mrow> <mo>(</mo> <msub> <mi>B</mi> <mi>j</mi> </msub> <mo>,</mo> <msub> <mi>B</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </msqrt> </mfrac> <mo>,</mo> </mrow>
Wherein, KNMI() is more example cores after normalization;
By KNMI(Bi, Bj) combined with SVMs, the harmful information in the selected webpage is identified, wherein the branch The discriminate for holding vector machine is as follows:
<mrow> <mi>f</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mo>&amp;Element;</mo> <mi>S</mi> <mi>V</mi> </mrow> </munder> <msub> <mi>&amp;alpha;</mi> <mi>i</mi> </msub> <msub> <mi>y</mi> <mi>i</mi> </msub> <mi>K</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>,</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>+</mo> <mi>b</mi> <mo>;</mo> </mrow>
Wherein, SV is supporting vector indexed set, yiIt is characteristic vector xiClass label, αiIt is corresponding weight, b is to bias, αi Value and b value all by training obtain;And
Use KNMIAfter () replaces K (), obtain:
<mrow> <mi>f</mi> <mrow> <mo>(</mo> <mi>B</mi> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mo>&amp;Element;</mo> <mi>S</mi> <mi>V</mi> </mrow> </munder> <msub> <mi>&amp;alpha;</mi> <mi>i</mi> </msub> <msub> <mi>y</mi> <mi>i</mi> </msub> <mi>K</mi> <mrow> <mo>(</mo> <msub> <mi>B</mi> <mi>i</mi> </msub> <mo>,</mo> <mi>B</mi> <mo>)</mo> </mrow> <mo>+</mo> <mi>b</mi> <mo>.</mo> </mrow>
CN201410609728.4A 2014-11-03 2014-11-03 A kind of harmful information identification and Web page classification method based on multi-instance learning Active CN104361059B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410609728.4A CN104361059B (en) 2014-11-03 2014-11-03 A kind of harmful information identification and Web page classification method based on multi-instance learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410609728.4A CN104361059B (en) 2014-11-03 2014-11-03 A kind of harmful information identification and Web page classification method based on multi-instance learning

Publications (2)

Publication Number Publication Date
CN104361059A CN104361059A (en) 2015-02-18
CN104361059B true CN104361059B (en) 2018-03-27

Family

ID=52528320

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410609728.4A Active CN104361059B (en) 2014-11-03 2014-11-03 A kind of harmful information identification and Web page classification method based on multi-instance learning

Country Status (1)

Country Link
CN (1) CN104361059B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021442B (en) * 2016-05-16 2019-10-01 江苏大学 A kind of Internet news summary extracting method
CN106055705B (en) * 2016-06-21 2019-07-05 广东工业大学 Web page classification method based on maximum spacing multitask multi-instance learning
CN106250924B (en) * 2016-07-27 2019-07-16 南京大学 A kind of newly-increased category detection method based on multi-instance learning
CN109241379A (en) * 2017-07-11 2019-01-18 北京交通大学 A method of across Modal detection network navy
CN107480289B (en) * 2017-08-24 2020-06-30 成都澳海川科技有限公司 User attribute acquisition method and device
CN111259237B (en) * 2020-01-13 2021-02-09 中国搜索信息科技股份有限公司 Method for identifying public harmful information
CN113254636A (en) * 2021-04-27 2021-08-13 上海大学 Remote supervision entity relationship classification method based on example weight dispersion
CN116992035B (en) * 2023-09-27 2023-12-08 湖南正宇软件技术开发有限公司 Intelligent classification method, device, computer equipment and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101281521A (en) * 2007-04-05 2008-10-08 中国科学院自动化研究所 Method and system for filtering sensitive web page based on multiple classifier amalgamation
CN103218608A (en) * 2013-04-19 2013-07-24 中国科学院自动化研究所 Network violent video identification method
CN103605794A (en) * 2013-12-05 2014-02-26 国家计算机网络与信息安全管理中心 Website classifying method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831129B (en) * 2011-06-16 2015-03-04 富士通株式会社 Retrieval method and system based on multi-instance learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101281521A (en) * 2007-04-05 2008-10-08 中国科学院自动化研究所 Method and system for filtering sensitive web page based on multiple classifier amalgamation
CN103218608A (en) * 2013-04-19 2013-07-24 中国科学院自动化研究所 Network violent video identification method
CN103605794A (en) * 2013-12-05 2014-02-26 国家计算机网络与信息安全管理中心 Website classifying method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
DRUG-TAKING INSTRUMENTS RECOGNITION;Ruiguang Hu等;《The First Asian Conference on Pattern Recognition》;20111128;90-94 *

Also Published As

Publication number Publication date
CN104361059A (en) 2015-02-18

Similar Documents

Publication Publication Date Title
CN104361059B (en) A kind of harmful information identification and Web page classification method based on multi-instance learning
Novendri et al. Sentiment analysis of YouTube movie trailer comments using naïve bayes
CN109471937A (en) A kind of file classification method and terminal device based on machine learning
CN110909164A (en) Text enhancement semantic classification method and system based on convolutional neural network
CN108965245A (en) Detection method for phishing site and system based on the more disaggregated models of adaptive isomery
US20070294223A1 (en) Text Categorization Using External Knowledge
Huang et al. JSContana: Malicious JavaScript detection using adaptable context analysis and key feature extraction
CN110705247B (en) Based on x2-C text similarity calculation method
Doshi et al. Movie genre detection using topological data analysis
Ashraf et al. CIC at CheckThat! 2021: Fake News detection Using Machine Learning And Data Augmentation.
CN104537280B (en) Protein interactive relation recognition methods based on text relation similitude
CN112052424A (en) Content auditing method and device
Rajesh et al. Fraudulent news detection using machine learning approaches
Huang et al. Topic detection from microblog based on text clustering and topic model analysis
Pritzkau et al. Finding a line between trusted and untrusted information on tweets through sequence classification
Abbasi et al. Organizing resources on tagging systems using t-org
Su et al. SSL-GAN-RoBERTa: A robust semi-supervised model for detecting Anti-Asian COVID-19 hate speech on social media
de Silva SAFS3 algorithm: Frequency statistic and semantic similarity based semantic classification use case
Saha et al. A large scale study of SVM based methods for abstract screening in systematic reviews
Khan et al. Fake news detection of South African COVID-19 related tweets using machine learning
Cuzzola et al. Automated classification and localization of daily deal content from the Web
CN112434126B (en) Information processing method, device, equipment and storage medium
Surendran et al. Covid-19 fake news detector using hybrid convolutional and Bi-lstm model
Chouliara et al. Fake News Detection Utilizing Textual Cues
Ma et al. LTCR: Long-Text Chinese Rumor Detection Dataset

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20191204

Address after: 250101 2F, Hanyu Jingu new media building, high tech Zone, Jinan City, Shandong Province

Patentee after: Renmin Zhongke (Shandong) Intelligent Technology Co.,Ltd.

Address before: 100190 Zhongguancun East Road, Beijing, No. 95, No.

Patentee before: Institute of Automation, Chinese Academy of Sciences

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20200311

Address after: Room 201, 2 / F, Hanyu Jingu new media building, no.7000, Jingshi Road, Jinan City, Shandong Province, 250000

Patentee after: Renmin Zhongke (Jinan) Intelligent Technology Co.,Ltd.

Address before: 250101 2F, Hanyu Jingu new media building, high tech Zone, Jinan City, Shandong Province

Patentee before: Renmin Zhongke (Shandong) Intelligent Technology Co.,Ltd.

CP03 Change of name, title or address
CP03 Change of name, title or address

Address after: 100176 1401, 14th floor, building 8, No. 8 courtyard, No. 1 KEGU street, Beijing Economic and Technological Development Zone, Daxing District, Beijing (Yizhuang group, high-end industrial area, Beijing Pilot Free Trade Zone)

Patentee after: Renmin Zhongke (Beijing) Intelligent Technology Co.,Ltd.

Address before: Room 201, 2 / F, Hangu Jinggu new media building, 7000 Jingshi Road, Jinan City, Shandong Province

Patentee before: Renmin Zhongke (Jinan) Intelligent Technology Co.,Ltd.