CN104361059A - Harmful information identification and web page classification method based on multi-instance learning - Google Patents

Harmful information identification and web page classification method based on multi-instance learning Download PDF

Info

Publication number
CN104361059A
CN104361059A CN201410609728.4A CN201410609728A CN104361059A CN 104361059 A CN104361059 A CN 104361059A CN 201410609728 A CN201410609728 A CN 201410609728A CN 104361059 A CN104361059 A CN 104361059A
Authority
CN
China
Prior art keywords
webpage
effective image
bag
text
web page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410609728.4A
Other languages
Chinese (zh)
Other versions
CN104361059B (en
Inventor
胡卫明
胡瑞光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Renmin Zhongke Beijing Intelligent Technology Co ltd
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN201410609728.4A priority Critical patent/CN104361059B/en
Publication of CN104361059A publication Critical patent/CN104361059A/en
Application granted granted Critical
Publication of CN104361059B publication Critical patent/CN104361059B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a web page classification method based on multi-instance learning. The method comprises the steps of designing a relative size ranking forward comparison method to extract effective images in a web page, and extracting relevant texts of the effective images according to a web page tree structure; using an effective image and the relevant text thereof as an instance in a web page packet, generating description of the effective image and the description of the relevant text thereof respectively by adopting an image word bag model and a text word bag model, and merging the description of the effective image and the description of the relevant text thereof as an instance description, conducting classification by adopting multiple instances to verify toxic web pages. The method disclosed by the invention uses the images contained in web pages and the relevant texts thereof as instances in the web page packet, the algorithm is enabled to be more compliant with the actual distribution of the contents of the web pages, the effective information of the web pages can be fully utilized, the complementarity between image information and text information is deeply mined and an effect better than that of classification by using single-mode information is finally obtained.

Description

A kind of harmful information identification based on multi-instance learning and Web page classification method
Technical field
The present invention relates to network content security field, relate more specifically to a kind of harmful information identification based on multi-instance learning and Web page classification method.
Background technology
Internet, while promotion social progress and development, is also that the propagation of various harmful information provides a great convenience.These harmful informations endanger normal social activities and healthy value system day by day, to teen-age grow up healthy and sound particularly unfavorable.Play the positive role of internet to greatest extent, suppress or eliminate its negative consequence, will be conducive to purifying internet environment, and promote social progress, take good care of teen-age growing up healthy and sound.Internet harmful information comprises pornographic, drugs, violence, terror, reaction etc., and wherein the harm of Drug Reference is compared with the harm of other harmful informations, goes even farther.
In internet, webpage is with HTML (Hypertext Markup Language) (Hyper Text Mark-up Language, HTML) form of file exists, html file is text in essence, therefore, common Web page classification method mainly utilizes text message, and wherein topmost is exactly word bag model.The principle of word bag model is: first select some keywords (key), composition text dictionary; Then add up the frequency of each keyword in document or webpage, and form a vector; Suitable sorter is adopted to classify to this vector.
Along with extensively popularizing of various digital device, the amount of images in webpage gets more and more, and amount of text is fewer and feweri, only utilizes text message to classify to webpage and can not meet the actual form of webpage well.Therefore, be necessary that very much comprehensive utilization image information and text message are to improve real web pages classification performance.
As an example, Fig. 1 is two drugs webpages, and left figure is the webpage peddling drug abuse instrument, and right figure is the webpage peddling hemp.Can find out, in two webpages, all contain a large amount of images and a small amount of text, and image and text alignment obtain in good order.For this situation, only utilize text message can not classify to it well.In addition; at present carry out for the Drug Reference on internet the Patents that processes or document also considerably less; in the urgent need to a kind of method of the harmful informations such as drugs being carried out to identifying processing, facilitate the supervision of national governments' reinforcement to internet, protection people are from the temptation of relevant information.
Summary of the invention
In view of this, the object of the invention is to propose a kind of Web page classification method and the harmful information recognition methods that meet image and this paper quantity actual distribution situation in webpage, solve the identification of harmful information and the technical matters of automatic classification in webpage.
For achieving the above object, as one aspect of the present invention, the present invention proposes a kind of Web page classification method, comprise the following steps:
Step 1: extract the effective image in selected webpage, and extract the related text of described effective image;
Step 2: using a width effective image and related text thereof as the example of in webpage bag, generate the description of described effective image and related text thereof, and the two is combined description exemplarily;
Step 3: the described example adopting many examples to check to obtain calculates, the result according to calculating is classified to described selected webpage.
Wherein, relative size in described step 1, is adopted to sort forward direction relative method to extract the effective image in described webpage, and
The related text of described effective image is extracted according to webpage tree structure.
Wherein, described step 2 comprises the following steps:
Step 2a: build webpage training set, extract the RGB-SIFT feature of effective image in described webpage training set, cluster generates visual dictionary, and adopts the proper vector of mode by effective image described in image word bag model generation of hard coded combination and polymerization;
Step 2b: utilize text dictionary, adopts the proper vector of text word bag model generation related text;
Step 2c: the proper vector of the proper vector of described effective image and described related text is combined, exemplarily describes.
Wherein, the step that the cluster described in step 2a generates visual dictionary adopts K-means clustering method, obtains the visual dictionary comprising 1500 vision word.
Wherein, the text dictionary described in step 2b comprises 100 to the representational keyword of required classification scheme and 100 and the complete incoherent keyword of required classification scheme;
The step of the proper vector of described employing text word bag model generation related text comprises:
For described related text, generate the proper vector of its 100 dimension according to described text dictionary statistics;
Described in step 2c, the step that the proper vector of effective image and the proper vector of related text are combined is comprised:
100 proper vectors tieed up of 1500 of the described effective image proper vectors tieed up and described related text are directly stringed together, obtains the proper vector of 1600 dimensions; And
If a webpage does not have effective image, then the proper vector of one 1500 null vector tieed up and described related text is combined.
Wherein, described step 3 comprises:
Step 3a: the described example adopting many examples to check to obtain calculates;
Step 3b: many examples core above-mentioned steps obtained is combined with support vector machine, classifies to described selected webpage.
Wherein, described step 3a comprises:
Using in step 2 generate a width effective image example as one bag in an example, a webpage as a bag, in step 2 generation bag and bag wherein x is the statement of corresponding example, in the following way tolerance bag B iwith bag B jbetween similarity:
K MI ( B i , B j ) = Σ a = 1 n i Σ b = 1 n j K p ( x ia , x jb )
Wherein, K mI(. .) be many examples core, K (. .) be traditional core, p is a positive integer.
Wherein, described step 3a is further comprising the steps of:
Described many examples core is normalized according to the following formula:
K NMI ( B i , B j ) = K MI ( B i , B j ) K MI ( B i , B i ) K MI ( B j , B j ) ,
Wherein, K nMI(. .) be many examples core after normalization.
Wherein, described step 3b comprises further:
By K nMI(B i, B j) be combined with support vector machine, classify to described selected webpage, the discriminant of wherein said support vector machine is as follows:
f ( x ) = Σ i ∈ SV α i y i K ( x i , x ) + b
Wherein, SV is support vector indexed set, y i(+1 or-1) is proper vector x iclass label, α ibe corresponding weight, b is biased, α ivalue and the value of b all obtained by training; K (. .) be traditional core; And
Use K nMI(. .) replacement K (. .) after, obtain:
f ( B ) = Σ i ∈ SV α i y i K ( B i , B ) + b
As another aspect of the present invention, the present invention proposes the recognition methods of a kind of webpage harmful information, comprise the following steps:
Step 1: extract the effective image in a webpage, and extract the related text of described effective image;
Step 2: using a width effective image and related text thereof as the example of in webpage bag, generate the description of described effective image and related text thereof, and the two is combined description exemplarily;
Step 3:
Using in step 2 generate a width effective image example as one bag in an example, a webpage as a bag, in step 2 generation bag and bag wherein x is the statement of corresponding example, in the following way tolerance bag B iwith bag B jbetween similarity:
K MI ( B i , B j ) = Σ a = 1 n i Σ b = 1 n j K p ( x ia , x jb )
Wherein, K mI(. .) be many examples core, K (. .) be traditional core, p is a positive integer;
By K nMI(B i, B j) be combined with support vector machine, identify the harmful information in described selected webpage, the discriminant of wherein said support vector machine is as follows:
f ( x ) = Σ i ∈ SV α i y i K ( x i , x ) + b
Wherein, SV is support vector indexed set, y i(+1 or-1) is proper vector x iclass label, α ibe corresponding weight, b is biased, α ivalue and the value of b all obtained by training; And
Use K nMI(. .) replacement K (. .) after, obtain:
f ( B ) = Σ i ∈ SV α i y i K ( B i , B ) + b
Web page classification method based on multi-instance learning proposed by the invention, by the image that will include in webpage and related text thereof as the example in webpage bag, algorithm is made more to meet the actual distribution of web page contents, and the effective information of webpage can be made full use of, deeply excavating the complementarity of image information and text message, finally obtaining than only utilizing single mode information to carry out better effect of classifying.
Accompanying drawing explanation
Fig. 1 is the sectional drawing of two drugs webpages exemplarily;
Fig. 2 is the false code schematic diagram of the Matlab style of FOCARSS algorithm of the present invention;
Fig. 3 is the schematic diagram of a width effective image sectional drawing and related text thereof;
Fig. 4 is the process flow diagram of the generating mode of the description of an example of the present invention;
Fig. 5 is whole lists of keywords of the text dictionary of the present invention as the present invention's specific embodiment.
Embodiment
For making the object, technical solutions and advantages of the present invention clearly understand, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in further detail.
Method of the present invention is not subject to the restriction of particular hardware and programming language, can realize method of the present invention by any language compilation.As an example, present invention employs the computing machine that has 2.83GHz central processing unit and 2GB internal memory, and achieve method of the present invention with Matlab language.
The basic procedure of the Web page classification method based on multi-instance learning of the present invention is:
Step 1: first carry out effective information extraction, adopts relative size sequence forward direction relative method to extract effective image in webpage, and extracts the related text of effective image according to webpage tree structure;
Step 2: according to the spread pattern of effective image in webpage and related text, using a width effective image and related text thereof as the example of in webpage bag, adopt the description of image word bag model and text word bag model generation effective image and related text thereof respectively, and the two is combined description exemplarily;
Step 3: adopt many examples to check webpage and classify.
Below in conjunction with accompanying drawing, each step of the present invention is described in detail, is wherein exemplarily described with drugs webpage.
For step 1, comprise the steps:
Step 1a: adopt relative size sequence forward direction relative method (FOrwardCompArison ofRelative Sizes Sorting, FOCARSS) to extract effective image in webpage.The false code of the Matlab style of FOCARSS algorithm as shown in Figure 2.FOCARSS algorithm is the algorithm that the present invention creates, and adopt the relative size of image, instead of absolute size sorts.First image size sorts according to descending order by FOCARSS algorithm, and ratio calculated matrix; Then threshold value beta is used to determine effective image candidate collection; Then adopt threshold gamma to carry out explication de texte to Candidate Set, finally determine the effective image in webpage.Threshold value beta and threshold gamma are two empirical threshold values; Can find by carrying out analysis to a large amount of webpage, threshold value beta gets 0.5, threshold gamma gets 0.95 extraction effect that can reach satisfied.
Step 1b: the related text extracting effective image according to webpage tree structure.For the html file of a webpage, by tag extraction and tag match, and generate corresponding tree structure according to the set membership between label.For a width effective image, according to its corresponding node in tree structure of its Name Lookup, and the mode of local traversal is adopted to search text around it, the condition of convergence using 200 words as local traversal.The surrounding of effective image text and label text thereof are combined the related text as this effective image.Fig. 3 is the sectional drawing of a width effective image and the schematic diagram of related text thereof.
Step 2 as shown in Figure 4, comprises the steps:
Step 2a: the proper vector adopting image word bag model generation one width effective image.Build training webpage collection, in a preferred embodiment, comprise 2243 webpages altogether, these webpages take from some shopping mall website and news websites equably; Training webpage concentrates the totally 6219 width effective images in all training webpages, all be used to generate visual dictionary: the RGB-SIFT (intensive sampling extracting each width effective image, sampling interval is 16), and K-means cluster is carried out to all RGB-SIFT, obtain 1500 cluster centres; Using each cluster centre as a vision word, thus the visual dictionary comprising 1500 vision word can be obtained.For every width effective image (no matter being from training webpage or test webpage), first we extract the RGB-SIFT (intensive sampling of this image, sampling interval is 16), and according to above-mentioned visual dictionary, adopt hard coded combination and polymerization methods to generate its proper vector; Particularly, hard coded refers to a RGB-SIFT only has response in the vision word nearest with it, and response is 1, and the response in all the other vision word is 0; After referring to polymerization all RGB-SIFT of a width effective image are encoded, all responses in each vision word are added up, as the final response on this word; Through hard coded and and polymerization, just can obtain a width effective image 1500 tie up proper vectors.In particular cases, if a webpage does not have effective image, we are using the image feature vector of the null vector of one 1500 dimension as this webpage.
Step 2b: adopt text word bag model to generate the proper vector of the related text of every width effective image.From harmful information webpage and non-harmful Intelligence Page, such as, in drugs webpage and non-drugs webpage well-chosen 100 representational keywords, composition text dictionary, as shown in Figure 5; The principle selected is that the number of times that certain keyword occurs in drugs webpage is a lot, and the number of times occurred in non-drugs webpage is little, is even zero; Do like this and text dictionary can be made to have good representativeness.For the related text of every width effective image, generate the proper vector of its 100 dimension according to above-mentioned text dictionary statistics.In particular cases, if a webpage does not have effective image, then extract its body text, then generate its proper vector according to above-mentioned text dictionary statistics.
Step 2c: for the example of in webpage, directly strings together the image feature vector of its 1500 dimension and 100 Text eigenvectors tieed up, and obtains the proper vector of 1600 dimensions of this example; If there is N (N > 0) individual example in a webpage, the proper vector of individual 1600 dimensions of N (N > 0) just can be obtained.In particular cases, if a webpage does not have effective image, then the proper vector of one 1500 null vector tieed up and body text is combined, also can obtains the proper vector of one 1600 dimension.It can be used as the example of this webpage, and this webpage only has so example.
Step 2 is calculated gained example as input by step 3, calculates many examples core and carries out final classification task, specifically comprising the steps:
Step 3a: calculate many examples core (Multi-Instance Kernel, MIK).
Many examples core is used for measuring the similarity between bag.Be provided with bag and bag wherein x is the statement of corresponding example.MIK measures bag B in the following way iwith bag B jbetween similarity:
K MI ( B i , B j ) = Σ a = 1 n i Σ b = 1 n j K p ( x ia , x jb )
Wherein, K mI(. .) be many examples core, K (. .) be certain traditional core, p is a positive integer.Because the p power of RBF core is still RBF core, thus this method select radial basis function core (RBF core) as K (. .), RBF core is a kind of core be widely used, functional.Be similar to general kernel method, MIK also needs to be normalized:
K NMI ( B i , B j ) = K MI ( B i , B j ) K MI ( B i , B i ) K MI ( B j , B j ) ,
Using a webpage as a bag, and using the proper vector of the effective image in this webpage as the example in bag, can directly use above-mentioned formula.
Step 3b: by K nMI(B i, B j) be combined with support vector machine, drugs webpage is classified.Support vector machine is the good sorter of a kind of performance, and widely, its discriminant is as follows in application scenario:
f ( x ) = Σ i ∈ SV α i y i K ( x i , x ) + b
Wherein, SV is support vector indexed set, y i(+1 or-1) is proper vector x iclass label, α icorresponding weight, K (. .) be certain traditional core, b is biased; According to the ultimate principle of support vector machine, α ivalue and the value of b all obtained by training.Use K nMI(. .) replacement K (. .), obtain:
f ( B ) = Σ i ∈ SV α i y i K ( B i , B ) + b
So just can very naturally utilize support vector machine to classify to webpage: when classifying, if the output label of certain bag is+1, then the webpage of this bag representative be drugs webpage; Otherwise be normal webpage.
As another aspect of the present invention, present invention also offers a kind of webpage harmful information recognition methods based on multi-instance learning, based on the principle identical with sorting technique above, identify the webpage containing harmful information and mark, concrete steps comprise:
Step 1: extract the effective image in a webpage, and extract the related text of described effective image;
Step 2: using a width effective image and related text thereof as the example of in webpage bag, generate the description of described effective image and related text thereof, and the two is combined description exemplarily;
Step 3:
Using in step 2 generate a width effective image example as one bag in an example, a webpage as a bag, in step 2 generation bag and bag wherein x is the statement of corresponding example, in the following way tolerance bag B iwith bag B jbetween similarity:
K MI ( B i , B j ) = Σ a = 1 n i Σ b = 1 n j K p ( x ia , x jb )
Wherein, K mI(. .) be many examples core, K (. .) be traditional core, p is a positive integer;
By K nMI(B i, B j) be combined with support vector machine, identify the harmful information in described selected webpage, the discriminant of wherein said support vector machine is as follows:
f ( x ) = Σ i ∈ SV α i y i K ( x i , x ) + b
Wherein, SV is support vector indexed set, y i(+1 or-1) is proper vector x iclass label, α ibe corresponding weight, b is biased, α ivalue and the value of b all obtained by training; And
Use K nMI(. .) replacement K (. .) after, obtain:
f ( B ) = Σ i ∈ SV α i y i K ( B i , B ) + b .
Known by the description of the technical scheme to the inventive method, method of the present invention can make full use of the effective information of webpage, obtain than only utilizing single mode information to carry out the better effect that identifies and classify, through the actual test verification to some webpages in actual website, method accuracy of the present invention is high, recognition speed is fast, reaches good practical function.
Above-described specific embodiment; object of the present invention, technical scheme and beneficial effect are further described; be understood that; the foregoing is only specific embodiments of the invention; be not limited to the present invention; within the spirit and principles in the present invention all, any amendment made, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (10)

1. a Web page classification method, comprises the following steps:
Step 1: extract the effective image in selected webpage, and extract the related text of described effective image;
Step 2: using a width effective image and related text thereof as the example of in webpage bag, generate the description of described effective image and related text thereof, and the two is combined description exemplarily;
Step 3: the described example adopting many examples to check to obtain calculates, the result according to calculating is classified to described selected webpage.
2. Web page classification method according to claim 1, adopts relative size to sort forward direction relative method to extract the effective image in described webpage in wherein said step 1, and
The related text of described effective image is extracted according to webpage tree structure.
3. Web page classification method according to claim 1, wherein said step 2 comprises the following steps:
Step 2a: build webpage training set, extract the RGB-SIFT feature of effective image in described webpage training set, cluster generates visual dictionary, and adopts the proper vector of mode by effective image described in image word bag model generation of hard coded combination and polymerization;
Step 2b: utilize text dictionary, adopts the proper vector of text word bag model generation related text;
Step 2c: the proper vector of the proper vector of described effective image and described related text is combined, exemplarily describes.
4. Web page classification method according to claim 3, the step that the cluster wherein described in step 2a generates visual dictionary adopts K-means clustering method, obtains the visual dictionary comprising 1500 vision word.
5. Web page classification method according to claim 3, the text dictionary wherein described in step 2b comprises 100 to the representational keyword of required classification scheme and 100 and the complete incoherent keyword of required classification scheme;
The step of the proper vector of described employing text word bag model generation related text comprises:
For described related text, generate the proper vector of its 100 dimension according to described text dictionary statistics;
Described in step 2c, the step that the proper vector of effective image and the proper vector of related text are combined is comprised:
100 proper vectors tieed up of 1500 of the described effective image proper vectors tieed up and described related text are directly stringed together, obtains the proper vector of 1600 dimensions; And
If a webpage does not have effective image, then the proper vector of one 1500 null vector tieed up and described related text is combined.
6. Web page classification method according to claim 1, wherein said step 3 comprises:
Step 3a: the described example adopting many examples to check to obtain calculates;
Step 3b: many examples core above-mentioned steps obtained is combined with support vector machine, classifies to described selected webpage.
7. Web page classification method according to claim 6, wherein said step 3a comprises:
Using in step 2 generate a width effective image example as one bag in an example, a webpage as a bag, in step 2 generation bag and bag wherein x is the statement of corresponding example, in the following way tolerance bag B iwith bag B jbetween similarity:
K MI ( B i , B j ) = Σ a = 1 n i Σ b = 1 n j K p ( x ia , x jb )
Wherein, K mI(. .) be many examples core, K (. .) be traditional core, p is a positive integer.
8. Web page classification method according to claim 7, wherein said step 3a is further comprising the steps of:
Described many examples core is normalized according to the following formula:
K NMI ( B i , B j ) = K MI ( B i , B j ) K MI ( B i , B i ) K MI ( B j , B j ) ,
Wherein, K nMI(. .) be many examples core after normalization.
9. Web page classification method according to claim 6, wherein said step 3b comprises further:
By K nMI(B i, B j) be combined with support vector machine, classify to described selected webpage, the discriminant of wherein said support vector machine is as follows:
f ( x ) = Σ i ∈ SV α i y i K ( x i , x ) + b
Wherein, SV is support vector indexed set, y i(+1 or-1) is proper vector x iclass label, α ibe corresponding weight, b is biased, α ivalue and the value of b all obtained by training; K (. .) be traditional core; And
Use K nMI(. .) replacement K (. .) after, obtain:
f ( B ) = Σ i ∈ SV α i y i K ( B i , B ) + b
10. a webpage harmful information recognition methods, comprises the following steps:
Step 1: extract the effective image in a webpage, and extract the related text of described effective image;
Step 2: using a width effective image and related text thereof as the example of in webpage bag, generate the description of described effective image and related text thereof, and the two is combined description exemplarily;
Step 3:
Using in step 2 generate a width effective image example as one bag in an example, a webpage as a bag, in step 2 generation bag and bag wherein x is the statement of corresponding example, in the following way tolerance bag B iwith bag B jbetween similarity:
K MI ( B i , B j ) = Σ a = 1 n i Σ b = 1 n j K p ( x ia , x jb )
Wherein, K mI(. .) be many examples core, K (. .) be traditional core, p is a positive integer;
By K nMI(B i, B j) be combined with support vector machine, identify the harmful information in described selected webpage, the discriminant of wherein said support vector machine is as follows:
f ( x ) = Σ i ∈ SV α i y i K ( x i , x ) + b
Wherein, SV is support vector indexed set, y i(+1 or-1) is proper vector x iclass label, α ibe corresponding weight, b is biased, α ivalue and the value of b all obtained by training; And
Use K nMI(. .) replacement K (. .) after, obtain:
f ( B ) = Σ i ∈ SV α i y i K ( B i , B ) + b
CN201410609728.4A 2014-11-03 2014-11-03 A kind of harmful information identification and Web page classification method based on multi-instance learning Active CN104361059B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410609728.4A CN104361059B (en) 2014-11-03 2014-11-03 A kind of harmful information identification and Web page classification method based on multi-instance learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410609728.4A CN104361059B (en) 2014-11-03 2014-11-03 A kind of harmful information identification and Web page classification method based on multi-instance learning

Publications (2)

Publication Number Publication Date
CN104361059A true CN104361059A (en) 2015-02-18
CN104361059B CN104361059B (en) 2018-03-27

Family

ID=52528320

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410609728.4A Active CN104361059B (en) 2014-11-03 2014-11-03 A kind of harmful information identification and Web page classification method based on multi-instance learning

Country Status (1)

Country Link
CN (1) CN104361059B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021442A (en) * 2016-05-16 2016-10-12 江苏大学 Network news outline extraction method
CN106055705A (en) * 2016-06-21 2016-10-26 广东工业大学 Web page classification method for multi-task and multi-example learning based on maximum distance
CN106250924A (en) * 2016-07-27 2016-12-21 南京大学 A kind of newly-increased category detection method based on multi-instance learning
CN107480289A (en) * 2017-08-24 2017-12-15 成都澳海川科技有限公司 User property acquisition methods and device
CN109241379A (en) * 2017-07-11 2019-01-18 北京交通大学 A method of across Modal detection network navy
CN111259237A (en) * 2020-01-13 2020-06-09 中国搜索信息科技股份有限公司 Method for identifying public harmful information
CN113254636A (en) * 2021-04-27 2021-08-13 上海大学 Remote supervision entity relationship classification method based on example weight dispersion
CN116992035A (en) * 2023-09-27 2023-11-03 湖南正宇软件技术开发有限公司 Intelligent classification method, device, computer equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101281521A (en) * 2007-04-05 2008-10-08 中国科学院自动化研究所 Method and system for filtering sensitive web page based on multiple classifier amalgamation
JP2013004093A (en) * 2011-06-16 2013-01-07 Fujitsu Ltd Search method and system by multi-instance learning
CN103218608A (en) * 2013-04-19 2013-07-24 中国科学院自动化研究所 Network violent video identification method
CN103605794A (en) * 2013-12-05 2014-02-26 国家计算机网络与信息安全管理中心 Website classifying method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101281521A (en) * 2007-04-05 2008-10-08 中国科学院自动化研究所 Method and system for filtering sensitive web page based on multiple classifier amalgamation
JP2013004093A (en) * 2011-06-16 2013-01-07 Fujitsu Ltd Search method and system by multi-instance learning
CN103218608A (en) * 2013-04-19 2013-07-24 中国科学院自动化研究所 Network violent video identification method
CN103605794A (en) * 2013-12-05 2014-02-26 国家计算机网络与信息安全管理中心 Website classifying method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
RUIGUANG HU等: "DRUG-TAKING INSTRUMENTS RECOGNITION", 《THE FIRST ASIAN CONFERENCE ON PATTERN RECOGNITION》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021442B (en) * 2016-05-16 2019-10-01 江苏大学 A kind of Internet news summary extracting method
CN106021442A (en) * 2016-05-16 2016-10-12 江苏大学 Network news outline extraction method
CN106055705A (en) * 2016-06-21 2016-10-26 广东工业大学 Web page classification method for multi-task and multi-example learning based on maximum distance
CN106055705B (en) * 2016-06-21 2019-07-05 广东工业大学 Web page classification method based on maximum spacing multitask multi-instance learning
CN106250924A (en) * 2016-07-27 2016-12-21 南京大学 A kind of newly-increased category detection method based on multi-instance learning
CN106250924B (en) * 2016-07-27 2019-07-16 南京大学 A kind of newly-increased category detection method based on multi-instance learning
CN109241379A (en) * 2017-07-11 2019-01-18 北京交通大学 A method of across Modal detection network navy
CN107480289A (en) * 2017-08-24 2017-12-15 成都澳海川科技有限公司 User property acquisition methods and device
CN107480289B (en) * 2017-08-24 2020-06-30 成都澳海川科技有限公司 User attribute acquisition method and device
CN111259237A (en) * 2020-01-13 2020-06-09 中国搜索信息科技股份有限公司 Method for identifying public harmful information
CN113254636A (en) * 2021-04-27 2021-08-13 上海大学 Remote supervision entity relationship classification method based on example weight dispersion
CN116992035A (en) * 2023-09-27 2023-11-03 湖南正宇软件技术开发有限公司 Intelligent classification method, device, computer equipment and medium
CN116992035B (en) * 2023-09-27 2023-12-08 湖南正宇软件技术开发有限公司 Intelligent classification method, device, computer equipment and medium

Also Published As

Publication number Publication date
CN104361059B (en) 2018-03-27

Similar Documents

Publication Publication Date Title
CN104361059A (en) Harmful information identification and web page classification method based on multi-instance learning
CN103218444B (en) Based on semantic method of Tibetan language webpage text classification
CN101430695B (en) System and method for computing difference affinities of word
CN107133213A (en) A kind of text snippet extraction method and system based on algorithm
US20070294223A1 (en) Text Categorization Using External Knowledge
CN104951548A (en) Method and system for calculating negative public opinion index
CN107992542A (en) A kind of similar article based on topic model recommends method
CN103617157A (en) Text similarity calculation method based on semantics
CN107291723A (en) The method and apparatus of web page text classification, the method and apparatus of web page text identification
CN103559199B (en) Method for abstracting web page information and device
CN105653668A (en) Webpage content analysis and extraction optimization method based on DOM Tree in cloud environment
CN104615593A (en) Method and device for automatic detection of microblog hot topics
CN103246644B (en) Method and device for processing Internet public opinion information
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
EP2041669A2 (en) Text categorization using external knowledge
CN102945244A (en) Chinese web page repeated document detection and filtration method based on full stop characteristic word string
CN104239485A (en) Statistical machine learning-based internet hidden link detection method
CN106126502A (en) A kind of emotional semantic classification system and method based on support vector machine
Chen et al. Learning to predict charges for judgment with legal graph
CN103530316A (en) Science subject extraction method based on multi-view learning
CN104537280B (en) Protein interactive relation recognition methods based on text relation similitude
Hassan et al. Automatic document topic identification using wikipedia hierarchical ontology
CN103699568B (en) A kind of from Wiki, extract the method for hyponymy between field term
Croce et al. Semantic convolution kernels over dependency trees: smoothed partial tree kernel
de Silva SAFS3 algorithm: Frequency statistic and semantic similarity based semantic classification use case

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20191204

Address after: 250101 2F, Hanyu Jingu new media building, high tech Zone, Jinan City, Shandong Province

Patentee after: Renmin Zhongke (Shandong) Intelligent Technology Co.,Ltd.

Address before: 100190 Zhongguancun East Road, Beijing, No. 95, No.

Patentee before: Institute of Automation, Chinese Academy of Sciences

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20200311

Address after: Room 201, 2 / F, Hanyu Jingu new media building, no.7000, Jingshi Road, Jinan City, Shandong Province, 250000

Patentee after: Renmin Zhongke (Jinan) Intelligent Technology Co.,Ltd.

Address before: 250101 2F, Hanyu Jingu new media building, high tech Zone, Jinan City, Shandong Province

Patentee before: Renmin Zhongke (Shandong) Intelligent Technology Co.,Ltd.

CP03 Change of name, title or address
CP03 Change of name, title or address

Address after: 100176 1401, 14th floor, building 8, No. 8 courtyard, No. 1 KEGU street, Beijing Economic and Technological Development Zone, Daxing District, Beijing (Yizhuang group, high-end industrial area, Beijing Pilot Free Trade Zone)

Patentee after: Renmin Zhongke (Beijing) Intelligent Technology Co.,Ltd.

Address before: Room 201, 2 / F, Hangu Jinggu new media building, 7000 Jingshi Road, Jinan City, Shandong Province

Patentee before: Renmin Zhongke (Jinan) Intelligent Technology Co.,Ltd.