CN106126512A

CN106126512A - The Web page classification method of a kind of integrated study and device

Info

Publication number: CN106126512A
Application number: CN201610227852.3A
Authority: CN
Inventors: 任艳萍; 潘季明; 崔雨雷; 孟庆飞
Original assignee: Beijing Topsec Technology Co Ltd; Beijing Topsec Network Security Technology Co Ltd; Beijing Topsec Software Co Ltd
Current assignee: Beijing Topsec Technology Co Ltd; Beijing Topsec Network Security Technology Co Ltd; Beijing Topsec Software Co Ltd
Priority date: 2016-04-13
Filing date: 2016-04-13
Publication date: 2016-11-16

Abstract

The present invention proposes Web page classification method and the device of a kind of integrated study, inputs webpage uniform resource position mark URL, after the described webpage URL inputted carrying out duplicate removal and guaranteeing that effectiveness processes, obtains webpage set of URL and closes；By distributed reptile, described webpage set of URL is closed corresponding web page contents to crawl, and the described web page contents crawled is carried out pretreatment, generate original language material；Described original language material is carried out word segmentation processing and obtains language material to be sorted；Creating at least two sorting algorithm model, and parallel language material vectorization document to be sorted is carried out webpage URL classification prediction by described sorting algorithm model, the webpage URL predicting the outcome consistent by webpage URL classification is classified as a class and is stored in URL classification storehouse by class.By at least two sorting algorithm, webpage URL is classified, with classification results the most unanimously for warehouse-in criterion, improve webpage URL classification storehouse accuracy rate.

Description

The Web page classification method of a kind of integrated study and device

Technical field

The present invention relates to technical field of network information, particularly relate to Web page classification method and the device of a kind of integrated study.

Background technology

The Internet (Internet) produced along with network information industry development and the safety problem of the network information, Become hot issue.Various network systems and about the leak in terms of the defect of software and hardware system, various system administration, band Carry out much hidden danger for security, occur in that many serious network security problems.The opening that Internet itself is had With sharing, the safety problem of information be it is also proposed severe challenge.

For preventing important information from leaking, the content that outside accesses need to be carried out Behavior-Based control, according to certain policy requirement, right Some classification website shields.Therefore, this field core skill is become for website or the identification of web page contents and correlation technique Art.

Currently for the mode of classification under Web page classifying many employings line.Under webpage line, mode classification first passes through web crawlers Obtain a large amount of web page source file, web page source file is carried out information extraction and information denoising, then carries out Chinese word segmentation process, according to Word segmentation result uses machine learning correlation technique, such as SVM (Support Vector Machine, support vector machine), Bayes (Bayes) etc., classify to webpage, webpage and Web page classifying are stored in data base.Time actually used, by webpage with Record in the data base of storage webpage and respective classes thereof mates, and obtains webpage respective classes.

Conventional method for abstracting web page information, such as based on document handling tree method；Segmenting method has character string Mate participle, understand participle, statistics participle etc.；Webpage type identification method mainly has two kinds: the first is based on artificial rule With the method for strategy, domain-specialist knowledge is mainly utilized to carry out collating sort；The second is file classification method, such as simple pattra leaves This, SVM etc..

There is the defect that information retrieval accuracy rate is relatively low in above-mentioned existing info web extracting method；Participle correlation technique is also deposited In problems such as participle are inaccurate, Web page classifying accuracy rate is relatively low；Sorting technique extensibility based on artificial rule and strategy is relatively Difference, and expend a large amount of manpower, time；Although Web page classifying based on file classification method needs less manual intervention, also can Ensure certain coverage rate and accuracy rate, but computationally intensive, and also ratio is relatively time-consuming, it is difficult to meet the system that requirement of real-time is the highest. Additionally, the most all there is certain restriction in coverage and amount of calculation in two kinds of sorting techniques.

Summary of the invention

The technical problem to be solved in the present invention is to provide Web page classification method and the device of a kind of integrated study, overcomes existing There is webpage URL classification accuracy rate in technology low and computational efficiency is low.

The technical solution used in the present invention is, the Web page classification method of described a kind of integrated study, including:

Step one, inputs webpage uniform resource position mark URL, carries out the described webpage URL of input duplicate removal and guarantees have After effect property processes, obtain webpage set of URL and close；

Step 2, closes corresponding web page contents by distributed reptile to described webpage set of URL and crawls, and to crawling To described web page contents carry out pretreatment, generate original language material；

Step 3, carries out word segmentation processing to described original language material and obtains language material to be sorted；

Step 4, by least two sorting algorithm model parallel language material vectorization document to be sorted is carried out webpage URL classification is predicted, the webpage URL predicting the outcome the most consistent by webpage URL classification is classified as a class and is stored in URL classification storehouse by class.

Further, in step one, described input webpage URL, specifically include:

Inputted by outside or internal extraction inputs described webpage URL；

Wherein, described outside input is to be manually entered or import in the form of text described webpage URL by user；

Described internal extraction is to extract the webpage URL imposed a condition from data base.

Further, in step one, described in guarantee that effectiveness processes, specifically include:

Last modification time field and expiry date field according to the webpage URL after duplicate removal judge, if distance is Rear modification time exceedes expiry date, then again crawl the webpage URL exceeding expiry date by reptile.

Further, in step 2, the described web page contents to crawling carries out pretreatment, generates original language material, specifically wraps Include:

The webpage crawled is removed web page tag and mess code, and filters out the web page contents comprising foreign language, generate and preset The original language material of form.

Further, in step 2, described web page contents webpage set of URL being closed correspondence by distributed reptile is climbed Take, specifically include:

Web crawlers framework based on sea dupp distributed type assemblies, mutual from the world by batch by breadth-first search algorithm Networking crawls described webpage set of URL and closes corresponding web page contents；

Wherein, described web page contents is that during webpage set of URL closes, webpage URL/domain name rank is that below Pyatyi and Pyatyi are corresponding Web page contents.

Further, step 3, specifically include:

Described original language material is carried out word segmentation processing by the Words partition system of that increase income and extendible dictionary for word segmentation, obtains institute State language material to be sorted.

Further, step 4, specifically include:

Initialize described language material to be sorted, load the language material set to be sorted and described spy being made up of described language material to be sorted Solicit articles part to internal memory；

According to the Feature Words in described tag file, each language material to be sorted in language material set to be sorted is generated a N Dimensional vector, and N-dimensional column vector is stored in language material vectorization document to be sorted；

Choosing at least two sorting algorithm model, parallel language material vectorization document to be sorted to be carried out webpage URL classification pre- Survey；

The webpage URL classification of each sorting algorithm model chosen is predicted the outcome and carries out contrast coupling；

If the webpage URL classification of each sorting algorithm model chosen predicts the outcome all consistent, then classification results is consistent Webpage URL is classified as a class and is stored in webpage URL classification storehouse by class.

Further, in step 4, described at least two sorting algorithm model arbitrarily selects from following sorting algorithm model Take: Bayes's Bayes sorting algorithm model, support vector machines sorting algorithm model, maximum entropy sorting algorithm model, neighbouring Algorithm KNN sorting algorithm model and neural network classification algorithm model.

Further, described Bayes sorting algorithm model, described svm classifier algorithm model, described maximum entropy sorting algorithm Model, described KNN sorting algorithm model or the acquisition process of described neural network classification algorithm model, including:

Initialize corpus, load the corpus set being made up of described corpus to internal memory；

By expecting that Cross-Entropy Algorithm carries out Feature Words extraction to described corpus, and by expectation Cross-Entropy Algorithm it is The corpus Feature Words assigned characteristics word weights of extraction；Corpus Feature Words dimension is specified, to institute according to default parameter N State corpus Feature Words by Feature Words weights descending, choose top n corpus Feature Words and be saved in tag file；

Wherein, N is the integer not less than the webpage URL classification categorical measure preset；

According to the Feature Words in described tag file, each corpus in described corpus set is generated a N Dimensional vector, and N-dimensional column vector is stored in corpus vectorization document；

Using described corpus vectorization document as input, by Bayes sorting algorithm, svm classifier algorithm, maximum entropy Sorting algorithm, KNN sorting algorithm or neural network classification algorithm are trained selecting with parameter respectively, generate correspondence respectively Bayes sorting algorithm model, svm classifier algorithm model, maximum entropy sorting algorithm model, KNN sorting algorithm model or nerve Meshsort algorithm model.

Further, in step 4, by described Bayes sorting algorithm model, described svm classifier algorithm model and described Maximum entropy sorting algorithm model carries out webpage URL classification prediction, specifically to described language material vectorization document to be sorted respectively parallel Including:

The webpage URL classification of described Bayes sorting algorithm model is predicted the outcome, the net of described svm classifier algorithm model Page URL classification predicts the outcome and the webpage URL classification of described maximum entropy sorting algorithm model predicts the outcome carries out contrast coupling；

If the webpage URL classification of described Bayes sorting algorithm model predicts the outcome, the net of described svm classifier algorithm model The webpage URL classification that page URL classification predicts the outcome with described maximum entropy sorting algorithm model predicts the outcome consistent, then classification tied The most consistent webpage URL is classified as a class and is stored in webpage URL classification storehouse by class.

Further, described method, after step 4, also include:

Issue described webpage URL classification storehouse, and generate webpage URL classification form；

Described webpage URL classification form includes: webpage URL quantity and accounting in webpage URL classification and each webpage URL classification Ratio.

The present invention also provides for the Web page classifying device of a kind of integrated study, including:

Input module, is used for inputting webpage URL, carries out the described webpage URL of input duplicate removal and guarantees that effectiveness processes After, obtain webpage set of URL and close；

Reptile crawls module, climbs for described webpage set of URL being closed corresponding web page contents by distributed reptile Take, and the described web page contents crawled is carried out pretreatment, generate original language material；

Word-dividing mode, obtains language material to be sorted for described original language material is carried out word segmentation processing；

Sort module, for by parallel the carrying out language material vectorization document to be sorted of at least two sorting algorithm model Webpage URL classification is predicted, the webpage URL predicting the outcome the most consistent by webpage URL classification is classified as a class and is stored in URL classification by class Storehouse.

Further, described input module, specifically for:

Inputted by outside or internal extraction inputs described webpage URL；

Further, described input module, specifically for:

Further, described reptile crawls module, specifically for:

Further, described word-dividing mode, specifically for:

Further, described sort module, specifically for:

Further, described sort module, it is additionally operable to:

Described at least two sorting algorithm model is chosen from following sorting algorithm model: Bayes sorting algorithm model, Svm classifier algorithm model, maximum entropy sorting algorithm model, KNN sorting algorithm model and neural network classification algorithm model.

Further, described sort module, it is additionally operable to:

Obtain described Bayes sorting algorithm model, described svm classifier algorithm model, described maximum entropy as follows Sorting algorithm model, described KNN sorting algorithm model or described neural network classification algorithm model:

Using described corpus vectorization document as input, by Bayes's Bayes sorting algorithm, support vector machine Svm classifier algorithm, maximum entropy sorting algorithm, nearest neighbor algorithm KNN sorting algorithm or neural network classification algorithm are instructed respectively Practice and parameter select, corresponding generation Bayes sorting algorithm model, svm classifier algorithm model, maximum entropy sorting algorithm model, KNN sorting algorithm model or neural network classification algorithm model.

Further, described sort module, specifically for:

By described Bayes sorting algorithm model, described svm classifier algorithm model and described maximum entropy sorting algorithm model Parallel described language material vectorization document to be sorted is carried out webpage URL classification prediction respectively；

Further, described device, also include:

Release module, is used for issuing described webpage URL classification storehouse, and generates webpage URL classification form；

Using technique scheme, the present invention at least has the advantage that

The Web page classification method of integrated study of the present invention and device, based on distributed reptile framework, effectively carry High the whole network is creeped efficiency, it is not necessary to manually participating in, autopolling performs the task that user submits to, it is achieved automatization's webpage was creeped Journey；Webpage URL is carried out classification prediction, with classification results the most unanimously for entering by two kinds and two or more sorting algorithm model Storehouse criterion, improves webpage URL classification accuracy rate and the quality in webpage URL classification storehouse；Achieve full automation webpage URL to divide Class integrated system, decreases a large amount of manpower and time, greatly improves webpage URL classification efficiency.

Accompanying drawing explanation

Fig. 1 is the Web page classification method flow chart of a kind of integrated study of first embodiment of the invention；

Fig. 2 is the Web page classifying device composition structural representation of a kind of integrated study of third embodiment of the invention；

Fig. 3 is the Web page classification method flow chart of a kind of integrated study of fifth embodiment of the invention；

Fig. 4 is the webpage classification algorithm flow chart of the described integrated study of fifth embodiment of the invention.

Detailed description of the invention

By further illustrating the technological means and effect that the present invention taked by reaching predetermined purpose, below in conjunction with accompanying drawing And preferred embodiment, after the present invention is described in detail such as.

First embodiment of the invention, the Web page classification method of a kind of integrated study, as it is shown in figure 1, include walking in detail below Rapid:

Step S101, input webpage URL (Uniform Resource Locator, URL), to input Webpage URL carry out duplicate removal and guarantee effectiveness process after, obtain webpage set of URL close.

Concrete, step S101, including:

Step A1, is inputted by outside or internal extraction input webpage URL.

Wherein, outside input is to be manually entered by user or import webpage URL in the form of text.

Internal extraction is by extracting the webpage URL imposed a condition from data base.

Wherein, data base is the MySQL database of storage webpage URL relevant information.

Step A2, carries out duplicate removal, according to the last modification time field of the webpage URL after duplicate removal to the webpage URL of input Judge with expiry date field, if again crawl webpage URL, it would be desirable to the webpage URL again crawled is updated, Close to webpage set of URL.

Concrete, step A2, including:

Step B1, deletes the repeated pages URL of input.

Step B2, last modification time field and expiry date field according to the webpage URL after duplicate removal judge, are No needs crawls webpage URL again, it would be desirable to the webpage URL again crawled is updated, and obtains webpage set of URL and closes.

Concrete, last modification time field and expiry date field according to the webpage URL after duplicate removal judge, if Exceed expiry date apart from last modification time, then again climbed by distributed reptile exceeding expiry date webpage URL Take, and replace the former webpage URL being judged as needing again to crawl with the webpage URL again crawled, obtain webpage set of URL and close.

Step S102, closes corresponding web page contents by distributed reptile to webpage set of URL and crawls, and to crawling Web page contents carry out pretreatment, obtain original language material.

Concrete, step S102, including:

Step C1, closes corresponding web page contents by reptile to webpage set of URL and crawls.

Web crawlers framework based on Hadoop (sea dupp) distributed type assemblies, by BFS (Breadth First Search, breadth-first search algorithm) algorithm multinode concurrently crawls webpage set of URL from Internet and closes in corresponding webpage Hold.Being limited to the colony integrated ability of Hadoop, the data volume that cluster once can effectively process is limited, therefore by batch from Internet crawls webpage set of URL and closes corresponding web page contents, so that web page contents quantity size reaches ten million rank.

Wherein, crawl web page contents be the webpage URL/domain name rank during webpage set of URL closes be that below Pyatyi and Pyatyi are corresponding Web page contents.

Step C2, carries out pretreatment to the web page contents crawled.

The web page contents crawled is removed the formatting such as web page tag and mess code process, and filter out the net comprising foreign language Page, generates the original language material of preset format, original language material is stored in original language material set.

Step S103, carries out word segmentation processing to original language material, obtains language material to be sorted.

To original language material by increasing income and the Words partition system of extendible dictionary for word segmentation carries out word segmentation processing, obtain language to be sorted Material.By increasing income and autonomous expansion of extendible dictionary for word segmentation greatly strengthen the demand to original language material participle, by increasing Part-of-speech tagging and word frequency statistics function, more effectively ensure language material quality to be sorted.This technology is prior art, does not goes to live in the household of one's in-laws on getting married at this State.

Step S104, by least two sorting algorithm model parallel language material vectorization document to be sorted is carried out webpage URL classification is predicted, the webpage URL predicting the outcome the most consistent by webpage URL classification is classified as a class and is stored in URL classification storehouse by class.

Concrete, step S104, including:

Step D1, creates Bayes sorting algorithm model, svm classifier algorithm model, maximum entropy sorting algorithm model, KNN (K-NearestNeighbor, nearest neighbor algorithm) sorting algorithm model or neural network classification algorithm model.

Initialize corpus, load the corpus set being made up of corpus to internal memory etc..

Carry out Feature Words extraction by expectation Cross-Entropy Algorithm to practicing language material, and be extraction by expectation Cross-Entropy Algorithm Corpus Feature Words distribution weights.Feature Words and Feature Words weights according to extraction generate Dimension Characteristics file.

Corpus Feature Words dimension is specified, according to corpus Feature Words weights, to all instructions by default parameter N Practice language material Feature Words and press the sequence of Feature Words weights, choose top n Feature Words according to parameter N and be saved in tag file.

Wherein, N is the integer not less than the webpage URL categorical measure preset.

Corpus is carried out vectorization process, generates vectorization document.

Corpus is carried out vectorization process, including: according to the Feature Words in tag file, by corpus set Each corpus generates a N-dimensional column vector, and N-dimensional column vector is stored in corpus vectorization document.This technology is existing There is technology, be not repeated herein.

Using corpus vectorization document as input, classified by Bayes sorting algorithm, svm classifier algorithm, maximum entropy Algorithm, KNN sorting algorithm and neural network classification algorithm are trained selecting with parameter respectively, generate the Bayes of correspondence respectively Sorting algorithm model, svm classifier algorithm model, maximum entropy sorting algorithm model, KNN sorting algorithm model or neutral net are divided Class algorithm model.

Step D2, chooses Bayes sorting algorithm model, svm classifier algorithm model, maximum entropy sorting algorithm model, KNN divide At least two sorting algorithm model in class algorithm model and neural network classification algorithm model is combined, to language material to be sorted Vectorization document carries out webpage URL classification prediction the most parallel.

Initialize language material to be sorted, load the language material set to be sorted and tag file being made up of language material to be sorted to internal memory Deng.

Language material to be sorted is carried out vectorization process, generates language material vectorization document to be sorted.

Language material to be sorted is carried out vectorization process, including: according to the Feature Words in tag file, by corpus to be sorted In conjunction, each language material to be sorted generates a N-dimensional column vector, and N-dimensional column vector is stored in language material vectorization document to be sorted.Should Technology is prior art, is not repeated herein.

Choose Bayes sorting algorithm model, svm classifier algorithm model, maximum entropy sorting algorithm model, KNN sorting algorithm At least two sorting algorithm model in model and neural network classification algorithm model is combined, the most parallel to language to be sorted Material vectorization document carries out webpage URL classification prediction.

The webpage URL classification of at least two sorting algorithm model chosen is predicted the outcome and carries out contrast coupling.

If the webpage URL classification of at least two sorting algorithm model chosen predicts the outcome unanimously, then by classification results one The webpage URL caused is classified as a class and is stored in webpage URL classification storehouse by class.

Step S105, issues webpage URL classification storehouse.

Publishing web page URL classification storehouse, and generate webpage URL classification form.

Webpage URL classification form includes: webpage URL quantity and accounting in webpage URL classification and each webpage URL classification.

Second embodiment of the invention, the Web page classification method of a kind of integrated study, method described in the present embodiment is real with first Executing example roughly the same, difference is that by three kinds of sorting algorithm models, language material vectorization document to be sorted being carried out webpage URL divides Class is predicted, the described method of the present embodiment also includes step in detail below:

Step S104, by Bayes sorting algorithm model, svm classifier algorithm model and maximum entropy sorting algorithm model also Row carries out webpage URL classification prediction to language material vectorization document to be sorted, and predict the outcome consistent webpage by webpage URL classification URL is classified as a class and is stored in URL classification storehouse by class.

Concrete, step S104, including:

Step Z1, creates Bayes sorting algorithm model, svm classifier algorithm model and maximum entropy sorting algorithm model

Corpus Feature Words dimension is specified, according to corpus Feature Words weights, to all instructions by default parameter N Practice language material Feature Words and press Feature Words weights descending, choose top n Feature Words according to parameter N and be saved in tag file.

Corpus is carried out vectorization process, generates vectorization document.

Using corpus vectorization document as input, divided by Bayes sorting algorithm, svm classifier algorithm and maximum entropy Class algorithm is trained selecting with parameter respectively, generates Bayes sorting algorithm model, svm classifier algorithm model and maximum respectively Entropy sorting algorithm model.

Step Z2, is treated by Bayes sorting algorithm model, svm classifier algorithm model and maximum entropy sorting algorithm model Classification language material vectorization document carries out webpage URL classification prediction the most parallel.

Initialize language material to be sorted, load the language material set to be sorted being made up of language material to be sorted and with tag file to interior Deposit.

Treated the most parallel by Bayes sorting algorithm model, svm classifier algorithm model and maximum entropy sorting algorithm model Classification language material vectorization document carries out webpage URL classification prediction.

The webpage URL classification of Bayes sorting algorithm model is predicted the outcome, the webpage URL classification of svm classifier algorithm model The webpage URL classification with maximum entropy sorting algorithm model that predicts the outcome predicts the outcome and carries out contrast coupling.

If the webpage URL classification of Bayes sorting algorithm model predicts the outcome, the webpage URL classification of svm classifier algorithm model The webpage URL classification with maximum entropy sorting algorithm model that predicts the outcome predicts the outcome consistent, then by webpage consistent for classification results URL is classified as a class and is stored in webpage URL classification storehouse by class.

Third embodiment of the invention, corresponding with first embodiment, the present embodiment introduces the Web page classifying of a kind of integrated study Device, as in figure 2 it is shown, include consisting of part:

Input module 100, is used for inputting webpage uniform resource position mark URL, and the webpage URL of input is carried out duplicate removal with true After protecting effectiveness process, obtain webpage set of URL and close.

Concrete, input module 100, it is used for:

Inputted by outside or internal extraction input webpage URL；

Internal extraction is to extract the webpage URL imposed a condition from data base.

Reptile crawls module 200, climbs for webpage set of URL being closed corresponding web page contents by distributed reptile Take, and the web page contents crawled is carried out pretreatment, generate original language material.

For web crawlers framework based on sea dupp distributed type assemblies, concurrent by breadth-first search algorithm multinode Crawl webpage set of URL by batch from Internet and close corresponding web page contents.

Wherein, web page contents is that during webpage set of URL closes, webpage URL/domain name rank is the webpage of below Pyatyi and Pyatyi correspondence Content.

Word-dividing mode 300, obtains language material to be sorted for original language material is carried out word segmentation processing.

For original language material is carried out word segmentation processing by the Words partition system of that increase income and extendible dictionary for word segmentation, treated Classification language material.

Sort module 400, for by least two sorting algorithm model parallel to language material vectorization document to be sorted Carrying out webpage URL classification prediction, the webpage URL predicting the outcome the most consistent by webpage URL classification is classified as a class and is stored in URL by class Class library.

Concrete, sort module 400, it is used for:

Corpus is carried out vectorization process, generates vectorization document.

Using corpus vectorization document as input, classified by Bayes sorting algorithm, svm classifier algorithm, maximum entropy Algorithm, KNN sorting algorithm or neural network classification algorithm are trained selecting with parameter respectively, generate correspondence respectively Bayes sorting algorithm model, svm classifier algorithm model, maximum entropy sorting algorithm model, KNN sorting algorithm model or nerve Meshsort algorithm model.

Initialize language material to be sorted, load the language material set to be sorted and tag file being made up of language material to be sorted to interior Deposit.

Choose Bayes sorting algorithm model, svm classifier algorithm model, maximum entropy sorting algorithm model, KNN sorting algorithm At least two sorting algorithm model in model and neural network classification algorithm model is combined, by least two chosen Sorting algorithm model carries out webpage URL classification prediction to language material vectorization document to be sorted respectively parallel.

Release module 500, for publishing web page URL classification storehouse, and generates webpage URL classification form；

Third embodiment of the invention, the Web page classifying device of a kind of integrated study, device described in the present embodiment and the 3rd real Execute example roughly the same, difference be sort module 400 for by Bayes sorting algorithm model, svm classifier algorithm model and Big entropy sorting algorithm model carries out webpage URL classification prediction to language material vectorization document to be sorted parallel, the present embodiment described Device, sort module 400 specifically for:

Sort module 400, is used for creating Bayes sorting algorithm model, svm classifier algorithm model and maximum entropy sorting algorithm Model, and by Bayes sorting algorithm model, svm classifier algorithm model and maximum entropy sorting algorithm model are parallel to be sorted Language material vectorization document carries out webpage URL classification prediction, and the webpage URL predicting the outcome consistent by webpage URL classification is classified as a class And it is stored in URL classification storehouse by class.

Concrete, sort module 400, specifically for:

Create Bayes sorting algorithm model, svm classifier algorithm model and maximum entropy sorting algorithm model

Corpus is carried out vectorization process, generates vectorization document.

If the webpage URL classification of Bayes sorting algorithm model predicts the outcome, the webpage URL of svm classifier algorithm model divides The webpage URL classification that class predicts the outcome with maximum entropy sorting algorithm model predicts the outcome consistent, then by net consistent for classification results Page URL is classified as a class and is stored in webpage URL classification storehouse by class.

Fourth embodiment of the invention, the Web page classification method of a kind of integrated study, as shown in figs. 34, including in detail below Step:

Step S301, input webpage URL (Uniform Resource Locator, URL).

Concrete, step S301, including:

(1) outside input: be manually entered by user or import URL in the form of text；

(2) internal extraction: extraction meets the webpage URL of database query as webpage URL from MySQL database Input source.

Such as: carry out extracting or by specifying the time period to carry out the sides such as extraction by appointment news category from MySQL database Formula extraction webpage URL is as webpage URL input source.

Step S302, carries out duplicate removal process, according to when finally the revising of webpage URL after duplicate removal to the webpage URL of input Between field and expiry date field judge, if again crawl webpage URL, it would be desirable to the webpage URL again crawled is carried out Update, generate webpage set of URL and close.

Concrete, step S302, including:

Step E1, deletes the repeated pages URL of input.

Step E2, last modification time field and expiry date field according to the webpage URL after duplicate removal judge, are No needs crawls webpage URL again, it would be desirable to the webpage URL again crawled is updated, and generates webpage set of URL and closes.

Concrete, last modification time field and expiry date field according to the webpage URL after duplicate removal judge, if Exceed expiry date apart from last modification time, then webpage URL is crawled again by distributed reptile, and with again climbing The webpage URL taken replaces the former webpage URL being judged as needing again to crawl.

Such as: last modification time field lastmodify of the webpage URL after duplicate removal is 2013-12-0910:12:33, Expiry date field usefullife is 80 days, exceedes expiry date 80 apart from last modification time 2013-12-0910:12:33 My god, then webpage URL is crawled again by distributed reptile, and replace being judged as needs with the webpage URL again crawled Again the former webpage URL crawled, generates webpage URL set.

Step S303, closes according to webpage set of URL, creates webpage URL classification task.

Webpage URL classification task in MySQL database by one record presented in.

Webpage URL classification task includes: the webpage set of URL that task ID, task names, task status and task are comprised closes Deng.

Step S304, initializes webpage URL classification task preparation.

Concrete, step S304, including:

Step F1, creates the bibliographic structure of webpage URL classification task.

Bibliographic structure includes: task list.

Task list: include catalogue to be crawled, reptile result list, word segmentation result catalogue and classification results catalogue etc..

Wherein, task list is named with task names.

Step F2, reads webpage URL to be crawled from MySQL database and puts into catalogue to be crawled.

Step F3, in being arranged to carry out the task status in webpage URL classification task.

Step S305, is climbed the web page contents that the catalogue to be crawled in task list is corresponding by distributed reptile Take.

Concrete, step S305, including:

Web crawlers framework based on Hadoop (sea dupp) distributed type assemblies, by BFS (Breadth First Search, breadth-first search algorithm) algorithm multinode concurrently crawls in the webpage to be crawled task list from Internet Hold the web page contents that catalogue is corresponding.Being limited to the colony integrated ability of Hadoop, the data volume that cluster once can effectively process is limited, Therefore crawl, from Internet, the web page contents corresponding for webpage URL that step 2 obtains by batch, so that web page contents quantity rule Mould reaches ten million rank.

Wherein, crawl web page contents be the webpage URL/domain name rank in the catalogue to be crawled in task list be Pyatyi and Web page contents corresponding below Pyatyi.

Step S306, carries out pretreatment to the web page contents crawled.

Concrete, step S306, including:

The web page contents crawled is removed web page tag and removes the formatting such as mess code and process, and outside filtering out and comprising The web page contents of language, generates the original language material of preset format, the reptile result list being stored in task list by original language material.

Step S307, carries out word segmentation processing to the original language material in reptile result list.

Concrete, step S307, including:

To the original language material in reptile result list by increasing income and the Words partition system of extendible dictionary for word segmentation carries out participle Process, obtain language material to be measured, language material to be measured is stored in word segmentation result catalogue.This technology is prior art, is not repeated herein.

Such as: reptile is crawled original language material in web page contents result list and is carried out point by Chinese Academy of Sciences's Words partition system Word, obtains language material to be measured.

By increasing income and autonomous expansion of extendible dictionary for word segmentation greatly strengthen the demand to original language material participle, pass through Increase part-of-speech tagging and word frequency statistics function, more effectively ensure language material quality.

Step S308, establishment Bayes sorting algorithm model, svm classifier algorithm model and maximum entropy sorting algorithm model, and According to Bayes sorting algorithm model, svm classifier algorithm model and maximum entropy sorting algorithm model, original language material is carried out webpage URL classification.

Such as Fig. 4, concrete, step S308, including:

Step H1, initializes corpus and language material to be measured.

Concrete, step D1, including:

Initialize corpus and language material to be measured, create preparation for webpage URL classification algorithm model, such as: load Corpus collection and language material to be measured are to internal memory etc..

Step H2, carries out Feature Words extraction by expectation Cross-Entropy Algorithm to corpus.

Step H3, calculates weights and distribution weights by the corpus Feature Words that expectation Cross-Entropy Algorithm is extraction.

Step H4, Feature Words and Feature Words weights according to extraction generate Dimension Characteristics file.

Corpus Feature Words dimension is specified, according to corpus Feature Words weights, to all instructions by default parameter N Practice language material Feature Words and press the sequence of Feature Words weights, take top n Feature Words according to parameter N and be saved in Dimension Characteristics file.

Step H5, corpus vectorization processes and language material vectorization to be measured processes.

Concrete, step D5, including:

It is according to the Feature Words in intrinsic dimensionality file that corpus vectorization processes, and corpus is concentrated each webpage Content sample generates 1 × N-dimensional vector, and 1 × N-dimensional vector is put into corpus vectorization document, and wherein N is training language Material Feature Words number.

It is according to the Feature Words in intrinsic dimensionality file that language material vectorization to be measured processes, by each for original language material web page contents Sample generates 1 × N-dimensional vector, and 1 × N-dimensional vector is put into original language material vectorization document.

This technology is prior art, is not repeated herein.

Step H6, using corpus vectorization document as input, to Bayes sorting algorithm model, svm classifier algorithm mould Type and maximum entropy sorting algorithm model are trained selecting with parameter respectively.

Step H7, training and parameter according to step H6 select result, generation Bayes sorting algorithm model file, SVM to divide Class algorithm model file and maximum entropy sorting algorithm model file.

Step H8, by Bayes sorting algorithm model file, svm classifier algorithm model file and maximum entropy sorting algorithm Model file carries out the prediction of parallel webpage URL classification respectively to language material vectorization document to be measured.

Step H9, respectively parallel generation Bayes sorting algorithm model webpage URL classification result, svm classifier algorithm model net Page URL classification result and maximum entropy sorting algorithm model webpage URL classification result.

Step H10, to Bayes sorting algorithm model webpage URL classification result, svm classifier algorithm model webpage URL classification Result and maximum entropy sorting algorithm model webpage URL classification result carry out ballot screening.

Concrete, step D10, including:

To Bayes sorting algorithm model webpage URL classification result, svm classifier algorithm model webpage URL classification result and The webpage URL that big entropy sorting algorithm model webpage URL classification result is consistent is stored in webpage URL classification result list.

Step S309, webpage URL classification post processing, update MySQL database according to webpage URL classification result list.

Step S310, whether lost efficacy according to the webpage URL in the conjunction of webpage set of URL or classification change etc. judges, if Restart webpage URL classification task.

Concrete, step S110, including:

If the webpage URL during webpage set of URL closes is judged as losing efficacy or classification change, then restart webpage URL classification task.

Wherein, whether webpage URL lost efficacy the last modification time field by webpage URL and expiry date field is sentenced Disconnected, if exceeding expiry date apart from last modification time, then webpage URL is judged as losing efficacy.

Such as: last modification time field lastmodify of webpage URL is 2013-12-09 10:12:33, effect duration Limit field usefullife is 80 days, exceedes expiry date 80 days, then apart from last modification time 2013-12-09 10:12:33 Webpage URL is judged as losing efficacy.

Step S311, issues webpage URL classification storehouse.

Automatization's one-touch packing and issuing webpage URL classification storehouse, and generate webpage URL classification form.

By the explanation of detailed description of the invention, it should the technological means that the present invention can be taked by reaching predetermined purpose and Effect is able to more deeply and concrete understanding, but appended diagram is only to provide reference and purposes of discussion, is not used for this Invention is any limitation as.

Claims

1. the Web page classification method of an integrated study, it is characterised in that including:

Step one, inputs webpage uniform resource position mark URL, carries out the described webpage URL of input duplicate removal and guarantees effectiveness After process, obtain webpage set of URL and close；

Step 2, close corresponding web page contents by distributed reptile to described webpage set of URL and crawl, and to crawling Described web page contents carries out pretreatment, generates original language material；

Step 4, by parallel the carrying out webpage URL to language material vectorization document to be sorted and divide of at least two sorting algorithm model Class is predicted, the webpage URL predicting the outcome the most consistent by webpage URL classification is classified as a class and is stored in URL classification storehouse by class.

The Web page classification method of integrated study the most according to claim 1, it is characterised in that in step one, described input Webpage URL, specifically includes:

Inputted by outside or internal extraction inputs described webpage URL；

The Web page classification method of integrated study the most according to claim 1, it is characterised in that in step one, described in guarantee Effectiveness processes, and specifically includes:

Last modification time field and expiry date field according to the webpage URL after duplicate removal judge, if distance is finally repaiied The time of changing exceedes expiry date, then again crawl the webpage URL exceeding expiry date by reptile.

The Web page classification method of integrated study the most according to claim 1, it is characterised in that in step 2, described to climbing The web page contents taken carries out pretreatment, generates original language material, specifically includes:

The webpage crawled is removed web page tag and mess code, and filters out the web page contents comprising foreign language, generate preset format Original language material.

The Web page classification method of integrated study the most according to claim 1, it is characterised in that in step 2, described in pass through Webpage set of URL is closed corresponding web page contents and crawls by distributed reptile, specifically includes:

Web crawlers framework based on sea dupp distributed type assemblies, by breadth-first search algorithm by batch from Internet Crawl described webpage set of URL and close corresponding web page contents；

Wherein, described web page contents is that during webpage set of URL closes, webpage URL/domain name rank is the webpage of below Pyatyi and Pyatyi correspondence Content.

The Web page classification method of integrated study the most according to claim 1, it is characterised in that step 3, specifically includes:

Described original language material is carried out word segmentation processing by that increase income and extendible dictionary for word segmentation Words partition system, obtain described in treat Classification language material.

The Web page classification method of integrated study the most according to claim 1, it is characterised in that step 4, specifically includes:

Initialize described language material to be sorted, load the language material set to be sorted being made up of described language material to be sorted and described feature literary composition Part is to internal memory；

According to the Feature Words in described tag file, each language material to be sorted in language material set to be sorted is generated N-dimensional row Vector, and N-dimensional column vector is stored in language material vectorization document to be sorted；

Choose at least two sorting algorithm model and parallel language material vectorization document to be sorted is carried out webpage URL classification prediction；

If the webpage URL classification of each sorting algorithm model chosen predicts the outcome all consistent, then by webpage consistent for classification results URL is classified as a class and is stored in webpage URL classification storehouse by class.

8. according to the Web page classification method of the integrated study described in claim 1 or 7, it is characterised in that in step 4, described extremely Few two kinds of sorting algorithm models are arbitrarily chosen from following sorting algorithm model: Bayes's Bayes sorting algorithm model, support to Amount machine svm classifier algorithm model, maximum entropy sorting algorithm model, nearest neighbor algorithm KNN sorting algorithm model and neural network classification Algorithm model.

The Web page classification method of integrated study the most according to claim 8, it is characterised in that described Bayes sorting algorithm Model, described svm classifier algorithm model, described maximum entropy sorting algorithm model, described KNN sorting algorithm model or described god Through the acquisition process of meshsort algorithm model, including:

By expecting that Cross-Entropy Algorithm carries out Feature Words extraction to described corpus, and it is extraction by expectation Cross-Entropy Algorithm Corpus Feature Words assigned characteristics word weights；Corpus Feature Words dimension is specified, to described instruction according to default parameter N Practice language material Feature Words and press Feature Words weights descending, choose top n corpus Feature Words and be saved in tag file；

According to the Feature Words in described tag file, each corpus in described corpus set is generated N-dimensional row Vector, and N-dimensional column vector is stored in corpus vectorization document；

Using described corpus vectorization document as input, classified by Bayes sorting algorithm, svm classifier algorithm, maximum entropy Algorithm, KNN sorting algorithm or neural network classification algorithm are trained selecting with parameter respectively, generate correspondence respectively Bayes sorting algorithm model, svm classifier algorithm model, maximum entropy sorting algorithm model, KNN sorting algorithm model or nerve Meshsort algorithm model.

The Web page classification method of integrated study the most according to claim 8, it is characterised in that in step 4, by described Bayes sorting algorithm model, described svm classifier algorithm model and described maximum entropy sorting algorithm model are treated described respectively parallel Classification language material vectorization document carries out webpage URL classification prediction, specifically includes:

The webpage URL classification of described Bayes sorting algorithm model is predicted the outcome, the webpage URL of described svm classifier algorithm model Classification predicts the outcome and the webpage URL classification of described maximum entropy sorting algorithm model predicts the outcome carries out contrast coupling；

If the webpage URL classification of described Bayes sorting algorithm model predicts the outcome, the webpage URL of described svm classifier algorithm model The classification webpage URL classification with described maximum entropy sorting algorithm model that predicts the outcome predicts the outcome consistent, then by classification results one The webpage URL caused is classified as a class and is stored in webpage URL classification storehouse by class.

The Web page classification method of 11. integrated studies according to claim 1, it is characterised in that described method, in step 4 Afterwards, also include:

Described webpage URL classification form includes: webpage URL quantity and accounting in webpage URL classification and each webpage URL classification.

The Web page classifying device of 12. 1 kinds of integrated studies, it is characterised in that including:

Input module, is used for inputting webpage URL, after the described webpage URL inputted being carried out duplicate removal and guaranteeing that effectiveness processes, Close to webpage set of URL；

Reptile crawls module, crawls for described webpage set of URL being closed corresponding web page contents by distributed reptile, and The described web page contents crawled is carried out pretreatment, generates original language material；

Sort module, for by least two sorting algorithm model parallel language material vectorization document to be sorted is carried out webpage URL classification is predicted, the webpage URL predicting the outcome the most consistent by webpage URL classification is classified as a class and is stored in URL classification storehouse by class.

The Web page classifying device of 13. integrated studies according to claim 12, it is characterised in that described input module, tool Body is used for:

Inputted by outside or internal extraction inputs described webpage URL；

The Web page classifying device of 14. integrated studies according to claim 12, it is characterised in that described input module, tool Body is used for:

The Web page classifying device of 15. integrated studies according to claim 12, it is characterised in that described reptile crawls mould Block, specifically for:

The Web page classifying device of 16. integrated studies according to claim 12, it is characterised in that described reptile crawls mould Block, specifically for:

The Web page classifying device of 17. integrated studies according to claim 12, it is characterised in that described word-dividing mode, tool Body is used for:

The Web page classifying device of 18. integrated studies according to claim 12, it is characterised in that described sort module, tool Body is used for:

19. according to the Web page classifying device of the integrated study described in claim 12 or 18, it is characterised in that described classification mould Block, is additionally operable to:

Described at least two sorting algorithm model is chosen: Bayes sorting algorithm model, SVM divide from following sorting algorithm model Class algorithm model, maximum entropy sorting algorithm model, KNN sorting algorithm model and neural network classification algorithm model.

The Web page classifying device of 20. integrated studies according to claim 19, it is characterised in that described sort module, also For:

Obtain described Bayes sorting algorithm model, described svm classifier algorithm model, the classification of described maximum entropy as follows Algorithm model, described KNN sorting algorithm model or described neural network classification algorithm model:

Using described corpus vectorization document as input, divided by Bayes's Bayes sorting algorithm, support vector machines Class algorithm, maximum entropy sorting algorithm, nearest neighbor algorithm KNN sorting algorithm or neural network classification algorithm are trained respectively and join Number selects, the generation Bayes sorting algorithm model of correspondence, svm classifier algorithm model, maximum entropy sorting algorithm model, KNN classification Algorithm model or neural network classification algorithm model.

The Web page classifying device of 21. integrated studies according to claim 19, it is characterised in that described sort module, tool Body is used for:

By described Bayes sorting algorithm model, described svm classifier algorithm model and described maximum entropy sorting algorithm model difference Parallel described language material vectorization document to be sorted is carried out webpage URL classification prediction；

The Web page classifying device of 22. integrated studies according to claim 12, it is characterised in that described device, also includes: