CN105447018B - Verify the method and device of Web page classifying model - Google Patents

Verify the method and device of Web page classifying model Download PDF

Info

Publication number
CN105447018B
CN105447018B CN201410411722.6A CN201410411722A CN105447018B CN 105447018 B CN105447018 B CN 105447018B CN 201410411722 A CN201410411722 A CN 201410411722A CN 105447018 B CN105447018 B CN 105447018B
Authority
CN
China
Prior art keywords
web page
crawl
page classifying
webpage
classifying model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410411722.6A
Other languages
Chinese (zh)
Other versions
CN105447018A (en
Inventor
刘晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Network Technology Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201410411722.6A priority Critical patent/CN105447018B/en
Publication of CN105447018A publication Critical patent/CN105447018A/en
Application granted granted Critical
Publication of CN105447018B publication Critical patent/CN105447018B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention provides a kind of method and device for verifying Web page classifying model.Method includes: that the seed stations point being applicable in Web page classifying model to be verified carries out crawl processing, and grabbed webpage is handled to crawl according to Web page classifying model and carries out classification processing, it obtains crawl and handles corresponding positive example ratio, crawl handles the ratio that corresponding positive example ratio refers to that crawl handles the webpage quantity for being classified as the affiliated type of Web page classifying model in grabbed webpage and crawl handles grabbed webpage sum;Corresponding positive example ratio is handled according to crawl, determines whether Web page classifying model fails.The verifying to Web page classifying model validation may be implemented in technical solution of the present invention.

Description

Verify the method and device of Web page classifying model
[technical field]
The present invention relates to Internet technical field more particularly to a kind of method and devices for verifying Web page classifying model.
[background technique]
With the high speed development of internet, the information on internet is more and more abundant, and family can be used by Web page classifying Required information is fast and easily found, therefore Web page classifying is widely used.
Currently, the common practice of Web page classifying is: first marking a collection of webpage, carry out feature extraction to these webpages, pass through Machine learning scheduling algorithm is trained extracted feature to obtain Web page classifying model, finally based on obtained Web page classifying mould Type classifies to webpage.
Over time, above-mentioned Web page classifying model is possible to no longer be applicable in current web page, this will lead to webpage point Mistake occurs for class, therefore needs a kind of effectively verifying whether still effective method of Web page classifying model.
[summary of the invention]
Many aspects of the invention provide a kind of method and device for verifying Web page classifying model, to verify Web page classifying The validity of model.
An aspect of of the present present invention provides a kind of method for verifying Web page classifying model, comprising:
The seed stations point being applicable in Web page classifying model to be verified carries out crawl processing, and according to the Web page classifying Model handles grabbed webpage to crawl and carries out classification processing, obtains crawl and handles corresponding positive example ratio, at the crawl It manages corresponding positive example ratio and refers to that crawl is handled in grabbed webpage and be classified as the affiliated type of Web page classifying model Webpage quantity and crawl handle the ratio of grabbed webpage sum;
Corresponding positive example ratio is handled according to the crawl, determines whether the Web page classifying model fails.
Another aspect of the present invention provides a kind of device for verifying Web page classifying model, comprising:
Handling module, the seed stations point for being applicable in Web page classifying model to be verified carry out crawl processing;
Module is obtained, carries out classification processing for handling grabbed webpage to crawl according to the Web page classifying model, It obtains crawl and handles corresponding positive example ratio, the crawl handles corresponding positive example ratio and refers to that crawl handles grabbed webpage In be classified as the webpage quantity of the affiliated type of Web page classifying model and crawl handles the ratio of grabbed webpage sum;
First determining module determines the Web page classifying model for handling corresponding positive example ratio according to the crawl Whether fail.
In the technical solution of the present invention, the seed stations point being applicable in Web page classifying model to be verified carries out at crawl Reason handles grabbed webpage to crawl using the Web page classifying model and carries out classification processing, and it is corresponding just to obtain crawl processing Example ratio handles corresponding positive example ratio according to crawl, determines whether Web page classifying model fails.Wherein, crawl processing corresponds to Positive example ratio refer to that crawl handles the webpage quantity for being classified as the affiliated type of Web page classifying model in grabbed webpage The ratio of grabbed webpage sum is handled with crawl, which can characterize Web page classifying model to different web pages The accuracy of classification results, can be true to the accuracy of the classification results of different web pages on seed website by Web page classifying model Determine whether Web page classifying model fails, realizes the verifying to Web page classifying model validation.
[Detailed description of the invention]
It to describe the technical solutions in the embodiments of the present invention more clearly, below will be to embodiment or description of the prior art Needed in attached drawing be briefly described, it should be apparent that, the accompanying drawings in the following description is some realities of the invention Example is applied, it for those of ordinary skill in the art, without any creative labor, can also be attached according to these Figure obtains other attached drawings.
Fig. 1 a is the flow diagram of the method for the verifying Web page classifying model that one embodiment of the invention provides;
Fig. 1 b be another embodiment of the present invention provides verifying Web page classifying model method flow diagram;
Fig. 2 is the flow diagram of the method for the verifying Web page classifying model that further embodiment of this invention provides;
Fig. 3 is the flow diagram of the method for the verifying Web page classifying model that further embodiment of this invention provides;
Fig. 4 is the structural schematic diagram of the device for the verifying Web page classifying model that one embodiment of the invention provides;
Fig. 5 be another embodiment of the present invention provides verifying Web page classifying model device structural schematic diagram.
[specific embodiment]
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without creative efforts, shall fall within the protection scope of the present invention.
Fig. 1 a is the flow diagram of the method for the verifying Web page classifying model that one embodiment of the invention provides.Such as Fig. 1 a institute Show, this method comprises:
101, the seed stations point being applicable in Web page classifying model to be verified carries out crawl processing, and according to the webpage point Class model handles grabbed webpage to crawl and carries out classification processing, obtains crawl and handles corresponding positive example ratio;Wherein, it grabs It handles corresponding positive example ratio and refers to that crawl is handled in grabbed webpage and be classified as the above-mentioned affiliated type of Web page classifying model Webpage quantity and crawl handle the ratio of grabbed webpage sum.
102, corresponding positive example ratio is handled according to above-mentioned crawl, determines whether above-mentioned Web page classifying model fails.
The executing subject of the present embodiment can be the various equipment for executing method provided in this embodiment with ability, can be with Referred to as verify the device of Web page classifying model, hereinafter referred to as verifying device.
Method provided in this embodiment can be used for carrying out validation verification to any one Web page classifying model, to different nets For page disaggregated model, the process for verifying its validity is all the same.The affiliated type of different classifications model is different, such as is mainly used for Type belonging to the Web page classifying model classified to news category webpage is news category, is mainly used for carrying out forum's class webpage Type belonging to the Web page classifying model of classification is forum's class, is mainly used for the Web page classifying mould classified to music class webpage Type belonging to type is music class, etc..
The seed website that different web pages disaggregated model is applicable in is not identical, such as the Web page classifying model of news category is main Suitable for the seed website for webpage of releasing news, the Web page classifying model of music class is primarily adapted for use in the seed of publication music webpage Website, etc..
After the generation of Web page classifying model, Web page classifying model can be used to the net on the seed website being suitable for Page is classified.But over time, the interior perhaps structure of seed website online page may change, such as net Page correcting or typesetting again, some variations will lead to the failure of Web page classifying model, need the Web page classifying model that re -training is new. For example: assuming that original webpage is forum page, network address:www.xxblog.com;In training webpage disaggregated model When, which has the following characteristics that website (site) in uniform resource locator (Uniform Resource Locator, URL) Part is comprising the blog in above-mentioned network address and is center word, then the Web page classifying model trained based on the webpage can be by the URL Under webpage be determined as forum Web pages.But if webpage correcting or web page contents all change novel into, and URL is not sent out Webpage under the URL still can be determined as forum if going to classify also according to pervious Web page classifying model by changing Page, and the webpage under the actually URL has been novel page, illustrates original Web page classifying model failure, needs re -training.
Whether still effective in order to verify Web page classifying model, verifying device can be to Web page classifying model to be verified The seed stations point being suitable for carries out crawl processing, and handles grabbed webpage to crawl using the Web page classifying model and divide Class processing obtains crawl and handles corresponding positive example ratio;Later, corresponding positive example ratio is handled according to above-mentioned crawl, determines net Whether page disaggregated model fails.
Illustrate herein, a Web page classifying model can be used for many seed websites.In the present embodiment, above-mentioned webpage point The seed website that class model is applicable in can be all seed websites that the Web page classifying model can be used, or be also possible to The Some seeds website therefrom chosen in all seed websites that the Web page classifying model can be used.
The primary crawl processing for the seed website that above-mentioned Web page classifying model is applicable in is referred to by web crawlers from the net The process for each seed website crawl webpage that page disaggregated model is applicable in.Crawl handles corresponding positive example ratio and refers at crawl What the webpage quantity and crawl processing for being classified as the affiliated type of Web page classifying model in the grabbed webpage of reason were grabbed The ratio of webpage sum, the ratio can be percentage, but not limited to this.It illustrates, it is assumed that primary crawl processing is grabbed Webpage sum be 1000, by the classification processing of Web page classifying model, wherein 890 webpages are classified as the Web page classifying mould Type belonging to type, then it is 890/1000*100%=89% that this time crawl, which handles corresponding positive example ratio,.Assume again that another It is 1200 that crawl, which handles grabbed webpage sum, by the classification processing of Web page classifying model, wherein 1080 webpages are divided Class is type belonging to Web page classifying model, then it is 1080/1200*100%=that this time crawl, which handles corresponding positive example ratio, 90%.Specifically, verifying device, which can handle grabbed webpage to each crawl with Web page classifying model, carries out classification processing, Statistics be classified as the affiliated type of Web page classifying model webpage number, with the number counted divided by each crawl at The sum for managing grabbed webpage, which obtains grabbing every time, handles corresponding positive example ratio.Crawl handles corresponding positive example ratio every time Verifying device is reflected to a certain extent handles each crawl the accuracy for corresponding to the classification results of the webpage in the time.Its In, a webpage is classified as the affiliated type of Web page classifying model, it is meant that the classification results to the webpage are accurate.
In the present embodiment, Web page classifying model can be characterized to the standard of the classification results of different web pages by positive example ratio Exactness can determine Web page classifying mould to the accuracy of the classification results of different web pages on seed website by Web page classifying model Whether type fails, and realizes the verifying to Web page classifying model validation.
In an optional embodiment, a kind of embodiment of above-mentioned steps 101 includes: to be applicable in from Web page classifying model Seed website in determine that weighted value is greater than the seed website of default weight threshold;Default weight is greater than to identified weighted value The seed stations point of threshold value carries out crawl processing, and handles grabbed webpage to crawl according to the disaggregated model and carry out at classification Reason obtains crawl and handles corresponding positive example ratio.
Wherein, the weighted value of seed website can be determined by the website ranking of seed website, for example, website ranking is more leaned on Before, corresponding weighted value is bigger.Alternatively, the weighted value of above-mentioned seed website can also be by the quantity of the off line page of seed website Lai really Determine, such as webpage quantity is more, corresponding weighted value is bigger.Alternatively, the weight of above-mentioned seed website can also be by seed website The Multiple factors such as website ranking, the quantity of the off line page of seed website determine jointly.Wherein, the weighted value of seed website is bigger, table The authority of bright seed website is higher.In the present embodiment, the authoritative higher seed stations point of selection carries out crawl processing, has Conducive to the reliability for guaranteeing verification result.
In addition, based on authoritative higher seed website, it is possible to reduce crawl number of processing, such as can be to determining Weighted value be greater than the seed stations point of default weight threshold and carry out primary crawl processing, the secondary crawl is handled according to disaggregated model The webpage grabbed carries out classification processing, obtains this time crawl and handles corresponding positive example ratio, this time can be grabbed processing pair The positive example ratio answered is compared with preset threshold value, if comparison result is greater than illustrating that Web page classifying model still has Effect;If comparison result be no more than, illustrate Web page classifying model fail.
In another optional embodiment, it is contemplated that each crawl processing requires to expend certain time, therefore to webpage The not homogeneous crawl processing for the seed website that disaggregated model is applicable in is meant to be the crawl processing carried out in different time.Accordingly , homogeneous crawl does not handle the webpage of grabbed webpage i.e. different time.Based on this, another embodiment of the present invention provides A method of verifying Web page classifying model, the process of this method include: as shown in Figure 1 b
10a, the seed stations point progress processing of crawl at least twice that Web page classifying model to be verified is applicable in, and according to The Web page classifying model handles grabbed webpage to each crawl and carries out classification processing, and it is corresponding just to obtain crawl processing every time Example ratio;Wherein, the corresponding positive example ratio of each crawl processing refers to that each crawl is handled and is classified as in grabbed webpage The webpage quantity of the above-mentioned affiliated type of Web page classifying model and crawl handle the ratio of grabbed webpage sum.
10b, corresponding positive example ratio is handled according to above-mentioned crawl at least twice, determines whether above-mentioned Web page classifying model loses Effect.
Since homogeneous crawl does not handle the webpage of grabbed webpage i.e. different time, so verifying device is to each Crawl handles grabbed webpage and carries out classification processing, and each crawl of acquisition handles the process of corresponding positive example ratio, practical Upper is exactly to carry out classification processing to the webpage in different time, obtains the accuracy to the Web page classifying result in different time Process.
Based on above-mentioned, verifying device is according to the corresponding positive example ratio of crawl processing at least twice, so that it may judge webpage Fluctuation of the disaggregated model to the classification results of the webpage in different time in accuracy can determine net according to the size of fluctuation Whether page disaggregated model fails.Specifically, if the difference that different crawls handle corresponding positive example ratio is greater than preset ratio threshold Value, illustrate Web page classifying model to fluctuation of the classification results of the webpage in different time on seed website in accuracy compared with Greatly, it also means that Web page classifying model is no longer desirable for seed website, determines that the Web page classifying model fails;, whereas if The difference that difference crawl handles corresponding positive example ratio is not more than (being less than or equal to) preset proportion threshold value, illustrates Web page classifying Model is little to fluctuation of the classification results of the webpage in different time on seed website in accuracy, is still suitable for seed Website determines that the Web page classifying model is effective.
From the foregoing, it can be seen that the present embodiment carries out at least twice the seed stations point that Web page classifying model to be verified is applicable in Crawl processing handles grabbed webpage to different crawls using the Web page classifying model and carries out classification processing, and acquisition is grabbed every time The corresponding positive example ratio of processing is taken, belongs to the webpage in different time since different crawls handle grabbed webpage, no With crawl handle corresponding positive example ratio can characterize Web page classifying model in different time the classification results of webpage it is accurate Degree, may determine that Web page classifying model exists to the classification results of the webpage in different time on seed website by multiple accuracy Fluctuation in accuracy can determine whether Web page classifying model fails by the degree of fluctuation, realize to Web page classifying mould The verifying of type validity.
In an optional embodiment, the above-mentioned seed stations point being applicable in Web page classifying model to be verified is carried out at least The processing of crawl twice specifically: the processing of crawl twice is carried out to above-mentioned seed stations point in different time.Carry out the processing of crawl twice In the case of verifying process it is as shown in Figure 2.Fig. 2 is the method for the verifying Web page classifying model that further embodiment of this invention provides Flow diagram.As shown in Fig. 2, this method comprises:
201, the seed stations point that Web page classifying model to be verified is applicable in is carried out at first time crawl in first time Reason, and grabbed webpage is handled to first time crawl according to the Web page classifying model and carries out classification processing, obtain the first positive example Ratio.
202, second of crawl processing is carried out to above-mentioned seed stations point in the second time, and according to above-mentioned Web page classifying model Grabbed webpage is handled to second of crawl and carries out classification processing, obtains the second positive example ratio.
203, above-mentioned first positive example ratio and above-mentioned second positive example ratio are compared.
204, judge whether above-mentioned comparison result is greater than preset ratio threshold value, if so, executing step 205;If not, holding Row step 206.
205, above-mentioned Web page classifying model failure is determined.
206, determine that above-mentioned Web page classifying model is effective.
The positive example ratio that the present embodiment is related to, which refers to, is classified as the affiliated type of Web page classifying model in the webpage being crawled Webpage quantity and the ratio of webpage sum that is crawled.For example, above-mentioned first positive example ratio refers to the net of crawl at the first time It is classified as the webpage quantity of the affiliated type of Web page classifying model and the ratio of the webpage sum grabbed at the first time in page.It lifts Example explanation, it is assumed that the webpage sum of crawl is 1000 at the first time, by the classification processing of Web page classifying model, wherein 890 Webpage is classified as type belonging to the Web page classifying model, then the first positive example ratio is 890/1000*100%=89%.Phase It answers, above-mentioned second positive example ratio, which refers in the webpage of the second time crawl, is classified as the net of the affiliated type of Web page classifying model The ratio of number of pages and the webpage sum of the second time crawl.Illustrate, it is assumed that the second time crawl webpage sum be 1200, by the classification processing of Web page classifying model, wherein 1080 webpages are classified as type belonging to Web page classifying model, Then the second positive example ratio is 1080/1200*100%=90%.
Specifically, need to verify Web page classifying model it is whether effective when, can to Web page classifying model be applicable in kind Substation point carries out the processing of crawl twice, and the classification processing that grabbed webpage is handled each crawl according to Web page classifying model obtains The two positive example ratios arrived, determine whether Web page classifying model is effective, realize the verifying to Web page classifying model validation.Due to Need to only the speed for being conducive to improve verifying Web page classifying model validation twice be grabbed to seed website.
Fig. 3 is the flow diagram of the method for the verifying Web page classifying model that further embodiment of this invention provides.Such as Fig. 3 institute Show, this method comprises:
301, the seed stations point being periodically applicable in Web page classifying model to be verified carries out crawl processing.
302, grabbed webpage is handled to current crawl according to Web page classifying model and carries out classification processing, acquisition is currently grabbed Take the corresponding positive example ratio of processing.
303, current crawl is handled into corresponding positive example ratio and preceding primary crawl handles the difference of corresponding positive example ratio It is compared with preset proportion threshold value.
304, judge whether above-mentioned difference is greater than aforementioned proportion threshold value, if so, executing step 305;If not, executing step Rapid 306.
305, above-mentioned Web page classifying model failure is determined.
306, it determines that above-mentioned Web page classifying model is effective, and returns to step 301.
In the present embodiment, verifying device periodically carries out crawl processing to seed stations point, using Web page classifying model to working as Preceding crawl handles grabbed webpage and carries out classification processing, obtains current crawl and handles corresponding positive example ratio, then will currently grab The difference for handling corresponding positive example ratio and the preceding primary corresponding positive example ratio of crawl processing is taken to be compared with preset proportion threshold value Compared with determining whether Web page classifying model fails according to comparison result.Due to periodically carrying out crawl processing to seed stations point, be conducive to Whether discovery Web page classifying model fails in time.
In view of the factor for causing Web page classifying model to fail is also possible to be using the Web page classifying model to input webpage The program classified changes, therefore, it is also desirable to exclude to classify to input webpage using the Web page classifying model Influence of the program to the Web page classifying model validation.
Based on this factor of above-mentioned sort program, in an optional embodiment, shown in Fig. 1 a- embodiment illustrated in fig. 3 Before method and step, further includes: determine that the program classified using Web page classifying model to input webpage is not changed Step.Specifically, verifying device can be according to modes such as the version number of the program or MD5 values, to judge whether the program occurs Variation.If the program does not change, illustrate that the program will not influence the validity of Web page classifying model, that is to say, that right For the program, Web page classifying model is still effective, the method therefore, it is necessary to continue to provide using Fig. 1 a- embodiment illustrated in fig. 3 Further the validity of Web page classifying model is verified.
Based on above-mentioned program this factor classified using Web page classifying model to input webpage, in another optional reality It applies in mode, before method and step shown in each embodiment shown in Fig. 1 a- Fig. 3, further includes: determine and use Web page classifying model pair The program that input webpage is classified is changed, and determines that Web page classifying model classifies to the webpage of marking types The classification results of processing are consistent with the marking types of webpage of the marking types.
Specifically, verifying device can be according to the version of the program classified using Web page classifying model to input webpage The modes such as this number or MD5 value, to judge whether the program changes.If the program is changed, illustrate that the program can The validity of Web page classifying model can be will affect, then the webpage of marking types is carried out at classification using Web page classifying model Reason, and the mode that classification results are compared with the marking types of the webpage of marking types, to judge Web page classifying model It is whether still effective.
It has been marked specifically, verifying device can choose to extract from the seed website that weighted value is greater than default weight threshold Then the webpage of type carries out classification processing using webpage of the Web page classifying model to marking types, by classification results and The marking types of the webpage of marking types are compared, if to the classification results and marking types of the webpage of marking types Webpage marking types it is consistent, illustrating the program not influences the validity of Web page classifying model, the Web page classifying model according to So effectively;If inconsistent to the marking types of the classification results and the webpage of marking types of the webpage of marking types, say It is bright for the program, the Web page classifying model is no longer valid.The classification results of the webpage to marking types with marked When the marking types of the webpage of type are consistent, illustrate that Web page classifying model is still effective, therefore, it is necessary to continue using Fig. 1 a- Fig. 3 The method that illustrated embodiment provides further verifies the validity of Web page classifying model.In the webpage to marking types Classification results and the webpage of marking types marking types it is inconsistent when, illustrate that Web page classifying model fails, at this time can be with Terminate verification process, that is, does not need to execute the method and step that Fig. 1 a- embodiment illustrated in fig. 3 provides.
Illustrate herein, the weighted value of above-mentioned seed website can be determined by the website ranking of seed website, for example, website Ranking is more forward, and corresponding weighted value is bigger.Alternatively, the weighted value of above-mentioned seed website can also be by the off line page of seed website Quantity determines, such as webpage quantity is more, and corresponding weighted value is bigger.Alternatively, the weight of above-mentioned seed website can also be by planting The Multiple factors such as the website ranking of substation point, the quantity of the off line page of seed website determine jointly.Wherein, the weighted value of seed website It is bigger, show that the authority of seed website is higher.In the present embodiment, the mark on the authoritative higher seed website of selection The webpage of note type advantageously ensures that the reliability of verification result for verifying the validity of Web page classifying model.
Based on above-mentioned program this factor classified using Web page classifying model to input webpage, in another optional reality It applies in mode, on the basis of the method that Fig. 1 a- embodiment illustrated in fig. 3 provides, if in the embodiment of the method shown in Fig. 1 a- Fig. 3, It determines that Web page classifying model is effective, then can also comprise determining that the journey classified using Web page classifying model to input webpage Whether sequence changes;When determining that the program changes, carried out according to webpage of the Web page classifying model to marking types Classification processing;If classification results are consistent with the marking types of the webpage of marking types, determine that Web page classifying model is effective;Such as Fruit classification results and the marking types of the webpage of marking types are inconsistent, determine that Web page classifying model fails.Specifically, verifying Device can choose the webpage that marking types are extracted from the seed website that weighted value is greater than default weight threshold, then utilize Web page classifying model carries out classification processing to the webpage of marking types, by the mark of classification results and the webpage of marking types Type is compared, if the marking types one of classification results and the webpage of marking types to the webpage of marking types It causes, illustrates that Web page classifying model is still effective;If the net of classification results and marking types to the webpage of marking types The marking types of page are inconsistent, illustrate that Web page classifying model fails.
Illustrate herein, the weighted value of above-mentioned seed website can be determined by the website ranking of seed website, for example, website Ranking is more forward, and corresponding weighted value is bigger.Alternatively, the weighted value of above-mentioned seed website can also be by the off line page of seed website Quantity determines, such as webpage quantity is more, and corresponding weighted value is bigger.Alternatively, the weight of above-mentioned seed website can also be by planting The Multiple factors such as the website ranking of substation point, the quantity of the off line page of seed website determine jointly.Wherein, the weighted value of seed website It is bigger, show that the authority of seed website is higher.In the present embodiment, the mark on the authoritative higher seed website of selection The webpage of note type advantageously ensures that the reliability of verification result for verifying the validity of Web page classifying model.
It is worth noting that in the above embodiments, if it is determined that the failure of Web page classifying model is then needed to webpage point Class model is updated, such as can obtain Web page classifying model with re -training.The mistake of re -training acquisition Web page classifying model Journey can be found in the process for obtaining Web page classifying model in the prior art.
It should be noted that for the various method embodiments described above, for simple description, therefore, it is stated as a series of Combination of actions, but those skilled in the art should understand that, the present invention is not limited by the sequence of acts described because According to the present invention, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art should also know It knows, the embodiments described in the specification are all preferred embodiments, and related actions and modules is not necessarily of the invention It is necessary.
In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment Point, reference can be made to the related descriptions of other embodiments.
Fig. 4 is the structural schematic diagram of the device for the verifying Web page classifying model that one embodiment of the invention provides.Such as Fig. 4 institute Show, which includes: handling module 41, obtains module 42 and the first determining module 43.
Handling module 41, the seed stations point for being applicable in Web page classifying model to be verified carry out crawl processing.
Module 42 is obtained, is connect with handling module 41, for being grabbed according to Web page classifying model to handling module 41 It handles grabbed webpage and carries out classification processing, obtain crawl and handle corresponding positive example ratio.The crawl handles corresponding positive example Ratio refers to that crawl is handled and is classified as at the webpage quantity and crawl of the affiliated type of Web page classifying model in grabbed webpage Manage the ratio of grabbed webpage sum.
First determining module 43 is connect with module 42 is obtained, the crawl processing correspondence for obtaining according to module 42 is obtained Positive example ratio, determine whether Web page classifying model fails.
In an optional embodiment, handling module 41, which is particularly used in, carries out for the first time seed stations point in first time Crawl processing, and second of crawl processing is carried out to seed stations point in the second time.Correspondingly, it is specifically available to obtain module 42 Classification processing is carried out in carrying out the webpage that first time crawl processing is grabbed to handling module 41 according to Web page classifying model, is obtained First positive example ratio, and according to Web page classifying model to handling module 41 carry out the webpage that is grabbed of second of crawl processing into Row classification processing obtains the second positive example ratio.Correspondingly, the first determining module 43, which is particularly used in, will obtain what module 42 obtained The difference of first positive example ratio and the second positive example ratio is compared with preset proportion threshold value, if the difference is greater than ratio threshold Value determines that Web page classifying model fails, if the difference is less than or equal to proportion threshold value, determines that Web page classifying model is effective.
In another optional embodiment, handling module 41, which is particularly used in, periodically carries out crawl processing to seed stations point. Grabbed net is handled the current crawl of handling module 41 according to Web page classifying model correspondingly, obtaining module 42 and being particularly used in Page carries out classification processing, obtains current crawl and handles corresponding positive example ratio.Correspondingly, the first determining module 43 is particularly used in It will obtain that the current crawl that module 42 obtains handles corresponding positive example ratio and preceding primary crawl handles corresponding positive example ratio Difference is compared with preset proportion threshold value, if the difference is greater than proportion threshold value, determines that Web page classifying model fails, if The difference is less than or equal to proportion threshold value, determines that the Web page classifying model is effective.
In another optional embodiment, handling module 41 is particularly used in the seed stations being applicable in from Web page classifying model It determines that weighted value is greater than the seed website of default weight threshold in point, the kind of default weight threshold is greater than to identified weighted value Substation point carries out crawl processing.
Further, as shown in figure 5, the device further include: the second determining module 51.
Second determining module 51, connect with handling module 41, in handling module 41 to Web page classifying mould to be verified Before the seed stations point that type is applicable in carries out crawl processing, the journey classified using Web page classifying model to input webpage is determined Sequence does not change;Alternatively, for being carried out in handling module 41 to the seed stations point that Web page classifying model to be verified is applicable in Before crawl processing, when determination changes to the program that input webpage is classified using Web page classifying model, net is determined Page disaggregated model carries out the mark class of the classification results and the webpage of marking types of classification processing to the webpage of marking types Type is consistent.
Further, as shown in figure 5, the device further include: third determining module 52 and the 4th determining module 53.
Third determining module 52 is connect with the first determining module 43, for determining Web page classifying in the first determining module 43 After model is effective, determine whether the program classified using Web page classifying model to input webpage is changed.
4th determining module 53 is connect with third determining module 52, uses webpage for determining in third determining module 52 When disaggregated model changes to the program that input webpage is classified, then according to Web page classifying model to the net of marking types Page carries out classification processing and determines Web page classifying model when classification results are consistent with the marking types of the webpage of marking types Effectively, or when the marking types of classification results and the webpage of the marking types are inconsistent, Web page classifying model is determined Failure.
Further, as shown in figure 5, the device further include: extraction module 54.
Extraction module 54 is connect with the 4th determining module 53, for being greater than the seed stations of default weight threshold from weighted value The webpage of the marking types is extracted on point, to provide the webpage of marking types to the 4th determining module 53.
The device of verifying Web page classifying model provided in this embodiment, clicks through the seed stations that Web page classifying model is applicable in Row crawl processing handles grabbed webpage to crawl using the Web page classifying model and carries out classification processing, obtains crawl processing Corresponding positive example ratio handles corresponding positive example ratio according to crawl, determines whether Web page classifying model fails.Wherein, it grabs It handles corresponding positive example ratio and refers to that crawl is handled in grabbed webpage and be classified as the affiliated type of Web page classifying model Webpage quantity and crawl handle the percent ratio of grabbed webpage sum, therefore device provided in this embodiment can benefit Accuracy with positive example ratio characterization Web page classifying model to the classification results of different web pages, by Web page classifying model to seed The accuracy of the classification results of different web pages can determine whether Web page classifying model fails on website, realize to Web page classifying The verifying of model validation.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.
In several embodiments provided by the present invention, it should be understood that disclosed system, device and method can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components It can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown or The mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, the indirect coupling of device or unit It closes or communicates to connect, can be electrical property, mechanical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of hardware adds SFU software functional unit.
The above-mentioned integrated unit being realized in the form of SFU software functional unit can store and computer-readable deposit at one In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, including some instructions are used so that a computer It is each that equipment (can be personal computer, server or the network equipment etc.) or processor (processor) execute the present invention The part steps of embodiment the method.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (Read- Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic or disk etc. it is various It can store the medium of program code.
Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although Present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: it still may be used To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features; And these are modified or replaceed, technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution spirit and Range.

Claims (10)

1. a kind of method for verifying Web page classifying model characterized by comprising
The processing of crawl at least twice is carried out to the seed stations point that Web page classifying model to be verified is applicable in, and according to the webpage Disaggregated model handles grabbed webpage to each crawl and carries out classification processing, obtains crawl every time and handles corresponding positive example ratio Example, each crawl, which handles corresponding positive example ratio and refers to that each crawl is handled in grabbed webpage, is classified as the net The webpage quantity of the page affiliated type of disaggregated model and crawl handle the ratio of grabbed webpage sum;
The difference between corresponding positive example ratio is handled according to the crawl at least twice, whether determines the Web page classifying model Failure.
2. the method according to claim 1, wherein the kind being applicable in Web page classifying model to be verified Substation point carries out the processing of crawl at least twice, and according to the Web page classifying model to each crawl handle grabbed webpage into Row classification processing obtains crawl every time and handles corresponding positive example ratio, comprising:
First time crawl processing is carried out to the seed stations point in first time, and according to the Web page classifying model to described the Primary crawl handles grabbed webpage and carries out classification processing, obtains the first positive example ratio;
Second of crawl processing is carried out to the seed stations point in the second time, and according to the Web page classifying model to described the Secondary crawl handles grabbed webpage and carries out classification processing, obtains the second positive example ratio;
Described crawl handles the difference between corresponding positive example ratio at least twice according to, determines the Web page classifying model Whether fail, comprising:
The difference of the first positive example ratio and the second positive example ratio is compared with preset proportion threshold value;
If the difference is greater than the proportion threshold value, the Web page classifying model failure is determined;
If the difference is less than or equal to the proportion threshold value, determine that the Web page classifying model is effective.
3. the method according to claim 1, wherein the kind being applicable in Web page classifying model to be verified Substation point carries out the processing of crawl at least twice, and according to the Web page classifying model to each crawl handle grabbed webpage into Row classification processing obtains crawl every time and handles corresponding positive example ratio, comprising:
Crawl processing periodically is carried out to the seed stations point, current crawl processing is grabbed according to the Web page classifying model Webpage carries out classification processing, obtains current crawl and handles corresponding positive example ratio;
Described crawl handles the difference between corresponding positive example ratio at least twice according to, determines the Web page classifying model Whether fail, comprising:
Will current crawl handle corresponding positive example ratio and preceding primary crawl handle the difference of corresponding positive example ratio with it is preset Proportion threshold value is compared;
If the difference is greater than the proportion threshold value, the Web page classifying model failure is determined;
If the difference is less than or equal to the proportion threshold value, determine that the Web page classifying model is effective.
4. method according to claim 1 or 2 or 3, which is characterized in that described suitable to Web page classifying model to be verified Seed stations point carries out before the processing of crawl at least twice, further includes:
Determine that the program classified using the Web page classifying model to input webpage is not changed;Or
When determination changes to the program that input webpage is classified using the Web page classifying model, the webpage is determined Disaggregated model carries out the mark of the classification results of classification processing and the webpage of the marking types to the webpage of marking types Type is consistent.
5. method according to claim 1 or 2 or 3, which is characterized in that further include:
It is determining that the Web page classifying model is used to classify input webpage after determining that the Web page classifying model is effective Whether program changes;
When determination changes to the program that input webpage is classified using the Web page classifying model, then according to the net Page disaggregated model carries out classification processing to the webpage of marking types;
If classification results are consistent with the marking types of webpage of the marking types, determine that the Web page classifying model has Effect;
If the marking types of the classification results and the webpage of the marking types are inconsistent, the Web page classifying mould is determined Type failure.
6. a kind of device for verifying Web page classifying model characterized by comprising
Handling module, the seed stations point for being applicable in Web page classifying model to be verified carry out the processing of crawl at least twice;
Module is obtained, carries out classification processing for handling grabbed webpage to each crawl according to the Web page classifying model, It obtains crawl every time and handles corresponding positive example ratio, each crawl handles corresponding positive example ratio and refers to each crawl processing The webpage quantity and crawl for being classified as the affiliated type of Web page classifying model in the webpage grabbed handle grabbed net The ratio of page sum;
First determining module determines institute for handling the difference between corresponding positive example ratio according to the crawl at least twice State whether Web page classifying model fails.
7. device according to claim 6, which is characterized in that the handling module is specifically used in first time to described Seed stations point carries out first time crawl processing, and carries out second of crawl processing to the seed stations point in the second time;
The module that obtains is specifically used for handling the first time crawl grabbed webpage according to the Web page classifying model Classification processing is carried out, obtains the first positive example ratio, and according to the Web page classifying model to second of crawl processing institute The webpage of crawl carries out classification processing, obtains the second positive example ratio;
First determining module is specifically used for the difference of the first positive example ratio and the second positive example ratio and presets Proportion threshold value be compared, if the difference is greater than the proportion threshold value, determine Web page classifying model failure, if The difference is less than or equal to the proportion threshold value, determines that the Web page classifying model is effective.
8. device according to claim 6, which is characterized in that the handling module is specifically used for periodically to the seed stations Point carries out crawl processing;
The module that obtains is specifically used for being divided according to the Web page classifying model webpage grabbed to current crawl processing Class processing obtains current crawl and handles corresponding positive example ratio;
First determining module is specifically used for currently grabbing the corresponding positive example ratio of processing and preceding primary crawl processing corresponds to The difference of positive example ratio be compared with preset proportion threshold value, if the difference is greater than the proportion threshold value, determine institute It states the failure of Web page classifying model and determines that the Web page classifying model has if the difference is less than or equal to the proportion threshold value Effect.
9. device described according to claim 6 or 7 or 8, which is characterized in that further include:
Second determining module, for determining that the program classified using the Web page classifying model to input webpage is not become Change;Alternatively, for determine changed using the Web page classifying model to the program classified of input webpage when, determine The Web page classifying model carries out the classification results of classification processing and the net of the marking types to the webpage of marking types The marking types of page are consistent.
10. device described according to claim 6 or 7 or 8, which is characterized in that further include:
Third determining module, for determining and using institute after first determining module determines that the Web page classifying model is effective State whether the program that Web page classifying model classifies to input webpage changes;
4th determining module carries out input webpage using the Web page classifying model for determining in the third determining module When the program of classification changes, then classification processing is carried out according to webpage of the Web page classifying model to marking types, When classification results are consistent with the marking types of webpage of the marking types, determine that the Web page classifying model is effective, or When the marking types of the classification results and the webpage of the marking types are inconsistent, determine that the Web page classifying model loses Effect.
CN201410411722.6A 2014-08-20 2014-08-20 Verify the method and device of Web page classifying model Active CN105447018B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410411722.6A CN105447018B (en) 2014-08-20 2014-08-20 Verify the method and device of Web page classifying model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410411722.6A CN105447018B (en) 2014-08-20 2014-08-20 Verify the method and device of Web page classifying model

Publications (2)

Publication Number Publication Date
CN105447018A CN105447018A (en) 2016-03-30
CN105447018B true CN105447018B (en) 2019-06-28

Family

ID=55557213

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410411722.6A Active CN105447018B (en) 2014-08-20 2014-08-20 Verify the method and device of Web page classifying model

Country Status (1)

Country Link
CN (1) CN105447018B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106790593B (en) * 2016-12-28 2020-11-06 北京奇虎科技有限公司 Page processing method and device
CN108133027A (en) * 2017-12-28 2018-06-08 中译语通科技(青岛)有限公司 A kind of machine automatic classification method based on web crawlers

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101320393A (en) * 2008-07-23 2008-12-10 腾讯科技(深圳)有限公司 Web page classifying indication method and system
CN102033965A (en) * 2011-01-17 2011-04-27 安徽海汇金融投资集团有限公司 Method and system for classifying data based on classification model
CN103744981A (en) * 2014-01-14 2014-04-23 南京汇吉递特网络科技有限公司 System for automatic classification analysis for website based on website content

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070094267A1 (en) * 2005-10-20 2007-04-26 Glogood Inc. Method and system for website navigation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101320393A (en) * 2008-07-23 2008-12-10 腾讯科技(深圳)有限公司 Web page classifying indication method and system
CN102033965A (en) * 2011-01-17 2011-04-27 安徽海汇金融投资集团有限公司 Method and system for classifying data based on classification model
CN103744981A (en) * 2014-01-14 2014-04-23 南京汇吉递特网络科技有限公司 System for automatic classification analysis for website based on website content

Also Published As

Publication number Publication date
CN105447018A (en) 2016-03-30

Similar Documents

Publication Publication Date Title
CN104486461B (en) Domain name classification method and device, domain name recognition methods and system
CN105224959B (en) The training method and device of order models
CN104077402B (en) Data processing method and data handling system
CN105653701B (en) Model generating method and device, word assign power method and device
CN102339296A (en) Method and device for sorting query results
CN104796300B (en) A kind of packet feature extracting method and device
CN104699837B (en) Method, device and server for selecting illustrated pictures of web pages
CN109977327A (en) A kind of Web page classification method and device
US20200394448A1 (en) Methods for more effectively moderating one or more images and devices thereof
US9002832B1 (en) Classifying sites as low quality sites
CN104484449B (en) The context extraction method and device of Webpage
CN109324960A (en) Automatic test approach and terminal device based on big data analysis
CN109558983A (en) Network courses dropping rate prediction technique and device
CN105447018B (en) Verify the method and device of Web page classifying model
CN109871770A (en) Property ownership certificate recognition methods, device, equipment and storage medium
CN103455491B (en) To the method and device of query word classification
CN106599291B (en) Data grouping method and device
CN106874340A (en) A kind of web page address sorting technique and device
CN103617262A (en) Picture content attribute identification method and system
CN103902447A (en) Distributed system testing method and device
CN103617261A (en) Picture content attribute identification method and system
CN105608183B (en) A kind of method and apparatus that polymeric type is provided and is answered
CN102929948B (en) list page identification system and method
CN103631832B (en) Business object sort method, business object searching method and relevant apparatus
CN110516258A (en) Data verification method and device, storage medium, electronic device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20211123

Address after: No. 699, Wangshang Road, Binjiang District, Hangzhou, Zhejiang

Patentee after: Alibaba (China) Network Technology Co.,Ltd.

Address before: Box 847, four, Grand Cayman capital, Cayman Islands, UK

Patentee before: ALIBABA GROUP HOLDING Ltd.