CN102955810B - A kind of Web page classification method and equipment - Google Patents

A kind of Web page classification method and equipment Download PDF

Info

Publication number
CN102955810B
CN102955810B CN201110249270.2A CN201110249270A CN102955810B CN 102955810 B CN102955810 B CN 102955810B CN 201110249270 A CN201110249270 A CN 201110249270A CN 102955810 B CN102955810 B CN 102955810B
Authority
CN
China
Prior art keywords
url
class library
classification
prediction
last layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201110249270.2A
Other languages
Chinese (zh)
Other versions
CN102955810A (en
Inventor
徐萌
何洪凌
胡珉
罗治国
孙少陵
陶涛
陈婷
张新访
李成华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chellona Mobile Communications Corp Cmcc
China Mobile Communications Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201110249270.2A priority Critical patent/CN102955810B/en
Publication of CN102955810A publication Critical patent/CN102955810A/en
Application granted granted Critical
Publication of CN102955810B publication Critical patent/CN102955810B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of Web page classification method and equipment, the method utilizes the record in existing URL class library, sets up virtual level URL, and predicts the classification of level URL.When needs are classified to webpage to be sorted, the URL according to webpage to be sorted inquires about URL class library; If do not inquire the URL of coupling, then inquire about URL class library according to the upper strata URL of this URL, and when inquiring the URL of coupling, determine the classification of webpage to be sorted according to the prediction classification of the URL inquired.In the present invention, improve efficiency and the success ratio of Web page classifying.

Description

A kind of Web page classification method and equipment
Technical field
The present invention relates to Internet technical field, particularly relate to a kind of Web page classification method and equipment.
Background technology
Along with the high speed development of development of Mobile Internet technology, the quantity of mobile interchange network users gets more and more, and therefore, also becomes a study hotspot gradually to the behavioural analysis of mobile interchange network users.
In prior art, usually according to the access log of mobile interchange network users, user behavior is analyzed.Concrete, the access log of mobile interchange network users leaves WAP (WirelessApplicationProtocol in, Wireless Application Protocol) in gateway, URL (the UniversalResourceLocator of the webpage that user accesses is have recorded in this access log, URL(uniform resource locator)), can know by inquiry URL class library the webpage classification that user accesses, and then know the Behavior preference of respective user.
Wherein, existing Web page classification method can comprise the following steps:
1, reptile crawls web page contents;
2, web page contents is resolved, obtain corresponding text;
3, text analyzed, obtain keyword;
4, utilize algorithm model, such as the Algorithm of documents categorization such as naive Bayesian or SVM model, classifies; Wherein, algorithm model obtains according to training set training usually in advance.
The webpage that can be accessed user by said method (or webpage corresponding URL) is classified, and then can set up URL class library.Wherein, URL class library of the prior art can be as shown in table 1.
Table 1
Realizing in process of the present invention, inventor finds at least there is following problem in prior art:
In prior art, URL class library is a simple flat data table, without any relation between entry, in order to accurately inquire the classification of the webpage that user accesses, needs to store a large amount of data, and needs real-time update class library.And due to internet development rapid, newly-increased webpage speed is exceedingly fast, and upgrades even if do a URL class library every day, and URL class library also can not preserve the classification of all webpages.Now, adoptable method is the method for real-time crawl, prediction, and the classification of a prediction webpage time may need about several tens minutes, if batch forecast, although can parallelization, the time be still very long, rank time at least little.
Summary of the invention
The embodiment of the present invention provides a kind of method and apparatus of Web page classifying, determines other efficiency of web page class and success ratio to improve.
In order to achieve the above object, the embodiment of the present invention provides a kind of Web page classification method, be applied to the Web page classifying flow process realized based on URL class library, the prediction classification of each level URL and each URL is recorded in described URL class library, wherein, upper strata URL in the URL of adjacent level intercepts to obtain on the basis of lower floor URL, and the method comprises:
URL according to webpage to be sorted inquires about URL class library;
If do not inquire the URL of coupling, then inquire about URL class library according to the upper strata URL of this URL, and when inquiring the URL of coupling, determine the classification of webpage to be sorted according to the prediction classification of the URL inquired.
The embodiment of the present invention also provides a kind of Web page classifying equipment, be applied to the Web page classifying flow process realized based on uniform resource position mark URL class library, the prediction classification of each level URL and each URL is recorded in described URL class library, wherein, upper strata URL in the URL of adjacent level intercepts to obtain on the basis of lower floor URL, and this equipment comprises:
Upper strata URL generation module, for the URL according to webpage to be sorted, generates the upper strata URL of this URL;
Enquiry module, inquires about URL class library for the URL according to webpage to be sorted; If do not inquire the URL of coupling, then inquire about URL class library according to the upper strata URL of this URL;
Determination module, for when described enquiry module inquires the URL of coupling, determines the classification of webpage to be sorted according to the prediction classification of the URL inquired.
Compared with prior art, the embodiment of the present invention, by carrying out level division to URL, records each level URL in URL class library, and the prediction classification of each URL of corresponding record; When needing the classification determining webpage to be sorted, obtain the URL of this webpage to be sorted, and inquire about in URL class library whether record this URL; When not recording identical URL in URL class library, being defined as the classification of webpage to be sorted according to the prediction classification of the upper strata URL of this URL, improve and determining other efficiency of web page class and success ratio.
Accompanying drawing explanation
The URL class library product process schematic diagram that Fig. 1 provides for the embodiment of the present invention;
The Web page classification method schematic flow sheet that Fig. 2 provides for the embodiment of the present invention;
The structural representation of the Web page classifying equipment that Fig. 3 provides for the embodiment of the present invention.
Embodiment
For defect of the prior art, the embodiment of the present invention proposes a kind of technical scheme of Web page classifying.In the technical scheme that the embodiment of the present invention proposes, by the mode intercepted URL, level division is carried out to URL, the URL of adjacent level at the middle and upper levels URL obtains by intercepting on the basis of lower floor URL, the record (namely recording URL, the prediction classification of this URL and the upper strata URL of the adjacent level of this URL in the embodiment of the present invention in URL class library) of upper strata URL is increased in existing URL class library, and record the prediction classification of upper strata URL, when needs are classified to webpage, URL class library can be inquired about according to the URL of webpage to be sorted; If do not inquire the URL of coupling, then inquire about URL class library according to the upper strata URL of this URL, and when inquiring the URL of coupling, the classification of webpage to be sorted is determined according to the prediction classification of the URL inquired, namely when not recording the URL of webpage to be sorted in URL class library, the classification of webpage to be sorted can be determined according to the prediction classification of the upper strata URL of this URL, by the record that the upper strata URL inquiring about this URL to be sorted is corresponding, and using the prediction classification of the prediction classification of its upper strata URL as webpage to be sorted, improve and determine other efficiency of web page class and success ratio.
Wherein, carry out level in the mode intercepted URL to URL to divide and can realize especially by with under type:
According to separator "/" in URL, level division is carried out to URL, "/" is obtained successively forward from position, URL end, and using this URL from the field before predetermined number (as the 1) "/" forward of last position as the upper strata URL (i.e. last layer level URL) of the adjacent level of this URL.
Such as, for URL:http: // 3g.sina.com.cn:80/3g/static/sina.gif? t1=1252192802, http: // 3g.sina.com.cn:80/3g/static/sina.gif? t1=1252192802 is first level of this URL, http: // 3g.sina.com.cn:80/3g/static/ is second level of this URL, http: // 3g.sina.com.cn:80/3g/ is the third layer level of this URL, http: // 3g.sina.com.cn:80/3g/static/ is then the last layer level URL of former URL, http: // 3g.sina.com.cn:80/3g/ be then http: // 3g.sina.com.cn:80/3g/static/ last layer level URL.
Should be realized that, determining that the mode of last layer level URL is not limited to aforesaid way in the technical scheme that the embodiment of the present invention proposes, also can be other modes.
Below in conjunction with the accompanying drawing in the application, carry out clear, complete description to the technical scheme in the application, obviously, described embodiment is a part of embodiment of the application, instead of whole embodiments.Based on the embodiment in the application, the every other embodiment that those of ordinary skill in the art obtain under the prerequisite not making creative work, all belongs to the scope of the application's protection.
As shown in Figure 1, for the schematic diagram of the URL class library Establishing process that the embodiment of the present invention proposes, for ease of describing, to be described for the information of the form of tables of data storage URL in URL classification, the corresponding list item of each URL, this URL class library Establishing process can comprise the following steps:
Step 101, in URL class library, record list item corresponding to lowest hierarchical level URL.Wherein, URL, the prediction classification of this URL and the last layer level URL of this URL is recorded in the list item that URL is corresponding.
Concrete, the URL of the webpage (as month) in user in the past a period of time can accessed as the lowest hierarchical level URL in URL class library, and obtains the prediction classification of corresponding URL by existing Web page classification method; Or, can using URL corresponding for some well-known website as seed, the mode crawled by reptile obtains the URL of some, and using the URL that gets as the lowest hierarchical level URL in URL class library, and the prediction classification of corresponding URL is obtained by existing Web page classification method.In the URL class library got lowest hierarchical level URL and prediction classification after, obtain the last layer level URL of each lowest hierarchical level URL, and corresponding information (URL predicts classification, last layer level URL) be recorded in URL class library corresponding to URL.
Step 102, from URL class library select a list item, obtain the last layer level URL of the URL recorded in this list item.
Concrete, the list item in traversal URL class library, and the list item in select progressively URL class library, the last layer level URL in the list item selected by acquisition.
Step 103, judge in URL class library, whether to store list item corresponding to this last layer level URL.If be judged as YES, then go to step 102; Otherwise, go to step 104.
Concrete, when storing list item corresponding to this last layer level URL in URL class library, then reselect another list item; When not storing list item corresponding to this last layer level URL in URL class library, then need to create list item corresponding to this last layer level URL.
The last layer level URL of step 104, the prediction classification determining this last layer level URL and this last layer level URL, and be recorded in URL class library.
Concrete, the list item in traversal URL class library, obtains the list item that wherein last layer level URL is identical, and the prediction classification of prediction classification determination last layer level URL according to the URL in the list item got.
Wherein, determine that the prediction classification of last layer level URL specifically can realize in the following manner:
All URL that its last layer level URL is the URL of this classification to be predicted are obtained from described URL class library; Determine the quantity of the URL of each prediction classification in the URL got; Prediction classifications maximum for wherein URL quantity is defined as the prediction classification of the URL of this classification to be predicted.
Such as, for following 4 URL:
Http:// www.chinaweekly.cn/bencandy.php? fid=48 & id=5464 predicts classification: history
Http:// www.chinaweekly.cn/bencandy.php? fid=48 & id=5463 predicts classification: history
Http:// www.chinaweekly.cn/bencandy.php? fid=48 & id=5344 predicts classification: history
Http:// www.chinaweekly.cn/bencandy.php? fid=49 & id=5449 predicts classification: the news commentary
These four URL have identical last layer level URL:http: //www.chinaweekly.cn/, in lower floor URL due to the adjacent level of this upper strata URL, having 3 prediction classifications is history, and 1 prediction classification is the news commentary, and therefore the prediction classification of this upper strata URL is history.
It should be noted that in the technical scheme that the embodiment of the present invention provides, in URL class library, corresponding record can also have the prediction probability of each URL.Now, in URL class library, the list item of corresponding URL comprises the last layer level URL of URL, the prediction classification of this URL, prediction probability and this URL.For lowest hierarchical level URL, its prediction classification and prediction probability are determined by existing Web page classification method, and the prediction classification of the URL of all the other levels and prediction probability are determined according to the prediction classification of next level URL of this URL and prediction probability.
Concrete, determine that the prediction classification of the URL of its last layer level and prediction probability can realize especially by with under type according to the prediction classification of the URL of next level and prediction probability:
All URL that its last layer level URL is the URL of this classification to be predicted and probability are obtained from described URL class library; For the URL of each prediction classification, calculate the weighted mean value of the prediction probability of each URL in this prediction classification; Prediction classification the highest for weighted mean value is defined as the prediction classification of this URL to be predicted, and the mean value of the prediction probability of the URL of this prediction classification is defined as the prediction probability of this URL to be predicted.
Still for above-mentioned 4 URL, suppose that the prediction probability of above-mentioned 4 URL is followed successively by 80%, 79%, 81% and 80%.Then in these 4 URL, prediction classification is the weighted mean value of the prediction probability of the URL of history is 60% ((80%+79%+81%)/(3+1)), and prediction classification is the weighted mean value of the prediction probability of the URL of the news commentary is 20% ((80%)/(3+1)).Therefore, the prediction classification of the upper strata URL of these 4 adjacent levels of URL is history, and its prediction probability is 60%.
Above-mentioned flow process realizes by computer program, also according to above principle, can configure this URL class library by manual type.
Should be realized that, in the technical scheme of the embodiment of the present invention, when not recording the URL of webpage to be sorted in URL class library, the mode being not limited to the last layer level URL by successively inquiring about this URL determines the classification of webpage to be sorted, also can be that the prediction classification of the last layer level URL of the last layer level URL directly inquiring about this URL or other upper stratas URL of this URL is to determine the classification of webpage to be sorted.In addition, determining in the technical scheme that the embodiment of the present invention provides that last layer level URL predicts that class method for distinguishing is not limited to the mode described in above-mentioned flow process, also can be other modes.
By above flow process, can determine the upper strata URL of the URL recorded in existing URL class library, and be stored in URL class library by list item corresponding for this upper strata URL, the list item stored in URL class library can to form a hierarchical architecture.Wherein, in the URL class library after renewal, the data structure of URL information can be as shown in table 2:
Table 2
Title Annotation
url URL
url_label Prediction classification
prediction Prediction probability
faurlevel Last layer level URL
Wherein, the implication of every variable is as follows:
Url: the URLStringUTF-8 of webpage
The prediction classification StringUTF-8 of url_label:URL
The prediction probability Double of prediction:URL
Faurlevel: last layer level URLStringUTF-8
Based on above-mentioned URL class library, embodiments provide a kind of method of Web page classifying, as shown in Figure 2, be the schematic diagram of the Web page classification method flow process that the embodiment of the present invention provides, can comprise the following steps:
Step 201, obtain the URL of webpage to be sorted, in inquiry URL class library, whether record this URL.
If step 202 inquires in URL class library and records identical URL, then go to step 204; Otherwise, go to step 203.
Step 203, generate the last layer level URL of this URL, whether record this last layer level URL in inquiry URL class library, and go to step 202.
Step 204, prediction classification corresponding for the URL inquired is defined as the classification of described webpage to be sorted.
Concrete, in the prior art scheme, directly in URL class library, carry out Exact-match queries according to URL, when the list item of the correspondence inquired, then return the prediction classification of URL; When the list item of the correspondence do not inquired, then return null value.
And in the technical scheme provided in the embodiment of the present invention, by introducing, level division is carried out to URL, and list item corresponding for upper strata URL is stored in URL class library.After needs are classified to webpage, first in URL class library, exact matching is carried out according to the URL of webpage to be sorted, when the list item that the URL not storing webpage to be sorted in URL class library is corresponding, the last layer level URL of the URL of further generation webpage to be sorted, and inquire about corresponding list item according to this last layer level URL in class library, and using the prediction classification of last layer level URL that the inquires prediction classification as the URL of webpage to be sorted.
Such as, the URL of the webpage to be sorted got is http://sports.sina.com.cn/k/2011-05-18/09415581512.shtml, and in current URL class library, do not record the URL of webpage to be sorted with this, now, need the last layer level URL generating this URL, i.e. http://sports.sina.com.cn/k/2011-05-18/, and in URL class library, inquire about list item corresponding to this last layer level URL.If store the list item that this last layer level URL is corresponding in URL class library, then can be obtained the prediction classification (as physical culture) of this last layer level URL by inquiry URL class library, then using the prediction classification of the prediction classification of this last layer level URL as the URL of webpage to be sorted.
It should be noted that the URL of the highest level that the URL that ought find webpage to be sorted is corresponding, do not inquire in URL class library when recording identical URL yet, return inquiry failure response.
In embodiments of the present invention, when there being new lowest hierarchical level URL to increase in URL class library, can be triggered by event or the mode such as manual activation carries out classification renewal to URL class library.Concrete, can again travel through the lowest hierarchical level URL stored in URL class library, and carry out distinguishing hierarchy, again obtain corresponding upper strata URL and the prediction classification of correspondence thereof.In addition, also can only upgrade the prediction classification of the upper strata URL relevant to newly-increased lowest hierarchical level URL.Specific implementation does not repeat them here.
Based on the technical conceive that above-mentioned Web page classification method is identical, the embodiment of the present invention also provides a kind of Web page classifying equipment, the above-mentioned Web page classification method realized based on URL class library can be applied to, each level URL is recorded in described URL class library, wherein, upper strata URL in the URL of adjacent level intercepts to obtain on the basis of lower floor URL, and each URL respectively corresponding record has prediction classification.
As shown in Figure 3, be the structural representation of the Web page classifying equipment that the embodiment of the present invention provides, can comprise:
Upper strata URL generation module 31, for the URL according to webpage to be sorted, generates the upper strata URL of this URL;
Enquiry module 32, inquires about URL class library for the URL according to webpage to be sorted; If do not inquire the URL of coupling, then inquire about URL class library according to the upper strata URL of this URL;
Determination module 33, for when enquiry module 32 inquires the URL of coupling, determines the classification of webpage to be sorted according to the prediction classification of the URL inquired.
Wherein, upper strata URL generation module 31 specifically for, when enquiry module 32 does not inquire the URL of coupling, generate the last layer level URL of this URL;
Enquiry module 32 inquires about the prediction classification of the upper strata URL of the URL of webpage to be sorted especially by following flow process:
Steps A, obtain the last layer level URL of this URL, in inquiry URL class library, whether record this last layer level URL;
If step B inquires in URL class library and records identical URL, then go to step C; Otherwise go to step A;
The prediction classification of the URL that step C, acquisition inquire;
Determination module 33 specifically for, the URL that inquired by enquiry module 33 predicts that classification is defined as the classification of described webpage to be sorted.
Wherein, determination module 33 also for, when enquiry module 32 has inquired the URL of highest level corresponding to the URL of described webpage to be sorted, do not inquire in URL class library when recording identical URL yet, return inquiry failure response.
Wherein, described Web page classifying equipment also comprises: URL class library maintenance module 34;
Upper strata URL generation module 31 specifically for, travel through the URL in described URL class library, and when traversing a URL, from described URL class library, select this URL, and generate the last layer level URL of this URL according to the URL selected;
Enquiry module 32 specifically for, inquire about URL class library according to the last layer level URL that upper strata URL generation module 31 generates;
URL classification maintenance module 34 for, when enquiry module 32 does not inquire the URL of coupling, determine the prediction classification of this last layer level URL, and by this last layer level URL and prediction classification be recorded in described URL class library.
Wherein, URL class library maintenance module 34 specifically for, determine the prediction classification of the URL of all the other levels except lowest hierarchical level according to the prediction classification of next level URL of URL.
Wherein, URL class library maintenance module 34 specifically for, from described URL class library, obtain all URL that its last layer level URL is the URL of classification to be predicted; Determine the quantity of the URL of each prediction classification in the URL got; Prediction classifications maximum for wherein URL quantity is defined as the prediction classification of the URL of this classification to be predicted.
Wherein, each URL in URL class library is also respective to having prediction probability;
URL class library maintenance module 34 specifically for, determine prediction classification and the prediction probability of the URL of all the other levels except lowest hierarchical level according to the prediction classification of next level URL of URL and prediction probability.
Wherein, URL class library maintenance module 34 specifically for, from described URL class library, obtain all URL that its last layer level URL is the URL of this classification to be predicted and probability; For the URL of each prediction classification, calculate the weighted mean value of the prediction probability of each URL in this prediction classification; Prediction classification the highest for weighted mean value is defined as the prediction classification of this URL to be predicted, and the mean value of the prediction probability of the URL of this prediction classification is defined as the prediction probability of this URL to be predicted.
Wherein, when adding new URL in described URL class library,
Upper strata URL generation module 31 also for, generate the upper strata URL of this URL;
Enquiry module 32 specifically for, inquire about URL class library according to the upper strata URL of described URL;
URL class library maintenance module 34 specifically for, if enquiry module 32 inquires the URL of coupling, then upgrade the prediction classification of upper strata URL; If enquiry module 32 does not inquire the URL of coupling, then in URL class library, record the prediction classification of this upper strata URL and correspondence.
Wherein, upper strata URL generation module 31 specifically for, according to the separator in URL, level division is carried out to URL, and using this URL from the field before the predetermined number separator forward of last position as the last layer level URL of this URL.
By the description of above embodiment, those skilled in the art can be well understood to the mode that the present invention can add required general hardware platform by software and realize, and can certainly pass through hardware, but in a lot of situation, the former is better embodiment.Based on such understanding, technical scheme of the present invention can embody with the form of software product the part that prior art contributes in essence in other words, this computer software product is stored in a storage medium, comprising some instructions in order to make a computer equipment (can be personal computer, server, or the network equipment etc.) perform method described in each embodiment of the present invention.
It will be appreciated by those skilled in the art that accompanying drawing is the schematic diagram of a preferred embodiment, the module in accompanying drawing or flow process might not be that enforcement the present invention is necessary.
It will be appreciated by those skilled in the art that the module in the device in embodiment can be distributed in the device of embodiment according to embodiment description, also can carry out respective change and be arranged in the one or more devices being different from the present embodiment.The module of above-described embodiment can merge into a module, also can split into multiple submodule further.
The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.
Be only several specific embodiment of the present invention above, but the present invention is not limited thereto, the changes that any person skilled in the art can think of all should fall into protection scope of the present invention.

Claims (14)

1. a Web page classification method, it is characterized in that, be applied to the Web page classifying flow process realized based on uniform resource position mark URL class library, the prediction classification of each level URL and each URL is recorded in described URL class library, wherein, upper strata URL in the URL of adjacent level intercepts to obtain on the basis of lower floor URL, and the method comprises:
URL according to webpage to be sorted inquires about URL class library;
If do not inquire the URL of coupling, then inquire about URL class library according to the upper strata URL of this URL, and when inquiring the URL of coupling, determine the classification of webpage to be sorted according to the prediction classification of the URL inquired;
Wherein, the generative process of described URL class library, comprising:
Travel through the URL in described URL class library, and when traversing a URL, from described URL class library, select this URL, and generate the last layer level URL of this URL according to the URL selected;
Judge that whether Already in the last layer level URL generated in described URL class library, and when there is not this last layer level URL in described URL class library, determine the prediction classification of this last layer level URL, and this last layer level URL and prediction classification thereof are recorded in described URL class library.
2. the method for claim 1, is characterized in that, the described upper strata URL according to this URL inquires about URL class library, comprising:
Steps A, generate the last layer level URL of this URL, in inquiry URL class library, whether record this last layer level URL;
If step B inquires in URL class library and records identical URL, then go to step C; Otherwise go to step A;
The prediction classification of the URL that step C, acquisition inquire.
3. the method as described in one of claim 1-2, is characterized in that, except the URL of lowest hierarchical level, the prediction classification of the URL of all the other levels determines according to the prediction classification of next level URL of this URL.
4. method as claimed in claim 3, is characterized in that, determine the prediction classification of the URL of its last layer level, be specially according to the prediction classification of the URL of next level:
All URL that its last layer level URL is the URL of classification to be predicted are obtained from described URL class library;
Determine the quantity of the URL of each prediction classification in the URL got;
Prediction classifications maximum for wherein URL quantity is defined as the prediction classification of the URL of this classification to be predicted.
5. method as claimed in claim 3, it is characterized in that, each URL in URL class library is also respective to having prediction probability;
Determine prediction classification and the prediction probability of the URL of its last layer level according to the prediction classification of the URL of next level and prediction probability, be specially:
All URL that its last layer level URL is the URL of classification to be predicted and probability are obtained from described URL class library;
For the URL of each prediction classification, calculate the weighted mean value of the prediction probability of each URL in this prediction classification;
Prediction classification the highest for weighted mean value is defined as the prediction classification of this URL to be predicted, and the mean value of the prediction probability of the URL of this prediction classification is defined as the prediction probability of this URL to be predicted.
6. the method for claim 1, is characterized in that, when adding new URL in described URL class library, generate the upper strata URL of this URL, and inquire about URL class library according to the upper strata URL of described URL, if inquire the URL of coupling, then upgrade the prediction classification of this upper strata URL; If do not inquire the URL of coupling, in URL class library, record the prediction classification of this upper strata URL and correspondence.
7. the method for claim 1, is characterized in that, determines the last layer level URL of URL, is specially:
According to the separator in URL, level division is carried out to URL, and using this URL from the field before the predetermined number separator forward of last position as the last layer level URL of this URL.
8. a Web page classifying equipment, it is characterized in that, be applied to the Web page classifying flow process realized based on uniform resource position mark URL class library, the prediction classification of each level URL and each URL is recorded in described URL class library, wherein, upper strata URL in the URL of adjacent level intercepts to obtain on the basis of lower floor URL, and this equipment comprises:
Upper strata URL generation module, for the URL according to webpage to be sorted, generates the upper strata URL of this URL;
Enquiry module, inquires about URL class library for the URL according to webpage to be sorted; If do not inquire the URL of coupling, then inquire about URL class library according to the upper strata URL of this URL;
Determination module, for when described enquiry module inquires the URL of coupling, determines the classification of webpage to be sorted according to the prediction classification of the URL inquired;
Wherein, also comprise: URL class library maintenance module;
Described upper strata URL generation module specifically for, travel through the URL in described URL class library, and when traversing a URL, from described URL class library, select this URL, and generate the last layer level URL of this URL according to the URL selected;
Described enquiry module specifically for, inquire about URL class library according to the last layer level URL that described upper strata URL generation module generates;
Described URL classification maintenance module is used for, and when described enquiry module does not inquire the URL of coupling, determines the prediction classification of this last layer level URL, and this last layer level URL and prediction classification thereof is recorded in described URL class library.
9. equipment as claimed in claim 8, is characterized in that,
Described upper strata URL generation module specifically for, when described enquiry module does not inquire the URL of coupling, generate the last layer level URL of this URL;
Described enquiry module inquires about the prediction classification of the upper strata URL of the URL of webpage to be sorted especially by following flow process:
Steps A, obtain the last layer level URL of this URL, in inquiry URL class library, whether record this last layer level URL;
If step B inquires in URL class library and records identical URL, then go to step C; Otherwise go to step A;
The prediction classification of the URL that step C, acquisition inquire;
Described determination module specifically for, the URL that inquired by described enquiry module predicts that classification is defined as the classification of described webpage to be sorted.
10. the equipment as described in one of claim 8-9, is characterized in that, described URL class library maintenance module specifically for, determine the prediction classification of the URL of all the other levels except lowest hierarchical level according to the prediction classification of next level URL of URL.
11. equipment as claimed in claim 10, is characterized in that, described URL class library maintenance module specifically for, from described URL class library, obtain all URL that its last layer level URL is the URL of classification to be predicted; Determine the quantity of the URL of each prediction classification in the URL got; Prediction classifications maximum for wherein URL quantity is defined as the prediction classification of the URL of this classification to be predicted.
12. equipment as claimed in claim 10, is characterized in that, each URL in URL class library is also respective to having prediction probability;
Described URL class library maintenance module specifically for, from described URL class library, obtain all URL that its last layer level URL is the URL of classification to be predicted and probability; For the URL of each prediction classification, calculate the weighted mean value of the prediction probability of each URL in this prediction classification; Prediction classification the highest for weighted mean value is defined as the prediction classification of this URL to be predicted, and the mean value of the prediction probability of the URL of this prediction classification is defined as the prediction probability of this URL to be predicted.
13. equipment as claimed in claim 10, when adding new URL in described URL class library,
Described upper strata URL generation module also for, generate the upper strata URL of this URL;
Described enquiry module specifically for, inquire about URL class library according to the upper strata URL of described URL;
Described URL class library maintenance module specifically for, if described enquiry module inquires the URL of coupling, then upgrade the prediction classification of upper strata URL; If described enquiry module does not inquire the URL of coupling, then in URL class library, record the prediction classification of this upper strata URL and correspondence.
14. equipment as claimed in claim 8, it is characterized in that, described upper strata URL generation module specifically for, according to the separator in URL, level division is carried out to URL, and using this URL from the field before the predetermined number separator forward of last position as the last layer level URL of this URL.
CN201110249270.2A 2011-08-26 2011-08-26 A kind of Web page classification method and equipment Active CN102955810B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110249270.2A CN102955810B (en) 2011-08-26 2011-08-26 A kind of Web page classification method and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110249270.2A CN102955810B (en) 2011-08-26 2011-08-26 A kind of Web page classification method and equipment

Publications (2)

Publication Number Publication Date
CN102955810A CN102955810A (en) 2013-03-06
CN102955810B true CN102955810B (en) 2015-12-02

Family

ID=47764622

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110249270.2A Active CN102955810B (en) 2011-08-26 2011-08-26 A kind of Web page classification method and equipment

Country Status (1)

Country Link
CN (1) CN102955810B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776825A (en) * 2016-11-24 2017-05-31 竹间智能科技(上海)有限公司 User preference entity classification method and system based on level mapping

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103646119A (en) * 2013-12-26 2014-03-19 北京西塔网络科技股份有限公司 Method and device for generating user behavior record
CN103914534B (en) * 2014-03-31 2017-03-15 郭磊 Content of text sorting technique based on specialist system URL classification knowledge base
CN106294443A (en) * 2015-05-28 2017-01-04 上海池乐信息科技有限公司 The URL classification recognition methods in a kind of knowledge based storehouse and system
CN106294442A (en) * 2015-05-28 2017-01-04 上海池乐信息科技有限公司 A kind of internet information classifying identification method based on URL and system
CN106528556B (en) * 2015-09-10 2019-07-30 北京国双科技有限公司 The analysis method and device of website visitation data
CN105912736A (en) * 2016-06-28 2016-08-31 迈普通信技术股份有限公司 URL classifying method and device
CN107545020A (en) * 2017-05-10 2018-01-05 新华三信息安全技术有限公司 A kind of determination method and device of Web page classifying
CN110472125B (en) * 2019-08-23 2022-04-01 厦门商集网络科技有限责任公司 Multistage page cascading crawling method and equipment based on web crawler

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7565350B2 (en) * 2006-06-19 2009-07-21 Microsoft Corporation Identifying a web page as belonging to a blog
CN101630330A (en) * 2009-08-14 2010-01-20 苏州锐创通信有限责任公司 Method for webpage classification
CN1592229B (en) * 2003-08-25 2010-10-06 微软公司 Electronic communications and web pages filtering based on URL
CN101872347A (en) * 2009-04-22 2010-10-27 富士通株式会社 Method and device for judging type of webpage

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1592229B (en) * 2003-08-25 2010-10-06 微软公司 Electronic communications and web pages filtering based on URL
US7565350B2 (en) * 2006-06-19 2009-07-21 Microsoft Corporation Identifying a web page as belonging to a blog
CN101872347A (en) * 2009-04-22 2010-10-27 富士通株式会社 Method and device for judging type of webpage
CN101630330A (en) * 2009-08-14 2010-01-20 苏州锐创通信有限责任公司 Method for webpage classification

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776825A (en) * 2016-11-24 2017-05-31 竹间智能科技(上海)有限公司 User preference entity classification method and system based on level mapping

Also Published As

Publication number Publication date
CN102955810A (en) 2013-03-06

Similar Documents

Publication Publication Date Title
CN102955810B (en) A kind of Web page classification method and equipment
US7779001B2 (en) Web page ranking with hierarchical considerations
US9449271B2 (en) Classifying resources using a deep network
KR101114023B1 (en) Content propagation for enhanced document retrieval
CN101364239B (en) Method for auto constructing classified catalogue and relevant system
US9317613B2 (en) Large scale entity-specific resource classification
US8645369B2 (en) Classifying documents using implicit feedback and query patterns
CN103329151A (en) Recommendations based on topic clusters
RU2720954C1 (en) Search index construction method and system using machine learning algorithm
CN1702654A (en) Method and system for calculating importance of a block within a display page
CN103177090A (en) Topic detection method and device based on big data
US20120233096A1 (en) Optimizing an index of web documents
CN101211368B (en) Method for classifying search term, device and search engine system
Saranya et al. A personalized online news recommendation system
KR100954842B1 (en) Method and System of classifying web page using category tag information and Recording medium using by the same
Wagh et al. Enhanced web personalization for improved browsing experience
Yen The design and evaluation of accessibility on web navigation
CN108446296A (en) A kind of information processing method and device
Wang et al. Crawling ranked deep web data sources
Huang et al. Location-aware query recommendation for search engines at scale
Chen et al. COWES: Web user clustering based on evolutionary web sessions
Lai et al. Question routing by modeling user expertise and activity in cQA services
CN109388649B (en) Land intelligent recommendation method and system
Sulaiman et al. An implementation of rough set in optimizing mobile Web caching performance
Caramia et al. Mining relevant information on the Web: a clique-based approach

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20170223

Address after: Kolding road high tech Zone of Suzhou City, Jiangsu Province, No. 78 215163

Patentee after: CHINA MOBILE (SUZHOU) SOFTWARE TECHNOLOGY CO., LTD.

Patentee after: China Mobile Communications Co., Ltd.

Patentee after: Chellona Mobile Communications Corporation Cmcc

Address before: 100032 Beijing Finance Street, No. 29, Xicheng District

Patentee before: Chellona Mobile Communications Corporation Cmcc