CN102890717A - System and method for building webpage category knowledge base - Google Patents

System and method for building webpage category knowledge base Download PDF

Info

Publication number
CN102890717A
CN102890717A CN2012103763814A CN201210376381A CN102890717A CN 102890717 A CN102890717 A CN 102890717A CN 2012103763814 A CN2012103763814 A CN 2012103763814A CN 201210376381 A CN201210376381 A CN 201210376381A CN 102890717 A CN102890717 A CN 102890717A
Authority
CN
China
Prior art keywords
webpage
page
sample
knowledge base
page framework
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012103763814A
Other languages
Chinese (zh)
Other versions
CN102890717B (en
Inventor
卢宏林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201210376381.4A priority Critical patent/CN102890717B/en
Publication of CN102890717A publication Critical patent/CN102890717A/en
Application granted granted Critical
Publication of CN102890717B publication Critical patent/CN102890717B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a system for building a webpage category knowledge base, belonging to the field of internet technology; the system comprises a sample page frame identity (ID) computation module which is suitable for sampling a page frame of a sample webpage and is used for computing the page frame ID of the sample webpage, a mode accumulating module which is suitable for computing the page frame mode of the sample webpage when the number of the page frames with the same ID is accumulated to reach a threshold value, and a knowledge base building module suitable for building the mapping relationship between the sample webpage categories and the page frame modes to generate the webpage category knowledge base. The invention also discloses a method for building the webpage category knowledge base. According to the system and the method for building the webpage category knowledge base, the knowledge base capable of indentifying the webpage categories can be built so as to rapidly indentify the webpage category, so that the problem that the webpage categories can not be distinguished by whole network search can be solved, and the beneficial effect of rapidly indentifying the webpage categories is obtained.

Description

Webpage classification knowledge base set up system and method
Technical field
The present invention relates to Internet technical field, be specifically related to a kind of system and method for setting up of webpage classification knowledge base.
Background technology
In search technique, basically be divided into two large classes.One class is as object take whole internet, grasp whole webpages (in a website, can restriction grasp the degree of depth at present, and generally not process js(java script), and be the processing section dynamic page), and the Webpage search that webpage is processed and analyzed, i.e. the whole network search.Another kind of is only to grasp vertical search with analyzing and processing for certain class page, as: picture searching, video search, Blog Search, forum's search, news search etc.For most of vertical search, all be based at present seed (being also referred to as list page) and process.The processing of vertical search can be divided into two parts: the first is looked for seed; It two is to find a specific product page from kind of subpage frame, and namely the page of different classes of (picture, video, news etc.) is then processed these product pages.
Existing the whole network is searched for, and does not basically consider the demand of vertical search, and the different product of can't classifying namely can't be distinguished the webpage classification, can only be auxiliary some Useful Informations of excavating of vertical search.If existing vertical search, because Webpage search, both analyzing and processing modes are different.Mutually independent between the system, the page that the whole network search is downloaded, analyzing and processing is crossed, vertical search also can independently be downloaded and analyzing and processing, can't shared resource, both can not organically integrate the resource that makes vertical search share the whole network search.Therefore, can automatically to identify other knowledge base of web page class be problem demanding prompt solution in foundation.
Summary of the invention
In view of the above problems, the present invention has been proposed in order to a kind of system and method for setting up of the webpage classification knowledge base that overcomes the problems referred to above or address the above problem at least in part is provided.
According to one aspect of the present invention, the system that sets up of webpage classification knowledge base is provided, comprising:
Sample page framework ID computing module is suitable for the page framework of sample drawn webpage, calculates the page framework ID of sample webpage;
Pattern accumulative total module when the page framework quantity that is suitable for the identical ID of accumulative total reaches threshold value, is calculated the page framework mode of sample webpage;
Knowledge base is set up module, is suitable for setting up the mapping relations of the classification of sample webpage and described page framework mode with generating web page classification knowledge base.
Alternatively, described knowledge base is set up module and is further comprised:
The weight setting module is suitable for the classification according to different sample webpages, gives for each web page characteristics in such other page framework mode and presets weight;
Mapping table is set up module, is suitable for setting up the classification of sample webpage and the relation mapping table of such other each web page characteristics and weight, with generating web page classification knowledge base.
Alternatively, page framework ID computing module further comprises: page framework abstraction module is suitable for extracting according to the html linguistic labels in the sample webpage source code page framework of described sample webpage.
Alternatively, page framework ID computing module further comprises: page framework abstraction module, be suitable for identifying by punctuate the text of sample webpage, and remove text to obtain the page framework of described sample webpage.
Alternatively, described pattern accumulative total module further comprises:
List page identification module undetermined is suitable for judging whether to be positioned at the link of page fixed position piece and stable existence certain hour, if having, then setting described sample webpage is list page undetermined;
List page framework mode determination module is suitable for once described list page undetermined of at set intervals interior scheduling, is new url if described link is constantly updated, and just the page framework mode with described sample webpage is made as the list page framework mode.
According to a further aspect in the invention, provide the method for building up of webpage classification knowledge base, may further comprise the steps:
The page framework of sample drawn webpage, the page framework ID of calculating sample webpage;
When the page framework quantity of the identical ID of accumulative total reaches threshold value, calculate the page framework mode of sample webpage;
Set up the classification of sample webpage and the mapping relations of described page framework mode, with generating web page classification knowledge base.
Alternatively, describedly set up the classification of sample webpage and the mapping relations of described page framework mode specifically comprise with generating web page classification knowledge base:
According to the classification of different sample webpages, give for each web page characteristics in such other page framework mode and to preset weight;
Set up the classification of sample webpage and the relation mapping table of such other each web page characteristics and weight, with generating web page classification knowledge base.
Alternatively, the mode that extracts the page framework of described sample webpage is: the page framework that extracts described sample webpage according to the html linguistic labels in the sample webpage source code.
Alternatively, the mode that extracts the page framework of described sample webpage is: identify the text of sample webpage by punctuate, remove text to obtain the page framework of described sample webpage.
Alternatively, the mode of list page framework mode calculating is:
Judge whether to be positioned at the link of page fixed position piece and stable existence certain hour, if having, then setting described sample webpage is list page undetermined;
Dispatch once described list page undetermined at set intervals, be new url if described link is constantly updated, just the page framework mode with described sample webpage is made as the list page framework mode.
Setting up system and method and can set up identification web page class other knowledge base with quick identification webpage classification according to webpage classification knowledge base of the present invention, solve thus the whole network search and can't distinguish other problem of web page class, obtained quick other beneficial effect of identification web page class.
Above-mentioned explanation only is the general introduction of technical solution of the present invention, for can clearer understanding technological means of the present invention, and can be implemented according to the content of instructions, and for above and other objects of the present invention, feature and advantage can be become apparent, below especially exemplified by the specific embodiment of the present invention.
Description of drawings
By reading hereinafter detailed description of the preferred embodiment, various other advantage and benefits will become cheer and bright for those of ordinary skills.Accompanying drawing only is used for the purpose of preferred implementation is shown, and does not think limitation of the present invention.And in whole accompanying drawing, represent identical parts with identical reference symbol.In the accompanying drawings:
Fig. 1 shows the according to an embodiment of the invention method for building up process flow diagram of webpage classification knowledge base;
Fig. 2 shows the particular flow sheet of step S130 among Fig. 1;
What Fig. 3 showed webpage classification knowledge base according to an embodiment of the invention sets up the system architecture synoptic diagram;
Fig. 4 shows among Fig. 3 knowledge base and sets up module concrete structure synoptic diagram.
Embodiment
Exemplary embodiment of the present disclosure is described below with reference to accompanying drawings in more detail.Although shown exemplary embodiment of the present disclosure in the accompanying drawing, yet should be appreciated that and to realize the disclosure and the embodiment that should do not set forth limits here with various forms.On the contrary, it is in order to understand the disclosure more thoroughly that these embodiment are provided, and can with the scope of the present disclosure complete convey to those skilled in the art.
The method for building up flow process of the webpage classification knowledge base of present embodiment comprises as shown in Figure 1:
Step S110, the page framework of sample drawn webpage, and the page framework ID of calculating sample webpage.The sample webpage is known other webpage of its web page class of choosing in advance.The mode of the page framework of sample drawn webpage is: according to the page framework of the html linguistic labels sample drawn webpage in the webpage source code, the mark that only keeps html linguistic labels middle frame class during extraction, as: frame, table etc., keep simultaneously id, name, class attribute, remove all the other attributes.Can also identify Web page text by punctuate, remove text to obtain the page framework of sample webpage.With the hash value of attribute in the page according to hash algorithm calculating page framework, be page framework ID behind the extraction page framework, such as: utilize the salted hash Salted methods such as MD5 or FNV to calculate the hash value of page frameworks after extracting page framework, be about to the mark of frame clsss.As: frame, table and id thereof, name, class attribute etc. calculate by hash algorithm, and the acquired results value is the page framework ID of sample webpage.Because adopt identical hash function, the page framework ID that identical page framework calculates is also identical.
Step S 120, when the page framework quantity of the identical ID of accumulative total reaches threshold value, calculate the page framework mode of sample webpage.Part of title, time, text philosophy calculate during calculating, and computing method can adopt machine automatic learning mechanism, as: adopt support vector machine (support vector machine, SVM) to calculate page framework mode.During study the sample webpage converted to the source code based on the Html language, and extract the html linguistic labels and close key label, obtain page framework, this step realizes in step S110.Page framework input SVM is learnt, namely page framework is carried out the coupling that the html linguistic labels closes key label, html linguistic labels in the page framework of some identical ID closes key label and can mate fully, therefore, after learning the quantity of above-mentioned threshold value for the page framework of identical ID, SVM just exports the page framework mode of respective page framework.Before study, also need to be done as follows for page framework: with title and title or anchor(anchor point) inner variable content coupling; Time will calculate according to the form of time; Text will have certain variable ratio and length requirement, can reject like this rubbish contents such as advertisement.
In order to prevent that some sample webpage from can not get processing for a long time, judge whether the page framework quantity of the sample webpage of corresponding same I D totally reaches this threshold value in the given time, if do not have, then the threshold value that this I D is corresponding is successively decreased with certain step-length.Wherein this threshold value is preferably 23.
Step S130 sets up the classification of sample webpage and the mapping relations of its page framework mode, with generating web page classification knowledge base.Its concrete generation step comprises as shown in Figure 2:
Step S210 according to the classification of different sample webpages, gives for each web page characteristics in such other page framework mode and to preset weight.
Step S220 sets up the classification of sample webpage and the relation mapping table of such other each web page characteristics and weight, with generating web page classification knowledge base.
Wherein, sample class comprises: the webpage classifications such as picture, video, blog, forum (bbs) and news.The page framework mode of the sample webpage of each classification has some different web page characteristics, page framework mode of some different web page characteristics characterizeds, the i.e. webpage of a kind.Certainly, identical web page characteristics that the webpage of two different classifications may comprise one or more (not being whole), but weight may be different, and for example: forum (bbs) and news all comprise the web page characteristics of " title, time, text ".The webpage classification knowledge base concrete form that generates by above-mentioned steps is web page characteristics and the weight mapping table under the webpage classification page framework mode corresponding with it, and is as shown in table 1 below:
Web page characteristics and weight mapping table under the table 1 webpage classification page framework mode corresponding with it
Figure BDA00002221549300051
Figure BDA00002221549300061
Upper table has only been listed partial information, is intended to illustrate the mapping relations of web page characteristics under the webpage classification page framework mode corresponding with it and weight mapping.Can find out the page framework mode of news web page, two web page characteristics wherein: comprise the news key word in (1) url, in (2) page-mode title, time, text are arranged from upper table.Its weight is respectively 50 and 30.It also can be bbs(forum that title, time, text are arranged in the page-mode) web page characteristics of the page framework mode of webpage, its weight is 20.Bbs also has feature: contain bbs or forum among the url, its weight is 50.The web page characteristics of list page comprises: comprising " more " key words, navigation bar pattern and webpage in the url is top-level domain etc., and the weight of setting is respectively: 30,50 and 60.
When adopting the classification of webpage classification knowledge base identification target pages framework mode, give a mark for this target pages framework mode according to the different classes of weight in the table.For example, if contain bbs or forum among the url, so just for bbs adds 50 minutes, if news is arranged in the url, just add 50 minutes for news.If title, time, text are arranged, just for news adds 30 minutes, also can add 20 minutes for bbs in page-mode.If the information such as floor, answer number are arranged, the bbs that just respectively does for oneself adds some marks.And so on.If the mark by news category weight gained after all characteristic matching of target pages framework mode is the highest, so this page framework mode is classified as news category.
For list page, can calculate its page framework mode according to the SVM learning method among the above-mentioned steps S120, because the singularity of the web page characteristics of list page, comprising: the domain name that webpage is corresponding is top-level domain; The navigation bar pattern; Comprise " more " key words etc.Therefore, also can in step S120, press following mode Direct Recognition list page:
Judge whether domain name corresponding to webpage is top-level domain, if it is list page that this webpage then is set.If the domain name that webpage is corresponding is not top-level domain, recognized list page or leaf in the following manner then: judge whether to be positioned at the link of page fixed position piece and stable existence certain hour, if having, then setting this webpage is list page undetermined; Dispatch once described list page undetermined at set intervals, be new url if described link is constantly updated, just the page framework mode with this webpage is made as the list page framework mode, and namely this webpage is list page.For example: the navigation bar of webpage top, and comprise that the part of " more " printed words all is the link that is arranged in page fixed block usually in the web page frame, the webpage that namely comprises navigation bar and " more " printed words is list page.
The method for building up of the webpage classification knowledge base of present embodiment has been set up and can have been identified fast other knowledge base of web page class, has solved the whole network search and can't distinguish other problem of web page class, for the integration of vertical search and the whole network search is laid a good foundation.
The present invention also provides a kind of system that sets up 3 of webpage classification knowledge base, and concrete knot comprises as shown in Figure 3: sample page framework ID computing module 310, pattern accumulative total module 320 and knowledge base are set up module 330.
Sample page framework ID computing module 310 is suitable for the page framework of sample drawn webpage, calculates the page framework ID of sample webpage.Sample page framework ID computing module 310 further comprises: page framework abstraction module is suitable for extracting according to the html linguistic labels in the sample webpage source code page framework of described sample webpage; Also be applicable to identify by punctuate the text of sample webpage, remove text to obtain the page framework of described sample webpage.
When the page framework quantity that pattern accumulative total module 320 is suitable for the identical ID of accumulative total reaches threshold value, calculate the page framework mode of sample webpage.Pattern accumulative total module further comprises: the threshold value adjustment module, be suitable for judging whether the page framework quantity of the sample webpage of corresponding same ID totally reaches described threshold value in the given time, and if do not have, then the threshold value that this ID is corresponding is successively decreased with certain step-length.
Pattern accumulative total module 320 further comprises: the domain name identification module is suitable for judging whether domain name corresponding to webpage is top-level domain, if it is list page that this webpage then is set.Pattern accumulative total module 320 also further comprises: list page identification module undetermined, be suitable for judging whether to be positioned at the link of page fixed position piece and stable existence certain hour, and if having, then setting this webpage is list page undetermined; List page framework mode determination module is suitable for once described list page undetermined of at set intervals interior scheduling, is new url if described link is constantly updated, and just the page framework mode with described webpage is made as the list page framework mode.
Knowledge base is set up module 330 and is suitable for setting up the mapping relations of the classification of sample webpage and described page framework mode with generating web page classification knowledge base.Knowledge base is set up module 330 concrete structures as shown in Figure 4, further comprises:
Weight setting module 410 is suitable for the classification according to different sample webpages, gives for each web page characteristics in such other page framework mode and presets weight;
Mapping table is set up module 420, is suitable for setting up the classification of sample webpage and the relation mapping table of such other each web page characteristics and weight, with generating web page classification knowledge base.
Intrinsic not relevant with any certain computer, virtual system or miscellaneous equipment with demonstration at this algorithm that provides.Various general-purpose systems also can be with using based on the teaching at this.According to top description, it is apparent constructing the desired structure of this type systematic.In addition, the present invention is not also for any certain programmed language.Should be understood that and to utilize various programming languages to realize content of the present invention described here, and the top description that language-specific is done is in order to disclose preferred forms of the present invention.
In the instructions that provides herein, a large amount of details have been described.Yet, can understand, embodiments of the invention can be put into practice in the situation of these details not having.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand one or more in each inventive aspect, in the description to exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or the description to it sometimes in the above.Yet the method for the disclosure should be construed to the following intention of reflection: namely the present invention for required protection requires the more feature of feature clearly put down in writing than institute in each claim.Or rather, as following claims reflected, inventive aspect was to be less than all features of the disclosed single embodiment in front.Therefore, follow claims of embodiment and incorporate clearly thus this embodiment into, wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and can adaptively change and they are arranged in one or more equipment different from this embodiment the module in the equipment among the embodiment.Can be combined into a module or unit or assembly to the module among the embodiment or unit or assembly, and can be divided into a plurality of submodules or subelement or sub-component to them in addition.In such feature and/or process or unit at least some are mutually repelling, and can adopt any combination to disclosed all features in this instructions (comprising claim, summary and the accompanying drawing followed) and so all processes or the unit of disclosed any method or equipment make up.Unless in addition clearly statement, disclosed each feature can be by providing identical, being equal to or the alternative features of similar purpose replaces in this instructions (comprising claim, summary and the accompanying drawing followed).
In addition, those skilled in the art can understand, although embodiment more described herein comprise some feature rather than further feature included among other embodiment, the combination of the feature of different embodiment means and is within the scope of the present invention and forms different embodiment.For example, in the following claims, the one of any of embodiment required for protection can be used with array mode arbitrarily.
All parts embodiment of the present invention can realize with hardware, perhaps realizes with the software module of moving at one or more processor, and perhaps the combination with them realizes.It will be understood by those of skill in the art that and to use in practice microprocessor or digital signal processor (DSP) to realize some or all some or the repertoire of parts in the system set up according to the webpage classification knowledge base of the embodiment of the invention.The present invention can also be embodied as be used to part or all equipment or the device program (for example, computer program and computer program) of carrying out method as described herein.Such realization program of the present invention can be stored on the computer-readable medium, perhaps can have the form of one or more signal.Such signal can be downloaded from internet website and obtain, and perhaps provides at carrier signal, perhaps provides with any other form.
It should be noted above-described embodiment the present invention will be described rather than limit the invention, and those skilled in the art can design alternative embodiment in the situation of the scope that does not break away from claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and is not listed in element or step in the claim.Being positioned at word " " before the element or " one " does not get rid of and has a plurality of such elements.The present invention can realize by means of the hardware that includes some different elements and by means of the computing machine of suitably programming.In having enumerated the unit claim of some devices, several in these devices can be to come imbody by same hardware branch.The use of word first, second and C grade does not represent any order.Can be title with these word explanations.

Claims (10)

1. the system that sets up of a webpage classification knowledge base comprises:
Sample page framework ID computing module is suitable for the page framework of sample drawn webpage, calculates the page framework ID of sample webpage;
Pattern accumulative total module when the page framework quantity that is suitable for the identical ID of accumulative total reaches threshold value, is calculated the page framework mode of sample webpage;
Knowledge base is set up module, is suitable for setting up the mapping relations of the classification of sample webpage and described page framework mode with generating web page classification knowledge base.
2. the method for building up of webpage classification knowledge base as claimed in claim 1 is characterized in that, described knowledge base is set up module and further comprised:
The weight setting module is suitable for the classification according to different sample webpages, gives for each web page characteristics in such other page framework mode and presets weight;
Mapping table is set up module, is suitable for setting up the classification of sample webpage and the relation mapping table of such other each web page characteristics and weight, with generating web page classification knowledge base.
3. the system that sets up of webpage classification knowledge base as claimed in claim 1 or 2, it is characterized in that, page framework ID computing module further comprises: page framework abstraction module is suitable for extracting according to the html linguistic labels in the sample webpage source code page framework of described sample webpage.
4. such as the system that sets up of each described webpage classification knowledge base in the claim 1 ~ 3, it is characterized in that, page framework ID computing module further comprises: page framework abstraction module, be suitable for identifying by punctuate the text of sample webpage, remove text to obtain the page framework of described sample webpage.
5. such as the system that sets up of each described webpage classification knowledge base in the claim 1 ~ 4, it is characterized in that described pattern accumulative total module further comprises:
List page identification module undetermined is suitable for judging whether to be positioned at the link of page fixed position piece and stable existence certain hour, if having, then setting described sample webpage is list page undetermined;
List page framework mode determination module is suitable for once described list page undetermined of at set intervals interior scheduling, is new url if described link is constantly updated, and just the page framework mode with described sample webpage is made as the list page framework mode.
6. the method for building up of a webpage classification knowledge base may further comprise the steps:
The page framework of sample drawn webpage, the page framework ID of calculating sample webpage;
When the page framework quantity of the identical ID of accumulative total reaches threshold value, calculate the page framework mode of sample webpage;
Set up the classification of sample webpage and the mapping relations of described page framework mode, with generating web page classification knowledge base.
7. the method for building up of webpage classification knowledge base as claimed in claim 6 is characterized in that, describedly sets up the classification of sample webpage and the mapping relations of described page framework mode specifically comprise with generating web page classification knowledge base:
According to the classification of different sample webpages, give for each web page characteristics in such other page framework mode and to preset weight;
Set up the classification of sample webpage and the relation mapping table of such other each web page characteristics and weight, with generating web page classification knowledge base.
8. such as the method for building up of claim 6 or 7 described webpage classification knowledge bases, it is characterized in that the mode that extracts the page framework of described sample webpage is: the page framework that extracts described sample webpage according to the html linguistic labels in the sample webpage source code.
9. such as the method for building up of each described webpage classification knowledge base in the claim 6 ~ 8, it is characterized in that, the mode that extracts the page framework of described sample webpage is: identify the text of sample webpage by punctuate, remove text to obtain the page framework of described sample webpage.
10. such as the method for building up of each described webpage classification knowledge base in the claim 6 ~ 9, it is characterized in that the mode that the list page framework mode is calculated is:
Judge whether to be positioned at the link of page fixed position piece and stable existence certain hour, if having, then setting described sample webpage is list page undetermined;
Dispatch once described list page undetermined at set intervals, be new url if described link is constantly updated, just the page framework mode with described sample webpage is made as the list page framework mode.
CN201210376381.4A 2012-09-29 2012-09-29 Webpage category knowledge base set up system and method Active CN102890717B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210376381.4A CN102890717B (en) 2012-09-29 2012-09-29 Webpage category knowledge base set up system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210376381.4A CN102890717B (en) 2012-09-29 2012-09-29 Webpage category knowledge base set up system and method

Publications (2)

Publication Number Publication Date
CN102890717A true CN102890717A (en) 2013-01-23
CN102890717B CN102890717B (en) 2016-09-28

Family

ID=47534219

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210376381.4A Active CN102890717B (en) 2012-09-29 2012-09-29 Webpage category knowledge base set up system and method

Country Status (1)

Country Link
CN (1) CN102890717B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102902793A (en) * 2012-09-29 2013-01-30 北京奇虎科技有限公司 Creation system and method of webpage classification knowledge base
CN103336786A (en) * 2013-06-05 2013-10-02 腾讯科技(深圳)有限公司 Data processing method and device
CN111914201A (en) * 2020-08-07 2020-11-10 腾讯科技(深圳)有限公司 Network page processing method and device
CN114706793A (en) * 2022-05-16 2022-07-05 北京百度网讯科技有限公司 Webpage testing method and device, electronic equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101251855A (en) * 2008-03-27 2008-08-27 腾讯科技(深圳)有限公司 Equipment, system and method for cleaning internet web page
CN102298614A (en) * 2011-07-29 2011-12-28 百度在线网络技术(北京)有限公司 Method for determining collection category of page collection information and device and equipment
CN102411587A (en) * 2010-09-21 2012-04-11 腾讯科技(深圳)有限公司 Webpage classification method and device
CN102902793A (en) * 2012-09-29 2013-01-30 北京奇虎科技有限公司 Creation system and method of webpage classification knowledge base

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101251855A (en) * 2008-03-27 2008-08-27 腾讯科技(深圳)有限公司 Equipment, system and method for cleaning internet web page
CN102411587A (en) * 2010-09-21 2012-04-11 腾讯科技(深圳)有限公司 Webpage classification method and device
CN102298614A (en) * 2011-07-29 2011-12-28 百度在线网络技术(北京)有限公司 Method for determining collection category of page collection information and device and equipment
CN102902793A (en) * 2012-09-29 2013-01-30 北京奇虎科技有限公司 Creation system and method of webpage classification knowledge base

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王小华等: "基于N-Gram的文本去重方法研究", 《杭州电子科技大学学报》, vol. 30, no. 2, 30 April 2010 (2010-04-30) *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102902793A (en) * 2012-09-29 2013-01-30 北京奇虎科技有限公司 Creation system and method of webpage classification knowledge base
CN102902793B (en) * 2012-09-29 2016-12-21 北京奇虎科技有限公司 Webpage category knowledge base set up system and method
CN103336786A (en) * 2013-06-05 2013-10-02 腾讯科技(深圳)有限公司 Data processing method and device
CN103336786B (en) * 2013-06-05 2017-05-24 腾讯科技(深圳)有限公司 Data processing method and device
CN111914201A (en) * 2020-08-07 2020-11-10 腾讯科技(深圳)有限公司 Network page processing method and device
CN111914201B (en) * 2020-08-07 2023-11-07 腾讯科技(深圳)有限公司 Processing method and device of network page
CN114706793A (en) * 2022-05-16 2022-07-05 北京百度网讯科技有限公司 Webpage testing method and device, electronic equipment and medium

Also Published As

Publication number Publication date
CN102890717B (en) 2016-09-28

Similar Documents

Publication Publication Date Title
CN104077388A (en) Summary information extraction method and device based on search engine and search engine
CN109190049B (en) Keyword recommendation method, system, electronic device and computer readable medium
US8321396B2 (en) Automatically extracting by-line information
CN109145216A (en) Network public-opinion monitoring method, device and storage medium
JP2015204103A (en) Interactive search and recommendation method and device thereof
US8572087B1 (en) Content identification
CN102508859A (en) Advertisement classification method and device based on webpage characteristic
RU2010151913A (en) SHOW OF ADVERTISING ANNOUNCEMENTS BASED ON INTERACTION WITH WEB PAGE
CN102298614A (en) Method for determining collection category of page collection information and device and equipment
US10860792B2 (en) Detecting compatible layouts for content-based native ads
CN103902889A (en) Malicious message cloud detection method and server
CN102722563A (en) Method and device for displaying page
JP2003330948A (en) Device and method for evaluating web page
CN103488786A (en) Method and client terminal for providing information search
CN107153716B (en) Webpage content extraction method and device
CN102902794A (en) Web page classification system and method
CN102902790A (en) Web page classification system and method
CN102902792B (en) list page identification system and method
WO2014000130A1 (en) Method or system for automated extraction of hyper-local events from one or more web pages
CN103034707A (en) Website navigation method, device and browser client
CN102902784A (en) Web page classification storage system and method
CN102833233A (en) Method and device for recognizing web pages
CN102982118A (en) Searching method and device based on favorites
CN102890717A (en) System and method for building webpage category knowledge base
CN105630937A (en) Method and device for searching answers to exam questions

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220711

Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015

Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Qizhi software (Beijing) Co., Ltd

TR01 Transfer of patent right