CN101404031A - Method and system for recognizing concept type web pages - Google Patents

Method and system for recognizing concept type web pages Download PDF

Info

Publication number
CN101404031A
CN101404031A CNA2008102257626A CN200810225762A CN101404031A CN 101404031 A CN101404031 A CN 101404031A CN A2008102257626 A CNA2008102257626 A CN A2008102257626A CN 200810225762 A CN200810225762 A CN 200810225762A CN 101404031 A CN101404031 A CN 101404031A
Authority
CN
China
Prior art keywords
concept type
web pages
catalogue
uri
type web
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2008102257626A
Other languages
Chinese (zh)
Other versions
CN101404031B (en
Inventor
刘琳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Technology Development Co Ltd
Original Assignee
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Technology Development Co Ltd filed Critical Beijing Sogou Technology Development Co Ltd
Priority to CN2008102257626A priority Critical patent/CN101404031B/en
Publication of CN101404031A publication Critical patent/CN101404031A/en
Application granted granted Critical
Publication of CN101404031B publication Critical patent/CN101404031B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for identifying a conceptual web page and a system thereof. The method comprises the following steps: a plurality of conceptual web pages are acquired from a web page database; the URI amount of the conceptual web pages under all levels of directories of each website domain name is compared with a first threshold, the directory under which the URI amount of the conceptual web pages is greater than the first threshold is determined as a conceptual directory; and the URI of the web page to be identified is matched with each conceptual directory, if matched, the web page to be identified is determined as the conceptual web page. The method can quickly and comprehensively distinguish whether the web page is a conceptual web page and the class thereof. The method increases the identification rate and obviously improves the coverage rate in regard to identifying a conceptual document from mass web page data.

Description

The method and system of identification concept type web pages
Technical field
The present invention relates to network information process field, more specifically, relate to a kind of method and system of discerning concept type web pages.
Background technology
Along with increasing sharply of text that uses in internet and other data network and the system and content of multimedia, the data volume of the network information sharply increases.Therefore, how to help the user to try one's best apace, from the network information of magnanimity, obtain needed information exactly as far as possible, become the hot issue in the network information process field.
" notion " typically refers to the blocks of knowledge (or general semantic primitive) that the unique combination of feature is formed.The concept type document is usually with to the explanation of the notion theme as document, launches to describe around the connotation and extension of identical concept.
Prior art has proposed the various network informations are carried out the technical scheme of analyzing and processing, to satisfy user's information requirement.Wherein, in patent " a kind of recognition methods of concept type document and system " (publication number: CN101004753A, hereinafter referred is invention 1), analyze and point out that the user is in search behavior, under the situation of same matching inquiry keyword, the selection answer that the concept type document is normally best.Therefore be necessary from the network documentation set, to analyze and identify this type.Simultaneously,, conflict mutually, thereby reduced the efficient of the information of obtaining so be generally user's optimal selection answer with the concept type document because the concept type document is usually located in the search result list position that comparatively falls behind in traditional searching order mode.Therefore, be necessary this type of document is discerned specially efficiently.
Provided a kind of independent, automatic also means of high efficiency identification concept type document in the invention 1, but there is following problem in actual applications in it: the method for (1) invention 1 has certain wrong probability.Can cause the document to be identified as the concept nature document such as the application of rhetorical devices such as the metaphor in literary works, personification by mistake.For example, " people's army is exactly our relatives, is exactly our great wall of steel." compare with next sentence: " sunspot is exactly a kind of solar activity phenomenon." on the textual description mode, be difficult to whether it is describing a notion by automatic program identification.Except the noun difference as the sentence main body, other form of presentations are in full accord.Theoretically, only in the system of invention 1 itself, adjust, can't reach the ability of distinguishing the two; And the method for (2) invention 1 has omission to a certain degree.That is,, do not guarantee to cover all concept type documents of identification though invention 1 has guaranteed the relatively accurate and high-level efficiency of identification.Especially in the internet, such as notional word itself can not occurring in the description of some concept type document file pages, but the title of notional word as the page represented separately; Also there are some concept type document contents very brief, are not enough to the abundant foundation of judging as with the method for invention 1.All there are some intrinsic defectives in invention 1 method on the accuracy rate of concept type document recognition and recall rate, or because the limitation of method, or because the diversity of document.In addition,, can find in user behavior analysis that the user can have influence on choosing of Search Results for the authoritative understanding of website even discern concept type document accurately.For the result of the website that a large amount of concept type documents can be provided, the user can be more prone to trust and choose; And only have the website of a small amount of concept type document, and user's cognition degree is not high, and the result is also difficult to win the confidence.Therefore, though the method for invention 1 provides a kind of means of quick and precisely discerning the concept type document technically, can't satisfy the demand of user search fully.
Therefore, need a kind of solution of discerning concept type web pages, to solve the problem in the above-mentioned correlation technique.
Summary of the invention
The present invention aims to provide a kind of method and system that can improve the identification concept type web pages of search quality.
According to an aspect of the present invention, the invention provides a kind of method of discerning concept type web pages, may further comprise the steps: in web database, obtain a plurality of concept type web pages; The URI quantity and the first threshold of the concept type web pages under the catalogues at different levels of each website domain name are compared, with its down the URI quantity of concept type web pages be defined as the concept type catalogue greater than the catalogue of first threshold; URI and each concept type catalogue of webpage to be identified are mated, if coupling then should be defined as concept type web pages by webpage to be identified.
After the step of determining the concept type catalogue, also comprise: all webpages under the concept type catalogue are classified; And total classification classification under the concept type catalogue is identical and that reach the webpage of predetermined ratio is defined as the classification of concept type catalogue.
Also comprise after the step that webpage to be identified is defined as concept type web pages: the webpage that will be confirmed as concept type web pages makes an addition in the classification.
After the step of determining the concept type catalogue, also comprise: the URI quantity of the non-concept type web pages of the catalogues at different levels under the statistic concept type catalogue, under the situation of the URI of non-concept type web pages quantity, the catalogue at non-concept type web pages place is defined as non-concept type catalogue greater than second threshold value.
The step that webpage to be identified is defined as concept type web pages also comprises: webpage to be identified and each non-concept type catalogue are mated, if all do not match with each non-concept type catalogue, then webpage to be identified is defined as concept type web pages, otherwise webpage to be identified is defined as non-concept type web pages.
The step of determining the concept type catalogue comprises: the URI quantity of adding up the concept type web pages under each website domain name catalogue, and itself and first threshold compared, when the URI quantity of the concept type web pages under the domain name catalogue of website during, the website domain name catalogue is defined as the concept type catalogue greater than first threshold; And the URI quantity of the concept type web pages of statistics website domain name subprime directory, and itself and first threshold compared, when the URI quantity of the concept type web pages of website domain name subprime directory greater than first threshold, the website domain name subprime directory is defined as the concept type catalogue; And so repetitive operation, be not more than first threshold or the catalogue of being added up does not have subprime directory until the URI quantity of the concept type web pages of the catalogue of being added up.
Obtaining a plurality of concept type web pages steps comprises: utilize the concept type web pages recognizer to obtain a plurality of described concept type web pages.
According to another aspect of the present invention, provide a kind of system that discerns concept type web pages, having comprised: acquisition module is used for obtaining a plurality of concept type web pages at web database; Concept type catalogue determination module is used for the URI quantity and the first threshold of the concept type web pages under the catalogues at different levels of each website domain name are compared, with its down the URI quantity of concept type web pages be defined as the concept type catalogue greater than the catalogue of first threshold; And coupling and determination module, be used for URI and each concept type catalogue of webpage to be identified are mated, if coupling then should be defined as concept type web pages by webpage to be identified.
This system also comprises: the classification determination module be used for all webpages under the concept type catalogue are classified, and total classification classification under the concept type catalogue is identical and that reach the webpage of predetermined ratio is defined as the classification of concept type catalogue; And the interpolation module, the webpage that is used for being confirmed as concept type web pages is added into this classification.
This system also comprises: non-concept type catalogue determination module, the URI quantity that is used for the non-concept type web pages of the catalogues at different levels under the statistic concept type catalogue, under the situation of the URI of non-concept type web pages quantity, the catalogue at non-concept type web pages place is defined as non-concept type catalogue greater than second threshold value.
Coupling and determination module also are used for webpage to be identified and non-concept type catalogue are mated, if all do not match with each non-concept type catalogue, then webpage to be identified is defined as concept type web pages, otherwise webpage to be identified is defined as non-concept type web pages.
Acquisition module uses the concept type web pages recognizer to obtain a plurality of concept type web pages.
Concept type catalogue determination module comprises: first statistical module is used to add up the URI quantity of the concept type web pages under the catalogues at different levels of each website domain name; First comparison module is used for the URI quantity and the first threshold of concept type web pages are compared; And first determination module, be used for its down the URI quantity of concept type web pages be defined as the concept type catalogue greater than the concept type catalogue of first threshold.
Non-concept type catalogue determination module comprises: second statistical module is used for the URI quantity of the non-concept type web pages of the catalogues at different levels under the statistic concept type catalogue; Second comparison module is used for the URI quantity and second threshold value of non-concept type web pages are compared; And second determination module, be used under the situation of the URI of non-concept type web pages quantity greater than second threshold value, the catalogue at non-concept type web pages place is defined as non-concept type catalogue.
The present invention utilizes the distribution characteristics of concept type web pages, filter out some and comprise the seldom website of the concept type page, and make that distribution is comparatively concentrated, a fairly large number of website of the concept type page is selected out as the concept type catalogue than the website that comprises less concept type web pages is easier.Like this, just can not be identified as the concept type page by the page that confidence level is lower by the present invention, thereby can represent better Search Results.
Whether be concept type web pages and classification thereof, for identification concept type document from extensive web data, not only improved recognition speed, but also obviously improved coverage rate by the present invention if can discern webpage fast all sidedly.
Other features and advantages of the present invention will be set forth in the following description, and, partly from instructions, become apparent, perhaps understand by implementing the present invention.Purpose of the present invention and other advantages can realize and obtain by specifically noted structure in the instructions of being write, claims and accompanying drawing.
Description of drawings
Accompanying drawing described herein is used to provide further understanding of the present invention, constitutes the application's a part, and illustrative examples of the present invention and explanation thereof are used to explain the present invention, do not constitute improper qualification of the present invention.In the accompanying drawings:
Fig. 1 is the process flow diagram according to the identification concept type web pages method of first embodiment of the invention;
Fig. 2 is the block scheme according to the identification concept type web pages system of second embodiment of the invention;
Fig. 3 A, Fig. 3 B and Fig. 3 C are the process flow diagrams according to each step of identification concept type web pages method of third embodiment of the invention; And
Fig. 4 obtains the method flow diagram of classification concept type website and similar notional word according to the utilization of the embodiment of the invention the 4th embodiment notional word of presorting.
Embodiment
Below with reference to the accompanying drawings and in conjunction with the embodiments, describe the present invention in detail.In full, same reference numerals is represented similar elements.
Fig. 1 is the process flow diagram according to the identification concept type web pages method of first embodiment of the invention.
With reference to Fig. 1, may further comprise the steps according to the identification concept type web pages method of the embodiment of the invention:
Step S102 obtains a plurality of concept type web pages in web database;
Step S104 compares the URI quantity and the first threshold of the concept type web pages under the catalogues at different levels of each website domain name, with its down the URI quantity of concept type web pages be defined as the concept type catalogue greater than the catalogue of first threshold; And
Step S106 mates URI and each concept type catalogue of webpage to be identified, if coupling then should be defined as concept type web pages by webpage to be identified.
Also comprising behind the step S104: the classification of determining webpage under the concept type catalogue.For example, to the concept type web pages that heart disease, coronary heart disease, diabetes and other diseases make an explanation, according to the classification of each webpage, the classification of this concept type catalogue should be the healthy class of disease.
Also comprising behind the step S106: all webpages under the concept type catalogue are classified, and total classification classification under the concept type catalogue is identical and that reach the webpage of predetermined ratio is defined as the classification of concept type catalogue.If will be defined as concept type web pages about the webpage of pneumonia by this method, then this webpage belongs in the classification of the healthy class of above-mentioned disease.
Also comprising behind the step S104: the URI quantity of the non-concept type web pages of the catalogues at different levels under the statistic concept type catalogue under the situation of the URI of non-concept type web pages quantity greater than second threshold value, is defined as non-concept type catalogue with the catalogue at non-concept type web pages place.
Step S106 also comprises: webpage to be identified and each non-concept type catalogue are mated, if all do not match with each non-concept type catalogue, then webpage to be identified is defined as concept type web pages, otherwise webpage to be identified is defined as non-concept type web pages.
Step S104 comprises: the URI quantity of adding up the concept type web pages under each website domain name catalogue, and itself and first threshold compared, when the URI quantity of the concept type web pages under the domain name catalogue of website during, the website domain name catalogue is defined as the concept type catalogue greater than first threshold; And the URI quantity of the concept type web pages of statistics website domain name subprime directory, and itself and first threshold compared, when the URI quantity of the concept type web pages of website domain name subprime directory greater than first threshold, the website domain name subprime directory is defined as the concept type catalogue; And so repetitive operation, be not more than first threshold or the catalogue of being added up does not have subprime directory until the URI quantity of the concept type web pages of the catalogue of being added up.
Step S102 comprises: utilize the concept type web pages recognizer to obtain a plurality of described concept type web pages.
Fig. 2 is the block scheme according to the identification concept type web pages system of second embodiment of the invention.
With reference to Fig. 2, the system 200 of identification concept type web pages, comprising: acquisition module 202 is used for obtaining a plurality of concept type web pages at web database; Concept type catalogue determination module 204 is used for the URI quantity and the first threshold of the concept type web pages under the catalogues at different levels of each website domain name are compared, with its down the URI quantity of concept type web pages be defined as the concept type catalogue greater than the catalogue of first threshold; And coupling and determination module 206, be used for URI and each concept type catalogue of webpage to be identified are mated, if coupling then should be defined as concept type web pages by webpage to be identified.
This system also comprises: classification determination module 208 be used for all webpages under the concept type catalogue are classified, and total classification classification under the concept type catalogue is identical and that reach the webpage of predetermined ratio is defined as the classification of concept type catalogue; And adding module 210, the webpage that is used for being confirmed as concept type web pages is added into this classification.
This system also comprises: non-concept type catalogue determination module 212, the URI quantity that is used for the non-concept type web pages of the catalogues at different levels under the statistic concept type catalogue, under the situation of the URI of non-concept type web pages quantity, the catalogue at non-concept type web pages place is defined as non-concept type catalogue greater than second threshold value.
Coupling and determination module 206 also are used for webpage to be identified and non-concept type catalogue are mated, if all do not match with each non-concept type catalogue, then webpage to be identified is defined as concept type web pages, otherwise webpage to be identified is defined as non-concept type web pages.
Acquisition module 202 uses the concept type web pages recognizer to obtain a plurality of concept type web pages.
Concept type catalogue determination module 204 comprises: first statistical module is used to add up the URI quantity of the concept type web pages under the catalogues at different levels of each website domain name; First comparison module is used for the URI quantity and the first threshold of concept type web pages are compared; And first determination module, be used for its down the URI quantity of concept type web pages be defined as the concept type catalogue greater than the concept type catalogue of first threshold.
Non-concept type catalogue determination module 212 comprises: second statistical module is used for the URI quantity of the non-concept type web pages of the catalogues at different levels under the statistic concept type catalogue; Second comparison module is used for the URI quantity and second threshold value of non-concept type web pages are compared; And second determination module, be used under the situation of the URI of non-concept type web pages quantity greater than second threshold value, the catalogue at non-concept type web pages place is defined as non-concept type catalogue.
Fig. 3 A, Fig. 3 B and Fig. 3 C are the process flow diagrams according to each step of identification concept type website location mode of third embodiment of the invention.
With reference to Fig. 3 A, find that the method for concept type website may further comprise the steps:
Step S302a uses the concept type document recognition algorithm of invention 1 that collections of web pages is handled to obtain the concept type web pages set; And
Step S304a, the URI that concept type web pages is gathered carries out statistical treatment.
In step 302a, the element of collections of web pages is single web document, set is identical or approximate identical with the collections of web pages that user search need be inquired about, and is indifferent to the Data Source of collections of web pages, but requires each webpage to keep the unique identification of original URI as webpage.
Wherein, step S304a comprises: add up the concept type web pages URI sum under each website domain name successively, record URI sum surpasses the sum of concept type web pages URI under the website domain name (as A.com) of a certain predetermined threshold N and this domain name; To the website that selects, do further statistics with the URI sum of statistics under subprime directory, if the URI sum under a certain catalogue still exceeds predetermined threshold value N,, and replace the higher level's domain name (A.com) that has write down then with the URI (as A.com/Z/) that writes down this catalogue; Catalogue analysis so step by step is up to the sub-directory that does not exceed threshold value N or there is not the next stage sub-directory to analyze.
Find that the non-concept type catalogue in the concept type website is by on the basis of finding the concept type website, the distribution situation of the non-concept type page realizes in the set of continuation concept of analysis type catalogue.With reference to Fig. 3 B, find that the step of the non-concept type catalogue in the concept type website comprises:
Step S302b, the method for utilizing website to mate, the concept type catalogue set that generates among the use step S304a and the URI of the webpage in the collections of web pages mate, the adding that the match is successful " concept type web pages set * ";
Step S304b in " concept type web pages set * ", carries out the concept type document recognition algorithm of invention 1, and the webpage that is identified as non-concept type document is added non-concept type web pages set; And
Step S306b, based on the URI (A.com/Z) in the set of concept type catalogue, the URI that adds up non-concept type web pages set backing wire page or leaf distribution situation at different levels below the concept type catalogue, if the ratio of non-concept type web pages surpasses a certain predetermined threshold K under certain one-level catalogue, then write down this grade catalogue and stop this catalogue is further analyzed.
Fig. 3 C shows the concept type catalogue set of the step generation that utilizes among Fig. 3 B and the non-concept type catalogue set that the step among Fig. 3 B generates, and uses the concept type document analysis algorithm (hereinafter to be referred as method 2) of simplification, analyzes the method for identification concept type web pages.
Step S302c uses the concept type Website Hosting to mate the URI that imports webpage successively.Wherein, if coupling concept type catalogue set failure, then nonrecognition is a concept type web pages; If coupling concept type catalogue is gathered successfully, continue to use non-concept type catalogue set to carry out similar coupling; Gather successfully if mate non-concept type catalogue, then nonrecognition is a concept type web pages; If mate non-concept type catalogue set failure, then this page be identified as the concept type page, and extract notional word.
The example of step S302c is as follows:
By the step among Fig. 3 A obtain the set of concept type catalogue XXX: //A.com/Z, XXX: //A.com/Y, XXX: //B.com/W}
By the step among Fig. 3 B obtain the set of non-concept type catalogue XXX: //A.com/Z/M, XXX: //A.com/Y/N/P, XXX: //B.com/W/Q}
Then the step among Fig. 3 C is respectively for the judged result of following URI:
XXX: //the A.com/Z/H/1.html concept type (coupling XXX: //A.com/Z)
XXX: //the non-concept type of A.com/Z/M/2.html (coupling XXX: //A.com/Z and coupling XXX: //A.com/Z/M)
XXX: //the A.com/Y/N/R/3.html concept type (coupling XXX: //A.com/Y)
XXX: //the non-concept type of B.com/4.html
XXX: //the non-concept type of C.com/5.html
In three steps of Fig. 3 A, Fig. 3 B and Fig. 3 C, the step among the step among Fig. 3 A and Fig. 3 B is the distribution situation of coming concept of analysis type website as the basis with 1 the method for inventing.Yet the method that it also can not utilize invention 1 promptly, can replace to " concept type recognizer " other effective concept type document recognition algorithms.Step among Fig. 3 C is the application that identification concept type website distributes, and the statistics of utilizing the concept type website to distribute is only discerned specific data as concept type web pages.On the basis that concept type website distribution results is analyzed out in advance, recognition efficiency is very high.Because concept type web pages has the intensive characteristics that distribute in the internet, utilize concept type website DISTRIBUTION RECOGNITION concept type document can effectively remedy the defective of recognizer, though can lose the data of part website, but the concept type total number of documents that identifies has the lifting of certain scale, and because the inner concept type DATA DISTRIBUTION of website is intensive, for the user, the confidence level of Search Results also is improved, and effect can be better than independently concept type document recognition.
Fig. 4 obtains the method flow diagram of classification concept type website and similar notional word according to the utilization of the embodiment of the invention the 4th embodiment notional word of presorting.
Because the concept type document has concentrated character in some professional websites distribution can utilize some known particular category notional words, the step in Fig. 3 C is mated after analyzing the concept type document and extracting notional word.After the match is successful, write down the concept type catalogue set that this concept type document hits in identifying, as the concept type catalogue set of this classification notional word correspondence.In example in Fig. 3 C, if XXX: // notional word coupling in notional word that A.com/Z/H/1.html extracts and the classification first, then record concept type catalogue XXX in the concept type catalogue set of classification first correspondence: //A.com/Z.
With reference to Fig. 4, may further comprise the steps according to the utilization of the embodiment of the invention the 4th embodiment method that notional word obtains classification concept type website and similar notional word of presorting:
Step S402, after the concept type catalogue set that obtains some classification, add up each concept type catalogue corresponding class information, under the fully comprehensive situation of the classification concept speech that is used to mate that provides in advance, if the only corresponding a kind of or limited a few kind of certain concept type catalogue, perhaps the ratio that accounts for all classificating words that provide in advance of this catalogue coupling with the speech of a certain or limited a few kind coupling exceeds predetermined threshold Q, simultaneously, exceed predetermined threshold P with notional word sums a certain or limited a few kind couplings, then can think, document under this concept type catalogue, all belong to this corresponding a kind of or limited a few kind, and the notional word that document extracted under this concept type catalogue, also belong to this a kind of or limited a few kind substantially, that is, utilize the notional word category distribution under the concept type catalogue, can analyze possible classification concept type catalogue, and further excavate similar notional word.
The present invention utilizes the distribution characteristics of concept type web pages, filter out some and comprise the seldom website of the concept type page, and make that distribution is comparatively concentrated, a fairly large number of website of the concept type page is selected out as the concept type catalogue than the website that comprises less concept type web pages is easier.Like this, just can not be identified as the concept type page by the page that confidence level is lower by the present invention, thereby can represent better Search Results.
Whether be concept type web pages and classification thereof, for identification concept type document from extensive web data, not only improved recognition speed, but also obviously improved coverage rate by the present invention if can discern webpage fast all sidedly.
The above is the preferred embodiments of the present invention only, is not limited to the present invention, and for a person skilled in the art, the present invention can have various changes and variation.Within the spirit and principles in the present invention all, any modification of being done, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (14)

1. a method of discerning concept type web pages is characterized in that, may further comprise the steps:
In web database, obtain a plurality of concept type web pages;
The URI quantity and the first threshold of the concept type web pages under the catalogues at different levels of each website domain name are compared, with its down the URI quantity of described concept type web pages be defined as the concept type catalogue greater than the catalogue of described first threshold; And
URI and each described concept type catalogue of webpage to be identified are mated, if coupling then should be defined as concept type web pages by webpage to be identified.
2. method according to claim 1 is characterized in that, also comprises after the step of described definite concept type catalogue:
All webpages under the described concept type catalogue are classified; And
Total classification classification under the described concept type catalogue is identical and that reach the webpage of predetermined ratio is defined as the classification of described concept type catalogue.
3. method according to claim 2 is characterized in that, also comprises after the step that webpage to be identified is defined as concept type web pages:
The webpage that is confirmed as described concept type web pages is added in the described classification.
4. method according to claim 1 is characterized in that, also comprises after the step of described definite concept type catalogue:
Add up the URI quantity of the non-concept type web pages of the catalogues at different levels under the described concept type catalogue, under the situation of the URI of described non-concept type web pages quantity, the catalogue at described non-concept type web pages place is defined as non-concept type catalogue greater than second threshold value.
5. method according to claim 4 is characterized in that, the described step that webpage to be identified is defined as concept type web pages also comprises:
Described webpage to be identified and each described non-concept type catalogue are mated,, then described webpage to be identified is defined as described concept type web pages, otherwise described webpage to be identified is defined as non-concept type web pages if all do not match with each described non-concept type catalogue.
6. method according to claim 1 is characterized in that, the step of described definite concept type catalogue comprises:
Add up the URI quantity of the described concept type web pages under each described website domain name catalogue, and itself and described first threshold compared, when the URI quantity of the described concept type web pages under the domain name catalogue of described website during, described website domain name catalogue is defined as described concept type catalogue greater than described first threshold; And
Add up the URI quantity of the concept type web pages of described website domain name subprime directory, and itself and described first threshold compared, when the URI quantity of the concept type web pages of described website domain name subprime directory greater than described first threshold, described website domain name subprime directory is defined as described concept type catalogue; And so repetitive operation, be not more than described first threshold or the catalogue of being added up does not have subprime directory until the URI quantity of the described concept type web pages of the catalogue of being added up.
7. method according to claim 6 is characterized in that, describedly obtains a plurality of concept type web pages steps and comprises:
Utilize the concept type web pages recognizer to obtain a plurality of described concept type web pages.
8. a system that discerns concept type web pages is characterized in that, comprising:
Acquisition module is used for obtaining a plurality of concept type web pages at web database;
Concept type catalogue determination module is used for the URI quantity and the first threshold of the concept type web pages under the catalogues at different levels of each website domain name are compared, with its down the URI quantity of described concept type web pages be defined as the concept type catalogue greater than the catalogue of described first threshold; And
Coupling and determination module are used for URI and each described concept type catalogue of webpage to be identified are mated, if coupling then should be defined as concept type web pages by webpage to be identified.
9. system according to claim 8, it is characterized in that, also comprise: the classification determination module, be used for all webpages under the described concept type catalogue are classified, and total classification classification under the described concept type catalogue is identical and that reach the webpage of predetermined ratio is defined as the classification of described concept type catalogue; And the interpolation module, the webpage that is used for being confirmed as described concept type web pages is added into described classification.
10. system according to claim 8, it is characterized in that, also comprise: non-concept type catalogue determination module, be used to add up the URI quantity of the described non-concept type web pages of the catalogues at different levels under the described concept type catalogue, under the situation of the URI of described non-concept type web pages quantity, the catalogue at described non-concept type web pages place is defined as non-concept type catalogue greater than second threshold value.
11. system according to claim 10, it is characterized in that, described coupling and determination module also are used for described webpage to be identified and described non-concept type catalogue are mated, if all do not match with each described non-concept type catalogue, then described webpage to be identified is defined as described concept type web pages, otherwise described webpage to be identified is defined as described non-concept type web pages.
12. system according to claim 8 is characterized in that, described acquisition module uses the concept type web pages recognizer to obtain a plurality of described concept type web pages.
13. system according to claim 8 is characterized in that, described concept type catalogue determination module comprises:
First statistical module is used to add up the URI quantity of the described concept type web pages under the catalogues at different levels of each described website domain name;
First comparison module is used for the URI quantity and the described first threshold of described concept type web pages are compared; And
First determination module, be used for its down the URI quantity of described concept type web pages be defined as described concept type catalogue greater than the concept type catalogue of described first threshold.
14. system according to claim 10 is characterized in that, described non-concept type catalogue determination module comprises:
Second statistical module is used to add up the URI quantity of the non-concept type web pages of the catalogues at different levels under the described concept type catalogue;
Second comparison module is used for the URI quantity and described second threshold value of described non-concept type web pages are compared; And
Second determination module is used under the situation of the URI of described non-concept type web pages quantity greater than described second threshold value, and the catalogue at described non-concept type web pages place is defined as non-concept type catalogue.
CN2008102257626A 2008-11-12 2008-11-12 Method and system for recognizing concept type web pages Active CN101404031B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2008102257626A CN101404031B (en) 2008-11-12 2008-11-12 Method and system for recognizing concept type web pages

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2008102257626A CN101404031B (en) 2008-11-12 2008-11-12 Method and system for recognizing concept type web pages

Publications (2)

Publication Number Publication Date
CN101404031A true CN101404031A (en) 2009-04-08
CN101404031B CN101404031B (en) 2012-05-30

Family

ID=40538043

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2008102257626A Active CN101404031B (en) 2008-11-12 2008-11-12 Method and system for recognizing concept type web pages

Country Status (1)

Country Link
CN (1) CN101404031B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999576A (en) * 2012-11-13 2013-03-27 北京百度网讯科技有限公司 Method and equipment for confirming page description information corresponding to target pages

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6167370A (en) * 1998-09-09 2000-12-26 Invention Machine Corporation Document semantic analysis/selection with knowledge creativity capability utilizing subject-action-object (SAO) structures
JP4825669B2 (en) * 2003-07-30 2011-11-30 グーグル・インク Method and system for determining the meaning of a document and matching the document with the content
CN101004753B (en) * 2007-01-25 2010-08-11 北京搜狗科技发展有限公司 Method and system for recognizing conception type files

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999576A (en) * 2012-11-13 2013-03-27 北京百度网讯科技有限公司 Method and equipment for confirming page description information corresponding to target pages
CN102999576B (en) * 2012-11-13 2016-08-17 北京百度网讯科技有限公司 For the method and apparatus determining the page-describing information corresponding to target pages

Also Published As

Publication number Publication date
CN101404031B (en) 2012-05-30

Similar Documents

Publication Publication Date Title
Ntoulas et al. Detecting spam web pages through content analysis
US7565350B2 (en) Identifying a web page as belonging to a blog
CN100462980C (en) Content-related advertising identifying method and content-related advertising server
JP4097602B2 (en) Information analysis method and apparatus
CN101416179B (en) System and method for providing regulated recommended word to every subscriber
US9317613B2 (en) Large scale entity-specific resource classification
CN106095979B (en) URL merging processing method and device
CN107911448B (en) Content pushing method and device
CN101685521A (en) Method for showing advertisements in webpage and system
CN102693271A (en) Network information recommending method and system
KR20060049165A (en) Search engine spam detection using external data
CN108027814B (en) Stop word recognition method and device
CN103336766A (en) Short text garbage identification and modeling method and device
CN103324622A (en) Method and device for automatic generating of front page abstract
CN102087648A (en) Method and system for fetching news comment page
CN105653701A (en) Model generating method and device as well as word weighting method and device
CN102664925A (en) Method and apparatus for displaying searching result
KR101638535B1 (en) Method of detecting issue patten associated with user search word, server performing the same and storage medium storing the same
CN101630315B (en) Quick retrieval method and system
CN105095175A (en) Method and device for obtaining truncated web title
CN101425064B (en) Processing method and system for testing log
CN111079029A (en) Sensitive account detection method, storage medium and computer equipment
CN101334789A (en) Device for identifying document plagiarism by search engine
CN112328936A (en) Website identification method, device and equipment and computer readable storage medium
CN102999538A (en) Character searching method and equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant