CA2308592A1

CA2308592A1 - Method and system for determining reputation of interlinked pages

Info

Publication number: CA2308592A1
Application number: CA002308592A
Authority: CA
Inventors: Alberto O. Mendelzon; Davood Rafiei
Original assignee: Individual
Current assignee: Individual
Priority date: 2000-05-12
Filing date: 2000-05-12
Publication date: 2001-11-12

Abstract

The invention relates to a method and system for providing a reputation of a page. Reputation is one measure of the degree of preference to be accorded to the page. The page is one of a collection of such electronic pages on a distributed network.
A probabilistic model of a topic search strategy for the pages is used to determine the reputation for the page relating to a topic.

Description

Method And System For Determining Reputation Of Interlinked Electronic Pages FIELD OF THE INVENTION
This invention concerns determining the reputation of a web page.
BACKGROUND OF THE INVENTION
The use of search engines on the World Wide Web (the "Web") such as those provided by YAHOO.COM, GOOGLE.COM, and ALTAVISTA.COM has provided access to a comprehensive collection of information on the Web. Information provided through these search engines primarily consists of pages encoded in hypertext markup language ("web pages") referenced in search engine outputs, which essentially consists of the universal resource locator (URL) of each page.
Ordering the pages returned in response to a search engine query is of growing importance. In the absence of such an ordering, the user may face the prospect of wading through hundred, if not thousands, of resources. For example, a query to the ALTAVISTA.COM search facility yields references to over 1.2 million web pages for the search phrase "search engine"; even the human-maintained YAHOO.COM returned 960 web sites for the same phrase. In practice, it is unlikely that a user will examine in excess of a hundred returned resources; empirical results even indicate that users will not examine more than five to ten pages. Therefore, there is a need for a way to order web pages as returned from a query on a particular topic.
Furthermore, search engines such as YAHOO.COM are, in reality, a hierarchical directory. These search engines require categorization of web pages under the classification scheme of its directory. Such directories employ large contingents of persons to manually categorize new web pages. This can be inefficient, inaccurate and expensive process. The need to automatically classify web pages is obvious.
A number of approaches to solve these problems are in existence. One approach is a recursive method for ranking the importance of a Web page based on the importance of its incoming links. (See S. Brin and L. Page, "The anatomy of a large-scale hypertextual web search engine", In Proceedings of the 7th International World Wide Web Conference, pages 107--117, Brisbane, Australia, April 1998. Elsevier Science) The ranking is based on simulating the behavior of a "random surfer" who either selects an outgoing link uniformly at random, or jumps to a new page chosen uniformly at random from the entire collection of pages. The rank of a page corresponds to the number of visits the "random surfer" makes to the page. However the disadvantage of this approach is that it ranks all the pages collectively without regard for the topic, which is not helpful to a person seeking information on a particular subject.
Another method is for the search engine to record for each and every query typed by a user the number of subsequent visits made by the user to individual web pages consequent upon such a query. For example, if "java" is entered and the search engine yields and displays results in a particular order; depending on which page is clicked on subsequently, the ranks of these 10 pages are adjusted. When a subsequent user queries "Java", the order of presentation of the 10 web pages may be different.
The main problem with this scheme is that there are millions of pages that can match a particular query, and there is no way to present all of them to the user. The search engine needs to be selective; this makes the ranking more biased toward those pages shown first without real justification. A further problem has to do with the learning mechanism of the engine.
Since the system learns from the previous hits, the system might be useful for its frequent users. However, such users often want to see new pages as well. In addition, the learning mechanism may not be useful for new users and getting a few bad results are enough for a new user to switch to another search engine.
A further method finds authorities on specified topics based on a hubs and authorities model with the assumption that authority is conferred on a page by a set of hub pages, which is recursively defined as a set of pages with links to many relevant authority pages. The weights of these authority and hub pages are then determined by matrix computation.
This approach does not yield a measure of the topics for which a web page is relevant.
Therefore, there exists a need to provide a method to addressing the problems of weighting topics for a chosen web page, and automating classification of web pages.

2 SUMMARY OF THE INVENTION
The invention includes a method for providing a reputation of a page among a set of electronic pages on a distributed network comprising determining a reputation term for the page from a probabilistic model of a topical term search strategy for the pages.
One embodiment of the invention relates to a method to provide a user with a page having a reputation relating to a topical term, the page among a set of electronic pages on a distributed network , the method comprising:
(a) obtaining a topical term from the user;
(b) determining a reputation term for each page from a probabilistic model of the topical term search strategy for the pages; and (c) providing the user with a reputation page class output.
Another embodiment of the invention includes a method to provide a user with a distributed network search output from a search engine, comprising:
a) obtaining a topical term input from the user;
b) providing a search engine output display in response to the topical term input; and c) providing a reputation term display in response to the topical term input, wherein the reputation term display is displayed proximate to the search engine output display and wherein the reputation term for each page of the search engine output display is obtained from a probabilistic model of the topical term search strategy for the pages.
Another embodiment of the invention includes a method to provide a user with a topical term class output of topical terms relating to a page in a set of electronic pages, comprising:
(a) determining the set of topical terms related to the pages of the set of electronic pages.
(b) determining the reputation term for the page for each of the set of topical terms, the reputation term being determined from a probabilistic model of the topical term search strategy for the pages; and (c) providing a topical term class output.

3 Another embodiment of the invention includes a method where the reputation term of the page is determined by a method comprising the following steps:
(a) forming a reputation term matrix based on a plurality of pages and a plurality of topical terms, each element including the random likelihood of the page containing the topical term;
(b) updating the reputation term matrix, so that each element includes the likelihood of visiting the related page in search of the specific topical term after taking a step in a Markovian random walk process;
(c) recursively iterating step (b), one further Markovian step at a time, until the probability term converges; and (d) forming the reputation term as the related element of the converged reputation term matrix.
Another embodiment of the invention includes a method to provide a user with a page having a reputation relating to a topical term, the page among a set of electronic pages on a distributed network , the method comprising:
(a) obtaining a topical term from the user;
(b) determining a referential term for each page from a probabilistic model of the topical term search strategy for the pages; and (c) providing the user with a referential page class output.
Another embodiment of the invention includes where the referential term of the page is determined by a method comprising the following steps:
a) forming a reputation term matrix based on a plurality of pages and a plurality of topical terms, each element being the random likelihood of the page containing the topical term;
b) updating the reputation term matrix, so that each element includes the likelihood of visiting the related page in search of the specific topical term after taking a step in a Markovian random walk process;
c) recursively iterating step (b), one further Markovian step at a time, until the probability term converges; and d) forming the referential term as the related element of the converged reputation term matrix.

The invention includes a system for providing a reputation of a page among a set of electronic pages on a distributed network comprising a) means for determining a reputation term for the page from a probabilistic model of a topical term search strategy for the pages and b) means for providing the user with a reputation term output.
One embodiment of the invention relates to a system to provide a user with a page having a reputation relating to a topical term, the page among a set of electronic pages on a distributed network , the system comprising:
(a) means for obtaining a topical term from the user;
(b) means for determining a reputation term for each page from a probabilistic model of the topical term search strategy for the pages; and (c) means for providing the user with a reputation page class output.
Another embodiment of the invention includes a system to provide a user with a distributed network search output from a search engine, comprising:
(a) means for obtaining a topical term input from the user;
(b) means for providing a search engine output display in response to the topical term input;
and (c) means for providing a reputation term display in response to the topical term input, wherein the reputation term display is displayed proximate to the search engine output display and wherein the reputation term for each page of the search engine output display is obtained from a probabilistic model of the topical term search strategy for the pages.
Another embodiment of the invention includes a system to provide a user with a topical term class output of topical terms relating to a page in a set of electronic pages, comprising:
(a) means for determining the set of topical terms related to the pages of the set of electronic pages.
(b) means for determining the reputation term for the page for each of the set of topical terms, the reputation term being determined from a probabilistic model of the topical term search strategy for the pages; and (c) means for providing a topical term class output.
Another embodiment of the invention includes a system where the system further comprises:
(a) means for forming a reputation term matrix based on a plurality of pages and a plurality of topical terms, each element including the random likelihood of the page containing the topical term;
(b) means for updating the reputation term matrix, so that each element includes the likelihood of visiting the related page in search of the specific topical term after taking a step in a Markovian random walk process;
(c) means for recursively iterating step (b), one further Markovian step at a time, until the probability term converges; and (e) means for forming the reputation term as a related element of the converged reputation term matrix.
Another embodiment of the invention includes a system to provide a user with a page having a reputation relating to a topical term, the page among a set of electronic pages on a distributed network , the system comprising:
(a) means for obtaining a topical term from the user;
(b) means for determining a referential term for each page from a probabilistic model of the topical term search strategy for the pages; and (c) means for providing the user with a referential page class output.
Another embodiment of the invention includes a system comprising:
a) means for forming a reputation term matrix based on a plurality of pages and a plurality of topical terms, each element being the random likelihood of the page containing the topical term;
b) means for updating the reputation term matrix, so that each element includes the likelihood of visiting the related page in search of the specific topical term after taking a step in a Markovian random walk process;

c) means for recursively iterating step (b), one further Markovian step at a time, until the probability term converges; and d) means for forming the referential term as the related element of the converged reputation term matrix.
BRIEF DESCRIPTION OF THE DRAWINGS
Embodiments of the invention will be described by way of example and with reference to the drawings in which:
FIG. 1 is diagram of preferred embodiments of the present invention.
FIG. 2 is diagram of a preferred embodiment of the present invention involving search engine output.
FIG. 3 is diagram of preferred embodiments of the present invention for determining a topical term class output.
FIG. 4 is diagram of a preferred embodiment of the present invention to determine a reference output class.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
The invention involves the notion of the reputation of a page: pages with a good reputation should be given preferential treatment when reporting the results of a search; and that the hyperlink structure of web pages can be mined to extract such reputation measures, on the assumption that a link from page p to page q is, to some degree, an endorsement of the contents of q by the creator of p.
However, there are some difficulties in formalizing the concept of 'reputation' effectively.
The assumption that links are endorsements suggests that the number of incoming links of a page indicates its reputation. But in practice, links represent a wide variety of relationships such as navigation, subsumption, relatedness, refutation, justification, etc. In addition, the focus of this invention concerns not just the overall reputation of a page, but its reputation on certain topics.

This invention involves two methods for determining the reputation of a page, each based on a Markovian random walk model. A random walk on the Web is in the form of navigation between pages, where each page represents a possible state, and each link represents a possible transition. A random walk is Markovian if the transition at each step is independent of the previous steps and it only depends on the current state.
The first method is based on one-level weight propagation where the reputation of a page on a topic is proportional to the sum of the reputation weights of pages pointing to it on the same topic. In other words, links emanating from pages with high reputations are weighted more. For example, a page can acquire a high reputation on a topic because the page is pointed to by many pages on that topic, or because the page is pointed to by some high reputation pages on that topic.
Consider a 'random surfer' who wanders the Web, searching for pages on topic t. At each step, the surfer either jumps into a page uniformly chosen at random from the set of pages that contain the term t, or follows a link uniformly chosen at random from the set of outgoing links of the current page. If the random surfer continues this walk forever, then the number of visits he or she makes to a page is its reputation on t. Pages with relatively high reputations on a topic are more likely to be visited by the random surfer searching for that topic.
A justification for this is that the reputation of a page on a topic naturally depends both on the number of pages on the same topic that point to it, and on the reputations of these pages on the same topic as well. The number of visits the surfer makes to a page depends on the same two factors.
At each step, with probability d the random surfer jumps into a page uniformly chosen at random from the set of pages that contain the term t, and with probability (1-d) he or she follows an outgoing link from the current page. Let Nt denote the total number of pages on the Web that contain the term t. The probability that the surfer at each step visits page p in a random jump is d I N, if page p contains term t and zero otherwise.
To the person skilled in the art, assuming that every web page has at least one outgoing hyperlink, d > 0 and Nt > 0, the above formulation allows determination of two probability values: (1) the probability R"(p,,t) that the surfer visits page p, at step n, and (2) the transitional probability from page p, to page p~. The latter leads to a solution for the equilibrium probability R"(p,,t) for visiting p, for topic t as the number of random steps n approaches infinity by determining the eigenvector of the stochastic matrix associated with eigenvalue 1. Each element of this eigenvector is the stationary probability of a surfer visiting a page p, for topic t.
The reputation (or reputation rank or reputation term) of a web page p associated with a topic t using this first Markovian random walk method is defined as this stationary probability.
In the setting of the Web, the assumption that every page has at least one outgoing link may not be true; there are often pages that have no outgoing link, or the outgoing links may not be valid. A reasonable solution to accommodate these web pages is to implicitly add links from every such page to all pages in the base set, i.e. the set of pages that contain the term. The interpretation is that when the surfer reaches a dead end, he or she jumps to a page in the base set chosen uniformly at random.
The probability d that the random surfer jumps into a page uniformly chosen at random from the set of pages that contain the term t is a value that may be varied according to the prevailing conditions of the web crawling. The value (1-d) is proportional to the average walk length or to the average weight propagation radius. If d is chosen close to the value of 0, the reputation rank of a page on any topic t mainly depends on its linkage structure and is much independent of the topic t. If a value for d close to 1 is selected, only pages that explicitly contain t will have a non-zero reputation rank associated with topic t.
Choosing d between 0 and 1 results in the reputation rank depending on both the linkage structure and the textual content, and the fractions of each varies with d.
The second Markovian random walk approach is based on two-level weight propagation, generalizing the Hubs and Authorities model. In this model, a page is deemed an authority on a topic if it is pointed to by good hubs on the topic, and a good hub is one that points to good authorities.
In order to clearly explain the method, define a transition as one of (a) jump to a page on topic t chosen uniformly at random from the whole collection, or (b) follow an outgoing link of the current page chosen uniformly at random. When the current page is p, the surfer has two choices: either make a transition out of page p, or randomly pick any page q that has a link into page p and make a transition out of page q. The intuitive justification is this: when the surfer reaches a page p that seems useful for topic t, this does not mean that p is a good source of further links; but it does mean that pages q that point to p may be good sources of links, since they already led to page p.
The surfer follows links both forward (out of page p) and backward (into page q). The walk alternates strictly between forward and backward steps, except that after option (a) is chosen, the direction of the next step is picked at random.
If the random surfer continues the walk forever, then the number of forward visits he or she makes to a page is its authority reputation and the number of backward visits he or she makes to a page is its hub reputation. Clearly pages with relatively high authority reputations on a topic are more likely to be visited through their incoming links, and pages with relatively high hub reputations on a topic are more likely to be visited through their outgoing links. The authority reputation of a page p on topic t depends not only on the number of pages on topic t that point to p, but on the hub reputations of these pages on topic t as well.
Similarly, the hub reputation of a page p on topic t depends not only on the number of pages on topic t that page p points to, but on the authority reputations of these pages on topic t as well.
Define the authority reputation of a page p on a topic t as the probability that the random surfer looking for topic t makes a forward visit to page p and the hub reputation of a page p on topic t as the probability that the random surfer looking for topic t makes a backward visit to page p. Under two level influence propagation, the reputation of a web page constitutes the authority reputation and the hub authority.
Suppose at each step, with probability d the random surfer picks a direction and jumps into a page uniformly chosen at random from the set of pages on topic t, and with probability (1 - d) the surfer follows a link. The probability that at each step the surfer makes a forward visit (and similarly a backward visit) to page p in a random jump is d/2N, if page p contains term t and zero otherwise.
To the person versed in the art, assuming that every page has an outgoing and an incoming hyperlink, d > 0 and N~ > 0, this formulation allows determination of a number of probabilistic values: (1) the probabilities A"(p,,t) that the surfer makes a forward visit and H"-'(p,,t) that the surfer makes a backward visit, to page p at step n, and (2) the transitional ID

probabilities for a forward and a backward visit from page page p, to page p~.
The latter lead to a solution for the equilibrium probabilities A"(p,,t) of visiting p, in the case of a forward visit and H"(p,,t) of a backward visit for topic t as the number of steps n approaches infinity by determining the eigenvector of the stochastic matrix associated with eigenvalue 1. The elements of this eigenvector are the stationary probabilities of a surfer visiting the pages p, for topic t. The authority reputation (or authority reputation rank) and hub reputation (or hub reputation rank) of a web page p associated with a topic t using this second Markovian random walk method is defined as these stationary probabilities.
In the setting of the Web, the assumption that a page has at least one incoming link and one outgoing link may not hold. One solution to accommodate such pages is to assign a hub (or authority) rank of zero for every page with no outgoing (incoming) links.
The random walk process accordingly can be modified by restricting random jumps only to pages with at least either one incoming link or one outgoing link. Another solution is to add a link from every page to itself. This ensures that every page will acquire a fixed minimum authority and hub rank on topics of the page independent from its links.
The probability d that the random surfer jumps into a page uniformly chosen at random from the set of pages -- whether forward or backward -- that contain the term t is a value that may be varied according to the prevailing conditions of the web crawling. The earlier discussion on this matter in the case of one level influence propagation applies equally.
The above methods allowing determination of the reputation for all linked web pages on the world wide web and all topics. In reality, this may be a computationally unfeasible and objectively unnecessary task. According to two preferred embodiments of this invention, the reputations of a collection of web pages, possibly the contents of a directory of web pages, for all terms that appear in the content of the pages may be determined by using the following examples of approaches associated with the one-level and two-level influence propagation models (Figure 1 ). The user, whether a human person or not as in the case of an automatic process, starts the process embodying the invention and enters as input a topical term. The output consists of a display of the pages, possibly each page next to its reputation term for the topical term. The display may be visual in form, such as screen or printed output, or it may be non-visual, such as that for an automatic process. When displayed, the display may be by way of its URL or another representation, such as an icon or an annotated hyperlink. Similarly, a representation other than a numerical value may indicate the reputation term, for example, a number of icons (whether identical or not), an adjective, or a noun. These extended definitions apply for all embodiments of this invention.
The reputation ranks in the one-level influence propagation model are in the form of a sparse matrix, say R, with rows representing Web pages and columns denoting each term or phrase that appears in some document (after removing stop words, etc.) The process involves initializing R and repeatedly updating it until convergence.
Example 1: Determining one-level Reputation Ranks For every page p and term t, Initialize R(p, t) =1 / N, if t appears in page p; otherwise R( p, t) = 0 .
While R has not converged, Set R'(p,t) = 0 for every page p and term t, For every link q ~ p , R'( p,t) = R'( p,t) + R(q, t) l O(q) R( p, t) _ (1- d )R'( p, t) for every page p and term t, R( p, t) = R( p, t) + d l N~ if term t appears in page p.
Each column of R converges to the principal eigenvector of the matrix of transition probabilities for a term t and eigenvalue one. The principal eigenvector associated to each term is the stationary distribution of pages in the random walk process.
The reputation in the two-level influence propagation model can be represented in the form of two sparse matrixes, say H and A, respectively denoting the hub and the authority reputations of pages.
Example 2: Determining Two-Level Reputation Ranks For every page p and term t, Initialize H( p, t) = A( p, t) =1 / 2N, if t appears in page p; otherwise H( p, t) = A( p, t) = 0 .
While both H and A have not converged, Set H'( p, t) = A'( p, t) = 0 for every page p and term t, For every link q -~ p , H'(q,t) = H'(q,t)+A(p,t)lI(p) A'( p, t) = A'( p, t) + H(q, t) l O(q) H( p, t) _ (1- d )H'( p, t) and A( p, t) _ (1- d )A'( p, t) for every page p and term t, H( p, t) = H( p, t) + d l 2N, and A( p, t) = A( p, t) + d l 2Nt if term t appears in page p.

Again, the processing for each term is guaranteed to converge to the principal eigenvector of the matrix of transition probabilities for that term and eigenvalue one. The principal eigenvector is the stationary distribution provided every page has at least one incoming and one outgoing link.
Using the above methods, it is theoretically possible to determine the reputation for all linked web pages on the world wide web (or a large collection of the web pages) and all topics (or a large subset thereof). However in reality, this may be a computationally unfeasible and objectively unnecessary task. According to another four preferred embodiments with the same functional objectives as the four embodiments just outlined (two models with two ways of processing each), approximations to the true reputation ranks may be determined. Given a page p and a parameter d > 0, and the reputations of the page within the one-level influence propagation model is sought, if the page acquires a high rank on an arbitrarily chosen term t carrying out the full process described by Example 1, then at least one of the following must hold:
(1) term t appears in page p, (2) many pages on topic t point to p, or (3) there are pages with high reputations on t that point to p.
This observation provides a practical way of identifying the candidate terms.
Start from page p and collect all terms that appear in it. Then examine at the incoming links of the page and collect all possible terms from those pages. Continue this process until either there is no incoming link or the incoming links have very small effects on the reputations of page p.
Denoting the maximum number of iterations by k, the method can be expressed as follows:
Example 3: Approximating One-Level Reputation R(p,t) = d lNr for every term t that appears in p For 1=1,2,...,k d' = d if l < k , 1 otherwise For every path q' -~ ~ ~ ~ ~ q' ~ p of length 1 and every term t in page q, , R( p, t) = 0 if term t has not been seen before R( p, t) = R( p, t ) + ((1- d )' / II l=, O(q' ))(d' l Nr ) The parameter k can be chosen such that (1 - d)k becomes very close to zero;
i.e. there t3 is no need to look at a page if the terms that appear in the page have little or no effect in the reputations of page p.
Similarly, the hub and the authority reputations of a page can be approximated within the two-level influence propagation model as follows:
Example 4: Approximating Two-Level Reputation H( p, t) = A( p, t) = d l 2Nr for every term t that appears in p For l =1,2,...,k d' = d if l < k , 1 otherwise If I is odd For every path qr -~ qr_1 <-- qr_Z ~ ~ ~ ~ p of length I and every term t in page q, , A( p, t) = 0 if term t has not been seen before A(P~ t) = A(P~ t) + ((1- d ) r l (~(qr )1 (qr-i ) ~ . . p(qi )))d ~ l(2Nr ) For every path p -~ ~ ~ ~ qr_x ~ qr-~ ~ 9l of length I and every term t in page qr , H( p, t) = 0 if term t has not been seen before H(P~ t) = H(P~ t ) ~- ((1- d )r l(I (qr )~(qr-i ) ~ . . I(q~ )))d ~ l(2Nr ) else For every path qr <- qr_1 ~ ~ ~ ~ -~ p of length I and every term t in page gr , A( p, t) = 0 if term t has not been seen before A(P~ t) = A(P~ t) + ((i d )r l(1 (9r )~(qr-i ) ~ . .))d~ /(2Nr ) For every path p -~ ~ ~ ~ -> qr_1 E- qr of length I and every term t in page qr , H(p,t) = 0 if term t has not been seen before H(P~ t) = H(P~ t) + ((l d ) r l (~(qr )I (qr-~ ) ~ . .))d ~ /(2Nr ) Examples 3 and 4 are approximations of Examples 1 and 2 in the following respects.
First, for a given page p, instead of examining all pages from which a user can reach p within a random walk, only pages that have significant influences are examined. The claim here is that a highly-ranked page q which is far away from p cannot have a significant influence on p as long as (1 - d)k is chosen close to zero. To put it in another way, Examples 3 and

4 simulate Examples 1 and 2 just for k iterations. If Examples 1 and 2 converge within k iterations, both approximate and exact computations will give the same results. Secondly, instead of computing the ranks for all pages, the rank is only computed for a given page p.
In both Examples 3 and 4, a breadth-first search of the pages that can affect the reputations of a page p is adopted, i.e. all pages within depth I are visited before any page in depth I+1. A benefit of this ordering is that the user can stop the search at any point and be sure that pages that are expected to have a high influence on p are visited.
This may happen, for example, if the search takes longer than expected. However, it should be noted that the number of outgoing or incoming links for each page being visited must be remembered, if this information is not already stored. An alternative to a breadth-first search is to conduct a depth-first search, if it can be assumed that always the same set of pages with the same ordering are visited in each level along the paths. The only benefit of such a search is that only the current path needs to be remembered. However, this assumption may not hold, for example, when pages are retrieved from real search engines. In addition, there is the danger of spending most of the time on pages that have a very small effect on the reputations of page p before visiting more important pages.
To the person skilled in the art, the approaches outlined in Examples Three and Four, in terms of the how the reputation term for a topical term of a page is generated, are easily adaptable to be used for determining reputation terms for a set of pages relating to a topical term as in the first four preferred embodiments. The topical term t is held constant during the actual processing of a modified version of the Examples.
According to another four preferred embodiments of the invention the topics for which a web page p has highest reputation rank are identified (Figure 3). Thus the user, whether human or non-human, enters the URL of the web page p, and the output of the embodiments is a list of topics that the web page has high reputation for, i.e. the topical term class output One of Examples One to Four is utilized under each of the embodiment to generate the converged reputation matrix or matrices, the columns of which represent the stationary probability distribution corresponding to the topics (or an approximation thereof), i.e. the reputation ranks of all the pages for the topics. The preferred embodiment goes on to determine and indicate the web pages corresponding to the highest reputation rank values in the matrix or matrices for the specified topic t.
In terms of rank determinations, the earlier-mentioned embodiments corresponding to Examples One and Two determine the reputations of every page p on every topic t. Therefore, the highly-weighted pages for a given topic can be easily identified. In practice, however, full processing for every possible term may be very expensive; or an approximate solution might be ~5 as good as an exact solution. The earlier-mentioned embodiments adopting Examples Three and Four approximately find the topics on which a page has a high reputation.
Two preferred embodiments to determine approximately web pages with relatively high reputations on a given topic are now described.
Given a topic t, an arbitrarily chosen page p can potentially acquire a relatively high rank, within the one-level influence propagation model, on topic t if at least one of the following hold:
(1) term t appears in page p, (2) many pages on topic t point to p, or (3) there are pages with high reputations on t that point to p.
Thus, a page with high reputation on topic t must either contain term t or be reachable within a few steps from a large set of pages on topic t. An approximate way of determining the one-level reputation ranks of pages on topic t is as follows:
Example 5: Approximating One-Level Reputation (a) identify pages that are either on topic t or reachable within a short distance from a page on topic t;
(b) construct the matrix U of stochastic transition probabilities for the resulting set of pages; and (c) compute the principal eigenvector of UT for eigenvalue 1.
The principal eigenvector will give the approximate reputation ranks of pages that are expected to have high reputations, i.e. every page which is not identified in Step (a) is assumed to have a rank of zero.
For the two-level influence propagation model, given a topic t, an arbitrarily chosen page p can acquire a relatively high rank on topic t if either term t appears in page p or it is reachable within a short path of alternating forward and backward links (or vice versa) from a large set of pages on topic t. An approximate way of determining the two-level reputation ranks of pages on topic t is as follows:
Example 6: Approximating Two-Level Reputation (a) identify pages that are either on topic t or reachable within few steps from a page on topic t, alternately following links forward and backward or vice versa;
(b) construct the matrix U of transition probabilities for the resulting set of pages;
(c) compute the principal eigenvector of UT for eigenvalue 1.
The principal eigenvector will give the approximate reputation ranks of pages that are expected to have high reputations. Again, every page which is not identified in Step (a) is assumed to have a rank of zero.
With another embodiment of the present invention, the categorization process of web pages, which is otherwise performed manually, may be automated by adopting one of the above-mentioned approaches to determining the list of topics that a web page has high reputation for. This list of topics, once produced, may be used in the categorization process, either as a sole means of classification modifying the directory without human intervention, or as an aid to further processing whether manual or automatic. Variations predicated on the particular process of determination as shown in the prior Examples are clear to the skilled person in the art, as for the following preferred embodiments.
According to another embodiment of the invention, this invention may be used as a means of ranking web pages corresponding to a chosen topic (Figure 2). As discussed earlier, there are a number of such ranking methodologies in the market used by search engines.
Given a particular topic, the reputation rank of a page is proportional to the degree of preference which should be accorded to the page. A page with a higher reputation rank on the topic should thus be placed higher in the list of web pages presented to the searcher. The embodiment includes one component, implementing one-level or two-level influence propagation, to determine the reputation of all pages for the chosen topic, preferably that which produces approximately the reputation ranks of the web pages for the topic.
Another component displays to the user the distributed network search output, pages returned by the search engine for the specified topical term and the reputation values.
Instead of replacing a ranking scheme for web pages, a preferred embodiment uses the reputation of web pages to complement or supplement a pre-existing ranking scheme. Thus two or more schemes may be used to offer the searcher either a broader selection of web pages ,'1 than that obtainable with only one scheme, or multiple measures of the relevance of a web page presented. In either case, it involves first the determination of the reputation rank of the pages corresponding to the chosen topic. This is followed by combining (removing redundancies) and presenting (ordered by either schemes) to the searcher these web pages with the set of pages produced by another scheme. In the case where ranking by reputation is to complement an existing or an alternative scheme, the reputation ranks of web pages are preferably displayed along with whatever numerical ranking values are produced by the other scheme. This offers to the searcher two or more guides to relevance.
According to another preferred embodiment, the owner or operator of a web page is enabled to perform a more comprehensive assessment of its web services.
Organizations routinely expend a great deal of effort and money in determining how they are perceived by the public; evaluating the reputation of their Web site on specific topics, or determining those topics on which its reputation is highest (or abnormally low) could be a valuable part of this self-evaluation. A variation of this embodiment involves an assessment of a web page by a third party for the purpose of market analysis and comparison.
Another preferred embodiment determines the referential term of a page (or pages) using hub reputation to rank the value of the page in providing links to web pages on a topic (Figure 4). Since hub reputation is the probabilistic measure of a surfer seeking content on a topic of making a backward link visit to a web page, it is reflective of the worth of a web page as a source of links to relevant pages on the topic. This is of use to web pages which aim to provide hyperlinks to specified topics, as for example web pages on niche subject matters, as a means of assessing the quality of the service. The user enters the topical term, which initiates the processing of the page or pages, resulting in the display of a referential output class.
Although the above preferred embodiments refer to a single web page, it is clear that one or more web pages taken collectively, such as those comprising the pages of a web site, may be the subject matter of a reputation rank or topics determination. The methodology or methodologies listed above may determine reputation ranks using, instead of the links which emerge from a single web page, all links from the pages that go out of the web site; hyperlinks coming to any one of the web pages are taken collectively; and content of any one is attributed to the collective. Consequently, the reputation of the collection is assessed en semble. Of course, this can be artificial and misleading in the case of a massive collection of unrelated web Ig pages; however, for a small number of related web pages taken as a whole, such as those of a web site, this can be useful.
The preferred embodiments of this inventions further include a software product to provide a reputation of a page among a set of electronic pages on a distributed network by determining a reputation term for the page from a probabilistic model of a topical term search strategy for the pages. The software product is preferably executed on a computer to implement the methods of the invention as set out above.
Additional preferred embodiments take the form of computer-readable media such as diskettes, CD-ROM's, tapes, DVD disks, ROM chips, etc., containing a computer-executable software product implementing a method of the invention as described earlier.
Other possible embodiments of this invention need not be solely addressed to pages encoded in hypertext markup language available on the world wide web. Provided that a holder of content is uniquely addressable and possessing of links to other such content holders, the invention may be employed to determine the reputation of the content holder for a topic in a finite collection of such content holders.
For example, the content holder may be other Internet resources, such as FTP
files, Usenet entries, and even email messages. All of these are of unique addresses (the URL), and are capable of hyperlinking to other such Internet resources. Another possible embodiment of the invention is in the context of intranets and extranets, where Internet-capable resources such as hypertext pages and FTP files are made available in a network of one or more computers employing TCP/IP network communication protocol. In fact, any computer-based entity, such as a database object, may be made the subject matter of reputation determination by using this invention, whether the computer be stand-alone or existing as one node in a distributed network. Tangible documents may also be subject to analysis by way of this invention, such as published journal articles or judicial case reports, which make references to other such documents. In the case of the latter, a possible use of the reputation scheme is to gauge the authoritative or persuasive nature of cases on a particular subject matter or in general.
The description of preliminary evaluations of this invention follows. A
simplified version of Example 3 (and also part of Example 4 that computes the authority reputation of a page) 1 °I

where k was set to 1, d to 0.10 and O(q; ) for every page q, to 7.2, the estimated average number of outgoing links of a page (R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins, "Extracting large-scale knowledge bases from the Web", In Proceedings of fihe 25th International Conference on Very Large Databases, pages 639--650, September 1999.). The best value for parameter d needs to be determined empirically. Further details of the implementation are as follows:
(1) Only a limited number of incoming links are examined; at most 500 incoming links of a page are obtained, but the number of links returned by the search engine, currently ALTAVISTA.COM, can be less than that.
(2) For each incoming link, terms are extracted from the 'snippet' returned by the search engine, rather than the page itself. A justification for this is that the snippet of a page, to some degree, represents the topic of the page. In addition, the number of distinct terms and as a result the number of count queries needed to be sent to the search engine are dramatically reduced.
(3) Internal links and duplicate snippets are removed.
(4) Stop words and every term t with N< < (1 + r x L) are removed, where L is the number of incoming links collected and r is the near-duplicate ratio of the search engine, currently set to 0.01. This reduces the number of count queries and also removes unusual terms such as 'AAATT' that rarely appear in any page but might acquire high weights.
Experiments with the prototype, called TOPIC, are set out.
In the first experiment, a set of known authoritative pages on queries (Java) and (+censorship +net), as reported by Kleinberg's HITS algorithm (J. M.
Kleinberg, "Authoritative sources in a hyperlinked environment", In Proceedings of ACM-SIAM Symposium on Discrete Algorithms, pages 668--677, January 1998) was picked, and TOPIC computed the topics that each page was an authority on. As shown in Table 1, the term 'java' is the most frequent term among pages that point to an authority on Java. There are other frequent terms such as 'search' or'Microsoft' which have nothing to do with the topic; their high frequency represents the fact that authorities on Java are frequently cocited with search engines or Microsoft. This usually happens in cases where the number of links examined is much less than the number of links available. However, the highly-weighted terms for each page in both Tables 1 and 2 largely describe the topics that the page is an authority on consistently with the results of HITS.
URL : java.sun.com -- 500 links examined (out of 128653 available) Highly weighted terms: Developers, JavaSoft, JDK, Java applets, Sun Microsystems, API, Programming, Solaris, tutorial Frequent terms: Java, Software, Computer, Programming, Sun, Development, Microsoft, Search URL : sunsite.unc.edu~avafaq~avafaq.html Highly weighted terms: Java FAQ Java, comp.lang.java FAQ, Java Tutorials, Java Stuff, Applets, IBM Java, Javasoft, Java Resources, API Java, Learning Java Frequent terms: FAQ, Sun, Computer, Language, Tutorial, Java FAQ, Software Table 1. Authorities on (Java) URL : ~.eff.org -- 500 links examined (out of 181899 available) Highly weighted terms: Anti-Censorship, Join the Blue Ribbon, Blue Ribbon Campaign, Electronic Frontier Foundation, Free Speech URL : www.cdt.org --500 links examined (out of 12922 available) Highly weighted terms: Center for Democracy and Technology, Communications Decency Act, Censorship, Free speech, Blue Ribbon, Syllabus, encryption URL : www.vtw.org -- 500 links examined (out of 7948 available) Highly weighted terms: decision is near in the fight to overturn the Communication Decency Act, Blue Ribbon Campaign, Censorship, American Civil Liberties Union, free speech URL : www.aclu.org -- 500 links examined (out of 22087 available) Highly weighted terms: ACLU, American Civil Liberties Union, Communications Decency Act, Amendment, CDA, Criminal Law, Censorship Table 2. Authorities on (+censorship +net) In another experiment, Inquirus is used (S. Lawrence and C.L. Giles, "Context and page analysis for improved {Web} search", IEEE Internet Computing, 2(4):38--46, 1998), the NECI
meta-search engine, which computes authorities using an unspecified algorithm.
Inquirus was provided with the query ('data warehousing') and set the number of hits to its maximum, which was 1,000, to get the best authorities, as suggested by the system. The top four authorities returned by Inquirus was picked and used with the TOPIC system to compute the topics those pages have high reputations on. The result, as shown in Table 3, is again consistent with the 2~

judgments of Inquirus.
URL : www.dw-institute.com -- 390 links examined (out of 785 available) Highly weighted terms: TDWI, Data Warehousing Information Center, www.dw-institute.com, Data Warehousing Institute, data warehouse URL : pwp.starnetinc.comllarryg -- 500 links examined (out of 1017 available) Highly weighted terms: Data Warehousing Information Center, OLAP and Data, Analytical Processing, Data Mining, data warehouse, Decision Support Systems URL : www.datawarehousing.com -- 188 links examined (out of 229 available) Highly weighted terms: Data Warehousing Information, OLAP, Data Mining URL : www.dmreview.com -- 270 links examined (out of 1258 available) Highly weighted terms: Data Warehouse 100, Powel Publishing, Review Magazine, data Warehousing, Business Intelligence, Cognos, Data Mining, Product Review Table 3. Authorities on ('data warehousing') In another experiment, a set of personal home pages was selected and TOPIC
used to find the high reputation topics for each page. This was expected to describe in some way the reputation of the owner of the page. The results are shown in Table 4 are revealing. Tim Berners-Lee's reputation on the 'History of the Internet,' Don Knuth's fame on 'TeX' and 'Latex' and Jeff Ullman's reputation on 'database systems' and 'programming languages' are to be expected. The humour site Dilbert Zone seems to be frequently cited by Don Knuth's fans.
Alberto Mendelzon's high reputation on 'data warehousing,' on the other hand, is mainly due to an online research bibliography he maintains on data warehousing and OLAP in his home page, and not to any merits of his own.
URL : www.w3.orglPeoplelBerners-Lee -- 500 links examined (out of 933 available) Highly weighted terms: History Of The Internet, Tim Berners-Lee, Internet History, URL : www cs-faculty.stanford.edul knuth -- 500 links examined (out of 1733 available) Highly weighted terms: Don Knuth, Donald E Knuth, TeX, Dilbert Zone, Latex, ACM
URL : www db.stanford.edul ullman -- 238 links examined (out of 466 available) Highly weighted terms: Jeffrey D Ullman, Database Systems, Database Management, Data Mining, Programming Languages, Computer Science, Stanford University URL : www.cs.toronto.edul mendel -- 139 links examined (out of 259 available) Highly weighted terms: Alberto Mendelzon, Data Warehousing and OLAP, SIGMOD, DBMS
Table 4. Personal home pages In the last experiment, the home pages of a number of Computer Science Departments on the Web are selected. The main characteristic of these pages is that the sites are unregulated, in the sense that users store any documents they desire in their own pages. The results are shown in Table 5. The Computer Science Department at the University of Toronto has a high reputation on 'Russia' and 'Images,' mainly because a Russian graduate student of the department has put online a large collection of images of Russia, and many pages on Russia link to it. The high reputation on 'hockey' is due to a former studentwho used to play on the Canadian national women's hockey team. The Faculty of Mathematics, Computer Science, Physics and Astronomy at the University of Amsterdam (www.wins.uva.nl) has a high reputation on 'Solaris 2 FAQ' because the site maintains a FAQ on the Solaris operating system. It also has a high reputation on the musician Frank Zappa because it has a set of pages dedicated to him and the FAQ of the alt.fan.frank-zappa newsgroup. The Computer Science Department of the University of Helsinki (www.cs.helsinki.fi) has a high reputation on Linux because of the many pages on Linux that point to Linus Torvalds's page.
URL : wvvw.cs.toronto.edu - 500 links examined (out of 7814 available) Highly weighted terms: Russia, Computer Vision, Linux, Images, Orthodox, Hockey URL : www.wins.uva.nl - 500 links examined (out of 6174 available) Highly weighted terms: Solaris 2 FAQ, Wiskunde, Frank Zappa, FreeBSD, Recipes URL : www.cs.helsinki.fi - 500 links examined (out of 9664 available) Highly weighted terms: Linux Applications, Linux Gazette, Linux Software, Knowledge Discovery, Linus Torvalds, Data Mining Table 5. Computer science departments It will be appreciated that the above description relates to the preferred embodiments by way of example only. Many variations on the apparatus for delivering the invention will be obvious to those knowledgeable in the field, and such obvious variations are within the scope of the invention as described and claimed, whether or not expressly described.
All publications referred to in this paper are incorporated by reference in their entirety.
z3

Claims

We claim:

1. A method for providing a reputation of a page among a set of electronic pages on a distributed network comprising determining a reputation term for the page from a probabilistic model of a topical term search strategy for the pages.

2. A method of providing a user with a page having a reputation relating to a topical term, the page among a set of electronic pages on a distributed network, the method comprising:
(a) obtaining a topical term from the user;
(b) determining a reputation term for each page from a probabilistic model of the topical term search strategy for the pages; and (c) providing the user with a reputation page class output.

3. The method of claim 2, wherein the reputation page class output display comprises each page proximate to the reputation term for each page, wherein the pages and reputation terms are displayed sequentially according to the values of the reputation term.

4. The method of claim 3, wherein the order of the pages in the reputation page class output display is descending so that pages having higher reputation term values are displayed to the user before pages having lower reputation term values.

5. A method of providing a user with a distributed network search output from a search engine, comprising:
(a) obtaining a topical term input from the user;
(b) providing a search engine output display in response to the topical term input;
and (c) providing a reputation term display in response to the topical term input, wherein the reputation term display is displayed proximate to the search engine output display and wherein the reputation term for each page of the search engine output display is obtained from a probabilistic model of the topical term search strategy for the pages.

6. A method of providing a user with a topical term class output of topical terms relating to a page in a set of electronic pages, comprising:

(a) determining the set of topical terms related to the pages of the set of electronic pages.
(b) determining the reputation term for the page for each of the set of topical terms, the reputation term being determined from a probabilistic model of the topical term search strategy for the pages; and (c) providing a topical term class output.

7. The method of claim 6, wherein the topical term class output comprises a plurality of topical terms associated with the highest values for the reputation term, the number being pre-determined by the user.

8. The method of any of claims 1 to 7, wherein the probabilistic model comprises a Markov model.

9. The method of claim 8, wherein the Markov model determines the reputation term associated with one-level influence propagation.

10. The method of claim 9, wherein the Markov model determines the reputation term as the authority reputation term associated with two-level influence propagation.

11. The method of any of claims 1 to 10, wherein the reputation term of the page is determined by a method comprising the following steps:
(a) forming a reputation term matrix based on a plurality of pages and a plurality of topical terms, each element including the random likelihood of the page containing the topical term;
(b) updating the reputation term matrix, so that each element includes the likelihood of visiting the related page in search of the specific topical term after taking a step in a Markovian random walk process;
(c) recursively iterating step (b), one further Markovian step at a time, until the probability term converges; and (d) forming the reputation term as the related element of the converged reputation term matrix.

12. The method of any of claims 2 to 10, wherein the reputation term of the page is determined by the following steps:
(a) forming a transition probability matrix from a plurality of pages.

(b) determining the principal eigenvector of the transition probability matrix; and (c) forming the reputation term as a related element of the eigenvector.

13. The method of any of claims 2 to 10, wherein the reputation term of the page is determined by the following steps:
(a) forming a probability term including the random likelihood of the page containing the topical term.
(b) updating the probability term for all pages which are proximate to the page, based on the likelihood of the user visiting the page in search of the topical term;
and (c) forming the reputation term as the probability term.

14. The method of any of claims 2 to 10, wherein the reputation term of the page is determined by the following steps:
(a) forming a stochastic matrix based on the number of pages which contain the topical term or are proximate to the pages;
(b) determining a principal eigenvector of the transition probability matrix;
and (c) forming the reputation term as a related element of the eigenvector.

15. A method of providing a user with a page having a reputation relating to a topical term, the page among a set of electronic pages on a distributed network, the method comprising:
(a) obtaining a topical term from the user;
(b) determining a referential term for each page from a probabilistic model of the topical term search strategy for the pages; and (c) providing the user with a referential page class output.

16. The method of claim 15, wherein the referential class output display comprises the pages proximate to the referential term for each page, wherein the pages and referential terms are displayed sequentially according to the values of the referential term.

17. The method of claim 16, wherein the order of the pages in the referential page class output display is descending so that pages having higher referential term values are displayed to the user before pages having lower referential term values.

18. The method of any of claims 15 to 17, wherein the probabilistic model comprises a Markov model.

19. The method claims 18, wherein the Markov model determines the referential term as the hub reputation term associated with two-level influence propagation.

20. The method of claim 18 or 19, wherein the referential term of the page is determined by a method comprising the following steps:
(a) forming a reputation term matrix based on a plurality of pages and a plurality of topical terms, each element being the random likelihood of the page containing the topical term;
(b) updating the reputation term matrix, so that each element includes the likelihood of visiting the related page in search of the specific topical term after taking a step in a Markovian random walk process;
(c) recursively iterating step (b), one further Markovian step at a time, until the probability term converges; and (d) forming the referential term as the related element of the converged reputation term matrix.

21. The method of claim 18 or 19, wherein the referential term of the page is determined by the following steps:
(a) forming a transition probability matrix from a plurality of pages;
(b) determining the principal eigenvector of the transition probability matrix; and (c) forming the referential term as an element of the eigenvector related to the page.

22. The method of claims 18 or 19, wherein the referential term of the page is determined by the following steps:
(a) forming a probability term including the random likelihood of the page containing the topical term.

(b) updating the probability term for all pages which are proximate to the page, based on the likelihood of the user visiting the page in search of the topical term;
and (c) forming the referential term as the probability term.

23. The method of claim 18 or 19, wherein the referential term of the page is approximated by the following steps:
(a) forming a stochastic matrix based on the number of pages which contain the topical term or are proximate to the pages;
(b) determining the principal eigenvector of the transition probability matrix; and (c) forming the referential term as a related element of the eigenvector.

24. A method as in any one of claims 1 to 23, wherein the page is from a plurality of pages from a web site and the reputation term is determined for the web site, wherein a link to any one of the plurality of pages comprises a links to the page, and a link from any of the page to a page not of the plurality comprises a link from the page.

25. The method as in any one of claims 1 to 24, wherein the reputation term is determined with a computer.

26. A system for providing a reputation of a page among a set of electronic pages on a distributed network comprising a) means for determining a reputation term for the page from a probabilistic model of a topical term search strategy for the pages and b) means for providing the user with a reputation term output.

27. A system for providing a user with a page having a reputation relating to a topical term, the page among a set of electronic pages on a distributed network, the system comprising:
(a) means for obtaining a topical term from the user;
(b) means for determining a reputation term for each page from a probabilistic model of the topical term search strategy for the pages; and (c) means for providing the user with a reputation page class output.

28. The system of claim 27, wherein the reputation page class output display comprises each page proximate to the reputation term for each page, wherein the pages and reputation terms are displayed sequentially according to the values of the reputation term.

29. The system of claim 28, wherein the order of the pages in the reputation page class output display is descending so that pages having higher reputation term values are displayed to the user before pages having lower reputation term values.

30. A system for providing a user with a distributed network search output from a search engine, comprising:
(a) means for obtaining a topical term input from the user;
(b) means for providing a search engine output display in response to the topical term input; and (c) means for providing a reputation term display in response to the topical term input, wherein the reputation term display is displayed proximate to the search engine output display and wherein the reputation term for each page of the search engine output display is obtained from a probabilistic model of the topical term search strategy for the pages.

31. A system for providing a user with a topical term class output of topical terms relating to a page in a set of electronic pages, comprising:
(a) means for determining the set of topical terms related to the pages of the set of electronic pages.
(b) means for determining the reputation term for the page for each of the set of topical terms, the reputation term being determined from a probabilistic model of the topical term search strategy for the pages; and (c) means for providing a topical term class output.

32. The system of claim 31, wherein the topical term class output comprises a plurality of topical terms associated with the highest values for the reputation term, the number being pre-determined by the user.

33. The system of any of claims 26 to 32, wherein the probabilistic model comprises a Markov model.

34. The system of claim 33, wherein the Markov model determines the reputation term associated with one-level influence propagation.

35. The system of claim 34, wherein the Markov model determines the reputation term as the authority reputation term associated with two-level influence propagation.

36. The system of any of claims 25 to 35, wherein the system further comprises:
(a) means for forming a reputation term matrix based on a plurality of pages and a plurality of topical terms, each element including the random likelihood of the page containing the topical term;
(b) means for updating the reputation term matrix, so that each element includes the likelihood of visiting the related page in search of the specific topical term after taking a step in a Markovian random walk process;
(c) means for recursively iterating step (b), one further Markovian step at a time, until the probability term converges; and (d) means for forming the reputation term as a related element of the converged reputation term matrix.

37. The system of any of claims 27 to 35, wherein the system further comprises:
(a) means for forming a transition probability matrix from a plurality of pages.
(b) means for determining the principal eigenvector of the transition probability matrix; and (c) means for forming the reputation term as a related element of the eigenvector.

38. The system of any of claims 27 to 35, wherein the system further comprises:
(a) means for forming a probability term including the random likelihood of the page containing the topical term.
(b) means for updating the probability term for all pages which are proximate to the page, based on the likelihood of the user visiting the page in search of the topical term; and (c) means for forming the reputation term as the probability term.

39. The system of any of claims 27 to 35, wherein the system further comprises:
(a) means for forming a stochastic matrix based on the number of pages which contain the topical term or are proximate to the pages;

(b) means for determining a principal eigenvector of the transition probability matrix;
and (c) means for forming the reputation term as a related element of the eigenvector.

40. A system for providing a user with a page having a reputation relating to a topical term, the page among a set of electronic pages on a distributed network , the system comprising:
(a) means for obtaining a topical term from the user;
(b) means for determining a referential term for each page from a probabilistic model of the topical term search strategy for the pages; and (c) means for providing the user with a referential page class output.

41. The system of claim 40, wherein the referential class output display comprises the pages proximate to the referential term for each page, wherein the pages and referential terms are displayed sequentially according to the values of the referential term.

42. The system of claim 41, wherein the order of the pages in the referential page class output display is descending so that pages having higher referential term values are displayed to the user before pages having lower referential term values.

43. The system of any of claims 40 to 42, wherein the probabilistic model comprises a Markov model.

44. The system claims 43, wherein the Markov model determines the referential term as the hub reputation term associated with two-level influence propagation.

45. The system of claim 43 or 44, further comprising:
(a) means for forming a reputation term matrix based on a plurality of pages and a plurality of topical terms, each element being the random likelihood of the page containing the topical term;
(b) means for updating the reputation term matrix, so that each element includes the likelihood of visiting the related page in search of the specific topical term after taking a step in a Markovian random walk process;
(c) means for recursively iterating step (b), one further Markovian step at a time, until the probability term converges; and (d) means for forming the referential term as the related element of the converged reputation term matrix.

46. The system of claim 43 or 44, further comprising:
(a) means for forming a transition probability matrix from a plurality of pages;
(b) means for determining the principal eigenvector of the transition probability matrix; and (c) means for forming the referential term as an element of the eigenvector related to the page.

47. The system of claims 43 or 44, further comprising:
(a) means for forming a probability term including the random likelihood of the page containing the topical term.
(b) means for updating the probability term for all pages which are proximate to the page, based on the likelihood of the user visiting the page in search of the topical term; and (c) means for forming the referential term as the probability term.

48. The system of claim 43 or 44, further comprising:
(a) means for forming a stochastic matrix based on the number of pages which contain the topical term or are proximate to the pages;
(b) means for determining the principal eigenvector of the transition probability matrix; and (c) means for forming the referential term as a related element of the eigenvector.

49. A system as in one of claims 26 to 48, wherein the page is from a plurality of pages from a web site and the reputation term is determined for the web site, wherein a link to any one of the plurality of pages comprises a links to the page, and a link from any of the page to a page not of the plurality comprises a link from the page.