GB2386712A

GB2386712A - Selective updating of an index of webpages by a search engine

Info

Publication number: GB2386712A
Application number: GB0206626A
Authority: GB
Inventors: Barry David Ottley Adams
Original assignee: MAGUS RES Ltd
Current assignee: MAGUS RES Ltd
Priority date: 2002-03-20
Filing date: 2002-03-20
Publication date: 2003-09-24
Also published as: WO2003081462A3; GB0206626D0; WO2003081462A2; AU2003212535A8; AU2003212535A1

Abstract

To save on computational resources, a search engine is configured to perform selective updating of its index rather than full indexing. Selective updating operates on a previous index, by classifying the indexed pages as leaf and branch pages. Branch pages are those which include links to other pages deeper in the website, while leaf pages do not include such links. The selective updating procedure updates branch pages and new leaf pages more regularly than existing leaf pages.

Description

- 1 - Selective Updating Description

The present invention relates to the field of search engines, particularly but not

s exclusively to a method of selectively updating an index of web pages for use by a search engine in performing searches.

The World Wide Web, or simply 'web', is based on hypertext, which can be thought of as text that is not constrained to be sequential. The web can handle much more 10 than just text, so the more general term hypermedia is used to cover all types of content, including but not limited to pictures, graphics, sound and video. While the primary language for representing hypermedia content on the web is HTML, other markup languages are constantly developing, including, for example, XML. The term hypermedia as used herein is therefore not intended to be limited to any 15 particular web language, nor indeed to the web, but should be interpreted as a general term that can also refer to content on public or private networks which operate according to HyperText Transfer Protocol (HTTP) or other similar protocols. 20 As mentioned above, HTML is a document mark-up language that is the primary language for creating documents on the web. It defines the structure and layout of a web document by reference to a number of pre-defined tags with associated attributes. The tags and attributes are interpreted and the web page is accordingly displayed by a client application running on a computer, commonly referred to as a 25 browser.

As a result of the vast amount of information available on the web, search engine technology is well established, with a large number of different search engines being available, including those with well-known names such as Google_, AltaVista_ and 30 Excite_. A search engine is a system that can search for specific words and phrases in a set of electronic docents, particularly HTML documents on the web, although the term is not confined to use on the web.

- 2 - The majority of search engines work on similar principles. Web content is hosted by a very large number of remote web servers. A computer program known as a spider' or 'robot' crawls through the content on the web that is to be indexed and 5 stores information about each page it finds in a searchable index. The index therefore comprises a complete database of information about a predetermined list of web pages.

Each page or document in the index is given a ranking, indicating its relevance with 10 reference to some word or phrase. Every search engine typically uses a different algorithm, although based on the same principle, namely that the ranking is determined by a combination of the number of occurrences of each keyword or phrase in the document, the total word count of the document and whether the keyword/phrase occurred in a particularly significant location within the document, 15 such as in the title or in HTML tags known as meta-tags, or was in some other way highlighted in the document.

When a user performs a search for some keyword or phrase, the search engine consults its index and returns a set of search results ranked by relevance.

Clearly, it is important from a user point of view to be looking at the latest data available. In turn, search engines need to ensure that their index is kept as up-to date as possible, since it is in the nature of the Internet that the content being indexed is likely to be constantly changing.

However, the creation of the index takes substantial computational and bandwidth resources. Furthermore, the speed at which updates can be performed is limited by the number of pages that the remote web servers can deliver in unit time.

30 One possible way of saving computational time is to record the creation time of a remote document during a first indexing operation and subsequently only refresh the information about this document if the creation time has changed. However,

- 3 since many web servers do not deliver accurate information on document dates, this is not a particularly good solution in practice.

The present invention aims to address the above problems.

s According to the invention, there is provided a method of selectively updating an index of documents, some of said documents including links to other of said documents, the method comprising the step of updating only those of said documents which include said links.

The method can further comprise incorporating into the index documents not previously indexed and which are linked to by updated documents and some or all of said incorporated documents may not include said links.

15 A selected set of the documents in the index can comprise a document hierarchy, for example a website, in which each document is associated with a depth, the depth representing the number of said links required to reach the document from an entry document, wherein said links comprise links between documents at different depths.

20 The documents can comprise hypertext documents such as web pages and the links can be hyperlinks.

According to the invention, there is further provided a method of selectively updating an index of documents, including the steps of preparing the index for 25 selective updating and selectively updating the index, wherein the step of preparing the index for selective updating includes classifying the documents in the index as leaf and non-leaf pages, wherein the leaf pages do not include links to other documents in the index and the non-leaf pages include links to other documents in the index.

The method can further comprise updating the non-leaf pages more frequently than the leaf pages and/or adding new leaf pages to the index more frequently than updating existing leaf pages.

According to the invention, there is also provided a search engine including means for selectively updating an index of documents, some of said documents including links to other of said documents, said means being configured to update only those 5 of said documents which include said links.

Embodiments of the invention will now be described, by way of example, with reference to the accompanying drawings, in which: Figure 1 is a schematic diagram of a system according to the invention; 10 Figure 2 illustrates the structure of a typical website; Figure 3 is a flow diagram illustrating search engine indexing operation in accordance with the invention, including full indexing and selective updating; Figure 4 is a flow diagram illustrating the full indexing procedure referred to in Figure 3; 15 Figure 5 is a flow diagram illustrating the retrieval of a list of non-leaf pages, which requires the retrieval of a list of leaf pages; Figure 6 is a flow diagram illustrating the retrieval of a list of leaf pages for use in the flow diagram of Figure 5; Figure 7 is a flow diagram illustrating the selective updating procedure referred to in 20 Figure 3; and Figure 8 is a flow diagram illustrating the deletion of orphaned pages following selective updating.

Referring to Figure 1, a system according to the invention comprises a search engine 25 program 1, written, for example, in Java_ running on and executable by the processor 2 of a web server machine 3. The search engine program 1 communicates with a database 4 which stores the raw data that forms the basis of the information to be presented to a user carrying out a search.

30 The user accesses the web server 3 via a communications network 5, such as the Internet, using browser software 6 running on a personal computer 7. The browser software 6, for example, Microsoft Internet Explorer_ or Netscape Navigator_, interfaces with the search engine program 1 via web server software 8 running on

- - the web server machine 3. The browser software 6 communicates with the web server software 8 using the HTTP protocol, in a way that is well known.

It will be understood that the web server machine 3 and the personal computer 7 s are conventional computers equipped with all of the hardware and software necessary to carry out their respective tasks.

The search engine program 1 includes a web spider program 9, the function of which is to trawl the Internet to provide the raw data that will be processed and 10 used to respond to subsequent user queries. The functionality of the search engine program 1 according to the invention will be described and illustrated below with reference to Figure 3.

Referring first to Figure 2, a typical website comprises an entry page or start page P. 15 also referred to as the home page, a number of more specific pages Q. R. S and finally, specific items of particular interest T. U. V. For example, a news website has a home page listing the most current articles, a number of pages listing articles in specific sections, for example UK news and world news and finally the news articles themselves. The pages are inter]inked by hyperlinks 10, indicated in bold in 20 Figure 2. A website may have more than one entry page.

The depth of a web page in a website is defined for the purpose of this application as the minimum number of hyperlinks a user or computer program must traverse from an entry page in order to arrive at that page. A branch page is then defined as Is any page, other than an entry page, which contains one or more hyperlinks deeper into the website, while a leaf page is defined as any page that does not link deeper into the website. A non-leaf page is an entry page or a branch page.

Referring to Figure 3, the indexing operation of the search engine program 1 begins 30 at step so. The program 1 first determines whether a previous index exists (step s1).

If it does not, for example, because this is the first time that the indexing operation is being performed, then a full index is generated (step s2) and the process then terminates (step s3).

- 6 The procedure for generating a full index, which is also the way in which a conventional web spider program functions, is described in detail below with reference to Figure 4.

Referring to Figure 4, in a full indexing process, the web spider program 9 periodically downloads documents from the Internet in accordance with predetermined indexing criteria, which, for example, specify the coverage that the search engine is attempting to achieve (step s20). The type of document 10 downloaded is classified as either an HTML document or 'other' document type, i.e. all documents which are not HTML documents (step sol). In both cases, a document parser corresponding to the document type parses the document (steps s22, s23) and stores data about the document in the database 4 (step s24). This data includes, for example, information relating to the positions and frequency of every 15 word, with the option of excluding the most common words, in each document.

In addition, the HTML parser retrieves the URLs of any hyperlinks within the document being parsed (step s22) and checks whether each of these hyperlinks is new and meets the indexing criteria (step s25). The indexing criteria can comprise a 20 pattern to be matched by the hyperlink, for example, that the hyperlink must have the same root as the source page. If this is the case, the hyperlink is added to the download list stack (step s26) and the next document is downloaded from the list stack (step s20). If none of the hyperlinks found are to be added to the list stack, the spider program 9 determines if the list stack is empty (step s27). If it is, the 25 spider program terminates (step s28). If not, control returns to the document downloader to download the next document from the list stack (step s20).

Returning to Figure 3, even if a previous index exists, the program determines whether it is time to perform a full re-indexng process (step s4). This process is 30 carried out at intervals to ensure that a completely fresh index is periodically generated. For example, a full reindexing process is carried out weekly, while selective indexing is carried out daily. The full re-indexing process involves carrying out the procedure set out above in relation to step s2.

- 7 - If there is no need for full re-indexing and a previous index is available, the program examines the existing search index to generate tables, for example in the form of hashtables, of links between the pages (step s5). The tables include a forward set s ('links to') and a reverse set ('links from') of link information. For the 'links to' table, a source URL maps to an array of link destination URLs, whereas for the links from' table, the destination URL maps to an array of URLs of pages having the destination link. For example, for the structure shown in Figure 2, the forward table includes the information that page P links to pages Q. R and S. and that page 10 Q links to pages T. U and V, while the reverse table holds the information that page R has links from page P and V, that page S has links Mom pages P and R and that page V has links from pages Q and S. Corresponding information is held for all the other pages.

5 The next step carried out by the program is to obtain a list of all of the non-leaf pages (step s6). The procedure for doing this is set out in detail below and illustrated with reference to Figure 5.

Referring to Figure 5, the subroutine begins at step s60. Essentially, to obtain the 20 set of non-leaf pages, the program first determines the set of leaf pages for a given 7' set of entry pages (step s61). Pages which are not leaf pages are by definition non-

leaf pages. The set of leaf pages is determined by the program using the subroutine illustrated in Figure 6.

2s Referring to Figure 6, the subroutine begins at step 6100. The first step is to initialise an empty set of leaf pages, to which leaf pages will be added as they are found, an empty set of pages called 'this_level', an empty set of pages called done_links', a variable called 'depth' which is initially set to O and an empty hashtable referred to herein as the depth hashtable, which maps URLs to the depth 30 variable (step s6101). The 'this_level' set will contain all the pages at a particular depth i.e. starting with all the entry pages, then all the pages at depth 1 and so on.

The 'done_links' set will contain all the links that have already been considered.

- 8 - After initialization, all of the entry pages are added to the tthis_level' page set (step s6102). The program then determines whether there are pages in the 'this_level' page set (step s6103) and executes a first loop while there are such pages. Initially, therefore, all the entry pages are in the 'this_level' page set. The program then s initialises an empty set of pages called 'next_level' (step s6104) for the first entry page. It then determines whether there are more entries in the this_level set (step s6105). If there are, then the program retrieves, from the index, the list of links for the current page, sets a boolean variable called allold to TRUE and adds a depth hashtable entry for the page equal to the depth (step s6106).

At the next stage, the program determines whether there are any more links on the page (step s6107). For each link which exists, the program determines whether the link is already present in the this_level set or the depth hashtable (step s6108). If it is present, control returns to step s6107 and the program looks at the next link. If 15 the link does not already exist in the this_level set or the depth hashtable, the Boolean variable allold is set to FALSE (step s6109) and the program determines whether the link exists in the done_links set (step s6110). If it does, control again returns to step s6107 without further action. If it does not, an entry for the link URL is added to the done_links set and to the next_level set (step s6111). Control 20 then returns to step s6107 and the program performs the same procedure for the next link.

If the program at step s6107 determines that there are no more links, then the program tests the state of the allold variable (step s6112). If this is set to TRUE, the 2s page is added to the set of leaf pages (step s6109) and control passes back to step s6105. This will only be the case if the page has no links that go deeper into the document. If the allold variable is set to FALSE, control passes back to step s6105 without the page being added to the set of leaf pages. At step s6105, the program determines whether there are further entries in the this_level set. In the absence of 30 further entries, the this_level set is set to hold the contents of the next_level set, and the depth is increased by 1 (step s6114). Control passes back to step s6103, which re-runs the steps described above for the next page in the this_level set.

- 9 - Only when there are no further pages to be processed does the program exit with a completed list of leaf pages (step s6115).

An example of the operation of the algorithm above is now described with 5 reference to Figure 2. On the first pass, page P is the only entry page so this is loaded into the this_level set (step s6102). Since this page is in the this_level set (step s6103), the next_level set is initialised (step s6104) and the list of links for page P are retrieved from the links_to hashtable (step s6106). These links are the URLs for pages Q. R and S. The allold variable is set to TRUE and an entry is 0 added to the depth hashtable for page P specifying a depth of 0 (step s6106).

Pages Q. R and S do not exist in the this_level set or the depth hashtable (steps -: s6107 and s6108). Therefore, they are new pages, so allold is set to FALSE (step s6109). Pages Q. R and S do not exist in the done_links set either, so the URLs for 15 each of pages Q. R and S are added to the done_1inks set and to the next_level set (steps s6107 to s6111). It will be understood that, although these pages are described above, for the purpose of clarity and brevity, as if the program treated them together, each page is in fact processed by the program 1 on separate passes, for example on the basis of a 'for' loop over all of the links in the list. On the 20 fourth pass through the loop, the program 1 determines that there are no more links for page P (step s61 07). A test of the allold variable determines that it is set to false (step s6112), so control passes to step s6105. Since there are no more entry pages at depth 0, the this_level set is set to the contents of the next_level set i.e. to contain the URLs for pages Q. R and S and the depth variable is incremented (step 25 s6114). The process described above then repeats from step s6103 for each of pages Q. R and S. For example, an empty next_level set is initialised for page Q (step s6104) and the list of links for page Q is obtained. This contains pages T. U and V. allold is set to 30 TRUE and the depth hashtable for page Q is set to 1 (step s6106) .

The links to page T. U and V are not in the this_level or done_links sets nor in the depth Hashtable (steps s6108, s6110), so allold is set to FALSE (step s6110) and

- 10 pages T. U and V are added to the done_links and next_level sets. Control returns to step s6105. The next page at this level is page R. Page R contains only a single link to page S. At step s6106, allold is set to TRUE s and a depth hashtable for page R. depth = 1 is made. Program flow proceeds to step s6108. Since page S is in the this_level set, program flow returns to step s6107.

Since allold is set to TRUE, page R is added to the set of leaf pages (step s6113).

Although it links to another page, that page is not one which is deeper within the website. Control now returns to step s6105. The next page at this level is page S. The list of links for page S contains page V only. allold is set to TRUE and a depth hashtable entry is made for page S. depth = 1 (step s6106). Since the link to page V is not in the this_level set nor in the depth hashtable (step s6108), allold is set to FALSE (step s6109). However, link V is in the set of done_links, so control passes 15 back to step s6107 without adding link V to the next_level set (step s6110). Since there are no further links and allold is set to FALSE, control passes to step s6112 and then to step s6105 without adding page S to the set of leaf pages. Since there are no further pages at this level, the this_level set is set to the contents of next_level, i.e. pages T. U and V and the depth variable is incremented, so that 20 depth = 2 (step s6114). On the next pass, an empty next_level set is initialized (step s6104) and control passes through step s6105 to step s6106.

At step s6106, for each of pages T and U. allold is set to TRUE and hashtable entries are made for each of pages T and U with depth = 2. Since pages T and U do 25 not include any links, control passes from step s6107 to step s6112 and since allold has just been set to TRUE, pages T and U are each added to the set of leaf pages on sequential passes through the flowchart.

For the final page in the website, page V, the list of links contains a link to page R. 30 allold is set to TRUE and a hashtable entry made for page V, depth = 2 (step s6106).

Program flow proceeds to step s6108, where the program detertnines that the link exists in the depth hashtable. An entry was previously made in the depth hashtable for page R. depth = 1. Therefore, program flow moves back to step s6107. There

- 11 are no further links and the program determines that allold is set to TRUE (step s6112), so the current page V is added to the set of leaf pages.

Control returns to step s6105, there are no more pages at level 2, so this_level is set s to the contents of next_level and the depth is incremented (step s6114). However, no links were added to the next_level set on the previous pass, so the this_level set is empty. The program determines that this is the case (step s6103) and exits with a complete set of leaf pages, comprising pages R. T. U and V (step s6115).

10 Referring again to Figure 5, after obtaining the list of leaf pages as described in detail above, the program gets the complete list of pages in the search index and initialises an empty list of non-leaf pages (step s62). For every page in the search, index (step s63), the program determines whether the page is in the list of leaf pages (step s64). If so, control returns to step s63 and the next page is looked at. If a 5 page is not a leaf page, it is classified as a non-leaf page and added to the list of non-leaf pages (step s65). When the program determines that there are no further pages to be considered (step s63), it returns the complete list of non-leaf pages (step s66). 20 In the example given above in relation to Figure 2, the non-leaf pages are pages Pi Q and S only.

Referring again to Figure 3, the non-leaf pages are then deleted from the search index (step s7). This involves the deletion of all the information about each non-

2s leaf page that exists in the index as a result of previous indexing of the page.

Information outside the index, such as the URL of the page and the information in the links_to and links_from tables is unaffected at this stage. As the index contains an inverse table, which for each word lists the pages and locations of that word, it is almost as quick to delete many pages at the same time as it is to delete a single page.

For each website present in the search index, the URL of its entry page is retrieved, together with the URLs of the existing leaf pages. A selective re-indexing procedure

- 12 is then performed (step s8), as described in more detail in Figure 7. In a multi-

threading environment, the indexing of a number of websites can run concurrently.

Referring to Figure 7, the selective re-indexing process according to the invention 5 operates in a very similar way to the conventional indexing program illustrated in Figure 4. For ease of reference, the steps that are the same are indicated by the same reference numerals. The difference between this process and the spidering operation illustrated in Figure 4 lies in the fact that the existing leaf pages are ignored. At step s22, the HTML document parser retrieves the URLs of any 10 hyperlinks within a web page. As in the conventional spider program, the selective indexing spider determines whether the hyperlink is new and meets the criteria for indexing (step s25). If it is, it determines whether the hyperlink is in the list of leaf pages for the given entry page (step s80). If it is in the list, the hyperlink is not added to the download list stack for subsequent downloading and control returns to 15 the document downloader (step s20). If it is not in the list, it is added to the download list stack in the usual way (step s26) and control is returned to the document downloader (step s20).

Based on the example of Figure 2, it is evident that of the seven pages which would 20 require updating in a full indexing procedure, four pages R. T. U and V no longer require indexing according to the selective indexing procedure, so achieving a significant saving in computational resources.

Referring again to Figure 3, after performing selective re-indexing, a list of links to 2s and from the pages in the new index is again extracted, as described in relation to step s5 above (step s9). A list of all orphaned pages is then obtained and can be deleted from the new index (step s10). Orphaned pages are pre-existing leaf pages which are no longer linked to by other pages. They may or may not exist on a targeted website but, since they are no longer accessible on the live website, they 30 should not be found by the search engine.

The process of orphaned page deletion is now explained in detail with reference to Figure 8. The process begins at step s100. An empty list of unlinked documents is

- 13 initialised and the complete list of pages in the search index retrieved (step s101).

For every page in the search index (step s102), the program 1 determines whether an entry exists in the linked_from table created at step s9 in Figure 5 (step s103). If an entry exists, control passes back to step s102 and the process is repeated for the 5 next page. If there is no entry in the linked_from table (step s103), indicating that the page in question is unreachable from any other page, the current page is added to the list of unlinked documents (step s104). Control then passes back to step s102. Once all the pages have been processed, control passes to step s105, at which all the pages listed in the list of unlinked documents are deleted and the subroutine 10 terminates (step s106).

K.

Claims

- 14 Claims

1. A method of selectively updating an index of documents, some of said documents including links to other of said documents, the method comprising the 5 step of updating only those of said documents which include said links.

2. A method according to claim 1, further comprising incorporating into the index documents not previously indexed and which are linked to by updated documents.

3. A method according to claim 2, wherein some of said incorporated documents do not include said links.

4. A method according to any one of the preceding claims, including the step of 15 analysing the index to produce a set of documents which link to other documents and a set of documents which are linked to by other documents.

5. A method according to claim 4, including determining from said sets, a set of documents which do not include links to other documents.

6. A method according to claim 5, including determining from the documents in the index and the set of documents which do not include links, the set of documents which include links.

25

7. A method according to claim 6, further comprising deleting the set of documents which include links from the index.

8. A method according to any one of claims 5 to 7, comprising updating the index ignoring the set of documents which do not include links.

9. A method according to claim 4, including determining the depth of a document from said sets.

- 15

10. A method according to claim 9, including determining a first set of documents which do not link to documents having a greater depth than the depth of the documents in the first set.

5

11. A method according to claim 10, including determining from said first set of documents a second set of documents which include links to documents having a greater depth than the depth of the documents in the second set.

12. A method according to claim 11, including deleting the entries in the index 10 for said second set of documents.

13. A method according to claim 12, including forming a new index of documents which do not include links to other documents deeper than the documents in the new index.

14. A method according to any one of claims 1 to 8, wherein a selected set of the documents in the index comprise a document hierarchy in which each document is associated with a depth, the depth representing the number of said links required to reach the document from an entry document, wherein said links comprise links 20 between documents at different depths.

15. A method according to any one of the preceding claims, wherein the documents comprise hypertext documents and the links are hyperlinks.

25

16. A method according to claim 15, wherein the documents comprise web pages.

17. A method according to claim 16, when dependent on claim 14, wherein the document hierarchy comprises a website.

18. A method according to any one of the preceding claims, further comprising deleting documents which do not include links to any other documents.

- 16

19. A method of selectively updating an index of documents, including the steps of: preparing the index for selective updating; and selectively updating the index, wherein the step of preparing the index for s selective updating includes classifying the documents in the index as leaf and non-

leaf pages, wherein the leaf pages do not include links to other documents in the index and the non-leaf pages include links to other documents in the index.

20. A method according to claim 19, wherein said links comprise links to 10 documents having a greater depth, the depth of a document being measured as the minimum number of links that must be followed from an entry page to reach the document.

21. A method according to claim 19 or 20, further comprising updating the non 15 leaf pages more frequently than the leaf pages.

22. A method according to any one of claims 19 to 21, further comprising adding new leaf pages to the index more frequently than updating existing leaf pages.

23. A computer program which, when executed on a computer, is configured to perform the method of any one of claims 1 to 22.

24. A search engine including means for selectively updating an index of 2s documents, some of said documents including links to other of said documents, said means being configured to update only those of said documents which include said links.

25. A search engine according to claim 24, wherein a selected set of the 30 documents in the index comprise a document hierarchy in which each document is associated with a depth, the depth representing the number of said links required to reach the document from an entry document, wherein said links comprise links between documents at different depths.