GB2366877A - A system for categorising and indexing documents - Google Patents

A system for categorising and indexing documents Download PDF

Info

Publication number
GB2366877A
GB2366877A GB0005091A GB0005091A GB2366877A GB 2366877 A GB2366877 A GB 2366877A GB 0005091 A GB0005091 A GB 0005091A GB 0005091 A GB0005091 A GB 0005091A GB 2366877 A GB2366877 A GB 2366877A
Authority
GB
United Kingdom
Prior art keywords
rule
category
score
combination
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB0005091A
Other versions
GB0005091D0 (en
Inventor
Colin Dick
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
JURA TECH Ltd
Original Assignee
JURA TECH Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by JURA TECH Ltd filed Critical JURA TECH Ltd
Priority to GB0005091A priority Critical patent/GB2366877A/en
Publication of GB0005091D0 publication Critical patent/GB0005091D0/en
Publication of GB2366877A publication Critical patent/GB2366877A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A computer system that allows a user an intuitive method of conducting a search. A categorisation engine is used to compare a certain document against a plurality of programmable rule sets, where each rule set defines a combination. The rules are weighted and a statistical algorithm determines a score for a combination. A category contains a plurality of combinations and the same algorithm is used to provide an overall category score, which is classified as falling into a certain range defined by predetermined threshold values. A search engine indexes category value information as well as associated combination/rule information for the relevant categories that scored higher than a predetermined threshold value. The combination/rule information is used by the system at run-time to allow the user to refine his search using the humanly readable data that resides in the relevant combination/rule information.

Description

<Desc/Clms Page number 1> A SYSTEM FOR CATEGORISING AND INDEXING DOCUMENTS The invention proposes a method and computer system for categorising documents containing text.
The constantly developing World Wide Web has dramatically revolutionised communications and is able to present an extremely large amount of information to online users. Smaller intra-company computer networks, i.e. intranets, also provide a common pool of information relating to the company and are readily accessible to the company's connected employees. Intranets are used to promote knowledge-sharing in any organisation. It is possible for all the various departments that constitute an organisation to access each other's data. The problem with any network, whether the Internet or smaller intra-nets, is that they are constantly growing and a user can be flooded with information.
One of the biggest challenges in the information age is being able to disseminate the overwhelming amount of information that is available. It is necessary to be able to access the required information as rapidly and efficiently as possible. At present, search engines provide the gateway to finding information on the Internet. Various methods of searching are used. According to one popular technique the online user inputs a number of key words that are felt to best describe the topic he wishes to obtain information on. However, due to the massive number of results that may be returned by the search engine, the user may waste a great deal of time reviewing irrelevant documents before arriving at something useful. The user is often encouraged to conduct a refined search whereupon he may have to give some thought to redoing a search by inputting Boolean expressions or other key word combinations.
Other search engines present a more intuitive user interface by presenting the results in a categorised format as opposed to a list of all the articles matching the
<Desc/Clms Page number 2>
input criteria. The idea is that the user is then able to have a reasonable overview of the information available and may drill-down into the most suitable category and obtain the required information. An example of such a search engine is YahooT"^. However, there are limitations imposed on such a search engine by the manner in which documents are categorised in the first place and the subsequent constraints placed on a user using only category-based searching.
It is an aim of the present invention to provide an improved method of presenting information related to categorised documents.
According to one aspect of the invention there is provided a system for categorising and indexing documents comprising: means for comparing textual content of a document with a plurality of rule sets, each rule set defining a category and comprising a plurality of rules each constituted by at least one alphanumeric string and a data name displayable in humanly readable form, said comparing means being organised to trigger a rule when the alphanumeric string of that rule is located in the textual part, wherein each rule is associated with a weighting factor; means for generating a category score representing an importance weighting for each category based on the weighting factors for the rules which triggered in the rule set defining the category; a data structure arranged to hold a category identifier and a group of data names displayable in humanly readable form, said data names being derived from the rules that triggered for any category for which the category score is above a predetermined threshold.
According to another aspect of the present invention there is provided a method for categorising and indexing documents comprising: comparing textual content of a document with a plurality of rule sets, each rule set defining a category and comprising a plurality of rules each constituted by at least one alphanumeric
<Desc/Clms Page number 3>
string and a data name displayable in humanly readable form, wherein a rule is triggered when the alphanumeric string of that rule is located in the textural part wherein each rule is associated with a weighting factor; generating a category score representing an importance weighting for each category based on the weighting factors for the rules which triggered in the rule set defining the category; and creating a data structure holding a category identifier and a group of data names displayable in humanly readable form, said data names being derived from the rules that triggered in any category for which the category score is greater than a predetermined threshold.
Another aspect of the invention provides a search engine for searching documents containing textual content, the search engine comprising: a category index holding for each document a plurality of key:value pairs, each key:value pair comprising a key identifying the category and a value denoting a category score representing an importance weighting for that category, said importance weighting having been derived from rules which triggered in a categorisation process; a data structure arranged to hold said key and a group of data names displayable in humanly readable form, said data names having been derived from the rules which triggered in the categorisation process for each category having a category score above a predetermined threshold; and means for cooperating with a user interface so that when a search is run according to certain search criteria and a category is selected, said data names are displayable.
Thus, the invention uses a set of rule bases that are designed to supply information pertinent to an online user conducting an information search to allow the search to be made in a more intuitive manner.
One particular use of the invention is to allow a user to further limit search criteria to reduce the number of selected documents based on the displayed humanly
<Desc/Clms Page number 4>
readable data names- Other possible uses include alerting systems and market research tools.
The predetermined threshold can be set at any suitable level. It is possible for categories to be indexed according to three score levels, high, medium and low. This may determine whether or not a document is categorised in a particular category according to the setting of the indexing engine. In the described embodiment however the predetermined threshold which governs whether or not the data names are held in the data structure is the low threshold. That is, any documents which are categorised as low, medium or high will have humanly readable data names associated with them, at least in the preferred embodiment.
The categorisation technique can have an extra, intermediate level in which each rule set defining a category is divided into combinations. Each combination comprises a plurality of rules and the data structure then holds a plurality of combination identifiers, each identifying a combination and being associated with a combination name in humanly readable form together with the group of data names.
The invention can be implemented in any suitable environment. In particular it is suitable for Internet or Intranet applications. The navigational tool may be implemented on a normal Web browser or a wireless Web browser.
The present invention will now be described by way of an example with reference to the accompanying drawings, in which: Figure 1 shows the architecture of the present invention; Figure 2 shows the hierarchical structure of the examples of the rule- bases in text format for six combinations;
<Desc/Clms Page number 5>
Figure 3 shows an example of the RETAIL FINANCIAL SERVICES category; Figure 4a-f shows screenshots of rule-bases in text format for six combinations; Figure 5 shows how category KEY:VALUE pairs are created by the categorisation engine; Figure 6 shows how the search engine uses the KEY-VALUE pairs produced by the categorisation engine.
Figure 7a-d shows screenshots from a users perspective in narrowing a search.
Figure 1 illustrates the architecture of an indexing and searching system. To categorise documents, a categorisation engine is used with a plurality of rule- bases 6 in binary format. Each rule-base is in the form a database (for example ExceITM) stored in RAM. The block marked 6 in figure 1 represents a number of rule-bases in a binary format. The blocks marked 2 and 4 denote the rule-bases in a text format and intermediate format discussed later. These are interim stages to the rule-bases in binary format 6. A search engine 14 receives and collates the results from the categorisation engine 10. The search engine 14 is responsible for indexing the documents. A navigation hierarchy 16 allows a user to interrogate the search engine to call up the documents he is interested in. In Figure 1, reference numeral 8 denotes a document which, of course, does not form part of the architecture itself.
The rule-bases are created by an information scientist in such a manner as to suit the requirements of a particular organisation. For example, the creator of the rule-bases for the intranet of a particular company would tailor and/or weight the rules in such a manner that would be relevant to the company's interests. More generally, construction of the rule-bases depends on the nature of the information to be searched and the expected scope of searches.
<Desc/Clms Page number 6>
An information scientist can create the rule-bases in a simple textual manner without requiring any computer-language programming skills. This is done in the text format 2 of the rule-bases. It can be converted into a binary format 6 for storage and automated comparison purposes via an intermediate format 4 such as XML.
Before describing the rule-bases, it is important to understand the basic hierarchical structure governing the indexing of documents. Figure 2 provides a basic overview of the hierarchy. At the highest level is a category 20. A category 20 is the broadest heading into which documents are grouped. Each category contains multiple "sub-categories" that that shall be referred to herein as combinations 22. As will be described later, each combination 22 is defined by a set of rules 24. However, it needs to be understood that the system has a recursive structure. Therefore, in practice, there may be combinations occurring under combinations. So a sub-combination will combine its scores up to a parent combination and then that score will be combined with other scores at that level and so on up to the top category level. Rules are the base items with no children. As a result of the recursive structure, the same statistical algorithm and scoring logic may be applied at any level, as will be described later. A summary of the more complete structure is shown below: Category - Rule OR Combination Combination -> Combination OR Rule Rule -> nothing An example relating to the banking sector is used to describe a particular embodiment of the present invention. It should be understood that this is a non-
<Desc/Clms Page number 7>
limiting example and that the system described herein may be applied to any topic so desired by a person skilled in the art.
Figure 3 shows the RETAIL FINANCIAL SERVICES category, which contains six combinations; i.e. GENERAL, TITLE, PRODUCTS, COMPETITORS, CUSTOMERS and ALLIANCES. The rules, combinations and categories are all uniquely identified by a number.
Figures 4a-f provide examples of the text format 2 of the rule-bases. Each of these figures represents a screen shot for a particular combination indicated by the combination indicator 30. For example, figure 4c illustrates the PRODUCTS combination. The spreadsheet format, in this case ExceITM, allows the information scientist to edit the rule-bases by creating rules and input criteria for their associated fields. The text format for the rule-bases comprises a predetermined set of fields as follows: ID, field 32 - This is a number that uniquely identifies each rule within a combination.
RULES, field 34 - This is the rule itself, which the document is fired against. It comprises free-text which may be a single word (for example (see Figure 4a) rules 10, 17, 18, 20), a phrase (for example rule 19) or a Boolean combination of words (for example rules 1,2,5 to 9 etc.).
FIELD, field 36 - This stipulates the location in the document where the text defined in a rule may occur for it to trigger. If "TITLE' is input into this field then the criterion is that the relevant rule must be matched within the title of the document in question. Alternatively, by inserting "ANY", the rule is triggered if it is matched anywhere within the document.
<Desc/Clms Page number 8>
SCOPE, field 38 - This stipulates the criteria for the proximity of free-text portions defined in a Boolean combination. For example, under the PRODUCTS combination rule number 5 is "barclays AND internet". In this case, "sentence" is inserted in the scope field. Therefore, for the rule to trigger both the words "barclays" and "internet" need to appear within the same sentence. Alternatively, if "paragraph" was inserted then the words could appear anywhere within the same paragraph for the rule to trigger.
DATA, field 40 - In this field, the information scientist inserts the data that would be readable or output to the user and which defines the rule in a humanly understandable manner. Different rules can have the same data field and in fact this is very useful in searching as will be discussed latter.
SCORE, field 42 - This is a weighting indicator as to the importance of a rule within a particular combination. It is a reflection of the interest that a rule has within a combination. In the current example, the score has a maximum value of 100.A score of 0 is possible.
Annex 1 shows an example of the ruleformat 4 for the RETAIL FINANCIAL SERVICES category using the intermediate (XML) format.
Each combination in the category RETAIL FINANCIAL SERVICES is designed as: < combination name="[name]" score="100" data="" id= "RETAIL FINANCIAL SERVICESI[name]"> where [name] is the combination indicator 30. Each rule is defined with the following structure:
<Desc/Clms Page number 9>
< rule id="$C$[#]" field="[i;rise rt]" score=" [insert]">[rule name] < Irule> where [#] is a unique rule number and the field and score are inserted as decided with reference to the rule-bases of figures 4a, b, c, d, a and f. Categorisation Engine The categorisation engine 10 takes a particular document 8 and fires the text contained within such a document at rule-bases 6. The rule-bases are translated from an intermediate format 4 into a binary format 6. The binary format 6 is the format that is presented as an input 7 to the categorisation engine 10. The other input to the engine 10 is the document 8 to be categorised. The categorisation engine compares the binary format of the rule-bases to that of the document and determines which of the rules are triggered. At this binary level, each byte pattern that constitutes the input text of the document (in ASCII or some other suitable encoding) is compared against the byte patterns of the free-text defined in each RULE field 34. For each rule ID, if there is a match according to any Boolean logical combination defined in the rule, and subject to any restrictions in the FIELD or SCOPE fields of the rule, then that particular rule is triggered. The score associated with that rule is stored against the rule ID in a designated storage area. Each rule is checked against the document so that a list of scores is built up for each combination heading 30.
The text is fired against all the rule bases simultaneously. For Boolean logic rules, a `hit' on part of the free-text defined in the rule leaves that part to be held in a waiting zone until the other part or parts are located in the document, in which case the rule triggers. If not, the waiting zones are simply emptied when the end of the document is reached.
<Desc/Clms Page number 10>
After the document has been fired against the rule-bases, a statistical algorithm is performed on the scores, which constitute each of the relevant combination lists.
The first step is to arrange the scores that constitute each combination list into descending order. Thus, the highest score is placed first on the list and would be saved in a storage location as a variable A. The statistical algorithm then takes each subsequent score X and applies the algorithm:
The result of this mathematical manipulation is then stored in another storage location. The contents of the two storage locations are added and the result is stored in an accumulator. The process is iterated until all the scores in the list have been manipulated by the algorithm, wherein after the first iteration has been performed each subsequent iteration uses the value stored in the accumulator register for the previous iteration as the A variable in the algorithm. Once all the iterations have been performed, a total score for each combination list is calculated and stored. It is then possible to arrange all of the combination total scores into a separate list in descending order. The same statistical algorithm is applied at this categorisation level, producing a category total score reflecting the relevance of a particular category for the document.
Table 1 shows an example of how the statistical algorithm would be carried out on a list comprising the five scores 50, 30, 10, 10 and 5. For the first iteration, the values A=50 and X=30 are used to calculate a value of 15. The first value 50 is then added to the calculated 15 giving an accumulated total of 65. For the second iteration, A=65 and X=10 where the algorithm produces a result of 3.5, which is accumulated to give a value of 68.5 in the accumulator. The process is
<Desc/Clms Page number 11>
iterated, until all the scores have been manipulated. The final result in the accumulator (i.e. 73.0675) is truncated to supply a total score of 73 for the list in question. In summary, the statistical algorithm may be applied at the combination level as well as at the category level.
SCORh ADD ACCUMULATOR 50 T5' 65 .5 10 3.15 71.65 .4 5# 01 73.067 Table 1 Once the algorithm has been applied at the category level, the category total is then analysed and classified as falling into a range based on defined threshold values. An example of the classification of the ranges is shown in Table 2. A total score greater than 60 falls into the HIGH range, while a score between 40- 60 and 20-40 indicate a MEDIUM and LOW range respectively. A score below 20 is discarded.
Threshold Total Raaqe/Value > 60 High 40-60 Medium 20-40 Low < 20 Discard Table 2 Figure 5 illustrates more clearly the operation of the categorisation engine as generally described above. A particular document 8 is shown to consist of a title 50 and a body of text 52. The same document is used in Annexe 2, which indicates the output of the categorisation engine as will now be described. The rule-bases are compared against the document and certain rules are triggered. This figure shows how the rules are triggered for two categories: RETAIL FINANCIAL SERVICES 54 and BARCLAYS 55. In Annexe 2, additional categories are exemplified.
<Desc/Clms Page number 12>
Firstly, for the RETAIL FINANCIAL SERVICES 54 category, it is evident that rules have triggered under three combinations, i.e. GENERAL 56, TITLE 62 and PRODUCTS 66. For the GENERAL combination the rule "Barclaycard" has triggered twice. The criterion in the FIELD field 36 was "ANY" (see figure 4a) and thus the rule triggered once because it was found in the title 50 and a second time because it occurred in the text 52. Each rule has a weighting of 20 and the scores are placed in designated storage areas 74 and 76. Similarly, the PRODUCTS combination (see figure 4c) has "ANY" in the FIELD field 36 and thus the rule triggers twice. However, for this combination each rule has a weighting of 5, whose scores are stored in areas 84 and 86. The TITLE combination (see figure 4b) has "title" in the FIELD field and therefore only triggers once, i.e. in the title 52 of the document. The rule under this combination is given a weighting of 50 and is stored in a storage area 80.
In this manner a list of scores is built up under each combination depending on the corresponding rules that trigger. In the case of the GENERAL combination, the scores stored in areas 74 and 76 are used by the statistical algorithm 88 to produce a total combination score in storage area 72. Similarly, for the TITLE and PRODUCTS combinations, the results from the algorithm are stored in areas 78 and 82 respectively.
It will readily be appreciated that in practice the storage areas may be implemented as addressable memory locations in random access memory (RAM) or in any other suitable manner.
Moreover, the algorithm can be implemented by suitable logic or by a suitably programmed processor.
The iterations through the algorithms are shown below:
<Desc/Clms Page number 13>
For the GENERAL combination: SCORE I ADD ACCUMULATOR 20 16 36 j 201 0 36
For the TITLE combination: SCORE ADD ACCUMULATOR 50 0 50
For the PRODUCTS combination: SCORE ADD ACCUMULATOR 5 4.75 9.75 5 0 9.75 The final accumulator score for each combination is truncated as 36, 50 and 9 and is stored in locations 72, 78 and 82 respectively.
Now, the total combination scores stored in 72, 78 and 82 constitute a new input list 79 to the algorithm 88, which produces a truncated RETAIL FINANCIAL SERVICES category score of 70. The iterations of the algorithm are shown below:
SCORE ADD ACCUMULATOR 50 18 68 36 2.88 70.88 9 0 70.88 The score for the RETAIL FINANCIAL SERVICES category is fed 96 into a look- up table 97 (see Table 2), which classifies the overall score of 70 as fitting into the HIGH range. Each category is associated with a unique key 94. The value 98 is combined with the key 94, which uniquely identifies each category, to produce a KEYMALUE pair 99. For example, for RETAIL FINANCIAL SERVICES the KEY:VALUE pair is HIGH:9120.
<Desc/Clms Page number 14>
A similar method of obtaining the value for the BARCLAYS category 55 would be used. The document is fired against the rule-bases and it is found that two rules under the TITLE combination 57 triggered. In this case, both rules under the TITLE category are triggered within the title 50 of the document. The rule 59 triggers because "barclays" appears in the title and has a weighting of 25 that is stored in area 65. The rule 61 is a Boolean combination, where the words "barclays" AND "credit" must appear in the title 50. Rule 61 has a weighting of 50 that is stored in area 67. The statistical algorithm 88 operates on the scores and produces a truncated value of 62 that is stored in area 63. The iterations are shown below:
SCORE ADD ACCUMULATOR @50 j 12.5 62.5 25 0 62.5 For the BARCLAYS category 55, there is only one combination and therefore the statistical algorithm produces an overall score of 62 that is input 96 to the lookup table 97. The value of 62 fits into the HIGH range and this value 98 forms part of the KEY-.VALUE pair 99 relating to the BARCLAYS category for this document:- HIGH:6688.
Therefore, for a particular document, the categorisation engine will produce a set of KEY:VALUE pairs 99 (one for each category) that are supplied to the search engine. In the example of figure 5, the creation of two KEY-VALUE pairs, i.e. HIGH:9120 and HIGH:6688, was shown. Therefore if a user wishes to access information that is a lot about RETAIL FINANCIAL SERVICES or a lot about BARCLAYS, then the search engine will present a list of documents that includes the document of the example. In fact, a synopsis of all the KEYMALUE pairs for the example, is shown in table 3.
<Desc/Clms Page number 15>
It should be noted that the idea behind the statistical algorithm is that if a document triggers a lot of rules in one combination, it will not necessarily be sufficient to push the document over the normal categorisation threshold. Categorisation evidence is stronger if the evidence comes from a variety of combinations.
CATEGORY KEY VA UE Barclays 768u nign etai inancia . Services 9 high re it Cards high Finance ow Currency Z6144 Interest Rates ow an k of England 1824 rm 4 Inflation Chemicals an Gases ow e ecomms 4 Hardware an Technology 4 et Cable Telecoms Competition 32224 Table 3 Table 3 shows that the document is a lot about BARCLAYS, RETAIL FINANCIAL SERVICES and CREDIT CARDS and a little about FINANCE, INTEREST RATES and CHEMICALS AND GASES. Although the remaining categories had combinations and rules that triggered, their overall total category score was lower than the threshold of 20 (Table 3) and thus were regarded as being insignificant and therefore discarded. This is confirmed by the fact that no range value is submitted for these categories to the search engine.
In the RETAIL FINANCIAL SERVICES category, it can be seen that the "barclaycard" rule of the PRODUCTS combination has triggered twice, once
<Desc/Clms Page number 16>
because it occurs in the title and once because it occurs in the text. This is because the FIELD specification in the rule-bases was "ANY". Therefore, although the rule has a weighted score of 5 points for each time it triggers the statistical algorithm provides a total score of 9 for the PRODUCTS combination. Alternatively, a single rule, subject to certain conditions, may trigger multiple times within a document and the score for the rule is incremented under each combination. For example, in the document 8 exemplified in Annex 2 there are multiple references to the word "barclaycard". To detect this, a new rule (as shown below) can be defined under the PRODUCTS combination. In this case, the total combinational score would be increased. This is due to the list now containing multiple scores of the rule weighting, where the exact number of these scores corresponds to the number of times "barclaycard" is found in the document.
Id Rules Field Scope Data Score 11 MULTIPLE(barclaycard) ANY Barclay Card Data Mining and The Search Engine Figure 6 shows the collated results 12 sent by the categorisation engine 10 to the search engine 14 as a first set of category KEY:VALUE pairs 99 and a second set of combination/rule key:value pairs 102. To attempt to minimise confusion, capitals will be used for category KEY.-VALUE pairs and lower case for combination/rule key:value pairs.
The formation of the KEY:VALUE pairs at the category level has already been described. However, the categorisation engine also produces a set of key.-value pairs at the combination/rule level which shall be referred to herein as the combination/rule key:value pairs. Combination/rule key:value pairs contain information pertaining to the rules that triggered under relevant combinations.
<Desc/Clms Page number 17>
The relevant combinations are those that formed part of a category that had been classified with a HIGH, MEDIUM or LOW value. The notation adopted for the combination/rule keyNalue pairs is that the KEY is represented by the ID 30 of the relevant combination, while VALUE is represented by the contents of the DATA field of the rule that triggered under the relevant combination. In the implementation, the category KEY:VALUE pairs are represented in a "VALUE:KEY" form (for example HIGH:9210), when added to the search engine. Therefore, when a search is performed it is impossible to confuse the category value (i.e. HIGH) with the combination ID's of a combination/rule key:value pair. In the example of figure 3, it was noted that the RETAIL FINANCIAL SERVICES category was rated with a HIGH VALUE and contained three combinations GENERAL, TITLE and PRODUCT. The following combination/rule key:value pairs 102 are sent to the search engine 14 and held in an index 104: "26488:Barclay Card", "27104:Barclay Card" and "27720:Barclay Card". If at run-time the user does a search on one of the combination/rule key-.value pairs, for example, "27104 = Barclay Card", this document will be returned.
Similarly, for the category KEY:VALUE pairs, all categories scoring HIGH, MEDIUM or LOW for a particular document are sent to the search engine 14. In the example, the following pairs will be added to the index 104 of the search engine: "HIGH:6688", "HIGH:9120", "HIGH:25536", "LOW:23712", "LOW:28576" and "LOW:12768". Therefore, if someone searches on "HIGH 9120" or any of the other pairs, this document will be among those returned.
The search engine takes the results of the categorisation engine 12 and indexes them using an indexing process 200. The indexing process 200 also performs an important step in that it not only adds the combination/rule key:value information to the index 104, but also creates a data-set 106, which contains this information, for the document in question. Typically, each document will have an
<Desc/Clms Page number 18>
associated data-set. The use of this data set to facilitate searching is discussed in the following. The data set 106 provides a data structure which holds a category id, a set of combinations ids together with the combination names in humanly readable form and one or more readable data name associated with each id. This allows the list discussed later with reference to Figure 7c to be generated.
In particular, the indexing process 200 collates the readable data names from their initial format to their useful final format discussed later "combination id: HRDI" (HRDI stands for human readable data item). The process takes categorisation results, for example: < combination id:--123 name=product> < rule id=222 phrase ="Barclays AND credit" data=" Barclaycard"> < rule id = 234 phrase ="Barclaycard" data ="Barclaycard"> < /combination> and will merge the two instances to come up with a final format for indexing and adding to the data structure in association with the document in the index. The final format in this example instance would be "123:Barclaycard".
At run-time, i.e. to perform a search, the user is able to interface with the system using the navigation hierarchy element 16 (Figure 1). In practice, this would be a GUI (Graphical User Interface) similar to the example shown in the screen shots of Figures 7a, b, c and d. Each level of the hierarchy has an associated searching function that is translated for the search engine, when a user selects a level. This is best illustrated in the example.
Therefore, the on-line user may conduct his search from any level he chooses and in a flexible manner. He can use intuitive mouse-clicking operation to
<Desc/Clms Page number 19>
narrow down the documents or perform a standard free-text search by entering words in the text box 220. At the highest category level, figure 7a shows the initial display, where a user may select one of the broad categories listed under the DIRECTORY heading 201. Figure 7b shows the user has selected the BARCLAYS category and a list of relevant documents 22$ is listed and a list of associated categories 226.
The user decides to further narrow his search by selecting the RETAIL FINANCIAL SERVICES category. The navigational hierarchy translates this into a search on the KEY:VALUE pairs: "HIGH = 9120" OR "MEDIUM = 9120" OR "LOW = 9120" The search engine uses its index to check which documents correspond to the supplied KEY:VALUE pairsFigure 7c shows that a new list (say 20) of documents 240 is returned as a result of the narrowed search criteria. At this stage, it may be assumed however that the user may still be faced with a large result set (e.g. 10,000 documents). He can decide to now use the DATA MINING option shown in the drop-down menu 250.
The search engine has gone through all the documents in this category and extracted relevant information which is held in the data set 106 as described above. The navigation hierarchy returns to the front end interface a list of the key:value pairs (i.e. COMBINATION ID:DATA) for the rules that fired in that category.
<Desc/Clms Page number 20>
That list is made up of COMBINATIONS and DATA from ALL documents in the result set for that category - not just the 20, say, that the user is faced with on the first page. That list has the following format:
< category id="9120"> - < combination id="27720" name="Products"> < data name="Barclay Card"/> < data name="Barclays ISA"/> < data name="Barclay Loan"/> < /combination> - < combination id="28952" name="Customers"> < data name="Small Businesses"/> < data name="Students"/> < /combination> - < combination id="29668" name="Alliances''> < data name="Dell Corporation"/> < /combination> < /category> The human readable free text data name is displayed on the screen as at 243 in Figure 7c. The DATA MINING list provides the user with an opportunity to refine his search by clicking on a relevant combination or rule that best suits the topic he requires.
In the example of Figure 7c the user decides to select the ALLIANCES combination. The system then uses the information in the data-sets to further provide a possible set of rules that have triggered in all documents pertaining to the RETAIL FINANCIAL SERVICES category and the ALLIANCES combination. In the example, the user then selects "Dell Computer Corporation", which is the DATA field for rule 1 of the ALLIANCES combination
<Desc/Clms Page number 21>
(see Figure 4f). More generally, by selecting the ALLIANCES combination a list of rules that triggered for the 53 listed documents is supplied and corresponds to text programmed in to the DATA field of the rule-bases for the relevant rule. In this manner, the user is presented with a more intuitive method of rapidly finding the exact information required.
In Figure 7c, if the user clicks on the ALLIANCES/DELL COMPUTER CORPORATION from the DATA MINING list, this is translated into a narrowed search for: "HIGH:9120" OR "MEDIUM:9120"OR "LOW 9120" AND "29668:Dell Corporation" The refined search produces a narrowed document list that should be (a lot) about RETAIL FINANCIAL SERVICES and DELL CORPORATION. The result 260 is shown in Figure 7d where only 1 relevant document is listed.
Typically, the document list 260 may still be fairly large. If this happens to be the case, the search engine will again repeat the process of extracting information from the associated data-sets of the new list 260. This enables the user to drill- down further using the refreshed DATA MINING lists.
It should be noted that in searching numbers are used rather than names as they provide unique descriptors. However, at the point when the user is using the DATA MINING list and searching with combinations keys, names are used rather than numbers. The reason being that at this stage it is the particular DATA name of a rule in humanly readable form (i.e. "Barclay Card") that is important, rather than an instance of a rule. The DATA name "Barclay Card" can be triggered by more than one rule within a combination, which provides the further opportunity for improving the interface with a human. See for example Figure 4c
<Desc/Clms Page number 22>
where the free text portions of rules 5 and 6: barclays AND internet; barclays AND web, both have the DATA name Barclays Internet which will be displayed to a user in the drop down process.
Another way in which the present invention may be adapted is for the creator of the rule-bases to produce a rule like "small businesses" under the CUSTOMERS combination and give it a tiny score. It is evident that although the small weighted rule will have minimal categorisation value, it will appear in the DATA MINING list and therefore is useful in figuring out what documents are about.
It should be realised that the banking example presented in the description is intended to be non-limiting and it should be appreciated that the rule-bases would be adapted according to the nature of the information. It should also be realised that the certain formats described in the example, i.e. XML for the intermediate format and Excel for the textual format are non-limiting and may be replaced by other suitable tools known to a person skilled in the art.
<Desc/Clms Page number 23>
Rule Input Format to Category Engine,
< xml> - < category name="Retail Financial Services" id="\barclays\Retaii Financial Services\Retail Financial Services.xls" data='-> - < combination name= "general" score="100" data="" id="Retail Financial Services/ general"> < rule id="$C$7" field="ANY" score="10">barclays AND nationwide < /rule> < rule id="$C$8" field="ANY" score ="10">barclays AND cash < /rule> < rule id="$C$9" field="ANY" score= "10">barclays AND it < /rule> < rule id="$C$10" fleld="ANY" score= "20">barclayioan/-/'s < /rule> < rule id="$C$11" field="ANY" score= "10">barclays AND internet < /rul e> < rule id="$C$12" field="ANY" score= "10">barclays AND mortgage < /rule> < ruie id="$C$13" field="ANY" score="10">barclays AND student < /rule> < rule id="$C$14" leld="ANY" score= "10">barclays AND charges < /rule> < rule id="$C$1S" field="ANY" score="10">barclays AND holiday/- /s/makers < /rule> < rule id="$C$16" fleld="ANY" score="20">barciaycard < /rule> < rule id="$C$17" fleld="ANY" score= "10 ">barciays AND pension < /rule> < rule id="$C$18" field="ANY" score= "10">barclays AND small business/-/es < /rule> < rule id="$C$19" fleld="ANY" score= "10">barclays AND graduate < /rule> < rule id="$C$20" field="ANY" score= "10">barclays AND wealth management < /rule> < ruie id="$C$21" field="ANY" score=" 10">barclays AND isa < /rule> < rule id="$C$22" field="ANY" score="10">barclays AND telephone banking < /rule> < rule id="$C$23" field="ANY" score="20">barclaycall < /rule> < rule id="$C$24" fiafd="ANY" score= "20">barclaysconnect < /ruie> < rule id="$C$25" field="ANY" score="20">barclays merchant services < /rule> < rule id="$C$26" fleld="ANY" score= "20">barclaysquare < /rule> < /combination> - < combination name="title" score="100" data="" id="Retail Financial Services/title">
<Desc/Clms Page number 24>
< rule id="$C$7" field="title" score= "25">barclays AND nationwide < /rule> < rule id="$C$8" field="title" score= "50">barclays AND cash < /rule> < rule id="$C$9" field="titfe" score= "50">barclays AND it < /rule> < rule id="$C$10" field="title" score="50">barcfayloan/-/'s < /rule> < rule id="$C$11" field="title" score="50">barclays AND internet < /rule> < rule id="$C$12" field="title" score= "50">barclays AND mortgage < /rule> < rule id="$C$13" field="title" score="50">barclays AND student < /ru le > < rule id="$C$14" field="titfe" score= "50">barclays AND charges < /rule> < rule id="$C$15" field="title" score= "50">barclays AND holiday/- /s/makers < /rule> < rule id="$C$16" field="titfe" score= "50">barclaycard < /rule> < rule id="$C$17" field="title" score= "50">barclays AND pension < /rule> < rule id="$C$18" field="title" score= "50">barclays AND small business/-/es < /rule> < rule id="$C$19" field="title" score="50">barclays AND graduate < /rule> < rule id="$C$20" fieid="title" score= "50">barclays AND wealth management < /rule> < rule id="$C$21" field="title" score= "50">barclays AND isa < /rule> < rule id="$C$22" field="titfe" score= "50">barclays AND telephone banking < /ruie> < rule id="$C$23" field="title" score= "50">barciaycall < /rule> < rule id="$C$24" field="title" score="50">barciaysconnect < /rule> < rule id="$C$25" field="title" score="50">barclays merchant services < /ru I e > < rule id="$C$26" field="tite" score="50">barclaysquare < /rule> < /combination> - < combination name= "products" score="100" data="" id="Retail Financial Services/ products"> < rule id="$C$7" field="ANY" data="Barclay Card" score="5">barclaycard < /rule> < rule id="$C$8" field="ANY" data="Barclay Loan" score="5">barclayloan < /rule> < rule id="$C$9" field="ANY" data="Barclay Call" score="5">barclaycall < /rule> < rule id="$C$10" field="ANY" data="Barclays Connect" score= "5">barclaysconnect < /rule> < rule id="$C$11" field="ANY" scope= "sentence" data="Barcfays Internet" score= "5">barclays AND internet < /rule> < rule id= "$C$12" field="ANY" scope= "sentence" data="Barcfays Internet" score= "S">barclays AND web < /rule> < rule id="$C$13" field="ANY" scope= "sentence" data="Barclays Insurance" score= "5">barcfays insurance < /rufe> < rule id="$C$14" field="ANY" scope= "sentence" data="Barclays Mortgages" score="5">barclays AND mortgage/-/s < /rule>
<Desc/Clms Page number 25>
< rule id="$C$15" data="Barclays Internet" score="20">barclays.net < /rule> < rule id="$C$16" fleld="ANY" data="Barclays ISA" score="1">isa < /rule> < /combination> - < combination name="competitors" score="100" data="" id="Retail Financial Services/ competitors"> < rule id="$C$7" fleld="ANY" data ="Nationwide Building Society" score= "10">nationwide < /rule> < rule id="$C$S" fleld="ANY" data="National Westminster Bank" score="10">natwest < /ru 1e> < rule id="$C$9" fleld="ANY" data="Midlands Bank" score= "10">midlands < /rule> < rule id="$C$10" field="ANY" data="Hong Kong Shanghaii Banking Corporation" score="10">hsbc < /rule> < rule id="$C$11" field="ANY" data="Royai Bank of Scotland" score="10">royal bank of scotland < /rule> < rule id="$C$12" field="ANY" data= "Lloyds-TSB Bank" score="10">lloyds < /rule> < rule id="$C$13" fleld="ANY" data ="Lloyds-TSB Bank" score="10">tsb < /rule> < rule id="$C$14" field="ANY" data= "Lloyds-TSB Bank" score="10">Iloyds-tsb < /rule> < rule id="$C$15" fjeld="ANY" data="Bank of Scotland" score= "10">bank of scotland < /rule> < rule id="$C$16" fleld="ANY" data= "Clydesdale Bank" score= "10">clydesdale bank < /rule> < /combination> - < combination name= "customers" score="100" data="" id="Retail Financial Services/ customers"> < rule id="$C$7" field="ANY" data="Small Business" score= "1">small business/-/es < /rule> < rule id="$C$S" field="ANY" data= "Students" score="1">graduate < /rule> < rule id="$C$9" field="ANY" data ="Students" score= "1">student AND loan/-/s < /rule> < rule id="$C$10" field="ANY" data ="Students" score ="1">student AND grant/-/s < /rule> < rule id="$C$11" field="ANY" data="Small Business" score="1">small corporations < /rule> < /combination> - < combination name="alliances" score="100" data="" id="Retail Financial Services/ all iances"> < rule id="$C$7" field="ANY" data="Dell Computer Corporation" score="1">dell < /rule> < rule id="$C$S" field="ANY" data="BBC Worldwide" score="1">bbc AND worldwide < /rule> < rule id="$C$9" field="ANY" data ="Microsoft" score="1">microsoft < /rule> < rule id="$C$10" field="ANY" data ="Microsoft" score= "1">webtv < /rule>
<Desc/Clms Page number 26>
< rule id="$C$11" field="ANY" data ="Cellnet" score= "1">cellnet < /rule> < rule id="$C$12" fleld="ANY" data="Link" score="1">link < /rule> < rule id="$C$13" field="ANY" data="Royal National Institute for Deaf People" score="1">rnid < /rule> < rule id="$C$14" field ="ANY" data="Royal National Institute for Deaf People" score="1">institute for deaf < /rule> < rule id="$C$15" field="ANY" data="Blind In Business" score="1">bib < /rule> < rule id="$C$15" field="ANY" data="Blind In Business" score="1">Blind In Business < /rule> < rule id="$C$17" field="ANY" data="Eastern Group" score= "1">eastern group < /rule> < rule id="$C$18" fleld="ANY" data="Unisys" score= "10">unisys < /rule> < /combination> < /category> < /xml>
<Desc/Clms Page number 27>
Output of categorization engine for one document. < document title= "Barclays Newsroom. News Releases: Barclaycard sets the standards for the UK credit card market" url="http://www.newsroom.barcllays.co.uk/news/data/145.html"> < title>Barclays Newsroom. News Releases: Barclaycard sets the standards for the UK credit card market < /title> -- < category id="6688" name="Barclays" score="62" range="high" source= "/barclays/barclays.xls"> - < combination id="20944" name='Title" bubble="0" score="62" source="Barclays/title"> < rule id="66588" score="25" phrase= "barclays" source="$C$7" field ="FIELD TITLE" scope="ANY" operator= "PHRASE" Is="0" 1e="8" rs="0" re="0" /> < rule id="67084" score="50" phrase="barclays AND credit" source="$C$11" field="FIELD TITLE" scope="ANY" operator= "AND" (S="0" 1e="8" rs="76" re="82'7- /> < /combination> < /category> - - < category id="9120" name="Retail Financial Services" score="70" range="high" source="/barclays/Retail Financial Services/Retail Financial Services.xls"> - < combination id="26488" name= "Genera l" bubble="0" score="3fi" source="Retail Financial Services/general"> < ruie id="75888" score="20" phrase="barclaycard" source="$C$16" field ="FIELD TITLE" scope="ANY" operator= "PHRASE" Is="34" 1e="45" rs="0" re="0" /> < rule id="75888" score="20" phrase= "barclaycard" source="$C$16" field ="FIELD TEXT" scope="ANY" operator= "PHRASE" 1s="22" 1e="33" r5=110" re="0" /> < /combination> - < combination id="27104" name='Title" bubble="0" score="50" source="Retail Financial Services/titlle"> < rule id="78368" score="50" phrase= "barclaycard" source="$C$16" field ="FIELD TITLE" scope="ANY" operator= "PHRASE" I5="34" 1e="45" r5="0" re="0" /> < /combination> - < combination id="27720" name= "Products" bubble="1" score="9" source="Retail Financial Services/ products"> < rule id="79732" score="5" phrase= "barclaycard" source="$C$7" data="Barclay Card" field ="FIELD_TITLE" scope="ANY" operator= "PHRASE" 1s="34" 1e="45" r5="0" re="0" /> < rule id="79732" score="5" phrase= "barclaycard" source="$C$7" data="Barclay Card" field="FIELD TEXT" scope="ANY" operator= "PHRASE" I5="22" 1e="33" r5="0" re="0" /> < /combination> < /category> - < category id="25536" name="Credit Cards" score="75" range="high" source="/Finance/credit cards/credit cards.xls">
<Desc/Clms Page number 28>
- < combination id="87472" name="Groups" bubble="1" score="48" source="credit cards/groups"> < rule id="255192" score="10" phrase="barclaycard" source="$C$11" data="BarclayCard" field="FIELD TITLE" scope="ANY" operator= "PHRASE" 1s="34" 1e="45" rs="0" re="0" /> < rule id="255316" score="25" phrase="barclaycard" source="$C$12" data= "BarclayCard" field ="FIELD_TITLE" scope="ANY" operator="PHRASE" 1s="34" 1e="45" rs="0" re="0" /> < rule id="255192" score="10" phrase="barclaycard" source="$C$11" data ="BarclayCard" field ="FIELD,TEXT" scope="ANY" operator="PHRASE" 1s="22" 1e="33" rs="0" re="0" /> < rule id="255440" score="15" phrase="credit card company" source="$C$13" field ="FIELD TEXT" scope="ANY" opera to r= "PHRASE" 1s="1446" 1e="1465" rs="0" re= "0" /> < /cornbination> - < combination id="88704" name= "General" bubble="0" score="53" source="credit cards/general"> < rule id="260400" score="15" phrase="credit card" source="$C$12" field="FIELD TITLE" scope="ANY" operator= "PHRASE" 1s="76" 1e="87" rs="0" re="O" /> < rule id="260524" score="25" phrase="credit card" source="$C$13" field="FIELD_TITLE" scope="ANY" operator="PHRASE" 1s="76" 1e="87" rs="0" re="0" / > < rule id="260400" score="15" phrase="credit card" source="$C$12" field="FIELD_TEXT" scope="ANY" operator= "PHRASE" 1s="64" 1e="75" rs="0" re="0" /> < rule id="261888" score="10" phrase= "platinum card" source="$C$23" data="Platinum Card" field="FIELD_TEXT" scope="ANY" operator="PHRASE" !s="411" ie="424" rs="O" re="0" /> < rule id="260896" score="5" phrase="credit limit" source="$C$16" field="FIELD_TEXT" scope="ANY" operator="PHRASE" 1s="3953" 1e="3965" rs="0" re="0" /> < /combination> < /category> - < category id="23712" name= "Finance" score="22" range="low" source="/Finance/finance.xis"> - < combination id="82544" name= "General" bubble="O" score="18" source= "finance/ general"> < rule id="230392" score="5" phrase="credit card" source="$C$7" data="Credit Cards" field="FIELD TITLE" scope="ANY" operator="PHRASE" 1s="76" 1e="87" rs="0" re="0" /> < rule id="230516" score="10" phrase="credit card" source="$C$8" data="Credit Cards" field ="FIELD TITLE" scope="ANY" operator="PHRASE" 1s="76" 1e="87" rs="0" re="0" /> < rule id="230392" score="5" phrase="credit card" source="$C$7" data="Credit Cards" field ="FIELDTEXT"scope="ANY" operator= "PHRASE" !s="64" 1e="75" rs="0" re= "0" /> < /combination> - < combination id="81928" name= "Indicators" bubble="1" score="5" source= "finance/ indicators" >
<Desc/Clms Page number 29>
< rule id="229896" score="5" phrase= "interest rates" source="$C$9" data="Interest Rates" field="FIELD TEXT" scope="ANY" operator= "PHRASE" (s="438" !e="452" rs="0" re="0" /> < /combination> < /category> - < category id="26144" name= "Currency" score="9" source="/ Finance/currency/currency.xls"> - < combination id="90552" name="Folding" bubble="1" score="9" source="currency/folding"> < rule id="267344" score="5" phrase="credit card" source="$C$16" data="Credit Cards" field ="FIELD_TITLE" scope="ANY" operator= "PHRASE" Is="76" 1e="87" rs="0" re="0" /> < rule id="267344" score="5" phrase="credit card" source="$C$16" data="Credit Cards" field="FIELD_TEXT" scope="ANY" operator= "PHRASE" Is="64" 1e="75" rs="0" re="0" /> < /combination> < /category> < text>Barclaycard sets the standards for the UK credit card market7th April 1999 Almost seven million customers will benefit from a powerful package of new initiatives unveiled today by Barclaycard - the UK's largest credit card. The package based on the results of intensive customer research includes the introduction of a new rewards scheme, free extended warranty for all customers, a new platinum card and a cut in interest rates. "These initiatives will further strengthen Barclaycard's grip on the competitive credit card market. No other credit card can offer this combination of additional benefits, powerful rewards, attractive interest rates and no hidden charges," comments John Eaton, managing director at Barclaycard. Barclaycard Rewards scheme: The new Barclaycard Rewards scheme enables customers to make unique savings on gas, electricity and telephone bills, from May 1 this year. For example, Barclaycard customers can already save 20 per cent on home telephone calls, but by redeeming Barclaycard Rewards points, cardholders can boost their savings by an extra 10 per cent. This represents a saving of up to 56 for the average customer spending 200 on household calls through BT standard charges. New deals on travel insurance (up to 1 per cent off) and AA cover (40 per cent off) are also being launched through the Barclaycard Rewards scheme. Extended warranty: Barclaycard is to become the first major credit card company to offer all cardholders free extended warranty. From April 15, 1999 customers purchasing new household appliances with their Barclaycard will benefit from one year's free extended warranty. The service applies to most new household appliances costing more than 25. To take advantage of the offer customers simply need to registe a purchase within 90 days. Platinum card: Barclaycard Platinum is a top of the range extension to Barclaycard offering customers an unrivalled range of benefits. The highlight of the benefits of Barclaycard Platinum will be two year's free extended warranty on most household appliances paid for with the card. Together with the initial manufacturer's warranty period this provides a total warranty period - absolutely free of charge - of three years for Barclaycard Platinum customers. Additionally, Platinum customers will receive
<Desc/Clms Page number 30>
favourable rates ranging from 14.9 to 17.9 per cent. Cardholders spending 6,000 or more in a year on their card will also receive a full rebate of the annual fee. By the end of this year Barclaycard expects to have issued up to 500,000 Platinum cards. Lowest Barclaycard rates: Barclaycard has also announced a reduction in interest rates of one per cent to 19.9 per cent - the lowest rate ever. Barclaycard's interest rates have fallen by three per cent in the last six months. The new standard Barclaycard APRs will range from 16.9 per cent to 19.9 per cent, depending on the amount a cardholder spends each month. In addition half of Barclaycard's customers will benefit from the fee rebate thresholds being lowered from 5,000 to 2,000. Barclaycard is also offering both new and existing cardholders the opportunity to transfer balances from other cards to their Barclaycard at an APR of 9.9 per cent for six months. "Customers need to look at the complete picture when choosing a credit card. They should look not just at the initial APR, but at the interest free period -ours is 56 days- any hidden charges, reward schemes and the range of additional benefits available. It is therefore no wonder we attracted half a million new customers last year alone," comments John Eaton. Research* conducted by Barclaycard indicates that a staggering number of cardholders in the.UK are still not aware of the hidden charges imposed on them by issuers. Credit cardholders who do not have a Barclaycard could end up paying for at least one of the following hidden charges - up to 20 for a late payment, up to 15 for exceeding a credit limit, up to 10 for a duplicate statement, up to 15 for a direct debit or 5 for a copy voucher. Cardholders incurring these costs once or twice a year will find that they cancel out the benefits of limited APR special offers. "The introduction of this new package offers more value to our cardholders than ever before. These additions will ensure that Barclaycard extends its lead as the UK's number one credit card," concludes John Eaton. Notes to editors: 1. The research was conducted in March 1999 by Audience Selection on behalf of Barclaycard. 1013 telephone interviews were undertaken. 2. For every 10 spent cardholders will receive one Reward point. All profiles points will automatically be converted into Reward points of equivalent value. For further information, journalists should contact the relevant press office < /text> - < category id="28576" name="Interest Rates" score="23" range= "low" source="/Finance/interest rates/interest rates.xis"> - < combination id="99792" name="Higher" bubble="1" score="10" source= "interest rates/higher"> < rule id="303800" score="10" phrase="cut in interest rates" source="$C$7" field="FIELD_TEXT" scope="ANY" operator= "PHRASE" (s="431" 1e="452" rs="0" re="0" /> < /combination> - < combination id="101024" name="General" bubble="0" score="15" source= "interest rates/general"> < rule id="311364" score="15" phrase= "interest rates" source="$C$12" field="FIELD TEXT" scope="ANY" operator= "PHRASE" 1s="438" 1e="452" rs="0" re="0" /> < /combination> < /category>
<Desc/Clms Page number 31>
- < category id="1824" name="Bank Of england" score="5" source="/ Banking/ bank of england/bank of england.xls"> - < combination id="2464" name="Policy" bubble="1" score="5" source="bank of england/policy"> < rule id="8804" score="5" phrase ="interest rates" source="$C$16" data="Interest Rates" field="FIELD TEXT" scope="ANY" operator= "PHRASE" (s="438" 1e="452" rs="0" re="0" /> < /combination> < /category> - < category id="18848" name="Erm" score="5" source="/economy/euro/erm/erm.xls"> - < combination id="69608" name= "Indicators" bubble="1" score="5" sou rce="erm/indicators" > < rule id="200260" score="5" phrase ="interest rates" source="$C$8" field ="FIELD TEXT" scope="ANY" operator= "PHRASE" (s="438" 1e="452" rs="0" re="0" /> < /combination> < /category> - < category id="27968" name="Inflation" score="5" source= "inflation/ inflation"> - < Combination id="99176" name= "General" bubble="0" score="5" source="inflation/general"> < rule id="299708" score="5" phrase="interest rates" source="$C$9" data="Interest Rates" field="FIELD,TEXT" scope="ANY" operator= "PHRASE" (s="438" 1e="452" rs="O" re="0" /> < /combination> < /category> - < category id="12768" name= "Chemicals and Gases" score="20" range="low" source="/ environment/ chemicals and gases/chemicals and gases.xls"> - < combination id="47432" name= "General" bubble="0" score="20" source= "chemicals and gases/general"> < rule id="135408" score="20" phrase="gas" source="$C$30" field="FIELD TEXT" scope="ANY" operator="PHRASE" 1s="856" 1e="859" rs="0" re="0" /> < /combination> < /category> - < category id="30400" name= "Telecoms" score="1i" sou rce="/telecoms/telecoms.xls"> - < combination id="115808" name= "Genera l" bubble="0" score="5" source= "telecoms/general"> < rule id="373488" score="5" phrase= "telephone" sourceT"$C$21" field ="FIELD TEXT" scope="ANY" operator= "PHRASE" (s="877" 1e="886" rs="0" re="0" /> < /combination> - < combination id="113960" name= "Companies" bubble="1" score="7" source="telecoms/companies"> < rule id="364064" score="7" phrase="bt" source="$C$10" data="British Telecom" field="FIELD_TEXT" scope="ANY" operator="OR" 1s="1215" 1e="1217" rs="0" re="0" /> < /combination>
<Desc/Clms Page number 32>
< /category> - < category id="34656" name="Hardware and Technology" score="5" source="/telecoms/technology and hardware/technology & hardware.xls" > - < combination id="139216" name="Phones" bubble="1" score="5" source="hardware and tech nology/phones"> < rule id="456692" score="5" phrase= "telephone" source="$C$25" field="FIELD TEXT" scope="ANY" operator= "PHRASE" (s="877" 1e="886" rs="0" re="0" /> < /combination > < /category> - < category id="26752" name="Debt" score="5" source="/Finance/debt/debt.xis"> - < combination id="93632" name= "Personal" bubble="1" score="5" source= "debt/ persona l"> < rule id="274908" score="5" phrase= "bills" source="$C$9" field="FIELD-TEXT" scope="ANY" operator= "PHRASE" (s="887" 1e="892" rs="0" re="0" /> < /combination> < /category> - < category id="31008" name="Cable Telecoms" score="5" source="/telecoms/cable comms/cable comms.xls"> - < combination id="118272" name= "Players" bubble="1" score="5" source="cable telecoms/ players"> < rule id="382292" score="5" phrase="bt" source="$C$14" data="BT Cable" field="FIELD TEXT" scope="ANY" operator="OR" 1s="1215" 1e="1217" rs="O" re--"O" /> < /combination> < /category> - < category id="32224" name="Competition" score="3" source="/telecoms/competition/corn petition.xls" > - < combination id="123200" name="Names" bubble="i" score="3" source= "competition/ names" > < rule id="400520" score="3" phrase="bt" source="$C$9" data="bt OR british telecom" field="FIELD,TEXT" scope="ANY" operator="OR" 1s="1215" 1e="1217" rs="0" re="0" /> < /combination> < /category> < /document>
<Desc/Clms Page number 33>

Claims (13)

  1. CLAIMS 1. A system for categorising and indexing documents comprising: means for comparing textual content of a document with a plurality of rule sets, each rule set defining a category and comprising a plurality of rules each constituted by at least one alphanumeric string and a data name displayable in humanly readable form, said comparing means being organised to trigger a rule when the alphanumeric string of that rule is located in the textual part, wherein each rule is associated with a weighting factor; means for generating a category score representing an importance weighting for each category based on the weighting factors for the rules which triggered in the rule set defining the category; and a data structure arranged to hold a category identifier and a group of data names displayable in humanly readable form, said data names being derived from the rules that triggered for any category for which the category score is above a predetermined threshold.
  2. 2. A system according to claim 1, wherein each rule set defining a category is divided into combinations, each combination comprising a plurality of rules wherein the data structure holds a plurality of combination identifiers each identifying a combination and being associated with a combination name in humanly readable form together with said group of data names.
  3. 3. A system according to claim 2, wherein said means for generating a category score is arranged to generate combination scores and comprises a processor programmed to run an algorithm which receives the weighting factors for the rules which triggered in a particular combination and which generates a combination score based on said weighting factors.
    <Desc/Clms Page number 34>
  4. 4. A system according to claim 3, wherein the category score is generated using said algorithm having as its inputs the combination scores generated by the first pass of the algorithm.
  5. 5. A system according to any preceding claim, which includes at least one rule base comprising a store holding said plurality of rule sets defining respective categories.
  6. 6. A system according to claim 5, wherein the rule base holds in association with each rule a field identifying the location in the document where said at least one alphanumeric string is to be located for the rule to trigger.
  7. 7. A system according to any preceding claim, wherein at least some of said rules are constituted by a Boolean combination of alphanumeric strings which both have to be located in the document to trigger the rule.
  8. 8. A system according to any preceding claim, wherein at least some of said data names are shared by different rules.
  9. 9. A system according to any preceding claim, which comprises means for executing a search according to certain criteria wherein said data names are displayed to a user to allow further limitation of said search criteria to reduce the number of selected documents.
  10. 10. A search engine for searching documents containing textual content, the search engine comprising: a category index holding for each document a plurality of key:value pairs, each key-.value pair comprising a key identifying the category and a value denoting a category score representing an importance weighting for that
    <Desc/Clms Page number 35>
    category, said importance weighting having been derived from rules which triggered in a categorisation process; a data structure arranged to hold said key and a group of data names displayable in humanly readable form, said data names having been derived from the rules which triggered in the categorisation process for each category having a category score above a predetermined threshold; and means for cooperating with a user interface so that when a search is run according to certain search criteria and a category is selected, said data names are displayable.
  11. 11. A method of categorising and indexing documents comprising: comparing textual content of a document with a plurality of rule sets, each rule set defining a category and comprising a plurality of rules each constituted by at least one alphanumeric string and a data name displayable in humanly readable form, wherein a rule is triggered when the alphanumeric string of that rule is located in the textural part wherein each rule is associated with a weighting factor; generating a category score representing an importance weighting for each category based on the weighting factors for the rules which triggered in the rule set defining the category; and creating a data structure holding a category identifier and a group of data names displayable in humanly readable form, said data names being derived from the rules that triggered in any category for which the category score is greater than a predetermined threshold.
  12. 12. A method according to claim 11, wherein when a search is run according to search criteria and a category is selected, said data names are displayed to a user to allow further limitation of said search criteria to reduce the number of selected documents.
    <Desc/Clms Page number 36>
  13. 13. A method according to claim 11 or 12, which comprises setting up a plurality of rule bases containing the rule sets with which each document is to be compared.
GB0005091A 2000-03-02 2000-03-02 A system for categorising and indexing documents Withdrawn GB2366877A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB0005091A GB2366877A (en) 2000-03-02 2000-03-02 A system for categorising and indexing documents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB0005091A GB2366877A (en) 2000-03-02 2000-03-02 A system for categorising and indexing documents

Publications (2)

Publication Number Publication Date
GB0005091D0 GB0005091D0 (en) 2000-04-26
GB2366877A true GB2366877A (en) 2002-03-20

Family

ID=9886858

Family Applications (1)

Application Number Title Priority Date Filing Date
GB0005091A Withdrawn GB2366877A (en) 2000-03-02 2000-03-02 A system for categorising and indexing documents

Country Status (1)

Country Link
GB (1) GB2366877A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2385159A (en) * 2002-02-07 2003-08-13 3G Lab Ltd Classification of access objects for content provision to a mobile
US7065532B2 (en) * 2002-10-31 2006-06-20 International Business Machines Corporation System and method for evaluating information aggregates by visualizing associated categories
US7363035B2 (en) 2002-02-07 2008-04-22 Qualcomm Incorporated Method and apparatus for providing content to a mobile terminal

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5642502A (en) * 1994-12-06 1997-06-24 University Of Central Florida Method and system for searching for relevant documents from a text database collection, using statistical ranking, relevancy feedback and small pieces of text
US5943670A (en) * 1997-11-21 1999-08-24 International Business Machines Corporation System and method for categorizing objects in combined categories
GB2336700A (en) * 1998-04-24 1999-10-27 Dialog Corp Plc The Generating machine readable association files

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5642502A (en) * 1994-12-06 1997-06-24 University Of Central Florida Method and system for searching for relevant documents from a text database collection, using statistical ranking, relevancy feedback and small pieces of text
US5943670A (en) * 1997-11-21 1999-08-24 International Business Machines Corporation System and method for categorizing objects in combined categories
GB2336700A (en) * 1998-04-24 1999-10-27 Dialog Corp Plc The Generating machine readable association files

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2385159A (en) * 2002-02-07 2003-08-13 3G Lab Ltd Classification of access objects for content provision to a mobile
GB2385159B (en) * 2002-02-07 2004-03-24 3G Lab Ltd Providing content to a mobile terminal
US7363035B2 (en) 2002-02-07 2008-04-22 Qualcomm Incorporated Method and apparatus for providing content to a mobile terminal
US8428564B2 (en) 2002-02-07 2013-04-23 Qualcomm Incorporated Method and apparatus for providing updated content data to a mobile terminal
US7065532B2 (en) * 2002-10-31 2006-06-20 International Business Machines Corporation System and method for evaluating information aggregates by visualizing associated categories

Also Published As

Publication number Publication date
GB0005091D0 (en) 2000-04-26

Similar Documents

Publication Publication Date Title
US7373669B2 (en) Method and system for determining presence of probable error or fraud in a data set by linking common data values or elements
US8429167B2 (en) User-context-based search engine
CN102456075B (en) Respond the method and system from the inquiry of user
US7266537B2 (en) Predictive selection of content transformation in predictive modeling systems
US20110313854A1 (en) Online advertising valuation apparatus and method
CN109255586B (en) Online personalized recommendation method for e-government affairs handling
JP2000511671A (en) Automatic document classification system
Fitzpatrick et al. How can lenders prosper? Comparing machine learning approaches to identify profitable peer-to-peer loan investments
US20080215614A1 (en) Pyramid Information Quantification or PIQ or Pyramid Database or Pyramided Database or Pyramided or Selective Pressure Database Management System
JP5552582B2 (en) Content search device
KR100926118B1 (en) Method on Providing Trademark Information
Safier Between Big Brother and the Bottom Line: Privacy in Cyberspace
Caid et al. Context vector-based text retrieval
US7257568B2 (en) Process and system for matching products and markets
Badhe et al. Vague set theory for profit pattern and decision making in uncertain data
Evans et al. CLARIT TREC design, experiments, and results
Ezeife et al. The use of smart tokens in cleaning integrated warehouse data
GB2366877A (en) A system for categorising and indexing documents
Han et al. Centroid-based document classification algorithms: Analysis & experimental results
Du et al. Identifying high-impact opioid products and key sellers in dark net marketplaces: An interpretable text analytics approach
Dieijen et al. What say they about their mandate? a textual assessment of federal reserve speeches
Shinde et al. Deceptive opinion spam detection using bidirectional long short-term memory with capsule neural network
GB2366008A (en) Document selection
Van den Poel et al. Purchase prediction in database marketing with the ProbRough system
Krishna et al. Novel approach to museums development & emergence of text mining

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)