WO2007062215A2

WO2007062215A2 - Method, system and code for retrieving texts

Info

Publication number: WO2007062215A2
Application number: PCT/US2006/045397
Authority: WO
Inventors: Peter J. Dehlinger
Original assignee: Word Data Corp.
Priority date: 2005-11-22
Filing date: 2006-11-22
Publication date: 2007-05-31
Also published as: WO2007062215A3

Abstract

A computer-assisted method, code, and system for use in retrieving one or more selected texts from a collection of texts, are disclosed. The method employs a word-affinity matrix for use in constructing a search vector composed of a plurality of vector terms, each term containing a query word and a coefficient for that query word related to the inverse of the sum of all P(Wqm|Wqn), for all other query words Wqn, where P(Wm|Wn) is the conditional probability of finding word Wm in a text containing word Wn, within a collection of texts.

Description

METHOD, SYSTEM AND CODE FOR RETRIEVING TEXTS

Field of the Invention

The present invention relates to a method, system, and code for retrieving one or more texts from a collection of texts.

Background of the Invention

There are a number of information-retrieval methods available for accessing digitally processed texts. In one general approach, a user first classifies a text of interest according to a field or class that the text is likely to be located in. For example, in the legal field, one might initially specify a class of appellate cases relating to a specific area of the law or in a specific jurisdiction. In a technical or patent search, one might confine the search to a particular area of technology of patent class or subclass. This initial classification narrows the search to the areas of interest or most likely text matches.

Whether or not the user first specifies a search subset, actual texts are found by matching words or word pairs in a search query with words or word pairs present in the text of interest. In one word-matching algorithm, known familiarly as a Boolean search, the user inputs a small number of key words which may be joined by one or more Boolean operators. The search may be either inclusive or ranked. The inclusive approach will find only those texts that meet all of the constraints of the search query. For example, if the search query is "(a or b) and (c or d)," the search will find only those documents containing at least one of a or b and at least one of c or d. This type of search tends to be either overly limited, in which case relatively few texts may be found and many texts of potential interest may be missed, or overly inclusive, in which case many texts of only marginal interest may be retrieved. In practice, the user may have to carry out a number of overly inclusive searches, then combine two or more of these initial queries in an effort to find a search-query "intersection" that is neither overly inclusive nor overly limited.

In the approach based on document ranking, a search query is converted to a search vector in which each term, e.g., word, may be assigned an independent weighting factor, such as inverse document frequency, that provides a rough estimate of the word frequency or word importance in a collection of documents. The search algorithm then finds and ranks documents having the highest vector- tem score. The search-vector approach, though more flexible than an inclusive- type Boolean search, is nonetheless limited by the challenge in assigning weights to the query words in a meaningful way, and the tendency of the retrieved texts to cluster around certain groups of query words.

Summary of the Invention In one aspect, the present invention includes a computer-assisted method for retrieving one or more selected texts from a collection of texts. The method follows includes the steps of: (a) for each of a plurality of input query words Q, determining the conditional probabilities of finding that word, given each of the other query words, in a set of texts, (b) using the conditional-probability values determined in (a) to calculate, for each query word, a query-word coefficient related to the inverse of the sum of the conditional probabilities of finding that query word, given each of the other query words, (c) for each query word, searching a word index of the collection of texts to locate texts containing that query words, (d) assigning to each of the texts located in step (d), and for each query word, a weighting factor related to the coefficient calculated in step (b) for that query word, and (e) identifying those one or more texts having the highest summed weighting factors.

The collection of texts may be substantially overlapping with the set of texts, or may be merely representative of the set. The collection of texts may include full-length documents, paragraphs from a library of documents, or statements associated with bibliographic citations in a library of citation-rich documents.

Step (a) in the method may include accessing a word-affinity matrix that contains, for each of a plurality of words W_m, and each of a plurality of word pairs W_m, W_n, the conditional probability P(W_m|W_n) of finding word W_m in a text containing word W_n, to determine for each query word W_qm, the conditional probabilities P(W_qm|W_qn) for all other query words W_qn. Step (b) in the method may include, for each query word W_qm, calculating a coefficient related the inverse of the sum of all P(W_qm|W_qn), for all other query words W_qn.

In another aspect, the method includes a computer-readable code for use with an electronic computer for retrieving one or more selected texts from a collection of texts. The code is operable on the computer to carry out the above method steps (a)-(e).

Also disclosed is a system for use in retrieving one or more selected texts from a collection of texts, the system comprises: (1) a computer, (2) accessible by the computer, a word-affinity matrix that contains, for each of a plurality of words Wm, and each of a plurality of word pairs W_m, W_n, the conditional probability P(W_mjW_n) of finding word W_m in a text containing word W_n, (3) operatively connected to the computer, a user input device, and (4) computer readable code that operates on the computer to: (a) access the word-affinity table, to determine, for each of a plurality of input query words Q, determining the conditional probabilities of finding that word, given each of the other query words, in a set of texts, (b) using the conditional-probability values determined in (a) to calculate, for each query word, a query-word coefficient related to the inverse of the sum of the conditional probabilities of finding that query word, given each of the other query words, (c) for each query word, searching a word index of the collection of texts to locate texts containing that query words, (d) assigning to each of the texts located in step (d), and for each query word, a weighting factor related to the coefficient calculated in step (b) for that query word, and (e) identifying those one or more texts having the highest summed weighting factors.

In still another aspect, the invention includes a search vector representing a multi-word search query. The vector has a plurality of vector terms, where each term contains a query word W_qm and a coefficient for that query word that is related to the inverse of the sum of all P(Wq_m|W_qn), for all other query words W_qn, where P(Wm|W_n) is the conditional probability of finding word W_m in a text containing word W_n, within a collection of texts. These and other objects and feature of the invention will become more fully apparent when the following detailed description of the invention is read in conjunction with the accompanying drawings. Brief Description of the Drawings

Fig. 1 shows hardware and software components of the system of the invention; Fig. 2 shows the relationship between texts and database tables generated from the texts;

Figs. 3A-3C show representative table entries for a document-ID table (3A), a text-ID table (3B) and a word index of text records (3C);

Fig. 4 shows a portion of a conditional-probability matrix used in the method of the invention;

Fig. 5 gives an overview of the operation of the system of the invention in flow diagram form;

Fig. 6 is a flow diagram of steps for generating a word index of text records in accordance with an embodiment of the invention; Fig. 7 is a flow diagram of steps for generating a conditional probability matrix of paired words in a set of texts; and

Fig. 8 is a flow diagram of steps for retrieving texts based on a word query, in accordance with the invention.

Detailed Description of the Invention A. Definitions

A "text" refers to a digitally stored text, typically a natural-language text. Typical texts include: scientific or technical articles or abstracts, legal documents, such as case-law decisions, opinions, briefs patent applications and patents and/or abstracts or claims thereof. One exemplary type of text in the invention includes paragraphs contained in multi-paragraph documents and statements or phrases extracted from citation-rich documents.

A "collection of texts" refers to a library of texts, typically containing a large number of texts, e.g., hundreds to millions of texts, containing one or more texts of interest for retrieval purposes.

A "set of texts" refers to a library of texts that may be substantially the same as the collection of texts, i.e., substantially overlapping, or the set of text may be substantially non-overlapping with the collection of texts, but preferably made up of similar types of documents, or similar subject matter or breadth of subject matter. The set of texts are used in determining conditional probabilities, which in turn are used in calculating search-vector coefficients, in accordance with an aspect of the invention. A set of texts is typically a subset of a collection of texts. In one embodiment, the collection of texts is a library of texts that are being added to over time, and the set of texts, a subset of these that are fixed as of a given date. In another embodiment, the collection may includes texts from several fields, e.g., technical fields, with the one or more sets being calculated from texts within a given field, as illustrated in the Example below.

A "search query" refers to a list or words, or a multiple-word sentence of sentence fragments that is descriptive of the content of texts to be retrieved.

A "search vector" refers to a list of query words, each having a separately calculated coefficient. A "verb-root word" is a word or phrase that has a verb root. Thus, the word "light" or "lights" (the noun), "light" (the adjective), "lightly" (the adverb) and various forms of "light" (the verb), such as light, lighted, lighting, lit, lights, to light, has been lighted, etc., are all verb-root words with the same verb root form "light," where the verb root form selected is typically the present-tense singular (infinitive) form of the verb.

"Generic words" refers to words in a natural-language passage that are not descriptive of, or only non-specifically descriptive of, the subject matter of the passage. Examples include prepositions, conjunctions, pronouns, as well as certain nouns, verbs, adverbs, and adjectives that occur frequently in passages from many different fields. "Non-generic words" are those words in a passage remaining after generic words are removed.

A "text identifier"or "TID" identifies a particular digitally encoded or processed text, e.g., document in a database of texts, e.g., by a text number, i.e., a computer- readable alphanumeric code. A "database" refers to a database of tables containing information about texts and/or other text-related information. A database typically includes two or more tables, each containing locators by which information in one table can be used to access information in another table or tables.

The probabilities of finding the word W_n-, and the word W_n in a set of texts are expressed as P(W_m) and P(W_n), respectively. The probability of finding W_m in a text that also contains W_n is the conditional probability of finding W_m, given W_n, and is expressed as P(W_m|W_n). The conditional probability P(W_m|W_n) is calculated as P(W_m*W_n)/P(W_n), where P(W_m*W_n) is the probability of finding both words in a text in the set, and P(W_n) is the probability of finding W_n alone in a text in the set.

B. System components

Fig. 1 shows the basic components of a system 40 for use in retrieving one or more stored texts. A computer or processor 42 in the system may be a standalone computer or a central computer or server that communicates with a user's personal computer. The computer has an input device 44, such as a keyboard, modem, and/or disc reader, by which the user can enter query words or a query statement, as will be described. A display or monitor 46 displays a user interface by which the user communicates with the system in retrieving one or more texts. Computer 42 in the system is typically one of many user terminal computers, each of which communicates with a central server or processor 41 on which the main program activity in the system takes place.

A database in the system, typically run on processor 41 , includes a document-ID table 48, a text-ID table 50, and a word-index of text records 52, as will be described below with respect to Figs. 3A-3C, respectively. The word-index of text records is used in generating a conditional-probability matrix 54, as described below with respect to Figs. 4 and 7. The database also includes a database tool that operates on the server to access and act on information contained in the database tables, in accordance with the program steps described below. One exemplary database tool is MySQL database tool, which can be accessed at www.mysql.com.

It will be appreciated that the assignment of various stored documents, databases, database tools and search modules, to be detailed below, to a user computer or a central server or central processing station is made on the basis of computer storage capacity and speed of operations, but may be modified without altering the basic functions and operations to be described.

Fig. 2 shows in flow diagram form, the relationship between text-based documents, paragraph texts contained in the documents, and various database tables constructed from the texts. The collection of documents (box 60) refers to a library or collection of digitally stored documents, such as scientific or technical articles legal documents, such as case-law decisions, opinions, briefs, patent applications and patents. Each document is assigned a document identifier or document-ID or DID. In the embodiment illustrated, the documents are processed to yield the paragraphs making up the documents. These paragraphs form a collection of texts 62, each with an assigned text identifier (TID), such that some known groups of TIDs are associated with each document identifier (DID). It will be understood that the texts themselves may be whole documents, paragraphs, making up the documents, or statements or sentences within paragraphs.

The DIDs and associated TIDs are assembled into document-ID table 48. As seen in Fig. 3A, which shows representative table entries, the key locator for this table is a DID, such as DID-i, and the information stored with each DID includes the associated TIDs, such as TID_a, TID_b, ■•• TID_n. Also as indicated in Fig. 2, each TID and its associated paragraph text and DID are assembled into text-ID table 50. Representative table entries for this database table are shown in Fig. 3B. The key locator for this database table is a TID, such as TID-i, and the information stored with each TID is the associated paragraph text, e.g., text-i, and the document ID, e.g., DIDi, from which the paragraph was extracted. It can be appreciated that tables 48, 50 allow a system user to reconstruct selected portions of any document. For example, to generate a string of paragraphs containing a selected paragraph, the user can first access table 50 to identify the DID associated with the selected paragraph TID. Then, going to table 48, the user can readily identify a string a paragraphs in that document, identified by TIDs that contain the selected paragraph at a known string position.

With reference again to Fig. 2, the non-generic words in the paragraph texts are extracted at 62 and the associated TIDs are processed, as will be described with reference to Fig. 6, to generate a word index of text records 52. This table contains a list of all of the non-generic words (the key locators) contained in the paragraph records, and for each word, a list of all TIDs containing that word (the text record for each word), as shown in Fig. 3C. Thus, in the table entries shown in this figure, word2 is contained in records TIDm, TID_n

... TID_y.

The word index just described, which is generated from the collection of texts to be searched, may be used in generating a conditional-probability matrix 54, according to the method described below with reference to Fig.7. Alternatively, where the values for the conditional probability matrix are generated from a set of texts other than the collection of texts, the set is first used to generate an index of word records, as just described. In one general embodiment, the collection of texts to be searched is a library of texts that expands over time, as new texts appropriate for that collection become available, whereas the set of texts used in generating conditional probability values is fixed in time, and typically representing some subset of the collection of documents. Accordingly, the conditional probabilities, once calculated, do not have to be recalculated as new texts become available, or can be recalculated only occasionally. The conditional-probability matrix, a portion of which is shown at 54 in Fig.

4, is an N x N matrix of N row words, such as words Wi, W₂, W₃, where the value of each matrix entry for a pair of words W_m , W_n, is the conditional probability of finding word W_m, given word W_n in the set of texts used in generating the probability values. The conditional probability P(W_m|W_n) is defined as P(Wm*Wn)/P(Wn), that is, the probability of finding both words W_m and W_n in the same text in the set divided by the probability of finding W_n alone As will be described further below, this can be calculated as the number of texts in the set containing both W_m and W_n, divided by the total text occurrence of W_n in the same set of texts. Typically, the diagonal values P(W₀|W₀), are set to zero. The matrix is non-symmetrical, meaning that P(WmW_n)/P(W_n) is typically not equal to P(W_nW_m)/P(W_m). C. Overview of system operation

Fig. 5 is an overview of the method of the invention for retrieving a text contained in a collection of texts. Initially, the user enters a search query (box 60) consisting or a plurality of query words Q. The query may be a simple list of query words, or a phrase or sentence or even full paragraph. Although not shown here, the program processes the query to remove generic words, and may further process the query to convert verb-root words into their common roots, and standardize verb tense and plural endings, as will be described below. The net result is to generate a group of non-generic query words, indicated as W_qm, where m = 1 to N, the total number of query words in the search. With the counter for m set to 1 , in box 64, the program accesses conditional probability matrix 54 (box 62) , to retrieve the matrix row for W_qm, e.g., Wqi. For example, with respect to Fig. 4, if W_qi is W₃ in Fig. 4, the program will retrieve the row for W₃, containing the conditional probabilities P(W3|Wi), P(W₃|W₂), P(W₃IW₃), ... P(W₃|W_N). The retrieved matrix row is then scanned to find the conditional probability values P(W_qm|W_qn), for all query words m≠n, where W_qm is the query word of interest, e.g., W₃, and W_qn are all other query words. This operation is indicated at 66.

In accordance with the method, it is desired to calculate query-word coefficients that tend to equalize the probabilities of finding any one query word, given all other query words. In one embodiment, indicated at 68, this is done by first calculating, for each query word W_qn, the sum of all probability values of all P(W_qm|W_qn), m≠n. This value represents the total probability of finding W_qm, given each of the other query words W_qn. By setting the coefficient of W_qn equal to the inverse of the sum, or to a value related to the inverse of the sum, the tendency to retrieved texts to cluster around a group of query words with high internal conditional probabilities will be offset by a correspondingly low coefficient value assigned to each of the "clustering" query words. Viewed another way, the coefficient assigned to each query word is calculated to offset the contribution that query word is expected to make to the total search score, based solely on expected conditional probabilities with all other query words. These properties can be appreciated from the coefficients assigned to the two different word queries illustrated in the Example below.

This process is repeated, through the logic of 70, 72, until all query words have been considered and assigned a coefficient. It will be appreciated that certain operations in this loop, such as fetching a matrix row for each query word (box 62) may be carried out simultaneously for all query words, to make the program operation more efficient. In a preferred embodiment, the coefficients assigned are the actual inverse sum values calculated as above. In another embodiment, the coefficients are related to this value, e.g., are calculated as the square or cube root of these inverse sum values.

Once all of the query word coefficients have been calculated, at search vector is constructed, at 74. Each term of the vector is one of the query words and the coefficient for that term is the associated coefficient. That is, the search vector may be represented as C-|W_qi + C₂Wq₂ +C₃W_q3 +...CkW_qk, for the k query words, where O, is the coefficient calculated for each query word W_qi. In terms of program operation, this vector may be a simple list of all query words and their associated coefficients. This vector represents one aspect of the invention.

To retrieve one or more records of interest, the program initializes the term (query word) in the vector to 1 , at 78, and finds all texts in the collection of texts containing that query word W_qi. This is done by accessing the word index of records 52 to identify all texts containing word W_qi, as shown at box 76. At each step, all TIDs containing are assigned a text-score value equal to the word coefficient (which adjusts the text-score value for the expected probability of finding that word in a text). After this search step is completed for all query words, through the logic of 80, 82, the corrected word scores for each TID are summed to yield a total text-value score for each text, and the TIDs are ranked by total score, at 84. The top-ranked TIDs are the one or more TIDs of interest. The search method is detailed below with reference to Fig. 8.

D. Generating a word-records table and conditional probability matrix

As noted above, the program uses non-generic words contained in the collection of texts to generate a word-records table 52. This table is essentially a dictionary of non-generic words, where each word has associated with it, each TID containing that word, and optionally, for each TID, an associated DID.

In forming the word-records file, and with reference to Fig. 6, the program creates an empty ordered list 52. With the text-IDs t initialized to 1 , at 86, the program accesses the collection of texts, e.g., a collection of individual paragraphs from a library of documents, to retrieve the text and associated identifier(s) (e.g., TID and DID) for that text. The text is processed to (i) remove generic words, and (ii) simplify verb-root words to their common root, yielding a list of query words which will be used in the search. The program may optionally add synonyms, such as obtained from a separate synonym dictionary, to one or more of the query words. Details of text-processing steps for use in removing non-generic words and identify verb-root words can be found, for example, in co- owned U.S. patent application 20040054520 A1 , for "Text-Searching Code, System, and Method, published March 18, 2004, which is incorporated herein in its entirety.

For each text t, the word number is initialized at w=1 , at 90, and the program retrieves that word from the word list generated above for the text (box 92). For that word w, the program asks, at 94: is word w already in the word- records table? If it is, the word text identifier (associated TID) for text t is added to word-records table 52 for that word w in the table, at 96. If not, a new word entry is created in table 52, at 98, along with the associated TID identifier. This process is repeated, through the logic of 100, 102, until all of the non-generic words in text t have been added to the table. Once a text has been processed, the program advances, through the logic of 104, 106, until all texts t in collection of texts have been processed and added to the word-records table, completing the processing steps at 108.

As noted with respect to Fig. 4, conditional probability matrix 54 is an N x N matrix of all N non-generic words in a set of texts, where the value of each matrix entry for a pair of words W_m , W_n, is the conditional probability of finding word W_m, given word W_n in the set of texts used in generating the probability values. Each conditional probability value in the matrix, P(W_m|W_n), can be expressed as P(W_m*W_n)/P(W_n), where P(W_m*W_n) is the probability of finding both words in a text in the set and P(W_n) is the probability of finding W_n alone in a text in the set. The probability of finding both words in the set, P(W_m*W_n), is the number occurrence of texts containing both words, divided by the total number of texts in the set. The probability P(W_n) of finding W_n in the set is the number occurrence of texts containing just that word, divided by the total number of texts in the sets. Therefore, canceling the common factor in both numerator and denominator (the total number of texts in the set), the value for P(W_m|W_n) can be calculated simply as the number of texts containing both W_m and W_n, divided by the number of texts containing OnIyW_n. An algorithm for calculating these values for all pairs of W_m, W_n is given in

Fig. 7. In this scheme, Wj and Wj indicate word pairs for the conditional probability P(Wj|W_j). With W₁ initialized at Wi = 1 , at 110, the program selects word Wj, at 112, then retrieves all TIDs for that word, at 114, by accessing word index 52, and these TIDs are stored at 135. Next, W_j is initialized at j=1 , and word W_j is selected at 118. In the case where Wj=W_j, that is, a conditional probability value on the diagonal of the matrix (where P(Wj]Wj) is one), the program advances, through the logic of 120, 122 to the next Wj, and accesses word index 52 to retrieve all TIDs for that word, at 124. The retrieved TIDs for Wj are now used to determine the number of texts containing word Wj, at 126, simply by counting the number of TIDs retrieved for that word. In box 128, the TIDs retrieved for Wj are compared with those retrieved for Wj, to find the total number of TIDs containing both words, i.e., the number of TIDs common to both words. From these two values, the conditional probability P(W,|Wj) is calculated as the total number of TIDs containing both words divided by the total number of TIDs containing W_j, at box 130. This value is then added to the empty or partially filled matrix 54.

For each Wj, this process is repeated for all W_j, through the logic of 132 122, until the conditional probabilities P(WjWV_n), n=1 to N, have been determined, completing the values for the W, row of the matrix, where the diagonal value may be set to zero. Each new word Wj is similarly processed, through the logic of 134 136, until all rows of the matrix have been completed, indicated at 138. E. Text searching and ranking

The text searching module in the system, illustrated in Fig. 8, operates to find texts having the greatest term overlap with the search vector terms, where the value of each vector term is weighted by the associated term coefficient. An empty ordered list of TIDs, shown at 142 in the figure, will store the accumulating match-score values for each TID associated with the vector terms. The program initializes the vector term (query word w) to 1 , in box 144, and retrieves the first query word from query-word list 140. The TIDs associated with this word are then retrieved, at 146, from the word index 52. With the TID count set at 1 (box 150) the program gets the first retrieved TIDs, and asks, at 152: Is this TID already present in list 142? If it is not, the TID and the term coefficient are added to list 142, as indicated at 143, creating the first coefficient of one or more coefficients which may accumulate for that TID. Although not shown here, the program also orders the TIDs in list 142 numerically, to facilitate searching for TIDs in the list.

If the TID is already present in the list, the coefficient is added to the summed coefficients for that term, as indicated at 154. This process is repeated, through the logic of 156, 158, until all of the TIDs for a given word have been considered and added to list 142. Each word in the search vector is processed in this way, though the logic of 160, 162, until each of the vector terms (query words) has been considered. List 142 now consists of an ordered list of TIDs, each with an accumulated match score representing the sum of coefficients of terms associated with that TID. These TIDs are then ranked at 164, according to a standard ordering algorithm, to yield an output of the top N match score, e.g., the 10 or 20 highest-ranked matched score texts, identified by TID.

From the foregoing, it can be appreciated how various objects and features of the invention have been met. In constructing a search vector, the program calculates a coefficient that for each term (query word) that offsets the tendency of certain groups of query words to "self-aggregate" in the retrieved document, due to high internal conditional probabilities among those words. This results in the highest ranking texts having a broader representation of all of the query words, because the score calculated for each text will be skewed more toward those query words that have low conditional probabilities with respect to the other query words.

Example

Word query coefficient calculations A set of all patent abstracts from the U.S. patent database, 1975-2002, from the following Surgery-related classes (U.S. patent classification) was used as the set of texts in this first example: classes 128, 351 , 378, 433, 600, 601 , 602, 604, 606, 623. A word-index and conditional probability was generated as above. A word query in the surgical class contained the words: implantable, stent, insertable, sleeve, region, method, vascular, constriction, and expandable. For each of the nine query words, the conditional probabilities of that word with respect to the other eight query words were determined from the probability matrix. These values are given in Table I below, along with the calculated coefficient for each word. As seen, more generic query words such as "insert," "method," and "expand," all of which would be expected to self-aggregate, have relatively low-valued coefficients, while more distinctive words, such as "stent," "vascular," and "expand," which would be expected to be less self-aggregating with respect to all eight other query words, have substantially higher coefficients.

Table 1

0.225305 0.10628 0.101001 0.086122 0.114427 0.148171 0.087912 0.147713 0.983351 implant

0.044566 0.02024 0.030937 0.028707 0.024465 0.102144 0.054945 0.170375 2.099166 stent

0.163117 0.157048 0.252351 0.130371 0.150131 0.17024 0.197802 0.25752 0.676325 insert

0.036553 0.056604 0.059505 0.03273 0.020347 0.027743 0.074725 0:05789 2.731514 sleeve

0.051701 0.087125 0.050994 0.054292 0.068114 0.081967 0.107692 0.074166 1.735956 region

0.259276 0.280244 0.221642 0.127389 0.257085 0.328499 0.224176 0.228883 0.518889 method

0.025796 0.0899 0.019311 0.013345 0.02377 0.02524 0.030769 0.049856 3.597289 vascular

0.004391 0.013873 0.006437 0.010312 0.00896 0.004941 0.008827 0.017511 13.2885 constrict

0.078705 0.458935 0.089401 0.085229 0.065826 0.053822 0.152585 0.186813 0.853741 expand

In a second example, a set of all patent abstracts from the U.S. patent database, 1975-2002, from the Computer-related classes (U.S. patent classification) was used as the set of texts in this first example: classes 345, 360, 365, 369, 382, 700, 701 , 702, 703, 704, 705, 706, 707, 708, 709, 710, 711 , 712, 713, 714, 716, 717, 725. A word-index and conditional probability was generated as above. A word query in the computer class included the words: wireless, communication, architecture, virtual, channel, memory, controller, synchronous, data, and bus. For each of the ten query words, the conditional probabilities of that word with respect to the other nine query words were determined from the 5 probability matrix. These values are given in Table 2 below, along with the calculated coefficient for each word. As seen, more generic query words such as "communicate," "memory," "controller," and "data," all of which would be expected to self-aggregate, have relatively low-valued coefficients, while more distinctive words, such as "wireless," "architecture," and "virtual," which would be 0 expected to be less self-aggregating, have substantially higher coefficients.

Table 2

0.043767 0.007974 0.006977 0.012751 0.002094 0.006024 0.004944 0.006727 0.002374 10.68006 wireless

0.576563 0.132908 0.130698 0.169289 0.049099 0.079333 0.053672 0.093817 0.168942 0.687606 communicate

0.023438 0.029652 0.045581 0.021704 0.027824 0.015847 0.024011 0.02061 0.054797 3.795577 architecture

0.023438 0.033329 0.0521 0.022246 0.022239 0.017091 0.008475 0.019684 0.020178 4.570789 Virtual

0.073437 0.074013 0.042531 0.03814 0.033508 0.03893 0.056497 0.037563 0.039169 2.305274 channel

0.098438 0.175187 0.444976 0.311163 0.273467 0.306169 0.492938 0.338269 0.510188 0.338892 memory

0.2875 0.287392 0.25731 0.242791 0.322572 0.31085 0.382062 0.272218 0.365776 0.366506 controller

0.010938 0.009014 0.018075 0.005581 0.021704 0.023203 0.017713 0.017179 0.029674 6.532475 synchronous

0.465625 0.492943 0.48538 0.405581 0.451438 0.498138 0.394833 0.537429 0.623937 0.229605 data

0.01875 0.101293 0.147262 0.047442 0.053717 0.085732 0.06054 0.105932 0.071198 1.445367 bus

While the invention has been described with respect to particular 5 embodiments and applications, it will be appreciated that various changes and modification may be made without departing from the spirit of the invention.

Claims

IT IS CLAIMED:

1. A computer-assisted method for retrieving one or more selected texts from a collection of texts, comprising (a) for each of a plurality of input query words Q, determining the conditional probabilities of finding that word, given each of the other query words, in a set of texts,

(b) using the conditional-probability values determined in (a) to calculate, for each query word, a query-word coefficient related to the inverse of the sum of the conditional probabilities of finding that query word, given each of the other query words,

(c) for each query word, searching a word index of said collection of texts to locate texts containing that query words,

(d) assigning to each of the texts located in step (d), and for each query word, a weighting factor related to the coefficient calculated in step (b) for that query word, and

(e) identifying those one or more texts having the highest summed weighting factors.

2. The method of claim 1 , wherein said collection of texts is substantially overlapping with said set of texts.

3. The method of claim 1 , wherein step (a) includes accessing a word- affinity matrix that contains, for each of a plurality of words W_m, and each of a plurality of word pairs Wm₁ W_n, the conditional probability P(W_m|W_n) of finding word Wm in a text containing word W_n, to determine for each query word W_qm, the conditional probabilities P(W_qm|W_qn) for all other query words W_qn.

4. The method of claim 3, wherein step (b) includes, for each query word W_qm, calculating a coefficient related the inverse of the sum of all P(W_qrn|W_qn), for all other query words W_qn.

5. The method of claim 1 , wherein the collection of texts includes full- length documents.

6. The method of claim 1 , wherein the collection of texts includes paragraphs from a library of documents.

7. The method of claim 1 , wherein said collection of texts includes statements associated with bibliographic citations in a library of citation-rich documents.

8. Computer-readable code for use with an electronic computer for retrieving one or more selected texts from a collection of texts, said code being operable on said computer to carry out the steps comprising:

(a) for each of a plurality of input query words Q, determining the conditional probabilities of finding that word, given each of the other query words, in a set of texts,

(e) identifying those one or more texts having the highest summed ^ weighting factors.

9. The code of claim 8, which is operable, in carrying out step (a) to access a word-affinity matrix that contains, for each of a plurality of words W_m, and each of a plurality of word pairs W_m, W_n, the conditional probability P(W_m|W_n) of finding word W_m in a text containing word W_n, and to determine for each query word W_qm, the conditional probabilities P(Wqm|Wqn) for all other query words W_qn, and in carrying out step (b), to calculate, for each query word W_qm, a coefficient related the inverse of the sum of all P(W_qrn|Wqn), for all other query words W_qn.

10. A system for use in retrieving one or more selected texts from a collection of texts, said system comprising:

(1 ) a computer,

(2) accessible by said computer, a word-affinity matrix that contains, for each of a plurality of words Wm, and each of a plurality of word pairs W_m, W_n, the conditional probability P(W_m]W_n) of finding word W_m in a text containing word W_n,

(3) operatively connected to said computer, a user input device,

(4) computer readable code that operates on said computer to:

(a) access said word-affinity table, to determine, for each of a plurality of input query words Q, determining the conditional probabilities of finding that word, given each of the other query words, in a set of texts,

11. The system of claim 17, wherein said code operates, in carrying out step (a) to determine for each query word W_qm, the conditional probabilities

P(W_qm|W_qn) for all other query words W_qn, and in carrying out step (b), to calculate, for each query word W_qm, a coefficient related the inverse of the sum of all P(Wqm|W_qn), for all other query words W_qn.

12. A search vector representing a multi-word search query, comprising a plurality of vector terms, each term containing a query word and a coefficient for that query word W_qm related to the inverse of the sum of all P(Wqm|W_qn), for all other query words W_qn, where P(Wm|W_n) is the conditional probability of finding word W_m in a text containing word W_n, within a collection of texts.