WO2007062215A2 - Method, system and code for retrieving texts - Google Patents

Method, system and code for retrieving texts Download PDF

Info

Publication number
WO2007062215A2
WO2007062215A2 PCT/US2006/045397 US2006045397W WO2007062215A2 WO 2007062215 A2 WO2007062215 A2 WO 2007062215A2 US 2006045397 W US2006045397 W US 2006045397W WO 2007062215 A2 WO2007062215 A2 WO 2007062215A2
Authority
WO
WIPO (PCT)
Prior art keywords
word
query
texts
words
collection
Prior art date
Application number
PCT/US2006/045397
Other languages
French (fr)
Other versions
WO2007062215A3 (en
Inventor
Peter J. Dehlinger
Original Assignee
Word Data Corp.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Word Data Corp. filed Critical Word Data Corp.
Publication of WO2007062215A2 publication Critical patent/WO2007062215A2/en
Publication of WO2007062215A3 publication Critical patent/WO2007062215A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing

Definitions

  • the present invention relates to a method, system, and code for retrieving one or more texts from a collection of texts.
  • a user first classifies a text of interest according to a field or class that the text is likely to be located in. For example, in the legal field, one might initially specify a class of appellate cases relating to a specific area of the law or in a specific jurisdiction. In a technical or patent search, one might confine the search to a particular area of technology of patent class or subclass. This initial classification narrows the search to the areas of interest or most likely text matches.
  • a search subset actual texts are found by matching words or word pairs in a search query with words or word pairs present in the text of interest.
  • a Boolean search the user inputs a small number of key words which may be joined by one or more Boolean operators.
  • the search may be either inclusive or ranked. The inclusive approach will find only those texts that meet all of the constraints of the search query. For example, if the search query is "(a or b) and (c or d)," the search will find only those documents containing at least one of a or b and at least one of c or d.
  • This type of search tends to be either overly limited, in which case relatively few texts may be found and many texts of potential interest may be missed, or overly inclusive, in which case many texts of only marginal interest may be retrieved.
  • the user may have to carry out a number of overly inclusive searches, then combine two or more of these initial queries in an effort to find a search-query "intersection" that is neither overly inclusive nor overly limited.
  • a search query is converted to a search vector in which each term, e.g., word, may be assigned an independent weighting factor, such as inverse document frequency, that provides a rough estimate of the word frequency or word importance in a collection of documents.
  • the search algorithm finds and ranks documents having the highest vector- tem score.
  • the search-vector approach though more flexible than an inclusive- type Boolean search, is nonetheless limited by the challenge in assigning weights to the query words in a meaningful way, and the tendency of the retrieved texts to cluster around certain groups of query words.
  • the present invention includes a computer-assisted method for retrieving one or more selected texts from a collection of texts.
  • the method follows includes the steps of: (a) for each of a plurality of input query words Q, determining the conditional probabilities of finding that word, given each of the other query words, in a set of texts, (b) using the conditional-probability values determined in (a) to calculate, for each query word, a query-word coefficient related to the inverse of the sum of the conditional probabilities of finding that query word, given each of the other query words, (c) for each query word, searching a word index of the collection of texts to locate texts containing that query words, (d) assigning to each of the texts located in step (d), and for each query word, a weighting factor related to the coefficient calculated in step (b) for that query word, and (e) identifying those one or more texts having the highest summed weighting factors.
  • the collection of texts may be substantially overlapping with the set of texts, or may be merely representative of the set.
  • the collection of texts may include full-length documents, paragraphs from a library of documents, or statements associated with bibliographic citations in a library of citation-rich documents.
  • Step (a) in the method may include accessing a word-affinity matrix that contains, for each of a plurality of words W m , and each of a plurality of word pairs W m , W n , the conditional probability P(W m
  • Step (b) in the method may include, for each query word W qm , calculating a coefficient related the inverse of the sum of all P(W qm
  • the method includes a computer-readable code for use with an electronic computer for retrieving one or more selected texts from a collection of texts.
  • the code is operable on the computer to carry out the above method steps (a)-(e).
  • a system for use in retrieving one or more selected texts from a collection of texts comprises: (1) a computer, (2) accessible by the computer, a word-affinity matrix that contains, for each of a plurality of words Wm, and each of a plurality of word pairs W m , W n , the conditional probability P(W m jW n ) of finding word W m in a text containing word W n , (3) operatively connected to the computer, a user input device, and (4) computer readable code that operates on the computer to: (a) access the word-affinity table, to determine, for each of a plurality of input query words Q, determining the conditional probabilities of finding that word, given each of the other query words, in a set of texts, (b) using the conditional-probability values determined in (a) to calculate, for each query word, a query-word coefficient related to the inverse of the sum of the conditional probabilities of finding that query word, given each of the other query words
  • the invention includes a search vector representing a multi-word search query.
  • the vector has a plurality of vector terms, where each term contains a query word W qm and a coefficient for that query word that is related to the inverse of the sum of all P(Wq m
  • Fig. 1 shows hardware and software components of the system of the invention
  • Fig. 2 shows the relationship between texts and database tables generated from the texts
  • Figs. 3A-3C show representative table entries for a document-ID table (3A), a text-ID table (3B) and a word index of text records (3C);
  • Fig. 4 shows a portion of a conditional-probability matrix used in the method of the invention
  • Fig. 5 gives an overview of the operation of the system of the invention in flow diagram form
  • Fig. 6 is a flow diagram of steps for generating a word index of text records in accordance with an embodiment of the invention
  • Fig. 7 is a flow diagram of steps for generating a conditional probability matrix of paired words in a set of texts.
  • Fig. 8 is a flow diagram of steps for retrieving texts based on a word query, in accordance with the invention.
  • a "text” refers to a digitally stored text, typically a natural-language text.
  • Typical texts include: scientific or technical articles or abstracts, legal documents, such as case-law decisions, opinions, briefs patent applications and patents and/or abstracts or claims thereof.
  • One exemplary type of text in the invention includes paragraphs contained in multi-paragraph documents and statements or phrases extracted from citation-rich documents.
  • a "collection of texts” refers to a library of texts, typically containing a large number of texts, e.g., hundreds to millions of texts, containing one or more texts of interest for retrieval purposes.
  • a "set of texts” refers to a library of texts that may be substantially the same as the collection of texts, i.e., substantially overlapping, or the set of text may be substantially non-overlapping with the collection of texts, but preferably made up of similar types of documents, or similar subject matter or breadth of subject matter.
  • the set of texts are used in determining conditional probabilities, which in turn are used in calculating search-vector coefficients, in accordance with an aspect of the invention.
  • a set of texts is typically a subset of a collection of texts.
  • the collection of texts is a library of texts that are being added to over time, and the set of texts, a subset of these that are fixed as of a given date.
  • the collection may includes texts from several fields, e.g., technical fields, with the one or more sets being calculated from texts within a given field, as illustrated in the Example below.
  • search query refers to a list or words, or a multiple-word sentence of sentence fragments that is descriptive of the content of texts to be retrieved.
  • a “search vector” refers to a list of query words, each having a separately calculated coefficient.
  • a "verb-root word” is a word or phrase that has a verb root.
  • the word “light” or “lights” (the noun), "light” (the adjective), “lightly” (the adverb) and various forms of "light” (the verb), such as light, lighted, lighting, lit, lights, to light, has been lighted, etc., are all verb-root words with the same verb root form "light,” where the verb root form selected is typically the present-tense singular (infinitive) form of the verb.
  • Generic words refers to words in a natural-language passage that are not descriptive of, or only non-specifically descriptive of, the subject matter of the passage. Examples include prepositions, conjunctions, pronouns, as well as certain nouns, verbs, adverbs, and adjectives that occur frequently in passages from many different fields. "Non-generic words” are those words in a passage remaining after generic words are removed.
  • a “text identifier”or “TID” identifies a particular digitally encoded or processed text, e.g., document in a database of texts, e.g., by a text number, i.e., a computer- readable alphanumeric code.
  • a “database” refers to a database of tables containing information about texts and/or other text-related information.
  • a database typically includes two or more tables, each containing locators by which information in one table can be used to access information in another table or tables.
  • the probabilities of finding the word W n -, and the word W n in a set of texts are expressed as P(W m ) and P(W n ), respectively.
  • the probability of finding W m in a text that also contains W n is the conditional probability of finding W m , given W n , and is expressed as P(W m
  • W n ) is calculated as P(W m *W n )/P(W n ), where P(W m *W n ) is the probability of finding both words in a text in the set, and P(W n ) is the probability of finding W n alone in a text in the set.
  • Fig. 1 shows the basic components of a system 40 for use in retrieving one or more stored texts.
  • a computer or processor 42 in the system may be a standalone computer or a central computer or server that communicates with a user's personal computer.
  • the computer has an input device 44, such as a keyboard, modem, and/or disc reader, by which the user can enter query words or a query statement, as will be described.
  • a display or monitor 46 displays a user interface by which the user communicates with the system in retrieving one or more texts.
  • Computer 42 in the system is typically one of many user terminal computers, each of which communicates with a central server or processor 41 on which the main program activity in the system takes place.
  • a database in the system typically run on processor 41 , includes a document-ID table 48, a text-ID table 50, and a word-index of text records 52, as will be described below with respect to Figs. 3A-3C, respectively.
  • the word-index of text records is used in generating a conditional-probability matrix 54, as described below with respect to Figs. 4 and 7.
  • the database also includes a database tool that operates on the server to access and act on information contained in the database tables, in accordance with the program steps described below.
  • One exemplary database tool is MySQL database tool, which can be accessed at www.mysql.com.
  • Fig. 2 shows in flow diagram form, the relationship between text-based documents, paragraph texts contained in the documents, and various database tables constructed from the texts.
  • the collection of documents refers to a library or collection of digitally stored documents, such as scientific or technical articles legal documents, such as case-law decisions, opinions, briefs, patent applications and patents. Each document is assigned a document identifier or document-ID or DID. In the embodiment illustrated, the documents are processed to yield the paragraphs making up the documents.
  • These paragraphs form a collection of texts 62, each with an assigned text identifier (TID), such that some known groups of TIDs are associated with each document identifier (DID). It will be understood that the texts themselves may be whole documents, paragraphs, making up the documents, or statements or sentences within paragraphs.
  • TID text identifier
  • the DIDs and associated TIDs are assembled into document-ID table 48.
  • the key locator for this table is a DID, such as DID-i
  • the information stored with each DID includes the associated TIDs, such as TID a , TID b , ⁇ •• TID n .
  • each TID and its associated paragraph text and DID are assembled into text-ID table 50. Representative table entries for this database table are shown in Fig. 3B.
  • TID The key locator for this database table is a TID, such as TID-i
  • the information stored with each TID is the associated paragraph text, e.g., text-i
  • the document ID e.g., DIDi
  • tables 48, 50 allow a system user to reconstruct selected portions of any document. For example, to generate a string of paragraphs containing a selected paragraph, the user can first access table 50 to identify the DID associated with the selected paragraph TID. Then, going to table 48, the user can readily identify a string a paragraphs in that document, identified by TIDs that contain the selected paragraph at a known string position.
  • the non-generic words in the paragraph texts are extracted at 62 and the associated TIDs are processed, as will be described with reference to Fig. 6, to generate a word index of text records 52.
  • This table contains a list of all of the non-generic words (the key locators) contained in the paragraph records, and for each word, a list of all TIDs containing that word (the text record for each word), as shown in Fig. 3C.
  • word2 is contained in records TIDm, TID n
  • the word index just described which is generated from the collection of texts to be searched, may be used in generating a conditional-probability matrix 54, according to the method described below with reference to Fig.7.
  • the values for the conditional probability matrix are generated from a set of texts other than the collection of texts
  • the set is first used to generate an index of word records, as just described.
  • the collection of texts to be searched is a library of texts that expands over time, as new texts appropriate for that collection become available, whereas the set of texts used in generating conditional probability values is fixed in time, and typically representing some subset of the collection of documents. Accordingly, the conditional probabilities, once calculated, do not have to be recalculated as new texts become available, or can be recalculated only occasionally.
  • the conditional-probability matrix a portion of which is shown at 54 in Fig.
  • N x N matrix of N row words such as words Wi, W 2 , W 3 , where the value of each matrix entry for a pair of words W m , W n , is the conditional probability of finding word W m , given word W n in the set of texts used in generating the probability values.
  • W n ) is defined as P(Wm*Wn)/P(Wn), that is, the probability of finding both words W m and W n in the same text in the set divided by the probability of finding W n alone As will be described further below, this can be calculated as the number of texts in the set containing both W m and W n , divided by the total text occurrence of W n in the same set of texts.
  • W 0 ) are set to zero.
  • the matrix is non-symmetrical, meaning that P(WmW n )/P(W n ) is typically not equal to P(W n W m )/P(W m ).
  • Fig. 5 is an overview of the method of the invention for retrieving a text contained in a collection of texts.
  • the user enters a search query (box 60) consisting or a plurality of query words Q.
  • the query may be a simple list of query words, or a phrase or sentence or even full paragraph.
  • the program processes the query to remove generic words, and may further process the query to convert verb-root words into their common roots, and standardize verb tense and plural endings, as will be described below.
  • conditional probability matrix 54 (box 62) , to retrieve the matrix row for W q m, e.g., Wqi.
  • W q m e.g., Wqi.
  • the retrieved matrix row is then scanned to find the conditional probability values P(W qm
  • query-word coefficients that tend to equalize the probabilities of finding any one query word, given all other query words. In one embodiment, indicated at 68, this is done by first calculating, for each query word W qn , the sum of all probability values of all P(W qm
  • the coefficient of W qn is calculated to offset the contribution that query word is expected to make to the total search score, based solely on expected conditional probabilities with all other query words.
  • the coefficients assigned are the actual inverse sum values calculated as above.
  • the coefficients are related to this value, e.g., are calculated as the square or cube root of these inverse sum values.
  • search vector is constructed, at 74.
  • Each term of the vector is one of the query words and the coefficient for that term is the associated coefficient. That is, the search vector may be represented as C-
  • the program To retrieve one or more records of interest, the program initializes the term (query word) in the vector to 1 , at 78, and finds all texts in the collection of texts containing that query word W q i. This is done by accessing the word index of records 52 to identify all texts containing word W q i, as shown at box 76. At each step, all TIDs containing are assigned a text-score value equal to the word coefficient (which adjusts the text-score value for the expected probability of finding that word in a text). After this search step is completed for all query words, through the logic of 80, 82, the corrected word scores for each TID are summed to yield a total text-value score for each text, and the TIDs are ranked by total score, at 84. The top-ranked TIDs are the one or more TIDs of interest.
  • the search method is detailed below with reference to Fig. 8.
  • the program uses non-generic words contained in the collection of texts to generate a word-records table 52.
  • This table is essentially a dictionary of non-generic words, where each word has associated with it, each TID containing that word, and optionally, for each TID, an associated DID.
  • the program In forming the word-records file, and with reference to Fig. 6, the program creates an empty ordered list 52.
  • the program accesses the collection of texts, e.g., a collection of individual paragraphs from a library of documents, to retrieve the text and associated identifier(s) (e.g., TID and DID) for that text.
  • the text is processed to (i) remove generic words, and (ii) simplify verb-root words to their common root, yielding a list of query words which will be used in the search.
  • the program may optionally add synonyms, such as obtained from a separate synonym dictionary, to one or more of the query words.
  • conditional probability matrix 54 is an N x N matrix of all N non-generic words in a set of texts, where the value of each matrix entry for a pair of words W m , W n , is the conditional probability of finding word W m , given word W n in the set of texts used in generating the probability values.
  • W n ) can be expressed as P(W m *W n )/P(W n ), where P(W m *W n ) is the probability of finding both words in a text in the set and P(W n ) is the probability of finding W n alone in a text in the set.
  • the probability of finding both words in the set is the number occurrence of texts containing both words, divided by the total number of texts in the set.
  • the probability P(W n ) of finding W n in the set is the number occurrence of texts containing just that word, divided by the total number of texts in the sets. Therefore, canceling the common factor in both numerator and denominator (the total number of texts in the set), the value for P(W m
  • An algorithm for calculating these values for all pairs of W m , W n is given in
  • Wj and Wj indicate word pairs for the conditional probability P(Wj
  • the program selects word Wj, at 112, then retrieves all TIDs for that word, at 114, by accessing word index 52, and these TIDs are stored at 135.
  • word W j is selected at 118.
  • the program advances, through the logic of 120, 122 to the next Wj, and accesses word index 52 to retrieve all TIDs for that word, at 124.
  • the retrieved TIDs for Wj are now used to determine the number of texts containing word Wj, at 126, simply by counting the number of TIDs retrieved for that word.
  • the TIDs retrieved for Wj are compared with those retrieved for Wj, to find the total number of TIDs containing both words, i.e., the number of TIDs common to both words.
  • Wj) is calculated as the total number of TIDs containing both words divided by the total number of TIDs containing W j , at box 130. This value is then added to the empty or partially filled matrix 54.
  • the text searching module in the system operates to find texts having the greatest term overlap with the search vector terms, where the value of each vector term is weighted by the associated term coefficient.
  • An empty ordered list of TIDs shown at 142 in the figure, will store the accumulating match-score values for each TID associated with the vector terms.
  • the program initializes the vector term (query word w) to 1 , in box 144, and retrieves the first query word from query-word list 140. The TIDs associated with this word are then retrieved, at 146, from the word index 52. With the TID count set at 1 (box 150) the program gets the first retrieved TIDs, and asks, at 152: Is this TID already present in list 142?
  • the TID and the term coefficient are added to list 142, as indicated at 143, creating the first coefficient of one or more coefficients which may accumulate for that TID.
  • the program also orders the TIDs in list 142 numerically, to facilitate searching for TIDs in the list.
  • TID is already present in the list
  • the coefficient is added to the summed coefficients for that term, as indicated at 154. This process is repeated, through the logic of 156, 158, until all of the TIDs for a given word have been considered and added to list 142.
  • Each word in the search vector is processed in this way, though the logic of 160, 162, until each of the vector terms (query words) has been considered.
  • List 142 now consists of an ordered list of TIDs, each with an accumulated match score representing the sum of coefficients of terms associated with that TID.
  • These TIDs are then ranked at 164, according to a standard ordering algorithm, to yield an output of the top N match score, e.g., the 10 or 20 highest-ranked matched score texts, identified by TID.
  • the program calculates a coefficient that for each term (query word) that offsets the tendency of certain groups of query words to "self-aggregate" in the retrieved document, due to high internal conditional probabilities among those words. This results in the highest ranking texts having a broader representation of all of the query words, because the score calculated for each text will be skewed more toward those query words that have low conditional probabilities with respect to the other query words.
  • Word query coefficient calculations A set of all patent abstracts from the U.S. patent database, 1975-2002, from the following Surgery-related classes (U.S. patent classification) was used as the set of texts in this first example: classes 128, 351 , 378, 433, 600, 601 , 602, 604, 606, 623. A word-index and conditional probability was generated as above.
  • a word query in the surgical class contained the words: implantable, stent, insertable, sleeve, region, method, vascular, constriction, and expandable.
  • the conditional probabilities of that word with respect to the other eight query words were determined from the probability matrix. These values are given in Table I below, along with the calculated coefficient for each word.
  • conditional probabilities of that word with respect to the other nine query words were determined from the 5 probability matrix. These values are given in Table 2 below, along with the calculated coefficient for each word.
  • more generic query words such as “communicate,” “memory,” “controller,” and “data,” all of which would be expected to self-aggregate, have relatively low-valued coefficients, while more distinctive words, such as “wireless,” “architecture,” and “virtual,” which would be 0 expected to be less self-aggregating, have substantially higher coefficients.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A computer-assisted method, code, and system for use in retrieving one or more selected texts from a collection of texts, are disclosed. The method employs a word-affinity matrix for use in constructing a search vector composed of a plurality of vector terms, each term containing a query word and a coefficient for that query word related to the inverse of the sum of all P(Wqm|Wqn), for all other query words Wqn, where P(Wm|Wn) is the conditional probability of finding word Wm in a text containing word Wn, within a collection of texts.

Description

METHOD, SYSTEM AND CODE FOR RETRIEVING TEXTS
Field of the Invention
The present invention relates to a method, system, and code for retrieving one or more texts from a collection of texts.
Background of the Invention
There are a number of information-retrieval methods available for accessing digitally processed texts. In one general approach, a user first classifies a text of interest according to a field or class that the text is likely to be located in. For example, in the legal field, one might initially specify a class of appellate cases relating to a specific area of the law or in a specific jurisdiction. In a technical or patent search, one might confine the search to a particular area of technology of patent class or subclass. This initial classification narrows the search to the areas of interest or most likely text matches.
Whether or not the user first specifies a search subset, actual texts are found by matching words or word pairs in a search query with words or word pairs present in the text of interest. In one word-matching algorithm, known familiarly as a Boolean search, the user inputs a small number of key words which may be joined by one or more Boolean operators. The search may be either inclusive or ranked. The inclusive approach will find only those texts that meet all of the constraints of the search query. For example, if the search query is "(a or b) and (c or d)," the search will find only those documents containing at least one of a or b and at least one of c or d. This type of search tends to be either overly limited, in which case relatively few texts may be found and many texts of potential interest may be missed, or overly inclusive, in which case many texts of only marginal interest may be retrieved. In practice, the user may have to carry out a number of overly inclusive searches, then combine two or more of these initial queries in an effort to find a search-query "intersection" that is neither overly inclusive nor overly limited.
In the approach based on document ranking, a search query is converted to a search vector in which each term, e.g., word, may be assigned an independent weighting factor, such as inverse document frequency, that provides a rough estimate of the word frequency or word importance in a collection of documents. The search algorithm then finds and ranks documents having the highest vector- tem score. The search-vector approach, though more flexible than an inclusive- type Boolean search, is nonetheless limited by the challenge in assigning weights to the query words in a meaningful way, and the tendency of the retrieved texts to cluster around certain groups of query words.
Summary of the Invention In one aspect, the present invention includes a computer-assisted method for retrieving one or more selected texts from a collection of texts. The method follows includes the steps of: (a) for each of a plurality of input query words Q, determining the conditional probabilities of finding that word, given each of the other query words, in a set of texts, (b) using the conditional-probability values determined in (a) to calculate, for each query word, a query-word coefficient related to the inverse of the sum of the conditional probabilities of finding that query word, given each of the other query words, (c) for each query word, searching a word index of the collection of texts to locate texts containing that query words, (d) assigning to each of the texts located in step (d), and for each query word, a weighting factor related to the coefficient calculated in step (b) for that query word, and (e) identifying those one or more texts having the highest summed weighting factors.
The collection of texts may be substantially overlapping with the set of texts, or may be merely representative of the set. The collection of texts may include full-length documents, paragraphs from a library of documents, or statements associated with bibliographic citations in a library of citation-rich documents.
Step (a) in the method may include accessing a word-affinity matrix that contains, for each of a plurality of words Wm, and each of a plurality of word pairs Wm, Wn, the conditional probability P(Wm|Wn) of finding word Wm in a text containing word Wn, to determine for each query word Wqm, the conditional probabilities P(Wqm|Wqn) for all other query words Wqn. Step (b) in the method may include, for each query word Wqm, calculating a coefficient related the inverse of the sum of all P(Wqm|Wqn), for all other query words Wqn.
In another aspect, the method includes a computer-readable code for use with an electronic computer for retrieving one or more selected texts from a collection of texts. The code is operable on the computer to carry out the above method steps (a)-(e).
Also disclosed is a system for use in retrieving one or more selected texts from a collection of texts, the system comprises: (1) a computer, (2) accessible by the computer, a word-affinity matrix that contains, for each of a plurality of words Wm, and each of a plurality of word pairs Wm, Wn, the conditional probability P(WmjWn) of finding word Wm in a text containing word Wn, (3) operatively connected to the computer, a user input device, and (4) computer readable code that operates on the computer to: (a) access the word-affinity table, to determine, for each of a plurality of input query words Q, determining the conditional probabilities of finding that word, given each of the other query words, in a set of texts, (b) using the conditional-probability values determined in (a) to calculate, for each query word, a query-word coefficient related to the inverse of the sum of the conditional probabilities of finding that query word, given each of the other query words, (c) for each query word, searching a word index of the collection of texts to locate texts containing that query words, (d) assigning to each of the texts located in step (d), and for each query word, a weighting factor related to the coefficient calculated in step (b) for that query word, and (e) identifying those one or more texts having the highest summed weighting factors.
In still another aspect, the invention includes a search vector representing a multi-word search query. The vector has a plurality of vector terms, where each term contains a query word Wqm and a coefficient for that query word that is related to the inverse of the sum of all P(Wqm|Wqn), for all other query words Wqn, where P(Wm|Wn) is the conditional probability of finding word Wm in a text containing word Wn, within a collection of texts. These and other objects and feature of the invention will become more fully apparent when the following detailed description of the invention is read in conjunction with the accompanying drawings. Brief Description of the Drawings
Fig. 1 shows hardware and software components of the system of the invention; Fig. 2 shows the relationship between texts and database tables generated from the texts;
Figs. 3A-3C show representative table entries for a document-ID table (3A), a text-ID table (3B) and a word index of text records (3C);
Fig. 4 shows a portion of a conditional-probability matrix used in the method of the invention;
Fig. 5 gives an overview of the operation of the system of the invention in flow diagram form;
Fig. 6 is a flow diagram of steps for generating a word index of text records in accordance with an embodiment of the invention; Fig. 7 is a flow diagram of steps for generating a conditional probability matrix of paired words in a set of texts; and
Fig. 8 is a flow diagram of steps for retrieving texts based on a word query, in accordance with the invention.
Detailed Description of the Invention A. Definitions
A "text" refers to a digitally stored text, typically a natural-language text. Typical texts include: scientific or technical articles or abstracts, legal documents, such as case-law decisions, opinions, briefs patent applications and patents and/or abstracts or claims thereof. One exemplary type of text in the invention includes paragraphs contained in multi-paragraph documents and statements or phrases extracted from citation-rich documents.
A "collection of texts" refers to a library of texts, typically containing a large number of texts, e.g., hundreds to millions of texts, containing one or more texts of interest for retrieval purposes.
A "set of texts" refers to a library of texts that may be substantially the same as the collection of texts, i.e., substantially overlapping, or the set of text may be substantially non-overlapping with the collection of texts, but preferably made up of similar types of documents, or similar subject matter or breadth of subject matter. The set of texts are used in determining conditional probabilities, which in turn are used in calculating search-vector coefficients, in accordance with an aspect of the invention. A set of texts is typically a subset of a collection of texts. In one embodiment, the collection of texts is a library of texts that are being added to over time, and the set of texts, a subset of these that are fixed as of a given date. In another embodiment, the collection may includes texts from several fields, e.g., technical fields, with the one or more sets being calculated from texts within a given field, as illustrated in the Example below.
A "search query" refers to a list or words, or a multiple-word sentence of sentence fragments that is descriptive of the content of texts to be retrieved.
A "search vector" refers to a list of query words, each having a separately calculated coefficient. A "verb-root word" is a word or phrase that has a verb root. Thus, the word "light" or "lights" (the noun), "light" (the adjective), "lightly" (the adverb) and various forms of "light" (the verb), such as light, lighted, lighting, lit, lights, to light, has been lighted, etc., are all verb-root words with the same verb root form "light," where the verb root form selected is typically the present-tense singular (infinitive) form of the verb.
"Generic words" refers to words in a natural-language passage that are not descriptive of, or only non-specifically descriptive of, the subject matter of the passage. Examples include prepositions, conjunctions, pronouns, as well as certain nouns, verbs, adverbs, and adjectives that occur frequently in passages from many different fields. "Non-generic words" are those words in a passage remaining after generic words are removed.
A "text identifier"or "TID" identifies a particular digitally encoded or processed text, e.g., document in a database of texts, e.g., by a text number, i.e., a computer- readable alphanumeric code. A "database" refers to a database of tables containing information about texts and/or other text-related information. A database typically includes two or more tables, each containing locators by which information in one table can be used to access information in another table or tables.
The probabilities of finding the word Wn-, and the word Wn in a set of texts are expressed as P(Wm) and P(Wn), respectively. The probability of finding Wm in a text that also contains Wn is the conditional probability of finding Wm, given Wn, and is expressed as P(Wm|Wn). The conditional probability P(Wm|Wn) is calculated as P(Wm*Wn)/P(Wn), where P(Wm*Wn) is the probability of finding both words in a text in the set, and P(Wn) is the probability of finding Wn alone in a text in the set.
B. System components
Fig. 1 shows the basic components of a system 40 for use in retrieving one or more stored texts. A computer or processor 42 in the system may be a standalone computer or a central computer or server that communicates with a user's personal computer. The computer has an input device 44, such as a keyboard, modem, and/or disc reader, by which the user can enter query words or a query statement, as will be described. A display or monitor 46 displays a user interface by which the user communicates with the system in retrieving one or more texts. Computer 42 in the system is typically one of many user terminal computers, each of which communicates with a central server or processor 41 on which the main program activity in the system takes place.
A database in the system, typically run on processor 41 , includes a document-ID table 48, a text-ID table 50, and a word-index of text records 52, as will be described below with respect to Figs. 3A-3C, respectively. The word-index of text records is used in generating a conditional-probability matrix 54, as described below with respect to Figs. 4 and 7. The database also includes a database tool that operates on the server to access and act on information contained in the database tables, in accordance with the program steps described below. One exemplary database tool is MySQL database tool, which can be accessed at www.mysql.com.
It will be appreciated that the assignment of various stored documents, databases, database tools and search modules, to be detailed below, to a user computer or a central server or central processing station is made on the basis of computer storage capacity and speed of operations, but may be modified without altering the basic functions and operations to be described.
Fig. 2 shows in flow diagram form, the relationship between text-based documents, paragraph texts contained in the documents, and various database tables constructed from the texts. The collection of documents (box 60) refers to a library or collection of digitally stored documents, such as scientific or technical articles legal documents, such as case-law decisions, opinions, briefs, patent applications and patents. Each document is assigned a document identifier or document-ID or DID. In the embodiment illustrated, the documents are processed to yield the paragraphs making up the documents. These paragraphs form a collection of texts 62, each with an assigned text identifier (TID), such that some known groups of TIDs are associated with each document identifier (DID). It will be understood that the texts themselves may be whole documents, paragraphs, making up the documents, or statements or sentences within paragraphs.
The DIDs and associated TIDs are assembled into document-ID table 48. As seen in Fig. 3A, which shows representative table entries, the key locator for this table is a DID, such as DID-i, and the information stored with each DID includes the associated TIDs, such as TIDa, TIDb, ■•• TIDn. Also as indicated in Fig. 2, each TID and its associated paragraph text and DID are assembled into text-ID table 50. Representative table entries for this database table are shown in Fig. 3B. The key locator for this database table is a TID, such as TID-i, and the information stored with each TID is the associated paragraph text, e.g., text-i, and the document ID, e.g., DIDi, from which the paragraph was extracted. It can be appreciated that tables 48, 50 allow a system user to reconstruct selected portions of any document. For example, to generate a string of paragraphs containing a selected paragraph, the user can first access table 50 to identify the DID associated with the selected paragraph TID. Then, going to table 48, the user can readily identify a string a paragraphs in that document, identified by TIDs that contain the selected paragraph at a known string position.
With reference again to Fig. 2, the non-generic words in the paragraph texts are extracted at 62 and the associated TIDs are processed, as will be described with reference to Fig. 6, to generate a word index of text records 52. This table contains a list of all of the non-generic words (the key locators) contained in the paragraph records, and for each word, a list of all TIDs containing that word (the text record for each word), as shown in Fig. 3C. Thus, in the table entries shown in this figure, word2 is contained in records TIDm, TIDn
... TIDy.
The word index just described, which is generated from the collection of texts to be searched, may be used in generating a conditional-probability matrix 54, according to the method described below with reference to Fig.7. Alternatively, where the values for the conditional probability matrix are generated from a set of texts other than the collection of texts, the set is first used to generate an index of word records, as just described. In one general embodiment, the collection of texts to be searched is a library of texts that expands over time, as new texts appropriate for that collection become available, whereas the set of texts used in generating conditional probability values is fixed in time, and typically representing some subset of the collection of documents. Accordingly, the conditional probabilities, once calculated, do not have to be recalculated as new texts become available, or can be recalculated only occasionally. The conditional-probability matrix, a portion of which is shown at 54 in Fig.
4, is an N x N matrix of N row words, such as words Wi, W2, W3, where the value of each matrix entry for a pair of words Wm , Wn, is the conditional probability of finding word Wm, given word Wn in the set of texts used in generating the probability values. The conditional probability P(Wm|Wn) is defined as P(Wm*Wn)/P(Wn), that is, the probability of finding both words Wm and Wn in the same text in the set divided by the probability of finding Wn alone As will be described further below, this can be calculated as the number of texts in the set containing both Wm and Wn, divided by the total text occurrence of Wn in the same set of texts. Typically, the diagonal values P(W0|W0), are set to zero. The matrix is non-symmetrical, meaning that P(WmWn)/P(Wn) is typically not equal to P(WnWm)/P(Wm). C. Overview of system operation
Fig. 5 is an overview of the method of the invention for retrieving a text contained in a collection of texts. Initially, the user enters a search query (box 60) consisting or a plurality of query words Q. The query may be a simple list of query words, or a phrase or sentence or even full paragraph. Although not shown here, the program processes the query to remove generic words, and may further process the query to convert verb-root words into their common roots, and standardize verb tense and plural endings, as will be described below. The net result is to generate a group of non-generic query words, indicated as Wqm, where m = 1 to N, the total number of query words in the search. With the counter for m set to 1 , in box 64, the program accesses conditional probability matrix 54 (box 62) , to retrieve the matrix row for Wqm, e.g., Wqi. For example, with respect to Fig. 4, if Wqi is W3 in Fig. 4, the program will retrieve the row for W3, containing the conditional probabilities P(W3|Wi), P(W3|W2), P(W3IW3), ... P(W3|WN). The retrieved matrix row is then scanned to find the conditional probability values P(Wqm|Wqn), for all query words m≠n, where Wqm is the query word of interest, e.g., W3, and Wqn are all other query words. This operation is indicated at 66.
In accordance with the method, it is desired to calculate query-word coefficients that tend to equalize the probabilities of finding any one query word, given all other query words. In one embodiment, indicated at 68, this is done by first calculating, for each query word Wqn, the sum of all probability values of all P(Wqm|Wqn), m≠n. This value represents the total probability of finding Wqm, given each of the other query words Wqn. By setting the coefficient of Wqn equal to the inverse of the sum, or to a value related to the inverse of the sum, the tendency to retrieved texts to cluster around a group of query words with high internal conditional probabilities will be offset by a correspondingly low coefficient value assigned to each of the "clustering" query words. Viewed another way, the coefficient assigned to each query word is calculated to offset the contribution that query word is expected to make to the total search score, based solely on expected conditional probabilities with all other query words. These properties can be appreciated from the coefficients assigned to the two different word queries illustrated in the Example below.
This process is repeated, through the logic of 70, 72, until all query words have been considered and assigned a coefficient. It will be appreciated that certain operations in this loop, such as fetching a matrix row for each query word (box 62) may be carried out simultaneously for all query words, to make the program operation more efficient. In a preferred embodiment, the coefficients assigned are the actual inverse sum values calculated as above. In another embodiment, the coefficients are related to this value, e.g., are calculated as the square or cube root of these inverse sum values.
Once all of the query word coefficients have been calculated, at search vector is constructed, at 74. Each term of the vector is one of the query words and the coefficient for that term is the associated coefficient. That is, the search vector may be represented as C-|Wqi + C2Wq2 +C3Wq3 +...CkWqk, for the k query words, where O, is the coefficient calculated for each query word Wqi. In terms of program operation, this vector may be a simple list of all query words and their associated coefficients. This vector represents one aspect of the invention.
To retrieve one or more records of interest, the program initializes the term (query word) in the vector to 1 , at 78, and finds all texts in the collection of texts containing that query word Wqi. This is done by accessing the word index of records 52 to identify all texts containing word Wqi, as shown at box 76. At each step, all TIDs containing are assigned a text-score value equal to the word coefficient (which adjusts the text-score value for the expected probability of finding that word in a text). After this search step is completed for all query words, through the logic of 80, 82, the corrected word scores for each TID are summed to yield a total text-value score for each text, and the TIDs are ranked by total score, at 84. The top-ranked TIDs are the one or more TIDs of interest. The search method is detailed below with reference to Fig. 8.
D. Generating a word-records table and conditional probability matrix
As noted above, the program uses non-generic words contained in the collection of texts to generate a word-records table 52. This table is essentially a dictionary of non-generic words, where each word has associated with it, each TID containing that word, and optionally, for each TID, an associated DID.
In forming the word-records file, and with reference to Fig. 6, the program creates an empty ordered list 52. With the text-IDs t initialized to 1 , at 86, the program accesses the collection of texts, e.g., a collection of individual paragraphs from a library of documents, to retrieve the text and associated identifier(s) (e.g., TID and DID) for that text. The text is processed to (i) remove generic words, and (ii) simplify verb-root words to their common root, yielding a list of query words which will be used in the search. The program may optionally add synonyms, such as obtained from a separate synonym dictionary, to one or more of the query words. Details of text-processing steps for use in removing non-generic words and identify verb-root words can be found, for example, in co- owned U.S. patent application 20040054520 A1 , for "Text-Searching Code, System, and Method, published March 18, 2004, which is incorporated herein in its entirety.
For each text t, the word number is initialized at w=1 , at 90, and the program retrieves that word from the word list generated above for the text (box 92). For that word w, the program asks, at 94: is word w already in the word- records table? If it is, the word text identifier (associated TID) for text t is added to word-records table 52 for that word w in the table, at 96. If not, a new word entry is created in table 52, at 98, along with the associated TID identifier. This process is repeated, through the logic of 100, 102, until all of the non-generic words in text t have been added to the table. Once a text has been processed, the program advances, through the logic of 104, 106, until all texts t in collection of texts have been processed and added to the word-records table, completing the processing steps at 108.
As noted with respect to Fig. 4, conditional probability matrix 54 is an N x N matrix of all N non-generic words in a set of texts, where the value of each matrix entry for a pair of words Wm , Wn, is the conditional probability of finding word Wm, given word Wn in the set of texts used in generating the probability values. Each conditional probability value in the matrix, P(Wm|Wn), can be expressed as P(Wm*Wn)/P(Wn), where P(Wm*Wn) is the probability of finding both words in a text in the set and P(Wn) is the probability of finding Wn alone in a text in the set. The probability of finding both words in the set, P(Wm*Wn), is the number occurrence of texts containing both words, divided by the total number of texts in the set. The probability P(Wn) of finding Wn in the set is the number occurrence of texts containing just that word, divided by the total number of texts in the sets. Therefore, canceling the common factor in both numerator and denominator (the total number of texts in the set), the value for P(Wm|Wn) can be calculated simply as the number of texts containing both Wm and Wn, divided by the number of texts containing OnIyWn. An algorithm for calculating these values for all pairs of Wm, Wn is given in
Fig. 7. In this scheme, Wj and Wj indicate word pairs for the conditional probability P(Wj|Wj). With W1 initialized at Wi = 1 , at 110, the program selects word Wj, at 112, then retrieves all TIDs for that word, at 114, by accessing word index 52, and these TIDs are stored at 135. Next, Wj is initialized at j=1 , and word Wj is selected at 118. In the case where Wj=Wj, that is, a conditional probability value on the diagonal of the matrix (where P(Wj]Wj) is one), the program advances, through the logic of 120, 122 to the next Wj, and accesses word index 52 to retrieve all TIDs for that word, at 124. The retrieved TIDs for Wj are now used to determine the number of texts containing word Wj, at 126, simply by counting the number of TIDs retrieved for that word. In box 128, the TIDs retrieved for Wj are compared with those retrieved for Wj, to find the total number of TIDs containing both words, i.e., the number of TIDs common to both words. From these two values, the conditional probability P(W,|Wj) is calculated as the total number of TIDs containing both words divided by the total number of TIDs containing Wj, at box 130. This value is then added to the empty or partially filled matrix 54.
For each Wj, this process is repeated for all Wj, through the logic of 132 122, until the conditional probabilities P(WjWVn), n=1 to N, have been determined, completing the values for the W, row of the matrix, where the diagonal value may be set to zero. Each new word Wj is similarly processed, through the logic of 134 136, until all rows of the matrix have been completed, indicated at 138. E. Text searching and ranking
The text searching module in the system, illustrated in Fig. 8, operates to find texts having the greatest term overlap with the search vector terms, where the value of each vector term is weighted by the associated term coefficient. An empty ordered list of TIDs, shown at 142 in the figure, will store the accumulating match-score values for each TID associated with the vector terms. The program initializes the vector term (query word w) to 1 , in box 144, and retrieves the first query word from query-word list 140. The TIDs associated with this word are then retrieved, at 146, from the word index 52. With the TID count set at 1 (box 150) the program gets the first retrieved TIDs, and asks, at 152: Is this TID already present in list 142? If it is not, the TID and the term coefficient are added to list 142, as indicated at 143, creating the first coefficient of one or more coefficients which may accumulate for that TID. Although not shown here, the program also orders the TIDs in list 142 numerically, to facilitate searching for TIDs in the list.
If the TID is already present in the list, the coefficient is added to the summed coefficients for that term, as indicated at 154. This process is repeated, through the logic of 156, 158, until all of the TIDs for a given word have been considered and added to list 142. Each word in the search vector is processed in this way, though the logic of 160, 162, until each of the vector terms (query words) has been considered. List 142 now consists of an ordered list of TIDs, each with an accumulated match score representing the sum of coefficients of terms associated with that TID. These TIDs are then ranked at 164, according to a standard ordering algorithm, to yield an output of the top N match score, e.g., the 10 or 20 highest-ranked matched score texts, identified by TID.
From the foregoing, it can be appreciated how various objects and features of the invention have been met. In constructing a search vector, the program calculates a coefficient that for each term (query word) that offsets the tendency of certain groups of query words to "self-aggregate" in the retrieved document, due to high internal conditional probabilities among those words. This results in the highest ranking texts having a broader representation of all of the query words, because the score calculated for each text will be skewed more toward those query words that have low conditional probabilities with respect to the other query words.
Example
Word query coefficient calculations A set of all patent abstracts from the U.S. patent database, 1975-2002, from the following Surgery-related classes (U.S. patent classification) was used as the set of texts in this first example: classes 128, 351 , 378, 433, 600, 601 , 602, 604, 606, 623. A word-index and conditional probability was generated as above. A word query in the surgical class contained the words: implantable, stent, insertable, sleeve, region, method, vascular, constriction, and expandable. For each of the nine query words, the conditional probabilities of that word with respect to the other eight query words were determined from the probability matrix. These values are given in Table I below, along with the calculated coefficient for each word. As seen, more generic query words such as "insert," "method," and "expand," all of which would be expected to self-aggregate, have relatively low-valued coefficients, while more distinctive words, such as "stent," "vascular," and "expand," which would be expected to be less self-aggregating with respect to all eight other query words, have substantially higher coefficients.
Table 1
0.225305 0.10628 0.101001 0.086122 0.114427 0.148171 0.087912 0.147713 0.983351 implant
0.044566 0.02024 0.030937 0.028707 0.024465 0.102144 0.054945 0.170375 2.099166 stent
0.163117 0.157048 0.252351 0.130371 0.150131 0.17024 0.197802 0.25752 0.676325 insert
0.036553 0.056604 0.059505 0.03273 0.020347 0.027743 0.074725 0:05789 2.731514 sleeve
0.051701 0.087125 0.050994 0.054292 0.068114 0.081967 0.107692 0.074166 1.735956 region
0.259276 0.280244 0.221642 0.127389 0.257085 0.328499 0.224176 0.228883 0.518889 method
0.025796 0.0899 0.019311 0.013345 0.02377 0.02524 0.030769 0.049856 3.597289 vascular
0.004391 0.013873 0.006437 0.010312 0.00896 0.004941 0.008827 0.017511 13.2885 constrict
0.078705 0.458935 0.089401 0.085229 0.065826 0.053822 0.152585 0.186813 0.853741 expand
In a second example, a set of all patent abstracts from the U.S. patent database, 1975-2002, from the Computer-related classes (U.S. patent classification) was used as the set of texts in this first example: classes 345, 360, 365, 369, 382, 700, 701 , 702, 703, 704, 705, 706, 707, 708, 709, 710, 711 , 712, 713, 714, 716, 717, 725. A word-index and conditional probability was generated as above. A word query in the computer class included the words: wireless, communication, architecture, virtual, channel, memory, controller, synchronous, data, and bus. For each of the ten query words, the conditional probabilities of that word with respect to the other nine query words were determined from the 5 probability matrix. These values are given in Table 2 below, along with the calculated coefficient for each word. As seen, more generic query words such as "communicate," "memory," "controller," and "data," all of which would be expected to self-aggregate, have relatively low-valued coefficients, while more distinctive words, such as "wireless," "architecture," and "virtual," which would be 0 expected to be less self-aggregating, have substantially higher coefficients.
Table 2
0.043767 0.007974 0.006977 0.012751 0.002094 0.006024 0.004944 0.006727 0.002374 10.68006 wireless
0.576563 0.132908 0.130698 0.169289 0.049099 0.079333 0.053672 0.093817 0.168942 0.687606 communicate
0.023438 0.029652 0.045581 0.021704 0.027824 0.015847 0.024011 0.02061 0.054797 3.795577 architecture
0.023438 0.033329 0.0521 0.022246 0.022239 0.017091 0.008475 0.019684 0.020178 4.570789 Virtual
0.073437 0.074013 0.042531 0.03814 0.033508 0.03893 0.056497 0.037563 0.039169 2.305274 channel
0.098438 0.175187 0.444976 0.311163 0.273467 0.306169 0.492938 0.338269 0.510188 0.338892 memory
0.2875 0.287392 0.25731 0.242791 0.322572 0.31085 0.382062 0.272218 0.365776 0.366506 controller
0.010938 0.009014 0.018075 0.005581 0.021704 0.023203 0.017713 0.017179 0.029674 6.532475 synchronous
0.465625 0.492943 0.48538 0.405581 0.451438 0.498138 0.394833 0.537429 0.623937 0.229605 data
0.01875 0.101293 0.147262 0.047442 0.053717 0.085732 0.06054 0.105932 0.071198 1.445367 bus
While the invention has been described with respect to particular 5 embodiments and applications, it will be appreciated that various changes and modification may be made without departing from the spirit of the invention.

Claims

IT IS CLAIMED:
1. A computer-assisted method for retrieving one or more selected texts from a collection of texts, comprising (a) for each of a plurality of input query words Q, determining the conditional probabilities of finding that word, given each of the other query words, in a set of texts,
(b) using the conditional-probability values determined in (a) to calculate, for each query word, a query-word coefficient related to the inverse of the sum of the conditional probabilities of finding that query word, given each of the other query words,
(c) for each query word, searching a word index of said collection of texts to locate texts containing that query words,
(d) assigning to each of the texts located in step (d), and for each query word, a weighting factor related to the coefficient calculated in step (b) for that query word, and
(e) identifying those one or more texts having the highest summed weighting factors.
2. The method of claim 1 , wherein said collection of texts is substantially overlapping with said set of texts.
3. The method of claim 1 , wherein step (a) includes accessing a word- affinity matrix that contains, for each of a plurality of words Wm, and each of a plurality of word pairs Wm1 Wn, the conditional probability P(Wm|Wn) of finding word Wm in a text containing word Wn, to determine for each query word Wqm, the conditional probabilities P(Wqm|Wqn) for all other query words Wqn.
4. The method of claim 3, wherein step (b) includes, for each query word Wqm, calculating a coefficient related the inverse of the sum of all P(Wqrn|Wqn), for all other query words Wqn.
5. The method of claim 1 , wherein the collection of texts includes full- length documents.
6. The method of claim 1 , wherein the collection of texts includes paragraphs from a library of documents.
7. The method of claim 1 , wherein said collection of texts includes statements associated with bibliographic citations in a library of citation-rich documents.
8. Computer-readable code for use with an electronic computer for retrieving one or more selected texts from a collection of texts, said code being operable on said computer to carry out the steps comprising:
(a) for each of a plurality of input query words Q, determining the conditional probabilities of finding that word, given each of the other query words, in a set of texts,
(b) using the conditional-probability values determined in (a) to calculate, for each query word, a query-word coefficient related to the inverse of the sum of the conditional probabilities of finding that query word, given each of the other query words,
(c) for each query word, searching a word index of said collection of texts to locate texts containing that query words,
(d) assigning to each of the texts located in step (d), and for each query word, a weighting factor related to the coefficient calculated in step (b) for that query word, and
(e) identifying those one or more texts having the highest summed ^ weighting factors.
9. The code of claim 8, which is operable, in carrying out step (a) to access a word-affinity matrix that contains, for each of a plurality of words Wm, and each of a plurality of word pairs Wm, Wn, the conditional probability P(Wm|Wn) of finding word Wm in a text containing word Wn, and to determine for each query word Wqm, the conditional probabilities P(Wqm|Wqn) for all other query words Wqn, and in carrying out step (b), to calculate, for each query word Wqm, a coefficient related the inverse of the sum of all P(Wqrn|Wqn), for all other query words Wqn.
10. A system for use in retrieving one or more selected texts from a collection of texts, said system comprising:
(1 ) a computer,
(2) accessible by said computer, a word-affinity matrix that contains, for each of a plurality of words Wm, and each of a plurality of word pairs Wm, Wn, the conditional probability P(Wm]Wn) of finding word Wm in a text containing word Wn,
(3) operatively connected to said computer, a user input device,
(4) computer readable code that operates on said computer to:
(a) access said word-affinity table, to determine, for each of a plurality of input query words Q, determining the conditional probabilities of finding that word, given each of the other query words, in a set of texts,
(b) using the conditional-probability values determined in (a) to calculate, for each query word, a query-word coefficient related to the inverse of the sum of the conditional probabilities of finding that query word, given each of the other query words,
(c) for each query word, searching a word index of said collection of texts to locate texts containing that query words,
(d) assigning to each of the texts located in step (d), and for each query word, a weighting factor related to the coefficient calculated in step (b) for that query word, and
(e) identifying those one or more texts having the highest summed weighting factors.
11. The system of claim 17, wherein said code operates, in carrying out step (a) to determine for each query word Wqm, the conditional probabilities
P(Wqm|Wqn) for all other query words Wqn, and in carrying out step (b), to calculate, for each query word Wqm, a coefficient related the inverse of the sum of all P(Wqm|Wqn), for all other query words Wqn.
12. A search vector representing a multi-word search query, comprising a plurality of vector terms, each term containing a query word and a coefficient for that query word Wqm related to the inverse of the sum of all P(Wqm|Wqn), for all other query words Wqn, where P(Wm|Wn) is the conditional probability of finding word Wm in a text containing word Wn, within a collection of texts.
PCT/US2006/045397 2005-11-22 2006-11-22 Method, system and code for retrieving texts WO2007062215A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US73927205P 2005-11-22 2005-11-22
US60/739,272 2005-11-22

Publications (2)

Publication Number Publication Date
WO2007062215A2 true WO2007062215A2 (en) 2007-05-31
WO2007062215A3 WO2007062215A3 (en) 2007-12-13

Family

ID=38067955

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2006/045397 WO2007062215A2 (en) 2005-11-22 2006-11-22 Method, system and code for retrieving texts

Country Status (1)

Country Link
WO (1) WO2007062215A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104615723A (en) * 2015-02-06 2015-05-13 百度在线网络技术(北京)有限公司 Determining method and device of search term weight value

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050120011A1 (en) * 2003-11-26 2005-06-02 Word Data Corp. Code, method, and system for manipulating texts

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050120011A1 (en) * 2003-11-26 2005-06-02 Word Data Corp. Code, method, and system for manipulating texts

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104615723A (en) * 2015-02-06 2015-05-13 百度在线网络技术(北京)有限公司 Determining method and device of search term weight value
CN104615723B (en) * 2015-02-06 2018-08-07 百度在线网络技术(北京)有限公司 The determination method and apparatus of query word weighted value

Also Published As

Publication number Publication date
WO2007062215A3 (en) 2007-12-13

Similar Documents

Publication Publication Date Title
EP0965089B1 (en) Information retrieval utilizing semantic representation of text
US6947920B2 (en) Method and system for response time optimization of data query rankings and retrieval
US7814099B2 (en) Method for ranking and sorting electronic documents in a search result list based on relevance
CN103136352B (en) Text retrieval system based on double-deck semantic analysis
Ide et al. Essie: a concept-based search engine for structured biomedical text
USRE36727E (en) Method of indexing and retrieval of electronically-stored documents
US7483892B1 (en) Method and system for optimally searching a document database using a representative semantic space
US8346795B2 (en) System and method for guiding entity-based searching
Chen et al. Multilingual information retrieval using machine translation, relevance feedback and decompounding
Sarkar Automatic single document text summarization using key concepts in documents
JP4426041B2 (en) Information retrieval method by category factor
JP2008117351A (en) Search system
Li et al. Complex query recognition based on dynamic learning mechanism
JP4888677B2 (en) Document search system
Ramani et al. An Explorative Study on Extractive Text Summarization through k-means, LSA, and TextRank
WO2007062215A2 (en) Method, system and code for retrieving texts
Pinto et al. Joining automatic query expansion based on thesaurus and word sense disambiguation using WordNet
Gupta et al. A review on important aspects of information retrieval
Lin et al. Biological question answering with syntactic and semantic feature matching and an improved mean reciprocal ranking measurement
Sharma et al. Improved stemming approach used for text processing in information retrieval system
Strzalkowski Document representation in natural language text retrieval
Azad et al. Query expansion for improving web search
Yoshioka et al. On a combination of probabilistic and Boolean IR models for WWW document retrieval
Liddy Document retrieval, automatic
Hussey et al. A comparison of automated keyphrase extraction techniques and of automatic evaluation vs. human evaluation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase in:

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 06838392

Country of ref document: EP

Kind code of ref document: A2