EP1072001A1 - Aehnlichkeitsrecherche mittels kombination verschiedener daten-typen - Google Patents
Aehnlichkeitsrecherche mittels kombination verschiedener daten-typenInfo
- Publication number
- EP1072001A1 EP1072001A1 EP00903814A EP00903814A EP1072001A1 EP 1072001 A1 EP1072001 A1 EP 1072001A1 EP 00903814 A EP00903814 A EP 00903814A EP 00903814 A EP00903814 A EP 00903814A EP 1072001 A1 EP1072001 A1 EP 1072001A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- elements
- document
- data type
- data types
- searching
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/5854—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using shape and object relationship
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
Definitions
- the present invention relates to a method and means for searching to find similar documents in response to a query.
- the invention is particularly relevant to the use of one document as a query for a search to obtain similar documents.
- Similarity searching in databases of electronically stored documents is an important area of practical application. Such searching is well known for text. Typically, the input for such searching would be a text string, and the engine would then search the database matching entries against the text string and return entries with an acceptable similarity threshold. Similar searching is available for images - an example is the IBM Corporation QBIC (Query by Image Content) package, described at and available from http://wwwqbic.almaden.ibm.com/.
- This structural information is then used to allow user searching and text indexing in chosen functional elements of the document.
- This mechanism is particularly useful for making the problem of text searching in complex documents more tractable - it is not, however, effective to allow searching for documents which are as a whole similar to a query document.
- the invention provides method of searching a database to find documents similar to a query document, comprising: decomposing the query document into elements of different data types; for one or more of the elements in a first data type, conducting a first data type similarity search to return match results from the database for the one or more elements in the first data type; for one or more of the elements in a second data type, conducting a second data type similarity search to return match results from the database for the one or more elements in the second data type; combining the match results from the first data type similarity search and the second data type similarity search to provide query document match results.
- results from each query document match may be combined to allow progressive refinement of queries using any of the data types either singly or in further combination.
- the invention provides a method of searching a database to find documents similar to a query document, comprising: decomposing the query document into elements of different data types; determining a layout element in a layout datatype from the spatial arrangement of the elements in the document; for the layout element, conducting a layout similarity search to return match results from the database for the layout element.
- Figure 1 shows a typical document page containing different data types
- Figure 2 shows steps in a method according to an embodiment of a first aspect of the invention for conducting a similarity search for the document shown in Figure 1 ;
- Figure 3 shows the representation of the document shown in Figure 1 as a layout of datatypes, and indicates a search step usable in a further embodiment of the method of the invention.
- Figure 4 shows steps in a method according to an embodiment of the second aspect of the invention for conducting a similarity search for layout information.
- a typical document contains a plurality of data types.
- the most basic data types are text and images.
- Document 1 shown in Figure 1 contains a text block 12 - this text block is data in a first data type.
- Document 1 also contains two different kinds of image.
- One kind, image block 13 is a photographic image, typically consisting of an array of pixels in which each pixel has a colour value.
- the other kind, line art block 11 is also an image but a "drawn" one, readily representable as a combination of geometric or formulaic elements - and as such, typically readily scalable.
- Photographic images and line art images (hereafter “pictures” and "graphics”) respond differently to different image processing and analysis techniques, and are most effectively treated as different data types.
- the document 1 is selected in step 21.
- this could be achieved through any appropriate application capable of supporting the file type or file types of the document.
- For a physical document this could be achieved by scanning the document using a scanner.
- step 22 the document is decomposed into separate elements: in the case of document 1, these elements are graphic block 11, text block 12, and picture block 13.
- these elements are graphic block 11, text block 12, and picture block 13.
- text block 12 it is desirable for optical character recognition to be carried out at this point so that the text block element resulting from decomposition consists of ASCII text.
- Decomposition of the document is achieved by an analysis and recognition process through which the different parts of the document are recognised as being text, pictures or graphics. Decomposition of a document into separate data types in this way is known, using for example techniques identified in "Block Segmentation and Text Extraction in Mixed Text/Image Documents" by FM Wahl, KY Wong and RG Casey, Computer Graphics and Image Processing, Vol. 20 (1982) (a further example is provided in US Patent No. 6,002,798).
- HP PrecisionScan Software adapted for use with proprietary scanners to decompose the elements of a scanned page into separate data types (in order to optimise the scanning process for each data type) is provided by Hewlett-Packard Company as "HP PrecisionScan".
- the output of HP PrecisionScan is a set of elements each in a single data type, each of which can be selected for further processing.
- the result of decomposition is a set of elements, each element having a single data type. For a particular data type, such as text, then either all text is determined to be part of a single element, or else physically distinct areas of text are considered as separate elements, depending on how the decomposition is carried out.
- all the elements of the document are used in similarity searching: in other versions one or more of the elements are selected for use in similarity searching (or the user is even allowed an opportunity to select part of an element for such further processing).
- Separate elements are then used in similarity searching 23, 24 against a database, for example a database representing content available on the World Wide Web.
- Inxight Summarizer is a software component technology that summarises a document by extracting key sentences from the document. This is the preconditioning step 23. These summaries can then be matched against each other in the matching step 24. Inxight Summarizer generates indicative summaries that contain key sentence, elements from a document. The essence of the text isextractedby stemming and text normalisation technology to obtain a concise and canonical synopsis of the text. "Stemming” is the replacement of a word by its root and part-of- speech (e.g. "I had wanted” -> “to want/first person/pluperfect"), whereas "normalisation” involves replacement of one of several forms with a "concept" (e.g.
- the matching step 24 can then be carried out on the stemmed and normalised results of the preconditioning step 23 with confidence that text content which is genuinely similar will be matched without adverse influence from unwanted syntax considerations.
- An example of an image searching tool is the IBM QBIC package, as indicated above.
- QBIC is further described at http://wwwqbic.almaden.ibm.com/.
- This package is adapted to precondition the images by analysing for a number of different criteria, such as colour percentages, colour layout, and textures occurring in the images. These criteria are then used in combination in a matching step 24.
- searching a 'new' image for known objects from robot vision (a robot searching for parts in a bin), through to traffic monitoring systems
- serial approach could be used effectively: for example, first using a "straight edge” histogram to enable differentiation between natural and artificial scenes; then using an "edge length” histogram (an shortage of long edges probably indicates a natural scene); testing for a large area of blue tone at the top of the image (indicating an outdoor scene); and testing for significant elements of flesh tones", indicating that there is an image containing representations of people - which can be followed by a face matching analysis to find the same faces.
- edge length an shortage of long edges probably indicates a natural scene
- testing for a large area of blue tone at the top of the image indicating an outdoor scene
- testing for significant elements of flesh tones indicating that there is an image containing representations of people - which can be followed by a face matching analysis to find the same faces.
- the result of the similarity searching is a set of series of matching scores for documents in the database, such a set existing for each element searched.
- Each of these search scores needs to be normalised 25 for combination 26 to achieve a combined search result 27.
- the normalisation step 25 is to ensure that a correct balance is given to the results of the different searching steps 24. This can either be to weight each element of the document equally, to weight each element of the document according to its perceived importance in the document, or according to a user assessment of the relative importance of the different elements of the document.
- a preferred solution may involve a mixture of automatic and manual weighting.
- a particularly effective approach is to use synopsis generation techniques on the textual part to produce a set of textual search criteria and also to present a set of possible criteria based on the non-textual parts. These criteria are then presented to the user for verification.
- Such a user based approach is easy to use (and it is also easy for a user to tell when it is ineffective). For example, auser may be asked if he/she wanted to search for things that matched the textual synopses, or, for the image and drawing parts, whether he wanted "this person", “scenes like this", “pictures containing this object”... or "pages that look like this one”.
- the combined result 27 is as for conventional similarity searching: a series of matching scores (generally expressed as percentages) listing documents in the database from best towards worst matches.
- a further output available from page decomposition is a data type plan 31 representing the document as a line art block, a text block, and an image block, arranged vertically in sequence - decomposition into layouts is discussed is US Patent No. 6,002,798.
- this data type plan can itself be used as a layout data type. This allows yet another element - the layout data type element - to be used in searching 32 of a database (provided that layout information is available in or derivable from the database entries).
- similarity searching is conducted using the layout data type alone.
- the steps to be followed are essentially as in conventional similarity searching - this is shown in Figure 4, with elements common to the first aspect of the invention given the same reference numbers as in Figure 2.
- Layout similarity searching is more powerful if a number of different data types are used for text and for overall document type. Using a rule-based approach, different text blocks and whole documents, especially in the case of formal workflow documents, can be assigned particular functions with relatively high confidence.
- the difficulty of this problem depends on the nature and type of documents that are to be considered for matching. If the "universe" of documents is well defined, then there are tools available that can do an accurate job of classifying and labelling within that universe (e.g. OfficeMaid from DFKI). What is required in this case is classification according to a set of conventions laid down for the various classes of documents available for consideration. Conventions are here essentially rules that need not be closely followed: consequently an appropriate approach to this problem is rule based (most conveniently using fuzzy rules). Training of a neural network would also be an effective approach to adopt. The skilled person will appreciate how conventional fuzzy rule or neural network approaches could be adapted for use in a solution to this problem.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Library & Information Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB9903451 | 1999-02-16 | ||
GBGB9903451.4A GB9903451D0 (en) | 1999-02-16 | 1999-02-16 | Similarity searching for documents |
PCT/GB2000/000489 WO2000049526A1 (en) | 1999-02-16 | 2000-02-15 | Similarity searching by combination of different data-types |
Publications (1)
Publication Number | Publication Date |
---|---|
EP1072001A1 true EP1072001A1 (de) | 2001-01-31 |
Family
ID=10847827
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP00903814A Withdrawn EP1072001A1 (de) | 1999-02-16 | 2000-02-15 | Aehnlichkeitsrecherche mittels kombination verschiedener daten-typen |
Country Status (4)
Country | Link |
---|---|
EP (1) | EP1072001A1 (de) |
JP (1) | JP2002537604A (de) |
GB (1) | GB9903451D0 (de) |
WO (1) | WO2000049526A1 (de) |
Families Citing this family (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002108936A (ja) * | 2000-10-03 | 2002-04-12 | Canon Inc | 情報検索装置及びその制御方法及びコンピュータ読み取り可能な記憶媒体 |
US6721728B2 (en) * | 2001-03-02 | 2004-04-13 | The United States Of America As Represented By The Administrator Of The National Aeronautics And Space Administration | System, method and apparatus for discovering phrases in a database |
DE10215852B4 (de) * | 2002-04-10 | 2006-11-30 | Software Engineering Gmbh | Verfahren zum Vergleich zweier Datenbankabfragen aufweisenden Quelldateien und Vergleichsvorrichtung |
JP2004348706A (ja) * | 2003-04-30 | 2004-12-09 | Canon Inc | 情報処理装置及び情報処理方法ならびに記憶媒体、プログラム |
US7953720B1 (en) | 2005-03-31 | 2011-05-31 | Google Inc. | Selecting the best answer to a fact query from among a set of potential answers |
US7587387B2 (en) | 2005-03-31 | 2009-09-08 | Google Inc. | User interface for facts query engine with snippets from information sources that include query terms and answer terms |
JP4533273B2 (ja) | 2005-08-09 | 2010-09-01 | キヤノン株式会社 | 画像処理装置及び画像処理方法、プログラム |
US20070185870A1 (en) | 2006-01-27 | 2007-08-09 | Hogue Andrew W | Data object visualization using graphs |
US7925676B2 (en) | 2006-01-27 | 2011-04-12 | Google Inc. | Data object visualization using maps |
US8954426B2 (en) | 2006-02-17 | 2015-02-10 | Google Inc. | Query language |
US8055674B2 (en) | 2006-02-17 | 2011-11-08 | Google Inc. | Annotation framework |
US7620721B2 (en) | 2006-02-28 | 2009-11-17 | Microsoft Corporation | Pre-existing content replication |
US8347202B1 (en) | 2007-03-14 | 2013-01-01 | Google Inc. | Determining geographic locations for place names in a fact repository |
US20080267504A1 (en) * | 2007-04-24 | 2008-10-30 | Nokia Corporation | Method, device and computer program product for integrating code-based and optical character recognition technologies into a mobile visual search |
US8670597B2 (en) | 2009-08-07 | 2014-03-11 | Google Inc. | Facial recognition with social network aiding |
US9135277B2 (en) | 2009-08-07 | 2015-09-15 | Google Inc. | Architecture for responding to a visual query |
US9087059B2 (en) | 2009-08-07 | 2015-07-21 | Google Inc. | User interface for presenting search results for multiple regions of a visual query |
US9405772B2 (en) | 2009-12-02 | 2016-08-02 | Google Inc. | Actionable search results for street view visual queries |
US8977639B2 (en) | 2009-12-02 | 2015-03-10 | Google Inc. | Actionable search results for visual queries |
US8805079B2 (en) | 2009-12-02 | 2014-08-12 | Google Inc. | Identifying matching canonical documents in response to a visual query and in accordance with geographic information |
US9183224B2 (en) | 2009-12-02 | 2015-11-10 | Google Inc. | Identifying matching canonical documents in response to a visual query |
US8811742B2 (en) | 2009-12-02 | 2014-08-19 | Google Inc. | Identifying matching canonical documents consistent with visual query structural information |
US9852156B2 (en) | 2009-12-03 | 2017-12-26 | Google Inc. | Hybrid use of location sensor data and visual query to return local listings for visual query |
US8935246B2 (en) | 2012-08-08 | 2015-01-13 | Google Inc. | Identifying textual terms in response to a visual query |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5933823A (en) * | 1996-03-01 | 1999-08-03 | Ricoh Company Limited | Image database browsing and query using texture analysis |
-
1999
- 1999-02-16 GB GBGB9903451.4A patent/GB9903451D0/en not_active Ceased
-
2000
- 2000-02-15 EP EP00903814A patent/EP1072001A1/de not_active Withdrawn
- 2000-02-15 WO PCT/GB2000/000489 patent/WO2000049526A1/en not_active Application Discontinuation
- 2000-02-15 JP JP2000600197A patent/JP2002537604A/ja active Pending
Non-Patent Citations (1)
Title |
---|
See references of WO0049526A1 * |
Also Published As
Publication number | Publication date |
---|---|
WO2000049526A1 (en) | 2000-08-24 |
GB9903451D0 (en) | 1999-04-07 |
JP2002537604A (ja) | 2002-11-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP1072001A1 (de) | Aehnlichkeitsrecherche mittels kombination verschiedener daten-typen | |
Doermann et al. | The detection of duplicates in document image databases | |
US6029167A (en) | Method and apparatus for retrieving text using document signatures | |
Lesk | Practical digital libraries: Books, bytes, and bucks | |
US7809695B2 (en) | Information retrieval systems with duplicate document detection and presentation functions | |
US7801893B2 (en) | Similarity detection and clustering of images | |
US5802515A (en) | Randomized query generation and document relevance ranking for robust information retrieval from a database | |
EP1585073B1 (de) | Verfahren zur Determinierung und Unterdrückung von Duplikaten | |
US5465353A (en) | Image matching and retrieval by multi-access redundant hashing | |
US8566305B2 (en) | Method and apparatus to define the scope of a search for information from a tabular data source | |
US5926565A (en) | Computer method for processing records with images and multiple fonts | |
US7930647B2 (en) | System and method for selecting pictures for presentation with text content | |
US20050021545A1 (en) | Very-large-scale automatic categorizer for Web content | |
US20030235345A1 (en) | Imaged document optical correlation and conversion system | |
GB2439843A (en) | Relevance ranked faceted metadata search method | |
US20080005081A1 (en) | Method and apparatus for searching and resource discovery in a distributed enterprise system | |
US20080147642A1 (en) | System for discovering data artifacts in an on-line data object | |
US20080147578A1 (en) | System for prioritizing search results retrieved in response to a computerized search query | |
JP2010044777A (ja) | データベース照会システムおよび方法 | |
Shin et al. | Document Image Retrieval Based on Layout Structural Similarity. | |
CN113297457A (zh) | 一种高精准性的信息资源智能推送***及推送方法 | |
Benitez et al. | Perceptual knowledge construction from annotated image collections | |
Aslandogan et al. | Evaluating strategies and systems for content based indexing of person images on the Web | |
JP2000231560A (ja) | 文書自動分類方式 | |
Hu et al. | Identifying Story and Preview Images in News Web Pages. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20000901 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE |
|
RAP1 | Party data changed (applicant data changed or rights of an application transferred) |
Owner name: HEWLETT-PACKARD COMPANY, A DELAWARE CORPORATION |
|
RBV | Designated contracting states (corrected) |
Designated state(s): DE FR GB |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20060909 |