EP1072001A1 - Aehnlichkeitsrecherche mittels kombination verschiedener daten-typen - Google Patents

Aehnlichkeitsrecherche mittels kombination verschiedener daten-typen

Info

Publication number: EP1072001A1
Authority: EP; European Patent Office
Prior art keywords: elements; document; data type; data types; searching
Prior art date: 1999-02-16
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.): Withdrawn

Application number

EP00903814A

Other languages

English (en)

French (fr)

Inventor

William Sharpe

Roland John Burns

Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)

HP Inc

Original Assignee

Hewlett Packard Co

Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)

1999-02-16

Filing date

2000-02-15

Publication date

2001-01-31

2000-02-15 Application filed by Hewlett Packard Co filed Critical Hewlett Packard Co

2001-01-31 Publication of EP1072001A1 publication Critical patent/EP1072001A1/de

Status Withdrawn legal-status Critical Current

Links

Classifications

- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/5854—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using shape and object relationship
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution

Definitions

the present invention relates to a method and means for searching to find similar documents in response to a query.
the invention is particularly relevant to the use of one document as a query for a search to obtain similar documents.
Similarity searching in databases of electronically stored documents is an important area of practical application. Such searching is well known for text. Typically, the input for such searching would be a text string, and the engine would then search the database matching entries against the text string and return entries with an acceptable similarity threshold. Similar searching is available for images - an example is the IBM Corporation QBIC (Query by Image Content) package, described at and available from http://wwwqbic.almaden.ibm.com/.
This structural information is then used to allow user searching and text indexing in chosen functional elements of the document.
This mechanism is particularly useful for making the problem of text searching in complex documents more tractable - it is not, however, effective to allow searching for documents which are as a whole similar to a query document.
the invention provides method of searching a database to find documents similar to a query document, comprising: decomposing the query document into elements of different data types; for one or more of the elements in a first data type, conducting a first data type similarity search to return match results from the database for the one or more elements in the first data type; for one or more of the elements in a second data type, conducting a second data type similarity search to return match results from the database for the one or more elements in the second data type; combining the match results from the first data type similarity search and the second data type similarity search to provide query document match results.
results from each query document match may be combined to allow progressive refinement of queries using any of the data types either singly or in further combination.
the invention provides a method of searching a database to find documents similar to a query document, comprising: decomposing the query document into elements of different data types; determining a layout element in a layout datatype from the spatial arrangement of the elements in the document; for the layout element, conducting a layout similarity search to return match results from the database for the layout element.
Figure 1 shows a typical document page containing different data types
Figure 2 shows steps in a method according to an embodiment of a first aspect of the invention for conducting a similarity search for the document shown in Figure 1 ;
Figure 3 shows the representation of the document shown in Figure 1 as a layout of datatypes, and indicates a search step usable in a further embodiment of the method of the invention.
Figure 4 shows steps in a method according to an embodiment of the second aspect of the invention for conducting a similarity search for layout information.
a typical document contains a plurality of data types.
the most basic data types are text and images.
Document 1 shown in Figure 1 contains a text block 12 - this text block is data in a first data type.
Document 1 also contains two different kinds of image.
One kind, image block 13 is a photographic image, typically consisting of an array of pixels in which each pixel has a colour value.
the other kind, line art block 11 is also an image but a "drawn" one, readily representable as a combination of geometric or formulaic elements - and as such, typically readily scalable.
Photographic images and line art images (hereafter “pictures” and "graphics”) respond differently to different image processing and analysis techniques, and are most effectively treated as different data types.
the document 1 is selected in step 21.
this could be achieved through any appropriate application capable of supporting the file type or file types of the document.
For a physical document this could be achieved by scanning the document using a scanner.
step 22 the document is decomposed into separate elements: in the case of document 1, these elements are graphic block 11, text block 12, and picture block 13.
these elements are graphic block 11, text block 12, and picture block 13.
text block 12 it is desirable for optical character recognition to be carried out at this point so that the text block element resulting from decomposition consists of ASCII text.
Decomposition of the document is achieved by an analysis and recognition process through which the different parts of the document are recognised as being text, pictures or graphics. Decomposition of a document into separate data types in this way is known, using for example techniques identified in "Block Segmentation and Text Extraction in Mixed Text/Image Documents" by FM Wahl, KY Wong and RG Casey, Computer Graphics and Image Processing, Vol. 20 (1982) (a further example is provided in US Patent No. 6,002,798).
HP PrecisionScan Software adapted for use with proprietary scanners to decompose the elements of a scanned page into separate data types (in order to optimise the scanning process for each data type) is provided by Hewlett-Packard Company as "HP PrecisionScan".
the output of HP PrecisionScan is a set of elements each in a single data type, each of which can be selected for further processing.
the result of decomposition is a set of elements, each element having a single data type. For a particular data type, such as text, then either all text is determined to be part of a single element, or else physically distinct areas of text are considered as separate elements, depending on how the decomposition is carried out.
all the elements of the document are used in similarity searching: in other versions one or more of the elements are selected for use in similarity searching (or the user is even allowed an opportunity to select part of an element for such further processing).
Separate elements are then used in similarity searching 23, 24 against a database, for example a database representing content available on the World Wide Web.
Inxight Summarizer is a software component technology that summarises a document by extracting key sentences from the document. This is the preconditioning step 23. These summaries can then be matched against each other in the matching step 24. Inxight Summarizer generates indicative summaries that contain key sentence, elements from a document. The essence of the text isextractedby stemming and text normalisation technology to obtain a concise and canonical synopsis of the text. "Stemming” is the replacement of a word by its root and part-of- speech (e.g. "I had wanted” -> “to want/first person/pluperfect"), whereas "normalisation” involves replacement of one of several forms with a "concept" (e.g.
the matching step 24 can then be carried out on the stemmed and normalised results of the preconditioning step 23 with confidence that text content which is genuinely similar will be matched without adverse influence from unwanted syntax considerations.
An example of an image searching tool is the IBM QBIC package, as indicated above.
QBIC is further described at http://wwwqbic.almaden.ibm.com/.
This package is adapted to precondition the images by analysing for a number of different criteria, such as colour percentages, colour layout, and textures occurring in the images. These criteria are then used in combination in a matching step 24.
searching a 'new' image for known objects from robot vision (a robot searching for parts in a bin), through to traffic monitoring systems
serial approach could be used effectively: for example, first using a "straight edge” histogram to enable differentiation between natural and artificial scenes; then using an "edge length” histogram (an shortage of long edges probably indicates a natural scene); testing for a large area of blue tone at the top of the image (indicating an outdoor scene); and testing for significant elements of flesh tones", indicating that there is an image containing representations of people - which can be followed by a face matching analysis to find the same faces.
edge length an shortage of long edges probably indicates a natural scene
testing for a large area of blue tone at the top of the image indicating an outdoor scene
testing for significant elements of flesh tones indicating that there is an image containing representations of people - which can be followed by a face matching analysis to find the same faces.
the result of the similarity searching is a set of series of matching scores for documents in the database, such a set existing for each element searched.
Each of these search scores needs to be normalised 25 for combination 26 to achieve a combined search result 27.
the normalisation step 25 is to ensure that a correct balance is given to the results of the different searching steps 24. This can either be to weight each element of the document equally, to weight each element of the document according to its perceived importance in the document, or according to a user assessment of the relative importance of the different elements of the document.
a preferred solution may involve a mixture of automatic and manual weighting.
a particularly effective approach is to use synopsis generation techniques on the textual part to produce a set of textual search criteria and also to present a set of possible criteria based on the non-textual parts. These criteria are then presented to the user for verification.
Such a user based approach is easy to use (and it is also easy for a user to tell when it is ineffective). For example, auser may be asked if he/she wanted to search for things that matched the textual synopses, or, for the image and drawing parts, whether he wanted "this person", “scenes like this", “pictures containing this object”... or "pages that look like this one”.
the combined result 27 is as for conventional similarity searching: a series of matching scores (generally expressed as percentages) listing documents in the database from best towards worst matches.
a further output available from page decomposition is a data type plan 31 representing the document as a line art block, a text block, and an image block, arranged vertically in sequence - decomposition into layouts is discussed is US Patent No. 6,002,798.
this data type plan can itself be used as a layout data type. This allows yet another element - the layout data type element - to be used in searching 32 of a database (provided that layout information is available in or derivable from the database entries).
similarity searching is conducted using the layout data type alone.
the steps to be followed are essentially as in conventional similarity searching - this is shown in Figure 4, with elements common to the first aspect of the invention given the same reference numbers as in Figure 2.
Layout similarity searching is more powerful if a number of different data types are used for text and for overall document type. Using a rule-based approach, different text blocks and whole documents, especially in the case of formal workflow documents, can be assigned particular functions with relatively high confidence.
the difficulty of this problem depends on the nature and type of documents that are to be considered for matching. If the "universe" of documents is well defined, then there are tools available that can do an accurate job of classifying and labelling within that universe (e.g. OfficeMaid from DFKI). What is required in this case is classification according to a set of conventions laid down for the various classes of documents available for consideration. Conventions are here essentially rules that need not be closely followed: consequently an appropriate approach to this problem is rule based (most conveniently using fuzzy rules). Training of a neural network would also be an effective approach to adopt. The skilled person will appreciate how conventional fuzzy rule or neural network approaches could be adapted for use in a solution to this problem.

Landscapes

Engineering & Computer Science (AREA)
Theoretical Computer Science (AREA)
Library & Information Science (AREA)
Data Mining & Analysis (AREA)
Databases & Information Systems (AREA)
Physics & Mathematics (AREA)
General Engineering & Computer Science (AREA)
General Physics & Mathematics (AREA)
Computational Linguistics (AREA)
Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Image Analysis (AREA)

EP00903814A 1999-02-16 2000-02-15 Aehnlichkeitsrecherche mittels kombination verschiedener daten-typen Withdrawn EP1072001A1 (de)

Applications Claiming Priority (3)

Application Number	Priority Date	Filing Date	Title
GB9903451		1999-02-16
GBGB9903451.4A GB9903451D0 (en)	1999-02-16	1999-02-16	Similarity searching for documents
PCT/GB2000/000489 WO2000049526A1 (en)	1999-02-16	2000-02-15	Similarity searching by combination of different data-types

Publications (1)

Publication Number	Publication Date
EP1072001A1 true EP1072001A1 (de)	2001-01-31

Family

ID=10847827

Family Applications (1)

Application Number	Title	Priority Date	Filing Date
EP00903814A Withdrawn EP1072001A1 (de)	1999-02-16	2000-02-15	Aehnlichkeitsrecherche mittels kombination verschiedener daten-typen

Country Status (4)

Country	Link
EP (1)	EP1072001A1 (de)
JP (1)	JP2002537604A (de)
GB (1)	GB9903451D0 (de)
WO (1)	WO2000049526A1 (de)

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number	Priority date	Publication date	Assignee	Title
JP2002108936A (ja) *	2000-10-03	2002-04-12	Canon Inc	情報検索装置及びその制御方法及びコンピュータ読み取り可能な記憶媒体
US6721728B2 (en) *	2001-03-02	2004-04-13	The United States Of America As Represented By The Administrator Of The National Aeronautics And Space Administration	System, method and apparatus for discovering phrases in a database
DE10215852B4 (de) *	2002-04-10	2006-11-30	Software Engineering Gmbh	Verfahren zum Vergleich zweier Datenbankabfragen aufweisenden Quelldateien und Vergleichsvorrichtung
JP2004348706A (ja) *	2003-04-30	2004-12-09	Canon Inc	情報処理装置及び情報処理方法ならびに記憶媒体、プログラム
US7953720B1 (en)	2005-03-31	2011-05-31	Google Inc.	Selecting the best answer to a fact query from among a set of potential answers
US7587387B2 (en)	2005-03-31	2009-09-08	Google Inc.	User interface for facts query engine with snippets from information sources that include query terms and answer terms
JP4533273B2 (ja)	2005-08-09	2010-09-01	キヤノン株式会社	画像処理装置及び画像処理方法、プログラム
US20070185870A1 (en)	2006-01-27	2007-08-09	Hogue Andrew W	Data object visualization using graphs
US7925676B2 (en)	2006-01-27	2011-04-12	Google Inc.	Data object visualization using maps
US8954426B2 (en)	2006-02-17	2015-02-10	Google Inc.	Query language
US8055674B2 (en)	2006-02-17	2011-11-08	Google Inc.	Annotation framework
US7620721B2 (en)	2006-02-28	2009-11-17	Microsoft Corporation	Pre-existing content replication
US8347202B1 (en)	2007-03-14	2013-01-01	Google Inc.	Determining geographic locations for place names in a fact repository
US20080267504A1 (en) *	2007-04-24	2008-10-30	Nokia Corporation	Method, device and computer program product for integrating code-based and optical character recognition technologies into a mobile visual search
US8670597B2 (en)	2009-08-07	2014-03-11	Google Inc.	Facial recognition with social network aiding
US9135277B2 (en)	2009-08-07	2015-09-15	Google Inc.	Architecture for responding to a visual query
US9087059B2 (en)	2009-08-07	2015-07-21	Google Inc.	User interface for presenting search results for multiple regions of a visual query
US9405772B2 (en)	2009-12-02	2016-08-02	Google Inc.	Actionable search results for street view visual queries
US8977639B2 (en)	2009-12-02	2015-03-10	Google Inc.	Actionable search results for visual queries
US8805079B2 (en)	2009-12-02	2014-08-12	Google Inc.	Identifying matching canonical documents in response to a visual query and in accordance with geographic information
US9183224B2 (en)	2009-12-02	2015-11-10	Google Inc.	Identifying matching canonical documents in response to a visual query
US8811742B2 (en)	2009-12-02	2014-08-19	Google Inc.	Identifying matching canonical documents consistent with visual query structural information
US9852156B2 (en)	2009-12-03	2017-12-26	Google Inc.	Hybrid use of location sensor data and visual query to return local listings for visual query
US8935246B2 (en)	2012-08-08	2015-01-13	Google Inc.	Identifying textual terms in response to a visual query

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number	Priority date	Publication date	Assignee	Title
US5933823A (en) *	1996-03-01	1999-08-03	Ricoh Company Limited	Image database browsing and query using texture analysis

1999
- 1999-02-16 GB GBGB9903451.4A patent/GB9903451D0/en not_active Ceased
2000
- 2000-02-15 EP EP00903814A patent/EP1072001A1/de not_active Withdrawn
- 2000-02-15 WO PCT/GB2000/000489 patent/WO2000049526A1/en not_active Application Discontinuation
- 2000-02-15 JP JP2000600197A patent/JP2002537604A/ja active Pending

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO0049526A1 *

Also Published As

Publication number	Publication date
WO2000049526A1 (en)	2000-08-24
GB9903451D0 (en)	1999-04-07
JP2002537604A (ja)	2002-11-05

Legal Events

Date	Code	Title	Description
2000-12-15	PUAI	Public reference made under article 153(3) epc to a published international application that has entered the european phase	Free format text: ORIGINAL CODE: 0009012
2001-01-31	17P	Request for examination filed	Effective date: 20000901
2001-01-31	AK	Designated contracting states	Kind code of ref document: A1 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE
2001-05-02	RAP1	Party data changed (applicant data changed or rights of an application transferred)	Owner name: HEWLETT-PACKARD COMPANY, A DELAWARE CORPORATION
2004-05-12	RBV	Designated contracting states (corrected)	Designated state(s): DE FR GB
2007-02-02	STAA	Information on the status of an ep patent application or granted ep patent	Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN
2007-03-07	18D	Application deemed to be withdrawn	Effective date: 20060909

Publication	Publication Date	Title
EP1072001A1 (de)	2001-01-31	Aehnlichkeitsrecherche mittels kombination verschiedener daten-typen
Doermann et al.	1998	The detection of duplicates in document image databases
US6029167A (en)	2000-02-22	Method and apparatus for retrieving text using document signatures
Lesk	1997	Practical digital libraries: Books, bytes, and bucks
US7809695B2 (en)	2010-10-05	Information retrieval systems with duplicate document detection and presentation functions
US7801893B2 (en)	2010-09-21	Similarity detection and clustering of images
US5802515A (en)	1998-09-01	Randomized query generation and document relevance ranking for robust information retrieval from a database
EP1585073B1 (de)	2009-05-27	Verfahren zur Determinierung und Unterdrückung von Duplikaten
US5465353A (en)	1995-11-07	Image matching and retrieval by multi-access redundant hashing
US8566305B2 (en)	2013-10-22	Method and apparatus to define the scope of a search for information from a tabular data source
US5926565A (en)	1999-07-20	Computer method for processing records with images and multiple fonts
US7930647B2 (en)	2011-04-19	System and method for selecting pictures for presentation with text content
US20050021545A1 (en)	2005-01-27	Very-large-scale automatic categorizer for Web content
US20030235345A1 (en)	2003-12-25	Imaged document optical correlation and conversion system
GB2439843A (en)	2008-01-09	Relevance ranked faceted metadata search method
US20080005081A1 (en)	2008-01-03	Method and apparatus for searching and resource discovery in a distributed enterprise system
US20080147642A1 (en)	2008-06-19	System for discovering data artifacts in an on-line data object
US20080147578A1 (en)	2008-06-19	System for prioritizing search results retrieved in response to a computerized search query
JP2010044777A (ja)	2010-02-25	データベース照会システムおよび方法
Shin et al.	2006	Document Image Retrieval Based on Layout Structural Similarity.
CN113297457A (zh)	2021-08-24	一种高精准性的信息资源智能推送***及推送方法
Benitez et al.	2002	Perceptual knowledge construction from annotated image collections
Aslandogan et al.	2000	Evaluating strategies and systems for content based indexing of person images on the Web
JP2000231560A (ja)	2000-08-22	文書自動分類方式
Hu et al.	2003	Identifying Story and Preview Images in News Web Pages.