WO2021152550A1 - Systems and methods for processing images - Google Patents

Systems and methods for processing images Download PDF

Info

Publication number
WO2021152550A1
WO2021152550A1 PCT/IB2021/050749 IB2021050749W WO2021152550A1 WO 2021152550 A1 WO2021152550 A1 WO 2021152550A1 IB 2021050749 W IB2021050749 W IB 2021050749W WO 2021152550 A1 WO2021152550 A1 WO 2021152550A1
Authority
WO
WIPO (PCT)
Prior art keywords
landmarks
document
image
pixel coordinates
resolution
Prior art date
Application number
PCT/IB2021/050749
Other languages
French (fr)
Inventor
Patrick STEEVES
Ying Zhang
Original Assignee
Element Ai Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US16/778,324 external-priority patent/US11514702B2/en
Priority claimed from CA3070701A external-priority patent/CA3070701C/en
Application filed by Element Ai Inc. filed Critical Element Ai Inc.
Publication of WO2021152550A1 publication Critical patent/WO2021152550A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image
    • G06T2207/10008Still image; Photographic image from scanner, fax or copier
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30176Document

Definitions

  • the present technology relates to machine-vision systems and methods for processing images, such as digital representations of documents.
  • the present technology relates to systems and methods for identifying landmarks of images and/or matching documents with corresponding templates based on identified landmarks of an image.
  • the present technology is directed to systems and methods that facilitate, in accordance with at least one broad aspect, improved identification of image landmarks.
  • the present technology is directed to systems and methods that match documents with corresponding templates based on identified landmarks.
  • a method of identifying landmarks of a document from a digital representation of the document comprising: accessing the digital representation of the document, the digital representation being associated with a first resolution; operating a Machine Learning Algorithm (MLA), the MLA having been trained: based on a set of training digital representations of documents associated with labels, the labels identifying landmarks of the documents represented by the training digital representations; to leam a first function allowing detection of landmarks of documents represented by digital representations; to leam a second function allowing generation of fractional pixel coordinates for the landmarks detected by the first function; the operating the MLA comprising: down-sampling the digital representation of the document, the down-sampled digital representation of the document being associated with a second resolution, the second resolution being lower than the first resolution; detecting landmarks from the down sampled digital representation of the document; generating fractional pixel coordinates for the detected landmarks in accordance with the second resolution, the fractional pixel coordinates allowing reconstructing
  • MLA Machine Learning Algorithm
  • a method of identifying a template document to be associated with a document comprising: accessing the digital representation of the document; accessing a set of digital representations of template documents, each one of the digital representations of template documents comprising known landmarks; applying an image alignment routine to the document and the template documents; calculating a covariance of pixel values of the document aligned and superimposed to the at least one of the template documents; and determining, based on the covariance of the pixel values, whether the document is to be associated with the at least one of the template documents.
  • a method of aligning a first image with a second image comprising: accessing the first image; accessing the second image comprising known landmarks; determining pixel coordinates of landmarks of the first image; determining a transformation based on the determined pixel coordinates of the landmarks of the first image and known landmarks of the second image, the transformation allowing mapping of the first image onto the second image; calculating a covariance of pixel values of the first image aligned and superimposed to the second image; and determining, based on the covariance of the pixel values, whether the first image is to be associated with the second image.
  • various implementations of the present technology provide a non- transitory computer-readable medium storing program instructions for executing one or more methods described herein, the program instructions being executable by a processor of a computer-based system.
  • various implementations of the present technology provide a computer-based system, such as, for example, but without being limitative, an electronic device comprising at least one processor and a memory storing program instructions for executing one or more methods described herein, the program instructions being executable by the at least one processor of the electronic device.
  • a computer system may refer, but is not limited to, an “electronic device”, a “computing device”, an “operation system”, a “system”, a “computer-based system”, a “computer system”, a “network system”, a “network device”, a “controller unit”, a “monitoring device”, a “control device”, a “server”, and/or any combination thereof appropriate to the relevant task at hand.
  • computer-readable medium and “memory” are intended to include media of any nature and kind whatsoever, non-limiting examples of which include RAM, ROM, disks (e.g., CD-ROMs, DVDs, floppy disks, hard disk drives, etc.), USB keys, flash memory cards, solid state-drives, and tape drives. Still in the context of the present specification, “a” computer- readable medium and “the” computer-readable medium should not be construed as being the same computer-readable medium. To the contrary, and whenever appropriate, “a” computer- readable medium and “the” computer-readable medium may also be construed as a first computer-readable medium and a second computer-readable medium.
  • FIG. 1 is a block diagram of an example computing environment in accordance with at least one embodiment of the present technology
  • FIG. 2 is a block diagram illustrating a system comprising a landmark detection module and a document matching module in accordance with at least one embodiment of the present technology
  • FIG. 3 is a diagram illustrating a neural network in accordance with at least one embodiment of the present technology
  • FIG. 4 and 5 illustrate examples of document matching in accordance with at least one embodiment of the present technology
  • FIG. 6 is a diagram providing an overview of a method of conducting document matching based on identified landmarks in accordance with at least one embodiment of the present technology
  • FIG. 7 is a flow diagram illustrating steps of a computer-implemented method of identifying landmarks of a document from a digital representation of the document in accordance with at least one embodiment of the present technology
  • FIG. 8 is a flow diagram illustrating steps of identifying a template document to be associated with a document in accordance with at least one embodiment of the present technology
  • FIG. 9 is a flow diagram illustrating steps of aligning a first image with a second image in accordance with at least one embodiment of the present technology. [25] Unless otherwise explicitly specified herein, the drawings (“ Figures”) are not to scale. DETAILED DESCRIPTION
  • processor may be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software.
  • the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared.
  • the processor may be a general purpose processor, such as a central processing unit (CPU) or a processor dedicated to a specific purpose, such as a digital signal processor (DSP).
  • CPU central processing unit
  • DSP digital signal processor
  • processor should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, application specific integrated circuit (ASIC), field programmable gate array (FPGA), read-only memory (ROM) for storing software, random access memory (RAM), and non-volatile storage. Other hardware, conventional and/or custom, may also be included.
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • ROM read-only memory
  • RAM random access memory
  • non-volatile storage Other hardware, conventional and/or custom, may also be included.
  • modules may be represented herein as any combination of flowchart elements or other elements indicating performance of process steps and/or textual description. Such modules may be executed by hardware that is expressly or implicitly shown. Moreover, it should be understood that one or more modules may include for example, but without being limitative, computer program logic, computer program instructions, software, stack, firmware, hardware circuitry, or a combination thereof which provides the required capabilities.
  • FIG. 1 illustrates a computing environment in accordance with an embodiment of the present technology, shown generally as 100.
  • the computing environment 100 may be implemented by any of a conventional personal computer, a computer dedicated to managing network resources, a network device and/or an electronic device (such as, but not limited to, a mobile device, a tablet device, a server, a controller unit, a control device, etc.), and/or any combination thereof appropriate to the relevant task at hand.
  • the computing environment 100 comprises various hardware components including one or more single or multi-core processors collectively represented by processor 110, a solid-state drive 120, a random access memory 130, and an input/output interface 150.
  • the computing environment 100 may be a computer specifically designed to detect landmarks and/or match documents. In some alternative embodiments, the computing environment 100 may be a generic computer system. [34] In some embodiments, the computing environment 100 may also be a subsystem of one of the above-listed systems. In some other embodiments, the computing environment 100 may be an “off-the-shelf’ generic computer system. In some embodiments, the computing environment 100 may also be distributed amongst multiple systems. The computing environment 100 may also be specifically dedicated to the implementation of the present technology. As a person in the art of the present technology may appreciate, multiple variations as to how the computing environment 100 is implemented may be envisioned without departing from the scope of the present technology.
  • processor 110 is generally representative of a processing capability.
  • one or more specialized processing cores may be provided.
  • one or more Graphic Processing Units (GPUs), Tensor Processing Units (TPUs), and/or other so-called accelerated processors (or processing accelerators) may be provided in addition to or in place of one or more CPUs.
  • System memory will typically include random access memory 130, but is more generally intended to encompass any type of non-transitory system memory such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous DRAM (SDRAM), read-only memory (ROM), or a combination thereof.
  • Solid-state drive 120 is shown as an example of a mass storage device, but more generally such mass storage may comprise any type of non-transitory storage device configured to store data, programs, and other information, and to make the data, programs, and other information accessible via a system bus 160.
  • mass storage may comprise one or more of a solid state drive, hard disk drive, a magnetic disk drive, and/or an optical disk drive.
  • Communication between the various components of the computing environment 100 may be enabled by a system bus 160 comprising one or more internal and/or external buses (e.g., a PCI bus, universal serial bus, IEEE 1394 “Firewire” bus, SCSI bus, Serial-ATA bus, ARINC bus, etc.), to which the various hardware components are electronically coupled.
  • a system bus 160 comprising one or more internal and/or external buses (e.g., a PCI bus, universal serial bus, IEEE 1394 “Firewire” bus, SCSI bus, Serial-ATA bus, ARINC bus, etc.), to which the various hardware components are electronically coupled.
  • the input/output interface 150 may allow enabling networking capabilities such as wire or wireless access.
  • the input/output interface 150 may comprise a networking interface such as, but not limited to, a network port, a network socket, a network interface controller and the like. Multiple examples of how the networking interface may be implemented will become apparent to the person skilled in the art of the present technology.
  • the networking interface may implement specific physical layer and data link layer standard such as Ethernet, Fibre Channel, Wi-Fi, Token Ring or Serial communication protocols.
  • the specific physical layer and the data link layer may provide a base for a full network protocol stack, allowing communication among small groups of computers on the same local area network (LAN) and large-scale network communications through routable protocols, such as Internet Protocol (IP).
  • IP Internet Protocol
  • the solid-state drive 120 stores program instructions suitable for being loaded into the random access memory 130 and executed by the processor 110 for executing acts of one or more methods described herein, relating to detecting landmarks and/or matching documents.
  • the program instructions may be part of a library or an application.
  • FIG. 2 is a block diagram illustrating a system 200 comprising a landmark detection module, a document matching module 220 and a content extraction module 230.
  • the system 200 may receive one or more images 202 for further processing, for example, but without being limitative, further processing involving image registration and/or document matching.
  • the one or more images 202 may be accessed from a computer-readable memory storing digital representations of images.
  • the digital representations of the images may be stored in a computer-readable format, for example, but without being limitative, under the fde formats jpeg, png, tiff and/or gif.
  • the digital representations may be compressed or uncompressed.
  • the digital representations may be in raster formats or vectorial formats. This aspect is non-limitative and multiple variations will become apparent to the person skilled in the art of the present technology.
  • the digital representations may have been generated by a camera, a scanner or any electronic device configured to generate a digital representation of an image.
  • the image comprises landmarks which may be broadly defined as image features which may be relied upon to define a coordinate system associated with the content of the image. Such coordinate system may be used for multiple machine-vision tasks, such as, but not limited to, image registration.
  • the image 202 may comprise a digital representation of a document.
  • the document may be a sheet of paper.
  • the sheet of paper may be a form which may be filled with additional information.
  • the additional information may have been handwritten on the form or typed in (e.g., a form having been filled electronically and then printed).
  • the empty form i.e., prior to any additional information being added to the form
  • the template form may define pre-defined content/fields such as boxes, lines, sections, questions, graphical information, etc.
  • the template form may be organised so as to collect information associated with one or more tasks, such as administrative tasks.
  • an administrative task may be to collect information in the context of insurance company gathering information from clients or potential clients.
  • the template form may have been downloaded from a website by a user, printed, filled out and scanned thereby generating a digital representation of the document (in this example, a filled out form).
  • landmarks associated with the document may comprise comers which exact positions in the digital representation vary depending on how the document was positioned during the scanning process. This typically results in positions of comers of a same document varying from one scanned version of the document to another. Such situation may be referred to as a misaligned document or misaligned image.
  • landmarks associated with the document may also comprise edges which exact positions in the digital representation may also vary depending on how the document was positioned during the scanning process.
  • a first technical problem to be addressed involve accurately identifying positions of landmarks (typically four comers/edges but variations may encompass less than four comers/edges of more than four comers/edges) of a document from a digital representation of the document (in this example the scanned document).
  • Embodiments of the present technology provide improved performances for accurately identifying positions of landmarks from an image. Such embodiments may be implemented through a landmark detection module such as the landmark detection module 210.
  • the landmark detection module 210 allows automatic and accurate identification of landmarks contained in an image, for example, but without being limitative, comers of a scanned document.
  • the landmark detection module 210 implements one or more machine learning algorithms (MLAs).
  • the one or more MLAs rely on deep learning approaches.
  • the one or more MLAs comprise a neural network, such as a convolutional neural network (CNN) 300 comprising multiple layers 310, 320 and 330.
  • the CNN may rely on training datasets, such as the training dataset 204, to be trained so as to detect landmarks (e.g., comers or edges) of an image inputted to the system 200 (e.g., the image 202).
  • the training dataset 204 comprises a set of training digital representations of documents associated with labels.
  • the labels may comprise coordinates of landmarks associated with a given documents (e.g., coordinates of comers).
  • the training dataset 204 may also comprise pixel-wise landmarks heat map which may, for example, be transformed from the coordinates of landmarks.
  • the training dataset 204 may be relied upon to generate one or more machine learning (ML) models which may be stored in the ML model database 212 and called upon depending on certain use cases. For example, a first ML model may be associated with the detection of comers of forms while a second ML model may be associated with the detection of landmarks from panoramic pictures.
  • ML machine learning
  • the ML model may implement a pre-trained CNN.
  • One example of such CNN is Res-Net-18 which is a CNN trained on more than a million images from the ImageNet database.
  • the pre-trained CNN may be subjected to complementary training, the complementary training being directed to detection of specific landmarks, such as comers of a document.
  • multiple groups of layers 310-330 may implement different learning functions, such as, for example, a first function allowing detection of landmarks of documents represented by digital representations and a second function allowing generation of fractional pixel coordinates for the landmarks detected by the first function.
  • the first function is configured to down sample the digital representation of the document (i.e., lower the resolution) and, for each pixel of the digital representation of the document, classifies the pixel as a landmark (e.g., a comer, an edge) or not a landmark (e.g., not a comer, not an edge).
  • the first function may be referred to as a classification task predicting whether a sub-portion of the digital representation of the document comprises a landmark.
  • the second function may be referred to as a regression task generating fractional pixel coordinates from the sub-portion of the digital representation identified as comprising a landmark.
  • the first function may be described as classifying pixels as landmarks while the second function may be described as predicting offset to improve precision.
  • the CNN 300 is trained based on a set of training digital representations of documents associated with labels, the labels identifying landmarks (e.g., comers) of the documents represented by the training digital representations.
  • the training of the CNN 300 allows the learning of the first function and the second function.
  • the CNN 300 may be operated by the landmark detection module 210. During operation, the CNN 300 will receive as an input an image for which landmarks (e.g., comers) need to be detected.
  • the fractional pixel coordinates comprises floating values with sufficient accuracy and may be relied upon for reconstmcting precise pixel coordinates in the original resolution of the image.
  • precise pixel coordinates may be defined as the position of the exact pixel at which the landmark is positioned.
  • the precise pixel coordinates may be defined as the position of the landmark with about 2-5 pixels of precision.
  • the precise pixel coordinates may be defined as the position ofthe landmark with about 5-10 pixels of precision.
  • the first function and the second function are executed in parallel so that the fractional pixel coordinates may be calculated while the first function identifies landmarks on the lower resolution image.
  • the landmark detection module 210 may reconstruct the pixel coordinates ofthe landmarks in the original image based on the fractional pixel coordinates.
  • the original resolution of an image may be 10 000* 10 000.
  • the landmark detection module 210 may output pixel coordinates of landmarks associated with an image.
  • the landmark detection module 210 may provide the determined pixel coordinates to the document matching module 220.
  • the landmark detection module 210 may generate a set of coordinates associated with comers of the documents (e.g., a first set of coordinates of a top left comer, a second set of coordinates of a top right comer, a third set of coordinates of a bottom left comer and a fourth set of coordinates of a top right comer).
  • the CNN 300 outputs four images, each one of the four images representing a distinct identified comer, and four sets of associated coordinates.
  • the document matching module 220 relies on the set of coordinates associated with comers of the document to align the document with reference documents such a document templates.
  • the system 200 may be operated in the context of identifying which template form amongst a plurality of template forms correspond to the document of the digital representation.
  • the template forms (also referred to as reference documents) are stored in the template database 222.
  • FIG. 4 illustrates an example of a digital representation 410 of a form 420 and reference template forms 420-440.
  • the digital representation 410 comprises boxes filled out with information.
  • comers For each one of the reference template forms 420-440, comers have been identified or associated upon creation of the template database 222. Comers of the document represented in the digital representation 410 have been determined by the landmark detection module 210. Once comers of the document and comers of the reference template forms 420- 440 are known, the document matching module 220 may undertake to align the document with respect to the reference template forms 420-440.
  • the alignment is operated by an image alignment routine which comprises determining a transformation based on the pixel coordinates of the comers of the document and known comers of the reference template documents.
  • the transformation allows mapping the document onto the reference template documents as illustrated at FIG. 5 (see 510 representing document and template forms 420-440 aligned).
  • the transformation comprises an affine transformation and/or a homographic transformation (e.g., implemented by computing an homography matrix). Implementations details of the transformation such as the affine transformation and/or homographic transformation will become apparent to the person skilled in the art of the present technology. Other transformations may also be used without departing from the scope of the present technology.
  • the document matching module 220 proceeds to calculating a covariance of pixel values of the document aligned and superimposed to reference template forms 420-440 as exemplified by the graphic representation 520.
  • operating the document matching module 220 entails comparing the pixels values of the document and a given reference template form, the document being aligned with the given reference template form. Two one-dimensional vectors of length (width * height) are generated, a first one-dimensional vector being associated with the aligned document and a second one-dimensional vector being associated with the given reference template form with which the document is aligned.
  • pixels should move from dark to light in a similar way in the aligned document compared to the given reference template form. If the document corresponds to the given reference template form then the first one -dimensional vector substantially matches the second one-dimensional vector. If the document does not correspond to the given reference template form then the first one-dimensional vector does not match the second one-dimensional vector. In some embodiments, a calculation of a correlation between the first one-dimensional vector and the second one-dimensional vector allows determining if a substantial match exists or not.
  • the covariance of pixels values allows identification of picks reflective of a match between a portion of the document and a corresponding reference template form. For example, a box located at a same position in both the document and a corresponding reference template form will result in a pick of the covariance value for the pixels representing the box. As a result, a high covariance will indicate that the reference template form is more likely to correspond to the document. As a result, in some embodiments, the document matching module 220 may rely on the higher covariance to determine which reference template form amongst a plurality of reference template forms is likely to correspond to the document.
  • the document matching module 220 may output an indication of the reference template form corresponding to the document, in the example of FIG. 5, the template form 420.
  • the content extraction module 230 may proceed to an extraction of the content of the document.
  • the extracted content may be associated with the corresponding known fields of the reference template form. For example, content located within a field “first name” may be extracted as a corresponding content for the field “first name”.
  • FIG. 6 a diagram providing an overview of a method of conducting document matching based on identified landmarks 600 is illustrated.
  • the method 600 may be executed by the system 200, in particular by the landmark detection module 210 and the document matching module 220 which takes as an input the image 212 and outputs a corresponding form template 420.
  • FIG. 7 shows a flowchart illustrating a computer-implemented method 700 implementing embodiments of the present technology.
  • the computer-implemented method of FIG. 7 may comprise a computer- implemented method executable by a processor of a computing environment, such as the computing environment 100 of FIG. 1, the method comprising a series of steps to be carried out by the computing environment.
  • Certain aspects of FIG. 7 may have been previously described with references to FIG.
  • the method 700 starts at step 702 by accessing the digital representation of the document, the digital representation being associated with a first resolution.
  • the method 700 then proceeds to step 704 by operating a Machine Learning Algorithm (MLA).
  • MLA Machine Learning Algorithm
  • the MLA having been previously trained to leam a first function allowing detection of landmarks of documents represented by digital representations and to leam a second function allowing generation of fractional pixel coordinates for the landmarks detected by the first function.
  • the MLA is trained based on a set of training digital representations of documents associated with labels, the labels identifying landmarks of the documents represented by the training digital representations.
  • operating the MLA comprises steps 706-710.
  • Step 706 comprises down-sampling the digital representation of the document, the down-sampled digital representation of the document being associated with a second resolution, the second resolution being lower than the first resolution.
  • Step 708 comprises detecting landmarks from the down sampled digital representation of the document.
  • Step 710 comprises generating fractional pixel coordinates for the detected landmarks in accordance with the second resolution, the fractional pixel coordinates allowing reconstructing pixel coordinates in accordance with the first resolution.
  • the method 700 comprises determining the pixel coordinates of the landmarks by upscaling the fractional pixel coordinates from the second resolution to the first resolution which in turn may be outputted.
  • the MLA comprises a Convolutional Neural Network (CNN) comprising multiple layers, the multiple layers comprising a first layer implementing the learning of the first function and a second layer implementing the learning of the second function.
  • the labels identifying landmarks comprise coordinates.
  • the fractional pixel coordinates comprise floating values.
  • the first function implements a classification task, the classification task predicting whether a sub-portion of the digital representation of the document comprises a landmark.
  • the second function implements a regression task, the regression task generating fractional pixel coordinates from the sub-portion of the digital representation identified as comprising a landmark.
  • the landmarks comprise comers or edges.
  • FIG. 8 shows a flowchart illustrating a computer- implemented method 800 implementing embodiments of the present technology.
  • the computer-implemented method of FIG. 8 may comprise a computer-implemented method executable by a processor of a computing environment, such as the computing environment 100 of FIG. 1, the method comprising a series of steps to be carried out by the computing environment.
  • FIG. 8 Certain aspects of FIG. 8 may have been previously described with references to FIG. 2-6. The reader is directed to that disclosure for additional details.
  • the method 800 starts at step 802 by accessing the digital representation of the document. Then, at step 804, the method proceeds to accessing a set of digital representations of template documents, each one of the digital representations of template documents comprising known landmarks. At a step 806, the method proceeds to applying an image alignment routine to the document and the template documents. At a step 808, the method proceeds to calculating a covariance of pixel values of the document aligned and superimposed to the at least one of the template documents. Then, at a step 810, the method proceeds to determining, based on the covariance of the pixel values, whether the document is to be associated with the at least one of the template documents.
  • the image alignment routine comprises the steps of (i) determining pixel coordinates of landmarks of the document; and (ii) determining a transformation based on the determined pixel coordinates of the landmarks of the document and known landmarks of at least one of the template documents, the transformation allowing mapping the document onto the at least one of the template documents.
  • the transformation comprises one of an affine transformation and a homographic transformation.
  • the digital representation being associated with a first resolution and determining the pixel coordinates of landmarks of the document comprises executing the method 700.
  • the document is a form comprising filled content and the template documents comprise template forms, each one of the template forms comprising a plurality of fields.
  • the method 800 further comprises associating the filled content of the form with corresponding fields of the at least one of the template forms.
  • the landmarks comprise comers or edges.
  • FIG. 9 shows a flowchart illustrating a computer-implemented method 900 implementing embodiments of the present technology.
  • the computer-implemented method of FIG. 9 may comprise a computer-implemented method executable by a processor of a computing environment, such as the computing environment 100 of FIG. 1, the method comprising a series of steps to be carried out by the computing environment.
  • FIG. 9 Certain aspects of FIG. 9 may have been previously described with references to FIG. 2-6. The reader is directed to that disclosure for additional details.
  • the method 900 starts at step 902 by accessing the first image and the second image comprising known landmarks. Then, at step 904, the method 900 proceeds to determining pixel coordinates of landmarks of the first image. At a step 906, the method 900 proceeds to determining a transformation based on the determined pixel coordinates of the landmarks of the first image and known landmarks of the second image, the transformation allowing mapping of the first image onto the second image. At a step 908, the method 900 proceeds to calculating a covariance of pixel values of the first image aligned and superimposed to the second image.
  • the method 900 proceeds to determining, based on the covariance of the pixel values, whether the first image is to be associated with the second image.
  • the first image is associated with a first resolution and determining the pixel coordinates of landmarks of the first image comprises executing the method 700.
  • the landmarks comprise comers or edges.
  • the wording “and/or” is intended to represent an inclusive-or; for example, “X and/or Y” is intended to mean X or Y or both. As a further example, “X, Y, and/or Z” is intended to mean X or Y or Z or any combination thereof. [73]
  • the foregoing description is intended to be exemplary rather than limiting. Modifications and improvements to the above-described implementations of the present technology may be apparent to those skilled in the art.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

Systems and methods for identifying landmarks of a document from a digital representation of the document. The method comprises accessing the digital representation of the document and operating a Machine Learning Algorithm (MLA), the MLA having been trained based on a set of training digital representations of documents associated with labels. The operating the MLA comprises down-sampling the digital representation of the document, detecting landmarks, generating fractional pixel coordinates for the detected landmarks. The method further determines the pixel coordinates of the landmarks by upscaling the fractional pixel coordinates from the second resolution to the first resolution and outputs the pixel coordinates of the landmarks.

Description

SYSTEMS AND METHODS FOR PROCESSING IMAGES
CROSS-REFERENCE
[01] The present application claims priority to U.S. Patent Application No. 16/778,324, entitled “SYSTEMS AND METHODS FOR PROCESSING IMAGES,” fded on January 31, 2020, the entirety of which is incorporated herein by reference, and to CA Patent Application
No. 3,070,701, entitled “SYSTEMS AND METHODS FOR PROCESSING IMAGES,” fded on January 31, 2020, the entirety of which is incorporated herein by reference.
FIELD
[02] The present technology relates to machine-vision systems and methods for processing images, such as digital representations of documents. In particular, the present technology relates to systems and methods for identifying landmarks of images and/or matching documents with corresponding templates based on identified landmarks of an image.
BACKGROUND
[03] Developments in machine-vision techniques have enabled automation of document processing. One such machine-vision technique is referred to as image registration and allows transformation of different images into one coordinate system which may, in turn, be relied upon to compare and/or integrate data from different images, for example, but without being limitative, in the context of matching documents with corresponding templates.
[04] Current image registration methods typically involve computing transformations of images based on landmarks detection and matching. Known image registration methods may present certain limitations, in particular, but not only, when the image is a digital representation of a paper document comprising defects as it is often the case with scanned documents. Such defects may comprise misalignment of the document during the scanning process, dirt presents on the paper document and/or the scanner, handwritten annotations, etc. In such contexts, known image registration methods may not provide a sufficient level of accuracy resulting in inaccurate alignment or failed alignment of images. This inaccurate or failed alignment may prove to be limiting in the context of matching documents with corresponding templates.
[05] Improvements are therefore desirable. SUMMARY
[06] The present technology is directed to systems and methods that facilitate, in accordance with at least one broad aspect, improved identification of image landmarks. In accordance with at least another broad aspect, the present technology is directed to systems and methods that match documents with corresponding templates based on identified landmarks.
[07] In one broad aspect, there is provided a method of identifying landmarks of a document from a digital representation of the document, the method comprising: accessing the digital representation of the document, the digital representation being associated with a first resolution; operating a Machine Learning Algorithm (MLA), the MLA having been trained: based on a set of training digital representations of documents associated with labels, the labels identifying landmarks of the documents represented by the training digital representations; to leam a first function allowing detection of landmarks of documents represented by digital representations; to leam a second function allowing generation of fractional pixel coordinates for the landmarks detected by the first function; the operating the MLA comprising: down-sampling the digital representation of the document, the down-sampled digital representation of the document being associated with a second resolution, the second resolution being lower than the first resolution; detecting landmarks from the down sampled digital representation of the document; generating fractional pixel coordinates for the detected landmarks in accordance with the second resolution, the fractional pixel coordinates allowing reconstructing pixel coordinates in accordance with the first resolution; determining the pixel coordinates of the landmarks by upscaling the fractional pixel coordinates from the second resolution to the first resolution; and outputting the pixel coordinates of the landmarks.
[08] In another broad aspect, there is provided a method of identifying a template document to be associated with a document, the method comprising: accessing the digital representation of the document; accessing a set of digital representations of template documents, each one of the digital representations of template documents comprising known landmarks; applying an image alignment routine to the document and the template documents; calculating a covariance of pixel values of the document aligned and superimposed to the at least one of the template documents; and determining, based on the covariance of the pixel values, whether the document is to be associated with the at least one of the template documents.
[09] In yet another broad aspect, there is provided a method of aligning a first image with a second image, the method comprising: accessing the first image; accessing the second image comprising known landmarks; determining pixel coordinates of landmarks of the first image; determining a transformation based on the determined pixel coordinates of the landmarks of the first image and known landmarks of the second image, the transformation allowing mapping of the first image onto the second image; calculating a covariance of pixel values of the first image aligned and superimposed to the second image; and determining, based on the covariance of the pixel values, whether the first image is to be associated with the second image. [10] In other aspects, various implementations of the present technology provide a non- transitory computer-readable medium storing program instructions for executing one or more methods described herein, the program instructions being executable by a processor of a computer-based system.
[11] In other aspects, various implementations of the present technology provide a computer-based system, such as, for example, but without being limitative, an electronic device comprising at least one processor and a memory storing program instructions for executing one or more methods described herein, the program instructions being executable by the at least one processor of the electronic device.
[12] In the context of the present specification, unless expressly provided otherwise, a computer system may refer, but is not limited to, an “electronic device”, a “computing device”, an “operation system”, a “system”, a “computer-based system”, a “computer system”, a “network system”, a “network device”, a “controller unit”, a “monitoring device”, a “control device”, a “server”, and/or any combination thereof appropriate to the relevant task at hand.
[13] In the context of the present specification, unless expressly provided otherwise, the expression “computer-readable medium” and “memory” are intended to include media of any nature and kind whatsoever, non-limiting examples of which include RAM, ROM, disks (e.g., CD-ROMs, DVDs, floppy disks, hard disk drives, etc.), USB keys, flash memory cards, solid state-drives, and tape drives. Still in the context of the present specification, “a” computer- readable medium and “the” computer-readable medium should not be construed as being the same computer-readable medium. To the contrary, and whenever appropriate, “a” computer- readable medium and “the” computer-readable medium may also be construed as a first computer-readable medium and a second computer-readable medium.
[14] In the context of the present specification, unless expressly provided otherwise, the words “first”, “second”, “third”, etc. have been used as adjectives only for the purpose of allowing for distinction between the nouns that they modify from one another, and not for the purpose of describing any particular relationship between those nouns.
[15] Additional and/or alternative features, aspects and advantages of implementations of the present technology will become apparent from the following description, the accompanying drawings, and the appended claims. BRIEF DESCRIPTION OF THE DRAWINGS
[ 16] For a better understanding of the present technology, as well as other aspects and further features thereof, reference is made to the following description which is to be used in conjunction with the accompanying drawings, where: [17] FIG. 1 is a block diagram of an example computing environment in accordance with at least one embodiment of the present technology;
[18] FIG. 2 is a block diagram illustrating a system comprising a landmark detection module and a document matching module in accordance with at least one embodiment of the present technology; [19] FIG. 3 is a diagram illustrating a neural network in accordance with at least one embodiment of the present technology;
[20] FIG. 4 and 5 illustrate examples of document matching in accordance with at least one embodiment of the present technology;
[21] FIG. 6 is a diagram providing an overview of a method of conducting document matching based on identified landmarks in accordance with at least one embodiment of the present technology;
[22] FIG. 7 is a flow diagram illustrating steps of a computer-implemented method of identifying landmarks of a document from a digital representation of the document in accordance with at least one embodiment of the present technology; [23] FIG. 8 is a flow diagram illustrating steps of identifying a template document to be associated with a document in accordance with at least one embodiment of the present technology; and
[24] FIG. 9 is a flow diagram illustrating steps of aligning a first image with a second image in accordance with at least one embodiment of the present technology. [25] Unless otherwise explicitly specified herein, the drawings (“Figures”) are not to scale. DETAILED DESCRIPTION
[26] The examples and conditional language recited herein are principally intended to aid the reader in understanding the principles of the present technology and not to limit its scope to such specifically recited examples and conditions. It will be appreciated that those skilled in the art may devise various arrangements which, although not explicitly described or shown herein, nonetheless embody the principles of the present technology and are included within its spirit and scope.
[27] Furthermore, as an aid to understanding, the following description may describe relatively simplified implementations of the present technology. As persons skilled in the art would understand, various implementations of the present technology may be of greater complexity.
[28] In some cases, what are believed to be helpful examples of modifications to the present technology may also be set forth. This is done merely as an aid to understanding, and, again, not to define the scope or set forth the bounds of the present technology. These modifications are not an exhaustive list, and a person skilled in the art may make other modifications while nonetheless remaining within the scope of the present technology. Further, where no examples of modifications have been set forth, it should not be interpreted that no modifications are possible and/or that what is described is the sole manner of implementing that element of the present technology.
[29] Moreover, all statements herein reciting principles, aspects, and implementations of the present technology, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof, whether they are currently known or developed in the future. Thus, for example, it will be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative circuitry embodying the principles of the present technology. Similarly, it will be appreciated that any flowcharts, flow diagrams, state transition diagrams, pseudo-code, and the like represent various processes which may be substantially represented in computer-readable media and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.
[30] The functions of the various elements shown in the figures, including any functional block labeled as a "processor", may be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. In some embodiments of the present technology, the processor may be a general purpose processor, such as a central processing unit (CPU) or a processor dedicated to a specific purpose, such as a digital signal processor (DSP). Moreover, explicit use of the term a "processor" should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, application specific integrated circuit (ASIC), field programmable gate array (FPGA), read-only memory (ROM) for storing software, random access memory (RAM), and non-volatile storage. Other hardware, conventional and/or custom, may also be included.
[31] Software modules, or simply modules which are implied to be software, may be represented herein as any combination of flowchart elements or other elements indicating performance of process steps and/or textual description. Such modules may be executed by hardware that is expressly or implicitly shown. Moreover, it should be understood that one or more modules may include for example, but without being limitative, computer program logic, computer program instructions, software, stack, firmware, hardware circuitry, or a combination thereof which provides the required capabilities.
[32] With these fundamentals in place, we will now consider some non-limiting examples to illustrate various implementations of aspects of the present technology.
[33] FIG. 1 illustrates a computing environment in accordance with an embodiment of the present technology, shown generally as 100. In some embodiments, the computing environment 100 may be implemented by any of a conventional personal computer, a computer dedicated to managing network resources, a network device and/or an electronic device (such as, but not limited to, a mobile device, a tablet device, a server, a controller unit, a control device, etc.), and/or any combination thereof appropriate to the relevant task at hand. In some embodiments, the computing environment 100 comprises various hardware components including one or more single or multi-core processors collectively represented by processor 110, a solid-state drive 120, a random access memory 130, and an input/output interface 150. The computing environment 100 may be a computer specifically designed to detect landmarks and/or match documents. In some alternative embodiments, the computing environment 100 may be a generic computer system. [34] In some embodiments, the computing environment 100 may also be a subsystem of one of the above-listed systems. In some other embodiments, the computing environment 100 may be an “off-the-shelf’ generic computer system. In some embodiments, the computing environment 100 may also be distributed amongst multiple systems. The computing environment 100 may also be specifically dedicated to the implementation of the present technology. As a person in the art of the present technology may appreciate, multiple variations as to how the computing environment 100 is implemented may be envisioned without departing from the scope of the present technology.
[35] Those skilled in the art will appreciate that processor 110 is generally representative of a processing capability. In some embodiments, in place of one or more conventional Central Processing Units (CPUs), one or more specialized processing cores may be provided. For example, one or more Graphic Processing Units (GPUs), Tensor Processing Units (TPUs), and/or other so-called accelerated processors (or processing accelerators) may be provided in addition to or in place of one or more CPUs.
[36] System memory will typically include random access memory 130, but is more generally intended to encompass any type of non-transitory system memory such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous DRAM (SDRAM), read-only memory (ROM), or a combination thereof. Solid-state drive 120 is shown as an example of a mass storage device, but more generally such mass storage may comprise any type of non-transitory storage device configured to store data, programs, and other information, and to make the data, programs, and other information accessible via a system bus 160. For example, mass storage may comprise one or more of a solid state drive, hard disk drive, a magnetic disk drive, and/or an optical disk drive.
[37] Communication between the various components of the computing environment 100 may be enabled by a system bus 160 comprising one or more internal and/or external buses (e.g., a PCI bus, universal serial bus, IEEE 1394 “Firewire” bus, SCSI bus, Serial-ATA bus, ARINC bus, etc.), to which the various hardware components are electronically coupled.
[38] The input/output interface 150 may allow enabling networking capabilities such as wire or wireless access. As an example, the input/output interface 150 may comprise a networking interface such as, but not limited to, a network port, a network socket, a network interface controller and the like. Multiple examples of how the networking interface may be implemented will become apparent to the person skilled in the art of the present technology. For example, but without being limitative, the networking interface may implement specific physical layer and data link layer standard such as Ethernet, Fibre Channel, Wi-Fi, Token Ring or Serial communication protocols. The specific physical layer and the data link layer may provide a base for a full network protocol stack, allowing communication among small groups of computers on the same local area network (LAN) and large-scale network communications through routable protocols, such as Internet Protocol (IP).
[39] According to some implementations of the present technology, the solid-state drive 120 stores program instructions suitable for being loaded into the random access memory 130 and executed by the processor 110 for executing acts of one or more methods described herein, relating to detecting landmarks and/or matching documents. For example, at least some of the program instructions may be part of a library or an application.
[40] FIG. 2 is a block diagram illustrating a system 200 comprising a landmark detection module, a document matching module 220 and a content extraction module 230. In accordance with some embodiments, the system 200 may receive one or more images 202 for further processing, for example, but without being limitative, further processing involving image registration and/or document matching.
[41] The one or more images 202 may be accessed from a computer-readable memory storing digital representations of images. The digital representations of the images may be stored in a computer-readable format, for example, but without being limitative, under the fde formats jpeg, png, tiff and/or gif. The digital representations may be compressed or uncompressed. The digital representations may be in raster formats or vectorial formats. This aspect is non-limitative and multiple variations will become apparent to the person skilled in the art of the present technology. The digital representations may have been generated by a camera, a scanner or any electronic device configured to generate a digital representation of an image. In some embodiments, the image comprises landmarks which may be broadly defined as image features which may be relied upon to define a coordinate system associated with the content of the image. Such coordinate system may be used for multiple machine-vision tasks, such as, but not limited to, image registration.
[42] In accordance with some embodiments of the present technology, the image 202 may comprise a digital representation of a document. The document may be a sheet of paper. In some embodiments, the sheet of paper may be a form which may be filled with additional information. The additional information may have been handwritten on the form or typed in (e.g., a form having been filled electronically and then printed). In some embodiments, the empty form (i.e., prior to any additional information being added to the form) may define a template form. The template form may comprise pre-defined content/fields such as boxes, lines, sections, questions, graphical information, etc. The template form may be organised so as to collect information associated with one or more tasks, such as administrative tasks. As an example, an administrative task may be to collect information in the context of insurance company gathering information from clients or potential clients. In some embodiments, the template form may have been downloaded from a website by a user, printed, filled out and scanned thereby generating a digital representation of the document (in this example, a filled out form). In such example, landmarks associated with the document may comprise comers which exact positions in the digital representation vary depending on how the document was positioned during the scanning process. This typically results in positions of comers of a same document varying from one scanned version of the document to another. Such situation may be referred to as a misaligned document or misaligned image. In other embodiments, landmarks associated with the document may also comprise edges which exact positions in the digital representation may also vary depending on how the document was positioned during the scanning process.
[43] In accordance with some aspects of the present technology, a first technical problem to be addressed involve accurately identifying positions of landmarks (typically four comers/edges but variations may encompass less than four comers/edges of more than four comers/edges) of a document from a digital representation of the document (in this example the scanned document). Embodiments of the present technology provide improved performances for accurately identifying positions of landmarks from an image. Such embodiments may be implemented through a landmark detection module such as the landmark detection module 210.
[44] Broadly speaking, the landmark detection module 210 allows automatic and accurate identification of landmarks contained in an image, for example, but without being limitative, comers of a scanned document.
[45] Referring simultaneously to FIG. 2 and 3, the landmark detection module 210, in some embodiments, implements one or more machine learning algorithms (MLAs). In some embodiments, the one or more MLAs rely on deep learning approaches. In the example illustrated at FIG. 3, the one or more MLAs comprise a neural network, such as a convolutional neural network (CNN) 300 comprising multiple layers 310, 320 and 330. The CNN may rely on training datasets, such as the training dataset 204, to be trained so as to detect landmarks (e.g., comers or edges) of an image inputted to the system 200 (e.g., the image 202). In some embodiments, the training dataset 204 comprises a set of training digital representations of documents associated with labels. The labels may comprise coordinates of landmarks associated with a given documents (e.g., coordinates of comers). In some embodiments, the training dataset 204 may also comprise pixel-wise landmarks heat map which may, for example, be transformed from the coordinates of landmarks. In some embodiments, the training dataset 204 may be relied upon to generate one or more machine learning (ML) models which may be stored in the ML model database 212 and called upon depending on certain use cases. For example, a first ML model may be associated with the detection of comers of forms while a second ML model may be associated with the detection of landmarks from panoramic pictures.
[46] As previously mentioned, the ML model may implement a pre-trained CNN. One example of such CNN is Res-Net-18 which is a CNN trained on more than a million images from the ImageNet database. In accordance with embodiments of the present technology, the pre-trained CNN may be subjected to complementary training, the complementary training being directed to detection of specific landmarks, such as comers of a document.
[47] Referring back to the CNN 300 of FIG. 3, multiple groups of layers 310-330 may implement different learning functions, such as, for example, a first function allowing detection of landmarks of documents represented by digital representations and a second function allowing generation of fractional pixel coordinates for the landmarks detected by the first function. In some embodiments, the first function is configured to down sample the digital representation of the document (i.e., lower the resolution) and, for each pixel of the digital representation of the document, classifies the pixel as a landmark (e.g., a comer, an edge) or not a landmark (e.g., not a comer, not an edge). In some embodiments, the first function may be referred to as a classification task predicting whether a sub-portion of the digital representation of the document comprises a landmark. In some embodiments the second function may be referred to as a regression task generating fractional pixel coordinates from the sub-portion of the digital representation identified as comprising a landmark. In some embodiments, the first function may be described as classifying pixels as landmarks while the second function may be described as predicting offset to improve precision.
[48] In accordance with some aspects of the present technology, the CNN 300 is trained based on a set of training digital representations of documents associated with labels, the labels identifying landmarks (e.g., comers) of the documents represented by the training digital representations. The training of the CNN 300 allows the learning of the first function and the second function. Once training of the CNN 300 is deemed sufficient (e.g., based on a loss function assessing accuracy of classifications/predictions made by the CNN 300), the CNN 300 may be operated by the landmark detection module 210. During operation, the CNN 300 will receive as an input an image for which landmarks (e.g., comers) need to be detected. While processing of the image progresses through the CNN 300, the CNN 300 will down-sample the image (i.e., a resolution of the image will decrease as it is processed through the CNN). However, while the CNN 300 operates detection of landmarks from the down sampled image, the CNN also operates the calculation of fractional pixel coordinates for the detected landmarks. In some embodiments, the fractional pixel coordinates is calculated using original landmark labels and a down sample factor used by the first function. For example, if a landmark is on pixel x = 302, y = 314, and the down sample factor is 10, then the fractional pixel values are (2/10, 4/10), the fractional labels are then used as labels to train the second function. In other words, given a coordinate x and a down sample factor w, the fractional pixel coordinate is equal to x/w - floor (x/w).
[49] In some embodiments, the fractional pixel coordinates comprises floating values with sufficient accuracy and may be relied upon for reconstmcting precise pixel coordinates in the original resolution of the image. In some embodiments, precise pixel coordinates may be defined as the position of the exact pixel at which the landmark is positioned. In some other embodiments, the precise pixel coordinates may be defined as the position of the landmark with about 2-5 pixels of precision. In yet some other embodiments, the precise pixel coordinates may be defined as the position ofthe landmark with about 5-10 pixels of precision. In some embodiments, the first function and the second function are executed in parallel so that the fractional pixel coordinates may be calculated while the first function identifies landmarks on the lower resolution image. As a result, once the landmarks are identified at the lower resolution and associated fractional pixel coordinates are calculated, the landmark detection module 210 may reconstruct the pixel coordinates ofthe landmarks in the original image based on the fractional pixel coordinates. As an example, the original resolution of an image may be 10 000* 10 000. The first function identifies landmarks at a resolution of 100* 100 while the second function calculates the fractional pixel coordinates of the identified landmark (e.g. a comer) as x = 0.6 and y = 0.3. Based on the output of the first function and the second function, the landmark detection module 210 may upscale the pixel coordinates so as to determine that the pixel coordinates of the landmark in the original resolution are x = 1 065 and y = 523.
[50] As detailed above, the landmark detection module 210 may output pixel coordinates of landmarks associated with an image. In the context of the system 200, the landmark detection module 210 may provide the determined pixel coordinates to the document matching module 220. In the specific context of identifying comers of a document from its digital representation, the landmark detection module 210 may generate a set of coordinates associated with comers of the documents (e.g., a first set of coordinates of a top left comer, a second set of coordinates of a top right comer, a third set of coordinates of a bottom left comer and a fourth set of coordinates of a top right comer). In some embodiments, the CNN 300 outputs four images, each one of the four images representing a distinct identified comer, and four sets of associated coordinates.
[51] In some embodiments, the document matching module 220 relies on the set of coordinates associated with comers of the document to align the document with reference documents such a document templates. As an example, the system 200 may be operated in the context of identifying which template form amongst a plurality of template forms correspond to the document of the digital representation. In some embodiments, the template forms (also referred to as reference documents) are stored in the template database 222.
[52] FIG. 4 illustrates an example of a digital representation 410 of a form 420 and reference template forms 420-440. The digital representation 410 comprises boxes filled out with information. For each one of the reference template forms 420-440, comers have been identified or associated upon creation of the template database 222. Comers of the document represented in the digital representation 410 have been determined by the landmark detection module 210. Once comers of the document and comers of the reference template forms 420- 440 are known, the document matching module 220 may undertake to align the document with respect to the reference template forms 420-440. In some embodiments, the alignment is operated by an image alignment routine which comprises determining a transformation based on the pixel coordinates of the comers of the document and known comers of the reference template documents. In some embodiments, the transformation allows mapping the document onto the reference template documents as illustrated at FIG. 5 (see 510 representing document and template forms 420-440 aligned). In some embodiments, the transformation comprises an affine transformation and/or a homographic transformation (e.g., implemented by computing an homography matrix). Implementations details of the transformation such as the affine transformation and/or homographic transformation will become apparent to the person skilled in the art of the present technology. Other transformations may also be used without departing from the scope of the present technology.
[53] Once the document and reference template forms 420-440 are aligned, the document matching module 220 proceeds to calculating a covariance of pixel values of the document aligned and superimposed to reference template forms 420-440 as exemplified by the graphic representation 520. In some embodiments, operating the document matching module 220 entails comparing the pixels values of the document and a given reference template form, the document being aligned with the given reference template form. Two one-dimensional vectors of length (width * height) are generated, a first one-dimensional vector being associated with the aligned document and a second one-dimensional vector being associated with the given reference template form with which the document is aligned. If the document and the given reference template form are well aligned, pixels should move from dark to light in a similar way in the aligned document compared to the given reference template form. If the document corresponds to the given reference template form then the first one -dimensional vector substantially matches the second one-dimensional vector. If the document does not correspond to the given reference template form then the first one-dimensional vector does not match the second one-dimensional vector. In some embodiments, a calculation of a correlation between the first one-dimensional vector and the second one-dimensional vector allows determining if a substantial match exists or not.
[54] In some embodiments, the covariance of pixels values allows identification of picks reflective of a match between a portion of the document and a corresponding reference template form. For example, a box located at a same position in both the document and a corresponding reference template form will result in a pick of the covariance value for the pixels representing the box. As a result, a high covariance will indicate that the reference template form is more likely to correspond to the document. As a result, in some embodiments, the document matching module 220 may rely on the higher covariance to determine which reference template form amongst a plurality of reference template forms is likely to correspond to the document. Once that determination is completed, the document matching module 220 may output an indication of the reference template form corresponding to the document, in the example of FIG. 5, the template form 420. [55] Referring back to FIG. 2, once the document matching module 220 has identified which reference template form corresponds to the document, the content extraction module 230 may proceed to an extraction of the content of the document. As the system 200 knows which reference template form corresponds to the document, the extracted content may be associated with the corresponding known fields of the reference template form. For example, content located within a field “first name” may be extracted as a corresponding content for the field “first name”.
[56] Turning now to FIG. 6, a diagram providing an overview of a method of conducting document matching based on identified landmarks 600 is illustrated. The method 600 may be executed by the system 200, in particular by the landmark detection module 210 and the document matching module 220 which takes as an input the image 212 and outputs a corresponding form template 420.
[57] Referring now to FIG. 7, some non-limiting example instances of systems and computer-implemented methods for identifying landmarks of a document from a digital representation of the document are detailed. More specifically, FIG. 7 shows a flowchart illustrating a computer-implemented method 700 implementing embodiments of the present technology. The computer-implemented method of FIG. 7 may comprise a computer- implemented method executable by a processor of a computing environment, such as the computing environment 100 of FIG. 1, the method comprising a series of steps to be carried out by the computing environment. [58] Certain aspects of FIG. 7 may have been previously described with references to FIG.
2-6. The reader is directed to that disclosure for additional details.
[59] The method 700 starts at step 702 by accessing the digital representation of the document, the digital representation being associated with a first resolution. The method 700 then proceeds to step 704 by operating a Machine Learning Algorithm (MLA). The MLA having been previously trained to leam a first function allowing detection of landmarks of documents represented by digital representations and to leam a second function allowing generation of fractional pixel coordinates for the landmarks detected by the first function. In some embodiments, the MLA is trained based on a set of training digital representations of documents associated with labels, the labels identifying landmarks of the documents represented by the training digital representations. In some embodiments, operating the MLA comprises steps 706-710.
[60] Step 706 comprises down-sampling the digital representation of the document, the down-sampled digital representation of the document being associated with a second resolution, the second resolution being lower than the first resolution. Step 708 comprises detecting landmarks from the down sampled digital representation of the document. Step 710 comprises generating fractional pixel coordinates for the detected landmarks in accordance with the second resolution, the fractional pixel coordinates allowing reconstructing pixel coordinates in accordance with the first resolution.
[61] At further step 712, the method 700 comprises determining the pixel coordinates of the landmarks by upscaling the fractional pixel coordinates from the second resolution to the first resolution which in turn may be outputted.
[62] In some embodiments, the MLA comprises a Convolutional Neural Network (CNN) comprising multiple layers, the multiple layers comprising a first layer implementing the learning of the first function and a second layer implementing the learning of the second function. In some embodiments, the labels identifying landmarks comprise coordinates. In some embodiments, the fractional pixel coordinates comprise floating values. In some embodiments, the first function implements a classification task, the classification task predicting whether a sub-portion of the digital representation of the document comprises a landmark. In some embodiments, the second function implements a regression task, the regression task generating fractional pixel coordinates from the sub-portion of the digital representation identified as comprising a landmark. In some embodiments, the landmarks comprise comers or edges.
[63] Referring now to FIG. 8, some non-limiting example instances of systems and computer-implemented methods for identifying a template document to be associated with a document are detailed. More specifically, FIG. 8 shows a flowchart illustrating a computer- implemented method 800 implementing embodiments of the present technology. The computer-implemented method of FIG. 8 may comprise a computer-implemented method executable by a processor of a computing environment, such as the computing environment 100 of FIG. 1, the method comprising a series of steps to be carried out by the computing environment.
[64] Certain aspects of FIG. 8 may have been previously described with references to FIG. 2-6. The reader is directed to that disclosure for additional details.
[65] The method 800 starts at step 802 by accessing the digital representation of the document. Then, at step 804, the method proceeds to accessing a set of digital representations of template documents, each one of the digital representations of template documents comprising known landmarks. At a step 806, the method proceeds to applying an image alignment routine to the document and the template documents. At a step 808, the method proceeds to calculating a covariance of pixel values of the document aligned and superimposed to the at least one of the template documents. Then, at a step 810, the method proceeds to determining, based on the covariance of the pixel values, whether the document is to be associated with the at least one of the template documents.
[66] In some embodiments, the image alignment routine comprises the steps of (i) determining pixel coordinates of landmarks of the document; and (ii) determining a transformation based on the determined pixel coordinates of the landmarks of the document and known landmarks of at least one of the template documents, the transformation allowing mapping the document onto the at least one of the template documents. In some embodiments, the transformation comprises one of an affine transformation and a homographic transformation. In some embodiments, the digital representation being associated with a first resolution and determining the pixel coordinates of landmarks of the document comprises executing the method 700. In some embodiments, the document is a form comprising filled content and the template documents comprise template forms, each one of the template forms comprising a plurality of fields. In some embodiments, the method 800 further comprises associating the filled content of the form with corresponding fields of the at least one of the template forms. In some embodiments, the landmarks comprise comers or edges.
[67] Referring now to FIG. 9, some non-limiting example instances of systems and computer-implemented methods for aligning a first image with a second image are detailed. More specifically, FIG. 9 shows a flowchart illustrating a computer-implemented method 900 implementing embodiments of the present technology. The computer-implemented method of FIG. 9 may comprise a computer-implemented method executable by a processor of a computing environment, such as the computing environment 100 of FIG. 1, the method comprising a series of steps to be carried out by the computing environment.
[68] Certain aspects of FIG. 9 may have been previously described with references to FIG. 2-6. The reader is directed to that disclosure for additional details.
[69] The method 900 starts at step 902 by accessing the first image and the second image comprising known landmarks. Then, at step 904, the method 900 proceeds to determining pixel coordinates of landmarks of the first image. At a step 906, the method 900 proceeds to determining a transformation based on the determined pixel coordinates of the landmarks of the first image and known landmarks of the second image, the transformation allowing mapping of the first image onto the second image. At a step 908, the method 900 proceeds to calculating a covariance of pixel values of the first image aligned and superimposed to the second image. Then, at a step 910, the method 900 proceeds to determining, based on the covariance of the pixel values, whether the first image is to be associated with the second image. In some embodiments, the first image is associated with a first resolution and determining the pixel coordinates of landmarks of the first image comprises executing the method 700. In some embodiments, the landmarks comprise comers or edges.
[70] While some of the above-described implementations may have been described and shown with reference to particular acts performed in a particular order, it will be understood that these acts may be combined, sub-divided, or re-ordered without departing from the teachings of the present technology. At least some of the acts may be executed in parallel or in series. Accordingly, the order and grouping of the act is not a limitation of the present technology.
[71] It should be expressly understood that not all technical effects mentioned herein need be enjoyed in each and every embodiment of the present technology.
[72] As used herein, the wording “and/or” is intended to represent an inclusive-or; for example, “X and/or Y” is intended to mean X or Y or both. As a further example, “X, Y, and/or Z” is intended to mean X or Y or Z or any combination thereof. [73] The foregoing description is intended to be exemplary rather than limiting. Modifications and improvements to the above-described implementations of the present technology may be apparent to those skilled in the art.

Claims

What is claimed is :
1. A computer-implemented method of identifying landmarks of a document from a digital representation of the document, the method comprising: accessing the digital representation of the document, the digital representation being associated with a first resolution; operating a Machine Learning Algorithm (MLA), the MLA having been trained: based on a set of training digital representations of documents associated with labels, the labels identifying landmarks of the documents represented by the training digital representations; to leam a first function allowing detection of landmarks of documents represented by digital representations; to leam a second function allowing generation of fractional pixel coordinates for the landmarks detected by the first function; the operating the MLA comprising: down-sampling the digital representation of the document, the down-sampled digital representation of the document being associated with a second resolution, the second resolution being lower than the first resolution; detecting landmarks from the down sampled digital representation of the document; generating fractional pixel coordinates for the detected landmarks in accordance with the second resolution, the fractional pixel coordinates allowing reconstructing pixel coordinates in accordance with the first resolution; determining the pixel coordinates of the landmarks by upscaling the fractional pixel coordinates from the second resolution to the first resolution; and outputting the pixel coordinates of the landmarks.
2. The method of claim 1, wherein the MLA comprises a Convolutional Neural Network (CNN) comprising multiple layers, the multiple layers comprising a first layer implementing the learning of the first function and a second layer implementing the learning of the second function.
3. The method of claim 1, wherein the labels identifying landmarks comprise coordinates.
4. The method of claim 1, wherein the fractional pixel coordinates comprise floating values.
5. The method of claim 1, wherein the first function implements a classification task, the classification task predicting whether a sub-portion of the digital representation of the document comprises a landmark.
6. The method of claim 5, wherein the second function implements a regression task, the regression task generating fractional pixel coordinates from the sub-portion of the digital representation identified as comprising a landmark.
7. The method of claim 1, wherein the landmarks comprise one of comers or edges.
8. A computer-implemented method of identifying a template document to be associated with a document, the method comprising: accessing the digital representation of the document; accessing a set of digital representations of template documents, each one of the digital representations of template documents comprising known landmarks; applying an image alignment routine to the document and the template documents; calculating a covariance of pixel values of the document aligned and superimposed to the at least one of the template documents; and determining, based on the covariance of the pixel values, whether the document is to be associated with the at least one of the template documents.
9. The method of claim 8, wherein the image alignment routine comprises the steps of: determining pixel coordinates of landmarks of the document; and determining a transformation based on the determined pixel coordinates of the landmarks of the document and known landmarks of at least one of the template documents, the transformation allowing mapping the document onto the at least one of the template documents.
10. The method of claim 9, wherein the transformation comprises one of an affine transformation and a homographic transformation.
11. The method of claim 9, wherein the digital representation being associated with a first resolution and wherein determining the pixel coordinates of landmarks of the document comprises: operating a Machine Learning Algorithm (MLA), the MLA having been trained: based on a set of training digital representations of documents associated with labels, the labels identifying landmarks of the documents represented by the training digital representations; to leam a first function allowing detection of landmarks of documents represented by digital representations; to leam a second function allowing generation of fractional pixel coordinates for the landmarks detected by the first function; the operating the MLA comprising: down-sampling the digital representation of the document, the down-sampled digital representation of the document being associated with a second resolution, the second resolution being lower than the first resolution; detecting landmarks from the down sampled digital representation of the document; generating fractional pixel coordinates for the detected landmarks in accordance with the second resolution, the fractional pixel coordinates allowing reconstructing pixel coordinates in accordance with the first resolution; determining the pixel coordinates of the landmarks by upscaling the fractional pixel coordinates from the second resolution to the first resolution; and outputting the pixel coordinates of the landmarks.
12. The method of claim 8, wherein the document is a form comprising fdled content and the template documents comprise template forms, each one of the template forms comprising a plurality of fields.
13. The method of claim 12, further comprising associating the filled content of the form with corresponding fields of the at least one of the template forms.
14. The method of claim 8, wherein the landmarks comprise one of comers or edges.
15. A computer-implemented method of aligning a first image with a second image, the method comprising: accessing the first image; accessing the second image comprising known landmarks; determining pixel coordinates of landmarks of the first image; determining a transformation based on the determined pixel coordinates of the landmarks of the first image and known landmarks of the second image, the transformation allowing mapping of the first image onto the second image; calculating a covariance of pixel values of the first image aligned and superimposed to the second image; and determining, based on the covariance of the pixel values, whether the first image is to be associated with the second image.
16. The method of claim 15, wherein the first image is associated with a first resolution and wherein determining the pixel coordinates of landmarks of the first image comprises: operating a Machine Learning Algorithm (MLA), the MLA having been trained: based on a set of training images associated with labels, the labels identifying landmarks of the training images; to leam a first function allowing detection of landmarks of images; to leam a second function allowing generation of fractional pixel coordinates for the landmarks detected by the first function; the operating the MLA comprising: down-sampling the first image, the down-sampled first image being associated with a second resolution, the second resolution being lower than the first resolution; detecting landmarks from the down-sampled first image; generating fractional pixel coordinates for the detected landmarks in accordance with the second resolution, the fractional pixel coordinates allowing reconstructing pixel coordinates in accordance with the first resolution; determining the pixel coordinates of the landmarks by upscaling the fractional pixel coordinates from the second resolution to the first resolution; and outputting the pixel coordinates of the landmarks.
17. A system for identifying landmarks of a document from a digital representation of the document, the system comprising: at least one processor, and memory storing a plurality of executable instructions which, when executed by the at least one processor, cause the system to: access the digital representation of the document, the digital representation being associated with a first resolution; operate a Machine Learning Algorithm (MLA), the MLA having been trained: based on a set of training digital representations of documents associated with labels, the labels identifying landmarks of the documents represented by the training digital representations; to leam a first function allowing detection of landmarks of documents represented by digital representations; to leam a second function allowing generation of fractional pixel coordinates for the landmarks detected by the first function; the operating the MLA comprising: down-sampling the digital representation of the document, the down-sampled digital representation of the document being associated with a second resolution, the second resolution being lower than the first resolution; detecting landmarks from the down sampled digital representation of the document; generating fractional pixel coordinates for the detected landmarks in accordance with the second resolution, the fractional pixel coordinates allowing reconstructing pixel coordinates in accordance with the first resolution; determine the pixel coordinates of the landmarks by upscaling the fractional pixel coordinates from the second resolution to the first resolution; and output the pixel coordinates of the landmarks.
18. A system for identifying a template document to be associated with a document, the system comprising: at least one processor, and memory storing a plurality of executable instructions which, when executed by the at least one processor, cause the system to: access the digital representation of the document; access a set of digital representations of template documents, each one of the digital representations of template documents comprising known landmarks; apply an image alignment routine to the document and the template documents; calculate a covariance of pixel values of the document aligned and superimposed to the at least one of the template documents; and determine, based on the covariance of the pixel values, whether the document is to be associated with the at least one of the template documents.
19. A system for aligning a first image with a second image, the system comprising: at least one processor, and memory storing a plurality of executable instructions which, when executed by the at least one processor, cause the system to: access the first image; access the second image comprising known landmarks; determine pixel coordinates of landmarks of the first image; determine a transformation based on the determined pixel coordinates of the landmarks of the first image and known landmarks of the second image, the transformation allowing mapping of the first image onto the second image; calculate a covariance of pixel values of the first image aligned and superimposed to the second image; and determine, based on the covariance of the pixel values, whether the first image is to be associated with the second image.
PCT/IB2021/050749 2020-01-31 2021-01-30 Systems and methods for processing images WO2021152550A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US16/778,324 2020-01-31
US16/778,324 US11514702B2 (en) 2020-01-31 2020-01-31 Systems and methods for processing images
CA3,070,701 2020-01-31
CA3070701A CA3070701C (en) 2020-01-31 2020-01-31 Systems and methods for processing images

Publications (1)

Publication Number Publication Date
WO2021152550A1 true WO2021152550A1 (en) 2021-08-05

Family

ID=77078371

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2021/050749 WO2021152550A1 (en) 2020-01-31 2021-01-30 Systems and methods for processing images

Country Status (1)

Country Link
WO (1) WO2021152550A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070168382A1 (en) * 2006-01-03 2007-07-19 Michael Tillberg Document analysis system for integration of paper records into a searchable electronic database
US20110052062A1 (en) * 2009-08-25 2011-03-03 Patrick Chiu System and method for identifying pictures in documents
US20140029857A1 (en) * 2011-04-05 2014-01-30 Hewlett-Packard Development Company, L.P. Document Registration
US20180024974A1 (en) * 2016-07-22 2018-01-25 Dropbox, Inc. Enhancing documents portrayed in digital images
US20190087942A1 (en) * 2013-03-13 2019-03-21 Kofax, Inc. Content-Based Object Detection, 3D Reconstruction, and Data Extraction from Digital Images

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070168382A1 (en) * 2006-01-03 2007-07-19 Michael Tillberg Document analysis system for integration of paper records into a searchable electronic database
US20110052062A1 (en) * 2009-08-25 2011-03-03 Patrick Chiu System and method for identifying pictures in documents
US20140029857A1 (en) * 2011-04-05 2014-01-30 Hewlett-Packard Development Company, L.P. Document Registration
US20190087942A1 (en) * 2013-03-13 2019-03-21 Kofax, Inc. Content-Based Object Detection, 3D Reconstruction, and Data Extraction from Digital Images
US20180024974A1 (en) * 2016-07-22 2018-01-25 Dropbox, Inc. Enhancing documents portrayed in digital images

Similar Documents

Publication Publication Date Title
JP5261501B2 (en) Permanent visual scene and object recognition
US8718365B1 (en) Text recognition for textually sparse images
WO2020223859A1 (en) Slanted text detection method, apparatus and device
US20160063035A1 (en) Method and system for 3d model database retrieval
EP1416434A2 (en) Passive embedded interaction coding
US11514702B2 (en) Systems and methods for processing images
US20200005078A1 (en) Content aware forensic detection of image manipulations
WO2020168284A1 (en) Systems and methods for digital pathology
US11216961B2 (en) Aligning digital images by selectively applying pixel-adjusted-gyroscope alignment and feature-based alignment models
CN113592706A (en) Method and device for adjusting homography matrix parameters
US20110099137A1 (en) Graphical user interface component classification
JP7357454B2 (en) Feature identification device, feature identification method, and feature identification program
US10991085B2 (en) Classifying panoramic images
CA3070701C (en) Systems and methods for processing images
WO2021152550A1 (en) Systems and methods for processing images
CN114120305B (en) Training method of text classification model, and text content recognition method and device
US11756319B2 (en) Shift invariant loss for deep learning based image segmentation
US20230206408A1 (en) Method and system for adjusting a digital elevation model
WO2021098861A1 (en) Text recognition method, apparatus, recognition device, and storage medium
TWI810623B (en) Document proofreading method and device, and computer-readable recording medium
US20240135739A1 (en) Method of classifying a document for a straight-through processing
CN111881778B (en) Method, apparatus, device and computer readable medium for text detection
CN117853321A (en) Remote sensing image automatic registration method and device based on deep learning
CN116385738A (en) Image feature point matching method and device, electronic equipment and storage medium
CN115631498A (en) Text position determining method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21747041

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21747041

Country of ref document: EP

Kind code of ref document: A1