CN112288040B

CN112288040B - Method and system for performing image classification for object recognition

Info

Publication number: CN112288040B
Application number: CN202011360781.7A
Authority: CN
Inventors: 余锦泽; 何塞·赫罗尼莫·莫雷拉·罗德里格斯; A·阿布勒拉
Original assignee: Mujin Technology
Current assignee: Mujin Technology
Priority date: 2020-01-10
Filing date: 2020-10-28
Publication date: 2021-07-23
Anticipated expiration: 2040-10-28
Also published as: CN112288040A

Abstract

The present disclosure relates to a method and system for performing image classification for object recognition. Systems and methods for classifying at least a portion of an image as either textured or non-textured are presented. The system receives an image generated by an image capture device, where the image represents one or more objects in a field of view of the image capture device. The system generates one or more bitmaps based on the at least one image portion of the image. The one or more bitmaps describe whether one or more features for feature detection are present in the at least one image portion, or whether one or more visual features for feature detection are present in the at least one image portion, or whether there is a change in intensity across the at least one image portion. The system determines whether to classify the at least one image portion as textured or non-textured based on the one or more bitmaps.

Description

Method and system for performing image classification for object recognition

The present application is a divisional application of the invention patent application 202011170640.9 entitled "method and system for performing image classification for object recognition" filed on 28/10/2020.

Cross reference to related applications

This application claims the benefit of U.S. provisional application No.62/959,182 entitled "a Robotic System with Object Detection," filed on 10.1.2020, which is incorporated herein by reference in its entirety.

Technical Field

The present disclosure relates to computing systems and methods for image classification. In particular, embodiments herein relate to classifying an image or a portion thereof as textured or non-textured.

Background

As automation becomes more prevalent, images representing objects can be used to automatically extract information about the object, such as a box or other package in a warehouse, factory, or retail space. These images may facilitate tasks such as automated package tracking, inventory management, or robot interaction with objects.

Disclosure of Invention

In an embodiment, a computing system is provided that includes a non-transitory computer-readable medium and processing circuitry. The processing circuitry is configured to perform the method of: receiving, by a computing system, an image, wherein the computing system is configured to communicate with an image capture device, wherein the image is generated by the image capture device and is to represent one or more objects in a field of view of the image capture device; generating, by a computing system, one or more bitmaps (bitmaps) based on at least one image portion of an image, wherein the one or more bitmaps and the at least one image portion are associated with a first object of the one or more objects, and wherein the one or more bitmaps describe whether one or more visual features for feature detection are present in the at least one image portion or whether an intensity (intensity) change is present across the at least one image portion. Additionally, the method includes determining, by the computing system, based on the one or more bitmaps, whether to classify the at least one image portion as textured or non-textured, and performing a motion plan for the robot's interaction with the one or more objects based on whether the at least one image portion is classified as textured or non-textured. In an embodiment, the method may be performed by executing instructions on a non-transitory computer readable medium.

Drawings

Fig. 1A-1F illustrate a system for classifying an image or image portion as textured or non-textured according to embodiments herein.

Fig. 2A-2C provide block diagrams illustrating a computing system for classifying an image or image portion as textured or non-textured according to embodiments herein.

Fig. 3 provides a flow diagram illustrating a method for classifying an image or image portion as textured or non-textured according to embodiments herein.

Fig. 4A-4D illustrate example environments in which the method of fig. 3 is performed according to embodiments herein.

Fig. 5A-5E illustrate various bitmaps being generated based on image portions according to embodiments herein.

FIG. 6 illustrates a fused bitmap and a texture bitmap being generated according to embodiments herein.

Fig. 7 illustrates a fused bitmap being generated from a color image according to embodiments herein.

Fig. 8A-8C illustrate additional image portions classified as textured or non-textured according to embodiments herein.

Detailed Description

The present disclosure relates to systems and methods for classifying whether at least one image portion is textured or non-textured. In some cases, the classification may be part of an object registration process (object registration process) that is used to determine characteristics of a set of one or more objects, such as boxes or other packages arriving at a warehouse or retail space. These characteristics may be determined, for example, to facilitate automated processing or other interaction with the set of objects, or interaction with other objects having substantially the same design as the set of objects. In an embodiment, a portion of an image (also referred to as an image portion) that may be generated by a camera or other image capture device may represent one of the one or more objects and may provide an indication of: whether there is any visual detail present on the surface of the object, whether there is at least an amount or quantity of visual detail present on the surface of the object, and/or whether there is at least an amount of variation in the visual detail. In some cases, the image portion may be used to generate a template (template) for object recognition. Such a case may involve classifying whether an image or image portion is formed with a textured template or a non-textured template. The template may describe, for example, the appearance of the object (also referred to as the object appearance) and/or the size of the object (also referred to as the object size). In embodiments, the template may be used, for example, to identify any other objects that have a matching object appearance or more generally match the template. Such a match may indicate that the two objects belong to the same object design, and more particularly may indicate that they have the same or substantially the same other characteristics, such as object size. In some cases, if a particular object has an appearance that matches an existing template, such matching may facilitate robotic interaction. For example, a match may indicate that the object has an object size (e.g., object size or surface area) described by the template. The object size may be used to plan how the robot can pick up or interact with the object.

In embodiments, classifying whether at least one image portion is textured or non-textured may involve generating one or more bitmaps (also referred to as one or more masks) based on the image portion. In some cases, some or all of the one or more bitmaps may act as heat maps (heat maps) that indicate the probability or strength of a particular attribute across various locations of the image portion. In some cases, some or all of the one or more bitmaps may be used to describe whether an image portion has one or more visual features for object recognition. If the image portion has one or more such visual features, the one or more bitmaps may describe the location of the one or more features in the image portion. As an example, the one or more bitmaps may include a descriptor bitmap (descriptor bitmap) and/or an edge bitmap (edge bitmap). A descriptor bitmap may describe whether an image portion has a descriptor or the location of one or more descriptors in the image portion (the term "or" may refer to "and/or" in this disclosure). The edge bitmap may describe whether an edge is detected in the image portion or the position of one or more edges in the image portion.

In embodiments, some or all of the one or more bitmaps may be used to describe whether there is a change in intensity across the image portion. Such a change (which may also be referred to as a spatial change) may indicate, for example, whether there is a change between pixel values of the image portion. In some cases, the spatial variation may be described by a standard deviation bitmap (standard deviation bitmap) which may describe local standard deviations between pixel values of the image portion.

In embodiments, classifying whether at least an image portion is textured or non-textured may involve information from a single bitmap, or information from a merged bitmap combining multiple bitmaps. For example, the fused bitmap may be based on a combined descriptor bitmap, an edge bitmap, and/or a standard deviation bitmap. In some cases, the fused bitmap may be used to generate a texture bitmap that may identify, for example, whether an image portion has one or more textured regions and whether an image portion has one or more non-textured regions. In some cases, a texture bitmap may be used to describe the total area or total size occupied by the one or more textured regions or the one or more non-textured regions.

In an embodiment, the fusion bitmap may be generated in a manner that compensates for the effects of conditions (such as excessive light reflecting from a bright object surface and causing glare to appear in the image portion, or light being blocked by the object surface and causing shadows to appear in the image portion). The influence of the lighting conditions can be described by, for example, a highlight bitmap (highlight bitmap) and/or a shadow bitmap (shadow bitmap). In some implementations, the fused bitmap can also be generated based on the highlight bitmap and/or the shadow bitmap.

In embodiments, whether at least an image portion is textured or non-textured may be classified based on information provided by a descriptor bitmap, an edge bitmap, a standard deviation bitmap, a highlight bitmap, a shadow bitmap, a blend bitmap, and/or a texture bitmap. For example, the classification may be performed based on how many descriptors (if any) are detected in the image portion, a total area occupied by textured regions (if any) in the image portion, a total area occupied by non-textured regions (if any) in the image portion, and/or a standard deviation associated with the image portion or with the fused bitmap.

In an embodiment, the classification of whether a template (or more generally, an image portion) is textured or non-textured may affect how object recognition is performed based on the template. OBJECT identification BASED ON such CLASSIFICATION is discussed in more detail in U.S. patent application No. ________ entitled "METHOD AND matching SYSTEM FOR OBJECT RECOGNITION BASED ON IMAGE CLASSIFICATION," filed ON even date herewith (atty. dkt. mj0054-US/0077-0012US1), the entire contents of which are incorporated herein by reference. In some cases, the classification may affect a confidence associated with the result of the object recognition. For example, if the object identification is based on a textured template, the result of the object identification may be assigned a relatively high confidence, and if the object identification is based on a non-textured template, the result of the object identification may be assigned a relatively low confidence. In some cases, the confidence associated with the results of object recognition may affect whether object recognition is to be performed again (e.g., using another object recognition technique), and/or how the robot is to plan for interaction with a particular object. For example, if the object recognition for the object is based on a non-textured template, the robot's interaction with the object may be controlled to proceed more cautiously or more slowly. In some cases, if the object identification process determines that a particular image portion does not match any existing template, an object registration process may be performed to generate and store a new template based on the image portion.

FIG. 1A illustrates a system 100 for classifying an image or portion thereof. System 100 may include a computing system 101 and an image capture device 141 (also referred to as an image sensing device). The image capture device 141 (e.g., a camera) may be configured to capture or otherwise generate an image representative of the environment in the field of view of the image capture device 141. In some cases, the environment may be, for example, a warehouse or a factory. In such a case, the image may represent one or more objects in the warehouse or plant, such as one or more boxes to receive robotic interactions. The computing system 101 may receive the image directly or indirectly from the image capture device 141 and process the image to, for example, perform object recognition. As discussed in more detail below, this processing may involve classifying whether an image or a portion thereof is textured or non-textured. In some cases, the computing system 101 and the image capture device 141 may be located at the same site, such as a warehouse or a factory. In some cases, the computing system 101 and the image capture device 141 may be remote from each other. For example, the computing system 101 may be located in a data center that provides a cloud computing platform.

In embodiments, the computing system 101 may receive images from the image capture device 141 via a data storage device (which may also be referred to as a storage device) or via a network. For example, fig. 1B depicts a system 100A, which may be an embodiment of the system 100 of fig. 1A, the system 100A including the computing system 101, the image capture device 141, and further including a data storage device 198 (or any other type of non-transitory computer-readable medium). The data storage device 198 may be part of the image capture device 141 or may be separate from the image capture device 141. In this embodiment, the computing system 101 may be configured to access the image by retrieving (or more generally, receiving) the image from the data storage device 198.

In FIG. 1B, storage 198 may include any type of non-transitory computer-readable medium(s), which may also be referred to as non-transitory computer-readable storage devices. Such non-transitory computer readable media or storage devices may be configured to store and provide access to data. Examples of a non-transitory computer-readable medium or storage device may include, but are not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination thereof, such as, for example, a computer floppy disk, a hard disk, a Random Access Memory (RAM), a Read Only Memory (ROM), an erasable programmable read only memory (EPROM or flash memory), a solid state drive, a Static Random Access Memory (SRAM), a portable compact disc read only memory (CD-ROM), a Digital Versatile Disc (DVD), and/or a memory stick.

Fig. 1C depicts a system 100B, which may be an embodiment of the system 100/100a of fig. 1A and 1B, the system 100B including a network 199. More specifically, computing system 101 may receive images generated by image capture device 141 via network 199. Network 199 may provide a separate network connection or a series of network connections to allow computing system 101 to receive image data consistent with embodiments herein. In embodiments, the network 199 may be connected via a wired or wireless link. The wired link may include a Digital Subscriber Line (DSL), coaxial cable, or fiber optic line. The wireless link may include

Bluetooth Low Energy (BLE), ANT/ANT +, ZigBee, Z-Wave, Thread, Bluetooth ™ B,

Global microwave access interoperability

Move

-Advanced, NFC, SigFox, LoRa, Random Phase Multiple Access (RPMA), weightless N/P/W, infrared channel or satellite band. The wireless link may also include any cellular network standard for communicating between mobile devices, including standards compliant with 2G, 3G, 4G, or 5G. The wireless standard may use various channel access methods, such as FDMA, TDMA, CDMA, or SDMA. Network communications may be via any suitable protocol, including for example http, tcp/ip, udp, ethernet, ATM, and the like.

In embodiments, network 199 may be any type of network. The geographic extent of the network may vary widely, and network 199 may be a Body Area Network (BAN), a Personal Area Network (PAN), a Local Area Network (LAN) (e.g., an intranet), a Metropolitan Area Network (MAN), a Wide Area Network (WAN), or the internet. The topology of network 199 may be of any form, and may include, for example, any of the following: point-to-point, bus, star, ring, mesh, or tree. Network 199 may be any such network topology as known to one of ordinary skill in the art capable of supporting the operations described herein. The network 199 may utilize different technologies and protocol layers or protocol stacks including, for example, an ethernet protocol, internet protocol suite (TCP/IP), ATM (asynchronous transfer mode) technology, SONET (synchronous optical network) protocol, or SDH (synchronous digital hierarchy) protocol. Network 199 may be a type of broadcast network, telecommunications network, data communications network, or computer network.

In embodiments, the computing system 101 and the image capture device 141 may communicate via a direct connection rather than a network connection. For example, in such embodiments, computing system 101 may be configured to receive images from image capture device 141 via a dedicated communication interface, such as an RS-232 interface, a Universal Serial Bus (USB) interface, and/or via a local computer bus, such as a Peripheral Component Interconnect (PCI) bus.

In embodiments, the computing system 101 may be configured to communicate with a spatial structure sensing device. For example, fig. 1D shows a system 100C (which may be an embodiment of system 100/100 a/100B) that includes computing system 101, image capture device 141, and also includes spatial structure sensing device 142. The spatial structure sensing device 142 may be configured to sense the 3D structure of objects in its field of view. For example, the spatial structure sensing device 142 may be a depth sensing camera (e.g., a time-of-flight (TOF) camera or a structured light camera) configured to generate spatial structure information, such as a point cloud, that describes how the structure of the object is arranged in 3D space. More specifically, the spatial structure information may include depth information, such as a set of depth values, which describe the depth of various locations on the surface of the object. The depth may be relative to the spatial structure sensing device 142 or some other frame of reference.

In an embodiment, the images generated by the image capture device 141 may be used to facilitate control of the robot. For example, fig. 1E shows a robot operating system 100D (which is an embodiment of system 100) that includes a computing system 101, an image capture device 141, and a robot 161. The image capture device 141 may be configured to generate an image representing an object, for example, in a warehouse or other environment, and the robot 161 may be controlled to interact with the object based on the image. For example, the computing system 101 may be configured to receive an image and perform object recognition based on the image. Object recognition may involve determining, for example, the size or shape of an object. In this example, the interaction of the robot 161 with the object may be controlled based on the determined size or shape of the object.

In embodiments, the computing system 101 may form or may be part of a robot control system (also referred to as a robot controller) configured to control movement or other operations of the robot 161. For example, in such embodiments, the computing system 101 may be configured to execute a motion plan for the robot 161 based on the images generated by the image capture device 141 and generate one or more movement commands (e.g., motor commands) based on the motion plan. In such an example, the computing system 101 may output the one or more movement commands to the robot 161 to control its movement.

In an embodiment, the computing system 101 may be separate from the robot control system and may be configured to communicate information to the robot control system in order to allow the robot control system to control the robot. For example, fig. 1F depicts a robotic manipulation system 100E (which is an embodiment of the system 100 of fig. 1A) that includes a computing system 101 and a robot control system 162 that is separate from the computing system 101. In this example, the computing system 101 and the image capture device 141 may form a vision system 150, the vision system 150 configured to provide information to the robot control system 162 about the environment of the robot 161, and more particularly, about objects in the environment. The computing system 101 may serve as a vision controller configured to process images generated by the image capture device 141 to determine information about the environment of the robot 161. The computing system 101 may be configured to communicate the determined information to the robot 162, and the robot control system 162 may be configured to execute a motion plan for the robot 161 based on the information received from the computing system 101.

As described above, the image capture device 141 of fig. 1A-1F may be configured to generate image data that captures or forms an image representing one or more objects in the environment of the image capture device 141. More specifically, the image capture device 141 may have a device field of view and may be configured to generate images representing one or more objects in the device field of view. As used herein, image data refers to any type of data (also referred to as information) that describes the appearance of the one or more physical objects (also referred to as one or more objects). In an embodiment, the image capture device 141 may be or may include a camera, such as a camera configured to generate a two-dimensional (2D) image. The 2D image may be, for example, a grayscale image or a color image.

As also described above, the image generated by the image capture device 141 may be processed by the computing system 101. In embodiments, the computing system 101 may include or be configured as a server (e.g., having one or more server blades, processors, etc.), a personal computer (e.g., a desktop computer, a laptop computer, etc.), a smartphone, a tablet computing device, and/or any other computing system. In embodiments, any or all of the functionality of computing system 101 may be performed as part of a cloud computing platform. Computing system 101 may be a single computing device (e.g., a desktop computer or server) or may include multiple computing devices.

Fig. 2A provides a block diagram illustrating an embodiment of computing system 101. The computing system 101 includes at least one processing circuit 110 and non-transitory computer-readable medium(s) 120. In an embodiment, the processing circuitry 110 includes one or more processors, one or more processing cores, a programmable logic controller ("PLC"), an application specific integrated circuit ("ASIC"), a programmable gate array ("PGA"), a field programmable gate array ("FPGA"), any combination thereof, or any other processing circuitry.

In an embodiment, the non-transitory computer-readable medium 120 may be a storage device, such as an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination thereof, such as, for example, a computer disk, a hard disk, a Solid State Drive (SSD), a Random Access Memory (RAM), a Read Only Memory (ROM), an erasable programmable read only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, any combination thereof, or any other storage device. In some cases, the non-transitory computer-readable medium 120 may include multiple storage devices. In some cases, the non-transitory computer readable medium 120 is configured to store image data received from the image capture device 141. In certain instances, the non-transitory computer-readable medium 120 also stores computer-readable program instructions that, when executed by the processing circuit 110, cause the processing circuit 110 to perform one or more methods described herein, such as the method described with respect to fig. 3.

Fig. 2B depicts a computing system 101A that is an embodiment of computing system 101 and includes a communication interface 130. The communication interface 130 may be configured to receive images or more generally image data, for example, from the image capture device 141, such as via the storage device 198 of FIG. 1B, the network 199 of FIG. 1C, or via a more direct connection. In an embodiment, the communication interface 130 may be configured to communicate with the robot 161 of fig. 1D or the robot control system 162 of fig. 1E. The communication interface 130 may include, for example, communication circuitry configured to perform communications via wired or wireless protocols. By way of example, the communication circuit may include an RS-232 port controller, a USB controller, an Ethernet controller, a USB interface, a USB,

A controller, a PCI bus controller, any other communication circuit, or a combination thereof.

In an embodiment, the processing circuit 110 may be programmed by one or more computer readable program instructions stored on the non-transitory computer readable medium 120. For example, fig. 2C illustrates a computing system 101B, which may be an embodiment of the computing system 101, where the processing circuitry 110 may be programmed by the image access module 202, the image classification module 204, the object identification module 206, the object registration module 207, and the motion planning module 208 or the processing circuitry 110 is configured to execute the image access module 202, the image classification module 204, the object identification module 206, the object registration module 207, and the motion planning module 208. It is to be understood that the functions of the various modules discussed herein are representative and not limiting.

In an embodiment, the image access module 202 may be a software protocol running on the computing system 101B and may be configured to access (e.g., receive, retrieve, store) images or more generally image data. For example, image access module 202 may be configured to access image data stored in non-transitory computer-readable medium 120 or 198 or via network 199 and/or communication interface 130 of fig. 2B. In some cases, the image access module 202 may be configured to receive image data directly or indirectly from the image capture device 141. The image data may be used to represent one or more objects in the field of view of the image capture device 141. In an embodiment, the image classification module 204 may be configured to classify an image or image portion as textured or non-textured, wherein the image may be represented by image data accessed by the image access module 202, as discussed in more detail below.

In an embodiment, the object identification module may be configured to perform object identification based on the appearance of the object. As described above, object identification may be based on one or more templates, such as template 210 in fig. 2C. These templates may be stored on computing system 101B, as shown in fig. 2C, or in other locations, such as in a database hosted by another device or group of devices. In some cases, each of the templates may include or may be based on a respective image portion that the image access module 202 has received and that the image classification module 204 has classified as textured or non-textured. The object recognition module 206 may perform object recognition on objects appearing in another image portion, for example, using a template. If the object identification module 206 determines that the image portion does not match any existing templates in the template storage space (e.g., the non-transitory computer-readable medium 120 or the database described above), or if there are no templates in the template storage space, in some cases, the object registration module 207 may be configured to generate and store a new template based on the image portion. In an embodiment, the motion planning module 208 may be configured to perform a motion plan to control the robot's interaction with the object, e.g., based on the classification performed by the image classification module 204 and/or based on the results of the object recognition module 206, as discussed in more detail below.

In various embodiments, the terms "software protocol," "software instructions," "computer readable instructions," and "computer readable program instructions" are used to describe software instructions or computer code that are configured to perform various tasks and operations. As used herein, the term "module" broadly refers to a collection of software instructions or code configured to cause the processing circuit 110 to perform one or more functional tasks. For convenience, the various modules, managers, computer instructions and software protocols will be described as performing the various operations or tasks when in fact the modules, computer instructions and software protocols program the hardware processor to perform the operations and tasks. While described in various places as "software," it should be understood that the functions performed by the "modules," "software protocols," and "computer instructions" can be implemented more generally as firmware, software, hardware, or any combination thereof. Furthermore, embodiments herein are described in terms of method steps, functional steps, and other types of events. In an embodiment, these actions occur in accordance with computer instructions or software protocols executed by the processing circuitry 110 of the computing system 101.

FIG. 3 is a flow diagram illustrating an example method 300 for classifying an image or image portion as textured or non-textured. The image may represent, for example, one or more objects in a warehouse, retail space, or other location. For example, FIG. 4A depicts an environment in which the method 300 may be performed. More specifically, fig. 4A depicts a system 400, the system 400 including a computing system 101, a robot 461 (which may be an embodiment of robot 161), and an image capture device 441 (which may be an embodiment of image capture device 141) having a device field of view 443. The image capture device 441 may be configured to generate images representing the appearance of a scene in the device field of view 443. For example, when the

objects

401, 402, 403, 404 are located in the device field of view 443, the image capture device 441 may be configured to generate an image representing the

objects

401 and 404, or more specifically the appearance of the

objects

401 and 404. In one example, the

objects

401 and 404 may be a stack of boxes or other packages to be destacked by the robot 461. The appearance of the object 401-404 may include visual indicia (if any) printed or otherwise disposed on one or more surfaces of the object 401-404. The visual indicia may form or include, for example, text, logos or other visual designs or patterns, or pictures on the surface of one or more of the objects 401-404. For example, objects 401, 404 may be boxes, each having a picture 401A/404A printed on a respective top surface of box 401/404. If box 401/404 is used to store items, picture 401A/404A or other visual indicia may, for example, identify a brand name or company associated with the item and/or identify the item itself or other contents of the box. In some cases, the appearance of the

object

401 and 404 may include an outline, if any, of a physical item attached to a surface of one or more of the

objects

401 and 404. For example, the object 403 may have a strip of tape 403A on its top surface. In some cases, there may be sufficient contrast between the strip of tape 403A and the surrounding area of the object 403 to allow the edge of the tape 403A to appear in the image of the object 403.

In some cases, some or all of the objects (e.g., 401 and 404) in the field of view (e.g., 443) of the image capture device may have matching appearances or substantially matching appearances. More specifically, these objects may each comprise the same or substantially the same visual indicia, such as the same picture. For example, the picture 401A printed on the top surface of the object 401 may be the same or substantially the same as the picture 404A printed on the top surface of the object 404. In some cases, objects (e.g., 401 and 404) may have matching appearances because they are all instances of a common object design. For example, the object design may be a box design for producing a box for storing a particular item or type of item. Such a box design may relate to a specific size and/or a specific visual design or other visual indicia. Thus, objects having the same object design may have matching appearances and/or matching sizes (e.g., matching dimensions).

In an embodiment, the method 300 of fig. 3 may be performed by the computing system 101 of fig. 2A-2C, and more particularly by the processing circuit 110. The method 300 may be performed, for example, when an image representing one or more objects (e.g., the object 401 and 404) is stored in a non-transitory computer-readable medium (e.g., 120 of fig. 2A-2C), or when the image is generated by an image capture device (e.g., 441 of fig. 4A). In an embodiment, the non-transitory computer-readable medium (e.g., 120) may also store a plurality of instructions (e.g., computer program instructions) that, when executed by the processing circuit 110, cause the processing circuit 110 to perform the method 300.

In an embodiment, the method 300 of fig. 3 may begin at or otherwise include step 302, where the processing circuitry 110 of the computing system 101 receives an image generated by an image capture device (e.g., 141/441) to represent one or more objects (e.g., 401-404) in a device field of view (e.g., 443) of the image capture device (e.g., 141/441). For example, FIG. 4B shows an image 420 representing the

object

401 and 404 of FIG. 4A. The image 420 may be generated by an image capture device 441, which in this example may be located directly above the

object

401 and 404. Thus, the image 420 may represent the appearance of the respective top surfaces of the

objects

401 and 404, or more specifically, the appearance of the non-occluded portion(s) of the top surface. In other words, the image 420 in this example may represent a top perspective view of the top surface of the

capture object

401 and 404. In embodiments, the received images may be used to create one or more templates for performing object recognition, as discussed in more detail below.

In some cases, the image received in step 302 may represent a stack of multiple objects, such as multiple boxes. For example, as shown in FIG. 4B, the entirety of the received image 420 may represent a plurality of objects, namely objects 401 and 404. In this example, each of the

objects

401 and 404 may be represented by a particular portion of the image 420 (also referred to as an image portion). For example, as shown in FIG. 4C, object 401 may be represented by image portion 421 of image 420. Image portion 421 may be, for example, a rectangular region (e.g., a square region) or other region of image 420. In such an example, method 300 may involve extracting an image portion (e.g., 421) associated with a particular object (e.g., 401) from a received image (e.g., 420). The particular object (which may also be referred to as a target object) may be a single object (e.g., 401), such as a single box identified by the computing system 101. The identified objects may be targets for performing object recognition or object registration and/or targets for performing robotic interactions (e.g., being unloaded from a pallet).

In an embodiment, extracting image portion 421 from image 420 may be based on identifying the locations (also referred to as image locations) within image 420 where edges of object 401 appear, and extracting the regions of image 420 bounded by these image locations. In some cases, if one or

more objects

401 and 404 are also in the field of view of the spatial structure sensing device (e.g., 142 of fig. 1D), the computing system 101 may be configured to receive spatial structure information generated by the spatial object sensing device (e.g., 142) and extract the image portion 421 with the help of the spatial structure information. For example, the spatial structure information may include depth information, and the computing system 101 may be configured to determine a location of an edge of the object 401 (also referred to as an edge location) based on the depth information, such as by detecting a location where there is a sharp change in depth. In this example, computing system 101 may be configured to map edge locations sensed by spatial structure sensing devices (e.g., 142) to image locations within image 420 and extract regions bounded by the image locations, where the extracted regions may be image portions (e.g., 421).

In embodiments, image portion 421 may be used in some cases to generate a template for performing object recognition, and the template may be classified as textured or non-textured, as discussed below with respect to step 308. The template may represent a particular object design, or more specifically, an object appearance and/or object structure associated with the object design. The object structure may describe an object size, such as an object length, an object width, an object height, and/or any other object dimension or combination thereof. Object identification may involve, for example, comparing the appearance of another object to a template, or more specifically, to the appearance of an object described by a template. For example, object identification may involve comparing the respective appearances of each of the objects 402-404 to determine which, if any, object has a respective appearance that matches the template created from the image portion 421. In some cases, the appearance of each of the

objects

402 and 404 may be represented by a corresponding image portion of the image 420 of fig. 4B and 4C. As an example, computing system 101 may determine (e.g., via object identification module 206 of fig. 2C) that an image portion representing object 404 matches a template created from image portion 421 and object 401. Such a match may indicate, for example, that object 404 has the same object design as object 401, and more specifically, the same object design as represented by the template. More specifically, the match may indicate that object 404 has the same object size (e.g., object dimensions) as object 401, and has an object size associated with the object design represented by the template.

As described above, the image 420 may represent multiple objects in some cases. In other cases, the image received in step 302 may represent only one object (e.g., only one box). For example, before the image is received by computing system 101, it may have been processed (e.g., cropped) by the image capture device (e.g., 141/441) or another device to represent only certain objects (e.g., object 401) and to remove any portions of other objects (if any) in the field of view (e.g., 443) representing the image capture device (e.g., 141/441). In such an example, the image received in step 302 may represent only the particular object (e.g., object 401).

In an embodiment, step 302 may be performed by image access module 202 of fig. 2C. In an embodiment, the image (e.g., 420 of fig. 4B) may have been stored on a non-transitory computer-readable medium (e.g., 120 of fig. 2C), and receiving the image in step 302 may involve retrieving (or, more generally, receiving) the image (e.g., 420) from the non-transitory computer-readable medium (e.g., 120) or from any other device. In some cases, the image (e.g., 420) may have been received by computing system 101 from an image capture device (e.g., 141/441), such as via communication interface 130 of fig. 2B, and may have been stored in a non-transitory computer-readable medium (e.g., 120) that may provide a temporary buffer or long-term storage for the image (e.g., 420). For example, the image (e.g., 420) may be received from an image capture device (e.g., 141/441 of fig. 4A) and stored in a non-transitory computer-readable medium (e.g., 120). The image (e.g., 420) may then be received by the processing circuit 110 of the computing system 101 from a non-transitory computer-readable medium in step 302.

In some cases, the image (e.g., 420) may be stored in a non-transitory computer-readable medium (e.g., 120) and may have been previously generated by the processing circuit 110 itself based on information received from the image capture device (e.g., 141/441). For example, the processing circuit 110 may be configured to generate an image (e.g., 420) based on raw camera data received from an image capture device (e.g., 141/441), and may be configured to store the generated image in a non-transitory computer-readable medium (e.g., 120). The image may then be received by the processing circuit 110 (e.g., by retrieving the image from the non-transitory computer-readable medium 120) in step 302.

In an embodiment, the image (e.g., 420) received in step 302 may be or include a two-dimensional (2D) array of pixels, which may have respective pixel values (also referred to as pixel intensity values) associated with the intensity of the signal sensed by image capture device 441, such as the intensity of light reflected from a respective surface (e.g., top surface) of

object

401 and 404. In some cases, the image (e.g., 420) may be a grayscale image. In such a case, the image (e.g., 420) may include a single 2D array of pixels, where each pixel may have an integer value or floating point value, e.g., from 0 to 255 or some other range. In some cases, the image (e.g., 420) may be a color image. In such a case, the image (e.g., 420) may include a different 2D pixel array, where each pixel in the 2D pixel array may indicate the intensity of a respective color component (also referred to as a respective color channel). For example, such a color image may include a first 2D pixel array representing a red channel and indicating an intensity of a red component of the image (e.g., 420), a second 2D pixel array representing a green channel and indicating an intensity of a green component of the image (e.g., 420), and a third 2D pixel array representing a blue channel and indicating an intensity of a blue component of the image (e.g., 420).

In embodiments, computing system 101 may be configured to perform smoothing or smoothing operations on an image (e.g., 420). If a smoothing operation is performed, the smoothing operation may be performed as part of step 302 or after step 302, for example, to remove artifacts or noise (e.g., illumination noise) from the image (e.g., 420). The artifacts may be due to, for example, irregularities on the surface of the object (e.g., wrinkles), illumination condition effects (e.g., shadows), or some other factor. In some cases, the smoothing operation may involve applying a structure-preserving filter, such as a gaussian filter, over the image (e.g., 420).

In an embodiment, the method 300 of fig. 3 further comprises a step 306 in which the processing circuitry 110 of the computing system 101 generates one or more bitmaps (also referred to as one or more masks) based on at least one image portion of the image, such as at least based on the image portion 421 of the image 420 of fig. 4C and 4D. The image portion (e.g., 421) may be a portion of the image (e.g., 420) that represents a particular object (e.g., 401) in the field of view (e.g., 443) of the image capture device (e.g., 441), such as an image portion that represents the target object described above. Thus, the one or more bitmaps in step 306 may be specifically associated with the target object. If the image (e.g., 420) received in step 302 represents multiple objects (e.g., 401 and 404), then in some cases, step 306 may be based only on the portion of the image (e.g., 421) representing the target object (e.g., 401), or primarily on the portion of the image (e.g., 421). In other words, for such a case, the at least one image portion on which the one or more bitmaps are based may be limited primarily to the image portion representing the target object. In another case, if the image received in step 302 represents only the target object, step 306 may be based on the entirety of the image in some cases. In other words, for such a case, the at least one image portion on which the one or more bitmaps are based may comprise the whole or almost the whole of the image. In such an example, the portion of the image associated with the target object may occupy the entirety or substantially the entirety of the image in such a case, such that the one or more bitmaps may be generated based directly on the entirety or substantially the entirety of the image in such a case. In some cases, step 306 may be performed by image classification module 204 of fig. 2C.

In an embodiment, the one or more bitmaps may describe whether one or more visual features for feature detection are present in at least one image portion (e.g., 421) representing an object (e.g., 401). The one or more visual features may represent visual details that may be used to compare the appearance of the object to the appearance of a second object (e.g., 404). Some or all of the visual details (if present in the image portion) may capture or otherwise represent visual indicia (if any) printed or otherwise appearing on the object (e.g., 401). If the image portion (e.g., 421) is used to create a template, the one or more visual features (if any) may represent visual details described by the template and may be used to facilitate a comparison of the template with the appearance of a second object (e.g., 404). In such an example, performing object recognition may involve comparing the appearance of the second object (e.g., 404) to visual details described by the template.

In an embodiment, visual details or visual features, if any, in an image portion (e.g., 421) may contribute to a visual texture of the image portion (e.g., 421), or more particularly, to a visual texture of an appearance of a surface of an object (e.g., 401) represented by the image portion (e.g., 421). Visual texture may refer to spatial variation in intensity across an image portion (e.g., 421), or more specifically, to pixels of an image portion (e.g., 421) having variations between their pixel intensity values. For example, the visual detail or one or more visual features (if present) may comprise a line, corner, or pattern, which may be represented by a region of pixels having non-uniform pixel intensity values. In some cases, a sharp change between pixel intensity values may correspond to a high level of visual texture, while a uniform pixel intensity value may correspond to a lack of visual texture. The presence of the visual texture may facilitate a more robust comparison between respective appearances of the objects, or more particularly, between a template generated from the appearance of a first object (e.g., 401) and the appearance of a second object (e.g., 404).

In embodiments, some or all of the one or more bitmaps may each indicate whether an image portion (e.g., 421) has one or more visual features for feature detection, or whether an image portion lacks visual features for feature detection. If an image portion (e.g., 421) has or represents one or more visual features for feature detection, each bitmap of the one or more bitmaps may indicate a number (quality) or amount (amount) of visual features present in the image portion (e.g., 421) and/or indicate a location within the image portion (e.g., 421) where the one or more visual features are located.

In embodiments, some or all of the one or more bitmaps may each represent a particular type of visual feature. For example, the types of visual features may include a descriptor as a first visual feature type and an edge as a second visual feature type. If multiple bitmaps are generated, they may include a first bitmap associated with identifying the presence of descriptors (if any) in at least one image portion of the image, and a second bitmap associated with identifying the presence of edges (if any) in the at least one image portion.

More specifically, in embodiments, the one or more bitmaps generated in step 306 may include a descriptor bitmap (also referred to as a descriptor mask) for describing whether one or more descriptors are present in at least one image portion (e.g., 421) of the image (e.g., 420) received in step 302. As discussed in more detail below, the descriptor bitmap may indicate which region(s) of an image portion (e.g., 421) do not have a descriptor, and which region(s) of an image portion (e.g., 421) have a descriptor, if any. In some cases, the descriptor bitmap may serve as a heat map indicating the probability of a descriptor being present at various locations of the image portion. A descriptor (also referred to as a feature descriptor) may be a type of visual feature that represents a particular visual detail that appears in an image portion (e.g., 421), such as a corner or pattern in the image portion. In some cases, the visual details may have a sufficient level of uniqueness to be visually distinguishable from other visual details or other types of visual details in the received image (e.g., 420). In some cases, a descriptor may serve as a fingerprint (fingerprint) for a visual minutia by encoding pixels representing the visual minutia as scalar values or vectors.

As described above, the descriptor bitmap may indicate which location(s) or area(s) (if any) within an image portion (e.g., 421) have visual details that form the descriptor. For example, fig. 5A depicts an example of a descriptor bitmap 513 generated based on an image portion 421. In this example, the descriptor bitmap 513 may be a 2D array of pixels and may indicate that the descriptors are respectively located at pixel coordinates [ a ]₁ b₁]^T、[a₂ b₂]^T……[a_n b_n]^TAt and/or around the pixel coordinate [ a ]₁ b₁]^T、[a₂ b₂]^T……[a_n b_n]^T Descriptor identification area 514₁、514₂……514_nTo (3). Descriptor identification area 514₁、514₂……514_nMay be a circular area or may have some other shape (e.g., square). In some cases, if a pixel value of zero indicates the absence of a descriptor, then the descriptor identification region 514 of the descriptor bitmap 513 is described₁、514₂……514_nAll pixels within may have non-zero value(s). The pixel coordinate of the descriptor bitmap 513 [ a ]₁ b₁]^T、[a₂b₂]^T……[a_n b_n]^T(also referred to as pixel location) may correspond to the same pixel coordinate [ a ] of image portion 421₁ b₁]^T、[a₂ b₂]^T……[a_n b_n]^T. Thus, the descriptor bitmap 513 may indicate the pixel coordinate [ a ] of the image portion 421₁ b₁]^T、[a₂b₂]^T……[a_n b_n]^TWith visual details forming corresponding descriptors, and these descriptors are generally located in AND region 514 of image portion 421₁、514₂……514_nIn or around the area occupying the same position.

In embodiments, computing system 101 may be configured to determine the descriptor by searching for one or more locations (e.g., [ a ] where the descriptor (if any) is present in image portion 421₁ b₁]^TTo [ a ]_n b_n]^T) Or one or more regions (e.g., 514)₁To 514_n) A descriptor bitmap is generated. In this embodiment, image portion 421 may have sufficient visual detail or sufficient change in visual detail at the one or more locations or areas to form one or more corresponding descriptors at such locations or areas. As an example, in this embodiment, computing system 101 may be configured to search for the one or more locations by searching for one or more keypoints (also referred to as descriptor keypoints) in at least image portion 421. Each of the one or more keypoints (if found) may be a location or region where a descriptor exists. The one or more positions (e.g., [ a ]₁ b₁]^TTo [ a ]_n b_n]^T) Or one or more regions (e.g., 514)₁To 514_n) May be equal to or based on one or more keypoints. The search may be performed using feature detection techniques such as the harris corner detection algorithm, the Scale Invariant Feature Transform (SIFT) algorithm, the Speeded Up Robust Features (SURF) algorithm, the speeded up segmentation test Feature (FAST) detection algorithm, and/or the oriented FAST (the oriented FAST) and rotated binary robust dependent elementary features (ORB) algorithms. As an example, computing system 101 may search for keypoints in image portion 421 using the SIFT algorithm, where each keypoint may be a circular region having keypoint center coordinates and a radius represented by a scale parameter value σ (also referred to as a keypoint scale). In this example, the coordinate [ a ] of the descriptor bitmap 513 in FIG. 5A₁ b₁]^T、[a₂ b₂]^T……[a_n b_n]^TMay be equal to the keypoint center coordinates while sub-identification region 514 is described₁To 514_nMay correspond to a circular area identified by a keypoint. More specifically, each of the sub-identification regions (e.g., region 514) is described₁) The keypoint center coordinates of the corresponding keypoints (e.g., [ a ]₁ b₁]^T) Is central and may have a size (e.g., radius) equal to or based on the scale parameter value for the corresponding keypoint.

In embodiments, regions (e.g., 514) are identified in one or more descriptors₁To 514_n) Pixels of the descriptor bitmap (e.g., 513) within (if any such regions are found) may have non-zero pixel value(s), while some or all other pixels of the bitmap may have a pixel value of zero (or some other defined value). In this example, if all pixels of a particular descriptor bitmap have a pixel value of zero, the descriptor bitmap may indicate that no descriptor is found in the corresponding image portion. Alternatively, if some pixels of the descriptor bitmap have non-zero value(s), the descriptor bitmap (e.g., 513) may indicate the corresponding imageThe number or amount of descriptors in a section (e.g., 421). For example, the number of descriptors or descriptor identification regions in the descriptor bitmap 513 of fig. 5A may indicate the number of descriptors (e.g., n descriptors) in the image portion 421. In this example, the descriptor identification region 514 is described₁To 514_nMay indicate the descriptor or amount of descriptor information in image portion 421. In some cases, if there is a descriptor identification region in the descriptor bitmap (e.g., 514)₁) The size of the descriptor identification area may indicate the size of the corresponding descriptor. For example, descriptor identification area 514₁May indicate a location within image portion 421 at pixel coordinate a₁ b₁]^TThe size of the corresponding descriptor at (a). In this example, a larger radius may correspond to a descriptor occupying a larger area.

In an embodiment, the respective centers of the descriptor identification regions (if any) in the descriptor bitmap (e.g., 513) may have a defined non-zero value. For example, the pixel coordinate [ a ] in the descriptor bitmap 513 of FIG. 5A₁ b₁]^TTo [ a ]_n b_n]^TMay have a defined maximum pixel value. The defined maximum pixel value may be the maximum value allowed for the pixels of the description sub-bitmap 513 (or more generally, the pixels of any bitmap). For example, if each pixel of the bitmap 513 is an integer value represented by 8 bits, the defined maximum pixel value may be 255. In another example, the defined maximum pixel value may be 1 if each pixel is a floating point value representing a probability value between 0 and 1 (the probability for the descriptor to be present at that pixel). In embodiments, the pixel values of the other pixel coordinates in the descriptor region may be less than the defined maximum pixel value and/or may be based on how far they are from the respective center coordinates of the descriptor region. For example, descriptor identification area 514₁Pixel coordinate of [ x y ]]^TMay be equal to or based on the defined maximum pixel value multiplied by a scaling factor less than 1, wherein the scaling factor may be the descriptor identification region 514₁Pixel coordinate of [ x y ]]^TWith the central coordinate [ a ]₁ b₁]^TA function of the distance therebetween (e.g., a gaussian function).

In an embodiment, the one or more bitmaps generated in step 306 may include an edge bitmap (also referred to as an edge mask) that describes whether one or more edges are present in at least one image portion (e.g., 421) of the image (e.g., 420) received in step 302. More specifically, the edge bitmap may be used to identify one or more regions of the at least one image portion (e.g., 421) that include one or more respective edges detected from the at least one image portion (e.g., 421), or to indicate that no edges are detected in the at least one image portion. In some cases, the edge bitmap may serve as a heat map indicating the strength or probability of edges being present at various locations of at least one image portion. As an example, FIG. 5B shows an edge 423 in image portion 421₁To 423_nAnd shows an edge 423 of the identification and image portion 421₁To 423_nCorresponding region 525₁To 525_n Edge bitmap 523. More specifically, if edge 423₁To 423_nOccupy certain edge positions (e.g., pixel coordinate g) in image portion 421 in FIG. 5B_m h_m]^T) Then region 525₁To 525_n(also referred to as edge identification regions) may surround these locations (e.g., around pixel coordinate [ g ] in the edge bitmap 523_m h_m]^T). For example, edge identification region 525₁To 525_nA band may be formed around these edge locations, wherein the band may have a defined band thickness or width.

In an embodiment, edge identification region 525₁To 525_nAll pixels within (if any) may have non-zero pixel value(s), and some or all other pixels of the edge bitmap 523 may have pixel values of zero. If all pixels of a particular edge bitmap have a pixel value of zero, the edge bitmap may indicate that no edge was detected in the corresponding image portion. If some pixels of the edge-specific bitmap haveWith non-zero pixel value(s), these pixels may indicate one or more locations or regions in image portion 421 where one or more edges are located. In an embodiment, the edge bitmap (e.g., 523) may indicate the number or prevalence (prevalence) of edges in the image portion 421. For example, edge identification regions (e.g., 525) in an edge bitmap₁To 525_n) Can indicate the number of edges in the corresponding image portion (e.g., 421), and the edge identification region (e.g., 525)₁To 525_n) May indicate the prevalence of edges in the image portion (e.g., 421).

In an embodiment, the edge bitmap (e.g., 523) neutralizes the edge locations (e.g., [ g ]_m h_m]^T) The pixel at (a) may be set to a defined pixel value, such as the defined maximum pixel value discussed above. In such an embodiment, at a surrounding edge location (e.g., surrounding [ g ]_m h_m]^T) Edge marking region (e.g., 525)₁) May have a value less than the defined maximum pixel value. For example, edge marking regions (e.g., 525)₁) The pixels in (a) may have pixel values based on the distance of the pixels from the edge location. As an example, edge identification region 525 of FIG. 5B₁Pixel of [ x y ]]^TMay have a pixel value equal to the defined maximum pixel value multiplied by a scaling factor, wherein the scaling factor is smaller than 1. In some cases, the scale factor may be a pixel [ x y ]]^TAnd nearest edge position (e.g., [ g ]_m h_m]^T) A function of the distance therebetween (e.g., a gaussian function).

In embodiments, the computing system 101 may be configured to search for edge locations by using edge detection techniques, such as the Sobel edge detection algorithm, the Prewitt edge detection algorithm, the Laplacian edge detection algorithm, the Canny edge detection algorithm, or any other edge detection technique. In an embodiment, the edge detection algorithm may identify 2D edges, such as straight lines or curved lines. The detection may be based on, for example, identifying pixel coordinates where there is a sharp change in pixel value.

In the examplesThe one or more bitmaps generated in step 306 may include a standard deviation bitmap (also referred to as a standard deviation mask). The standard deviation bitmap may be used to describe whether the intensity varies across at least one image portion (e.g., 421), or more specifically, how the intensity varies across at least one image portion. For example, the standard deviation bitmap may form a 2D pixel array, where each pixel of the standard deviation bitmap may indicate a standard deviation between pixel values of a corresponding region of pixels in the image portion (e.g., 421). Because the standard deviation is region-specific, it may be referred to as the local standard deviation. As an example, fig. 5C shows a standard deviation bitmap 533 generated from the image portion 421. In this example, the particular pixel coordinates of the standard deviation bitmap 533 (e.g., [ u ])₁ v₁]^TOr [ u ]₂ v₂]^T) May be equal to or based on the coordinates of the image portion 421 surrounding the same pixel (e.g., [ u ])₁ v₁]^TOr [ u ]₂ v₂]^T) Area (e.g., 432)₁Or 432₂) Of (2) a local standard deviation (or other measure of variance) between pixel values. Pixel regions (e.g., 432) for determining local standard deviation₁Or 432₂) It may be, for example, a rectangular area of a defined size, such as a 3 pixel by 3 pixel square area. In some implementations, each pixel of the standard deviation bitmap may have a normalized standard deviation value, which may be equal to the standard deviation between pixel values of a corresponding region divided by the size of the corresponding region. For example, [ u ] in the standard deviation bitmap 533₁ v₁]^TMay be equal to area 432 of image portion 421₁Divided by the standard deviation between pixel values of area 432₁E.g., 9 square pixels).

In an embodiment, if a particular pixel of the standard deviation bitmap (e.g., 533) has a pixel value of zero or substantially zero, that pixel may indicate that the local standard deviation of the corresponding region of the image portion (e.g., 421) is zero. In such embodiments, the corresponding region of the image portion (e.g., 421) is in the regionMay be unchanged or substantially unchanged between pixel values. For example, [ u ] in the standard deviation bitmap 533₂ v₂]^TThe pixel at (a) may have a zero value, which may indicate that the same pixel coordinate u is surrounded in image portion 421₂ v₂]^TCorresponding area 432 of₂Having pixels with substantially uniform pixel values. In an embodiment, if all pixels of the standard deviation bitmap have a pixel value of zero, the standard deviation bitmap may indicate no intensity variation across the portion of the image on which the standard deviation bitmap is based. In another embodiment, if the pixels of the standard deviation bitmap have non-zero values (e.g., at the pixel coordinate [ u ] of the bitmap 533)₁ v₁]^TAt), such pixels may indicate at least a corresponding area (e.g., 432) across the image portion (e.g., 421)₂) There is a variation in intensity. In some cases, higher pixel values in the standard deviation bitmap (e.g., 533) may indicate a higher local standard deviation, which may indicate a higher level of variation between pixel values in the image portion.

In an embodiment, step 306 may include generating a plurality of bitmaps, such as a first bitmap that is a description sub-bitmap (e.g., 513) and a second bitmap that is an edge bitmap (e.g., 523). In some cases, the plurality of bitmaps may include at least three bitmaps, such as a descriptor bitmap, an edge bitmap, and a standard deviation bitmap. This embodiment may allow information from multiple bitmaps to be combined to produce more complete information about how many visual features, if any, are present in the image portion. In some cases, the plurality of bitmaps may describe a plurality of feature types. For example, a first bitmap may indicate whether one or more features (such as descriptors) of a first feature type are present in at least one image portion (e.g., 421), and a second bitmap may indicate whether one or more features (such as edges) of a second feature type are present in the at least one image portion (e.g., 421).

In embodiments, the computing system 101 may be configured to generate one or more bitmaps indicating the effect of lighting conditions on the received image (e.g., 420) or image portion thereof (e.g., 421). In some cases, the lighting conditions may cause excessive light or other signals to be reflected from a surface area of the object (e.g., the top surface of object 401), which may cause glare in the resulting image portion (e.g., 421) representing the object. For example, light may be reflected from an area having a shiny material (e.g., smooth tape). In some cases, the lighting conditions may cause too little light to be reflected from the surface area of the object, which may cause shadows in the resulting image portion. For example, light may be prevented from reaching the surface area of the object completely. The one or more bitmaps in this example may be referred to as one or more lighting effect bitmaps and may be considered additional bitmaps to the plurality of bitmaps described above. In an embodiment, glare or shading in an area of an image or image portion may cause any visual detail in the area to lose contrast or appear too blurred, which may make the visual detail less reliable for use in object recognition.

In embodiments, the one or more lighting effect bitmaps (also referred to as one or more lighting effect masks) may include a highlight bitmap (also referred to as a highlight mask) and/or a shadow bitmap (also referred to as a shadow mask). The highlight bitmap may indicate one or more regions (if any) of the corresponding image portion (e.g., 421) that exhibit too much glare or other effects of too much light reflected from a particular portion of the surface of the object. Glare may saturate regions of an image or portion of an image, which may cause visual details (if any) representing that portion of the surface of the object to lose contrast or blend in with the glare. Fig. 5D depicts an example highlight bitmap 543 generated based on image portion 421. The highlight bitmap 543 can include regions 547 having pixel value(s) (such as non-zero pixel values) indicating glare₁And region 547₂. More specifically, region 547₁And region 547₂(which may be referred to as highlight identification region) may indicate that glare is present at corresponding region 427 of image portion 421₁And 427₂. Region 427 of image portion 421₁And 427₂(it may also be referred to asHighlight region) can occupy highlight identification region 547 associated with highlight bitmap 543₁And 547₂The same position. In some cases, pixels in a highlight bitmap (e.g., 543) that indicate the presence of glare in a corresponding image portion (e.g., 421), such as region 547₁And 547₂Pixel(s) may have a defined pixel value(s), such as the defined maximum pixel value discussed above. In other cases, pixels in the highlight identification region of the highlight bitmap (e.g., 543) may have the same pixel values as corresponding pixels in the highlight region of the image portion (e.g., 421). In an embodiment, not in at least one highlight identification area (e.g., 547)₁And 547₂) May have a pixel value of zero.

In an embodiment, the computing system 101 may generate the highlight bitmap by detecting glare or other over-lighting effects (overlit effects) in the image portion. Such detection may be based on, for example, detecting pixel values of image portion 421 that exceed a defined brightness threshold, such as region 427₁And 427₂The pixel value of (1). As an example of a luminance threshold, if the pixel value is an 8-bit (8-bit) integer in the range from 0 to 255, the defined luminance threshold may be, for example, 230 or 240. If the pixel value at a particular pixel coordinate in the image portion 421 exceeds a defined brightness threshold, the computing system 101 may set the pixel value of the same pixel coordinate in the highlight bitmap 543 to a value associated with identifying glare (e.g., 255).

In an embodiment, the shadow bitmap may indicate areas of the image portion (e.g., 421) that represent the effect of light being blocked from reaching a portion of the surface of the object completely, if any. Such an effect of insufficient light may result in a shadow being cast on the portion of the surface of the object. In some cases, the shading may cause any visual detail at that region of the image portion (e.g., 421) to appear blurred or not appear at all. For example, FIG. 5E shows shaded region 428 in image portion 421₁. Computing system 101 may shadow region 428₁Detected as image portion 421 having pixel values at least smaller than the surrounding areaA region of pixel values of a defined difference threshold. In some cases, shaded region 428₁Can be detected as areas where the pixel values are less than a defined darkness threshold. For example, if the pixel value is in the range of 0 to 255, the defined darkness threshold may be pixel value 10 or 20.

Fig. 5E also depicts a shadow bitmap 553 generated based on the image portion 421. More specifically, the shadow bitmap 553 may include a shadow region 428₁Corresponding shaded logo area 558₁. More specifically, shaded logo area 558₁The shadow region 428 may be occupied in a shadow bitmap 553₁The same positions as those occupied in the image portion 421. In some cases, the shadow identifies the region (e.g., 558)₁) Each of the pixels in (a) may have a non-zero value and all pixels of the shadow bitmap 553 that are not in the shadow flag region may have a pixel value of zero. In some cases, pixels in the shadow identification area (if any) of the shadow bitmap (e.g., 553) may have a defined pixel value, such as a defined maximum pixel value. In some cases, the shadow identifies the region (e.g., 558)₁) May have a shadow region (e.g., 428)₁) The corresponding pixels in (a) have the same pixel value.

Referring back to fig. 3, method 300 further includes step 308, where processing circuitry 110 of computing system 101 may determine (e.g., via image classification module 204) whether to classify (e.g., via image classification module 204) at least one image portion (e.g., 421) as textured or non-textured based on the one or more bitmaps. Such classification may refer to whether an image or image portion has a sufficient amount of visual texture, if any, or whether an image or image portion is substantially blank or uniform in appearance. As described above, in some cases, at least one image portion may be used as a template for performing object recognition. In such a case, step 308 may involve determining whether to classify the template as a textured template or a non-textured template. In an embodiment, step 308 may be performed by image classification module 208 of fig. 2C.

In an embodiment, step 308 may involve classifying the image portion as textured if at least one of the one or more criteria is satisfied. In some cases, the at least one criterion may be based on a single bitmap, such as a descriptor bitmap (e.g., 513) or a standard deviation bitmap (e.g., 533). For example, the determination of whether to classify the at least one image portion as textured or non-textured may be based on whether the total number of descriptors indicated by the descriptor bitmap (e.g., 513) exceeds a defined descriptor number threshold, or whether the maximum, minimum or average of the local standard deviation values in the standard deviation bitmap 533 exceeds a defined standard deviation threshold. As described above, the descriptor bitmap (e.g., 513) may identify one or more regions of the at least one image portion (e.g., 421) that include one or more respective descriptors, or may indicate that no descriptors are detected at the at least one image portion (e.g., 421).

In embodiments, the at least one criterion for causing the image portion to be classified as textured may be based on a plurality of bitmaps, such as a combination of a description sub-bitmap (e.g., 513) and an edge bitmap (e.g., 523), a combination of a description sub-bitmap (e.g., 513) and a standard deviation bitmap (e.g., 533), a combination of an edge bitmap and a standard deviation bitmap, or all three bitmaps. For example, determining whether to classify at least one image portion as textured or non-textured at step 308 may include generating a fused bitmap (also referred to as a fused mask) incorporating a plurality of bitmaps, and wherein the classification is based on the fused bitmap. In some cases, multiple bitmaps may describe multiple respective types of features. Using multiple types of bitmaps to classify corresponding image portions may provide the benefit of utilizing information about the presence or absence of multiple types of features, which may provide a more complete assessment of how many or how many features, if any, are present in an image or image portion. For example, an image portion may have a particular visual detail (e.g., a pink region adjacent to a white region) that may not be identified as a feature by a first bitmap, but may be identified as a feature by a second bitmap.

In embodiments, generating the fused bitmap may involve generating a plurality of bitmapsAnd, or more specifically, a weighted sum of a plurality of bitmaps. For example, the fused bitmap may be equal to or based on M1W 1+ M2W 2, or M1W 1+ M2W 2+ M3W 3, wherein M1 may refer to a first bitmap (e.g., a descriptor bitmap), M2 may refer to a second bitmap (e.g., an edge bitmap), and M3 may be a third bitmap (e.g., a standard deviation bitmap), and wherein W1, W2, and W3 may be respective weights associated with bitmaps M1, M2, and M3. In this example, the bitmaps M1, M2, and M3 may be referred to as feature or change bitmaps because they represent the presence of a feature in an image portion (or represent the absence of a feature), or represent an intensity change across an image portion (or represent the absence of a change). In embodiments, a sum or other combination of feature or change bitmaps may be referred to as a combined feature or change bitmap. Generating a weighted sum of feature or change bitmaps may involve, for example, adding the bitmaps pixel by pixel. For example, fuse the pixel coordinates of the bitmap [ x y ]]^TMay be equal to the sum of: w1 multiplied by [ x y ] of the first bitmap M1]^TA pixel value of (a); w2 multiplied by [ x y ] of the second bitmap M2]^TA pixel value of (a); and W3 multiplied by [ x y ] of the third bitmap M3]^TThe pixel value of (2). In an embodiment, the weights W1, W2, W3 may be predefined. In an embodiment, the weights W1, W2, and W3 may be determined by the computing system 101 via a machine learning algorithm, as discussed in more detail below.

In embodiments, generating the fused bitmap may also be based on one or more lighting effect bitmaps, such as a highlight bitmap (e.g., 543) and a shadow bitmap (e.g., 553). For example, the computing system 101 may determine pixel values, also referred to as bitmap pixel values, that describe the visual texture level across at least one image portion (e.g., 421) of the image. The bitmap pixel values may be based on the combined feature or change bitmap discussed above, e.g., pixel values M1W 1+ M2W 2 or M1W 1+ M2W 2+ M3W 3. In this example, the computing system 101 may reduce or otherwise adjust the subset of determined bit-image pixel values of the combined feature or change bitmap, where the adjustment may be based on the highlight map (e.g., 543) and/or the shadow bitmap (e.g., 553). For example, a highlight bitmap or a shadow bitmap may identify one or more regions of at least one image portion (e.g., 421) as showing glare or in shadow. The computing system 101 may make adjustments that reduce bit-pel values in the same one or more regions of the combined feature or change bitmap. This reduction may reduce the impact of pixel values in the one or more regions on classifying image portions as textured or non-textured, as these bit image pixel values may be affected by lighting effects that reduce the reliability or quality of visual information from these regions. In embodiments, the reduction may be based on multiplying the combined feature or change bitmap by the highlight bitmap and/or the shadow bitmap.

As an example of the above discussion, FIG. 6 shows a fused bitmap 631 being generated based on combining a feature bitmap and a lighting effect bitmap. More specifically, fig. 6 depicts computing system 101 generating the fused bitmap equal to (M1W 1+ M2W 2+ M3W 3) (M4W 4+ M5W 5), where M4 is the highlighted bitmap, M5 is the shaded bitmap, and W4 and W5 are the weights associated with bitmaps M4 and M5, respectively. In this example, M1W 1+ M2W 2+ M3W 3 may form a combined feature or change bitmap 621, which may be multiplied by a combined lighting effect bitmap 623 equal to (M4W 4+ M5W 5).

As described above, the weights W1-W5 may be determined via machine learning techniques in an example. For example, machine learning techniques may involve using training data to determine optimal values for the weights W1 through W5. In some cases, the training data may include training images or training image portions, which may be images or image portions having a predetermined classification as to whether they are textured or non-textured. In such a case, the computing system 101 may be configured to determine the optimal values of the weights W1 to W5 that minimize the classification error of the training images. For example, the computing system 101 may be configured to adjust the weights W1-W5 towards their optimal values using a gradient descent process.

In an embodiment, the computing system 101 may be configured to determine the values of the weights W1-W5 based on predefined information about objects that may be within the field of view (e.g., 443) of the image capture device. For example, if the computing system 101 receives an indication (e.g., from a warehouse manager) that the image capture device (e.g., 441) has captured or will capture an object that is likely to have many visual markers that will appear as edges, the weight W2 may be assigned a relatively high value in order to emphasize the edge bitmap M2. If the computing system 101 receives an indication that an object may have a visual marker forming a descriptor, the weight W1 may be assigned a relatively high value in order to emphasize the descriptor bitmap M1. In some cases, the computing system 101 may be configured to determine the values of the weights W1-W5 based on downstream analysis, such as determining which bitmaps have more information (e.g., more non-zero values). In such an example, the weight of a bitmap (e.g., M1) with more information may be assigned a relatively higher weight. In some cases, the computing system 101 may be configured to assign values to the weights based on which feature detection type is to be used or emphasizing a defined preference. For example, if the defined preference indicates that edge-based detection is to be emphasized, the computing system may assign a relatively higher value to W2. If the defined preference indicates that descriptor-based detection is to be emphasized, the computing system may assign a relatively higher value to W1.

In an embodiment, if the image received at step 302 (e.g., 420) is a color image having multiple color components, generating the fused bitmap (e.g., 631) may involve generating respective intermediate fused bitmaps corresponding to the color components and then combining the intermediate fused bitmaps. More specifically, fig. 7 shows a color image having a red component, a green component, and a blue component. In such an example, the computing system 101 may be configured to generate at least a first set of bitmaps (M1_ red through M5_ red) corresponding to a first color component (e.g., red) and a second set of bitmaps (M1_ green through M5_ green) corresponding to a second color component (e.g., green). In the example of fig. 7, computing system 101 may also generate a third set of bitmaps (M1_ blue through M5_ blue) corresponding to a third color component (e.g., blue). In this embodiment, respective intermediate fused bitmaps (such as fused _ red, fused _ green, and fused _ blue) may be generated from each of the three sets of bitmaps. The three intermediate fused bitmaps may be combined into a single fused bitmap, such as bitmap 631 of FIG. 6.

As described above, the classification in step 308 may be based on a standard deviation bitmap (e.g., 533), which may represent intensity variations across at least one image portion of the image. In an embodiment, the at least one criterion for causing the image portion to be classified as textured may be based on intensity variations across the fused bitmap (e.g., 631). The variation across the fused bitmap may be quantified, for example, by the standard deviation value of the localized regions in the fused bitmap. For example, if a maximum, minimum, or average of such local standard deviation values is equal to or greater than a defined standard deviation threshold, the computing system 101 may classify the at least one image portion as textured.

In an embodiment, step 308 may involve generating a texture bitmap based on the fused bitmap. In such embodiments, at least one criterion for causing the image portion to be classified as textured may be based on the texture bitmap. FIG. 6 depicts the fused bitmap 631 being converted to a texture bitmap 641. In embodiments, the texture bitmap may be used to identify which one or more regions of the corresponding image portion (e.g., 421) have a sufficient level of visual texture, or to indicate regions of the image portion (e.g., 421) that do not have a sufficient level of visual texture. More specifically, the texture bitmap may have texture identifying regions and/or non-texture identifying regions. A texture identifying region, such as region 643 of texture bitmap 641, may have pixel value(s) to indicate that a corresponding region of the image portion, which may be referred to as a textured region, has at least a defined texture level. The non-texture identifying region, such as region 645 in texture bitmap 641, may have pixel value(s) to indicate that the corresponding region of the image portion (which may be referred to as a non-textured region) does not have a defined level of texture. The texture regions in the image portion (e.g., 421) may occupy the same positions (e.g., the same coordinates) as those occupied by the texture identification region 643 in the texture bitmap 641. Similarly, non-textured areas in the image portion may occupy the same positions as those occupied by the non-textured identification area 645 in the texture bitmap 641. Thus, the texture bitmap 641 can be used to identify how many (if any) of the image portions have a sufficient level of visual texture and how many (if any) of the image portions lack a sufficient level of visual texture.

In embodiments, the computing system 101 may be configured to generate a texture bitmap (e.g., 641) by comparing pixels of the fused bitmap (e.g., 631) to a defined texture level threshold, such as a defined pixel value threshold. In such an example, the computing system 101 may determine, for each pixel coordinate of the fused bitmap (e.g., 631), whether the pixel value of the fused bitmap (e.g., 631) at that pixel coordinate equals or exceeds a defined pixel value threshold. If the pixel value of the fused bitmap at that pixel coordinate equals or exceeds a defined pixel value threshold, then computing system 101 may assign, for example, a non-zero value to the same pixel coordinate in the texture bitmap (e.g., 641). As an example, the pixel coordinate assigned a non-zero value may be one pixel coordinate in texture identification region 643. Although the above discussion refers to assigning non-zero values, any value associated with a texture indicating a sufficient level may be assigned. If the pixel value of the fused bitmap (e.g., 631) at that pixel coordinate is less than the defined pixel value threshold, then the computing system 101 may assign, for example, a zero value to the same pixel coordinate in the texture bitmap. As an example, the pixel coordinate assigned a zero value may be one pixel coordinate in the non-texture identifying region 645. Although the above discussion refers to assigning zero values, any value associated with a texture indicating an insufficient level may be assigned.

In an embodiment, the texture bitmap may be a binary mask, where all pixels in the texture bitmap may have only one of two pixel values, such as 0 or 1. For example, all pixels in the texture identification area 643 of the texture bitmap 641 may have a pixel value of 1, while all pixels in the no-texture identification area 645 may have a value of 0. In this example, a pixel in the texture bitmap having a pixel value of 1 may indicate that the corresponding area of the image portion (e.g., 421) is a textured area, while a pixel in the texture bitmap 641 having a pixel value of 0 may indicate that the corresponding area of the image portion (e.g., 421) is a non-textured area.

In an embodiment, the at least one criterion for classifying an image portion (e.g., 421) as textured may be based on the size (e.g., total area) of one or more texture identifying regions (if any) in a texture bitmap (e.g., 641) or the size of one or more non-texture identifying regions (if any) in a texture bitmap (e.g., 641). The criteria may also be based on the size of one or more textured areas (if any) of the image portion (e.g., 421), or based on the size of one or more non-textured areas (if any) of the image portion. The size of the texture identification area(s), if any, may be equal to or substantially equal to the size of the textured area(s), if any, and the size of the texture identification area(s), if any, may be equal to or substantially equal to the size of the non-textured area(s), if any.

As an example of the above criteria, computing system 101 may determine a total textured area indicated by the texture bitmap and may classify the image portion (e.g., 421) as textured or non-textured based on the total textured area. The total textured area may indicate the total area of all texture identification regions (e.g., 643) in a texture bitmap (e.g., 641) or all corresponding textured regions in an image portion (e.g., 421). If the texture bitmap (e.g., 641) does not have a texture identifying region, or if the image portion (e.g., 421) does not have a textured region, the total textured area may be zero. In some cases, computing system 101 may classify an image portion (e.g., 421) as textured if the total textured area is equal to or greater than a defined area threshold, and may classify an image portion (e.g., 421) as non-textured if the total textured area is less than a defined area threshold.

In an embodiment, the at least one criterion that causes the image portion to be classified as textured or non-textured may be based on the percentage P_TextureIt may be one or more textured areas (if any)If) the percentage of the image portion (e.g., 421) or the percentage of the texture bitmap (e.g., 641) occupied by the texture identification area(s) (e.g., 643), if any. If an image portion has no texture-labeled regions, or if the corresponding texture bitmap has no texture-labeled regions, the percentage P_TextureMay be zero. In an embodiment, the at least one criterion may be based on a percentage P_Non-textureWhich may be the percentage of the image portion (e.g., 421) occupied by one or more non-textured areas (if any), or the percentage of the texture bitmap (e.g., 641) occupied by one or more non-textured identification areas (e.g., 643) (if any).

In an embodiment, the at least one criterion that causes the image portion to be classified as textured or non-textured may be based on the percentage P_Texture(which may be the first percentage in this example) and a percentage P_Non-texture(which may be the second percentage in this example). For example, if the ratio P_Texture/P_Non-textureExceeding a defined comparison threshold T of textured versus non-textured₁(e.g., 5), such embodiments may involve classifying at least one image portion (e.g., 421) as textured.

In an embodiment, at least one criterion that causes an image portion (e.g., 421) to be classified as textured or non-textured may be based on a percentage P_TextureWith the total number Num of pixels in the image portion (e.g., 421) or image (e.g., 420) received in step 302_{Image of a person}Based on the percentage P, and/or_Non-textureAnd Num_{Image of a person}The ratio of (a) to (b). For example, if the ratio P_Texture/Num_{Image of a person}Greater than a defined texture-image size comparison threshold T₂(e.g., 0.9), and/or if the ratio P is_Non-texture/Num_{Image of a person}Less than a defined non-texture-image size comparison threshold T₃(e.g., 0.1), then computing system 101 may classify at least the image portion (e.g., 421) as textured.

In embodiments, the computing system 101 may incorporate some or all of the above-described criteria involved in classifying image portions as textured or non-textured. In some cases, computing system 101 may be configured to perform step 308 by: an image portion (e.g., 421) is classified as textured if any of the above criteria are met, and classified as non-textured if all of the above conditions are not met.

For example, as part of evaluating the first criteria, the computing system 101 may determine whether the number of descriptors in the descriptor bitmap (e.g., 513) is greater than a defined descriptor number threshold. If this first criterion is satisfied, computing system 101 may classify the image portion (e.g., 421) as textured. If this first criterion is not met, computing system 101 may determine whether P is satisfied_Texture/P_Non-texture>T₁To evaluate a second criterion. If the second criterion is satisfied, computing system 101 may classify the image portion (e.g., 421) as textured. If the second criterion is not satisfied, computing system 101 may determine whether P is satisfied_Non-texture/Num_{Image of a person}>T₂And/or P_Non-texture/Num_{Image of a person}<T₃To evaluate a third criterion. If the third criterion is satisfied, computing system 101 may classify the image portion (e.g., 421) as textured. If the third criterion is not met, the computing system 101 may evaluate the fourth criterion by determining whether a maximum, minimum, or mean of the standard deviation values indicated by the standard deviation bitmap (e.g., 533) or by the fused bitmap (e.g., 631) is greater than a defined standard deviation threshold. If the fourth criterion is satisfied, the computing system may classify the image portion (e.g., 421) as textured. If none of the above criteria are met, the computing system 101 may classify the image portion (e.g., 421) as being non-textured.

In an embodiment, steps 306 and 308 may be repeated for one or more other image portions of the image received in step 302. For example, the received image (e.g., 420) may represent a plurality of objects, such as

objects

401 and 404 in FIG. 4A. In some cases, more than one template may be generated based on the plurality of objects. As an example, as described above, the first template may be generated based on the image portion 421 describing the appearance of the object 401. In this embodiment, the second template may be generated based on the second image portion 422, and the third template may be generated based on the third image portion 423, which is illustrated in fig. 8A to 8C. Image portion 422 may represent object 402 and image portion 423 may represent object 403. In this example, computing system 101 may extract

image portions

422 and 423 from image 420 and perform

steps

306 and 308 on these

image portions

422, 423 to generate second and third templates, respectively, based on these images. In one example, image portion 422 may be classified as a non-textured template. In some implementations, image portion 423 may also be classified as a non-textured template. Although image portion 423 may show one or more edges of the tape strip, the feature bitmap(s), change bitmap, and blend bitmap generated from only the one or more edges may not be sufficient to produce a textured classification in this example.

Returning to fig. 3, the method 300 may include a step 310 in which the processing circuitry 110 of the computing system 101 may perform a motion plan for the interaction of the robot with one or more objects (e.g., 401 of fig. 4A and 404) based on whether at least one image portion (e.g., 421) is classified as textured or non-textured. In an embodiment, step 308 may be performed by the image classification module 204 and/or the motion planning module 208 of fig. 2C.

In embodiments, step 310 may involve performing object recognition on one or more of the objects in the device field of view (e.g., 443) of the image capture device (e.g., 441), such as the one or

more objects

401 and 404 represented by the image 420. For example, as described above, the image portion 421 representing the object 401 may be used as a template or to generate a template, and the object recognition may involve determining whether the remaining

objects

402 and 404 in the device field of view 443 match the template. As an example, the computing system 101 may be configured to determine whether a portion of the image 420 representing the

object

402, 403, or 404 matches the template, wherein the template is generated based on the appearance of the object 401. In some cases, object identification may be based on whether the template is classified as a textured or non-textured template. For example, the classification of a template may affect where and/or for how long the template is stored. Performing OBJECT RECOGNITION BASED ON either a non-textured template OR a textured template is discussed in more detail in U.S. patent application No. ________ entitled METHOD AND COMPATITING SYSTEM FOR OBJECT RECOGNITION OR OBJECT REGISTRATION BASED IMAGE CLASSIFICATION filed ON even date herewith (Atty. Dkt. MJ0054-US/0077-0012US1), the entire contents of which are incorporated herein by reference. As described above, object recognition may yield information about, for example, the size of the object, which may be used to plan the robot's interaction with the object (e.g., 404). In an embodiment, step 310 may be omitted. For example, such an embodiment may include a

method having steps

302, 306, 308 and stopping after step 308 is completed.

In an embodiment, the computing system 101 may be configured to determine a confidence in the object recognition, where the determination may be based on whether the template is textured or non-textured. For example, if the appearance of an object (e.g., 403) only matches a template without texture, such a match may be assigned a relatively low confidence. If the appearance of an object (e.g., 404) matches a textured template, such a match may be assigned a relatively high confidence. In some cases, the computing system 101 may be configured to perform additional object recognition operations, such as additional object recognition operations based on another technique or based on additional information, in an attempt to improve the robustness of the object recognition. In some cases, the computing system 101 may execute the motion plan based on the confidence. For example, if the confidence is relatively low, the computing system 101 may be configured to limit the speed of the robot (e.g., 461) when the robot attempts to pick up or otherwise interact with an object, so that the robot interaction proceeds with more caution.

Additional discussion of various embodiments

Embodiment 1 relates to an image classification method. The method may be performed, for example, by a computing system executing instructions on a non-transitory computer-readable medium. The method in this embodiment includes receiving, by a computing system, an image, wherein the computing system is configured to communicate with an image capture device, wherein the image is generated by the image capture device and the image is to represent one or more objects in a field of view of the image capture device. The method also includes generating, by the computing system, one or more bitmaps based on at least one image portion of the image, wherein the one or more bitmaps and the at least one image portion are associated with a first object of the one or more objects, and wherein the one or more bitmaps describe whether one or more visual features for feature detection are present in the at least one image portion or whether there is an intensity variation across the at least one image portion. Additionally, the method includes determining, by the computing system, based on the one or more bitmaps, whether to classify the at least one image portion as textured or non-textured, and executing, based on whether the at least one image portion is classified as textured or non-textured, a motion plan for interaction of the robot with the one or more objects.

Example 2 includes the method of example 1. In this embodiment, the one or more bitmaps include a descriptor bitmap for indicating whether one or more descriptors are present in the at least one image portion or for identifying one or more regions of the at least one image portion that include one or more respective descriptors detected from the at least one image portion. The determination of whether to classify the at least one image portion as textured or non-textured is based on whether a total number of descriptors identified by the descriptor bitmap exceeds a defined descriptor number threshold.

Example 3 includes the method of example 1 or 2. In this embodiment, the one or more bitmaps include a plurality of bitmaps having a first bitmap and a second bitmap. The first bitmap is generated based on the at least one image portion and describes whether one or more visual features of a first feature type are present in the at least one image portion. Further, in this embodiment, the second bitmap is generated based on the at least one image portion and describes whether one or more visual features of a second feature type are present in the at least one image portion, and wherein the determination of whether to classify the at least one image portion as textured or non-textured comprises generating a fused bitmap combining the plurality of bitmaps, and wherein the at least one image portion is classified as textured or non-textured based on the fused bitmap.

Example 4 includes the method of example 3. In this embodiment, the first bitmap is a descriptor bitmap for identifying one or more regions of the at least one image portion that include one or more respective descriptors detected from the at least one image portion, or for indicating that no descriptors are detected in the at least one image portion, and wherein the second bitmap is an edge bitmap for identifying one or more regions of the at least one image portion that include one or more respective edges detected from the at least one image portion, or for indicating that no edges are detected in the at least one image portion.

Example 5 includes the method of example 4. In this embodiment, the plurality of bitmaps includes a third bitmap, the third bitmap being a standard deviation bitmap for indicating, for each pixel of the at least one image portion, a standard deviation between pixel intensity values around the pixel.

Embodiment 6 includes the method of any one of embodiments 3-5. In this embodiment, the determination of whether to classify the at least one image portion as textured or non-textured includes converting, by the computing system, the fused bitmap to a texture bitmap. Furthermore, in this embodiment, the texture bitmap is used for identifying one or more textured regions of the at least one image portion, or for indicating that the at least one image portion does not have a texture, wherein the texture bitmap is also used for identifying one or more non-textured regions of the at least one image portion, or for indicating that the at least one image portion does not have a non-texture, wherein the one or more textured regions are one or more regions of the at least one image portion having at least a defined texture level, and the one or more non-textured regions are one or more regions of the at least one image portion having less than the defined texture level; and the determination of whether to classify the at least one image portion as textured or non-textured is based on the texture bitmap.

Example 7 includes the method of example 6. In this embodiment, the determination of whether to classify the at least one image portion as textured or non-textured is based on at least one of: a total textured area indicated by the texture bitmap, wherein the total textured area is a total area of the one or more textured regions, or zero if the texture bitmap indicates that the at least one image portion does not have textured regions.

Embodiment 8 includes the method of any one of embodiments 3-7. In this embodiment, the determination of whether to classify the at least one image portion as textured or non-textured is based on whether there is a change in pixel intensity values across the fused bitmap, or on a change in pixel intensity values across the fused bitmap.

Embodiment 9 includes the method of any one of embodiments 2-8. In this embodiment, the determination of whether to classify the at least one image portion as textured or non-textured comprises at least one of: a) classifying the at least one image portion as textured if the number of descriptors identified by the descriptor bitmap is greater than a defined descriptor number threshold, b) classifying the at least one image portion as textured if a ratio between a first percentage and a second percentage exceeds a defined texture-to-no texture comparison threshold, wherein the first percentage is a percentage of the at least one image portion that is occupied by the one or more textured areas, or if the at least one image portion does not have textured areas, the first percentage is zero and the second percentage is a percentage of the at least one image portion that is occupied by the one or more non-textured areas, c) if the ratio between the first percentage and the size of the at least one image portion is greater than a defined texture-to-image size comparison threshold, or classifying the at least one image portion as textured if a ratio between the second percentage and the size of the at least one image portion is less than a defined non-texture-image size comparison threshold, or d) classifying the at least one image portion as textured if a maximum or minimum of a standard deviation of local regions of corresponding pixels of the fused bitmap is greater than a defined standard deviation threshold.

Embodiment 10 includes the method of any one of embodiments 1-9. In this embodiment, the method further comprises generating an additional bitmap describing the effect of the lighting conditions generating the image on the at least one image portion.

Example 11 includes the method of example 10. In this embodiment, the additional bitmap comprises at least one of: a highlight bitmap identifying one or more regions of the at least one image portion that exceed a defined brightness threshold due to the lighting conditions, or a shadow bitmap identifying one or more regions of the at least one image portion that are in shadow.

Embodiment 12 includes the method of any one of embodiments 3-11. In this embodiment, generating the fused bitmap includes: determining bit-pel values describing a texture level across the at least one image portion based at least on the first bitmap and the second bitmap; and reducing a subset of the determined bit-image pixel values based on the highlight bitmap or the shadow bitmap, wherein the subset of reduced bit-image pixel values corresponds to one or more regions of the at least one image portion that are identified by the highlight bitmap as exceeding the defined brightness threshold or identified by the shadow bitmap as being in shadow.

Embodiment 13 includes the method of any one of embodiments 3-12. In this embodiment, generating the fused bitmap is based on a weighted sum of at least the first bitmap and the second bitmap and on a weighted sum of the highlight bitmap and the shadow bitmap.

Embodiment 14 includes the method of any one of embodiments 3-13. In this embodiment, the image received by the computing system is a color image comprising a plurality of color components, wherein the first bitmap and the second bitmap belong to a first set of bitmaps associated with a first color component of the plurality of color components, and wherein the method includes generating a second set of bitmaps associated with a second color component of the plurality of color components, and wherein the fused bitmap is generated based on at least the first set of bitmaps and the second set of bitmaps.

Example 15 includes the method of example 14. In this embodiment, the method further comprises: generating a first intermediate fused bitmap that combines the first set of bitmaps, wherein the first intermediate fused bitmap is associated with the first color component; generating a second intermediate fused bitmap combining the second set of bitmaps, wherein the second intermediate fused bitmap is associated with the second color component, and wherein the fused bitmap is generated by combining at least the first intermediate fused bitmap and the second intermediate fused bitmap.

Embodiment 16 includes the method of any one of embodiments 1-15. In this embodiment, the method further comprises: applying a smoothing operation to the image to produce an updated image before the one or more bitmaps are generated, wherein the at least one image from which the one or more bitmaps are generated is extracted from the updated image.

It will be apparent to one of ordinary skill in the relevant art that other suitable modifications and adaptations to the methods and applications described herein may be made without departing from the scope of any of the embodiments. The embodiments described above are illustrative examples and should not be construed as limiting the invention to these particular embodiments. It should be understood that the various embodiments disclosed herein may be combined in different combinations than those specifically presented in the description and drawings. It will also be understood that, according to an example, certain acts or events of any process or method described herein can be performed in a different order, may be added, merged, or omitted altogether (e.g., all described acts or events may not be necessary for performing the method or process). Additionally, although certain features of the embodiments herein are described as being performed by a single component, module, or unit for clarity, it should be understood that the features and functions described herein can be performed by any combination of components, units, or modules. Accordingly, various changes and modifications may be effected therein by one skilled in the art without departing from the spirit or scope of the invention as defined in the appended claims.

Claims

1. An image classification method, comprising:

receiving, by a computing system, an image, wherein the computing system is configured to communicate with an image capture device, wherein the image is generated by the image capture device and the image is to represent one or more objects in a field of view of the image capture device;

generating, by the computing system, a plurality of bitmaps based on at least one image portion of the image, wherein the plurality of bitmaps and the at least one image portion are associated with a first object of the one or more objects, and wherein the plurality of bitmaps include: (i) a descriptor bitmap for identifying one or more regions of the at least one image portion that include one or more respective descriptors detected from the at least one image portion, or for indicating that no descriptors are detected in the at least one image portion, and (ii) an edge bitmap for identifying one or more regions of the at least one image portion that include one or more respective edges detected from the at least one image portion, or for indicating that no edges are detected in the at least one image portion;

generating a fused bitmap, the fused bitmap based on a weighted sum of the plurality of bitmaps;

determining, by the computing system, based on the fused bitmap, whether to classify the at least one image portion as textured or non-textured; and

performing a motion plan for the robot's interaction with the one or more objects based on whether the at least one image portion is classified as textured or non-textured.

2. The method of claim 1, wherein the plurality of bitmaps further include a standard deviation bitmap for indicating, for each pixel of the at least one image portion, a standard deviation between pixel intensity values surrounding the pixel.

3. The method of claim 1, wherein the image received by the computing system is a color image comprising a plurality of color components,

wherein the plurality of bitmaps form a first set of bitmaps associated with a first color component of the plurality of color components, and wherein the method includes generating a second set of bitmaps associated with a second color component of the plurality of color components, and

wherein the fused bitmap is generated based at least on the first set of bitmaps and the second set of bitmaps.

4. The method of claim 3, further comprising:

generating a first intermediate fused bitmap that combines the first set of bitmaps, wherein the first intermediate fused bitmap is associated with the first color component;

generating a second intermediate fused bitmap combining the second set of bitmaps, wherein the second intermediate fused bitmap is associated with the second color component, an

Wherein the fused bitmap is generated by combining at least the first intermediate fused bitmap and the second intermediate fused bitmap.

5. The method of claim 1, further comprising applying a smoothing operation to the image to produce an updated image before the plurality of bitmaps are generated, wherein the at least one image portion from which the plurality of bitmaps are generated is extracted from the updated image.

6. An image classification method, comprising:

generating, by the computing system, a plurality of bitmaps based on at least one image portion of the image, wherein the plurality of bitmaps and the at least one image portion are associated with a first object of the one or more objects, and wherein the plurality of bitmaps include: (i) a first bitmap describing whether one or more visual features of a first feature type are present in the at least one image portion, and (ii) a second bitmap describing whether one or more visual features of a second feature type are present in the at least one image portion;

generating, by the computing system, a fused bitmap that combines the plurality of bitmaps;

converting, by the computing system, the fused bitmap to a texture bitmap;

determining, by the computing system, based on the texture bitmap, whether to classify the at least one image portion as textured or non-textured; and

performing a motion plan for robot interaction with the one or more objects based on whether the at least one image portion is classified as textured or non-textured;

wherein the texture bitmap is used to identify one or more textured areas of the at least one image portion or to indicate that the at least one image portion does not have textured areas,

wherein the texture bitmap is further for identifying one or more non-textured regions of the at least one image portion, or for indicating that the at least one image portion does not have a non-textured region,

wherein the one or more textured areas are one or more areas of the at least one image portion having at least a defined level of texture and the one or more non-textured areas are one or more areas of the at least one image portion having less than the defined level of texture.

7. The method of claim 6, wherein the determination of whether to classify the at least one image portion as textured or non-textured is based on at least one of: a total textured area indicated by the texture bitmap, wherein the total textured area is a total area of the one or more textured regions, or zero if the texture bitmap indicates that the at least one image portion does not have textured regions.

8. The method of claim 6, wherein the determination of whether to classify the at least one image portion as textured or non-textured is based on whether there is a change in pixel intensity values across the fused bitmap or on a change in pixel intensity values across the fused bitmap.

9. The method of claim 6, wherein the determination of whether to classify the at least one image portion as textured or non-textured comprises at least one of:

a) classifying the at least one image portion as textured if the number of descriptors identified by the first bitmap is greater than a defined descriptor number threshold, wherein the first bitmap is a descriptor bitmap for identifying one or more regions of the at least one image portion that include one or more respective descriptors detected from the at least one image portion or for indicating that no descriptors are detected in the at least one image portion,

b) classifying the at least one image portion as textured if a ratio between a first percentage and a second percentage exceeds a defined texture-to-no-texture comparison threshold, wherein the first percentage is a percentage of the at least one image portion that is occupied by the one or more textured areas or, if the at least one image portion does not have textured areas, the first percentage is zero and the second percentage is a percentage of the at least one image portion that is occupied by the one or more no-textured areas,

c) classifying the at least one image portion as textured or if the ratio between the first percentage and the size of the at least one image portion is greater than a defined texture-image size comparison threshold or if the ratio between the second percentage and the size of the at least one image portion is less than a defined non-texture-image size comparison threshold

d) Classifying the at least one image portion as textured if a maximum or minimum of a standard deviation of a local region of a corresponding pixel of the fused bitmap is greater than a defined standard deviation threshold.

10. An image classification method, comprising:

generating, by the computing system, a plurality of bitmaps based on at least one image portion of the image, wherein the plurality of bitmaps and the at least one image portion are associated with a first object of the one or more objects, and wherein the plurality of bitmaps include: (i) a first bitmap describing whether one or more visual features of a first feature type are present in the at least one image portion, and (ii) a second bitmap describing whether one or more visual features of a second feature type are present in the at least one image portion; and

generating, by the computing system, a lighting effect bitmap describing an effect of lighting conditions that generated the image on the at least one image portion;

generating, by the computing system, a combined feature or change bitmap that combines the plurality of bitmaps;

generating, by the computing system, a fused bitmap that adjusts pixel values of the combined feature or change bitmap based on the lighting effect bitmap;

11. The method of claim 10, wherein the lighting effect bitmap comprises at least one of:

a highlight bitmap identifying one or more areas in the at least one image portion that exceed a defined brightness threshold due to the lighting conditions, or

A shadow bitmap identifying one or more regions in shadow in the at least one image portion.

12. The method of claim 11, wherein the bit-image-pixel values of the combined feature or change bitmap describe a texture level across the at least one image portion; and

wherein generating the fused bitmap comprises: reducing the subset of bitmap pixel values based on the highlight bitmap or the shadow bitmap, wherein the subset of reduced bitmap pixel values corresponds to one or more regions of the at least one image portion identified by the highlight bitmap as exceeding the defined brightness threshold or identified by the shadow bitmap as being in shadow.

13. The method of claim 11, wherein the lighting effect bitmap is a combined lighting effect bitmap combining the highlight bitmap and the shadow bitmap, wherein generating the fused bitmap comprises multiplying bit-image pixel values of the combined feature or change bitmap by bit-image pixel values of the combined lighting effect bitmap.

14. A computing system for image classification, comprising:

a non-transitory computer readable medium;

at least one processing circuit configured to, when the non-transitory computer-readable medium has stored an image generated by an image capture device representing one or more objects in a field of view of the image capture device, perform the following:

receiving the image;

generating a plurality of bitmaps based on at least one image portion of the image, wherein the plurality of bitmaps and the at least one image portion are associated with a first object of the one or more objects, and wherein the plurality of bitmaps include: (i) a descriptor bitmap for identifying one or more regions of the at least one image portion that include one or more respective descriptors detected from the at least one image portion, or for indicating that no descriptors are detected in the at least one image portion, and (ii) an edge bitmap for identifying one or more regions of the at least one image portion that include one or more respective edges detected from the at least one image portion, or for indicating that no edges are detected in the at least one image portion;

determining whether to classify the at least one image portion as textured or non-textured based on the fused bitmap; and

15. A non-transitory computer-readable medium having stored thereon instructions that, when executed by at least one processing circuit of a computing system, cause the at least one processing circuit to:

receiving an image, wherein the computing system is configured to communicate with an image capture device, wherein the image is generated by the image capture device and the image is to represent one or more objects in a field of view of the image capture device;