AU2008264231B2

AU2008264231B2 - Video object foreground mask encoding

Info

Publication number: AU2008264231B2
Application number: AU2008264231A
Authority: AU
Inventors: David Grant Mcleish; Lachlan James Patrick
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2008-11-24
Filing date: 2008-12-30
Publication date: 2010-08-26
Anticipated expiration: 2028-12-30
Also published as: AU2008264229B2; AU2009238382A1; AU2008264229A1; AU2008264230A1; AU2008264231A1

Description

S&F Ref: 885503 AUSTRALIA PATENTS ACT 1990 COMPLETE SPECIFICATION FOR A STANDARD PATENT Name and Address Canon Kabushiki Kaisha, of 30-2, Shimomaruko 3 of Applicant: chome, Ohta-ku, Tokyo, 146, Japan Actual Inventor(s): David Grant McLeish Lachlan James Patrick Address for Service: Spruson & Ferguson St Martins Tower Level 35 31 Market Street Sydney NSW 2000 (CCN 3710000177) Invention Title: Video object foreground mask encoding Associated Provisional Application Details: [33] Country: [31] Appl'n No(s): [32] Application Date: AU 2008249180 24 Nov 2008 The following statement is a full description of this invention, including the best method of performing it known to me/us: 5845c(19097621) VIDEO OBJECT FOREGROUND MASK ENCODING RELATED APPLICATION This application claims priority from Australian Patent Application No. 2008249180 entitled "Multiple JPEG coding of same image", filed on 24 November, 2008 in the name of Canon Kabushiki Kaisha, the entire contents of which are incorporated herein by s reference. TECHNICAL FIELD The present disclosure relates to video object detection and, in particular to encoding detected objects for transmission over a network. BACKGROUND 10 Digital video transmission via communications networks has become common within video surveillance systems. However, the data bandwidth required for transmitting digital video content can be much greater than for capturing and transmitting still images or sounds alone, since video transmission involves a continuous flow of image infonnation. The flow of information can also pose a problem to content analysis systems which 15 employ object detection, since such systems must commonly analyse contemporaneously many streams of input to provide real-time results. For example, consider systems deployed in airports and shopping malls. Such systems typically have many cameras to provide surveillance coverage of areas of interest, including entrances and exits. Systems with many cameras produce many streams of image information to analyse. 20 The scalability of such systems can thus become a problem, both in terms of the network bandwidth required and the computational power needed to analyse effectively the captured scene image data. One approach to addressing both these problems is to shift some of the content analysis work onto processors within the cameras themselves. This approach allows each 25 camera to perform object detection on the scene the camera is capturing. The object data deduced by the camera can be transmitted via the same network as the video image data, allowing subsequent analysis by a downstream system to benefit from the results computed by the camera and avoid doing some of the object detection itself, thus lowering the computational expense of the downstream system by distributing some of the work. 1908242_1.DOC 885503 spe i -2 Transmitting object data deduced by the cameras increases the bandwidth required. In systems where the video data is compressed, the object data can make up a significant portion of the total bandwidth used by the system. A number of approaches have been explored to reduce the amount of bandwidth 5 required to transmit object data over the network. One approach is to use a video coding system to describe detected objects, MPEG-4, for example, allows the transmission of separate moving objects that are composite on top of a non-moving background by the decoder. One drawback of such a system is that, since robust video object detection is more 10 computationally expensive than most video compression methods, this can slow down the video coding process. Another drawback is that this method does not cope well with video where object presence is not closely connected to the appearance of the scene. An example of such video is when a moving object in a scene changes appearance, or when the appearance of the background changes in a way that is not considered to correspond to an is object. Further, this method requires the system to transmit object data at the same frame rate as video, which is not ideal, or even suitable, in all circumstances. A different approach is to transmit metadata for each object separately to the video data, either in a separate data stream, or embedded in the video stream in a way that does not affect the appearance or transmission of the video. This approach avoids the issues 20 described above, but means that object metadata contributes significantly to the bandwidth requirements of the system. This is of particular concern considering that those scenes with the highest bandwidth requirements for objects, such as scenes containing several foreground objects with complex metadata, are also the most important scenes to transmit efficiently in a surveillance system. 25 In some systems, this problem is avoided by sending only a terse description of objects, thus reducing the required bandwidth. In particular, such systems might not send a full description of the outline of each object, instead sending only information about the position and size. Clearly, this reduces the range of functionality that can be provided by the recipient of the object data to the user. 30 Thus, a need exists to provide a system capable of detecting objects in a video stream and efficiently transmitting object metadata over a network, preferably with a fixed maximum bandwidth requirement, while still including important information about each object. Such infonnation may include, for example, detailed object outlines. 1908242_LDOC 885503_speci -3 SUMMARY It is an object of the present invention to overcome substantially, or at least ameliorate, one or more disadvantages of existing arrangements. According to a first aspect of the present disclosure, there is provided a method of s transmitting object data over a communications channel, wherein the object data is derived from at least one detected object in a video frame, the method comprising the steps of: (a) encoding a foreground map derived from the video frame, wherein the foreground map is segmented into at least one element, each element being associated with an object identifier relating to the video frame; 10 (b) for at least one of the objects detected in the video frame, encoding metadata comprising at least a position of a representative element in the foreground map, wherein the object identifier associated with the representative element corresponds to the object detected in the video frame, and further wherein adj cent elements that are associated with a same object identifier define a blob; and is (c) transmitting at least the encoded foreground map and the encoded metadata, as object data, over the communications channel. According to a second aspect of the present disclosure, there is provided a method of receiving object data over a communications channel, comprising the steps of: (a) decoding a foreground map from a received bitmap; 20 (b) for at least one object, decoding metadata relating to each blob associated with the object, the metadata comprising at least a representative element in the foreground map, the representative element being within a boundary of a blob associated with the object; (c) performing a floodfill on the foreground map, starting from the 25 representative element, to determine a blob boundary of each blob associated with the object; and (d) determining a boundary of the object from each blob boundary of each blob associated with the object. According to a third aspect of the present disclosure, there is provided camera so adapted to transmit object data over a communications channel, wherein the object data is derived from at least one detected object in a video frame, the camera comprising; an image capture device for capturing an image as a video frame; 1908242 1.DOC 885503speci -4 an object detection module for processing the video frame to produce object detection results including a foreground map and metadata associated with each object detected in the video frame, wherein the foreground map is segmented into at least one element, each 5 element being associated with an object identifier relating to the video frame, and wherein the metadata includes, for each object detected in the video frame, at least a position of a representative element in the foreground map, wherein the object identifier associated with the representative element corresponds to the object detected in the video frame, and 10 father wherein adjacent elements of the foreground map that are associated with a same object identifier define a blob; a storage device for storing a computer program; and a processor for executing the program, the program comprising: code for encoding the foreground map; is code for processing at least one of the objects detected in the video frame, to encode the metadata associated with the object; and code for transmitting at least the encoded foreground map and the encoded metadata, as object data, over the communications channel. According to another aspect of the present disclosure, there is provided an apparatus 20 for implementing any one of the aforementioned methods. According to another aspect of the present disclosure, there is provided a computer program product including a computer readable medium having recorded thereon a computer program for implementing any one of the aforementioned methods. Other aspects of the invention are also disclosed. 1908242_1.DOC 885503_speci -5 BRIEF DESCRIPTION OF THE DRAWINGS One or more embodiments of the invention will now be described with reference to the following drawings, in which: Figs I a and lb form a schematic block diagram of a general purpose computer s system upon which arrangements described can be practised; Fig. 2 is a schematic block diagram of an architecture according to an embodiment of the present disclosure; Figs 3a to 3c are diagrams illustrating a single video frame captured by the camera, corresponding image data, the output of the object detection module, and the 10 encoded object data; Fig. 4 is a flow diagram illustrating one embodiment of the process of adding a representative coordinate to the object metadata; Fig. 5 is a flow diagram illustrating a method of performing a 4-way floodfill; Fig. 6 is a flow diagram illustrating in more detail a method of visiting an element in is a floodfill, as used in the method of Fig. 5; Fig. 7 is a diagram illustrating an encoding that includes the bounding box of each blob; Figs Sa to 8c are diagrams illustrating alternate encodings of a foreground bitmap; Figs 9a and 9b are diagrams illustrating a further embodiment of the present 20 disclosure in which the object detection module can produce adjacent objects; and Figs 10a and lOb are diagrams illustrating a further embodiment of the present disclosure in which the object detection module can produce non-contiguous objects. 1908242_.DOC 88503.speci -6 DETAILED DESCRIPTION Where reference is made in any one or more of the accompanying drawings to steps and/or features that have the same reference numerals, those steps and/or features have for the purposes of this description the same function(s) or operation(s), unless the contrary s intention appears. A video is a sequence of images orframes. Thus, each frame is an image in an image sequence, Each frame of the video has an x axis and ay axis. A scene is the information contained in a frame and may include, for example, foreground objects, background objects, or a combination thereof. A scene model is stored information relating to a 10 background. A scene model generally relates to background information derived from an image sequence, A video may be encoded and compressed. Such encoding and compression may be performed intra-frame, such as motion-JPEG (M-JPEG), or inter frame, such as specified in the H.264 standard. An image is made up of visual elements. The visual elements may be, for example, is pixels, or 8x8 DCT (Discrete Cosine Transfonn) blocks as used in JPEG images in a motion-JPEG stream, or wavelet transforms. In the context of this patent specification, the following terms are defined. A blob is a contiguous region of a video frame which is detected as foreground. An object is a portion of a video frame that is detected to be a representation of a physical foreground 20 object in the scene, corresponding to one or more blobs and with associated metadata. A foreground map is a region segmented into one or more elements, wherein the foreground map maps onto a video frame, where each element of the foreground map is associated with an object identifier that indicates whether that element corresponds to background or identifies an object. In one embodiment, the foreground map is a two-dimensional array 25 implemented as a grid of elements. A bitmap is an array of bits (binary digits), in which each bit comprises a single element of the array. Disclosed herein is a method of transmitting video object data, derived from an object detection module, over a communications channel or network, such that a receiver of the transmitted video object data can reproduce an outline of each object detected by the so object detection module, In one embodiment of the present disclosure, an object detection module processes a video frame and provides a description of the foreground of the video frame as a 1903242 .DOC 88503jpeci -7 foreground map segmented into a plurality of elements. Each element of the foreground map is associated with an object identifier. The object identifier indicates whether the element relates to background of the video frame or a portion of foreground of the video frame relating to an object detected in the video frame by the object detection module. A s group of adjacent elements in the foreground map that are associated with a same object identifier relate to a single blob. In one implementation, the foreground map is a two-dimensional (2D) array of elements. In one implementation, each element of the foreground map corresponds to a Discrete Cosine Transform (DCT) block of the video frame. 10 As indicated above, a blob is a contiguous region of a video frame which is detected as foreground. In one implementation, it is assumed that each blob is a contiguous connected region of a video frame, and separate blobs do not touch. Accordingly, representations of blobs in a foreground map are separable by floodfidl, An object detected in a video frame can correspond to one or more blobs detected in the video frame. In a is simple case, an object corresponds to a single detected blob of foreground. In a more complex case, an object corresponds to multiple detected blobs of foreground. The object detection module also provides a metadata array derived from a video frame. The metadata array contains metadata for at least one object detected in the video frame. The metadata for each object can include, for example, mean mode age, mean 20 hitcount, a track ID (if tracking is implemented), or any combination thereof. The metadata may also include, for convenience, bounding coordinates of an object and the and the size of the object in blocks. In an alternative arrangement, the metadata array includes metadata for each blob corresponding to a single detected object. According to one embodiment of the present disclosure, there is provided a method 25 of transmitting object data over a communications channel, wherein the object data is derived from at least one detected object in a video frame. The method encodes a foreground map derived from the video frame, wherein the foreground map is segmented into at least one element, each element being associated with an object identifier relating to the video frame. The method then, for at least one of the objects detected in the video 30 frame, encodes metadata comprising at least a position of a representative element in the foreground map, wherein the object identifier associated with the representative element corresponds to the object detected in the video frame, and further wherein adjacent 1908242_1.DOC 885503_speci elements that are associated with a same object identifier define a blob. The method then transmits at least the encoded foreground map and the encoded metadata, as object data, over the communications channel. According to another embodiment of the present disclosure, there is provided a 5 method of receiving object data over a communications channel. The method decodes a foreground map from a received bitmap and then, for at least one object, decodes metadata relating to each blob associated with the object The metadata includes at least a representative element in the foreground map, wherein the representative element is within a boundary of a blob associated with the object. The method then performs a floodfill on io the foreground map, starting from the representative element, to determine a blob boundary of each blob associated with the object. The method then determines a boundary of the object from the blob boundary of each blob associated with the object. According to a further embodiment, there is provided a camera adapted to transmit object data over a communications channel, wherein said object data is derived from at 15 least one detected object in a video frame. The camera includes an image capture device for capturing an image as a video frame, and an object detection module for processing said video frame to produce object detection results including a foreground map and metadata associated with each object detected in said video frame. The foreground map is segmented into at least one element, wherein each element is associated with an object 20 identifier relating to the video frame. The metadata includes, for each object detected in said video frame, at least a position of a representative element in the foreground map, wherein the object identifier associated with said representative element corresponds to said object detected in said video frame. Adjacent elements of said foreground map that are associated with a same object identifier define a blob. The camera also includes a 25 storage device for storing a computer program and a processor for executing the program. The program includes: code for encoding the foreground map; code for processing at least one of said objects detected in said video frame, to encode said metadata associated with said object; and code for transmitting at least the encoded foreground map and the encoded metadata, as object data, over said communications channel. 30 A method in accordance with one embodiment of the present disclosure converts the foreground map to a binary bitmap, where each bit of the binary bitmap indicates whether that bit is foreground or background. For each blob, the method adds a position of a 1903242 1.DOC 885503_speci -9 representative element to the metadata associated with that blob. The representative element represents an element inside the blob. In the example in which the foreground map is a two-dimensional array, the position of a representative element is provided by an (x, y) Cartesian coordinate pair, formed of a first ordinate and a second ordinate, that is 5 added to the metadata for the blob in question. The method transmits the bitmap and the metadata array, as object data, to a client, In one embodiment, the method transmits the object data in conjunction with the corresponding video frame. The client may be, for example, an external computing device. The client receives the transmitted bitmap and metadata array and, to determine the 10 outline of a blob, performs a floodfill on the bitmap. The client starts the floodfill at the representative element associated with the blob, as retrieved from the transmitted metadata array. In one embodiment, the representative element of each blob is chosen from a topmost row of the respective blob. In the embodiment in which the transmitted metadata array is contains bounding coordinates for each blob, only the representative x coordinate needs to be added to the metadata for each blob. For example, if a bounding box is represented by bounding coordinates (x1, yl)-(x2, y2), the representative coordinate is (x, y]), and x is added to the metadata for the relevant blob. In one embodiment, the method attempts to compress the bitmap in a lossless 20 manner. The method may utilise, for example, quad-tree encoding, or rn-length encoding. The method may optionally transmit a value indicating the compression type used, so that, if the compression would make the encoded size larger, the method reverts to transmitting an uncompressed bitmap, Optionally, instead of a bitmap representing whether each corresponding block of a 25 video frame is foreground or background, one embodiment transmits a bitmap representing whether each horizontal or vertical edge between blocks of the video frame is on the boundary of a blob. This embodiment is useful if blobs produced by the video object detection module are not guaranteed to be separable by floodfill. That is, such an embodiment is useful when processing a video image in which different blobs can touch. ao In one implementation, video frames are transmitted as JPEG images, and object data relating to a video frame is embedded in a corresponding transmitted JPEG image frame as an application-specific header segment. 1908242_1.DoC 885503PMeci -10 Figs 1 a and lb collectively form a schematic block diagram of a general purpose computer system 100, upon which the various arrangements described can be practised. In one implementation, the general purpose computer system 100 is coupled to a camera to form a video camera on which the various arrangements described are practised. In another 5 implementation, one instance of the general purpose computer system 100 is an external computing device that receives data from a camera and encodes a foreground map and metadata for transmission as object data over a communications channel. As seen in Fig. la, the computer system 100 is formed by a computer module 101, input devices such as a keyboard 102, a mouse pointer device 103, a scanner 126, a io camera 127, and a microphone 180, and output devices including a printer 115, a display device 114 and loudspeakers 117. An external Modulator-Demodulator (Modem) transceiver device 116 may be used by the computer module 101 for communicating to and from a communications network 120 via a connection 121. The network 120 may be a wide-area network (WAN), such as the Internet or a private WAN. Where the connection u 121 is a telephone line, the modem 116 may be a traditional "dial-up" modem. Alternatively, where the connection 121 is a high capacity (e.g., cable) connection, the modem 116 may be a broadband modem. A wireless modem may also be used for wireless connection to the network 120. The computer module 101 typically includes at least one processor unit 105, and a 20 memory unit 106 for example formed from semiconductor random access memory (RAM) and semiconductor read only memory (ROM), The module 101 also includes an number of input/output (1/0) interfaces including an audio-video interface 107 that couples to the video display 114, loudspeakers 117 and microphone 180, an 1/0 interface 113 fbr the keyboard 102, mouse 103, scanner 126, camera 127 and optionally a joystick (not 25 illustrated), and an interface 108 for the external modem 116 and printer 115. In some implementations, the modem 116 may be incorporated within the computer module 101, for example within the interface 108. The computer module 101 also has a local network interface 111 which, via a connection 123, permits coupling of the computer system 100 to a local computer network 122, known as a Local Area Network (LAN). As also illustrated, 30 the local network 122 may also couple to the wide network 120 via a connection 124, which would typically include a so-called "firewall" device or device of similar 1908242I .DOC 885503_speci - II functionality, The interface 111 may be formed by an EthernetTm circuit card, a BluetoothM wireless arrangement or an IEEE 802.11 wireless arrangement. The interfaces 108 and 113 may afford either or both of serial and parallel connectivity, the former typically being implemented according to the Universal Serial Bus s (USB) standards and having corresponding USB connectors (not illustrated). Storage devices 109 are provided and typically include a hard disk drive (HDD) 110. Other storage devices such as a floppy disk drive and a magnetic tape drive (not illustrated) may also be used. An optical disk drive 112 is typically provided to act as a non-volatile source of data. Portable memory devices, such optical disks (e.g., CD-ROM, DVD), USB-RAM, and 10 floppy disks for example may then be used as appropriate sources of data to the system 100. The components 105 to 113 of the computer module 101 typically communicate via an interconnected bus 104 and in a manner which results in a conventional mode of operation of the computer system 100 known to those in the relevant art. Examples of is computers on which the described arrangements can be practised include IBM-PCs and compatibles, Sun Sparcstations, Apple MacJ" or alike computer systems evolved therefrom. The method of transmitting object data over a communications channel may be implemented using the computer system 100 wherein the processes of Figs 2 to 10 may be 20 implemented as one or more software application programs 133 executable within the computer system 100, In particular, the steps of the method of transmitting object data over a communications channel are effected by instructions 131 in the software 133 that are carried out within the computer system 100. The software instructions 131 may be formed as one or more code modules, each for performing one or more particular tasks. The 25 software may also be divided into two separate parts, in which a first part and the corresponding code modules performs the encoding methods and a second part and the corresponding code modules manage a user interface between the first part and the user. The software 133 is generally loaded into the computer system 100 from a computer readable medium, and is then typically stored in the HDD 110, as illustrated in Fig, I a, or 30 the memory 106, after which the software 133 can be executed by the computer system 100. In some instances, the application programs 133 may be supplied to the user encoded on one or more CD-ROM 125 and read via the corresponding drive 112 prior to 1908242_1.DOC 885503_peci -12 storage in the memory 110 or 106. Alternatively the software 133 may be read by the computer system 100 from the networks 120 or 122 or loaded into the computer system 100 from other computer readable media. Computer readable storage media refers to any storage medium that participates in providing instructions and/or data to the s computer system 100 for execution and/or processing. Examples of such storage media include floppy disks, magnetic tape, CD-ROM, a hard disk drive, a ROM or integrated circuit, USB memory, a magneto-optical disk, or a computer readable card such as a PCMCIA card and the like, whether or not such devices are internal or external of the computer module 101. Examples of computer readable transmission media that may also io participate in the provision of software, application programs, instructions and/or data to the computer module 101 include radio or infra-red transmission channels as well as a network connection to another computer or networked device, and the Internet or Intranets including e-mail transmissions and information recorded on Websites and the like. The second part of the application programs 133 and the corresponding code modules is mentioned above may be executed to implement one or more graphical user interfaces (GUIs) to be rendered or otherwise represented upon the display 114. Through manipulation of typically the keyboard 102 and the mouse 103, a user of the computer system 100 and the application may manipulate the interface in a functionally adaptable manner to provide controlling commands and/or input to the applications associated with 2o the GUI(s). Other forms of functionally adaptable user interfaces may also be implemented, such as an audio interface utilizing speech prompts output via the loudspeakers 117 and user voice commands input via the microphone 180. Fig. lb is a detailed schematic block diagram of the processor 105 and a "memory" 134,. The memory 134 represents a logical aggregation of all the memory 25 devices (including the HDD 110 and semiconductor memory 106) that can be accessed by the computer module 101 in Fig. La. When the computer module 101 is initially powered up, a power-on self-test (POST) program 150 executes. The POST program 150 is typically stored in a ROM 149 of the semiconductor memory 106. A program permanently stored in a hardware device such as so the ROM 149 is sometimes referred to as firmware. The POST program 150 examines hardware within the computer module 101 to ensure proper functioning, and typically checks the processor 105, the memory (109, 106), and a basic input-output systems 1908242LDOC 885503_speci 13 software (10S) module 151, also typically stored in the ROM 149, for correct operation. Once the POST program 150 has run successfully, the BIOS 151 activates the hard disk drive 110. Activation of the hard disk drive 110 causes a bootstrap loader program 152 that is resident on the hard disk drive 110 to execute via the processor 105. This loads an 5 operating system 153 into the RAM memory 106 upon which the operating system 153 commences operation. The operating system 153 is a system level application, executable by the processor 105, to fulfil various high level functions, including processor management, memory management, device management, storage management, software application interface, and generic user interface. 10 The operating system 153 manages the memory (109, 106) in order to ensure that each process or application running on the computer module 101 has sufficient memory in which to execute without colliding with memory allocated to another process. Furthermore, the different types of memory available in the system 100 must be used properly so that each process can run effectively. Accordingly, the aggregated memory 134 15 is not intended to illustrate how particular segments of memory are allocated (unless otherwise stated), but rather to provide a general view of the memory accessible by the computer system 100 and how such is used. The processor 105 includes a number of functional modules including a control unit 139, an arithmetic logic unit (ALU) 140, and a local or internal memory 148, 20 sometimes called a cache memory. The cache memory 148 typically includes a number of storage registers 144 - 146 in a register section. One or more internal buses 141 functionally interconnect these functional modules. The processor 105 typically also has one or more interfaces 142 for communicating with external devices via the system bus 104, using a connection 118. 25 The application program 133 includes a sequence of instructions 131 that may include conditional branch and loop instructions. The program 133 may also include data 132 which is used in execution of the program 133. The instructions 131 and the data 132 are stored in memory locations 128-130 and 135-137 respectively. Depending upon the relative size of the instructions 131 and the memory locations 128-130, a 30 particular instruction may be stored in a single memory location as depicted by the instruction shown in the memory location 130. Alternately, an instruction may be 1908242_.DOC 885503_speel -14 segmented into a number of parts each of which is stored in a separate memory location, as depicted by the instruction segments shown in the memory locations 128-129. In general, the processor 105 is given a set of instructions which are executed therein. The processor 105 then waits for a subsequent input, to which it reacts to by executing s another set of instructions. Each input may be provided from one or more of a number of sources, including data generated by one or more of the input devices 102, 103, data received from an external source across one of the networks 120, 122, data retrieved from one of the storage devices 106, 109 or data retrieved from a storage medium 125 inserted into the corresponding reader 112. The execution of a set of the instructions may in some 10 cases result in output of data. Execution may also involve storing data or variables to the memory 134. The disclosed classification arrangements use input variables 154, that are stored in the memory 134 in corresponding memory locations 155-158. The classification arrangements produce output variables 161, that are stored in the memory 134 in 15 corresponding memory locations 162-165, Intermediate variables maybe stored in memory locations 159, 160, 166 and 167. The register section 144-146, the arithmetic logic unit (ALU) 140, and the control unit 139 of the processor 105 work together to perform sequences of micro-operations needed to perform "fetch, decode, and execute" cycles for every instruction in the 20 instruction set making up the program 133. Each fetch, decode, and execute cycle comprises: (a) a fetch operation, which fetches or reads an instruction 131 from a. memory location 128; (b) a decode operation in which the control unit 139 determines which instruction 25 has been fetched; and (c) an execute operation in which the control unit 139 and/or the ALU 140 execute the instruction, Thereafter, a further fetch, decode, and execute cycle for the next instruction may be executed. Similarly, a store cycle may be performed by which the control unit 139 stores 3o or writes a value to a memory location 132. Each step or sub-process in the processes of Figs 5 to 8 and 16 to 18 is associated with one or more segments of the program 133, and is performed by the register 1908242_1LDoC 885503_speci - 15 section 144-147, the ALU 140, and the control unit 139 in the processor 105 working together to perform the fetch, decode, and execute cycles for every instruction in the instruction set for the noted segments of the program 133. The method of transmitting object data over a communications channel may 5 alternatively be implemented in dedicated hardware such as one or more integrated circuits performing the functions or sub functions of encoding a foreground map, encoding metadata, and transmitting the encoded foreground map and the encoded metadata. Such dedicated hardware may include graphic processors, digital signal processors, or one or more microprocessors and associated memories. to Fig. 2 is a schematic block diagram of a system 200 in accordance with an embodiment of the present disclosure. In the system 200 illustrated in Fig. 2, components are housed within a network video camera 201. In an alternate embodiment, some or all of the computation may be performed on a general computing device. The network video camera includes an image capture device 202 coupled to an object is detection module 204. The object detection module 204 is coupled to an object encoding module 206. The image capture device 202 captures an image of a scene as a video frame, by utilising a lens system, and passes the image 203 to the object detection module 204. The object detection module 204 processes the image 203 to detect objects and thus produces object detection results 205. One method for determining an object boundary for 20 objects detected in the object detection results 205 is disclosed in United States Patent Publication No. 2008/0152236 (Vendrig et al.). Other methods for determining a visual object boundary may equally be utilised, such as pixel based object detection and user definition. The object encoding module 204 passes the object detection results 205 to an object 2s encoding module 206. The object encoding module 206 encodes object data derived from the object detection results 205 for transfer over a communications channel 207 to a receiver 208. The communications channel 207 may be a wired or wireless transmission path. The communications channel may be a standalone communications link or part of a communications network. The receiver 208 may be a client in the form of an external so computing device, for example. Depending on a particular application of the system 200, the receiver 208 may be configured to perform one or more operations on the transmitted object data. Such operations may include, for example, displaying the object data on a 1908242_1.DOC 885503speci - 16 display device, saving the object data to a storage device, performing further computation on the object data, re-transmitting the object data across the network, or any combination thereof. In one embodiment, the encoded object data, in the form of an encoded foreground s map and encoded metadata, and corresponding video data are transmitted over the communications channel 207 in a single data stream. In another embodiment, the encoded object data and video data are transmitted as separate data streams. In such an implementation, the encoded object data and video data may contain synchronisation infonnation to allow the receiver to match received object data to a corresponding video to frame. The receiver may be configured to display, store, or otherwise process the object data in conjunction with the corresponding video frame. Figs 3a to 3c illustrate in more detail the processing involved in creating and encoding object data relating to a video frame. Fig. 3a shows a single video frame 300 of a scene in which a first person 301, a second person 302, and a suitcase 303 are visible. The 1 video frame 300 is provided as input to an object detection module, not shown. The object detection module processes the video frame 300 to produce object detection results 205. In one embodiment, the object detection results 205 comprise a foreground map relating to blobs detected in the video frame 300, and metadata associated with objects detected in the video frame 300. 20 Fig. 3b shows one form of the object detection results 205 output from the object processing module 204 as a result of processing the video frame 300. The output object detection results 205 contain a foreground map 304, where each element of the foreground map 304 corresponds to an area of the video frame 300. In one embodiment, each element of the foreground map 304 corresponds to a region of 8x8 pixels in the original image, zs corresponding to a DCT block in an encoded JPEG image of the video frame 300. Such an embodiment relates to a system in which the object detection module 204 examines DCT coefficients from a JPEG-encoded video frame, In another embodiment, each element of the map 304 corresponds to one pixel, or a rectangular region of pixels, in the video frame 300. 30 Each element of the map 304 contains an object identifier relating to an object detected in the video frame 300. In this example, an object identifier "1" 305 identifies those blocks of the foreground map 304 comprising a blob 306, wherein the blob 306 190822t1DOC H85503_speci -17 relates to an object in the video frame 300, namely the first person 301 in the video frame 300. An object identifier "2" 307 identifies those blocks of the foreground map 304 comprising a blob 308, wherein the blob 308 relates to an object in the video frame 300, namely the second person 302 in the video frame 300. Similarly, an object identifier "3" s 309 identifies those blocks of the foreground map 304 comprising a blob 310, wherein the blob 310 relates to an object in the video frame 300, namely the suitcase 303 in the video frame. A reserved blob identifier, in this example "0" 318, indicates that the corresponding area of the frame 300 is background; that is, the corresponding area of the frame 300 does not correspond to a foreground blob. 10 The object detection results 205 output from the object processing module 204 also contain metadata corresponding to each detected object from the video frame 300. In this example, the metadata consists of a metadata array 311. Each metadata element of the metadata array 311 corresponds to a detected object from the video frame 300, and contains metadata relating to that object. Here, the metadata includes a tracking identifier 312, is being an identifier that is persistent across multiple video frames and which is used to identify the same physical object in the scene in earlier or later frames. The metadata in the example of Fig 3b also includes an age measure 313, which is a measure of the amount of time that this object has been visible in the scene. The metadata may also include, for example, object classification information, a 20 measure of the degree to which the physical object is stationary, a measure of confidence in the object detection results, a list of rules or conditions satisfied by the object, or any combination thereof. Additionally, the metadata may include information that can be derived from the foreground map 304, but which is provided for convenience and to avoid extra computation to re-calculate such information. This includes information such as the 2z coordinates of a minimum bounding box of a detected object the number of grid elements covered by a detected object, and the coordinates of a centroid of a detected object. In one implementation, in which an object corresponds to a single blob, each metadata element of the metadata array corresponds to a single blob. In another implementation, in which multiple blobs may be associated with a single object detected in 30 a video frame, each metadata element of the metadata array corresponds to the number of blobs associated with each particular object. 1908242_IDOC 885503speci - 18 The foreground map 304 and metadata array 311 comprise the information, or object data, to be transmitted to the receiver 208. However, in its raw form, the information relating to the foreground map 304 and the metadata array 311 requires significant bandwidth for transmission over a communication channel, such as a communications s network. In particular, the foreground map 304 requires elements of sufficient size to encode the maximum possible number of objects that can be detected in a video frame. For example, an object detection module capable of detecting 256 objects in one frame requires a grid with 8 bits per element. Compression can be utilised to reduce the size of the information to be transmitted. However, lossless compression schemes typically are not to able to guarantee a reduction in size in a worst case. Accordingly, in accordance with an embodiment of the present disclosure, the object encoding module 206 receives the foreground map 304 and metadata array 311 as input, and produces an encoded foreground map, in the form of a flattened bitmap 314, and modified, annotated metadata array 315 as encoded object data, as shown in Fig. 3c. The is flattened bitmap 314 is a grid of the same dimensions as the foreground map 304 produced by the object detection module 204. However, the flattened bitmap 314 has a single bit for each element. Each bit is set to 0 to signify that the corresponding area of the video frame 300 forms part of the background, or I to signify that the corresponding area of the video frame 300 forms part of the foreground. 20 To preserve the relationship between objects and their corresponding metadata, a representative element 316 is chosen for each blob corresponding to the objects in the scene depicted in the video frame 300. The representative element of a blob is any element of the flattened bitmap 314 that is a foreground element and part of that blob. For ease of computation, one embodiment selects the leftImost element on a top row of each blob to be 25 the representative element for that blob. The x andy coordinates 317 of the representative element 316 corresponding to each blob are added to the metadata associated with that blob, producing the modified, annotated metadata array 315. The flattened bitmap 314 and annotated metadataan-ay 315 are encoded and transmitted as encoded object data over the communications channel 207 to the receiver 30 208. In one embodiment, this encoded object data is transmitted in conjunction with an encoded form of the corresponding video frame 300. In particular, if the video frame 300 is encoded as a JPEG image, the flattened bitmap 314 and annotated metadata array 315 1902242I.DOC 885503speci - 19 may be embedded in the JPEG image as an application-specific header segment. In another embodiment, the flattened bitmap 314 and annotated metadata array 315 are transmitted separately to the encoded video data. The receiver 208 can reproduce, from the received encoded object data, a blob s corresponding to any object 306, 308, 310 by performing a floodfill on the flattened bitmap 314 starting from the representative element 317 corresponding to that blob. By performing such a floodfll for each blob, the receiver can reproduce the foreground map 304. Fig. 4 is a flow diagram that illustrates in more detail one embodiment of a method 10 400 for determining the coordinates 317 of the representative elements 316. The method 400 begins at a Start step 401, which performs initialisation, and passes control to an object detection step 402, in which object detection results, in the form of the foreground map 304 and object metadata array 311, are received from the object detection module 204. Control then passes to a first decision step 403, which checks whether any of blobs in the received is foreground map 304 remain to be processed, If at least one blob remains to be processed, Yes, control passes to a next step 404, which selects a next blob to be processed. Control then passes to a loop initialisation step 405, which initialises two variables, x andy, with the coordinates in the foreground map 304 of the left and top bounds, respectively, of the selected blob. These coordinates may 20 be calculated as part of this process. Alternatively, if the bounding box of each blob is included in the object metadata array 311, then the coordinates may be retrieved from that metadat. Control then passes to a second decision step 406, which examines an object identifier (ID) at the element of the foreground map 304 corresponding to the position (xy) 25 according to the current value of those variables, and compares the object identifier to an identifier of the current blob as selected in step 404. If the object identifier at that position does not match the identifier of the current blob, No, control passes to a loop counter increment step 407, which increments the variable x by one; and from there control loops back from step 407 to the second decision step 406. However, if instead the object 30 identifier at the position (xy) does match the identifier of the current blob, Yes, control passes from step 406 to a metadata update step 408, which adds the representative element 1908242_1.DOC 885503speci - 20 (x, y) to the metadata corresponding to the current blob; and from there control passes from step 408 back to the first decision step 403. Returning to the first decision step 403, if there are no blobs remaining to be processed, No, then control passes to an End step 409 and the method 400 terminates, 5 Fig. 5 is a flow diagram that illustrates one embodiment of a floodfill process 500 performed by the receiver 208 to reconstruct a blob corresponding to an object 306 using the flattened bitmap 314 and coordinates 317 of the representative element 316 associated with the blob. The method 500 begins at a Start step 501, which performs initialisation, and passes 1o control to a first initialisation step 502, which adds the representative element 316 associated with the blob to an empty queue containing points to be processed. Control then passes to a second initialisation step 503, which adds the representative element to a blob set containing points that are part of the blob 306 corresponding to the object. Control then passes from step 503 to a decision step 504, which checks whether there is are any elements in the queue. If at least one element is in the queue, Yes, control passes to a removal step 505, which removes an element from the queue. Control then passes to, in turn, visiting steps 506, 507, 508 and 509, which visit the element immediately to the left, to the right, above and below the point removed from the queue in step 505. This describes the implementation of a 4-way floodfill; a similar process for an 8-way floodfill would also 20 visit the element to the upper-left, upper-right, lower-left and lower-right, and would also be in-keeping with the spirit of the present disclosure. The process of visiting an element is illustrated in further detail in Fig, 6. It will be appreciated that step 506, 507, 508, and 509 can be performed in any order and two or more of the visiting steps may be performed in parallel, 25 After these visiting steps 506 - 509, control returns to the decision step 504. If at decision step 504 there are now no elements in the queue, No, then control passes to an End step 510 and the method 500 terminates. Fig. 6 is a flow diagram that illustrates a process 600 of visiting an element during a floodfill process, as in visiting steps 506, 507, 508 and 509 of Fig. 5. The method 600 3o begins at a Start step 601 and control passes to a decision step 602, which checks whether the element being analysed is within the bounds of the bitmap. If the element is not within the bounds of the bitmap - for example, if the element removed from the queue in step 505 1908242t.1DOC 885503,speci -21 was at the leftmost edge of the bitmap, and the element being visited is to the left of that element as in step 506 - then control passes to an End step 607 and the method 600 terminates. However, if at step 602 the element is within the bounds of the bitmap, then control s passes from step 602 to a decision step 603, which checks whether the element in question corresponds a foreground element - that is, whether the corresponding element of the flattened mask 314 is a "1". If the element does not correspond to a foreground element, No, - that is, if the element corresponds to a background element - then control passes to the End step 607 and the method 600 terminates. 10 If at step 603 the element does correspond to a foreground element, Yes, then control continues to a decision step 604, which checks whether the element is already part of the blob set as initialised in the second initialisation step 503 of Fig. 5. If the element is already part of the blob set, Yes, then control passes to the End step 607 and the method 600 terminates. is However, if at step 604 the element is not yet in the blob set, then control passes to an update step 605, which adds the element to the blob set, and then to step 606, which adds the element to the queue, Control then passes from step 606 to the End step 607 and the method 600 terminates. Depending on the particular application of the receiver 208, the receiver 208 may 20 perform the floodfill operation 500 for each blob in the scene, or the receiver 208 may perform this processing only on particular blob. Blobs for which the detailed object outline 306 maybe relevant to the receiver include blobs in which a user has indicated interest, or blobs to be tested for intersection with a region of interest within the scene, It may be computationally expensive for the receiver to perform the floodfill operation for every blob 25 in every frame of video. Fig. 7 is a schematic representation that illustrates a preferred way to mitigate this computational expense in some circumstances. Fig. 7 shows the flattened bitmap 314 and annotated metadata 702. For many applications, it is sufficient for the receiver to use the minimal bounding box 701 of a blob, rather than using an entire detailed blob so corresponding to the object. For example, to draw the attention of a user to an object in a scene, it may be sufficient for a user interface of the receiver to draw a rectangle around the object, rather than drawing the outline of the object based on a corresponding blob. In 19082421 DOC 88550_spCci -22 other circumstances, such as testing for intersection with a region of interest within the scene, the detailed blob data 306 is still required. The annotated object metadata 702 preferably includes coordinates 703 representing bounds 701 of a blob - that is, the minimum and maximum horizontal and vertical s coordinates of elements that comprise the blob. Since the method 400 of determining the representative element 316 of each blob finds a point on the topmost row of each blob, the minimum vertical coordinate of the bounds 703 of a blob matches that of the representative element. Thus, where the bounds 703 are included in the annotated object metadata 702, it is only necessary to encode the horizontal coordinate 704 of the representative element 10 316. Figs 8a, 8b, and 8c are schematic representations that illustrate alternate methods of encoding the flattened bitmap 314 for transmission over the network 207. The simplest encoding, as illustrated in Fig. 8a, is to send raw bits 801 comprising the flattened mask 314, wherein each bit 802 is either "0" to represent background, or "1" to represent is foreground. This method has the advantage of requiring minimal computation for both the object encoding module 206 and the receiver 208. Also, the method illustrated in Fig. Ba requires a constant amount of memory for storage, which can simplify memory management and access. The main disadvantage of this method is that it often contains a significant amount of 20 redundant information. Therefore, two alternate encodings are proposed for losslessly encoding the flattened mask 314. The first alternative, illustrated in Fig. 8b, is a quad-tree encoding. A quad-tree encoding approach represents a bitmap (flattened mask) 803 as either a square region containing a single value of "0" or "!", or divides the bitmap into four smaller regions 804. Each of the smaller regions 804, in turn, is represented as either a as square region containing a single value or four smaller regions 805, and so on. Methods of efficiently encoding a quad-tree are known in the art. A second alternative, illustrated in Fig. 8c, is run-length encoding, Run-length encoding represents a bitmap (flattened mask) 806 as a series of runs 807. Each run covers a linear series - for example, horizontally across the bitmap - of elements with a constant 30 value of "O" or "1". Each run consists of a number of elements in the run 808 and a value of the elements in the run 809. Methods of efficiently encoding a run-length representation 1908242_1DOC 885503_speci -23 are known in the art. Since each run has the opposite value to the previous run, the encoding method may not need to transmit explicitly the value of each run. Both quad-tree encoding and run-length encoding may reduce the amount of network bandwidth required to transmit the flattened bitmap 314 in scenes containing significant s contiguous areas of either background or foreground. However, both encodings can potentially increase the encoded size of the bitmap in complex scenes consisting of several small or irregularly-shaped objects. Therefore, according to one embodiment of the present disclosure, if the size of the bitmap as encoded using either the quad-tree 803 or run-length 806 approach is larger than the uncompressed bitmap 801, then the uncompressed bitmap to is transmitted instead. The system as described makes two assumptions about the output of the object detection module 204, which will now be addressed. The first assumption is that blobs 306, 308, 310 in the foreground map 304 corresponding to different objects are separable by floodfill; that is, there is always a region is of background elements between any two foreground blobs. Figs 9a and 9b are schematic representations that illustrate an output 901 of an object detection module 204 for which this assumption does not hold; that is, the object detection module 204 is capable of producing touching blobs corresponding to different objects. Fig. 9b illustrates a flattened foreground bitmap 902, being a variation of the 20 flattened foreground bitmap 314, to handle such output 901 that may contain touching blobs. Instead of representing the content of each element, each bit in the flattened foreground bitmap 902 represents, for each vertical 903 or horizontal 904 boundary between two adjacent elements, whether the element corresponds to a boundary of at least one blob. Thus, a bit is a "1" if the elements either belong to two different foreground 2s blobs 905, or one element belongs to a foreground blob and the other element belongs to the background 906. A bit is a "0" if both of the elements belong to the same blob, or both elements belong to the background. This approach allows the receiver 208 to reconstruct the blob corresponding to an object by performing a floodfill bounded by the encoded boundary elements. 30 The second assumption is that each object corresponds to a single contiguous blob of foreground elements. Figs 10a and 10b illustrate an output 1001 of an object detection 1908242_I.DOC 885503_speci -24 module for which this assumption does not hold; that is, the object detection module is capable of producing disjoint objects consisting of multiple blobs. Fig. 10b illustrates a variation 1002 on the annotated object metadata array 315 to handle such output 1001. A representative element 1003 is chosen for each blob in the s output 1001, so that an object corresponding to more than one blob has more than one representative element. The annotated object metadata 1002 contains a list of coordinates 1004 of the representative elements 1003 of the blobs corresponding to each object. This approach allows the receiver 208 to reconstruct the collection of blobs corresponding to an object by performing a floodfill at each of the representative coordinates 1004 in the 10 metadata associated with that object. INDUSTRIAL APPICABILITY The arrangements described are applicable to the computer and data processing industries and particularly for the imaging and security industries. The foregoing describes only some embodiments of the present invention, and is modifications and/or changes can be made thereto without departing from the scope and spirit of the invention, the embodiments being illustrative and not restrictive. In the context of this specification, the word "comprising"means "including principally but not necessarily solely" or "having" or "including", and not "consisting only of'. Variations of the word "comprising", such as "comprise" and "comprises", have 20 correspondingly varied meanings. 1908242_1.Doc 885503_speci

Claims

1. A method of transmitting object data over a communications channel, wherein said object data is derived from at least one detected object in a video frame, said method comprising the steps of: 5 (a) encoding a foreground map derived from said video frame, wherein said foreground map is segmented into at least one element, each element being associated with an object identifier relating to said video frame; (b) for at least one of said objects detected in said video frame, encoding metadata comprising at least a position of a representative element in the foreground map, 10 wherein the object identifier associated with said representative element corresponds to said object detected in said video frame, and further wherein adjacent elements that are associated with a same object identifier define a blob; and (c) transmitting at least the encoded foreground map and the encoded metadata, as object data, over said communications channel. 15

2. The method according to claim 1, wherein said object identifier relates to one of: (i) background in said video frame; or (ii) one of said objects detected in said video frame. 20

3. The method according to either one of claims 1 and 2, wherein said foreground map is segmented such that elements associated with a same object identifier define a blob, wherein blobs in said foreground map are separable by floodfill.

4. The method according to any one of claims 1 to 3, wherein the encoded foreground 25 map is encoded as a bitmap, wherein: each bit in the bitmap corresponds to an element of the foreground map; and each bit in the bitmap indicates whether the corresponding element of the foreground map is part of a foreground or a background. 30

5. The method according to claim 4, wherein the bitmap is encoded as a quad-tree. 1908242_IDOC Spe -26

6. The method according to claim 4, wherein the bitmap is encoded using a rn-length encoding.

7. The method according to claim 1, wherein the metadata in step (b) further comprises s positions in said foreground map of bounds of each blob associated with the object.

8. The method according to claim 7, wherein said positions are defined by Cartesian coordinates, the method comprising the further steps of: choosing, from coordinates of the position of the representative element, a first io ordinate to be equal to one coordinate of the bounds of each blob; and encoding the position of the representative element as a second ordinate of the coordinates of the position of the representative element.

9. The method according to claim 1, wherein the metadata in step (b) further comprises is a measure of the age of the object,

10. The method according to claim 1, wherein the metadata in step (b) fiuther comprises a measure of a degree to which the object is stationary. 20

11. The method according to claim 1, comprising the further step of: for at least one of said objects detected in said video frame, determining a result based on whether said object matches a rule, wherein the metadata in step (b) additionally comprises the result of the determination. 25

12, The method according to claim 1, wherein the metadata in step (b) further comprises a tracking identifier.

13. The method according to claim 1, wherein the metadata in step (b) further comprises object classification data. 30 1908242_1.DOC 885503 speci - 27

14. A method of receiving object data over a communications channel, comprising the steps of: (a) decoding a foreground map from a received bitmap; (b) for at least one object, decoding metadata relating to each blob associated with 5 the object, the metadata comprising at least a representative element in the foreground map, the representative element being within a boundary of a blob associated with the object; (c) performing a floodfill on the foreground map, starting from the representative element, to determine a blob boundary of each blob associated with said object; and 10 (d) determining a boundary of said object from each said blob boundary of each blob associated with said object.

15. A camera adapted to transmit object data over a communications channel, wherein said object data is derived from at least one detected object in a video frame, said camera 15 comprising: an image capture device for capturing an image as a video frame; an object detection module for processing said video frame to produce object detection results including a foreground map and metadata associated with each object detected in said video frame, 20 wherein said foreground map is segmented into at least one element, each element being associated with an object identifier relating to said video frame, and wherein said metadata includes, for each object detected in said video frame, at least a position of a representative element in the foreground map, wherein the object identifier associated with said representative element corresponds to said object 25 detected in said video frame, and further wherein adjacent elements of said foreground map that are associated with a same object identifier define a blob; a storage device for storing a computer program; and a processor for executing the program, said program comprising: 30 code for encoding the foreground map; code for processing at least one of said objects detected in said video frame, to encode said metadata associated with said object; and 1908242_ .DOC S85503jpeci -28 code for transmitting at least the encoded foreground map and the encoded metadata, as object data, over said communications channel.

16. A method of transmitting object data over a communications channel, wherein s said object data is derived from at least one detected object in a video frame, said method being substantially as described herein with reference to the accompanying drawings.

17. A method of receiving object data over a communications channel, said method to being substantially as described herein with reference to the accompanying drawings.

18. A camera adapted to transmit object data over a communications channel, wherein said object data is derived from at least one detected object in a video frame, said camera being substantially as described herein with reference to the 15 accompanying drawings. DATED this Thirtieth Day of December, 2008 Canon Kabushiki Kaisha Patent Attomeys for the Applicant 20 SPRUSON & FERGUSON 1908242_1.DOC 885503_spcci