US9299159B2 - Systems and methods for tracking objects - Google Patents
Systems and methods for tracking objects Download PDFInfo
- Publication number
- US9299159B2 US9299159B2 US14/071,899 US201314071899A US9299159B2 US 9299159 B2 US9299159 B2 US 9299159B2 US 201314071899 A US201314071899 A US 201314071899A US 9299159 B2 US9299159 B2 US 9299159B2
- Authority
- US
- United States
- Prior art keywords
- contour
- frame
- degree
- current frame
- estimated
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/246—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
-
- G06T7/2033—
Definitions
- the present disclosure generally relates to video processing, and more particularly, to a system and method for tracking objects utilizing a contour weighting map.
- the process of tracking objects is commonly performed for editing purposes. For example, a user may wish to augment a video with special effects where one or more graphics are superimposed onto an object. In this regard, precise tracking of the object is important.
- challenges may arise when tracking objects, particularly as the object moves from frame to frame. This may cause, for example, the object to vary in shape and size. Additional challenges may arise when the object includes regions or elements that easily blend in with the background. This may be due to the thickness of the elements, the color make-up of the elements, and/or other attributes of the elements.
- one embodiment is a method for tracking an object in a plurality of frames, comprising obtaining a reference contour of an object in a reference frame and estimating, for a current frame after the reference frame, a contour of the object.
- the method further comprises comparing the reference contour with the estimated contour and determining at least one local region of the reference contour in the reference frame based on a difference between the reference contour and the estimated contour. Based on the difference, at least one corresponding region of the current frame is determined.
- the method further comprises computing a degree of similarity between the at least one corresponding region in the current frame and the at least one local region in the reference frame, adjusting the estimated contour in the current frame according to the degree of similarity, and designating the current frame as a new reference frame and a frame after the new reference as a new current frame.
- Another embodiment is a system for tracking an object in a plurality of frames, comprising a processing device.
- the system further comprises an object selector executable in the processing device for obtaining a reference contour of an object in a reference frame and a contour estimator executable in the processing device for estimating, for a current frame after the reference frame, a contour of the object.
- the system further comprises a local region analyzer executable in the processing device for: comparing the reference contour with the estimated contour, determining at least one local region of the reference contour in the reference frame based on a difference between the reference contour and the estimated contour, determining at least one corresponding region of the current frame based on the difference, and computing a degree of similarity between the at least one corresponding region in the current frame and the at least one local region in the reference frame.
- the contour estimator adjusts the estimated contour in the current frame according to the degree of similarity and designates the current frame as a new reference frame and a frame after the new reference as a new current frame.
- Another embodiment is a non-transitory computer-readable medium embodying a program executable in a computing device, comprising code that generates a user interface and obtains a reference contour of an object in a reference frame, code that estimates, for a current frame after the reference frame, a contour of the object, code that compares the reference contour with the estimated contour and code that determines at least one local region of the reference contour in the reference frame based on a difference between the reference contour and the estimated contour.
- the program further comprises code that determines at least one corresponding region of the current frame based on the difference, code that computes a degree of similarity between the at least one corresponding region in the current frame and the at least one local region in the reference frame, code that adjusts the estimated contour in the current frame according to the degree of similarity, and code that designates the current frame as a new reference frame and a frame after the new reference as a new current frame.
- FIG. 1 is a block diagram of a video editing system for facilitating object tracking in accordance with various embodiments of the present disclosure.
- FIG. 2 is a detailed view of the video editing system device of FIG. 1 in accordance with various embodiments of the present disclosure.
- FIG. 3 is a top-level flowchart illustrating examples of functionality implemented as portions of the video editing system of FIG. 1 for facilitating object tracking according to various embodiments of the present disclosure.
- FIG. 4 depicts an example digital image to be processed by the video editing system of FIG. 1 in accordance with various embodiments of the present disclosure.
- FIG. 5 illustrates thin regions of an object to be tracked by the video editing system of FIG. 1 in accordance with various embodiments of the present disclosure.
- FIG. 6 illustrates the identification of local regions by the video editing system of FIG. 1 in accordance with various embodiments of the present disclosure.
- FIG. 7A illustrates selection of an object by a user using a selection tool in a first frame.
- FIGS. 7B-7E illustrate the object in succeeding frames.
- FIG. 7F illustrates modification of the object based on the estimated contour.
- FIG. 8 illustrates the refinement of an estimated contour performed by the video editing system of FIG. 1 in accordance with various embodiments of the present disclosure.
- FIG. 9A illustrates an initial video frame or reference frame with an object that the user wishes to track.
- FIG. 9B illustrates a next frame in the video sequence.
- FIG. 9C illustrates estimation of the direction of movement and the magnitude of movement in accordance with various embodiments of the present disclosure.
- FIG. 9D illustrates a resulting object contour after the shape of the object contour is modified in accordance with various embodiments of the present disclosure.
- FIG. 9E illustrates an example where the estimated contour is missing a portion of the object.
- FIG. 9F illustrates the result of a refined estimated contour in accordance with various embodiments of the present disclosure.
- FIG. 10A illustrates an initial video frame with an object that the user wishes to track.
- FIG. 10B illustrates a next frame in the video sequence.
- FIG. 10C illustrates an example of an estimated contour that erroneouly includes an additional region.
- FIG. 10D illustrates identification of the additional region in accordance with various embodiments of the present disclosure.
- FIG. 10E illustrates the result of a refined estimated contour in accordance with various embodiments of the present disclosure.
- FIG. 11A illustrates an initial video frame and the object contour input by the user.
- FIG. 11B illustrates the next video frame, where local regions are used for refinement of the estimated contour in accordance with various embodiments of the present disclosure.
- FIGS. 11C and 11D illustrate an example of how the contour can change substantially due to partial occlusion of the tracked object by an individual's hand in the frame.
- FIG. 12A illustrates how the content close to a local region is shown as an pixel array for a video frame in accordance with various embodiments of the present disclosure.
- FIG. 12B illustrates the frame content for another video frame.
- FIGS. 12C and 12D illustrate an example where the local regions cannot be located precisely due to a small shift or deformation between the video frames or an error in the contour estimation.
- FIGS. 12E and 12F illustrate how a measurement technique is utilized to evaluate local regions that are slightly misaligned while still accurately identifying local regions with a low degree of similarity in accordance with various embodiments of the present disclosure.
- the process of tracking one or more objects within a video stream may be challenging, particularly when the object moves from frame to frame as the object may vary in shape and size when moving from one position/location to another. Additional challenges may arise when the object includes regions or elements that tend to blend in with the background.
- an object tracking system should accurately estimate the contour of the object as the object moves.
- the object tracking process may occasionally yield erroneous results. For example, in some cases, one or more portions of the object being tracked will not be completely surrounded by the estimated contour that corresponds to an estimation of where and how the object is positioned. As temporal dependency exists in the object tracking process, an erroneous tracking result will, in many cases, lead to a series of erroneous results, thereby affecting video editing process that follows.
- the user can reduce the number of erroneous results by manually refining the estimated contour on a frame-by-frame basis as needed and then allowing the tracking system to resume object tracking based on the refinements made by the user.
- the object tracking algorithm may continually yield erroneous results for the portions of the object that are difficult to track. This results in the user having to constantly refine the tracking results in order to produce an accurate, estimated contour of the object. This, of course, can be a time consuming process.
- Various embodiments are disclosed for improving the tracking of objects within an input stream of frames, particularly for objects that include elements or regions that may be difficult to track by conventional systems due to color, shape, contour, and other attributes.
- the position and contour of the object is estimated on a frame-by-frame basis. The user selects a frame in the video and manually specifies the contour of an object in the frame. As described in more detail below, for the video frames that follow, the object tracking system iteratively performs a series of operations that include refining estimated contours based on the contour in a previous frame.
- an object contour in the current video frame is received from the user and designated as a reference contour.
- An object tracking algorithm is then utilized to estimate the object contour in the next video frame, and a tracking result is generated whereby an estimated contour is derived.
- the object tracking system compares the generated tracking result with the recorded reference contour, and a “local region” corresponding to a region containing the difference in contour between the two is derived. Based on the content of the local region in the current video frame and the content of the local region in the next video frame, the object tracking system computes the similarity of the corresponding local regions between the two video frames, and refines the tracking result (i.e., the estimated contour) of the next frame according to the degree of similarity. The iterative tracking process continues until all the frames are processed or until the user stops the tracking process.
- FIG. 1 is a block diagram of a video editing system 102 in which embodiments of the object tracking techniques disclosed herein may be implemented.
- the video editing system 102 may be embodied, for example, as a desktop computer, computer workstation, laptop, a smartphone 109 , a tablet, or other computing platform that includes a display 104 and may include such input devices as a keyboard 106 and a mouse 108 .
- the video editing system 102 may be embodied as a smartphone 109 or tablet, the user may interface with the video editing system 102 via a touchscreen interface (not shown).
- the video editing system 102 may be embodied as a video gaming console 171 , which includes a video game controller 172 for receiving user preferences.
- the video gaming console 171 may be connected to a television (not shown) or other display 104 .
- the video editing system 102 is configured to retrieve, via the media interface 112 , digital media content 115 stored on a storage medium 120 such as, by way of example and without limitation, a compact disc (CD) or a universal serial bus (USB) flash drive, wherein the digital media content 115 may then be stored locally on a hard drive of the video editing system 102 .
- a storage medium 120 such as, by way of example and without limitation, a compact disc (CD) or a universal serial bus (USB) flash drive, wherein the digital media content 115 may then be stored locally on a hard drive of the video editing system 102 .
- the digital media content 115 may be encoded in any of a number of formats including, but not limited to, Motion Picture Experts Group (MPEG)-1, MPEG-2, MPEG-4, H.264, Third Generation Partnership Project (3GPP), 3GPP-2, Standard-Definition Video (SD-Video), High-Definition Video (HD-Video), Digital Versatile Disc (DVD) multimedia, Video Compact Disc (VCD) multimedia, High-Definition Digital Versatile Disc (HD-DVD) multimedia, Digital Television Video/High-definition Digital Television (DTV/HDTV) multimedia, Audio Video Interleave (AVI), Digital Video (DV), QuickTime (QT) file, Windows Media Video (WMV), Advanced System Format (ASF), Real Media (RM), Flash Media (FLV), an MPEG Audio Layer III (MP3), an MPEG Audio Layer II (MP2), Waveform Audio Format (WAV), Windows Media Audio (WMA), or any number of other digital formats.
- MPEG Motion Picture Experts Group
- MPEG-4 High-Definition Video
- the media interface 112 in the video editing system 102 may also be configured to retrieve digital media content 115 directly from a digital camera 107 where a cable 111 or some other interface may be used for coupling the digital camera 107 to the video editing system 102 .
- the video editing system 102 may support any one of a number of common computer interfaces, such as, but not limited to IEEE-1394 High Performance Serial Bus (Firewire), USB, a serial connection, and a parallel connection.
- the digital camera 107 may also be coupled to the video editing system 102 over a wireless connection or other communication path.
- the video editing system 102 may be coupled to a network 118 such as, for example, the Internet, intranets, extranets, wide area networks (WANs), local area networks (LANs), wired networks, wireless networks, or other suitable networks, etc., or any combination of two or more such networks.
- a network 118 such as, for example, the Internet, intranets, extranets, wide area networks (WANs), local area networks (LANs), wired networks, wireless networks, or other suitable networks, etc., or any combination of two or more such networks.
- the video editing system 102 may receive digital media content 115 from another computing system 103 .
- the video editing system 102 may access one or more video sharing websites 134 hosted on a server 137 via the network 118 to retrieve digital media content 115 .
- the object selector 114 in the video editing system 102 is configured to obtain an object contour selection from the user of the video editing system 102 , where the user is viewing and/or editing the media content 115 obtained by the media interface 112 .
- the objection selection is used as a reference contour where a local region is derived for purposes of refining subsequent contour estimations, as described in more detail below.
- the contour estimator 116 is configured to estimate a contour on a frame-by-frame basis for the object being tracked.
- the local region analyzer 119 determines a local region based on a difference between the reference contour and the estimated contour.
- a “local region” generally refers to one or more areas or regions within a given frame corresponding to a portion or element of an object that is lost or erroneously added during the tracking process.
- FIGS. 4-6 depicts an object 404 (i.e., a penguin) that a user wishes to track.
- the object 404 includes various elements (e.g., the flippers) which vary in size, shape, color, etc.
- the object 404 includes various elements or regions that blend in with the background, thereby resulting in “thin” regions 502 a , 502 b due to the thin portions of the elements that are in contrast with the background of the image in the frame 402 .
- the local regions 602 a , 602 b identified by the local region analyzer 119 comprises the portion of the object that is lost (i.e., the flippers) during the tracking process.
- these local regions 602 a , 602 b are analyzed across frames to further refine or correct the contour estimation derived by the contour estimator 116 .
- the local regions 602 a , 602 b are added to an estimated contour in order to more accurately track the object 404 .
- FIG. 2 shown is a schematic diagram of the video editing system 102 shown in FIG. 1 .
- the video editing system 102 may be embodied in any one of a wide variety of wired and/or wireless computing devices, such as a desktop computer, portable computer, dedicated server computer, multiprocessor computing device, smartphone 109 ( FIG. 1 ), tablet computing device, and so forth.
- the video editing system 102 comprises memory 214 , a processing device 202 , a number of input/output interfaces 204 , a network interface 206 , a display 104 , a peripheral interface 211 , and mass storage 226 , wherein each of these devices are connected across a local data bus 210 .
- the processing device 202 may include any custom made or commercially available processor, a central processing unit (CPU) or an auxiliary processor among several processors associated with the video editing system 102 , a semiconductor based microprocessor (in the form of a microchip), a macroprocessor, one or more application specific integrated circuits (ASICs), a plurality of suitably configured digital logic gates, and other well known electrical configurations comprising discrete elements both individually and in various combinations to coordinate the overall operation of the computing system.
- CPU central processing unit
- ASICs application specific integrated circuits
- the memory 214 can include any one of a combination of volatile memory elements (e.g., random-access memory (RAM, such as DRAM, and SRAM, etc.)) and nonvolatile memory elements (e.g., ROM, hard drive, tape, CDROM, etc.).
- RAM random-access memory
- nonvolatile memory elements e.g., ROM, hard drive, tape, CDROM, etc.
- the memory 214 typically comprises a native operating system 217 , one or more native applications, emulation systems, or emulated applications for any of a variety of operating systems and/or emulated hardware platforms, emulated operating systems, etc.
- the applications may include application specific software which may comprise some or all the components (media interface 112 , object selector 114 , contour estimator 116 , local region analyzer 119 ) of the video editing system 102 depicted in FIG. 1 .
- the components are stored in memory 214 and executed by the processing device 202 .
- the memory 214 can, and typically will, comprise other components which have been omitted for purposes of brevity.
- Input/output interfaces 204 provide any number of interfaces for the input and output of data.
- the video editing system 102 comprises a personal computer
- these components may interface with one or more user input devices via the I/O interfaces 204 , where the user input devices may comprise a keyboard 106 ( FIG. 1 ) or a mouse 108 ( FIG. 1 ).
- the display 104 may comprise a computer monitor, a plasma screen for a PC, a liquid crystal display (LCD), a touchscreen display, or other display device 104 .
- a non-transitory computer-readable medium stores programs for use by or in connection with an instruction execution system, apparatus, or device. More specific examples of a computer-readable medium may include by way of example and without limitation: a portable computer diskette, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM, EEPROM, or Flash memory), and a portable compact disc read-only memory (CDROM) (optical).
- RAM random access memory
- ROM read-only memory
- EPROM erasable programmable read-only memory
- CDROM portable compact disc read-only memory
- network interface 206 comprises various components used to transmit and/or receive data over a network environment.
- the network interface 206 may include a device that can communicate with both inputs and outputs, for instance, a modulator/demodulator (e.g., a modem), wireless (e.g., radio frequency (RF)) transceiver, a telephonic interface, a bridge, a router, network card, etc.).
- the video editing system 102 may communicate with one or more computing devices via the network interface 206 over the network 118 ( FIG. 1 ).
- the video editing system 102 may further comprise mass storage 226 .
- the peripheral interface 211 supports various interfaces including, but not limited to IEEE-1294 High Performance Serial Bus (Firewire), USB, a serial connection, and a parallel connection.
- FIG. 3 is a flowchart 300 in accordance with one embodiment for facilitating object tracking performed by the video editing system 102 of FIG. 1 . It is understood that the flowchart 300 of FIG. 3 provides merely an example of the many different types of functional arrangements that may be employed to implement the operation of the various components of the video editing system 102 ( FIG. 1 ). As an alternative, the flowchart of FIG. 3 may be viewed as depicting an example of steps of a method implemented in the video editing system 102 according to one or more embodiments.
- FIG. 3 shows a specific order of execution, it is understood that the order of execution may differ from that which is depicted. For example, the order of execution of two or more blocks may be scrambled relative to the order shown. Also, two or more blocks shown in succession in FIG. 3 may be executed concurrently or with partial concurrence. It is understood that all such variations are within the scope of the present disclosure.
- the object selector 114 ( FIG. 1 ) in the video editing system 102 obtains user input specifying a contour of an object in a current frame.
- the frame serves as a current frame for the iterative tracking process. It may comprise the first frame in a sequence of video frames or any frame in which the user selects as a starting point for tracking an object.
- the user may specify the contour through any number of selection or control means such as a paint brush tool on a user interface displayed to the user.
- the user utilizes the region selection tool to specify or define the contour of the object to be tracked in a video stream.
- the tracking results may then be utilized for video editing. For example, the user may elect to adjust the color and/or brightness of the object or augment frames with the object with content from another video stream.
- the contour estimator 116 ( FIG. 1 ) records the current frame as the reference frame, and record the contour in the reference frame as the reference contour.
- block 320 select a frame after the reference frame as the current frame, and estimate a contour of the object in the current frame.
- the frame following the reference frame is not limited to the frame immediately following the reference frame and may comprise any frame following the reference frame (e.g., the fifth frame following the reference frame).
- the iterative tracking process involves processing the video sequence by one or more frames during each iteration.
- the reference contour comprises the contour defined by the user.
- the reference contour comprises the refined contour of the next frame and so on.
- the local region analyzer 119 compares the reference contour with the estimated contour. That is, the contour of the reference frame is compared to the contour of the current frame such that contours spanning two successive frames are compared. Note, however, that the various embodiments disclosed are not limited to the comparison of successive frames as the video editing system 102 may be configured to compare frames spaced farther apart.
- the local region analyzer 119 determines a local region based on a difference between the reference contour and the estimated contour.
- the object contours for a reference frame (n) and the current frame (n+1) are compared.
- certain regions or elements 502 a , 502 b i.e., the flippers
- the missing regions are designated as local regions 602 a , 602 b for purposes of refining the estimated object contour at frame (n+1).
- the local region analyzer 119 computes a degree of similarity between a local region 602 a , 602 b ( FIG. 6 ) in the reference frame (n) and a local region 602 a , 602 b in the current frame (n+1).
- the degree of similarity between local regions 602 a , 602 b in two frames may be calculated based on a sum of absolute difference (SAD) metric between the pixels in the corresponding local regions 602 a , 602 b .
- SAD sum of absolute difference
- a low value of the sum of absolute difference indicates a large degree of similarity between the local regions 602 a , 602 b , and that the local regions 602 a , 602 b are almost static across the two frames. Based on this, an inference can be made that the object 404 itself has not moved significantly across the two frames.
- the sum of absolute difference (SAD) metric used to compute the degree of similarity is described in connection with FIGS. 12A-F .
- SAD absolute difference
- the frame content for another video frame is shown in FIG. 12B , where another local region 1204 is shown.
- Determination of the SAD metric comprises computing the absolute difference of pixel values for every pair of pixels and then accumulating the absolute differences as a measurement between the two regions.
- a smaller SAD value indicates a higher similarity between two regions, while a larger SAD value indicates that the two regions are different.
- FIG. 12A and FIG. 12B only the top-right pixels in the frames are different, and every pair of pixels inside the local regions has the same value. This leads to a zero SAD value, which denotes very high similarity between the local regions.
- the local regions cannot be located precisely due, for example, to a small shift or deformation between the video frames or an error in the contour estimation.
- An example of such a scenario is shown in FIG. 12C and FIG. 12D , where the shape of the local region is the same as the previous example, but where the location of local region 1208 has some deformation in shape. Due to this small misalignment, the SAD value computed based on the pixel pairs becomes significantly large, thereby erroneously indicating a small degree of similarity between the local regions.
- the SAD metric is computed based on pixel pairs. For example, a pixel A in local region 1206 is matched to a corresponding pixel in the other frame. For purposes of this disclosure, the corresponding pixel in the other frame is referred to as an anchor pixel.
- the original SAD metric matches the pixel A to the anchor pixel A′, which leads to a large value of absolute difference.
- the revised SAD metric performs a local search in a small range around the anchor pixel A′ and identifies a pixel with the smallest absolute difference.
- the small range in which the local search is performed may comprise, for example, a pixel block (e.g., 3 ⁇ 3 or 5 ⁇ 5 pixel block where the anchor pixel is located at the center).
- a local search reveals that pixel B′ has the same value as anchor pixel A′ and is therefore selected for purposes of computing the absolute difference.
- a local search is performed for a plurality of pixel pairs to match a pixel in one frame to another pixel in the other frame.
- a reasonable range of the local search should be small enough to identify the local regions with obviously different content while also taking into account the misalignment of local regions in one or two pixels.
- multiple searches are performed for the regions 1206 and 1208 to compute their SAD value. Each search yields a pixel pair from one region to the other region.
- Each local search may also select a pixel with a different position relative to the anchor pixel used for the search.
- the selected pixel B is one pixel left to the anchor pixel A′, but the selected pixel in another search may involve a pixel in a different position where the pixel is not located one pixel left to the anchor pixel. This allows pixel matching between two regions where slight deformation occurs, which is typical during video tracking.
- anchor(p i ) is the anchor pixel in the video frame containing R 2 .
- the anchor pixel corresponds to the pixel p i and can be determined by the locations of two regions in the video frames.
- S(anchor(p i )) represents a set of pixels as the search region according to anchor(p i ), and the search is performed for each pixel q j in the search region.
- each pixel contains a fixed number of channels and there is a value for each channel.
- the metric corresponds to computing the absolute difference between the values of the two pixels for each channel and then accumulating the absolute differences among all channels.
- another metric may be used represent the discrimination of pixel values, such as computing the square values of the differences and then accumulating the squared values.
- the pixel q j that contributes to the summation in SAD(R 1 , R 2 ) is the pixel which results in the minimal absolute difference within the search region.
- an estimated contour with the local region(s) omitted will likely be an erroneous estimate as the estimated contour differs substantially from the previously estimated contour.
- the contour estimator 116 adjusts or further refines the estimated contour.
- the contour estimator 116 may be configured to incorporate the missing local region(s) into the erroneous estimated contour as part of the refinement process.
- FIG. 8 illustrates estimated object contours across two frames (i.e., frame (n) and frame (n+1)).
- the local regions 602 a , 602 b comprise the difference between the contours in the two frames.
- the large degree of similarity between the local regions 602 a , 602 b may be determined based on a sum of absolute difference between pixels in the corresponding local regions 602 a , 602 b .
- a comparison between pixel characteristics e.g., pixel color
- pixel-by-pixel basis between the local regions 602 a , 602 b in each frame (frame (n) and frame (n+1)).
- the contour estimator 116 may be configured to incorporate the missing local regions 602 a , 602 b into the erroneous estimated contour of frame (n+1) as part of the refinement process, as shown in FIG. 8 .
- FIGS. 7 and 9-11 illustrate various aspects of object tracking technique in accordance with various embodiments of the present disclosure.
- FIGS. 7A-F illustrate an example application of performing object tracking.
- the user selects or defines the contour of the object (i.e., the dog) using a selection tool such as brush tool as represented by the cursor tool shown.
- the contour drawn around the object is represented by the outline surrounding the object.
- the object tracking algorithm estimates the contour of the object on a frame-by-frame basis as the object moves and as the shape of the object changes. The object tracking results across the series of frames can then be utilized for editing purposes. As illustrated in FIG. 7F , based on the estimated contour, the object may be modified (e.g., color change) without modifying any of the other regions in the frame. In this regard, accurate object tracking is needed to facilitate video editing operations.
- the object being tracked moves or the shape of the object changes over time.
- the amount of movement tends to be fairly small within a short amount of time.
- Successive frames in a video are typically spaced apart by approximately 1/30 th of a second.
- the rate of change is relatively small on a frame-by-frame basis.
- FIGS. 9A-F further illustrate the refinement operation of an estimated contour in accordance with various embodiments, where the difference between video frames is analyzed.
- FIG. 9A depicts an initial video frame or reference frame (frame (n)) with an object 902 that the user wishes to track.
- the bold line around the object 902 to be tracked represents an object contour 904 specified by the user using, for example, a paint brush tool or other selection tool via a user interface displayed to the user.
- FIG. 9B depicts the next frame (e.g., frame (n+1)) in the video sequence.
- the direction of movement and the magnitude of movement are estimated, as illustrated in FIG. 9C , where the arrows represent the direction and magnitude of movement by the object.
- the shape of the object contour 904 is warped or modified where the resulting object contour 906 is shown in FIG. 9D .
- motion estimation may be performed on all the pixels in the entire frame and not just on those pixels within the object contour 904 .
- the frame may be divided into blocks where motion estimation is then performed on each block.
- the object tracking algorithm loses track of one or more portions/regions of the object 902 .
- the estimated contour 907 is missing the tail and the feet of the tiger (the object 902 being tracked).
- the modified contour 906 in FIG. 9D rather than the initial contour 904 in FIG. 9A specified by the user is used as the reference contour in the comparison for purposes of identifying the one or more local regions as the modified contour 906 in FIG. 9D provides a better estimation of the object shape in the next frame as it incorporates the difference between the reference frame depicted in FIG. 9A and the current frame depicted in FIG. 9E .
- the estimated movements can be used to shift the corresponding local regions in the two frames in order to more accurately track the missing regions of the object (e.g., the tail and feet of the tiger) more accurately.
- the object e.g., the tail and feet of the tiger
- a refined estimated contour 910 including the local regions 908 a , 908 b , 908 c , 908 d , 908 e is derived to provide a more accurate estimation object contour.
- supplementing an erroneous contour estimation with the local region(s) comprises performing a union operation or determination on the estimated contour and the local region to merge the two into a larger region.
- FIG. 10A depicts an initial video frame (frame (n)) with an object 1002 that the user wishes to track.
- the bold line around the object 1002 to be tracked represents an object contour 1004 specified by the user using, for example, a paint brush tool or other selection tool via a user interface displayed to the user.
- a paint brush tool or other selection tool via a user interface displayed to the user.
- FIG. 10B depicts the next frame (e.g., frame (n+1)) in the video sequence. Again, for every region of the object, the direction of movement and the magnitude of movement are estimated. Based on motion estimation, the shape of the object contour 1004 is warped or modified where the resulting object contour 1006 is shown in FIG. 10B . Note that for some embodiments, motion estimation may be performed on all the pixels in the entire frame and not just on those pixels within the object contour 1002 . For such embodiments, the frame may be divided into blocks where motion estimation is then performed on each block.
- a region 1007 is the tracking result for the frame, and a part of region 1008 is erroneously included the result that was not included in the estimated contour 1006 .
- the refinement method identifies this additional region as a local region 1010 ( FIG. 10D ) and removes the erroneous region from the estimated contour to generate a refined estimated contour, as shown in FIG. 10E .
- information from motion estimation may be utilized to improve the accuracy in removing erroneous regions.
- certain restrictions may be implemented during the object tracking process disclosed in order to further enhance the accuracy of generating an estimated contour.
- a major assumption is that the previous tracking result contains an accurate estimation of the contour. Based on this assumption, the estimated contour may be further refined on a frame-by-frame basis.
- the contour of the object may change substantially, thereby resulting in erroneous adjustments made based on an erroneous contour.
- comparison of other attributes other than the local regions may also be used, where such attributes include, for example, the color of the object and the color of the background. If the color of the region is close to the background color, then refining the estimated contour using this region may lead to an erroneous refinement due to the color of the local region matching the color of the background. As such, by utilizing other comparisons, the refinement process may be improved.
- FIGS. 11A-D illustrate how the object contour may change substantially over time.
- the initial video frame and the object contour 1102 input by the user are shown in FIG. 11A .
- FIG. 11B depicts the next video frame, where the two local regions 1106 a , 1106 b are used for refinement of the estimated contour 1104 , as described herein.
- a comparison of the local regions of the two frames ( FIGS. 11A and 11B ) reveals that the local regions have an intermediate similarity in the video frames.
- the reference contour is exactly the original contour 1102 specified by the user in FIG. 11A , and this implies the highest degree of agreement between the original contour and the reference contour. Accordingly, looser restrictions may be applied during the refinement process.
- the contour can change substantially due, for example, to partial occlusion of the tracked object by an individual's hand in the frame.
- a comparison of the local regions between the frames in FIGS. 11C and 11D reveals a difference represented by the region 1108 shown in FIG. 11D .
- a comparison of the reference contour 1110 in FIG. 11C with the original contour 1102 specified by the user in FIG. 11A reveals that the contour 1110 has changed substantially over time during the tracking process.
- stricter restrictions may be applied to the threshold of the similarity in order to avoid erroneously refining the estimated contour using regions that are not part of the tracked object (e.g., the individual's hand).
- the similarity is not high enough to pass the stricter restrictions, so it will not be used to refine the contour.
- the original contour shape 1102 specified by the user is compared to the reference contour 1110 by calculating a degree of similarity between the original contour shape 1102 and the reference contour 1110 to determine whether the two are substantially similar. If the reference contour 1110 is substantially similar to the original contour 1102 specified by the user, then looser restrictions are applied, otherwise stricter restrictions are applied.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Image Analysis (AREA)
Abstract
Description
SAD(R 1 ,R 2)=Σp
where R1, R2 are the two regions, P1 is a set of pixels which can be all pixels or a subset of pixels in R1. For each pixel pi in P1, anchor(pi) is the anchor pixel in the video frame containing R2. The anchor pixel corresponds to the pixel pi and can be determined by the locations of two regions in the video frames. S(anchor(pi)) represents a set of pixels as the search region according to anchor(pi), and the search is performed for each pixel qj in the search region. The values of pixel pi, qj are represented as v(pi), v(qj), and D(v(pi), v(qj)) is a metric for computing the absolute difference of the values such that v(pi)={vi(pi), . . . , vn(pi)}, v(qj))={v1(qj), . . . , vn(qj)}.
D(v(p i),v(q j))=Σk=1 n ∥v k(p i)−v k(q j)∥,
D(v(p i),v(q j))=Σk=1 n(v k(p i)−v k(q j))2, or
D(v(p i),v(q j)=√{square root over (Σk=1 n(v k(p i)−v k(q j))2)}{square root over (Σk=1 n(v k(p i)−v k(q j))2)},
where ∥x∥ is the absolute value of x. The metric corresponds to computing the absolute difference between the values of the two pixels for each channel and then accumulating the absolute differences among all channels. However, in some cases, another metric may be used represent the discrimination of pixel values, such as computing the square values of the differences and then accumulating the squared values. The pixel qj that contributes to the summation in SAD(R1, R2) is the pixel which results in the minimal absolute difference within the search region. By leveraging this revised SAD technique, the SAD value computed from
Claims (33)
SAD(R 1 ,R 2)=Σp
D(v(p i),v(q j))=Σk=1 n ∥v k(p i)−v k(q j)∥,
D(v(p i),v(q j))=Σk=1 n(v k(p i)−v k(q j))2, or
D(v(p i),v(q j))=√{square root over (Σk=1 n(v k(p i)−v k(q j))2)}{square root over (Σk=1 n(v k(p i)−v k(q j))2)},
SAD(R 1 ,R 2)=Σp
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/071,899 US9299159B2 (en) | 2012-11-09 | 2013-11-05 | Systems and methods for tracking objects |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201261724389P | 2012-11-09 | 2012-11-09 | |
US14/071,899 US9299159B2 (en) | 2012-11-09 | 2013-11-05 | Systems and methods for tracking objects |
Publications (2)
Publication Number | Publication Date |
---|---|
US20140133701A1 US20140133701A1 (en) | 2014-05-15 |
US9299159B2 true US9299159B2 (en) | 2016-03-29 |
Family
ID=50681722
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/071,899 Active 2034-05-30 US9299159B2 (en) | 2012-11-09 | 2013-11-05 | Systems and methods for tracking objects |
Country Status (1)
Country | Link |
---|---|
US (1) | US9299159B2 (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9389767B2 (en) | 2013-08-30 | 2016-07-12 | Cyberlink Corp. | Systems and methods for object tracking based on user refinement input |
KR101865766B1 (en) * | 2016-10-11 | 2018-06-11 | 주식회사 피엘케이 테크놀로지 | Moving objects collision warning apparatus and method for large vehicle |
KR101955506B1 (en) * | 2016-12-12 | 2019-03-11 | 주식회사 피엘케이 테크놀로지 | Side safety assistant device and method for large vehicle by using two opposite cameras |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5940538A (en) | 1995-08-04 | 1999-08-17 | Spiegel; Ehud | Apparatus and methods for object border tracking |
US7142600B1 (en) | 2003-01-11 | 2006-11-28 | Neomagic Corp. | Occlusion/disocclusion detection using K-means clustering near object boundary with comparison of average motion of clusters to object and background motions |
US7164718B2 (en) | 2000-09-07 | 2007-01-16 | France Telecom | Method for segmenting a video image into elementary objects |
US20090324012A1 (en) * | 2008-05-23 | 2009-12-31 | Siemens Corporate Research, Inc. | System and method for contour tracking in cardiac phase contrast flow mr images |
US20100158378A1 (en) * | 2008-12-23 | 2010-06-24 | National Chiao Tung University | Method for image processing |
US20110291925A1 (en) * | 2009-02-02 | 2011-12-01 | Eyesight Mobile Technologies Ltd. | System and method for object recognition and tracking in a video stream |
-
2013
- 2013-11-05 US US14/071,899 patent/US9299159B2/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5940538A (en) | 1995-08-04 | 1999-08-17 | Spiegel; Ehud | Apparatus and methods for object border tracking |
US7164718B2 (en) | 2000-09-07 | 2007-01-16 | France Telecom | Method for segmenting a video image into elementary objects |
US7142600B1 (en) | 2003-01-11 | 2006-11-28 | Neomagic Corp. | Occlusion/disocclusion detection using K-means clustering near object boundary with comparison of average motion of clusters to object and background motions |
US20090324012A1 (en) * | 2008-05-23 | 2009-12-31 | Siemens Corporate Research, Inc. | System and method for contour tracking in cardiac phase contrast flow mr images |
US20100158378A1 (en) * | 2008-12-23 | 2010-06-24 | National Chiao Tung University | Method for image processing |
US20110291925A1 (en) * | 2009-02-02 | 2011-12-01 | Eyesight Mobile Technologies Ltd. | System and method for object recognition and tracking in a video stream |
Non-Patent Citations (3)
Title |
---|
Chiueh et al. "Zodiac: A history-based interactive video authoring system" Multimedia Systems 8: 201-211 (2000). |
Daras et al. "MPEG-4 Authoring Tool Using Moving Object Segmentation and Tracking in Video Shots" EURASIP Journal on Applied Signal Processing 2003:9, 861-877, Nov. 22, 2002. |
Singh et al. "Annotation Supported Contour Based Object Tracking with Frame Based Error Analysis" 2011 3rd International Conference on Machine Learning and Computing (ICMLC 2011). |
Also Published As
Publication number | Publication date |
---|---|
US20140133701A1 (en) | 2014-05-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8971575B2 (en) | Systems and methods for tracking objects | |
US20230077355A1 (en) | Tracker assisted image capture | |
US10404917B2 (en) | One-pass video stabilization | |
US9240056B2 (en) | Video retargeting | |
US9389767B2 (en) | Systems and methods for object tracking based on user refinement input | |
EP2180695B1 (en) | Apparatus and method for improving frame rate using motion trajectory | |
US9014474B2 (en) | Systems and methods for multi-resolution inpainting | |
US20180082428A1 (en) | Use of motion information in video data to track fast moving objects | |
CN110691259B (en) | Video playing method, system, device, electronic equipment and storage medium | |
US9336583B2 (en) | Systems and methods for image editing | |
US9672866B2 (en) | Automated looping video creation | |
US8879894B2 (en) | Pixel analysis and frame alignment for background frames | |
US9836180B2 (en) | Systems and methods for performing content aware video editing | |
US20100066914A1 (en) | Frame interpolation device and method, and storage medium | |
US8494216B2 (en) | Image processing device and image processing method and program | |
US9299159B2 (en) | Systems and methods for tracking objects | |
US20210407105A1 (en) | Motion estimation method, chip, electronic device, and storage medium | |
US20070248243A1 (en) | Device and method of detecting gradual shot transition in moving picture | |
US20190347510A1 (en) | Systems and Methods for Performing Facial Alignment for Facial Feature Detection | |
US8463037B2 (en) | Detection of low contrast for image processing | |
US10685213B2 (en) | Systems and methods for tracking facial features | |
WO2023179342A1 (en) | Relocalization method and related device | |
US20240062484A1 (en) | Systems and methods for rendering an augmented reality object with adaptive zoom feature | |
US20230293045A1 (en) | Systems and methods for contactless estimation of wrist size |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: CYBERLINK CORP., TAIWAN Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MA, CHIH-CHAO;REEL/FRAME:031549/0865 Effective date: 20131105 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 4 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 8 |