WO2024049513A1 - Methods and apparatus for forecasting collisions using egocentric video data - Google Patents

Methods and apparatus for forecasting collisions using egocentric video data Download PDF

Info

Publication number
WO2024049513A1
WO2024049513A1 PCT/US2023/020232 US2023020232W WO2024049513A1 WO 2024049513 A1 WO2024049513 A1 WO 2024049513A1 US 2023020232 W US2023020232 W US 2023020232W WO 2024049513 A1 WO2024049513 A1 WO 2024049513A1
Authority
WO
WIPO (PCT)
Prior art keywords
coordinate system
response
image data
moving object
processor
Prior art date
Application number
PCT/US2023/020232
Other languages
French (fr)
Inventor
Wentao BAO
Lele Chen
Yi Xu
Original Assignee
Innopeak Technology, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Innopeak Technology, Inc. filed Critical Innopeak Technology, Inc.
Publication of WO2024049513A1 publication Critical patent/WO2024049513A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/20Scenes; Scene-specific elements in augmented reality scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • G06V10/225Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition based on a marking or identifier characterising the area
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30241Trajectory

Definitions

  • the present invention relates to object movement prediction. More specifically, the invention relates to predicting how a visible object will move and interact with another object when viewed from a first-person perspective, e.g. egocentric video data.
  • the present invention relates to object movement prediction. More specifically, the invention relates to predicting how a visible object will move and interact with another object when viewed from a first-person perspective, e.g. egocentric video data.
  • Embodiments of the present invention describe novel methods for predicting movements of a hand or other object in 3D space using video images typically taken from a first-person perspective, e.g. egocentric video.
  • Some examples implement an uncertainty- aware state space Transformer that includes different techniques to improve movement prediction accuracy.
  • Additional examples include ways to label and mark videos to facilitate the prediction model.
  • Some embodiments include three primary processing phases, a feature encoding phase, a state transition phase, and a trajectory prediction phase.
  • the feature encoding phase includes determining three-dimensional positions, e.g. two dimensions are positions on a video image, and a third dimension is depth information, of an object (e.g. the user’s hand) as it appears on the screen. This can be done for each frame of a given video, for the first dozen frames of a given video, or the like.
  • the state transition phase includes determining the three-dimensional positions of the object relative to a global coordinate. In some examples, the global reference coordinate is selected as the first video image in a given video.
  • the movement path of the object relative to the global coordinate system for subsequent video frames is then determined based upon the three-dimensional positions of the object.
  • a prediction of the path of the object can be determined. Collisions may then be predicted based upon movement of the object and positions of stationary objects within the global coordinate system.
  • Some embodiments described herein may be performed using a Transformer encoder and Transformer decoder paradigm within a processing unit.
  • a Transformer encoder may be used to determine temporal dependencies (e.g. movement) of objects from incoming video data, and a Transform decoder may be used to predict movements of objects.
  • a method for a computing system may include receiving, from a memory, a plurality of image data of a scene including a moving object and a stationary object, wherein the plurality of image data is derived from a video camera associated with a user, wherein the plurality of image data comprise depth information with respect to the video camera, and determining, in a processor coupled to the video camera, a first plurality of three-dimensional positions of the moving object within a local coordinate system, in response to the plurality of image data, wherein the local coordinate system is relative to the video camera.
  • a process may include determining, in the processor, a first movement path of the moving object within the local coordinate system in response to the first plurality of three-dimensional positions within the local coordinate system, determining, in the processor, a second plurality of three- dimensional positions within the local coordinate system of the stationary object in response to the plurality of image data, and determining, in the processor, a plurality of transforms from the local coordinate system to a global coordinate system in response to the second plurality of three-dimensional positions within the local coordinate system.
  • a method may include determining, in the processor, a second movement path of the moving object within the global coordinate system in response to the first movement path of the moving object and to the plurality of transforms, and determining, in the processor, a third movement path for the moving object within the global coordinate system in response to the second movement path of the moving object.
  • One system may include a memory configured to store a plurality of image data of scene including a moving object and a stationary object captured by a video camera associated with a user, wherein the plurality of image data comprise depth information.
  • a device may include a processor coupled to the memory, wherein the processor is configured to determine a first plurality of three-dimensional positions within a local coordinate system of the moving object in response to the plurality of image data, wherein the local coordinate system is relative to the video camera, wherein the processor is configured to determine a first movement path of the moving object within the local coordinate system in response to the first plurality of three- dimensional positions within the local coordinate system, wherein the processor is configured to determine a second plurality of three-dimensional positions within the local coordinate system of the stationary object in response to the plurality of image data, wherein the processor is configured to determine a plurality of transforms from the local coordinate system to a global coordinate system in response to the second plurality of three-dimensional position within the local coordinate system and to a position of the stationary object the global coordinate system, wherein the processor is configured to determine a second movement path of the moving object within the global coordinate system in response to the first movement path of the moving object and to the plurality of transforms from the local coordinate system to the global coordinate system, and wherein the
  • a method for a computing system includes receiving, from a memory, a plurality of image data of a scene including a moving object and a stationary object from a video camera associated with a user, wherein the plurality of image data comprise depth information, wherein the plurality of image data includes a first frame of image data, a second frame of image data and a third frame of image data, wherein the third frame is temporally between the first frame and the second frame, determining, in a processor coupled to the memory, a first three-dimensional position within a local coordinate system of the moving object and a first pose of the moving object in response to the first frame of image data, and determining, in the processor, a second three-dimensional position within the local coordinate system of the moving object and a second pose of the moving object in response to the second frame of image data.
  • a process may include determining, in a processor, a fourth three-dimensional position within the local coordinate system of the stationary object in response to the first frame of image data and in response to first orientation data of the video camera associated with the first frame of image data, and determining, in the processor, a fifth three-dimensional position within the local coordinate system of the stationary object in response to the second frame of image data and in response to second orientation data of the video camera associated with the second frame of image data.
  • a method may include determining, in the processor, a first transform from the local coordinate system to a global coordinate system, in response to the fourth three-dimensional position within the local coordinate system of the stationary object and determining, in the processor, a second transform from the local coordinate system to the global coordinate system, in response to the fifth three-dimensional position within the local coordinate system of the stationary object.
  • a process may include determining, in the processor, a sixth three-dimensional position within the global coordinate system in response to the first three-dimensional position within the local coordinate system and in response to the first transform, determining, in the processor, a seventh three-dimensional position within the global coordinate system in response to the second three-dimensional position within the local coordinate system and in response to the second transform, and determining, in the processor, an eighth three-dimensional position within the global coordinate system in response to the sixth three-dimensional position within the global coordinate system and in response to the seventh three-dimensional position within the global coordinate system.
  • FIG. 1 illustrates a functional block diagram of some embodiments of the present invention
  • FIGs. 2A-C illustrate a process diagram according to various embodiments of the present invention.
  • FIG. 3 illustrates a system diagram according to various embodiments of the present invention.
  • Fig. 1 illustrates an overview diagram according to some embodiments of the present invention. More specifically, Fig. 1 illustrates a user 100 wearing a video camera 102.
  • video camera 102 may be a stand-alone video camera and in other embodiments, video camera 102 may be integrated into another device 104, e.g. a virtual reality, a heads-up headset, or the like.
  • video camera 102 may be worn by user 100 on their head, torso, arm, or the like.
  • video camera 102 may include a visible light sensor 103 for capturing video images, and a depth sensor 106, e.g. an infrared sensor, or the like for capturing depth information associated with the video images.
  • a depth sensor 106 e.g. an infrared sensor, or the like for capturing depth information associated with the video images.
  • the two-dimensional resolution for visible light sensor may be the same or different from the two-dimensional resolution for infrared sensor 106.
  • a sensor such as a Microsoft Azure Kinect DK camera, or the like may be used.
  • visible light sensor 103 may capture two-dimensional video images of a scene 108, while depth sensor 106 may capture depth information associated with the two-dimensional video images.
  • video camera 102 may be coupled to a processing device 110, e.g. a host computer, a laptop computer, or the like, that stores the video image data 112, 114, 116 etc. in a memory.
  • a processing device 110 e.g. a host computer, a laptop computer, or the like
  • video image data 112, 114, 116 etc. are associated with a unique time stamp, a two-dimensional image, and depth information associated with the two-dimensional image.
  • image data 112 is associated with time TO
  • image data 114 is associated with T1 (TO+t)
  • image data 16 is associated with T2 (T0+2t) and the like.
  • TO video image data 112
  • user 100 is moving their hand 120 towards a flower vase 122, with a globe 124 in the background.
  • video image data 114 user 100 has moved their head to the right, thus flower vase 122 appears further apart 126 from globe 124.
  • user 100 has moved their left hand 120 closer to flower vase 122.
  • video image data 116 user 100 has moved their head further to the right, thus flower vase 122 appears even further apart 128 from globe 124.
  • user 100 has moved their left hand 120 even closer to flower vase 122.
  • processing device 110 may be programmed to perform various techniques to process movements captured by video image data 112, to predict future movements, to predict collisions, or the like.
  • Fig. 2 illustrates a flow diagram according to various embodiments. For sake of convenience, various of the steps below may reference the system configuration illustrated in Fig. 1, above. It should be understood that other system configurations may be used in other embodiments.
  • a video image data (e.g. 112, 114, 116, etc.) is provided to processing system 110, step 200.
  • Video image data is typically associated with a point of view (POV) camera 102 of a user.
  • POV point of view
  • video camera 102 may be mounted on a chest, on a forehead, or an arm.
  • the video camera 102 may be a camera integrated on a virtual reality headset 104, a mixed reality headset, or the like.
  • video image data may be stored in a memory of processing unit 110.
  • depth information 118 may be captured by sensor 106 as part of video image data 112.
  • the depth information e.g. 118, typically associates a distance between depth sensor 106 and objects and surfaces in scene 108 for pixels in the depth sensor image.
  • a 2D image resolution of video camera 103 may be the same or different (e.g. greater) than a 2D image resolution of depth sensor 106.
  • video image data may be stored in a memory of processing unit 110, step 202.
  • video image data may be prerecorded and then provided to processing unit 110.
  • video image data may be recorded at a different location, and then provided to processing unit 110 for further processing, as described herein.
  • processing system 110 may automatically or manually segment the video image data to shorter segments with fewer frames, step 204.
  • the shorter segments may be 1.5 second clips (e.g. 45 frames @ 30 frames per second), two second clips (e.g. 30 frames @ 15 frames per second) or the like.
  • a reduced number of frames in a segment may be used to facilitate the processes described herein.
  • processing may be perform on a greater number of frames or a different duration to facilitate more accurate landmark movement determination described herein.
  • a landmark 130 may be identified by processing system 110 for each frame of video image data within a segment, step 206.
  • the landmark may be a position of a hand 120 of user 100, an object user 100 is holding, an object user 100 is throwing or moving, or the like.
  • landmark 130 is typically a point on an object which processing unit 110 tracks movement within the segment. Further, in some embodiments, processing unit 110 will predict movement of landmark 130 after the segment.
  • the orientation of the hand may also be determined.
  • hand 120 may be palm up; in video image data 112, hand 120 may be palm pointed to the right; in video image data 116, hand 120 may be palm down; etc.
  • the orientation of the tracked object may be relevant in some embodiments.
  • tracking of the landmark may be lost in some images in a segment due to quick movement of the landmark (e.g. motion blur), quick movement of video camera 102, and the like.
  • an initial position (and orientation) for the landmark may be determined in the first frame of the segment, e.g. TO, and the position (and orientation) for the landmark may be determined in the last frame of the segment, e.g. Tn.
  • interpolation techniques may be used by processing unit 101 to determine the estimated positions (and orientations) of the landmark there in between, step 208.
  • a least squared fit may be used to facilitate the interpolation, and in other cases, flow- warped bi-directional trajectories can be used.
  • the 2D (x-y) position of the landmark in the two- dimensional video images is determined by processing unit 110 for each frame in the segment, step 210.
  • a distance to the landmark can be determined by processing unit 110, step 212. More specifically, the depth sensor image taken from depth sensor 106 may be referenced by processing unit 110 to determine the distance between the x-y position in the video image. As discussed above, in some cases, the 2D resolution of the video images may be different from the 2D resolution of the depth image, accordingly, a linear interpolation may be performed to determine the depth for the 2D position of landmark 130 in the video images. As a result of the above process, a three-dimensional (x-y position, z depth) position of landmark 130 can be determined by processing unit 110 for each frame in a segment.
  • a 3D movement of landmark 130 in the local coordinate system can be determined by processing unit 110 for the video segment, step 214.
  • this 3D position for landmark 130 is relative to the 3D coordinate system of video sensor 102 (a local coordinate system) - the x-y coordinates are relative to the video image of video sensor 102 at that moment in time.
  • the 3D positions for landmark 130 and objects within scene 108 change as user 100 moves their head.
  • the process includes translating the movement of landmark 130 from the 3D coordinate system of video sensor 102 to movement of landmark 130 in a real -world 3D coordinate system - where objects in scene 108 are stationary (a global coordinate system).
  • this determination includes processing unit 110 comparing adjacent frames of video image data, e.g. first video image data 112 to the second video image data 114, etc. step 216. More specifically, processing unit 110 uses the positions of stationary objects, e.g. flower vase 122 within a scene, e.g. 108, to determine movement of video camera 102 between adjacent frames.
  • processing unit 110 determines a transformation from the local coordinate system (relative to video sensor 102) to the global coordinate system (e.g. the real world), step 218.
  • the global coordinate system is determined relative to the first frame in the sequence, in other embodiments, the global frame may be determined relative to any other frame in the sequence.
  • the process may be repeated comparing the first video image data 112 to the next successive video image data, e.g. video image data 116, and the like, step 220.
  • a series of coordinate transformations between the local coordinate system (relative to camera 102), to the global coordinate system is determined, step 222.
  • this may include determining transforms between the first frame and each next frame in the sequence, determining transforms between adjacent pairs of frames, or combinations thereof. Based upon various combinations of transforms, it is expected that the local coordinate systems for each frame in the sequence can be transformed to a single global coordinate system.
  • 3D positions / motion of landmark 130 within the global coordinate system are then determined by processing unit 110, step 224.
  • the 3D movement or positions of landmark 130 for each frame that was determined in step 214 with the local coordinate system are then transformed by processing unit 110 using the series of coordinate transformations.
  • the 3D movement or positions of landmark 130 in the global coordinate system may thus be determined for each frame within the selected segment.
  • the 3D motion or position of landmark 130 within the global coordinate system is estimated by processing unit 110 after the selected segment, step 226. More specifically, using the 3D positions / motion of landmark 130 within the global coordinate system for all or a last few frames of the segment, processing unit 110 can estimate or predict where landmark 130 will be several or more frames into the future. For example, if the user is moving their hand from left to right during the segment, it may be predicted that the hand will continue moving to the right in the next frames. Of course, it is expected the prediction may have greater uncertainty for frames further after the selected segment.
  • processing unit 110 may also predict whether landmark 130 will interact with other objects within scene 108, step 228.
  • object e.g. flower vase 122
  • processing unit 110 can give a probability whether landmark 130 and flower vase 122 will have the same 3D position within the global coordinate system, i.e. what is the probability or confidence level that left hand 120 will grab flower vase 122 in the future?
  • processing system 110 may provide feedback to user 100, step 230. In some embodiments, this may be done by providing visual feedback to the user. For example, for a heads-up or VR display, a red overlay may be provided over the entire video image, over the stationary object or the like. In some embodiments, audio feedback may be given such as a higher pitch tone as a user’s hand moves closer to a stationary object, or the like. In still other embodiments, haptic feedback may be provided having a vibration intensity increase with higher probability determination.
  • FIG. 3 illustrates a functional block diagram of various embodiments of the present invention. More specifically, it is contemplated that computers (e.g. servers, laptops, streaming servers, etc.), processing units, VR headsets, etc. may be implemented with a subset or superset of the below illustrated components.
  • computers e.g. servers, laptops, streaming servers, etc.
  • processing units e.g. VR headsets, etc.
  • VR headsets e.g., VR headsets, etc.
  • a computing device 300 may include some, but not necessarily all of the following components: an applications processor / microprocessor 302, memory 304, a display 306, an image acquisition device 310, audio input / output devices 312, and the like.
  • Data and communications from and to computing device 300 can be provided by via a wired interface 314 (e.g. Ethernet, dock, plug, controller interface to peripheral devices); miscellaneous rf receivers, e.g. a GPS / Wi-Fi / Bluetooth interface / UWB 316; an NFC interface (e.g. antenna or coil) and driver 318; RF interfaces and drivers 320, and the like.
  • wired interface 314 e.g. Ethernet, dock, plug, controller interface to peripheral devices
  • miscellaneous rf receivers e.g. a GPS / Wi-Fi / Bluetooth interface / UWB 316
  • NFC interface e.g. antenna or coil
  • driver 318 e.g. antenna or coil
  • computing device 300 may be a computing device (e.g. Apple iPad, Microsoft Surface, Samsung Galaxy Note, an Android Tablet); a smart phone (e.g. Apple iPhone, Google Pixel, Samsung Galaxy S); a computer (e.g. netbook, laptop, convertible), a media player (e.g. Apple iPod); a reading device (e.g. Amazon Kindle); a fitness tracker (e.g. Fitbit, Apple Watch, Garmin or the like); a headset or glasses (e.g. Meta Quest, HTC Vive, Sony PlaystationVR, Magic Leap, Microsoft HoloLens); a wearable device (e.g. Motiv smart ring, smart headphones); an implanted device (e.g.
  • a computing device e.g. Apple iPad, Microsoft Surface, Samsung Galaxy Note, an Android Tablet
  • a smart phone e.g. Apple iPhone, Google Pixel, Samsung Galaxy S
  • a computer e.g. netbook, laptop, convertible
  • a media player e.g. Apple iPod
  • a reading device e.
  • computing device 300 may include one or more processors 302.
  • processors 302 may also be termed application processors, and may include a processor core, a video/graphics core, and other cores.
  • processors 302 may include processors from Apple (A14 Bionic, Al 5 Bionic), NVidia (Tegra), Intel (Core), Qualcomm (Snapdragon), Samsung (Exynos), ARM (Cortex), MIPS technology, a microcontroller, and the like.
  • processing accelerators may also be included, e.g. an Al accelerator, Google (Tensor processing unit), a GPU, or the like. It is contemplated that other existing and / or later-developed processors / microcontrollers may be used in various embodiments of the present invention.
  • memory 304 may include different types of memory (including memory controllers), such as flash memory (e.g. NOR, NAND), SRAM, DDR SDRAM, or the like.
  • Memory 304 may be fixed within computing device 300 and may also include removable memory (e.g. SD, SDHC, MMC, MINI SD, MICRO SD, SIM).
  • computer-executable software code e.g. firmware, application programs
  • security applications application data
  • operating system data databases or the like.
  • a secure device including secure memory and / or a secure processor are provided. It is contemplated that other existing and / or later-developed memory and memory technology may be used in various embodiments of the present invention.
  • display 306 may be based upon a variety of later- developed or current display technology, including LED or OLED displays and / or status lights; touch screen technology (e.g. resistive displays, capacitive displays, optical sensor displays, electromagnetic resonance, or the like); and the like. Additionally, display 306 may include single touch or multiple-touch sensing capability. Any later-developed or conventional output display technology may be used for embodiments of the output display, such as LED IPS, OLED, Plasma, electronic ink (e.g. electrophoretic, electrowetting, interferometric modulating), or the like. In various embodiments, the resolution of such displays and the resolution of such touch sensors may be set based upon engineering or nonengineering factors (e.g. sales, marketing).
  • display 306 may be integrated into computing device 300 or may be separate. In some embodiments, display 306 may be in virtually any size or resolution, such as a 3K resolution display, a microdisplay, one or more individual status or communication lights, e.g. LEDs, or the like.
  • acquisition device 310 may include one or more sensors, drivers, lenses and the like.
  • the sensors may be visible light, infrared, and / or UV sensitive sensors, ultrasonic sensors, or the like, that are based upon any later-developed or convention sensor technology, such as CMOS, CCD, or the like.
  • image recognition algorithms, image processing algorithms or other software programs for operation upon processor 302, to process the acquired data may pair with enabled hardware to provide functionality such as: facial recognition (e.g.
  • acquisition device 310 may provide user input data in the form of a selfie, biometric data, or the like.
  • audio input / output 312 may include a microphone(s) / speakers.
  • voice processing and / or recognition software may be provided to applications processor 302 to enable the user to operate computing device 300 by stating voice commands.
  • audio input 312 may provide user input data in the form of a spoken word or phrase, or the like, as described above.
  • audio input / output 312 may be integrated into computing device 300 or may be separate.
  • wired interface 314 may be used to provide data or instruction transfers between computing device 300 and an external source, such as a computer, a remote server, a POS server, a local security server, a storage network, another computing device 300, an IMU, video camera, or the like.
  • Embodiments may include any later-developed or conventional physical interface / protocol, such as: USB, micro USB, mini USB, USB-C, Firewire, Apple Lightning connector, Ethernet, POTS, custom interface or dock, or the like.
  • wired interface 314 may also provide electrical power, or the like to power source 324, or the like.
  • interface 314 may utilize close physical contact of device 300 to a dock for transfer of data, magnetic power, heat energy, light energy, laser energy or the like. Additionally, software that enables communications over such networks is typically provided.
  • a wireless interface 316 may also be provided to provide wireless data transfers between computing device 300 and external sources, such as computers, storage networks, headphones, microphones, cameras, IMUs or the like.
  • wireless protocols may include Wi-Fi (e.g. IEEE 802.11 a/b/g/n, WiMAX), Bluetooth, Bluetooth Low Energy (BLE) IR, near field communication (NFC), ZigBee, Ultra-Wide Band (UWB), Wi-Fi, mesh communications, and the like.
  • GNSS e.g. GPS
  • GPS e.g. GPS
  • wireless interface 316 a wireless interface that is distinct from the Wi-Fi circuitry, the Bluetooth circuitry, and the like.
  • GPS receiving hardware may provide user input data in the form of current GPS coordinates, or the like, as described above.
  • RF interfaces 320 may support any future-developed or conventional radio frequency communications protocol, such as CDMA-based protocols (e.g. WCDMA), GSM-based protocols, HSUPA-based protocols, G4, G5, or the like.
  • CDMA-based protocols e.g. WCDMA
  • GSM-based protocols e.g. GSM-based protocols
  • HSUPA-based protocols G4, G5, or the like.
  • various functionality is provided upon a single IC package, for example the Marvel PXA330 processor, and the like.
  • data transmissions between a smart device and the services may occur via Wi-Fi, a mesh network, 4G, 5G, or the like.
  • the functional blocks in Fig. 3 are shown as being separate, it should be understood that the various functionality may be regrouped into different physical devices.
  • some processors 302 may include Bluetooth functionality. Additionally, some functionality need not be included in some blocks, for example, GPS functionality need not be provided in a provider server.
  • any number of future developed, current operating systems, or custom operating systems may be supported, such as iPhone OS (e.g. iOS), Google Android, Linux, Windows, MacOS, or the like.
  • the operating system may be a multi -threaded multi-tasking operating system. Accordingly, inputs and / or outputs from and to display 306 and inputs / or outputs to physical sensors 322 may be processed in parallel processing threads. In other embodiments, such events or outputs may be processed serially, or the like. Inputs and outputs from other functional blocks may also be processed in parallel or serially, in other embodiments of the present invention, such as acquisition device 310 and physical sensors 322.
  • physical sensors 322 may include accelerometers, gyros, magnetometers, pressure sensors, temperature sensors, imaging sensors (e.g. blood oxygen, heartbeat, blood vessel, iris data, etc.), thermometer, otoacoustic emission (OAE) testing hardware, and the like.
  • the data from such sensors may be used to capture data associated with device 300, and a user of device 300.
  • Such data may include physical motion data, pressure data, orientation data, or the like.
  • Data captured by sensors 322 may be processed by software running upon processor 302 to determine characteristics of the user, e.g. gait, gesture performance data, or the like and used for user authentication purposes.
  • sensors 322 may also include physical output data, e.g. vibrations, pressures, and the like.
  • a power supply 324 may be implemented with a battery (e.g. LiPo), ultracapacitor, or the like, that provides operating electrical power to device 300.
  • a battery e.g. LiPo
  • ultracapacitor e.g. LiPo
  • any number of power generation techniques may be utilized to supplement or even replace power supply 324, such as solar power, liquid metal power generation, thermoelectric engines, rf harvesting (e.g. NFC) or the like.
  • Fig. 3 is representative of components possible for a processing device. It will be readily apparent to one of ordinary skill in the art that many other hardware and software configurations are suitable for use with the present invention. Embodiments of the present invention may include at least some but need not include all of the functional blocks illustrated in Fig. 3.
  • a smart phone e.g. processing unit / video camera
  • processing unit 110 may include some of the functional blocks in Fig. 3, but it need not include an accelerometer or other physical sensor 322, an rf interface 320, an internal power source 324, or the like.
  • other objects e.g. globe 124 may be used as a stationary object for purposes of the visual odometry / coordinate system transformations, described above, and other objects, e.g. flower vase 112 may actually move within scene 108.
  • flower vase 112 may slide be sliding to the left.
  • the motion of flower vase 112 can also be determined in the same way as landmark 130 - 3D determination within the local coordinate system, then transformed to 3D within the global coordinate system. The motion of landmark 130 and flower vase 112 can then be estimated within the 3D global coordinate system to predict whether they will intersect, or not, e.g.
  • processing device 110 may have to track the movements of multiple landmarks within video image data.
  • other types of visual odometry-type functionality may include determination of sub-portions of objects that are stationary. For example, sharp corners on a stationary object may be used, long horizontal or vertical portions on a stationary object may be used, and the like.
  • Al processing systems may be used to facilitate determination of various of the parameters above, such as determination of movements of one or more landmarks, the positions of stationary objects (or portions thereof) within a local coordinate system, determining probability of future movement path of a landmark, and the like.
  • an architecture based upon a Transformer paradigm may be used. It should be understood that other types of architectures, including a recurrent neural network (RNN), or the like may also be used.
  • RNN recurrent neural network

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

A method for a system includes receiving image and depth data of a moving and a stationary object from a camera associated with a user, determining first 3D positions of the moving object within a local coordinate system in response to the image data, wherein the local coordinate system is relative to the camera, determining first movements for the moving object within the local coordinate system in response to the first 3D positions, determining second 3D positions of the stationary object within the local coordinate system in response to the image data, determining transforms from the local to a global coordinate system in response to the second 3D positions, determining second movements for the moving object within the global coordinate system in response to the first movements and the transforms, and determining predicted movements for the moving object within the global coordinate system in response to the second movements

Description

METHODS AND APPARATUS FOR FORECASTING COLLISIONS USING
EGOCENTRIC VIDEO DATA
CROSS-REFERENCE TO RELATED CASES
[0001] The present invention is a non-provisional of and claims priority to U.S. App. No. 63/403,146 filed September 1, 2022. This application is incorporated by reference, for all purposes.
BACKGROUND
[0002] The present invention relates to object movement prediction. More specifically, the invention relates to predicting how a visible object will move and interact with another object when viewed from a first-person perspective, e.g. egocentric video data.
[0003] Currently some methods have been proposed to predict a trajectory of an object relative to a user’s point of view. Typically, such methods focus upon guessing where, within the field of view of the user, an object will moved.
[0004] The inventors believe such methods have multiple drawbacks. One drawback is that it does not attempt to relate the estimated position to the real-world. In other words, such predictions may only be relative to where the viewer is looking, which has little bearing in the real-world. Accordingly, movement predictions cannot normally be used to determine whether there will be interaction with other real -world objects. Another drawback is that because the frame-rate of the user point of view video is sometimes low, if the object moves quickly or the user turns their head quickly, the object may not appear clearly upon each frame of the video. Accordingly, computed positions of the object in the video images may be inaccurate due to motion blur (e.g. high movement) of the object in the video frames. Still other drawbacks include that some solutions require data that may not be readily available in many cases. For example, some solutions require acceleration and rotation data from an inertial measurement unit (IMU) or the like, which may not be available from the user’s point of view camera.
[0005] In light of the above, what is desired are methods and apparatus that address the above problem with reduced drawbacks.
SUMMARY
[0006] The present invention relates to object movement prediction. More specifically, the invention relates to predicting how a visible object will move and interact with another object when viewed from a first-person perspective, e.g. egocentric video data.
[0007] Embodiments of the present invention describe novel methods for predicting movements of a hand or other object in 3D space using video images typically taken from a first-person perspective, e.g. egocentric video. Some examples implement an uncertainty- aware state space Transformer that includes different techniques to improve movement prediction accuracy. Additional examples include ways to label and mark videos to facilitate the prediction model.
[0008] Some embodiments include three primary processing phases, a feature encoding phase, a state transition phase, and a trajectory prediction phase. In various embodiments, the feature encoding phase includes determining three-dimensional positions, e.g. two dimensions are positions on a video image, and a third dimension is depth information, of an object (e.g. the user’s hand) as it appears on the screen. This can be done for each frame of a given video, for the first dozen frames of a given video, or the like. The state transition phase includes determining the three-dimensional positions of the object relative to a global coordinate. In some examples, the global reference coordinate is selected as the first video image in a given video. The movement path of the object relative to the global coordinate system for subsequent video frames is then determined based upon the three-dimensional positions of the object. Next, based upon the determined movement in the global coordinate system, a prediction of the path of the object can be determined. Collisions may then be predicted based upon movement of the object and positions of stationary objects within the global coordinate system.
[0009] Some embodiments described herein may be performed using a Transformer encoder and Transformer decoder paradigm within a processing unit. Specifically a Transformer encoder may be used to determine temporal dependencies (e.g. movement) of objects from incoming video data, and a Transform decoder may be used to predict movements of objects.
[0010] According to one aspect, a method for a computing system is disclosed. One technique may include receiving, from a memory, a plurality of image data of a scene including a moving object and a stationary object, wherein the plurality of image data is derived from a video camera associated with a user, wherein the plurality of image data comprise depth information with respect to the video camera, and determining, in a processor coupled to the video camera, a first plurality of three-dimensional positions of the moving object within a local coordinate system, in response to the plurality of image data, wherein the local coordinate system is relative to the video camera. A process may include determining, in the processor, a first movement path of the moving object within the local coordinate system in response to the first plurality of three-dimensional positions within the local coordinate system, determining, in the processor, a second plurality of three- dimensional positions within the local coordinate system of the stationary object in response to the plurality of image data, and determining, in the processor, a plurality of transforms from the local coordinate system to a global coordinate system in response to the second plurality of three-dimensional positions within the local coordinate system. A method may include determining, in the processor, a second movement path of the moving object within the global coordinate system in response to the first movement path of the moving object and to the plurality of transforms, and determining, in the processor, a third movement path for the moving object within the global coordinate system in response to the second movement path of the moving object.
[0011] According to one aspect a computing system is described. One system may include a memory configured to store a plurality of image data of scene including a moving object and a stationary object captured by a video camera associated with a user, wherein the plurality of image data comprise depth information. A device may include a processor coupled to the memory, wherein the processor is configured to determine a first plurality of three-dimensional positions within a local coordinate system of the moving object in response to the plurality of image data, wherein the local coordinate system is relative to the video camera, wherein the processor is configured to determine a first movement path of the moving object within the local coordinate system in response to the first plurality of three- dimensional positions within the local coordinate system, wherein the processor is configured to determine a second plurality of three-dimensional positions within the local coordinate system of the stationary object in response to the plurality of image data, wherein the processor is configured to determine a plurality of transforms from the local coordinate system to a global coordinate system in response to the second plurality of three-dimensional position within the local coordinate system and to a position of the stationary object the global coordinate system, wherein the processor is configured to determine a second movement path of the moving object within the global coordinate system in response to the first movement path of the moving object and to the plurality of transforms from the local coordinate system to the global coordinate system, and wherein the processor is configured to determine a third movement path for the moving object within the global coordinate system in response to the second movement path of the moving object.
[0012] According to yet another aspect, a method for a computing system is disclosed. One technique include receiving, from a memory, a plurality of image data of a scene including a moving object and a stationary object from a video camera associated with a user, wherein the plurality of image data comprise depth information, wherein the plurality of image data includes a first frame of image data, a second frame of image data and a third frame of image data, wherein the third frame is temporally between the first frame and the second frame, determining, in a processor coupled to the memory, a first three-dimensional position within a local coordinate system of the moving object and a first pose of the moving object in response to the first frame of image data, and determining, in the processor, a second three-dimensional position within the local coordinate system of the moving object and a second pose of the moving object in response to the second frame of image data. A process may include determining, in a processor, a fourth three-dimensional position within the local coordinate system of the stationary object in response to the first frame of image data and in response to first orientation data of the video camera associated with the first frame of image data, and determining, in the processor, a fifth three-dimensional position within the local coordinate system of the stationary object in response to the second frame of image data and in response to second orientation data of the video camera associated with the second frame of image data. A method may include determining, in the processor, a first transform from the local coordinate system to a global coordinate system, in response to the fourth three-dimensional position within the local coordinate system of the stationary object and determining, in the processor, a second transform from the local coordinate system to the global coordinate system, in response to the fifth three-dimensional position within the local coordinate system of the stationary object. A process may include determining, in the processor, a sixth three-dimensional position within the global coordinate system in response to the first three-dimensional position within the local coordinate system and in response to the first transform, determining, in the processor, a seventh three-dimensional position within the global coordinate system in response to the second three-dimensional position within the local coordinate system and in response to the second transform, and determining, in the processor, an eighth three-dimensional position within the global coordinate system in response to the sixth three-dimensional position within the global coordinate system and in response to the seventh three-dimensional position within the global coordinate system.
BRIEF DESCRIPTION OF THE DRAWINGS [0013] In order to more fully understand the present invention, reference is made to the accompanying drawings. Understanding that these drawings are not to be considered limitations in the scope of the invention, the presently described embodiments and the presently understood best mode of the invention are described with additional detail through use of the accompanying drawings in which:
[0014] Fig. 1 illustrates a functional block diagram of some embodiments of the present invention;
[0015] Figs. 2A-C illustrate a process diagram according to various embodiments of the present invention; and
[0016] Fig. 3 illustrates a system diagram according to various embodiments of the present invention.
DETAILED DESCRIPTION
[0017] Fig. 1 illustrates an overview diagram according to some embodiments of the present invention. More specifically, Fig. 1 illustrates a user 100 wearing a video camera 102. In some embodiments, video camera 102 may be a stand-alone video camera and in other embodiments, video camera 102 may be integrated into another device 104, e.g. a virtual reality, a heads-up headset, or the like. In various examples, video camera 102 may be worn by user 100 on their head, torso, arm, or the like.
[0018] In various embodiments, video camera 102 may include a visible light sensor 103 for capturing video images, and a depth sensor 106, e.g. an infrared sensor, or the like for capturing depth information associated with the video images. In some embodiments, the two-dimensional resolution for visible light sensor may be the same or different from the two-dimensional resolution for infrared sensor 106. In some examples, a sensor such as a Microsoft Azure Kinect DK camera, or the like may be used.
[0019] In some examples, visible light sensor 103 may capture two-dimensional video images of a scene 108, while depth sensor 106 may capture depth information associated with the two-dimensional video images. As illustrated, video camera 102 may be coupled to a processing device 110, e.g. a host computer, a laptop computer, or the like, that stores the video image data 112, 114, 116 etc. in a memory. Typically, video image data 112, 114, 116 etc. are associated with a unique time stamp, a two-dimensional image, and depth information associated with the two-dimensional image. In the example in Fig. 1, image data 112 is associated with time TO, image data 114 is associated with T1 (TO+t), image data 16 is associated with T2 (T0+2t) and the like. In this example, at TO, video image data 112, user 100 is moving their hand 120 towards a flower vase 122, with a globe 124 in the background. At Tl, video image data 114, user 100 has moved their head to the right, thus flower vase 122 appears further apart 126 from globe 124. Additionally at Tl, user 100 has moved their left hand 120 closer to flower vase 122. At T2, video image data 116, user 100 has moved their head further to the right, thus flower vase 122 appears even further apart 128 from globe 124. Additionally at T2, user 100 has moved their left hand 120 even closer to flower vase 122. [0020] As will be discussed below, processing device 110 may be programmed to perform various techniques to process movements captured by video image data 112, to predict future movements, to predict collisions, or the like.
[0021] Fig. 2 illustrates a flow diagram according to various embodiments. For sake of convenience, various of the steps below may reference the system configuration illustrated in Fig. 1, above. It should be understood that other system configurations may be used in other embodiments.
[0022] In some embodiments, a video image data (e.g. 112, 114, 116, etc.) is provided to processing system 110, step 200. Video image data is typically associated with a point of view (POV) camera 102 of a user. In some cases, video camera 102 may be mounted on a chest, on a forehead, or an arm. In other cases, the video camera 102 may be a camera integrated on a virtual reality headset 104, a mixed reality headset, or the like. In various embodiments, video image data may be stored in a memory of processing unit 110.
[0023] In some embodiments, depth information 118 (z-axis data) may be captured by sensor 106 as part of video image data 112. The depth information, e.g. 118, typically associates a distance between depth sensor 106 and objects and surfaces in scene 108 for pixels in the depth sensor image. As mentioned above, a 2D image resolution of video camera 103 may be the same or different (e.g. greater) than a 2D image resolution of depth sensor 106.
[0024] In some embodiments, video image data may be stored in a memory of processing unit 110, step 202. In other embodiments, video image data may be prerecorded and then provided to processing unit 110. For example, video image data may be recorded at a different location, and then provided to processing unit 110 for further processing, as described herein.
[0025] Initially, if the number of frames of video image data is very long, processing system 110 may automatically or manually segment the video image data to shorter segments with fewer frames, step 204. In some examples, the shorter segments may be 1.5 second clips (e.g. 45 frames @ 30 frames per second), two second clips (e.g. 30 frames @ 15 frames per second) or the like. For the sake of convenience, a reduced number of frames in a segment may be used to facilitate the processes described herein. In other embodiments, processing may be perform on a greater number of frames or a different duration to facilitate more accurate landmark movement determination described herein.
[0026] Next, in some embodiments, a landmark 130, may be identified by processing system 110 for each frame of video image data within a segment, step 206. In some examples, the landmark may be a position of a hand 120 of user 100, an object user 100 is holding, an object user 100 is throwing or moving, or the like. As will be described below, landmark 130 is typically a point on an object which processing unit 110 tracks movement within the segment. Further, in some embodiments, processing unit 110 will predict movement of landmark 130 after the segment.
[0027] In some cases where a hand, e.g. left hand 120 is to be tracked, the orientation of the hand may also be determined. For example in video image data 112, hand 120 may be palm up; in video image data 112, hand 120 may be palm pointed to the right; in video image data 116, hand 120 may be palm down; etc. The orientation of the tracked object may be relevant in some embodiments.
[0028] In some cases, tracking of the landmark may be lost in some images in a segment due to quick movement of the landmark (e.g. motion blur), quick movement of video camera 102, and the like. In such cases, an initial position (and orientation) for the landmark may be determined in the first frame of the segment, e.g. TO, and the position (and orientation) for the landmark may be determined in the last frame of the segment, e.g. Tn. In these cases, if the positions and orientations for the landmark are missing or undeterminable for some frames (e.g. between TO and Tn), interpolation techniques may be used by processing unit 101 to determine the estimated positions (and orientations) of the landmark there in between, step 208. In some cases, a least squared fit may be used to facilitate the interpolation, and in other cases, flow- warped bi-directional trajectories can be used.
[0029] As a result of the above process, the 2D (x-y) position of the landmark in the two- dimensional video images is determined by processing unit 110 for each frame in the segment, step 210.
[0030] In some embodiments, based upon the 2-D position of the landmark in the video images, a distance to the landmark can be determined by processing unit 110, step 212. More specifically, the depth sensor image taken from depth sensor 106 may be referenced by processing unit 110 to determine the distance between the x-y position in the video image. As discussed above, in some cases, the 2D resolution of the video images may be different from the 2D resolution of the depth image, accordingly, a linear interpolation may be performed to determine the depth for the 2D position of landmark 130 in the video images. As a result of the above process, a three-dimensional (x-y position, z depth) position of landmark 130 can be determined by processing unit 110 for each frame in a segment.
[0031] In various embodiments, a 3D movement of landmark 130 in the local coordinate system can be determined by processing unit 110 for the video segment, step 214. In some cases, this 3D position for landmark 130 is relative to the 3D coordinate system of video sensor 102 (a local coordinate system) - the x-y coordinates are relative to the video image of video sensor 102 at that moment in time. In other words, even though the user does not actually move left hand 120 and the objects are stationary, the 3D positions for landmark 130 and objects within scene 108 change as user 100 moves their head.
[0032] In some embodiments, the process includes translating the movement of landmark 130 from the 3D coordinate system of video sensor 102 to movement of landmark 130 in a real -world 3D coordinate system - where objects in scene 108 are stationary (a global coordinate system). In various examples, this determination includes processing unit 110 comparing adjacent frames of video image data, e.g. first video image data 112 to the second video image data 114, etc. step 216. More specifically, processing unit 110 uses the positions of stationary objects, e.g. flower vase 122 within a scene, e.g. 108, to determine movement of video camera 102 between adjacent frames. Using visual odometry techniques, processing unit 110 then determines a transformation from the local coordinate system (relative to video sensor 102) to the global coordinate system (e.g. the real world), step 218. In some embodiments, the global coordinate system is determined relative to the first frame in the sequence, in other embodiments, the global frame may be determined relative to any other frame in the sequence. In various embodiments, the process may be repeated comparing the first video image data 112 to the next successive video image data, e.g. video image data 116, and the like, step 220.
[0033] As a result of the above process, for each frame of data, a series of coordinate transformations between the local coordinate system (relative to camera 102), to the global coordinate system is determined, step 222. In some embodiments, this may include determining transforms between the first frame and each next frame in the sequence, determining transforms between adjacent pairs of frames, or combinations thereof. Based upon various combinations of transforms, it is expected that the local coordinate systems for each frame in the sequence can be transformed to a single global coordinate system. [0034] In various embodiments, 3D positions / motion of landmark 130 within the global coordinate system are then determined by processing unit 110, step 224. More specifically, the 3D movement or positions of landmark 130 for each frame that was determined in step 214 with the local coordinate system are then transformed by processing unit 110 using the series of coordinate transformations. The 3D movement or positions of landmark 130 in the global coordinate system may thus be determined for each frame within the selected segment. [0035] Subsequently, the 3D motion or position of landmark 130 within the global coordinate system is estimated by processing unit 110 after the selected segment, step 226. More specifically, using the 3D positions / motion of landmark 130 within the global coordinate system for all or a last few frames of the segment, processing unit 110 can estimate or predict where landmark 130 will be several or more frames into the future. For example, if the user is moving their hand from left to right during the segment, it may be predicted that the hand will continue moving to the right in the next frames. Of course, it is expected the prediction may have greater uncertainty for frames further after the selected segment.
[0036] In some embodiments, processing unit 110 may also predict whether landmark 130 will interact with other objects within scene 108, step 228. In step 218, above, when performing the visual odometry techniques, it is expected that objects, e.g. flower vase 122, will be determined to be relatively stationary within the global coordinate system. Accordingly in this example, based upon the predicted movement for landmark 130 determined in step 226, processing unit 110 can give a probability whether landmark 130 and flower vase 122 will have the same 3D position within the global coordinate system, i.e. what is the probability or confidence level that left hand 120 will grab flower vase 122 in the future?
[0037] In various embodiments, if the landmark 130 is predicted to collide with another object, processing system 110 may provide feedback to user 100, step 230. In some embodiments, this may be done by providing visual feedback to the user. For example, for a heads-up or VR display, a red overlay may be provided over the entire video image, over the stationary object or the like. In some embodiments, audio feedback may be given such as a higher pitch tone as a user’s hand moves closer to a stationary object, or the like. In still other embodiments, haptic feedback may be provided having a vibration intensity increase with higher probability determination.
[0038] In some cases, the process described above may be repeated for successive segments of video images/ frames. [0039] Fig. 3 illustrates a functional block diagram of various embodiments of the present invention. More specifically, it is contemplated that computers (e.g. servers, laptops, streaming servers, etc.), processing units, VR headsets, etc. may be implemented with a subset or superset of the below illustrated components.
[0040] In Fig. 3, a computing device 300 may include some, but not necessarily all of the following components: an applications processor / microprocessor 302, memory 304, a display 306, an image acquisition device 310, audio input / output devices 312, and the like. Data and communications from and to computing device 300 can be provided by via a wired interface 314 (e.g. Ethernet, dock, plug, controller interface to peripheral devices); miscellaneous rf receivers, e.g. a GPS / Wi-Fi / Bluetooth interface / UWB 316; an NFC interface (e.g. antenna or coil) and driver 318; RF interfaces and drivers 320, and the like. Also included in some embodiments are physical sensors 322 (e.g. (MEMS-based) accelerometers, gyros, magnetometers, pressure sensors, temperature sensors, bioimaging sensors etc.).
[0041] In various embodiments, computing device 300 may be a computing device (e.g. Apple iPad, Microsoft Surface, Samsung Galaxy Note, an Android Tablet); a smart phone (e.g. Apple iPhone, Google Pixel, Samsung Galaxy S); a computer (e.g. netbook, laptop, convertible), a media player (e.g. Apple iPod); a reading device (e.g. Amazon Kindle); a fitness tracker (e.g. Fitbit, Apple Watch, Garmin or the like); a headset or glasses (e.g. Meta Quest, HTC Vive, Sony PlaystationVR, Magic Leap, Microsoft HoloLens); a wearable device (e.g. Motiv smart ring, smart headphones); an implanted device (e.g. smart medical device), a point of service (POS) device, a server, or the like. Typically, computing device 300 may include one or more processors 302. Such processors 302 may also be termed application processors, and may include a processor core, a video/graphics core, and other cores. Processors 302 may include processors from Apple (A14 Bionic, Al 5 Bionic), NVidia (Tegra), Intel (Core), Qualcomm (Snapdragon), Samsung (Exynos), ARM (Cortex), MIPS technology, a microcontroller, and the like. In some embodiments, processing accelerators may also be included, e.g. an Al accelerator, Google (Tensor processing unit), a GPU, or the like. It is contemplated that other existing and / or later-developed processors / microcontrollers may be used in various embodiments of the present invention.
[0042] In various embodiments, memory 304 may include different types of memory (including memory controllers), such as flash memory (e.g. NOR, NAND), SRAM, DDR SDRAM, or the like. Memory 304 may be fixed within computing device 300 and may also include removable memory (e.g. SD, SDHC, MMC, MINI SD, MICRO SD, SIM). The above are examples of computer readable tangible media that may be used to store embodiments of the present invention, such as computer-executable software code (e.g. firmware, application programs), security applications, application data, operating system data, databases or the like. Additionally, in some embodiments, a secure device including secure memory and / or a secure processor are provided. It is contemplated that other existing and / or later-developed memory and memory technology may be used in various embodiments of the present invention.
[0043] In various embodiments, display 306 may be based upon a variety of later- developed or current display technology, including LED or OLED displays and / or status lights; touch screen technology (e.g. resistive displays, capacitive displays, optical sensor displays, electromagnetic resonance, or the like); and the like. Additionally, display 306 may include single touch or multiple-touch sensing capability. Any later-developed or conventional output display technology may be used for embodiments of the output display, such as LED IPS, OLED, Plasma, electronic ink (e.g. electrophoretic, electrowetting, interferometric modulating), or the like. In various embodiments, the resolution of such displays and the resolution of such touch sensors may be set based upon engineering or nonengineering factors (e.g. sales, marketing). In some embodiments, display 306 may be integrated into computing device 300 or may be separate. In some embodiments, display 306 may be in virtually any size or resolution, such as a 3K resolution display, a microdisplay, one or more individual status or communication lights, e.g. LEDs, or the like.
[0044] In some embodiments of the present invention, acquisition device 310 may include one or more sensors, drivers, lenses and the like. The sensors may be visible light, infrared, and / or UV sensitive sensors, ultrasonic sensors, or the like, that are based upon any later-developed or convention sensor technology, such as CMOS, CCD, or the like. In some embodiments of the present invention, image recognition algorithms, image processing algorithms or other software programs for operation upon processor 302, to process the acquired data. For example, such software may pair with enabled hardware to provide functionality such as: facial recognition (e.g. Face ID, head tracking, camera parameter control, or the like); fingerprint capture / analysis; blood vessel capture / analysis; iris scanning capture / analysis; otoacoustic emission (OAE) profiling and matching; and the like. In additional embodiments of the present invention, acquisition device 310 may provide user input data in the form of a selfie, biometric data, or the like.
[0045] In various embodiments, audio input / output 312 may include a microphone(s) / speakers. In various embodiments, voice processing and / or recognition software may be provided to applications processor 302 to enable the user to operate computing device 300 by stating voice commands. In various embodiments of the present invention, audio input 312 may provide user input data in the form of a spoken word or phrase, or the like, as described above. In some embodiments, audio input / output 312 may be integrated into computing device 300 or may be separate.
[0046] In various embodiments, wired interface 314 may be used to provide data or instruction transfers between computing device 300 and an external source, such as a computer, a remote server, a POS server, a local security server, a storage network, another computing device 300, an IMU, video camera, or the like. Embodiments may include any later-developed or conventional physical interface / protocol, such as: USB, micro USB, mini USB, USB-C, Firewire, Apple Lightning connector, Ethernet, POTS, custom interface or dock, or the like. In some embodiments, wired interface 314 may also provide electrical power, or the like to power source 324, or the like. In other embodiments interface 314 may utilize close physical contact of device 300 to a dock for transfer of data, magnetic power, heat energy, light energy, laser energy or the like. Additionally, software that enables communications over such networks is typically provided.
[0047] In various embodiments, a wireless interface 316 may also be provided to provide wireless data transfers between computing device 300 and external sources, such as computers, storage networks, headphones, microphones, cameras, IMUs or the like. As illustrated in Fig. 3, wireless protocols may include Wi-Fi (e.g. IEEE 802.11 a/b/g/n, WiMAX), Bluetooth, Bluetooth Low Energy (BLE) IR, near field communication (NFC), ZigBee, Ultra-Wide Band (UWB), Wi-Fi, mesh communications, and the like.
[0048] GNSS (e.g. GPS) receiving capability may also be included in various embodiments of the present invention. As illustrated in Fig. 3, GPS functionality is included as part of wireless interface 316 merely for sake of convenience, although in implementation, such functionality may be performed by circuitry that is distinct from the Wi-Fi circuitry, the Bluetooth circuitry, and the like. In various embodiments of the present invention, GPS receiving hardware may provide user input data in the form of current GPS coordinates, or the like, as described above.
[0049] Additional wireless communications may be provided via RF interfaces in various embodiments. In various embodiments, RF interfaces 320 may support any future-developed or conventional radio frequency communications protocol, such as CDMA-based protocols (e.g. WCDMA), GSM-based protocols, HSUPA-based protocols, G4, G5, or the like. In some embodiments, various functionality is provided upon a single IC package, for example the Marvel PXA330 processor, and the like. As described above, data transmissions between a smart device and the services may occur via Wi-Fi, a mesh network, 4G, 5G, or the like. [0050] Although the functional blocks in Fig. 3 are shown as being separate, it should be understood that the various functionality may be regrouped into different physical devices. For example, some processors 302 may include Bluetooth functionality. Additionally, some functionality need not be included in some blocks, for example, GPS functionality need not be provided in a provider server.
[0051] In various embodiments, any number of future developed, current operating systems, or custom operating systems may be supported, such as iPhone OS (e.g. iOS), Google Android, Linux, Windows, MacOS, or the like. In various embodiments of the present invention, the operating system may be a multi -threaded multi-tasking operating system. Accordingly, inputs and / or outputs from and to display 306 and inputs / or outputs to physical sensors 322 may be processed in parallel processing threads. In other embodiments, such events or outputs may be processed serially, or the like. Inputs and outputs from other functional blocks may also be processed in parallel or serially, in other embodiments of the present invention, such as acquisition device 310 and physical sensors 322.
[0052] In some embodiments of the present invention, physical sensors 322 (e.g. MEMS- based) may include accelerometers, gyros, magnetometers, pressure sensors, temperature sensors, imaging sensors (e.g. blood oxygen, heartbeat, blood vessel, iris data, etc.), thermometer, otoacoustic emission (OAE) testing hardware, and the like. The data from such sensors may be used to capture data associated with device 300, and a user of device 300. Such data may include physical motion data, pressure data, orientation data, or the like. Data captured by sensors 322 may be processed by software running upon processor 302 to determine characteristics of the user, e.g. gait, gesture performance data, or the like and used for user authentication purposes. In some embodiments, sensors 322 may also include physical output data, e.g. vibrations, pressures, and the like.
[0053] In some embodiments, a power supply 324 may be implemented with a battery (e.g. LiPo), ultracapacitor, or the like, that provides operating electrical power to device 300. In various embodiments, any number of power generation techniques may be utilized to supplement or even replace power supply 324, such as solar power, liquid metal power generation, thermoelectric engines, rf harvesting (e.g. NFC) or the like.
[0054] Fig. 3 is representative of components possible for a processing device. It will be readily apparent to one of ordinary skill in the art that many other hardware and software configurations are suitable for use with the present invention. Embodiments of the present invention may include at least some but need not include all of the functional blocks illustrated in Fig. 3. For example, a smart phone (e.g. processing unit / video camera) may include some, but not all of the illustrated functionality. As additional examples, processing unit 110 may include some of the functional blocks in Fig. 3, but it need not include an accelerometer or other physical sensor 322, an rf interface 320, an internal power source 324, or the like.
[0055] In light of the above, other variations and adaptations can be envisioned to one of ordinary skill in the art. For example, other objects, e.g. globe 124 may be used as a stationary object for purposes of the visual odometry / coordinate system transformations, described above, and other objects, e.g. flower vase 112 may actually move within scene 108. For example, flower vase 112 may slide be sliding to the left. In such examples, the motion of flower vase 112 can also be determined in the same way as landmark 130 - 3D determination within the local coordinate system, then transformed to 3D within the global coordinate system. The motion of landmark 130 and flower vase 112 can then be estimated within the 3D global coordinate system to predict whether they will intersect, or not, e.g. whether left hand 120 will catch flower vase 112. In such embodiments, processing device 110 may have to track the movements of multiple landmarks within video image data. In other embodiments, other types of visual odometry-type functionality may include determination of sub-portions of objects that are stationary. For example, sharp corners on a stationary object may be used, long horizontal or vertical portions on a stationary object may be used, and the like. In still other embodiments, Al processing systems may be used to facilitate determination of various of the parameters above, such as determination of movements of one or more landmarks, the positions of stationary objects (or portions thereof) within a local coordinate system, determining probability of future movement path of a landmark, and the like. As mentioned above, in one embodiment, an architecture based upon a Transformer paradigm may be used. It should be understood that other types of architectures, including a recurrent neural network (RNN), or the like may also be used.
[0056] The block diagrams of the architecture and flow charts are grouped for ease of understanding. However, it should be understood that combinations of blocks, additions of new blocks, re-arrangement of blocks, and the like are contemplated in alternative embodiments of the present invention. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. It will, however, be evident that various modifications and changes may be made thereunto without departing from the broader spirit and scope of the invention as set forth in the claims.

Claims

CLAIMS We claim
1. A method for a computing system comprising: receiving, from a memory, a plurality of image data of a scene including a moving object and a stationary object, wherein the plurality of image data is derived from a video camera associated with a user, wherein the plurality of image data comprise depth information with respect to the video camera; determining, in a processor coupled to the video camera, a first plurality of three- dimensional positions of the moving object within a local coordinate system, in response to the plurality of image data, wherein the local coordinate system is relative to the video camera; determining, in the processor, a first movement path of the moving object within the local coordinate system in response to the first plurality of three-dimensional positions within the local coordinate system; determining, in the processor, a second plurality of three-dimensional positions within the local coordinate system of the stationary object in response to the plurality of image data; determining, in the processor, a plurality of transforms from the local coordinate system to a global coordinate system in response to the second plurality of three-dimensional positions within the local coordinate system; determining, in the processor, a second movement path of the moving object within the global coordinate system in response to the first movement path of the moving object and to the plurality of transforms; and determining, in the processor, a third movement path for the moving object within the global coordinate system in response to the second movement path of the moving object.
2. The method of claim 1 wherein the plurality of image data of the scene is associated with a first time period; wherein the third movement path for the moving object is associated with a second time period; and wherein the second time period is after the first time period.
3. The method of claim 1 wherein the video camera is disposed upon the user at a position selected from a group consisting of: a head, a chest, a torso, a hand, an arm.
4. The method of claim 1 wherein the video camera comprises an optical sensor for providing the plurality of image data and a depth sensor for providing the depth information.
5. The method of claim 1 wherein the plurality of image data is associated with a plurality of frames of video data from the video camera; and wherein the global coordinate system is relative to the video camera for a frame from the plurality of frames of video data.
6. The method of claim 1 wherein the moving object is selected from a group consisting of: a hand of the user, an object held by the user, an object manipulated by the user, an object thrown by the user, an object separate from the user.
7. The method of claim 1 further comprising determining, in the processor, whether the moving object will collide with a position within the global coordinate system in response to the third movement path for the moving object within the global coordinate system.
8. A computing system comprising: a memory configured to store a plurality of image data of scene including a moving object and a stationary object captured by a video camera associated with a user, wherein the plurality of image data comprise depth information; a processor coupled to the memory, wherein the processor is configured to determine a first plurality of three-dimensional positions within a local coordinate system of the moving object in response to the plurality of image data, wherein the local coordinate system is relative to the video camera; wherein the processor is configured to determine a first movement path of the moving object within the local coordinate system in response to the first plurality of three- dimensional positions within the local coordinate system; wherein the processor is configured to determine a second plurality of three- dimensional positions within the local coordinate system of the stationary object in response to the plurality of image data; wherein the processor is configured to determine a plurality of transforms from the local coordinate system to a global coordinate system in response to the second plurality of three-dimensional position within the local coordinate system; wherein the processor is configured to determine a second movement path of the moving object within the global coordinate system in response to the first movement path of the moving object and to the plurality of transforms from the local coordinate system to the global coordinate system; and wherein the processor is configured to determine a third movement path for the moving object within the global coordinate system in response to the second movement path of the moving object.
9. The system of claim 8 wherein the plurality of image data of the scene is associated with a first time period; wherein the third movement path for the moving object is associated with a second time period; and wherein the second time period is after the first time period.
10. The system of claim 8 further comprising the video camera disposed upon the user at a position selected from a group consisting of a head, a chest, a torso, a hand, and an arm.
11. The system of claim 10 wherein the video camera comprises: a first sensor configured to determine a plurality of images of the scene; and a second sensor configured to determine the depth information.
12. The system of claim 8 wherein the plurality of image data is associated with a plurality of frames of video data from the video camera; and wherein the global coordinate system is relative to the video camera for a frame from the plurality of frames of video data.
13. The system of claim 8 wherein the moving object is selected from a group consisting of a hand of the user, an object held by the user, an object manipulated by the user, an object thrown by the user, an object separate from the user.
14. The system of claim 8 wherein the processor is configured to determine whether the moving object will collide with another position within the global coordinate system in response to the third movement path for the moving object within the global coordinate system.
15. A method for a computing system comprising: receiving, from a memory, a plurality of image data of a scene including a moving object and a stationary object from a video camera associated with a user, wherein the plurality of image data comprise depth information, wherein the plurality of image data includes a first frame of image data, a second frame of image data and a third frame of image data, wherein the third frame is temporally between the first frame and the second frame; determining, in a processor coupled to the memory, a first three-dimensional position within a local coordinate system of the moving object and a first pose of the moving object in response to the first frame of image data; determining, in the processor, a second three-dimensional position within the local coordinate system of the moving object and a second pose of the moving object in response to the second frame of image data; determining, in a processor, a fourth three-dimensional position within the local coordinate system of the stationary object in response to the first frame of image data and in response to first orientation data of the video camera associated with the first frame of image data; determining, in the processor, a fifth three-dimensional position within the local coordinate system of the stationary object in response to the second frame of image data and in response to second orientation data of the video camera associated with the second frame of image data; determining, in the processor, a first transform from the local coordinate system to a global coordinate system, in response to the fourth three-dimensional position within the local coordinate system of the stationary object; determining, in the processor, a second transform from the local coordinate system to the global coordinate system, in response to the fifth three-dimensional position within the local coordinate system of the stationary object; determining, in the processor, a sixth three-dimensional position within the global coordinate system in response to the first three-dimensional position within the local coordinate system and in response to the first transform; determining, in the processor, a seventh three-dimensional position within the global coordinate system in response to the second three-dimensional position within the local coordinate system and in response to the second transform; and determining, in the processor, an eighth three-dimensional position within the global coordinate system in response to the sixth three-dimensional position within the global coordinate system and in response to the seventh three-dimensional position within the global coordinate system.
16. The method of claim 15 further comprising determining, in the processor, a third pose of the moving object associated with the eighth three-dimensional pose within the global coordinate system in response to the first pose and the second pose.
17. The method of claim 15 further comprising: recording, with the video camera, the plurality of image data; and wherein the video camera is disposed upon the user at a position selected from a group consisting of: a head, a chest, a torso, a hand, an arm.
18. The method of claim 15 wherein the plurality of image data is associated with a plurality of frames of video data; and wherein the global coordinate system is relative to the video camera for a frame from the plurality of frames of video data.
19. The method of claim 15 wherein the moving object is selected from a group consisting of: a hand of the user, an object held by the user, an object manipulated by the user, an object thrown by the user, an object separate from the user.
20. The method of claim 15 further comprising determining, in the processor, whether the moving object collides with a position within the global coordinate system in response to the eighth three-dimensional position within the global coordinate system.
PCT/US2023/020232 2022-09-01 2023-04-27 Methods and apparatus for forecasting collisions using egocentric video data WO2024049513A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202263403146P 2022-09-01 2022-09-01
US63/403,146 2022-09-01

Publications (1)

Publication Number Publication Date
WO2024049513A1 true WO2024049513A1 (en) 2024-03-07

Family

ID=90098483

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/020232 WO2024049513A1 (en) 2022-09-01 2023-04-27 Methods and apparatus for forecasting collisions using egocentric video data

Country Status (1)

Country Link
WO (1) WO2024049513A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200126239A1 (en) * 2018-01-22 2020-04-23 SZ DJI Technology Co., Ltd. Methods and system for multi-target tracking
US20210103340A1 (en) * 2014-06-14 2021-04-08 Magic Leap, Inc. Methods and systems for creating virtual and augmented reality

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210103340A1 (en) * 2014-06-14 2021-04-08 Magic Leap, Inc. Methods and systems for creating virtual and augmented reality
US20200126239A1 (en) * 2018-01-22 2020-04-23 SZ DJI Technology Co., Ltd. Methods and system for multi-target tracking

Similar Documents

Publication Publication Date Title
US10282913B2 (en) Markerless augmented reality (AR) system
US10535160B2 (en) Markerless augmented reality (AR) system
US10728616B2 (en) User interest-based enhancement of media quality
US10825197B2 (en) Three dimensional position estimation mechanism
CN114185427A (en) System and method for concurrent ranging and mapping
CN107346172B (en) Action sensing method and device
Hwang et al. Monoeye: Multimodal human motion capture system using a single ultra-wide fisheye camera
KR20180039013A (en) Feature data management for environment mapping on electronic devices
KR20170036747A (en) Method for tracking keypoints in a scene
US20230300464A1 (en) Direct scale level selection for multilevel feature tracking under motion blur
EP4342170A1 (en) Selective image pyramid computation for motion blur mitigation
US11765457B2 (en) Dynamic adjustment of exposure and iso to limit motion blur
US20220375110A1 (en) Augmented reality guided depth estimation
EP4342169A1 (en) Dynamic adjustment of exposure and iso related application
US12008155B2 (en) Reducing startup time of augmented reality experience
WO2024049513A1 (en) Methods and apparatus for forecasting collisions using egocentric video data
US20220377238A1 (en) Direct scale level selection for multilevel feature tracking under motion blur
KR20160012909A (en) Electronic device for displyaing image and method for controlling thereof
US20220375128A1 (en) Intrinsic parameters estimation in visual tracking systems
Jain et al. [POSTER] AirGestAR: Leveraging Deep Learning for Complex Hand Gestural Interaction with Frugal AR Devices
CN112541418A (en) Method, apparatus, device, medium, and program product for image processing
JP7415912B2 (en) Apparatus, system, method and program
US11756274B1 (en) Low-power architecture for augmented reality device
EP4184446A1 (en) Method and system for improving target detection performance through dynamic learning
US20240203069A1 (en) Method and system for tracking object for augmented reality

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23861022

Country of ref document: EP

Kind code of ref document: A1