WO2022216295A1

WO2022216295A1 - Method and apparatus for operating an image signal processor

Info

Publication number: WO2022216295A1
Application number: PCT/US2021/026723
Authority: WO
Inventors: Feng Guo; Hsilin Huang
Original assignee: Zeku, Inc.
Priority date: 2021-04-09
Filing date: 2021-04-09
Publication date: 2022-10-13

Abstract

An image signal processor (ISP) is disclosed that uses image-based feature mapping to estimate both global motion and local motion for multiple stacked frames. For global motion, multiple stacked frames may be rectified using the inertial measurement unit (IMU) data may reduce the range of the feature field search used in image feature mapping. The mapped features may then be used to estimate the global translation and rotation of the images. For local motion, object identification may be performed prior to featuring mapping. Object identification may be performed using image segmentation. Feature mapping may then be performed to estimate the local motion of an object of the long-exposure interval frames. The local motion of the object in short-exposure frames may be estimated by interpolation or machine learning. The final motion of the object in all frames may be computed using the location motion estimated for the long and short-exposure frames.

Description

METHOD AND APPARATUS FOR OPERATING AN IMAGE SIGNAL

PROCESSOR

BACKGROUND

[0001] Embodiments of the present disclosure relate to apparatuses and methods for operating an image signal processor (ISP).

[0002] An image/video capturing device, such as a camera or camera array, can be used to capture an image/video or a picture of a scene. Cameras or camera arrays have been included in many handheld devices, especially since the advent of social media that allows users to upload pictures and videos of themselves, friends, family, pets, or landscapes on the internet with ease and in real-time. Examples of camera components that operate together to capture an image/video include lens(es), image sensor(s), ISP(s), and/or encoders, just to name a few components thereof. The lens, for example, may receive and focus light onto one or more image sensors that are configured to detect photons. When photons impinge on the image sensor, an image signal corresponding to the scene is generated and sent to the ISP. The ISP performs various operations associated with the image signal to generate one or more processed images of the scene that can then be output to a user, stored in memory, or output to the cloud.

SUMMARY

[0003] Embodiments of apparatus and method for operating an ISP are disclosed herein.

[0004] According to one aspect of the present disclosure, an apparatus (e.g., global motion unit 204) for image signal processing is disclosed. The apparatus may include an interface unit configured to obtain multiple stacked frames from an image capturing device. The multiple stacked frames may include a first frame captured at a first time using a first exposure duration, a second frame captured at a second time using a second exposure duration that may be shorter than the first exposure duration, and a third frame captured at a third time using a third exposure duration equal to the first exposure duration. The apparatus may also include a rectification unit configured to rectify each of the first frame, the second frame, and the third frame based at least in part on motion sensor data. The apparatus may also include a feature matching unit configured to perform feature matching of the rectified first frame, the rectified second frame, and the rectified third frame. The apparatus may also include a first motion estimation unit configured to estimate a first motion of the image capturing device based at least in part on the feature matching between the rectified first frame and the rectified third frame. The apparatus may also include a second motion estimation unit configured to a second motion of the image capturing device associated with the rectified second frame by interpolating the first motion. The apparatus may also include a motion computation unit configured to compute a final motion of the image capturing device based at least in part on the first motion, the second motion, and the motion sensor data.

[0005] According to another aspect of the present disclosure, an apparatus (e.g., local motion unit 206) for image signal processing is disclosed. The apparatus may include an interface unit configured to obtain multiple stacked frames using the image capturing device. The multiple stacked frames may include a first frame captured at a first time using a first exposure duration, and a second frame captured at a second time using a second exposure duration. The apparatus may also include an identification unit configured to an object in the first frame and the second frame. The apparatus may also include a feature matching unit configured to perform feature matching of the object in the first frame and the second frame. The apparatus may also include a first motion estimation unit configured to estimate a first motion of the object during the first exposure duration and the second exposure duration based at least in part on the feature matching.

[0006] According to another aspect of the present disclosure, an apparatus (e.g., ISP

120) for image signal processing is disclosed. The apparatus may include an interface unit configured to obtain multiple stacked frames. The multiple stacked frames may include a first frame captured at a first time using a first exposure duration, a second frame captured at a second time using a second exposure duration that may be shorter than the first exposure duration, and a third frame captured at a third time using a third exposure duration equal to the first exposure duration. The apparatus may further include a global motion unit configured to estimate a global motion associated with the image capturing device during the first exposure duration, the second exposure duration, and the third exposure duration. The apparatus may further include a local motion unit configured to estimate a local motion associated with an object in the multiple stacked frames during the first exposure duration, the second exposure duration, and the third exposure duration. The apparatus may further include a rectification unit configured to rectify the multiple stacked frames based at least in part on the global motion and the local motion. The apparatus may further include a high dynamic range (HDR) generator unit configured to generate the HDR image based at least in part on the rectified multiple stacked frames.

[0007] According to another aspect of the present disclosure a method of image signal processing is disclosed. The method may be performed by, e.g., global motion unit 204. The method may include obtaining multiple stacked frames using the image capturing device. The multiple stacked frames may include a first frame captured at a first time using a first exposure duration, a second frame captured at a second time using a second exposure duration that may be shorter than the first exposure duration, and a third frame captured at a third time using a third exposure duration equal to the first exposure duration. The method may further include rectifying each of the first frame, the second frame, and the third frame based at least in part on motion sensor data. The method may further include performing feature matching of the rectified first frame, the rectified second frame, and the rectified third frame. The method may further include estimating a first motion of the image capturing device based at least in part on the feature matching between the rectified first frame and the rectified third frame. The method may further include estimating a second motion of the image capturing device associated with the rectified second frame by interpolating the first motion. The method may further include computing a final motion of the image capturing device based at least in part on the first motion, the second motion, and the motion sensor data.

[0008] According to another aspect of the present disclosure, a method of image signal processing is disclosed. The method may be performed by, e.g., local motion unit 206. The method may include obtaining multiple stacked frames using an image capturing device. The multiple stacked frames may include a first frame captured at a first time using a first exposure duration, a second frame captured at a second time using a second exposure duration, and a third frame captured at a third time using a third exposure duration that may be shorter than the first exposure duration and the second exposure duration. The method may further include identifying an object in the first frame and the second frame. The method may further include performing feature matching of the object in first frame, the second frame, and the third frame. The method may further include estimating a first motion of the object during the first exposure duration and the second exposure duration based at least in part on the feature matching. The method may further include determining whether a number of matched data points of the feature matching of the first frame, the second frame, and the third frame meets a threshold number. The method may further include estimating a second motion of the object during the third exposure duration based at least in part on the feature matching when the number of matched data points meets the threshold number, or estimating the second motion of the object during the third exposure duration using interpolation or machine learning when the number of matched data points fails to meet the threshold number. The method may further include computing a final motion of the object during the first exposure duration the second exposure duration, and the third exposure duration based at least in part on the first motion and the second motion.

[0009] According to another aspect of the present disclosure, a method of image signal processing is disclosed. The method may be performed by, e.g., ISP 120. The method may include obtaining multiple stacked frames. The multiple stacked frames may include a first frame captured at a first time using a first exposure duration, a second frame captured at a second time using a second exposure duration that may be shorter than the first exposure duration, and a third frame captured at a third time using a third exposure duration equal to the first exposure duration. The method may further include estimating a global motion associated with the image capturing device during the first exposure duration, the second exposure duration, and the third exposure duration. The method may further include estimating a local motion associated with an object in the multiple stacked frames during the first exposure duration, the second exposure duration, and the third exposure duration. The method may further include rectifying the multiple stacked frames based at least in part on the global motion and the local motion. The method may further include generating the HDR image based at least in part on the rectified multiple stacked frames.

BRIEF DESCRIPTION OF THE DRAWINGS

[0010] The accompanying drawings, which are incorporated herein and form a part of the specification, illustrate embodiments of the present disclosure and, together with the description, further serve to explain the principles of the present disclosure and to enable a person skilled in the pertinent art to make and use the present disclosure.

[0011] FIG. 1 illustrates a block diagram of an exemplary apparatus that includes an

ISP, according to some embodiments of the present disclosure.

[0012] FIG. 2 illustrates a detailed view of the exemplary ISP depicted in the apparatus of FIG. 1, according to some embodiments of the present disclosure.

[0013] FIG. 3 A illustrates a block diagram of an exemplary global motion unit of the

ISP depicted in FIGs. 1 and 2, according to some embodiments of the present disclosure. [0014] FIG. 3B illustrates a block diagram of an exemplary local motion unit of the ISP depicted in FIGs. 1 and 2, according to some embodiments of the present disclosure.

[0015] FIG. 4A illustrates a first conceptual depiction of an exemplary data flow between the global motion unit of FIG. 3 A and the local motion unit of FIG. 3B, according to some embodiments of the present disclosure. [0016] FIG. 4B illustrates a second conceptual depiction of an exemplary data flow of a high dynamic range (HDR) unit of the ISP of FIGs. 1 and 2, according to some embodiments of the present disclosure.

[0017] FIG. 5 illustrates a flowchart of an exemplary method of the global motion unit of FIG. 3 A, according to some embodiments of the present disclosure.

[0018] FIG. 6 illustrates a flowchart of an exemplary method of the local motion unit of FIG. 3B, according to some embodiments of the present disclosure.

[0019] FIG. 7 illustrates a flowchart of an exemplary method of the ISP of FIGs. 1 and

2, according to some embodiments of the present disclosure.

[0020] Embodiments of the present disclosure will be described with reference to the accompanying drawings.

DETAILED DESCRIPTION

[0021] Although specific configurations and arrangements are discussed, it should be understood that this is done for illustrative purposes only. A person skilled in the pertinent art will recognize that other configurations and arrangements can be used without departing from the spirit and scope of the present disclosure. It will be apparent to a person skilled in the pertinent art that the present disclosure can also be employed in a variety of other applications. [0022] It is noted that references in the specification to “one embodiment,” “an embodiment,” “an example embodiment,” “some embodiments,” etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases do not necessarily refer to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it would be within the knowledge of a person skilled in the pertinent art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.

[0023] In general, terminology may be understood at least in part from usage in context.

For example, the term “one or more” as used herein, depending at least in part upon context, may be used to describe any feature, structure, or characteristic in a singular sense or may be used to describe combinations of features, structures or characteristics in a plural sense. Similarly, terms, such as “a,” “an,” or “the,” again, may be understood to convey a singular usage or to convey a plural usage, depending at least in part upon context. In addition, the term “based on” may be understood as not necessarily intended to convey an exclusive set of factors and may, instead, allow for existence of additional factors not necessarily expressly described, again, depending at least in part on context.

[0024] Various aspects of method and apparatus will now be described. These apparatus and methods will be described in the following detailed description and illustrated in the accompanying drawings by various blocks, modules, units, components, circuits, steps, operations, processes, algorithms, etc. (collectively referred to as “elements”). These elements may be implemented using electronic hardware, firmware, computer software, or any combination thereof. Whether such elements are implemented as hardware, firmware, or software depends upon the particular application and design constraints imposed on the overall system.

[0025] For ease of nomenclature, the term “camera” is used herein to refer to an image capture device or other data acquisition device. Such a data acquisition device can be any device or system for acquiring, recording, measuring, estimating, determining and/or computing data representative of a scene, including but not limited to two-dimensional image data, three-dimensional image data, and/or light field data. Such a data acquisition device may include optics, sensors, and image processing electronics for acquiring data representative of a scene, using techniques that are well known in the art. One skilled in the art will recognize that many types of data acquisition devices can be used in connection with the present disclosure, and that the present disclosure is not limited to cameras. Thus, the use of the term “camera” herein is intended to be illustrative and exemplary, but should not be considered to limit the scope of the present disclosure. Specifically, any use of such term herein should be considered to refer to any suitable data acquisition device.

[0026] In addition, the term “frame,” as used herein, may be defined as a data entity

(stored, for example, in a file) containing a description of a state corresponding to a single captured sensor exposure in a camera. This state includes the sensor image, and other relevant camera parameters, specified as metadata. The sensor image may be either a raw image or a compressed representation of the raw image. The terms “frame” and “image” may be used interchangeably in the description below. The term “multiple stacked frames,” as used herein, may be defined as a set of frames captured with exposure intervals of the same or varying lengths. The exposure intervals may be continuous or intermittent. An example of continuous exposure intervals includes a first frame captured with an exposure interval from t_{n -} tn+i, a second frame captured with an exposure interval from t_n+i - t_n+3, and a third frame captured with an exposure interval from t_n+3 - tn+6. Moreover, “exposure interval.” as used herein mav be defined as the length of time that one or more of the lens and/or the shutter remain open, and/or the length of time the image sensor receives light from a scene being captured in a single frame. The term “long-exposure” as used herein may be defined as an exposure interval longer than a threshold duration, where the threshold duration is longer than a “short-exposure” duration. Long-exposure frames may be captured to increase the brightness of regions that are undesirable dark in the short-exposures when generating an HDR image.

[0027] Multiple stacked frames may be used to generate an HDR image. HDR imaging is a technique used in imaging and photography to capture a greater dynamic range between the lightest and darkest areas of an image as compared to standard digital imaging techniques. HDR images may represent more accurately the range of intensity levels found in real scenes, from direct sunlight to faint starlight, and the images are often captured by exposing the same subject matter with a plurality of different exposure intervals or levels.

[0028] Non-HDR cameras take pictures at one exposure interval with a limited contrast range. This results in the loss of detail in bright or dark areas of the picture, depending on whether the camera had a low or high exposure setting. HDR compensates for this loss of detail by taking multiple pictures at different exposure levels and intelligently stitching them together to produce a picture that is representative in both dark and bright areas.

[0029] HDR methods provide a higher dynamic range from the imaging process. Non-

HDR cameras take pictures at one exposure level with a limited contrast range. This results in the loss of detail in bright or dark areas of the picture, depending on whether the camera had a low or high exposure setting. HDR compensates for this loss of detail by taking multiple pictures at different exposure levels and intelligently stitching them together to produce a picture that is representative in both dark and bright areas.

[0030] HDR is also commonly used to refer to the display of images derived from HDR imaging in a way that exaggerates contrast for artistic effects. The two main sources of HDR images are computer renderings and merging of multiple stacked frames. These multiple stacked frames may include, e.g., multiple low-dynamic-range (LDR) images or standard-dynamic-range (SDR) images. Tone mapping methods, which reduce overall contrast to facilitate the display of HDR images on devices with lower dynamic range, can be applied to produce images with preserved or exaggerated local contrast for artistic effect. HDR images are generally achieved by capturing multiple standard photographs (e.g., multiple stacked frames), often using two or three different exposure intervals, and then merging them into an HDR image.

[0031] Forming an HDR image of a scene may involve capturing images of the scene using two or more exposure times (e.g., short exposures, long exposures, and possibly other exposure times as well), and then combining these images so that the resulting output image displays details of both the bright and dark regions of the scene. However, combining short- exposure images and long-exposure images may be difficult due to blur that is caused by local motion and/or global motion. Local motion includes motion of elements or details of the content recorded in the pictures, and global motion includes motion resulting from movement of the camera or the user, for example, the person holding the device. Optionally, the motion may include translational motion, rotational motion , and scaling motion. Further, motion can cause these combining procedures to fail when the short-exposure image and long- exposure image cannot be properly aligned.

[0032] Known techniques, such as vision-based motion estimation, may be used to estimate local motion and/or global motion and align the frames to generate an HDR image. However, to achieve a high degree of accuracy using these known approaches, an undesirable amount of time and computational complexity are used. Moreover, in low light conditions, motion estimation using these known approaches may lead to computational inaccuracies due to the noise caused by inadequate lighting. Certain image capturing devices, such as smartphones, may include an inertial motion unit (IMU) to measure the rotation angle and speed of the camera. However, the IMU bias can be large, which may introduce further inaccuracies in motion estimation.

[0033] Thus, there is an unmet need for an approach to estimate local motion and global motion that achieves a high degree of accuracy, while at the same time uses a reduced amount of time and computational complexity to perform the computations, as compared to conventional techniques.

[0034] Due to the challenges imposed by the known techniques for estimating local motion and global motion with a high degree of accuracy, the present disclosure provides a solution that uses image-based feature mapping to estimate both global motion and local motion. For global motion, the IMU data may be used to rectify the multiple stacked frames prior to performing feature matching. Rectifying the multiple stacked frames using the IMU data may reduce the range of the feature field search used in image feature mapping. This reduction in the range may reduce the computational time of the image feature mapping operation. Moreover, rectifying multiple stacked frames using IMU data may exclude some falsely detected features, which may increase the accuracy of the feature matching computation. Then, the mapped features may then be used to estimate the global translation and rotation of the images with a high degree of accuracy. [0035] Moreover, image feature mapping may also be used to estimate local motion using the present techniques. Here, object identification for two long-exposure frames with continuous exposure intervals may be performed using image segmentation. Feature mapping may then be performed to estimate the local motion of an object in the continuous exposure intervals of the long-exposure interval frames. Then, depending on whether the number of matched features meets a threshold number, the local motion of the object in short-exposure interval frames may be estimated by interpolation or machine learning. The final motion of the object in all frames may then be computed using the location motion estimated for the long- exposure interval frames and the short-exposure interval frames. Again, the image feature mapping used in local motion estimation increases the accuracy of the computation, while at the same time reducing the computational time and complexity, as compared to known approaches. Additional details describing the present approach for global motion estimation and local motion estimation are set forth below in connection with FIGs. 1-7.

[0036] FIG. 1 illustrates an exemplary block diagram of an apparatus 100 having an image signal processor (ISP), according to some embodiments. Apparatus 100 may include an application processor (AP) 130, an ISP 120, a memory 118, and input-output devices 108. ISP 120 may include an HDR generator unit 102, a motion estimation unit 104, and a camera 106. Input-output device 108 may include user input devices 110, IMU 112 (e.g., translational motion sensor, rotational motion sensor, accelerometer, gyroscope, etc.), display and audio devices 114, and wireless communication devices 116. In some embodiments, apparatus 100 may be an image capturing device, such as a smartphone or digital camera.

[0037] AP 130 may be the main application process of apparatus 100, and may host the operating system (OS) of apparatus 100 and all the applications. AP 130 may be any kind of general-purpose processors such as a microprocessor, a microcontroller, a digital signal processor, or a central processing unit, and other needed integrated circuits such as glue logic. The term “processor” may refer to a device having one or more processing units or elements, e.g., a central processing unit (CPU) with multiple processing cores. AP 130 may be used to control the operations of apparatus 100 by executing instructions stored in memory 118, which can be in the same chip as AP 130 or in a separate chip from AP 130. AP 130 may also generate control signals and transmit them to various parts of apparatus 100 to control and monitor the operations of these parts. In some embodiments, AP 130 can run the OS of apparatus 100, control the communications between the user and apparatus 100, and control the operations of various applications. For example, AP 130 may be coupled to a communications circuitry and execute software to control the wireless communications functionalitv of annaratus 100. In another example, AP 130 may be coupled to ISP 120 and input-output devices 108 to control the processing and display of sensor data, e.g., image data, one or more frames, HDR images, LDR images, etc.

[0038] ISP 120 may include software and/or hardware operatively coupled to AP 130 and input-output devices 108. In some embodiments, components, e.g., circuitry, of ISP 120 may be integrated on a single chip. In some embodiments, ISP 120 includes an image processing hardware coupled to (e.g., placed between) AP 130 and HDR generator unit 102/motion estimator unit 104/camera 106. ISP 120 may include a suitable circuitry that, when controlled by AP 130, performs functions not supported by AP 130, e.g., processing raw image data by rectifying frames with respect to IMU data, estimating global motion and local motion associated with multiple stacked frames, performing feature matching, object identification, aligning frames, HDR generation etc. In various embodiments, ISP 120 may include a field- programmable gate array (FPGA), an application-specific integrated circuit (ASIC), a graphics processing unit (GPU), a microprocessor, a microcontroller, a digital signal processor, and other needed integrated circuits for its purposes.

[0039] FIG. 2 illustrates a detailed block diagram of the ISP 120 of FIG. 1, according to some embodiments of the disclosure. FIG. 3 A illustrates a block diagram of an exemplary global motion unit 204 that may be included in ISP 120, according to some embodiments of the disclosure. FIG. 3 A illustrates a block diagram of an exemplary local motion unit 206 that may be included in ISP 120, according to some embodiments of the disclosure. FIG. 4A illustrates a first conceptual depiction of an exemplary data flow 400 between the global motion unit of FIG. 3 A and the local motion unit of FIG. 3B, according to some embodiments of the present disclosure. FIG. 4B illustrates a second conceptual depiction of an exemplary data flow 450 of an exemplary HDR generator unit 102 that may be included in ISP 120, according to some embodiments of the present disclosure. FIGs. 2, 3 A, 3B, 4A, and 4B will be described together.

[0040] Referring to FIG. 2, AP 130 may transmit control signals and other data to ISP

120 via an internal bus to control the operation of ISP 120. Certain other components, such as a power management integrated circuit (PMIC) 216 and a memory 214, are operatively coupled to motion estimation unit 104 for the operation thereof. PMIC 216 may be a dedicated power management circuitry that provides the power/voltages necessary for the operation of motion estimator unit 104. Memory 214 may be a dedicated low-power double data rate (LPDDR) device that provides the caching/storage functions necessary for the operation of ISP 120. In some embodiments, memory 214 includes a dynamic random-access memorv fDRAMY [0041] As mentioned above, ISP 120 may include HDR generator unit 102, motion estimator unit 104, and camera 106, operatively coupled to one another. Each of HDR generator unit 102, motion estimator unit 104, and camera 106 may include suitable software and hardware for the functions of ISP 120. Motion estimator unit 104 may be operatively coupled to camera 106 and HDR generator unit 102 of ISP 120. Moreover, motion estimator unit 104 may be operatively coupled to IMU 112 of input-output devices 108 depicted in FIG. 1

[0042] Camera 106 may be configured to operate in HDR mode or non-HDR mode.

While operating in HDR-mode, camera 106 may be configured to capture a burst of consecutive digital images (also referred to herein as “multiple stacked frames”) of a scene automatically when activated, for example, when a user presses an activation button or when AP 130 sends an instruction to operate in HDR mode. Alternatively, a user can activate camera 106 multiple times while pointing at a specific scene to acquire a burst of digital images even if the imaging device is operating in non-HDR mode, e.g., camera 106 captures one picture responsive to each command. Moreover, camera 106 may include multiple cameras configured to capture a burst of frames from different angles, positions, or perspectives. Still further, multiple stacked frames of the same scene may be captured using camera 106 and one or more external cameras. The external cameras may be part of apparatus 100 but located external to ISP 120 or external to apparatus 100. ISP 120 may store the multiple stacked frames in memory 214 and/or send them for storage in memory 118 of apparatus 100. The frames captured by external camera(s) may be sent to the AP 130, ISP 120, memory 118, or memory 214 and stored with the frames captured using camera 106 for processing together. In some embodiments, the frames in the burst may differ from each other due to global motion or local motion. As mentioned previously, the global motion may include motion that results from the movement of apparatus 100 between frames. For example, the global motion may occur when apparatus 100 or a user holding apparatus 100 moves while camera 106 captures a burst of frames. Local motion, as mentioned above, may occur when an object (e.g., e.g., a person, animal, vehicle, waves, landscape, etc.) captured during the burst changes position.

[0043] To analyze and align multiple stacked frames, ISP 120 and/or AP 130 may activate motion estimator unit 104. Motion estimator unit 104 may include global motion unit 204, local motion unit 206, and rectification unit 208. Global motion unit 204 may be configured to estimate global motion for multiple stacked, e.g., using the operations described below in connection with FIG. 3 A. Local motion unit 206 may be configured to estimate local motion from multiple stacked frames using, e.g., the operations described below in connection with FIGs. 3B. Using the estimated global and local motion, rectification unit 208 may be configured to adjust and/or rectify the multiple stacked frames before sending them to HDR generator unit 102 for HDR image generation.

[0044] As seen in FIG. 3A, global motion unit 204 may include interface unit 302, sensor rectification unit 304, feature mapping unit 306, long-exposure motion unit 308, short- exposure motion unit 310, and global motion computation unit 312.

[0045] Interface unit 302 may be configured to receive (at 401) multiple stacked frames

201, e.g., from one or more of camera 106, memory 214, and/or memory 118, and to receive IMU data 209 from IMU 112. Multiple stacked frames 201 may include a mix of long- exposure frames and short-exposure frames captured of the same scene. Interface unit 302 may send multiple stacked frames 201 and IMU data 209 to sensor rectification unit 304, which rectifies each of the frames in the stack using the IMU data 209. For example, multiple stacked frames 201 may include a first long-exposure frame captured during a first exposure interval, a short-exposure frame captured during a second exposure interval, and a second long-exposure frame captured during a third exposure interval. Interface unit 302 may receive (at 403) IMU data 209 from IMU 112. IMU data 209 may include translational and/or rotational motion data associated with apparatus 100, which may be obtained by the accelerometer and/or gyroscope of IMU 112. For example, IMU data 209 may include first motion sensor data obtained by IMU 112 during the first exposure interval, second motion sensor data obtained by IMU 112 during the second exposure interval, and third motion sensor data obtained by IMU 112 during the third exposure interval.

[0046] Using portions of IMU data 209 that roughly correspond to the respective exposure intervals, sensor rectification unit 304 may perform an initial rectification (at 405) of the raw image data of the first long-exposure frame, the short-exposure frame, and the second long-exposure frame to reduce and/or remove image distortion due to global motion that occurred during those time periods. However, because the motion sensors of IMU 112 and frame capture may not be synchronized, and the sensors of IMU 112 provide limited accuracy with respect to global motion, IMU data 209 is only one rough source for image rectification, which is why it is referred to as an “initial rectification.” After the initial rectification, sensor rectification unit 304 may send the multiple stacked frames 201 and the initial rectification information to feature mapping unit 306. Feature mapping may subsequently be performed to estimate global motion with a high degree of accuracy. However, initial rectification of the frames using IMU data 209 may reduce the range of the feature field search operation performed by feature matching unit 306, which may reduce the time used to complete the feature matching operation. In general feature matching, the search can be performed using the whole images size between two frames due to the unknown motion range. With initial rectification using IMU data 209, the two frames are roughly aligned and the search range used in feature mapping may limit the search range for motion to much smaller size, e.g., one-tenth of the frame size.

[0047] Additionally and/or alternatively, the initial rectification using IMU data 209 may exclude some falsely detected features to improve the image feature matching accuracy. When the feature matching search over a large frame, there is an increased chance that the feature in one frame may be incorrectly matched to an area in another frame that does not include the same feature. As mentioned above, by reducing the search range using initial rectification with IMU data 209, the search range is much smaller and the falsely matched features when the search range is larger may be excluded.

[0048] Feature mapping unit 306 may perform feature matching (at 407) of the rectified first long-exposure frame, the rectified short-exposure frame, and the rectified second long- exposure frame. In some embodiments, feature mapping unit 306 may include a graph neural network, such as SuperGlue, that matches two sets of local features by jointly finding correspondences between the frames and rejecting non-matchable points. Matched features may be estimated by solving a differentiable optimal transport problem, whose costs are predicted by a graph neural network. Feature mapping unit 306 may use a flexible context aggregation mechanism based on attention, which enables reasoning about the underlying three-dimensional (3D) scene and feature assignments jointly. Feature mapping unit 306 may perform geometric transformations and regularities of the 3D world through end-to-end training from image pairs. The output of the neural network may include features that are mapped between each of, e.g., the first long-exposure frame, the short-exposure frame, and the second long-exposure frame. These mapped features may be sent to global motion computation unit 312

[0049] In a low light environment, the short exposure frame may have a lower signal- to-noise ratio (SNR) compared with the long-exposure frames. Consequently, it may be challenging to obtain accurate image matching information for short-exposure frames using feature mapping, especially when the short-exposure frame(s) are dark. In this situation, the image feature matching of the short-exposure frame may fail. Here, long-exposure motion unit 308 may estimate (at 409a) a first global motion m_g based on the feature mapping of the first long-exposure frame and the second long-exposure frame, where the first long-exposure frame is captured before the short-exposure frame, and the second lons-exoosure frame is captured after the short-exposure frame. Then, short-exposure motion unit 310 may estimate (at 409b) a second global motion for the short-exposure frame by interpolating the first global motion m. For example, short-exposure motion unit 310 may estimate the second global motion M_g using Equation (1):

M_g = w_am_g * ¾¾ + ^w _b9 (1), where t_L1 is the duration of the exposure interval associated with the first long-exposure frame, ti2 is the duration of the exposure interval associated with the second long-exposure interval, t_s is the duration of the exposure duration for the short-exposure interval frame, g is the gyroscope information from IMU data 209, and w_a and w_b are both weights.

[0050] Finally, global motion computation unit 312 may combine the first global motion m and second global motion with IMU data 209 to compute (at 409c) the final global motion of apparatus 100 for the multiple stacked frames 201 with a high degree of accuracy. The first global motion m and second global motion M are filtered with IMU data 209, e.g. using a Kalman filter to increase the accuracy of final global motion estimation.

[0051] Turning to FIG. 3B, local motion unit 206 may include interface unit 322, object identification unit 324, feature matching unit 326, long-exposure motion unit 328, short- exposure motion unit 330, and/or local motion computation unit 332. Interface unit 322 may be configured to receive (at 401) multiple stacked frames 201, e.g., from one or more of camera 106, memory 214, and/or memory 118. Multiple stacked frames 201 being the same as those received by interface unit 302 of global motion unit 204. In some embodiments, interface unit 322 may also be configured to receive IMU data 209 from IMU 112. Interface unit 302 may send multiple stacked frames 201 to object identification unit 324, which is configured to identify (at 411) one or more objects in each of the frames. By way of example and not limitation, object identification unit 324 may be configured to identify people, anatomy, facial expressions, animals, toys, vehicles, foliage, water, clothing, etc. Object identification unit 324 may perform image segmentation (at 411) to identify object(s) in multiple stacked frames 201. For example, object identification unit 324 may perform image segmentation by partitioning each frame into multiple segments, which may be sets of pixels, also known as image objects. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation may include locating objects and boundaries (lines, curves, etc.) in each of the frames. Each pixel or segment is assigned a label such that pixels with the same label in each of the different frames share certain characteristics. Then, the image segmentation information may be sent to feature mapping unit 326.

[0052] Feature mapping unit 326 may perform (at 413) feature mapping using a graph neural network the same or similar to the one described above in connection with the feature mapping unit 306 of global motion unit 204, except here intensive features near the boundary of an object (as indicted in the image segmentation information) may be used to match features in the frames. Then, the feature mapping information may be sent to long-exposure motion unit 328 and/or short-exposure motion unit 330 to estimate the local motion of the object. [0053] Long-exposure motion unit 328 may use two long-exposure frames captured with continuous exposure intervals to estimate (at 415) a first local motion mi. For example, long-exposure motion unit 328 may estimate the first local motion miby estimating the distance and/or direction of the object’s motion based on the differences in the feature mapping information between the two long-exposure frames. Similar to global motion, it may be challenging to use feature mapping to estimate local motion for short-exposure frames when they are dark. This is because the feature mapping of short-exposure frames may be inaccurate because the boundaries of an object may be difficult to identify by the feature mapping unit 326. When the number of matched features meets a threshold number (e.g., as determined by short-exposure motion unit 330 or feature mapping unit 326), short-exposure motion unit 330 may estimate (at 417a) the second local motion Mi between a short-exposure frame and a long- exposure frame using feature mapping between these two frames. Otherwise, when the threshold number is not met (e.g., when the short-exposure frame is too dark), short-exposure motion unit 330 may estimate (at 417b) the second local motion Mi using machine learning or interpolation of the first local motion mi as shown below in Equation (2):

^{M = m, *}¾¾^{+ n <2)}· where t_L1 is the duration of the exposure interval associated with the first long-exposure frame, ti2 is the duration of the exposure interval associated with the second long-exposure interval, t_s is the duration of the exposure duration for the short-exposure interval frame, and n is noise with Gaussian distribution (0, s). To increase the accuracy of local motion estimation (at 417b), interpolation may occur using frames that are rectified/shifted based on the final global motion estimation.

[0054] Finally, local motion computation unit 332 may compute (at 419) a final local motion for multiple stacked frames 201 by combining the first local motion mi and second local motion Mi of the object.

[0055] Referring again to FIG. 2, rectification unit 208 may receive information associated with the final global motion 203 and the final local motion 205 from global motion unit 204 and local motion unit 206, respectively. Rectification unit 208 may receive multiple stacked frames 201 from camera 106 either directly or indirectly via the motion unit(s) of AP 130. Then, using final global motion 203 and final local motion 205, rectification unit 208 may (at 421) rectify multiple stacked frames 207. Due to the high degree of accuracy with which final global motion 203 and final local motion 205 are estimated, multiple stacked frames 207 may be more accurately aligned, as compared to known approaches. Referring to FIG. 4B, HDR generator unit 102 may generate (at 423) an HDR image using the multiple stacked frames 201 rectified using final global motion 203 estimated by global motion unit 204 and final local motion 205 estimated by local motion unit 206.

[0056] Compared with conventional approaches, the techniques for global and local motion estimates disclosed herein reduce the uncertainty in motion estimates, especially for large motion and in dark environments. Still the estimation is pixel-level accuracy. To meet the standards of subpixel level accuracy and to handle some estimation errors, the rectified images can be used as input for the following function networks, such as HDR generator unit 102. As illustrated in FIG. 4B, HDR generator unit 102 may include a neural network that includes one or more of, e.g., an HDR frame merging network, denoising network, deep quantization network, and/or a super resolution (SR) network, as shown in FIG. 4B. Those networks may perform estimation for small motions, especially for short exposure or dark areas. In this way, the ISP 120 of the present disclosure may perform motion estimation, HDR frame merging, denoising, end-to-end, thereby saving time and computational complexity, as compared to known ISPs that estimate motion from a single source such as the IMU.

[0057] As mentioned above, ISP 120 may be used for multiple view frames alignment in which multiple stacked frames 201 are captured using different cameras with different views. In such an application, a standard focal length camera can use the same framework to generate the super-resolution by merging multiple frames after performing motion estimation and rectification as described above in connection with FIGs. 2, 3 A, 3B, 4 A, and 4B. When the generated frame is combined with the frame from the telephoto lens camera, the machine learning-based local motion unit 206 may use machine learning to perform local motion estimation between the frames of the standard focal length camera and the telephoto lens camera, especially in low light scenes.

[0058] ISP 120 may also be used in an augmented reality (AR) device, such as an AR headset that includes a camera. In AR, camera poses may be estimated using geometric computer vision tasks such as Simultaneous Localization and Manning ISLAM) and/or Structure-from-Motion (SfM). SLAM, for instance, may be used to estimate or generate a map that tracks the movement of features within the AR environment. The pipeline operations of the SLAM may include feature matching between different frames and the integration of IMU data 209 to improve motion tracking performance. Here, ISP 120 and the exemplary operations described above in connection with FIGs. 2, 3A, 3B, 4A, and 4B may be used to handle the low light environment, thereby improving the feature tracking feature for AR implementations. [0059] FIG. 5 illustrates a flowchart of an exemplary method 500 of the global motion estimation, according to some embodiments of the present disclosure. Exemplary method 500 may be performed by an image capturing device, e.g., such as apparatus 100, ISP 120, motion estimator unit 104, and/or global motion unit 204. Method 500 may include steps 502-512 as described below. It is to be appreciated that some of the steps may be optional, and some of the steps may be performed simultaneously, or in a different order than shown in FIG. 5. [0060] At 502, image capturing device may obtain multiple stacked frames. The multiple stacked frames may include a first frame (e.g., first long-exposure frame described above in connection with FIG. 3A) captured at a first time using a first exposure duration, a second frame (e.g., short-exposure frame described above in connection with FIG. 3A) captured at a second time using a second exposure duration that is shorter than the first exposure duration, and a third frame captured at a third time using a third exposure duration that is longer than the second exposure interval and/or equal to the first exposure interval. For example, referring to FIGs. 3A and 4A, interface unit 302 of global motion unit 204 may be configured to receive (at 401) multiple stacked frames 201, e.g., from one or more of camera 106, memory 214, and/or memory 118 and to receive IMU data 209 from IMU 112. Multiple stacked frames 201 may include a mix of long-exposure frames and short-exposure frames captured of the same scene.

[0061] At 504, image capturing device may rectify each of the first frame, the second frame, and the third frame based at least in part on motion sensor data. For example, referring to FIGs. 3 A and 4A, using portions of IMU data 209 that roughly correspond to the respective exposure intervals, sensor rectification unit 304 may perform an initial rectification (at 405) of the raw image data of the first long-exposure frame, the short-exposure frame, and the second long-exposure frame to reduce and/or remove image distortion due to global motion that occurred during those time periods.

[0062] At 506, image capturing device may perform feature matching of the rectified first frame, the rectified second frame, and the rectified third frame. For example, referring to FIGs. 3 A and 4 A, feature matching unit 306 may perform feature matching tat 4071 of the rectified first long-exposure frame, the rectified short-exposure frame, and the rectified second long-exposure frame. In some embodiments, feature mapping unit 306 may include a graph neural network, such as SuperGlue, that matches two sets of local features by jointly finding correspondences between the frames and rejecting non-matchable points. Matched features may be estimated by solving a differentiable optimal transport problem, whose costs are predicted by a graph neural network. Feature mapping unit 306 may use a flexible context aggregation mechanism based on attention, which enables reasoning about the underlying 3D scene and feature assignments jointly. Feature mapping unit 306 may perform geometric transformations and regularities of the 3D world through end-to-end training from image pairs. The output of the neural network may include features that are mapped between each of e.g., the first long-exposure frame, the short-exposure frame, and the second long-exposure frame. These mapped features may be sent to global motion computation unit 312.

[0063] At 508, image capturing device may estimate a first motion of the image capturing device based at least in part on the feature matching between the rectified first frame and the rectified third frame. For example, referring to FIGs. 3 A and 4A, long-exposure motion unit 308 may estimate (at 409a) a first global motion m_g based on the feature mapping of the first long-exposure frame and the second long-exposure frame, where the first long-exposure frame is captured before the short-exposure frame, and the second long-exposure frame is captured after the short-exposure frame.

[0064] At 510, image capturing device may estimate a second motion of the image capturing device associated with the rectified second frame by interpolating the first motion. For example, referring to FIGs. 3 A and 4A, short-exposure motion unit 310 may estimate (at 409b) a second global motion for the short-exposure frame by interpolating the first global motion m. For example, short-exposure motion unit 310 may estimate the second global motion M_g using Equation (1) discussed above.

[0065] At 512, image capturing device may compute a final motion of the image capturing device based at least in part on the first motion, the second motion, and the motion sensor data. For example, referring to FIGs. 3 A and 4A, global motion computation unit 312 may combine the first global motion m and second global motion M with IMU data 209 to compute (at 409c) the final global motion of apparatus 100 for the multiple stacked frames 201 with a high degree of accuracy.

[0066] FIG. 6 illustrates a flowchart of an exemplary method 600 of the local motion estimation, according to some embodiments of the present disclosure. Exemplary method 600 may be performed by an image capturing device, e.g., such as apparatus 100. ISP 120. motion estimator unit 104, and/or local motion unit 206. Method 600 may include steps 602-614 as described below. It is to be appreciated that some of the steps may be optional, and some of the steps may be performed simultaneously, or in a different order than shown in FIG. 6. [0067] At 602, the image capturing device may obtain multiple stacked frames. The multiple stacked frames may include a first frame captured at a first time using a first exposure interval, a second frame captured at a second time using a second exposure interval continuous with the first exposure interval, and a third frame captured at a third time using a third exposure interval less than the first or second exposure intervals. For example, referring to FIGs. 3B and 4A, interface unit 322 may be configured to receive (at 401) multiple stacked frames 201, e.g., from one or more of camera 106, memory 214, and/or memory 118. Multiple stacked frames 201 being the same as those received by interface unit 302 of global motion unit 204.

[0068] At 604, image capturing device may identify an object in the first and second frames. For example, referring to FIGs. 3B and 4A, object identification unit 324 may perform image segmentation (at 411) to identify object(s) in multiple stacked frames 201. For example, object identification unit 324 may perform image segmentation by partitioning each frame into multiple segments, which may be sets of pixels, also known as image objects. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation may include locating objects and boundaries (lines, curves, etc.) in each frame. Each pixel or segment is assigned a label such that pixels with the same label in each of the different frames share certain characteristics. Then, the image segmentation information may be sent to feature mapping unit 326.

[0069] At 606, image capturing device may perform feature matching of the object in first frame and the second frame. For example, referring to FIGs. 3B and 4A, feature mapping unit 326 may perform (at 413) feature mapping using a graph neural network the same or similar to the one described above in connection with the feature mapping unit 306 of global motion unit 204, except here intensive features near the boundary of an object (as indicted in the image segmentation information) may be used to match features in the frames. Then, the feature mapping information may be sent to long-exposure motion unit 328 and/or short- exposure motion unit 330 to estimate the local motion of the object.

[0070] At 608, image capturing device may estimate a first motion of the object during the first exposure interval and the second exposure interval based at least in part on the feature matching. For example, referring to FIGs. 3B and 4A, long-exposure motion unit 328 may use two long-exposure frames captured with continuous exposure intervals to estimate (at 415) a first local motion mi. For example, long-exposure motion unit 328 mav estimate the first local motion mi by estimating the distance and/or direction of the object’s motion based on the differences in the feature mapping information between the two long-exposure frames.

[0071] At 610, image capturing device may determine whether a number of matched data points of the feature matching of the first frame, the second frame, and the third frame meets a threshold number. When the threshold number is met, the operation moves to 612. Otherwise, when the threshold number is not met, the operation moves to 614.

[0072] At 612, the image capturing device may estimate a second motion of the object during the third exposure interval based at least in part on the feature matching. For example, referring to FIGs. 3B and 4A, when the number of matched features meets a threshold number (e.g., when the short-exposure frame is not too dark), short-exposure motion unit 330 may estimate (at 417a) the second local motion Mi between a short-exposure frame and a long- exposure frame using feature mapping between these two frames.

[0073] At 614, the image capturing device may estimate the second motion of the object during the third exposure interval using interpolation or machine learning. For example, referring to FIGs. 3B and 4A, when the threshold number is not met (e.g., when the short- exposure frame is too dark), short-exposure motion unit 330 may estimate (at 417b) the second local motion Mi using machine learning or interpolation of the first local motion mi as shown above in Equation (2).

[0074] At 616, the image capturing device may compute a final motion of the object during the first exposure interval, the second exposure interval, and the third exposure interval based at least in part on the first motion and the second motion. For example, referring to FIGs. 3B and 4A, local motion computation unit 332 may compute (at 419) a final local motion for multiple stacked frames 201 by combining the first local motion mi and second local motion Mi of the object.

[0075] FIG. 7 illustrates a flowchart of an exemplary method 700 of the local motion estimation, according to some embodiments of the present disclosure. Exemplary method 700 may be performed by an image capturing device, e.g., such as apparatus 100 and/or ISP 120. Method 700 may include steps 702-710 as described below. It is to be appreciated that some of the steps may be optional, and some of the steps may be performed simultaneously, or in a different order than shown in FIG. 7.

[0076] At 702, the image capturing device may obtain multiple stacked frames. The multiple stacked frames may include a first frame (e.g., first long-exposure frame described above in connection with FIG. 3 A) captured at a first time using a first exposure interval, a second frame (e.g., short-exposure frame described above in connection with FIG. 3 A) captured at a second time using a second exposure interval that is shorter than the first exposure interval, and a third frame (e.g., second long-exposure frame described above in connection with FIG. 3 A) captured at a third time using a third exposure interval longer than the second exposure interval.

[0077] At 704, the image capturing device may estimate a global motion associated with the image capturing device during the first exposure interval, the second exposure interval, and the third exposure interval. For example, the image capturing device may perform one or more of the operations described above in connection with one or more of FIGs. 2, 3 A, 4A, or 5 to estimate final global motion (at 409c).

[0078] At 706, the image capturing device may estimate a local motion associated with an object in the multiple stacked frames during the first exposure interval, the second exposure interval, and the third exposure interval. For example, the image capturing device may perform one or more of the operations described above in connection with one or more of FIGs. 2, 3B, 4A, or 6 to estimate final local motion (at 419).

[0079] At 708, the image capturing device may rectify the multiple stacked frames based at least in part on the global motion and the local motion. For example, referring to FIGs. 2 and 4A, using final global motion 203 and final local motion 205, rectification unit 208 may (at 421) rectify multiple stacked frames 207. Due to the high degree of accuracy with which final global motion 203 and final local motion 205 are estimated, multiple stacked frames 207 may be more accurately aligned, as compared to known approaches.

[0080] At 710, the image capturing device may generate the HDR image based at least in part on the rectified multiple stacked frames. For example, referring to FIGs. 2 and 4B, HDR generator unit 102 may generate (at 423) an HDR image using the multiple stacked frames 201 rectified using final global motion 203 estimated by global motion unit 204 and final local motion 205 estimated by local motion unit 206.

[0081] In various aspects of the present disclosure, the functions described herein may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or encoded as instructions or code on a non- transitory computer-readable medium. Computer-readable media includes computer storage media. Storage media may be any available media that can be accessed by a computing device, such as apparatus 100 in FIG. 1. By way of example, and not limitation, such computer- readable media can include random-access memory (RAM), read-only memory (ROM), EEPROM, compact disc read-only memory (CD-ROM) or other optical disk storage, hard disk drive (HDD), such as magnetic disk storage or other magnetic storage devices. Flash drive. solid state drive (SSD), or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a processing system, such as a mobile device or a computer. Disk and disc, as used herein, includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), and floppy disk where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

[0082] According to one aspect of the present disclosure, an apparatus (e.g., global motion unit 204) for image signal processing is disclosed. The apparatus may include an interface unit configured to obtain multiple stacked frames from an image capturing device. The multiple stacked frames may include a first frame captured at a first time using a first exposure duration, a second frame captured at a second time using a second exposure duration that may be shorter than the first exposure duration, and a third frame captured at a third time using a third exposure duration equal to the first exposure duration. The apparatus may also include a rectification unit configured to rectify each of the first frame, the second frame, and the third frame based at least in part on motion sensor data. The apparatus may also include a feature matching unit configured to perform feature matching of the rectified first frame, the rectified second frame, and the rectified third frame. The apparatus may also include a first motion estimation unit configured to estimate a first motion of the image capturing device based at least in part on the feature matching between the rectified first frame and the rectified third frame. The apparatus may also include a second motion estimation unit configured to a second motion of the image capturing device associated with the rectified second frame by interpolating the first motion. The apparatus may also include a motion computation unit configured to compute a final motion of the image capturing device based at least in part on the first motion, the second motion, and the motion sensor data.

[0083] In some embodiments, the motion sensor data may comprise first motion sensor data for the first time, second motion sensor data for the second time, and third motion data for the third time. In some other embodiments, the interface unit may be further configured to obtain the first motion sensor data for the first time, the second motion sensor data for the second time, and the third motion sensor data for the third time.

[0084] In some other embodiments, rectification unit may be configured to rectify each of the first frame, the second frame, and the third frame by rectifying the first frame based at least in part on the first motion sensor data, rectifying the second frame based at least in part on the second motion sensor data; and rectifying the third frame based at least in nart on the third motion sensor data.

[0085] In some other embodiments, the motion data may be obtained using one or more of a gyroscope or accelerometer of the image capturing device.

[0086] In some other embodiments, the first motion of the image capturing device may be associated with a time period between a start of the first exposure duration and an end of the third exposure duration. In some other embodiments, the second motion of the image capturing device may be associated with the second exposure duration.

[0087] According to another aspect of the present disclosure, an apparatus (e.g., local motion unit 206) for image signal processing is disclosed. The apparatus may include an interface unit configured to obtain multiple stacked frames using the image capturing device. The multiple stacked frames may include a first frame captured at a first time using a first exposure duration, and a second frame captured at a second time using a second exposure duration. The apparatus may also include an identification unit configured to an object in the first frame and the second frame. The apparatus may also include a feature matching unit configured to perform feature matching of the object in the first frame and the second frame. The apparatus may also include a first motion estimation unit configured to estimate a first motion of the object during the first exposure duration and the second exposure duration based at least in part on the feature matching.

[0088] In some embodiments, the multiple stacked frames may further comprise a third frame captured at a third time using a third exposure duration that may be shorter than the first exposure duration and the second exposure duration.

[0089] In some other embodiments, the feature matching of the object may be performed for the first frame, the second frame, and the third frame.

[0090] In some other embodiments, the apparatus further includes a second motion estimation unit configured to determine whether a number of matched data points of the feature matching of the first frame, the second frame, and the third frame meets a threshold number. [0091] In some embodiments, the second motion estimation unit may be further configured to estimate a second motion of the object during the third exposure duration based at least in part on the feature matching when the number of matched data points meets the threshold number, or estimate the second motion of the object during the third exposure duration using interpolation or machine learning when the number of matched data points fails to meet the threshold number.

[0092] In some other embodiments, the apparatus may further include a motion computation unit configured to compute a final motion of the obiect during the first exposure duration, the second exposure duration, and the third exposure duration based at least in part on the first motion and the second motion.

[0093] In some other embodiments, the object may be identified based at least in part on image segmentation.

[0094] According to another aspect of the present disclosure, an apparatus (e.g., ISP

120) for image signal processing is disclosed. The apparatus may include an interface unit configured to obtain multiple stacked frames. The multiple stacked frames may include a first frame captured at a first time using a first exposure duration, a second frame captured at a second time using a second exposure duration that may be shorter than the first exposure duration, and a third frame captured at a third time using a third exposure duration equal to the first exposure duration. The apparatus may further include a global motion unit configured to estimate a global motion associated with the image capturing device during the first exposure duration, the second exposure duration, and the third exposure duration. The apparatus may further include a local motion unit configured to estimate a local motion associated with an object in the multiple stacked frames during the first exposure duration, the second exposure duration, and the third exposure duration. The apparatus may further include a rectification unit configured to rectify the multiple stacked frames based at least in part on the global motion and the local motion. The apparatus may further include an HDR generator unit configured to generate the HDR image based at least in part on the rectified multiple stacked frames.

[0095] In some embodiments, the global motion unit may be configured to estimate the global motion of the image capturing device by rectifying each of the first frame, the second frame, and the third frame based at least in part on motion sensor data. In some other embodiments, the global motion unit may be further configured to estimate the global motion of the image capturing device by performing feature matching of the rectified first frame, the rectified second frame, and the rectified third frame. In some other embodiments, the global motion unit may be further configured to estimate the global motion of the image capturing device by estimating a long-exposure motion of the image capturing device based at least in part on the feature matching between the rectified first frame and the rectified third frame. In some other embodiments, the global motion unit may be further configured to estimate the global motion of the image capturing device by estimating a short-exposure motion of the image capturing device associated with the rectified second frame by interpolating the long- exposure motion. In some other embodiments, the global motion unit may be further configured to estimate the global motion of the image capturing device by computing the global motion of the image capturing device based at least in nart on the lons-exnosure motion the short-exposure motion, and the motion sensor data.

[0096] In some embodiments, the motion sensor data may comprise first motion sensor data for the first time, second motion sensor data for the second time, and third motion data for the third time. In some other embodiments, the interface unit may be further configured to obtain the first motion sensor data for the first time, the second motion sensor data for the second time, and the third motion sensor data for the third time.

[0097] In some embodiments, the rectification unit may be configured to rectify each of the first frame, the second frame, and the third frame by rectifying the first frame based at least in part on the first motion sensor data, rectifying the second frame based at least in part on the second motion sensor data, and rectifying the third frame based at least in part on the third motion sensor data.

[0098] In some embodiments, the local motion estimating unit may be configured to estimate the local motion associated with the object in the multiple stacked frames by identifying the object in the first frame, the second frame, and the third frame. In some embodiments, the local motion estimating unit may be configured to estimate the local motion associated with the object in the multiple stacked frames by performing feature matching of the object in the first frame, the second frame, and the third. In some embodiments, the local motion estimating unit may be configured to estimate the local motion associated with the object in the multiple stacked frames by estimating a long-exposure motion of the object during the first exposure duration and the third exposure duration based at least in part on the feature matching.

[0099] In some embodiment, the local motion unit may be further configured to determine whether a number of matched data points of the feature matching of the first frame, the second frame, and the third frame meets a threshold number.

[0100] In some embodiments, the local motion unit may be further configured to estimate a short-exposure motion of the object during the second exposure duration based at least in part on the long-exposure motion when the number of matched data points meets the threshold number. In some other embodiments, the local motion unit may be further configured to estimate the short-exposure motion of the object during the second exposure duration using interpolation or machine learning when the number of matched data points fails to meet the threshold number.

[0101] In some embodiments, the local motion unit may be further configured to compute the local motion of the object during the first exposure duration, the second exposure duration, and the third exposure duration based at least in nart on the lons-exnosure motion and the short-exposure motion.

[0102] According to another aspect of the present disclosure, a method of image signal processing is disclosed. The method may be performed by, e.g., global motion unit 204. The method may include obtaining multiple stacked frames using the image capturing device. The multiple stacked frames may include a first frame captured at a first time using a first exposure duration, a second frame captured at a second time using a second exposure duration that may be shorter than the first exposure duration, and a third frame captured at a third time using a third exposure duration equal to the first exposure duration. The method may further include rectifying each of the first frame, the second frame, and the third frame based at least in part on motion sensor data. The method may further include performing feature matching of the rectified first frame, the rectified second frame, and the rectified third frame. The method may further include estimating a first motion of the image capturing device based at least in part on the feature matching between the rectified first frame and the rectified third frame. The method may further include estimating a second motion of the image capturing device associated with the rectified second frame by interpolating the first motion. The method may further include computing a final motion of the image capturing device based at least in part on the first motion, the second motion, and the motion sensor data.

[0103] According to another aspect of the present disclosure, a method of image signal processing is disclosed. The method may be performed by, e.g., local motion unit 206. The method may include obtaining multiple stacked frames using an image capturing device. The multiple stacked frames may include a first frame captured at a first time using a first exposure duration, a second frame captured at a second time using a second exposure duration, and a third frame captured at a third time using a third exposure duration that may be shorter than the first exposure duration and the second exposure duration. The method may further include identifying an object in the first frame and the second frame. The method may further include performing feature matching of the object in the first frame, the second frame, and the third frame. The method may further include estimating a first motion of the object during the first exposure duration and the second exposure duration based at least in part on the feature matching. The method may further include determining whether a number of matched data points of the feature matching of the first frame, the second frame, and the third frame meets a threshold number. The method may further include estimating a second motion of the object during the third exposure duration based at least in part on the feature matching when the number of matched data points meets the threshold number, or estimating the second motion of the object during the third exposure duration using interpolation or machine learning when the number of matched data points fails to meet the threshold number. The method may further include computing a final motion of the object during the first exposure duration, the second exposure duration, and the third exposure duration based at least in part on the first motion and the second motion.

[0104] According to another aspect of the present disclosure, a method of image signal processing is disclosed. The method may be performed by, e.g., ISP 120. The method may include obtaining multiple stacked frames. The multiple stacked frames may include a first frame captured at a first time using a first exposure duration, a second frame captured at a second time using a second exposure duration that may be shorter than the first exposure duration, and a third frame captured at a third time using a third exposure duration equal to the first exposure duration. The method may further include estimating a global motion associated with the image capturing device during the first exposure duration, the second exposure duration, and the third exposure duration. The method may further include estimating a local motion associated with an object in the multiple stacked frames during the first exposure duration, the second exposure duration, and the third exposure duration. The method may further include rectifying the multiple stacked frames based at least in part on the global motion and the local motion. The method may further include generating the HDR image based at least in part on the rectified multiple stacked frames.

[0105] Embodiments of the present disclosure have been described above with the aid of functional building blocks illustrating the implementation of specified functions and relationships thereof. The boundaries of these functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternate boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed.

[0106] The Summary and Abstract sections may set forth one or more but not all exemplary embodiments of the present disclosure as contemplated by the inventor(s), and thus, are not intended to limit the present disclosure and the appended claims in any way.

[0107] Various functional blocks, modules, and steps are disclosed above. The particular arrangements provided are illustrative and without limitation. Accordingly, the functional blocks, modules, and steps may be re-ordered or combined in different ways than in the examples provided above. Likewise, some embodiments include only a subset of the functional blocks, modules, and steps, and any such subset is permitted.

[0108] The breadth and scope of the present disclosure should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.

Claims

WHAT IS CLAIMED IS

1. An apparatus for image signal processing, comprising: an interface unit configured to obtain multiple stacked frames from an image capturing device, wherein the multiple stacked frames comprise: a first frame captured at a first time using a first exposure duration, a second frame captured at a second time using a second exposure duration that is shorter than the first exposure duration, and a third frame captured at a third time using a third exposure duration equal to the first exposure duration, a rectification unit configured to rectify each of the first frame, the second frame, and the third frame based at least in part on motion sensor data; a feature matching unit configured to perform feature matching of the rectified first frame, the rectified second frame, and the rectified third frame; a first motion estimation unit configured to estimate a first motion of the image capturing device based at least in part on the feature matching between the rectified first frame and the rectified third frame; a second motion estimation unit configured to a second motion of the image capturing device associated with the rectified second frame by interpolating the first motion; and a motion computation unit configured to compute a final motion of the image capturing device based at least in part on the first motion, the second motion, and the motion sensor data.

2. The apparatus of claim 1, wherein: the motion sensor data comprises first motion sensor data for the first time, second motion sensor data for the second time, and third motion sensor data for the third time, and the interface unit is further configured to obtain the first motion sensor data for the first time, the second motion sensor data for the second time, and the third motion sensor data for the third time.

3. The apparatus of claim 2, wherein the rectification unit is configured to rectify each of the first frame, the second frame, and the third frame by: rectifying the first frame based at least in part on the first motion sensor data; rectifying the second frame based at least in part on the second motion sensor data; and rectifying the third frame based at least in part on the third motion sensor data.

4. The apparatus of claim 1, wherein the motion sensor data is obtained using one or more of a gyroscope or an accelerometer of the image capturing device.

5. The apparatus of claim 1, wherein: the first motion of the image capturing device is associated with a time period between a start of the first exposure duration and an end of the third exposure duration; and the second motion of the image capturing device is associated with the second exposure duration.

6. An apparatus for image signal processing, comprising: an interface unit configured to obtain multiple stacked frames using a image capturing device, wherein the multiple stacked frames comprise: a first frame captured at a first time using a first exposure duration, and a second frame captured at a second time using a second exposure duration, an identification unit configured to an object in the first frame and the second frame; a feature matching unit configured to perform feature matching of the object in the first frame and the second frame; and a first motion estimation unit configured to estimate a first motion of the object during the first exposure duration and the second exposure duration based at least in part on the feature matching.

7. The apparatus of claim 6, wherein the multiple stacked frames further comprise a third frame captured at a third time using a third exposure duration that is shorter than the first exposure duration and the second exposure duration.

8. The apparatus of claim 7, wherein the feature matching of the object is performed for the first frame, the second frame, and the third frame.

9. The apparatus of claim 8, further comprising a second motion estimation unit configured to determine whether a number of matched data points of the feature matching of the first frame, the second frame, and the third frame meets a threshold number.

10. The apparatus of claim 9, wherein the second motion estimation unit is further configured to: estimate a second motion of the object during the third exposure duration based at least in part on the feature matching when the number of matched data points meets the threshold number, or estimate the second motion of the object during the third exposure duration using interpolation or machine learning when the number of matched data points fails to meet the threshold number.

11. The apparatus of claim 10, further comprising: a motion computation unit configured to compute a final motion of the object during the first exposure duration, the second exposure duration, and the third exposure duration based at least in part on the first motion and the second motion.

12. The apparatus of claim 6, wherein the object is identified based at least in part on image segmentation.

13. An apparatus for image signal processing, comprising: an interface unit configured to obtain multiple stacked frames comprising: a first frame captured at a first time using a first exposure duration, a second frame captured at a second time using a second exposure duration that is shorter than the first exposure duration, and a third frame captured at a third time using a third exposure duration equal to the first exposure duration, a global motion unit configured to estimate a global motion associated with an image capturing device during the first exposure duration, the second exposure duration, and the third exposure duration; a local motion unit configured to estimate a local motion associated with an object in the multiple stacked frames during the first exposure duration, the second exposure duration, and the third exposure duration; a rectification unit configured to rectify the multiple stacked frames based at least in part on the global motion and the local motion; and a high dynamic range (HDR) generator unit configured to generate an HDR image based at least in part on the rectified multiple stacked frames.

14. The apparatus of claim 13, wherein the global motion unit is configured to estimate the global motion of the image capturing device by: rectifying each of the first frame, the second frame, and the third frame based at least in part on motion sensor data; performing feature matching of the rectified first frame, the rectified second frame, and the rectified third frame; estimating a long-exposure motion of the image capturing device based at least in part on the feature matching between the rectified first frame and the rectified third frame; estimating a short-exposure motion of the image capturing device associated with the rectified second frame by interpolating the long-exposure motion; and computing the global motion of the image capturing device based at least in part on the long-exposure motion, the short-exposure motion, and the motion sensor data.

15. The apparatus of claim 14, wherein: the motion sensor data comprises first motion sensor data for the first time, second motion sensor data for the second time, and third motion sensor data for the third time, and the interface unit is further configured to obtain the first motion sensor data for the first time, the second motion sensor data for the second time, and the third motion sensor data for the third time.

16. The apparatus of claim 15, wherein the rectification unit is configured to rectify each of the first frame, the second frame, and the third frame by: rectifying the first frame based at least in part on the first motion sensor data; rectifying the second frame based at least in part on the second motion sensor data; and rectifying the third frame based at least in part on the third motion sensor data.

17. The apparatus of claim 13, wherein the local motion unit is configured to estimate the local motion associated with the object in the multiple stacked frames by: identifying the object in the first frame, the second frame, and the third frame; performing feature matching of the object in the first frame, the second frame, and the third frame; and estimating a long-exposure motion of the object during the first exposure duration and the third exposure duration based at least in part on the feature matching.

18. The apparatus of claim 17, wherein the local motion unit is further configured to: determine whether a number of matched data points of the feature matching of the first frame, the second frame, and the third frame meets a threshold number.

19. The apparatus of claim 18, wherein the local motion unit is further configured to: estimate a short-exposure motion of the object during the second exposure duration based at least in part on the long-exposure motion when the number of matched data points meets the threshold number, or estimate the short-exposure motion of the object during the second exposure duration using interpolation or machine learning when the number of matched data points fails to meet the threshold number.

20. The apparatus of claim 19, wherein the local motion unit is further configured to compute the local motion of the object during the first exposure duration, the second exposure duration, and the third exposure duration based at least in part on the long-exposure motion and the short-exposure motion.

21. A method of image signal processing, comprising: obtaining multiple stacked frames using an image capturing device, wherein the multiple stacked frames comprise: a first frame captured at a first time using a first exposure duration, a second frame captured at a second time using a second exposure duration that is shorter than the first exposure duration, and a third frame captured at a third time using a third exposure duration equal to the first exposure duration, rectifying each of the first frame, the second frame, and the third frame based at least in part on motion sensor data; performing feature matching of the rectified first frame, the rectified second frame, and the rectified third frame; estimating a first motion of the image capturing device based at least in part on the feature matching between the rectified first frame and the rectified third frame; estimating a second motion of the image capturing device associated with the rectified second frame by interpolating the first motion; and computing a final motion of the image capturing device based at least in nart on the first motion, the second motion, and the motion sensor data.

22. A method of image signal processing, comprising: obtaining multiple stacked frames using an image capturing device, wherein the multiple stacked frames comprise: a first frame captured at a first time using a first exposure duration, a second frame captured at a second time using a second exposure duration, and a third frame captured at a third time using a third exposure duration that is shorter than the first exposure duration and the second exposure duration, identifying an object in the first frame and the second frame; performing feature matching of the object in the first frame, the second frame, and the third frame; estimating a first motion of the object during the first exposure duration and the second exposure duration based at least in part on the feature matching; determining whether a number of matched data points of the feature matching of the first frame, the second frame, and the third frame meets a threshold number; estimating a second motion of the object during the third exposure duration based at least in part on the feature matching when the number of matched data points meets the threshold number, or estimating the second motion of the object during the third exposure duration using interpolation or machine learning when the number of matched data points fails to meet the threshold number; and computing a final motion of the object during the first exposure duration, the second exposure duration, and the third exposure duration based at least in part on the first motion and the second motion.

23. A method of image signal processing, comprising: obtaining multiple stacked frames comprising: a first frame captured at a first time using a first exposure duration, a second frame captured at a second time using a second exposure duration that is shorter than the first exposure duration, and a third frame captured at a third time using a third exposure duration equal to the first exposure duration, estimating a global motion associated with an image capturing device during the first exposure duration, the second exposure duration, and the third exposure duration; estimating a local motion associated with an object in the multiple stacked frames during the first exposure duration, the second exposure duration, and the third exposure duration; rectifying the multiple stacked frames based at least in part on the global motion and the local motion; and generating an HDR image based at least in part on the rectified multiple stacked frames.