WO2017127816A1

WO2017127816A1 - Omnidirectional video encoding and streaming

Info

Publication number: WO2017127816A1
Application number: PCT/US2017/014588
Authority: WO
Inventors: Ziyu Wen; Yikai ZHAO; Jisheng Li; Bichuan GUO; Jiangtao Wen; Sihan Li; Yao LU
Original assignee: Ziyu Wen; Zhao Yikai; Jisheng Li; Guo Bichuan; Jiangtao Wen; Sihan Li; Lu Yao
Priority date: 2016-01-22
Filing date: 2017-01-23
Publication date: 2017-07-27
Also published as: CN109121466A; CN109121466B

Abstract

Systems, devices and methods for capturing, encoding and streaming 360° video. Devices allow a fisheye lens to be placed over mobile device cameras allowing two or more cameras to capture a full 360° video. Omni directional video may be segmented into a plurality of poles and one or more equatorial tiles. The segmented tiles may be stacked into a frame for encoding. Multiple view points may be encoded in order to provide adaptive view point streaming.

Description

OMNIDIRECTIONAL VIDEO ENCODING AND STREAMING

TECHNICAL FIELD

[1 ] The current disclosure relates to encoding and streaming of video and in particular to encoding and streaming omnidirectional video. BACKGROUND

[2] Omnidirectional video provides a 360 ° view of an environment.

Omnidirectional video allows a viewer to view any desired portion of the 360 ° environment. Encoding omnidirectional video may use existing encoding techniques used for 2-dimensional (2D) video, by projecting the omnidirection video from a sphere to one or more rectangles. FIG. 1 depicts projecting the omnidirectional video from a sphere 100 onto one or more rectangles 102, 104a, 104b, 104c, 104d, 104e, 104f using equirectangular projection and cubic projection. In both cases of equirectangular projection and cubic projection, the resulting 2D projections have wasted pixels. As depicted in FIG. 1 the area of the omnidirectional video is that of a sphere 102. If the omnidirectional videos' sphere has a radius of r, the omnidirectional video covers an area of 4nr². However, in equirectangular projection, the sphere's area is projected onto a rectangle having an area of 2n²r² which is 157% the area of the sphere. Similarly, in cubic projection, the sphere's area is projected to six squares have a combined area of 6nr², which is 150% the area of the sphere. Accordingly, both projection techniques result in relatively large amount of unnecessary information being encoded.

[3] In 2D videos, photographers tend to use a whole frame to capture all regions of interest (ROI). However, in omnidirectional videos, a relatively high percentage of pixels are used to render environment of the scene. Regular encoding methods treat these non-ROI regions the same as ROI regions. Accordingly, the non-ROI areas may utilize bit rate unnecessarily resulting in lower available bit rate for encoding ROI areas. The encoded video may streamed to a viewer for interactive viewing of the 360 ° video [4] An additional, alternative and/or improved encoding technique for encoding omnidirectional video as well as an improved streaming technique for streaming the encoded omnidirectional video is desirable.

SUMMARY

[5] The present disclosures provides a new encoding method that uses a nearly equal-area projection. The encoding may also use ROI-targeted encoding to provide the encoded omnidirectional videos. Further, the present disclosure provides adaptive streaming techniques for omnidirectional videos. The present disclosure further provides video capture devices and stitching and techniques for capturing panoramic and omnidirectional video.

[6] In accordance with the present disclosure, there is provided a method of encoding omnidirectional video comprising: receiving spherical

omnidirectional video data segmenting a frame of the spherical

omnidirectional video data into a north pole segment formed by mapping a top spherical dome portion of the frame of the spherical omnidirectional video data onto a circle, a south pole segment formed by mapping a bottom spherical dome portion of the frame of the spherical omnidirectional video data onto a circle and at least one joining segment formed by mapping a spherical joining portion of the frame of the spherical omnidirectional video data joining the top spherical dome portion and the bottom spherical dome portion onto at least one rectangle; stacking the north pole segment, the south pole segment and the at least one joining segment together to form an 2- dimension (2D) frame corresponding to the frame of the spherical

omnidirectional video data; and encoding the 2D frame. [7] In a further embodiment, the method further comprises segmenting a plurality of frames of the spherical omnidirectional video data into a plurality of north pole segments, south pole segments and joining segments; stacking the plurality of north pole segments, south pole segments and joining segments into a plurality of 2D frames; and encoding the plurality of 2D frames. [8] In a further embodiment of the method, the at least one joining segment comprises a plurality of joining segments, each mapping a portion of the spherical joining portion onto a respective rectangles.

[9] In a further embodiment of the method, the circular poles are placed within squares.

[10] In a further embodiment of the method, each of the north pole segment, the south pole segment and the at least one joining segment comprise overlapping pixel data.

[1 1 ] In a further embodiment of the method, there is up to 5% overlap between the segments.

[12] In a further embodiment of the method, the at least one joining segment comprises between 2 and 4 segments.

[13] In a further embodiment, the method further comprises tracking one or more regions of interest (ROI) before encoding the 2D frames. [14] In a further embodiment of the method, encoding the 2D frame comprises: encoding one or more view points into a first stream; and for each view point encoding additional streams comprising an intracoded frame stream, a predictive frame stream, and a bi-predictive frame stream.

[15] In a further embodiment, the method further comprises streaming at least one of the encoded view points.

[16] In accordance with the present disclosure there is further provided a system for encoding omnidirectional video comprising: a processor for executing instructions; and a memory for storing instructions, which when executed by the processor, configure the system to provide a method comprising: receiving spherical omnidirectional video data; segmenting a frame of the spherical omnidirectional video data into a north pole segment formed by mapping a top spherical dome portion of the frame of the spherical omnidirectional video data onto a circle, a south pole segment formed by mapping a bottom spherical dome portion of the frame of the spherical omnidirectional video data onto a circle and at least one joining segment formed by mapping a spherical joining portion of the frame of the spherical omnidirectional video data joining the top spherical dome portion and the bottom spherical dome portion onto at least one rectangle; stacking the north pole segment, the south pole segment and the at least one joining segment together to form an 2-dimension (2D) frame corresponding to the frame of the spherical omnidirectional video data; and encoding the 2D frame.

[17] In a further embodiment of the system, the instructions further configure the system to: segment a plurality of frames of the spherical omnidirectional video data into a plurality of north pole segments, south pole segments and joining segments; stack the plurality of north pole segments, south pole segments and joining segments into a plurality of 2D frames; and encode the plurality of 2D frames. [18] In a further embodiment of the system, the at least one joining segment comprises a plurality of joining segments, each mapping a portion of the spherical joining portion onto a respective rectangles.

[19] In a further embodiment of the system, the circular poles are placed within squares. [20] In a further embodiment of the system, each of the north pole segment, the south pole segment and the at least one joining segment comprise overlapping pixel data.

[21 ] In a further embodiment of the system, there is up to 5% overlap between the segments. [22] In a further embodiment of the system, the at least one joining segment comprises between 2 and 4 segments.

[23] In a further embodiment of the system, the instructions further configure the system to track one or more regions of interest (ROI) before encoding the 2D frames. [24] In a further embodiment of the system, encoding the 2D frame comprises: encoding one or more view points into a first stream; and for each view point encoding additional streams comprising an intracoded frame stream, a predictive frame stream, and a bi-predictive frame stream. [25] In a further embodiment of the system, the instructions further configure the system to stream at least one of the encoded view points.

[26] In accordance with the present disclosure there is further provided a device for use in capturing panoramic video comprising: a frame for holding a mobile device; a first fisheye lens mounted on the frame and arranged to be located over a front facing camera of the mobile device when the mobile device is held by the frame; and a second fisheye lens mounted on the frame and arranged to be located over a rear facing camera of the mobile device when the mobile device is held by the frame.

[27] In accordance with the present disclosure there is further provided a method of stitching multiple videos captured from one or more mobile devices comprising: generating a stitching template for each camera capturing the videos; synchronizing frames of the captured video using timestamps of the frames; remapping the multiple videos onto a sphere using the stitching template; and blending the remapped images to provide a panoramic video. BRIEF DESCRIPTION OF THE DRAWINGS

[28] Features, aspects and advantages of the present disclosure will become better understood with regard to the following description and accompanying drawings in which:

[29] FIG. 1 depicts equirectangular projection and cubic projection of a sphere;

[30] FIGs. 2A and 2B depict segmented sphere projection (SSP) of a sphere;

[31 ] FIG. 3 depicts stacking of segments from a segmented sphere projection; [32] FIG. 4 is a graph of a ration of segmented area to the spherical area based on the number of segments for both circular pole and square pole segments;

[33] FIG. 5 is a graph of a ration of segmented area to the spherical area based on the number of segments and different amounts of segment overlap for circular pole segments;

[34] FIG. 6 is a graph of a ration of segmented area to the spherical area based on the number of segments and different amounts of segment overlap for square pole segments; [35] FIG. 7 depicts segmented sphere projection using a single equatorial tile segment and square poles;

[36] FIG. 8 depicts segmented sphere projection using multiple equatorial tile segments and square poles;

[37] FIG. 9 depicts segmented sphere projection using equally sized equatorial tile segments and square poles;

[38] FIG. 10 depicts stacking of overlapping tile segments;

[39] FIG. 1 1 depicts further stacking of overlapping tile segments;

[40] FIG. 12 depicts further stacking of overlapping tile segments;

[41 ] FIG. 13 depicts stacking of overlapping tile segments for stereoscopic omnidirectional video;

[42] FIG. 14 depicts region of interest (ROI) encoding;

[43] FIG. 15 depicts further ROI encoding;

[44] FIG. 16 depicts an ROI heat map;

[45] FIG. 17 depicts ROI temporal encoding; [46] FIG. 18 depicts view point encoding of omnidirectional video;

[47] FIG. 19 depicts view point encoding of a view of omnidirectional video for adaptive view point streaming;

[48] FIG. 20 depicts adaptive view point streaming of omnidirectional video; [49] FIG. 21 depicts a system for encoding and streaming omnidirectional video;

[50] FIGs. 22A, 22B and 25c depict devices for capturing panoramic and/or omnidirectional video;

[51 ] FIGs. 23A, 23B, and 23C depict stitching video together; and [52] FIGs 24A and 24B depict brightness mapping. DETAILED DESCRIPTION

[53] Omnidirectional video can be encoded using regular encoding techniques by first projecting the video from a sphere to 2-dimensional (2D) tiles. Segmented sphere projection (SSP) projects the spherical video from a top dome or cap portion of the sphere, a bottom dome or cap portion of the sphere and a middle equatorial portion of the sphere joining the top and bottom cap portions. The top and bottom cap segments may be mapped to circular tiles or to a circular portion of a square tile. The equatorial portion of the sphere may be mapped to one or more rectangular tiles. The tiles may then be stacked together into a single frame for subsequent encoding. The total area of tiles resulting from SSP may be smaller than the total area resulting from either equirectangular projection or cubic projection. The tile area for SSP is close to that of the area of the sphere of the omnidirectional video. [54] In addition to segmenting the sphere into tiles having a lower total area than other projection techniques such as equirectangular projection or cubic projection, the encoding efficiency of omnidirectional video may be further improved by encoding particular region of interest (ROI) portions of the omnidirectional video with a higher bitrate while encoding non-ROI portions of the omnidirectional video using a lower bitrate.

[55] FIGs. 2A and 2B depict segmented sphere projection (SSP) of a sphere. As depicted in FIG. 2A, the sphere is segmented and mapped to tiles using an improved projection based on Sinusoidal projection. As depicted, the sphere 200 is cut along its latitude into several segments including a north pole segment 202a, a south pole segment 202b (referred to collectively as pole segments 202) and one or more equatorial joining segments 204a-f (referred to collectively as joining segments 204) between the two poles. The segments may then be mapped to tiles, and in particular the pole segments may be mapped to circular tiles 206a, 206b and the joining segments 204 may be mapped to rectangular tiles 208a-208f. As depicted in FIG. 2B the number of joining segments can vary. The sphere 200 may be cut into two pole segments 210a, 210b and 3 equatorial joining segments 212a, 212b, 212c. Each of the two pole segments 210a, 210b are mapped to respective circles contained by squares 214a, 214b and the joining segments 212a-c are mapped to respective rectangles 216a-c. The individual tiles may overlap with each other a certain amount in order to maintain video quality during further processing. Once segmented into tiles, the individual tiles may be stacked together to form a frame that may be encoded using various encoding techniques.

[56] FIG. 3 depicts stacking of segments from a segmented sphere projection. As depicted, individual joining segment tiles 304a-c may be stacked together with the square pole tiles 302a,b and arranged in rectangular frame 300. The rectangular frame 300 may then be encoded, using for example a h.264 encoder although other encoding techniques may be used.

[57] FIG. 4 is a graph of a ratio of segmented area to the spherical area based on the number of segments for both circular pole and square pole segments. As depicted in the graph of FIG. 4, as the number of equatorial joining segments increases, the total area of the segmented tiles approaches the area of the sphere. As can be seen from FIG. 4, the segmented tile area is greater when using square poles when compared to circular poles. As the number of segmented tiles increases, the tile-pole segmentation latitude, that is the latitude where the sphere is cut to form the segments, will be pushed toward the poles. [58] FIG. 5 is a graph of a ratio of segmented area to the spherical area based on the number of segments and different amounts of segment overlap for circular pole segments. FIG. 6 is a graph of a ration of segmented area to the spherical area based on the number of segments and different amounts of segment overlap for square pole segments. As can be seen from the graphs of FIGs. 5 and 6, if there is overlap between the segmented tiles, then as the number of segments increases, so does the total area. In contrast, when there is no overlap between the segmented tiles, the total area decreases as the number of segments increases. Table 1 below shows segmentation latitudes for varying amounts of segment overlap, varying number of joining segments, and the use of circular or square pole tiles.

segmentations and overlap

[59] As described above, the number of segment tiles may vary. The segmentation of each hemisphere of the sphere into 1 segment and 1 pole may be described by: niinee(o/y₂) + ø)

[60] Where Θ is the tile-pole segmentation latitude. The minimum total area is about 107.1 % of the sphere's area of 4nr² when Θ = 32.70°.

[61 ] The segmentation of each hemisphere of the sphere into 2 segments and 1 pole may be described by:

Γηϊη_{θι<θ2 ε(0} π 7/₂ )⁾ 4ττΓ² (^{(π/2 01)2} + θ₁ + θ₁ + (θ₂ - COS (2)

[62] Where θ₁ is the tile-tile segmentation latitude and θ₂ is the tile-pole segmentation latitude. The minimum total area is about 105.4% when θ_χ = 25.34° and θ₂ = 38.22°.

[63] As described above, the poles may be mapped to circles; however, when encoding the resulting tiles, the circles are placed in squares. Placing the circles of the poles in squares will increase the total area of the segments to about 1 17.8% for 1 tile and 1 pole per hemisphere when Θ = 45°. When each hemisphere is segmented into 2 tiles and 1 pole the total area is approximately 1 13.4% when θ = 35.07° and θ₂ = 53.16°. [64] Different hardware decoders have different decoding ability. Taking

HEVC an as example, Table 2 provides examples about coding levels and the corresponding HEVC supported resolution, equivalent equirectangular resolution and equivalent resolution displayed on single eye(FOV 90^°) when using different tile.

Table 2 showing resolutions for different segmenta ions and HEVC encoding levels [65] FIG. 7 depicts segmented sphere projection using a single equatorial tile segment and square poles. As depicted, a sphere 700 may be segmented into north pole 702 and south pole 704 joined by a joining segment 706. The poles 702, 704 are mapped to circles within squares 708, 71 0 and the joining segment 706 is mapped to a rectangle 71 2. As depicted, the pole tiles 708 and the equator tile 710 can be vertically stacked to form a frame for encoding.

[66] FIG. 8 depicts segmented sphere projection using multiple equatorial tile segments and square poles. As depicted, a sphere 800 may be

segmented into north pole and south pole joined by a number of joining segments. The poles are mapped to circles within squares 802, 804 and the joining segments are mapped to rectangles 806. As depicted, the pole tiles 808 and the equator tiles 81 0 can be vertically stacked to form a frame for encoding. [67] As is shown in both FIGs. 7 and 8, Segmented Sphere Projection

(SSP) segments the sphere into several segments: north pole, south pole and the rest. The boundaries of all segments in the north and south are

symmetric. The north and south poles are mapped into 2 circles, and the rest of the segments are projected to the one or more rectangles. [68] The layout of the tiles may be vertically arranged when forming a frame as shown in FIGs. 7 and 8. The formulas for the SSP are shown below.

Assuming there are Θ, namely α-ι , a₂, ... _k , then there will be 2k + 1.

(3)

= - ^hi + > φ Ε (-π, π]

i = 1,2, ... /c— 1

[69] The origin is in the upper left corner of the image. The initial side of θ' is located in the equatorial plane, θ' in north hemisphere is positive and in south hemisphere is negative. Equation (3) indicates how to map a point on the cap (0', ø) into a point in circle (x',y')- It should be noticed that there are differences in the sign between north and south poles. Equation (4) indicates how to map the equator to the middle rectangle. It uses the same equation as Equirectangular Projection (ERP) to convert the equator area into the rectangle. Equations (5) and (6) indicate how to map the rest of the segments to rectangles. It also uses the same equation as Equirectangular Projection (ERP) to map to rectangles.

[70] FIG. 9 depicts segmented sphere projection using equally sized equatorial tile segments and square poles. The projection depicted in FIG. 9 may be similar to that depicted in FIG. 7; however, rather than mapping the single joining segment to a single rectangle, the projection depicted in FIG. 9 breaks the single rectangle into 4 squares. That is, the sphere 900 is segmented into a two poles 902, 904 and a joining segment 906 and mapped to circles on squares 908, 910 and squares 912a-d. As depicted, the Segmented Sphere Projection (SSP) of FIG. 9 segments the sphere into 3 tiles: north pole, equator and south pole. The boundaries of 3 segments are 45 °N and 45 °S. The north and south pole are mapped into 2 circles, and the projection of the equatorial segment is the same as ERP. The diameter of the circle is equal to the height of the equatorial segments since both pole segments and equatorial segment have a 90 " latitude span.

[71 ] The equatorial segment is split into 4 squares in order to get "faces" of same size. The frame packing structure is depicted in FIG. 9. The corners of the circular poles are filled with "null" values to form the square. Points on the sphere are mapped to the respective tiles according to:

[72] [73] The segment tiles may be packed together to form a frame for encoding. The packing process attempts to put each region of the SSP segments into one 2D image with the least wasted area.

[74] There are three packing types for SSP. The particular packing method may be selected at the encoder side in order to minimize the wasted area. For the 1 ^st type, depicted in FIG. 10, the two circles on squares 1002 are placed vertically on top of the rectangles 1004. The circles are centered horizontally as the center of the rectangle of the equator. All the other rectangles are centered vertically as the center of the rectangle of the equator. For the 2^nd type, depicted in FIG. 1 1 , two circles 1 102 are place horizontally on top of the rectangles 1 104. The circles are also centered horizontally as the center of the rectangle of the equator and all the other rectangles are also centered vertically as the center of the rectangle of the equator. For the 3^rd type, depicted in FIG. 12, two circles 1202 are put on the left side and the right side of the rectangle 1204 of the equator. The highest point of the circle is as high as the top edge of the rectangle of the equator. All the other rectangles are placed so that the bottom edges of all rectangles are at the same height.

[75] The overlap of pixels, depicted by reference numbers 1006, 1 106, 1206 in the above FIGs 10 - 12, for the segments will also be put in the 2D image and the width of the overlap area can be indicate by syntax and semantics communicated with the encoded image data to decoders.

[76] FIG. 13 depicts stacking of overlapping tile segments for stereoscopic omnidirectional video. For stereoscopic video, there are two views. The segmented tiles of each of the views 1302, 1304 are packed side by side. FIG. 13 shows a layout of 1 tile 1 pole SSP that supports stereoscopic video.

[77] New syntax is provided below that allows the existing MP4 format to support SSP format. An SSP Video Information box is defined as

below.Although a specific syntax and semantics are described below, it will be appreciated that other implementations are possible. Syntax

aligned(8) class SSPVideoInfoBox extends FullBox ( ' ssp ' , version = 0, 0)

{

bit (8) reserved = 0;

unsigned int(l) is_stereoscopic ;

if ( is_sterescopic )

unsigned int(8) stereoscopic_type ;

unsigned int(8) geometry_type ;

if (geometry_type == GEOMETRY_SSP ) {

unsigned int(8) ssp_theta_num;

for ( ssp_theta_id=0 ; ssp_theta_id < ssp_theta_num; ssp_theta_id

++) {

unsigned int(8) ssp_theta [ ssp_theta_id] ;

unsigned int(8) ssp_overlap_pixel [ ssp_theta_id] ;

}

Semantics

Box Type ssp

Container Scheme Information box ( 'vrvd'

Mandatory Yes

Quantity One

[78] is_stereoscopic indicates whether stereoscopic media rendering is used or not. The value of this field is equal to 1 to indicate that the video in the referenced track is divided into two parts to provide different texture data for left eye and right eye separately according to the composition type specified by stereoscopic_type.

[79] geometry_type indicates the type of geometry for rendering of omnidirectional media. It may be GEOMETRY_ERP indicating that an

Equirectangular projection is used, GEOMETRY_CMP indicating a cube map projection is used or GEOMETRY_SSP indicating a segmented sphere projection.

[80] stereoscopic_type indicates the type of composition for the stereoscopic video in the referenced track.

[81 ] ssp_theta_num indicates how many Θ are used. Then the number of segments of SSP including north pole and south pole in total will be 2^* ssp_theta_num +1 , the default value is 1 . [82] s sp_thet a_id indicates the identifier of the theta.

[83] s sp_thet a contains Θ values in terms of degrees, ranging from 0-180. The default value is 45.

[84] s sp_over i ap_width indicates the width, in pixels, of overlap. [85] The above has described the segmenting and mapping of the spherical omnidirectional video to a number of segments and packed together into a frame and encoded. It will be appreciated that while a single segmenting and mapping a single frame is described, the process will map each of the frames of the omnidirectional video. In addition to the efficient mapping provided by the segmented sphere projection, the encoding efficiency for omnidirectional video may be improved by encoding regions of interest with a higher bitrate while encoding non-ROIs with a lower bitrate.

[86] FIG. 14 depicts region of interest (ROI) encoding. As shown in FIG.14, an ROI target encoding process 1400 uses ROI information 1406, which may comprise a mask 1408 specifying the ROI portion of the raw video 1402 being encoded. The raw video 1402 is depicted as a video frame 1404 having a person and a tree, with the person being the ROI. The raw video 1402 and the ROI information 1406 may be used to lower the encoding quality of the non-ROI areas of the raw video by the encoder 1410. The reduced quality of the non-ROI areas allow an optimized bitrate allocation, in order to acquire highest quality encoding of ROI areas with constant bitrate. The encoder 1410 provides an ROI optimized video 1412. The output is depicted as having a frame 1414 with a high quality encoding of the person, while the tree is a low quality encoding. [87] FIG. 15 depicts further ROI encoding. The process 1500 is similar to that described above with reference to FIG. 14; however, the process tracks ROIs across the raw video. The raw video 1402 is provided to an ROI analysis and tracking functionality 1506. [88] For ROI tracking, the user may point out objects in the first frame, or any subsequent frames, that the ROIs are based on. The tracking scheme uses an image segmentation algorithm to estimate an ROI corresponding to the selected objects. The image segmentation algorithm is tuned specifically for omnidirectional videos such that it automatically adjusts the area allocation to achieve better efficiency when the resulting ROI is applied to

omnidirectional encoding. Users can further correct the estimation by pointing out the misclassified region and the ROI will be optimized.

[89] Once the ROI for the first frame is decided, an optic flow tracking algorithm is used to generate the ROIs for successive frames based on prior frames. The number of feature points, the fineness of the optic flow vector field and other parameters are chosen by the algorithm to maximize its efficiency for the projection scheme. Users can pause the optic flow tracking algorithm at any point, and manually define the ROI for a specific frame with the same image segmentation algorithm. The optic flow tracking algorithm will use the newest manually specified mask as its reference once it is resumed.

[90] FIG. 16 depicts an ROI heat map. The heat map 1600 depicts the most common locations, depicted by brightness with the most common area being depicted by white 1602, of ROIs. The heat map 1600 provides information on the most common locations within the pole tiles 1604, 1606 and the equatorial tiles 1608. For the lower observed frequency of ROIs, the ROI expansion size margin is relatively low and the ROI border is sharp. The segmentation iteration is high and the number of feature points is small. The fineness of the optic flow field is low. In the low frequency regions there is a smaller ROI region with a sharp transition tuned for static video. For the higher observed frequency of ROIs, the ROI expansion size margin is relatively high and the ROI border is smooth. The segmentation iteration is low and the number of feature points is large. The fineness of the optic flow field is high. In the high frequency regions there is a larger ROI region with a smooth transition and is motion-sensitive. [91 ] For these extracted and tracked ROIs, there are two general ways to control the encoding quality. The first is adjusting the QP utilization. Regular video encoders treat every block (CU) in the video stream equally. However, in omnidirectional video and with the information on ROI, it is possible to tune the parameter to give ROI areas higher quality. The second is resolution utilization. As described above, a omnidirectional video will be cut and reshaped. Some of the tiles may not contain any ROI area. Therefore there is no need to keep the same resolution ratio for those tiles. Hence those tiles which doesn't contain ROI area can be downscaled to certain resolution encoded with tuned qp parameters in order to save bitrate.

[92] For the ROI area, it is possible to simply upscale the whole tile to a higher resolution, or it may be possible to use a temporal resolution

enhancement as shown in FIG .17. Using the temporal resolution

enhancement, only ROI area's resolution 1704, 1708 will be enhanced. The extra pixel information will be stored in even-frames 1706 while the original frames become all odd-frames 1702. The changes of resolution may be uncomfortable, and as such, the resolution may be slowly adjusted while there is limited motion in the video.

[93] The above has described the segmenting and subsequent encoding of the entire omnidirectional video. However, it may be advantageous to encode particular view points of the omnidirectional video. Encoding the different view points as separate streams, may allow for more efficient streaming of the encoded video as only the stream of the particular view point being displayed to the user needs to be transmitted. If the user navigates through the omnidirectional video, different view point streams can be retrieved.

[94] FIG. 18 depicts view point encoding of omnidirectional video. As depicted, the omnidirectional video 1800 is encoded to provide a plurality of different view points 1802, 1804, 1806. Each view point stream is encoded into different time blocks 1808. The streaming of different view points can be switched between each other at the different clip starting time blocks.

However, as depicted, if each clip starting block is 5 seconds long, switching between view points may take up to 5 seconds to be able to properly decode the new view point. The encoded time blocks form a 2D caching scheme that allows different time blocks for different view points to be cached.

[95] As described further below, it is possible to encode additional streams for each view point. The view points are encoded to include additional streams of l-frames, P-frames and B-frames that allow a smart assembler to quickly recover the decoded stream when switching between the view points.

[96] FIG. 19 depicts view point encoding of a view of omnidirectional video for adaptive view point streaming. As depicted in FIG. 19, an original view point stream 1902 is further encoded into additional streams 1904, 1906, 1908 for the different time clips. The additional streams 1904, 1906, 1908 encode different frames into l-frames, P-frames and B-frames. The original stream and the additional streams 1910 are transmitted to allow quick view point switching at any time point. [97] As depicted in FIG. 19, after encoding one whole video stream, several addition streams are encoded. All frames after the l-frame in a GOP in the original stream 1902 are encoded as l-frames forming Stream 0 1904. Several frames after the first l-frame in Stream 0 1902 are selected and encoded as P-frames forming Stream 1 1904. Several frames after the first l-frame in Stream 0 1902 are selected and encoded as B-frame forming Stream 2 1908.

[98] When streaming though a network, a smart assembler can pick up these Ι,Ρ,Β frames whose position is between the random access

point(include) and the l-frame of next GOP of the original stream to form a standard decodable stream according to their frame dependency. This view point encoding can shorten the average waiting time during temporal random access. Combined with spatial division feature of encoding the different view points described above, it is possible to achieve high spatial and temporal random access ability during the omnidirectional video streaming.

[99] FIG. 20 depicts adaptive view point streaming of omnidirectional video.

As described above, it is possible to encode each view point with additional streams of data which provides an improved adaptive streaming for omnidirectional video. As depicted in FIG. 20, previous adaptive streaming techniques encoded a video into a plurality of different quality streams 2002. The different quality streams allowed the streaming of a video to adapt to network conditions. In contrast to regular adaptive streaming, the adapting streaming for omnidirectional viewpoints allows the adaptive streaming of multiple view points. The additional streams allow the quick switching between view points.

[100] FIG. 21 a system for encoding and streaming omnidirectional video. The system is depicted as a server 2100 for processing omnidirectional video that may be provided to the server 2100 from a video system 2102 that captures 360 ° video. The server 2100 comprises a processing unit 2104 for executing instructions. An input/output (I/O) 2106 interface allows additional components to be operatively coupled to the processing unit 2104. The server 2100 further comprises a non-volatile (NV) memory 2108 for storing data and instructions and a memory unit 21 10, such as RAM, for storing instructions for execution by the processing unit 2104. The instructions stored in the memory unit 21 10 when executed by the processing unit 2104 configure the server 2100 to provide an omnidirectional video encoding functionality 21 12 in accordance with the functionality described above.

[101 ] The encoding functionality 21 12 comprises functionality for segmenting and mapping 21 14 spherical omnidirectional video data to a number of pole and equatorial joining segments. Tile stacking functionality 21 16 arranges the segments into a single frame for subsequent encoding. The functionality further comprises ROI tracking functionality 21 18 that tracks ROIs across frames of the omnidirectional video. The stacked images and ROI information is then used by encoding functionality 2120 to encode the video data.

[102] The above has described the encoding and streaming of

omnidirectional video. As described further below, 360 ° panoramic video may be captured from existing devices such as smart phones. [103] FIGs. 22A, 22B and 22C depict devices for capturing panoramic and/or omnidirectional video. The devices depicted in FIGs. 22A, 22B, 22C each comprise fish eye-lenses that are mounted over camera of the device. The fish-eye devices may be used with a single mobile phone or with additional mobile phones. FIG. 22A depicts single phone 2200a that has a front facing camera and a rear facing camera. A panoramic video capture device 2202a fits over the phone 2200a and places a first fisheye lens 2204a over a front facing camera and a second fisheye lens 2206a over a back facing camera. FIG. 22B depicts a similar panoramic video capture device 2202b; however, rather than placing fisheye lenses over the front and back cameras, the device 2202b is designed for holding two mobile devices 2200b-1 , 2200b-2 back-to-back and places the fisheye lenses 2204b, 2204b over the front facing cameras of the mobile devices. FIG. 22C depicts a further device 2202C that is designed for holding three mobile devices 2200c-1 , 2200c-2, 2200c-3. The device 2202c holds the three mobile devices and arranges fisheye lenses 2204c, 2206c, 2208c over the front facing cameras of the devices. The devices 2202a, 2202b, 2202c allow panoramic video to be captured using common mobile devices. Each of the fisheye lenses may provide a 180 ° field of view. [104] With each of the devices described above, two or more fisheye video streams are captured simultaneously. When the video streams are captured from separate devices, one of the capture devices acts as the master capture device and may make connections with other devices to to receive the video streams, stitch the videos together and output the panoramic videos. When capturing video from a single mobile device, the two video streams captured by the front and back cameras of the device can be stitched together by the mobile device and stream out the resulting video. Alternatively, stitching can be done in a player, which is suitable for low-power capture devices. In that case, all capture devices stream video directly to the player.

[105] The devices depicted in FIGs. 22A, 22B, 22C may be used to stream panoramic video in a video chat system. The video streaming process, both between capture devices and from capture device to player, may use Real- time Transport Protocol to transfer real-time video and use Session

Description Protocol to negotiate the parameters. Additionally, timestamps for each frame may be added to the stream for synchronization.

[106] The stitching process may be performed as follows:

1 . Generate stitching template. This is necessary every time the

fisheye lens is re-mounted. Static photos are captured for every camera and key points are extracted using algorithms like SIFT. After matching the key points from different camera, each camera's parameters and rotation can be generated. More details about stitching is described below.

2. Frame synchronize. Using timestamps of each video frame, every video frame is synchronized with frames captured by other devices.

3. Remap. Using the generated template (each camera's parameters and rotation), each frame from different camera can be remapped into a sphere.

4. Blend. Linear or multi-band blending algorithm is used to blend remapped frames from different cameras to produce a 360 degree panorama frame, which is usually projected into a rectangular image as described above.

[107] Generating the stitching template is illustrated in FIGs. 23A, 23B, 23C. Using key points directly extracted from fisheye images to perform matching may produce bad results as depicted in FIG. 23A. The mismatch between the fisheye videos 2302a, 2302b may result from distortion effects of the fisheye lens make the objects far from the image center hard to recognize by algorithms like SIFT and because most part of the images do not overlap. Generating the stitching template uses predefined approximate camera parameters to remap the fisheye images to flattened images 2304a, 2304b before extracting key points as depicted in FIG 23B. Based on the

approximate position of two or three cameras, key points at certain areas 2306a, 2306b can be ignored safely. After a correct match is found in remapped images 2306a, 2306b, those key points are then un-mapped into the original fisheye images 2308a, 2308b to compute the final camera parameters and rotation to provide a proper matching between captured videos.

[108] Another important step in generating the template is calculating a brightness map as depicted in FIGs. 24A, 23B. As a fisheye lens is used, brightness of each pixel varies greatly near the border as depicted in FIG. 24A. Using multiple overlapped images captured when rotating the device and position data from sensors like gyroscopes to detect the rotation of the device, the brightness map, which provides brightness values for each pixel in an image, can be calculated as depicted in FIG 24B and used later to correct image brightness before blending.

[109] The stitching process may also involve audio reconstruction. Audio streams captured by several devices at different positions can be

reconstructed to provide stereo audio.

[1 10] The player (usually smart devices or headsets) receives stitched panorama video (or multiple original video stream and then do stitching) and displays it. A user can look at different angles using rotation sensors in the player such as gyroscopes.

Claims

WHAT IS CLAIMED IS:

1 . A method of encoding omnidirectional video comprising: receiving spherical omnidirectional video data;

segmenting a frame of the spherical omnidirectional video data into a north pole segment formed by mapping a top spherical dome portion of the frame of the spherical omnidirectional video data onto a circle, a south pole segment formed by mapping a bottom spherical dome portion of the frame of the spherical omnidirectional video data onto a circle and at least one joining segment formed by mapping a spherical joining portion of the frame of the spherical omnidirectional video data joining the top spherical dome portion and the bottom spherical dome portion onto at least one rectangle; stacking the north pole segment, the south pole segment and the at least one joining segment together to form an 2-dimension (2D) frame corresponding to the frame of the spherical omnidirectional video data; and

encoding the 2D frame.

2. The method of claim 1 , further comprising: segmenting a plurality of frames of the spherical omnidirectional video data into a plurality of north pole segments, south pole segments and joining segments;

stacking the plurality of north pole segments, south pole segments and joining segments into a plurality of 2D frames; and

encoding the plurality of 2D frames.

3. The method of claim 1 , wherein the at least one joining segment comprises a plurality of joining segments, each mapping a portion of the spherical joining portion onto a respective rectangles.

4. The method of claim 1 , wherein the circular poles are placed within

squares.

The method of claim 1 , wherein each of the north pole segment, the south pole segment and the at least one joining segment comprise overlapping pixel data.

The method of claim 5, wherein there is up to 5% overlap between the segments.

The method of claim 6, wherein the at least one joining segment comprises between 2 and 4 segments.

The method of claim 2, further comprising tracking one or more regions of interest (ROI) before encoding the 2D frames.

The method of claim 1 , wherein encoding the 2D frame comprises: encoding one or more view points into a first stream; and

for each view point encoding additional streams comprising an

intracoded frame stream, a predictive frame stream, and a bi- predictive frame stream.

10. The method of claim 9, further comprising streaming at least one of the encoded view points.

1 1 . A system for encoding omnidirectional video comprising: a processor for executing instructions; and

a memory for storing instructions, which when executed by the

processor, configure the system to provide a method comprising: receiving spherical omnidirectional video data;

segmenting a frame of the spherical omnidirectional video data into a north pole segment formed by mapping a top spherical dome portion of the frame of the spherical omnidirectional video data onto a circle, a south pole segment formed by mapping a bottom spherical dome portion of the frame of the spherical omnidirectional video data onto a circle and at least one joining segment formed by mapping a spherical joining portion of the frame of the spherical omnidirectional video data joining the top spherical dome portion and the bottom spherical dome portion onto at least one rectangle;

stacking the north pole segment, the south pole segment and the at least one joining segment together to form an 2- dimension (2D) frame corresponding to the frame of the spherical omnidirectional video data; and

encoding the 2D frame.

12. The system of claim 1 1 , wherein the instructions further configure the

system to: segment a plurality of frames of the spherical omnidirectional video data into a plurality of north pole segments, south pole segments and joining segments;

stack the plurality of north pole segments, south pole segments and joining segments into a plurality of 2D frames; and

encode the plurality of 2D frames.

13. The system of claim 1 1 , wherein the at least one joining segment

comprises a plurality of joining segments, each mapping a portion of the spherical joining portion onto a respective rectangles.

14. The system of claim 1 1 , wherein the circular poles are placed within

squares.

15. The system of claim 1 1 , wherein each of the north pole segment, the

south pole segment and the at least one joining segment comprise overlapping pixel data.

16. The system of claim 15, wherein there is up to 5% overlap between the segments.

17. The system of claim 16, wherein the at least one joining segment comprises between 2 and 4 segments.

18. The system of claim 12, wherein the instructions further configure the system to track one or more regions of interest (ROI) before encoding the 2D frames.

19. The system of claim 1 , wherein encoding the 2D frame comprises: encoding one or more view points into a first stream; and

for each view point encoding additional streams comprising an

20. The system of claim 9, wherein the instructions further configure the system to stream at least one of the encoded view points.

21 . A device for use in capturing panoramic video comprising: a frame for holding a mobile device;

a first fisheye lens mounted on the frame and arranged to be located over a front facing camera of the mobile device when the mobile device is held by the frame; and

a second fisheye lens mounted on the frame and arranged to be

located over a rear facing camera of the mobile device when the mobile device is held by the frame.

22. A method of stitching multiple videos captured from one or more mobile devices comprising: generating a stitching template for each camera capturing the videos; synchronizing frames of the captured video using timestamps of the frames;

remapping the multiple videos onto a sphere using the stitching

template; and blending the remapped images to provide a panoramic video.