WO2023099809A1 - A method, an apparatus and a computer program product for video encoding and video decoding - Google Patents

A method, an apparatus and a computer program product for video encoding and video decoding Download PDF

Info

Publication number
WO2023099809A1
WO2023099809A1 PCT/FI2022/050763 FI2022050763W WO2023099809A1 WO 2023099809 A1 WO2023099809 A1 WO 2023099809A1 FI 2022050763 W FI2022050763 W FI 2022050763W WO 2023099809 A1 WO2023099809 A1 WO 2023099809A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
atlas
information
sub
static
Prior art date
Application number
PCT/FI2022/050763
Other languages
French (fr)
Inventor
Lukasz Kondrad
Lauri Aleksi ILOLA
Emre Baris Aksu
Original Assignee
Nokia Technologies Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Technologies Oy filed Critical Nokia Technologies Oy
Publication of WO2023099809A1 publication Critical patent/WO2023099809A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/597Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding specially adapted for multi-view video sequence encoding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/70Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by syntax aspects related to video coding, e.g. related to compression standards

Definitions

  • the present solution generally relates to coding of volumetric video.
  • Volumetric video data represents a three-dimensional (3D) scene or object, and can be used as input for AR (Augmented Reality), VR (Virtual Reality), and MR (Mixed Reality) applications.
  • Such data describes geometry (Shape, size, position in 3D space) and respective attributes (e.g., color, opacity, reflectance, ...), and any possible temporal transformations of the geometry and attributes at given time instances (like frames in 2D video).
  • Volumetric video can be generated from 3D models, also referred to as volumetric visual objects, i.e., CGI (Computer Generated Imagery), or captured from real-world scenes using a variety of capture solutions, e.g., multi-camera, laser scan, combination of video and dedicated depth sensors, and more.
  • CGI Computer Generated Imagery
  • volumetric data comprises triangle meshes, point clouds, or voxels.
  • Temporal information about the scene can be included in the form of individual capture instances, i.e., “frames” in 2D video, or other means, e.g., position of an object as a function of time.
  • an apparatus comprising means for receiving volumetric video data from a media capture system; means for creating a V3C bitstream comprising atlas information; means for encoding video components of the volumetric video data into one or more video subbitstreams; means for creating static atlas information and storing it into a static V3C atlas sub-bitstream; means for sending encoded video sub-bitstreams over a network via a real-time delivery protocol; and means for sending static V3C atlas sub-bitstream through out-of-band information.
  • an apparatus for decoding comprising means for receiving encoded video sub-bitstreams over a network via a real-time delivery protocol; means for receiving the out-of-band information on static atlas information; and means for rendering the V3C content based on the received data.
  • a method for encoding comprising receiving volumetric video data from a media capture system; creating a V3C bitstream comprising atlas information; encoding video components of the volumetric video data into one or more video subbitstreams; creating static atlas information and storing it into a static V3C atlas sub-bitstream; sending encoded video sub-bitstreams over a network via a real-time delivery protocol; and sending static V3C atlas sub-bitstream through out-of-band information.
  • a method for decoding comprising receiving encoded video sub-bitstreams over a network via a realtime delivery protocol; receiving the out-of-band information on static atlas information; and rendering the V3C content based on the received data.
  • an apparatus for encoding comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following receive volumetric video data from a media capture system; create a V3C bitstream comprising atlas information; encode video components of the volumetric video data into one or more video sub-bitstreams; create static atlas information and storing it into a static V3C atlas sub-bitstream; send encoded video sub-bitstreams over a network via a real-time delivery protocol; and send static V3C atlas sub-bitstream through out-of-band information.
  • an apparatus for decoding comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following receive encoded video sub-bitstreams over a network via a real-time delivery protocol; receive the out-of-band information on static atlas information; and render the V3C content based on the received data.
  • computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to receive volumetric video data from a media capture system; create a V3C bitstream comprising atlas information; encode video components of the volumetric video data into one or more video sub-bitstreams; create static atlas information and storing it into a static V3C atlas sub-bitstream; send encoded video sub-bitstreams over a network via a real-time delivery protocol; and send static V3C atlas subbitstream through out-of-band information.
  • computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to receive encoded video subbitstreams over a network via a real-time delivery protocol; receive the out-of- band information on static atlas information; and render the V3C content based on the received data.
  • video components are encoded as a packed video sub-bitstream or separate video sub-bitstreams.
  • the received volumetric video data comprises at least one of the depth or color texture; and camera or sensor related metadata information.
  • pre-defined fixed patch configuration is used according to which patches of the atlas information are placed.
  • the computer program product is embodied on a non-transitory computer readable medium.
  • Fig. 1 shows an example of a compression process of a volumetric video
  • Fig. 2 shows an example of a de-compression process of a volumetric video
  • Fig. 3 shows an example of a V3C bitstream originated from ISO/IEC 23090-5;
  • Fig. 4 shows an example of a capture system
  • Fig. 5 shows an example of a packed video component structure
  • Fig. 6 shows an example of an encoding procedure with generating out- of-band V3C static information
  • Fig. 7 shows an example of an encoding method with separate video component bitstreams and static atlas data
  • Fig. 8 is a flowchart illustrating a method according to an embodiment
  • Fig. 9 is a flowchart illustrating a method according to another embodiment.
  • Fig. 10 shows an example of an apparatus. Description of Example Embodiments
  • Figure 1 illustrates an overview of an example of a compression process of a volumetric video. Such process may be applied for example in MPEG Videobased Point Cloud Coding (V-PCC).
  • V-PCC MPEG Videobased Point Cloud Coding
  • the process starts with an input point cloud frame 101 that is provided for patch generation 102, geometry image generation 104 and texture image generation 105.
  • the patch generation 102 process aims at decomposing the point cloud into a minimum number of patches with smooth boundaries, while also minimizing the reconstruction error.
  • the normal at every point can be estimated.
  • An initial clustering of the point cloud can then be obtained by associating each point with one of the following six oriented planes, defined by their normals:
  • each point may be associated with the plane that has the closest normal (i.e., maximizes the dot product of the point normal and the plane normal).
  • the initial clustering may then be refined by iteratively updating the cluster index associated with each point based on its normal and the cluster indices of its nearest neighbors.
  • the final step may comprise extracting patches by applying a connected component extraction procedure.
  • Patch info determined at patch generation 102 for the input point cloud frame 101 is delivered to packing process 103, to geometry image generation 104 and to texture image generation 105.
  • the packing process 103 aims at mapping the extracted patches onto a 2D plane, while trying to minimize the unused space, and guaranteeing that every TxT (e.g., 16x16) block of the grid is associated with a unique patch.
  • T may be a user- defined parameter.
  • Parameter T may be encoded in the bitstream and sent to the decoder.
  • W and H may be user-defined parameters, which correspond to the resolution of the geometry/texture images that will be encoded.
  • the patch location is determined through an exhaustive search that is performed in raster scan order. The first location that can guarantee an overlapping-free insertion of the patch is selected and the grid cells covered by the patch are marked as used. If no empty space in the current resolution image can fit a patch, then the height H of the grid may be temporarily doubled, and search is applied again. At the end of the process, H is clipped so as to fit the used grid cells.
  • the geometry image generation 104 and the texture image generation 105 are configured to generate geometry images and texture images respectively.
  • the image generation process may exploit the 3D to 2D mapping computed during the packing process to store the geometry and texture of the point cloud as images.
  • each patch may be projected onto two images, referred to as layers.
  • H(u, y) be the set of points of the current patch that get projected to the same pixel (u, v).
  • the first layer also called a near layer, stores the point o H u, v) with the lowest depth DO.
  • the second layer referred to as the far layer, captures the point of H(u, v) with the highest depth within the interval [DO, DO+A], where is a user-defined parameter that describes the surface thickness.
  • the generated videos may have the following characteristics:
  • the geometry video is monochromatic.
  • the texture generation procedure exploits the reconstructed/smoothed geometry in order to compute the colors to be associated with the re-sampled points.
  • the geometry images and the texture images may be provided to image padding 107.
  • the image padding 107 may also receive as an input an occupancy map (OM) 106 to be used with the geometry images and texture images.
  • the occupancy map 106 may comprise a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud.
  • the occupancy map (OM) may be a binary image of binary values where the occupied pixels and non-occupied pixels are distinguished and depicted respectively.
  • the occupancy map may alternatively comprise a non-binary image allowing additional information to be stored in it. Therefore, the representative values of the DOM (Deep Occupancy Map) may comprise binary values or other values, for example integer values. It should be noticed that one cell of the 2D grid may produce a pixel during the image generation process. Such an occupancy map may be derived from the packing process 103.
  • the padding process 107 aims at filling the empty space between patches in order to generate a piecewise smooth image suited for video compression.
  • each block of TxT e.g., 16x16 pixels is compressed independently. If the block is empty (i.e., unoccupied, i.e., all its pixels belong to empty space), then the pixels of the block are filled by copying either the last row or column of the previous TxT block in raster order. If the block is full (i.e., occupied, i.e., no empty pixels), nothing is done. If the block has both empty and filled pixels (i.e., edge block), then the empty pixels are iteratively filled with the average value of their non-empty neighbors.
  • the padded geometry images and padded texture images may be provided for video compression 108.
  • the generated images/layers may be stored as video frames and compressed using for example the HM16.16 video codec according to the HM configurations provided as parameters.
  • the video compression 108 also generates reconstructed geometry images to be provided for smoothing 109, wherein a smoothed geometry is determined based on the reconstructed geometry images and patch info from the patch generation 102.
  • the smoothed geometry may be provided to texture image generation 105 to adapt the texture images.
  • the patch may be associated with auxiliary information being encoded/decoded for each patch as metadata.
  • the auxiliary information may comprise index of the projection plane, 2D bounding box, 3D location of the patch.
  • Metadata may be encoded/decoded for every patch:
  • mapping information providing for each TxT block its associated patch index may be encoded as follows:
  • L be the ordered list of the indexes of the patches such that their 2D bounding box contains that block.
  • the order in the list is the same as the order used to encode the 2D bounding boxes.
  • L is called the list of candidate patches.
  • the occupancy map consists of a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud.
  • One cell of the 2D grid produces a pixel during the image generation process.
  • the occupancy map compression 110 leverages the auxiliary information described in previous section, in order to detect the empty TxT blocks (i.e., blocks with patch index 0).
  • the remaining blocks may be encoded as follows:
  • the occupancy map can be encoded with a precision of a BOxBO blocks.
  • the compression process may comprise one or more of the following example operations:
  • Binary values may be associated with BOxBO sub-blocks belonging to the same TxT block.
  • a value 1 associated with a sub-block if it contains at least a non-padded pixel, and 0 otherwise. If a sub-block has a value of 1 it is said to be full, otherwise it is an empty sub-block.
  • a binary information may be encoded for each TxT block to indicate whether it is full or not.
  • an extra information indicating the location of the full/empty sub-blocks may be encoded as follows: o Different traversal orders may be defined for the sub-blocks, for example horizontally, vertically, or diagonally starting from top right or top left corner o
  • the encoder chooses one of the traversal orders and may explicitly signal its index in the bitstream.
  • the binary values associated with the sub-blocks may be encoded by using a run-length encoding strategy.
  • ⁇ The binary value of the initial sub-block is encoded.
  • FIG. 2 illustrates an overview of a de-compression process for MPEG Point Cloud Coding (PCC).
  • a de-multiplexer 201 receives a compressed bitstream, and after de-multiplexing, provides compressed texture video and compressed geometry video to video decompression 202.
  • the de-multiplexer 201 transmits compressed occupancy map to occupancy map decompression 203. It may also transmit a compressed auxiliary patch information to auxiliary patch-info compression 204.
  • Decompressed geometry video from the video decompression 202 is delivered to geometry reconstruction 205, as are the decompressed occupancy map and decompressed auxiliary patch information.
  • the point cloud geometry reconstruction 205 process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers. The 3D positions of the points associated with those pixels may be computed by leveraging the auxiliary patch information and the geometry images.
  • the reconstructed geometry image may be provided for smoothing 206, which aims at alleviating potential discontinuities that may arise at the patch boundaries due to compression artifacts.
  • the implemented approach moves boundary points to the centroid of their nearest neighbors.
  • the smoothed geometry may be transmitted to texture reconstruction 207, which also receives a decompressed texture video from video decompression 202.
  • the texture reconstruction 207 outputs a reconstructed point cloud.
  • the texture values for the texture reconstruction are directly read from the texture images.
  • the point cloud geometry reconstruction process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers.
  • the 3D positions of the points associated with those pixels are computed by levering the auxiliary patch information and the geometry images. More precisely, let P be the point associated with the pixel (u, v) and let (30, sO, rO) be the 3D location of the patch to which it belongs and (uO, vO, ul, vl) its 2D bounding box. P can be expressed in terms of depth 3(u, v), tangential shift s(u, v) and bi-tangential shift r(u, v) as follows:
  • the texture values can be directly read from the texture images.
  • the result of the decoding process is a 3D point cloud reconstruction.
  • volumetric frame can be represented as a point cloud.
  • a point cloud is a set of unstructured points in 3D space, where each point is characterized by its position in a 3D coordinate system (e.g., Euclidean), and some corresponding attributes (e.g., color information provided as RGBA value, or normal vectors).
  • a volumetric frame can be represented as images, with or without depth, captured from multiple viewpoints in 3D space.
  • the volumetric video can be represented by one or more view frames (where a view is a projection of a volumetric scene on to a plane (the camera plane) using a real or virtual camera with known/computed extrinsic and intrinsic).
  • Each view may be represented by a number of components (e.g., geometry, color, transparency, and occupancy picture), which may be part of the geometry picture or represented separately.
  • a volumetric frame can be represented as a mesh.
  • Mesh is a collection of points, called vertices, and connectivity information between vertices, called edges. Vertices along with edges form faces. The combination of vertices, edges and faces can uniquely approximate shapes of objects.
  • a volumetric frame can provide viewers the ability to navigate a scene with six degrees of freedom, i.e., both translational and rotational movement of their viewing pose (which includes yaw, pitch, and roll).
  • the data to be coded for a volumetric frame can also be significant, as a volumetric frame can contain many numbers of objects, and the positioning and movement of these objects in the scene can result in many dis-occluded regions.
  • the interaction of the light and materials in objects and surfaces in a volumetric frame can generate complex light fields that can produce texture variations for even a slight change of pose.
  • a sequence of volumetric frames is a volumetric video. Due to large amount of information, storage and transmission of a volumetric video requires compression.
  • a way to compress a volumetric frame can be to project the 3D geometry and related attributes into a collection of 2D images along with additional associated metadata.
  • the projected 2D images can then be coded using 2D video and image coding technologies, for example ISO/IEC 14496- 10 (H.264/AVC) and ISO/IEC 23008-2 (H.265/HEVC).
  • the metadata can be coded with technologies specified in specification such as ISO/IEC 23090-5.
  • the coded images and the associated metadata can be stored or transmitted to a client that can decode and render the 3D volumetric frame.
  • ISO/IEC 23090-5 specifies the syntax, semantics, and process for coding volumetric video.
  • the specified syntax is designed to be generic, so that it can be reused for a variety of applications.
  • Point clouds, immersive video with depth, and mesh representations can all use ISO/IEC 23090-5 standard with extensions that deal with the specific nature of the final representation.
  • the purpose of the specification is to define how to decode and interpret the associated data (for example atlas data in ISO/IEC 23090-5) which tells a Tenderer how to interpret 2D frames to reconstruct a volumetric frame.
  • V-PCC ISO/IEC 23090-5
  • MIV ISO/IEC 23090-12
  • the syntax element pdu_projection_id specifies the index of the projection plane for the patch. There can be 6 or 18 projection planes in V- PCC, and they are implicit, i.e., pre-determined.
  • pdu_projection_id corresponds to a view ID, i.e., identifies which view the patch originated from. View IDs and their related information is explicitly provided in MIV view parameters list and may be tailored for each content.
  • MPEG 3DG ISO SC29 WG7
  • V3C the mesh compression
  • V3C uses the ptl_profile_toolset_idc parameter.
  • V3C bitstream is a sequence of bits that forms the representation of coded volumetric frames and the associated data making one or more coded V3C sequences (CVS).
  • CVS is a sequence of bits identified and separated by appropriate delimiters, and is required to start with a VPS
  • V3C unit includes a V3C unit, and contains one or more V3C units with atlas sub-bitstream or video subbitstream. This is illustrated in Figure 3.
  • Video sub-bitstream and atlas subbitstreams can be referred to as V3C sub-bitstreams.
  • a V3C unit header in conjunction with VPS information identify which V3C sub-bitstream a V3C unit contains and how to interpret it. An example of this is shown herein below:
  • V3C bitstream can be stored according to Annex C of ISO/IEC 23090-5, which specifies syntax and semantics of a sample stream format to be used by applications that deliver some or all of the V3C unit stream as an ordered stream of bytes or bits within which the locations of V3C unit boundaries need to be identifiable from patterns in the data.
  • VPS V3C Parameter Set
  • vuh_v3c_parameter_set_id specifies the value of vps_v3c_parameter_set_id for the active V3C VPS.
  • the VPS provides the following information about V3C bitstream among others:
  • the number of cameras, camera extrinsic, camera intrinsic information is not fixed and may change during the V3C bitstream.
  • the camera information may be shared among all atlases within V3C bitstream.
  • Common atlas data is carried in a dedicated V3C unit type equal to V3C_CAD which contains a number of non-ACL NAL
  • the V3C includes signalling mechanisms, through profile_tier_level syntax structure in VPS to support interoperability while restricting capabilities of V3C profiles.
  • V3C also includes an initial set of tool constraint flags to indicate additional restrictions on profile.
  • the sub-profile indicator syntax element is always present, but the value of OxFFFFFFFF indicates that no subprofile is used, i.e., the full profile is supported.
  • a Real Time Transfer Protocol is intended for an end-to-end, real-time transfer or streaming media and provides facilities for jitter compensation and detection of packet loss and out-of-order delivery.
  • RTP allows data transfer to multiple destinations through IP multicast or to a specific destination through IP unicast.
  • the majority of the RTP implementations are built on the User Datagram Protocol (UDP).
  • UDP User Datagram Protocol
  • Other transport protocols may also be utilized.
  • RTP is used in together with other protocols such as H.323 and Real Time Streaming Protocol RTSP.
  • RTP Resource Transfer Protocol
  • RTCP Resource Control Protocol
  • RTP sessions may be initiated between client and server using a signalling protocol, such as H.323, the Session Initiation Protocol (SIP), or RTSP.
  • SIP Session Initiation Protocol
  • RTSP RTSP
  • These protocols may use the Session Description Protocol (RFC 8866) to specify the parameters for the sessions.
  • RTP is designed to carry a multitude of multimedia formats, which permits the development of new formats without revising the RTP standard. To this end, the information required by a specific application of the protocol is not included in the generic RTP header. For a class of applications (e.g., audio, video), an RTP profile may be defined. For a media format (e.g., a specific video coding format), an associated RTP payload format may be defined. Every instantiation of RTP in a particular application may require a profile and payload format specifications.
  • the profile defines the codecs used to encode the payload data and their mapping to payload format codecs in the protocol field Payload Type (PT) of the RTP header.
  • PT Payload Type
  • RTP profile for audio and video conferences with minimal control is defined in RFC 3551.
  • the profile defines a set of static payload type assignments, and a dynamic mechanism for mapping between a payload format, and a PT value using Session Description Protocol (SDP).
  • SDP Session Description Protocol
  • the latter mechanism is used for newer video codec such as RTP payload format for H.264 Video defined in RFC 6184 or RTP Payload Format for High Efficiency Video Coding (HEVC) defined in RFC 7798.
  • An RTP session is established for each multimedia stream. Audio and video streams may use separate RTP sessions, enabling a receiver to selectively receive components of a particular stream.
  • the RTP specification recommends even port number for RTP, and the use of the next odd port number of the associated RTCP session. A single port can be used for RTP and RTCP in applications that multiplex the protocols.
  • RTP packets are created at the application layer and handed to the transport layer for delivery.
  • Each unit of RTP media data created by an application begins with the RTP packet header.
  • the RTP header has a minimum size of 12 bytes.
  • optional header extensions may be present. This is followed by the RTP payload, the format of which is determined by the particular class of application.
  • the fields in the header are as follows:
  • P (Padding) (1 bit) Used to indicate if there are extra padding bytes at the end of the RTP packet.
  • Extension header (1 bit) Indicates the presence of an extension header between the header and payload data.
  • the extension header is application or profile specific.
  • PT Payload type: (7 bits) Indicates the format of the payload and thus determines its interpretation by the application.
  • Sequence number (16 bits) The sequence number is incremented for each RTP data packet sent and is to be used by the receiver to detect packet loss and to accommodate out-of-order delivery.
  • Timestamp (32 bits) Used by the receiver to play back the received samples at appropriate time and interval. When several media streams are present, the timestamps may be independent in each stream. The granularity of the timing is application specific. For example, video stream may use a 90 kHz clock. The clock granularity is one of the details that is specified in the RTP profile for an application.
  • Synchronization source identifier uniquely identifies the source of the stream. The synchronization sources within the same RTP session will be unique.
  • the first 32-bit word contains a profile-specific identifier (16 bits) and a length specifier (16 bits) that indicates the length of the extension in 32-bit units, excluding the 32 bits of the extension header.
  • SDP Session Description Protocol
  • SDP is used as an example of a session specific file format.
  • SDP is a format for describing multimedia communication sessions for the purposes of announcement and invitation. Its predominant use is in support of conversational and streaming media applications. SDP does not deliver any media streams itself, but is used between endpoints for negotiation of network metrics, media types, and other associated properties. The set of properties and parameters is called a session profile.
  • SDP is extensible for the support of new media types and formats.
  • the Session Description Protocol describes a session as a group of fields in a text-based format, one field per line.
  • the form of each field is as follows:
  • ⁇ character> ⁇ value> ⁇ CR> ⁇ LF>
  • ⁇ character> is a single case-sensitive character
  • ⁇ value> is structured text in a format that depends on the character. Values may be UTF-8 encoded. Whitespace is not allowed immediately to either side of the equal sign.
  • Session descriptions consist of three sections: session, timing, and media descriptions. Each description may contain multiple timing and media descriptions. Names are only unique within the associated syntactic construct.
  • m (media name and transport address )
  • i * (media title or information field)
  • c * ( connection information — optional if included at session level )
  • b * ( zero or more bandwidth information lines )
  • k * (encryption key)
  • a * ( zero or more media attribute lines — overriding the Ses sion attribute lines )
  • This session is originated by the user “jdoe” at IPv4 address 10.47.16.5. Its name is “SDP Seminar” and extended session information (“A Seminar on the session description protocol”) is included along with a link for additional information and an email address to contact the responsible party, Jane Doe.
  • This session is specified to last two hours using NTP timestamps, with a connection address (which indicates the address clients must connect to or - when a multicast address is provided, as it is here - subscribe to) specified as IPv4 224.2.17.12 with a TTL of 127. Recipients of this session description are instructed to only receive media. Two media descriptions are provided, both using RTP Audio Video Profile.
  • the first is an audio stream on port 49170 using RTP/AVP payload type 0 (defined by RFC 3551 as PCMLI), and the second is a video stream on port 51372 using RTP/AVP payload type 99 (defined as “dynamic”). Finally, an attribute is included which maps RTP/AVP payload type 99 to format h263-1998 with a 90 kHz clock rate.
  • RTCP ports for the audio and video streams of 49171 and 51373 respectively are implied.
  • the media types are “audio/L8” and “audio/L16”.
  • Codec-specific parameters may be added in other attributes, for example, "fmtp".
  • fmtp attribute allows parameters that are specific to a particular format to be conveyed in a way that SDP does not have to understand them.
  • the format can be one of the formats specified for the media. Format-specific parameters, semicolon separated, may be any set of parameters required to be conveyed by SDP and given unchanged to the media tool that will use this format. At most one instance of this attribute is allowed for each format.
  • RFC7798 defines the following sprop-vps, sprop-sps, sprop- pps, profile-space, profile-id, tier-flag, level-id, interop-constraints, profilecompatibility-indicator, sprop-sub-layer-id, recv-sub-layer-id, max-recv- level-id, tx-mode, max-lsr, max-lps, max-cpb, max-dpb, max-br, max-tr, max-tc, max-fps, sprop-max-don-diff, sprop-depack-buf-nalus, sprop- depack-buf-bytes, depack-buf-cap, sprop-segmentation-id, sprop-spatial- segmentation-idc, dec-parallel-cap, and include-dph.
  • Such registration incudes type name (i.e., video, audio, application, etc.), subtype name (e.g., H264, H265, mp4, etc.), required parameters, as well as optional parameters.
  • the register types are maintained by Internet Assigned Numbers Authority (IANA).
  • the media types of registration can be extended by registration of new optional parameters.
  • Media types and their parameters can be mapped to a specific description protocol.
  • an RFC may provide information on how parameters of a given media type can be mapped to SDP.
  • Such mapping is not restricted to SDP and can be defined for other protocols, e.g., XML, DASH MPD.
  • V3C coding makes the compromise between the complexity and the coding efficiency.
  • the encoder must perform more calculation to find optimal patch segmentation which will reduce the amount of data as well allow better video compression.
  • the complexity of the encoding side needs to be minimized to ensure the lowest possible latency, which can provide acceptable end user experience.
  • V3C encoder complexity can be reduced to minimum, when input camera feeds are encoded as video components directly, without any segmentation. This results in static atlas data that corresponds to capture camera rig parameters like locations, orientations and the intrinsic. Currently there is no efficient means for signalling this static atlas data. Instead, the static atlas data requires its own stream, e.g., RTP stream, to be established in order to signal atlas data to the client only once in the beginning of the stream. This wastes network resources and complicates client device implementations.
  • RTP stream e.g.
  • the present embodiments relate to a method and an apparatus for low complexity - low delay V3C bitstream delivery for real-time applications.
  • the method for encoding according to an embodiment comprises:
  • V3C bitstream creating a V3C bitstream, where atlas related information is precalculated (e.g., through calibration of the capturing system);
  • V3C atlas information and/or V3C common atlas information are static for the duration of the V3C bitstream; and storing the information either as o optional media type parameters of V3C component (e.g., packed video) and map them to a given media type in a description document, e.g., SDP, XML, MPD, or o as SDP parameters;
  • V3C component e.g., packed video
  • V3C video component(s) over a network, e.g., using a real-time delivery protocol such RTP;
  • static V3C atlas information and/or V3C common atlas information is referred to as “static out-of-band V3C information” or “static atlas data” or “static atlas information” interchangeably.
  • Figure 4 illustrates a simple example of a capture system 500 where four cameras C0-C3 are present and each of the cameras C0-C3 has
  • Such capture system 400 can provide depth and color textures together with camera or sensor information to a V3C encoder.
  • V3C encoder can minimize the complexity by accepting a pre-defined fixed patch configuration.
  • An example how the patches can be placed withing a packed video component 500 is presented on Figure 5.
  • a V3C encoder can generate a packed video component based on the provided patch configuration. Instead of generating a complete V3C bitstream, it may output only packed video component video bitstream and corresponding media type description which utilize optional media type parameters that carries static out of band information related to V3C that allows to reconstruct volumetric video, i.e., static V3C atlas information and/or V3C common atlas information.
  • An example of such encoding procedure is presented on Figure 6.
  • a capture system 600 such as the one presented in Figure 4 as an example, provide depth and color textures (DO, TO... Dx, Tx) to a V3C encoder 610.
  • depth and color textures also static camera information can be provided.
  • the video encoder 610 may obtain a fixed patch configuration 620 according to which packet video component is generated.
  • the V3C encoder 610 outputs a media type of a packet video component with a parameter carrying static out-of-band V3C information.
  • FIG. 7 illustrates an alternative example, where a V3C encoder 710 encodes each camera or sensor output in the capture system 700 as separate V3C video component bitstream. Instead of generating a complete V3C bitstream it outputs video component bitstreams and corresponding media type descriptions with optional media type parameters along with static out-of-band V3C information that allows to reconstruct the volumetric video. Also in this example, the V3C encoder may utilize fixed patch configuration 720.
  • Static out-of-band V3C information that allows to reconstruct the volumetric video can be provided as separate entity as is illustrated in Figure 7, or it can be duplicated in each media type descriptions as optional media type parameters, or the duplicated information can be assigned only to one V3C video component, e.g., occupancy, and provided in media type descriptions as optional media type parameters.
  • the optional media type parameters for the V3C video components can include V3C unit header, whose presence is exposed through SDP level signalling.
  • the static atlas data can be stored as parameters of a session level SDP attribute, or a new attribute may be created for the purpose of storing static atlas data.
  • V3c-parameter-set provides V3C parameter set bytes as defined in ISO/IEC 23090-5.
  • ⁇ value> contains encoded representation of the V3C parameter set bytes, e.g., base16 [RFC 4648] (hexadecimal).
  • v3c-unit-header provides V3C unit header set bytes as defined in ISO/IEC 23090-5. ⁇ value> contains encoded representation of the V3C unit header bytes, e.g., base16 [RFC 4648] (hexadecimal). The v3c-unit-header would indicate the type of video component carried by a given media. E.g., if the video contains packed component, the v3c unit header type shall contain vuh_unit_type equal to V3C_PVD (value 5).
  • This parameter may be used to convey any atlas data NAL units of the V3C atlas sub bitstream for out-of-band transmission.
  • the value of the parameter is a comma-separated (',') list of encoded representations of the atlas NAL units as specified in Section 8.4.5.2 of ISO/IEC 23090-5.
  • the encoded representation format can for example be base64 [RFC4648].
  • a subset of NAL units from section 8.4.5.2 of ISO/IEC 23090-5 may be exposed by a more specific parameter.
  • v3c-common-atlas-data ⁇ value>
  • This parameter may be used to convey common atlas data NAL units of the V3C common atlas sub bitstream for out-of-band transmission.
  • the value of the parameter is a comma-separated list of encoded representations of the common atlas NAL units (i.e., NAL_CASPS and NAL CAF IDR) as specified in Section 8.4.5.2 of ISO/IEC 23090-5.
  • the encoded representation format can for example be base64 [RFC4648].
  • This parameter may be used to convey SEI NAL units of the V3C common atlas sub-bitstream for out-of-band transmission.
  • the value of the parameter is a comma-separated list of encoded representations of the SEI NAL units (i.e., NAL PREFIX NSEI and NAL_SUFFIX_NSEI, NAL PREFIX ESEI, NAL SUFFIX ESEI) as specified in Section 8.4.5.2 of ISO/IEC 23090-5.
  • the encoded representation can for example be base64 [RFC4648].
  • the optional parameters media types can be defined for already existing media types such as H264, H265, MP4 etc.
  • Example 1 MIME media type
  • V3C packed video encoded with H.265 and all atlas data provided out of band in a mimeType would be as follow:
  • Example 3 - MIME media type parameters mapped to SDP the media type parameters can be mapped to description protocol for example SDP MIME media type from example 1
  • a packed video component can be encapsulated in mp4 and split into segments to allow DASH based delivery while all V3C related information can be provided out of band.
  • C1+C2 and C2+C3 each form a stereo pair but C1+C3 does not .
  • 00S > ⁇ SegmentLi st>
  • v3c-unit- header 0F;
  • v3c-parameter- set AF6F0093992 ;
  • a delimiter codeword may be present to indicate the separation of codewords.
  • FF is used as a separator codeword.
  • Such a codeword may be also signaled as part of the mimeType, adaptation set parameters or attributes.
  • V3C video components are streamed as separate RTP streams, but atlas data remains static, static signalling of the atlas data can be enabled without establishing a dedicated RTP session for it.
  • the method for encoding generally comprises receiving 805 volumetric video data from a capture system; creating 810 a V3C bitstream comprising atlas information; encoding 815 video components of the volumetric video data into one or more video sub-bitstreams; creating 820 static atlas information and storing it into a static V3C atlas sub-bitstream; sending 825 encoded video sub-bitstreams over an network via a real-time delivery protocol; and sending 830 static V3C atlas sub-bitstream through out-of-band information.
  • Each of the steps can be implemented by a respective module of a computer system.
  • An apparatus comprises means for receiving volumetric video data from a capture system; means for creating a V3C bitstream comprising atlas information; means for encoding video components of the volumetric video data into one or more video sub-bitstreams; means for creating static atlas information and storing it into a static V3C atlas subbitstream; means for sending encoded video sub-bitstreams over an network via a real-time delivery protocol; and means for sending static V3C atlas subbitstream through out-of-band information.
  • the means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry.
  • the memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 8 according to various embodiments.
  • the method for decoding generally comprises receiving 910 encoded video sub-bitstreams via a real-time delivery protocol; receiving 920 the out-of-band information on static atlas information; and rendering 930 the V3C content based on the received data.
  • Each of the steps can be implemented by a respective module of a computer system.
  • An apparatus comprises means for receiving encoded video sub-bitstreams via a real-time delivery protocol; means for receiving the out-of-band information on static atlas information; and means for rendering the V3C content based on the received data.
  • the means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry.
  • the memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 9 according to various embodiments.
  • Figure 10 shows a block diagram of a video coding system according to an example embodiment as a schematic block diagram of an electronic device 50, which may incorporate a codec.
  • the electronic device may comprise an encoder or a decoder.
  • the electronic device 50 may for example be a mobile terminal or a user equipment of a wireless communication system or a camera device.
  • the electronic device 50 may be also comprised at a local or a remote server or a graphics processing unit of a computer.
  • the device may be also comprised as part of a head-mounted display device.
  • the apparatus 50 may comprise a display 32 in the form of a liquid crystal display. In other embodiments of the invention the display may be any suitable display technology suitable to display an image or video.
  • the apparatus 50 may further comprise a keypad 34.
  • any suitable data or user interface mechanism may be employed.
  • the user interface may be implemented as a virtual keyboard or data entry system as part of a touch-sensitive display.
  • the apparatus may comprise a microphone 36 or any suitable audio input which may be a digital or analogue signal input.
  • the apparatus 50 may further comprise an audio output device which in embodiments of the invention may be any one of: an earpiece 38, speaker, or an analogue audio or digital audio output connection.
  • the apparatus 50 may also comprise a battery (or in other embodiments of the invention the device may be powered by any suitable mobile energy device such as solar cell, fuel cell or clockwork generator).
  • the apparatus may further comprise a camera 42 capable of recording or capturing images and/or video.
  • the camera 42 may be a multi-lens camera system having at least two camera sensors.
  • the camera is capable of recording or detecting individual frames which are then passed to the codec 54 or the controller for processing.
  • the apparatus may receive the video and/or image data for processing from another device prior to transmission and/or storage.
  • the apparatus 50 may comprise a controller 56 or processor for controlling the apparatus 50.
  • the apparatus or the controller 56 may comprise one or more processors or processor circuitry and be connected to memory 58 which may store data in the form of image, video and/or audio data, and/or may also store instructions for implementation on the controller 56 or to be executed by the processors or the processor circuitry.
  • the controller 56 may further be connected to codec circuitry 54 suitable for carrying out coding and decoding of image, video and/or audio data or assisting in coding and decoding carried out by the controller.
  • the apparatus 50 may further comprise a card reader 48 and a smart card 46, for example a IIICC (Universal Integrated Circuit Card) and UICC reader for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network.
  • the apparatus 50 may comprise radio interface circuitry 52 connected to the controller and suitable for generating wireless communication signals for example for communication with a cellular communications network, a wireless communications system, or a wireless local area network.
  • the apparatus 50 may further comprise an antenna 44 connected to the radio interface circuitry 52 for transmitting radio frequency signals generated at the radio interface circuitry 52 to other apparatus(es) and for receiving radio frequency signals from other apparatus(es).
  • the apparatus may comprise one or more wired interfaces configured to transmit and/or receive data over a wired connection, for example an electrical cable or an optical fiber connection.
  • a device may comprise circuitry and electronics for handling, receiving, and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment.
  • a network device like a server may comprise circuitry and electronics for handling, receiving, and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of various embodiments.
  • the different functions discussed herein may be performed in a different order and/or concurrently with other.
  • one or more of the above-described functions and embodiments may be optional or may be combined.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The embodiments relate to a method for encoding anddecoding, and to apparatuses for the same. The method forencoding comprises receiving volumetric video data from amedia capture system; creating a V3C bitstream comprisingatlas information; encoding video components of thevolumetric video data into one or more video sub-bitstreams;creating static atlas information and storing it into a static V3Catlas sub-bitstream; sending encoded video sub-bitstreamsover a network via a real-time delivery protocol; and sendingstatic V3C atlas sub-bitstream through out-of-bandinformation.

Description

A METHOD, AN APPARATUS AND A COMPUTER PROGRAM PRODUCT FOR VIDEO ENCODING AND VIDEO DECODING
Technical Field
The present solution generally relates to coding of volumetric video.
Background
Volumetric video data represents a three-dimensional (3D) scene or object, and can be used as input for AR (Augmented Reality), VR (Virtual Reality), and MR (Mixed Reality) applications. Such data describes geometry (Shape, size, position in 3D space) and respective attributes (e.g., color, opacity, reflectance, ...), and any possible temporal transformations of the geometry and attributes at given time instances (like frames in 2D video). Volumetric video can be generated from 3D models, also referred to as volumetric visual objects, i.e., CGI (Computer Generated Imagery), or captured from real-world scenes using a variety of capture solutions, e.g., multi-camera, laser scan, combination of video and dedicated depth sensors, and more. Also, a combination of CGI and real-world data is possible. Examples of representation formats for volumetric data comprise triangle meshes, point clouds, or voxels. Temporal information about the scene can be included in the form of individual capture instances, i.e., “frames” in 2D video, or other means, e.g., position of an object as a function of time.
Summary
The scope of protection sought for various embodiments of the invention is set out by the independent claims. The embodiments and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various embodiments of the invention.
Various aspects include a method, an apparatus and a computer readable medium comprising a computer program stored therein, which are characterized by what is stated in the independent claims. Various embodiments are disclosed in the dependent claims. According to a first aspect, there is provided an apparatus comprising means for receiving volumetric video data from a media capture system; means for creating a V3C bitstream comprising atlas information; means for encoding video components of the volumetric video data into one or more video subbitstreams; means for creating static atlas information and storing it into a static V3C atlas sub-bitstream; means for sending encoded video sub-bitstreams over a network via a real-time delivery protocol; and means for sending static V3C atlas sub-bitstream through out-of-band information.
According to a second aspect, there is provided an apparatus for decoding, comprising means for receiving encoded video sub-bitstreams over a network via a real-time delivery protocol; means for receiving the out-of-band information on static atlas information; and means for rendering the V3C content based on the received data.
According to a third aspect, there is provided a method for encoding, comprising receiving volumetric video data from a media capture system; creating a V3C bitstream comprising atlas information; encoding video components of the volumetric video data into one or more video subbitstreams; creating static atlas information and storing it into a static V3C atlas sub-bitstream; sending encoded video sub-bitstreams over a network via a real-time delivery protocol; and sending static V3C atlas sub-bitstream through out-of-band information.
According to a fourth aspect, there is provided a method for decoding comprising receiving encoded video sub-bitstreams over a network via a realtime delivery protocol; receiving the out-of-band information on static atlas information; and rendering the V3C content based on the received data.
According to a fifth aspect, there is provided an apparatus for encoding comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following receive volumetric video data from a media capture system; create a V3C bitstream comprising atlas information; encode video components of the volumetric video data into one or more video sub-bitstreams; create static atlas information and storing it into a static V3C atlas sub-bitstream; send encoded video sub-bitstreams over a network via a real-time delivery protocol; and send static V3C atlas sub-bitstream through out-of-band information.
According to a sixth aspect, there is provided an apparatus for decoding comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following receive encoded video sub-bitstreams over a network via a real-time delivery protocol; receive the out-of-band information on static atlas information; and render the V3C content based on the received data.
According to seventh aspect, there is provided computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to receive volumetric video data from a media capture system; create a V3C bitstream comprising atlas information; encode video components of the volumetric video data into one or more video sub-bitstreams; create static atlas information and storing it into a static V3C atlas sub-bitstream; send encoded video sub-bitstreams over a network via a real-time delivery protocol; and send static V3C atlas subbitstream through out-of-band information.
According to an eighth aspect, there is provided computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to receive encoded video subbitstreams over a network via a real-time delivery protocol; receive the out-of- band information on static atlas information; and render the V3C content based on the received data.
According to an embodiment, video components are encoded as a packed video sub-bitstream or separate video sub-bitstreams.
According to an embodiment, the received volumetric video data comprises at least one of the depth or color texture; and camera or sensor related metadata information. According to an embodiment, pre-defined fixed patch configuration is used according to which patches of the atlas information are placed.
According to an embodiment, the computer program product is embodied on a non-transitory computer readable medium.
Description of the Drawings
In the following, various embodiments will be described in more detail with reference to the appended drawings, in which
Fig. 1 shows an example of a compression process of a volumetric video;
Fig. 2 shows an example of a de-compression process of a volumetric video;
Fig. 3 shows an example of a V3C bitstream originated from ISO/IEC 23090-5;
Fig. 4 shows an example of a capture system;
Fig. 5 shows an example of a packed video component structure;
Fig. 6 shows an example of an encoding procedure with generating out- of-band V3C static information;
Fig. 7 shows an example of an encoding method with separate video component bitstreams and static atlas data;
Fig. 8 is a flowchart illustrating a method according to an embodiment;
Fig. 9 is a flowchart illustrating a method according to another embodiment; and
Fig. 10 shows an example of an apparatus. Description of Example Embodiments
The following description and drawings are illustrative and are not to be construed as unnecessarily limiting. The specific details are provided for a thorough understanding of the disclosure. However, in certain instances, well- known or conventional details are not described in order to avoid obscuring the description. References to one or an embodiment in the present disclosure can be, but not necessarily are, reference to the same embodiment and such references mean at least one of the embodiments.
Reference in this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment in included in at least one embodiment of the disclosure.
Figure 1 illustrates an overview of an example of a compression process of a volumetric video. Such process may be applied for example in MPEG Videobased Point Cloud Coding (V-PCC). The process starts with an input point cloud frame 101 that is provided for patch generation 102, geometry image generation 104 and texture image generation 105.
The patch generation 102 process aims at decomposing the point cloud into a minimum number of patches with smooth boundaries, while also minimizing the reconstruction error. For patch generation, the normal at every point can be estimated. An initial clustering of the point cloud can then be obtained by associating each point with one of the following six oriented planes, defined by their normals:
- (1.0, 0.0, 0.0),
- (0.0, 1.0, 0.0),
- (0.0, 0.0, 1.0),
- (-1 .0, 0.0, 0.0),
- (0.0, -1.0, 0.0), and
- (0.0, 0.0, -1.0) More precisely, each point may be associated with the plane that has the closest normal (i.e., maximizes the dot product of the point normal and the plane normal).
The initial clustering may then be refined by iteratively updating the cluster index associated with each point based on its normal and the cluster indices of its nearest neighbors. The final step may comprise extracting patches by applying a connected component extraction procedure.
Patch info determined at patch generation 102 for the input point cloud frame 101 is delivered to packing process 103, to geometry image generation 104 and to texture image generation 105. The packing process 103 aims at mapping the extracted patches onto a 2D plane, while trying to minimize the unused space, and guaranteeing that every TxT (e.g., 16x16) block of the grid is associated with a unique patch. It should be noticed that T may be a user- defined parameter. Parameter T may be encoded in the bitstream and sent to the decoder.
The used simple packing strategy iteratively tries to insert patches into a WxH grid. W and H may be user-defined parameters, which correspond to the resolution of the geometry/texture images that will be encoded. The patch location is determined through an exhaustive search that is performed in raster scan order. The first location that can guarantee an overlapping-free insertion of the patch is selected and the grid cells covered by the patch are marked as used. If no empty space in the current resolution image can fit a patch, then the height H of the grid may be temporarily doubled, and search is applied again. At the end of the process, H is clipped so as to fit the used grid cells.
The geometry image generation 104 and the texture image generation 105 are configured to generate geometry images and texture images respectively. The image generation process may exploit the 3D to 2D mapping computed during the packing process to store the geometry and texture of the point cloud as images. In order to better handle the case of multiple points being projected to the same pixel, each patch may be projected onto two images, referred to as layers. For example, let H(u, y) be the set of points of the current patch that get projected to the same pixel (u, v). The first layer, also called a near layer, stores the point o H u, v) with the lowest depth DO. The second layer, referred to as the far layer, captures the point of H(u, v) with the highest depth within the interval [DO, DO+A], where is a user-defined parameter that describes the surface thickness. The generated videos may have the following characteristics:
• Geometry: WxH YUV420-8bit,
• Texture: WxH YUV420-8bit,
It is to be noticed that the geometry video is monochromatic. In addition, the texture generation procedure exploits the reconstructed/smoothed geometry in order to compute the colors to be associated with the re-sampled points.
The geometry images and the texture images may be provided to image padding 107. The image padding 107 may also receive as an input an occupancy map (OM) 106 to be used with the geometry images and texture images. The occupancy map 106 may comprise a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud. In other words, the occupancy map (OM) may be a binary image of binary values where the occupied pixels and non-occupied pixels are distinguished and depicted respectively. The occupancy map may alternatively comprise a non-binary image allowing additional information to be stored in it. Therefore, the representative values of the DOM (Deep Occupancy Map) may comprise binary values or other values, for example integer values. It should be noticed that one cell of the 2D grid may produce a pixel during the image generation process. Such an occupancy map may be derived from the packing process 103.
The padding process 107 aims at filling the empty space between patches in order to generate a piecewise smooth image suited for video compression. For example, in a simple padding strategy, each block of TxT (e.g., 16x16) pixels is compressed independently. If the block is empty (i.e., unoccupied, i.e., all its pixels belong to empty space), then the pixels of the block are filled by copying either the last row or column of the previous TxT block in raster order. If the block is full (i.e., occupied, i.e., no empty pixels), nothing is done. If the block has both empty and filled pixels (i.e., edge block), then the empty pixels are iteratively filled with the average value of their non-empty neighbors.
The padded geometry images and padded texture images may be provided for video compression 108. The generated images/layers may be stored as video frames and compressed using for example the HM16.16 video codec according to the HM configurations provided as parameters. The video compression 108 also generates reconstructed geometry images to be provided for smoothing 109, wherein a smoothed geometry is determined based on the reconstructed geometry images and patch info from the patch generation 102. The smoothed geometry may be provided to texture image generation 105 to adapt the texture images.
The patch may be associated with auxiliary information being encoded/decoded for each patch as metadata. The auxiliary information may comprise index of the projection plane, 2D bounding box, 3D location of the patch.
For example, the following metadata may be encoded/decoded for every patch:
- index of the projection plane o Index 0 for the planes (1 .0, 0.0, 0.0) and (-1 .0, 0.0, 0.0) o Index 1 for the planes (0.0, 1 .0, 0.0) and (0.0, -1 .0, 0.0) o Index 2 for the planes (0.0, 0.0, 1 .0) and (0.0, 0.0, -1 .0)
- 2D bounding box (uO, vO, ul, vl)
- 3D location (xO, yO, z0) of the patch represented in terms of depth 30, tangential shift sO and bitangential shift rO. According to the chosen projection planes, (50, sO, rO) may be calculated as follows: o Index 0, 30= xO, s0=z0 and rO = y0 o Index 1, 30= yO, s0=z0 and rO = x0 o Index 2, 30= zO, s0=x0 and rO = yO
Also, mapping information providing for each TxT block its associated patch index may be encoded as follows:
- For each TxT block, let L be the ordered list of the indexes of the patches such that their 2D bounding box contains that block. The order in the list is the same as the order used to encode the 2D bounding boxes. L is called the list of candidate patches.
- The empty space between patches is considered as a patch and is assigned the special index 0, which is added to the candidate patches list of all the blocks. - Let I be index of the patch, which the current TxT block belongs to, and let J be the position of I in L. Instead of explicitly coding the index I, its position J is arithmetically encoded instead, which leads to better compression efficiency.
The occupancy map consists of a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud. One cell of the 2D grid produces a pixel during the image generation process.
The occupancy map compression 110 leverages the auxiliary information described in previous section, in order to detect the empty TxT blocks (i.e., blocks with patch index 0). The remaining blocks may be encoded as follows: The occupancy map can be encoded with a precision of a BOxBO blocks. B0 is a configurable parameter. In order to achieve lossless encoding, B0 may be set to 1 . In practice B0=2 or B0=4 results in visually acceptable results, while significantly reducing the number of bits required to encode the occupancy map.
The compression process may comprise one or more of the following example operations:
• Binary values may be associated with BOxBO sub-blocks belonging to the same TxT block. A value 1 associated with a sub-block, if it contains at least a non-padded pixel, and 0 otherwise. If a sub-block has a value of 1 it is said to be full, otherwise it is an empty sub-block.
• If all the sub-blocks of a TxT block are full (i.e., have value 1 ). The block is said to be full. Otherwise, the block is said to be non-full.
• A binary information may be encoded for each TxT block to indicate whether it is full or not.
• If the block is non-full, an extra information indicating the location of the full/empty sub-blocks may be encoded as follows: o Different traversal orders may be defined for the sub-blocks, for example horizontally, vertically, or diagonally starting from top right or top left corner o The encoder chooses one of the traversal orders and may explicitly signal its index in the bitstream. o The binary values associated with the sub-blocks may be encoded by using a run-length encoding strategy. ■ The binary value of the initial sub-block is encoded.
■ Continuous runs of 0s and 1 s are detected, while following the traversal order selected by the encoder.
■ The number of detected runs is encoded.
■ The length of each run, except of the last one, is also encoded.
Figure 2 illustrates an overview of a de-compression process for MPEG Point Cloud Coding (PCC). A de-multiplexer 201 receives a compressed bitstream, and after de-multiplexing, provides compressed texture video and compressed geometry video to video decompression 202. In addition, the de-multiplexer 201 transmits compressed occupancy map to occupancy map decompression 203. It may also transmit a compressed auxiliary patch information to auxiliary patch-info compression 204. Decompressed geometry video from the video decompression 202 is delivered to geometry reconstruction 205, as are the decompressed occupancy map and decompressed auxiliary patch information. The point cloud geometry reconstruction 205 process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers. The 3D positions of the points associated with those pixels may be computed by leveraging the auxiliary patch information and the geometry images.
The reconstructed geometry image may be provided for smoothing 206, which aims at alleviating potential discontinuities that may arise at the patch boundaries due to compression artifacts. The implemented approach moves boundary points to the centroid of their nearest neighbors. The smoothed geometry may be transmitted to texture reconstruction 207, which also receives a decompressed texture video from video decompression 202. The texture reconstruction 207 outputs a reconstructed point cloud. The texture values for the texture reconstruction are directly read from the texture images.
The point cloud geometry reconstruction process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers. The 3D positions of the points associated with those pixels are computed by levering the auxiliary patch information and the geometry images. More precisely, let P be the point associated with the pixel (u, v) and let (30, sO, rO) be the 3D location of the patch to which it belongs and (uO, vO, ul, vl) its 2D bounding box. P can be expressed in terms of depth 3(u, v), tangential shift s(u, v) and bi-tangential shift r(u, v) as follows:
3(u, v) = 30 + g(u, v) s(u, v) = sO - uO + u r(u, v) = rO - vO + v where g(u, v) is the luma component of the geometry image.
For the texture reconstruction, the texture values can be directly read from the texture images. The result of the decoding process is a 3D point cloud reconstruction.
There are alternatives to capture and represent a volumetric frame. The format used to capture and represent the volumetric frame depends on the process to be performed on it, and the target application using the volumetric frame. As a first example a volumetric frame can be represented as a point cloud. A point cloud is a set of unstructured points in 3D space, where each point is characterized by its position in a 3D coordinate system (e.g., Euclidean), and some corresponding attributes (e.g., color information provided as RGBA value, or normal vectors). As a second example, a volumetric frame can be represented as images, with or without depth, captured from multiple viewpoints in 3D space. In other words, the volumetric video can be represented by one or more view frames (where a view is a projection of a volumetric scene on to a plane (the camera plane) using a real or virtual camera with known/computed extrinsic and intrinsic). Each view may be represented by a number of components (e.g., geometry, color, transparency, and occupancy picture), which may be part of the geometry picture or represented separately. As a third example, a volumetric frame can be represented as a mesh. Mesh is a collection of points, called vertices, and connectivity information between vertices, called edges. Vertices along with edges form faces. The combination of vertices, edges and faces can uniquely approximate shapes of objects.
Depending on the capture, a volumetric frame can provide viewers the ability to navigate a scene with six degrees of freedom, i.e., both translational and rotational movement of their viewing pose (which includes yaw, pitch, and roll). The data to be coded for a volumetric frame can also be significant, as a volumetric frame can contain many numbers of objects, and the positioning and movement of these objects in the scene can result in many dis-occluded regions. Furthermore, the interaction of the light and materials in objects and surfaces in a volumetric frame can generate complex light fields that can produce texture variations for even a slight change of pose.
A sequence of volumetric frames is a volumetric video. Due to large amount of information, storage and transmission of a volumetric video requires compression. A way to compress a volumetric frame can be to project the 3D geometry and related attributes into a collection of 2D images along with additional associated metadata. The projected 2D images can then be coded using 2D video and image coding technologies, for example ISO/IEC 14496- 10 (H.264/AVC) and ISO/IEC 23008-2 (H.265/HEVC). The metadata can be coded with technologies specified in specification such as ISO/IEC 23090-5. The coded images and the associated metadata can be stored or transmitted to a client that can decode and render the 3D volumetric frame.
In the following, a short reference of ISO/IEC 23090-5 Visual Volumetric Videobased Coding (V3C) and Video-based Point Cloud Compression (V-PCC) 2nd Edition is given. ISO/IEC 23090-5 specifies the syntax, semantics, and process for coding volumetric video. The specified syntax is designed to be generic, so that it can be reused for a variety of applications. Point clouds, immersive video with depth, and mesh representations can all use ISO/IEC 23090-5 standard with extensions that deal with the specific nature of the final representation. The purpose of the specification is to define how to decode and interpret the associated data (for example atlas data in ISO/IEC 23090-5) which tells a Tenderer how to interpret 2D frames to reconstruct a volumetric frame.
Two applications of V3C (ISO/IEC 23090-5) have been defined, V-PCC (ISO/IEC 23090-5) and MIV (ISO/IEC 23090-12). MIV and V-PCC use number of V3C syntax elements with a slightly modified semantics. An example on how the generic syntax element can be differently interpreted by the application is pdu_projection_id.
In case of V-PCC, the syntax element pdu_projection_id specifies the index of the projection plane for the patch. There can be 6 or 18 projection planes in V- PCC, and they are implicit, i.e., pre-determined. In case of MIV, pdu_projection_id corresponds to a view ID, i.e., identifies which view the patch originated from. View IDs and their related information is explicitly provided in MIV view parameters list and may be tailored for each content.
MPEG 3DG (ISO SC29 WG7) group has started a work on a third application of V3C - the mesh compression. It is also envisaged that mesh coding will reuse V3C syntax as much as possible and can also slightly modify the semantics.
To differentiate between applications of V3C bitstream that allow a client to properly interpret the decoded data, V3C uses the ptl_profile_toolset_idc parameter.
V3C bitstream is a sequence of bits that forms the representation of coded volumetric frames and the associated data making one or more coded V3C sequences (CVS). Where CVS is a sequence of bits identified and separated by appropriate delimiters, and is required to start with a VPS, includes a V3C unit, and contains one or more V3C units with atlas sub-bitstream or video subbitstream. This is illustrated in Figure 3. Video sub-bitstream and atlas subbitstreams can be referred to as V3C sub-bitstreams. A V3C unit header in conjunction with VPS information identify which V3C sub-bitstream a V3C unit contains and how to interpret it. An example of this is shown herein below:
Figure imgf000015_0001
Figure imgf000016_0001
V3C bitstream can be stored according to Annex C of ISO/IEC 23090-5, which specifies syntax and semantics of a sample stream format to be used by applications that deliver some or all of the V3C unit stream as an ordered stream of bytes or bits within which the locations of V3C unit boundaries need to be identifiable from patterns in the data.
CVS start with a VPS (V3C Parameter Set), which allows to interpret each V3C unit that vuh_v3c_parameter_set_id specifies the value of vps_v3c_parameter_set_id for the active V3C VPS. The VPS provides the following information about V3C bitstream among others:
• Profile, tier, and level to which the bitstream is conformant
• Number of atlases that constitute to the V3C bitstream
• Number of occupancies, geometry, attributes video-sub bitstreams • Number of maps for each geometry and attribute video components
• Mapping information from attribute index to attribute type
Figure imgf000016_0002
Figure imgf000017_0001
In contrast to a fixed number of camera views and only one atlas in V-PCC, in MIV specification the number of cameras, camera extrinsic, camera intrinsic information is not fixed and may change during the V3C bitstream. In addition, the camera information may be shared among all atlases within V3C bitstream.
In order to support such flexibility, the ISO/IEC 23090-5 2nd edition introduces a concept for common atlas data. Common atlas data is carried in a dedicated V3C unit type equal to V3C_CAD which contains a number of non-ACL NAL
1 Under preparation. Stage at time of publication: ISO/IEC CD 23090-12:2020 unit types, such as NAL_CASPS that carry common atlas sequence parameter set syntax structure, NAL_CAF_IDR and NAL_CAF_TRAIL that contain common atlas frames.
The V3C includes signalling mechanisms, through profile_tier_level syntax structure in VPS to support interoperability while restricting capabilities of V3C profiles. V3C also includes an initial set of tool constraint flags to indicate additional restrictions on profile. Currently the sub-profile indicator syntax element is always present, but the value of OxFFFFFFFF indicates that no subprofile is used, i.e., the full profile is supported.
Figure imgf000018_0001
A Real Time Transfer Protocol (RTP) is intended for an end-to-end, real-time transfer or streaming media and provides facilities for jitter compensation and detection of packet loss and out-of-order delivery. RTP allows data transfer to multiple destinations through IP multicast or to a specific destination through IP unicast. The majority of the RTP implementations are built on the User Datagram Protocol (UDP). Other transport protocols may also be utilized. RTP is used in together with other protocols such as H.323 and Real Time Streaming Protocol RTSP.
The RTP specification describes two protocols: RTP and RTCP. RTP is used for the transfer of multimedia data, and the RTCP is used to periodically send control information and QoS parameters. RTP sessions may be initiated between client and server using a signalling protocol, such as H.323, the Session Initiation Protocol (SIP), or RTSP. These protocols may use the Session Description Protocol (RFC 8866) to specify the parameters for the sessions.
RTP is designed to carry a multitude of multimedia formats, which permits the development of new formats without revising the RTP standard. To this end, the information required by a specific application of the protocol is not included in the generic RTP header. For a class of applications (e.g., audio, video), an RTP profile may be defined. For a media format (e.g., a specific video coding format), an associated RTP payload format may be defined. Every instantiation of RTP in a particular application may require a profile and payload format specifications.
The profile defines the codecs used to encode the payload data and their mapping to payload format codecs in the protocol field Payload Type (PT) of the RTP header.
For example, RTP profile for audio and video conferences with minimal control is defined in RFC 3551. The profile defines a set of static payload type assignments, and a dynamic mechanism for mapping between a payload format, and a PT value using Session Description Protocol (SDP). The latter mechanism is used for newer video codec such as RTP payload format for H.264 Video defined in RFC 6184 or RTP Payload Format for High Efficiency Video Coding (HEVC) defined in RFC 7798.
An RTP session is established for each multimedia stream. Audio and video streams may use separate RTP sessions, enabling a receiver to selectively receive components of a particular stream. The RTP specification recommends even port number for RTP, and the use of the next odd port number of the associated RTCP session. A single port can be used for RTP and RTCP in applications that multiplex the protocols.
RTP packets are created at the application layer and handed to the transport layer for delivery. Each unit of RTP media data created by an application begins with the RTP packet header. The RTP header has a minimum size of 12 bytes. After the header, optional header extensions may be present. This is followed by the RTP payload, the format of which is determined by the particular class of application. The fields in the header are as follows:
• Version: (2 bits) Indicates the version of the protocol.
• P (Padding): (1 bit) Used to indicate if there are extra padding bytes at the end of the RTP packet.
• X (Extension): (1 bit) Indicates the presence of an extension header between the header and payload data. The extension header is application or profile specific.
• CC (CSRC count): (4 bits) Contains the number of CSRC identifiers that follow the SSRC
• M (Marker): (1 bit) Signalling used at the application level in a profilespecific manner. If it is set, it means that the current data has some special relevance for the application.
• PT (Payload type): (7 bits) Indicates the format of the payload and thus determines its interpretation by the application.
• Sequence number: (16 bits) The sequence number is incremented for each RTP data packet sent and is to be used by the receiver to detect packet loss and to accommodate out-of-order delivery.
• Timestamp: (32 bits) Used by the receiver to play back the received samples at appropriate time and interval. When several media streams are present, the timestamps may be independent in each stream. The granularity of the timing is application specific. For example, video stream may use a 90 kHz clock. The clock granularity is one of the details that is specified in the RTP profile for an application.
• SSRC: (32 bits) Synchronization source identifier uniquely identifies the source of the stream. The synchronization sources within the same RTP session will be unique.
• CSRC: (32 bits each) Contributing source IDs enumerate contributing sources to a stream which has been generated from multiple sources.
• Header extension: (optional, presence indicated by Extension field) The first 32-bit word contains a profile-specific identifier (16 bits) and a length specifier (16 bits) that indicates the length of the extension in 32-bit units, excluding the 32 bits of the extension header. In this disclosure, the Session Description Protocol (SDP) is used as an example of a session specific file format. SDP is a format for describing multimedia communication sessions for the purposes of announcement and invitation. Its predominant use is in support of conversational and streaming media applications. SDP does not deliver any media streams itself, but is used between endpoints for negotiation of network metrics, media types, and other associated properties. The set of properties and parameters is called a session profile. SDP is extensible for the support of new media types and formats.
The Session Description Protocol describes a session as a group of fields in a text-based format, one field per line. The form of each field is as follows:
<character>=<value><CR><LF> where <character> is a single case-sensitive character and <value> is structured text in a format that depends on the character. Values may be UTF-8 encoded. Whitespace is not allowed immediately to either side of the equal sign.
Session descriptions consist of three sections: session, timing, and media descriptions. Each description may contain multiple timing and media descriptions. Names are only unique within the associated syntactic construct.
Fields appear in the order, shown below; optional fields are marked with an asterisk: v= (protocol version number , currently only 0 ) o= ( originator and session identi fier : username , id, version number , network addres s ) s= ( ses sion name : mandatory with at least one UTF- 8-encoded character) i=* ( ses sion title or short information ) u=* (URI of description) e=* ( zero or more email address with optional name of contacts ) p=* ( zero or more phone number with optional name of contacts ) c=* ( connection inf ormation— not required if included in all media) b=* ( zero or more bandwidth information lines )
One or more time descriptions ( "t=" and " r=" lines ; see below) z=* (time zone adj ustments ) k=* (encryption key) a=* ( zero or more session attribute lines )
Zero or more Media descriptions (each one starting by an "m=" line ; see below)
Time description (mandatory): t= (time the ses sion i s active ) r=* ( zero or more repeat times )
Media description (optional): m= (media name and transport address ) i=* (media title or information field) c=* ( connection information — optional if included at session level ) b=* ( zero or more bandwidth information lines ) k=* (encryption key) a=* ( zero or more media attribute lines — overriding the Ses sion attribute lines )
Below is a sample session description from RFC 4566. This session is originated by the user “jdoe” at IPv4 address 10.47.16.5. Its name is “SDP Seminar” and extended session information (“A Seminar on the session description protocol”) is included along with a link for additional information and an email address to contact the responsible party, Jane Doe. This session is specified to last two hours using NTP timestamps, with a connection address (which indicates the address clients must connect to or - when a multicast address is provided, as it is here - subscribe to) specified as IPv4 224.2.17.12 with a TTL of 127. Recipients of this session description are instructed to only receive media. Two media descriptions are provided, both using RTP Audio Video Profile. The first is an audio stream on port 49170 using RTP/AVP payload type 0 (defined by RFC 3551 as PCMLI), and the second is a video stream on port 51372 using RTP/AVP payload type 99 (defined as “dynamic”). Finally, an attribute is included which maps RTP/AVP payload type 99 to format h263-1998 with a 90 kHz clock rate. RTCP ports for the audio and video streams of 49171 and 51373 respectively are implied. v=0 o=j doe 2890844526 2890842807 IN IP4 10 . 47 . 16 . 5 s=SDP Seminar i=A Seminar on the ses sion description protocol u=http : / / www . example . com/ seminars/ sdp . pdf e= j . doe@example . com ( Jane Doe ) c=IN I P4 224 . 2 . 17 . 12 / 127 t=2873397496 2873404696 a=recvonly m=audio 49170 RTP/AVP 0 m=video 51372 RTP/AVP 99 a=rtpmap : 99 h263-1998 / 90000
SDP uses attributes to extend the core protocol. Attributes can appear within the Session or Media sections and are scoped accordingly as session-level or media-level. New attributes can be added to the standard through registration with IANA. A media description may contain any number of “a=” lines (attribute-fields) that are media description specific. Session-level attributes convey additional information that applies to the session as a whole rather than to individual media descriptions.
Attributes are either properties or values: a=<attribute-name> a=<attribute-name> : <attribute-value>
Examples of attributes defined in RFC8866 are “rtpmap” and “fmpt”. “rtpmap” attribute maps from an RTP payload type number (as used in an "m=" line) to an encoding name denoting the payload format to be used. It also provides information on the clock rate and encoding parameters. Up to one "a=rtpmap:" attribute can be defined for each media format specified. This can be the following: m=audio 49230 RTP/AVP 96 97 98 a=rtpmap : 96 L8 / 8000 a=rtpmap : 97 L16/ 8000 a=rtpmap : 98 L16/ 11025/ 2
In the example above, the media types are “audio/L8” and “audio/L16”.
Parameters added to an "a=rtpmap:" attribute may only be those required for a session directory to make the choice of appropriate media to participate in a session. Codec-specific parameters may be added in other attributes, for example, "fmtp".
"fmtp" attribute allows parameters that are specific to a particular format to be conveyed in a way that SDP does not have to understand them. The format can be one of the formats specified for the media. Format-specific parameters, semicolon separated, may be any set of parameters required to be conveyed by SDP and given unchanged to the media tool that will use this format. At most one instance of this attribute is allowed for each format. An example is: a=fmtp : 96 prof ile- level- id=42e 016 ; max-mbps=l 08000 ; max- f s=3600
For example RFC7798 defines the following sprop-vps, sprop-sps, sprop- pps, profile-space, profile-id, tier-flag, level-id, interop-constraints, profilecompatibility-indicator, sprop-sub-layer-id, recv-sub-layer-id, max-recv- level-id, tx-mode, max-lsr, max-lps, max-cpb, max-dpb, max-br, max-tr, max-tc, max-fps, sprop-max-don-diff, sprop-depack-buf-nalus, sprop- depack-buf-bytes, depack-buf-cap, sprop-segmentation-id, sprop-spatial- segmentation-idc, dec-parallel-cap, and include-dph. While defining a payload type of a given media, quite often an RFC also performs Media Type registration. Such registration incudes type name (i.e., video, audio, application, etc.), subtype name (e.g., H264, H265, mp4, etc.), required parameters, as well as optional parameters. The register types are maintained by Internet Assigned Numbers Authority (IANA). The media types of registration can be extended by registration of new optional parameters.
Media types and their parameters can be mapped to a specific description protocol. For example, while defining a payload type of a given media, an RFC may provide information on how parameters of a given media type can be mapped to SDP. Such mapping is not restricted to SDP and can be defined for other protocols, e.g., XML, DASH MPD.
Like any coding technology, also V3C coding makes the compromise between the complexity and the coding efficiency. To achieve higher coding efficiency in V3C, the encoder must perform more calculation to find optimal patch segmentation which will reduce the amount of data as well allow better video compression. For example, in a real-time communication, the complexity of the encoding side needs to be minimized to ensure the lowest possible latency, which can provide acceptable end user experience.
V3C encoder complexity can be reduced to minimum, when input camera feeds are encoded as video components directly, without any segmentation. This results in static atlas data that corresponds to capture camera rig parameters like locations, orientations and the intrinsic. Currently there is no efficient means for signalling this static atlas data. Instead, the static atlas data requires its own stream, e.g., RTP stream, to be established in order to signal atlas data to the client only once in the beginning of the stream. This wastes network resources and complicates client device implementations.
The present embodiments relate to a method and an apparatus for low complexity - low delay V3C bitstream delivery for real-time applications. The method for encoding according to an embodiment comprises:
- creating a V3C bitstream, where atlas related information is precalculated (e.g., through calibration of the capturing system);
- encoding video components (occupancy, geometry, attributes) either as o packed video bitstream, or o separate video bitstream;
- creating V3C atlas information and/or V3C common atlas information according to the atlas related information, wherein the V3C atlas information and/or V3C common atlas information is static for the duration of the V3C bitstream; and storing the information either as o optional media type parameters of V3C component (e.g., packed video) and map them to a given media type in a description document, e.g., SDP, XML, MPD, or o as SDP parameters;
- sending V3C video component(s) over a network, e.g., using a real-time delivery protocol such RTP; and
- sending static V3C atlas information and/or V3C common atlas information through out-of-band information (e.g., SDP).
The method for decoding according to an embodiment comprises:
- receiving the out of band information on static V3C atlas information and/or V3C common atlas information, and V3C video stream(s); and
- rendering the V3C content.
In this disclosure static V3C atlas information and/or V3C common atlas information is referred to as “static out-of-band V3C information” or “static atlas data” or “static atlas information” interchangeably.
Figure 4 illustrates a simple example of a capture system 500 where four cameras C0-C3 are present and each of the cameras C0-C3 has
• A depth sensor DO ... D3
• A color sensor TO ... T3
• A static intrinsic information of the camera or sensor
• A static extrinsic information of the camera or sensor
Such capture system 400 can provide depth and color textures together with camera or sensor information to a V3C encoder. V3C encoder can minimize the complexity by accepting a pre-defined fixed patch configuration. An example how the patches can be placed withing a packed video component 500 is presented on Figure 5. A V3C encoder can generate a packed video component based on the provided patch configuration. Instead of generating a complete V3C bitstream, it may output only packed video component video bitstream and corresponding media type description which utilize optional media type parameters that carries static out of band information related to V3C that allows to reconstruct volumetric video, i.e., static V3C atlas information and/or V3C common atlas information. An example of such encoding procedure is presented on Figure 6. A capture system 600, such as the one presented in Figure 4 as an example, provide depth and color textures (DO, TO... Dx, Tx) to a V3C encoder 610. In addition to depth and color textures, also static camera information can be provided. The video encoder 610 may obtain a fixed patch configuration 620 according to which packet video component is generated. In addition to the packet video component, the V3C encoder 610 outputs a media type of a packet video component with a parameter carrying static out-of-band V3C information.
Figure 7 illustrates an alternative example, where a V3C encoder 710 encodes each camera or sensor output in the capture system 700 as separate V3C video component bitstream. Instead of generating a complete V3C bitstream it outputs video component bitstreams and corresponding media type descriptions with optional media type parameters along with static out-of-band V3C information that allows to reconstruct the volumetric video. Also in this example, the V3C encoder may utilize fixed patch configuration 720.
Static out-of-band V3C information that allows to reconstruct the volumetric video can be provided as separate entity as is illustrated in Figure 7, or it can be duplicated in each media type descriptions as optional media type parameters, or the duplicated information can be assigned only to one V3C video component, e.g., occupancy, and provided in media type descriptions as optional media type parameters.
The optional media type parameters for the V3C video components can include V3C unit header, whose presence is exposed through SDP level signalling. The static atlas data can be stored as parameters of a session level SDP attribute, or a new attribute may be created for the purpose of storing static atlas data. A “v3c-parameter-set” optional media type parameter is defined: v3c-parameter- set=<value>
“v3c-parameter-set” provides V3C parameter set bytes as defined in ISO/IEC 23090-5. <value> contains encoded representation of the V3C parameter set bytes, e.g., base16 [RFC 4648] (hexadecimal).
A “v3c-u nit-header” optional media type parameter is defined: v3c-unit-header=<value>
“v3c-unit-header” provides V3C unit header set bytes as defined in ISO/IEC 23090-5. <value> contains encoded representation of the V3C unit header bytes, e.g., base16 [RFC 4648] (hexadecimal). The v3c-unit-header would indicate the type of video component carried by a given media. E.g., if the video contains packed component, the v3c unit header type shall contain vuh_unit_type equal to V3C_PVD (value 5).
A “v3c-atlas-data” parameter is defined as v3c- atlas -da ta=<value>
This parameter may be used to convey any atlas data NAL units of the V3C atlas sub bitstream for out-of-band transmission. The value of the parameter is a comma-separated (',') list of encoded representations of the atlas NAL units as specified in Section 8.4.5.2 of ISO/IEC 23090-5. The encoded representation format can for example be base64 [RFC4648].
A subset of NAL units from section 8.4.5.2 of ISO/IEC 23090-5 may be exposed by a more specific parameter.
A “v3c-common-atlas-data” parameter is defined: v3c-common-atlas-data=<value> This parameter may be used to convey common atlas data NAL units of the V3C common atlas sub bitstream for out-of-band transmission. The value of the parameter is a comma-separated
Figure imgf000029_0001
list of encoded representations of the common atlas NAL units (i.e., NAL_CASPS and NAL CAF IDR) as specified in Section 8.4.5.2 of ISO/IEC 23090-5. The encoded representation format can for example be base64 [RFC4648].
A “v3c-sei” parameter is defined as: v3c- sei=<value>
This parameter may be used to convey SEI NAL units of the V3C common atlas sub-bitstream for out-of-band transmission. The value of the parameter is a comma-separated list of encoded representations of the SEI NAL units (i.e., NAL PREFIX NSEI and NAL_SUFFIX_NSEI, NAL PREFIX ESEI, NAL SUFFIX ESEI) as specified in Section 8.4.5.2 of ISO/IEC 23090-5. The encoded representation can for example be base64 [RFC4648].
Alternative encoding schemes for all representations may be provided for the <value> such as ASCII, decimal or base64 encoded strings.
The optional parameters media types can be defined for already existing media types such as H264, H265, MP4 etc.
Example 1 - MIME media type
In one example, for V3C packed video encoded with H.265 and all atlas data provided out of band in a mimeType would be as follow:
' video /H265 ; v3c-parameter- set=AF6F0093992 ; v3c- atlas-data=ABC , 5D5 , 68 '
Example 2 - MIME media type
In another example packed video component could be encapsulated in mp4 file to provide timing information while all V3C related information could be provided out of band 'video/mp4 ; codecs="avcl .4d002a;v3c- unitheader^ F; v3c-parameter-set=AF6F0 093992 ;v3c- atlas-data=ABC, 5D5, 68"
Example 3 - MIME media type parameters mapped to SDP In another example the media type parameters can be mapped to description protocol for example SDP MIME media type from example 1
' video /H265 ; v3c-parameter-set=AF6F0093992 ;v3c- atlas-data=ABC, 5D5, 68' can be mapped to SDP when packed video is distributed over RTP stream as follows: v=0 o=svcsrv 289083124 289083124 IN IP4 host.example.com S=V3C OUT OF BAND SIGNALLING t=0 0 c=IN IP4 192.0.2.1/127 m=video 40000 RTP/AVP 96 a=rtpmap:96 H265/90000 a=fmtp: 96 v3c-unit-header=0F;v3c-parameter- set=AF6F0093992 ; v3c-atlas-data=ABC, 5D5, 68
Example 4 - MIME media type parameters used in MPD
In another example, a packed video component can be encapsulated in mp4 and split into segments to allow DASH based delivery while all V3C related information can be provided out of band.
<?xml version=" 1.0 " ?> <MPD xml ns :xsi="http:// www .w3.org/ 2001 /XML Schema- instance" xmlns="urn : mpeg : dash : schema : mpd: 2011" xsi : schemaLocation="urn : mpeg : dash : schema : mpd: 2011 DASH-
MPD.xsd" type=" static" mediaPresentationDuration=" PT3256S" minBuf f erTime=" PT10.00S" profiles="urn: mpeg :dash:profile: isof f-main : 2011 "> <BaseURL>http : / / www . example . com/ </BaseURL>
< ! -- In this Period there are 3 views : coming from three lined up cameras : C1-C2-C3 .
C1+C2 and C2+C3 each form a stereo pair but C1+C3 does not .
C2 i s taken as the base view for MVC while Cl and C3 are enhancement views --> <Period start=" PT0 . D OS" duration=" PT2000 . 00S" > <SegmentLi st> <Initialization sourceURL=" seg-m-init . mp4 " / > </ SegmentLi st> < dapt at ionSet mimeType=" video/mp4 ; v3c-unit- header=0F; v3c-parameter- set=AF6F0093992 ; v3c-atlas- data=ABC5D568 " codecs="avcl . 4d002a"> <Representation id=" C2" bandwidth=" 128000"> <SegmentLi st duration=" 10"> <SegmentURL media=" seg-ml -C2view- l . mp4 " / > <SegmentURL media=" seg-ml -C2view-2 . mp4 " / > <SegmentURL media=" seg-ml -C2view-3 . mp4 " / > </ SegmentLi st> </Representation> </AdaptationSet> </ Period> </MPD>
In an embodiment, in all the examples usages of v3c-unit-data or equivalent information placeholder, a delimiter codeword may be present to indicate the separation of codewords. An example may be as follows: v3c-unit-header=0F; v3c-parameter-set=AF6F0093992 ; separator- codeword=FF; v3c-atlas-data=ABCFF5D5FF68FF
Where FF is used as a separator codeword. Such a codeword may be also signaled as part of the mimeType, adaptation set parameters or attributes.
When V3C video components are streamed as separate RTP streams, but atlas data remains static, static signalling of the atlas data can be enabled without establishing a dedicated RTP session for it.
The static atlas data can consist of v3c-atlas-data, v3c-common-atlas-data and/or v3c-sei information as defined for new optional media type parameters v3c-atlas-data=<value> v3c-common-atlas-data=<value> v3c-sei=<value>
This information may be signaled as new session level attributes: a=v3c-atlas-data <value> a=v3c-common-atlas-data <value> a=v3c-sei <value> or as parameters of new session level attribute a=v3c-static-data <v3c specific static atlas parameters> or as new parameters of session level v3c grouping attribute a=group:v3c <tokens> <v3c specific static atlas parameters>
An example SDP for the last configuration is provided below. v=0 o=svcsrv 289083124 289083124 IN IP4 host.example.com s=V3C SIGNALING t=0 0 c=IN IP4 192.0.2.1/127 a=group:V3C 1 2 3 v3c-parameter-set=AF6F00939921878 ; a=v3c-atlas-data ABC,5D5, 68 m=video 40000 RTP/AVP 96 a=rtpmap:96 H264/90000 a=fmtp:96 v3c-unit-header=l 0000000 ; // occupancy a=mid : 1 m=video 40002 RTP/AVP 97 a=rtpmap:97 H264/90000 a=fmtp:96 v3c-unit-header=l 8000000 ; // geometry a=mid : 2 m=video 40004 RTP/AVP 98 a=rtpmap:98 H264/90000 a=fmtp : 96 v3c-unit-header=14180000 ; / / attribute texture a=mid : 3
The method for encoding according to an embodiment is shown in Figure 8. The method generally comprises receiving 805 volumetric video data from a capture system; creating 810 a V3C bitstream comprising atlas information; encoding 815 video components of the volumetric video data into one or more video sub-bitstreams; creating 820 static atlas information and storing it into a static V3C atlas sub-bitstream; sending 825 encoded video sub-bitstreams over an network via a real-time delivery protocol; and sending 830 static V3C atlas sub-bitstream through out-of-band information. Each of the steps can be implemented by a respective module of a computer system.
An apparatus according to an embodiment comprises means for receiving volumetric video data from a capture system; means for creating a V3C bitstream comprising atlas information; means for encoding video components of the volumetric video data into one or more video sub-bitstreams; means for creating static atlas information and storing it into a static V3C atlas subbitstream; means for sending encoded video sub-bitstreams over an network via a real-time delivery protocol; and means for sending static V3C atlas subbitstream through out-of-band information. The means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry. The memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 8 according to various embodiments.
The method for decoding according to an embodiment is shown in Figure 9. The method generally comprises receiving 910 encoded video sub-bitstreams via a real-time delivery protocol; receiving 920 the out-of-band information on static atlas information; and rendering 930 the V3C content based on the received data. Each of the steps can be implemented by a respective module of a computer system.
An apparatus according to an embodiment comprises means for receiving encoded video sub-bitstreams via a real-time delivery protocol; means for receiving the out-of-band information on static atlas information; and means for rendering the V3C content based on the received data. The means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry. The memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 9 according to various embodiments.
An example of an apparatus is disclosed with reference to Figure 10. Figure 10 shows a block diagram of a video coding system according to an example embodiment as a schematic block diagram of an electronic device 50, which may incorporate a codec. In some embodiments the electronic device may comprise an encoder or a decoder. The electronic device 50 may for example be a mobile terminal or a user equipment of a wireless communication system or a camera device. The electronic device 50 may be also comprised at a local or a remote server or a graphics processing unit of a computer. The device may be also comprised as part of a head-mounted display device. The apparatus 50 may comprise a display 32 in the form of a liquid crystal display. In other embodiments of the invention the display may be any suitable display technology suitable to display an image or video. The apparatus 50 may further comprise a keypad 34. In other embodiments of the invention any suitable data or user interface mechanism may be employed. For example, the user interface may be implemented as a virtual keyboard or data entry system as part of a touch-sensitive display. The apparatus may comprise a microphone 36 or any suitable audio input which may be a digital or analogue signal input. The apparatus 50 may further comprise an audio output device which in embodiments of the invention may be any one of: an earpiece 38, speaker, or an analogue audio or digital audio output connection. The apparatus 50 may also comprise a battery (or in other embodiments of the invention the device may be powered by any suitable mobile energy device such as solar cell, fuel cell or clockwork generator). The apparatus may further comprise a camera 42 capable of recording or capturing images and/or video. The camera 42 may be a multi-lens camera system having at least two camera sensors. The camera is capable of recording or detecting individual frames which are then passed to the codec 54 or the controller for processing. The apparatus may receive the video and/or image data for processing from another device prior to transmission and/or storage. The apparatus 50 may comprise a controller 56 or processor for controlling the apparatus 50. The apparatus or the controller 56 may comprise one or more processors or processor circuitry and be connected to memory 58 which may store data in the form of image, video and/or audio data, and/or may also store instructions for implementation on the controller 56 or to be executed by the processors or the processor circuitry. The controller 56 may further be connected to codec circuitry 54 suitable for carrying out coding and decoding of image, video and/or audio data or assisting in coding and decoding carried out by the controller.
The apparatus 50 may further comprise a card reader 48 and a smart card 46, for example a IIICC (Universal Integrated Circuit Card) and UICC reader for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network. The apparatus 50 may comprise radio interface circuitry 52 connected to the controller and suitable for generating wireless communication signals for example for communication with a cellular communications network, a wireless communications system, or a wireless local area network. The apparatus 50 may further comprise an antenna 44 connected to the radio interface circuitry 52 for transmitting radio frequency signals generated at the radio interface circuitry 52 to other apparatus(es) and for receiving radio frequency signals from other apparatus(es). The apparatus may comprise one or more wired interfaces configured to transmit and/or receive data over a wired connection, for example an electrical cable or an optical fiber connection.
The various embodiments can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the method. For example, a device may comprise circuitry and electronics for handling, receiving, and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment. Yet further, a network device like a server may comprise circuitry and electronics for handling, receiving, and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of various embodiments. If desired, the different functions discussed herein may be performed in a different order and/or concurrently with other. Furthermore, if desired, one or more of the above-described functions and embodiments may be optional or may be combined.
Although various aspects of the embodiments are set out in the independent claims, other aspects comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims.
It is also noted herein that while the above describes example embodiments, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications, which may be made without departing from the scope of the present disclosure as, defined in the appended claims.

Claims

35 Claims:
1 . An apparatus, comprising: means for receiving volumetric video data from a media capture system; means for creating a V3C bitstream comprising atlas information; means for encoding video components of the volumetric video data into one or more video sub-bitstreams; means for creating static atlas information and storing it into a static V3C atlas sub-bitstream; means for sending encoded video sub-bitstreams over a network via a real-time delivery protocol; and means for sending static V3C atlas sub-bitstream through out-of-band information.
2. The apparatus according to claim 1 , wherein video components are encoded as a packed video sub-bitstream or separate video sub-bitstreams.
3. The apparatus according to claim 1 or 2, wherein the received volumetric video data comprises at least one of the depth or color texture; and camera or sensor related metadata information.
4. The apparatus according to any of the claims 1 to 3, further comprising means for using pre-defined fixed patch configuration according to which patches of the atlas information are placed.
5. An apparatus for decoding, comprising means for receiving encoded video sub-bitstreams over a network via a real-time delivery protocol; means for receiving the out-of-band information on static atlas information; and means for rendering the V3C content based on the received data.
6. A method for encoding, comprising: receiving volumetric video data from a media capture system; creating a V3C bitstream comprising atlas information; encoding video components of the volumetric video data into one or more video sub-bitstreams; 36 creating static atlas information and storing it into a static V3C atlas subbitstream; sending encoded video sub-bitstreams over a network via a real-time delivery protocol; and sending static V3C atlas sub-bitstream through out-of-band information.
7. The method according to claim 6, wherein video components are encoded as a packed video sub-bitstream or separate video sub-bitstreams.
8. The method according to claim 6 or 7, wherein the received volumetric video data comprises at least one of the depth or color texture; and camera or sensor related metadata information.
9. The method according to any of the claims 6 to 8, further comprising using pre-defined fixed patch configuration according to which patches of the atlas information are placed.
10. A method for decoding, comprising receiving encoded video sub-bitstreams over a network via a real-time delivery protocol; receiving the out-of-band information on static atlas information; and rendering the V3C content based on the received data.
11. An apparatus for encoding comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: receive volumetric video data from a media capture system; create a V3C bitstream comprising atlas information; encode video components of the volumetric video data into one or more video sub-bitstreams; create static atlas information and storing it into a static V3C atlas subbitstream; send encoded video sub-bitstreams over a network via a real-time delivery protocol; and send static V3C atlas sub-bitstream through out-of-band information.
12. The apparatus according to claim 11 , wherein video components are encoded as a packed video sub-bitstream or separate video sub-bitstreams.
13. The apparatus according to claim 11 or 12, wherein the received volumetric video data comprises at least one of the depth or color texture; and camera or sensor related metadata information.
14. The apparatus according to any of the claims 11 to 13, further comprising computer program code configured to cause the apparatus to use pre-defined fixed patch configuration according to which patches of the atlas information are placed.
15. An apparatus for decoding comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: receive encoded video sub-bitstreams over a network via a real-time delivery protocol; receive the out-of-band information on static atlas information; and render the V3C content based on the received data.
16. A computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to receive volumetric video data from a media capture system; create a V3C bitstream comprising atlas information; encode video components of the volumetric video data into one or more video sub-bitstreams; create static atlas information and storing it into a static V3C atlas subbitstream; send encoded video sub-bitstreams over a network via a real-time delivery protocol; and send static V3C atlas sub-bitstream through out-of-band information.
17. A computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to receive encoded video sub-bitstreams over a network via a real-time delivery protocol; receive the out-of-band information on static atlas information; and render the V3C content based on the received data.
PCT/FI2022/050763 2021-12-03 2022-11-18 A method, an apparatus and a computer program product for video encoding and video decoding WO2023099809A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
FI20216244 2021-12-03
FI20216244 2021-12-03

Publications (1)

Publication Number Publication Date
WO2023099809A1 true WO2023099809A1 (en) 2023-06-08

Family

ID=84362201

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/FI2022/050763 WO2023099809A1 (en) 2021-12-03 2022-11-18 A method, an apparatus and a computer program product for video encoding and video decoding

Country Status (1)

Country Link
WO (1) WO2023099809A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210021664A1 (en) * 2019-07-16 2021-01-21 Apple Inc. Streaming of volumetric point cloud content based on session description protocols and real time protocols
WO2021110940A1 (en) * 2019-12-06 2021-06-10 Koninklijke Kpn N.V. Encoding and decoding views on volumetric image data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210021664A1 (en) * 2019-07-16 2021-01-21 Apple Inc. Streaming of volumetric point cloud content based on session description protocols and real time protocols
WO2021110940A1 (en) * 2019-12-06 2021-06-10 Koninklijke Kpn N.V. Encoding and decoding views on volumetric image data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BART KROON (PHILIPS) ET AL: "Common atlas SPS", no. m55208, 7 October 2020 (2020-10-07), XP030292725, Retrieved from the Internet <URL:https://dms.mpeg.expert/doc_end_user/documents/132_OnLine/wg11/m55208-v1-CASPS.zip m55208_v1_CASPS.docx> [retrieved on 20201007] *
LUKASZ KONDRAD ET AL: "[MIV] Discussion on new packed video sub-bitstream type", no. m53125, 15 April 2020 (2020-04-15), XP030286060, Retrieved from the Internet <URL:http://phenix.int-evry.fr/mpeg/doc_end_user/documents/130_Alpbach/wg11/m53125-v1-m53125.zip m53125.docx> [retrieved on 20200415] *

Similar Documents

Publication Publication Date Title
JP6721631B2 (en) Video encoding/decoding method, device, and computer program product
CN110431850B (en) Signaling important video information in network video streaming using MIME type parameters
KR102089457B1 (en) Apparatus, method and computer program for video coding and decoding
US20190104326A1 (en) Content source description for immersive media data
EP3311574B1 (en) Method, device, and computer program for obtaining media data and metadata from encapsulated bit-streams wherein operating point descriptors can be dynamically set
CN113287323B (en) Method, client device and computer readable medium for retrieving media data
JP2020526982A (en) Regionwise packing, content coverage, and signaling frame packing for media content
CN109246447B (en) Method for coding picture
EP3713234A1 (en) Method, device, and computer program for encoding inter-layer dependencies in encapsulating multi-layer partitioned timed media data
WO2018187318A1 (en) Segment types as delimiters and addressable resource identifiers
EP3888375A1 (en) Method, device, and computer program for encapsulating media data into a media file
US20220335978A1 (en) An apparatus, a method and a computer program for video coding and decoding
KR20190010567A (en) Sample Entries and Random Access
CN115211131A (en) Apparatus, method and computer program for omnidirectional video
KR20190010568A (en) Sample Entries and Random Access
WO2021205061A1 (en) An apparatus, a method and a computer program for video coding and decoding
WO2023062271A1 (en) A method, an apparatus and a computer program product for video coding
WO2023073283A1 (en) A method, an apparatus and a computer program product for video encoding and video decoding
WO2023099809A1 (en) A method, an apparatus and a computer program product for video encoding and video decoding
US20240080477A1 (en) Method, An Apparatus and A Computer Program Product For Streaming Volumetric Video Content
US20230239453A1 (en) Method, an apparatus and a computer program product for spatial computing service session description for volumetric extended reality conversation
WO2023161556A1 (en) A method, an apparatus and a computer program product for video encoding and video decoding
WO2023175234A1 (en) A method, an apparatus and a computer program product for streaming volumetric video
EP4284000A1 (en) An apparatus, a method and a computer program for volumetric video
WO2024069045A1 (en) An apparatus and a method for processing volumetric video content

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22812690

Country of ref document: EP

Kind code of ref document: A1